I have tried both Llamafile and Ollama...
Ollama has better graphic cards support, for example in my case with Llamafile my GPUs are not being used so its super slow = Tokens Per Second....
With Ollama it uses 3 of my GPUs and you can monitor which one is being used
by monitoring the ram usage (I didn't look into the details but apparently Hugging Face Transformers build with Pytorch do the balancing between the cards pipe workload)... My next setup step is piping Open Web-UI so I can have a local GUI.... Currently I use Ollama directly in the CLI....
With Open Web-UI you can pipe RAG documentation/library directly in the GUI.
There is a bug installing it on FreeBSD but there is always bhyve Linux for such obstacle.
Bug Report Installation Method # pkg install -y python311 py311-pip postgresql16-server postgresql16-client portsnap rust py311-onnx # mkdir -p /var/db/portsnap/ && portsnap fetch extract update # ...
github.com
tingo the post error means your cards or CPU is not enough to run the model. Run a smaller model if the model is 4 gb you need MINIMUM 1 card with that much Ram if not you're going to face issues with Post Error. How do I know because when I try running Mistral-nemo 12B or the Gwen2.5 32B parameter the same issues happens with me.... But with smaller models like mistral latest I never face this issue and the local llm can maintain context for longer like 24+ hours (depends on number of prompts)....
Code:
root@Secure_Ollama:/ # ollama run context-mistral-nemo:latest
Error: Post "http://127.0.0.1:11434/api/chat": EOF
root@Secure_Ollama:/ # ollama run context-mistral-nemo:latest
I run this on a FreeBSD Jail and Pass the GPUs to it through the devfs.rules
You can also look at the Ollama log you'll see why the POST error happens in my case with over 32 GB of ram I face constantly when running big parameters model with my hardware (Now if I had H100 or A100 with 40 or 80GB of Ram doubt the post error happens). You lose the context and you will need to initiate the model again when EOF error.
Code:
time=2024-12-11T23:14:28.040Z level=WARN source=sched.go:642 msg="gpu VRAM usage didn't recover within timeout" seconds=8.543873626 model=/root/.ollama/models/blobs/sha256-b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94
time=2024-12-11T23:14:28.166Z level=WARN source=sched.go:642 msg="gpu VRAM usage didn't recover within timeout" seconds=8.669209817 model=/root/.ollama/models/blobs/sha256-b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94
time=2024-12-11T23:14:28.446Z level=WARN source=sched.go:642 msg="gpu VRAM usage didn't recover within timeout" seconds=8.949622423 model=/root/.ollama/models/blobs/sha256-b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94
I also edited the context (to 15000 tokens) the default is set to 2048 tokens, some models allows you to go to 100,000 tokens in context...
https://llm.extractum.io/list/ < == for model spec info.
With Ollama you can limit the layers so GPUs can run bigger models... When doing this what Ollama does is that some of the layers are run in the CPU resulting slower token per second BUT at least you can run a bigger billions+ parameter model at speed expense.
stay curious....