Local LLMs

I have tried both Llamafile and Ollama...

Ollama has better graphic cards support, for example in my case with Llamafile my GPUs are not being used so its super slow = Tokens Per Second....

With Ollama it uses 3 of my GPUs and you can monitor which one is being used
Code:
nvidia-smi --loop=1
by monitoring the ram usage (I didn't look into the details but apparently Hugging Face Transformers build with Pytorch do the balancing between the cards pipe workload)... My next setup step is piping Open Web-UI so I can have a local GUI.... Currently I use Ollama directly in the CLI....

With Open Web-UI you can pipe RAG documentation/library directly in the GUI.

There is a bug installing it on FreeBSD but there is always bhyve Linux for such obstacle.


tingo the post error means your cards or CPU is not enough to run the model. Run a smaller model if the model is 4 gb you need MINIMUM 1 card with that much Ram if not you're going to face issues with Post Error. How do I know because when I try running Mistral-nemo 12B or the Gwen2.5 32B parameter the same issues happens with me.... But with smaller models like mistral latest I never face this issue and the local llm can maintain context for longer like 24+ hours (depends on number of prompts)....

Code:
root@Secure_Ollama:/ # ollama run context-mistral-nemo:latest
Error: Post "http://127.0.0.1:11434/api/chat": EOF
root@Secure_Ollama:/ # ollama run context-mistral-nemo:latest

👆👆
I run this on a FreeBSD Jail and Pass the GPUs to it through the devfs.rules



You can also look at the Ollama log you'll see why the POST error happens in my case with over 32 GB of ram I face constantly when running big parameters model with my hardware (Now if I had H100 or A100 with 40 or 80GB of Ram doubt the post error happens). You lose the context and you will need to initiate the model again when EOF error.


Code:
time=2024-12-11T23:14:28.040Z level=WARN source=sched.go:642 msg="gpu VRAM usage didn't recover within timeout" seconds=8.543873626 model=/root/.ollama/models/blobs/sha256-b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94
time=2024-12-11T23:14:28.166Z level=WARN source=sched.go:642 msg="gpu VRAM usage didn't recover within timeout" seconds=8.669209817 model=/root/.ollama/models/blobs/sha256-b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94
time=2024-12-11T23:14:28.446Z level=WARN source=sched.go:642 msg="gpu VRAM usage didn't recover within timeout" seconds=8.949622423 model=/root/.ollama/models/blobs/sha256-b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94

I also edited the context (to 15000 tokens) the default is set to 2048 tokens, some models allows you to go to 100,000 tokens in context...

https://llm.extractum.io/list/ < == for model spec info.

With Ollama you can limit the layers so GPUs can run bigger models... When doing this what Ollama does is that some of the layers are run in the CPU resulting slower token per second BUT at least you can run a bigger billions+ parameter model at speed expense.

:beer:

stay curious....
You run ollama in jail with nvidia gpu and it uses nvidia gpu ?
 
Can you expand or manual on how to have jail running with ollama and using gpu for it ?

I have a shell script so here it is.... Honestly pretty simple no magic.

1. Create a Jail (I am not going to cover how, there is too many options)
2. Pass the Devfs Rules to get nvidia working (I have 2 gpus on this one).
Code:
add path 'nvidia*' unhide  # Expose NVIDIA GPU devices if applicable
perm  nvidia0      unprivileged 0666
perm  nvidia1      unprivileged 0666
perm  nvidiactl    unprivileged 0666
perm  nvidia-modeset unprivileged 0666
3. Shell scripting. You need to edit your JAIL_PATH in your gateway / system.
Code:
#!/bin/sh

# Variables
JAIL_NAME="Secure_Ollama"
OLLAMA_PKG="ollama"
JAIL_PATH="/usr/jails/jails-data/${JAIL_NAME}-data"


# Step 0: Set up resolv.conf in the jail for DNS resolution
echo "Setting up resolv.conf in the jail for DNS resolution..."
echo "nameserver 8.8.8.8" > "${JAIL_PATH}/etc/resolv.conf"
echo "nameserver 9.9.9.9" >> "${JAIL_PATH}/etc/resolv.conf"


# Step 1: Install Ollama and required packages inside the jail
echo "Installing Ollama and necessary packages in the jail..."
jexec ${JAIL_NAME} /bin/sh -c "pkg update && pkg install -y ${OLLAMA_PKG} git python3 py311-pip"
if [ $? -ne 0 ]; then
    echo "Error: Failed to install Ollama package in the jail."
    exit 1
fi
 
I have a shell script so here it is.... Honestly pretty simple no magic.

1. Create a Jail (I am not going to cover how, there is too many options)
2. Pass the Devfs Rules to get nvidia working (I have 2 gpus on this one).
Code:
add path 'nvidia*' unhide  # Expose NVIDIA GPU devices if applicable
perm  nvidia0      unprivileged 0666
perm  nvidia1      unprivileged 0666
perm  nvidiactl    unprivileged 0666
perm  nvidia-modeset unprivileged 0666
3. Shell scripting. You need to edit your JAIL_PATH in your gateway / system.
Code:
#!/bin/sh

# Variables
JAIL_NAME="Secure_Ollama"
OLLAMA_PKG="ollama"
JAIL_PATH="/usr/jails/jails-data/${JAIL_NAME}-data"


# Step 0: Set up resolv.conf in the jail for DNS resolution
echo "Setting up resolv.conf in the jail for DNS resolution..."
echo "nameserver 8.8.8.8" > "${JAIL_PATH}/etc/resolv.conf"
echo "nameserver 9.9.9.9" >> "${JAIL_PATH}/etc/resolv.conf"


# Step 1: Install Ollama and required packages inside the jail
echo "Installing Ollama and necessary packages in the jail..."
jexec ${JAIL_NAME} /bin/sh -c "pkg update && pkg install -y ${OLLAMA_PKG} git python3 py311-pip"
if [ $? -ne 0 ]; then
    echo "Error: Failed to install Ollama package in the jail."
    exit 1
fi
Im going to try it... but, ollama needs cuda, so how do you get this one sorted ? or im wrong no cuda needed for ollama ?
 
Im going to try it... but, ollama needs cuda, so how do you get this one sorted ? or im wrong no cuda needed for ollama ?

You don't need to install the drivers, the transformer from hugging face that ollama uses is build on top of pytorch takes care of this... (AGAIN this is what I read didn't confirm or care, it just works!)...

Answer from my Local LLM:

Code:
 Ollama is a Python interface for running large language models locally. It doesn't require CUDA
installation as it supports both GPU and CPU usage out of the box. Here's a simplified explanation
of how it works:

1. **Model Download**: First, Ollama downloads the selected model (like Llama 2, Alpaca, etc.)
from Hugging Face's model hub or another supported source.

2. **Quantization**: Ollama uses quantization to reduce the model size and improve inference
speed. It converts the model's weights from float32 to a lower-precision format like float16 or
int8. This step is done on the fly, so you don't need to create quantized versions of models
beforehand.

3. **Model Loading**: Once downloaded and quantized, the model is loaded into memory. Ollama
supports loading models in different formats, such as Hugging Face's `transformers` library format
or ONNX format.

4. **Inference**: With the model now loaded, you can generate text using Ollama's API. It uses a
technique called "in-context learning" where it feeds the model with your input along with some
context (like a few previous tokens) to guide its response generation.

5. **Hardware Acceleration**: Ollama automatically uses GPU if available for faster inference, but
it also works fine on CPUs if no GPU is present. It utilizes libraries like ONNX Runtime and
PyTorch for hardware acceleration.

Here's a simple usage example:

```python
from Ollama import Ollama

ollama = Ollama()
response = ollama.generate("Hello! How are you?", model="llama2:7b")
print(response)
```
 
You don't need to install the drivers, the transformer from hugging face that ollama uses is build on top of pytorch takes care of this... (AGAIN this is what I read didn't confirm or care, it just works!)...

Answer from my Local LLM:

Code:
 Ollama is a Python interface for running large language models locally. It doesn't require CUDA
installation as it supports both GPU and CPU usage out of the box. Here's a simplified explanation
of how it works:

1. **Model Download**: First, Ollama downloads the selected model (like Llama 2, Alpaca, etc.)
from Hugging Face's model hub or another supported source.

2. **Quantization**: Ollama uses quantization to reduce the model size and improve inference
speed. It converts the model's weights from float32 to a lower-precision format like float16 or
int8. This step is done on the fly, so you don't need to create quantized versions of models
beforehand.

3. **Model Loading**: Once downloaded and quantized, the model is loaded into memory. Ollama
supports loading models in different formats, such as Hugging Face's `transformers` library format
or ONNX format.

4. **Inference**: With the model now loaded, you can generate text using Ollama's API. It uses a
technique called "in-context learning" where it feeds the model with your input along with some
context (like a few previous tokens) to guide its response generation.

5. **Hardware Acceleration**: Ollama automatically uses GPU if available for faster inference, but
it also works fine on CPUs if no GPU is present. It utilizes libraries like ONNX Runtime and
PyTorch for hardware acceleration.

Here's a simple usage example:

```python
from Ollama import Ollama

ollama = Ollama()
response = ollama.generate("Hello! How are you?", model="llama2:7b")
print(response)
```
what t/s you get with your gpu ? and what gpu u running ? looks like under FreeBSD gpu`s are running with Vulkan. mistral 4gb gets 12t/s but i deepseek-coder:33b ( 18GB ) less than 4 but model fits my gpu fully but for some reasons swap is used a bit like 56MB.
under Linux - i run 17t/s thats a huge drop off.
 
what t/s you get with your gpu ? and what gpu u running ? looks like under FreeBSD gpu`s are running with Vulkan. mistral 4gb gets 12t/s but i deepseek-coder:33b ( 18GB ) less than 4 but model fits my gpu fully but for some reasons swap is used a bit like 56MB.
under Linux - i run 17t/s thats a huge drop off.
The mistral 4b above 20-35+ t/s on my setup with some old GPUs... To get the token rate up you need to play a little with dividing the layers, some of them run in CPU and majority in GPU.... You need to play with ollama-limit-gpu-layers.
 
The mistral 4b above 20-35+ t/s on my setup with some old GPUs... To get the token rate up you need to play a little with dividing the layers, some of them run in CPU and majority in GPU.... You need to play with ollama-limit-gpu-layers.
Ok, i need to play around and maybe ditch terminal and use other ways ( oatmeal does not work at all for me )
 
Back
Top