Local LLMs

GlitchyDot · Dec 12, 2024

GoNeFast_01 said:
I have tried both Llamafile and Ollama...

Ollama has better graphic cards support, for example in my case with Llamafile my GPUs are not being used so its super slow = Tokens Per Second....

With Ollama it uses 3 of my GPUs and you can monitor which one is being used

Code:

nvidia-smi --loop=1

by monitoring the ram usage (I didn't look into the details but apparently Hugging Face Transformers build with Pytorch do the balancing between the cards pipe workload)... My next setup step is piping Open Web-UI so I can have a local GUI.... Currently I use Ollama directly in the CLI....

With Open Web-UI you can pipe RAG documentation/library directly in the GUI.

There is a bug installing it on FreeBSD but there is always bhyve Linux for such obstacle.

Open Web UI on FreeBSD 14.1-REL (pip) (missing dependencies) · open-webui open-webui · Discussion #5901

Bug Report Installation Method # pkg install -y python311 py311-pip postgresql16-server postgresql16-client portsnap rust py311-onnx # mkdir -p /var/db/portsnap/ && portsnap fetch extract update # ...

github.com

tingo the post error means your cards or CPU is not enough to run the model. Run a smaller model if the model is 4 gb you need MINIMUM 1 card with that much Ram if not you're going to face issues with Post Error. How do I know because when I try running Mistral-nemo 12B or the Gwen2.5 32B parameter the same issues happens with me.... But with smaller models like mistral latest I never face this issue and the local llm can maintain context for longer like 24+ hours (depends on number of prompts)....

Code:

root@Secure_Ollama:/ # ollama run context-mistral-nemo:latest Error: Post "http://127.0.0.1:11434/api/chat": EOF root@Secure_Ollama:/ # ollama run context-mistral-nemo:latest

I run this on a FreeBSD Jail and Pass the GPUs to it through the devfs.rules

You can also look at the Ollama log you'll see why the POST error happens in my case with over 32 GB of ram I face constantly when running big parameters model with my hardware (Now if I had H100 or A100 with 40 or 80GB of Ram doubt the post error happens). You lose the context and you will need to initiate the model again when EOF error.

Code:

time=2024-12-11T23:14:28.040Z level=WARN source=sched.go:642 msg="gpu VRAM usage didn't recover within timeout" seconds=8.543873626 model=/root/.ollama/models/blobs/sha256-b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94 time=2024-12-11T23:14:28.166Z level=WARN source=sched.go:642 msg="gpu VRAM usage didn't recover within timeout" seconds=8.669209817 model=/root/.ollama/models/blobs/sha256-b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94 time=2024-12-11T23:14:28.446Z level=WARN source=sched.go:642 msg="gpu VRAM usage didn't recover within timeout" seconds=8.949622423 model=/root/.ollama/models/blobs/sha256-b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94

I also edited the context (to 15000 tokens) the default is set to 2048 tokens, some models allows you to go to 100,000 tokens in context...

https://llm.extractum.io/list/ < == for model spec info.

With Ollama you can limit the layers so GPUs can run bigger models... When doing this what Ollama does is that some of the layers are run in the CPU resulting slower token per second BUT at least you can run a bigger billions+ parameter model at speed expense.

stay curious....

You run ollama in jail with nvidia gpu and it uses nvidia gpu ?

GoNeFast_01 · Dec 12, 2024

GlitchyDot said:
You run ollama in jail with nvidia gpu and it uses nvidia gpu ?

Yes... The only issue I face is when doing GPU passthrough via Bhyve trying to run Microsoft Games or SteamOS Games (GPU Pass through needs some patches).

GlitchyDot · Dec 12, 2024

GoNeFast_01 said:
Yes... The only issue I face is when doing GPU passthrough via Bhyve trying to run Microsoft Games or SteamOS Games (GPU Pass through needs some patches).

Can you expand or manual on how to have jail running with ollama and using gpu for it ?

GoNeFast_01 · Dec 12, 2024

GlitchyDot said:
Can you expand or manual on how to have jail running with ollama and using gpu for it ?

I have a shell script so here it is.... Honestly pretty simple no magic.

1. Create a Jail (I am not going to cover how, there is too many options)
2. Pass the Devfs Rules to get nvidia working (I have 2 gpus on this one).

Code:

add path 'nvidia*' unhide  # Expose NVIDIA GPU devices if applicable
perm  nvidia0      unprivileged 0666
perm  nvidia1      unprivileged 0666
perm  nvidiactl    unprivileged 0666
perm  nvidia-modeset unprivileged 0666

3. Shell scripting. You need to edit your JAIL_PATH in your gateway / system.

Code:

#!/bin/sh

# Variables
JAIL_NAME="Secure_Ollama"
OLLAMA_PKG="ollama"
JAIL_PATH="/usr/jails/jails-data/${JAIL_NAME}-data"


# Step 0: Set up resolv.conf in the jail for DNS resolution
echo "Setting up resolv.conf in the jail for DNS resolution..."
echo "nameserver 8.8.8.8" > "${JAIL_PATH}/etc/resolv.conf"
echo "nameserver 9.9.9.9" >> "${JAIL_PATH}/etc/resolv.conf"


# Step 1: Install Ollama and required packages inside the jail
echo "Installing Ollama and necessary packages in the jail..."
jexec ${JAIL_NAME} /bin/sh -c "pkg update && pkg install -y ${OLLAMA_PKG} git python3 py311-pip"
if [ $? -ne 0 ]; then
    echo "Error: Failed to install Ollama package in the jail."
    exit 1
fi

GlitchyDot · Dec 14, 2024

GoNeFast_01 said:

I have a shell script so here it is.... Honestly pretty simple no magic.

1. Create a Jail (I am not going to cover how, there is too many options)
2. Pass the Devfs Rules to get nvidia working (I have 2 gpus on this one).

Code:

add path 'nvidia*' unhide  # Expose NVIDIA GPU devices if applicable
perm  nvidia0      unprivileged 0666
perm  nvidia1      unprivileged 0666
perm  nvidiactl    unprivileged 0666
perm  nvidia-modeset unprivileged 0666

3. Shell scripting. You need to edit your JAIL_PATH in your gateway / system.

Code:

#!/bin/sh

# Variables
JAIL_NAME="Secure_Ollama"
OLLAMA_PKG="ollama"
JAIL_PATH="/usr/jails/jails-data/${JAIL_NAME}-data"


# Step 0: Set up resolv.conf in the jail for DNS resolution
echo "Setting up resolv.conf in the jail for DNS resolution..."
echo "nameserver 8.8.8.8" > "${JAIL_PATH}/etc/resolv.conf"
echo "nameserver 9.9.9.9" >> "${JAIL_PATH}/etc/resolv.conf"


# Step 1: Install Ollama and required packages inside the jail
echo "Installing Ollama and necessary packages in the jail..."
jexec ${JAIL_NAME} /bin/sh -c "pkg update && pkg install -y ${OLLAMA_PKG} git python3 py311-pip"
if [ $? -ne 0 ]; then
    echo "Error: Failed to install Ollama package in the jail."
    exit 1
fi

Im going to try it... but, ollama needs cuda, so how do you get this one sorted ? or im wrong no cuda needed for ollama ?

GoNeFast_01 · Dec 14, 2024

GlitchyDot said:
Im going to try it... but, ollama needs cuda, so how do you get this one sorted ? or im wrong no cuda needed for ollama ?

You don't need to install the drivers, the transformer from hugging face that ollama uses is build on top of pytorch takes care of this... (AGAIN this is what I read didn't confirm or care, it just works!)...

Answer from my Local LLM:

Code:

 Ollama is a Python interface for running large language models locally. It doesn't require CUDA
installation as it supports both GPU and CPU usage out of the box. Here's a simplified explanation
of how it works:

1. **Model Download**: First, Ollama downloads the selected model (like Llama 2, Alpaca, etc.)
from Hugging Face's model hub or another supported source.

2. **Quantization**: Ollama uses quantization to reduce the model size and improve inference
speed. It converts the model's weights from float32 to a lower-precision format like float16 or
int8. This step is done on the fly, so you don't need to create quantized versions of models
beforehand.

3. **Model Loading**: Once downloaded and quantized, the model is loaded into memory. Ollama
supports loading models in different formats, such as Hugging Face's `transformers` library format
or ONNX format.

4. **Inference**: With the model now loaded, you can generate text using Ollama's API. It uses a
technique called "in-context learning" where it feeds the model with your input along with some
context (like a few previous tokens) to guide its response generation.

5. **Hardware Acceleration**: Ollama automatically uses GPU if available for faster inference, but
it also works fine on CPUs if no GPU is present. It utilizes libraries like ONNX Runtime and
PyTorch for hardware acceleration.

Here's a simple usage example:

```python
from Ollama import Ollama

ollama = Ollama()
response = ollama.generate("Hello! How are you?", model="llama2:7b")
print(response)
```

freezr · Dec 16, 2024

GoNeFast_01 do I need a dedicated GPU? I have only one... (and old).

GlitchyDot · Dec 16, 2024

freezr said:
GoNeFast_01 do I need a dedicated GPU? I have only one... (and old).

You can run on CPU`s but its be diabolically slow and then you need RAM. if llm is 8GB you need at least 8GB ram for it.
People do run on CPU`s but you will get around 1t/s

GlitchyDot · Dec 31, 2024

GoNeFast_01 said:

You don't need to install the drivers, the transformer from hugging face that ollama uses is build on top of pytorch takes care of this... (AGAIN this is what I read didn't confirm or care, it just works!)...

Answer from my Local LLM:

Code:

 Ollama is a Python interface for running large language models locally. It doesn't require CUDA
installation as it supports both GPU and CPU usage out of the box. Here's a simplified explanation
of how it works:

1. **Model Download**: First, Ollama downloads the selected model (like Llama 2, Alpaca, etc.)
from Hugging Face's model hub or another supported source.

2. **Quantization**: Ollama uses quantization to reduce the model size and improve inference
speed. It converts the model's weights from float32 to a lower-precision format like float16 or
int8. This step is done on the fly, so you don't need to create quantized versions of models
beforehand.

3. **Model Loading**: Once downloaded and quantized, the model is loaded into memory. Ollama
supports loading models in different formats, such as Hugging Face's `transformers` library format
or ONNX format.

4. **Inference**: With the model now loaded, you can generate text using Ollama's API. It uses a
technique called "in-context learning" where it feeds the model with your input along with some
context (like a few previous tokens) to guide its response generation.

5. **Hardware Acceleration**: Ollama automatically uses GPU if available for faster inference, but
it also works fine on CPUs if no GPU is present. It utilizes libraries like ONNX Runtime and
PyTorch for hardware acceleration.

Here's a simple usage example:

```python
from Ollama import Ollama

ollama = Ollama()
response = ollama.generate("Hello! How are you?", model="llama2:7b")
print(response)
```

what t/s you get with your gpu ? and what gpu u running ? looks like under FreeBSD gpu`s are running with Vulkan. mistral 4gb gets 12t/s but i deepseek-coder:33b ( 18GB ) less than 4 but model fits my gpu fully but for some reasons swap is used a bit like 56MB.
under Linux - i run 17t/s thats a huge drop off.

tingo · Jan 1, 2025

has anyone managed to get any of the image-generating tools working on FreeBSD? I tried Fooocus, but had to many errors during the install and wasn't able to fix all of them.

GoNeFast_01 · Jan 2, 2025

GlitchyDot said:
what t/s you get with your gpu ? and what gpu u running ? looks like under FreeBSD gpu`s are running with Vulkan. mistral 4gb gets 12t/s but i deepseek-coder:33b ( 18GB ) less than 4 but model fits my gpu fully but for some reasons swap is used a bit like 56MB.
under Linux - i run 17t/s thats a huge drop off.

The mistral 4b above 20-35+ t/s on my setup with some old GPUs... To get the token rate up you need to play a little with dividing the layers, some of them run in CPU and majority in GPU.... You need to play with ollama-limit-gpu-layers.

GlitchyDot · Jan 2, 2025

GoNeFast_01 said:
The mistral 4b above 20-35+ t/s on my setup with some old GPUs... To get the token rate up you need to play a little with dividing the layers, some of them run in CPU and majority in GPU.... You need to play with ollama-limit-gpu-layers.

Ok, i need to play around and maybe ditch terminal and use other ways ( oatmeal does not work at all for me )

winterschon · Tuesday at 11:55 AM

General update, popping in to say that I haven't forgotten about posting notes / configs / etc. It's on my list of fun things to do, which has unfortunately been scheduled after switching corporate jobs, moving to a different state half-way across the USA during the ass-cold winter, and everything involved with settling into a nearly-new everything.

However, for the fun part... here are some bullets.

Migrated primary homelab GPU box to a 5U chassis with loads of CFM / LFM to keep things quiet and cool next to my desk (w/ 2x 180mm + 1x 160mm + 2x 80mm)
Acquired a Supermicro MBD-X12SPA-TF board to run a second instance of the primary aforementioned box (8370C Ice Lake Xeon, 2TB Optane DDR4-3200 NVDIMM, loads of NVMe, 25GbE + 10GbE links for RDMA network storage access)
Built out a new homelab rack, 25U with 2x APC SRT1500RMXLA-NA UPS units + APC ATS for distribution failover + APC PDU AP7900B w/ per-port power metering
Obtained a fun bit of network kit for 48-ports of 1/2.5/5/10GbE + 4x 40G QSFP networking: Juniper EX4300-MP (yes, FreeBSD based! love it... still needs some fan mods to reduce dB)
Obtained an additional two Nvidia A4000 16GB GPUs, yay

ok... back to work? sure. rather to work on homelab stuff right now.