Local LLMs

Hi mates!

Has anyone tried to deploy LLM (Large Language Models) server on the FreeBSD box? Share your experience and how-to, please.
 
I just came here to ask the same question and I found your post. I have a local Manjaro desktop machine running Open WebUI with several LLMs installed and it works really well. But, I'd love to have it running on one of my webservers, which are all FreeBSD.

I could easily set up Open WebUI with one of my domains and connect from anywhere over HTTPS.

I wonder what kind of hardware resources I'd need though. My desktop has an NVIDIA 3070 but I don't put GPUs in the webservers.
 
This is actually not that hard. You could use llama.cpp:

Running 14.0-RELEASE-p6 on a pi4:

- install gmake
- git clone https://github.com/ggerganov/llama.cpp
- cd llama.cpp; gmake # use -j n_cores
- get a model from huggingface where the ram requirements match your machine, i used phi-2.Q4_K_M
- place the model file into the models/ subdir of llama.cpp

Use this shell script to launch it:

Bash:
#!/usr/local/bin/bash
PROMPT="Instruct: $@\nOutput:\n"
./main -m models/phi-2.Q4_K_M.gguf --color --temp 0.7 --repeat_penalty 1.1 -n -1 -p "$PROMPT" -e

Example:

Code:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp/
doas pkg install gmake
gmake -j4
mv ~/phi-2.Q4_K_M.gguf models/
./run-phi2.sh "Tell me something about FreeBSD"

It's not very fast here but works:
Code:
... initialization output omitted...

Instruct: Tell me something about FreeBSD
Output:
- FreeBSD is an open source, distributed operating system for Unix-like devices.
- It was created in 1995 and is known for its stability, security, and scalability.
- It is used in a variety of settings, from small enterprises to large organizations.
- It has a number of different distributions, each tailored for different tasks and needs.
- It allows for the customization of the operating system, allowing users to modify and improve it.
- It features a strong password policy and advanced security measures.
<|endoftext|> [end of text]


llama_print_timings:        load time =    1187.23 ms
llama_print_timings:      sample time =     121.36 ms /   108 runs   (    1.12 ms per token,   889.94 tokens per second)
llama_print_timings: prompt eval time =    3147.98 ms /    11 tokens (  286.18 ms per token,     3.49 tokens per second)
llama_print_timings:        eval time =   54504.98 ms /   107 runs   (  509.39 ms per token,     1.96 tokens per second)
llama_print_timings:       total time =   57837.63 ms /   118 tokens
Log end

 
This is actually not that hard. You could use llama.cpp:

Running 14.0-RELEASE-p6 on a pi4:

- install gmake
- git clone https://github.com/ggerganov/llama.cpp
- cd llama.cpp; gmake # use -j n_cores
- get a model from huggingface where the ram requirements match your machine, i used phi-2.Q4_K_M
- place the model file into the models/ subdir of llama.cpp

Use this shell script to launch it:

Bash:
#!/usr/local/bin/bash
PROMPT="Instruct: $@\nOutput:\n"
./main -m models/phi-2.Q4_K_M.gguf --color --temp 0.7 --repeat_penalty 1.1 -n -1 -p "$PROMPT" -e

Example:

Code:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp/
doas pkg install gmake
gmake -j4
mv ~/phi-2.Q4_K_M.gguf models/
./run-phi2.sh "Tell me something about FreeBSD"

It's not very fast here but works:
Code:
... initialization output omitted...

Instruct: Tell me something about FreeBSD
Output:
- FreeBSD is an open source, distributed operating system for Unix-like devices.
- It was created in 1995 and is known for its stability, security, and scalability.
- It is used in a variety of settings, from small enterprises to large organizations.
- It has a number of different distributions, each tailored for different tasks and needs.
- It allows for the customization of the operating system, allowing users to modify and improve it.
- It features a strong password policy and advanced security measures.
<|endoftext|> [end of text]


llama_print_timings:        load time =    1187.23 ms
llama_print_timings:      sample time =     121.36 ms /   108 runs   (    1.12 ms per token,   889.94 tokens per second)
llama_print_timings: prompt eval time =    3147.98 ms /    11 tokens (  286.18 ms per token,     3.49 tokens per second)
llama_print_timings:        eval time =   54504.98 ms /   107 runs   (  509.39 ms per token,     1.96 tokens per second)
llama_print_timings:       total time =   57837.63 ms /   118 tokens
Log end
Beautiful, and there actually already is a port/package for that, so no need to compile yourself:
misc/llama-cpp

Thanks a lot, didn't know about llama-cpp. Will try it as soon as possible myself.
 
I was not aware that there is a port/package, but actually llama.cpp gets updated so frequently (sometimes multiple times per day) that it could make sense to pull from the repo.

It also includes a server to use the llm via rest api and more stuff. Be sure to check out the git repo (readme).
 
now there is misc/ollama too. After watching a video from the latest Valuable News post, I got a bit interested. And now I have questions:
  • Will AMD graphics cards work on FreeBSD (I see that ollama recently got support for AMD)?
  • How much memory should the graphics card have? Will 12 GB be enough, or should I put in more money and get one with 16 GB? (As I understand it, the amount of memory limits how much of the LLM that can be in memory at once)
  • Is the CPU in the machine important too (if I run the LLM on the graphics card), or can I run this on an old machine with say a cheap AMD CPU with 16 or 32 GB RAM?
 
Back
Top