This is actually not that hard. You could use
llama.cpp:
Running 14.0-RELEASE-p6 on a pi4:
- install
gmake
- git clone
https://github.com/ggerganov/llama.cpp
- cd llama.cpp; gmake # use -j n_cores
- get a model from huggingface where the ram requirements match your machine, i used
phi-2.Q4_K_M
- place the model file into the models/ subdir of llama.cpp
Use this shell script to launch it:
Bash:
#!/usr/local/bin/bash
PROMPT="Instruct: $@\nOutput:\n"
./main -m models/phi-2.Q4_K_M.gguf --color --temp 0.7 --repeat_penalty 1.1 -n -1 -p "$PROMPT" -e
Example:
Code:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp/
doas pkg install gmake
gmake -j4
mv ~/phi-2.Q4_K_M.gguf models/
./run-phi2.sh "Tell me something about FreeBSD"
It's not very fast here but works:
Code:
... initialization output omitted...
Instruct: Tell me something about FreeBSD
Output:
- FreeBSD is an open source, distributed operating system for Unix-like devices.
- It was created in 1995 and is known for its stability, security, and scalability.
- It is used in a variety of settings, from small enterprises to large organizations.
- It has a number of different distributions, each tailored for different tasks and needs.
- It allows for the customization of the operating system, allowing users to modify and improve it.
- It features a strong password policy and advanced security measures.
<|endoftext|> [end of text]
llama_print_timings: load time = 1187.23 ms
llama_print_timings: sample time = 121.36 ms / 108 runs ( 1.12 ms per token, 889.94 tokens per second)
llama_print_timings: prompt eval time = 3147.98 ms / 11 tokens ( 286.18 ms per token, 3.49 tokens per second)
llama_print_timings: eval time = 54504.98 ms / 107 runs ( 509.39 ms per token, 1.96 tokens per second)
llama_print_timings: total time = 57837.63 ms / 118 tokens
Log end