Powered by the open source project — https://github.com/b4rtaz/distributed-llama?tab=readme-ov-file, we can run llama3.1 8B on a cluster of 4 raspberry pis at about 1.3–2 tokens per second. This is still far from daily usage, but building up a more powerful cluster to run LLMs more efficiently would be much feasible. Even though Pis are relatively weak, smaller models in the future such as Google’s Gemma2:2b would be a great choice on the small clusters to do specific tasks such as translation.

Usage

Install the project based on the guide. Need to make the binary
Download model files using the launch.py script prepared by the author. I chose llama3.1 8B here.
Run the command on worker nodes to listen to the master node on port 9998
sudo nice -n -20 ./dllama worker --port 9998 --nthreads 4
Run the command on the master node

#!/bin/sh

# chat mode
./dllama chat --model models/llama3_1_8b_instruct_q40/dllama_model_llama3_1_8b_instruct_q40.m --tokenizer models/llama3_1_8b_instruct_q40/dllama_tokenizer_llama3_1_8b_instruct_q40.t --buffer-float-type q80 --nthreads 4 --workers 192.168.68.102:9998 192.168.68.103:9998 192.168.68.104:9998 --max-seq-len 1024

Benchmark

To test what performance gain the project gives us, I test the llama3.1 8B model on a cluster of Raspberry Pis, with three Pi4 8G RAM and one Pi4 4G RAM.

# Benchmark using inference mode
# 1 worker
./dllama inference --steps 128 --prompt "Why is the sky blue?" --model models/llama3_1_8b_instruct_q40/dllama_model_llama3_1_8b_instruct_q40.m --tokenizer models/llama3_1_8b_instruct_q40/dllama_tokenizer_llama3_1_8b_instruct_q40.t --buffer-float-type q80 --nthreads 4 --max-seq-len 1024 > benchmark/benchmark_1worker.md

# 2 workers
./dllama inference --steps 128 --prompt "Why is the sky blue?" --model models/llama3_1_8b_instruct_q40/dllama_model_llama3_1_8b_instruct_q40.m --tokenizer models/llama3_1_8b_instruct_q40/dllama_tokenizer_llama3_1_8b_instruct_q40.t --buffer-float-type q80 --nthreads 4 --workers 192.168.68.102:9998 --max-seq-len 1024 > benchmark/benchmark_2workers.md

# 4 worker
./dllama inference --steps 128 --prompt "Why is the sky blue?" --model models/llama3_1_8b_instruct_q40/dllama_model_llama3_1_8b_instruct_q40.m --tokenizer models/llama3_1_8b_instruct_q40/dllama_tokenizer_llama3_1_8b_instruct_q40.t --buffer-float-type q80 --nthreads 4 --workers 192.168.68.102:9998 192.168.68.103:9998 192.168.68.104:9998 --max-seq-len 1024 > benchmark/benchmark_4workers.md

Performance on 1, 2, 4 nodes

# 1 node
Generated tokens:    128
Avg tokens / second: 0.42
Avg generation time: 2381.94 ms
Avg inference time:  2379.15 ms
Avg transfer time:   0.35 ms

# 2 nodes
Generated tokens:    128
Avg tokens / second: 0.99
Avg generation time: 1007.52 ms
Avg inference time:  944.58 ms
Avg transfer time:   60.49 ms

# 4 nodes
Generated tokens:    128
Avg tokens / second: 1.38
Avg generation time: 727.14 ms
Avg inference time:  716.78 ms
Avg transfer time:   7.80 ms

Hosting distributed-llama as a service

# API
./dllama-api --model models/llama3_1_8b_instruct_q40/dllama_model_llama3_1_8b_instruct_q40.m --tokenizer models/llama3_1_8b_instruct_q40/dllama_tokenizer_llama3_1_8b_instruct_q40.t --buffer-float-type q80 --nthreads 4 --workers 192.168.68.102:9998 192.168.68.103:9998 192.168.68.104:9998 --max-seq-len 1024

The client side:

curl -X POST "http://192.168.68.101:9990/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
&nbsp; "messages": [
&nbsp; &nbsp; {
&nbsp; &nbsp; &nbsp; "role": "system",
&nbsp; &nbsp; &nbsp; "content": "You are an excellent math teacher."
&nbsp; &nbsp; },
&nbsp; &nbsp; {
&nbsp; &nbsp; &nbsp; "role": "user",
&nbsp; &nbsp; &nbsp; "content": "What is 1 + 2?"
&nbsp; &nbsp; }
&nbsp; ],
&nbsp; "temperature": 0.7,
&nbsp; "stop": ["<|eot_id|>"],
&nbsp; "max_tokens": 128
}'