Powered by the open source project — https://github.com/b4rtaz/distributed-llama?tab=readme-ov-file, we can run llama3.1 8B on a cluster of 4 raspberry pis at about 1.3–2 tokens per second. This is still far from daily usage, but building up a more powerful cluster to run LLMs more efficiently would be much feasible. Even though Pis are relatively weak, smaller models in the future such as Google’s Gemma2:2b would be a great choice on the small clusters to do specific tasks such as translation.
Usage
- Install the project based on the guide. Need to
make
the binary - Download model files using the
launch.py
script prepared by the author. I chose llama3.1 8B here. - Run the command on worker nodes to listen to the master node on port 9998
sudo nice -n -20 ./dllama worker --port 9998 --nthreads 4
- Run the command on the master node
#!/bin/sh # chat mode ./dllama chat --model models/llama3_1_8b_instruct_q40/dllama_model_llama3_1_8b_instruct_q40.m --tokenizer models/llama3_1_8b_instruct_q40/dllama_tokenizer_llama3_1_8b_instruct_q40.t --buffer-float-type q80 --nthreads 4 --workers 192.168.68.102:9998 192.168.68.103:9998 192.168.68.104:9998 --max-seq-len 1024
Benchmark
To test what performance gain the project gives us, I test the llama3.1 8B model on a cluster of Raspberry Pis, with three Pi4 8G RAM and one Pi4 4G RAM.
# Benchmark using inference mode # 1 worker ./dllama inference --steps 128 --prompt "Why is the sky blue?" --model models/llama3_1_8b_instruct_q40/dllama_model_llama3_1_8b_instruct_q40.m --tokenizer models/llama3_1_8b_instruct_q40/dllama_tokenizer_llama3_1_8b_instruct_q40.t --buffer-float-type q80 --nthreads 4 --max-seq-len 1024 > benchmark/benchmark_1worker.md # 2 workers ./dllama inference --steps 128 --prompt "Why is the sky blue?" --model models/llama3_1_8b_instruct_q40/dllama_model_llama3_1_8b_instruct_q40.m --tokenizer models/llama3_1_8b_instruct_q40/dllama_tokenizer_llama3_1_8b_instruct_q40.t --buffer-float-type q80 --nthreads 4 --workers 192.168.68.102:9998 --max-seq-len 1024 > benchmark/benchmark_2workers.md # 4 worker ./dllama inference --steps 128 --prompt "Why is the sky blue?" --model models/llama3_1_8b_instruct_q40/dllama_model_llama3_1_8b_instruct_q40.m --tokenizer models/llama3_1_8b_instruct_q40/dllama_tokenizer_llama3_1_8b_instruct_q40.t --buffer-float-type q80 --nthreads 4 --workers 192.168.68.102:9998 192.168.68.103:9998 192.168.68.104:9998 --max-seq-len 1024 > benchmark/benchmark_4workers.md
Performance on 1, 2, 4 nodes
# 1 node Generated tokens: 128 Avg tokens / second: 0.42 Avg generation time: 2381.94 ms Avg inference time: 2379.15 ms Avg transfer time: 0.35 ms # 2 nodes Generated tokens: 128 Avg tokens / second: 0.99 Avg generation time: 1007.52 ms Avg inference time: 944.58 ms Avg transfer time: 60.49 ms # 4 nodes Generated tokens: 128 Avg tokens / second: 1.38 Avg generation time: 727.14 ms Avg inference time: 716.78 ms Avg transfer time: 7.80 ms
Hosting distributed-llama as a service
# API ./dllama-api --model models/llama3_1_8b_instruct_q40/dllama_model_llama3_1_8b_instruct_q40.m --tokenizer models/llama3_1_8b_instruct_q40/dllama_tokenizer_llama3_1_8b_instruct_q40.t --buffer-float-type q80 --nthreads 4 --workers 192.168.68.102:9998 192.168.68.103:9998 192.168.68.104:9998 --max-seq-len 1024
The client side:
curl -X POST "http://192.168.68.101:9990/v1/chat/completions" \ -H "Content-Type: application/json" \ -d '{ "messages": [ { "role": "system", "content": "You are an excellent math teacher." }, { "role": "user", "content": "What is 1 + 2?" } ], "temperature": 0.7, "stop": ["<|eot_id|>"], "max_tokens": 128 }'
Screenshots
- Run commands
- Transferring model data
- Transferring model data
- Exposing port 9990 to run API
- Send CURL command to the API endpoint
- Processing the request
- Returned result