-
Notifications
You must be signed in to change notification settings - Fork 110
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: update readme, add model. (#85)
- Loading branch information
Showing
4 changed files
with
85 additions
and
103 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,26 +11,23 @@ Tensor parallelism is all you need. Run LLMs on weak devices or make powerful de | |
<sub><sup>Distributed Llama running Llama 2 70B on 8 Raspberry Pi 4B devices</sup></sub> | ||
</p> | ||
|
||
**🔥 Download Model & Build by Single Command** | ||
### 🔥 Start by Single Command | ||
|
||
Python 3 and GCC required. | ||
Python 3 and C++ compiler required. | ||
|
||
**Chat & API** | ||
| Model | Purpose | Size | Command | | ||
| ----------------------- | --------- | -------- | ----------------------------------------- | | ||
| Llama 3 8B Q40 | Benchmark | 6.32 GB | `python launch.py llama3_8b_q40` | | ||
| Llama 3 8B Instruct Q40 | Chat, API | 6.32 GB | `python launch.py llama3_8b_instruct_q40` | | ||
|
||
* Llama 3 8B Instruct: `python launch.py llama3_instruct` | ||
### 🛠️ Convert Model Manually | ||
|
||
**Convert Model Manually** | ||
Supported architectures: Llama, Mixtral, Grok | ||
|
||
* [Llama 2](./docs/LLAMA.md#how-to-run-llama-2) | ||
* [Llama 3](./docs/LLAMA.md#how-to-run-llama-3) | ||
* [How to Convert Llama 2, Llama 3](./docs/LLAMA.md) | ||
|
||
**Supported modes:** | ||
### 🚧 Known Limitations | ||
|
||
- Inference CLI | ||
- Chat CLI | ||
- [API Server](./src/apps/dllama-api/README.md) | ||
|
||
**Known limitations:** | ||
* You can run Distributed Llama only on 1, 2, 4... 2^n nodes. | ||
* The maximum number of nodes is equal to the number of KV heads in the model [#70](https://github.com/b4rtaz/distributed-llama/issues/70). | ||
* Optimized for (weights format × buffer format): | ||
|
@@ -45,7 +42,8 @@ Python 3 and GCC required. | |
* ❌ Q40 × F32 | ||
* ✅ Q40 × Q80 | ||
|
||
**Architecture**<br /> | ||
### 👷 Architecture | ||
|
||
The project is split up into two parts: | ||
* **Root node** - it's responsible for loading the model and weights and forward them to workers. Also, it synchronizes the state of the neural network. The root node is also a worker, it processes own slice of the neural network. | ||
* **Worker node** - it processes own slice of the neural network. It doesn't require any configuration related to the model. | ||
|
@@ -54,70 +52,62 @@ You always need the root node and you can add 2^n - 1 worker nodes to speed up t | |
|
||
## 📊 Measurements | ||
|
||
### Average Single Token Generation Time | ||
### Average Token Generation Time | ||
|
||
All tests below utilized Q40 weights and a Q80 buffer. The generation time encompasses the inference time, network transfer time, sampling time, and multi-thread synchronization time. Number of samples: 16. | ||
I - inference time of the root node, T - network transfer time of the root node. | ||
|
||
**Raspberry Pi 5 8GB** | ||
|
||
<sub><sup>Weights = Q40, Buffer = Q80, nSamples = 16, switch = TP-Link LS1008G, tested on 0.3.1 version</sup></sub> | ||
|
||
| Model | 1 x RasPi 5 8 GB | 2 x RasPi 5 8 GB | 4 x RasPi 5 8 GB | | ||
|-------------|---------------------------------------------------------------------|---------------------------------------------------------------------|---------------------------------------------------------------------| | ||
| Llama 2 7B | **441.09 ms**, 2.26 t/s<br><sub><sup>(I: 434.84 ms, T: 5.25 ms)</sup></sub> | **341.46 ms**, 2.92 t/s<br><sub><sup>(I: 257.78 ms, T: 83.27 ms)</sup></sub> | **219.08 ms**, 4.56 t/s<br><sub><sup>(I: 163.42 ms, T: 55.25 ms)</sup></sub> | | ||
| Llama 3 8B | **564.31 ms**, 1.77 t/s<br><sub><sup>(I: 556.67 ms, T: 6.17 ms)</sup></sub> | **444.27 ms**, 2.25 t/s<br><sub><sup>(I: 362.73 ms, T: 80.11 ms)</sup></sub> | **331.47 ms**, 3.01 t/s<br><sub><sup>(I: 267.62 ms, T: 62.34 ms)</sup></sub> | | ||
|
||
<sub><sup>I - inference time of the root node, T - network transfer time, tested on 0.3.1 version</sup></sub> | ||
| Llama 2 7B | **441.09 ms**, 2.26 t/s<br><sub><sup>I: 434.84 ms, T: 5.25 ms</sup></sub> | **341.46 ms**, 2.92 t/s<br><sub><sup>I: 257.78 ms, T: 83.27 ms</sup></sub> | **219.08 ms**, 4.56 t/s 🔥<br><sub><sup>I: 163.42 ms, T: 55.25 ms</sup></sub> | | ||
| Llama 3 8B | **564.31 ms**, 1.77 t/s<br><sub><sup>I: 556.67 ms, T: 6.17 ms</sup></sub> | **444.27 ms**, 2.25 t/s<br><sub><sup>I: 362.73 ms, T: 80.11 ms</sup></sub> | **331.47 ms**, 3.01 t/s 🔥<br><sub><sup>I: 267.62 ms, T: 62.34 ms</sup></sub> | | ||
|
||
**Raspberry Pi 4B 8 GB** | ||
|
||
<sub><sup>Weights = Q40, Buffer = Q80, nSamples = 16, switch = TP-Link LS1008G, tested on 0.1.0 version</sup></sub> | ||
|
||
<p align="center"> | ||
<img src=".github/8raspi2.jpg" width="35%" alt="8 x Raspberry Pi 4B 8GB" /><br /> | ||
<sub><sup>8 x Raspberry Pi 4B 8GB</sup></sub> | ||
</p> | ||
|
||
All Raspberry Pi units were connected via Gigabit Ethernet to the TP-Link LS1008G Switch. | ||
|
||
| Model | 1 x RasPi 4B 8 GB | 2 x RasPi 4B 8 GB | 4 x RasPi 4B 8 GB | 8 x RasPi 4B 8 GB | | ||
|-------------|---------------------------------------------------------------------|-----------------------------------------------------------------------|--------------------------------------------------------------------------------------|----------------------------------------------------------------------| | ||
| Llama 2 7B | **1312.50 ms**<br><sub><sup>(I: 1307.94 ms, T: 1.81 ms)</sup></sub> | **793.69 ms**<br><sub><sup>(I: 739.00 ms, T: 52.50 ms)</sup></sub> | **494.00 ms** 🔥 <br><sub><sup>(I: 458.81 ms, T: 34.06 ms)</sup></sub> | **588.19 ms**<br><sub><sup>(I: 296.69 ms, T: 289.75 ms)</sup></sub> | | ||
| Llama 2 13B | <sub><sup>Not enough RAM</sup></sub> | **1497.19 ms**<br><sub><sup>(I: 1465.06 ms, T: 30.88 ms)</sup></sub> | **848.19 ms** 🔥<br><sub><sup>(I: 746.88 ms, T: 99.50 ms)</sup></sub> | **1114.88 ms**<br><sub><sup>(I: 460.8 ms, T: 652.88 ms)</sup></sub> | | ||
| Llama 2 70B | <sub><sup>Not enough RAM</sup></sub> | <sub><sup>Not enough RAM</sup></sub> | <sub><sup>Not enough RAM</sup></sub> | **4842.81 ms** 🔥<br><sub><sup>(I: 2121.94 ms, T: 2719.62 ms)</sup></sub> | | ||
|
||
<sub><sup>I - inference time of the root node, T - network transfer time, tested on 0.1.0 version</sup></sub> | ||
| Llama 2 7B | **1312.50 ms**<br><sub><sup>I: 1307.94 ms, T: 1.81 ms</sup></sub> | **793.69 ms**<br><sub><sup>I: 739.00 ms, T: 52.50 ms</sup></sub> | **494.00 ms** 🔥 <br><sub><sup>I: 458.81 ms, T: 34.06 ms</sup></sub> | **588.19 ms**<br><sub><sup>I: 296.69 ms, T: 289.75 ms</sup></sub> | | ||
| Llama 2 13B | <sub><sup>Not enough RAM</sup></sub> | **1497.19 ms**<br><sub><sup>I: 1465.06 ms, T: 30.88 ms</sup></sub> | **848.19 ms** 🔥<br><sub><sup>I: 746.88 ms, T: 99.50 ms</sup></sub> | **1114.88 ms**<br><sub><sup>I: 460.8 ms, T: 652.88 ms</sup></sub> | | ||
| Llama 2 70B | <sub><sup>Not enough RAM</sup></sub> | <sub><sup>Not enough RAM</sup></sub> | <sub><sup>Not enough RAM</sup></sub> | **4842.81 ms** 🔥<br><sub><sup>I: 2121.94 ms, T: 2719.62 ms</sup></sub> | | ||
|
||
**x86_64 CPU Cloud Server** | ||
|
||
All tests below were conducted on c3d-highcpu-30 (30 vCPU, 15 core, 59 GB memory) VMs in Google Cloud. [More details](https://github.com/b4rtaz/distributed-llama/discussions/9). | ||
<sub><sup>Weights = Q40, Buffer = Q80, nSamples = 16, VMs = [c3d-highcpu-30](https://github.com/b4rtaz/distributed-llama/discussions/9), tested on 0.1.0 version</sup></sub> | ||
|
||
| Model | 1 x VM | 2 x VM | 4 x VM | | ||
|-------------|---------------------------------------------------------------------|-----------------------------------------------------------------------|--------------------------------------------------------------------------------------| | ||
| Llama 2 7B | **101.81 ms**<br><sub><sup>(I: 101.06 ms, T: 0.19 ms)</sup></sub> | **69.69 ms**<br><sub><sup>(I: 61.50 ms, T: 7.62 ms)</sup></sub> | **53.69 ms** 🔥<br><sub><sup>(I: 40.25 ms, T: 12.81 ms)</sup></sub> | | ||
| Llama 2 13B | **184.19 ms**<br><sub><sup>(I: 182.88 ms, T: 0.69 ms)</sup></sub> | **115.38 ms**<br><sub><sup>(I: 107.12 ms, T: 7.81 ms)</sup></sub> | **86.81 ms** 🔥<br><sub><sup>(I: 66.25 ms, T: 19.94 ms)</sup></sub> | | ||
| Llama 2 70B | **909.69 ms**<br><sub><sup>(I: 907.25 ms, T: 1.75 ms)</sup></sub> | **501.38 ms**<br><sub><sup>(I: 475.50 ms, T: 25.00 ms)</sup></sub> | **293.06 ms** 🔥<br><sub><sup>(I: 264.00 ms, T: 28.50 ms)</sup></sub> | | ||
| Llama 2 7B | **101.81 ms**<br><sub><sup>I: 101.06 ms, T: 0.19 ms</sup></sub> | **69.69 ms**<br><sub><sup>I: 61.50 ms, T: 7.62 ms</sup></sub> | **53.69 ms** 🔥<br><sub><sup>I: 40.25 ms, T: 12.81 ms</sup></sub> | | ||
| Llama 2 13B | **184.19 ms**<br><sub><sup>I: 182.88 ms, T: 0.69 ms</sup></sub> | **115.38 ms**<br><sub><sup>I: 107.12 ms, T: 7.81 ms</sup></sub> | **86.81 ms** 🔥<br><sub><sup>I: 66.25 ms, T: 19.94 ms</sup></sub> | | ||
| Llama 2 70B | **909.69 ms**<br><sub><sup>I: 907.25 ms, T: 1.75 ms</sup></sub> | **501.38 ms**<br><sub><sup>I: 475.50 ms, T: 25.00 ms</sup></sub> | **293.06 ms** 🔥<br><sub><sup>I: 264.00 ms, T: 28.50 ms</sup></sub> | | ||
|
||
<sub><sup>I - inference time of the root node, T - network transfer time, tested on 0.1.0 version</sup></sub> | ||
|
||
### Network Transfer for Generating Single Token | ||
### Network Transfer for Generating Token | ||
|
||
**F32 Buffer** | ||
|
||
| Model | 2 devices | 4 devices | 8 devices | | ||
|-------------|------------------------------------------------------------------|------------------------------------------------------------------|------------------------------------------------------------------| | ||
| Llama 3 8B | **2048 kB**<br><sub><sup>(S: 1024 kB, R: 1024 kB)</sup></sub> | **6144 kB**<br><sub><sup>(S: 3072 kB, R: 3072 kB)</sup></sub> | **14336 kB**<br><sub><sup>(S: 7168 kB, R: 7168 kB)</sup></sub> | | ||
|
||
<sub><sup>S - sent data from the root node to workers, R - received data by the root node from workers, tested on 0.7.1 version</sup></sub> | ||
| Model | 2 devices | 4 devices | 8 devices | | ||
|-------------|----------------|---------------|---------------| | ||
| Llama 3 8B | **2048 kB** | **6144 kB** | **14336 kB** | | ||
|
||
**Q80 Buffer** | ||
|
||
| Model | 2 devices | 4 devices | 8 devices | | ||
|-------------|---------------------------------------------------------------|----------------------------------------------------------------|-----------------------------------------------------------------| | ||
| Llama 3 8B | **544 kB**<br><sub><sup>(S: 272 kB, R: 272 kB)</sup></sub> | **1632 kB**<br><sub><sup>(S: 816 kB, R: 816 kB)</sup></sub> | **3808 kB**<br><sub><sup>(S: 1904 kB, R: 1904 kB)</sup></sub> | | ||
| Model | 2 devices | 4 devices | 8 devices | | ||
|-------------|--------------|---------------|----------------| | ||
| Llama 3 8B | **544 kB** | **1632 kB** | **3808 kB** | | ||
|
||
<sub><sup>S - sent data from the root node to workers, R - received data by the root node from workers, tested on 0.7.1 version</sup></sub> | ||
|
||
## 📟 How to Run on Raspberry Pi Devices | ||
## 📟 Setup Raspberry Pi Devices | ||
|
||
1. Install `Raspberry Pi OS Lite (64 bit)` on your Raspberry Pi devices. This OS doesn't have desktop environment. | ||
2. Connect all devices to the Gigabit switch. | ||
2. Connect all devices to your switch or router. | ||
3. Connect to all devices via SSH. | ||
``` | ||
ssh [email protected] | ||
|
@@ -127,27 +117,24 @@ ssh [email protected] | |
```sh | ||
sudo apt install git | ||
``` | ||
5. Clone this repository: | ||
5. Clone this repository and compile Distributed Llama on all devices: | ||
```sh | ||
git clone https://github.com/b4rtaz/distributed-llama.git | ||
``` | ||
6. Compile Distributed Llama: | ||
```sh | ||
make dllama | ||
``` | ||
7. Transfer weights and the tokenizer file to the root device. | ||
8. Optional: assign static IP addresses. | ||
6. Transfer weights and the tokenizer file to the root device. | ||
7. Optional: assign static IP addresses. | ||
```sh | ||
sudo ip addr add 10.0.0.1/24 dev eth0 # 1th device | ||
sudo ip addr add 10.0.0.2/24 dev eth0 # 2th device | ||
``` | ||
9. Run worker nodes on worker devices: | ||
8. Run worker nodes on worker devices: | ||
```sh | ||
sudo nice -n -20 ./dllama worker --port 9998 --nthreads 4 | ||
``` | ||
10. Run root node on the root device: | ||
9. Run root node on the root device: | ||
```sh | ||
sudo nice -n -20 ./dllama inference --model ../dllama_llama-2-7b_q40.bin --tokenizer ../dllama-llama2-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 10.0.0.2:9998 | ||
sudo nice -n -20 ./dllama inference --model dllama_model_meta-llama-3-8b_q40.m --tokenizer dllama_tokenizer_llama3.t --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 10.0.0.2:9998 | ||
``` | ||
|
||
To add more worker nodes, just add more addresses to the `--workers` argument. | ||
|
@@ -156,70 +143,57 @@ To add more worker nodes, just add more addresses to the `--workers` argument. | |
./dllama inference ... --workers 10.0.0.2:9998 10.0.0.3:9998 10.0.0.4:9998 | ||
``` | ||
|
||
[Share your results](https://github.com/b4rtaz/distributed-llama/discussions)! | ||
## 💻 Setup computers with MacOS, Linux, or Windows | ||
|
||
## 💻 How to Run on MacOS, Linux, or Windows | ||
You need x86_64 AVX2 CPUs or ARM CPUs. Different devices may have different CPUs. | ||
|
||
You need to have x86_64 AVX2 CPU or ARM CPU. Different devices may have different CPUs. The below instructions are for Debian-based distributions but you can easily adapt them to your distribution, macOS, or Windows. | ||
#### MacOS or Linux | ||
|
||
### MacOS and Linux | ||
The below instructions are for Debian-based distributions but you can easily adapt them to your distribution, macOS. | ||
|
||
1. Install Git and G++: | ||
1. Install Git and GCC: | ||
```sh | ||
sudo apt install git build-essential | ||
``` | ||
2. Clone this repository: | ||
2. Clone this repository and compile Distributed Llama on all computers: | ||
```sh | ||
git clone https://github.com/b4rtaz/distributed-llama.git | ||
``` | ||
3. Compile Distributed Llama: | ||
```sh | ||
make dllama | ||
``` | ||
4. Transfer weights and the tokenizer file to the root node. | ||
5. Run worker nodes on worker devices: | ||
```sh | ||
sudo nice -n -20 ./dllama worker --port 9998 --nthreads 4 | ||
``` | ||
6. Run root node on the root device: | ||
```sh | ||
sudo nice -n -20 ./dllama inference --model ../dllama_llama-2-7b_q40.bin --tokenizer ../dllama-llama2-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 192.168.0.1:9998 | ||
``` | ||
7. To run the root node in the chat mode: | ||
```sh | ||
sudo nice -n -20 ./dllama chat --model ../dllama_llama-2-7b-chat_q40.bin --tokenizer ../dllama-llama2-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 4 --workers 192.168.0.1:9998 | ||
``` | ||
|
||
### Windows | ||
Continue to point 3. | ||
|
||
1. Install Git and Mingw (Chocolatey): | ||
- https://chocolatey.org/install | ||
#### Windows | ||
|
||
1. Install Git and Mingw (via [Chocolatey](https://chocolatey.org/install)): | ||
```powershell | ||
choco install mingw | ||
``` | ||
2. Clone this repository: | ||
2. Clone this repository and compile Distributed Llama on all computers: | ||
```sh | ||
git clone https://github.com/b4rtaz/distributed-llama.git | ||
``` | ||
3. Compile Distributed Llama: | ||
```sh | ||
make dllama | ||
``` | ||
4. Transfer weights and the tokenizer file to the root node. | ||
5. Run worker nodes on worker devices: | ||
|
||
Continue to point 3. | ||
|
||
#### Run Cluster | ||
|
||
3. Transfer weights and the tokenizer file to the root computer. | ||
4. Run worker nodes on worker computers: | ||
```sh | ||
./dllama worker --port 9998 --nthreads 4 | ||
``` | ||
6. Run root node on the root device: | ||
```sh | ||
./dllama inference --model ../dllama_llama-2-7b_q40.bin --tokenizer ../dllama-llama2-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 192.168.0.1:9998 | ||
``` | ||
7. To run the root node in the chat mode: | ||
5. Run root node on the root computer: | ||
```sh | ||
./dllama chat --model ../dllama_llama-2-7b-chat_q40.bin --tokenizer ../dllama-llama2-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 4 --workers 192.168.0.1:9998 | ||
./dllama inference --model dllama_model_meta-llama-3-8b_q40.m --tokenizer dllama_tokenizer_llama3.t --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 192.168.0.1:9998 | ||
``` | ||
|
||
[Share your results](https://github.com/b4rtaz/distributed-llama/discussions)! | ||
To add more worker nodes, just add more addresses to the `--workers` argument. | ||
|
||
``` | ||
./dllama inference ... --workers 192.168.0.1:9998 192.168.0.2:9998 192.168.0.3:9998 | ||
``` | ||
|
||
## 💡 License | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.