diff --git a/README.md b/README.md index 4204878..2d4d7c9 100644 --- a/README.md +++ b/README.md @@ -11,26 +11,23 @@ Tensor parallelism is all you need. Run LLMs on weak devices or make powerful de Distributed Llama running Llama 2 70B on 8 Raspberry Pi 4B devices

-**🔥 Download Model & Build by Single Command** +### 🔥 Start by Single Command -Python 3 and GCC required. +Python 3 and C++ compiler required. -**Chat & API** +| Model | Purpose | Size | Command | +| ----------------------- | --------- | -------- | ----------------------------------------- | +| Llama 3 8B Q40 | Benchmark | 6.32 GB | `python launch.py llama3_8b_q40` | +| Llama 3 8B Instruct Q40 | Chat, API | 6.32 GB | `python launch.py llama3_8b_instruct_q40` | -* Llama 3 8B Instruct: `python launch.py llama3_instruct` +### 🛠️ Convert Model Manually -**Convert Model Manually** +Supported architectures: Llama, Mixtral, Grok -* [Llama 2](./docs/LLAMA.md#how-to-run-llama-2) -* [Llama 3](./docs/LLAMA.md#how-to-run-llama-3) +* [How to Convert Llama 2, Llama 3](./docs/LLAMA.md) -**Supported modes:** +### 🚧 Known Limitations -- Inference CLI -- Chat CLI -- [API Server](./src/apps/dllama-api/README.md) - -**Known limitations:** * You can run Distributed Llama only on 1, 2, 4... 2^n nodes. * The maximum number of nodes is equal to the number of KV heads in the model [#70](https://github.com/b4rtaz/distributed-llama/issues/70). * Optimized for (weights format × buffer format): @@ -45,7 +42,8 @@ Python 3 and GCC required. * ❌ Q40 × F32 * ✅ Q40 × Q80 -**Architecture**
+### 👷 Architecture + The project is split up into two parts: * **Root node** - it's responsible for loading the model and weights and forward them to workers. Also, it synchronizes the state of the neural network. The root node is also a worker, it processes own slice of the neural network. * **Worker node** - it processes own slice of the neural network. It doesn't require any configuration related to the model. @@ -54,70 +52,62 @@ You always need the root node and you can add 2^n - 1 worker nodes to speed up t ## 📊 Measurements -### Average Single Token Generation Time +### Average Token Generation Time -All tests below utilized Q40 weights and a Q80 buffer. The generation time encompasses the inference time, network transfer time, sampling time, and multi-thread synchronization time. Number of samples: 16. +I - inference time of the root node, T - network transfer time of the root node. **Raspberry Pi 5 8GB** +Weights = Q40, Buffer = Q80, nSamples = 16, switch = TP-Link LS1008G, tested on 0.3.1 version + | Model | 1 x RasPi 5 8 GB | 2 x RasPi 5 8 GB | 4 x RasPi 5 8 GB | |-------------|---------------------------------------------------------------------|---------------------------------------------------------------------|---------------------------------------------------------------------| -| Llama 2 7B | **441.09 ms**, 2.26 t/s
(I: 434.84 ms, T: 5.25 ms) | **341.46 ms**, 2.92 t/s
(I: 257.78 ms, T: 83.27 ms) | **219.08 ms**, 4.56 t/s
(I: 163.42 ms, T: 55.25 ms) | -| Llama 3 8B | **564.31 ms**, 1.77 t/s
(I: 556.67 ms, T: 6.17 ms) | **444.27 ms**, 2.25 t/s
(I: 362.73 ms, T: 80.11 ms) | **331.47 ms**, 3.01 t/s
(I: 267.62 ms, T: 62.34 ms) | - -I - inference time of the root node, T - network transfer time, tested on 0.3.1 version +| Llama 2 7B | **441.09 ms**, 2.26 t/s
I: 434.84 ms, T: 5.25 ms | **341.46 ms**, 2.92 t/s
I: 257.78 ms, T: 83.27 ms | **219.08 ms**, 4.56 t/s 🔥
I: 163.42 ms, T: 55.25 ms | +| Llama 3 8B | **564.31 ms**, 1.77 t/s
I: 556.67 ms, T: 6.17 ms | **444.27 ms**, 2.25 t/s
I: 362.73 ms, T: 80.11 ms | **331.47 ms**, 3.01 t/s 🔥
I: 267.62 ms, T: 62.34 ms | **Raspberry Pi 4B 8 GB** +Weights = Q40, Buffer = Q80, nSamples = 16, switch = TP-Link LS1008G, tested on 0.1.0 version +

8 x Raspberry Pi 4B 8GB
8 x Raspberry Pi 4B 8GB

-All Raspberry Pi units were connected via Gigabit Ethernet to the TP-Link LS1008G Switch. - | Model | 1 x RasPi 4B 8 GB | 2 x RasPi 4B 8 GB | 4 x RasPi 4B 8 GB | 8 x RasPi 4B 8 GB | |-------------|---------------------------------------------------------------------|-----------------------------------------------------------------------|--------------------------------------------------------------------------------------|----------------------------------------------------------------------| -| Llama 2 7B | **1312.50 ms**
(I: 1307.94 ms, T: 1.81 ms) | **793.69 ms**
(I: 739.00 ms, T: 52.50 ms) | **494.00 ms** 🔥
(I: 458.81 ms, T: 34.06 ms) | **588.19 ms**
(I: 296.69 ms, T: 289.75 ms) | -| Llama 2 13B | Not enough RAM | **1497.19 ms**
(I: 1465.06 ms, T: 30.88 ms) | **848.19 ms** 🔥
(I: 746.88 ms, T: 99.50 ms) | **1114.88 ms**
(I: 460.8 ms, T: 652.88 ms) | -| Llama 2 70B | Not enough RAM | Not enough RAM | Not enough RAM | **4842.81 ms** 🔥
(I: 2121.94 ms, T: 2719.62 ms) | - -I - inference time of the root node, T - network transfer time, tested on 0.1.0 version +| Llama 2 7B | **1312.50 ms**
I: 1307.94 ms, T: 1.81 ms | **793.69 ms**
I: 739.00 ms, T: 52.50 ms | **494.00 ms** 🔥
I: 458.81 ms, T: 34.06 ms | **588.19 ms**
I: 296.69 ms, T: 289.75 ms | +| Llama 2 13B | Not enough RAM | **1497.19 ms**
I: 1465.06 ms, T: 30.88 ms | **848.19 ms** 🔥
I: 746.88 ms, T: 99.50 ms | **1114.88 ms**
I: 460.8 ms, T: 652.88 ms | +| Llama 2 70B | Not enough RAM | Not enough RAM | Not enough RAM | **4842.81 ms** 🔥
I: 2121.94 ms, T: 2719.62 ms | **x86_64 CPU Cloud Server** -All tests below were conducted on c3d-highcpu-30 (30 vCPU, 15 core, 59 GB memory) VMs in Google Cloud. [More details](https://github.com/b4rtaz/distributed-llama/discussions/9). +Weights = Q40, Buffer = Q80, nSamples = 16, VMs = [c3d-highcpu-30](https://github.com/b4rtaz/distributed-llama/discussions/9), tested on 0.1.0 version | Model | 1 x VM | 2 x VM | 4 x VM | |-------------|---------------------------------------------------------------------|-----------------------------------------------------------------------|--------------------------------------------------------------------------------------| -| Llama 2 7B | **101.81 ms**
(I: 101.06 ms, T: 0.19 ms) | **69.69 ms**
(I: 61.50 ms, T: 7.62 ms) | **53.69 ms** 🔥
(I: 40.25 ms, T: 12.81 ms) | -| Llama 2 13B | **184.19 ms**
(I: 182.88 ms, T: 0.69 ms) | **115.38 ms**
(I: 107.12 ms, T: 7.81 ms) | **86.81 ms** 🔥
(I: 66.25 ms, T: 19.94 ms) | -| Llama 2 70B | **909.69 ms**
(I: 907.25 ms, T: 1.75 ms) | **501.38 ms**
(I: 475.50 ms, T: 25.00 ms) | **293.06 ms** 🔥
(I: 264.00 ms, T: 28.50 ms) | +| Llama 2 7B | **101.81 ms**
I: 101.06 ms, T: 0.19 ms | **69.69 ms**
I: 61.50 ms, T: 7.62 ms | **53.69 ms** 🔥
I: 40.25 ms, T: 12.81 ms | +| Llama 2 13B | **184.19 ms**
I: 182.88 ms, T: 0.69 ms | **115.38 ms**
I: 107.12 ms, T: 7.81 ms | **86.81 ms** 🔥
I: 66.25 ms, T: 19.94 ms | +| Llama 2 70B | **909.69 ms**
I: 907.25 ms, T: 1.75 ms | **501.38 ms**
I: 475.50 ms, T: 25.00 ms | **293.06 ms** 🔥
I: 264.00 ms, T: 28.50 ms | -I - inference time of the root node, T - network transfer time, tested on 0.1.0 version - -### Network Transfer for Generating Single Token +### Network Transfer for Generating Token **F32 Buffer** -| Model | 2 devices | 4 devices | 8 devices | -|-------------|------------------------------------------------------------------|------------------------------------------------------------------|------------------------------------------------------------------| -| Llama 3 8B | **2048 kB**
(S: 1024 kB, R: 1024 kB) | **6144 kB**
(S: 3072 kB, R: 3072 kB) | **14336 kB**
(S: 7168 kB, R: 7168 kB) | - -S - sent data from the root node to workers, R - received data by the root node from workers, tested on 0.7.1 version +| Model | 2 devices | 4 devices | 8 devices | +|-------------|----------------|---------------|---------------| +| Llama 3 8B | **2048 kB** | **6144 kB** | **14336 kB** | **Q80 Buffer** -| Model | 2 devices | 4 devices | 8 devices | -|-------------|---------------------------------------------------------------|----------------------------------------------------------------|-----------------------------------------------------------------| -| Llama 3 8B | **544 kB**
(S: 272 kB, R: 272 kB) | **1632 kB**
(S: 816 kB, R: 816 kB) | **3808 kB**
(S: 1904 kB, R: 1904 kB) | +| Model | 2 devices | 4 devices | 8 devices | +|-------------|--------------|---------------|----------------| +| Llama 3 8B | **544 kB** | **1632 kB** | **3808 kB** | -S - sent data from the root node to workers, R - received data by the root node from workers, tested on 0.7.1 version - -## 📟 How to Run on Raspberry Pi Devices +## 📟 Setup Raspberry Pi Devices 1. Install `Raspberry Pi OS Lite (64 bit)` on your Raspberry Pi devices. This OS doesn't have desktop environment. -2. Connect all devices to the Gigabit switch. +2. Connect all devices to your switch or router. 3. Connect to all devices via SSH. ``` ssh user@raspberrypi1.local @@ -127,27 +117,24 @@ ssh user@raspberrypi2.local ```sh sudo apt install git ``` -5. Clone this repository: +5. Clone this repository and compile Distributed Llama on all devices: ```sh git clone https://github.com/b4rtaz/distributed-llama.git -``` -6. Compile Distributed Llama: -```sh make dllama ``` -7. Transfer weights and the tokenizer file to the root device. -8. Optional: assign static IP addresses. +6. Transfer weights and the tokenizer file to the root device. +7. Optional: assign static IP addresses. ```sh sudo ip addr add 10.0.0.1/24 dev eth0 # 1th device sudo ip addr add 10.0.0.2/24 dev eth0 # 2th device ``` -9. Run worker nodes on worker devices: +8. Run worker nodes on worker devices: ```sh sudo nice -n -20 ./dllama worker --port 9998 --nthreads 4 ``` -10. Run root node on the root device: +9. Run root node on the root device: ```sh -sudo nice -n -20 ./dllama inference --model ../dllama_llama-2-7b_q40.bin --tokenizer ../dllama-llama2-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 10.0.0.2:9998 +sudo nice -n -20 ./dllama inference --model dllama_model_meta-llama-3-8b_q40.m --tokenizer dllama_tokenizer_llama3.t --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 10.0.0.2:9998 ``` To add more worker nodes, just add more addresses to the `--workers` argument. @@ -156,70 +143,57 @@ To add more worker nodes, just add more addresses to the `--workers` argument. ./dllama inference ... --workers 10.0.0.2:9998 10.0.0.3:9998 10.0.0.4:9998 ``` -[Share your results](https://github.com/b4rtaz/distributed-llama/discussions)! +## 💻 Setup computers with MacOS, Linux, or Windows -## 💻 How to Run on MacOS, Linux, or Windows +You need x86_64 AVX2 CPUs or ARM CPUs. Different devices may have different CPUs. -You need to have x86_64 AVX2 CPU or ARM CPU. Different devices may have different CPUs. The below instructions are for Debian-based distributions but you can easily adapt them to your distribution, macOS, or Windows. +#### MacOS or Linux -### MacOS and Linux +The below instructions are for Debian-based distributions but you can easily adapt them to your distribution, macOS. -1. Install Git and G++: +1. Install Git and GCC: ```sh sudo apt install git build-essential ``` -2. Clone this repository: +2. Clone this repository and compile Distributed Llama on all computers: ```sh git clone https://github.com/b4rtaz/distributed-llama.git -``` -3. Compile Distributed Llama: -```sh make dllama ``` -4. Transfer weights and the tokenizer file to the root node. -5. Run worker nodes on worker devices: -```sh -sudo nice -n -20 ./dllama worker --port 9998 --nthreads 4 -``` -6. Run root node on the root device: -```sh -sudo nice -n -20 ./dllama inference --model ../dllama_llama-2-7b_q40.bin --tokenizer ../dllama-llama2-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 192.168.0.1:9998 -``` -7. To run the root node in the chat mode: -```sh -sudo nice -n -20 ./dllama chat --model ../dllama_llama-2-7b-chat_q40.bin --tokenizer ../dllama-llama2-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 4 --workers 192.168.0.1:9998 -``` -### Windows +Continue to point 3. -1. Install Git and Mingw (Chocolatey): - - https://chocolatey.org/install +#### Windows + +1. Install Git and Mingw (via [Chocolatey](https://chocolatey.org/install)): ```powershell choco install mingw ``` -2. Clone this repository: +2. Clone this repository and compile Distributed Llama on all computers: ```sh git clone https://github.com/b4rtaz/distributed-llama.git -``` -3. Compile Distributed Llama: -```sh make dllama ``` -4. Transfer weights and the tokenizer file to the root node. -5. Run worker nodes on worker devices: + +Continue to point 3. + +#### Run Cluster + +3. Transfer weights and the tokenizer file to the root computer. +4. Run worker nodes on worker computers: ```sh ./dllama worker --port 9998 --nthreads 4 ``` -6. Run root node on the root device: -```sh -./dllama inference --model ../dllama_llama-2-7b_q40.bin --tokenizer ../dllama-llama2-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 192.168.0.1:9998 -``` -7. To run the root node in the chat mode: +5. Run root node on the root computer: ```sh -./dllama chat --model ../dllama_llama-2-7b-chat_q40.bin --tokenizer ../dllama-llama2-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 4 --workers 192.168.0.1:9998 +./dllama inference --model dllama_model_meta-llama-3-8b_q40.m --tokenizer dllama_tokenizer_llama3.t --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 192.168.0.1:9998 ``` -[Share your results](https://github.com/b4rtaz/distributed-llama/discussions)! +To add more worker nodes, just add more addresses to the `--workers` argument. + +``` +./dllama inference ... --workers 192.168.0.1:9998 192.168.0.2:9998 192.168.0.3:9998 +``` ## 💡 License diff --git a/converter/convert-llama.py b/converter/convert-llama.py index 73a9b6f..31fae60 100644 --- a/converter/convert-llama.py +++ b/converter/convert-llama.py @@ -4,7 +4,7 @@ import torch import math import numpy as np -from writer import writeTensor, writeHeader, parseFloatType, FloatType +from writer import writeTensor, writeHeader, parseFloatType, strFloatType, FloatType from pathlib import Path LAYER_CHUNK_SIZE = 48 @@ -107,12 +107,13 @@ def usage(): modelPath = sys.argv[1] targetFloatType = parseFloatType(sys.argv[2]) + targetFloatTypeStr = strFloatType(targetFloatType) modelName = os.path.basename(modelPath) - outputFileName = f'dllama_model_{modelName.lower()}_{sys.argv[2]}.m' + outputFileName = f'dllama_model_{modelName.lower()}_{targetFloatTypeStr}.m' print(f'Model name: {modelName}') - print(f'Target float type: {targetFloatType}') + print(f'Target float type: {targetFloatTypeStr}') print(f'Target file: {outputFileName}') convert(modelPath, outputFileName, targetFloatType) diff --git a/converter/writer.py b/converter/writer.py index 256eef5..56e6dd4 100644 --- a/converter/writer.py +++ b/converter/writer.py @@ -23,6 +23,9 @@ def parseFloatType(type): return floatType raise Exception(f'{type} is not supported') +def strFloatType(type): + return floatTypeNames[type] + def writeQuantizedQ40Tensor(file, x): x = x.to(torch.float32).numpy().astype(np.float32) blockSize = 32 @@ -105,7 +108,7 @@ def writeTensor(file, tensor, floatType): else: raise Exception(f'Unknown float type') t1 = time.time() - print(f'Saved {floatTypeNames[floatType]} tensor in {t1 - t0:.2f}s, {nBytes} bytes') + print(f'Saved {strFloatType(floatType)} tensor in {t1 - t0:.2f}s, {nBytes} bytes') def writeHeader(file, params): headerKeys = { diff --git a/launch.py b/launch.py index c8ebeb3..620270b 100644 --- a/launch.py +++ b/launch.py @@ -4,6 +4,11 @@ # ['model-url', 'tokenizer-url', 'weights-float-type', 'buffer-float-type', 'model-type'] MODELS = { + 'llama3_8b_q40': [ + 'https://huggingface.co/b4rtaz/Llama-3-8B-Q40-Distributed-Llama/resolve/main/dllama_model_meta-llama-3-8b_q40.m?download=true', + 'https://huggingface.co/b4rtaz/Llama-3-8B-Q40-Distributed-Llama/resolve/main/dllama_tokenizer_llama3.t?download=true', + 'q40', 'q80', 'base' + ], 'llama3_8b_instruct_q40': [ 'https://huggingface.co/b4rtaz/Llama-3-8B-Q40-Instruct-Distributed-Llama/resolve/main/dllama_model_lama3_instruct_q40.m?download=true', 'https://huggingface.co/b4rtaz/Llama-3-8B-Q40-Instruct-Distributed-Llama/resolve/main/dllama_tokenizer_llama3.t?download=true', @@ -11,11 +16,12 @@ ] } -ALIASES = { - 'llama3_instruct': 'llama3_8b_instruct_q40' -} - def downloadFile(url: str, path: str): + if (os.path.isfile(path)): + fileName = os.path.basename(path) + result = input(f'❓ {fileName} already exists, do you want to download again? ("Y" if yes): ') + if (result.upper() != 'Y'): + return response = requests.get(url, stream=True) response.raise_for_status() print(f'📄 {url}') @@ -64,8 +70,6 @@ def printUsage(): os.chdir(os.path.dirname(__file__)) modelName = sys.argv[1].replace('-', '_') - if modelName in ALIASES: - modelName = ALIASES[modelName] if modelName not in MODELS: print(f'Model is not supported: {modelName}') exit(1)