This update introduces a significant speed improvement 🚀 in inference for clusters with 2 or more nodes.
Key changes:
- All nodes in the Distributed Llama cluster are now interconnected using a mesh topology. Previously, a star topology was used.
- Now, every layer is distributed across all nodes, including the last layer, which previously caused a major bottleneck.
- Norm layers are now calculated redundantly on all nodes. While redundant, this step is very fast and does not impact performance significantly.
Measurement
4 x Raspberry Pi 5 8GB
Model | Token/s - 0.10.6 | Token/s - This version | Acceleration |
---|---|---|---|
Llama 3.2 1B Q40 | 9.90 | 21.42 | 2.1x |
Llama 3.2 3B Q40 | 3.47 | 9.01 | 2.6x 🚀 |
Llama 3 8B Q40 | 2.83 | 4.67 | 1.6x |
2 x Raspberry Pi 5 8GB
Model | Tok/s - 0.10.6 | Tok/s - This version | Acceleration |
---|---|---|---|
Llama 3.2 1B Q40 | 8.44 | 15.31 | 1.8x |
Llama 3.2 3B Q40 | 3.24 | 6.80 | 2.0x 🚀 |
Llama 3 8B Q40 | 2.02 | 3.44 | 1.7x |
TODO
mixtral
model is temporary not supported, it will be fixed in a next release.