Skip to content

0.11.0 🚀

Latest
Compare
Choose a tag to compare
@b4rtaz b4rtaz released this 21 Nov 18:59
8b1cf89

This update introduces a significant speed improvement 🚀 in inference for clusters with 2 or more nodes.

Key changes:

  • All nodes in the Distributed Llama cluster are now interconnected using a mesh topology. Previously, a star topology was used.
  • Now, every layer is distributed across all nodes, including the last layer, which previously caused a major bottleneck.
  • Norm layers are now calculated redundantly on all nodes. While redundant, this step is very fast and does not impact performance significantly.

Measurement

4 x Raspberry Pi 5 8GB

Model Token/s - 0.10.6 Token/s - This version Acceleration
Llama 3.2 1B Q40 9.90 21.42 2.1x
Llama 3.2 3B Q40 3.47 9.01 2.6x 🚀
Llama 3 8B Q40 2.83 4.67 1.6x

2 x Raspberry Pi 5 8GB

Model Tok/s - 0.10.6 Tok/s - This version Acceleration
Llama 3.2 1B Q40 8.44 15.31 1.8x
Llama 3.2 3B Q40 3.24 6.80 2.0x 🚀
Llama 3 8B Q40 2.02 3.44 1.7x

Test details

TODO

  • mixtral model is temporary not supported, it will be fixed in a next release.