0.11.0 🚀

Latest

Latest

b4rtaz released this 21 Nov 18:59

This update introduces a significant speed improvement 🚀 in inference for clusters with 2 or more nodes.

Key changes:

All nodes in the Distributed Llama cluster are now interconnected using a mesh topology. Previously, a star topology was used.
Now, every layer is distributed across all nodes, including the last layer, which previously caused a major bottleneck.
Norm layers are now calculated redundantly on all nodes. While redundant, this step is very fast and does not impact performance significantly.

Measurement

4 x Raspberry Pi 5 8GB

Model	Token/s - 0.10.6	Token/s - This version	Acceleration
Llama 3.2 1B Q40	9.90	21.42	2.1x
Llama 3.2 3B Q40	3.47	9.01	2.6x 🚀
Llama 3 8B Q40	2.83	4.67	1.6x

2 x Raspberry Pi 5 8GB

Model	Tok/s - 0.10.6	Tok/s - This version	Acceleration
Llama 3.2 1B Q40	8.44	15.31	1.8x
Llama 3.2 3B Q40	3.24	6.80	2.0x 🚀
Llama 3 8B Q40	2.02	3.44	1.7x

TODO

mixtral model is temporary not supported, it will be fixed in a next release.

Assets 2