Skip to content

Releases: b4rtaz/distributed-llama

0.11.1

09 Dec 22:22
0975af8
Compare
Choose a tag to compare
  • This version disables CPU pinning #141
  • This version introduces help/usage information accessible via the command line #143 (thanks @jkeegan!)

0.11.0 🚀

21 Nov 18:59
8b1cf89
Compare
Choose a tag to compare

This update introduces a significant speed improvement 🚀 in inference for clusters with 2 or more nodes.

Key changes:

  • All nodes in the Distributed Llama cluster are now interconnected using a mesh topology. Previously, a star topology was used.
  • Now, every layer is distributed across all nodes, including the last layer, which previously caused a major bottleneck.
  • Norm layers are now calculated redundantly on all nodes. While redundant, this step is very fast and does not impact performance significantly.

Measurement

4 x Raspberry Pi 5 8GB

Model Token/s - 0.10.6 Token/s - This version Acceleration
Llama 3.2 1B Q40 9.90 21.42 2.1x
Llama 3.2 3B Q40 3.47 9.01 2.6x 🚀
Llama 3 8B Q40 2.83 4.67 1.6x

2 x Raspberry Pi 5 8GB

Model Tok/s - 0.10.6 Tok/s - This version Acceleration
Llama 3.2 1B Q40 8.44 15.31 1.8x
Llama 3.2 3B Q40 3.24 6.80 2.0x 🚀
Llama 3 8B Q40 2.02 3.44 1.7x

Test details

TODO

  • mixtral model is temporary not supported, it will be fixed in a next release.

0.10.6

17 Nov 13:41
6599db2
Compare
Choose a tag to compare

This version fixes a bug in the rms function for processors with AVX2 instructions #137.

0.10.5

11 Nov 23:12
c09173b
Compare
Choose a tag to compare

This version fixes the bug related to releasing memory #134.

0.10.4

13 Oct 14:22
Compare
Choose a tag to compare

This version adds to the launch.py script two new models:

  • Llama 3.2 1B Instruct Q40
  • Llama 3.2 3B Instruct Q40

0.10.3

10 Aug 22:27
3353d56
Compare
Choose a tag to compare

This version refactors the code to reduce the use of the writeMany and readMany methods.

0.10.2

29 Jul 12:23
71135e6
Compare
Choose a tag to compare

This version introduces a new CLI argument: --max-seq-len <n>. It allows you to reduce the context size and, at the same time, reduce memory consumption. This argument works with the following commands: dllama inference, dllama chat, and dllama-api. You don't need to set it in the worker because the root node will distribute the information to the worker.

Example:

./dllama chat --model ... --nthreads 8 --max-seq-len 1024

0.10.1

28 Jul 14:29
Compare
Choose a tag to compare

Implemented the fallback implementation for the matmulQ40vQ80 operation. Distributed Llama now supports all CPU architectures, with optimizations specifically for ARM and AVX2 CPUs.

0.10.0

25 Jul 11:54
4b8a0ca
Compare
Choose a tag to compare

This version introduces support for the Llama 3.1 model! 🔥 Additionally, it includes a small improvement that enables you to run the Llama 3.1 8B Q40 on a standard computer with the full context size (131,072 tokens!).

Llama 3.1 8B Q40 on MacBook Pro M1 16GB RAM with full context
Llama 3.1 8B Q40 on MacBook Pro M1 16GB RAM with full context

The quantized Llama 3.1 8B model to Q40 format requires 6.3 GB GB of RAM. The key-value cache for the full context requires approximately 34 GB of memory (F32). For casual devices, this is definitely too high. That's why this version introduces the --kv-cache-storage disc argument (Windows is not supported yet). Once set, the key-value cache will be stored on your disk. If you have a fast SSD, the slowdown should be acceptable. This argument works for the dllama inference, dllama worker, and dllama-api commands. An important fact is that the size of the KV cache is split across all nodes in the cluster. So, for example, with 4 nodes, each needs to have ~8.5 GB of memory (RAM or disk) to keep the KV cache.

How to run Llama 3.1 8B

  1. Download Distributed Llama repository and compile it: make dllama && make dllama-api.
  2. Download model python launch.py llama3_1_8b_instruct_q40
  3. Run model:
    • ./dllama chat --model models/llama3_1_8b_instruct_q40/dllama_model_llama3_1_8b_instruct_q40.m --tokenizer models/llama3_1_8b_instruct_q40/dllama_tokenizer_llama3_1_8b_instruct_q40.t --buffer-float-type q80 --kv-cache-storage disc --nthreads 8 --workers 192.168.0.1:9999 or
    • ./dllama-api --model models/llama3_1_8b_instruct_q40/dllama_model_llama3_1_8b_instruct_q40.m --tokenizer models/llama3_1_8b_instruct_q40/dllama_tokenizer_llama3_1_8b_instruct_q40.t --buffer-float-type q80 --kv-cache-storage disc --nthreads 8 --workers 192.168.0.1:9999

If your worker node does not have enough RAM for the KV cache, you can run the worker with the --kv-cache-storage disc argument.

./dllama worker --port 9999 --kv-cache-storage disc --nthreads 8

TODO

A future version will include the ability to reduce the context size. This should reduce memory consumption when the full context is not needed.

The 0.10.2 version introduced the --max-seq-len <n> argument.

0.9.2

12 Jul 20:10
90d7ebd
Compare
Choose a tag to compare

This version allows to override the chat template. This may be helpful if a model does not have a tokenizer with a chat template.

How to use:

./dllama... --chat-template llama3
./dllama-api ... --chat-template llama3

Supported values:

  • llama2
  • llama3
  • zephyr
  • chatml