Skip to content

Releases: b4rtaz/distributed-llama

0.5.2

18 May 09:28
182fdcd
Compare
Choose a tag to compare
  • feat: use avx2 to speedup dotProduct
  • feat: use avx2 to speedup matmulF32

0.5.1

15 May 14:35
d1304c8
Compare
Choose a tag to compare

0.5.0

13 May 22:07
c9bb613
Compare
Choose a tag to compare
  • feat: splitting attention layers into all nodes. 🎉 🎉 🎉
  • fix: convert-llama.py supports different max_seq_len.

0.4.0

09 May 17:38
e93d1e6
Compare
Choose a tag to compare
  • feat: support for any number of threads.
  • fix: support max kv cache length.
  • feat: splitting RoPE into all nodes.

0.3.1

28 Apr 21:36
37fad6a
Compare
Choose a tag to compare
  • Changed order of QKV synchronization (details)
  • All tasks of Llama architecture are executed in parallel
  • Rope cache for Llama architecture

0.3.0

22 Apr 20:57
Compare
Choose a tag to compare
  • New tokenizer format (old tokenizer files are not supported, please regenerate tokenizer files).
  • Added Llama 3 support.
  • Simple-server mode, check this example: nodejs-example.cjs Now you may use Distributed Llama as a simple LLM server.

0.2.0

11 Apr 21:29
620644a
Compare
Choose a tag to compare

Added Grok-1 support!

Breaking changes: you need to re-convert Llama 2 models to the new version.

0.1.1

23 Jan 23:07
f2137af
Compare
Choose a tag to compare

This version introduces partial optimization for x86_64 AVX2 CPUs. Now it's possible to run the inference with Q40 weights and Q80 buffer with partial AVX2 acceleration.

0.1.0

23 Jan 22:50
Compare
Choose a tag to compare

Initial release! 🚢