Releases: b4rtaz/distributed-llama
Releases · b4rtaz/distributed-llama
0.5.2
- feat: use avx2 to speedup dotProduct
- feat: use avx2 to speedup matmulF32
0.5.0
- feat: splitting attention layers into all nodes. 🎉 🎉 🎉
- fix: convert-llama.py supports different max_seq_len.
0.4.0
- feat: support for any number of threads.
- fix: support max kv cache length.
- feat: splitting RoPE into all nodes.
0.3.1
- Changed order of QKV synchronization (details)
- All tasks of Llama architecture are executed in parallel
- Rope cache for Llama architecture
0.3.0
- New tokenizer format (old tokenizer files are not supported, please regenerate tokenizer files).
- Added Llama 3 support.
- Simple-server mode, check this example: nodejs-example.cjs Now you may use Distributed Llama as a simple LLM server.
0.2.0
Added Grok-1 support!
Breaking changes: you need to re-convert Llama 2 models to the new version.
0.1.1
This version introduces partial optimization for x86_64 AVX2 CPUs. Now it's possible to run the inference with Q40 weights and Q80 buffer with partial AVX2 acceleration.