Skip to content

Releases: teleprint-me/llama.cpp

b1954

23 Jan 02:05
011e8ec
Compare
Choose a tag to compare
llama : fix not enough space in buffer with Qwen (#5086)

b1893

16 Jan 18:20
bee938d
Compare
Choose a tag to compare
nix: remove nixConfig from flake.nix (#4984)

b1886

16 Jan 17:15
862f5e4
Compare
Choose a tag to compare
android : introduce starter project example (#4926)

* Introduce starter project for Android

Based on examples/llama.swiftui.

* Add github workflow

* Set NDK version

* Only build arm64-v8a in CI

* Sync bench code

* Rename CI prop to skip-armeabi-v7a

* Remove unused tests

b1879

16 Jan 03:57
3e5ca79
Compare
Choose a tag to compare
pass cpu-architecture arguments only to host code (C;C++) (#4943)

b1878

15 Jan 17:03
4483396
Compare
Choose a tag to compare
llama : apply classifier-free guidance to logits directly (#4951)

b1874

15 Jan 07:54
4a3156d
Compare
Choose a tag to compare
CUDA: faster dequantize kernels for Q4_0 and Q4_1 (#4938)

Co-authored-by: Iwan Kawrakow <[email protected]>

b1873

14 Jan 16:59
a836c8f
Compare
Choose a tag to compare
llama : fix missing quotes (#4937)

b1863

14 Jan 05:10
76484fb
Compare
Choose a tag to compare
sync : ggml

b1848

12 Jan 22:28
de473f5
Compare
Choose a tag to compare
sync : ggml

b1843

12 Jan 19:50
e7e4df0
Compare
Choose a tag to compare
llama : ggml-backend integration (#4766)

* llama : ggml-backend integration

* ggml-backend : add names to buffers

* fix unmap after loading

* batched-bench : add tensor_split param

* llama : check for null tensor_split

* ggml-backend : increase GGML_MAX_BACKENDS

* improve graph splitting, partial fix for --no-kv-offload

* cuda : add ggml-backend split buffer support

* cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available)

* ggml : fix null backend dereference (#4807)

* ggml : fix null backend dereference

* ggml : also check ggml_backend_is_cpu

* test-backend-ops : check buffer allocation failures

* llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row)

* ggml : fix mul_mat_id work size

* llama : rewrite session kv load/set without graphs

* minor

* llama : only initialize used backends, free backends on context free

* llama : abort ctx if cuda backend init fails

* llama : rewrite lora with ggml-backend and compute on CPU

ggml-ci

* llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer

* opencl : add ggml-backend buffer type

* cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf)

* llama : on Metal, by default offload the full model

ggml-ci

* metal : page align the data ptr (#4854)

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <[email protected]>

* cuda : fix split buffer free

* address review comments

* llama-bench : add split-mode parameter

* fix whitespace

* opencl : fix double initialization

* server : add --split-mode parameter

* use async copy and compute to improve multi-gpu performance

ggml-ci

* use async memcpys to copy the graph outputs to the CPU

* fix opencl

* use a host buffer for the cpu compute buffer for faster copies to the gpu

---------

Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: Johannes Gäßler <[email protected]>