Skip to content

Releases: teleprint-me/llama.cpp

b2234

22 Feb 06:55
973053d
Compare
Choose a tag to compare
llama : fix loading models with shared tok_embd and output (#5651)

ggml-ci

b2230

21 Feb 16:17
89febfe
Compare
Choose a tag to compare
examples : do not assume BOS when shifting context (#5622)

b2217

20 Feb 16:44
9c405c9
Compare
Choose a tag to compare
Server: use llama_chat_apply_template (#5593)

* server: use llama_chat_apply_template

* server: remove trailing space

* server: fix format_chat

* server: fix help message

Co-authored-by: Georgi Gerganov <[email protected]>

* server: fix formatted_chat

---------

Co-authored-by: Georgi Gerganov <[email protected]>

b2181

18 Feb 18:16
c145f8a
Compare
Choose a tag to compare
server : slots monitoring endpoint (#5550)

b2167

16 Feb 21:55
5bf2b94
Compare
Choose a tag to compare
cmake : fix VULKAN and ROCm builds (#5525)

* cmake : fix VULKAN and ROCm builds

* cmake : fix (cont)

* vulkan : fix compile warnings

ggml-ci

* cmake : fix

ggml-ci

* cmake : minor

ggml-ci

b2134

12 Feb 19:34
099afc6
Compare
Choose a tag to compare
llama : fix quantization when tensors are missing (#5423)

b2128

11 Feb 22:44
3bdc4cd
Compare
Choose a tag to compare
CUDA: mul_mat_vec_q tiling, refactor mul mat logic (#5434)

* CUDA: mul_mat_vec_q tiling, refactor mul mat logic

Co-authored-by: slaren <[email protected]>

---------

Co-authored-by: slaren <[email protected]>

b2116

10 Feb 21:40
f026f81
Compare
Choose a tag to compare
metal : use autoreleasepool to avoid memory leaks (#5437)

There appears to be a known memory leak when using the
`MLTCommandBuffer`. It is suggested to use `@autoreleasepool` in
[1,2]

[1] https://developer.apple.com/forums/thread/662721
[2] https://forums.developer.apple.com/forums/thread/120931

This change-set wraps the `ggml_metal_graph_compute` in a
`@autoreleasepool`.

This commit addresses https://github.com/ggerganov/llama.cpp/issues/5436

b2112

09 Feb 19:31
4b7b38b
Compare
Choose a tag to compare
vulkan: Set limit for task concurrency (#5427)

A common default for the maximum number of open files is 256, which can
lead to `asyncio.gather(*tasks)` failing with Too many open files.

    $ python ggml_vk_generate_shaders.py --glslc=$ANDROID_NDK_PATH/shader-tools/darwin-x86_64/glslc
    ggml_vulkan: Generating and compiling shaders to SPIR-V
    Traceback (most recent call last):
      File "/Users/neuman/Code.noindex/github/llama.cpp/ggml_vk_generate_shaders.py", line 2326, in <module>
        asyncio.run(main())
      File "/Users/neuman/Code.noindex/miniforge3/lib/python3.10/asyncio/runners.py", line 44, in run
        return loop.run_until_complete(main)
      File "/Users/neuman/Code.noindex/miniforge3/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
        return future.result()
      File "/Users/neuman/Code.noindex/github/llama.cpp/ggml_vk_generate_shaders.py", line 2294, in main
        await asyncio.gather(*tasks)
    [...snip...]
    OSError: [Errno 24] Too many open files

This change sets a reasonable concurrency limit for tasks (and therefore
open files), without significant impact on run time.

b2103

08 Feb 18:11
6e99f2a
Compare
Choose a tag to compare
Fix f16_sycl cpy call from Arc (#5411)

* fix f16_sycl cpy call

* rm old logic

* add fp16 build CI

* use macro

* format fix