Releases · teleprint-me/llama.cpp

22 Feb 06:55

973053d

b2234

llama : fix loading models with shared tok_embd and output (#5651)

ggml-ci

Assets 14

21 Feb 16:17

github-actions

b2230

89febfe

b2230

examples : do not assume BOS when shifting context (#5622)

Assets 14

20 Feb 16:44

github-actions

b2217

9c405c9

b2217

Server: use llama_chat_apply_template (#5593)

* server: use llama_chat_apply_template

* server: remove trailing space

* server: fix format_chat

* server: fix help message

Co-authored-by: Georgi Gerganov <[email protected]>

* server: fix formatted_chat

---------

Co-authored-by: Georgi Gerganov <[email protected]>

Assets 14

18 Feb 18:16

github-actions

b2181

c145f8a

b2181

server : slots monitoring endpoint (#5550)

Assets 14

16 Feb 21:55

github-actions

b2167

5bf2b94

b2167

cmake : fix VULKAN and ROCm builds (#5525)

* cmake : fix VULKAN and ROCm builds

* cmake : fix (cont)

* vulkan : fix compile warnings

ggml-ci

* cmake : fix

ggml-ci

* cmake : minor

ggml-ci

Assets 14

12 Feb 19:34

github-actions

b2134

099afc6

b2134

llama : fix quantization when tensors are missing (#5423)

Assets 14

11 Feb 22:44

github-actions

b2128

3bdc4cd

b2128

CUDA: mul_mat_vec_q tiling, refactor mul mat logic (#5434)

* CUDA: mul_mat_vec_q tiling, refactor mul mat logic

Co-authored-by: slaren <[email protected]>

---------

Co-authored-by: slaren <[email protected]>

Assets 14

10 Feb 21:40

github-actions

b2116

f026f81

b2116

metal : use autoreleasepool to avoid memory leaks (#5437)

There appears to be a known memory leak when using the
`MLTCommandBuffer`. It is suggested to use `@autoreleasepool` in
[1,2]

[1] https://developer.apple.com/forums/thread/662721
[2] https://forums.developer.apple.com/forums/thread/120931

This change-set wraps the `ggml_metal_graph_compute` in a
`@autoreleasepool`.

This commit addresses https://github.com/ggerganov/llama.cpp/issues/5436

Assets 14

09 Feb 19:31

github-actions

b2112

4b7b38b

b2112

vulkan: Set limit for task concurrency (#5427)

A common default for the maximum number of open files is 256, which can
lead to `asyncio.gather(*tasks)` failing with Too many open files.

    $ python ggml_vk_generate_shaders.py --glslc=$ANDROID_NDK_PATH/shader-tools/darwin-x86_64/glslc
    ggml_vulkan: Generating and compiling shaders to SPIR-V
    Traceback (most recent call last):
      File "/Users/neuman/Code.noindex/github/llama.cpp/ggml_vk_generate_shaders.py", line 2326, in <module>
        asyncio.run(main())
      File "/Users/neuman/Code.noindex/miniforge3/lib/python3.10/asyncio/runners.py", line 44, in run
        return loop.run_until_complete(main)
      File "/Users/neuman/Code.noindex/miniforge3/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
        return future.result()
      File "/Users/neuman/Code.noindex/github/llama.cpp/ggml_vk_generate_shaders.py", line 2294, in main
        await asyncio.gather(*tasks)
    [...snip...]
    OSError: [Errno 24] Too many open files

This change sets a reasonable concurrency limit for tasks (and therefore
open files), without significant impact on run time.

Assets 14

08 Feb 18:11

github-actions

b2103

6e99f2a

b2103

Fix f16_sycl cpy call from Arc (#5411)

* fix f16_sycl cpy call

* rm old logic

* add fp16 build CI

* use macro

* format fix

Assets 14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: teleprint-me/llama.cpp

b2234

b2230

b2217

b2181

b2167

b2134

b2128

b2116

b2112

b2103