merge from upstream #49

l3utterfly · 2024-12-28T07:22:17Z

Make sure to read the contributing guidelines before submitting a PR

…ov#10419) * bug-fix: snprintf prints NULL in place of the last character We need to give snprintf enough space to print the last character and the null character, thus we allocate one extra byte and then ignore it when converting to std::string. * add comment about extra null-term byte requirement

* get rid of completion.js * extract chat bubble to a component * add tok/s info * sync * fix BASE_URL * only extract timings when it's enabled * fix auto scroll

Provide more documentation for streaming mode.

…oups for coopmats (ggerganov#10721) * Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats * Fix subgroup size control extension support check Add accf32 and accf16 checks for coopmats * Also disable coopmats on amdvlk

…gerganov#10798)

other windows build fixes

* faster uncontiguous concat * Use a lambda to avoid code duplication Co-authored-by: Diego Devesa <[email protected]> * Update ggml/src/ggml-cuda/concat.cu * add constexpr and static assert --------- Co-authored-by: Diego Devesa <[email protected]>

* common : improve ctv ctk cli argument * regenerate docs * even better approach * use std::vector

…p16 (ggerganov#10811)

* Try to reduce some unused and typecast warnings * Reduce compiler warnings step 2 * add a newline at the end of the file * Initialize nreduce as size_t * [SYCL] Remove pragma directives from mmq.cpp * SYCL: mmq add condition to prevent blocks_per_tile_x_row variable from becoming 0 * SYCL softmax: Initialize nreduce as size_t * ggml-sycl.cpp: fix some trailing whitespaces * SYCL: remove the unused variables instead of commenting it out * SYCL poo2d kernel: set NAN for invalid pooling op * SYCL gemm.hpp: remove pragma directives * SYCL gemm.hpp: use const cast to properly support dnnl::memory * SYCL: wkv6 remove a comment * SYCL: clean comments step 2 * SYCL: clean comments and variables step 3 * SYCL: Use GGML_UNUSED for unused variables * SYCL: remove extra empty lines and a comment * Remove TODO * cleanup spaces * add a stdout for unsupported op * use sycl printf over fprintf * remove prints for CI * SYCL ggml-sycl: pool2D use sycl::nan and remove if-else block --------- Co-authored-by: Abhilash Majumder <[email protected]>

* double the number of rows per workgroup * Update ggml-vulkan.cpp * Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats * only increase the number of rows for amd and subgroup size 64 * fix missing NUM_ROWS for mul_mat_vec_iq4_nl_f16_f32, untested * use subgroup min and max to check for gcn (requires ggerganov#10721) * manual merge ggml-vulkan.cpp * set min and max subgroup size in any case * Also double the number of rows for Intel GPUs

…ctivity (ggerganov#10812) * Fix crash caused by ggml_backend_load_all when launching on AndroidActivity. Details: Calling ggml_backend_load_all during initialization in the AndroidActivity project leads to a crash with the error: terminating with uncaught exception of type std::__ndk1::__fs::filesystem::filesystem_error: filesystem error: in directory_iterator::directory_iterator(...): Permission denied [./]. This issue occurs because AndroidActivity restricts file access due to sandboxing. Reproduction: In the example folder, the LlamaAndroid project can reproduce the crash by calling ggml_backend_load_all first in Java_android_llama_cpp_LLamaAndroid_backend_1init. * Update ggml/src/ggml-backend-reg.cpp --------- Co-authored-by: Diego Devesa <[email protected]>

Added support for positional arguments `model` and `prompt`. Added functionality to download via strings like: llama-run llama3 llama-run ollama://granite-code llama-run ollama://granite-code:8b llama-run hf://QuantFactory/SmolLM-135M-GGUF/SmolLM-135M.Q2_K.gguf llama-run huggingface://bartowski/SmolLM-1.7B-Instruct-v0.2-GGUF/SmolLM-1.7B-Instruct-v0.2-IQ3_M.gguf llama-run https://example.com/some-file1.gguf llama-run some-file2.gguf llama-run file://some-file3.gguf Signed-off-by: Eric Curtin <[email protected]>

…eno GPUs (ggerganov#10693) * [cl][adreno] Add Adreno GPU support Add new OpenCL backend to support Adreno GPUs --------- Co-authored-by: Skyler Szot <[email protected]> Co-authored-by: Shangqing Gu <[email protected]> Co-authored-by: Alexander Angus <[email protected]> Co-authored-by: Hongqiang Wang <[email protected]> Co-authored-by: Max Krasnyansky <[email protected]> * [cl][ci] Add workflow for CL * [cl][adreno] Fix memory leak for non SMALL_ALLOC path * opencl: integrate backend dyn.load interface and fix compiler and format warnings * opencl: remove small-alloc support and fix build errors for non-opencl platforms * opencl: fixed merge conflict (MUSA added twice in cmake) * opencl-ci: use RUNNER_TEMP instead of github.workspace * opencl: fix embed tool invocation with python3 * opencl: CI workflow fixes * opencl: Clean up small-alloc in CMake files * opencl: cleanup ggml-opencl2 header file * opencl: use ulong for offsets and strides in ADD kernel * opencl: use cl_ulong for all offsets * opencl: use cl_ulong for sizes and strides * opencl: use `GGML_LOG_xxx` instead of `fprintf(stderr, ...)` * opencl: rename backend `opencl2` -> `opencl` * opencl: rename kernel files `ggml-opencl2` -> `ggml-opencl` * opencl: make OpenCL required, remove redundant lib and inc directories * `ggml-base`, `..` and `.` are added by `ggml_add_backend_library` * opencl: rename backend - funcs, structs, etc `opencl2` -> `opencl` * opencl: remove copyright marker since main license already covers * opencl: replace some more OPENCL2 leftovers * opencl: remove limits on `tensor_extra` * opencl: use pools for `tensor_extra` * opencl: fix compiler warnings with GCC and Clang Still getting the warning about clCreateCmdQueue being obsolete. Will fix that separately. * opencl: fail gracefully if opencl devices are not available Also for unsupported GPUs. * opencl: fix MSVC builds (string length error) * opencl: check for various requirements, allow deprecated API * opencl: update log message for unsupported GPUs --------- Co-authored-by: Skyler Szot <[email protected]> Co-authored-by: Shangqing Gu <[email protected]> Co-authored-by: Alexander Angus <[email protected]> Co-authored-by: Hongqiang Wang <[email protected]> Co-authored-by: Max Krasnyansky <[email protected]>

…eat lines as binary and therefore hidden by default (ggerganov#10771) Signed-off-by: Charles Darke <[email protected]> Co-authored-by: Charles Darke <[email protected]>

* Barebone Qwen2VL LLM convertor * Add Qwen2VL cli entrypoint * [WIP] add qwen2vl arch * Verify m-rope output * Add vl-rope/2d-rope support for qwen2vl ViT * update qwen2vl cli tool * update 5D tensor op workaround * [WIP] qwen2vl vision model * make batch and clip utils compatible with qwen2vl * [WIP] create inference workflow, gguf convert script but fix * correcting vision-rope behavior, add the missing last layer back to ViT * add arg parser to qwen2vl_surgery * replace variable size array with vector * cuda-gdb cmake preset * add fp32 mrope, vision rope kernel * add fp16 support for qwen2vl and m-rope * add `GGML_ROPE_TYPE_MROPE`, `GGML_ROPE_TYPE_VISION` * fix rope op mode switching, out dated func args * update `llama_hparams` * update to keep up stream changes * resolve linter, test errors * add makefile entry, update speical image padding token * add mrope unit test, fix few compiler warnings * rename `mrope` related function, params * minor updates on debug util, bug fixs * add `m-rope` testcase to `test-backend-ops` * Apply suggestions from code review Co-authored-by: Georgi Gerganov <[email protected]> * fix traililng whitespce * store `llama_hparams.rope_sections` with fixed size array * update position id tensor size check in GGML_OP_ROPE * minor updates * update `ggml_backend_*_supports_op` of unsupported backends * remote old `rope_section` compare operator --------- Co-authored-by: Georgi Gerganov <[email protected]>

This allows to reduce compile time when you are building for a single GPU.

* Update server JSON response. * Add unit test to check `has_new_line` JSON response * Remove `has_new_line` unit test changes. * Address code review comment: type check for `has_new_line` in unit test

…nov#10808) * add code highlighting and math formatting * code cleanup * build public/index.html * rebuild public/index.html * fixed coding style * fixed coding style * style fixes * highlight: smaller bundle size, fix light & dark theme * remove katex * add bundle size check * add more languages * add php * reuse some langs * use gzip * Revert "remove katex" This reverts commit c0e5046. * use better maintained @vscode/markdown-it-katex * fix gzip non deterministic * ability to add a demo conversation for dev * fix latex rendering * add comment * latex codeblock as code --------- Co-authored-by: Xuan Son Nguyen <[email protected]>

…gerganov#10836)

* Support InfiniAI Megrez 3b * Fix tokenizer_clean_spaces for megrez

* server : add system_fingerprint to chat/completion * update README

* server : fix missing model id in /model endpoint * fix ci

ggml-ci

* llama : the WPM vocabs use the CLS token as BOS ggml-ci * llama : add comment

* llama_server_response_fields * llama_server_response_fields_fix_issues * params fixes * fix * clarify docs * change to "response_fields" --------- Co-authored-by: Xuan Son Nguyen <[email protected]>

* more perfo with llamafile tinyblas on x86_64. - add bf16 suport - change dispache strategie (thanks: ikawrakow/ik_llama.cpp#71 ) - reduce memory bandwidth simple tinyblas dispache and more cache freindly * tinyblas dynamic dispaching * sgemm: add M blocs. * - git 2.47 use short id of len 9. - show-progress is not part of GNU Wget2 * remove not stable test

…ngs endpoints (ggerganov#10967) * add support for base64 * fix base64 test * improve test --------- Co-authored-by: Xuan Son Nguyen <[email protected]>

Warning types fixed (observed under MSYS2 GCC 14.2.0): * format '%ld' expects argument of type 'long int', but argument has type 'size_t' * llama.cpp/ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp:81:46: warning: missing initializer for member '_STARTUPINFOA::lpDesktop' [-Wmissing-field-initializers] (emitted for all struct field except first)

* multi row k quant shaders! * better row selection * more row choices * readjust row selection * rm_kq=2 by default

CentricStorm and others added 30 commits December 11, 2024 11:47

docs: fix server documentation formatting (ggerganov#10776)

4b4d92b

ci : pin nodejs to 22.11.0 (ggerganov#10779)

92f77a6

Update README.md (ggerganov#10772)

1a31d0d

server : (UI) add tok/s, get rid of completion.js (ggerganov#10786)

235f6e1

* get rid of completion.js * extract chat bubble to a component * add tok/s info * sync * fix BASE_URL * only extract timings when it's enabled * fix auto scroll

gguf-py : bump version to 0.11.0

fb18934

Merge pull request ggerganov#10788 from ggerganov/gg/gguf-py-0.11.0

973f328

docs: update server streaming mode documentation (ggerganov#9519)

5555c0c

Provide more documentation for streaming mode.

common : add missing env var for speculative (ggerganov#10801)

9fdb124

Vulkan: Use improved q4_k and q5_k dequant code in dequant shaders (g…

4064c0e

…gerganov#10798)

remove CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS (ggerganov#10797)

cb13ef8

other windows build fixes

contrib : add ngxson as codeowner (ggerganov#10804)

274ec65

common : improve -ctv -ctk CLI arguments (ggerganov#10806)

adffa6f

* common : improve ctv ctk cli argument * regenerate docs * even better approach * use std::vector

ggml : Fix compilation issues on ARM platform when building without f…

d583cd0

…p16 (ggerganov#10811)

gguf-py : numpy 2 newbyteorder fix (ggerganov#9772)

4601a8b

fix: graceful shutdown for Docker images (ggerganov#10815)

11e07fd

Removes spurious \r in output that causes logging in journalctl to tr…

56eea07

…eat lines as binary and therefore hidden by default (ggerganov#10771) Signed-off-by: Charles Darke <[email protected]> Co-authored-by: Charles Darke <[email protected]>

nix: allow to override rocm gpu targets (ggerganov#10794)

e52aba5

This allows to reduce compile time when you are building for a single GPU.

server: Fix has_next_line in JSON response (ggerganov#10818)

89d604f

* Update server JSON response. * Add unit test to check `has_new_line` JSON response * Remove `has_new_line` unit test changes. * Address code review comment: type check for `has_new_line` in unit test

gguf-py : bump to v0.13.0

b5ae1dd

scripts : change build path to "build-bench" for compare-commits.sh (g…

87cf323

…gerganov#10836)

dixyes and others added 14 commits December 23, 2024 01:35

llama : support InfiniAI Megrez 3b (ggerganov#10893)

b92a14a

* Support InfiniAI Megrez 3b * Fix tokenizer_clean_spaces for megrez

rpc-server : add support for the SYCL backend (ggerganov#10934)

86bf31c

server : add system_fingerprint to chat/completion (ggerganov#10917)

485dc01

* server : add system_fingerprint to chat/completion * update README

server : fix missing model id in /model endpoint (ggerganov#10957)

14b699e

* server : fix missing model id in /model endpoint * fix ci

ggml : fix const usage in SSE path (ggerganov#10962)

32d6ee6

ggml : fix arm enabled features check (ggerganov#10961)

3327bb0

ggml : use wstring for backend search paths (ggerganov#10960)

60cfa72

ggml-ci

llama : the WPM vocabs use the CLS token as BOS (ggerganov#10930)

30caac3

* llama : the WPM vocabs use the CLS token as BOS ggml-ci * llama : add comment

server: allow filtering llama server response fields (ggerganov#10940)

09fe2e7

* llama_server_response_fields * llama_server_response_fields_fix_issues * params fixes * fix * clarify docs * change to "response_fields" --------- Co-authored-by: Xuan Son Nguyen <[email protected]>

server : add support for "encoding_format": "base64" to the */embeddi…

9ba399d

…ngs endpoints (ggerganov#10967) * add support for base64 * fix base64 test * improve test --------- Co-authored-by: Xuan Son Nguyen <[email protected]>

vulkan: multi-row k quants (ggerganov#10846)

d79d8f3

* multi row k quant shaders! * better row selection * more row choices * readjust row selection * rm_kq=2 by default

Merge branch 'layla-build' into merge

5c8aa73

l3utterfly merged commit 3e9852b into layla-build Dec 28, 2024
3 of 6 checks passed

l3utterfly deleted the merge branch December 28, 2024 07:23

github-actions bot added SYCL Nvidia GPU Vulkan testing examples devops python android server ggml Kompute Apple Metal script nix labels Dec 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge from upstream #49

merge from upstream #49

l3utterfly commented Dec 28, 2024

merge from upstream #49

merge from upstream #49

Conversation

l3utterfly commented Dec 28, 2024