merge upstream #43

l3utterfly · 2024-10-27T07:53:59Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

* examples : do not use common library in simple example * add command line parser, simplify code

* mtgpu: add docker image support Signed-off-by: Xiaodong Ye <[email protected]> * mtgpu: enable docker workflow Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]>

* rpc : add backend registry / device interfaces * llama : add llama_supports_rpc API * ggml_backend_rpc_start_rpc_server -> ggml_backend_rpc_start_server

) * common : use common_ prefix for common library functions --------- Co-authored-by: Georgi Gerganov <[email protected]>

* ggml : move more prints to the ggml log system * show BLAS OpenMP warnings in all builds using debug print

Signed-off-by: Xiaodong Ye <[email protected]>

…#9798) * llama : improve infill support ggml-ci * llama : add more FIM token strings ggml-ci * server : update prompt on slot restore (ggerganov#9800) * gguf : deprecate old FIM token KVs

* server : remove legacy system_prompt feature ggml-ci * readme : update [no ci] * server : fix non-transformer logic + remove response from /props

* server : remove self-extend ggml-ci * server : fix context limit check to use slot.n_past ggml-ci

ggml-ci

Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/bc947f541ae55e999ffdb4013441347d83b00feb?narHash=sha256-NOiTvBbRLIOe5F6RbHaAh6%2B%2BBNjsb149fGZd1T4%2BKBg%3D' (2024-10-04) → 'github:NixOS/nixpkgs/5633bcff0c6162b9e4b5f1264264611e950c8ec7?narHash=sha256-9UTxR8eukdg%2BXZeHgxW5hQA9fIKHsKCdOIUycTryeVw%3D' (2024-10-09) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

ggml-ci

* server : accept extra_context for the infill endpoint ggml-ci * server : update readme [no ci] * server : use repo-level FIM pattern if possible ggml-ci

* Vectorize load instructions in dmmv f16 CUDA kernel Replaces scalar with vector load instructions, which substantially improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup. * addressed comment * Update ggml/src/ggml-cuda/dmmv.cu Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]>

Co-authored-by: Gimling <[email protected]>

@slaren

* Initial XTC commit Adds XTC sampler, not activated by default, but recommended settings by default. * Cleanup * Simplified chances calculation To be more inline with the original implementation, chance is calculated once at the beginning. * First fixes by comments Still need to look into sorting * Fixed trailing backspaces * Fixed RNG to be reproduceable Thanks to @slaren for directions * Fixed forgotten header * Moved `min_keep` Moved from conditions to a simple check at the end. * Fixed broken randomization Thanks to @slaren for explanation * Swapped sorting for a custom algorithm Shifts tokens to remove the penalized ones, then puts the penalized at the back. Should make `min_keep` still viable. * Algorithm rework 1. Scan token from top till the first non-penalizable 2. Remove the last captured token (the least probable above threshold) 3. Shift all tokens to override the remaining penalizable 4. Penalize and put them at the the bottom. * Added XTC to `test-sampling` * Simplified algorithm and more tests * Updated info in common and args * Merged back lost commits in common and arg * Update dump info in common * Fixed incorrect min_keep check * Added XTC to README * Renamed parameters, fixed info and defaults * probability is at 0 by default, but XTC is included in sampling queue * threshold higher than 0.5 switches XTC off * Initial server support * Added XTC to server UIs * Fixed labels in old server UI * Made algorithm safer and more readable * Removed xtc_threshold_max * Fixed arg after update * Quick fixes by comments * Simplified algorithm since threshold_max is removed * Renamed random distribution * Fixed tests and outdated README * Small fixes

ggml-ci

Fix cann compilation error after merging llama.cpp supports dynamically loadable backends.

This commit removes the buffer_id field from the leaf_alloc struct. The motivation for is that this field is only written to and never read/used as far as I can tell. Each tensor_alloc has a buffer_id field and this is what caused me to look into this more closely, to understand what the buffer_id in leaf_alloc was used for.

* server: fix the disappearance of the end of the text when streaming with stop strings * simplify "send text" checks

Signed-off-by: Molly Sophia <[email protected]>

…anov#9903) Prior to this commit, using a JSON Schema containing a string with `pattern` regular expression that uses top-level alternation (e.g. `"pattern": "^A|B|C|D$"`) would result in invalid JSON output from the constrained sampling grammar, because it ended up creating a grammar rule like this for the string: ``` thing ::= "\"" "A" | "B" | "C" | "D" "\"" space ``` Note that this rule will only match a starting quote for the "A" case, and will only match an ending quote for the "D" case, so this rule will always produce invalid JSON when used for sampling (that is, the JSON will always be lacking the starting quote, the ending quote, or both). This was fixed in a simple way by adding parentheses to the generated rule (for all string pattern rules, to keep it simple), such that the new generated rule looks like this (correct): ``` thing ::= "\"" ("A" | "B" | "C" | "D") "\"" space ```

* llama : suppress conversion from 'size_t' to 'int' This commit updates llm_tokenizer_spm.tokenize to suppress/remove the following warnings that are generated on Windows when using MSVC: ```console src\llama-vocab.cpp(211,1): warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data src\llama-vocab.cpp(517,1): warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data ``` This is done by adding a cast for the size_t returned from symbols.size(). I believe this is safe as it seems unlikely that symbols, which stores an entry for each UTF8 character, would become larger than INT_MAX. The motivation for this change is to reduce the number of warnings that are currently generated when building on Windows. * squash! llama : suppress conversion from 'size_t' to 'int' Move cast into for loop.

…anov#9875) * fix: use `vm_allocate` to allocate CPU backend buffer on macOS * fix: switch to `posix_memalign` to keep existing `free()` usages work * feat: move `GGML_ALIGNED_MALLOC` to `ggml-backend-impl.h`, add support for `vm_allocate` on macOS * style: formatting * fix: move const outside of `#ifndef` * style: formatting * fix: unused var * fix: transform `GGML_ALIGNED_MALLOC` and `GGML_ALIGNED_FREE` into functions and add them to `ggml-impl.h` * fix: unused var * fix: page align to `GGUF_DEFAULT_ALIGNMENT` * fix: page align to `TENSOR_ALIGNMENT` * fix: convert `TENSOR_ALIGNMENT` to a macro * fix: increase page size to `32` on iOS * fix: iOS page size * fix: `hbw_posix_memalign` alignment

* CUDA: fix MMQ for non-contiguous src0, add tests * revise test code

…rganov#10023) * server : refactor slot input data, move tokenizer to HTTP thread * move prompt_tokens.empty() check * fix incorrect if branch * fix infinite generation loop * bring back infill validation * add infill test * try fixing format_infill * fix test * remove redundant code * rename completion to inference * update docs * use llama_tokens everywhere

…10030) ggml-ci

* llama: Refactor string_split to use template specialization, fixes parsing strings with spaces * llama: Add static_assert in the string_split template to ensure the correct template specialization is used for std::string

* sampling : add DRY sampler (post-refactor) * DRY: Trying to fix coauthors, removed unneeded line * DRY: Fixed redundant code * DRY: Fixed crash issue due to DRY being in chain but uninitialized --------- Co-authored-by: l3utterfly <[email protected]> Co-authored-by: pi6am <[email protected]>

* metal : support permuted matrix multiplicaions ggml-ci * cont : use nb01 directly for row steps ggml-ci * cont : add comments [no ci] * metal : minor refactor * metal : minor

Co-authored-by: bssrdf <[email protected]>

slaren and others added 30 commits October 10, 2024 19:50

examples : do not use common library in simple example (ggerganov#9803)

c7499c5

* examples : do not use common library in simple example * add command line parser, simplify code

musa: add docker image support (ggerganov#9685)

cf8e0a3

* mtgpu: add docker image support Signed-off-by: Xiaodong Ye <[email protected]> * mtgpu: enable docker workflow Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]>

rpc : add backend registry / device interfaces (ggerganov#9812)

0e9f760

* rpc : add backend registry / device interfaces * llama : add llama_supports_rpc API * ggml_backend_rpc_start_rpc_server -> ggml_backend_rpc_start_server

common : use common_ prefix for common library functions (ggerganov#9805

7eee341

) * common : use common_ prefix for common library functions --------- Co-authored-by: Georgi Gerganov <[email protected]>

ggml : move more prints to the ggml log system (ggerganov#9839)

9677640

* ggml : move more prints to the ggml log system * show BLAS OpenMP warnings in all builds using debug print

musa : update doc (ggerganov#9856)

943d20b

Signed-off-by: Xiaodong Ye <[email protected]>

llama : improve infill support and special token detection (ggerganov…

11ac980

…#9798) * llama : improve infill support ggml-ci * llama : add more FIM token strings ggml-ci * server : update prompt on slot restore (ggerganov#9800) * gguf : deprecate old FIM token KVs

server : remove legacy system_prompt feature (ggerganov#9857)

95c76e8

* server : remove legacy system_prompt feature ggml-ci * readme : update [no ci] * server : fix non-transformer logic + remove response from /props

server : remove self-extend features (ggerganov#9860)

1bde94d

* server : remove self-extend ggml-ci * server : fix context limit check to use slot.n_past ggml-ci

server : add option to time limit the generation phase (ggerganov#9865)

edc2656

ggml-ci

server : reuse cached context chunks (ggerganov#9866)

c7181bd

ggml-ci

server : accept extra_context for the infill endpoint (ggerganov#9874)

d4c19c0

* server : accept extra_context for the infill endpoint ggml-ci * server : update readme [no ci] * server : use repo-level FIM pattern if possible ggml-ci

server : handle "logprobs" field with false value (ggerganov#9871)

a89f75e

Co-authored-by: Gimling <[email protected]>

readme : update bindings list (ggerganov#9889)

4c42f93

server : update preact (ggerganov#9895)

dcdd535

server : improve infill context reuse (ggerganov#9894)

223c25a

ggml-ci

llama : add infill sampler (ggerganov#9896)

755a9b2

ggml-ci

[CANN] Fix cann compilation error (ggerganov#9891)

becfd38

Fix cann compilation error after merging llama.cpp supports dynamically loadable backends.

sync : ggml

0e41b30

server : fix the disappearance of the end of the text (ggerganov#9867)

1f66b69

* server: fix the disappearance of the end of the text when streaming with stop strings * simplify "send text" checks

llama : add tensor name for "result_norm" (ggerganov#9907)

10433e8

Signed-off-by: Molly Sophia <[email protected]>

llava : fix typo in error message [no ci] (ggerganov#9884)

dbf18e4

fix: allocating CPU buffer with size 0 (ggerganov#9917)

2194200

JohannesGaessler and others added 16 commits October 23, 2024 16:50

CUDA: fix 1D im2col, add tests (ggml/993)

80273a3

llama.vim : bump generation time limit to 3s [no ci]

2d3aba9

sync : ggml

190a37d

server : samplers accept the prompt correctly (ggerganov#10019)

0a1c750

CUDA: fix MMQ for non-contiguous src0, add tests (ggerganov#10021)

c39665f

* CUDA: fix MMQ for non-contiguous src0, add tests * revise test code

CUDA: fix insufficient buffer clearing for MMQ (ggerganov#10032)

167a515

ci : fix cmake flags for SYCL

40f2555

server : check that the prompt fits in the slot's context (ggerganov#…

bc5ba00

…10030) ggml-ci

llamafile : extend sgemm.cpp support for Q5_0 models (ggerganov#10010)

2f8bd2b

llama: string_split fix (ggerganov#10022)

d80fb71

* llama: Refactor string_split to use template specialization, fixes parsing strings with spaces * llama: Add static_assert in the string_split template to ensure the correct template specialization is used for std::string

metal : support permuted matrix multiplicaions (ggerganov#10033)

6687503

* metal : support permuted matrix multiplicaions ggml-ci * cont : use nb01 directly for row steps ggml-ci * cont : add comments [no ci] * metal : minor refactor * metal : minor

scripts : fix amx sync [no ci]

9e4a256

increase cuda_cpy block size (ggml/996)

8c60a8a

Co-authored-by: bssrdf <[email protected]>

sync : ggml

cc2983d

l3utterfly merged commit 35e499a into layla-build Oct 27, 2024
68 checks passed

github-actions bot added documentation Improvements or additions to documentation SYCL Nvidia GPU Vulkan testing build examples devops python android server ggml script labels Oct 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge upstream #43

merge upstream #43

l3utterfly commented Oct 27, 2024

merge upstream #43

merge upstream #43

Conversation

l3utterfly commented Oct 27, 2024