[pull] master from ggerganov:master #145

pull · 2024-10-12T23:13:50Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

Signed-off-by: Xiaodong Ye <[email protected]>

* llama : improve infill support ggml-ci * llama : add more FIM token strings ggml-ci * server : update prompt on slot restore (#9800) * gguf : deprecate old FIM token KVs

* server : remove legacy system_prompt feature ggml-ci * readme : update [no ci] * server : fix non-transformer logic + remove response from /props

* server : remove self-extend ggml-ci * server : fix context limit check to use slot.n_past ggml-ci

ggml-ci

Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/bc947f541ae55e999ffdb4013441347d83b00feb?narHash=sha256-NOiTvBbRLIOe5F6RbHaAh6%2B%2BBNjsb149fGZd1T4%2BKBg%3D' (2024-10-04) → 'github:NixOS/nixpkgs/5633bcff0c6162b9e4b5f1264264611e950c8ec7?narHash=sha256-9UTxR8eukdg%2BXZeHgxW5hQA9fIKHsKCdOIUycTryeVw%3D' (2024-10-09) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

ggml-ci

* server : accept extra_context for the infill endpoint ggml-ci * server : update readme [no ci] * server : use repo-level FIM pattern if possible ggml-ci

* Vectorize load instructions in dmmv f16 CUDA kernel Replaces scalar with vector load instructions, which substantially improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup. * addressed comment * Update ggml/src/ggml-cuda/dmmv.cu Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]>

Co-authored-by: Gimling <[email protected]>

@slaren

* Initial XTC commit Adds XTC sampler, not activated by default, but recommended settings by default. * Cleanup * Simplified chances calculation To be more inline with the original implementation, chance is calculated once at the beginning. * First fixes by comments Still need to look into sorting * Fixed trailing backspaces * Fixed RNG to be reproduceable Thanks to @slaren for directions * Fixed forgotten header * Moved `min_keep` Moved from conditions to a simple check at the end. * Fixed broken randomization Thanks to @slaren for explanation * Swapped sorting for a custom algorithm Shifts tokens to remove the penalized ones, then puts the penalized at the back. Should make `min_keep` still viable. * Algorithm rework 1. Scan token from top till the first non-penalizable 2. Remove the last captured token (the least probable above threshold) 3. Shift all tokens to override the remaining penalizable 4. Penalize and put them at the the bottom. * Added XTC to `test-sampling` * Simplified algorithm and more tests * Updated info in common and args * Merged back lost commits in common and arg * Update dump info in common * Fixed incorrect min_keep check * Added XTC to README * Renamed parameters, fixed info and defaults * probability is at 0 by default, but XTC is included in sampling queue * threshold higher than 0.5 switches XTC off * Initial server support * Added XTC to server UIs * Fixed labels in old server UI * Made algorithm safer and more readable * Removed xtc_threshold_max * Fixed arg after update * Quick fixes by comments * Simplified algorithm since threshold_max is removed * Renamed random distribution * Fixed tests and outdated README * Small fixes

ggml-ci

yeahdongcn and others added 5 commits October 12, 2024 08:09

musa : update doc (#9856)

943d20b

Signed-off-by: Xiaodong Ye <[email protected]>

llama : improve infill support and special token detection (#9798)

11ac980

* llama : improve infill support ggml-ci * llama : add more FIM token strings ggml-ci * server : update prompt on slot restore (#9800) * gguf : deprecate old FIM token KVs

server : remove legacy system_prompt feature (#9857)

95c76e8

* server : remove legacy system_prompt feature ggml-ci * readme : update [no ci] * server : fix non-transformer logic + remove response from /props

server : remove self-extend features (#9860)

1bde94d

* server : remove self-extend ggml-ci * server : fix context limit check to use slot.n_past ggml-ci

server : add option to time limit the generation phase (#9865)

edc2656

ggml-ci

github-actions bot added documentation Improvements or additions to documentation examples python server labels Oct 12, 2024

pull bot added ⤵️ pull and removed documentation Improvements or additions to documentation examples python server labels Oct 13, 2024

github-actions bot added documentation Improvements or additions to documentation examples python server labels Oct 13, 2024

ggerganov and others added 3 commits October 13, 2024 18:52

server : reuse cached context chunks (#9866)

c7181bd

ggml-ci

server : accept extra_context for the infill endpoint (#9874)

d4c19c0

* server : accept extra_context for the infill endpoint ggml-ci * server : update readme [no ci] * server : use repo-level FIM pattern if possible ggml-ci

github-actions bot added the Nvidia GPU label Oct 14, 2024

VoidIsVoid and others added 4 commits October 14, 2024 10:04

server : handle "logprobs" field with false value (#9871)

a89f75e

Co-authored-by: Gimling <[email protected]>

readme : update bindings list (#9889)

4c42f93

server : update preact (#9895)

dcdd535

github-actions bot added the testing label Oct 15, 2024

ggerganov added 2 commits October 15, 2024 16:28

server : improve infill context reuse (#9894)

223c25a

ggml-ci

llama : add infill sampler (#9896)

755a9b2

ggml-ci

teleprint-me closed this Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from ggerganov:master #145

[pull] master from ggerganov:master #145

pull bot commented Oct 12, 2024 •

edited

Loading

[pull] master from ggerganov:master #145

[pull] master from ggerganov:master #145

Conversation

pull bot commented Oct 12, 2024 • edited Loading

pull bot commented Oct 12, 2024 •

edited

Loading