merged from upstream #8

l3utterfly · 2024-04-16T01:57:38Z

No description provided.

…okens usage in stream OAI response (ggerganov#6495) * ci: bench: support sse and fix prompt processing time server: add tokens usage in stream mode * ci: bench: README.md EOL * ci: bench: remove total pp and tg as it is not accurate * ci: bench: fix case when there is no token generated * ci: bench: change to the 95 percentile for pp and tg as it is closer to what the server exports in metrics * ci: bench: fix finish reason rate

* Added integration tests for GBNF parser to validate correctness of parsing, as well as correctness of string matching. Intended for use to pin behavior while working on performance improvements. * Fixing whitespace errors and cleaning error message alert to be clearer. * Removing hacky include to llama.cpp from grammar integration test now that needed functions are available via internal API. * Comment cleanup. * Reorganizing tests for readability. * Cleaning up debug message to make a bit more sense.

Signed-off-by: Daniel Bevenius <[email protected]>

…, GGML_TYPE_IQ3_S, GGML_TYPE_IQ2_XXS, GGML_TYPE_IQ2_XS, GGML_TYPE_IQ2_S, GGML_TYPE_IQ1_S, GGML_TYPE_IQ1_M (ggerganov#6521)

`cudaHostRegisterReadOnly` parameter was only introduced in CUDA 11.1 See this issue for more details: https://github.com/ggerganov/examples/whisper/whisper.cpp/issues/2007

Flake lock file updates: • Updated input 'flake-parts': 'github:hercules-ci/flake-parts/f7b3c975cf067e56e7cda6cb098ebe3fb4d74ca2' (2024-03-01) → 'github:hercules-ci/flake-parts/9126214d0a59633752a136528f5f3b9aa8565b7d' (2024-04-01) • Updated input 'flake-parts/nixpkgs-lib': 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8?dir=lib' (2024-02-29) → 'github:NixOS/nixpkgs/d8fe5e6c92d0d190646fb9f1056741a229980089?dir=lib' (2024-03-29) • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/d8fe5e6c92d0d190646fb9f1056741a229980089' (2024-03-29) → 'github:NixOS/nixpkgs/fd281bd6b7d3e32ddfa399853946f782553163b5' (2024-04-03) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

…ster. (ggerganov#6525)

KodiBot is free and open source ai chat app released under the GNU General Public License.

* llama : save and restore kv cache for single seq id * remove trailing whitespace * respond error in case there's no space in the kv cache * add kv seq save restore to test case * add --slot-save-path arg to enable save restore and restrict save location * Returning 0 for some cases, instead of asserting. * cleanup error cases * rename sequence state functions * rename state get set functions * add previous function names back in with DEPRECATED notice * update doc * adjust endpoints to preferred style * fix restoring zero cell count * handle seq rm return value * unused param * keep in the size check * fix return types * add server test case for slot save restore * cleanup * add cake * cleanup style * add special * removing a whole sequence never fails * move sequence state file functionality from server to llama to match session api and add version tags * catch exceptions on save as well * error log messages * check types for stricter restore * update server doc * readme : update API changes date * strict filename validation * move include, reject bom as well * also reject empty filename * reject whitespace and trailing dot --------- Co-authored-by: Martin Evans <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

@compilade

* llama_sampling_sample with default args is more naively usable * Batches populated by either llama_batch_get_one or llama_batch_add work with default args * Previously get_one could use the default argument * Previously add should usually have used the last index where logits[idx] == true * This hopefully encourages the use of llama_batch_add * By giving expected results when using default arguments. * Adds "negative indexing" feature to llama_get_logits_ith and llama_get_embeddings_ith * Believed to work with any currently well behaved program * Default arg now works for both cases (previously would give strange results for add case) * Any non-negative number is unaffected and behaves as previously * Negative arguments were previously invalid. * Implemented as a special case of indexing as suggested by @compilade in ggerganov#6519 * Fixed mismatch type errors * cited in macOS CI tests * Missed in original updates based on PR feedback in ggerganov#6519

* llama : fix attention layer count sanity check * llama : fix parentheses in attention layer count sanity check There was otherwise a warning when compiling. --------- Co-authored-by: Francis Couture-Harpin <[email protected]>

* license : add AUTHORS * authors : update * scipts : add LICENSE and gen-authors.sh to sync

* Add Command R Plus GGUF * Add Command R Plus GGUF * Loading works up to LayerNorm2D * Export new tensors in 1D so they are not quantized. * Fix embedding layer based on Noeda's example * Whitespace * Add line * Fix unexpected tokens on MPS. Re-add F16 fix. ((Noeda) * dranger003: Fix block index overflow in CUDA dequantizing. * Reverted blocked multiplication code as it still has issues and could affect other Llama arches * export norms as f32 * fix overflow issues during quant and other cleanup * Type convention Co-authored-by: Georgi Gerganov <[email protected]> * dranger003: Fix more int overflow during quant. --------- Co-authored-by: S <[email protected]> Co-authored-by: S <[email protected]> Co-authored-by: slaren <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

Key changes: * BERT conversion: fix abuse of LlamaHfVocab, do not set BOS or EOS * Nomic Embed conversion: pad vocab instead of slicing embedding tensor * llama_tokenize: handle added special tokens like HF does

* docs: how to add a model * docs: model: typo and docs * docs: model: add prevision on RoPE * docs: model: rephrasing README.md * docs: model: rephrasing README.md * docs: model: README.md fix trailing spaces * docs : some fixes * Update README.md --------- Co-authored-by: Georgi Gerganov <[email protected]>

…v#6587)

…/ reuses) (ggerganov#6609) * grammars: reserve rejects & next candidates * grammars: reuse new_stacks * grammars: fix missing sig change in llama.h * grammars: fix test (api changed) * grammars: update gbnf-validator.cpp * grammars: simpler syntax (no swap)

* Refactor Error Handling for CUDA Add guidance for setting CUDA_DOCKER_ARCH to match GPU compute capability for CUDA versions < 11.7. Include link to NVIDIA's CUDA GPUs documentation for compute capability reference. * Update Makefile Improved wording Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]>

…OSX from ggerganov#6576 (ggerganov#6619)

…rammar. (ggerganov#6616)

…gerganov#6631)

Co-authored-by: MasterYi <[email protected]>

…ganov#6591) * Remove split metadata when quantize model shards * Find metadata key by enum * Correct loop range for gguf_remove_key and code format * Free kv memory --------- Co-authored-by: z5269887 <[email protected]>

* infill : add download instructions for model This commit adds instructions on how to download a CodeLlama model using the `hf.sh` script. This will download the model and place it in the `models` directory which is the same model use later by the infill example. Signed-off-by: Daniel Bevenius <[email protected]> * squash! infill : add download instructions for model Clarify the reason for using CodeLlama. Signed-off-by: Daniel Bevenius <[email protected]> --------- Signed-off-by: Daniel Bevenius <[email protected]>

…ngs, cap number length (ggerganov#6555) * json: rename python schema converter to make import easier * server: skip null json_schema / grammar fields * json: deps management for primitive rules (+ allow null values) * json: optimize repetitions for minItems/maxItems and regexps: `a{,3}` goes from `"a"? "a"? "a"?` (explosive combos) to `(a (a (a)?)?)?` * grammars: add troubleshooting section to readme * json: cap length of numbers to 15 digits before/after decimal point (avoids infinite gen, e.g. "one third" -> `0.333333333333...`) * json: unify all repetition code (w/ or w/o sep) * json: support string minLength/maxLength * server+json: update server/README w/ result_format * nits * json: fix type error w/ python 3.8 * json: fix server/README (json_schema in /completion vs. result_format in /v1/chat/completions) * json: simplify DOT `{"type": "string", "pattern": "^.$"}` * json: remove recursion in opt_repetitions (avoids Python stack overflow) * json: rm dead code * json: rm useless assert & ggml.h import

* model: dbrx convert to gguf ggerganov#6344 * llama: support dbrx ggerganov#6344 * doc: dbrx: add the model as supported * scripts: get-wikitext-2 add unzip * llama: increase maximum experts allowed * llama: factorize moe graph implementation between grok, mixtral and dbrx --------- Co-authored-by: Megha Agarwal <[email protected]>

) * disable mmap to fix memcpy crash, add missed cmd in guide, fix softmax * refactor to disable mmap for SYCL backend * fix compile error in other os * refactor the solution, use host buf to fix it, instead of disable mmap * keep to support mmap() * use host buff to reduce malloc times * revert to malloc/free solution, for threaad safe

* Fix --split-max-size Byte size calculation was done on int and overflowed. * add tests.sh * add examples test scripts to ci run Will autodiscover examples/*/tests.sh scripts and run them. * move WORK_PATH to a subdirectory * clean up before and after test * explicitly define which scripts to run * add --split-max-size to readme

* Added support for GGML_OP_CLAMP in Metal * Corrected size --------- Co-authored-by: dave-fl <[email protected]>

* Add chat template for command-r model series * Fix indentation * Add chat template test for command-r models and update the implementation to trim whitespaces * Remove debug print

- Package.swift now supports conditional compilation based on OS - Allows for package to be used by SPM on Non-Apple platforms Co-authored-by: Steven Prichard <[email protected]>

This reverts commit b3a96f2.

…ov#6687)

* main: add --json-schema / -j * json: move json-schema-to-grammar to common lib * json: fix zig build

phymbert and others added 30 commits April 6, 2024 05:40

backend : fix typo in scheduler documentation (ggml/781)

b66aec6

Signed-off-by: Daniel Bevenius <[email protected]>

sync : ggml

54ea069

support/fix OPs GGML_TYPE_IQ4_NL, GGML_TYPE_IQ4_XS, GGML_TYPE_IQ3_XXS…

d4f220a

…, GGML_TYPE_IQ3_S, GGML_TYPE_IQ2_XXS, GGML_TYPE_IQ2_XS, GGML_TYPE_IQ2_S, GGML_TYPE_IQ1_S, GGML_TYPE_IQ1_M (ggerganov#6521)

Run make to build the project (ggerganov#6457)

9472bce

scripts : sync ggml-cuda folder

43e8995

ggml: bypass code incompatible with CUDA < 11.1 (whisper/2020)

f77261a

`cudaHostRegisterReadOnly` parameter was only introduced in CUDA 11.1 See this issue for more details: https://github.com/ggerganov/examples/whisper/whisper.cpp/issues/2007

sync : ggml

c372477

Add GritLM as supported models. (ggerganov#6513)

e0717e7

Change Windows AMD example to release build to make inference much fa…

855f544

…ster. (ggerganov#6525)

Adding KodiBot to UI list (ggerganov#6535)

d752327

KodiBot is free and open source ai chat app released under the GNU General Public License.

remove row=1 cond (ggerganov#6532)

87fb5b4

quantize : fix precedence of cli args (ggerganov#6541)

b73e564

Comment explaining a decision (ggerganov#6531)

cecd8d3

license : update copyright notice + add AUTHORS (ggerganov#6405)

e11a899

* license : add AUTHORS * authors : update * scipts : add LICENSE and gen-authors.sh to sync

server : detect search query to start webchat (ggerganov#6554)

400d5d7

sync : ggml

c4a3a4f

BERT tokenizer fixes (ggerganov#6498)

1b67731

Key changes: * BERT conversion: fix abuse of LlamaHfVocab, do not set BOS or EOS * Nomic Embed conversion: pad vocab instead of slicing embedding tensor * llama_tokenize: handle added special tokens like HF does

readme: fix typo in amdgpu target name (ggerganov#6573)

ba5e134

readme : update UI list (ggerganov#6560)

b231b37

readme : fix ROCm link (ggerganov#6579)

29122d3

convert.py : add consolidated.safetensors for mixtral 8x22b (ggergano…

65c64dc

…v#6587)

llama : add model types for mixtral (ggerganov#6589)

4f407a0

ochafik and others added 29 commits April 11, 2024 19:47

As suggested by @slaren, disabling Metal for test to fix CI build on …

f7001cc

…OSX from ggerganov#6576 (ggerganov#6619)

Optimization: eliminate addition of redundant stacks when advancing g…

04a5ac2

…rammar. (ggerganov#6616)

ci : disable Metal for macOS-latest-cmake-x64 (ggerganov#6628)

9ed2737

eval-callback: use ggml_op_desc to pretty print unary operator name (g…

81da18e

…gerganov#6631)

Correct free memory and total memory. (ggerganov#6630)

dee7f8d

Co-authored-by: MasterYi <[email protected]>

imatrix : remove invalid assert (ggerganov#6632)

ef21ce4

chore: Fix markdown warnings (ggerganov#6625)

5c4d767

server : coherent log output for KV cache full (ggerganov#6637)

24ee66e

metal : unify mul_mv_id kernels (ggerganov#6556)

fbbc030

CUDA: fix matrix multiplication logic for tests (ggerganov#6667)

b5e7285

convert : enable the --use-temp-file cli flag (ggerganov#6645)

a4ec34e

[bug fix] convert github repository_owner to lowercase (ggerganov#6673)

e689fc4

Added support for GGML_OP_CLAMP in Metal (ggerganov#6662)

422c2af

* Added support for GGML_OP_CLAMP in Metal * Corrected size --------- Co-authored-by: dave-fl <[email protected]>

flake.lock: Update (ggerganov#6669)

f184dd9

Add Command R chat template (ggerganov#6650)

04fbc5f

* Add chat template for command-r model series * Fix indentation * Add chat template test for command-r models and update the implementation to trim whitespaces * Remove debug print

llama : add missing kv clear in llama_beam_search (ggerganov#6664)

1958f7e

fix mul_mat_id() for new input, make the ut pass (ggerganov#6682)

17e98d4

swift : linux support (ggerganov#6590)

7fc16a2

- Package.swift now supports conditional compilation based on OS - Allows for package to be used by SPM on Non-Apple platforms Co-authored-by: Steven Prichard <[email protected]>

server : revert "minor layout improvements" (ggerganov#6684)

3272896

This reverts commit b3a96f2.

llama : fix restoring the number of outputs from state files (ggergan…

132f557

…ov#6687)

main: add --json-schema / -j flag (ggerganov#6659)

7593639

* main: add --json-schema / -j * json: move json-schema-to-grammar to common lib * json: fix zig build

l3utterfly merged commit f6e7e93 into layla-build Apr 16, 2024
50 of 70 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merged from upstream #8

merged from upstream #8

l3utterfly commented Apr 16, 2024

merged from upstream #8

merged from upstream #8

Conversation

l3utterfly commented Apr 16, 2024