merge from upstream #7

l3utterfly · 2024-04-06T01:15:02Z

No description provided.

* Revisited & updated SYCL build documentation * removed outdated comment * Addressed PR comments * Trimed white spaces * added new end line

* Allow conversion of Mistral HF models * Homogenize Llama, Mistral, Mixtral under the same entry. * Fix tokenizer, permute tensors * Use sentencepiece tokenizer, or fall back to hfft. * convert-hf : small fix for mypy * convert-hf : fix duplicated block_count * convert-hf : add vocab size to metadata --------- Co-authored-by: Jared Van Bortel <[email protected]>

* llama: remove redundant reshape in build_kv_store This commit removes the reshape of the V matrix in the build_kv_store. The motivation for this is that V matrix has the shape: ```console (gdb) p *v_cur $46 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {4096, 512, 1, 1}, nb = {4, 16384, 8388608, 8388608}, op = GGML_OP_MUL_MAT, op_params = { 0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {0xb496b0, 0x7ffef1c40950, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0, view_src = 0x0, view_offs = 0, data = 0x0, name = "Vcur-0", '\000' <repeats 57 times>, extra = 0x0, padding = "\000\000\000\000\000\000\000"} ``` And after reshaping this tensor we get: ```console gdb) p *ggml_reshape_2d(ctx, v_cur, n_embd_v_gqa, n_tokens) $44 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {4096, 512, 1, 1}, nb = {4, 16384, 8388608, 8388608}, op = GGML_OP_RESHAPE, op_params = { 0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {0x7ffef1c40e00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0, view_src = 0x7ffef1c40e00, view_offs = 0, data = 0x0, name = "Vcur-0 (reshaped)", '\000' <repeats 46 times>, extra = 0x0, padding = "\000\000\000\000\000\000\000"} ``` I noticed that the `src` and `view_src` fields are different but that the dimensions are the same. From the code comment it seems like the reshape call is not needed and perhaps the above can motivate the removal of the reshape call. Signed-off-by: Daniel Bevenius <[email protected]> * llama : add assert --------- Signed-off-by: Daniel Bevenius <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

* cmake: add explicit metal version options * Update CMakeLists.txt --------- Co-authored-by: Georgi Gerganov <[email protected]>

* readme: add Android UI binding * Update README.md

ggml-ci

* Support xverse model convert to gguf format. * 1. Convert xverse models to gguf; 2. Add LLM_ARCH_XVERSE inference in llama.cpp; 3. Add xverse item in Supported models in README.md; * * gguf-py: remove redundant logs * llama: remove the init_mapping_prefetch custom parameter * llama.cpp: Include the changes from ggerganov#6122 to exclude the unused outputs of the last layers. * - Fix format issues - Remove duplicate set kqv_out to llm_build_kv * Update llama.cpp --------- Co-authored-by: willhe <[email protected]> Co-authored-by: willhe <[email protected]>

* sync : ggml ggml-ci * cuda : move GGML_CUDA_DMMV constants to dmmv.cuh --------- Co-authored-by: slaren <[email protected]>

…6155) * Fix Vulkan no kv offload incoherence * Add k-quant mul mat mat shaders * Rework working buffer allocation, reduces vram use noticeably Clean up cpu assist code, replaced with ggml-backend offload function * Default to all dedicated GPUs * Add fallback for integrated GPUs if no dedicated GPUs are found * Add debug info which device is allocating memory * Fix Intel dequant issue Fix validation issue * Fix Vulkan GGML_OP_GET_ROWS implementation * Clean up merge artifacts * Remove Vulkan warning

* split by max size * clean up arg parse * split: ok * add dry run option * error on 0 tensors * be positive * remove next_metadata_size

* fixed deprecated address * fixed deprecated address * fixed deprecated address * Added 'Apache-2.0' SPDX license identifier due to 'kompute.cc' submodule licensing. Explanation of licensing method: https://docs.fedoraproject.org/en-US/legal/spdx/#_and_expressions * Added 'Apache-2.0' SPDX license identifier due to 'kompute.cc' submodule licensing. Explanation of licensing method: https://docs.fedoraproject.org/en-US/legal/spdx/#_and_expressions * Added 'Apache-2.0' SPDX license identifier due to 'kompute.cc' submodule licensing. Explanation of licensing method: https://docs.fedoraproject.org/en-US/legal/spdx/#_and_expressions * reverted back to only the MIT license

…erganov#6393)

* ci: server: verify deps are coherent with the commit * ci: server: change the ref to build as now it's a pull event target

Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/44d0940ea560dee511026a53f0e2e2cde489b4d4' (2024-03-23) → 'github:NixOS/nixpkgs/d8fe5e6c92d0d190646fb9f1056741a229980089' (2024-03-29) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* disable iqx on windows as WA * array instead of global_memory

…6387) * ggml : update mul_mat_id to use the same tensor for all the experts * update cuda * minor * update metal * update test-backend-ops * fix cuda * Update ggml-metal.m Co-authored-by: Georgi Gerganov <[email protected]> * update convert.py * update convert-hf-to-gguf.py * update convert.py for mixtral hf models * Update convert-hf-to-gguf.py Co-authored-by: Georgi Gerganov <[email protected]> * cuda : support non-pow-2 number of experts * allow quantize to work for split and merged experts models in the same way * cleanup + disable mmap automatically with split tensors models * update imatrix * test-backend-ops : test qwen argsort * update grok model loading * llama : add merged experts tensors to the grok tensor map * minor * gguf : bump version * fix quantizing of merged experts * convert-hf-to-gguf.py : update grok (untested) * make linter happy * cuda/argsort : use shared memory instead of pool memory * convert : fix grok tensor names * metal : add support for non-pow-2 argsort * llama : more loader cleanup, better error checking * cuda : fix warning * llama : still use mmap for loading old models, but copy the data to a host buffer * add review note * llama : remove ffn tensor counting + add sanity check ggml-ci * convert : fix handling of n_experts == None ggml-ci * imatrix : fix ncall counters * llama : produce error if imatrix size does not match * quantize : terminate on errors + trace logs ggml-ci * metal : pad shared memory to 16 bytes --------- Co-authored-by: Georgi Gerganov <[email protected]>

* Add openchat chat template * Add chat template test for openchat * Add chat template for vicuna * Add chat template for orca-vicuna * Add EOS for vicuna templates * Combine vicuna chat templates * Add tests for openchat and vicuna chat templates * Add chat template for alpaca * Add separate template name for vicuna-orca * Remove alpaca, match deepseek with jinja output * Regenerate chat template test with add_generation_prompt * Separate deepseek bos from system message * Match openchat template with jinja output * Remove BOS token from templates, unprefix openchat

Co-authored-by: Jared Van Bortel <[email protected]>

* Create SECURITY.md Signed-off-by: Joyce <[email protected]> * Fix: link on SECURITY.md Signed-off-by: Joyce <[email protected]> * Fix: link on SECURITY.md Signed-off-by: Joyce <[email protected]> * minor * fix * fix --------- Signed-off-by: Joyce <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

…rganov#6456) * CI: Update actions/checkout to v4 * CI: Update actions/setup-python to v5 * CI: Update actions/upload-artifact to v4

* initial commit for sealion support * add sealion support * minor fix * q/k ln and pos_embd only if required * Apply suggestions from code review Co-authored-by: Georgi Gerganov <[email protected]> * minor : clear whitespaces --------- Co-authored-by: bryan <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

Co-authored-by: Jonas Holzner <[email protected]>

* Typo fix to server's README.md Fix minor typo ("tonen") in server README. * server readme grammar/style fixes. Quickly went through this file to look for inconsistencies in presentation of defaults, flag options, and looked for typos and grammar issues. Not perfect, but hopefully improved. * Update README.md Remove an extra space before newline.

* Revising GBNF validator program to be much simpler. * Changing from streams to using cstdio * Adding final newline character.

This commit removes one of the two identical checks for curl being NULL in llama_load_model_from_url. Signed-off-by: Daniel Bevenius <[email protected]>

* ci: bench: change trigger path to not spawn on each PR * ci: bench: add more file type for phi-2: q8_0 and f16. - do not show the comment by default * ci: bench: add seed parameter in k6 script * ci: bench: artefact name perf job * Add iteration in the commit status, reduce again the autocomment * ci: bench: add per slot metric in the commit status * Fix trailing spaces

README is called README.md.

…erganov#6478)

…erganov#6431)

Name the artifacts in the build CI, so that they get uploaded with separate names, instead of all put into the same `artifact` ZIP. It might be possible to further simplify the packing step (in future PRs).

…6486) * ci: exempt master branch workflows from getting cancelled * apply to bench.yml

* server: add cURL support to `full.Dockerfile` * server: add cURL support to `full-cuda.Dockerfile` and `server-cuda.Dockerfile` * server: add cURL support to `full-rocm.Dockerfile` and `server-rocm.Dockerfile` * server: add cURL support to `server-intel.Dockerfile` * server: add cURL support to `server-vulkan.Dockerfile` * fix typo in `server-vulkan.Dockerfile` Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>

…ganov#6464) * moved INTEL_MKL guard from gemm_impl to gemm (wrapper) * Update ggml-sycl.cpp Co-authored-by: AidanBeltonS <[email protected]> --------- Co-authored-by: AidanBeltonS <[email protected]>

…ganov#6500) * bench: make n_batch and n_ubatch configurable * bench: update doc for batched bench

* Add MindMac to UI list * Update proprietary description Co-authored-by: slaren <[email protected]> --------- Co-authored-by: slaren <[email protected]>

cebtenzzre and others added 30 commits March 28, 2024 11:44

convert : refactor vocab selection logic (ggerganov#6355)

be55134

[SYCL] Revisited & updated SYCL build documentation (ggerganov#6141)

5106ef4

* Revisited & updated SYCL build documentation * removed outdated comment * Addressed PR comments * Trimed white spaces * added new end line

readme : add notice for UI list

bfe7daf

cmake : add explicit metal version options (ggerganov#6370)

8093987

* cmake: add explicit metal version options * Update CMakeLists.txt --------- Co-authored-by: Georgi Gerganov <[email protected]>

readme : add project (ggerganov#6356)

b910287

* readme: add Android UI binding * Update README.md

ci : fix BGE wget (ggerganov#6383)

cfde806

ggml-ci

sync : ggml (ggerganov#6351)

d48ccf3

* sync : ggml ggml-ci * cuda : move GGML_CUDA_DMMV constants to dmmv.cuh --------- Co-authored-by: slaren <[email protected]>

split: allow --split-max-size option (ggerganov#6343)

f7fc5f6

* split by max size * clean up arg parse * split: ok * add dry run option * error on 0 tensors * be positive * remove next_metadata_size

ci: bench: fix Resource not accessible by integration on PR event (gg…

37e7854

…erganov#6393)

readme : update hot topics

c50a82c

ci: server: verify deps are coherent with the commit (ggerganov#6409)

226e819

* ci: server: verify deps are coherent with the commit * ci: server: change the ref to build as now it's a pull event target

compare-llama-bench.py: fix long hexsha args (ggerganov#6424)

33a5244

[SYCL] Disable iqx on windows as WA (ggerganov#6435)

5260486

* disable iqx on windows as WA * array instead of global_memory

readme : update hot topics

076b086

Missing tokenizer.model error during gguf conversion (ggerganov#6443)

db214fa

Co-authored-by: Jared Van Bortel <[email protected]>

readme : add feature-rich rust bindings (ggerganov#6465)

154d4ee

server: add cURL support to server.Dockerfile (ggerganov#6461)

5d4f12e

ci : update checkout, setup-python and upload-artifact to latest (gge…

9f62c01

…rganov#6456) * CI: Update actions/checkout to v4 * CI: Update actions/setup-python to v5 * CI: Update actions/upload-artifact to v4

server : handle exception on wrong type in request (ggerganov#6452)

60cdf40

Co-authored-by: Jonas Holzner <[email protected]>

HanClinto and others added 18 commits April 4, 2024 09:32

convert : fix for lint error complaining of bare except (ggerganov#6470)

72d73af

server : add option to disable KV offload (ggerganov#6468)

1a43c72

server : remove obsolete --memory-f32 option

4399f13

examples : add GBNF validator program (ggerganov#5948)

9b84ae1

* Revising GBNF validator program to be much simpler. * Changing from streams to using cstdio * Adding final newline character.

common: remove duplicate check for curl (ggerganov#6471)

4bcd6b9

This commit removes one of the two identical checks for curl being NULL in llama_load_model_from_url. Signed-off-by: Daniel Bevenius <[email protected]>

Correct README link (ggerganov#6458)

a74401f

README is called README.md.

ci: bench fix concurrency for workflow trigger dispatch with sha1 (gg…

8120efe

…erganov#6478)

server: allow penalizing repetition of newlines on server webpage (gg…

2e66913

…erganov#6431)

build CI: Name artifacts (ggerganov#6482)

c666ba2

Name the artifacts in the build CI, so that they get uploaded with separate names, instead of all put into the same `artifact` ZIP. It might be possible to further simplify the packing step (in future PRs).

ci: exempt master branch workflows from getting cancelled (ggerganov#…

7dda1b7

…6486) * ci: exempt master branch workflows from getting cancelled * apply to bench.yml

readme : fix typo (ggerganov#6481)

b660a57

readme : add Dot to UI list (ggerganov#6487)

a307375

[SYCL] Fixed minor bug when enabling FP16 for non intel targets (gger…

1b496a7

…ganov#6464) * moved INTEL_MKL guard from gemm_impl to gemm (wrapper) * Update ggml-sycl.cpp Co-authored-by: AidanBeltonS <[email protected]> --------- Co-authored-by: AidanBeltonS <[email protected]>

bench : make n_batch and n_ubatch configurable in Batched bench (gger…

87e21bb

…ganov#6500) * bench: make n_batch and n_ubatch configurable * bench: update doc for batched bench

readme : update UI list (ggerganov#6503)

d0f5dee

* Add MindMac to UI list * Update proprietary description Co-authored-by: slaren <[email protected]> --------- Co-authored-by: slaren <[email protected]>

gguf.py : add licence and version to gguf writer (ggerganov#6504)

a8bd14d

l3utterfly merged commit 607bb9a into layla-build Apr 6, 2024
46 of 61 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge from upstream #7

merge from upstream #7

l3utterfly commented Apr 6, 2024

merge from upstream #7

merge from upstream #7

Conversation

l3utterfly commented Apr 6, 2024