merge upstream #39

l3utterfly · 2024-09-26T09:08:38Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

ggml-ci

* AVX512 version of ggml_gemm_q4_0_8x8_q8_0 * Remove zero vector parameter passing * Rename functions and rearrange order of macros * Edit commments * style : minor adjustments * Update x to start from 0 --------- Co-authored-by: Georgi Gerganov <[email protected]>

Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/4f807e8940284ad7925ebd0a0993d2a1791acb2f?narHash=sha256-IiA3jfbR7K/B5%2B9byVi9BZGWTD4VSbWe8VLpp9B/iYk%3D' (2024-09-11) → 'github:NixOS/nixpkgs/c04d5652cfa9742b1d519688f65d1bbccea9eb7e?narHash=sha256-PmUr/2GQGvFTIJ6/Tvsins7Q43KTMvMFhvG6oaYK%2BWk%3D' (2024-09-19) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

…gerganov#9598) Make sure n_barrier and n_barrier_passed do not share the cache line to avoid cache line bouncing. This optimization shows performance improvements even for n_threads <= 8 cases. Resurect TSAN (Thread Sanitizer) check so that we can avoid doing expensive read-modify-write in the normal case and just use thread-fence as originally intended. --- Here is the original description and suggestions from Willy Tarreau : There's currently some false sharing between n_barrier and n_barrier_passed that is amplified in ggml_barrier() by the fact that all threads need to increment n_barrier when entering, while all previous threads continue to read n_barrier_passed, waiting for the last one to release them all. The side effect is that all these readers are slowing down all new threads by making the cache line bounce back and forth between readers and writers. Just placing them in two distinct cache lines is sufficient to boost the performance by 21% on a 80-core ARM server compared to the no-openmp version, and by 3% compared to the openmp version. Note that the variables could have been spread apart in the structure as well, but it doesn't seem that the size of this threadpool struct is critical so here we're simply aligning them. Finally, the same issue was present when leaving the barrier since all threads had to update the n_barrier_passed counter, though only one would add a non-zero value. This alone is responsible for half of the cost due to undesired serialization. It might be possible that using a small array of n_barrier counters could make things even faster on many-core systems, but it would likely complicate the logic needed to detect the last thread. Co-authored-by: Willy Tarreau <[email protected]>

* server : add --no-context-shift option * small fix * Update examples/server/tests/features/embeddings.feature Co-authored-by: Georgi Gerganov <[email protected]> * tests : minor fix * revert usage of GGML_ASSERT * update server documentation --------- Co-authored-by: Georgi Gerganov <[email protected]>

llama: enable K-shift for quantized KV cache It will fail on unsupported backends or quant types.

We're missing atomic_thread_fence() in MSVC builds when openmp is disabled.

…9605) * sampling : avoid expensive softmax during greedy sampling ggml-ci * speculative : fix default RNG seed + set sparams.n_probs * Update tests/test-sampling.cpp Co-authored-by: slaren <[email protected]> * sampling : add clarifying comment [no ci] --------- Co-authored-by: slaren <[email protected]>

ggml-ci

…gerganov#9627)

* feat(gguf-py): Add granitemoe architecture This includes the addition of new tensor names for the new moe layers. These may not be correct at this point due to the need for the hack in gguf_writer.py to double-check the length of the shape for these layers. Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]> * feat(convert_hf_to_gguf): Add GraniteMoeModel GraniteMoe has the same configuration deltas as Granite Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]> * fix(granitemoe convert): Split the double-sized input layer into gate and up After a lot of staring and squinting, it's clear that the standard mixtral expert implementation is equivalent to the vectorized parallel experts in granite. The difference is that in granite, the w1 and w3 are concatenated into a single tensor "input_linear." Rather than reimplementing all of the math on the llama.cpp side, the much simpler route is to just split this tensor during conversion and follow the standard mixtral route. Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * feat(granitemoe): Implement granitemoe GraniteMoE follows the mixtral architecture (once the input_linear layers are split into gate_exps/up_exps). The main delta is the addition of the same four multipliers used in Granite. Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]> * Typo fix in docstring Co-Authored-By: [email protected] Co-authored-by: Georgi Gerganov <[email protected]> Signed-off-by: Gabe Goodhart <[email protected]> * fix(conversion): Simplify tensor name mapping in conversion Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * fix(convert): Remove unused tensor name mappings Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * fix(convert): Sanity check on merged FFN tensor sizes Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * fix: Allow "output" layer in granite moe architecture (convert and cpp) Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * fix(granite): Add missing 'output' tensor for Granite This is a fix for the previous `granite` architecture PR. Recent snapshots have included this (`lm_head.weights`) as part of the architecture Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]> --------- Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

* server : add more env vars, improve gen-docs * update server docs * LLAMA_ARG_NO_CONTEXT_SHIFT

…9217) * ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels * added fallback mechanism when the offline re-quantized model is not optimized for the underlying target. * fix for build errors * remove prints from the low-level code * Rebase to the latest upstream

* ci : fix docker build number and tag name * fine-grant permissions

Signed-off-by: Xiaodong Ye <[email protected]>

ggerganov and others added 22 commits September 23, 2024 11:27

metal : use F32 prec for K*Q in vec FA (ggerganov#9595)

bf9c101

ggml-ci

perplexity : remove extra new lines after chunks (ggerganov#9596)

37f8c7b

readme : add programmable prompt engine language CLI (ggerganov#9599)

1d48e98

cuda: add q8_0->f32 cpy operation (ggerganov#9571)

116efee

llama: enable K-shift for quantized KV cache It will fail on unsupported backends or quant types.

threads: fix msvc build without openmp (ggerganov#9615)

c087b6f

We're missing atomic_thread_fence() in MSVC builds when openmp is disabled.

server : add newline after chat example (ggerganov#9616)

0aa1501

log : add CONT level for continuing previous log entry (ggerganov#9610)

cea1486

llama : keep track of all EOG tokens in the vocab (ggerganov#9609)

31ac583

ggml-ci

examples : adapt to ggml.h changes (ggml/0)

c038931

ggml-ci

sync : ggml

bb5f819

ggml : add AVX512DQ requirement for AVX512 builds (ggerganov#9622)

70392f1

cann: fix crash when llama-bench is running on multiple cann devices (g…

904837e

…gerganov#9627)

server : add more env vars, improve gen-docs (ggerganov#9635)

afbbfaa

* server : add more env vars, improve gen-docs * update server docs * LLAMA_ARG_NO_CONTEXT_SHIFT

ci : fix docker build number and tag name (ggerganov#9638)

ea9c32b

* ci : fix docker build number and tag name * fine-grant permissions

mtgpu: enable VMM (ggerganov#9597)

7691654

Signed-off-by: Xiaodong Ye <[email protected]>

l3utterfly merged commit 76fe14c into layla-build Sep 26, 2024
64 of 65 checks passed

github-actions bot added Nvidia GPU testing examples devops python server ggml labels Sep 26, 2024

github-actions bot added the script label Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge upstream #39

merge upstream #39

l3utterfly commented Sep 26, 2024

merge upstream #39

merge upstream #39

Conversation

l3utterfly commented Sep 26, 2024