forked from ggerganov/llama.cpp
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
merge upstream #39
Merged
Merged
merge upstream #39
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Owner
l3utterfly
commented
Sep 26, 2024
- I have read the contributing guidelines
- Self-reported review complexity:
- Low
- Medium
- High
* AVX512 version of ggml_gemm_q4_0_8x8_q8_0 * Remove zero vector parameter passing * Rename functions and rearrange order of macros * Edit commments * style : minor adjustments * Update x to start from 0 --------- Co-authored-by: Georgi Gerganov <[email protected]>
Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/4f807e8940284ad7925ebd0a0993d2a1791acb2f?narHash=sha256-IiA3jfbR7K/B5%2B9byVi9BZGWTD4VSbWe8VLpp9B/iYk%3D' (2024-09-11) → 'github:NixOS/nixpkgs/c04d5652cfa9742b1d519688f65d1bbccea9eb7e?narHash=sha256-PmUr/2GQGvFTIJ6/Tvsins7Q43KTMvMFhvG6oaYK%2BWk%3D' (2024-09-19) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…gerganov#9598) Make sure n_barrier and n_barrier_passed do not share the cache line to avoid cache line bouncing. This optimization shows performance improvements even for n_threads <= 8 cases. Resurect TSAN (Thread Sanitizer) check so that we can avoid doing expensive read-modify-write in the normal case and just use thread-fence as originally intended. --- Here is the original description and suggestions from Willy Tarreau : There's currently some false sharing between n_barrier and n_barrier_passed that is amplified in ggml_barrier() by the fact that all threads need to increment n_barrier when entering, while all previous threads continue to read n_barrier_passed, waiting for the last one to release them all. The side effect is that all these readers are slowing down all new threads by making the cache line bounce back and forth between readers and writers. Just placing them in two distinct cache lines is sufficient to boost the performance by 21% on a 80-core ARM server compared to the no-openmp version, and by 3% compared to the openmp version. Note that the variables could have been spread apart in the structure as well, but it doesn't seem that the size of this threadpool struct is critical so here we're simply aligning them. Finally, the same issue was present when leaving the barrier since all threads had to update the n_barrier_passed counter, though only one would add a non-zero value. This alone is responsible for half of the cost due to undesired serialization. It might be possible that using a small array of n_barrier counters could make things even faster on many-core systems, but it would likely complicate the logic needed to detect the last thread. Co-authored-by: Willy Tarreau <[email protected]>
* server : add --no-context-shift option * small fix * Update examples/server/tests/features/embeddings.feature Co-authored-by: Georgi Gerganov <[email protected]> * tests : minor fix * revert usage of GGML_ASSERT * update server documentation --------- Co-authored-by: Georgi Gerganov <[email protected]>
llama: enable K-shift for quantized KV cache It will fail on unsupported backends or quant types.
We're missing atomic_thread_fence() in MSVC builds when openmp is disabled.
…9605) * sampling : avoid expensive softmax during greedy sampling ggml-ci * speculative : fix default RNG seed + set sparams.n_probs * Update tests/test-sampling.cpp Co-authored-by: slaren <[email protected]> * sampling : add clarifying comment [no ci] --------- Co-authored-by: slaren <[email protected]>
* feat(gguf-py): Add granitemoe architecture This includes the addition of new tensor names for the new moe layers. These may not be correct at this point due to the need for the hack in gguf_writer.py to double-check the length of the shape for these layers. Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]> * feat(convert_hf_to_gguf): Add GraniteMoeModel GraniteMoe has the same configuration deltas as Granite Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]> * fix(granitemoe convert): Split the double-sized input layer into gate and up After a lot of staring and squinting, it's clear that the standard mixtral expert implementation is equivalent to the vectorized parallel experts in granite. The difference is that in granite, the w1 and w3 are concatenated into a single tensor "input_linear." Rather than reimplementing all of the math on the llama.cpp side, the much simpler route is to just split this tensor during conversion and follow the standard mixtral route. Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * feat(granitemoe): Implement granitemoe GraniteMoE follows the mixtral architecture (once the input_linear layers are split into gate_exps/up_exps). The main delta is the addition of the same four multipliers used in Granite. Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]> * Typo fix in docstring Co-Authored-By: [email protected] Co-authored-by: Georgi Gerganov <[email protected]> Signed-off-by: Gabe Goodhart <[email protected]> * fix(conversion): Simplify tensor name mapping in conversion Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * fix(convert): Remove unused tensor name mappings Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * fix(convert): Sanity check on merged FFN tensor sizes Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * fix: Allow "output" layer in granite moe architecture (convert and cpp) Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * fix(granite): Add missing 'output' tensor for Granite This is a fix for the previous `granite` architecture PR. Recent snapshots have included this (`lm_head.weights`) as part of the architecture Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]> --------- Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>
* server : add more env vars, improve gen-docs * update server docs * LLAMA_ARG_NO_CONTEXT_SHIFT
…9217) * ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels * added fallback mechanism when the offline re-quantized model is not optimized for the underlying target. * fix for build errors * remove prints from the low-level code * Rebase to the latest upstream
* ci : fix docker build number and tag name * fine-grant permissions
Signed-off-by: Xiaodong Ye <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.