merge upstream #38

l3utterfly · 2024-09-23T06:10:22Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

* gguf-split : do not overwrite existing files when merging * gguf-split : error when too many arguments are passed

* added cli arg to disable context shift * reverted precommit * updated README.md for main * white space * allow disabling context shift in the server * Update common/arg.cpp no-context-shift only works for main example Co-authored-by: Georgi Gerganov <[email protected]> * added server example to --no-context-shift args * removed server changes * white space --------- Co-authored-by: Georgi Gerganov <[email protected]>

Co-authored-by: 范睿凯 <[email protected]>

* squashed readd my iq4_nl sgemm PR ggerganov#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before

* cmake : do not hide GGML options ggml-ci * build : rename flag GGML_CUDA_USE_GRAPHS -> GGML_CUDA_GRAPHS for consistency ggml-ci

This commit renames n_embed to n_embd in llm_build_rwkv6_time_mix. The motivation for this change is consistency with the other rwkv6 functions like build_rwkv6 (and other parts of the code base).

* feat(gguf-py): Add Granite model and params to gguf-py Branch: GraniteLM Signed-off-by: Gabe Goodhart <[email protected]> * feat(convert_hf_to_gguf): Add registration and param setup for Granite Branch: GraniteLM Signed-off-by: Gabe Goodhart <[email protected]> * feat(llama.cpp): Add config parsing for Granite multiplier params Branch: GraniteLM Signed-off-by: Gabe Goodhart <[email protected]> * feat(llama.cpp): First pass at full port of granite deviations from llama Something is still not working right since the results are mostly terrible, but on occasion it's producing relevant results at this point, so _something_ is working. Branch: GraniteLM Signed-off-by: Gabe Goodhart <[email protected]> * fix(llama.cpp): Determine granite language 3b instruct by vocab size Branch: GraniteLM Signed-off-by: Gabe Goodhart <[email protected]> * fix(convert_hf_to_gguf): Use LlamaModel as base for GraniteModel The defaults in LlamaModel are needed for Granite as well Branch: GraniteLM Signed-off-by: Gabe Goodhart <[email protected]> * fix(llama.cpp): Switch Granite param names to use _scale for consistency Other scalar multipliers are called *_scale, so this provides a more consistent naming convention. Branch: GraniteLM Signed-off-by: Gabe Goodhart <[email protected]> * fix(convert_hf_to_gguf/gguf-py): _multiplier -> _scale The transformers names with _multiplier will now be converted to the _scale equivalent during conversion. Branch: GraniteLM Signed-off-by: Gabe Goodhart <[email protected]> * fix(llama.cpp): Use separate switch clause for granite in llm_load_hparams Branch: GraniteLM Signed-off-by: Gabe Goodhart <[email protected]> --------- Signed-off-by: Gabe Goodhart <[email protected]>

* threadpool: skip polling for unused threads Currently all threads do N polling rounds even if only 1 thread is active (n_threads_cur == 1). This commit adds a check to skip the polling for unused threads (ith >= n_threads_cur). n_threads_cur is now an atomic_int to explicitly tell thread sanitizer that it is written from one thread and read from other threads (not a race conditions). * threadpool: further simplify and improve ggml_barrier Avoid using strict memory order while polling, yet make sure that all threads go through full memory barrier (memory fence) on ggml_barrier entrace and exit. * threads: add simple barrier test This test does lots of small, parallel matmul ops where the barriers in between dominate the overhead. * threadpool: improve thread sync for new-graphs Using the same tricks as ggml_barrier. All the polling is done with relaxed memory order to keep it efficient, once the new graph is detected we do full fence using read-modify-write with strict memory order. * threadpool: improve abort handling Do not use threadpool->ec (exit code) to decide whether to exit the compute loop. threadpool->ec is not atomic which makes thread-sanitizer rightfully unhappy about it. Instead introduce atomic threadpool->abort flag used for this. This is consistent with how we handle threadpool->stop or pause. While at it add an explicit atomic_load for n_threads_cur for consistency. * test-barrier: release threadpool before releasing the context fixes use-after-free detected by gcc thread-sanitizer on x86-64 for some reason llvm sanitizer is not detecting this issue.

* llama: fixed n_vocab for `no_vocab` models * llama: updated error output for `llama_decode_internal` and `llama_encode_internal` * llama: log warning if there's no vocab_size in metadata * llama: correct vocab size for logging Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>

* add env variable for parallel * Update README.md with env: LLAMA_ARG_N_PARALLEL

…gerganov#9476) * set context default to avoid memory issue, update guide * Update docs/backend/SYCL.md Co-authored-by: Meng, Hengyu <[email protected]> --------- Co-authored-by: arthw <[email protected]> Co-authored-by: Meng, Hengyu <[email protected]>

This commit updates the llama_sampler_sample function to use reserve and emplace_back for the vector of llama_token_data structs. The motivation for this change is to avoid the creation of n_vocab default-constructed llama_token_data structs which are then immediately overwritten.

* ggml : fix n_threads_cur initialization with one thread * Update ggml/src/ggml.c --------- Co-authored-by: Max Krasnyansky <[email protected]>

ggml-ci

* CUDA eval works * stochastic gradient descent op * Adam except decay * CUDA CROSS_ENTROPY_LOSS_BACK * CUDA mnist-fc training works * backend CLI arg * refactor gguf load * remove sched from opt_step_adam * implement l1 regularization (weight decay) * extra call to add optimizer * initialize gradients with ggml_graph_reset * gradient accumulation * increment iter per eval instead of epoch * adjust backend interfaces * fix ggml_graph_reset without backend * fix ggml graph export/import * fixup * rename * revert ggml_opt changes * more general CUDA repeat_back * update documentation, fix CNN * validation split * add clarifying comment * optimize PyTorch training * adjust buffer size, thread count * fix 0.0f validation split * Update examples/mnist/mnist-common.cpp Co-authored-by: Georgi Gerganov <[email protected]> * fix gradient accumulation * tensor flag for accumulators -> tensor hash set * Update include/ggml.h Co-authored-by: slaren <[email protected]> * Update tests/test-backend-ops.cpp Co-authored-by: slaren <[email protected]> * Update tests/test-backend-ops.cpp Co-authored-by: slaren <[email protected]> * fix test prints * Update src/ggml-backend.c Co-authored-by: Georgi Gerganov <[email protected]> * better CUDA support for noncontiguous out_prod * add comment --------- Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: slaren <[email protected]>

ggml-ci

quantize : do not ignore invalid types in arg parsing quantize : ignore case of type and ftype arguments

…9550) * Avoid using saved CUDA graph if scale changes and reset nodes/params on update Fixes ggerganov#9451 * clear before resize

…gerganov#9573)

* ggml: CUDA unary op EXP Signed-off-by: Molly Sophia <[email protected]> * ggml: rwkv_wkv op CUDA impl Signed-off-by: Molly Sophia <[email protected]> --------- Signed-off-by: Molly Sophia <[email protected]>

Signed-off-by: Molly Sophia <[email protected]>

…e Flash Attention on QY1 (MTT S80) (ggerganov#9526) * mtgpu: add mp_21 support Signed-off-by: Xiaodong Ye <[email protected]> * mtgpu: disable flash attention on qy1 (MTT S80); disable q3_k and mul_mat_batched_cublas Signed-off-by: Xiaodong Ye <[email protected]> * mtgpu: enable unified memory Signed-off-by: Xiaodong Ye <[email protected]> * mtgpu: map cublasOperation_t to mublasOperation_t (sync code to latest) Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]>

This reverts commit 50addec.

Xarbirus and others added 30 commits September 15, 2024 19:55

cmake : correct order of sycl flags (ggerganov#9497)

6988da9

gguf-split : add basic checks (ggerganov#9499)

e6deac3

* gguf-split : do not overwrite existing files when merging * gguf-split : error when too many arguments are passed

common : reimplement logging (ggerganov#9418)

6262d13

ggerganov#9418

flake.lock: Update (ggerganov#9488)

90a2fff

metal : handle zero-sized allocs (ggerganov#9466)

c4965a6

llama : support MiniCPM3 (ggerganov#9322)

95ca851

Co-authored-by: 范睿凯 <[email protected]>

llama : support OLMoE (ggerganov#9462)

0aadac1

cmake : do not hide GGML options + rename option (ggerganov#9465)

19514d6

* cmake : do not hide GGML options ggml-ci * build : rename flag GGML_CUDA_USE_GRAPHS -> GGML_CUDA_GRAPHS for consistency ggml-ci

convert : identify missing model files (ggerganov#9397)

d54c21d

ggml : link MATH_LIBRARY not by its full path (ggerganov#9339)

a6a3a5c

llama : rename n_embed to n_embd in rwkv6_time_mix (ggerganov#9504)

acb2c32

This commit renames n_embed to n_embd in llm_build_rwkv6_time_mix. The motivation for this change is consistency with the other rwkv6 functions like build_rwkv6 (and other parts of the code base).

ggml : move common CPU backend impl to new header (ggerganov#9509)

23e0d70

llama : add llama_n_head() (ggerganov#9512)

37f3a38

unicode : add <algorithm> (ggerganov#9508)

503147a

arg : add env variable for parallel (ggerganov#9513)

8b836ae

* add env variable for parallel * Update README.md with env: LLAMA_ARG_N_PARALLEL

llama-bench: correct argument parsing error message (ggerganov#9524)

7be099f

server : fix OpenSSL build (remove obsolete LOG_INFO) (ggerganov#9529)

f799155

server : match OAI structured output response (ggerganov#9527)

8a30835

scripts : verify py deps at the start of compare (ggerganov#9520)

0d2f22e

ggml : fix n_threads_cur initialization with one thread (ggerganov#9538)

64c6af3

* ggml : fix n_threads_cur initialization with one thread * Update ggml/src/ggml.c --------- Co-authored-by: Max Krasnyansky <[email protected]>

imatrix : disable prompt escape by default (ggerganov#9543)

eca0fab

server : clean-up completed tasks from waiting list (ggerganov#9531)

6026da5

ggml-ci

perplexity : do not escape input data by default (ggerganov#9548)

722ec1e

JohannesGaessler and others added 15 commits September 20, 2024 21:15

sync : ggml

4301535

ggml-ci

ggml : fix trailing whitespace (#0)

27609c4

ggml-ci

ggml : fix builds (#0)

d13edb1

ggml-ci

quantize : improve type name parsing (ggerganov#9570)

6335114

quantize : do not ignore invalid types in arg parsing quantize : ignore case of type and ftype arguments

CI: Provide prebuilt windows binary for hip (ggerganov#9467)

e948a7d

Update CUDA graph on scale change plus clear nodes/params (ggerganov#…

41f4778

…9550) * Avoid using saved CUDA graph if scale changes and reset nodes/params on update Fixes ggerganov#9451 * clear before resize

ggml-alloc : fix list of allocated tensors with GGML_ALLOCATOR_DEBUG (g…

d09770c

…gerganov#9573)

RWKV v6: RWKV_WKV op CUDA implementation (ggerganov#9454)

2a63caa

* ggml: CUDA unary op EXP Signed-off-by: Molly Sophia <[email protected]> * ggml: rwkv_wkv op CUDA impl Signed-off-by: Molly Sophia <[email protected]> --------- Signed-off-by: Molly Sophia <[email protected]>

llama: remove redundant loop when constructing ubatch (ggerganov#9574)

ecd5d6b

CUDA: enable Gemma FA for HIP/Pascal (ggerganov#9581)

a5b57b0

Fix merge error in ggerganov#9454 (ggerganov#9589)

912c331

Signed-off-by: Molly Sophia <[email protected]>

Revert "[SYCL] fallback mmvq (ggerganov#9088)" (ggerganov#9579)

e62e978

This reverts commit 50addec.

Merge branch 'layla-build' into merge

ff3c559

l3utterfly merged commit b5b05a8 into layla-build Sep 23, 2024
7 of 8 checks passed

l3utterfly deleted the merge branch September 23, 2024 06:11

github-actions bot added documentation Improvements or additions to documentation SYCL Nvidia GPU Vulkan testing build examples devops python server ggml Kompute script labels Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge upstream #38

merge upstream #38

l3utterfly commented Sep 23, 2024 •

edited

Loading

merge upstream #38

merge upstream #38

Conversation

l3utterfly commented Sep 23, 2024 • edited Loading

l3utterfly commented Sep 23, 2024 •

edited

Loading