Merge from upstream #31

l3utterfly · 2024-08-08T05:11:06Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

* add truncate_bf16 * truncate intermediate fp32 if converting bf16 to bf16 * fix masking in __compute_fp32_to_bf16 * np.int16 no longer used * missing cast and additional numpy 2.x fix * ggml-impl : do not flush bf16 subnormals to zero * ggml : add reference fp32 to bf16 conversion The fast version is no longer equivalent for all platforms because of the handling of subnormal values. * gguf-py : remove flush to zero for bf16 subnormals * gguf-py : remove float32 truncation to bf16 Rounding achieves the same thing in the cases where this was used. * missed prototype update in merge * merge cleanup --------- Co-authored-by: Francis Couture-Harpin <[email protected]>

* ggml : reading the runtime sve config of the cpu * change to one time init to prevent performance drop * prefix variable to avoid possible conflicts * revert xxhash fix and add brackets --------- Co-authored-by: domke <[email protected]>

* [example] batched-bench "segmentation fault" When `llama-batched-bench` is invoked _without_ setting `-npl`, "number of parallel prompts", it segfaults. The segfault is caused by invoking `max_element()` on a zero-length vector, `n_pl` This commit addresses that by first checking to see if the number of parallel prompts is zero, and if so sets the maximum sequence size to 1; otherwise, sets it to the original, the result of `max_element()`. Fixes, when running `lldb build/bin/llama-batched-bench -- -m models/Meta-Llama-3-8B.gguf` ``` * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0) frame #0: 0x000000010000366c llama-batched-bench`main(argc=3, argv=0x000000016fdff268) at batched-bench.cpp:72:28 69 llama_context_params ctx_params = llama_context_params_from_gpt_params(params); 70 71 // ensure enough sequences are available -> 72 ctx_params.n_seq_max = *std::max_element(n_pl.begin(), n_pl.end()); ``` * Update examples/batched-bench/batched-bench.cpp Co-authored-by: compilade <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: compilade <[email protected]>

* Don't ignore llama.cpp params * Add fallback for max_tokens

This commit moves the comment for the c parameter from ggml_rope to ggml_rope_ext. The comment is currently incorrect as ggml_rope does not have a c parameter (freq_factors tensor). Signed-off-by: Daniel Bevenius <[email protected]>

* Fix Vulkan repeat op * Implement Vulkan concat op * Delete old Vulkan shader generator * Implement Vulkan im2col op * Implement Vulkan unary gelu_quick op * Implement Vulkan group_norm op * Implement Vulkan timestep_embedding op * Implement Vulkan upscale op * Fix Vulkan vk_context tensor extra index issue * Fix Vulkan matmul shader parameter bug * Properly fix Vulkan matmul shader parameter bug * Add Vulkan ADD f16 + f32 -> f16 operator support * Implement Vulkan tanh op * Fix Vulkan group count too large Validation error on non-Nvidia GPUs * Throw error when too much memory is requested * Fix another Vulkan group count too large Validation error on non-Nvidia GPUs * Fix matmul MMQ condition * Implement Vulkan pad op * Fix Vulkan crash when tensor is used multiple times in a compute graph * Add Vulkan CONCAT f16 + f16 -> f16 op * Add Vulkan LEAKY_RELU op

ggml-ci

…ov#8855) * Fix Vulkan mul mat vec invalid results when ncols < warp size * Only run backend ops mul mat vec block size test if block size not already covered

…#8573) * Vulkan-shaders: attempt fix compilation on windows * fix miss-matched parenthesis

… Llama 3.1 tool call support (ggerganov#8858) * gguf-py, llama : add constants and methods related to Llama-3.1 <|eom_id|> token * llama : find Llama-3.1 <|eom_id|> token id during vocab loading * llama-vocab : add Llama-3.1 <|eom_id|> token to the set of tokens stopping the generation --------- Co-authored-by: Stanisław Szymczyk <[email protected]>

* py: add more authorship metadata from model card * fixup! py: add more authorship metadata from model card

It's helpful to use expm1f(x), because expf(x)-1 will result in overflow for 25% of single-precision floating point numbers.

ramalama is a repo agnostic boring CLI tool that supports pulling from ollama, huggingface and oci registries. Signed-off-by: Eric Curtin <[email protected]>

* common : Changed tuple to struct (TODO fix) Use struct `llama_init_result` to replace the previous std::tuple<struct llama_model *, struct llama_context *> * delete llama_init_default_params() * delete the extra whitespace

* cann: fix ggml_backend_cann_buffer_get_tensor 1. fix data ptr offset 2. enable the acquisition of incomplete tensors * fix backend cann set_tensor

* add conversion for bge-m3; small fix in unigram tokenizer * clean up and simplify XLMRoberta conversion

Signed-off-by: Molly Sophia <[email protected]>

…e31a4f6` (ggerganov#8880) * Fix compilation issue in `vulkan-shaders-gen` ggerganov@e31a4f6 broke compilation on w64devkit. Including `algorithm` seems to fix that. * Guard it under `#ifdef _WIN32`

When using CMake to build with Vulkan support, compiling vulkan-shaders-gen fails due to missing a CMakeLists.txt specification to link vulkan-shaders-gen with the threading library, resulting in the following error. [5/172] Linking CXX executable bin/vulkan-shaders-gen FAILED: bin/vulkan-shaders-gen : && /usr/bin/c++ ggml/src/vulkan-shaders/CMakeFiles/vulkan-shaders-gen.dir/vulkan-shaders-gen.cpp.o -o bin/vulkan-shaders-gen && : ld: error: undefined symbol: pthread_create >>> referenced by vulkan-shaders-gen.cpp >>> ggml/src/vulkan-shaders/CMakeFiles/vulkan-shaders-gen.dir/vulkan-shaders-gen.cpp.o:(std::__1::__libcpp_thread_create[abi:se180100](pthread**, >>> void* (*)(void*), void*)) c++: error: linker command failed with exit code 1 (use -v to see invocation) [6/172] Generating build details from Git -- Found Git: /usr/local/bin/git (found version "2.45.2") ninja: build stopped: subcommand failed. Add the CMakeLists.txt specification to link vulkan-shaders-gen with the threading library and fix the above error. Fixes ggerganov#8834

This commit updates the name of the executable in README.md from `simple` to `llama-simple`.

* server : add lora hotswap endpoint * handle lora_no_apply * fix build * updae docs * clean up struct def * fix build * add LoRA test * fix style

This commit updates the usage comment in quantize.cpp to reflect the new name of the executable, which is llama-quantize.

…8824) * Add support for getting cpu info on Windows for llama_bench * refactor --------- Co-authored-by: slaren <[email protected]>

* Updated device filter to depend on default_selector (fixes non-intel device issues) * Small related update to example/sycl Readme

* ggml-backend : fix async copy from CPU * cuda : more reliable async copy, fix stream used when the devices are the same

* make : use C compiler to build metal embed object * use rm + rmdir to avoid -r flag in rm

CISC and others added 30 commits August 2, 2024 15:11

flake.lock: Update (ggerganov#8847)

4b77ea9

baby-llama : remove duplicate vector include

01aae2b

Server: Don't ignore llama.cpp params (ggerganov#8754)

978ba3d

* Don't ignore llama.cpp params * Add fallback for max_tokens

Install curl in runtime layer (ggerganov#8693)

0d6fb52

cann: support q4_0 model (ggerganov#8822)

c02b0a8

sync : ggml

5587e57

ggml-ci

vulkan : fix Qantized Mat-Vec Mul on AMD GPUs for ncols < 64 (ggergan…

064cdc2

…ov#8855) * Fix Vulkan mul mat vec invalid results when ncols < warp size * Only run backend ops mul mat vec block size test if block size not already covered

llama : better replace_all (ggerganov#8852)

f1ea514

readme : update model list (ggerganov#8851)

400ae6f

cmake: fix paths for vulkan shaders compilation on Windows (ggerganov…

e31a4f6

…#8573) * Vulkan-shaders: attempt fix compilation on windows * fix miss-matched parenthesis

py: Add more authorship metadata from model card (ggerganov#8810)

1ef14b3

* py: add more authorship metadata from model card * fixup! py: add more authorship metadata from model card

ggml : fix overflows in elu function (ggerganov#8866)

b9dfc25

It's helpful to use expm1f(x), because expf(x)-1 will result in overflow for 25% of single-precision floating point numbers.

readme : add ramalama to the availables UI (ggerganov#8811)

b42978e

ramalama is a repo agnostic boring CLI tool that supports pulling from ollama, huggingface and oci registries. Signed-off-by: Eric Curtin <[email protected]>

cann: fix buffer_num and runtime speed slowly error (ggerganov#8865)

bc0f887

common : Changed tuple to struct (TODO fix) (ggerganov#8823)

0a4ce78

* common : Changed tuple to struct (TODO fix) Use struct `llama_init_result` to replace the previous std::tuple<struct llama_model *, struct llama_context *> * delete llama_init_default_params() * delete the extra whitespace

[SYCL] correct cmd name (ggerganov#8877)

d4ff847

[CANN]: Fix ggml_backend_cann_buffer_get_tensor (ggerganov#8871)

c21a896

* cann: fix ggml_backend_cann_buffer_get_tensor 1. fix data ptr offset 2. enable the acquisition of incomplete tensors * fix backend cann set_tensor

convert : add support for XLMRoberta embedding models (ggerganov#8658)

cdd1889

* add conversion for bge-m3; small fix in unigram tokenizer * clean up and simplify XLMRoberta conversion

ggml : add epsilon as a parameter for group_norm (ggerganov#8818)

2d5dd7b

Signed-off-by: Molly Sophia <[email protected]>

contributing : add note about write access

0bf16de

[Vulkan] Fix compilation of vulkan-shaders-gen on w64devkit after `…

efda90c

…e31a4f6` (ggerganov#8880) * Fix compilation issue in `vulkan-shaders-gen` ggerganov@e31a4f6 broke compilation on w64devkit. Including `algorithm` seems to fix that. * Guard it under `#ifdef _WIN32`

simple : update name of executable to llama-simple (ggerganov#8885)

5f4dcb1

This commit updates the name of the executable in README.md from `simple` to `llama-simple`.

CUDA: fix padding logic for FP16/FP32 (ggerganov#8884)

641f5dd

ngxson and others added 8 commits August 6, 2024 17:33

server : add lora hotswap endpoint (WIP) (ggerganov#8857)

1e6f655

* server : add lora hotswap endpoint * handle lora_no_apply * fix build * updae docs * clean up struct def * fix build * add LoRA test * fix style

typo correction (ggerganov#8891)

3195854

quantize : update usage comment in quantize.cpp (ggerganov#8889)

725e3d9

This commit updates the usage comment in quantize.cpp to reflect the new name of the executable, which is llama-quantize.

llama-bench : add support for getting cpu info on Windows (ggerganov#…

506122d

…8824) * Add support for getting cpu info on Windows for llama_bench * refactor --------- Co-authored-by: slaren <[email protected]>

CUDA/HIP: fix tests/test-backend-ops (ggerganov#8896)

a8dbc6f

[SYCL] Updated SYCL device filtering (ggerganov#8901)

0478174

* Updated device filter to depend on default_selector (fixes non-intel device issues) * Small related update to example/sycl Readme

ggml-backend : fix async copy from CPU (ggerganov#8897)

be55695

* ggml-backend : fix async copy from CPU * cuda : more reliable async copy, fix stream used when the devices are the same

make : use C compiler to build metal embed object (ggerganov#8899)

15fa07a

* make : use C compiler to build metal embed object * use rm + rmdir to avoid -r flag in rm

l3utterfly merged commit 865d8f3 into layla-build Aug 8, 2024
62 of 78 checks passed

github-actions bot added SYCL Nvidia GPU Vulkan testing examples devops python server ggml script labels Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge from upstream #31

Merge from upstream #31

l3utterfly commented Aug 8, 2024

Merge from upstream #31

Merge from upstream #31

Conversation

l3utterfly commented Aug 8, 2024