merge from upstream #19

l3utterfly · 2024-05-16T12:36:45Z

No description provided.

* llama : rename ctx to user_data in progress_callback This commit renames the `ctx` parameter to `user_data` in the `llama_progress_callback` typedef. The motivation for this is that other callbacks use `user_data` or `data`, and using `ctx` in this case might be confusing as it could be confused with `llama_context`. --------- Signed-off-by: Daniel Bevenius <[email protected]>

* convert.py: add python logging instead of print() * convert.py: verbose flag takes priority over dump flag log suppression * convert.py: named instance logging * convert.py: use explicit logger id string * convert.py: convert extra print() to named logger * convert.py: sys.stderr.write --> logger.error * *.py: Convert all python scripts to use logging module * requirements.txt: remove extra line * flake8: update flake8 ignore and exclude to match ci settings * gh-actions: add flake8-no-print to flake8 lint step * pre-commit: add flake8-no-print to flake8 and also update pre-commit version * convert-hf-to-gguf.py: print() to logger conversion * *.py: logging basiconfig refactor to use conditional expression * *.py: removed commented out logging * fixup! *.py: logging basiconfig refactor to use conditional expression * constant.py: logger.error then exit should be a raise exception instead * *.py: Convert logger error and sys.exit() into a raise exception (for atypical error) * gguf-convert-endian.py: refactor convert_byteorder() to use tqdm progressbar * verify-checksum-model.py: This is the result of the program, it should be printed to stdout. * compare-llama-bench.py: add blank line for readability during missing repo response * reader.py: read_gguf_file() use print() over logging * convert.py: warning goes to stderr and won't hurt the dump output * gguf-dump.py: dump_metadata() should print to stdout * convert-hf-to-gguf.py: print --> logger.debug or ValueError() * verify-checksum-models.py: use print() for printing table * *.py: refactor logging.basicConfig() * gguf-py/gguf/*.py: use __name__ as logger name Since they will be imported and not run directly. * python-lint.yml: use .flake8 file instead * constants.py: logger no longer required * convert-hf-to-gguf.py: add additional logging * convert-hf-to-gguf.py: print() --> logger * *.py: fix flake8 warnings * revert changes to convert-hf-to-gguf.py for get_name() * convert-hf-to-gguf-update.py: use triple quoted f-string instead * *.py: accidentally corrected the wrong line * *.py: add compilade warning suggestions and style fixes

* tests : add test-tokenizer-0.sh * unicode : add all unicode number ranges * starcoder : fix pre-tokenizer * tests : add test that fails with DeepSeek tokenizers * falcon : fix regex * unicode : regenerate unicode tables * refact : add tokenizer model * lint : fix * tests : disable failing tests ggml-ci * refact : add tests files ggml-ci * convert : print -> logging ggml-ci * lint : fix * unicode : digit -> number * phi-3 : update

…will crash (ggerganov#7038) This will reproduce the issue in llama13b { 'prompt': 'Q: hello world \nA: ', 'stop': ['\n'], 'temperature': 0.0, 'n_predict': 10, 'cache_prompt': True, 'n_probs': 10 }

* Tidy Android Instructions README.md Remove CLBlast instructions(outdated), added OpenBlas. * don't assume git is installed Added apt install git, so that git clone works * removed OpenBlas Linked to Linux build instructions * fix typo Remove word "run" * correct style Co-authored-by: slaren <[email protected]> * correct grammar Co-authored-by: slaren <[email protected]> * delete reference to Android API * remove Fdroid reference, link directly to Termux Fdroid is not required Co-authored-by: slaren <[email protected]> * Update README.md Co-authored-by: slaren <[email protected]> --------- Co-authored-by: slaren <[email protected]>

Set one as executable and add basicConfig() to another. Also added noqa tag to test scripts.

* Add BPE pre-tokenization for Command-R/R+. * Bump transformers convert requirement. * command-r : add individual digits regex --------- Co-authored-by: Georgi Gerganov <[email protected]>

…ganov#7065)

* Disable benchmark on forked repo * only check owner on schedule event * check owner on push also * more readable as multi-line * ternary won't work * style++ * test++ * enable actions debug * test-- * remove debug * test++ * do debug where we can get logs * test-- * this is driving me crazy * correct github.event usage * remove test condition * correct github.event usage * test++ * test-- * event_name is pull_request_target * test++ * test-- * update ref checks

Flake lock file updates: • Updated input 'flake-parts': 'github:hercules-ci/flake-parts/9126214d0a59633752a136528f5f3b9aa8565b7d?narHash=sha256-sB4SWl2lX95bExY2gMFG5HIzvva5AVMJd4Igm%2BGpZNw%3D' (2024-04-01) → 'github:hercules-ci/flake-parts/e5d10a24b66c3ea8f150e47dfdb0416ab7c3390e?narHash=sha256-yzcRNDoyVP7%2BSCNX0wmuDju1NUCt8Dz9%2BlyUXEI0dbI%3D' (2024-05-02) • Updated input 'flake-parts/nixpkgs-lib': 'github:NixOS/nixpkgs/d8fe5e6c92d0d190646fb9f1056741a229980089?dir=lib&narHash=sha256-iMUFArF0WCatKK6RzfUJknjem0H9m4KgorO/p3Dopkk%3D' (2024-03-29) → 'https://github.com/NixOS/nixpkgs/archive/50eb7ecf4cd0a5756d7275c8ba36790e5bd53e33.tar.gz?narHash=sha256-QBx10%2Bk6JWz6u7VsohfSw8g8hjdBZEf8CFzXH1/1Z94%3D' (2024-05-02) • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/7bb2ccd8cdc44c91edba16c48d2c8f331fb3d856?narHash=sha256-Drmja/f5MRHZCskS6mvzFqxEaZMeciScCTFxWVLqWEY%3D' (2024-04-25) → 'github:NixOS/nixpkgs/63c3a29ca82437c87573e4c6919b09a24ea61b0f?narHash=sha256-4cPymbty65RvF1DWQfc%2BBc8B233A1BWxJnNULJKQ1EY%3D' (2024-05-02) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Add an option to build ggml cuda without CUDA VMM resolves ggerganov#6889 https://forums.developer.nvidia.com/t/potential-nvshmem-allocated-memory-performance-issue/275416/4

* ci : add GG_BUILD_EXTRA_TESTS_0 env ggml-ci * Update run.sh ggml-ci

* fix typo * fix typos * fix typo * fix typos * fix typo * fix typos

* Update log text (EOS to EOG) The log text "found EOS" is no longer always correct, here, because there is now an is-EOG check that also returns true for EOT. * Improve log msg. further by using "an" instead of "some". As suggested, to avoid misunderstanding (no multiple EOG tokens found, just one).

* server: normalize token probabilities * fix temperature == 0.0f

* Fixed save_imatrix to match old behaviour for MoE This fix is simple and clear, but unnecessarily doubles the memory overhead.. * Fixed missing idx variable * Unconditionally increment ncall Co-authored-by: slaren <[email protected]> * Fixed 2 bugs in save_imatrix() - Fixed segfault bug because the counts vector needed to be created. - Fixed pre-existing bug didn't actually add to the counts for "--combine" option. * ncall needs summing too * Trailing whitespace --------- Co-authored-by: slaren <[email protected]>

* Further tidy on Android instructions README.md Fixed some logic when following readme direction * Clean up redundent information A new user arriving will see simple directions on llama.cpp homepage * corrected puncuation Period after cmake, colon after termux * re-word for clarity method seems to be more correct, instead of alternative in this context * Organized required packages per build type building llama.cpp with NDK on a pc doesn't require installing clang, cmake, git, or wget in termux. * README.md corrected title * fix trailing whitespace

* Introduce bfloat16 support Many models on Hugging Face (e.g. Mistral, TinyLLaMA) use bfloat16 as their canonical floating point format. ┌sign │ │ ┌exponent │ │ │ │ ┌mantissa │ │ │ │┌──┴───┐┌─┴───┐ 0b0000000000000000 brain16 This encoding has the same number of exponent bits as float32. That makes conversion relatively straightforward, even in the absence of hardware support. For example, converting brain16 to binary32 means simply shifting 16 bits to the left. ┌sign │ │ ┌exponent │ │ │ │ ┌mantissa │ │ │ │┌──┴───┐┌─┴───────────────────┐ 0b00000000000000000000000000000000 IEEE binary32 The issue is that converting bf16 to fp16 can result in information loss. Only 13% of bf16 numbers can be precisely represented in fp16 which in practice ends up being 99.71% of Mistral 7b v0.2's weights however there is currently no way other than fp32 to get the others ┌sign │ │ ┌exponent │ │ │ │ ┌mantissa │ │ │ │┌─┴─┐┌─┴──────┐ 0b0000000000000000 IEEE binary16 This change fixes that, by adding a bf16 data type to GGML. Support for CPU inference has been implemented along with optimizations for the AVX2, AVX512, and AVX512BF16 ISAs. Perplexity on Mistral 7b 0.2 improves somewhere around -0.0024 to -0.0046 compared to using fp16 * Remove GGML code that's not needed * Minimize the GGML API surface area for BF16 * Remove bf16 luts * Make the GGML header look nicer * Fix documentation * Apply ggerganov's fixes for test-backend-ops * Add BF16 code for new ggml_validate_row_data() function

* compare-llama-bench.py: add missing basicConfig * compare-llama-bench.py: Add line break between error message and print_help() * Add regular print() markdown table

* Add BPE pre-tokenization for DBRX. * Add vocab GGUFs. * Remove test. * Remove GGUFs.

* ggml : add RPC backend The RPC backend proxies all operations to a remote server which runs a regular backend (CPU, CUDA, Metal, etc). * set TCP_NODELAY * add CI workflows * Address review comments * fix warning * implement llama_max_devices() for RPC * Address review comments * Address review comments * wrap sockfd into a struct * implement get_alignment and get_max_size * add get_device_memory * fix warning * win32 support * add README * readme : trim trailing whitespace * Address review comments * win32 fix * Address review comments * fix compile warnings on macos

) This reverts commit efc8f76.

* server: free sampling contexts on exit This cleans up last leak found by the address sanitizer. * fix whitespace * fix whitespace

* optimize for ppc64le using VSX intrinsics * 1. code clean up by removing comments about overflow concern. 2. fix typo in suffix of scaling. * Continue to fix typo in suffix of scaling for QK_K <> 256 --------- Co-authored-by: Georgi Gerganov <[email protected]>

…2128)

ggml-ci

* ggml : fa without mask + add asserts ggml-ci * metal : support non-contiguous KV ggml-ci

* initial commit with CPU implementation of upscale to shape and test, cuda implementation next * experimental commit to see if dst shape is correct * test version * test * removed unnecessary params * refactor * fixed tests * ggml : metal impl + cleanup + sycl dev warnings * patched ggml_upscale cuda op to handle non-contiguous tensors, added test for non-contiguous behavior * metal : fix upsacle op to support nb00 + style --------- Co-authored-by: Georgi Gerganov <[email protected]>

As discussed in PR ggerganov#6766, CUDA graphs were being disabled in the presence of long prompts. This fixes the issue by avoiding the consective update counter from incrementing unnecessarily for tokens in which cuda graphs are disabled due to batch size > 1.

…anov#6915) * Just reordering some structs. * Adding in the calls to mm_pause * Passing around the state * Renaming and moving a bunch of variables around. * Extracting the logic to it's own function. * Moving some variable definitions into the chunk function. * Moving some variables around * moving src1_cont inside * Moving row_size * adding the current_chunk * Reorg the code. * Formatting to match the orig patch * starting to setup the chunking variables * Starting the buildup of the loop * The yield shouldn't be necessary. * adding the looping structure based on the chunk configuration. * Add in the re-chunking code. * Making it much more likely to rechunk. * disable resizing if numa is enabled. * Updating comments with what we've learned. * Fix formatting * Couple more formatting fixes. * More style fixes. * Fix Warnings * Going with unused because there's conditional logic that needs it. * Update ggml.c * Update ggml.c ---------

Signed-off-by: Daniel Bevenius <[email protected]>

… MSVC (ggerganov#7191) * logging: add proper checks for clang to avoid errors and warnings with VA_ARGS * build: add CMake Presets and toolchian files for Windows ARM64 * matmul-int8: enable matmul-int8 with MSVC and fix Clang warnings * ci: add support for optimized Windows ARM64 builds with MSVC and LLVM * matmul-int8: fixed typos in q8_0_q8_0 matmuls Co-authored-by: Georgi Gerganov <[email protected]> * matmul-int8: remove unnecessary casts in q8_0_q8_0 --------- Co-authored-by: Georgi Gerganov <[email protected]>

Switch to Ninja Multi-Config CMake generator to resurect bin/Release path that broke artifact packaging in CI.

…l. (ggerganov#7288) * chore: add references to the quantisation space. * fix grammer lol. * Update README.md Co-authored-by: Julien Chaumond <[email protected]> * Update README.md Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Julien Chaumond <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

…ov#7273)

…5615) Co-authored-by: Brian <[email protected]>

ref: ggerganov#7293

This can be overridden with the -m command line option ref: ggerganov#7293

danbev and others added 30 commits May 3, 2024 15:24

If first token generated from the server is the stop word the server …

03fb8a0

…will crash (ggerganov#7038) This will reproduce the issue in llama13b { 'prompt': 'Q: hello world \nA: ', 'stop': ['\n'], 'temperature': 0.0, 'n_predict': 10, 'cache_prompt': True, 'n_probs': 10 }

Fix Linux /sys cpu path to guess number of cores (ggerganov#7064)

fcd84a0

gguf-split: add --no-tensor-first-split (ggerganov#7072)

8425001

py : logging and flake8 suppression refactoring (ggerganov#7081)

6fbd432

Set one as executable and add basicConfig() to another. Also added noqa tag to test scripts.

command-r : add BPE pre-tokenization (ggerganov#7063)

889bdd7

* Add BPE pre-tokenization for Command-R/R+. * Bump transformers convert requirement. * command-r : add individual digits regex --------- Co-authored-by: Georgi Gerganov <[email protected]>

readme : add note that LLaMA 3 is not supported with convert.py (gger…

ca36326

…ganov#7065)

Adding support for the --numa argument for llama-bench. (ggerganov#7080)

628b299

minor : fix trailing whitespace

bcdee0d

Add an option to build without CUDA VMM (ggerganov#7067)

858f6b7

Add an option to build ggml cuda without CUDA VMM resolves ggerganov#6889 https://forums.developer.nvidia.com/t/potential-nvshmem-allocated-memory-performance-issue/275416/4

ci : add GG_BUILD_EXTRA_TESTS_0 env (ggerganov#7098)

947d3ad

* ci : add GG_BUILD_EXTRA_TESTS_0 env ggml-ci * Update run.sh ggml-ci

docs: fix typos (ggerganov#7124)

04976db

* fix typo * fix typos * fix typo * fix typos * fix typo * fix typos

readme : update hot topics

53d6c52

server : update readme with undocumented options (ggerganov#7013)

260b7c6

Fix OLMo HF to GGUF conversion (ggerganov#6910)

b6aa670

server: fix incorrectly reported token probabilities (ggerganov#7125)

af0a5b6

* server: normalize token probabilities * fix temperature == 0.0f

metal : fix unused warning

c0e6fbf

compare-llama-bench.py: add missing basicConfig (ggerganov#7138)

acdce3c

* compare-llama-bench.py: add missing basicConfig * compare-llama-bench.py: Add line break between error message and print_help() * Add regular print() markdown table

py : also print the normalizers

7e0b6a7

convert : add BPE pre-tokenization for DBRX (ggerganov#7132)

4cd621c

* Add BPE pre-tokenization for DBRX. * Add vocab GGUFs. * Remove test. * Remove GGUFs.

clean up json_value & server_log (ggerganov#7142)

1fd9c17

slaren and others added 28 commits May 14, 2024 17:33

llama : disable pipeline parallelism with nkvo (ggerganov#7265)

5416002

Revert "move ndk code to a new library (ggerganov#6951)" (ggerganov#7282

1265c67

) This reverts commit efc8f76.

server: free sampling contexts on exit (ggerganov#7264)

4f02636

* server: free sampling contexts on exit This cleans up last leak found by the address sanitizer. * fix whitespace * fix whitespace

ggml : expose SSE3 and SSSE3 for MSVC when AVX is available (whisper/…

182adef

…2128)

ggml : try fix ppc64 (whisper/0)

c3c88f2

metal : tune soft_max number of threads (whisper/0)

f308ea7

sync : ggml

a5e3fde

ggml-ci

metal : support FA without mask + add asserts (ggerganov#7278)

e8a7fd4

* ggml : fa without mask + add asserts ggml-ci * metal : support non-contiguous KV ggml-ci

script : sync ggml-rpc

9f77348

server bench: fix bench not waiting for model load (ggerganov#7284)

583fd6b

sync : ggml

29499bb

embedding : free the batch after execution (ggerganov#7297)

ea3b059

Add missing " (ggerganov#7303)

9a17ab9

ggml : tag ggml_tensor::backend as deprecated (ggerganov#7290)

344f912

readme : remove stray double quote (ggerganov#7310)

8f7080b

Signed-off-by: Daniel Bevenius <[email protected]>

ci: fix bin/Release path for windows-arm64 builds (ggerganov#7317)

172b782

Switch to Ninja Multi-Config CMake generator to resurect bin/Release path that broke artifact packaging in CI.

grammar, json, llama: replace push on emplace if it possible (ggergan…

0350f58

…ov#7273)

convert : get general.name from model dir, not its parent (ggerganov#…

dda64fc

…5615) Co-authored-by: Brian <[email protected]>

rpc : add command line arg for specifying backend memory

3b3963c

ref: ggerganov#7293

rpc : get available mem for the CPU backend

9afdffe

This can be overridden with the -m command line option ref: ggerganov#7293

Merge branch 'layla-build' into merge

37df164

l3utterfly merged commit 1a7dbdd into layla-build May 16, 2024
15 of 21 checks passed

l3utterfly deleted the merge branch May 16, 2024 12:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge from upstream #19

merge from upstream #19

l3utterfly commented May 16, 2024

merge from upstream #19

merge from upstream #19

Conversation

l3utterfly commented May 16, 2024