merge from upstream #40

l3utterfly · 2024-10-01T09:28:57Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

* update oneapi to 2024.2 * use 2024.1 --------- Co-authored-by: arthw <[email protected]>

* ggml: Added run-time detection of neon, i8mm and sve Adds run-time detection of the Arm instructions set features neon, i8mm and sve for Linux and Apple build targets. * ggml: Extend feature detection to include non aarch64 Arm arch * ggml: Move definition of ggml_arm_arch_features to the global data section

@compilade

* convert chameleon hf to gguf * add chameleon tokenizer tests * fix lint * implement chameleon graph * add swin norm param * return qk norm weights and biases to original format * implement swin norm * suppress image token output * rem tabs * add comment to conversion * fix ci * check for k norm separately * adapt to new lora implementation * fix layer input for swin norm * move swin_norm in gguf writer * add comment regarding special token regex in chameleon pre-tokenizer * Update src/llama.cpp Co-authored-by: compilade <[email protected]> * fix punctuation regex in chameleon pre-tokenizer (@compilade) Co-authored-by: compilade <[email protected]> * fix lint * trigger ci --------- Co-authored-by: compilade <[email protected]>

* refactor tokenizer * llama : make llm_tokenizer more private ggml-ci * refactor tokenizer * refactor tokenizer * llama : make llm_tokenizer more private ggml-ci * remove unused files * remove unused fileds to avoid unused filed build error * avoid symbol link error * Update src/llama.cpp * Update src/llama.cpp --------- Co-authored-by: Georgi Gerganov <[email protected]>

* test-backend-ops : use flops for some performance tests - parallelize tensor quantization - use a different set of cases for performance and correctness tests - run each test for at least one second

* py : add XLMRobertaForSequenceClassification [no ci] * py : fix scalar-tensor conversion [no ci] * py : fix position embeddings chop [no ci] * llama : read new cls tensors [no ci] * llama : add classigication head (wip) [no ci] * llama : add "rank" pooling type ggml-ci * server : add rerank endpoint ggml-ci * llama : aboud ggml_repeat during classification * rerank : cleanup + comments * server : accept /rerank endpoint in addition to /v1/rerank [no ci] * embedding : parse special tokens * jina : support v1 reranker * vocab : minor style ggml-ci * server : initiate tests for later ggml-ci * server : add docs * llama : add comment [no ci] * llama : fix uninitialized tensors * ci : add rerank tests ggml-ci * add reranking test * change test data * Update examples/server/server.cpp Co-authored-by: Xuan Son Nguyen <[email protected]> * add `--reranking` argument * update server docs * llama : fix comment [no ci] ggml-ci --------- Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]>

…9668) A crash was observed when the number of tokens added to a batch exceeds llama_batch size. An assertion in llama_batch_add was added to protect against llama_batch size overflow.

a return before a barrier (that happens only in some threads in a workgroup) leads to UB. While the old code actually works on some devices, it fails on some others (i.e. "smaller" GPUs). BTW, I think it would be better to set specialization constants when the graph is built, in that way the local workgroup could be sized appropriately. But it would take a lot of work. Signed-off-by: Salvatore Mesoraca <[email protected]>

…/961)

ggml-ci Co-authored-by: Willy Tarreau <[email protected]>

* utf-8 fix for windows stdin * Update common/console.cpp --------- Co-authored-by: Georgi Gerganov <[email protected]>

Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/c04d5652cfa9742b1d519688f65d1bbccea9eb7e?narHash=sha256-PmUr/2GQGvFTIJ6/Tvsins7Q43KTMvMFhvG6oaYK%2BWk%3D' (2024-09-19) → 'github:NixOS/nixpkgs/1925c603f17fc89f4c8f6bf6f631a802ad85d784?narHash=sha256-J%2BPeFKSDV%2BpHL7ukkfpVzCOO7mBSrrpJ3svwBFABbhI%3D' (2024-09-26) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* update transfomers version. * update hfh version.

…erganov#9641) * Fix Docker ROCM builds, use AMDGPU_TARGETS instead of GPU_TARGETS * Set ROCM_DOCKER_ARCH as string due it incorrectly build and cause OOM exit code

* convert : refactor rope_freqs generation This should also fix vocab-only conversion for Phi-3. * convert : adapt MiniCPM3 to separate rope_freqs insertion MiniCPM3's tokenizer is treated as a SentencePiece tokenizer to avoid having to run its custom Python code which mixes tokenization in the same file as tool calls. gguf-py : add long and short RoPE factors to tensor mappings Empty, but the key names are used to populate the mappings.

NeoZhangJianyu and others added 28 commits September 26, 2024 17:38

[SYCL] add missed dll file in package (ggerganov#9577)

95bc82f

* update oneapi to 2024.2 * use 2024.1 --------- Co-authored-by: arthw <[email protected]>

cmake : add option for common library (ggerganov#9661)

44f59b4

readme : update hot topics

b5de3b7

Enable use to the rebar feature to upload buffers to the device. (gge…

89f9944

…rganov#9251)

readme : add tool (ggerganov#9655)

43bcdd9

llama : add comment about thread-safety [no ci] (ggerganov#9449)

7398427

test-backend-ops : use flops for some performance tests (ggerganov#9657)

1b2f992

* test-backend-ops : use flops for some performance tests - parallelize tensor quantization - use a different set of cases for performance and correctness tests - run each test for at least one second

contrib : add Resources section (ggerganov#9675)

589b48d

py : add model class for Chameleon conversion (ggerganov#9683)

f99d3f8

common : ensure llama_batch size does not exceed max size (ggerganov#…

faac0ba

…9668) A crash was observed when the number of tokens added to a batch exceeds llama_batch size. An assertion in llama_batch_add was added to protect against llama_batch size overflow.

ggml : fix GGML_MAX_N_THREADS + improve formatting (ggml/969)

6084bfb

vulkan : fix build for GGML_VULKAN_RUN_TESTS, add TFLOPS to log (ggml…

0de8b20

…/961)

vulkan : multithread pipeline creation (ggml/963)

641002f

CUDA: remove bad assert (ggml/972)

aaa4099

sync : ggml

d0b1d66

ggml : define missing HWCAP flags (ggerganov#9684)

c919d5d

ggml-ci Co-authored-by: Willy Tarreau <[email protected]>

console : utf-8 fix for windows stdin (ggerganov#9690)

8277a81

* utf-8 fix for windows stdin * Update common/console.cpp --------- Co-authored-by: Georgi Gerganov <[email protected]>

py : update transfomers version (ggerganov#9694)

08a43d0

* update transfomers version. * update hfh version.

ci : reduce severity of unused Pyright ignore comments (ggerganov#9697)

511636d

Fix Docker ROCM builds, use AMDGPU_TARGETS instead of GPU_TARGETS (gg…

6f1d9d7

…erganov#9641) * Fix Docker ROCM builds, use AMDGPU_TARGETS instead of GPU_TARGETS * Set ROCM_DOCKER_ARCH as string due it incorrectly build and cause OOM exit code

llama : print correct model type for Llama 3.2 1B and 3B

a90484c

l3utterfly merged commit a063017 into layla-build Oct 1, 2024
65 checks passed

github-actions bot added the Nvidia GPU label Oct 1, 2024

github-actions bot added Vulkan testing build examples devops python server ggml script labels Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge from upstream #40

merge from upstream #40

l3utterfly commented Oct 1, 2024 •

edited

Loading

merge from upstream #40

merge from upstream #40

Conversation

l3utterfly commented Oct 1, 2024 • edited Loading

l3utterfly commented Oct 1, 2024 •

edited

Loading