merged upstream #5

l3utterfly · 2024-03-14T01:44:05Z

No description provided.

* llama : refactor unicode stuff ggml-ci * unicode : names * make : fix c++ compiler * unicode : names * unicode : straighten tables * zig : fix build * unicode : put nfd normalization behind API ggml-ci * swift : fix build * unicode : add BOM * unicode : add <cstdint> ggml-ci * unicode : pass as cpts as const ref

* llama : more consistent names of count variables ggml-ci * llama : n_parallel -> n_seq_max * common : fix param name * examples : fix param name

* iq1_s: we can do even better Spent one of the 4 scale bits on a signs of a 0.125 shift. I.e., quants are now -1 + delta, delta, 1 + delta, where delta is +/- 0.125. CUDA works, same performance as before. PPL(LLaMA-v2-7B) is now 11.85! * iq1_s: make scalar and AVX2 work with the new version * iq1_s: make Neon work with new version. ~10% drop in performance, so will need some more work. * iq1_s: make Metal work with new version * iq1_s: very slightly faster dequantize on Metal * iq1_s: fix dequantize on the CPU --------- Co-authored-by: Iwan Kawrakow <[email protected]>

* sycl : try to fix after IQ1_S changes * sycl : iq1s_grid -> iq1s_grid_gpu * sycl : fix grid type

* ggml : reuse quant blocks across backends ggml-ci * ggml : define helper constants only for CUDA and SYCL ggml-ci * ggml : define helper quantum constants for SYCL ggml-ci

* use multitask for embd endpoint * specify types * remove redundant {"n_predict", 0}

* llama : add pipeline parallelism support for batch processing with multiple CUDA GPUs ggml-ci * server : add -ub, --ubatch-size parameter * fix server embedding test * llama : fix Mamba inference for pipeline parallelism Tested to work correctly with both `main` and `parallel` examples. * llama : limit max batch size to n_batch * add LLAMA_SCHED_MAX_COPIES to configure the number of input copies for pipeline parallelism default increase to 4 (from 2) changing this value may improve performance for some systems, but increases memory usage * fix hip build * fix sycl build (disable cpy_tensor_async) * fix hip build * llama : limit n_batch and n_ubatch to n_ctx during context creation * llama : fix norm backend * batched-bench : sync after decode * swiftui : sync after decode * ggml : allow ggml_get_rows to use multiple threads if they are available * check n_ubatch >= n_tokens with non-casual attention * llama : do not limit n_batch to n_ctx with non-casual attn * server : construct batch with size of llama_n_batch * ggml_backend_cpu_graph_compute : fix return value when alloc fails * llama : better n_batch and n_ubatch comment * fix merge * small fix * reduce default n_batch to 2048 --------- Co-authored-by: Francis Couture-Harpin <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

…rganov#6039)

* main : don't print special tokens with --grammar The CLI interface was recently changed to print special control tokens like the </s> stop message one. This token shouldn't be printed if the grammar flag was passed, unless the grammar specifies it, because that breaks shell-scriptability. * main: use seperate stream for control characters * main: use dprintf and add --ctrl-token-no-out and --ctrl-token-fd-out * main: dprintf isn't part of the IEEE POSIX standard. Just use write(). * main: remove --ctrl-token-fd-out in favor for fcntl() based detection * common.cpp: accidentally removed --interactive-first * main: only merge stdout and control token if not in conversation or grammar mode * main: rejig control token descriptor handling * main: must check pipe status on very top of program * main: renamed --no-special from --ctrl-token-no-out and other refactoring * main: refactor ctrl_token_no_out --> no_special * llama: rename llama_token_is_control_token() to llama_token_is_control() * main: remove special token file descriptor feature (#5) --------- Co-authored-by: Brian <[email protected]>

GetTuh and others added 16 commits March 11, 2024 14:40

Update server docker image URLs (ggerganov#5997)

828defe

llama : more consistent names of count variables (ggerganov#5994)

05b0621

* llama : more consistent names of count variables ggml-ci * llama : n_parallel -> n_seq_max * common : fix param name * examples : fix param name

grammar : fix unnecessarily retained pointer to rules (ggerganov#6003)

5cdb371

sycl : update IQ1_S kernels (WIP - not working!) (ggerganov#5995)

48358b2

* sycl : try to fix after IQ1_S changes * sycl : iq1s_grid -> iq1s_grid_gpu * sycl : fix grid type

ggml : fix UB in IQ2_S and IQ3_S (ggerganov#6012)

184215e

ggml : reuse quantum structs across backends (ggerganov#5943)

8030da7

* ggml : reuse quant blocks across backends ggml-ci * ggml : define helper constants only for CUDA and SYCL ggml-ci * ggml : define helper quantum constants for SYCL ggml-ci

ci : remove tidy-review (ggerganov#6021)

306d34b

Server: Use multi-task for embeddings endpoint (ggerganov#6001)

99b71c0

* use multitask for embd endpoint * specify types * remove redundant {"n_predict", 0}

Update get version (ggerganov#6025)

b3d9786

test-backend-ops : skip CPU backend by default (ggerganov#6028)

d8fd0cc

grammar : handle missing "root" node (ggerganov#6004)

4636283

readme : update API changes and hot topics

76a936c

readme : update details about running llama in Termux on Android (gge…

19885d2

…rganov#6039)

l3utterfly merged commit 28d9d38 into layla-build Mar 14, 2024
71 of 119 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merged upstream #5

merged upstream #5

l3utterfly commented Mar 14, 2024

merged upstream #5

merged upstream #5

Conversation

l3utterfly commented Mar 14, 2024