[pull] master from ggerganov:master #65

pull · 2024-03-22T18:01:21Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

* metal : require ne00 >= 128 for mat-mat kernels ggml-ci * llama : pad n_ctx by 32 ggml-ci

* metal : proper assert for mat-mat memory alignment ggml-ci * readme : add notice about the bug fix * metal : fix the fix ggml-ci

…#6208) * cuda : add LLAMA_CUDA_NO_PEER_COPY to workaround broken ROCm p2p copy * add LLAMA_CUDA_NO_PEER_COPY to HIP build

* json: ordered json in server/schema converter to respect orig order * json: ws nits * json: support non-string const / enums

* json: only attempt python & node schema conversion tests if their bins are present Tests introduced in #5978 disabled in #6198 * json: orange warnings when tests skipped * json: ensure py/js schema conv tested on ubuntu-focal-make * json: print env vars in test

IQ3_XS was not mentioned, IQ3_S and IQ3_M were present twice. That PR corrects this in the manner which was probably intended initially.

* common : add HF arg helpers * common : remove defaults

* split: support in llama_model_loader * avoid copying the entire vector Co-authored-by: slaren <[email protected]> * split: move llama_tensor_offset to llama_model_loader * llama_model_loader: PR feedbacks: - use only one gguf_context for metadata only - store all ggml_context in a vector as the files and mappings - store all weights in a vector along with the source tensor - rename ctx_gguf to meta - rename ctx_meta to contexts * avoid copying the entire vector * Simplify this by making these optional, switch some layer creation tensor optional Co-authored-by: Georgi Gerganov <[email protected]> * Handle optional tensors Co-authored-by: Georgi Gerganov <[email protected]> * llama_model_loader: fail if backend cannot allocate buffer * fix mmap buffer management * llama_model_loader: map file to backend buffer if the allocation succeeds only * llama_model_loader: only map tensors included in the context * llama_model_loader: minor, use same variable name for consistency, fix spacing in types cast * llama_model_loader: fail if any of backend buffer cannot be allocated * spacing Co-authored-by: slaren <[email protected]> * fix loop over pointer Co-authored-by: slaren <[email protected]> * llama_model_loader: if n_tensors declared not equals to loaded tensors in split, throw an exception instead of asserting * llama_model_loader: ensure mappings vector has the expected size * llama_model_loader: use at instead of operator[] if this should never add to the map. * llama_model_loader: immediately add the backend buffer to the model buffers in order to free them if an error occurs in the next allocation. Reserve the expected size. * llama_model_loader: be sure the model mappings has enough capacity before allocating backend buffer * llama_model_loader: fix map -> unordered map * llama_split_prefix: use a clearer version, not pass split path len but dest max len. Co-authored-by: Xuan Son Nguyen <[email protected]> * llama : minor ggml-ci * llama : introduce some typedef helpers * docs: add model shard in hot topic * llama_model_loader: put mapping in a unique_ptr from the moment it is allocated Co-authored-by: slaren <[email protected]> * fix llama_split_prefix --------- Co-authored-by: slaren <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]>

* quantize: be able to specify the output tensor type * quantize: be able to specify the token embedding tensor type --------- Co-authored-by: Iwan Kawrakow <[email protected]>

* convert-llama2c-to-ggml: enable conversion of multiqueries, #5608 * add test in build action * Update build.yml * Update build.yml * Update build.yml * gg patch

) * lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens

* Add support for Grok model architecture * Revert convert-hf-to-gguf to default options * Fixed f_norm_rms_eps bug * Fix whitespaces * llama : fix grok rope type * llama : minor --------- Co-authored-by: Georgi Gerganov <[email protected]>

…sable` (#6254)

* llama: llama_split_prefix fix strncpy does not include string termination common: llama_load_model_from_url: - fix header name case sensitive - support downloading additional split in parallel - hide password in url * common: EOL EOF * common: remove redundant LLAMA_CURL_MAX_PATH_LENGTH definition * common: change max url max length * common: minor comment * server: support HF URL options * llama: llama_model_loader fix log * common: use a constant for max url length * common: clean up curl if file cannot be loaded in gguf * server: tests: add split tests, and HF options params * common: move llama_download_hide_password_in_url inside llama_download_file as a lambda * server: tests: enable back Release test on PR * spacing Co-authored-by: Georgi Gerganov <[email protected]> * spacing Co-authored-by: Georgi Gerganov <[email protected]> * spacing Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>

NeoZhangJianyu and others added 14 commits March 22, 2024 15:19

add blog link (#6222)

59c17f0

metal : pad n_ctx by 32 (#6177)

95d576b

* metal : require ne00 >= 128 for mat-mat kernels ggml-ci * llama : pad n_ctx by 32 ggml-ci

ci : add CURL flag for the mac builds (#6214)

b2075fd

metal : proper assert for mat-mat memory alignment (#6225)

b3e94f2

* metal : proper assert for mat-mat memory alignment ggml-ci * readme : add notice about the bug fix * metal : fix the fix ggml-ci

server : enable continuous batching by default (#6231)

68e210b

server : fix n_keep always showing as 0 in response (#6211)

6b8bb3a

readme : add RecurseChat to the list of UIs (#6219)

29ab270

cuda : add LLAMA_CUDA_NO_PEER_COPY to workaround broken ROCm p2p copy (…

2f0e81e

…#6208) * cuda : add LLAMA_CUDA_NO_PEER_COPY to workaround broken ROCm p2p copy * add LLAMA_CUDA_NO_PEER_COPY to HIP build

json-schema-to-grammar : fix order of props + non-str const/enum (#6232)

72114ed

* json: ordered json in server/schema converter to respect orig order * json: ws nits * json: support non-string const / enums

llama : correction of the attn.v.weight quantization for IQ3_XS (#6209)

e80f06d

IQ3_XS was not mentioned, IQ3_S and IQ3_M were present twice. That PR corrects this in the manner which was probably intended initially.

common : add HF arg helpers (#6234)

80bd33b

* common : add HF arg helpers * common : remove defaults

ci: apply concurrency limit for github workflows (#6243)

ee804f6

pull bot added the ⤵️ pull label Mar 22, 2024

ikawrakow and others added 9 commits March 22, 2024 20:47

quantize: options for output and token embedding tensors qtype (#6239)

1d0331c

* quantize: be able to specify the output tensor type * quantize: be able to specify the token embedding tensor type --------- Co-authored-by: Iwan Kawrakow <[email protected]>

convert-llama2c-to-ggml : enable conversion of GQA models (#6237)

92397d8

* convert-llama2c-to-ggml: enable conversion of multiqueries, #5608 * add test in build action * Update build.yml * Update build.yml * Update build.yml * gg patch

common : default --hf-file to --model (#6234)

56a00f0

server: flush stdout after logging in both text and json layout (#6253)

1b26aeb

split: add gguf-split in the make build target (#6262)

21cad01

llama : add grok-1 support (#6204)

476b025

* Add support for Grok model architecture * Revert convert-hf-to-gguf to default options * Fixed f_norm_rms_eps bug * Fix whitespaces * llama : fix grok rope type * llama : minor --------- Co-authored-by: Georgi Gerganov <[email protected]>

server: docs: --threads and --threads, --ubatch-size, `--log-di…

1997577

…sable` (#6254)

teleprint-me closed this Mar 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from ggerganov:master #65

[pull] master from ggerganov:master #65

pull bot commented Mar 22, 2024 •

edited

Loading

[pull] master from ggerganov:master #65

[pull] master from ggerganov:master #65

Conversation

pull bot commented Mar 22, 2024 • edited Loading

pull bot commented Mar 22, 2024 •

edited

Loading