Releases: teleprint-me/llama.cpp
Releases · teleprint-me/llama.cpp
b1731
clip : refactor + bug fixes (#4696) * clip : refactor + bug fixes ggml-ci * server : add log message
b1708
llama : add AWQ for llama, llama2, mpt, and mistral models (#4593) * update: awq support llama-7b model * update: change order * update: benchmark results for llama2-7b * update: mistral 7b v1 benchmark * update: support 4 models * fix: Readme * update: ready for PR * update: readme * fix: readme * update: change order import * black * format code * update: work for bot mpt and awqmpt * update: readme * Rename to llm_build_ffn_mpt_awq * Formatted other files * Fixed params count * fix: remove code * update: more detail for mpt * fix: readme * fix: readme * update: change folder architecture * fix: common.cpp * fix: readme * fix: remove ggml_repeat * update: cicd * update: cicd * uppdate: remove use_awq arg * update: readme * llama : adapt plamo to new ffn ggml-ci --------- Co-authored-by: Trần Đức Nam <[email protected]> Co-authored-by: Le Hoang Anh <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>
b1707
finetune : fix output formatting in print_params (#4653) This commit fixes the output formatting in the print_params function which currently looks like this: ```console print_params: n_vocab: 32000 print_params: n_ctx: 128 print_params: n_embd: 4096 print_params: n_ff: 11008 print_params: n_head: 32 print_params: n_head_kv: 32 print_params: n_layer: 32 print_params: norm_rms_eps : 0.000010 print_params: rope_freq_base : 10000.000000 print_params: rope_freq_scale : 1.000000 ``` With this comit the output will look like this: ```console print_params: n_vocab : 32000 print_params: n_ctx : 128 print_params: n_embd : 4096 print_params: n_ff : 11008 print_params: n_head : 32 print_params: n_head_kv : 32 print_params: n_layer : 32 print_params: norm_rms_eps : 0.000010 print_params: rope_freq_base : 10000.000000 print_params: rope_freq_scale : 1.000000 ``` Signed-off-by: Daniel Bevenius <[email protected]>
b1703
cuda : fix vmm pool with multi GPU (#4620) * cuda : fix vmm pool with multi GPU * hip * use recommended granularity instead of minimum * better error checking * fix mixtral * use cudaMemcpy3DPeerAsync * use cuda_pool_alloc in ggml_cuda_op_mul_mat * consolidate error checking in ggml_cuda_set_device * remove unnecessary inlines ggml-ci * style fixes * only use vmm for the main device * fix scratch buffer size, re-enable vmm pool for all devices * remove unnecessary check id != g_main_device
b1702
Update comment for AdamW implementation reference. (#4604) Co-authored-by: Will Findley <[email protected]>
b1699
simplify bug issue template (#4623)
b1696
fallback to CPU buffer if host buffer alloc fails (#4610)
b1695
ci(docker): fix tags in "Build and push docker image (tagged)" (#4603)
b1691
lookup : add prompt lookup decoding example (#4484) * initial commit, going through initializations * main loop finished, starting to debug * BUG: generates gibberish/repeating tokens after a while * kv_cache management * Added colors to distinguish drafted tokens (--color). Updated README * lookup : fix token positions in the draft batch * lookup : use n_draft from CLI params * lookup : final touches --------- Co-authored-by: Leon Ericsson <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>
b1680
ggml : change ggml_scale to take a float instead of tensor (#4573) * ggml : change ggml_scale to take a float instead of tensor * ggml : fix CPU implementation * tests : fix test-grad0 ggml-ci