Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] master from ggerganov:master #142

Closed
wants to merge 34 commits into from
Closed

Conversation

pull[bot]
Copy link

@pull pull bot commented Aug 14, 2024

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

ggerganov and others added 2 commits August 14, 2024 09:14
* server : fix segfault on long system prompt

* server : fix parallel generation with very small batch sizes

* server : fix typo in comment
* Optimize Vulkan REPEAT performance

* Use Vulkan GLSL fused multiply-add instruction where possible

* Add GGML_VULKAN_PERF option to output performance data per operator

* Rework and fix Vulkan descriptor set and descriptor pool handling

* Fix float32 concat f16 shader validation error

* Add Vulkan GROUP_NORM eps parameter

* Fix validation error with transfer queue memory barrier flags

* Remove trailing whitespaces
kylo5aby and others added 13 commits August 15, 2024 10:23
* retrieval

* Reuse querybatch to reduce frequent memory allocation

* delete unused white space
* ggml : Dynamic ggml_sched_max_splits based on graph_size

* Fixed and readded debug code for causes
* Add nemotron GGUF conversion & inference support

* Fix formatting issues

* Remove unnecessary write_tensors()

* Update convert_hf_to_gguf.py

Co-authored-by: compilade <[email protected]>

* Update src/llama.cpp

Co-authored-by: compilade <[email protected]>

* Address comments by @compilade

* Replace ggml_mul_mat()->llm_build_lora_mm()

* Remove mutable variable

* Use  for bias tensors

* Cover corner case for role_scaling not in config.json

---------

Co-authored-by: compilade <[email protected]>
* Add support for cpu_get_num_phsical_cores() on Windows

* fix build bug on msys2-clang64 and ucrt64

* avoid adding new function

* add new macros to avoid windows+mingw64

* Add error checking to return default value
* add exaone model support

* add chat template

* fix whitespace

Co-authored-by: Georgi Gerganov <[email protected]>

* add ftype

* add exaone pre-tokenizer in `llama-vocab.cpp`

Co-Authored-By: compilade <[email protected]>

* fix lint

Co-Authored-By: compilade <[email protected]>

* add `EXAONE` to supported models in `README.md`

* fix space

Co-authored-by: compilade <[email protected]>

---------

Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: compilade <[email protected]>
Co-authored-by: compilade <[email protected]>
* init

* rename

* add run android for termux in readme

* add android readme

* add instructions in readme

* change name in readme

* Update README.md

* fixed line

* add result in readme

* random pos_embed

* add positions index

* change for ollama

* change for ollama

* better pos_embed in clip

* support ollama

* updata cmakelist

* updata cmakelist

* rename wrapper

* clear code

* replace and organize code

* add link

* sync master

* fix warnings

* fix warnings

* fix bug in bicubic resize when need resize iamge smaller

* receive review comments and modify

* receive review comments and modify

* put all code into llava dir

* fix quality problem in pr code

* change n_layer

* add space in "-1"

* imitate reshape bug of python code

* fix bug in clip

* fix issues for merging

* fix llama-minicpmv-cli in cmake file

* change pr readme

* fix code review

* remove in line 33 directory in the /cmakelists.txt (not in example, in the main dir

* fix cmakefile

* add warn

* fix KEY_HAS_MINICPMV_PROJ

* remove load_image_size into clip_ctx

* remove the extern "C", MINICPMV_API

* fix uhd code for review comment

* delete minicpmv-wrapper in pr

* remove uhd_image_embed

* Modify 2 notes

* support minicpmv2.6

* modify convert script of minicpmv

* modify convert

* modify convert

* add readme

* add resampler of v2.6

* modify clip

* modify readme

* fix type-check

* fix type-check

* fix type-check

* fix type-check

* modify convert script and readme

* fix convert script and readme

* fix convert

* fix num in convert

* fix type-check

---------

Co-authored-by: Hongji Zhu <[email protected]>
Co-authored-by: harvestingmoon <[email protected]>
* server : refactor middleware and /health endpoint

* move "fail_on_no_slot" to /slots

* Update examples/server/server.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* fix server tests

* fix CI

* update server docs

---------

Co-authored-by: Georgi Gerganov <[email protected]>
* Add printing to check weights match torch version

* minor code style changes

---------

Co-authored-by: Xuan Son Nguyen <[email protected]>
ggerganov and others added 4 commits August 18, 2024 07:43
Add more checks which prevent RPC server from crashing if invalid input
is received from client
Co-authored-by: xuedinge233 <[email protected]>
Co-authored-by: hipudding <[email protected]>
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Aug 19, 2024
fairydreaming and others added 2 commits August 20, 2024 12:09
* sycl: fix im2col overflow and sync with cuda

Signed-off-by: zhentaoyu <[email protected]>

* sycl: fix convert overflow

Signed-off-by: zhentaoyu <[email protected]>

* sycl: fix convert and dequantize

Signed-off-by: zhentaoyu <[email protected]>

* sycl: fix ib in dmmv

Signed-off-by: zhentaoyu <[email protected]>

* sycl:refine convert

Signed-off-by: zhentaoyu <[email protected]>

* sycl: move downsample global_range into common

Signed-off-by: zhentaoyu <[email protected]>

* test: add im2col and convert test cases

Signed-off-by: zhentaoyu <[email protected]>

* test: make new cases only in sycl

Signed-off-by: zhentaoyu <[email protected]>

* test: comment new test_cases for only local testing

Signed-off-by: zhentaoyu <[email protected]>

---------

Signed-off-by: zhentaoyu <[email protected]>
@github-actions github-actions bot added the SYCL label Aug 20, 2024
airMeng and others added 8 commits August 20, 2024 23:50
* fallback mmvq to mul_mat

* mmvq in cuda path

* Update ggml/src/ggml-sycl.cpp

Co-authored-by: Alberto Cabrera Pérez <[email protected]>

---------

Co-authored-by: Alberto Cabrera Pérez <[email protected]>
…LAVA CLIP model. (#8984)

* llava: Add ACC OP for GPU acceleration to the Vulkan backend in the LLAVA CLIP model.

- The CLIP model now prioritizes the Vulkan backend over the CPU when vulkan available.
- A GGML_OP_ACC shader has been added.
- The encoding performance of the CLIP model improved from 4.2s on the CPU to 0.9s on the GPU.

Signed-off-by: Changyeon Kim <[email protected]>

* fix-up coding style.

Signed-off-by: Changyeon Kim <[email protected]>

* Fix-up the missing initial parameter to resolve the compilation warning.

Signed-off-by: Changyeon Kim <[email protected]>

* [fix] Add missing parameters.

Signed-off-by: Changyeon Kim <[email protected]>

* [fix] Use nb1 and nb2 for dst.

Signed-off-by: Changyeon Kim <[email protected]>

* Fix check results ggml_acc call

---------

Signed-off-by: Changyeon Kim <[email protected]>
Co-authored-by: 0cc4m <[email protected]>
* llama : std::move llm_bigram_bpe from work_queue

This commit updates the retrieval of llm_bigram_bpe objects from
work_queue.top() by using std::move.

The motivation for this is to avoid the copying of the std::string
`text` member of the llm_bigram_bpe struct.

* squash! llama : std::move llm_bigram_bpe from work_queue

Introduced a MovablePriorityQueue class to allow moving elements
out of the priority queue for llm_bigram_bpe.

* squash! llama : std::move llm_bigram_bpe from work_queue

Rename MovablePriorityQueue to lama_priority_queue.

* squash! llama : std::move llm_bigram_bpe from work_queue

Rename lama_priority_queue -> llama_priority_queue.
* feat: initial support for llama.cpp

* fix: lint

* refactor: better refactor

* Update src/llama.cpp

Co-authored-by: compilade <[email protected]>

* Update src/llama.cpp

Co-authored-by: compilade <[email protected]>

* fix: address comments

* Update convert_hf_to_gguf.py

Co-authored-by: compilade <[email protected]>

* fix: add more cleanup and harmonization

* fix: lint

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <[email protected]>

* fix: change name

* Apply suggestions from code review

Co-authored-by: compilade <[email protected]>

* add in operator

* fix: add `dt_b_c_rms` in `llm_load_print_meta`

* fix: correct printf format for bool

* fix: correct print format

* Update src/llama.cpp

Co-authored-by: compilade <[email protected]>

* llama : quantize more Mamba tensors

* llama : use f16 as the fallback of fallback quant types

---------

Co-authored-by: compilade <[email protected]>
* server : support reading arguments from environment variables

* add -fa and -dt

* readme : specify non-arg env var
* llama : advanced batch splits

This includes equal-sequence-length batch splits which are useful
to simplify recurrent model operators.

* llama : always make recurrent state slots contiguous

* ggml : simplify mamba operators

* llama : fix integer signedness mixing

* llama : logits_all has priority over batch->logits

Otherwise, the server embeddings tests failed.
This was likely an existing problem but was only detected here
because of an additional assertion.

* llama : apply suggestions

Co-authored-by: Georgi Gerganov <[email protected]>

* llama : fix t5 segfault

* llama : fix Mamba session save and restore

* llama : minor cosmetic changes

* llama : rename llama_reorder_outputs to llama_output_reorder

Also move it closer to llama_output_reserve.

* llama : fix pooled embeddings when using batches with equal_seqs

* minor : add struct members for clarity

ggml-ci

* llama : fix T5 segfault again

* llama : fix Mamba pooled embeddings with multiple sequences

Until the pooled embeddings are refactored to allow splitting
across ubatches for causal embeddings,
recurrent models can only process a single sequence per ubatch
when calculating pooled embeddings.

* llama : add llama_model_is_recurrent to simplify figuring that out

This will make it easier to more cleanly support RWKV-v6 and Mamba-2.

* llama : fix simple splits when the batch contains embeddings

---------

Co-authored-by: Georgi Gerganov <[email protected]>
* add onednn

* add sycl_f16

* add dnnl stream

* add engine map

* use dnnl for intel only

* use fp16fp16fp16

* update doc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.