Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gg/flash attn #9

Merged
merged 116 commits into from
Apr 19, 2024
Merged

Conversation

l3utterfly
Copy link
Owner

No description provided.

danbev and others added 29 commits April 16, 2024 09:13
This commit adds special token metadata for Fill-In-the-Middle
(FIM)/Infill to the GGUF model.

The motivation for this is that currently there is support for CodeLlama
but other models exist now like CodeGemma, but the different models use
different token ids for the special tokens and this commit allows for
supporting multiple models.

Signed-off-by: Daniel Bevenius <[email protected]>
This commit updates the hf.sh script usage to include the --outdir option
and specifies the models directory as the output directory.

The motivation for this is to avoid cluttering the root directory with
model files.

Signed-off-by: Daniel Bevenius <[email protected]>
* support qwen2moe

* fix-review

* metal : support unary ops for nelements % 4 != 0

* metal : require contiguousness for float4 unary kernels

* metal : require contiguousness for float4 unary kernels (cont)

* fix-review

* names : for brevity "SHARED_EXP" -> "SHEXP"

* llama : reuse build_moe_ffn()

* llama : add model type name

---------

Co-authored-by: Georgi Gerganov <[email protected]>
* StableLM2 12B support for huggingface -> GGUF

* StableLM12 tensormapping and constants

* StableLM-2-12b model support

* fix

* Added 12B support

* Removed autoformatting; resolved bug where model_arch was not selecting StableLM2

* Formatting

* Do QK norm stacking in model conversion step

* Converge StableLM and StableLM2 code to simplify graph construction

* Fix accidental removal

* Removed warnings

* Revert formatter

* Move QK norm stack to private function so it's easier to read

* refactor stablelm graph builder to support 1.6, 3b and 12b more efficiently

* Proper check for None type for new_name to avoid crash; formatting; revert change to base class `write_tensors()`

* Format

* Formatting

* format

Co-authored-by: compilade <[email protected]>

* Fix incorrect check for K norm

* space after commas; Keep indentation multiple of 4 spaces

* Flake8 format

* Removed unnecessary conditional branches

* Removed unused comment

* Fixed incorrect tensor passing

* Format

---------

Co-authored-by: compilade <[email protected]>
This change upstreams llamafile's cpu matrix multiplication kernels
which improve image and prompt evaluation speed. For starters, Q4_0
and Q8_0 weights should go ~40% faster on CPU. The biggest benefits
are with data types like f16 / f32, which process prompts 2x faster
thus making them faster than quantized data types for prompt evals.

This change also introduces bona fide AVX512 support since tinyBLAS
is able to exploit the larger register file. For example, on my CPU
llama.cpp llava-cli processes an image prompt at 305 tokens/second,
using the Q4_K and Q4_0 types, which has always been faster than if
we used f16 LLaVA weights, which at HEAD go 188 tokens/second. With
this change, f16 LLaVA performance leap frogs to 464 tokens/second.

On Intel Core i9-14900K this change improves F16 prompt perf by 5x.
For example, using llama.cpp at HEAD with Mistral 7b f16 to process
a 215 token prompt will go 13 tok/sec. This change has fixes making
it go 52 tok/sec. It's mostly thanks to my vectorized outer product
kernels but also because I added support for correctly counting the
number of cores on Alderlake, so the default thread count discounts
Intel's new efficiency cores. Only Linux right now can count cores.

This work was sponsored by Mozilla who's given permission to change
the license of this code from Apache 2.0 to MIT. To read more about
what's improved, and how it works, see: https://justine.lol/matmul/
* fix autoawq quantized gemma model convert error

using autoawq to quantize gemma model will include a lm_head.weight tensor in model-00001-of-00002.safetensors. it result in this situation that convert-hf-to-gguf.py can't map lm_head.weight. skip loading this tensor could prevent this error.

* change code to full string match and print necessary message

change code to full string match and print a short message to inform users that lm_head.weight has been skipped.

---------

Co-authored-by: Zheng.Deng <[email protected]>
* Update README.md

* Update README.md

---------

Co-authored-by: Georgi Gerganov <[email protected]>
* build : sgemm.o only when needed

ggml-ci

* llamafile : tmp disable due to MoE bug

ggml-ci
* metal : add BS=1 kernel for flash attention (wip)

* metal : support more than 1 warps

* metal : opts

* metal : opt

* metal : switch to parallel reduce

* metal : reduce registers

* metal : simplify

* metal : initial FA vec kernel
@l3utterfly l3utterfly merged commit b03b419 into l3utterfly:test-flash-attn Apr 19, 2024
12 of 33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.