Gg/flash attn #9

l3utterfly · 2024-04-19T07:45:46Z

No description provided.

- the result from each simdgroup now stays in the registers - significantly reduced SRAM usage - more efficient skipping of -INF blocks - avoid simdgroup barrier in hot loop - add comments

This commit adds special token metadata for Fill-In-the-Middle (FIM)/Infill to the GGUF model. The motivation for this is that currently there is support for CodeLlama but other models exist now like CodeGemma, but the different models use different token ids for the special tokens and this commit allows for supporting multiple models. Signed-off-by: Daniel Bevenius <[email protected]>

This commit updates the hf.sh script usage to include the --outdir option and specifies the models directory as the output directory. The motivation for this is to avoid cluttering the root directory with model files. Signed-off-by: Daniel Bevenius <[email protected]>

* support qwen2moe * fix-review * metal : support unary ops for nelements % 4 != 0 * metal : require contiguousness for float4 unary kernels * metal : require contiguousness for float4 unary kernels (cont) * fix-review * names : for brevity "SHARED_EXP" -> "SHEXP" * llama : reuse build_moe_ffn() * llama : add model type name --------- Co-authored-by: Georgi Gerganov <[email protected]>

* StableLM2 12B support for huggingface -> GGUF * StableLM12 tensormapping and constants * StableLM-2-12b model support * fix * Added 12B support * Removed autoformatting; resolved bug where model_arch was not selecting StableLM2 * Formatting * Do QK norm stacking in model conversion step * Converge StableLM and StableLM2 code to simplify graph construction * Fix accidental removal * Removed warnings * Revert formatter * Move QK norm stack to private function so it's easier to read * refactor stablelm graph builder to support 1.6, 3b and 12b more efficiently * Proper check for None type for new_name to avoid crash; formatting; revert change to base class `write_tensors()` * Format * Formatting * format Co-authored-by: compilade <[email protected]> * Fix incorrect check for K norm * space after commas; Keep indentation multiple of 4 spaces * Flake8 format * Removed unnecessary conditional branches * Removed unused comment * Fixed incorrect tensor passing * Format --------- Co-authored-by: compilade <[email protected]>

This change upstreams llamafile's cpu matrix multiplication kernels which improve image and prompt evaluation speed. For starters, Q4_0 and Q8_0 weights should go ~40% faster on CPU. The biggest benefits are with data types like f16 / f32, which process prompts 2x faster thus making them faster than quantized data types for prompt evals. This change also introduces bona fide AVX512 support since tinyBLAS is able to exploit the larger register file. For example, on my CPU llama.cpp llava-cli processes an image prompt at 305 tokens/second, using the Q4_K and Q4_0 types, which has always been faster than if we used f16 LLaVA weights, which at HEAD go 188 tokens/second. With this change, f16 LLaVA performance leap frogs to 464 tokens/second. On Intel Core i9-14900K this change improves F16 prompt perf by 5x. For example, using llama.cpp at HEAD with Mistral 7b f16 to process a 215 token prompt will go 13 tok/sec. This change has fixes making it go 52 tok/sec. It's mostly thanks to my vectorized outer product kernels but also because I added support for correctly counting the number of cores on Alderlake, so the default thread count discounts Intel's new efficiency cores. Only Linux right now can count cores. This work was sponsored by Mozilla who's given permission to change the license of this code from Apache 2.0 to MIT. To read more about what's improved, and how it works, see: https://justine.lol/matmul/

ggml-ci

* fix autoawq quantized gemma model convert error using autoawq to quantize gemma model will include a lm_head.weight tensor in model-00001-of-00002.safetensors. it result in this situation that convert-hf-to-gguf.py can't map lm_head.weight. skip loading this tensor could prevent this error. * change code to full string match and print necessary message change code to full string match and print a short message to inform users that lm_head.weight has been skipped. --------- Co-authored-by: Zheng.Deng <[email protected]>

* Update README.md * Update README.md --------- Co-authored-by: Georgi Gerganov <[email protected]>

* build : sgemm.o only when needed ggml-ci * llamafile : tmp disable due to MoE bug ggml-ci

* metal : add BS=1 kernel for flash attention (wip) * metal : support more than 1 warps * metal : opts * metal : opt * metal : switch to parallel reduce * metal : reduce registers * metal : simplify * metal : initial FA vec kernel

ggerganov added 30 commits January 18, 2024 18:55

ggml : add ggml_flash_attn_ext API

a1c004e

ggml : fix GQA support in ggml_flash_attn_ext

fa7ebcc

Merge branch 'master' into gg/flash-attn

c3cdfff

ggml : online attention (CPU)

a9681fe

metal : initial implementation

1173f49

metal : f16 precision

528da75

metal : reduce branches

52ae085

metal : specialize for head size

b973258

wip : 8 rows per simd group

8cde449

wip : 4 rows per simd group

f31955f

wip : template for rows per warp

a4b6341

metal : parallelize across KV size

77d08f3

metal : parallel reduce across heads

17720fa

metal : efficient flash_attn_f16 implementation

1446a12

metal : avoid redundant loads of the attention

d917746

metal : scale and mask in matrix form

432ad04

metal : fix comment

40ea8cd

llama : avoid ggml_cast, use F32 query

f9ca5dc

metal : add parallel reduce version (disabled)

6fea843

Merge branch 'master' into gg/flash-attn

b3dd7d9

metal : move output into local memory + optimize

77f6976

- the result from each simdgroup now stays in the registers - significantly reduced SRAM usage - more efficient skipping of -INF blocks - avoid simdgroup barrier in hot loop - add comments

metal : add tests, fix scaling, support C > 32

ecc466a

metal : improve precision

3a428a1

ggml : fix f16 mad

8612864

Merge branch 'master' into gg/flash-attn

0ad44ba

metal : minor

134c81c

metal : support Q > 8

1db22d7

tests : add ATTN tests

4794821

metal : disable buffer allocation logs

abeaf0d

tests : more

c6c1132

danbev and others added 29 commits April 16, 2024 09:13

perplexity : require positive --ctx-size arg (#6695)

58227ff

ggml : fix llamafile sgemm wdata offsets (#6710)

666867b

ggml-ci

llama : make general.name optional (#6709)

532c173

Merge branch 'master' into gg/flash-attn

2c41180

llama : flash_attn cparam + fix defrag

599ce84

server: support flash_attn param

4053857

readme : add UI (#6724)

8dd1ec8

* Update README.md * Update README.md --------- Co-authored-by: Georgi Gerganov <[email protected]>

llamafile : tmp disable + build sgemm.o when needed (#6716)

3b8f1ec

* build : sgemm.o only when needed ggml-ci * llamafile : tmp disable due to MoE bug ggml-ci

server: bench: enable flash_attn param

5668c79

llama : fix compatibility with old 2 expert models (#6735)

c71bfd7

CUDA: refactor host code, dyn. par. blocks

34f93bb

fix flash_attn_vec_f16 race condition

6a3b842

flush softmax exp below threshold to 0

ef9e159

store temp KQ in registers

a5b0e2d

Calculate KQ as FP32 if KQV has GGML_PREC_F32

0bc67dd

Add __hgt2_mask implementation for CUDA 11

2f538b9

fix KQ FP32 precision fpr parallel_blocks > 1

87968de

llama-bench : add -fa,--flash-attn arg

260cdb2

metal : add BS=1 kernel for flash attention (#6508)

105332c

* metal : add BS=1 kernel for flash attention (wip) * metal : support more than 1 warps * metal : opts * metal : opt * metal : switch to parallel reduce * metal : reduce registers * metal : simplify * metal : initial FA vec kernel

Qwen2 : assume tied weights if lm_head/output weights is missing (#6738)

e11b2e6

Merge branch 'master' into gg/flash-attn

fa9e8c6

metal : use F32 attention accumulators

c16a7c2

batched-bench : add fattn arg

9ca8698

l3utterfly merged commit b03b419 into l3utterfly:test-flash-attn Apr 19, 2024
12 of 33 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gg/flash attn #9

Gg/flash attn #9

l3utterfly commented Apr 19, 2024

Gg/flash attn #9

Gg/flash attn #9

Conversation

l3utterfly commented Apr 19, 2024