CUDA: add BF16 support #11093

JohannesGaessler · 2025-01-05T20:21:43Z

This PR adds BF16 support for CUDA/HIP. For large batch sizes the BF16 data is converted to FP32, then FP32 cuBLAS GEMM is used. It seems that cuBLAS unfortunately does not have support for BF16 tensor cores. For batch size 1 I added a template parameter to mul_mat_vec to specify the input type as either FP16 or BF16. The calculations are done using FP32 arithmetic since BF16 hardware support for operations other than matrix multiplications are only available with compute capability 9.0 and the highest that I own is 8.9. I will purchase a Blackwell GPU in a few weeks when they come out and revisit this.

Performance:

Model	CPU	GPU	FlashAttention	Test	t/s master	t/s cuda-bf16-3	Speedup
gemma 2B BF16	EPYC 7742	RTX 4090	No	pp512	565.29	12552.04	22.20
gemma 2B BF16	EPYC 7742	RTX 4090	No	tg128	22.47	158.23	7.04
gemma 2B BF16	EPYC 7742	RTX 4090	Yes	pp512	564.51	12662.16	22.43
gemma 2B BF16	EPYC 7742	RTX 4090	Yes	tg128	22.43	157.44	7.02
gemma 2B BF16	Ryzen 9 5950X	RTX 3090	No	pp512	256.16	5901.91	23.04
gemma 2B BF16	Ryzen 9 5950X	RTX 3090	No	tg128	8.01	142.12	17.73
gemma 2B BF16	Ryzen 9 5950X	RTX 3090	Yes	pp512	267.93	5901.28	22.03
gemma 2B BF16	Ryzen 9 5950X	RTX 3090	Yes	tg128	8.09	140.63	17.39
gemma 2B BF16	Xeon E5-2683 v4	RX 6800	No	pp512	130.89	1428.06	10.91
gemma 2B BF16	Xeon E5-2683 v4	RX 6800	No	tg128	10.19	53.77	5.28
gemma 2B BF16	Xeon E5-2683 v4	RX 6800	Yes	pp512	128.57	1195.52	9.30
gemma 2B BF16	Xeon E5-2683 v4	RX 6800	Yes	tg128	10.09	51.66	5.12
gemma 2B BF16	Xeon E5-2683 v4	P40	No	pp512	128.99	1625.54	12.60
gemma 2B BF16	Xeon E5-2683 v4	P40	No	tg128	10.80	55.97	5.18
gemma 2B BF16	Xeon E5-2683 v4	P40	Yes	pp512	130.86	1476.22	11.28
gemma 2B BF16	Xeon E5-2683 v4	P40	Yes	tg128	10.77	54.01	5.02

sorasoras · 2025-01-06T17:34:18Z

@JohannesGaessler
I am bit confused. Doesn't 4090 support BF16 tensor?
"BF16 hardware support for operations other than matrix multiplications are only available with compute capability 9.0"
what exact operation that support by blackwell but not ada?

JohannesGaessler · 2025-01-06T17:54:28Z

Addition, multiplication, division, etc.

teihome · 2025-01-07T10:07:38Z

In this commit it stops working for me.

I have many warnings when compiling.

 cmake --build build --config Release -j10
[  0%] Generating build details from Git
[  0%] Building C object examples/gguf-hash/CMakeFiles/sha1.dir/deps/sha1/sha1.c.o
[  1%] Building C object examples/gguf-hash/CMakeFiles/xxhash.dir/deps/xxhash/xxhash.c.o
-- Found Git: /usr/bin/git (found version "2.47.0")
[  1%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml-alloc.c.o
[  2%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml-opt.cpp.o
[  2%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml-threading.cpp.o
[  2%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml-backend.cpp.o
[  3%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml.c.o
[  3%] Building C object examples/gguf-hash/CMakeFiles/sha256.dir/deps/sha256/sha256.c.o
[  4%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml-quants.c.o
[  4%] Built target sha1
[  4%] Built target xxhash
[  4%] Built target sha256
[  4%] Linking CXX shared library libggml-base.so
[  4%] Building CXX object common/CMakeFiles/build_info.dir/build-info.cpp.o
[  4%] Built target ggml-base
[  4%] Built target build_info
[  4%] Building C object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu.c.o
[  4%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu-aarch64.cpp.o
[  4%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu-hbm.cpp.o
[  5%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu-traits.cpp.o
[  5%] Building C object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu-quants.c.o
[  6%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/amx/amx.cpp.o
[  6%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/amx/mmq.cpp.o
[  6%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/llamafile/sgemm.cpp.o
[  7%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu.cpp.o
[  7%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/acc.cu.o
[  7%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/arange.cu.o
[  8%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/argmax.cu.o
[  9%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/binbcast.cu.o
[  9%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/clamp.cu.o
[ 10%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/conv-transpose-1d.cu.o
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
[ 10%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/convert.cu.o
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
[ 10%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/concat.cu.o
[ 11%] Linking CXX shared library libggml-cpu.so
[ 11%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/argsort.cu.o
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
[ 11%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/count-equal.cu.o
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
[ 12%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/cpy.cu.o
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
[ 12%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/cross-entropy-loss.cu.o
[ 12%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/diagmask.cu.o
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
[ 12%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/fattn-tile-f32.cu.o
[ 13%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/fattn.cu.o
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
[ 13%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/getrows.cu.o
[ 13%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/ggml-cuda.cu.o
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
[ 14%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/fattn-tile-f16.cu.o
[ 15%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/im2col.cu.o
[ 15%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/mmq.cu.o
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
[ 15%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/mmv.cu.o
[ 16%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/mmvq.cu.o
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
[ 17%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/norm.cu.o
[ 17%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/opt-step-adamw.cu.o
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
[ 17%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/out-prod.cu.o
[ 17%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/pad.cu.o
[ 18%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/pool2d.cu.o
[ 18%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/rope.cu.o
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
[ 19%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/scale.cu.o
[ 19%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/quantize.cu.o
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
[ 20%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/tsembd.cu.o
[ 20%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/sum.cu.o
[ 20%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/softmax.cu.o
[ 21%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/unary.cu.o
[ 21%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/sumrows.cu.o
[ 21%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/upscale.cu.o
[ 21%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/wkv6.cu.o
[ 22%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-wmma-f16-instance-kqfloat-cpb16.cu.o
[ 22%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-wmma-f16-instance-kqfloat-cpb32.cu.o
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
[ 23%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-wmma-f16-instance-kqhalf-cpb16.cu.o
[ 23%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-wmma-f16-instance-kqhalf-cpb32.cu.o
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
[ 24%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-wmma-f16-instance-kqhalf-cpb8.cu.o
[ 24%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-iq1_s.cu.o
[ 24%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-iq2_xs.cu.o
[ 25%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-iq2_xxs.cu.o
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
[ 25%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-iq3_s.cu.o
[ 25%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-iq3_xxs.cu.o
[ 25%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-iq2_s.cu.o
[ 26%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-iq4_nl.cu.o
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
[ 26%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-iq4_xs.cu.o
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
[ 26%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-q4_0.cu.o
[ 26%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-q3_k.cu.o
[ 27%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-q4_1.cu.o
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
[ 27%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-q4_k.cu.o
[ 28%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-q2_k.cu.o
[ 28%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-q5_0.cu.o
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
[ 29%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-q5_k.cu.o
[ 29%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-q6_k.cu.o
[ 30%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-q5_1.cu.o
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
[ 30%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-vec-f16-instance-hs128-q4_0-q4_0.cu.o
[ 30%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-q8_0.cu.o
[ 31%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-vec-f32-instance-hs128-q4_0-q4_0.cu.o
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
[ 31%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-vec-f16-instance-hs128-q8_0-q8_0.cu.o
[ 31%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-vec-f32-instance-hs128-q8_0-q8_0.cu.o
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
[ 32%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-vec-f16-instance-hs128-f16-f16.cu.o
[ 32%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-vec-f16-instance-hs256-f16-f16.cu.o
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
[ 33%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-vec-f16-instance-hs64-f16-f16.cu.o
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
[ 33%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-vec-f32-instance-hs128-f16-f16.cu.o
[ 33%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-vec-f32-instance-hs256-f16-f16.cu.o
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
[ 34%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-vec-f32-instance-hs64-f16-f16.cu.o
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
[ 34%] Linking CUDA shared library libggml-cuda.so
[ 34%] Built target ggml-cpu
[ 34%] Built target ggml-cuda
[ 34%] Building CXX object ggml/src/CMakeFiles/ggml.dir/ggml-backend-reg.cpp.o
[ 34%] Linking CXX shared library libggml.so
[ 34%] Built target ggml
[ 35%] Building CXX object examples/gguf/CMakeFiles/llama-gguf.dir/gguf.cpp.o
[ 35%] Building CXX object src/CMakeFiles/llama.dir/llama.cpp.o
[ 36%] Building CXX object src/CMakeFiles/llama.dir/llama-adapter.cpp.o
[ 37%] Building CXX object src/CMakeFiles/llama.dir/llama-batch.cpp.o
[ 37%] Building CXX object src/CMakeFiles/llama.dir/llama-chat.cpp.o
[ 37%] Building CXX object src/CMakeFiles/llama.dir/llama-context.cpp.o
[ 38%] Building CXX object src/CMakeFiles/llama.dir/llama-grammar.cpp.o
[ 38%] Building CXX object src/CMakeFiles/llama.dir/llama-hparams.cpp.o
[ 38%] Building CXX object src/CMakeFiles/llama.dir/llama-arch.cpp.o
[ 38%] Building CXX object examples/gguf-hash/CMakeFiles/llama-gguf-hash.dir/gguf-hash.cpp.o
[ 38%] Linking CXX executable ../../bin/llama-gguf
[ 38%] Building CXX object src/CMakeFiles/llama.dir/llama-impl.cpp.o
[ 39%] Building CXX object src/CMakeFiles/llama.dir/llama-kv-cache.cpp.o
[ 39%] Building CXX object src/CMakeFiles/llama.dir/llama-mmap.cpp.o
[ 40%] Building CXX object src/CMakeFiles/llama.dir/llama-model-loader.cpp.o
[ 41%] Building CXX object src/CMakeFiles/llama.dir/llama-sampling.cpp.o
[ 41%] Building CXX object src/CMakeFiles/llama.dir/llama-quant.cpp.o
[ 41%] Building CXX object src/CMakeFiles/llama.dir/llama-vocab.cpp.o
[ 41%] Building CXX object src/CMakeFiles/llama.dir/llama-model.cpp.o
[ 42%] Linking CXX executable ../../bin/llama-gguf-hash
[ 42%] Building CXX object src/CMakeFiles/llama.dir/unicode.cpp.o
[ 43%] Building CXX object src/CMakeFiles/llama.dir/unicode-data.cpp.o
[ 43%] Linking CXX shared library libllama.so
[ 43%] Built target llama-gguf
[ 43%] Built target llama-gguf-hash
[ 43%] Built target llama
[ 43%] Building CXX object examples/quantize-stats/CMakeFiles/llama-quantize-stats.dir/quantize-stats.cpp.o
[ 43%] Building CXX object examples/llava/CMakeFiles/llava.dir/llava.cpp.o
[ 44%] Building CXX object examples/llava/CMakeFiles/llava.dir/clip.cpp.o
[ 44%] Building CXX object examples/simple/CMakeFiles/llama-simple.dir/simple.cpp.o
[ 44%] Building CXX object examples/simple-chat/CMakeFiles/llama-simple-chat.dir/simple-chat.cpp.o
[ 45%] Building CXX object common/CMakeFiles/common.dir/arg.cpp.o
[ 45%] Building CXX object common/CMakeFiles/common.dir/console.cpp.o
[ 46%] Building CXX object common/CMakeFiles/common.dir/json-schema-to-grammar.cpp.o
[ 46%] Building C object tests/CMakeFiles/test-c.dir/test-c.c.o
[ 46%] Building CXX object common/CMakeFiles/common.dir/common.cpp.o
[ 46%] Building CXX object common/CMakeFiles/common.dir/log.cpp.o
[ 46%] Linking CXX executable ../../bin/llama-quantize-stats
[ 46%] Built target llava
[ 46%] Linking CXX executable ../../bin/llama-simple-chat
[ 47%] Linking C executable ../bin/test-c
[ 48%] Linking CXX executable ../../bin/llama-simple
[ 49%] Building CXX object common/CMakeFiles/common.dir/sampling.cpp.o
[ 49%] Building CXX object common/CMakeFiles/common.dir/ngram-cache.cpp.o
[ 49%] Building CXX object common/CMakeFiles/common.dir/speculative.cpp.o
[ 50%] Linking CXX shared library libllava_shared.so
[ 50%] Linking CXX static library libllava_static.a
[ 51%] Linking CXX static library libcommon.a
[ 51%] Built target test-c
[ 51%] Built target llava_static
[ 51%] Built target llama-quantize-stats
[ 51%] Built target llama-simple
[ 51%] Built target common
[ 51%] Built target llava_shared
[ 51%] Built target llama-simple-chat
[ 51%] Building CXX object tests/CMakeFiles/test-sampling.dir/test-sampling.cpp.o
[ 51%] Building CXX object tests/CMakeFiles/test-tokenizer-0.dir/test-tokenizer-0.cpp.o
[ 51%] Building CXX object tests/CMakeFiles/test-grammar-parser.dir/test-grammar-parser.cpp.o
[ 51%] Building CXX object tests/CMakeFiles/test-tokenizer-1-bpe.dir/test-tokenizer-1-bpe.cpp.o
[ 51%] Building CXX object tests/CMakeFiles/test-json-schema-to-grammar.dir/test-json-schema-to-grammar.cpp.o
[ 51%] Building CXX object tests/CMakeFiles/test-grammar-integration.dir/test-grammar-integration.cpp.o
[ 51%] Building CXX object tests/CMakeFiles/test-llama-grammar.dir/test-llama-grammar.cpp.o
[ 52%] Building CXX object tests/CMakeFiles/test-tokenizer-1-spm.dir/test-tokenizer-1-spm.cpp.o
[ 54%] Building CXX object tests/CMakeFiles/test-sampling.dir/get-model.cpp.o
[ 54%] Building CXX object tests/CMakeFiles/test-arg-parser.dir/test-arg-parser.cpp.o
[ 56%] Linking CXX executable ../bin/test-tokenizer-0
[ 56%] Building CXX object tests/CMakeFiles/test-grammar-parser.dir/get-model.cpp.o
[ 57%] Building CXX object tests/CMakeFiles/test-log.dir/test-log.cpp.o
[ 57%] Building CXX object tests/CMakeFiles/test-grammar-integration.dir/get-model.cpp.o
[ 58%] Building CXX object tests/CMakeFiles/test-json-schema-to-grammar.dir/get-model.cpp.o
[ 58%] Linking CXX executable ../bin/test-tokenizer-1-bpe
[ 59%] Building CXX object tests/CMakeFiles/test-llama-grammar.dir/get-model.cpp.o
[ 59%] Linking CXX executable ../bin/test-sampling
[ 59%] Building CXX object tests/CMakeFiles/test-arg-parser.dir/get-model.cpp.o
[ 59%] Linking CXX executable ../bin/test-grammar-parser
[ 59%] Building CXX object tests/CMakeFiles/test-log.dir/get-model.cpp.o
[ 59%] Linking CXX executable ../bin/test-tokenizer-1-spm
[ 60%] Linking CXX executable ../bin/test-grammar-integration
[ 60%] Linking CXX executable ../bin/test-json-schema-to-grammar
[ 60%] Linking CXX executable ../bin/test-log
[ 60%] Linking CXX executable ../bin/test-arg-parser
[ 60%] Linking CXX executable ../bin/test-llama-grammar
[ 60%] Built target test-tokenizer-0
[ 60%] Built target test-tokenizer-1-bpe
[ 60%] Built target test-sampling
[ 60%] Built target test-log
[ 60%] Built target test-grammar-integration
[ 60%] Building CXX object tests/CMakeFiles/test-chat-template.dir/test-chat-template.cpp.o
[ 60%] Built target test-json-schema-to-grammar
[ 60%] Built target test-grammar-parser
[ 60%] Built target test-llama-grammar
[ 60%] Building CXX object tests/CMakeFiles/test-gguf.dir/test-gguf.cpp.o
[ 60%] Building CXX object tests/CMakeFiles/test-backend-ops.dir/test-backend-ops.cpp.o
[ 60%] Building CXX object tests/CMakeFiles/test-gguf.dir/get-model.cpp.o
[ 60%] Building CXX object tests/CMakeFiles/test-chat-template.dir/get-model.cpp.o
[ 60%] Building CXX object tests/CMakeFiles/test-backend-ops.dir/get-model.cpp.o
[ 61%] Building CXX object tests/CMakeFiles/test-model-load-cancel.dir/test-model-load-cancel.cpp.o
[ 61%] Building CXX object tests/CMakeFiles/test-model-load-cancel.dir/get-model.cpp.o
[ 62%] Building CXX object tests/CMakeFiles/test-autorelease.dir/test-autorelease.cpp.o
[ 63%] Linking CXX executable ../bin/test-chat-template
[ 64%] Linking CXX executable ../bin/test-gguf
[ 65%] Linking CXX executable ../bin/test-backend-ops
[ 65%] Building CXX object tests/CMakeFiles/test-barrier.dir/test-barrier.cpp.o
[ 65%] Linking CXX executable ../bin/test-model-load-cancel
[ 66%] Building CXX object tests/CMakeFiles/test-quantize-fns.dir/test-quantize-fns.cpp.o
[ 66%] Building CXX object tests/CMakeFiles/test-autorelease.dir/get-model.cpp.o
[ 66%] Building CXX object tests/CMakeFiles/test-quantize-perf.dir/test-quantize-perf.cpp.o
[ 66%] Building CXX object tests/CMakeFiles/test-barrier.dir/get-model.cpp.o
[ 66%] Building CXX object tests/CMakeFiles/test-quantize-fns.dir/get-model.cpp.o
[ 67%] Linking CXX executable ../bin/test-autorelease
[ 67%] Building CXX object tests/CMakeFiles/test-quantize-perf.dir/get-model.cpp.o
[ 68%] Linking CXX executable ../bin/test-quantize-fns
[ 69%] Linking CXX executable ../bin/test-barrier
[ 69%] Built target test-arg-parser
[ 70%] Linking CXX executable ../bin/test-quantize-perf
[ 70%] Building CXX object tests/CMakeFiles/test-rope.dir/test-rope.cpp.o
[ 70%] Built target test-tokenizer-1-spm
[ 70%] Building CXX object tests/CMakeFiles/test-rope.dir/get-model.cpp.o
[ 71%] Linking CXX executable ../bin/test-rope
[ 71%] Building CXX object examples/batched-bench/CMakeFiles/llama-batched-bench.dir/batched-bench.cpp.o
[ 71%] Built target test-model-load-cancel
[ 71%] Built target test-gguf
[ 71%] Built target test-backend-ops
[ 72%] Linking CXX executable ../../bin/llama-batched-bench
[ 72%] Built target test-autorelease
[ 72%] Building CXX object examples/batched/CMakeFiles/llama-batched.dir/batched.cpp.o
[ 72%] Building CXX object examples/embedding/CMakeFiles/llama-embedding.dir/embedding.cpp.o
[ 73%] Building CXX object examples/eval-callback/CMakeFiles/llama-eval-callback.dir/eval-callback.cpp.o
[ 73%] Built target test-quantize-fns
[ 74%] Linking CXX executable ../../bin/llama-batched
[ 74%] Built target test-quantize-perf
[ 74%] Built target test-barrier
[ 74%] Linking CXX executable ../../bin/llama-embedding
[ 74%] Linking CXX executable ../../bin/llama-eval-callback
[ 74%] Building CXX object examples/gbnf-validator/CMakeFiles/llama-gbnf-validator.dir/gbnf-validator.cpp.o
[ 74%] Building CXX object examples/gguf-split/CMakeFiles/llama-gguf-split.dir/gguf-split.cpp.o
[ 75%] Linking CXX executable ../../bin/llama-gbnf-validator
[ 75%] Built target test-chat-template
[ 77%] Building CXX object examples/gritlm/CMakeFiles/llama-gritlm.dir/gritlm.cpp.o
[ 77%] Linking CXX executable ../../bin/llama-gguf-split
[ 77%] Building CXX object examples/imatrix/CMakeFiles/llama-imatrix.dir/imatrix.cpp.o
[ 77%] Linking CXX executable ../../bin/llama-imatrix
[ 77%] Linking CXX executable ../../bin/llama-gritlm
[ 77%] Building CXX object examples/infill/CMakeFiles/llama-infill.dir/infill.cpp.o
[ 77%] Built target test-rope
[ 78%] Linking CXX executable ../../bin/llama-infill
[ 78%] Building CXX object examples/llama-bench/CMakeFiles/llama-bench.dir/llama-bench.cpp.o
[ 78%] Built target llama-batched-bench
[ 78%] Linking CXX executable ../../bin/llama-bench
[ 78%] Built target llama-gbnf-validator
[ 78%] Built target llama-batched
[ 79%] Building CXX object examples/lookahead/CMakeFiles/llama-lookahead.dir/lookahead.cpp.o
[ 79%] Built target llama-embedding
[ 79%] Built target llama-eval-callback
[ 79%] Built target llama-gguf-split
[ 79%] Linking CXX executable ../../bin/llama-lookahead
[ 80%] Building CXX object examples/lookup/CMakeFiles/llama-lookup.dir/lookup.cpp.o
[ 80%] Building CXX object examples/lookup/CMakeFiles/llama-lookup-merge.dir/lookup-merge.cpp.o
[ 81%] Building CXX object examples/lookup/CMakeFiles/llama-lookup-create.dir/lookup-create.cpp.o
[ 81%] Building CXX object examples/lookup/CMakeFiles/llama-lookup-stats.dir/lookup-stats.cpp.o
[ 82%] Building CXX object examples/main/CMakeFiles/llama-cli.dir/main.cpp.o
[ 83%] Linking CXX executable ../../bin/llama-lookup-merge
[ 83%] Linking CXX executable ../../bin/llama-lookup-stats
[ 83%] Linking CXX executable ../../bin/llama-lookup-create
[ 83%] Linking CXX executable ../../bin/llama-lookup
[ 83%] Linking CXX executable ../../bin/llama-cli
[ 83%] Built target llama-gritlm
[ 83%] Built target llama-infill
[ 83%] Built target llama-imatrix
[ 83%] Building CXX object examples/passkey/CMakeFiles/llama-passkey.dir/passkey.cpp.o
[ 83%] Building CXX object examples/parallel/CMakeFiles/llama-parallel.dir/parallel.cpp.o
[ 83%] Built target llama-bench
[ 84%] Building CXX object examples/perplexity/CMakeFiles/llama-perplexity.dir/perplexity.cpp.o
[ 84%] Linking CXX executable ../../bin/llama-passkey
[ 85%] Linking CXX executable ../../bin/llama-parallel
[ 85%] Linking CXX executable ../../bin/llama-perplexity
[ 85%] Built target llama-lookahead
[ 85%] Building CXX object examples/quantize/CMakeFiles/llama-quantize.dir/quantize.cpp.o
[ 85%] Built target llama-lookup-merge
[ 86%] Linking CXX executable ../../bin/llama-quantize
[ 87%] Building CXX object examples/retrieval/CMakeFiles/llama-retrieval.dir/retrieval.cpp.o
[ 87%] Generating loading.html.hpp
[ 87%] Built target llama-lookup
[ 87%] Built target llama-lookup-stats
[ 87%] Built target llama-lookup-create
[ 87%] Linking CXX executable ../../bin/llama-retrieval
[ 88%] Generating index.html.gz.hpp
[ 88%] Built target llama-cli
[ 88%] Building CXX object examples/save-load-state/CMakeFiles/llama-save-load-state.dir/save-load-state.cpp.o
[ 88%] Building CXX object examples/run/CMakeFiles/llama-run.dir/run.cpp.o
[ 89%] Building CXX object examples/speculative/CMakeFiles/llama-speculative.dir/speculative.cpp.o
[ 89%] Linking CXX executable ../../bin/llama-save-load-state
[ 90%] Linking CXX executable ../../bin/llama-run
[ 90%] Linking CXX executable ../../bin/llama-speculative
[ 90%] Building CXX object examples/speculative-simple/CMakeFiles/llama-speculative-simple.dir/speculative-simple.cpp.o
[ 91%] Linking CXX executable ../../bin/llama-speculative-simple
[ 91%] Built target llama-passkey
[ 91%] Built target llama-parallel
[ 91%] Building CXX object examples/tokenize/CMakeFiles/llama-tokenize.dir/tokenize.cpp.o
[ 91%] Building CXX object examples/tts/CMakeFiles/llama-tts.dir/tts.cpp.o
[ 91%] Built target llama-quantize
[ 92%] Linking CXX executable ../../bin/llama-tokenize
[ 92%] Linking CXX executable ../../bin/llama-tts
[ 92%] Building CXX object examples/gen-docs/CMakeFiles/llama-gen-docs.dir/gen-docs.cpp.o
[ 92%] Built target llama-retrieval
[ 92%] Linking CXX executable ../../bin/llama-gen-docs
[ 92%] Built target llama-run
[ 92%] Building CXX object examples/convert-llama2c-to-ggml/CMakeFiles/llama-convert-llama2c-to-ggml.dir/convert-llama2c-to-ggml.cpp.o
[ 92%] Building CXX object examples/cvector-generator/CMakeFiles/llama-cvector-generator.dir/cvector-generator.cpp.o
[ 92%] Built target llama-save-load-state
[ 92%] Built target llama-speculative
[ 93%] Linking CXX executable ../../bin/llama-convert-llama2c-to-ggml
[ 94%] Linking CXX executable ../../bin/llama-cvector-generator
[ 94%] Built target llama-perplexity
[ 94%] Building CXX object examples/llava/CMakeFiles/llama-llava-cli.dir/llava-cli.cpp.o
[ 94%] Built target llama-speculative-simple
[ 94%] Building CXX object examples/export-lora/CMakeFiles/llama-export-lora.dir/export-lora.cpp.o
[ 94%] Built target llama-tokenize
[ 94%] Linking CXX executable ../../bin/llama-llava-cli
[ 95%] Building CXX object examples/llava/CMakeFiles/llama-minicpmv-cli.dir/minicpmv-cli.cpp.o
[ 96%] Building CXX object examples/llava/CMakeFiles/llama-qwen2vl-cli.dir/qwen2vl-cli.cpp.o
[ 97%] Linking CXX executable ../../bin/llama-export-lora
[ 98%] Building CXX object pocs/vdot/CMakeFiles/llama-vdot.dir/vdot.cpp.o
[ 98%] Linking CXX executable ../../bin/llama-qwen2vl-cli
[ 98%] Linking CXX executable ../../bin/llama-minicpmv-cli
[ 98%] Built target llama-tts
[ 98%] Linking CXX executable ../../bin/llama-vdot
[ 99%] Building CXX object pocs/vdot/CMakeFiles/llama-q8dot.dir/q8dot.cpp.o
[ 99%] Linking CXX executable ../../bin/llama-q8dot
[ 99%] Built target llama-convert-llama2c-to-ggml
[ 99%] Built target llama-cvector-generator
[ 99%] Built target llama-llava-cli
[ 99%] Built target llama-export-lora
[ 99%] Built target llama-qwen2vl-cli
[ 99%] Built target llama-vdot
[ 99%] Built target llama-gen-docs
[ 99%] Built target llama-q8dot
[ 99%] Built target llama-minicpmv-cli
[100%] Building CXX object examples/server/CMakeFiles/llama-server.dir/server.cpp.o
[100%] Linking CXX executable ../../bin/llama-server
[100%] Built target llama-server

and running fails also

[home@toolbx llama.cpp]$ ./build/bin/llama-server --metrics --slots --port 8084 --gpu-layers 57 --ctx-size 16000  --model /home/home/Models/Cydonia-22B-v1.2-Q5_K_L.gguf 
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
build: 4419 (46e3556e) with cc (GCC) 13.3.1 20240913 (Red Hat 13.3.1-3) for x86_64-redhat-linux
system info: n_threads = 8, n_threads_batch = 8, total_threads = 20

system_info: n_threads = 8 (n_threads_batch = 8) / 20 | CUDA : ARCHS = 520 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: HTTP server is listening, hostname: 127.0.0.1, port: 8084, http threads: 19
main: loading model
srv    load_model: loading model '/home/home/Models/Cydonia-22B-v1.2-Q5_K_L.gguf'
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 3090) - 22085 MiB free
llama_model_loader: loaded meta data with 40 key-value pairs and 507 tensors from /home/home/Models/Cydonia-22B-v1.2-Q5_K_L.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Cydonia 22B V2F
llama_model_loader: - kv   3:                            general.version str              = v1.2
llama_model_loader: - kv   4:                       general.organization str              = BeaverAI
llama_model_loader: - kv   5:                           general.finetune str              = v2f
llama_model_loader: - kv   6:                           general.basename str              = Cydonia
llama_model_loader: - kv   7:                         general.size_label str              = 22B
llama_model_loader: - kv   8:                            general.license str              = other
llama_model_loader: - kv   9:                          llama.block_count u32              = 56
llama_model_loader: - kv  10:                       llama.context_length u32              = 32768
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 6144
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 16384
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 48
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  18:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  19:                          general.file_type u32              = 17
llama_model_loader: - kv  20:                           llama.vocab_size u32              = 32768
llama_model_loader: - kv  21:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  22:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,32768]   = ["<unk>", "<s>", "</s>", "[INST]", "[...
llama_model_loader: - kv  26:                      tokenizer.ggml.scores arr[f32,32768]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,32768]   = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  30:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  33:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  34:                    tokenizer.chat_template str              = {%- if messages[0]["role"] == "system...
llama_model_loader: - kv  35:               general.quantization_version u32              = 2
llama_model_loader: - kv  36:                      quantize.imatrix.file str              = /models_out/Cydonia-22B-v1.2-GGUF/Cyd...
llama_model_loader: - kv  37:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  38:             quantize.imatrix.entries_count i32              = 392
llama_model_loader: - kv  39:              quantize.imatrix.chunks_count i32              = 148
llama_model_loader: - type  f32:  113 tensors
llama_model_loader: - type q8_0:    2 tensors
llama_model_loader: - type q5_K:  336 tensors
llama_model_loader: - type q6_K:   56 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 771
llm_load_vocab: token to piece cache size = 0.1732 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32768
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 6144
llm_load_print_meta: n_layer          = 56
llm_load_print_meta: n_head           = 48
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 6
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 16384
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 22.25 B
llm_load_print_meta: model size       = 14.76 GiB (5.70 BPW) 
llm_load_print_meta: general.name     = Cydonia 22B V2F
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 781 '<0x0A>'
llm_load_print_meta: EOG token        = 2 '</s>'
llm_load_print_meta: max token length = 48
llm_load_tensors: offloading 56 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 57/57 layers to GPU
llm_load_tensors:   CPU_Mapped model buffer size =   204.00 MiB
llm_load_tensors:        CUDA0 model buffer size = 14907.96 MiB
....................................................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 16000
llama_new_context_with_model: n_ctx_per_seq = 16000
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 1000000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (16000) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 16000, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 56, can_shift = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  3500.00 MiB
llama_new_context_with_model: KV self size  = 3500.00 MiB, K (f16): 1750.00 MiB, V (f16): 1750.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  1579.25 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    43.26 MiB
llama_new_context_with_model: graph nodes  = 1798
llama_new_context_with_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 16000
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 16000
main: model loaded
main: chat template, chat_template: (built-in), example_format: '[INST] You are a helpful assistant

Hello[/INST] Hi there</s>[INST] How are you?[/INST]'
main: server is listening on http://127.0.0.1:8084 - starting the main loop
srv  update_slots: all slots are idle
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 16000, n_keep = 0, n_prompt_tokens = 16460
slot update_slots: id  0 | task 0 | input truncated, n_ctx = 16000, n_keep = 0, n_left = 16000, n_prompt_tokens = 8460
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.242080
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
/var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:70: CUDA errorCUDA error: unspecified launch failure
  current device: 0, in function launch_mul_mat_q at /var/home/home/dev/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2771
  cudaFuncSetAttribute(mul_mat_q<type, mmq_x, 8, false>, cudaFuncAttributeMaxDynamicSharedMemorySize, shmem)

[New LWP 49627]
[New LWP 49626]
[New LWP 49625]
[New LWP 49624]
[New LWP 49623]
[New LWP 49622]
[New LWP 49621]
[New LWP 49620]
[New LWP 49619]
[New LWP 49618]
[New LWP 49617]
[New LWP 49616]
[New LWP 49615]
[New LWP 49614]
[New LWP 49613]
[New LWP 49612]
[New LWP 49611]
[New LWP 49610]
[New LWP 49609]
[New LWP 49608]
[New LWP 49607]
[New LWP 49606]
[New LWP 49605]
[New LWP 49595]

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.fedoraproject.org/>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".
0x00007f2a6cac9023 in wait4 () from /usr/lib64/libc.so.6
#0  0x00007f2a6cac9023 in wait4 () from /usr/lib64/libc.so.6
#1  0x00007f2a6ff57ff0 in ggml_abort () from /var/home/home/dev/llama.cpp/build/ggml/src/libggml-base.so
#2  0x00007f2a7005d8b3 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /var/home/home/dev/llama.cpp/build/ggml/src/ggml-cuda/libggml-cuda.so
#3  0x00007f2a701f84e6 in void mul_mat_q_case<(ggml_type)14>(ggml_backend_cuda_context&, mmq_args const&, CUstream_st*) () from /var/home/home/dev/llama.cpp/build/ggml/src/ggml-cuda/libggml-cuda.so
#4  0x00007f2a70071b35 in ggml_cuda_op_mul_mat_q(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, CUstream_st*) () from /var/home/home/dev/llama.cpp/build/ggml/src/ggml-cuda/libggml-cuda.so
#5  0x00007f2a700634f1 in ggml_cuda_op_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, void (*)(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, CUstream_st*), void (*)(float const*, void*, long, long, long, long, ggml_type, CUstream_st*)) () from /var/home/home/dev/llama.cpp/build/ggml/src/ggml-cuda/libggml-cuda.so
#6  0x00007f2a7006d9fd in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /var/home/home/dev/llama.cpp/build/ggml/src/ggml-cuda/libggml-cuda.so
#7  0x00007f2a6ff706c3 in ggml_backend_sched_graph_compute_async () from /var/home/home/dev/llama.cpp/build/ggml/src/libggml-base.so
#8  0x00007f2a731bfde0 in llama_graph_compute(llama_context&, ggml_cgraph*, int, ggml_threadpool*) () from /var/home/home/dev/llama.cpp/build/src/libllama.so
#9  0x00007f2a731c7608 in llama_decode_internal(llama_context&, llama_batch) () from /var/home/home/dev/llama.cpp/build/src/libllama.so
#10 0x00007f2a731c86c7 in llama_decode () from /var/home/home/dev/llama.cpp/build/src/libllama.so
#11 0x000000000046ed81 in server_context::update_slots() ()
#12 0x0000000000468d66 in server_queue::start_loop() ()
#13 0x000000000042149b in main ()
[Inferior 1 (process 49594) detached]
Aborted (core dumped)

JohannesGaessler · 2025-01-07T10:32:46Z

Just to make sure: you are saying that you nailed down the exact commit that caused the issue and that b56f079 was still working correctly but that 46e3556 does not, yes? Because this does not at all look like the problem you're describing would be caused by this PR.

teihome · 2025-01-07T10:35:21Z

It compiles and works on my system with commit: b56f079

⬢ [home@toolbx llama.cpp]$ git checkout b56f079e28fda692f11a8b59200ceb815b05d419
Previous HEAD position was 46e3556e CUDA: add BF16 support (#11093)
HEAD is now at b56f079e Vulkan: Add device-specific blacklist for coopmat for the AMD proprietary driver (#11074)
⬢ [home@toolbx llama.cpp]$ git clean -ffxd

JohannesGaessler · 2025-01-07T10:51:17Z

Okay, I don't at all understand why this is happening. The problem is that for whatever reason your 3090 is not detected during compilation so it instead compiles the code for compute capability 5.2 and you later get an error when you try to run the code. The only thing that this PR changes that would maybe have any effect is the inclusion of cuda_bf16.h but I really don't see why it would. As a workaround you should be able to fix the issue by explicitly setting the right CUDA architecture (8.6 for an RTX 3090, or compile for all CUDA architectures with GGML_NATIVE=OFF).

teihome · 2025-01-07T11:58:33Z

could we separate the refactoring from the inclusion of the #include <cuda_bf16.h>, I would be happy to test such a patch.

* CUDA: add BF16 support

CUDA: add BF16 support

ab20aa9

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jan 5, 2025

JohannesGaessler added 3 commits January 5, 2025 21:35

try MUSA fix

6a5cdad

try MUSA fix

fa77f7a

try MUSA fix

db8d6b7

slaren approved these changes Jan 5, 2025

View reviewed changes

JohannesGaessler merged commit 46e3556 into ggerganov:master Jan 6, 2025
48 checks passed

Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jan 8, 2025

CUDA: add BF16 support (ggerganov#11093)

a6e68c2

* CUDA: add BF16 support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: add BF16 support #11093

CUDA: add BF16 support #11093

JohannesGaessler commented Jan 5, 2025

sorasoras commented Jan 6, 2025

JohannesGaessler commented Jan 6, 2025

teihome commented Jan 7, 2025

JohannesGaessler commented Jan 7, 2025

teihome commented Jan 7, 2025 •

edited

Loading

JohannesGaessler commented Jan 7, 2025

teihome commented Jan 7, 2025

CUDA: add BF16 support #11093

CUDA: add BF16 support #11093

Conversation

JohannesGaessler commented Jan 5, 2025

sorasoras commented Jan 6, 2025

JohannesGaessler commented Jan 6, 2025

teihome commented Jan 7, 2025

JohannesGaessler commented Jan 7, 2025

teihome commented Jan 7, 2025 • edited Loading

JohannesGaessler commented Jan 7, 2025

teihome commented Jan 7, 2025

teihome commented Jan 7, 2025 •

edited

Loading