Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantized much slower than llama.cpp with same model and settings... #1939

Open
oddpxl opened this issue Mar 26, 2024 · 23 comments
Open

Quantized much slower than llama.cpp with same model and settings... #1939

oddpxl opened this issue Mar 26, 2024 · 23 comments

Comments

@oddpxl
Copy link

oddpxl commented Mar 26, 2024

quantized compiled using --> cargo build --example quantized -r --features metal

Unsure of... how many layers accelerated / how many threads used / clearly different sample stages

..yet I presume the speed should be on par... ?

CANDLE
./quantized --model mistral-7b-instruct-v0.2.Q4_K_S.gguf --which 7b-mistral-instruct-v0.2 --prompt "Blueberries cost more than strawberries. Blueberries cost less than raspberries. Raspberries cost more than strawberries and blueberries. If the first two statements are true, the third statement is?" --sample-len 2048 --temperature 0.1 --seed 1337 --top-p 0.950 --repeat-penalty 1.100 --repeat-last-n 64

--> 31.83 t/s

avx: false, neon: true, simd128: false, f16c: false / temp: 0.10 repeat-penalty: 1.10 repeat-last-n: 64 / loaded 291 tensors (4.14GB) in 0.09s


LLAMA.CPP
./main -p "Blueberries cost more than strawberries. Blueberries cost less than raspberries. Raspberries cost more than strawberries and blueberries. If the first two statements are true, the third statement is?" -m mistral-7b-instruct-v0.2.Q4_K_S.gguf -n 128 -ngl 33 --threads 8 --seed 1337

--> 51.30 t/s

sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000

sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 1

@oddpxl oddpxl changed the title Quantized example slower than llama.cpp with same model and settings... Quantized much slower than llama.cpp with same model and settings... Mar 26, 2024
@tomsanbear
Copy link
Collaborator

Just adding in some more data to this as I investigate the root cause of the differences, should have some results later on where we are slowing down in comparison to llama.cpp

CANDLE (metal)

`target/release/examples/quantized --model mistral-7b-instruct-v0.2.Q4_K_S.gguf --which 7b-mistral-instruct-v0.2 --prompt 'Blueberries cost more than strawberries. Blueberries cost less than raspberries. Raspberries cost more than strawberries and blueberries. If the first two statements are true, the third statement is?' --sample-len 2048 --temperature 0.1 --seed 1337 --top-p 0.950 --repeat-penalty 1.100 --repeat-last-n 64`
avx: false, neon: true, simd128: false, f16c: false
temp: 0.10 repeat-penalty: 1.10 repeat-last-n: 64
loaded 291 tensors (4.14GB) in 0.08s
model built
Blueberries cost more than strawberries. Blueberries cost less than raspberries. Raspberries cost more than strawberries and blueberries. If the first two statements are true, the third statement is?

Let's compare the cost of each type of berry:
1. Blueberries cost more than strawberries.
2. Blueberries cost less than raspberries.

To determine if the third statement, "Raspberries cost more than strawberries and blueberries," is true or false, we need to compare the cost of raspberries with both strawberries and blueberries:

1. Strawberries are cheaper than blueberries.
2. Blueberries are cheaper than raspberries.
3. Raspberries are more expensive than both strawberries and blueberries.

Therefore, the third statement is true.

  43 prompt tokens processed: 38.91 token/s
 136 tokens generated: 20.42 token/s

CANDLE (metal,accelerate)

`target/release/examples/quantized --model mistral-7b-instruct-v0.2.Q4_K_S.gguf --which 7b-mistral-instruct-v0.2 --prompt 'Blueberries cost more than strawberries. Blueberries cost less than raspberries. Raspberries cost more than strawberries and blueberries. If the first two statements are true, the third statement is?' --sample-len 2048 --temperature 0.1 --seed 1337 --top-p 0.950 --repeat-penalty 1.100 --repeat-last-n 64`
avx: false, neon: true, simd128: false, f16c: false
temp: 0.10 repeat-penalty: 1.10 repeat-last-n: 64
loaded 291 tensors (4.14GB) in 0.08s
model built
Blueberries cost more than strawberries. Blueberries cost less than raspberries. Raspberries cost more than strawberries and blueberries. If the first two statements are true, the third statement is?

Let's compare the cost of each type of berry:
1. Blueberries cost more than strawberries.
2. Blueberries cost less than raspberries.

To determine if the third statement, "Raspberries cost more than strawberries and blueberries," is true or false, we need to compare the cost of raspberries with both strawberries and blueberries:

1. Strawberries are cheaper than blueberries.
2. Blueberries are cheaper than raspberries.
3. Raspberries are more expensive than both strawberries and blueberries.

Therefore, the third statement is true.

  43 prompt tokens processed: 38.62 token/s
 136 tokens generated: 20.40 token/s

LLAMA.CPP

system_info: n_threads = 8 / 10 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 |
sampling:
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 2048, n_predict = 128, n_keep = 1


 Blueberries cost more than strawberries. Blueberries cost less than raspberries. Raspberries cost more than strawberries and blueberries. If the first two statements are true, the third statement is? The third statement is true because raspberries generally cost more than both strawberries and blueberries. However, it's important to note that prices can vary greatly depending on the location, season, and availability of each fruit. [end of text]

llama_print_timings:        load time =    4075.79 ms
llama_print_timings:      sample time =       1.09 ms /    47 runs   (    0.02 ms per token, 43119.27 tokens per second)
llama_print_timings: prompt eval time =     289.48 ms /    43 tokens (    6.73 ms per token,   148.54 tokens per second)
llama_print_timings:        eval time =    1481.11 ms /    46 runs   (   32.20 ms per token,    31.06 tokens per second)
llama_print_timings:       total time =    1776.77 ms /    89 tokens
ggml_metal_free: deallocating

@EricLBuehler
Copy link
Member

Our mistral.rs achieves llama.cpp speeds with CUDA on the gpu_sampling branch. We anticipate merging of that branch within the next day or two, which will bring allow for fast serving on CUDA! Metal optimizations are coming too.

@LaurentMazare
Copy link
Collaborator

LaurentMazare commented Apr 4, 2024

On the cuda backend, performance is now roughly comparable with llama.cpp. Most of the change came from #1978 that was merged earlier this week. My timings on a RTX 2080 (before the change candle was at ~34 token/s)

  • Candle: 69.4 token/s.
target/release-with-debug/examples/quantized \
    --model mistral-7b-v0.1.Q4_K_S.gguf \
    --prompt 'Building a website can be done in 10 simple steps:\nStep 1:' -n 100 --which 7b-mistral

...

  19 prompt tokens processed: 74.91 token/s
  99 tokens generated: 69.43 token/s

  • Llama.cpp: 73.2 token/s.
 ./main \
    -m  mistral-7b-v0.1.Q4_K_S.gguf  \
    -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 100 -ngl 33
system_info: n_threads = 6 / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C
= 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 2048, n_predict = 100, n_keep = 1


llama_print_timings:        load time =    1456.63 ms
llama_print_timings:      sample time =       3.12 ms /   100 runs   (    0.03 ms per token, 32041.01 tokens per second)
llama_print_timings: prompt eval time =      92.55 ms /    19 tokens (    4.87 ms per token,   205.29 tokens per second)
llama_print_timings:        eval time =    1353.12 ms /    99 runs   (   13.67 ms per token,    73.16 tokens per second)
llama_print_timings:       total time =    1480.64 ms /   118 tokens

So my guess is that there is something specific to metal that we're not getting right at the moment.

@tomsanbear
Copy link
Collaborator

Screenshot 2024-04-04 at 7 56 12 PM

I was finally able to get a capture of this going, my current theory is we are overwhelming it with instructions, the screenshot above shows that for the majority of our inference run we have the instruction limiter pegged to 100%. I haven't been able to get the equivalent from llama.cpp going but my suspicion is that they've just done a better job at fusing operations.

@tomsanbear
Copy link
Collaborator

One idea I have to look into this is reducing the number of calls to the encoder through the use of Metal's argument buffers, it seems like it might be a nice starting point to see if that leads to a reduction in the pressure on instructions. https://developer.apple.com/documentation/metal/buffers/improving_cpu_performance_by_using_argument_buffers?language=objc

@LaurentMazare
Copy link
Collaborator

Thanks for putting these up.
I have next to no experience optimizing on metal so this may well not apply here but just to mention that on the cuda side, it's actually the expected behavior to have a phase where the equivalent of the command buffer is saturated. This means that you're sending instructions to the gpu faster than it actually processes them which is a good thing. What you would care about is the percentage of usage of the gpu, my guess is that it's the "Compute" row in your UI but really not sure, and it doesn't look bad on this sample which might mean that the kernels themselves are inefficient?
Anyway it would certainly help a lot to see a comparison with llama.cpp, could you provide some details on how you ran this?

@tomsanbear
Copy link
Collaborator

Let me look into getting better way to export data from this, the debugger xcode has is nice... but it's brutally slow for even a trace I captured with 10 output tokens. It also does not have a great/obvious way to query/aggregate information to see which operations are taking the most. I'll do some more reading in the meantime on how to use the debugger more effectively and try to find the appropriate workflow to identify bottlenecks.

I went down a rabbit hole of trying to find half decent documentation or sources on optimizations for MPS, apple did a great job at throwing information at the wall without any kind of map to link the information together. I'll put a PR up later today or this weekend with the changes I made to the quantized model to enable outputting the gputrace information in addition to documentation that goes along with it so others can instrument any example/model or tensor operations they want.

Nice resource I found that does more code level optimization review:
https://devstreaming-cdn.apple.com/videos/wwdc/2016/606oluchfgwakjbymy8/606/606_advanced_metal_shader_optimization.pdf?dl=1

@tomsanbear
Copy link
Collaborator

One thing to note is that the only two "counters" (the graphs on the bottom), that were saturated were the one shown in the photo, the instruction throughput limiter and the compute shader launch limiter. I'm struggling to find what exactly these two are measuring and what the significance of a 100% statistic on these means, or if it's a red herring.

@tomsanbear
Copy link
Collaborator

Just another little screenshot, ran with smaller sample size and the debugger gave me a much more friendly analysis. I imagine it's expected we spend the majority of our time with this matmul operation. I would be interested to see how this compares to the CUDA results @LaurentMazare

If you would like to replicate you can do so on this branch: https://github.com/tomsanbear/candle/tree/AddMetalGpuTraceToQuantized with the following command: METAL_CAPTURE_ENABLED=1 cargo run --example quantized --features="metal,accelerate" --release -- --model mistral-7b-instruct-v0.2.Q4_K_S.gguf --which 7b-mistral-instruct-v0.2 --prompt "Blueberries cost more than strawberries. Blueberries cost less than raspberries. Raspberries cost more than strawberries and blueberries. If the first two statements are true, the third statement is?" --sample-len 5 --temperature 0.1 --seed 1337 --top-p 0.950 --repeat-penalty 1.100 --repeat-last-n 64 --metal-tracing

Note: the sample length needs to be that short or else when you open the gputrace it will cause the debugger to enter "lite" mode or crash your computer... 😰

Screenshot 2024-04-05 at 11 28 23 AM

@tomsanbear
Copy link
Collaborator

I should also note that I see drastically different breakdowns on how the compute is allocated from the M1 machine I have to the M3 machine I have

@tomsanbear
Copy link
Collaborator

tomsanbear commented Apr 11, 2024

Okay I was finally able to get a full profile of the llamacpp process up and running 🎉

LlamaCpp:
Screenshot 2024-04-10 at 9 43 04 PM

Candle:
Screenshot 2024-04-10 at 9 46 22 PM
Screenshot 2024-04-10 at 9 52 20 PM

From this it seems a bit obvious that our extra copy operations are the primary suspect here, after that we should should be near llama cpp performance.

@tomsanbear
Copy link
Collaborator

Quick debugging locally shows that our calls to reshape is taking the non-contiguous branch a fair amount, unsure how we could avoid or optimize from here if anyone has ideas!

@LaurentMazare
Copy link
Collaborator

Is this part of the trace from when processing the prompt or for subsequent tokens? I'm a bit surprised to see a copy-strided if it's for the subsequent tokens as the goal is to avoid these.
Also would be great if you can get a break off of the time spent in each kernel on both side (still separating the prompt processing from the rest).

@tomsanbear
Copy link
Collaborator

tomsanbear commented Apr 11, 2024

Yes this is a snapshot from the subsequent token generation, here is the breakdown on where the time is spent for 4 tokens on candle
Screenshot 2024-04-11 at 9 07 05 AM
compared to llama
Screenshot 2024-04-11 at 9 11 48 AM

Another thing to note that I did not notice earlier is that llama.cpp seems to use the mm kernel over the mv kernel for a fair amount of operations, I imagine that also is having an impact here? We don't branch our implementation to use that kernel and opt to use mv each time.

@tomsanbear
Copy link
Collaborator

Here are some instructions to reproduce the llama.cpp gputrace

Checkout this branch with some tweaks made to make metal compile in the debugger: https://github.com/tomsanbear/llama.cpp/tree/MetalDebugger

Run make then the same command as earlier in this thread for ./main etc..., the trace can then be opened with: open /tmp/llamacpp.gputrace

@LaurentMazare
Copy link
Collaborator

Is it still including the prompt processing? (I suspect it does as the copy_f32_strided should only appear there, otherwise it's likely a bug)

@tomsanbear
Copy link
Collaborator

tomsanbear commented Apr 11, 2024

I start the capture right before

        for index in 0..to_sample {

So this should only be capturing after prompt processing, as such likely a bug...

@LaurentMazare
Copy link
Collaborator

Ah I think that's actually a mistral specificity because of mqa. Could you try using the standard llama 7b model and see what you get? You can just do so by specifying no model in the quantized example.
(I'll take a stab at optimizing this mqa bit too though)

@tomsanbear
Copy link
Collaborator

This closes the gap a bit more

Candle

cargo run --release --features="metal,accelerate" --example quantized -- --model ~/models/llama-2-7b.Q4_K_M.gguf --which 7b --prompt 'Blueberries cost more than strawberries. Blueberries cost less than raspberries. Raspberries cost more than strawberries and blueberries. If the first two statements are true, the third statement is?' --sample-len 200 --temperature 0.1 --seed 1337 --top-p 0.950 --repeat-penalty 1.100 --repeat-last-n 64
...
52 prompt tokens processed: 34.38 token/s
 199 tokens generated: 22.99 token/s

Llama.cpp

./main --model ~/models/llama-2-7b.Q4_K_M.gguf --prompt 'Blueberries cost more than strawberries. Blueberries cost less than raspberries. Raspberries cost more than strawberries and blueberries. If the first two statements are true, the third statement is?' -n 200 -t 0.1 --seed 1337
...
llama_print_timings:        load time =    3184.75 ms
llama_print_timings:      sample time =       6.64 ms /   200 runs   (    0.03 ms per token, 30129.56 tokens per second)
llama_print_timings: prompt eval time =     264.12 ms /    52 tokens (    5.08 ms per token,   196.88 tokens per second)
llama_print_timings:        eval time =    7371.12 ms /   199 runs   (   37.04 ms per token,    27.00 tokens per second)
llama_print_timings:       total time =    7678.86 ms /   251 tokens

@LaurentMazare
Copy link
Collaborator

I've put together a small hack to get around the strided copy in #2043 , on my M2 Pro 16GB it significantly increases the speed (from 25tokens/s to 31.6) for mistral q4k generating 150 tokens.
Maybe you could give this a spin and see what the profile looks like with this and mistral?
Also this reshaping part is expected to be cleaned up properly soonish - hopefully this week-end - which should help with such situtations.
Let's see if it helps for you to but it's certainly great to have a good way to profile now! (still have to run it myself though :) )

@tomsanbear
Copy link
Collaborator

Ran that branch through the profiler and got the following breakdown per pipeline now:

Screenshot 2024-04-11 at 6 40 51 PM

@LaurentMazare
Copy link
Collaborator

LaurentMazare commented Apr 13, 2024

Here are some updated stats on a macbook pro M2 Pro (14,9) with the mistral model as of the current main branch. Timings do not include the prompt processing where llama.cpp is still much better.

  • candle (2bf413c): 31.4 token/s.
  • llama.cpp (2f04332): 33.4 token/s.

So it's still slower but not by that much.

target/release/examples/quantized --sample-len 150 --which 7b-mistral --model mistral-7b-v0.1.Q4_K_S.gguf --prompt 'My favorite theorem is '`
avx: false, neon: true, simd128: false, f16c: false
temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
...
  6 prompt tokens processed: 34.50 token/s
  149 tokens generated: 31.42 token/s
./main -m mistral-7b-v0.1.Q4_K_S.gguf --prompt "My favorite theorem is " --verbose-prompt  -n 150
...
system_info: n_threads = 6 / 10 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
...
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = 150, n_keep = 0
...
llama_print_timings:        load time =    1711.63 ms
llama_print_timings:      sample time =      11.68 ms /   150 runs   (    0.08 ms per token, 12838.07 tokens per second)
llama_print_timings: prompt eval time =     147.24 ms /     6 tokens (   24.54 ms per token,    40.75 tokens per second)
llama_print_timings:        eval time =    4460.35 ms /   149 runs   (   29.94 ms per token,    33.41 tokens per second)
llama_print_timings:       total time =    4637.79 ms

edit: adding -ngl 33 --threads 8 to the llama.cpp command does not seem to affect the timings.

@EricLBuehler
Copy link
Member

EricLBuehler commented Nov 22, 2024

@tomsanbear I saw this issue, I think #2615 is quite relevant! It boosts the prompt speeds to comparable to llama.cpp!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants