llama-bench : add test measuring token generation rate at given prompt length #11126

fairydreaming · 2025-01-07T16:44:21Z

I needed a test that would measure token generation rate after processing a prompt of given length, so I decided to add a new kind of test to the llama-bench tool.

This PR adds -gp <pp,tg> option that allows to specify a prompt length and number of tokens generated after processing the prompt. This new test works almost the same way as old -pg test, but it doesn't take into account the prompt length and prompt processing time when calculating result, only the token generation rate is reported.

Test results are labeled in a different way to avoid confusion with -pg test results, I used @ character to emphasize that the result indicates the token generation rate AT given prompt length.

Example:

$ ./bin/llama-bench --numa distribute -t 32 -m /mnt/md0/models/deepseek-v3-Q4_K_S.gguf -p 0 -n 0 -gp 128,32 -gp 256,32 -r 3

model	size	params	backend	threads	test	t/s
deepseek2 671B Q4_K - Small	353.90 GiB	671.03 B	CPU	32	tg32@pp128	8.94 ± 0.06
deepseek2 671B Q4_K - Small	353.90 GiB	671.03 B	CPU	32	tg32@pp256	8.35 ± 0.01

Hopefully this is more intuitive compared to averaged prompt processing + token generation rate in -pg test results.

… given prompt length

slaren

The other printers (sql, json, etc) would also need to be updated.

fairydreaming · 2025-01-08T13:42:47Z

The other printers (sql, json, etc) would also need to be updated.

@slaren Can you be more specific?

slaren · 2025-01-08T13:47:01Z

The test type needs to be exported in these printers as well, since n_prompt and n_gen is no longer enough to tell the difference. Another option would be to get rid of test_kind_type and just record and report timings for the prompt and generation steps separately, then the -pg test would be enough.

fairydreaming · 2025-01-08T14:05:33Z

The test type needs to be exported in these printers as well, since n_prompt and n_gen is no longer enough to tell the difference. Another option would be to get rid of test_kind_type and just record and report timings for the prompt and generation steps separately, then the -pg test would be enough.

I guess another option is to add a "test" column in all printers with the same values as displayed in default console output. Any specific reason it's not included there?

slaren · 2025-01-08T14:11:16Z

Yes, that's what I meant when I said that the test type would need to be exported in these printers. There isn't a test column/field at the moment because it is not necessary.

llama-bench : add -gp <pp,tg> test measuring token generation rate at…

bb6569e

… given prompt length

github-actions bot added the examples label Jan 7, 2025

llama-bench : whitespace formatting

1c69b0e

slaren reviewed Jan 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama-bench : add test measuring token generation rate at given prompt length #11126

llama-bench : add test measuring token generation rate at given prompt length #11126

fairydreaming commented Jan 7, 2025

slaren left a comment

fairydreaming commented Jan 8, 2025

slaren commented Jan 8, 2025

fairydreaming commented Jan 8, 2025

slaren commented Jan 8, 2025

llama-bench : add test measuring token generation rate at given prompt length #11126

Are you sure you want to change the base?

llama-bench : add test measuring token generation rate at given prompt length #11126

Conversation

fairydreaming commented Jan 7, 2025

slaren left a comment

Choose a reason for hiding this comment

fairydreaming commented Jan 8, 2025

slaren commented Jan 8, 2025

fairydreaming commented Jan 8, 2025

slaren commented Jan 8, 2025