Fix load time calculation error in llama_bench #9546

Septa2112 · 2024-09-19T03:46:28Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

Fix load time calculation error in llama_bench when running multi-benchmark with same model.

From the test results, we can see that the modified load_time remains basically the same and has almost no impact on the benchmark results.

Test Results:

load_time Testing

Command: ./build/bin/llama-bench -m ../gguf_models/llama-2-7b.Q2_K.gguf -n 1,2,4,8,16,32 -p 0 -v

Original

llama_perf_context_print:        load time =     172.46 ms
llama_perf_context_print:        load time =     596.39 ms
llama_perf_context_print:        load time =    1362.16 ms
llama_perf_context_print:        load time =    2820.51 ms
llama_perf_context_print:        load time =    5678.16 ms
llama_perf_context_print:        load time =   11295.91 ms

After modification

llama_perf_context_print:        load time =     171.97 ms
llama_perf_context_print:        load time =     171.44 ms
llama_perf_context_print:        load time =     167.02 ms
llama_perf_context_print:        load time =     166.92 ms
llama_perf_context_print:        load time =     168.78 ms
llama_perf_context_print:        load time =     167.03 ms

bench result testing

Command: ./build/bin/llama-bench -m ../gguf_models/llama-2-7b.Q2_K.gguf -n 1,2,4,8,16,32 -p 0

Original

| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 7B Q2_K - Medium         |   2.63 GiB |     6.74 B | CPU        |       8 |           tg1 |         14.49 ± 0.01 |
| llama 7B Q2_K - Medium         |   2.63 GiB |     6.74 B | CPU        |       8 |           tg2 |         14.48 ± 0.06 |
| llama 7B Q2_K - Medium         |   2.63 GiB |     6.74 B | CPU        |       8 |           tg4 |         14.48 ± 0.02 |
| llama 7B Q2_K - Medium         |   2.63 GiB |     6.74 B | CPU        |       8 |           tg8 |         14.48 ± 0.03 |
| llama 7B Q2_K - Medium         |   2.63 GiB |     6.74 B | CPU        |       8 |          tg16 |         14.47 ± 0.01 |
| llama 7B Q2_K - Medium         |   2.63 GiB |     6.74 B | CPU        |       8 |          tg32 |         14.45 ± 0.00 |

build: 8a308354 (3782)

After modification

| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 7B Q2_K - Medium         |   2.63 GiB |     6.74 B | CPU        |       8 |           tg1 |         14.49 ± 0.03 |
| llama 7B Q2_K - Medium         |   2.63 GiB |     6.74 B | CPU        |       8 |           tg2 |         14.44 ± 0.04 |
| llama 7B Q2_K - Medium         |   2.63 GiB |     6.74 B | CPU        |       8 |           tg4 |         14.42 ± 0.03 |
| llama 7B Q2_K - Medium         |   2.63 GiB |     6.74 B | CPU        |       8 |           tg8 |         14.47 ± 0.01 |
| llama 7B Q2_K - Medium         |   2.63 GiB |     6.74 B | CPU        |       8 |          tg16 |         14.45 ± 0.02 |
| llama 7B Q2_K - Medium         |   2.63 GiB |     6.74 B | CPU        |       8 |          tg32 |         14.37 ± 0.04 |

build: 216e7d96 (3784)

slaren · 2024-09-19T08:00:07Z

This is not a good solution, we should not add an API to workaround a bug. Instead, this needs to be fixed in llama.cpp so that the model loading times are not overwritten when creating additional contexts.

slaren · 2024-09-19T08:07:29Z

Alternatively, simply removing the update of the load time after the first evaluation would be enough. This was done to improve the accuracy of the load time when using mmap, since the model data might not have been read from disk until it is used, but there are better ways to do that, and I am not sure that it is really that important.

ggerganov · 2024-09-19T08:15:26Z

Alternatively, simply removing the update of the load time after the first evaluation would be enough. This was done to improve the accuracy of the load time when using mmap, since the model data might not have been read from disk until it is used, but there are better ways to do that, and I am not sure that it is really that important.

Yes, it's better to remove this hack.

Septa2112 · 2024-09-20T02:22:24Z

OK, thanks for your suggestions! I will reopen the issue after resolving it by a better way.

Septa2112 added 2 commits September 19, 2024 11:06

add llama_model_reset_time API

24bea15

fix llama_reset_model_time

216e7d9

github-actions bot added the examples label Sep 19, 2024

Septa2112 mentioned this pull request Sep 19, 2024

Bug: llama_print_timings seems to accumulate load_time/total_time in llama-bench #9286

Open

fix function params

5e1a23a

Septa2112 closed this Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix load time calculation error in llama_bench #9546

Fix load time calculation error in llama_bench #9546

Septa2112 commented Sep 19, 2024 •

edited

Loading

slaren commented Sep 19, 2024

slaren commented Sep 19, 2024 •

edited

Loading

ggerganov commented Sep 19, 2024

Septa2112 commented Sep 20, 2024

Fix load time calculation error in llama_bench #9546

Fix load time calculation error in llama_bench #9546

Conversation

Septa2112 commented Sep 19, 2024 • edited Loading

Test Results:

load_time Testing

Original

After modification

bench result testing

Original

After modification

slaren commented Sep 19, 2024

slaren commented Sep 19, 2024 • edited Loading

ggerganov commented Sep 19, 2024

Septa2112 commented Sep 20, 2024

Septa2112 commented Sep 19, 2024 •

edited

Loading

slaren commented Sep 19, 2024 •

edited

Loading