Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix load time calculation error in llama_bench #9546

Closed

Conversation

Septa2112
Copy link
Contributor

@Septa2112 Septa2112 commented Sep 19, 2024

Fix #9286

Fix load time calculation error in llama_bench when running multi-benchmark with same model.

From the test results, we can see that the modified load_time remains basically the same and has almost no impact on the benchmark results.

Test Results:

load_time Testing

  • Command: ./build/bin/llama-bench -m ../gguf_models/llama-2-7b.Q2_K.gguf -n 1,2,4,8,16,32 -p 0 -v

Original

llama_perf_context_print:        load time =     172.46 ms
llama_perf_context_print:        load time =     596.39 ms
llama_perf_context_print:        load time =    1362.16 ms
llama_perf_context_print:        load time =    2820.51 ms
llama_perf_context_print:        load time =    5678.16 ms
llama_perf_context_print:        load time =   11295.91 ms

After modification

llama_perf_context_print:        load time =     171.97 ms
llama_perf_context_print:        load time =     171.44 ms
llama_perf_context_print:        load time =     167.02 ms
llama_perf_context_print:        load time =     166.92 ms
llama_perf_context_print:        load time =     168.78 ms
llama_perf_context_print:        load time =     167.03 ms

bench result testing

  • Command: ./build/bin/llama-bench -m ../gguf_models/llama-2-7b.Q2_K.gguf -n 1,2,4,8,16,32 -p 0

Original

| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 7B Q2_K - Medium         |   2.63 GiB |     6.74 B | CPU        |       8 |           tg1 |         14.49 ± 0.01 |
| llama 7B Q2_K - Medium         |   2.63 GiB |     6.74 B | CPU        |       8 |           tg2 |         14.48 ± 0.06 |
| llama 7B Q2_K - Medium         |   2.63 GiB |     6.74 B | CPU        |       8 |           tg4 |         14.48 ± 0.02 |
| llama 7B Q2_K - Medium         |   2.63 GiB |     6.74 B | CPU        |       8 |           tg8 |         14.48 ± 0.03 |
| llama 7B Q2_K - Medium         |   2.63 GiB |     6.74 B | CPU        |       8 |          tg16 |         14.47 ± 0.01 |
| llama 7B Q2_K - Medium         |   2.63 GiB |     6.74 B | CPU        |       8 |          tg32 |         14.45 ± 0.00 |

build: 8a308354 (3782)

After modification

| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 7B Q2_K - Medium         |   2.63 GiB |     6.74 B | CPU        |       8 |           tg1 |         14.49 ± 0.03 |
| llama 7B Q2_K - Medium         |   2.63 GiB |     6.74 B | CPU        |       8 |           tg2 |         14.44 ± 0.04 |
| llama 7B Q2_K - Medium         |   2.63 GiB |     6.74 B | CPU        |       8 |           tg4 |         14.42 ± 0.03 |
| llama 7B Q2_K - Medium         |   2.63 GiB |     6.74 B | CPU        |       8 |           tg8 |         14.47 ± 0.01 |
| llama 7B Q2_K - Medium         |   2.63 GiB |     6.74 B | CPU        |       8 |          tg16 |         14.45 ± 0.02 |
| llama 7B Q2_K - Medium         |   2.63 GiB |     6.74 B | CPU        |       8 |          tg32 |         14.37 ± 0.04 |

build: 216e7d96 (3784)

@slaren
Copy link
Collaborator

slaren commented Sep 19, 2024

This is not a good solution, we should not add an API to workaround a bug. Instead, this needs to be fixed in llama.cpp so that the model loading times are not overwritten when creating additional contexts.

@slaren
Copy link
Collaborator

slaren commented Sep 19, 2024

Alternatively, simply removing the update of the load time after the first evaluation would be enough. This was done to improve the accuracy of the load time when using mmap, since the model data might not have been read from disk until it is used, but there are better ways to do that, and I am not sure that it is really that important.

@ggerganov
Copy link
Owner

Alternatively, simply removing the update of the load time after the first evaluation would be enough. This was done to improve the accuracy of the load time when using mmap, since the model data might not have been read from disk until it is used, but there are better ways to do that, and I am not sure that it is really that important.

Yes, it's better to remove this hack.

@Septa2112
Copy link
Contributor Author

OK, thanks for your suggestions! I will reopen the issue after resolving it by a better way.

@Septa2112 Septa2112 closed this Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: llama_print_timings seems to accumulate load_time/total_time in llama-bench
3 participants