-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml : synchronize threads using barriers #7993
Conversation
Is there it is not using mtecies ? (mutexes? o.o) |
Since these waits are often very short, spinlocks resulted in better performance. OpenMP is also likely spinning for a while before sleeping. |
Its very hard to get faster then most mutex implementations, since they have staggered strategies. Most will perform a spinlock and then switch to something else... was this actually mesured? |
There were several PRs to change the synchronization mechanism in the past, but always resulted in worse performance. |
The fallback seems to perform ok on the M3 Max.
|
The server thread sanitizer run failed the first time, but I don't understand the error, I cannot reproduce it locally, and the second run passed. It may have been a fluke, but it's something to look at if it continues failing intermittently after this is merged. |
With MSVC:
|
I was surprised that the performance with MSVC had fallen so much behind, and I found that the reason is llamafile. MSVC without llamafile:
|
I noticed that the thread sanitizer workflow also failed last week and it reported a data race in this case: So maybe it's not necessary related to the changes from this PR |
That one seems to be caused by using openmp. Thread sanitizer is not directly compatible with openmp, and causes false positives. I changed the workflow to avoid using openmp with thread sanitizer, so that shouldn't happen anymore. It was already done on build.yml in the initial PR, but I missed the server CI. |
Results on M1 Pro and M2 Ultra: LLAMA_NO_METAL=1 LLAMA_NO_ACCELERATE=1 LLAMA_NO_LLAMAFILE=1 ./scripts/compare-commits.sh master sl/test-omp-sync -m models/tinyllama-1b-chat/ggml-model-f16.gguf -m models/tinyllama-1b-chat/ggml-model-q8_0.gguf -m models/tinyllama-1b-chat/ggml-model-q4_0.gguf -p 128 -n 64 -t 4,8
M2 Ultra seems to respond better to the changes
|
Does it help if the |
After removing
|
The results with the M1 Pro seem to vary quite a bit between runs, the differences between master on the previous one and this one are larger than the difference between master and the PR in some cases, even though it is the same test. |
Hm, I might be doing something wrong in the benchmarks. I now ran LLAMA_NO_METAL=1 LLAMA_NO_ACCELERATE=1 LLAMA_NO_LLAMAFILE=1 make -j && ./llama-bench -m models/tinyllama-1b-chat/ggml-model-f16.gguf -m models/tinyllama-1b-chat/ggml-model-q8_0.gguf -m models/tinyllama-1b-chat/ggml-model-q4_0.gguf -p 128 -n 64 -t 4,8
build: 4d8a0c2 (3185) |
I very often have a lot of trouble getting reproducible results on the M3 Max, and I think it is due to thermal throttling. Increasing the number of repetitions with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, increasing the repetitions to -r 20
makes the results consistent:
CPU | Model | [GiB] | Threads | Test | t/s master | t/s sl/test-omp-sync | Speedup |
---|---|---|---|---|---|---|---|
M1 Pro | llama 1B F16 | 2.05 | 4 | pp128 | 113.01 | 112.20 | 0.99 |
M1 Pr | llama 1B F16 | 2.05 | 4 | tg64 | 45.01 | 45.32 | 1.01 |
M1 Pr | llama 1B F16 | 2.05 | 8 | pp128 | 176.56 | 180.62 | 1.02 |
M1 Pr | llama 1B F16 | 2.05 | 8 | tg64 | 38.65 | 49.41 | 1.28 |
M1 Pr | llama 1B Q4_0 | 0.59 | 4 | pp128 | 132.44 | 131.86 | 1.00 |
M1 Pr | llama 1B Q4_0 | 0.59 | 4 | tg64 | 91.38 | 92.86 | 1.02 |
M1 Pr | llama 1B Q4_0 | 0.59 | 8 | pp128 | 210.17 | 212.74 | 1.01 |
M1 Pr | llama 1B Q4_0 | 0.59 | 8 | tg64 | 118.79 | 125.84 | 1.06 |
M1 Pr | llama 1B Q8_0 | 1.09 | 4 | pp128 | 151.81 | 152.35 | 1.00 |
M1 Pr | llama 1B Q8_0 | 1.09 | 4 | tg64 | 71.58 | 73.54 | 1.03 |
M1 Pr | llama 1B Q8_0 | 1.09 | 8 | pp128 | 246.67 | 249.50 | 1.01 |
M1 Pr | llama 1B Q8_0 | 1.09 | 8 | tg64 | 56.88 | 83.92 | 1.48 |
Initially I removed the updates of |
It's quite obsolete at this point so it's fine to remove it. PR welcome |
I will remove it as part of a refactor to remove the initialize and finalize tasks in favor of using barriers in the ops. |
Since it happened again in #8018, I am trying to understand what this failure means. |
Yes, it seems to me that the |
Replaces the spinlocks in the CPU backend with barriers, using openmp if possible.