-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check CUDA memory pool support #3931
Check CUDA memory pool support #3931
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fixes the error I was seeing.
@cebtenzzre Please could you check with the latest changes because device property is not in CUDA 17 and it works only for 12+. |
That does work, but the command line I need to build unmodified llama.cpp is getting long: cmake -B build \
-DLLAMA_CUBLAS=ON \
-DCMAKE_CUDA_HOST_COMPILER=gcc-12 -DCMAKE_C_COMPILER=gcc-12 -DCMAKE_CXX_COMPILER=g++-12 \
-DLLAMA_CUDA_FORCE_MMQ=ON \
-DCMAKE_CUDA_FLAGS='-DGGML_CUDA_FORCE_CUSTOM_MEMORY_POOL' \
&& make -C build
If there is any benefit to the built-in memory pool, I'd like to have it enabled when I have my GTX 970 disabled with CUDA_VISIBLE_DEVICES. I'd also like to avoid the extra compile-time option, especially if we don't clearly document that this option is needed on older cards (MMQ isn't documented either). Could we check CUDART_VERSION and use prop.memoryPoolsSupported if available? |
@cebtenzzre Try without GGML_CUDA_FORCE_CUSTOM_MEMORY_POOL |
Without that flag I get a similar error to what I was getting originally:
|
So it somehow loaded the memory pool but fails on allocation to it. I can add an option based on CUDA version as you ve mentioned. |
I get cuda buffer pool full error when I try this out. Doesn't crash anymore but outputs one character over and over. Identical behavior for dual 3090s and dual P40s. |
… device properties checked.
@cebtenzzre @Ph0rk0z I added different checks(device props, alloc/dealoc test) during init phase. If one of those checks failed than it uses custom pool implementation. Please retest with new changes. |
Well the check works. Now both with/without GGML_CUDA_FORCE_CUSTOM_MEMORY_POOL FWIW, I'm using cuda 11.8 in my python environment. It's been working for months. Did 12.1 become a requirement at some point, if so why? |
@Ph0rk0z So it looks like it is related to another bug (not related to cuda memory pool) if it shows you the same char over and over. Try to use different versions and track when it is stopped to work for you. |
The PR that merged yarn caused this issue for me. |
…GPUs and main GPU.
Commit 56e5162 works fine individually on my GTX 970 and Tesla P40, but multi-GPU gives "an illegal memory access was encountered". On 81931b2 I get this with multi-GPU:
|
edit: I see that you already opened an issue. I don't think what you're seeing is directly related to the YaRN PR. |
Please retest once again. |
The latest commit seems to work fine with either GPU or with both. Thanks! |
@cebtenzzre hm, so one uses CUDA pool and another custom pool , hm interesting. I have only one GPU so it is blind fixing :) |
Actually, I just tested with multi-GPU and full GPU offload instead of partial offload and got this console spam:
So, not fixed yet. |
I have my RTX2070 and p106 together in a VM, if you send me your ssh pub key and IP address I can let you in for testing, |
I think I will change CUDA mem pool to optional because different multiple GPU can be used and will take me some time to check most cases. |
@cebtenzzre @staviq I added |
This is getting more complex than I would like for what is supposed to be a temporary solution, so if we have to resort to disabling this by default I would prefer to revert the original PR instead. |
With 2b0303a I am back to this with multi-GPU (LLAMA_CUDA_USE_CUDA_POOL is OFF):
|
Yeap. Definitely an option. Then I will close this PR. |
This is the error I was talking about. Did it also output nonsense? After yarn I had the allocation error I mentioned and the model would never load. Subsequent commits brought it to this point. I just reference it because that is the point I stopped being able to span a model across multiple GPUs. |
If #3944 works for you then YaRN is no longer a problem. I can't check it right now because my GPU is busy, but I've basically reverted the same commits without issue on my local fork. |
It's working now finally. Also pascal prompt processing is fixed without: #3816 Speed is mostly the same, only a few .10 difference. Not for the better but oh well. |
Some devices dont support memory pools but still override mempool array element from nullptr to some garbage value. I added additional check from device properties.
Includes multiple GPU pool access support and memory pools access check.