Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check CUDA memory pool support #3931

Closed

Conversation

young-developer
Copy link
Contributor

@young-developer young-developer commented Nov 3, 2023

Some devices dont support memory pools but still override mempool array element from nullptr to some garbage value. I added additional check from device properties.
Includes multiple GPU pool access support and memory pools access check.

Copy link
Collaborator

@cebtenzzre cebtenzzre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes the error I was seeing.

@young-developer
Copy link
Contributor Author

@cebtenzzre Please could you check with the latest changes because device property is not in CUDA 17 and it works only for 12+.

@cebtenzzre
Copy link
Collaborator

That does work, but the command line I need to build unmodified llama.cpp is getting long:

cmake -B build \
  -DLLAMA_CUBLAS=ON \
  -DCMAKE_CUDA_HOST_COMPILER=gcc-12 -DCMAKE_C_COMPILER=gcc-12 -DCMAKE_CXX_COMPILER=g++-12 \
  -DLLAMA_CUDA_FORCE_MMQ=ON \
  -DCMAKE_CUDA_FLAGS='-DGGML_CUDA_FORCE_CUSTOM_MEMORY_POOL' \
&& make -C build
  • Force gcc 12 because a recent CUDA update broke support for gcc 13
  • LLAMA_CUDA_FORCE_MMQ to avoid a massive performance hit since I don't have tensor cores
  • GGML_CUDA_FORCE_CUSTOM_MEMORY_POOL since one of my GPUs doesn't have memory pool support

If there is any benefit to the built-in memory pool, I'd like to have it enabled when I have my GTX 970 disabled with CUDA_VISIBLE_DEVICES. I'd also like to avoid the extra compile-time option, especially if we don't clearly document that this option is needed on older cards (MMQ isn't documented either). Could we check CUDART_VERSION and use prop.memoryPoolsSupported if available?

@young-developer
Copy link
Contributor Author

@cebtenzzre Try without GGML_CUDA_FORCE_CUSTOM_MEMORY_POOL

@cebtenzzre
Copy link
Collaborator

Without that flag I get a similar error to what I was getting originally:

CUDA error 801 at /home/cebtenzzre/src/forks/llama.cpp/ggml-cuda.cu:6807: operation not supported
current device: 0

@young-developer
Copy link
Contributor Author

So it somehow loaded the memory pool but fails on allocation to it. I can add an option based on CUDA version as you ve mentioned.

@Ph0rk0z
Copy link

Ph0rk0z commented Nov 3, 2023

I get cuda buffer pool full error when I try this out. Doesn't crash anymore but outputs one character over and over.

Identical behavior for dual 3090s and dual P40s.

@young-developer
Copy link
Contributor Author

@cebtenzzre @Ph0rk0z I added different checks(device props, alloc/dealoc test) during init phase. If one of those checks failed than it uses custom pool implementation. Please retest with new changes.

@young-developer young-developer changed the title Check CUDA memory pool support in device properties. Check CUDA memory pool support Nov 4, 2023
@Ph0rk0z
Copy link

Ph0rk0z commented Nov 4, 2023

Well the check works. Now both with/without GGML_CUDA_FORCE_CUSTOM_MEMORY_POOL
scroll INCREASE_MAX_CUDA_BUFFERS at me and return one character over and over.

FWIW, I'm using cuda 11.8 in my python environment. It's been working for months. Did 12.1 become a requirement at some point, if so why?

@young-developer
Copy link
Contributor Author

young-developer commented Nov 4, 2023

@Ph0rk0z So it looks like it is related to another bug (not related to cuda memory pool) if it shows you the same char over and over. Try to use different versions and track when it is stopped to work for you.

@Ph0rk0z
Copy link

Ph0rk0z commented Nov 4, 2023

The PR that merged yarn caused this issue for me.

@cebtenzzre
Copy link
Collaborator

Commit 56e5162 works fine individually on my GTX 970 and Tesla P40, but multi-GPU gives "an illegal memory access was encountered".

On 81931b2 I get this with multi-GPU:

main: build = 1487 (81931b2e)
main: built with gcc-12 (GCC) 12.3.0 for x86_64-pc-linux-gnu
main: seed  = 1699115827
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   yes
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
ggml_init_cublas: found 2 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1, CUDA memory pool is supported
  Device 1: NVIDIA GeForce GTX 970, compute capability 5.2Warning: Device 1 doesnt support CUDA memory pool, skipping pool access config
Cant give access for main device memory pool to device 1

CUDA error 801 at /home/jared/src/forks/llama.cpp/ggml-cuda.cu:5929: operation not supported
current device: 0

ggml-cuda.cu Outdated Show resolved Hide resolved
@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Nov 4, 2023

The PR that merged yarn caused this issue for me.

If you can reproduce the problem on latest master then you should open a new issue. There were some mistakes when I merged YaRN, but CUDA and Metal seem to be working fine for most people after the fixup PRs.

edit: I see that you already opened an issue. I don't think what you're seeing is directly related to the YaRN PR.

@young-developer
Copy link
Contributor Author

Commit 56e5162 works fine individually on my GTX 970 and Tesla P40, but multi-GPU gives "an illegal memory access was encountered".

On 81931b2 I get this with multi-GPU:

main: build = 1487 (81931b2e)
main: built with gcc-12 (GCC) 12.3.0 for x86_64-pc-linux-gnu
main: seed  = 1699115827
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   yes
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
ggml_init_cublas: found 2 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1, CUDA memory pool is supported
  Device 1: NVIDIA GeForce GTX 970, compute capability 5.2Warning: Device 1 doesnt support CUDA memory pool, skipping pool access config
Cant give access for main device memory pool to device 1

CUDA error 801 at /home/jared/src/forks/llama.cpp/ggml-cuda.cu:5929: operation not supported
current device: 0

Please retest once again.

@cebtenzzre
Copy link
Collaborator

Please retest once again.

The latest commit seems to work fine with either GPU or with both. Thanks!

@young-developer
Copy link
Contributor Author

young-developer commented Nov 4, 2023

@cebtenzzre hm, so one uses CUDA pool and another custom pool , hm interesting. I have only one GPU so it is blind fixing :)

@staviq
Copy link
Contributor

staviq commented Nov 4, 2023

Still broken for me

RTX 2070
plus
p106-100

@ 4ff1046
works
@ d606905
CUDA error 1 at ggml-cuda.cu:7036: invalid argument

This PR
@ 863166b
CUDA error 217 at ggml-cuda.cu:6881: peer access is not supported between these two devices

@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Nov 4, 2023

Actually, I just tested with multi-GPU and full GPU offload instead of partial offload and got this console spam:

WARNING: cuda buffer pool full, increase MAX_CUDA_BUFFERS
WARNING: cuda buffer pool full, increase MAX_CUDA_BUFFERS
WARNING: cuda buffer pool full, increase MAX_CUDA_BUFFERS
...

So, not fixed yet.

@staviq
Copy link
Contributor

staviq commented Nov 4, 2023

@young-developer

I have only one GPU so it is blind fixing :)

I have my RTX2070 and p106 together in a VM, if you send me your ssh pub key and IP address I can let you in for testing, staviq at gmail.com

@young-developer
Copy link
Contributor Author

I think I will change CUDA mem pool to optional because different multiple GPU can be used and will take me some time to check most cases.

@young-developer
Copy link
Contributor Author

young-developer commented Nov 4, 2023

@cebtenzzre @staviq I added LLAMA_CUDA_USE_CUDA_POOL so you can recompile if you want to test multiple GPUs using CUDA pools. Once it will be stable we can enable it by default.

@slaren
Copy link
Collaborator

slaren commented Nov 4, 2023

This is getting more complex than I would like for what is supposed to be a temporary solution, so if we have to resort to disabling this by default I would prefer to revert the original PR instead.

@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Nov 4, 2023

I added LLAMA_CUDA_USE_CUDA_POOL so you can recompile if you want to test multiple GPUs using CUDA pools. Once it will be stable we can enable it by default.

With 2b0303a I am back to this with multi-GPU (LLAMA_CUDA_USE_CUDA_POOL is OFF):

CUDA error 700 at /home/jared/src/forks/llama.cpp/ggml-cuda.cu:7178: an illegal memory access was encountered
current device: 1

@young-developer
Copy link
Contributor Author

young-developer commented Nov 4, 2023

This is getting more complex than I would like for what is supposed to be a temporary solution, so if we have to resort to disabling this by default I would prefer to revert the original PR instead.

Yeap. Definitely an option. Then I will close this PR.

@Ph0rk0z
Copy link

Ph0rk0z commented Nov 4, 2023

Actually, I just tested with multi-GPU and full GPU offload instead of partial offload and got this console spam:

WARNING: cuda buffer pool full, increase MAX_CUDA_BUFFERS
WARNING: cuda buffer pool full, increase MAX_CUDA_BUFFERS
WARNING: cuda buffer pool full, increase MAX_CUDA_BUFFERS
...

So, not fixed yet.

This is the error I was talking about. Did it also output nonsense?

After yarn I had the allocation error I mentioned and the model would never load. Subsequent commits brought it to this point. I just reference it because that is the point I stopped being able to span a model across multiple GPUs.

@cebtenzzre
Copy link
Collaborator

This is the error I was talking about. Did it also output nonsense?

If #3944 works for you then YaRN is no longer a problem. I can't check it right now because my GPU is busy, but I've basically reverted the same commits without issue on my local fork.

@Ph0rk0z
Copy link

Ph0rk0z commented Nov 5, 2023

It's working now finally. Also pascal prompt processing is fixed without: #3816

Speed is mostly the same, only a few .10 difference. Not for the better but oh well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants