Server creates a CPU buffer no matter the VRAM usage for 72B models #11012
DrVonSinistro
started this conversation in
General
Replies: 1 comment 1 reply
-
This is CUDA waiting for the device to finish work: https://forums.developer.nvidia.com/t/100-cpu-usage-when-running-cuda-code/35920 It's normal. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
QWEN2.5 32B Q8 loads in GPU and it creates something called CUDA_Host which has few mb of something in it. No significant CPU usage during prompt processing and inference.
QWEN2.5 72B Q6-5-4-2 loads all fully in GPU but even if VRAM is only half full, it always create and fill a CPU buffer in which it puts 600-800mb of something in it.
Then I get one single CPU core that work like hell on that thing during prompt processing and inference.. Its very annoying. I tried everything. Plz send help.
Beta Was this translation helpful? Give feedback.
All reactions