Server creates a CPU buffer no matter the VRAM usage for 72B models #11012

DrVonSinistro · 2024-12-29T21:04:36Z

DrVonSinistro
Dec 29, 2024

QWEN2.5 32B Q8 loads in GPU and it creates something called CUDA_Host which has few mb of something in it. No significant CPU usage during prompt processing and inference.

QWEN2.5 72B Q6-5-4-2 loads all fully in GPU but even if VRAM is only half full, it always create and fill a CPU buffer in which it puts 600-800mb of something in it.

Then I get one single CPU core that work like hell on that thing during prompt processing and inference.. Its very annoying. I tried everything. Plz send help.

ggerganov · 2024-12-31T13:42:11Z

ggerganov
Dec 31, 2024
Maintainer

Then I get one single CPU core that work like hell on that thing during prompt processing and inference.

This is CUDA waiting for the device to finish work:

https://forums.developer.nvidia.com/t/100-cpu-usage-when-running-cuda-code/35920

It's normal.

1 reply

DrVonSinistro Dec 31, 2024
Author

Thank you very much for taking time to answer. Happy new year to you and your team

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server creates a CPU buffer no matter the VRAM usage for 72B models #11012

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Server creates a CPU buffer no matter the VRAM usage for 72B models #11012

DrVonSinistro Dec 29, 2024

Replies: 1 comment · 1 reply

ggerganov Dec 31, 2024 Maintainer

DrVonSinistro Dec 31, 2024 Author

DrVonSinistro
Dec 29, 2024

Replies: 1 comment 1 reply

ggerganov
Dec 31, 2024
Maintainer

DrVonSinistro Dec 31, 2024
Author