Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add available memory check to accelerators #4508

Merged
merged 6 commits into from
Oct 17, 2023
Merged

Conversation

jeffra
Copy link
Collaborator

@jeffra jeffra commented Oct 12, 2023

There are many scenarios where we need to get an accurate estimate of the available memory on a device. Currently we rely on the torch memory allocator stats to give us this information, however there are several cases where memory may be allocated outside the view of torch. This means that torch.cuda.get_device_properties(device_index).total_memory - torch.cuda.memory_allocated(device_index) is not accurate. This is usually less of a problem on data center GPUs but quite common on consumer grade GPUs that are often shared between torch and the operating system.

This PR introduces available_memory to the abstract accelerator interface. On CUDA devices we can rely on pynvml to get the ground truth w.r.t. available memory.

This also introduces a hard dependency on pynvml. I have tested on non-GPU systems and this package seems to install successfully but fails at runtime at the nvmlInit() call. We fall back to using torch stats for memory in cases where pynvml is not functional.

@jeffra jeffra requested a review from mrwyattii as a code owner October 12, 2023 21:10
@jeffra jeffra requested review from tjruwase and cmikeh2 October 12, 2023 21:10
@tjruwase
Copy link
Contributor

@delock, FYI

@delock
Copy link
Collaborator

delock commented Oct 15, 2023

@delock, FYI

Thanks for reminding. I think the CPU part is good. We will add to XPU backend as well.

@tjruwase tjruwase added this pull request to the merge queue Oct 16, 2023
Merged via the queue into master with commit 12aedac Oct 17, 2023
15 checks passed
baodii pushed a commit to baodii/DeepSpeed that referenced this pull request Nov 7, 2023
* add available memory check to accelerator

* catch case where nvmlInit fails

* add pynvml to reqs

* fix for cpu systems

* Update accelerator/cuda_accelerator.py

Co-authored-by: Michael Wyatt <[email protected]>

* simplify

---------

Co-authored-by: Michael Wyatt <[email protected]>
mauryaavinash95 pushed a commit to mauryaavinash95/DeepSpeed that referenced this pull request Feb 17, 2024
* add available memory check to accelerator

* catch case where nvmlInit fails

* add pynvml to reqs

* fix for cpu systems

* Update accelerator/cuda_accelerator.py

Co-authored-by: Michael Wyatt <[email protected]>

* simplify

---------

Co-authored-by: Michael Wyatt <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants