Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix crash when creating Torch tensor on NPU with device=get_accelerator().current_device() #5464

Merged
merged 4 commits into from
May 7, 2024

Conversation

harygo2
Copy link
Contributor

@harygo2 harygo2 commented Apr 25, 2024

Creating a Torch tensor with the parameter device=get_accelerator().current_device() can result in a crash when using an NPU.

This issue arises because the current_device API across all accelerators is expected to return a device id as an integer, according to the interface docs.

However, specifying device as an interger when creating tensors by default directs Torch to use the CUDA backend, which leads to crash on NPUs (and potentially other accelerators as well).

To resolve this, we should use get_accelerator().current_device_name() instead, which returns the correct device identifier strings such as "npu:0", "cuda:0", or "xpu:0". This API provides the appropriate context needed for creating tensors on specific hardware accelerators.

I also notice that device=get_accelerator().current_device() is used across several files under deepspeed/inference, and may also lead to crash on other accelerators.

@harygo2
Copy link
Contributor Author

harygo2 commented Apr 25, 2024

@microsoft-github-policy-service agree

@minchao-sun
Copy link
Contributor

Hi, @tjruwase . Would you please review this PR?

BTW, this issue happened before. See #3933.

@tjruwase
Copy link
Contributor

tjruwase commented Apr 26, 2024 via email

tjruwase and others added 2 commits April 26, 2024 12:58
@harygo2
Copy link
Contributor Author

harygo2 commented Apr 28, 2024

Hi, @tjruwase. Just fix a formatting issue in new commit. Would you please approve the workflows to run?

@harygo2
Copy link
Contributor Author

harygo2 commented Apr 29, 2024

Hi, @tjruwase, the nv-accelerate-v100 and nv-lightening-v100 CI workflows seems to be down.
image
image

Could you please take a look? Thx.

@delock
Copy link
Collaborator

delock commented Apr 30, 2024

I did a search for current_device() in DeepSpeed repo and it looks like most occurance of current_device() should be current_device_name() in order to be compatible to non-cuda device. Maybe increase test coverage on non-cuda device would help catch such usage in the future.

@harygo2
Copy link
Contributor Author

harygo2 commented May 3, 2024

Hi, @tjruwase @loadams, All checks have passed, could you add this PR to the merge queue?

@tjruwase tjruwase added this pull request to the merge queue May 4, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to no response for status checks May 4, 2024
@harygo2
Copy link
Contributor Author

harygo2 commented May 6, 2024

Hi, @tjruwase, this PR was removed from merge queue by the bot, not sure why it was removed, could you please check it out?

@loadams
Copy link
Contributor

loadams commented May 6, 2024

Hi, @tjruwase, this PR was removed from merge queue by the bot, not sure why it was removed, could you please check it out?

Hi @harygo2 - looks like a transient failure on the merge queue system, I'll re-queue.

@loadams loadams enabled auto-merge May 6, 2024 16:20
@loadams loadams added this pull request to the merge queue May 7, 2024
Merged via the queue into microsoft:master with commit 0fc19b6 May 7, 2024
14 checks passed
umchand pushed a commit to umchand/DeepSpeed that referenced this pull request May 20, 2024
…or().current_device() (microsoft#5464)

Creating a Torch tensor with the parameter
`device=get_accelerator().current_device()` can result in a crash when
using an NPU.

This issue arises because the `current_device` API across all
accelerators is expected to return a device id as an integer, according
to the [interface
docs.](https://github.com/microsoft/DeepSpeed/blob/fa8458b1a80d6ba55091b17f092de19bbf95eb3d/docs/_tutorials/accelerator-abstraction-interface.md?plain=1#L52C1-L56C103)

However, specifying `device` as an interger when creating tensors by
default directs Torch to use the CUDA backend, which leads to crash on
NPUs (and potentially other accelerators as well).

To resolve this, we should use `get_accelerator().current_device_name()`
instead, which returns the correct device identifier strings such as
`"npu:0", "cuda:0", or "xpu:0"`. This API provides the appropriate
context needed for creating tensors on specific hardware accelerators.

I also notice that `device=get_accelerator().current_device()` is used
across several files under deepspeed/inference, and may also lead to
crash on other accelerators.

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants