-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix crash when creating Torch tensor on NPU with device=get_accelerator().current_device() #5464
Conversation
@microsoft-github-policy-service agree |
[like] Olatunji Ruwase reacted to your message:
…________________________________
From: minchao ***@***.***>
Sent: Friday, April 26, 2024 9:44:56 AM
To: microsoft/DeepSpeed ***@***.***>
Cc: Olatunji Ruwase ***@***.***>; Mention ***@***.***>
Subject: Re: [microsoft/DeepSpeed] Fix crash when creating Torch tensor on NPU with device=get_accelerator().current_device() (PR #5464)
Hi, @tjruwase<https://github.com/tjruwase> . Would you please review this PR?
BTW, this issue happened before. See #3933<#3933>.
—
Reply to this email directly, view it on GitHub<#5464 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABAS34D4BV2RJ5HFDQT2YNDY7IOZRAVCNFSM6AAAAABGYZW3G2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZZGAZDMMJYGU>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
fix the formatting issue for commit 2f382d8
Hi, @tjruwase. Just fix a formatting issue in new commit. Would you please approve the workflows to run? |
Hi, @tjruwase, the nv-accelerate-v100 and nv-lightening-v100 CI workflows seems to be down. Could you please take a look? Thx. |
I did a search for |
Hi, @tjruwase, this PR was removed from merge queue by the bot, not sure why it was removed, could you please check it out? |
…or().current_device() (microsoft#5464) Creating a Torch tensor with the parameter `device=get_accelerator().current_device()` can result in a crash when using an NPU. This issue arises because the `current_device` API across all accelerators is expected to return a device id as an integer, according to the [interface docs.](https://github.com/microsoft/DeepSpeed/blob/fa8458b1a80d6ba55091b17f092de19bbf95eb3d/docs/_tutorials/accelerator-abstraction-interface.md?plain=1#L52C1-L56C103) However, specifying `device` as an interger when creating tensors by default directs Torch to use the CUDA backend, which leads to crash on NPUs (and potentially other accelerators as well). To resolve this, we should use `get_accelerator().current_device_name()` instead, which returns the correct device identifier strings such as `"npu:0", "cuda:0", or "xpu:0"`. This API provides the appropriate context needed for creating tensors on specific hardware accelerators. I also notice that `device=get_accelerator().current_device()` is used across several files under deepspeed/inference, and may also lead to crash on other accelerators. --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]>
Creating a Torch tensor with the parameter
device=get_accelerator().current_device()
can result in a crash when using an NPU.This issue arises because the
current_device
API across all accelerators is expected to return a device id as an integer, according to the interface docs.However, specifying
device
as an interger when creating tensors by default directs Torch to use the CUDA backend, which leads to crash on NPUs (and potentially other accelerators as well).To resolve this, we should use
get_accelerator().current_device_name()
instead, which returns the correct device identifier strings such as"npu:0", "cuda:0", or "xpu:0"
. This API provides the appropriate context needed for creating tensors on specific hardware accelerators.I also notice that
device=get_accelerator().current_device()
is used across several files under deepspeed/inference, and may also lead to crash on other accelerators.