Fix crash when creating Torch tensor on NPU with device=get_accelerator().current_device() #5464

harygo2 · 2024-04-25T12:41:09Z

Creating a Torch tensor with the parameter device=get_accelerator().current_device() can result in a crash when using an NPU.

This issue arises because the current_device API across all accelerators is expected to return a device id as an integer, according to the interface docs.

However, specifying device as an interger when creating tensors by default directs Torch to use the CUDA backend, which leads to crash on NPUs (and potentially other accelerators as well).

To resolve this, we should use get_accelerator().current_device_name() instead, which returns the correct device identifier strings such as "npu:0", "cuda:0", or "xpu:0". This API provides the appropriate context needed for creating tensors on specific hardware accelerators.

I also notice that device=get_accelerator().current_device() is used across several files under deepspeed/inference, and may also lead to crash on other accelerators.

…celerators

harygo2 · 2024-04-25T12:47:00Z

@microsoft-github-policy-service agree

minchao-sun · 2024-04-26T09:44:35Z

Hi, @tjruwase . Would you please review this PR?

BTW, this issue happened before. See #3933.

tjruwase · 2024-04-26T11:18:02Z

[like] Olatunji Ruwase reacted to your message:

…

________________________________ From: minchao ***@***.***> Sent: Friday, April 26, 2024 9:44:56 AM To: microsoft/DeepSpeed ***@***.***> Cc: Olatunji Ruwase ***@***.***>; Mention ***@***.***> Subject: Re: [microsoft/DeepSpeed] Fix crash when creating Torch tensor on NPU with device=get_accelerator().current_device() (PR #5464) Hi, @tjruwase<https://github.com/tjruwase> . Would you please review this PR? BTW, this issue happened before. See #3933<#3933>. — Reply to this email directly, view it on GitHub<#5464 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABAS34D4BV2RJ5HFDQT2YNDY7IOZRAVCNFSM6AAAAABGYZW3G2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZZGAZDMMJYGU>. You are receiving this because you were mentioned.Message ID: ***@***.***>

fix the formatting issue for commit 2f382d8

harygo2 · 2024-04-28T01:17:16Z

Hi, @tjruwase. Just fix a formatting issue in new commit. Would you please approve the workflows to run?

harygo2 · 2024-04-29T03:53:24Z

Hi, @tjruwase, the nv-accelerate-v100 and nv-lightening-v100 CI workflows seems to be down.

Could you please take a look? Thx.

delock · 2024-04-30T07:52:09Z

I did a search for current_device() in DeepSpeed repo and it looks like most occurance of current_device() should be current_device_name() in order to be compatible to non-cuda device. Maybe increase test coverage on non-cuda device would help catch such usage in the future.

harygo2 · 2024-05-03T13:39:12Z

Hi, @tjruwase @loadams, All checks have passed, could you add this PR to the merge queue?

harygo2 · 2024-05-06T02:05:51Z

Hi, @tjruwase, this PR was removed from merge queue by the bot, not sure why it was removed, could you please check it out?

loadams · 2024-05-06T16:14:20Z

Hi, @tjruwase, this PR was removed from merge queue by the bot, not sure why it was removed, could you please check it out?

Hi @harygo2 - looks like a transient failure on the merge queue system, I'll re-queue.

…or().current_device() (microsoft#5464) Creating a Torch tensor with the parameter `device=get_accelerator().current_device()` can result in a crash when using an NPU. This issue arises because the `current_device` API across all accelerators is expected to return a device id as an integer, according to the [interface docs.](https://github.com/microsoft/DeepSpeed/blob/fa8458b1a80d6ba55091b17f092de19bbf95eb3d/docs/_tutorials/accelerator-abstraction-interface.md?plain=1#L52C1-L56C103) However, specifying `device` as an interger when creating tensors by default directs Torch to use the CUDA backend, which leads to crash on NPUs (and potentially other accelerators as well). To resolve this, we should use `get_accelerator().current_device_name()` instead, which returns the correct device identifier strings such as `"npu:0", "cuda:0", or "xpu:0"`. This API provides the appropriate context needed for creating tensors on specific hardware accelerators. I also notice that `device=get_accelerator().current_device()` is used across several files under deepspeed/inference, and may also lead to crash on other accelerators. --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]>

replacing current_device with current_device_name to support other ac…

2f382d8

…celerators

harygo2 requested review from mrwyattii, tjruwase and loadams as code owners April 25, 2024 12:41

tjruwase approved these changes Apr 26, 2024

View reviewed changes

tjruwase and others added 2 commits April 26, 2024 12:58

Merge branch 'master' into master

1e5e1b4

Update utils.py

2eadf2f

fix the formatting issue for commit 2f382d8

tjruwase added this pull request to the merge queue May 4, 2024

github-merge-queue bot removed this pull request from the merge queue due to no response for status checks May 4, 2024

Merge branch 'master' into master

d8bb1fd

loadams enabled auto-merge May 6, 2024 16:20

loadams added this pull request to the merge queue May 7, 2024

Merged via the queue into microsoft:master with commit 0fc19b6 May 7, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix crash when creating Torch tensor on NPU with device=get_accelerator().current_device() #5464

Fix crash when creating Torch tensor on NPU with device=get_accelerator().current_device() #5464

harygo2 commented Apr 25, 2024

harygo2 commented Apr 25, 2024

minchao-sun commented Apr 26, 2024

tjruwase commented Apr 26, 2024 via email

harygo2 commented Apr 28, 2024

harygo2 commented Apr 29, 2024

delock commented Apr 30, 2024

harygo2 commented May 3, 2024

harygo2 commented May 6, 2024

loadams commented May 6, 2024

Fix crash when creating Torch tensor on NPU with device=get_accelerator().current_device() #5464

Fix crash when creating Torch tensor on NPU with device=get_accelerator().current_device() #5464

Conversation

harygo2 commented Apr 25, 2024

harygo2 commented Apr 25, 2024

minchao-sun commented Apr 26, 2024

tjruwase commented Apr 26, 2024 via email

harygo2 commented Apr 28, 2024

harygo2 commented Apr 29, 2024

delock commented Apr 30, 2024

harygo2 commented May 3, 2024

harygo2 commented May 6, 2024

loadams commented May 6, 2024