Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: mark model_parallel tests as cuda specific #35269

Merged
merged 1 commit into from
Jan 7, 2025

Conversation

dvrogozh
Copy link
Contributor

parallelize() API is deprecated in favor of accelerate's device_map="auto" and therefore is not accepting new features. At the same time parallelize() implementation is currently CUDA-specific. This commit marks respective ci tests with @require_torch_gpu.

Fixes: #35252
CC: @ArthurZucker @SunMarc

@@ -3046,6 +3046,7 @@ def test_multi_gpu_data_parallel_forward(self):
with torch.no_grad():
_ = model(**self._prepare_for_class(inputs_dict, model_class))

@require_torch_gpu
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@require_torch_multi_gpu on the next line was previously made non-CUDA specific by 11c27dd. Right now it sets requirement for having any multi-GPUs on the system.

I wonder, maybe it makes sense to rename @require_torch_gpu to @require_torch_cuda to avoid naming collision? I can follow up on that in separate PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @ydshieh on this! 🤗

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dvrogozh I am not sure I follow with was previously made non-CUDA specific by 11c27dd. Is the commit link is the correct one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the commit link is the correct one?

Yes, that's correct commit and link. Unfortunately it's a huge commit with multiple changes inside and one of them was to make @require_torch_multi_gpu non-cuda specific. See a change in src/transformers/testing_utils.py, line 743 in old code, line 788 in new code, Commit changed torch.cuda.device_count() to device_count and the latter is calculated via newly added function (which works for CUDA and XPU at the moment):

def get_device_count():

Currently @require_torch_multi_gpu looks like this and is not CUDA specific (after 11c27dd):
def require_torch_multi_gpu(test_case):
"""
Decorator marking a test that requires a multi-GPU setup (in PyTorch). These tests are skipped on a machine without
multiple GPUs.
To run *only* the multi_gpu tests, assuming all test names contain multi_gpu: $ pytest -sv ./tests -k "multi_gpu"
"""
if not is_torch_available():
return unittest.skip(reason="test requires PyTorch")(test_case)
device_count = get_device_count()
return unittest.skipUnless(device_count > 1, "test requires multiple GPUs")(test_case)

Note that @require_torch_gpu is still CUDA-specific:

def require_torch_gpu(test_case):
"""Decorator marking a test that requires CUDA and PyTorch."""
return unittest.skipUnless(torch_device == "cuda", "test requires CUDA")(test_case)

I am not sure I follow with was previously made non-CUDA specific by 11c27dd.

The point I am trying to make is that @require_torch_multi_gpu and @require_torch_gpu are named similarly, but diverged (after 11c27dd) in what they actually signify which leads to confusion. My proposal is to rename @require_torch_gpu to @require_torch_cuda.

Copy link
Collaborator

@ydshieh ydshieh Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a series of work regarding device agnostic, and require_torch_accelerator, require_torch_multi_accelerator etc are added for device agnostic. The ``require_torch_xxx_gpu` should not be modified IMO, but I guess @Titus-von-Koelle wasn't realizing the existing methods #31098 is added at that time.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@@ -3046,6 +3046,7 @@ def test_multi_gpu_data_parallel_forward(self):
with torch.no_grad():
_ = model(**self._prepare_for_class(inputs_dict, model_class))

@require_torch_gpu
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @ydshieh on this! 🤗

`parallelize()` API is deprecated in favor of accelerate's `device_map="auto"`
and therefore is not accepting new features. At the same time `parallelize()`
implementation is currently CUDA-specific. This commit marks respective
ci tests with `@require_torch_gpu`.

Fixes: huggingface#35252
Signed-off-by: Dmitry Rogozhkin <[email protected]>
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The discussed topic about (potential) naming etc. should be in a follow PR

@ydshieh ydshieh merged commit 9fd123a into huggingface:main Jan 7, 2025
27 checks passed
AlanPonnachan pushed a commit to AlanPonnachan/transformers that referenced this pull request Jan 7, 2025
`parallelize()` API is deprecated in favor of accelerate's `device_map="auto"`
and therefore is not accepting new features. At the same time `parallelize()`
implementation is currently CUDA-specific. This commit marks respective
ci tests with `@require_torch_gpu`.

Fixes: huggingface#35252

Signed-off-by: Dmitry Rogozhkin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

xpu: parallelize() not supported for PyTorch XPU backend
4 participants