Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: mark model_parallel tests as cuda specific #35269

Merged
merged 1 commit into from
Jan 7, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions tests/test_modeling_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -3046,6 +3046,7 @@ def test_multi_gpu_data_parallel_forward(self):
with torch.no_grad():
_ = model(**self._prepare_for_class(inputs_dict, model_class))

@require_torch_gpu
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@require_torch_multi_gpu on the next line was previously made non-CUDA specific by 11c27dd. Right now it sets requirement for having any multi-GPUs on the system.

I wonder, maybe it makes sense to rename @require_torch_gpu to @require_torch_cuda to avoid naming collision? I can follow up on that in separate PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @ydshieh on this! 🤗

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dvrogozh I am not sure I follow with was previously made non-CUDA specific by 11c27dd. Is the commit link is the correct one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the commit link is the correct one?

Yes, that's correct commit and link. Unfortunately it's a huge commit with multiple changes inside and one of them was to make @require_torch_multi_gpu non-cuda specific. See a change in src/transformers/testing_utils.py, line 743 in old code, line 788 in new code, Commit changed torch.cuda.device_count() to device_count and the latter is calculated via newly added function (which works for CUDA and XPU at the moment):

def get_device_count():

Currently @require_torch_multi_gpu looks like this and is not CUDA specific (after 11c27dd):
def require_torch_multi_gpu(test_case):
"""
Decorator marking a test that requires a multi-GPU setup (in PyTorch). These tests are skipped on a machine without
multiple GPUs.
To run *only* the multi_gpu tests, assuming all test names contain multi_gpu: $ pytest -sv ./tests -k "multi_gpu"
"""
if not is_torch_available():
return unittest.skip(reason="test requires PyTorch")(test_case)
device_count = get_device_count()
return unittest.skipUnless(device_count > 1, "test requires multiple GPUs")(test_case)

Note that @require_torch_gpu is still CUDA-specific:

def require_torch_gpu(test_case):
"""Decorator marking a test that requires CUDA and PyTorch."""
return unittest.skipUnless(torch_device == "cuda", "test requires CUDA")(test_case)

I am not sure I follow with was previously made non-CUDA specific by 11c27dd.

The point I am trying to make is that @require_torch_multi_gpu and @require_torch_gpu are named similarly, but diverged (after 11c27dd) in what they actually signify which leads to confusion. My proposal is to rename @require_torch_gpu to @require_torch_cuda.

Copy link
Collaborator

@ydshieh ydshieh Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a series of work regarding device agnostic, and require_torch_accelerator, require_torch_multi_accelerator etc are added for device agnostic. The ``require_torch_xxx_gpu` should not be modified IMO, but I guess @Titus-von-Koelle wasn't realizing the existing methods #31098 is added at that time.

@require_torch_multi_gpu
def test_model_parallelization(self):
if not self.test_model_parallel:
Expand Down Expand Up @@ -3108,6 +3109,7 @@ def get_current_gpu_memory_use():
gc.collect()
torch.cuda.empty_cache()

@require_torch_gpu
@require_torch_multi_gpu
def test_model_parallel_equal_results(self):
if not self.test_model_parallel:
Expand Down