-
Notifications
You must be signed in to change notification settings - Fork 999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add XLA unit tests to pre-submit CI #2545
Comments
The issue is we don't run runners on pre-submit CI, nor do we have any TPUs to run on merge CI, so we have no way of testing TPUs ourselves with accelerate outside running it manually in Colab. (We could maybe look at adding GPU XLA tests in there though post-submit) Note: we also don't run GPU runners on pre-submit, only the main CI has access to those + nightlies |
Thanks for the response. In that case, can we add a GPU XLA tests through post-submit? That will help catch issues earlier. |
Certainly. IIUC all that's needed to get this going is to install |
That's correct. Thanks! |
@vanbasten23 do you have a good "hello world" test that can be run on the GPU docker images to check and see if everything works okay? Hitting a few snags just doing Docker file I'm testing: FROM us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.2.0_3.10_cuda_12.1
RUN python3 -m pip install --no-cache-dir \
git+https://github.com/huggingface/accelerate#egg=accelerate[test_prod,test_integrations] \
--extra-index-url https://download.pytorch.org/whl/cu117
# Activate the virtualenv
CMD ["/bin/bash"] |
Yes. You can use this:
|
BTW just noticing this, we should eventually change the logic so Running 2024-03-13 17:12:13.441463: W external/xla/xla/service/gpu/runtime/support.cc:58] Intercepted XLA runtime error:
INTERNAL: external/xla/xla/service/gpu/nccl_utils.cc:305: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: unhandled cuda error (run with NCCL_DEBUG=INFO for details). Last NCCL warning(error) log entry (may be unrelated) 'Cuda failure 'out of memory''.
2024-03-13 17:12:13.441551: E external/xla/xla/pjrt/pjrt_stream_executor_client.cc:2716] Execution of replica 0 failed: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.all_reduce' failed: external/xla/xla/service/gpu/nccl_utils.cc:305: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: unhandled cuda error (run with NCCL_DEBUG=INFO for details). Last NCCL warning(error) log entry (may be unrelated) 'Cuda failure 'out of memory''.; current tracing scope: all-reduce-start; current profiling annotation: XlaModule:#hlo_module=SyncTensorsGraph.25,program_id=0#.
2024-03-13 17:12:13.441641: W external/xla/xla/service/gpu/runtime/support.cc:58] Intercepted XLA runtime error:
INTERNAL: external/xla/xla/service/gpu/nccl_utils.cc:305: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: unhandled cuda error (run with NCCL_DEBUG=INFO for details). Last NCCL warning(error) log entry (may be unrelated) 'Cuda failure 'out of memory''.
2024-03-13 17:12:13.441712: E external/xla/xla/pjrt/pjrt_stream_executor_client.cc:2716] Execution of replica 1 failed: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.all_reduce' failed: external/xla/xla/service/gpu/nccl_utils.cc:305: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: unhandled cuda error (run with NCCL_DEBUG=INFO for details). Last NCCL warning(error) log entry (may be unrelated) 'Cuda failure 'out of memory''.; current tracing scope: all-reduce-start; current profiling annotation: XlaModule:#hlo_module=SyncTensorsGraph.25,program_id=0#.
Traceback (most recent call last):
File "/mnt/accelerate/src/accelerate/test_utils/scripts/test_script.py", line 716, in <module>
Traceback (most recent call last):
File "/mnt/accelerate/src/accelerate/test_utils/scripts/test_script.py", line 716, in <module>
main()
File "/mnt/accelerate/src/accelerate/test_utils/scripts/test_script.py", line 664, in main
main()
File "/mnt/accelerate/src/accelerate/test_utils/scripts/test_script.py", line 664, in main
state.wait_for_everyone()
File "/mnt/accelerate/src/accelerate/state.py", line 934, in wait_for_everyone
state.wait_for_everyone()
File "/mnt/accelerate/src/accelerate/state.py", line 934, in wait_for_everyone
PartialState().wait_for_everyone()
File "/mnt/accelerate/src/accelerate/state.py", line 423, in wait_for_everyone
xm.rendezvous("accelerate.utils.wait_for_everyone")
File "/usr/local/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 1166, in rendezvous
PartialState().wait_for_everyone()
File "/mnt/accelerate/src/accelerate/state.py", line 423, in wait_for_everyone
xm.rendezvous("accelerate.utils.wait_for_everyone")
File "/usr/local/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 1166, in rendezvous
return xla_rendezvous(payload, replicas or None, tag=tag)
File "/usr/local/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 1133, in xla_rendezvous
return xla_rendezvous(payload, replicas or None, tag=tag)
File "/usr/local/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 1133, in xla_rendezvous
if max_size.item() < 1:
RuntimeError: Bad StatusOr access: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.all_reduce' failed: external/xla/xla/service/gpu/nccl_utils.cc:305: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: unhandled cuda error (run with NCCL_DEBUG=INFO for details). Last NCCL warning(error) log entry (may be unrelated) 'Cuda failure 'out of memory''.; current tracing scope: all-reduce-start; current profiling annotation: XlaModule:#hlo_module=SyncTensorsGraph.25,program_id=0#. Any clue what's going on there? It should certainly not be running out of memory with 2x24gb GPUs and I set |
If we can get to a point where I can run them locally via Docker and things make sense on a CUDA runtime, then we can integrate it into a CI. |
Completed agreed.
Do you know your cuda runtime version (nvcc --version)? I'm using cuda 12.1 and I got a different error which it seems it accessed the XLA devices before calling the |
I rebase my codebase to get the latest code on the main branch and here is the new error that I got. It fails at
|
Yes, that random sampler part I mentioned that could be bad in this PR 😉 #2542 (comment) |
I reverted the change locally in https://github.com/huggingface/accelerate/pull/2542/files#diff-d9858283a2ced902233727f6fddde0a00831ad9a66a069e57231a5057d550bf6 and I still got the same error. |
Hmm okay I'll try giving it a look tommorow. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
System Info
Hi team,
Can we add some unit test for XLA (GPU, TPU v4, TPU v2 or v3) to the pre-submit CI? The test can be as simple as
accelerate test
. The reason for the request is that we have observed a few changes recently from accelerate that broke theaccelerate test
for TPU such as #2319 and #2176). It takes longer for PyTorch/XLA team to fix them because PyTorch/XLA team is not familiar with the change. And it will be great if the PR author can fix the issue before the PR is merged as they have the most context, so that the users won't see the regression. Thanks!cc @will-cromar, @JackCaoG, @muellerzr
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
config:
export PJRT_DEVICE=TPU
accelerate test
Expected behavior
na
The text was updated successfully, but these errors were encountered: