You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Training does not start on Vertex AI when using >1 A100 GPUs with NCCL due to an unhandled system error. The problem currently only occurs on A100 GPUs, probably due to GPU partitioning in the K8s cluster. Bumping up NCCL during compilation to the most recent master (https://github.com/NVIDIA/nccl) seems to fix the issue. Could we update Marian NCCL fork (https://github.com/marian-nmt/nccl), or would that possibly break something else? @snukky what are your thoughts on that?
If NCCL debug log is enabled, the following warning shows up, while it does not show up e.g. when using V100 GPUs: graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/fffffff/../../fffffff:ff:f
The text was updated successfully, but these errors were encountered:
I don't see any problems with updating NCCL if it helps. It seems we use vanilla NCCL (NVIDIA/nccl@master...marian-nmt:nccl:master) so updating shouldn't be problematic. Would you like to open a PR?
Sure, here it is @snukky: marian-nmt/nccl#1 (after that we need to update the submodule in marian-dev). Else I could just open a PR changing submodule in marian-dev to main NVIDIA repo instead of a fork.
Bug description
Training does not start on Vertex AI when using >1 A100 GPUs with NCCL due to an unhandled system error. The problem currently only occurs on A100 GPUs, probably due to GPU partitioning in the K8s cluster. Bumping up NCCL during compilation to the most recent master (https://github.com/NVIDIA/nccl) seems to fix the issue. Could we update Marian NCCL fork (https://github.com/marian-nmt/nccl), or would that possibly break something else? @snukky what are your thoughts on that?
Sample log:
If NCCL debug log is enabled, the following warning shows up, while it does not show up e.g. when using V100 GPUs:
graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/fffffff/../../fffffff:ff:f
The text was updated successfully, but these errors were encountered: