You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Couple of our training runs (specifically annealing) has unexplained loss spikes during training.
Some of these job ids:
2af74bc6-1452-4ced-9d6d-c8c1c5353438
1b035c3c-b4f8-4d99-85f1-309aff642dc5
d45b12cb-b799-405f-a48a-76f8155d3f03
🔄 Steps to Reproduce
Steps to reproduce the behavior:
Get the relevant Fast-LLM version: 5e6de1aed9c4d494f28c7b671efd5161c3a603fb
There weren't any errors from the logs which explicitly points to this behaviour. But we were able to isolate one node dgx-55 to be common amoung all failed training jobs (and also not present in successful jobs). With this we are currently isolating the node and training on others to test the issue.
Same job as earlier launched 89be0654-35ef-4968-bd0b-41b8485f3bf1 w/o the node dgx-55
The text was updated successfully, but these errors were encountered:
🐞 Describe the Bug
Couple of our training runs (specifically annealing) has unexplained loss spikes during training.
Some of these job ids:
🔄 Steps to Reproduce
Steps to reproduce the behavior:
5e6de1aed9c4d494f28c7b671efd5161c3a603fb
dgx-55
to be common amoung all failed training jobs (and also not present in successful jobs). With this we are currently isolating the node and training on others to test the issue.89be0654-35ef-4968-bd0b-41b8485f3bf1
w/o the nodedgx-55
The text was updated successfully, but these errors were encountered: