[bug] Increasing training loss likely due to desynchronization #77

tscholak · 2024-12-02T13:56:32Z

🐞 Describe the Bug

Couple of our training runs (specifically annealing) has unexplained loss spikes during training.

Some of these job ids:

Steps to reproduce the behavior:

Get the relevant Fast-LLM version: 5e6de1aed9c4d494f28c7b671efd5161c3a603fb
There weren't any errors from the logs which explicitly points to this behaviour. But we were able to isolate one node dgx-55 to be common amoung all failed training jobs (and also not present in successful jobs). With this we are currently isolating the node and training on others to test the issue.
Same job as earlier launched 89be0654-35ef-4968-bd0b-41b8485f3bf1 w/o the node dgx-55

The text was updated successfully, but these errors were encountered:

tscholak added the bug Something isn't working label Dec 2, 2024

tscholak assigned charlesGE and nitsanluke Dec 2, 2024

jlamypoirier mentioned this issue Dec 3, 2024

[bug] Nans and/or desync for sequence-tensor-parallel. #59

Closed