Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Increasing training loss likely due to desynchronization #77

Open
tscholak opened this issue Dec 2, 2024 · 0 comments
Open

[bug] Increasing training loss likely due to desynchronization #77

tscholak opened this issue Dec 2, 2024 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@tscholak
Copy link
Collaborator

tscholak commented Dec 2, 2024

🐞 Describe the Bug

Couple of our training runs (specifically annealing) has unexplained loss spikes during training.

Screenshot 2024-12-02 at 1 09 10 PM

Some of these job ids:

  • 2af74bc6-1452-4ced-9d6d-c8c1c5353438
  • 1b035c3c-b4f8-4d99-85f1-309aff642dc5
  • d45b12cb-b799-405f-a48a-76f8155d3f03

🔄 Steps to Reproduce

Steps to reproduce the behavior:

  1. Get the relevant Fast-LLM version: 5e6de1aed9c4d494f28c7b671efd5161c3a603fb
  2. There weren't any errors from the logs which explicitly points to this behaviour. But we were able to isolate one node dgx-55 to be common amoung all failed training jobs (and also not present in successful jobs). With this we are currently isolating the node and training on others to test the issue.
  3. Same job as earlier launched 89be0654-35ef-4968-bd0b-41b8485f3bf1 w/o the node dgx-55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants