-
Notifications
You must be signed in to change notification settings - Fork 533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect loss spikes and high losses during training #1473
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
joyce-chen-uni
force-pushed
the
main
branch
2 times, most recently
from
August 25, 2024 23:47
fff5576
to
e56b797
Compare
joyce-chen-uni
force-pushed
the
main
branch
2 times, most recently
from
August 26, 2024 06:39
a25a19c
to
44787df
Compare
dakinggg
reviewed
Aug 26, 2024
dakinggg
reviewed
Aug 27, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks mostly good to me! a couple small comments
dakinggg
approved these changes
Aug 28, 2024
joyce-chen-uni
changed the title
Track loss spikes and high losses during training
Detect loss spikes and high losses during training
Aug 28, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
KillLossSpike callback that 1) compares the train loss against the running average loss, and 2) checks the magnitude of running losses at the end of every batch/training step. This is intended to catch 1) rapid steep spikes and 2) divergent runs respectively.
If we identify a spike or divergence, we log a message to the run metadata. This message is added to the TRAIN_UPDATED run event in MAPI, so it will be displayed to the user in the run events. The message recommends the user to stop and resubmit the run with a lower learning rate.
This change will make it easier to query spiky runs. We also log the loss window leading up to the identified spike for data analysis purposes.
Once we have collected enough data and feel confident about our cancellation threshold, we will switch to the hard user error
LossSpikeError
which cancels the run without retry and prompts the user to resubmit with a lower learning rate.