-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix FairseqHydraPretrainJob for better start_checkpoint behavior #554
Conversation
fairseq/training.py
Outdated
os.symlink( | ||
start_checkpoint, | ||
os.path.join(self.out_checkpoint_dir.get_path(), os.path.basename(start_checkpoint)) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is handy to have in order to make the actual start_checkpoint
persist inside the output/checkpoints/
folder since checkpoint_last.pt
will be overwritten in each epoch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this can potentially break things depending on start_checkpoint
. If its basename has a special meaning to fairseq (similar to checkpoint_last.pt
), this could maybe result in unexpected behavior, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Maybe we could fix this by checking for all special filenames for fairseq (which are, if I remember correctly, checkpoint_last.pt
, checkpoint_best.pt
and checkpoint_crashed.pt
). Or would you suggest to just leave that out completely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could maybe just assign a fixed name here like e.g. checkpoint_initial.pt
(if that has no special role so far). Since it's a symlink, it's still very easy to see the original filename if that's of interest.
Co-authored-by: michelwi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me in general, just two nitpicks.
fairseq/training.py
Outdated
os.symlink( | ||
start_checkpoint, | ||
os.path.join(self.out_checkpoint_dir.get_path(), os.path.basename(start_checkpoint)) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this can potentially break things depending on start_checkpoint
. If its basename has a special meaning to fairseq (similar to checkpoint_last.pt
), this could maybe result in unexpected behavior, right?
Btw, please apply black to your code, the test failed. The AppTek test probably only fails because the PR is created from a fork that has no permission to trigger the test. |
Correct, we don't use this job, so we do not care. |
I recently found out that there is a fairseq parameter |
Currently, when
fairseq_hydra_config["checkpoint"]["restore_file"]
is given as a parameter for the pretraining, this makes resuming jobs much more difficult since it will always restart from the given file - even though the model might have been trained for several more epochs already. Since changing therestore_file
parameter to the new checkpoint would change the job parameter again, the job would also not be able continue training again.I therefore propose to handle the
"restore_file"
parameter in the job individually and not leaving it to fairseq. The idea is to save the given"restore_file"
as a job attribute to later move it to"output/checkpoints/checkpoint_last.pt"
if the latter does not exist yet, which is also the default parameter inside fairseq. That way, when training has been already done for several epochs, the job can still continue fromcheckpoint_last.pt
, and use the given checkpoint if the training hasn't already ran earlier.