Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix FairseqHydraPretrainJob for better start_checkpoint behavior #554

Closed
wants to merge 8 commits into from

Conversation

AndreasPlt
Copy link
Contributor

Currently, when fairseq_hydra_config["checkpoint"]["restore_file"] is given as a parameter for the pretraining, this makes resuming jobs much more difficult since it will always restart from the given file - even though the model might have been trained for several more epochs already. Since changing the restore_file parameter to the new checkpoint would change the job parameter again, the job would also not be able continue training again.

I therefore propose to handle the "restore_file" parameter in the job individually and not leaving it to fairseq. The idea is to save the given "restore_file" as a job attribute to later move it to "output/checkpoints/checkpoint_last.pt" if the latter does not exist yet, which is also the default parameter inside fairseq. That way, when training has been already done for several epochs, the job can still continue from checkpoint_last.pt, and use the given checkpoint if the training hasn't already ran earlier.

fairseq/training.py Outdated Show resolved Hide resolved
Comment on lines 261 to 264
os.symlink(
start_checkpoint,
os.path.join(self.out_checkpoint_dir.get_path(), os.path.basename(start_checkpoint))
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is handy to have in order to make the actual start_checkpoint persist inside the output/checkpoints/ folder since checkpoint_last.pt will be overwritten in each epoch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this can potentially break things depending on start_checkpoint. If its basename has a special meaning to fairseq (similar to checkpoint_last.pt), this could maybe result in unexpected behavior, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Maybe we could fix this by checking for all special filenames for fairseq (which are, if I remember correctly, checkpoint_last.pt, checkpoint_best.pt and checkpoint_crashed.pt). Or would you suggest to just leave that out completely?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could maybe just assign a fixed name here like e.g. checkpoint_initial.pt (if that has no special role so far). Since it's a symlink, it's still very easy to see the original filename if that's of interest.

fairseq/training.py Show resolved Hide resolved
Copy link
Contributor

@vieting vieting left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me in general, just two nitpicks.

fairseq/training.py Show resolved Hide resolved
Comment on lines 261 to 264
os.symlink(
start_checkpoint,
os.path.join(self.out_checkpoint_dir.get_path(), os.path.basename(start_checkpoint))
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this can potentially break things depending on start_checkpoint. If its basename has a special meaning to fairseq (similar to checkpoint_last.pt), this could maybe result in unexpected behavior, right?

@vieting
Copy link
Contributor

vieting commented Nov 20, 2024

Btw, please apply black to your code, the test failed. The AppTek test probably only fails because the PR is created from a fork that has no permission to trigger the test.

@michelwi
Copy link
Contributor

Btw, please apply black to your code, the test failed. The AppTek test probably only fails because the PR is created from a fork that has no permission to trigger the test.

Correct, we don't use this job, so we do not care.

@AndreasPlt
Copy link
Contributor Author

I recently found out that there is a fairseq parameter checkpoint.continue_once that essentially does the same as my proposed modification to the job, but inside of fairseq itself (see also here). I therefore decided to close this PR with a reference to this fairseq parameter, if this functionality should be required by someone else in the future.

@AndreasPlt AndreasPlt closed this Nov 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants