Fix FairseqHydraPretrainJob for better start_checkpoint behavior #554

AndreasPlt · 2024-11-14T13:11:23Z

Currently, when fairseq_hydra_config["checkpoint"]["restore_file"] is given as a parameter for the pretraining, this makes resuming jobs much more difficult since it will always restart from the given file - even though the model might have been trained for several more epochs already. Since changing the restore_file parameter to the new checkpoint would change the job parameter again, the job would also not be able continue training again.

I therefore propose to handle the "restore_file" parameter in the job individually and not leaving it to fairseq. The idea is to save the given "restore_file" as a job attribute to later move it to "output/checkpoints/checkpoint_last.pt" if the latter does not exist yet, which is also the default parameter inside fairseq. That way, when training has been already done for several epochs, the job can still continue from checkpoint_last.pt, and use the given checkpoint if the training hasn't already ran earlier.

fairseq/training.py

michelwi · 2024-11-14T13:38:24Z

fairseq/training.py

+            os.symlink(
+                start_checkpoint,
+                os.path.join(self.out_checkpoint_dir.get_path(), os.path.basename(start_checkpoint))
+            )


is this needed?

This is handy to have in order to make the actual start_checkpoint persist inside the output/checkpoints/ folder since checkpoint_last.pt will be overwritten in each epoch.

I wonder if this can potentially break things depending on start_checkpoint. If its basename has a special meaning to fairseq (similar to checkpoint_last.pt), this could maybe result in unexpected behavior, right?

Good point! Maybe we could fix this by checking for all special filenames for fairseq (which are, if I remember correctly, checkpoint_last.pt, checkpoint_best.pt and checkpoint_crashed.pt). Or would you suggest to just leave that out completely?

We could maybe just assign a fixed name here like e.g. checkpoint_initial.pt (if that has no special role so far). Since it's a symlink, it's still very easy to see the original filename if that's of interest.

fairseq/training.py

Co-authored-by: michelwi <[email protected]>

vieting

Looks good to me in general, just two nitpicks.

fairseq/training.py

vieting · 2024-11-18T10:52:45Z

fairseq/training.py

+            os.symlink(
+                start_checkpoint,
+                os.path.join(self.out_checkpoint_dir.get_path(), os.path.basename(start_checkpoint))
+            )


I wonder if this can potentially break things depending on start_checkpoint. If its basename has a special meaning to fairseq (similar to checkpoint_last.pt), this could maybe result in unexpected behavior, right?

vieting · 2024-11-20T16:17:32Z

Btw, please apply black to your code, the test failed. The AppTek test probably only fails because the PR is created from a fork that has no permission to trigger the test.

michelwi · 2024-11-20T16:19:16Z

Btw, please apply black to your code, the test failed. The AppTek test probably only fails because the PR is created from a fork that has no permission to trigger the test.

Correct, we don't use this job, so we do not care.

AndreasPlt · 2024-11-29T15:49:43Z

I recently found out that there is a fairseq parameter checkpoint.continue_once that essentially does the same as my proposed modification to the job, but inside of fairseq itself (see also here). I therefore decided to close this PR with a reference to this fairseq parameter, if this functionality should be required by someone else in the future.

AndreasPlt added 6 commits November 11, 2024 16:27

fix when using restore_file

1d76f46

add some prints

a7817e9

comment clean up

6812f6f

change link to symlink

51882e1

fix when using restore_file

e1191ee

comment clean up

a37e006

AndreasPlt mentioned this pull request Nov 14, 2024

add new fairseq_pretraining function for starting from checkpoint rwth-i6/i6_experiments#255

Open

vieting self-requested a review November 14, 2024 13:30

michelwi reviewed Nov 14, 2024

View reviewed changes

Update fairseq/training.py

6171182

Co-authored-by: michelwi <[email protected]>

michelwi approved these changes Nov 14, 2024

View reviewed changes

vieting reviewed Nov 18, 2024

View reviewed changes

apply black

2787fd3

AndreasPlt closed this Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix FairseqHydraPretrainJob for better start_checkpoint behavior #554

Fix FairseqHydraPretrainJob for better start_checkpoint behavior #554

AndreasPlt commented Nov 14, 2024

michelwi Nov 14, 2024

AndreasPlt Nov 14, 2024

vieting Nov 18, 2024

AndreasPlt Nov 22, 2024

vieting Nov 22, 2024

vieting left a comment

vieting Nov 18, 2024

vieting commented Nov 20, 2024

michelwi commented Nov 20, 2024

AndreasPlt commented Nov 29, 2024

Fix FairseqHydraPretrainJob for better start_checkpoint behavior #554

Fix FairseqHydraPretrainJob for better start_checkpoint behavior #554

Conversation

AndreasPlt commented Nov 14, 2024

michelwi Nov 14, 2024

Choose a reason for hiding this comment

AndreasPlt Nov 14, 2024

Choose a reason for hiding this comment

vieting Nov 18, 2024

Choose a reason for hiding this comment

AndreasPlt Nov 22, 2024

Choose a reason for hiding this comment

vieting Nov 22, 2024

Choose a reason for hiding this comment

vieting left a comment

Choose a reason for hiding this comment

vieting Nov 18, 2024

Choose a reason for hiding this comment

vieting commented Nov 20, 2024

michelwi commented Nov 20, 2024

AndreasPlt commented Nov 29, 2024