Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write state file during training #1

Open
jelmervdl opened this issue Jan 10, 2023 · 3 comments
Open

Write state file during training #1

jelmervdl opened this issue Jan 10, 2023 · 3 comments
Milestone

Comments

@jelmervdl
Copy link
Contributor

Right now the trainer only attempts to write the state file when it receives a shutdown signal. Sometimes a machine just dies, and no such signal is received. It would be nice if there would be at least an almost final state file on disk at that moment.

Not 100% sure what would be the best way to determine when to write the file. I'd say after feeding M lines and at least N minutes since last write (in case marian reads ahead a bit more than M lines). But what values for M and N should be… ???

@ZJaume
Copy link
Contributor

ZJaume commented Jan 13, 2023

The trainer does not have access to the marian config, but I would say every time marian runs a validation maybe?

@XapaJIaMnu XapaJIaMnu added this to the . milestone Jan 13, 2023
@jelmervdl
Copy link
Contributor Author

That would make sense. At some point we'll be tracking marian's output anyway, so we will know when it did a validation. The trainer will be slightly ahead by that point (because Marian buffers its input) but ideally it won't be too much.

@XapaJIaMnu
Copy link
Contributor

The current state resumption seems to be broken, @jelmervdl have you observed this:

(.env) $ ./trainer.py -c train.zhen.old.yml 
Traceback (most recent call last):
  File "empty-trainer/./trainer.py", line 658, in <module>
    for batch in state_tracker.run(trainer):
  File "empty-trainer/./trainer.py", line 590, in run
    self._restore(trainer)
  File "empty-trainer/./trainer.py", line 581, in _restore
    return trainer.restore(self.loader.load(fh))
  File "empty-trainer/./trainer.py", line 491, in restore
    self.stage = self.curriculum.stages[state.stage]
KeyError: ''

This happens after training finishes and restart it with the same parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants