Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Step 2 does not write config,json #31

Open
JanFreise opened this issue Oct 2, 2022 · 2 comments
Open

Step 2 does not write config,json #31

JanFreise opened this issue Oct 2, 2022 · 2 comments

Comments

@JanFreise
Copy link

JanFreise commented Oct 2, 2022

i enabled more variations for gridsearch and the last step of training crashes now (started this twice in a row to confirm).
Before it worked just fine, when i set n_max_config=1 handing over 1 model to step 2 of the training.

Error-Message: DistutilsFileError: could not create './logs/my_model/best_model/config.json': No such file or directory.

searcher = GridSearcher(
checkpoint_dir = current_checkpoint,
local_dataset=local_dataset,
model="PretrainedLM",
epoch=10,
epoch_partial=3,
n_max_config=3,
batch_size=64, # are the texts auto-chunked or is the rest of the sequence just being discarded?
gradient_accumulation_steps=[4, 8],
crf=[True, False],
lr=[1e-4, 1e-5],
weight_decay=[1e-7],
random_seed=[42],
lr_warmup_step_ratio=[0.1],
max_grad_norm=[10]
#gradient_accumulation_steps=[4],
#crf=[True],
#lr=[1e-4, 1e-5],
#weight_decay=[None],
#random_seed=[42],
#lr_warmup_step_ratio=[0.1],
#max_grad_norm=[None],
#use_auth_token=True
)
searcher.train()

These are the last log entries:

2022-10-02 19:36:21 INFO tmp metric: 0.7093469910371318
2022-10-02 19:36:21 INFO finish 3rd phase (no improvement)
2022-10-02 19:36:21 INFO 3rd RUN RESULTS: ./logs/my_model/model_cghqta
2022-10-02 19:36:21 INFO epoch 10: 0.6972361809045226
2022-10-02 19:36:21 INFO epoch 11: 0.6990415335463258
2022-10-02 19:36:21 INFO epoch 12: 0.6998706338939199
2022-10-02 19:36:21 INFO epoch 13: 0.7048969072164948
2022-10-02 19:36:21 INFO epoch 14: 0.707613563659629
2022-10-02 19:36:21 INFO epoch 15: 0.708893154190659
2022-10-02 19:36:21 INFO epoch 16: 0.7103403982016699
2022-10-02 19:36:21 INFO epoch 17: 0.7103403982016699
2022-10-02 19:36:21 INFO epoch 18: 0.7093469910371318

Referring to the documentation:
"The best model in the second stage will continue fine-tuning till the validation metric get decreased."

This brings up the question how is training to Epoch "l" being handeled? If the learning curve shows overfitting before the configured maximum number of epochs for step 2 (e.g. 10). Does it regonize and stop before epoch 10 or does it just continue handing over an overfitting model for furthermore training of "best_model"?

@asahi417
Copy link
Owner

That's a good point, and I don't think I explain it in the README in detail. The first step will train all the configuration until the epoch_partial and the n_max_config-best configurations will be handed to the second step, where those models will be trained until epoch. Once the second step finished, we compute the loss on the validation set with each epoch's checkpoint. Only if the best epoch is epoch, meaning there's a possibility that the model is still being underrepresented, we continue the fine-tuning.

@asahi417
Copy link
Owner

For example, you set epoch=10 and the best model is one with epoch=8, then that's the final model (no third round). If the best model was epoch=10, then the third round will start and continue fine-tuning until the drop in the validation loss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants