You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
i enabled more variations for gridsearch and the last step of training crashes now (started this twice in a row to confirm).
Before it worked just fine, when i set n_max_config=1 handing over 1 model to step 2 of the training.
Error-Message: DistutilsFileError: could not create './logs/my_model/best_model/config.json': No such file or directory.
searcher = GridSearcher(
checkpoint_dir = current_checkpoint,
local_dataset=local_dataset,
model="PretrainedLM",
epoch=10,
epoch_partial=3,
n_max_config=3,
batch_size=64, # are the texts auto-chunked or is the rest of the sequence just being discarded?
gradient_accumulation_steps=[4, 8],
crf=[True, False],
lr=[1e-4, 1e-5],
weight_decay=[1e-7],
random_seed=[42],
lr_warmup_step_ratio=[0.1],
max_grad_norm=[10]
#gradient_accumulation_steps=[4],
#crf=[True],
#lr=[1e-4, 1e-5],
#weight_decay=[None],
#random_seed=[42],
#lr_warmup_step_ratio=[0.1],
#max_grad_norm=[None],
#use_auth_token=True
)
searcher.train()
These are the last log entries:
2022-10-02 19:36:21 INFO tmp metric: 0.7093469910371318
2022-10-02 19:36:21 INFO finish 3rd phase (no improvement)
2022-10-02 19:36:21 INFO 3rd RUN RESULTS: ./logs/my_model/model_cghqta
2022-10-02 19:36:21 INFO epoch 10: 0.6972361809045226
2022-10-02 19:36:21 INFO epoch 11: 0.6990415335463258
2022-10-02 19:36:21 INFO epoch 12: 0.6998706338939199
2022-10-02 19:36:21 INFO epoch 13: 0.7048969072164948
2022-10-02 19:36:21 INFO epoch 14: 0.707613563659629
2022-10-02 19:36:21 INFO epoch 15: 0.708893154190659
2022-10-02 19:36:21 INFO epoch 16: 0.7103403982016699
2022-10-02 19:36:21 INFO epoch 17: 0.7103403982016699
2022-10-02 19:36:21 INFO epoch 18: 0.7093469910371318
Referring to the documentation:
"The best model in the second stage will continue fine-tuning till the validation metric get decreased."
This brings up the question how is training to Epoch "l" being handeled? If the learning curve shows overfitting before the configured maximum number of epochs for step 2 (e.g. 10). Does it regonize and stop before epoch 10 or does it just continue handing over an overfitting model for furthermore training of "best_model"?
The text was updated successfully, but these errors were encountered:
That's a good point, and I don't think I explain it in the README in detail. The first step will train all the configuration until the epoch_partial and the n_max_config-best configurations will be handed to the second step, where those models will be trained until epoch. Once the second step finished, we compute the loss on the validation set with each epoch's checkpoint. Only if the best epoch is epoch, meaning there's a possibility that the model is still being underrepresented, we continue the fine-tuning.
For example, you set epoch=10 and the best model is one with epoch=8, then that's the final model (no third round). If the best model was epoch=10, then the third round will start and continue fine-tuning until the drop in the validation loss.
i enabled more variations for gridsearch and the last step of training crashes now (started this twice in a row to confirm).
Before it worked just fine, when i set n_max_config=1 handing over 1 model to step 2 of the training.
Error-Message: DistutilsFileError: could not create './logs/my_model/best_model/config.json': No such file or directory.
searcher = GridSearcher(
checkpoint_dir = current_checkpoint,
local_dataset=local_dataset,
model="PretrainedLM",
epoch=10,
epoch_partial=3,
n_max_config=3,
batch_size=64, # are the texts auto-chunked or is the rest of the sequence just being discarded?
gradient_accumulation_steps=[4, 8],
crf=[True, False],
lr=[1e-4, 1e-5],
weight_decay=[1e-7],
random_seed=[42],
lr_warmup_step_ratio=[0.1],
max_grad_norm=[10]
#gradient_accumulation_steps=[4],
#crf=[True],
#lr=[1e-4, 1e-5],
#weight_decay=[None],
#random_seed=[42],
#lr_warmup_step_ratio=[0.1],
#max_grad_norm=[None],
#use_auth_token=True
)
searcher.train()
These are the last log entries:
2022-10-02 19:36:21 INFO tmp metric: 0.7093469910371318
2022-10-02 19:36:21 INFO finish 3rd phase (no improvement)
2022-10-02 19:36:21 INFO 3rd RUN RESULTS: ./logs/my_model/model_cghqta
2022-10-02 19:36:21 INFO epoch 10: 0.6972361809045226
2022-10-02 19:36:21 INFO epoch 11: 0.6990415335463258
2022-10-02 19:36:21 INFO epoch 12: 0.6998706338939199
2022-10-02 19:36:21 INFO epoch 13: 0.7048969072164948
2022-10-02 19:36:21 INFO epoch 14: 0.707613563659629
2022-10-02 19:36:21 INFO epoch 15: 0.708893154190659
2022-10-02 19:36:21 INFO epoch 16: 0.7103403982016699
2022-10-02 19:36:21 INFO epoch 17: 0.7103403982016699
2022-10-02 19:36:21 INFO epoch 18: 0.7093469910371318
Referring to the documentation:
"The best model in the second stage will continue fine-tuning till the validation metric get decreased."
This brings up the question how is training to Epoch "l" being handeled? If the learning curve shows overfitting before the configured maximum number of epochs for step 2 (e.g. 10). Does it regonize and stop before epoch 10 or does it just continue handing over an overfitting model for furthermore training of "best_model"?
The text was updated successfully, but these errors were encountered: