Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Error message: "learning rate too small - quitting training!" #3428

Closed
azkgit opened this issue Mar 18, 2024 · 1 comment
Closed

[Bug]: Error message: "learning rate too small - quitting training!" #3428

azkgit opened this issue Mar 18, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@azkgit
Copy link

azkgit commented Mar 18, 2024

Describe the bug

Model training quits after epoch 1 with a "learning rate too small - quitting training!" error message even though the "patience" parameter is set to 10.

To Reproduce

In Google Colab:

!pip install flair -qq

import os
from os import mkdir, listdir
from os.path import join, exists
import re

from torch.optim.adam import Adam
from flair.datasets import CSVClassificationCorpus
from flair.data import Corpus, Sentence
from flair.embeddings import TransformerDocumentEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer

for embedding in ["distilbert-base-uncased"]:
  print("Training on", embedding)

  # 1a. define the column format indicating which columns contain the text and labels
  column_name_map = {1: "text", 2: "label"}

  # 1b. load the preprocessed training, development, and test sets
  corpus: Corpus = CSVClassificationCorpus(processed_dir,
                                          column_name_map,
                                          label_type="label",
                                          skip_header=True,
                                          delimiter='\t')
  # 2. create the label dictionary
  label_dict = corpus.make_label_dictionary(label_type="label")

  # 3. initialize the transformer document embeddings
  document_embeddings = TransformerDocumentEmbeddings(embedding,
                                                      fine_tune=True,
                                                      layers="all")
  #document_embeddings.tokenizer.pad_token = document_embeddings.tokenizer.eos_token

  # 4. create the text classifier
  classifier = TextClassifier(document_embeddings,
                              label_dictionary=label_dict,
                              label_type="label")

  # 5. initialize the trainer
  trainer = ModelTrainer(classifier,
                        corpus)

  # 6. start the training
  trainer.train('model/'+embedding,
              learning_rate=1e-5,
              mini_batch_size=8,
              max_epochs=3,
              patience=10,
              optimizer=Adam,
              train_with_dev=False,
              save_final_model=False
              )

Expected behavior

In this case, the model should be trained for 3 epochs without reducing the learning rate. In prior cases, even when a learning rate of 1e-5 was reduced by an anneal factor of 0.5, I did not receive a "learning rate too small - quitting training!" error message.

Logs and Stack traces

2024-03-18 14:11:51,783 ----------------------------------------------------------------------------------------------------
2024-03-18 14:11:51,786 Model: "TextClassifier(
  (embeddings): TransformerDocumentEmbeddings(
    (model): DistilBertModel(
      (embeddings): Embeddings(
        (word_embeddings): Embedding(30523, 768)
        (position_embeddings): Embedding(512, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (transformer): Transformer(
        (layer): ModuleList(
          (0-5): 6 x TransformerBlock(
            (attention): MultiHeadSelfAttention(
              (dropout): Dropout(p=0.1, inplace=False)
              (q_lin): Linear(in_features=768, out_features=768, bias=True)
              (k_lin): Linear(in_features=768, out_features=768, bias=True)
              (v_lin): Linear(in_features=768, out_features=768, bias=True)
              (out_lin): Linear(in_features=768, out_features=768, bias=True)
            )
            (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (ffn): FFN(
              (dropout): Dropout(p=0.1, inplace=False)
              (lin1): Linear(in_features=768, out_features=3072, bias=True)
              (lin2): Linear(in_features=3072, out_features=768, bias=True)
              (activation): GELUActivation()
            )
            (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          )
        )
      )
    )
  )
  (decoder): Linear(in_features=5376, out_features=2, bias=True)
  (dropout): Dropout(p=0.0, inplace=False)
  (locked_dropout): LockedDropout(p=0.0)
  (word_dropout): WordDropout(p=0.0)
  (loss_function): CrossEntropyLoss()
  (weights): None
  (weight_tensor) None
)"
2024-03-18 14:11:51,787 ----------------------------------------------------------------------------------------------------
2024-03-18 14:11:51,789 Corpus: 8800 train + 2200 dev + 2200 test sentences
2024-03-18 14:11:51,793 ----------------------------------------------------------------------------------------------------
2024-03-18 14:11:51,794 Train:  8800 sentences
2024-03-18 14:11:51,795         (train_with_dev=False, train_with_test=False)
2024-03-18 14:11:51,799 ----------------------------------------------------------------------------------------------------
2024-03-18 14:11:51,802 Training Params:
2024-03-18 14:11:51,804  - learning_rate: "1e-05" 
2024-03-18 14:11:51,806  - mini_batch_size: "8"
2024-03-18 14:11:51,807  - max_epochs: "3"
2024-03-18 14:11:51,812  - shuffle: "True"
2024-03-18 14:11:51,813 ----------------------------------------------------------------------------------------------------
2024-03-18 14:11:51,814 Plugins:
2024-03-18 14:11:51,816  - AnnealOnPlateau | patience: '10', anneal_factor: '0.5', min_learning_rate: '0.0001'
2024-03-18 14:11:51,817 ----------------------------------------------------------------------------------------------------
2024-03-18 14:11:51,818 Final evaluation on model from best epoch (best-model.pt)
2024-03-18 14:11:51,820  - metric: "('micro avg', 'f1-score')"
2024-03-18 14:11:51,821 ----------------------------------------------------------------------------------------------------
2024-03-18 14:11:51,823 Computation:
2024-03-18 14:11:51,825  - compute on device: cuda:0
2024-03-18 14:11:51,835  - embedding storage: cpu
2024-03-18 14:11:51,836 ----------------------------------------------------------------------------------------------------
2024-03-18 14:11:51,837 Model training base path: "model/distilbert-base-uncased"
2024-03-18 14:11:51,840 ----------------------------------------------------------------------------------------------------
2024-03-18 14:11:51,846 ----------------------------------------------------------------------------------------------------
2024-03-18 14:11:55,845 epoch 1 - iter 110/1100 - loss 0.57600509 - time (sec): 4.00 - samples/sec: 220.19 - lr: 0.000010 - momentum: 0.000000
2024-03-18 14:11:58,978 epoch 1 - iter 220/1100 - loss 0.50393908 - time (sec): 7.13 - samples/sec: 246.84 - lr: 0.000010 - momentum: 0.000000
2024-03-18 14:12:01,876 epoch 1 - iter 330/1100 - loss 0.46954644 - time (sec): 10.03 - samples/sec: 263.27 - lr: 0.000010 - momentum: 0.000000
2024-03-18 14:12:05,276 epoch 1 - iter 440/1100 - loss 0.44181235 - time (sec): 13.43 - samples/sec: 262.14 - lr: 0.000010 - momentum: 0.000000
2024-03-18 14:12:08,456 epoch 1 - iter 550/1100 - loss 0.41807515 - time (sec): 16.61 - samples/sec: 264.93 - lr: 0.000010 - momentum: 0.000000
2024-03-18 14:12:11,447 epoch 1 - iter 660/1100 - loss 0.40403758 - time (sec): 19.60 - samples/sec: 269.41 - lr: 0.000010 - momentum: 0.000000
2024-03-18 14:12:14,420 epoch 1 - iter 770/1100 - loss 0.38948912 - time (sec): 22.57 - samples/sec: 272.91 - lr: 0.000010 - momentum: 0.000000
2024-03-18 14:12:17,914 epoch 1 - iter 880/1100 - loss 0.38118810 - time (sec): 26.07 - samples/sec: 270.09 - lr: 0.000010 - momentum: 0.000000
2024-03-18 14:12:21,085 epoch 1 - iter 990/1100 - loss 0.37110791 - time (sec): 29.24 - samples/sec: 270.89 - lr: 0.000010 - momentum: 0.000000
2024-03-18 14:12:24,027 epoch 1 - iter 1100/1100 - loss 0.36139164 - time (sec): 32.18 - samples/sec: 273.47 - lr: 0.000010 - momentum: 0.000000
2024-03-18 14:12:24,030 ----------------------------------------------------------------------------------------------------
2024-03-18 14:12:24,032 EPOCH 1 done: loss 0.3614 - lr: 0.000010
2024-03-18 14:12:28,158 DEV : loss 0.28874295949935913 - f1-score (micro avg)  0.9095
2024-03-18 14:12:29,719  - 0 epochs without improvement
2024-03-18 14:12:29,721 ----------------------------------------------------------------------------------------------------
2024-03-18 14:12:29,723 learning rate too small - quitting training!
2024-03-18 14:12:29,725 ----------------------------------------------------------------------------------------------------
2024-03-18 14:12:29,727 Done.
2024-03-18 14:12:29,729 ----------------------------------------------------------------------------------------------------
2024-03-18 14:12:29,733 Testing using last state of model ...
2024-03-18 14:12:33,651 
Results:
- F-score (micro) 0.9132
- F-score (macro) 0.9029
- Accuracy 0.9132

By class:
              precision    recall  f1-score   support

           0     0.9184    0.9511    0.9345      1432
           1     0.9024    0.8424    0.8714       768

    accuracy                         0.9132      2200
   macro avg     0.9104    0.8968    0.9029      2200
weighted avg     0.9128    0.9132    0.9125      2200

2024-03-18 14:12:33,653 ----------------------------------------------------------------------------------------------------

Screenshots

No response

Additional Context

No response

Environment

Versions:

Flair

0.13.1

Pytorch

2.2.1+cu121

Transformers

4.38.2

GPU

True

@azkgit azkgit added the bug Something isn't working label Mar 18, 2024
@azkgit azkgit closed this as completed Mar 18, 2024
@azkgit
Copy link
Author

azkgit commented Mar 18, 2024

I figured out what the issue was. It looks like a "min_learning_rate" parameter was added as a default since I last used Flair, and its default value (0.0001) was greater than my learning rate (0.00001).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant