[Bug]: Error message: "learning rate too small - quitting training!" #3428

azkgit · 2024-03-18T14:58:03Z

Describe the bug

Model training quits after epoch 1 with a "learning rate too small - quitting training!" error message even though the "patience" parameter is set to 10.

To Reproduce

In Google Colab:

!pip install flair -qq

import os
from os import mkdir, listdir
from os.path import join, exists
import re

from torch.optim.adam import Adam
from flair.datasets import CSVClassificationCorpus
from flair.data import Corpus, Sentence
from flair.embeddings import TransformerDocumentEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer

for embedding in ["distilbert-base-uncased"]:
  print("Training on", embedding)

  # 1a. define the column format indicating which columns contain the text and labels
  column_name_map = {1: "text", 2: "label"}

  # 1b. load the preprocessed training, development, and test sets
  corpus: Corpus = CSVClassificationCorpus(processed_dir,
                                          column_name_map,
                                          label_type="label",
                                          skip_header=True,
                                          delimiter='\t')
  # 2. create the label dictionary
  label_dict = corpus.make_label_dictionary(label_type="label")

  # 3. initialize the transformer document embeddings
  document_embeddings = TransformerDocumentEmbeddings(embedding,
                                                      fine_tune=True,
                                                      layers="all")
  #document_embeddings.tokenizer.pad_token = document_embeddings.tokenizer.eos_token

  # 4. create the text classifier
  classifier = TextClassifier(document_embeddings,
                              label_dictionary=label_dict,
                              label_type="label")

  # 5. initialize the trainer
  trainer = ModelTrainer(classifier,
                        corpus)

  # 6. start the training
  trainer.train('model/'+embedding,
              learning_rate=1e-5,
              mini_batch_size=8,
              max_epochs=3,
              patience=10,
              optimizer=Adam,
              train_with_dev=False,
              save_final_model=False
              )

Expected behavior

In this case, the model should be trained for 3 epochs without reducing the learning rate. In prior cases, even when a learning rate of 1e-5 was reduced by an anneal factor of 0.5, I did not receive a "learning rate too small - quitting training!" error message.

Logs and Stack traces

2024-03-18 14:11:51,783 ----------------------------------------------------------------------------------------------------
2024-03-18 14:11:51,786 Model: "TextClassifier(
  (embeddings): TransformerDocumentEmbeddings(
    (model): DistilBertModel(
      (embeddings): Embeddings(
        (word_embeddings): Embedding(30523, 768)
        (position_embeddings): Embedding(512, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (transformer): Transformer(
        (layer): ModuleList(
          (0-5): 6 x TransformerBlock(
            (attention): MultiHeadSelfAttention(
              (dropout): Dropout(p=0.1, inplace=False)
              (q_lin): Linear(in_features=768, out_features=768, bias=True)
              (k_lin): Linear(in_features=768, out_features=768, bias=True)
              (v_lin): Linear(in_features=768, out_features=768, bias=True)
              (out_lin): Linear(in_features=768, out_features=768, bias=True)
            )
            (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (ffn): FFN(
              (dropout): Dropout(p=0.1, inplace=False)
              (lin1): Linear(in_features=768, out_features=3072, bias=True)
              (lin2): Linear(in_features=3072, out_features=768, bias=True)
              (activation): GELUActivation()
            )
            (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          )
        )
      )
    )
  )
  (decoder): Linear(in_features=5376, out_features=2, bias=True)
  (dropout): Dropout(p=0.0, inplace=False)
  (locked_dropout): LockedDropout(p=0.0)
  (word_dropout): WordDropout(p=0.0)
  (loss_function): CrossEntropyLoss()
  (weights): None
  (weight_tensor) None
)"
2024-03-18 14:11:51,787 ----------------------------------------------------------------------------------------------------
2024-03-18 14:11:51,789 Corpus: 8800 train + 2200 dev + 2200 test sentences
2024-03-18 14:11:51,793 ----------------------------------------------------------------------------------------------------
2024-03-18 14:11:51,794 Train:  8800 sentences
2024-03-18 14:11:51,795         (train_with_dev=False, train_with_test=False)
2024-03-18 14:11:51,799 ----------------------------------------------------------------------------------------------------
2024-03-18 14:11:51,802 Training Params:
2024-03-18 14:11:51,804  - learning_rate: "1e-05" 
2024-03-18 14:11:51,806  - mini_batch_size: "8"
2024-03-18 14:11:51,807  - max_epochs: "3"
2024-03-18 14:11:51,812  - shuffle: "True"
2024-03-18 14:11:51,813 ----------------------------------------------------------------------------------------------------
2024-03-18 14:11:51,814 Plugins:
2024-03-18 14:11:51,816  - AnnealOnPlateau | patience: '10', anneal_factor: '0.5', min_learning_rate: '0.0001'
2024-03-18 14:11:51,817 ----------------------------------------------------------------------------------------------------
2024-03-18 14:11:51,818 Final evaluation on model from best epoch (best-model.pt)
2024-03-18 14:11:51,820  - metric: "('micro avg', 'f1-score')"
2024-03-18 14:11:51,821 ----------------------------------------------------------------------------------------------------
2024-03-18 14:11:51,823 Computation:
2024-03-18 14:11:51,825  - compute on device: cuda:0
2024-03-18 14:11:51,835  - embedding storage: cpu
2024-03-18 14:11:51,836 ----------------------------------------------------------------------------------------------------
2024-03-18 14:11:51,837 Model training base path: "model/distilbert-base-uncased"
2024-03-18 14:11:51,840 ----------------------------------------------------------------------------------------------------
2024-03-18 14:11:51,846 ----------------------------------------------------------------------------------------------------
2024-03-18 14:11:55,845 epoch 1 - iter 110/1100 - loss 0.57600509 - time (sec): 4.00 - samples/sec: 220.19 - lr: 0.000010 - momentum: 0.000000
2024-03-18 14:11:58,978 epoch 1 - iter 220/1100 - loss 0.50393908 - time (sec): 7.13 - samples/sec: 246.84 - lr: 0.000010 - momentum: 0.000000
2024-03-18 14:12:01,876 epoch 1 - iter 330/1100 - loss 0.46954644 - time (sec): 10.03 - samples/sec: 263.27 - lr: 0.000010 - momentum: 0.000000
2024-03-18 14:12:05,276 epoch 1 - iter 440/1100 - loss 0.44181235 - time (sec): 13.43 - samples/sec: 262.14 - lr: 0.000010 - momentum: 0.000000
2024-03-18 14:12:08,456 epoch 1 - iter 550/1100 - loss 0.41807515 - time (sec): 16.61 - samples/sec: 264.93 - lr: 0.000010 - momentum: 0.000000
2024-03-18 14:12:11,447 epoch 1 - iter 660/1100 - loss 0.40403758 - time (sec): 19.60 - samples/sec: 269.41 - lr: 0.000010 - momentum: 0.000000
2024-03-18 14:12:14,420 epoch 1 - iter 770/1100 - loss 0.38948912 - time (sec): 22.57 - samples/sec: 272.91 - lr: 0.000010 - momentum: 0.000000
2024-03-18 14:12:17,914 epoch 1 - iter 880/1100 - loss 0.38118810 - time (sec): 26.07 - samples/sec: 270.09 - lr: 0.000010 - momentum: 0.000000
2024-03-18 14:12:21,085 epoch 1 - iter 990/1100 - loss 0.37110791 - time (sec): 29.24 - samples/sec: 270.89 - lr: 0.000010 - momentum: 0.000000
2024-03-18 14:12:24,027 epoch 1 - iter 1100/1100 - loss 0.36139164 - time (sec): 32.18 - samples/sec: 273.47 - lr: 0.000010 - momentum: 0.000000
2024-03-18 14:12:24,030 ----------------------------------------------------------------------------------------------------
2024-03-18 14:12:24,032 EPOCH 1 done: loss 0.3614 - lr: 0.000010
2024-03-18 14:12:28,158 DEV : loss 0.28874295949935913 - f1-score (micro avg)  0.9095
2024-03-18 14:12:29,719  - 0 epochs without improvement
2024-03-18 14:12:29,721 ----------------------------------------------------------------------------------------------------
2024-03-18 14:12:29,723 learning rate too small - quitting training!
2024-03-18 14:12:29,725 ----------------------------------------------------------------------------------------------------
2024-03-18 14:12:29,727 Done.
2024-03-18 14:12:29,729 ----------------------------------------------------------------------------------------------------
2024-03-18 14:12:29,733 Testing using last state of model ...
2024-03-18 14:12:33,651 
Results:
- F-score (micro) 0.9132
- F-score (macro) 0.9029
- Accuracy 0.9132

By class:
              precision    recall  f1-score   support

           0     0.9184    0.9511    0.9345      1432
           1     0.9024    0.8424    0.8714       768

    accuracy                         0.9132      2200
   macro avg     0.9104    0.8968    0.9029      2200
weighted avg     0.9128    0.9132    0.9125      2200

2024-03-18 14:12:33,653 ----------------------------------------------------------------------------------------------------

Screenshots

No response

Additional Context

No response

Environment

Versions:

Flair

0.13.1

Pytorch

2.2.1+cu121

Transformers

4.38.2

GPU

True

The text was updated successfully, but these errors were encountered:

azkgit · 2024-03-18T16:14:54Z

I figured out what the issue was. It looks like a "min_learning_rate" parameter was added as a default since I last used Flair, and its default value (0.0001) was greater than my learning rate (0.00001).

azkgit added the bug Something isn't working label Mar 18, 2024

azkgit closed this as completed Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Error message: "learning rate too small - quitting training!" #3428

[Bug]: Error message: "learning rate too small - quitting training!" #3428

azkgit commented Mar 18, 2024

azkgit commented Mar 18, 2024

[Bug]: Error message: "learning rate too small - quitting training!" #3428

[Bug]: Error message: "learning rate too small - quitting training!" #3428

Comments

azkgit commented Mar 18, 2024

Describe the bug

To Reproduce

Expected behavior

Logs and Stack traces

Screenshots

Additional Context

Environment

Versions:

Flair

Pytorch

Transformers

GPU

azkgit commented Mar 18, 2024