Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training beyond 150 epochs #81

Open
gsethi2409 opened this issue Jun 10, 2024 · 0 comments
Open

Training beyond 150 epochs #81

gsethi2409 opened this issue Jun 10, 2024 · 0 comments

Comments

@gsethi2409
Copy link

Hey! I setup the conda env, repository and pretrained model as required. However, when I try to train the model for an extra epoch (as a sanity check), I see Nan values in the loss and updates.

I increase max_epochs to 151 but keep other params the same!

train:
  loss: "xentropy"       # must be either xentropy or iou
  max_epochs: 151
  lr: 0.05              # sgd learning rate
  wup_epochs: 0          # warmup during first XX epochs (can be float)
  momentum: 0.9          # sgd momentum
  lr_decay: 0.99         # learning rate decay per epoch after initial cycle (from min lr)
  w_decay: 0.0001        # weight decay
  batch_size: 1            # batch size
  report_batch: 50        # every x batches, report loss
  report_epoch: 1        # every x epochs, report validation set
  epsilon_w: 0.001       # class weight w = 1 / (content + epsilon_w)
  save_summary: False    # Summary of weight histograms for tensorboard
  save_scans: True       # False doesn't save anything, True saves some
    # sample images (one per batch of the last calculated batch)
  # in log folder
  show_scans: False      # show scans during training
  workers: 1      

Here's the log I see:

Lr: 1.326e-03 | Update: 9.066e-01 mean,2.853e-01 std | Epoch: [150][0/4541] | Time 771.238 (771.238) | Data 0.124 (0.124) | Loss 8.3961 (8.3961) | acc 0.138 (0.138) | IoU 0.016 (0.016) | [40 days, 12:46:27]
../../tasks/semantic/modules/trainer.py:453: RuntimeWarning: invalid value encountered in float_scalars
  update_ratios.append(update / max(w, 1e-10))
Lr: 2.236e-03 | Update: nan mean,nan std | Epoch: [150][50/4541] | Time 0.266 (15.390) | Data 0.039 (0.047) | Loss nan (nan) | acc 0.000 (0.004) | IoU 0.000 (0.000) | [19:15:13]
Lr: 2.236e-03 | Update: nan mean,nan std | Epoch: [150][100/4541] | Time 0.292 (7.906) | Data 0.044 (0.045) | Loss nan (nan) | acc 0.000 (0.002) | IoU 0.000 (0.000) | [9:48:24]
Lr: 2.235e-03 | Update: nan mean,nan std | Epoch: [150][150/4541] | Time 0.285 (5.380) | Data 0.050 (0.046) | Loss nan (nan) | acc 0.000 (0.001) | IoU 0.000 (0.000) | [6:36:59]

Any ideas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant