Training beyond 150 epochs #81

gsethi2409 · 2024-06-10T06:15:56Z

Hey! I setup the conda env, repository and pretrained model as required. However, when I try to train the model for an extra epoch (as a sanity check), I see Nan values in the loss and updates.

I increase max_epochs to 151 but keep other params the same!

train:
  loss: "xentropy"       # must be either xentropy or iou
  max_epochs: 151
  lr: 0.05              # sgd learning rate
  wup_epochs: 0          # warmup during first XX epochs (can be float)
  momentum: 0.9          # sgd momentum
  lr_decay: 0.99         # learning rate decay per epoch after initial cycle (from min lr)
  w_decay: 0.0001        # weight decay
  batch_size: 1            # batch size
  report_batch: 50        # every x batches, report loss
  report_epoch: 1        # every x epochs, report validation set
  epsilon_w: 0.001       # class weight w = 1 / (content + epsilon_w)
  save_summary: False    # Summary of weight histograms for tensorboard
  save_scans: True       # False doesn't save anything, True saves some
    # sample images (one per batch of the last calculated batch)
  # in log folder
  show_scans: False      # show scans during training
  workers: 1

Here's the log I see:

Lr: 1.326e-03 | Update: 9.066e-01 mean,2.853e-01 std | Epoch: [150][0/4541] | Time 771.238 (771.238) | Data 0.124 (0.124) | Loss 8.3961 (8.3961) | acc 0.138 (0.138) | IoU 0.016 (0.016) | [40 days, 12:46:27]
../../tasks/semantic/modules/trainer.py:453: RuntimeWarning: invalid value encountered in float_scalars
  update_ratios.append(update / max(w, 1e-10))
Lr: 2.236e-03 | Update: nan mean,nan std | Epoch: [150][50/4541] | Time 0.266 (15.390) | Data 0.039 (0.047) | Loss nan (nan) | acc 0.000 (0.004) | IoU 0.000 (0.000) | [19:15:13]
Lr: 2.236e-03 | Update: nan mean,nan std | Epoch: [150][100/4541] | Time 0.292 (7.906) | Data 0.044 (0.045) | Loss nan (nan) | acc 0.000 (0.002) | IoU 0.000 (0.000) | [9:48:24]
Lr: 2.235e-03 | Update: nan mean,nan std | Epoch: [150][150/4541] | Time 0.285 (5.380) | Data 0.050 (0.046) | Loss nan (nan) | acc 0.000 (0.001) | IoU 0.000 (0.000) | [6:36:59]

Any ideas?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training beyond 150 epochs #81

Training beyond 150 epochs #81

gsethi2409 commented Jun 10, 2024

Training beyond 150 epochs #81

Training beyond 150 epochs #81

Comments

gsethi2409 commented Jun 10, 2024