You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey! I setup the conda env, repository and pretrained model as required. However, when I try to train the model for an extra epoch (as a sanity check), I see Nan values in the loss and updates.
I increase max_epochs to 151 but keep other params the same!
train:
loss: "xentropy" # must be either xentropy or iou
max_epochs: 151
lr: 0.05 # sgd learning rate
wup_epochs: 0 # warmup during first XX epochs (can be float)
momentum: 0.9 # sgd momentum
lr_decay: 0.99 # learning rate decay per epoch after initial cycle (from min lr)
w_decay: 0.0001 # weight decay
batch_size: 1 # batch size
report_batch: 50 # every x batches, report loss
report_epoch: 1 # every x epochs, report validation set
epsilon_w: 0.001 # class weight w = 1 / (content + epsilon_w)
save_summary: False # Summary of weight histograms for tensorboard
save_scans: True # False doesn't save anything, True saves some
# sample images (one per batch of the last calculated batch)
# in log folder
show_scans: False # show scans during training
workers: 1
Here's the log I see:
Lr: 1.326e-03 | Update: 9.066e-01 mean,2.853e-01 std | Epoch: [150][0/4541] | Time 771.238 (771.238) | Data 0.124 (0.124) | Loss 8.3961 (8.3961) | acc 0.138 (0.138) | IoU 0.016 (0.016) | [40 days, 12:46:27]
../../tasks/semantic/modules/trainer.py:453: RuntimeWarning: invalid value encountered in float_scalars
update_ratios.append(update / max(w, 1e-10))
Lr: 2.236e-03 | Update: nan mean,nan std | Epoch: [150][50/4541] | Time 0.266 (15.390) | Data 0.039 (0.047) | Loss nan (nan) | acc 0.000 (0.004) | IoU 0.000 (0.000) | [19:15:13]
Lr: 2.236e-03 | Update: nan mean,nan std | Epoch: [150][100/4541] | Time 0.292 (7.906) | Data 0.044 (0.045) | Loss nan (nan) | acc 0.000 (0.002) | IoU 0.000 (0.000) | [9:48:24]
Lr: 2.235e-03 | Update: nan mean,nan std | Epoch: [150][150/4541] | Time 0.285 (5.380) | Data 0.050 (0.046) | Loss nan (nan) | acc 0.000 (0.001) | IoU 0.000 (0.000) | [6:36:59]
Any ideas?
The text was updated successfully, but these errors were encountered:
Hey! I setup the conda env, repository and pretrained model as required. However, when I try to train the model for an extra epoch (as a sanity check), I see Nan values in the loss and updates.
I increase max_epochs to 151 but keep other params the same!
Here's the log I see:
Any ideas?
The text was updated successfully, but these errors were encountered: