Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

icnet error (icnet return tuple but not write that logic) #161

Open
lucasjinreal opened this issue Nov 19, 2018 · 14 comments
Open

icnet error (icnet return tuple but not write that logic) #161

lucasjinreal opened this issue Nov 19, 2018 · 14 comments

Comments

@lucasjinreal
Copy link

Hi, icnet returned a tuple when training.... but when calculating loss, it directly get size from tuple and got this error:

Traceback (most recent call last):
  File "train.py", line 230, in <module>
    train(cfg, writer, logger)
  File "train.py", line 132, in train
    loss = loss_fn(input=outputs, target=labels)
  File "pytorch-semseg/ptsemseg/loss/loss.py", line 10, in cross_entropy2d
    n, c, h, w = input.size()
AttributeError: 'tuple' object has no attribute 'size'
@adam9500370
Copy link
Contributor

adam9500370 commented Nov 19, 2018

Hi, @jinfagang .
You can set multi_scale_cross_entropy loss function in config file.

loss:
    name: 'multi_scale_cross_entropy'

And change 'exponent' tensor type to float and set the corresponding device (in ptsemseg/loss/loss.py#L36):

scale_weight = torch.pow(scale * torch.ones(n_inp), torch.arange(n_inp).float()).to('cuda' if target.is_cuda else 'cpu')

@lucasjinreal
Copy link
Author

@adam9500370 Hi, I finally able to train on icnet. However, after 10k more iterations, the mean iOU seems not right at all:

27it [00:02, 16.36it/s]WARN: resizing labels yielded fewer classes
500it [00:26, 29.90it/s]
Overall Acc: 	 0.4196301378853902
Mean Acc : 	 0.15644619030428067
FreqW Acc : 	 0.31476421378091346
Mean IoU : 	 0.09229576247351066
Iter [194050/300000]  Loss: 816839.8750  Time/Image: 0.1126
Iter [194100/300000]  Loss: 548733.3750  Time/Image: 0.1123
Iter [194150/300000]  Loss: 898010.5625  Time/Image: 0.1130
Iter [194200/300000]  Loss: 646011.3125  Time/Image: 0.1125
Iter [194250/300000]  Loss: 968136.6250  Time/Image: 0.1122
Iter [194300/300000]  Loss: 655537.1875  Time/Image: 0.1125
Iter [194350/300000]  Loss: 673936.6250  Time/Image: 0.1127
Iter [194400/300000]  Loss: 556652.3750  Time/Image: 0.1128
Iter [194450/300000]  Loss: 751962.5000  Time/Image: 0.1116
Iter [194500/300000]  Loss: 685939.0625  Time/Image: 0.1128
Iter [194550/300000]  Loss: 653181.4375  Time/Image: 0.1128
Iter [194600/300000]  Loss: 596467.0625  Time/Image: 0.1117
Iter [194650/300000]  Loss: 947831.4375  Time/Image: 0.1131
Iter [194700/300000]  Loss: 603308.4375  Time/Image: 0.1123
Iter [194750/300000]  Loss: 470650.3438  Time/Image: 0.1125
Iter [194800/300000]  Loss: 461287.7500  Time/Image: 0.1140
Iter [194850/300000]  Loss: 803597.2500  Time/Image: 0.1140
Iter [194900/300000]  Loss: 580953.6875  Time/Image: 0.1157
Iter [194950/300000]  Loss: 472815.9375  Time/Image: 0.1151
Iter [195000/300000]  Loss: 620432.0625  Time/Image: 0.1165
26it [00:02, 16.84it/s]WARN: resizing labels yielded fewer classes
500it [00:26, 18.86it/s]
Overall Acc: 	 0.43595608380925455
Mean Acc : 	 0.14414920306903656
FreqW Acc : 	 0.30780209512001516
Mean IoU : 	 0.09285922375025128
Iter [195050/300000]  Loss: 584194.6875  Time/Image: 0.1131
Iter [195100/300000]  Loss: 579036.9375  Time/Image: 0.1129
Iter [195150/300000]  Loss: 761244.0000  Time/Image: 0.1124
Iter [195200/300000]  Loss: 789020.6875  Time/Image: 0.1127
Iter [195250/300000]  Loss: 497891.0312  Time/Image: 0.1132
Iter [195300/300000]  Loss: 814943.5625  Time/Image: 0.1123
Iter [195350/300000]  Loss: 719462.1250  Time/Image: 0.1126
Iter [195400/300000]  Loss: 583933.4375  Time/Image: 0.1119
Iter [195450/300000]  Loss: 510635.5000  Time/Image: 0.1145
Iter [195500/300000]  Loss: 540089.3125  Time/Image: 0.1137
Iter [195550/300000]  Loss: 678339.6875  Time/Image: 0.1141
Iter [195600/300000]  Loss: 1116914.5000  Time/Image: 0.1133
Iter [195650/300000]  Loss: 574083.0625  Time/Image: 0.1158

the loss is too big, and the mean IOU is totally wrong.......... Any idea about this?

@adam9500370
Copy link
Contributor

Could you share your training settings (i.e., # of classes (dataset), optimizer, learning rate, image size, ...)?

@lucasjinreal
Copy link
Author

@adam9500370 Of course.

model:
    arch: icnet
data:
    dataset: cityscapes
    train_split: train
    val_split: val
    # icnet should be 32*n+1
    img_rows: 513
    img_cols: 1025
    path: /media/jintain/sg/permanent/datasets/Cityscapes
training:
    train_iters: 300000
    batch_size: 1
    val_interval: 1000
    n_workers: 16
    print_interval: 50
    optimizer:
        name: 'sgd'
        lr: 1.0e-10
        weight_decay: 0.0005
        momentum: 0.99
    loss:
        name: 'multi_scale_cross_entropy'
        size_average: False
    lr_schedule:
#    resume: fcn8s_pascal_best_model.pkl
    resume: runs/icnet_cityscapes_best_model.pkl

nothing else change. Training on cityscapes and using the default cityscapes dataloader

@adam9500370
Copy link
Contributor

adam9500370 commented Nov 20, 2018

Due to size_average: False for loss calculation, you may get a very large loss value (summation of cross entropy loss for all the pixels of all the images in each batch).
I think you may need to set size_average: True to calculate mean of loss value.

In addition, if you train the model from scratch, you may need to try the followings:

  • Set arch: icnetBN to include BatchNorm (is_batchnorm: True)
  • Set a larger batch size (e.g., 8) and larger image size (e.g., (1025, 2049))
  • Set a larger learning rate (e.g., 1.0e-2) and choose a LR scheduler (e.g., poly_lr)

You can also download the converted Caffe pretrained Cityscapes models here, and set img_norm=False and version="pascal" arguments in data_loader (due to data preprocessing of original Caffe implementation).

@lucasjinreal
Copy link
Author

@adam9500370 Hi, I take your advise and retrain from scratch, but the mean IOU still not normal. Here is the log:

Iter [2800/300000]  Loss: 1.7671  Time/Image: 0.1351
Iter [2850/300000]  Loss: 1.8565  Time/Image: 0.1378
Iter [2900/300000]  Loss: 1.8952  Time/Image: 0.1374
Iter [2950/300000]  Loss: 1.7559  Time/Image: 0.1380
Iter [3000/300000]  Loss: 1.7315  Time/Image: 0.1363
0it [00:00, ?it/s]WARN: resizing labels yielded fewer classes
63it [00:55,  3.46it/s]
Overall Acc: 	 0.7806173583871298
Mean Acc : 	 0.26045823400686646
FreqW Acc : 	 0.64662924844955
Mean IoU : 	 0.20318397657362453
Iter [3050/300000]  Loss: 1.6093  Time/Image: 0.1298
Iter [3100/300000]  Loss: 1.7549  Time/Image: 0.1368
Iter [3150/300000]  Loss: 1.6235  Time/Image: 0.1380
Iter [3200/300000]  Loss: 1.3351  Time/Image: 0.1375
Iter [3250/300000]  Loss: 1.4034  Time/Image: 0.1393
Iter [3300/300000]  Loss: 1.7972  Time/Image: 0.1369
WARN: resizing labels yielded fewer classes
Iter [3350/300000]  Loss: 1.6406  Time/Image: 0.1366
Iter [3400/300000]  Loss: 1.7513  Time/Image: 0.1395
WARN: resizing labels yielded fewer classes
Iter [3450/300000]  Loss: 1.6573  Time/Image: 0.1381
Iter [3500/300000]  Loss: 2.1634  Time/Image: 0.1379
Iter [3550/300000]  Loss: 1.4725  Time/Image: 0.1357
Iter [3600/300000]  Loss: 1.5244  Time/Image: 0.1386
Iter [3650/300000]  Loss: 1.4610  Time/Image: 0.1374
Iter [3700/300000]  Loss: 1.6305  Time/Image: 0.1372
Iter [3750/300000]  Loss: 1.5950  Time/Image: 0.1387
Iter [3800/300000]  Loss: 1.8183  Time/Image: 0.1326
Iter [3850/300000]  Loss: 1.9768  Time/Image: 0.1387
Iter [3900/300000]  Loss: 1.4756  Time/Image: 0.1380
WARN: resizing labels yielded fewer classes
Iter [3950/300000]  Loss: 1.3690  Time/Image: 0.1374
Iter [4000/300000]  Loss: 1.4399  Time/Image: 0.1379
0it [00:00, ?it/s]WARN: resizing labels yielded fewer classes
63it [00:55,  3.55it/s]
Overall Acc: 	 0.7558650777368152
Mean Acc : 	 0.2424623463158562
FreqW Acc : 	 0.620776533991615
Mean IoU : 	 0.18858147214744353

As you can see, after almost 4000 iterations, the mean IOU still 0.18, is that normal? Doesn't see any continue improvement..........

@adam9500370
Copy link
Contributor

adam9500370 commented Nov 21, 2018

Due to high proportion of pixels for road class in the Cityscapes dataset, you may need to do class balancing to set higher loss weights for the rare classes. (reference: https://github.com/Eromera/erfnet_pytorch/blob/09efaac1dc7829e3719552cbe1e63183368f916d/train/main.py#L88-L131)
In addition, due to ~3000 training samples in the Cityscapes dataset, you may need to do some augmentations.

@lfdeep
Copy link

lfdeep commented Nov 22, 2018

Hi, @jinfagang .
You can set multi_scale_cross_entropy loss function in config file.

loss:
    name: 'multi_scale_cross_entropy'

And change 'exponent' tensor type to float and set the corresponding device (in ptsemseg/loss/loss.py#L36):

scale_weight = torch.pow(scale * torch.ones(n_inp), torch.arange(n_inp).float()).to('cuda' if input.is_cuda else 'cpu')

when i run pspnet,and modify the loss to:
scale_weight = torch.pow(scale * torch.ones(n_inp), torch.arange(n_inp).float()).to('cuda' if input.is_cuda else 'cpu')

but error occured:
AttributeError: 'tuple' object has no attribute 'is_cuda', i don't know how to solve it?

@adam9500370
Copy link
Contributor

Replace

scale_weight = torch.pow(scale * torch.ones(n_inp), torch.arange(n_inp).float()).to('cuda' if input.is_cuda else 'cpu')

with

scale_weight = torch.pow(scale * torch.ones(n_inp), torch.arange(n_inp).float()).to('cuda' if target.is_cuda else 'cpu')

to avoid handling different input type in different phase.

@lfdeep
Copy link

lfdeep commented Nov 22, 2018

Replace

scale_weight = torch.pow(scale * torch.ones(n_inp), torch.arange(n_inp).float()).to('cuda' if input.is_cuda else 'cpu')

with

scale_weight = torch.pow(scale * torch.ones(n_inp), torch.arange(n_inp).float()).to('cuda' if target.is_cuda else 'cpu')

to avoid handling different input type in different phase.

Thank you very much! but my result is unusual:
Iter [450/300000] Loss: 0.5713 Time/Image: 2.4058
Iter [460/300000] Loss: 2.1904 Time/Image: 2.2330
Iter [470/300000] Loss: 3.7478 Time/Image: 2.2353
Iter [480/300000] Loss: 1.8667 Time/Image: 2.2329
Iter [490/300000] Loss: 2.2474 Time/Image: 2.2363
Iter [500/300000] Loss: 1.5397 Time/Image: 2.2435
725it [16:00, 1.31s/it]
Iter 500 Loss on Val: 1.7601
Overall Acc: 0.735417399594
Mean Acc : 0.0471207022447
FreqW Acc : 0.550698099316
Mean IoU : 0.0352395812907
i set batch=2, lr=0.01, size_average: True and i use pascal voc +sbd datasets.

@adam9500370
Copy link
Contributor

adam9500370 commented Nov 22, 2018

Due to high proportion of pixels for background class in the Pascal VOC dataset, if you train the model from scratch, the model might tend to only learn background class.
Therefore, you may need to do class balancing to set higher loss weights for the rare classes, or set ignore_index=0 in F.cross_entropy to ignore background class before the model learned for all the other classes.

You can also download the converted Caffe pretrained weights here, and set img_norm=False and version="pascal" arguments in data_loader (due to data preprocessing of original Caffe implementation). Then use larger batch size and smaller learning rate to fine-tune the model on these datasets.

@lfdeep
Copy link

lfdeep commented Nov 22, 2018

Due to high proportion of pixels for background class in the Pascal VOC dataset, if you train the model from scratch, the model might tend to only learn background class.
Therefore, you may need to do class balancing to set higher loss weights for the rare classes, or set ignore_index=0 in F.cross_entropy to ignore background class before the model learned for all the other classes.

You can also download the converted Caffe pretrained weights here, and set img_norm=False and version="pascal" arguments in data_loader (due to data preprocessing of original Caffe implementation). Then use larger batch size and smaller learning rate to fine-tune the model on these datasets.

Thank you very much!

@zzh8829 zzh8829 mentioned this issue Feb 28, 2019
@erichhhhho
Copy link

erichhhhho commented Mar 1, 2019

@lfdeep Hi, I met the similar problem. I was wondering how you solved this. Thank you

@HareshKarnan
Copy link

My network doesn't seem to learn even after 10000 training iterations. the miou is still at 0.20.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants