train error #65

sdustdk1427 · 2019-05-09T09:31:32Z

When i get 000035.weights,then an error occured, i don't know why. I have set the image size in the cfg as 416*416.Pytorch version is 1.0.1.Please help me solve this issue,thank you very much.

2019-05-09 17:08:44 [035] training with 49.642771 samples/s
2019-05-09 17:08:44 save weights to backup/000035.weights

2019-05-09 17:08:44 [036] processed 133992 samples, lr 1.000000e-03
Traceback (most recent call last):
File "train.py", line 375, in
main()
File "train.py", line 156, in main
nsamples = train(epoch)
File "train.py", line 219, in train
for batch_idx, (data, target) in enumerate(train_loader):
File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 637, in next
return self._process_next_batch(batch)
File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in default_collate
return [default_collate(samples) for samples in transposed]
File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in
return [default_collate(samples) for samples in transposed]
File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 209, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 416 and 480 in dimension 2 at /opt/conda/conda-bld/pytorch_1550780889552/work/aten/src/TH/generic/THTensorMoreMath.cpp:1307

andy-yun · 2019-05-09T14:02:10Z

@sdustdk1427 same error to #55
I updated dataset.py and train.py. try the code.
Refer to https://discuss.pytorch.org/t/runtimeerror-invalid-argument-0-sizes-of-tensors-must-match-except-in-dimension-0-got-3-and-2-in-dimension-1/23890/15

sdustdk1427 · 2019-05-10T03:26:36Z

Today,I use your new dataset.py and train.py,but when I get 000030.weights,I face this problem again!
I refer this https://discuss.pytorch.org/t/runtimeerror-invalid-argument-0-sizes-of-tensors-must-match-except-in-dimension-0-got-3-and-2-in-dimension-1/23890/15,but I can't understand.....sorry......
so what should i do?thank you very very much.

2019-05-10 07:59:33 [030] training with 48.296028 samples/s
2019-05-10 07:59:33 save weights to backup2/000030.weights

interim evaluating ...
2019-05-10 08:01:59 [030] correct: 1004, precision: 0.327783, recall: 0.657929, fscore: 0.437564
done evaluation.

2019-05-10 08:01:59 [031] processed 147839 samples, lr 1.000000e-03
Traceback (most recent call last):
File "train.py", line 377, in
main()
File "train.py", line 156, in main
nsamples = train(epoch)
File "train.py", line 221, in train
for batch_idx, (data, target) in enumerate(train_loader):
File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 637, in next
return self._process_next_batch(batch)
File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/public/home/G19850028/RWJ/pytorch-0.4-yolov3-master/dataset.py", line 14, in custom_collate
data = torch.stack([item[0] for item in batch], 0)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 416 and 512 in dimension 2 at /opt/conda/conda-bld/pytorch_1550780889552/work/aten/src/TH/generic/THTensorMoreMath.cpp:1307

andy-yun · 2019-05-10T04:40:28Z

@sdustdk1427 In that case, you can check the information as follows:
in dataset.py, you expand data = torch.stack([item[0] for item in batch],0)

try:
  data = torch.stack([item[0] for item in batch],0)
except RuntimeError:
  import sys
  for item in batch:
        print(item[0].getbands())
        print(item[0].size())
  sys.exit(0)

maybe the image is not identically resized when training mode.

sdustdk1427 · 2019-05-11T07:48:37Z

I'd like to ask what the above code does.When I annotate def custom_collate(batch) out, I can run 000050.weight, but I still run into the same problem as before:
258900: Layer(106) nGT 80, nRC 64, nRC75 25, nPP 107, loss: box 2.187, conf 3.256, class 2.181, total 7.624

2019-05-11 13:07:08 [050] training with 29.621098 samples/s
2019-05-11 13:07:08 save weights to backup5/000050.weights

interim evaluating ...
2019-05-11 13:10:04 [050] correct: 919, precision: 0.369373, recall: 0.526346, fscore: 0.434100
done evaluation.

2019-05-11 13:10:04 [051] processed 264078 samples, lr 1.000000e-03
258964: Layer(082) nGT 105, nRC 78, nRC75 31, nPP 114, loss: box 2.332, conf 2.150, class 1.424, total 5.906
258964: Layer(094) nGT 105, nRC 68, nRC75 17, nPP 0, loss: box 2.786, conf 5.809, class 6.787, total 15.382
258964: Layer(106) nGT 105, nRC 80, nRC75 28, nPP 97, loss: box 2.600, conf 4.034, class 3.364, total 9.998
Traceback (most recent call last):
File "train.py", line 377, in
main()
File "train.py", line 156, in main
nsamples = train(epoch)
File "train.py", line 221, in train
for batch_idx, (data, target) in enumerate(train_loader):
File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 637, in next
return self._process_next_batch(batch)
File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in default_collate
return [default_collate(samples) for samples in transposed]
File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in
return [default_collate(samples) for samples in transposed]
File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 209, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 416 and 448 in dimension 2 at /opt/conda/conda-bld/pytorch_1550780889552/work/aten/src/TH/generic/THTensorMoreMath.cpp:1307
what should i do?

andy-yun · 2019-05-11T12:38:25Z

@sdustdk1427 If you comment out "def custom_collate", then default collate_fn is used. Then this phenomenon is exactly same to the first condition (without collate_fn). custom_collate function is used for checking the different size or image types. I don't know exact condition of your environment. I am wondering that your experimental condition is messed or there are some bugs in my code. If you have same problem continuously, I recommend other repo published in github.
Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train error #65

train error #65

sdustdk1427 commented May 9, 2019

andy-yun commented May 9, 2019 •

edited

Loading

sdustdk1427 commented May 10, 2019

andy-yun commented May 10, 2019 •

edited

Loading

sdustdk1427 commented May 11, 2019

andy-yun commented May 11, 2019

train error #65

train error #65

Comments

sdustdk1427 commented May 9, 2019

2019-05-09 17:08:44 [035] training with 49.642771 samples/s 2019-05-09 17:08:44 save weights to backup/000035.weights

andy-yun commented May 9, 2019 • edited Loading

sdustdk1427 commented May 10, 2019

andy-yun commented May 10, 2019 • edited Loading

sdustdk1427 commented May 11, 2019

andy-yun commented May 11, 2019

2019-05-09 17:08:44 [035] training with 49.642771 samples/s
2019-05-09 17:08:44 save weights to backup/000035.weights

andy-yun commented May 9, 2019 •

edited

Loading

andy-yun commented May 10, 2019 •

edited

Loading