Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train error #65

Open
sdustdk1427 opened this issue May 9, 2019 · 5 comments
Open

train error #65

sdustdk1427 opened this issue May 9, 2019 · 5 comments

Comments

@sdustdk1427
Copy link

When i get 000035.weights,then an error occured, i don't know why. I have set the image size in the cfg as 416*416.Pytorch version is 1.0.1.Please help me solve this issue,thank you very much.

2019-05-09 17:08:44 [035] training with 49.642771 samples/s
2019-05-09 17:08:44 save weights to backup/000035.weights

2019-05-09 17:08:44 [036] processed 133992 samples, lr 1.000000e-03
Traceback (most recent call last):
File "train.py", line 375, in
main()
File "train.py", line 156, in main
nsamples = train(epoch)
File "train.py", line 219, in train
for batch_idx, (data, target) in enumerate(train_loader):
File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 637, in next
return self._process_next_batch(batch)
File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in default_collate
return [default_collate(samples) for samples in transposed]
File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in
return [default_collate(samples) for samples in transposed]
File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 209, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 416 and 480 in dimension 2 at /opt/conda/conda-bld/pytorch_1550780889552/work/aten/src/TH/generic/THTensorMoreMath.cpp:1307

@andy-yun
Copy link
Owner

andy-yun commented May 9, 2019

@sdustdk1427
Copy link
Author

Today,I use your new dataset.py and train.py,but when I get 000030.weights,I face this problem again!
I refer this https://discuss.pytorch.org/t/runtimeerror-invalid-argument-0-sizes-of-tensors-must-match-except-in-dimension-0-got-3-and-2-in-dimension-1/23890/15,but I can't understand.....sorry......
so what should i do?thank you very very much.

2019-05-10 07:59:33 [030] training with 48.296028 samples/s
2019-05-10 07:59:33 save weights to backup2/000030.weights

interim evaluating ...
2019-05-10 08:01:59 [030] correct: 1004, precision: 0.327783, recall: 0.657929, fscore: 0.437564
done evaluation.


2019-05-10 08:01:59 [031] processed 147839 samples, lr 1.000000e-03
Traceback (most recent call last):
File "train.py", line 377, in
main()
File "train.py", line 156, in main
nsamples = train(epoch)
File "train.py", line 221, in train
for batch_idx, (data, target) in enumerate(train_loader):
File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 637, in next
return self._process_next_batch(batch)
File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/public/home/G19850028/RWJ/pytorch-0.4-yolov3-master/dataset.py", line 14, in custom_collate
data = torch.stack([item[0] for item in batch], 0)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 416 and 512 in dimension 2 at /opt/conda/conda-bld/pytorch_1550780889552/work/aten/src/TH/generic/THTensorMoreMath.cpp:1307

@andy-yun
Copy link
Owner

andy-yun commented May 10, 2019

@sdustdk1427 In that case, you can check the information as follows:
in dataset.py, you expand data = torch.stack([item[0] for item in batch],0)

try:
  data = torch.stack([item[0] for item in batch],0)
except RuntimeError:
  import sys
  for item in batch:
        print(item[0].getbands())
        print(item[0].size())
  sys.exit(0)

maybe the image is not identically resized when training mode.

@sdustdk1427
Copy link
Author

I'd like to ask what the above code does.When I annotate def custom_collate(batch) out, I can run 000050.weight, but I still run into the same problem as before:
258900: Layer(106) nGT 80, nRC 64, nRC75 25, nPP 107, loss: box 2.187, conf 3.256, class 2.181, total 7.624

2019-05-11 13:07:08 [050] training with 29.621098 samples/s
2019-05-11 13:07:08 save weights to backup5/000050.weights

interim evaluating ...
2019-05-11 13:10:04 [050] correct: 919, precision: 0.369373, recall: 0.526346, fscore: 0.434100
done evaluation.


2019-05-11 13:10:04 [051] processed 264078 samples, lr 1.000000e-03
258964: Layer(082) nGT 105, nRC 78, nRC75 31, nPP 114, loss: box 2.332, conf 2.150, class 1.424, total 5.906
258964: Layer(094) nGT 105, nRC 68, nRC75 17, nPP 0, loss: box 2.786, conf 5.809, class 6.787, total 15.382
258964: Layer(106) nGT 105, nRC 80, nRC75 28, nPP 97, loss: box 2.600, conf 4.034, class 3.364, total 9.998
Traceback (most recent call last):
File "train.py", line 377, in
main()
File "train.py", line 156, in main
nsamples = train(epoch)
File "train.py", line 221, in train
for batch_idx, (data, target) in enumerate(train_loader):
File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 637, in next
return self._process_next_batch(batch)
File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in default_collate
return [default_collate(samples) for samples in transposed]
File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in
return [default_collate(samples) for samples in transposed]
File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 209, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 416 and 448 in dimension 2 at /opt/conda/conda-bld/pytorch_1550780889552/work/aten/src/TH/generic/THTensorMoreMath.cpp:1307
what should i do?

@andy-yun
Copy link
Owner

@sdustdk1427 If you comment out "def custom_collate", then default collate_fn is used. Then this phenomenon is exactly same to the first condition (without collate_fn). custom_collate function is used for checking the different size or image types. I don't know exact condition of your environment. I am wondering that your experimental condition is messed or there are some bugs in my code. If you have same problem continuously, I recommend other repo published in github.
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@sdustdk1427 @andy-yun and others