Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. ...: 1333 #55

Open
mrkieumy opened this issue Mar 18, 2019 · 14 comments

Comments

@mrkieumy
Copy link

Hi @andy-yun ,
I meet this error (the same with #33):
Traceback (most recent call last):
File "train.py", line 385, in
main()
File "train.py", line 160, in main
nsamples = train(epoch)
File "train.py", line 229, in train
for batch_idx, (data, target) in enumerate(train_loader):
File "/home/kieumy/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 623, in next
return self._process_next_batch(batch)
File "/home/kieumy/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
File "/home/kieumy/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/kieumy/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in default_collate
return [default_collate(samples) for samples in transposed]
File "/home/kieumy/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in
return [default_collate(samples) for samples in transposed]
File "/home/kieumy/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 209, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 480 and 416 in dimension 2 at /pytorch/aten/src/TH/generic/THTensorMoreMath.cpp:1333

I think that the problem cause the get_difference_scale() method, because when I turn of it by setting shape = (img.width,img.height), the error has gone.
I set my image with and height is (544 x 480) because original size is 640x512 and w don't want to scale too small (416x416) so I used 544 x 480 (it still divides for 32)
Do you have any recommendation to fix this error?
Thanks & best regards.

@andy-yun
Copy link
Owner

andy-yun commented Mar 19, 2019

@mrkieumy You can refer the same issue at https://github.com/marvis/pytorch-yolo2/issues/89

Here's the reason.
https://medium.com/@yvanscher/pytorch-tip-yielding-image-sizes-6a776eb4115b

The solution is set the batch_size=1 or in the get_different_scale() you should change the 64 to self.batch_size. (re-download dataset.py)

@mrkieumy
Copy link
Author

mrkieumy commented Mar 25, 2019

Thanks @andy-yun. I set 64 to self.batch_size (re-downloaded the dataset file), but it's still error. If I set batch_size=1 that means dataloader load every 1 image and the network train with batch=1? Is that right or not? If it's, that not a good because we want to train with the largest possible batch_size.
Any help is appreciated.
Thanks & Best regards.

@andy-yun
Copy link
Owner

@mickolka yup. set batch_size=1 is recommended for test environment.
How many GPUs do you use? I wonder the different image sets are used together.

@mrkieumy
Copy link
Author

Hi @andy-yun , I have only 1 GPU, for test step the batchsize is always 2 images, when I set 1, it's error. But btw, for training, we don't want to set batch_size=1, right? Because we want to train with as large batchsize as possible. But my GPU on can train V3 with batchsize = 8 is maximum (GTX 1080).
Now, I uncomment the line get_different_scale(), only train with the constant shape (544,480). But the result will be bad comparing to get difference scale. How can I get difference scale without set batchsize=1?
Thanks.

@andy-yun
Copy link
Owner

Hi @mrkieumy Would you change the following 64 to self.batch_size ?
57th line of dataset.py:
if index % 64 == 0:
-->
if index % (self.batch_size * 10) == 0:

After checking the above code, please report me. thanks.

@mrkieumy
Copy link
Author

mrkieumy commented Apr 2, 2019

Hi @andy-yun ,
I changed everything as the same you said but it's still error. I also try the crop=True with those sizes but it still errors the same.
Do you know where is the problem? How can you train by the difference_scale without error?
I don't know what I understand correctly is that every 10*batch_size the shape will be random in the get_differnece_scale function (but the same width and height), and the data will load images with that shape. It supposed to be the same shape within the batch_size, in contrast, it raises the error difference dimension in the batch_size.
How to set every batch have the same shape?
Thanks.

@andy-yun
Copy link
Owner

andy-yun commented Apr 2, 2019

@mrkieumy I don't know what the exact problem is. But, in my opinion, the codes are well working with other people thus I am doubting your dataset and environment. Cheers.

@mrkieumy
Copy link
Author

mrkieumy commented Apr 3, 2019

@andy-yun ,
Thanks for your reply.
After printed the index I saw that the dataloader loaded images with shuffle so the index not in order. I recognize that self.seen increases in order, so I changed:
if index % (self.batch_size10) == 0:
-->
if self.seen % (self.batch_size
10) ==0:
It has been worked until now for 20 epochs.
I hope that was the final case to solve this problem. I don't know it correct or not. I'll let you know if any other thing.

On other thing is that: in your repo, you should replace the line 425 and 427 in darknet.py:
save_fc(fc, model) --> save_fc(fp,model). Because fc was not declared, it must be fp (file).
because yolov3 doesn't have fully connected layer so nobody used it. But in my case, I add more some fully connected.
The new problem is that I can not save weight file of fully connected until now because it said that the fc doesn't have bias and weight properties in save_fc function in cfg.py file. I have been saved model first.
The last is that, can you help me to explain the #59
Thanks.

@andy-yun
Copy link
Owner

andy-yun commented Apr 3, 2019

Thanks @mrkieumy I updated codes.

@zhangguotai
Copy link

my code has modified,but the question already still exist.
Traceback (most recent call last):
File "train.py", line 379, in
main()
File "train.py", line 156, in main
nsamples = train(epoch)
File "train.py", line 222, in train
for batch_idx, (data, target) in enumerate(train_loader):
File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 336, in next
return self._process_next_batch(batch)
File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 357, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 106, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 187, in default_collate
return [default_collate(samples) for samples in transposed]
File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 164, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 416 and 480 in dimension 2 at /pytorch/aten/src/TH/generic/THTensorMath.cpp:3616

I train voc dataset,the size of image is 416*416, batch_size = 8, the number of GPU is 1.
Do you have any recommendation to fix this error?

@andy-yun andy-yun mentioned this issue May 9, 2019
@andy-yun
Copy link
Owner

andy-yun commented May 9, 2019

@richard0326
Copy link

I have same problem and It seems I have downloaded the updated source code.
can you help me with this problem??
I'm having problem after epoch 15
에러

@sgflower66
Copy link

sgflower66 commented Aug 29, 2019

I met the same problem after epoch 15. (pytorch1.0, python 3.6.3, my own data, 4 gpus)
image

through reading previous problems and solutions, I guess the problem is in the dataset.py line53:
def get_different_scale(self):
if self.seen < 4000self.batch_size:
wh = 13
32 # 416
elif self.seen < 8000*self.batch_size:
wh = (random.randint(0,3) + 13)32 # 416, 480
elif self.seen < 12000
self.batch_size:
wh = (random.randint(0,5) + 12)*32 # 384, ..., 544
.....
so maybe we get different shape in the same batch(dataset.py line 14):
def custom_collate(batch):
data = torch.stack([item[0] for item in batch], 0)
[X,X,416,X] and [X,X,317,X]

although shape transfer happended after self.seen < xx*self.batch_size, maybe the errror due to multi-gpu?
I just have this guess, but I don't know how to solve it, I found there are many people have same question, maybe the problem is important, looking forward to your reply~

@Ginbor
Copy link

Ginbor commented Nov 2, 2019

in my case, the problem disappeared when I didn't use savemodel() function. I suppose that the problem appears after cur_model.save_weights(). also in my case i have train dataset that len(train_dataset)%batch_size = 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants