Replies: 4 comments
-
@JaejinCho hi do not worry about tagging :) in general I think I used the model with larger batch sizes in smaller gpus. Looking at the error it looks like a negative label in on of the losses, any chance that this might be happening only for larger batch sizes because of some property of the data you are training on? If this is not the case can you share some more info, to see if I can help better? |
Beta Was this translation helpful? Give feedback.
-
Thank you @gpucce for your answer! :) I think I did not explain well about the batch size difference so please disregard it for now. I may need to do an investigation after the error above is fixed. So, to check the negative labels, I ran something like "assert torch.sum(labels<0) == 0" and I did not get the assertion error but the above error keeps being printed out. I could check once more by setting up a toy dataset w/ my own correct labels to see if the same error raises. But before this, I would like to check if the pytorch version could cause this issue for some reason so do you mind sharing which pytorch versions you used for the experiments? I tried them w/ some of 2.0.* (w/ cuda 11.7 and 11.8) and 2.1.* nightly (w/ cuda 12.1). Please let me know if you need more information. Thank you! |
Beta Was this translation helpful? Give feedback.
-
I have the same issue, but I also get the error for small batch sizes. Did you figure out if it is a Pytorch version issue? |
Beta Was this translation helpful? Give feedback.
-
@JaejinCho @tillaczel I am using 1.13.1 but it might not be the issue. Are you trying to fine-tune a pre-trained model or pretrain a new one? |
Beta Was this translation helpful? Give feedback.
-
Hello,
First of all, thank you for sharing the codes with the public!
I have a question regarding the coca model training. I am trying to fine-tune the model ('coca_ViT-L-14', 'laion2b_s13b_b90k') but I cannot make a batch size (per GPU) higher than 12 as 16 gives me some error I think related to CUDA OOM. The below is the part of the error log (I've already set CUDA_LAUNCH_BLOCKING=1). BTW, This is when using an A100 80GB GPU, and when bs<=16, the code runs fine w/o errors.
.
.
.
2023-07-26,17:28:31 | INFO | val_num_samples: None
2023-07-26,17:28:31 | INFO | wandb: False
2023-07-26,17:28:31 | INFO | wandb_notes:
2023-07-26,17:28:31 | INFO | wandb_project_name: open-clip
2023-07-26,17:28:31 | INFO | warmup: 7000
2023-07-26,17:28:31 | INFO | wd: 0.0001
2023-07-26,17:28:31 | INFO | workers: 4
2023-07-26,17:28:31 | INFO | world_size: 1
2023-07-26,17:28:31 | INFO | zeroshot_frequency: 1
2023-07-26,17:28:31 | INFO | Start epoch 0
.
.
.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:106: nll_loss2d_forward_kernel: block: [618,0,0], thread: [26,0,0] Assertion
input_index >= 0
failed.../aten/src/ATen/native/cuda/NLLLoss2d.cu:106: nll_loss2d_forward_kernel: block: [618,0,0], thread: [27,0,0] Assertion
input_index >= 0
failed.../aten/src/ATen/native/cuda/NLLLoss2d.cu:106: nll_loss2d_forward_kernel: block: [618,0,0], thread: [28,0,0] Assertion
input_index >= 0
failed.../aten/src/ATen/native/cuda/NLLLoss2d.cu:106: nll_loss2d_forward_kernel: block: [618,0,0], thread: [29,0,0] Assertion
input_index >= 0
failed.../aten/src/ATen/native/cuda/NLLLoss2d.cu:106: nll_loss2d_forward_kernel: block: [618,0,0], thread: [30,0,0] Assertion
input_index >= 0
failed.../aten/src/ATen/native/cuda/NLLLoss2d.cu:106: nll_loss2d_forward_kernel: block: [618,0,0], thread: [31,0,0] Assertion
input_index >= 0
failed.Traceback (most recent call last):
File "open_clip/src/training/main.py", line 496, in
main(sys.argv[1:])
File "open_clip/src/training/main.py", line 424, in main
train_one_epoch(model, data, loss, epoch, optimizer, scaler, scheduler, dist_model, args, tb_writer=writer)
File "open_clip/src/training/train.py", line 156, in train_one_epoch
backward(total_loss, scaler)
File "open_clip/src/training/train.py", line 61, in backward
scaler.scale(total_loss).backward()
File "/root/miniconda/envs/dev/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/root/miniconda/envs/dev/lib/python3.10/site-packages/torch/autograd/init.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.Also, when using V100 32GB GPU, the batch size difference is 100 with my customized CLIP model VS only 8 with ('coca_ViT-L-14', 'laion2b_s13b_b90k'). Is this batch size difference normal?
@lucidrains @gpucce @iejMac Sorry if you get bothered by being tagged 🙏
Beta Was this translation helpful? Give feedback.
All reactions