Coca training related question #879

JaejinCho · 2023-07-27T20:14:18Z

JaejinCho
Jul 27, 2023

Hello,
First of all, thank you for sharing the codes with the public!

I have a question regarding the coca model training. I am trying to fine-tune the model ('coca_ViT-L-14', 'laion2b_s13b_b90k') but I cannot make a batch size (per GPU) higher than 12 as 16 gives me some error I think related to CUDA OOM. The below is the part of the error log (I've already set CUDA_LAUNCH_BLOCKING=1). BTW, This is when using an A100 80GB GPU, and when bs<=16, the code runs fine w/o errors.

.
.
.
2023-07-26,17:28:31 | INFO | val_num_samples: None
2023-07-26,17:28:31 | INFO | wandb: False
2023-07-26,17:28:31 | INFO | wandb_notes:
2023-07-26,17:28:31 | INFO | wandb_project_name: open-clip
2023-07-26,17:28:31 | INFO | warmup: 7000
2023-07-26,17:28:31 | INFO | wd: 0.0001
2023-07-26,17:28:31 | INFO | workers: 4
2023-07-26,17:28:31 | INFO | world_size: 1
2023-07-26,17:28:31 | INFO | zeroshot_frequency: 1
2023-07-26,17:28:31 | INFO | Start epoch 0
.
.
.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:106: nll_loss2d_forward_kernel: block: [618,0,0], thread: [26,0,0] Assertion `input_index >= 0` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:106: nll_loss2d_forward_kernel: block: [618,0,0], thread: [27,0,0] Assertion `input_index >= 0` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:106: nll_loss2d_forward_kernel: block: [618,0,0], thread: [28,0,0] Assertion `input_index >= 0` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:106: nll_loss2d_forward_kernel: block: [618,0,0], thread: [29,0,0] Assertion `input_index >= 0` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:106: nll_loss2d_forward_kernel: block: [618,0,0], thread: [30,0,0] Assertion `input_index >= 0` failed.
../aten/src/ATen/native/cuda/NLLLoss2d.cu:106: nll_loss2d_forward_kernel: block: [618,0,0], thread: [31,0,0] Assertion `input_index >= 0` failed.
Traceback (most recent call last):
File "open_clip/src/training/main.py", line 496, in
main(sys.argv[1:])
File "open_clip/src/training/main.py", line 424, in main
train_one_epoch(model, data, loss, epoch, optimizer, scaler, scheduler, dist_model, args, tb_writer=writer)
File "open_clip/src/training/train.py", line 156, in train_one_epoch
backward(total_loss, scaler)
File "open_clip/src/training/train.py", line 61, in backward
scaler.scale(total_loss).backward()
File "/root/miniconda/envs/dev/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/root/miniconda/envs/dev/lib/python3.10/site-packages/torch/autograd/init.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Also, when using V100 32GB GPU, the batch size difference is 100 with my customized CLIP model VS only 8 with ('coca_ViT-L-14', 'laion2b_s13b_b90k'). Is this batch size difference normal?

@lucidrains @gpucce @iejMac Sorry if you get bothered by being tagged 🙏

gpucce · 2023-07-27T20:24:19Z

gpucce
Jul 27, 2023

@JaejinCho hi do not worry about tagging :) in general I think I used the model with larger batch sizes in smaller gpus. Looking at the error it looks like a negative label in on of the losses, any chance that this might be happening only for larger batch sizes because of some property of the data you are training on?

If this is not the case can you share some more info, to see if I can help better?

0 replies

JaejinCho · 2023-07-28T19:55:51Z

JaejinCho
Jul 28, 2023
Author

Thank you @gpucce for your answer! :)

I think I did not explain well about the batch size difference so please disregard it for now. I may need to do an investigation after the error above is fixed.

So, to check the negative labels, I ran something like "assert torch.sum(labels<0) == 0" and I did not get the assertion error but the above error keeps being printed out. I could check once more by setting up a toy dataset w/ my own correct labels to see if the same error raises.

But before this, I would like to check if the pytorch version could cause this issue for some reason so do you mind sharing which pytorch versions you used for the experiments? I tried them w/ some of 2.0.* (w/ cuda 11.7 and 11.8) and 2.1.* nightly (w/ cuda 12.1).

Please let me know if you need more information. Thank you!

0 replies

tillaczel · 2023-07-31T09:50:43Z

tillaczel
Jul 31, 2023

I have the same issue, but I also get the error for small batch sizes. Did you figure out if it is a Pytorch version issue?

0 replies

gpucce · 2023-07-31T10:08:52Z

gpucce
Jul 31, 2023

@JaejinCho @tillaczel I am using 1.13.1 but it might not be the issue. Are you trying to fine-tune a pre-trained model or pretrain a new one?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coca training related question #879

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Coca training related question #879

JaejinCho Jul 27, 2023

Replies: 4 comments

gpucce Jul 27, 2023

JaejinCho Jul 28, 2023 Author

tillaczel Jul 31, 2023

gpucce Jul 31, 2023

JaejinCho
Jul 27, 2023

gpucce
Jul 27, 2023

JaejinCho
Jul 28, 2023
Author

tillaczel
Jul 31, 2023

gpucce
Jul 31, 2023