Finetuning with arbitrary input size -> strange behaviour #19

andreped · 2023-08-22T08:44:07Z

andreped
Aug 22, 2023

I have had great success finetuning CNN-based image classifier backbone architectures on a variety of applications. I have also finetuned ViTs (i.e., vanilla 16x16 B/16 arch) on some applications with success.

For CNNs it seems to converge fine "regardless" of the input size. That is, I have for instance used 1024x1024 input instead of 256x256, when the network was initially trained with 256x256. However, when attempting to do the same with ViTs, it does not seem to converge. I observed the same with GCViT-Tiny.

Does anyone have any idea why that might be so? Right now I'm doing a very naive finetuning scheme where I am just lowering the learning rate and finetuning the entire backbone. I could play around with step-wise approaches, I was just surprised that there was a discrepancy between CNN and ViT finetuning for larger-sized images.

EDIT: Can also be noted that my classifier head is extremely simple. I remove the ViT-head and append a GlobalAveragePooling2D + Dense + Softmax (InceptionV3-style). Could be that is why it does not work as well? Maybe I need to flatten instead of the global average pooling?

Answered by andreped

Aug 22, 2023

Seems like increasing the batch size, which initially was challenging due to OOM issues, using gradient accumulation, lowering the learning rate to 1e-4 using the Adam optimizer, and using cross-entropy loss instead of focal loss, seemed to have resolved the issue. At least for GCViT.

That larger batch size is more cruicial for ViTs make sense as they work fundamentally different to CNNs, acting more as low-pass filters compared to CNNs acting as high-pass filters.

Will be interesting to see if the same applies to the other ViTs I tried, but at least this resolved my initial concern.

View full answer

andreped · 2023-08-22T08:54:04Z

andreped
Aug 22, 2023
Author

I see that other people experience similar convergence issues with ViTs:
lucidrains/vit-pytorch#45 (comment)
https://discuss.huggingface.co/t/fine-tuning-image-transformer-on-higher-resolution/22623

4 replies

andreped Aug 22, 2023
Author

Perhaps I need to adjust the tokens. However, I was expecting the resize_query=True argument to that for me (see Issue #18 (comment)), but perhaps it is not doing exactly what I was thinking.

andreped Aug 22, 2023
Author

Could be that simply finetuning the attention-components is more suitable, as argued in Touvron et al. (2022).

andreped Aug 22, 2023
Author

Seems like increasing the batch size, which initially was challenging due to OOM issues, using gradient accumulation, lowering the learning rate to 1e-4 using the Adam optimizer, and using cross-entropy loss instead of focal loss, seemed to have resolved the issue. At least for GCViT.

That larger batch size is more cruicial for ViTs make sense as they work fundamentally different to CNNs, acting more as low-pass filters compared to CNNs acting as high-pass filters.

Will be interesting to see if the same applies to the other ViTs I tried, but at least this resolved my initial concern.

Answer selected by andreped

awsaf49 Aug 22, 2023
Maintainer

Thanks for your inquiries and also great to see that you've already figured them out.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetuning with arbitrary input size -> strange behaviour #19

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Finetuning with arbitrary input size -> strange behaviour #19

andreped Aug 22, 2023

Replies: 1 comment · 4 replies

andreped Aug 22, 2023 Author

andreped Aug 22, 2023 Author

andreped Aug 22, 2023 Author

andreped Aug 22, 2023 Author

awsaf49 Aug 22, 2023 Maintainer

andreped
Aug 22, 2023

Replies: 1 comment 4 replies

andreped
Aug 22, 2023
Author

andreped Aug 22, 2023
Author

andreped Aug 22, 2023
Author

andreped Aug 22, 2023
Author

awsaf49 Aug 22, 2023
Maintainer