-
I have had great success finetuning CNN-based image classifier backbone architectures on a variety of applications. I have also finetuned ViTs (i.e., vanilla 16x16 B/16 arch) on some applications with success. For CNNs it seems to converge fine "regardless" of the input size. That is, I have for instance used Does anyone have any idea why that might be so? Right now I'm doing a very naive finetuning scheme where I am just lowering the learning rate and finetuning the entire backbone. I could play around with step-wise approaches, I was just surprised that there was a discrepancy between CNN and ViT finetuning for larger-sized images. EDIT: Can also be noted that my classifier head is extremely simple. I remove the ViT-head and append a |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
I see that other people experience similar convergence issues with ViTs: |
Beta Was this translation helpful? Give feedback.
Seems like increasing the batch size, which initially was challenging due to OOM issues, using gradient accumulation, lowering the learning rate to
1e-4
using theAdam
optimizer, and usingcross-entropy loss
instead offocal loss
, seemed to have resolved the issue. At least forGCViT
.That larger batch size is more cruicial for ViTs make sense as they work fundamentally different to CNNs, acting more as low-pass filters compared to CNNs acting as high-pass filters.
Will be interesting to see if the same applies to the other ViTs I tried, but at least this resolved my initial concern.