- https://github.com/kentaroy47/vision-transformers-cifar10
- https://github.com/FrancescoSaverioZuppichini/ViT
- https://github.com/lucidrains/vit-pytorch
- https://github.com/facebookresearch/deit
Let's train vision transformers for cifar 10!
This is an unofficial and elementary implementation of An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
.
I use pytorch for implementation.
-
Added ConvMixer implementation. Really simple! (2021/10)
-
Added wandb train log to reproduce results. (2022/3)
python train_cifar10.py --lr 1e-4 --aug --n_epochs 200
# vit-patchsize-4
python train_cifar10.py --patch 2 --lr 1e-4 --aug --n_epochs 200
# vit-patchsize-2
python train_cifar10.py --net vit_timm --lr 1e-4
# train with pretrained vit
python train_cifar10.py --net convmixer --aug --n_epochs 200
# train with convmixer
python train_cifar10.py --net res18
# resnet18
python train_cifar10.py --net res18 --aug --n_epochs 200
# resnet18+randaug
Accuracy | Train Log | |
---|---|---|
ViT patch=2 | 80% | |
ViT patch=4 | 80% | Log |
ViT patch=8 | 30% | |
ViT small (timm transfer) | 97.5% | |
ViT base (timm transfer) | 98.5% | |
ConvMixerTiny(no pretrain) | 96.3% | |
resnet18 | 93% | |
resnet18+randaug | 95% | Log |