This repository contains implementation for the paper Training a Vision Transformer from scratch in less than 24 hours with 1 GPU published in HiTY workshop at Neurips 2022.
The implementation is a PyTorch evaluation code and training code based on DeiT. We also use and edit some code from LocalViT, Timm and torchvision.
In all experiments we build on DeiT-small model, and try to make the training more efficient time-wise (24 hours) and GPU-wise (1). This includes removing warm-up, an improved LocalViT model, in addition to our own multi-size training. There's also the possibility to use LayerScale in the code.
Our Best results are as below:
Before using it, make sure you have the pytorch-image-models package timm==0.3.2
by Ross Wightman installed.
First, clone the repository locally:
Then, install PyTorch 1.7.0+ and torchvision 0.8.1+ and pytorch-image-models 0.3.2:
conda install -c pytorch pytorch torchvision
pip install timm==0.3.2
Download and extract ImageNet train and val images from http://image-net.org/.
The directory structure is the standard layout for the torchvision datasets.ImageFolder
, and the training and validation data is expected to be in the train/
folder and val
folder respectively:
/path/to/imagenet/
train/
class1/
img1.jpeg
class2/
img2.jpeg
val/
class1/
img3.jpeg
class/2
img4.jpeg
In all experiments with 1 GPU we use --batch-size 64 and --lr 1e-3. (If you want to experiment with 4 GPUs, use --batch-size 128 and --lr 2e-4) We stop the training after 1 day.
To Train the network with the best config on 1 GPU, run varsize_1gpu_best.sh with your own paths.
To plot the accuracy per time results, use plot_output.py with your own paths.
Please cite the paper if you use the idea or code.
@misc{irandoust2022training,
title={{Training a Vision Transformer from scratch in less than 24 hours with 1 GPU}},
author={Saghar Irandoust and Thibaut Durand and Yunduz Rakhmangulova and Wenjie Zi and Hossein Hajimirsadeghi},
year={2022},
eprint={2211.05187},
archivePrefix={arXiv},
primaryClass={cs.CV}
}