This repository contains code for replicating the Vision Transformer (ViT) model proposed in the paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" by Dosovitskiy et al.
- Official implementation of ViT
- ViT paper by Dosovitskiy et al.