transformer_pipeline
- Attention is all you need
- ViT: AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE
- Swin: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
- CSwin: CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
- DETR: End-to-end Object Detection with Transformers
- iRPE: Rethinking and Improving Relative Position Encoding for Vision Transformer
- DAT: Vision Transformer with Deformable Attention
- CvT: CvT: Introducing Convolutions to Vision Transformers
- CrossViT: CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification
- SwinTrack: SwinTrack: A Simple and Strong Baseline for Transformer Tracking
- Stark: Learning Spatio-Temporal Transformer for Visual Tracking
- [] Swin-V2: coming soon
Performance comparisons on ImageNet1K
method | top-1 accuracy |
---|---|
ViT-B"384 | 77.9 |
SwinV1-B"384 | 84.2 |
CSWin-B"384 | 85.4 |
iRPE base DeiT-B"224 | 82.4 |
DAT-B"384 | 84.8 |
CvT-21"384 | 84.9 |
CrossViT-18"384 | 83.9 |
SwinV2-B"384 | 87.1 |