└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Vision-Transformer-papers 2 | 3 | This repository contains a (non-exhaustive) overview of follow-up works based on the original [Vision Transformer (ViT)](https://arxiv.org/abs/2010.11929) by Google. Feel free to open a PR to add more papers! 4 | 5 | ## Distillation: 6 | * DeiT (Data-efficient Image Transformers): https://arxiv.org/abs/2012.12877 7 | * Efficient Vision Transformers via Fine-Grained Manifold Distillation: https://arxiv.org/abs/2107.01378 8 | * NViT (Vision Transformer Compression and Parameter Redistribution): https://arxiv.org/abs/2110.04869 9 | * SiT (Self-slimmed Vision Transformer): https://arxiv.org/abs/2111.12624 10 | 11 | ## New pre-training objectives: 12 | - self-supervised: 13 | * DINO (Emerging Properties in Self-Supervised Vision Transformers): https://arxiv.org/abs/2104.14294 14 | * MoBY (Self-Supervised Learning with Swin Transformers): https://arxiv.org/abs/2105.04553 15 | * EsViT (Efficient self-supervised Vision Transformers): https://arxiv.org/abs/2106.09785 16 | * BEiT (BERT Pre-Training of Image Transformers): https://arxiv.org/abs/2106.08254 17 | * MAE (Masked Autoencoders Are Scalable Vision Learners): https://arxiv.org/abs/2111.06377 18 | * SiT (Self-supervised vIsion Transformer): https://arxiv.org/abs/2104.03602 19 | * SimMIM (A Simple Framework for Masked Image Modeling): https://arxiv.org/abs/2111.09886 20 | - supervised: 21 | * Token Labeling for Better Training Vision Transformers: https://arxiv.org/abs/2104.10858 22 | * Vision Transformers with Patch Diversification: https://arxiv.org/abs/2104.12753 23 | * Token Pooling in Vision Transformers: https://arxiv.org/abs/2110.03860 24 | 25 | ## New pre-training tricks, techniques: 26 | * Scaling Vision Transformers: https://arxiv.org/abs/2106.04560 27 | * Vision Transformers with Patch Diversification: https://arxiv.org/abs/2104.12753 28 | * Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations: https://arxiv.org/abs/2108.05887 29 | * How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers: https://arxiv.org/abs/2106.10270 30 | * When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations (SAM optimizer): https://arxiv.org/abs/2106.01548 31 | * V-MoE (Scaling Vision with Sparse Mixture of Experts): https://arxiv.org/abs/2106.05974 32 | 33 | ## Architectural changes: 34 | - Combining convolution with self-attention: 35 | * CvT (Introducing convolutions to Vision Transformers): https://arxiv.org/abs/2103.15808 36 | * ConViT (Improving Vision Transformers with Soft Convolutional Inductive Biases): https://arxiv.org/abs/2103.10697 37 | * CMT (Convolutional Neural Networks Meet Vision Transformers): https://arxiv.org/abs/2107.06263 38 | * LeViT (A Vision Transformer in ConvNet's Clothing for Faster Inference): https://arxiv.org/abs/2104.01136 39 | * Co-Scale Conv-Attentional Image Transformers (CoaT): https://arxiv.org/abs/2104.06399 40 | * Visformer (The Vision-friendly Transformer): https://arxiv.org/abs/2104.12533 41 | * CCT (Escaping the Big Data Paradigm with Compact Transformers): https://arxiv.org/abs/2104.05704 42 | * Refiner (Refining Self-attention for Vision Transformers): https://arxiv.org/abs/2106.03714 43 | * LVT (Lite Vision Transformer with Enhanced Self-Attention): https://arxiv.org/abs/2112.10809 44 | - Others: 45 | * PiT (Rethinking Spatial Dimensions of Vision Transformers): https://arxiv.org/abs/2103.16302 46 | * xCiT (Cross-Covariance Image Transformer): https://arxiv.org/abs/2106.09681 47 | * EsViT (Efficient self-supervised Vision Transformers): https://arxiv.org/abs/2106.09785 48 | * Token-to-token ViT (Training ViT from scratch on ImageNet): https://arxiv.org/abs/2101.11986 49 | * DeepViT (Towards Deeper Vision Transformer): https://arxiv.org/abs/2103.11886 50 | * PVT (Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions): https://arxiv.org/abs/2102.12122 51 | * PVTv2 (Improved Baselines with Pyramid Vision Transformer): https://arxiv.org/abs/2106.13797 52 | * Wider Vision Transformer (Go Wider Instead of Deeper): https://arxiv.org/abs/2107.11817 53 | * CaiT (Going Deeper with Image Transformers): https://arxiv.org/abs/2103.17239 54 | * CrossViT (Cross-Attention Multi-Scale Vision Transformer for Image Classification): https://arxiv.org/abs/2103.14899 55 | * Twins-CVT (Spatial Attention in Vision Transformers): https://arxiv.org/abs/2104.13840 56 | * LIT (Less is More: Pay Less Attention in Vision Transformers): https://arxiv.org/abs/2105.14217 57 | * TnT (Transformer-in-Transformer): https://arxiv.org/abs/2103.00112 58 | * Dynamic Vision Transformer: https://arxiv.org/abs/2105.15075 59 | * Swin Transformer (Hierarchical Vision Transformer using Shifted Windows): https://arxiv.org/abs/2103.14030 60 | * Shuffle Transformer (Rethinking Spatial Shuffle for Vision Transformer): https://arxiv.org/abs/2106.03650 61 | * NesT (Aggregating Nested Transformers): https://arxiv.org/abs/2105.12723 62 | * Long-Short Transformer (Efficient Transformers for Language and Vision): https://t.co/V8qKUkVH1c?amp=1 63 | * DynamicViT (Efficient Vision Transformers with Dynamic Token Sparsification): https://arxiv.org/abs/2106.02034 64 | * Dynamic Transformer (Dynamic Vision Transformers with Adaptive Sequence Length): https://arxiv.org/abs/2105.15075 65 | * PS-ViT (Vision Transformer with Progressive Sampling): https://arxiv.org/abs/2108.01684 66 | * RegionViT (Regional-to-Local Attention for Vision Transformers): https://arxiv.org/abs/2106.02689 67 | * Focal Transformer (Focal Self-attention for Local-Global Interactions in Vision Transformers): https://arxiv.org/pdf/2107.00641.pdf 68 | * kVT (k-NN Attention for Boosting Vision Transformers): https://arxiv.org/abs/2106.00515 69 | * Robust Vision Transformer: https://arxiv.org/abs/2105.07926 70 | * Glance-and-Gaze Vision Transformer: https://arxiv.org/abs/2106.02277 71 | * Feature Fusion Vision Transformer: https://arxiv.org/abs/2107.02341 72 | * Augmented Shortcuts for Vision Transformers: https://arxiv.org/abs/2106.15941 73 | * CrossFormer (A Versatile Vision Transformer Based on Cross-scale Attention): https://arxiv.org/abs/2108.00154 74 | * CSWin Transformer (A General Vision Transformer Backbone with Cross-Shaped Windows): https://arxiv.org/pdf/2107.00652.pdf 75 | * Evo-ViT (Slow-Fast Token Evolution for Dynamic Vision Transformer): https://arxiv.org/abs/2108.01390 76 | * PSViT (Better Vision Transformer via Token Pooling and Attention Sharing): https://t.co/OOnONItfnX?amp=1 77 | * ImageRPE (relative position encodings) for Vision Transformers: https://arxiv.org/abs/2107.14222 78 | * What makes for Hierarchical Vision Transformer? https://arxiv.org/abs/2107.02174 79 | * Multi-Scale Vision Longformer: https://arxiv.org/abs/2103.15358 80 | * CSWin Transformer: https://arxiv.org/abs/2107.00652 81 | * MetaFormer is Actually What You Need for Vision: https://arxiv.org/abs/2111.11418 82 | * Stochastic Layers in Vision Transformers: https://arxiv.org/abs/2112.15111 83 | * ViR: the Vision Reservoir: https://arxiv.org/abs/2112.13545 84 | * Blending Anti-Aliasing into Vision Transformer: https://arxiv.org/abs/2110.15156 85 | * ELSA: Enhanced Local Self-Attention for Vision Transformer: https://arxiv.org/abs/2112.12786 86 | * Swin Transformer V2: Scaling Up Capacity and Resolution: https://arxiv.org/abs/2111.09883 87 | 88 | ## Investigations of the inner workings (cfr. BERTology): 89 | 90 | * Are Convolutional Neural Networks or Transformers more like human vision? https://arxiv.org/abs/2105.07197 91 | * Do Vision Transformers See Like Convolutional Neural Networks? https://arxiv.org/abs/2108.08810 92 | * What makes for Hierarchical Vision Transformer? (Survey on Swin + ShuffleTransformer): https://arxiv.org/abs/2107.02174 93 | * Intriguing Properties of Vision Transformers: https://arxiv.org/abs/2105.10497 94 | * Exploring Corruption Robustness: Inductive Biases in Vision Transformers and MLP-Mixers: https://arxiv.org/abs/2106.13122 95 | * Are Transformers More Robust Than CNNs? https://arxiv.org/abs/2111.05464v1 96 | 97 | ## Applying ViT to other domains besides image classification: 98 | 99 | * YOLOS (object detection): https://arxiv.org/abs/2106.00666 100 | * ViTGAN (GANs): https://arxiv.org/abs/2107.04589 101 | * SegFormer (semantic segmentation): https://arxiv.org/abs/2105.15203 102 | * Feature Fusion Vision Transformer (Fine-Grained Visual Categorization): https://arxiv.org/abs/2107.02341 103 | * TrOCR (optical character recognition): https://arxiv.org/abs/2109.10282 104 | --------------------------------------------------------------------------------