├── LICENSE └── README.md /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 Yunsung Lee 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Awesome Visual Representation Learning with Transformers [![Awesome](https://awesome.re/badge.svg)](https://awesome.re) 2 | 3 | Awesome Transformers (self-attention) in Computer Vision 4 | 5 | ## About transformers 6 | - Attention Is All You Need, NeurIPS 2017 7 | - Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin 8 | - [[paper]](https://arxiv.org/abs/1706.03762) [[official code]](https://github.com/tensorflow/tensor2tensor) [[pytorch implementation]](https://github.com/jadore801120/attention-is-all-you-need-pytorch) 9 | - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL 2019 10 | - Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova 11 | - [[paper]](https://arxiv.org/abs/1810.04805) [[offficial code]](https://github.com/google-research/bert) [[huggingface/transformers]](https://github.com/huggingface/transformers) 12 | - Efficient Transformers: A Survey, arXiv 2020 13 | - Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler 14 | - [[paper]](https://arxiv.org/abs/2009.06732) 15 | - A Survey on Visual Transformer, arXiv 2020 16 | - Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, Zhaohui Yang, Yiman Zhang, Dacheng Tao 17 | - [[paper]](https://arxiv.org/abs/2012.12556) 18 | - Transformers in Vision: A Survey, arXiv 2021 19 | - Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, Mubarak Shah 20 | - [[paper]](https://arxiv.org/abs/2101.01169) 21 | 22 | ## Combining CNN with self-attention 23 | - Attention augmented convolutional networks, ICCV 2019, image classification 24 | - Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, Quoc V. Le 25 | - [[paper]](https://arxiv.org/abs/1904.09925) [[pytorch implementation]](https://github.com/leaderj1001/Attention-Augmented-Conv2d) 26 | - Self-Attention Generative Adversarial Networks, ICML 2019, generative model(GANs) 27 | - Han Zhang, Ian Goodfellow, Dimitris Metaxas, Augustus Odena 28 | - [[paper]](https://arxiv.org/abs/1805.08318) [[official code]](https://github.com/heykeetae/Self-Attention-GAN) 29 | - Videobert: A joint model for video and language representation learning, ICCV 2019, video processing 30 | - Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, Cordelia Schmid 31 | - [[paper]](https://arxiv.org/abs/1904.01766) 32 | - Visual Transformers: Token-based Image Representation and Processing for Computer Vision, arXiv 2020, image classification 33 | - Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Masayoshi Tomizuka, Kurt Keutzer, Peter Vajda 34 | - [[paper]](https://arxiv.org/abs/2006.03677) 35 | - Feature Pyramid Transformer, ECCV 2020, detection and segmentation 36 | - Dong Zhang, Hanwang Zhang, Jinhui Tang, Meng Wang, Xiansheng Hua, Qianru Sun 37 | - [[paper]](http://arxiv.org/abs/2007.09451) [[official code]](https://github.com/ZHANGDONG-NJUST/FPT) 38 | - Revisiting Stereo Depth Estimation From a Sequence-to-Sequence Perspective with Transformers, arXiv 2020, depth estimation 39 | - Zhaoshuo Li, Xingtong Liu, Francis X. Creighton, Russell H. Taylor, and Mathias Unberath 40 | - [[paper]](http://arxiv.org/abs/2011.02910) [[official code]](https://github.com/mli0603/stereo-transformer) 41 | - End-to-end Lane Shape Prediction with Transformers, arXiv 2020, lane detection 42 | - Ruijin Liu, Zejian Yuan, Tie Liu, Zhiliang Xiong 43 | - [[paper]](http://arxiv.org/abs/2011.04233) [[official code]](https://github.com/liuruijin17/LSTR) 44 | - Taming Transformers for High-Resolution Image Synthesis, arXiv 2020, image synthesis 45 | - Patrick Esser, Robin Rombach, Bjorn Ommer 46 | - [[paper]](http://arxiv.org/abs/2012.09841)[[official code]](https://github.com/CompVis/taming-transformers) 47 | - TransPose: Towards Explainable Human Pose Estimation by Transformer, arXiv 2020, pose estimation 48 | - Sen Yang, Zhibin Quan, Mu Nie, Wankou Yang 49 | - [[paper]](https://arxiv.org/abs/2012.14214) 50 | - End-to-End Video Instance Segmentation with Transformers, arXiv 2020, video instance segmentation 51 | - Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, Huaxia Xia 52 | - [[paper]](https://arxiv.org/abs/2011.14503) 53 | - TransTrack: Multiple-Object Tracking with Transformer, arXiv 2020, MOT 54 | - Peize Sun, Yi Jiang, Rufeng Zhang, Enze Xie, Jinkun Cao, Xinting Hu, Tao Kong, Zehuan Yuan, Changhu Wang, Ping Luo 55 | - [[paper]](https://arxiv.org/abs/2012.15460)[[official code]](https://github.com/PeizeSun/TransTrack) 56 | - TrackFormer: Multi-Object Tracking with Transformers, arXiv 2021, MOT 57 | - Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, Christoph Feichtenhofer 58 | - [[paper]](https://arxiv.org/abs/2101.02702) 59 | - Line Segment Detection Using Transformers without Edges, arXiv 2021, line segmentation 60 | - Yifan Xu, Weijian Xu, David Cheung, Zhuowen Tu 61 | - [[paper]](https://arxiv.org/abs/2101.01909) 62 | - Segmenting Transparent Object in the Wild with Transformer, arXiv 2021, transparent object segmentation 63 | - Enze Xie, Wenjia Wang, Wenhai Wang, Peize Sun, Hang Xu, Ding Liang, Ping Luo 64 | - [[paper]](https://arxiv.org/abs/2101.08461)[[official code]](https://github.com/xieenze/Trans2Seg) 65 | - Bottleneck Transformers for Visual Recognition, arXiv 2021, backbone design 66 | - Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter Abbeel, Ashish Vaswani 67 | - [[paper]](http://arxiv.org/abs/2101.11605) 68 | 69 | ### DETR Family 70 | - End-to-end object detection with transformers, ECCV 2020, object detection 71 | - Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko 72 | - [[paper]](https://arxiv.org/abs/2005.12872) [[official code]](https://github.com/facebookresearch/detr) [[detectron2 implementation]](https://github.com/poodarchu/DETR.detectron2) 73 | - Deformable DETR: Deformable Transformers for End-to-End Object Detection, ICLR 2021, object detection 74 | - Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, Jifeng Dai 75 | - [[paper]](http://arxiv.org/abs/2010.04159) [[official code]](https://github.com/fundamentalvision/Deformable-DETR) 76 | - End-to-End Object Detection with Adaptive Clustering Transformer, arXiv 2020, object detection 77 | - Minghang Zheng, Peng Gao, Xiaogang Wang, Hongsheng Li, Hao Dong 78 | - [[paper]](http://arxiv.org/abs/2011.09315) 79 | - UP-DETR: Unsupervised Pre-training for Object Detection with Transformers, arXiv 2020, object detection 80 | - Zhigang Dai, Bolun Cai, Yugeng Lin, Junying Chen 81 | - [[paper]](http://arxiv.org/abs/2011.09094) 82 | - DETR for Pedestrian Detection, arXiv 2020, pedestrian detection 83 | - Matthieu Lin, Chuming Li, Xingyuan Bu, Ming Sun, Chen Lin, Junjie Yan, Wanli Ouyang, Zhidong Deng 84 | - [[paper]](http://arxiv.org/abs/2012.06785) 85 | 86 | ## Stand-alone transformers for Computer Vision 87 | ### Self-attention only in local neighborhood 88 | - Image Transformer, ICML 2018 89 | - Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Łukasz Kaiser, Noam Shazeer, Alexander Ku, Dustin Tran 90 | - [[paper]](https://arxiv.org/abs/1802.05751) [[official code]](https://github.com/tensorflow/tensor2tensor) 91 | - Stand-alone self-attention in vision models, NeurIPS 2019 92 | - Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, Jonathon Shlens 93 | - [[paper]](https://arxiv.org/abs/1906.05909) [[official code(underconstruction)]](https://github.com/google-research/google-research/tree/master/standalone_self_attention_in_vision_models) 94 | - On the relationship between self-attention and convolutional layers, ICLR 2020 95 | - Jean-Baptiste Cordonnier, Andreas Loukas, Martin Jaggi 96 | - [[paper]](https://arxiv.org/abs/1911.03584) [[official code]](https://github.com/epfml/attention-cnn) 97 | - Exploring self-attention for image recognition, CVPR 2020 98 | - Hengshuang Zhao, Jiaya Jia, Vladlen Koltun 99 | - [[paper]](https://arxiv.org/abs/2004.13621) [[official code]](https://github.com/hszhao/SAN) 100 | ### Scalable approximations to global self-attention 101 | - Generating long sequences with sparse transformers, arXiv 2019 102 | - Rewon Child, Scott Gray, Alec Radford, Ilya Sutskever 103 | - [[paper]](https://arxiv.org/abs/1904.10509) [[official code]](https://github.com/openai/sparse_attention) 104 | - Scaling autoregressive video models, ICLR 2019 105 | - Dirk Weissenborn, Oscar Täckström, Jakob Uszkoreit 106 | - [[paper]](https://arxiv.org/abs/1906.02634) 107 | - Axial attention in multidimensional transformers, arXiv 2019 108 | - Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, Tim Salimans 109 | - [[paper]](https://arxiv.org/abs/1912.12180) [[pytorch implementation]](https://github.com/lucidrains/axial-attention) 110 | - Axial-deeplab: Stand-alone axial-attention for panoptic segmentation, ECCV 2020 111 | - Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh Chen 112 | - [[paper]](https://arxiv.org/abs/2003.07853) [[pytorch implementation]](https://github.com/csrhddlam/axial-deeplab) 113 | - MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers, arXiv 2020 114 | - Huiyu Wang, Yukun Zhu, Hartwig Adam, Alan Yuille, Liang-Chieh Chen 115 | - [[paper]](http://arxiv.org/abs/2012.00759) 116 | ### Global self-attention with image preprocessing 117 | - Generative pretraining from pixels, ICML 2020, iGPT 118 | - Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Heewoo Jun, Prafulla Dhariwal, David Luan, Ilya Sutskever 119 | - [[paper]](https://cdn.openai.com/papers/Generative_Pretraining_from_Pixels_V2.pdf) [[official code]](https://github.com/openai/image-gpt) 120 | - **An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021, ViT** 121 | - Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby 122 | - [[paper]](https://arxiv.org/abs/2010.11929) [[pytorch implementation]](https://github.com/lucidrains/vit-pytorch) 123 | - Pre-Trained Image Processing Transformer, arXiv, IPT 124 | - Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, Wen Gao 125 | - [[paper]](http://arxiv.org/abs/2012.00364) 126 | - Training data-efficient image transformers & distillation through attention, arXiv 2020, DeiT 127 | - Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Herve Jegou 128 | - [[paper]](http://arxiv.org/abs/2012.12877)[[official code]](https://github.com/facebookresearch/deit) 129 | - Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers, arXiv 2020, SETR 130 | - Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H.S. Torr, Li Zhang 131 | - [[paper]](http://arxiv.org/abs/2012.15840)[[official code]](https://fudan-zvg.github.io/SETR) 132 | - Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet, arXiv 2021, T2T-ViT 133 | - Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Francis EH Tay, Jiashi Feng, Shuicheng Yan 134 | - [[paper]](http://arxiv.org/abs/2101.11986)[[official code]](https://github.com/yitu-opensource/T2T-ViT) 135 | - TransReID: Transformer-based Object Re-Identification, arXiv 2021 136 | - Shuting He, Hao Luo, Pichao Wang, Fan Wang, Hao Li, Wei Jiang 137 | - [[paper]](http://arxiv.org/abs/2102.04378) 138 | - Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions 139 | - Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao 140 | - [[paper]](https://arxiv.org/abs/2102.12122)[[official code]](https://github.com/whai362/PVT) 141 | ### Global self-attention on 3D point clouds 142 | - Point Transformer, arXiv 2020, points classification + part/semantic segmentation 143 | - Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip Torr, Vladlen Koltun 144 | - [[paper]](http://arxiv.org/abs/2011.00931) 145 | 146 | ## Unified text-vision tasks 147 | ### Focused on VQA 148 | - ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, NeurIPS 2019 149 | - Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee 150 | - [[paper]](https://arxiv.org/abs/1908.02265) [[official code]](https://github.com/facebookresearch/vilbert-multi-task) 151 | - LXMERT: Learning Cross-Modality Encoder Representations from Transformers, EMNLP 2019 152 | - Hao Tan, Mohit Bansal 153 | - [[paper]](https://arxiv.org/abs/1908.07490) [[official code]](https://github.com/airsplay/lxmert) 154 | - VisualBERT: A Simple and Performant Baseline for Vision and Language, arXiv 2019 155 | - Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang 156 | - [[paper]](https://arxiv.org/abs/1908.03557) [[official code]](https://github.com/uclanlp/visualbert) 157 | - VL-BERT: Pre-training of Generic Visual-Linguistic Representations, ICLR 2020 158 | - Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, Jifeng Dai 159 | - [[paper]](https://arxiv.org/abs/1908.08530) [[official code]](https://github.com/jackroos/VL-BERT) 160 | - UNITER: UNiversal Image-TExt Representation Learning, ECCV 2020 161 | - Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, Jingjing Liu 162 | - [[paper]](https://arxiv.org/abs/1909.11740) [[official code]](https://github.com/ChenRocks/UNITER) 163 | - Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers, arXiv 2020 164 | - Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, Jianlong Fu 165 | - [[paper]](https://arxiv.org/abs/2004.00849) 166 | - ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision, arXiv 2021 167 | - Wonjae Kim, Bokyung Son, Ildoo Kim 168 | - [[paper]](https://arxiv.org/abs/2102.03334) 169 | 170 | ### Focused on Image Retrieval 171 | - Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training, AAAI 2020 172 | - Gen Li, Nan Duan, Yuejian Fang, Ming Gong, Daxin Jiang, Ming Zhou 173 | - [[paper]](https://arxiv.org/abs/1908.06066) [[official code]](https://github.com/microsoft/Unicoder) 174 | - ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data, arXiv 2020 175 | - Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, Arun Sacheti 176 | - [[paper]](https://arxiv.org/abs/2001.07966) 177 | - Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks, ECCV 2020 178 | - Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, Jianfeng Gao 179 | - [[paper]](https://arxiv.org/abs/2004.06165) [[official code]](https://github.com/microsoft/Oscar) 180 | - Training Vision Transformers for Image Retrieval, arXiv 2021 181 | - Alaaeldin El-Nouby, Natalia Neverova, Ivan Laptev, Herve Jegou 182 | - [[paper]](http://arxiv.org/abs/2102.05644) 183 | ### Focused on OCR 184 | 185 | - LayoutLM: Pre-training of Text and Layout for Document Image Understanding 186 | - Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou 187 | - [[paper]](https://arxiv.org/abs/1912.13318) [[official code]](https://github.com/microsoft/unilm/tree/master/layoutlm) 188 | 189 | ### Focused on Image Captioning 190 | 191 | - CPTR: Full Transformer Network for Image Captioning, arXiv 2021 192 | - Wei Liu, Sihan Chen, Longteng Guo, Xinxin Zhu, Jing Liu 193 | - [[paper]](http://arxiv.org/abs/2101.10804) 194 | 195 | 196 | ### Multi-Task 197 | - 12-in-1: Multi-Task Vision and Language Representation Learning 198 | - Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, Stefan Lee 199 | - [[paper]](https://arxiv.org/abs/1912.02315) [[official code]](https://github.com/facebookresearch/vilbert-multi-task) 200 | --------------------------------------------------------------------------------