├── transformers.png ├── VisionOnlyPTMs.md ├── VL-PTMs.md ├── LanguageOnlyPTMs.md └── README.md /transformers.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AlenUbuntu/Awesome-Vision-and-Language-PreTrain-Papers/HEAD/transformers.png -------------------------------------------------------------------------------- /VisionOnlyPTMs.md: -------------------------------------------------------------------------------- 1 | ## Vision-Only PTMs 2 | 3 | ### Modeling Techniques 4 | 5 | ### Transfer Learning 6 | 7 | ### Others 8 | 9 | -------------------------------------------------------------------------------- /VL-PTMs.md: -------------------------------------------------------------------------------- 1 | # VL-PTMs 2 | ## Table of Contents 3 | * [Image-Based VL-PTMs](#image-based-vl-ptms) 4 | * [Representation Learning](#representation-learning) 5 | * [Task-Specific](#task-specific) 6 | * [Others](#others) 7 | * [Video-Based VL-PTMs](#video-based-vl-ptms) 8 | * [Representation Learning](#representation-learning) 9 | * [Task-Specific](#task-specific) 10 | * [List of Other Resources](#list-of-other-resources) 11 | 12 | ## Image-Based VL-PTMs 13 | ### Representation Learning 14 | **X-LXMERT** 15 | 16 | [X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers](https://www.aclweb.org/anthology/2020.emnlp-main.707.pdf), EMNLP, 2020. 17 | 18 | **VL-BERT** 19 | 20 | [VL-BERT: Pre-training of Generic Visual-Linguistic Representations](https://arxiv.org/pdf/1908.08530.pdf), ICLR, 2020. 21 | 22 | **Unicoder-VL** 23 | 24 | [Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training](https://arxiv.org/pdf/1908.06066.pdf), AAAI, 2020. 25 | 26 | **VLP** 27 | 28 | [Unified Vision-Language Pre-Training for Image Captioning and VQA](https://arxiv.org/pdf/1909.11059.pdf), AAAI, 2020. 29 | 30 | **InterBERT** 31 | 32 | [InterBERT: An Effective Multi-Modal Pretraining Approach via Vision-and-Language Interaction](https://arxiv.org/pdf/2003.13198.pdf), arXiv, 2020. 33 | 34 | **Oscar** 35 | 36 | [Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks](https://arxiv.org/pdf/2004.06165.pdf), arXiv, 2020. 37 | 38 | **Pixel-BERT** 39 | 40 | [Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers](https://arxiv.org/pdf/2004.00849.pdf), arXiv, 2020. 41 | 42 | **ERNIE-ViL** 43 | 44 | [ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph](https://arxiv.org/pdf/2006.16934.pdf), arXiv, 2020. 45 | 46 | **DeVLBert** 47 | 48 | [DeVLBert: Learning Deconfounded Visio-Linguistic Representations](https://arxiv.org/pdf/2008.06884.pdf), ACM MM, 2020. 49 | 50 | **SEMVLP** 51 | 52 | [SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple Levels](https://openreview.net/pdf?id=Wg2PSpLZiH), ICLR, 2021. 53 | 54 | **ViLBERT** 55 | 56 | [ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks](https://arxiv.org/pdf/1908.02265.pdf), NIPS, 2019 57 | 58 | **LXMERT** 59 | 60 | [LXMERT: Learning Cross-Modality Encoder Representations from Transformers](https://arxiv.org/pdf/1908.07490.pdf), EMNLP, 2019. 61 | 62 | **VisualBERT** 63 | 64 | [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557.pdf), arXiv, 2019. 65 | 66 | **UNITER** 67 | 68 | [UNITER: UNiversal Image-TExt Representation Learning](https://arxiv.org/pdf/1909.11740.pdf), arXiv, 2019. 69 | 70 | **Vision-Language Encoder** 71 | 72 | [Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks](https://arxiv.org/pdf/1912.03063.pdf), arXiv, 2019. 73 | 74 | ### Task-Specific 75 | **Image Caption**: [Meshed-Memory Transformer for Image Captioning](https://arxiv.org/pdf/1912.08226.pdf), CVPR, 2020. 76 | 77 | **Image Caption**: [XGPT: Cross-modal Generative Pre-Training for Image Captioning](https://arxiv.org/pdf/2003.01473.pdf), arXiv, 2020. 78 | 79 | **Image Caption**: [Entangled Transformer for Image Captioning](https://openaccess.thecvf.com/content_ICCV_2019/papers/Li_Entangled_Transformer_for_Image_Captioning_ICCV_2019_paper.pdf), ICCV, 2019. 80 | 81 | **Machine Translation**: [Multimodal Transformer for Multimodal Machine Translation.](https://www.aclweb.org/anthology/2020.acl-main.400.pdf), ACL, 2020. 82 | 83 | **Classification**: [Supervised Multimodal Bitransformers for Classifying Images and Text](https://arxiv.org/pdf/1909.02950.pdf), arXiv, 2019 84 | 85 | **VQA**: [Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA](https://arxiv.org/pdf/1911.06258.pdf), CVPR, 2020. 86 | 87 | **VQA**: [Spatially Aware Multimodal Transformers for TextVQA](https://arxiv.org/pdf/2007.12146.pdf), ECCV, 2020. 88 | 89 | **NER**: [Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer](https://www.aclweb.org/anthology/2020.acl-main.306.pdf), ACL, 2020. 90 | 91 | **VisDial**: [VD-BERT: A Unified Vision and Dialog Transformer with BERT](https://arxiv.org/pdf/2004.13278.pdf), EMNLP, 2020. 92 | 93 | **VisDial**: [Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline](https://arxiv.org/pdf/1912.02379.pdf), ECCV, 2020. 94 | 95 | **Vision-and-Language Navigation**: [Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training](https://arxiv.org/pdf/2002.10638.pdf), CVPR, 2020. 96 | 97 | **Text-Image Retrieval**: [ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data](https://arxiv.org/pdf/2001.07966.pdf), arXiv, 2020. 98 | 99 | **Text-Image Retrieval**: [Cross-Probe BERT for Efficient and Effective Cross-Modal Search](https://openreview.net/pdf?id=bW9SYKHcZiz), ICLR, 2021. 100 | 101 | **Visial Common Sense Reasoning**: [Fusion of Detected Objects in Text for Visual Question Answering](https://arxiv.org/pdf/1908.05054.pdf), EMNLP, 2019. 102 | 103 | 104 | ### Others 105 | [Are we pretraining it right? Digging deeper into visio-linguistic pretraining](https://arxiv.org/pdf/2004.08744.pdf), arXiv, 2020. 106 | 107 | [Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models](https://arxiv.org/pdf/2005.07310.pdf), arXiv, 2020. 108 | 109 | [12-in-1: Multi-Task Vision and Language Representation Learning](https://arxiv.org/pdf/1912.02315.pdf), arXiv, 2020. 110 | 111 | ## Video-Based VL-PTMs 112 | ### Representation Learning 113 | **MAG-BERT, MAG-XLNet** 114 | 115 | [Integrating Multimodal Information in Large Pretrained Transformers.](https://www.aclweb.org/anthology/2020.acl-main.214.pdf), ACL, 2020. 116 | 117 | **COOT** 118 | 119 | [COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning](https://proceedings.neurips.cc/paper/2020/file/ff0abbcc0227c9124a804b084d161a2d-Paper.pdf), NIPS, 2020. 120 | 121 | **ActBERT** 122 | 123 | [ActBERT: Learning Global-Local Video-Text Representations](https://openaccess.thecvf.com/content_CVPR_2020/papers/Zhu_ActBERT_Learning_Global-Local_Video-Text_Representations_CVPR_2020_paper.pdf), CVPR, 2020. 124 | 125 | **HERO** 126 | 127 | [HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training](https://arxiv.org/pdf/2005.00200.pdf), EMNLP, 2020. 128 | 129 | **UniVL** 130 | 131 | [UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation](https://arxiv.org/pdf/2002.06353.pdf), arXiv, 2020. 132 | 133 | **DSTC8** 134 | 135 | [Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog](https://arxiv.org/pdf/2002.00163.pdf), AAAI2020 DSTC8 workshop 136 | 137 | **MulT** 138 | 139 | [Multimodal Transformer for Unaligned Multimodal Language Sequences](https://www.aclweb.org/anthology/P19-1656.pdf), ACL, 2019. 140 | 141 | **VideoBERT** 142 | 143 | [VideoBERT: A Joint Model for Video and Language Representation Learning](https://arxiv.org/pdf/1904.01766.pdf), ICCV, 2019. 144 | 145 | **CBT** 146 | 147 | [Learning Video Representations using Contrastive Bidirectional Transformer](https://arxiv.org/pdf/1906.05743.pdf), arXiv, 2019. 148 | 149 | **YouTube8M** 150 | 151 | [BERT for Large-scale Video Segment Classification with Test-time Augmentation](https://arxiv.org/pdf/1912.01127.pdf), ICCV 2019 YouTube8M workshop 152 | 153 | ### Task-Specific 154 | 155 | **Action Segmentation**: [SCT: Set Constrained Temporal Transformer for Set Supervised Action Segmentation](https://arxiv.org/pdf/2003.14266.pdf), CVPR, 2020. 156 | 157 | **Video Retrieval**: [Multi-modal Transformer for Video Retrieval](https://arxiv.org/pdf/2007.10639.pdf), ECCV, 2020. 158 | 159 | **VQA**: [Video-Grounded Dialogues with Pretrained Generation Language Models.](https://www.aclweb.org/anthology/2020.acl-main.518.pdf), ACL, 2020. 160 | 161 | **VQA**: [Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems](https://www.aclweb.org/anthology/P19-1564.pdf), ACL, 2019. 162 | 163 | **Video Caption**: [Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training](https://arxiv.org/pdf/2007.02375.pdf), arXiv, 2020. 164 | 165 | ## List of Other Resources 166 | 167 | [GitHub Repo](https://github.com/yuewang-cuhk/awesome-vision-language-pretraining-papers) 168 | 169 | -------------------------------------------------------------------------------- /LanguageOnlyPTMs.md: -------------------------------------------------------------------------------- 1 | ## Language-Only PTMs 2 | ### Modeling Techniques 3 | [To Pretrain or Not to Pretrain: Examining the Benefits of Pretrainng on Resource Rich Tasks.](https://www.aclweb.org/anthology/2020.acl-main.200.pdf), ACL, 2020. 4 | 5 | [Quantifying Attention Flow in Transformers](https://www.aclweb.org/anthology/2020.acl-main.385.pdf), ACL, 2020. 6 | 7 | [Successfully Applying the Stabilized Lottery Ticket Hypothesis to the Transformer Architecture](https://www.aclweb.org/anthology/2020.acl-main.360.pdf), ACL, 2020. 8 | 9 | [DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering](https://www.aclweb.org/anthology/2020.acl-main.411.pdf), ACL, 2020. 10 | 11 | [Roles and Utilization of Attention Heads in Transformer-based Neural Language Models.](https://www.aclweb.org/anthology/2020.acl-main.311.pdf), ACL, 2020. 12 | 13 | [Do Transformers Need Deep Long-Range Memory?](https://www.aclweb.org/anthology/2020.acl-main.672.pdf), ACL, 2020. 14 | 15 | [Dynamically Adjusting Transformer Batch Size by Monitoring Gradient Direction Change](https://www.aclweb.org/anthology/2020.acl-main.323.pdf), ACL, 2020. 16 | 17 | [Lipschitz Constrained Parameter Initialization for Deep Transformers](https://www.aclweb.org/anthology/2020.acl-main.38.pdf), ACL, 2020. 18 | 19 | [Byte Pair Encoding is Suboptimal for Language Model Pretraining.](https://www.aclweb.org/anthology/2020.findings-emnlp.414.pdf) EMNLP(Findings), 2020. 20 | 21 | [Analyzing Redundancy in Pretrained Transformer Models.](https://www.aclweb.org/anthology/2020.emnlp-main.398.pdf), EMNLP, 2020. 22 | 23 | [How Effective is Task-Agnostic Data Augmentation for Pretrained Transformers?](https://www.aclweb.org/anthology/2020.findings-emnlp.394.pdf), EMNLP(Findings), 2020. 24 | 25 | [Pretrained Language Model Embryology: The Birth of ALBERT.](https://www.aclweb.org/anthology/2020.emnlp-main.553.pdf), EMNLP, 2020. 26 | 27 | [Pre-Training Transformers as Energy-Based Cloze Models.](https://www.aclweb.org/anthology/2020.emnlp-main.20.pdf), EMNLP, 2020. 28 | 29 | [Calibration of Pre-trained Transformers](https://www.aclweb.org/anthology/2020.emnlp-main.21.pdf), EMNLP, 2020. 30 | 31 | [Guiding Attention for Self-Supervised Learning with Transformers](https://www.aclweb.org/anthology/2020.findings-emnlp.419.pdf), EMNLP(Findings), 2020. 32 | 33 | [Improve Transformer Models with Better Relative Position Embeddings](https://www.aclweb.org/anthology/2020.findings-emnlp.298.pdf), EMNLP(Findings), 2020. 34 | 35 | [Long Document Ranking with Query-Directed Sparse Transformer](https://www.aclweb.org/anthology/2020.findings-emnlp.412.pdf), EMNLP(Findings), 2020. 36 | 37 | [Attention is Not Only a Weight: Analyzing Transformers with Vector Norms](https://www.aclweb.org/anthology/2020.emnlp-main.574.pdf), EMNLP, 2020. 38 | 39 | [Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior](https://www.aclweb.org/anthology/2020.findings-emnlp.64.pdf), EMNLP(Findings), 2020. 40 | 41 | [Understanding the Difficulty of Training Transformers](https://www.aclweb.org/anthology/2020.emnlp-main.463.pdf), EMNLP, 2020. 42 | 43 | [AdapterHub: A Framework for Adapting Transformers](https://www.aclweb.org/anthology/2020.emnlp-demos.7.pdf), EMNLP(Demo), 2020. 44 | 45 | [Compressing Transformer-Based Semantic Parsing Models using Compositional Code Embeddings](https://www.aclweb.org/anthology/2020.findings-emnlp.423.pdf), EMNLP(Findings), 2020. 46 | 47 | [On the Sub-layer Functionalities of Transformer Decoder](https://www.aclweb.org/anthology/2020.findings-emnlp.432.pdf), EMNLP(Findings), 2020. 48 | 49 | [Scheduled DropHead: A Regularization Method for Transformer Models.](https://www.aclweb.org/anthology/2020.findings-emnlp.178.pdf), EMNLP(Findings), 2020. 50 | 51 | [Can You Tell Me How to Get Past Sesame Street? Sentence-Level Pretraining Beyond Language Modeling.](https://www.aclweb.org/anthology/P19-1439.pdf), ACL, 2019. 52 | 53 | [Scheduled Sampling for Transformers](https://www.aclweb.org/anthology/P19-2049.pdf), ACL, 2019. 54 | 55 | 56 | ### Transfer Learning 57 | [Unsupervised Domain Clusters in Pretrained Language Models.](https://www.aclweb.org/anthology/2020.acl-main.692.pdf), ACL, 2020. 58 | 59 | [Emerging Cross-lingual Structure in Pretrained Language Models.](https://www.aclweb.org/anthology/2020.acl-main.536.pdf) ACL, 2020. 60 | 61 | [Don't Stop Pretraining: Adapt Language Models to Domains and Tasks.](https://www.aclweb.org/anthology/2020.acl-main.740.pdf), ACL, 2020. 62 | 63 | [Intermediate-Task Transfer Learning with Pretrained Language Models: When and Why Does It Work?](https://www.aclweb.org/anthology/2020.acl-main.467.pdf), ACL, 2020. 64 | 65 | [SentiBERT: A Transferable Transformer-Based Architecture for Compositional Sentiment Semantics](https://www.aclweb.org/anthology/2020.acl-main.341.pdf), ACL, 2020. 66 | 67 | [X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained Language Models.](https://www.aclweb.org/anthology/2020.emnlp-main.479.pdf), EMNLP, 2020. 68 | 69 | [A Rigorous Study on Named Entity Recognition: Can Fine-tuning Pretrained Model Lead to the Promised Land?](https://www.aclweb.org/anthology/2020.emnlp-main.592.pdf), EMNLP, 2020. 70 | 71 | [Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting.](https://www.aclweb.org/anthology/2020.emnlp-main.634.pdf), EMNLP, 2020. 72 | 73 | [Investigating Transferability in Pretrained Language Models.](https://www.aclweb.org/anthology/2020.findings-emnlp.125/), EMNLP(Findings), 2020. 74 | 75 | [Integrating Task Specific Information into Pretrained Language Models for Low Resource Fine Tuning.](https://www.aclweb.org/anthology/2020.findings-emnlp.285.pdf), EMNLP(Findings), 2020. 76 | 77 | [Masking as an Efficient Alternative to Finetuning for Pretrained Language Models.](https://www.aclweb.org/anthology/2020.emnlp-main.174.pdf), EMNLP, 2020. 78 | 79 | [Factorized Transformer for Multi-Domain Neural Machine Translation](https://www.aclweb.org/anthology/2020.findings-emnlp.377.pdf), EMNLP, 2020. 80 | 81 | [From Zero to Hero: On the Limitations of Zero-Shot Language Transfer with Multilingual Transformers](https://www.aclweb.org/anthology/2020.emnlp-main.363.pdf), EMNLP, 2020. 82 | 83 | [A Bilingual Generative Transformer for Semantic Sentence Embedding](https://www.aclweb.org/anthology/2020.emnlp-main.122.pdf), EMNLP, 2020. 84 | 85 | [Transformer Based Multi-Source Domain Adaptation](https://www.aclweb.org/anthology/2020.emnlp-main.639.pdf), EMNLP, 2020. 86 | 87 | [KERMIT: Complementing Transformer Architectures with Encoders of Explicit Syntactic Interpretations](https://www.aclweb.org/anthology/2020.emnlp-main.18.pdf), EMNLP, 2020. 88 | 89 | [A Comparison of Architectures and Pretraining Methods for Contextualized Multilingual Word Embeddings.](https://ojs.aaai.org//index.php/AAAI/article/view/6443), AAAI, 2020. 90 | 91 | [Cross-lingual Language Model Pretraining](https://proceedings.neurips.cc/paper/2019/file/c04c19c2c2474dbf5f7ac4372c5b9af1-Paper.pdf), NIPS, 2019. 92 | 93 | [Fine-tuning Pre-Trained Transformer Language Models to Distantly Supervised Relation Extraction](https://www.aclweb.org/anthology/P19-1134.pdf), ACL, 2019. 94 | 95 | ### Others 96 | [SPECTER: Document-level Representation Learning using Citation-informed Transformers.](https://www.aclweb.org/anthology/2020.acl-main.207.pdf), ACL, 2020. 97 | 98 | [Pretrained Transformers Improve Out-of-Distribution Robustness.](https://www.aclweb.org/anthology/2020.acl-main.244.pdf), ACL, 2020. 99 | 100 | [Regularized Context Gates on Transformer for Machine Translation](https://www.aclweb.org/anthology/2020.acl-main.757.pdf), ACL, 2020. 101 | 102 | [Dependency Graph Enhanced Dual-transformer Structure for Aspect-based Sentiment Classification](https://www.aclweb.org/anthology/2020.acl-main.588.pdf), ACL, 2020. 103 | 104 | [Cost-effective Selection of Pretraining Data: A Case Study of Pretraining BERT on Social Media.](https://www.aclweb.org/anthology/2020.findings-emnlp.151.pdf), EMNLP(Findings), 2020. 105 | 106 | [Multi-pretraining for Large-scale Text Classification.](https://www.aclweb.org/anthology/2020.findings-emnlp.185.pdf), EMNLP(Findings), 2020. 107 | 108 | [BERT-kNN: Adding a kNN Search Component to Pretrained Language Models for Better QA.](https://www.aclweb.org/anthology/2020.findings-emnlp.307.pdf), EMNLP, 2020. 109 | 110 | [Document Ranking with a Pretrained Sequence-to-Sequence Model.](https://www.aclweb.org/anthology/2020.findings-emnlp.63.pdf), EMNLP(Findings), 2020. 111 | 112 | [Probing Pretrained Language Models for Lexical Semantics.](https://www.aclweb.org/anthology/2020.emnlp-main.586.pdf), EMNLP, 2020. 113 | 114 | [Pretrain-KGE: Learning Knowledge Representation from Pretrained Language Models.](https://www.aclweb.org/anthology/2020.findings-emnlp.25.pdf), EMNLP(Findings), 2020. 115 | 116 | [Competence-Level Prediction and Resume & Job Description Matching Using Context-Aware Transformer Models](https://www.aclweb.org/anthology/2020.emnlp-main.679.pdf), EMNLP, 2020 117 | 118 | [Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning](https://www.aclweb.org/anthology/2020.findings-emnlp.286.pdf), EMNLP(Findings), 2020. 119 | 120 | [A Time-Aware Transformer Based Model for Suicide Ideation Detection on Social Media](https://www.aclweb.org/anthology/2020.emnlp-main.619.pdf), EMNLP, 2020. 121 | 122 | [TNT: Text Normalization based Pre-training of Transformers for Content Moderation](https://www.aclweb.org/anthology/2020.emnlp-main.383.pdf), EMNLP, 2020. 123 | 124 | [Coupled Hierarchical Transformer for Stance-Aware Rumor Verification in Social Media Conversations](https://www.aclweb.org/anthology/2020.emnlp-main.108.pdf), EMNLP, 2020. 125 | 126 | [Long-Short Term Masking Transformer: A Simple but Effective Baseline for Document-level Neural Machine Translation](https://www.aclweb.org/anthology/2020.emnlp-main.81.pdf), EMNLP, 2020. 127 | 128 | [Table Fact Verification with Structure-Aware Transformer](https://www.aclweb.org/anthology/2020.emnlp-main.126.pdf), EMNLP, 2020. 129 | 130 | [Taming Pretrained Transformers for Extreme Multi-label Text Classification](https://arxiv.org/pdf/1905.02331.pdf), KDD, 2020. 131 | 132 | [Reciptor: An Effective Pretrained Model for Recipe Representation Learning](https://dl.acm.org/doi/pdf/10.1145/3394486.3403223), KDD, 2020. 133 | 134 | [Pretraining Methods for Dialog Context Representation Learning.](https://www.aclweb.org/anthology/P19-1373.pdf), ACL, 2019. 135 | 136 | [Topic Sensitive Attention on Generic Corpora Corrects Sense Bias in Pretrained Embeddings.](https://www.aclweb.org/anthology/P19-1168.pdf), ACL, 2019 137 | 138 | [HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization](https://www.aclweb.org/anthology/P19-1499.pdf), ACL, 2019 139 | 140 | [Extracting Multiple-Relations in One-Pass with Pre-Trained Transformers](https://www.aclweb.org/anthology/P19-1132.pdf), ACL, 2019. 141 | 142 | [COMET : Commonsense Transformers for Automatic Knowledge Graph Construction](https://www.aclweb.org/anthology/P19-1470.pdf), ACL, 2019. 143 | 144 | [Incremental Transformer with Deliberation Decoder for Document Grounded Conversations](https://www.aclweb.org/anthology/P19-1002.pdf), ACL, 2019. 145 | 146 | [Hierarchical Transformers for Multi-Document Summarization](https://www.aclweb.org/anthology/P19-1500.pdf), ACL, 2019. 147 | 148 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Awesome Vision and Language PreTrain Model (PTM) Papers 2 | Maintained by [Yang Gao]() (ustcgaoy01@gmail.com) Last Update on 12/25/2020. 3 | 4 | Due to the large amount of research in this field, we mainly focus on Vision-Only PTMs, Language-Only PTMs, Multimodal PTMs, and other releated research in this field such as transfer learning. 5 | 6 | ## Table of Contents 7 | * [Surveys](#survey) 8 | * [Transformers](#transformers) 9 | * [Vision-Only PTMs](#vision-only-ptms) 10 | * [Language-Only PTMs](#language-only-ptms) 11 | * [MultiModal/Vision-Language PTMs](#multimodal-ptms) 12 | 13 | ## Survey 14 | [Efficient Transformers: A Survey](https://arxiv.org/pdf/2009.06732.pdf) 15 | 16 | [Transformers: State-of-the-Art Natural Language Processing](https://www.aclweb.org/anthology/2020.emnlp-demos.6.pdf) 17 | 18 | [Pre-trained Models for Natural Language Processing: A Survey](https://arxiv.org/pdf/2003.08271.pdf), arXiv, 2020. 19 | 20 | [A Survey on Contextual Embeddings](https://arxiv.org/pdf/2003.07278.pdf), arXiv, 2020. 21 | 22 | [Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods](https://arxiv.org/pdf/1907.09358.pdf), arXiv, 2020. 23 | 24 | [O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers](https://proceedings.neurips.cc/paper/2020/file/9ed27554c893b5bad850a422c3538c15-Paper.pdf), NIPS, 2020. 25 | 26 | [Are Transformers universal approximators of sequence-to-sequence functions?](https://openreview.net/pdf?id=ByxRM0Ntvr), ICLR, 2020. 27 | 28 | [Deep Multimodal Representation Learning: A Survey](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8715409), arXiv, 2019. 29 | 30 | [Multimodal Machine Learning: A Survey and Taxonomy](https://arxiv.org/pdf/1705.09406.pdf), TPAMI, 2018. 31 | 32 | ## Transformers 33 | ![](https://github.com/AlenUbuntu/Awesome-Vision-and-Language-PreTrain-Models/blob/main/transformers.png) 34 | 35 | ### Efficiency and Performance 36 | **Performer** 37 | 38 | [Masked language modeling for proteins via linearly scalable long-context transformers](https://arxiv.org/pdf/2006.03555.pdf), arXiv, 2020. 39 | 40 | **Linformer** 41 | 42 | [Linformer: Selfattention with linear complexity.](https://arxiv.org/pdf/2006.04768.pdf), arXiv, 2020. 43 | 44 | **Linear Transformers** 45 | 46 | [Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention](https://arxiv.org/pdf/2006.16236.pdf), arXiv, 2020. 47 | 48 | **BigBird** 49 | 50 | [Big bird: Transformers for longer sequences.](https://arxiv.org/pdf/2007.14062.pdf), arXiv, 2020. 51 | 52 | **Synthesizer** 53 | 54 | [Synthesizer: Rethinking self-attention in transformer models.](https://arxiv.org/pdf/2005.00743.pdf), arXiv, 2020. 55 | 56 | **ETC** 57 | 58 | [Etc: Encoding long and structured data in transformers.](https://arxiv.org/pdf/2004.08483.pdf), arXiv, 2020. 59 | 60 | **Longformer** 61 | 62 | [Longformer: The long-document transformer.](https://arxiv.org/pdf/2004.05150.pdf), arXiv, 2020. 63 | 64 | **Sinkhorn Transformer** 65 | 66 | [Sparse sinkhorn attention.](https://arxiv.org/pdf/2002.11296.pdf), arXiv, 2020. 67 | 68 | **Compressive Transformer** 69 | 70 | [Compressive transformers for long-range sequence modelling.](https://arxiv.org/pdf/1911.05507.pdf), ICLR, 2020. 71 | 72 | **Reformer** 73 | 74 | [Reformer: The efficient transformer.](https://arxiv.org/pdf/2001.04451.pdf), ICLR, 2020. 75 | 76 | **Depth-Adaptive Transformer** 77 | 78 | [Depth-Adaptive Transformer](https://openreview.net/pdf?id=SJg7KhVKPH), ICLR, 2020. 79 | 80 | **LayerDrop** 81 | 82 | [Reducing Transformer Depth on Demand with Structured Dropout](https://openreview.net/pdf?id=SylO2yStDr), ICLR, 2020. 83 | 84 | **Lite Transformer** 85 | 86 | [Lite Transformer with Long-Short Range Attention](https://openreview.net/pdf?id=ByeMPlHKPH), ICLR, 2020. 87 | 88 | **TRANSFORMER-XH** 89 | 90 | [Transformer-XH: Multi-Evidence Reasoning with eXtra Hop Attention](https://openreview.net/pdf?id=r1eIiCNYwS), ICLR, 2020. 91 | 92 | **Routing Transformer** 93 | 94 | [Efficient Content-Based Sparse Attention with Routing Transformers](https://arxiv.org/pdf/2003.05997.pdf), arXiv, 2020. 95 | 96 | **FPT** 97 | 98 | [Feature Pyramid Transformer](https://arxiv.org/pdf/2007.09451.pdf), ECCV, 2020. 99 | 100 | **Sandwitch Transformer** 101 | 102 | [Improving Transformer Models by Reordering their Sublayers](https://www.aclweb.org/anthology/2020.acl-main.270.pdf), ACL, 2020. 103 | 104 | **Highway Transformer** 105 | 106 | [Highway Transformer: Self-Gating Enhanced Self-Attentive Networks](https://www.aclweb.org/anthology/2020.acl-main.616.pdf), ACL, 2020. 107 | 108 | **Cascade Transformer** 109 | 110 | [The Cascade Transformer: an Application for Efficient Answer Sentence Selection](https://www.aclweb.org/anthology/2020.acl-main.504.pdf), ACL, 2020. 111 | 112 | **Hard-Aware Transformer** 113 | 114 | [HAT: Hardware-Aware Transformers for Efficient Natural Language Processing](https://www.aclweb.org/anthology/2020.acl-main.686.pdf), ACL, 2020. 115 | 116 | **Memory-driven Transformer** 117 | 118 | [Generating Radiology Reports via Memory-driven Transformer](https://www.aclweb.org/anthology/2020.emnlp-main.112.pdf), EMNLP, 2020. 119 | 120 | **Funnel-Transformer** 121 | 122 | [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://proceedings.neurips.cc/paper/2020/file/2cd2915e69546904e4e5d4a2ac9e1652-Paper.pdf), NIPS, 2020. 123 | 124 | **LL** 125 | 126 | [Deep Transformers with Latent Depth](https://proceedings.neurips.cc/paper/2020/file/1325cdae3b6f0f91a1b629307bf2d498-Paper.pdf), NIPS, 2020. 127 | 128 | **Fast Transformers with Clustered Attention** 129 | 130 | [Fast Transformers with Clustered Attention](https://proceedings.neurips.cc/paper/2020/file/f6a8dd1c954c8506aadc764cc32b895e-Paper.pdf), NIPS, 2020. 131 | 132 | **PLD** 133 | 134 | [Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping](https://proceedings.neurips.cc/paper/2020/file/a1140a3d0df1c81e24ae954d935e8926-Paper.pdf), NIPS, 2020. 135 | 136 | **Axial Transformer** 137 | 138 | [Axial attention in multidimensional transformers.](https://arxiv.org/pdf/1912.12180.pdf), arXiv, 2019. 139 | 140 | **Sparse Transformer** 141 | 142 | [Generating long sequences with sparse transformers.](https://arxiv.org/pdf/1904.10509.pdf), arXiv, 2019. 143 | 144 | **Transformer-XL** 145 | 146 | [Transformer-xl: Attentive language models beyond a fixed-length context.](https://arxiv.org/pdf/1901.02860.pdf), ACL, 2019. 147 | 148 | **Adaptive-Span** 149 | 150 | [Adaptive Attention Span in Transformers](https://www.aclweb.org/anthology/P19-1032.pdf), ACL, 2019. 151 | 152 | **Set Transformer** 153 | 154 | [Set transformer: A framework for attention-based permutation-invariant neural networks.](https://arxiv.org/pdf/1810.00825.pdf), ICML, 2019. 155 | 156 | **Levenshtein Transformer** 157 | 158 | [Levenshtein Transformer](https://proceedings.neurips.cc/paper/2019/file/675f9820626f5bc0afb47b57890b466e-Paper.pdf), NIPS, 2019. 159 | 160 | **Universal Transformers** 161 | 162 | [Universal Transformers](https://openreview.net/pdf?id=HyzdRiR9Y7), ICLR, 2019. 163 | 164 | **Memory Compressed** 165 | 166 | [Generating wikipedia by summarizing long sequences](https://arxiv.org/pdf/1801.10198.pdf), ICLR, 2018. 167 | 168 | **Transformer** 169 | 170 | [Attention is all you need.](https://arxiv.org/pdf/1706.03762.pdf), NuerIPS, 2017 171 | 172 | ### Vision 173 | **DeiT** 174 | 175 | [Training data-efficient image transformers & distillation through attention](https://arxiv.org/pdf/2012.12877.pdf), arXiv, 2020. 176 | 177 | **Epipolar Transformers** 178 | 179 | [Epipolar Transformers](https://arxiv.org/pdf/2005.04551.pdf), CVPR, 2020. 180 | 181 | **Texture Transformer** 182 | 183 | [Learning Texture Transformer Network for Image Super-Resolution](https://arxiv.org/pdf/2006.04139.pdf), CVPR, 2020. 184 | 185 | **SE(3)-Transformers** 186 | 187 | [SE(3)-Transformers: 3D Roto-Translation Equivariant Attention Networks](https://proceedings.neurips.cc/paper/2020/file/15231a7ce4ba789d13b722cc5c955834-Paper.pdf), NIPS, 2020. 188 | 189 | **Face Identity Transformers** 190 | 191 | [Password-conditioned Anonymization and Deanonymization with Face Identity Transformers](https://arxiv.org/pdf/1911.11759.pdf), ECCV, 2020. 192 | 193 | **Image Transformer** 194 | 195 | [Image transformer.](https://arxiv.org/pdf/1802.05751.pdf), ICML, 2018. 196 | 197 | ### Transfer Learning 198 | **Style Transformer** 199 | 200 | [Style Transformer: Unpaired Text Style Transfer without Disentangled Latent Representation](https://www.aclweb.org/anthology/P19-1601.pdf), ACL, 2019. 201 | 202 | ### MultiModal 203 | **Sign Language Transformers** 204 | 205 | [Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation](https://arxiv.org/pdf/2003.13830.pdf), CVPR, 2020. 206 | 207 | **Multimodal Transformer** 208 | 209 | [Multimodal Transformer for Multimodal Machine Translation.](https://www.aclweb.org/anthology/2020.acl-main.400.pdf), ACL, 2020. 210 | 211 | **Meshed-Memory Transformer** 212 | 213 | [Meshed-Memory Transformer for Image Captioning](https://arxiv.org/pdf/1912.08226.pdf), CVPR, 2020. 214 | 215 | **MMT** 216 | 217 | [Multi-modal Transformer for Video Retrieval](https://arxiv.org/pdf/2007.10639.pdf), ECCV, 2020. 218 | 219 | **SA-M4C** 220 | 221 | [Spatially Aware Multimodal Transformers for TextVQA](https://arxiv.org/pdf/2007.12146.pdf), ECCV, 2020. 222 | 223 | **MART** 224 | 225 | [MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning](https://www.aclweb.org/anthology/2020.acl-main.233.pdf), ACL, 2020. 226 | 227 | **MulT** 228 | 229 | [Multimodal Transformer for Unaligned Multimodal Language Sequences](https://www.aclweb.org/anthology/P19-1656.pdf), ACL, 2019. 230 | 231 | ### Graph Transformer 232 | **HetGT** 233 | 234 | [Heterogeneous Graph Transformer for Graph-to-Sequence Learning](https://www.aclweb.org/anthology/2020.acl-main.640.pdf), ACL, 2020. 235 | 236 | **GTN** 237 | 238 | [Graph Transformer Networks](https://proceedings.neurips.cc/paper/2019/file/9d63484abb477c97640154d40595a3bb-Paper.pdf), NIPS, 2019. 239 | 240 | ## Vision-Only PTMs 241 | ### Well-Known Pretrain Models 242 | 243 | **Vision Transformer** 244 | 245 | [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/pdf/2010.11929.pdf), ICLR, 2021. 246 | 247 | **LambdaNetworks** 248 | 249 | [LambdaNetworks: Modeling long-range Interactions without Attention](https://openreview.net/pdf?id=xTJEN-ggl1b), ICLR, 2021. 250 | 251 | **IPT** 252 | 253 | [Pre-Trained Image Processing Transformer](https://arxiv.org/pdf/2012.00364.pdf), arXiv, 2020. 254 | 255 | **DETR** 256 | 257 | [End-to-End Object Detection with Transformers](https://arxiv.org/pdf/2005.12872.pdf), arXiv, 2020. 258 | 259 | **DEFORMABLE DETR** 260 | 261 | [Deformable DETR: Deformable Transformers for End-to-End Object Detection](https://arxiv.org/pdf/2010.04159.pdf), arXiv, 2020. 262 | 263 | **Epipolar Transformers** 264 | 265 | [Epipolar Transformers](https://arxiv.org/pdf/2005.04551.pdf), CVPR, 2020. 266 | 267 | **Sketchformer** 268 | 269 | [Sketchformer: Transformer-based Representation for Sketched Structure](https://arxiv.org/pdf/2002.10381.pdf), CVPR, 2020. 270 | 271 | **Texture Transformer Network** 272 | 273 | [Learning Texture Transformer Network for Image Super-Resolution](https://arxiv.org/pdf/2006.04139.pdf), CVPR, 2020. 274 | 275 | **iGPT** 276 | 277 | [Generative Pretraining from Pixels](https://cdn.openai.com/papers/Generative_Pretraining_from_Pixels_V2.pdf), ICML, 2020. 278 | 279 | **FPT** 280 | 281 | [Feature Pyramid Transformer](https://arxiv.org/pdf/2007.09451.pdf), ECCV, 2020. 282 | 283 | **STAR** 284 | 285 | [Spatio-Temporal Graph Transformer Networks for Pedestrian Trajectory Prediction](https://arxiv.org/pdf/2005.08514.pdf), ECCV, 2020. 286 | 287 | **RelationNet++** 288 | 289 | [RelationNet++: Bridging Visual Representations for Object Detection via Transformer Decoder](https://proceedings.neurips.cc/paper/2020/file/9d684c589d67031a627ad33d59db65e5-Paper.pdf), NIPS, 2020. 290 | 291 | **SE(3)-Transformers** 292 | 293 | [SE(3)-Transformers: 3D Roto-Translation Equivariant Attention Networks](https://proceedings.neurips.cc/paper/2020/file/15231a7ce4ba789d13b722cc5c955834-Paper.pdf), NIPS, 2020. 294 | 295 | ### Other Topics 296 | [Modeling Techniques, Transfer Learning and Applications](https://github.com/AlenUbuntu/Awesome-Vision-and-Language-PreTrain-Models/blob/main/VisionOnlyPTMs.md) 297 | 298 | ## Language-Only PTMs 299 | ### Well-Known Pretrain Models 300 | **ELECTRA** 301 | 302 | [ELECTRA: PRE-TRAINING TEXT ENCODERS AS DISCRIMINATORS RATHER THAN GENERATORS](https://arxiv.org/pdf/2003.10555.pdf), ICLR, 2020. 303 | 304 | **ALBERT** 305 | 306 | [ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS](https://arxiv.org/pdf/1909.11942.pdf), ICLR, 2020. 307 | 308 | **MINILM** 309 | 310 | [MINILM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers](https://proceedings.neurips.cc/paper/2020/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf), NIPS, 2020. 311 | 312 | **Longformer** 313 | 314 | [Longformer: The long-document transformer.](https://arxiv.org/pdf/2004.05150.pdf), arXiv, 2020. 315 | 316 | **XLM** 317 | 318 | [Cross-lingual Language Model Pretraining](https://arxiv.org/pdf/1901.07291.pdf), NeurIPS, 2019 319 | 320 | **DistilBERT** 321 | 322 | [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/pdf/1910.01108.pdf), NeurIPS, 2019 323 | 324 | **T5** 325 | 326 | [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf), JMLR, 2019. 327 | 328 | **Bart** 329 | 330 | [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/pdf/1910.13461.pdf), ACL, 2019. 331 | 332 | **XLNet** 333 | 334 | [XLNet: Generalized Autoregressive Pretraining for Language Understanding.](https://proceedings.neurips.cc/paper/2019/file/dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf), NIPS, 2019. 335 | 336 | **Transformer-XL** 337 | 338 | [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/pdf/1901.02860.pdf), ACL, 2019. 339 | 340 | **GPT/GPT2** 341 | 342 | [Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf), OpenAI blog, 2019 343 | 344 | **RoBERTa** 345 | 346 | [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/pdf/1907.11692.pdf), arXiv, 2019. 347 | 348 | **Bert** 349 | 350 | [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805.pdf), NAACL, 2019. 351 | 352 | **Ouroboros** 353 | 354 | [Ouroboros: On Accelerating Training of Transformer-Based Language Models](https://proceedings.neurips.cc/paper/2019/file/1b79b52d1bf6f71b2b1eb7ca08ed0776-Paper.pdf), NIPS, 2019. 355 | 356 | ### Other Topics 357 | [Modeling Techniques, Transfer Learning and Applications](https://github.com/AlenUbuntu/Awesome-Vision-and-Language-PreTrain-Models/blob/main/LanguageOnlyPTMs.md) 358 | 359 | ## Multimodal PTMs 360 | ### Well-Known Pretrain Models 361 | **Sign Language Transformers** 362 | 363 | [Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation](https://arxiv.org/pdf/2003.13830.pdf), CVPR, 2020. 364 | 365 | **Meshed-Memory Transformer** 366 | 367 | [Meshed-Memory Transformer for Image Captioning](https://arxiv.org/pdf/1912.08226.pdf), CVPR, 2020. 368 | 369 | **SCT** 370 | 371 | [SCT: Set Constrained Temporal Transformer for Set Supervised Action Segmentation](https://arxiv.org/pdf/2003.14266.pdf), CVPR, 2020. 372 | 373 | **M4C** 374 | 375 | [Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA](https://arxiv.org/pdf/1911.06258.pdf), CVPR, 2020. 376 | 377 | **MMT** 378 | 379 | [Multi-modal Transformer for Video Retrieval](https://arxiv.org/pdf/2007.10639.pdf), ECCV, 2020. 380 | 381 | **SA-M4C** 382 | 383 | [Spatially Aware Multimodal Transformers for TextVQA](https://arxiv.org/pdf/2007.12146.pdf), ECCV, 2020. 384 | 385 | **Progressive Transformers** 386 | 387 | [Progressive Transformers for End-to-End Sign Language Production](https://arxiv.org/pdf/2004.14874.pdf), ECCV, 2020. 388 | 389 | **MAG-BERT, MAG-XLNet** 390 | 391 | [Integrating Multimodal Information in Large Pretrained Transformers.](https://www.aclweb.org/anthology/2020.acl-main.214.pdf), ACL, 2020. 392 | 393 | **MART** 394 | 395 | [MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning](https://www.aclweb.org/anthology/2020.acl-main.233.pdf), ACL, 2020. 396 | 397 | **TaBERT** 398 | 399 | [TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data.](https://www.aclweb.org/anthology/2020.acl-main.745.pdf), ACL, 2020. 400 | 401 | **AV-ASR** 402 | 403 | [Multiresolution and Multimodal Speech Recognition with Transformers](https://www.aclweb.org/anthology/2020.acl-main.216.pdf), ACL, 2020. 404 | 405 | **Multimodal Transformer** 406 | 407 | [Multimodal Transformer for Multimodal Machine Translation.](https://www.aclweb.org/anthology/2020.acl-main.400.pdf), ACL, 2020. 408 | 409 | **Unified Multimodal Transformer** 410 | 411 | [Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer](https://www.aclweb.org/anthology/2020.acl-main.306.pdf), ACL, 2020. 412 | 413 | **VGD-GPT2** 414 | 415 | [Video-Grounded Dialogues with Pretrained Generation Language Models.](https://www.aclweb.org/anthology/2020.acl-main.518.pdf), ACL, 2020. 416 | 417 | **X-LXMERT** 418 | 419 | [X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers](https://www.aclweb.org/anthology/2020.emnlp-main.707.pdf), EMNLP, 2020. 420 | 421 | **VD-BERT** 422 | 423 | [VD-BERT: A Unified Vision and Dialog Transformer with BERT](https://arxiv.org/pdf/2004.13278.pdf), EMNLP, 2020. 424 | 425 | **HERO** 426 | 427 | [HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training](https://arxiv.org/pdf/2005.00200.pdf), EMNLP, 2020. 428 | 429 | **PREVALENT** 430 | 431 | [Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training](https://arxiv.org/pdf/2002.10638.pdf), CVPR, 2020. 432 | 433 | **ActBERT** 434 | 435 | [ActBERT: Learning Global-Local Video-Text Representations](https://openaccess.thecvf.com/content_CVPR_2020/papers/Zhu_ActBERT_Learning_Global-Local_Video-Text_Representations_CVPR_2020_paper.pdf), CVPR, 2020. 436 | 437 | **COOT** 438 | 439 | [COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning](https://proceedings.neurips.cc/paper/2020/file/ff0abbcc0227c9124a804b084d161a2d-Paper.pdf), NIPS, 2020. 440 | 441 | **VL-BERT** 442 | 443 | [VL-BERT: Pre-training of Generic Visual-Linguistic Representations](https://arxiv.org/pdf/1908.08530.pdf), ICLR, 2020. 444 | 445 | **Unicoder-VL** 446 | 447 | [Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training](https://arxiv.org/pdf/1908.06066.pdf), AAAI, 2020. 448 | 449 | **VLP** 450 | 451 | [Unified Vision-Language Pre-Training for Image Captioning and VQA](https://arxiv.org/pdf/1909.11059.pdf), AAAI, 2020. 452 | 453 | **InterBERT** 454 | 455 | [InterBERT: An Effective Multi-Modal Pretraining Approach via Vision-and-Language Interaction](https://arxiv.org/pdf/2003.13198.pdf), arXiv, 2020. 456 | 457 | **Oscar** 458 | 459 | [Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks](https://arxiv.org/pdf/2004.06165.pdf), arXiv, 2020. 460 | 461 | **Pixel-BERT** 462 | 463 | [Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers](https://arxiv.org/pdf/2004.00849.pdf), arXiv, 2020. 464 | 465 | **ERNIE-ViL** 466 | 467 | [ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph](https://arxiv.org/pdf/2006.16934.pdf), arXiv, 2020. 468 | 469 | **ImageBERT** 470 | 471 | [ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data](https://arxiv.org/pdf/2001.07966.pdf), arXiv, 2020. 472 | 473 | **XGPT** 474 | 475 | [XGPT: Cross-modal Generative Pre-Training for Image Captioning](https://arxiv.org/pdf/2003.01473.pdf), arXiv, 2020. 476 | 477 | **UniVL** 478 | 479 | [UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation](https://arxiv.org/pdf/2002.06353.pdf), arXiv, 2020. 480 | 481 | **TransED** 482 | 483 | [Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training](https://arxiv.org/pdf/2007.02375.pdf), arXiv, 2020. 484 | 485 | **DeVLBert** 486 | 487 | [DeVLBert: Learning Deconfounded Visio-Linguistic Representations](https://arxiv.org/pdf/2008.06884.pdf), ACM MM, 2020. 488 | 489 | **SEMVLP** 490 | 491 | [SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple Levels](https://openreview.net/pdf?id=Wg2PSpLZiH), ICLR, 2021. 492 | 493 | **CROSS-PROBE BERT** 494 | 495 | [Cross-Probe BERT for Efficient and Effective Cross-Modal Search](https://openreview.net/pdf?id=bW9SYKHcZiz), ICLR, 2021. 496 | 497 | **12-in-1** 498 | 499 | [12-in-1: Multi-Task Vision and Language Representation Learning](https://arxiv.org/pdf/1912.02315.pdf), arXiv, 2020. 500 | 501 | **MTN** 502 | 503 | [Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems](https://www.aclweb.org/anthology/P19-1564.pdf), ACL, 2019. 504 | 505 | **MulT** 506 | 507 | [Multimodal Transformer for Unaligned Multimodal Language Sequences](https://www.aclweb.org/anthology/P19-1656.pdf), ACL, 2019. 508 | 509 | **ETA-Transformer** 510 | 511 | [Entangled Transformer for Image Captioning](https://openaccess.thecvf.com/content_ICCV_2019/papers/Li_Entangled_Transformer_for_Image_Captioning_ICCV_2019_paper.pdf), ICCV, 2019. 512 | 513 | **VideoBERT** 514 | 515 | [VideoBERT: A Joint Model for Video and Language Representation Learning](https://arxiv.org/pdf/1904.01766.pdf), ICCV, 2019. 516 | 517 | **ViLBERT** 518 | 519 | [ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks](https://arxiv.org/pdf/1908.02265.pdf), NIPS, 2019 520 | 521 | **LXMERT** 522 | 523 | [LXMERT: Learning Cross-Modality Encoder Representations from Transformers](https://arxiv.org/pdf/1908.07490.pdf), EMNLP, 2019. 524 | 525 | **MMBT** 526 | 527 | [Supervised Multimodal Bitransformers for Classifying Images and Text](https://arxiv.org/pdf/1909.02950.pdf), arXiv, 2019. 528 | 529 | **VisualBERT** 530 | 531 | [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557.pdf), arXiv, 2019. 532 | 533 | **UNITER** 534 | 535 | [UNITER: UNiversal Image-TExt Representation Learning](https://arxiv.org/pdf/1909.11740.pdf), arXiv, 2019. 536 | 537 | **Vision-Language Encoder** 538 | 539 | [Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks](https://arxiv.org/pdf/1912.03063.pdf), arXiv, 2019. 540 | 541 | **CBT** 542 | 543 | [Learning Video Representations using Contrastive Bidirectional Transformer](https://arxiv.org/pdf/1906.05743.pdf), arXiv, 2019. 544 | 545 | **B2T2** 546 | 547 | [Fusion of Detected Objects in Text for Visual Question Answering](https://arxiv.org/pdf/1908.05054.pdf), EMNLP, 2019. 548 | 549 | 550 | ### Special Topic 551 | [Vision-Language-PTMs](https://github.com/AlenUbuntu/Awesome-Vision-and-Language-PreTrain-Models/blob/main/VL-PTMs.md) 552 | 553 | --------------------------------------------------------------------------------