└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # BERT-related Papers 2 | This is a list of BERT-related papers. Any feedback is welcome. 3 | 4 | (ChatGPT-related papers are listed at https://github.com/tomohideshibata/ChatGPT-related-papers.) 5 | 6 | ## Table of Contents 7 | - [Survey paper](#survey-paper) 8 | - [Downstream task](#downstream-task) 9 | - [Generation](#generation) 10 | - [Quality evaluator](#quality-evaluator) 11 | - [Modification (multi-task, masking strategy, etc.)](#modification-multi-task-masking-strategy-etc) 12 | - [Sentence embedding](#sentence-embedding) 13 | - [Transformer variants](#transformer-variants) 14 | - [Probe](#probe) 15 | - [Inside BERT](#inside-bert) 16 | - [Multi-lingual](#multi-lingual) 17 | - [Other than English models](#other-than-english-models) 18 | - [Domain specific](#domain-specific) 19 | - [Multi-modal](#multi-modal) 20 | - [Model compression](#model-compression) 21 | - [Large language model](#large-language-model) 22 | - [Reinforcement learning from human feedback](#reinforcement-learning-from-human-feedback) 23 | - [Misc.](#misc) 24 | 25 | ## Survey paper 26 | - [Evolution of transfer learning in natural language processing](https://arxiv.org/abs/1910.07370) 27 | - [Pre-trained Models for Natural Language Processing: A Survey](https://arxiv.org/abs/2003.08271) 28 | - [A Survey on Contextual Embeddings](https://arxiv.org/abs/2003.07278) 29 | - [A Survey on Transfer Learning in Natural Language Processing](https://arxiv.org/abs/2007.04239) 30 | - [Which \*BERT? A Survey Organizing Contextualized Encoders](https://arxiv.org/abs/2010.00854) (EMNLP2020) 31 | - [The NLP Cookbook: Modern Recipes for Transformer based Deep Learning Architectures](https://arxiv.org/abs/2104.10640) 32 | - [Pre-Trained Models: Past, Present and Future](https://arxiv.org/abs/2106.07139) 33 | - [A Survey of Transformers](https://arxiv.org/abs/2106.04554) 34 | - [AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing](https://arxiv.org/abs/2108.05542) 35 | - [Paradigm Shift in Natural Language Processing](https://arxiv.org/abs/2109.12575) 36 | - [Recent Advances in Natural Language Processing via Large Pre-Trained Language Models: A Survey](https://arxiv.org/abs/2111.01243) 37 | - [Formal Algorithms for Transformers](https://arxiv.org/abs/2207.09238) 38 | - [A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT](https://arxiv.org/abs/2302.09419) 39 | 40 | ## Downstream task 41 | ### QA, MC, Dialogue 42 | - [Machine Reading Comprehension: The Role of Contextualized Language Models and Beyond](https://arxiv.org/abs/2005.06249) 43 | - [A Survey on Machine Reading Comprehension: Tasks, Evaluation Metrics, and Benchmark Datasets](https://arxiv.org/abs/2006.11880) 44 | - [A BERT Baseline for the Natural Questions](https://arxiv.org/abs/1901.08634) 45 | - [MultiQA: An Empirical Investigation of Generalization and Transfer in Reading Comprehension](https://arxiv.org/abs/1905.13453) (ACL2019) 46 | - [BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions](https://arxiv.org/abs/1905.10044) (NAACL2019) [[github](https://github.com/google-research-datasets/boolean-questions)] 47 | - [Natural Perturbation for Robust Question Answering](https://arxiv.org/abs/2004.04849) 48 | - [Unsupervised Domain Adaptation on Reading Comprehension](https://arxiv.org/abs/1911.06137) 49 | - [BERTQA -- Attention on Steroids](https://arxiv.org/abs/1912.10435) 50 | - [Exploring BERT Parameter Efficiency on the Stanford Question Answering Dataset v2.0](https://arxiv.org/abs/2002.10670) 51 | - [Adversarial Augmentation Policy Search for Domain and Cross-Lingual Generalization in Reading Comprehension](https://arxiv.org/abs/2004.06076) 52 | - [Logic-Guided Data Augmentation and Regularization for Consistent Question Answering](https://arxiv.org/abs/2004.10157) (ACL2020) 53 | - [UnifiedQA: Crossing Format Boundaries With a Single QA System](https://arxiv.org/abs/2005.00700) 54 | - [How Can We Know When Language Models Know?](https://arxiv.org/abs/2012.00955) 55 | - [A Multi-Type Multi-Span Network for Reading Comprehension that Requires Discrete Reasoning](https://arxiv.org/abs/1908.05514) (EMNLP2019) 56 | - [A Simple and Effective Model for Answering Multi-span Questions](https://arxiv.org/abs/1909.13375) [[github](https://github.com/eladsegal/tag-based-multi-span-extraction)] 57 | - [Injecting Numerical Reasoning Skills into Language Models](https://arxiv.org/abs/2004.04487) (ACL2020) 58 | - [Towards Question Format Independent Numerical Reasoning: A Set of Prerequisite Tasks](https://arxiv.org/abs/2005.08516) 59 | - [SDNet: Contextualized Attention-based Deep Network for Conversational Question Answering](https://arxiv.org/abs/1812.03593) 60 | - [Multi-hop Question Answering via Reasoning Chains](https://arxiv.org/abs/1910.02610) 61 | - [Select, Answer and Explain: Interpretable Multi-hop Reading Comprehension over Multiple Documents](https://arxiv.org/abs/1911.00484) 62 | - [Multi-step Entity-centric Information Retrieval for Multi-Hop Question Answering](https://arxiv.org/abs/1909.07598) (EMNLP2019 WS) 63 | - [Fine-tuning Multi-hop Question Answering with Hierarchical Graph Network](https://arxiv.org/abs/2004.13821) 64 | - [Unsupervised Alignment-based Iterative Evidence Retrieval for Multi-hop Question Answering](https://www.aclweb.org/anthology/2020.acl-main.414/) (ACL2020) 65 | - [HybridQA: A Dataset of Multi-Hop Question Answering over Tabular and Textual Data](https://arxiv.org/abs/2004.07347) 66 | - [Unsupervised Multi-hop Question Answering by Question Generation](https://arxiv.org/abs/2010.12623) (NAACL2021) 67 | - [End-to-End Open-Domain Question Answering with BERTserini](https://arxiv.org/abs/1902.01718) (NAALC2019) 68 | - [Latent Retrieval for Weakly Supervised Open Domain Question Answering](https://arxiv.org/abs/1906.00300) (ACL2019) 69 | - [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906) (EMNLP2020) 70 | - [Efficient Passage Retrieval with Hashing for Open-domain Question Answering](https://arxiv.org/abs/2106.00882) (ACL2021) 71 | - [End-to-End Training of Neural Retrievers for Open-Domain Question Answering](https://arxiv.org/abs/2101.00408) 72 | - [Domain-matched Pre-training Tasks for Dense Retrieval](https://arxiv.org/abs/2107.13602) 73 | - [Towards Robust Neural Retrieval Models with Synthetic Pre-Training](https://arxiv.org/abs/2104.07800) 74 | - [Simple Entity-Centric Questions Challenge Dense Retrievers](https://arxiv.org/abs/2109.08535) (EMNLP2021) [[github](https://github.com/princeton-nlp/EntityQuestions)] 75 | - [Phrase Retrieval Learns Passage Retrieval, Too](https://arxiv.org/abs/2109.08133) (EMNLP2021) [[github](https://github.com/princeton-nlp/DensePhrases)] 76 | - [Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering](https://arxiv.org/abs/2007.01282) 77 | - [Progressively Pretrained Dense Corpus Index for Open-Domain Question Answering](https://arxiv.org/abs/2005.00038) (EACL2021) 78 | - [Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval](https://arxiv.org/abs/2009.12756) 79 | - [Multi-Step Reasoning Over Unstructured Text with Beam Dense Retrieval](https://arxiv.org/abs/2104.05883) (NAACL2021) [[github](https://github.com/henryzhao5852/BeamDR)] 80 | - [Retrieve, Read, Rerank, then Iterate: Answering Open-Domain Questions of Varying Reasoning Steps from Text](https://arxiv.org/abs/2010.12527) 81 | - [RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2010.08191) 82 | - [Pre-training Tasks for Embedding-based Large-scale Retrieval](https://arxiv.org/abs/2002.03932) (ICLR2020) 83 | - [Multi-passage BERT: A Globally Normalized BERT Model for Open-domain Question Answering](https://arxiv.org/abs/1908.08167) (EMNLP2019) 84 | - [QED: A Framework and Dataset for Explanations in Question Answering](https://arxiv.org/abs/2009.06354) [[github](https://github.com/google-research-datasets/QED)] 85 | - [Learning to Retrieve Reasoning Paths over Wikipedia Graph for Question Answering](https://arxiv.org/abs/1911.10470) (ICLR2020) 86 | - [Relevance-guided Supervision for OpenQA with ColBERT](https://arxiv.org/abs/2007.00814) 87 | - [RECONSIDER: Re-Ranking using Span-Focused Cross-Attention for Open Domain Question Answering](https://arxiv.org/abs/2010.10757) 88 | - [Joint Passage Ranking for Diverse Multi-Answer Retrieval](https://arxiv.org/abs/2104.08445) 89 | - [SPARTA: Efficient Open-Domain Question Answering via Sparse Transformer Matching Retrieval](https://arxiv.org/abs/2009.13013) 90 | - [Don't Read Too Much into It: Adaptive Computation for Open-Domain Question Answering](https://arxiv.org/abs/2011.05435) (EMNLP2020 WS) 91 | - [Pruning the Index Contents for Memory Efficient Open-Domain QA](https://arxiv.org/abs/2102.10697) [[github](https://github.com/KNOT-FIT-BUT/R2-D2)] 92 | - [Is Retriever Merely an Approximator of Reader?](https://arxiv.org/abs/2010.10999) 93 | - [Neural Retrieval for Question Answering with Cross-Attention Supervised Data Augmentation](https://arxiv.org/abs/2009.13815) 94 | - [RikiNet: Reading Wikipedia Pages for Natural Question Answering](https://arxiv.org/abs/2004.14560) (ACL2020) 95 | - [BERT-kNN: Adding a kNN Search Component to Pretrained Language Models for Better QA](https://arxiv.org/abs/2005.00766) 96 | - [DC-BERT: Decoupling Question and Document for Efficient Contextual Encoding](https://arxiv.org/abs/2002.12591) (SIGIR2020) 97 | - [Learning to Ask Unanswerable Questions for Machine Reading Comprehension](https://arxiv.org/abs/1906.06045) (ACL2019) 98 | - [Unsupervised Question Answering by Cloze Translation](https://arxiv.org/abs/1906.04980) (ACL2019) 99 | - [Reinforcement Learning Based Graph-to-Sequence Model for Natural Question Generation](https://arxiv.org/abs/1908.04942) (ICLR2020) 100 | - [A Recurrent BERT-based Model for Question Generation](https://www.aclweb.org/anthology/D19-5821/) (EMNLP2019 WS) 101 | - [Unsupervised Question Decomposition for Question Answering](https://arxiv.org/abs/2002.09758) [[github](https://github.com/facebookresearch/UnsupervisedDecomposition)] 102 | - [Conversational Question Reformulation via Sequence-to-Sequence Architectures and Pretrained Language Models](https://arxiv.org/abs/2004.01909) 103 | - [Template-Based Question Generation from Retrieved Sentences for Improved Unsupervised Question Answering](https://arxiv.org/abs/2004.11892) (ACL2020) 104 | - [What Are People Asking About COVID-19? A Question Classification Dataset](https://arxiv.org/abs/2005.12522) 105 | - [Learning to Answer by Learning to Ask: Getting the Best of GPT-2 and BERT Worlds](https://arxiv.org/abs/1911.02365) 106 | - [Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine Reading Comprehension](https://www.aclweb.org/anthology/papers/P/P19/P19-1226/) (ACL2019) 107 | - [QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering](https://arxiv.org/abs/2104.06378) (NAACL2021) [[github](https://github.com/michiyasunaga/qagnn)] [[blog](http://ai.stanford.edu/blog/qagnn/)] 108 | - [Incorporating Relation Knowledge into Commonsense Reading Comprehension with Multi-task Learning](https://arxiv.org/abs/1908.04530) (CIKM2019) 109 | - [SG-Net: Syntax-Guided Machine Reading Comprehension](https://arxiv.org/abs/1908.05147) 110 | - [MMM: Multi-stage Multi-task Learning for Multi-choice Reading Comprehension](https://arxiv.org/abs/1910.00458) 111 | - [Cosmos QA: Machine Reading Comprehension with Contextual Commonsense Reasoning](https://arxiv.org/abs/1909.00277) (EMNLP2019) 112 | - [ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning](https://arxiv.org/abs/2002.04326) (ICLR2020) 113 | - [Robust Reading Comprehension with Linguistic Constraints via Posterior Regularization](https://arxiv.org/abs/1911.06948) 114 | - [BAS: An Answer Selection Method Using BERT Language Model](https://arxiv.org/abs/1911.01528) 115 | - [Utilizing Bidirectional Encoder Representations from Transformers for Answer Selection](https://arxiv.org/abs/2011.07208) (AMMCS2019) 116 | - [TANDA: Transfer and Adapt Pre-Trained Transformer Models for Answer Sentence Selection](https://arxiv.org/abs/1911.04118) (AAAI2020) 117 | - [The Cascade Transformer: an Application for Efficient Answer Sentence Selection](https://arxiv.org/abs/2005.02534) (ACL2020) 118 | - [Support-BERT: Predicting Quality of Question-Answer Pairs in MSDN using Deep Bidirectional Transformer](https://arxiv.org/abs/2005.08294) 119 | - [Beat the AI: Investigating Adversarial Human Annotations for Reading Comprehension](https://arxiv.org/abs/2002.00293) 120 | - [Benchmarking Robustness of Machine Reading Comprehension Models](https://arxiv.org/abs/2004.14004) 121 | - [Evaluating NLP Models via Contrast Sets](https://arxiv.org/abs/2004.02709) 122 | - [Undersensitivity in Neural Reading Comprehension](https://arxiv.org/abs/2003.04808) 123 | - [Developing a How-to Tip Machine Comprehension Dataset and its Evaluation in Machine Comprehension by BERT](https://www.aclweb.org/anthology/2020.fever-1.4/) (ACL2020 WS) 124 | - [A Simple but Effective Method to Incorporate Multi-turn Context with BERT for Conversational Machine Comprehension](https://arxiv.org/abs/1905.12848) (ACL2019 WS) 125 | - [FlowDelta: Modeling Flow Information Gain in Reasoning for Conversational Machine Comprehension](https://arxiv.org/abs/1908.05117) (ACL2019 WS) 126 | - [BERT with History Answer Embedding for Conversational Question Answering](https://arxiv.org/abs/1905.05412) (SIGIR2019) 127 | - [GraphFlow: Exploiting Conversation Flow with Graph Neural Networks for Conversational Machine Comprehension](https://arxiv.org/abs/1908.00059) (ICML2019 WS) 128 | - [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) (ACL2020) 129 | - [TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data](https://arxiv.org/abs/2005.08314) (ACL2020) 130 | - [Understanding tables with intermediate pre-training](https://arxiv.org/abs/2010.00571) (EMNLP2020 Findings) 131 | - [GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing](https://arxiv.org/abs/2009.13845) (ICLR2021) 132 | - [Table Search Using a Deep Contextualized Language Model](https://arxiv.org/abs/2005.09207) (SIGIR2020) 133 | - [Open Domain Question Answering over Tables via Dense Retrieval](https://arxiv.org/abs/2103.12011) (NAACL2021) 134 | - [Capturing Row and Column Semantics in Transformer Based Question Answering over Tables](https://arxiv.org/abs/2104.08303) (NAACL2021) 135 | - [MATE: Multi-view Attention for Table Transformer Efficiency](https://arxiv.org/abs/2109.04312) (EMNLP2021) 136 | - [TORQUE: A Reading Comprehension Dataset of Temporal Ordering Questions](https://arxiv.org/abs/2005.00242) (EMNLP2020) 137 | - [Beyond English-only Reading Comprehension: Experiments in Zero-Shot Multilingual Transfer for Bulgarian](https://arxiv.org/abs/1908.01519) (RANLP2019) 138 | - [XQA: A Cross-lingual Open-domain Question Answering Dataset](https://www.aclweb.org/anthology/P19-1227/) (ACL2019) 139 | - [XOR QA: Cross-lingual Open-Retrieval Question Answering](https://arxiv.org/abs/2010.11856) (NAACL2021) [[website](https://nlp.cs.washington.edu/xorqa/)] 140 | - [Cross-Lingual Machine Reading Comprehension](https://arxiv.org/abs/1909.00361) (EMNLP2019) 141 | - [Zero-shot Reading Comprehension by Cross-lingual Transfer Learning with Multi-lingual Language Representation Model](https://arxiv.org/abs/1909.09587) 142 | - [Multilingual Question Answering from Formatted Text applied to Conversational Agents](https://arxiv.org/abs/1910.04659) 143 | - [BiPaR: A Bilingual Parallel Dataset for Multilingual and Cross-lingual Reading Comprehension on Novels](https://arxiv.org/abs/1910.05040) (EMNLP2019) 144 | - [MLQA: Evaluating Cross-lingual Extractive Question Answering](https://arxiv.org/abs/1910.07475) 145 | - [Multilingual Synthetic Question and Answer Generation for Cross-Lingual Reading Comprehension](https://arxiv.org/abs/2010.12008) 146 | - [Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering](https://arxiv.org/abs/2010.12643) 147 | - [Cross-lingual Machine Reading Comprehension with Language Branch Knowledge Distillation](https://arxiv.org/abs/2010.14271) (COLING2020) 148 | - [MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering](https://arxiv.org/abs/2007.15207) [[github](https://github.com/apple/ml-mkqa)] 149 | - [Towards More Equitable Question Answering Systems: How Much More Data Do You Need?](https://arxiv.org/abs/2105.14115) (ACL2021) 150 | - [X-METRA-ADA: Cross-lingual Meta-Transfer Learning Adaptation to Natural Language Understanding and Question Answering](https://arxiv.org/abs/2104.09696) (NAACL2021) 151 | - [Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension](https://arxiv.org/abs/1904.09679) (TACL) 152 | - [SberQuAD - Russian Reading Comprehension Dataset: Description and Analysis](https://arxiv.org/abs/1912.09723) 153 | - [DuReaderrobust: A Chinese Dataset Towards Evaluating the Robustness of Machine Reading Comprehension Models](https://arxiv.org/abs/2004.11142) 154 | - [Giving BERT a Calculator: Finding Operations and Arguments with Reading Comprehension](https://arxiv.org/abs/1909.00109) (EMNLP2019) 155 | - [Few-Shot Question Answering by Pretraining Span Selection](https://arxiv.org/abs/2101.00438) (ACL2021) 156 | - [DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue](https://arxiv.org/abs/2009.13570) [[website](https://evalai.cloudcv.org/web/challenges/challenge-page/708/overview)] 157 | - [A Short Survey of Pre-trained Language Models for Conversational AI-A NewAge in NLP](https://arxiv.org/abs/2104.10810) 158 | - [MPC-BERT: A Pre-Trained Language Model for Multi-Party Conversation Understanding](https://arxiv.org/abs/2106.01541) (ACL2021) 159 | - [BERT-DST: Scalable End-to-End Dialogue State Tracking with Bidirectional Encoder Representations from Transformer](https://arxiv.org/abs/1907.03040) (Interspeech2019) 160 | - [Dialog State Tracking: A Neural Reading Comprehension Approach](https://arxiv.org/abs/1908.01946) 161 | - [A Simple but Effective BERT Model for Dialog State Tracking on Resource-Limited Systems](https://arxiv.org/abs/1910.12995) (ICASSP2020) 162 | - [Fine-Tuning BERT for Schema-Guided Zero-Shot Dialogue State Tracking](https://arxiv.org/abs/2002.00181) 163 | - [Goal-Oriented Multi-Task BERT-Based Dialogue State Tracker](https://arxiv.org/abs/2002.02450) 164 | - [Dialogue State Tracking with Pretrained Encoder for Multi-domain Trask-oriented Dialogue Systems](https://arxiv.org/abs/2004.10663) 165 | - [Zero-Shot Transfer Learning with Synthesized Data for Multi-Domain Dialogue State Tracking](https://arxiv.org/abs/2005.00891) (ACL2020) 166 | - [A Fast and Robust BERT-based Dialogue State Tracker for Schema-Guided Dialogue Dataset](https://arxiv.org/abs/2008.12335) (KDD2020 WS) 167 | - [Knowledge-Aware Graph-Enhanced GPT-2 for Dialogue State Tracking](https://arxiv.org/abs/2104.04466) 168 | - [Coreference Augmentation for Multi-Domain Task-Oriented Dialogue State Tracking](https://arxiv.org/abs/2106.08723) (Interspeech2021) 169 | - [ToD-BERT: Pre-trained Natural Language Understanding for Task-Oriented Dialogues](https://arxiv.org/abs/2004.06871) (EMNLP2020) 170 | - [Conversations Are Not Flat: Modeling the Dynamic Information Flow across Dialogue Utterances](https://arxiv.org/abs/2106.02227) (ACL2021) 171 | - [Domain Adaptive Training BERT for Response Selection](https://arxiv.org/abs/1908.04812) 172 | - [Speaker-Aware BERT for Multi-Turn Response Selection in Retrieval-Based Chatbots](https://arxiv.org/abs/2004.03588) 173 | - [Curriculum Learning Strategies for IR: An Empirical Study on Conversation Response Ranking](https://arxiv.org/abs/1912.08555) (ECIR2020) 174 | - [MuTual: A Dataset for Multi-Turn Dialogue Reasoning](https://arxiv.org/abs/2004.04494) (ACL2020) 175 | - [DialBERT: A Hierarchical Pre-Trained Model for Conversation Disentanglement](https://arxiv.org/abs/2004.03760) 176 | - [Generalized Conditioned Dialogue Generation Based on Pre-trained Language Model](https://arxiv.org/abs/2010.11140) 177 | - [BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data](https://arxiv.org/abs/2106.06169) (ACL2021) 178 | - [Interactive Teaching for Conversational AI](https://arxiv.org/abs/2012.00958) (NeurIPS2020 WS) 179 | - [BERT Goes to Law School: Quantifying the Competitive Advantage of Access to Large Legal Corpora in Contract Understanding](https://arxiv.org/abs/1911.00473) 180 | ### Slot filling and Intent Detection 181 | - [A Stack-Propagation Framework with Token-Level Intent Detection for Spoken Language Understanding](https://www.aclweb.org/anthology/D19-1214/) (EMNLP2019) 182 | - [BERT for Joint Intent Classification and Slot Filling](https://arxiv.org/abs/1902.10909) 183 | - [A Co-Interactive Transformer for Joint Slot Filling and Intent Detection](https://arxiv.org/abs/2010.03880) (ICASSP2021) 184 | - [Few-shot Intent Classification and Slot Filling with Retrieved Examples](https://arxiv.org/abs/2104.05763) (NAACL2021) 185 | - [Multi-lingual Intent Detection and Slot Filling in a Joint BERT-based Model](https://arxiv.org/abs/1907.02884) 186 | - [A Comparison of Deep Learning Methods for Language Understanding](https://www.isca-speech.org/archive/Interspeech_2019/abstracts/1262.html) (Interspeech2019) 187 | - [Data Augmentation for Spoken Language Understanding via Pretrained Models](https://arxiv.org/abs/2004.13952) 188 | - [Few-Shot Intent Detection via Contrastive Pre-Training and Fine-Tuning] (EMNLP2021) 189 | - [STIL -- Simultaneous Slot Filling, Translation, Intent Classification, and Language Identification: Initial Results using mBART on MultiATIS++](https://arxiv.org/abs/2010.00760) (AACL-IJCNLP2020) [[github](https://github.com/amazon-research/stil-mbart-multiatis)] 190 | ### Analysis 191 | - [Fine-grained Information Status Classification Using Discourse Context-Aware Self-Attention](https://arxiv.org/abs/1908.04755) 192 | - [Neural Aspect and Opinion Term Extraction with Mined Rules as Weak Supervision](https://arxiv.org/abs/1907.03750) (ACL2019) 193 | - [BERT-based Lexical Substitution](https://www.aclweb.org/anthology/P19-1328) (ACL2019) 194 | - [Assessing BERT’s Syntactic Abilities](https://arxiv.org/abs/1901.05287) 195 | - [Investigating Novel Verb Learning in BERT: Selectional Preference Classes and Alternation-Based Syntactic Generalization](https://arxiv.org/abs/2011.02417) (EMNLP2020 WS) 196 | - [Does BERT agree? Evaluating knowledge of structure dependence through agreement relations](https://arxiv.org/abs/1908.09892) 197 | - [Simple BERT Models for Relation Extraction and Semantic Role Labeling](https://arxiv.org/abs/1904.05255) 198 | - [Bridging the Gap in Multilingual Semantic Role Labeling: a Language-Agnostic Approach](https://aclanthology.org/2020.coling-main.120/) (COLING2020) 199 | - [LIMIT-BERT : Linguistic Informed Multi-Task BERT](https://arxiv.org/abs/1910.14296) (EMNLP2020 Findings) 200 | - [Joint Semantic Analysis with Document-Level Cross-Task Coherence Rewards](https://arxiv.org/abs/2010.05567) 201 | - [A Simple BERT-Based Approach for Lexical Simplification](https://arxiv.org/abs/1907.06226) 202 | - [BERT-Based Arabic Social Media Author Profiling](https://arxiv.org/abs/1909.04181) 203 | - [Sentence-Level BERT and Multi-Task Learning of Age and Gender in Social Media](https://arxiv.org/abs/1911.00637) 204 | - [Evaluating the Factual Consistency of Abstractive Text Summarization](https://arxiv.org/abs/1910.12840) 205 | - [Generating Fact Checking Explanations](https://arxiv.org/abs/2004.05773) (ACL2020) 206 | - [NegBERT: A Transfer Learning Approach for Negation Detection and Scope Resolution](https://arxiv.org/abs/1911.04211) 207 | - [xSLUE: A Benchmark and Analysis Platform for Cross-Style Language Understanding and Evaluation](https://arxiv.org/abs/1911.03663) 208 | - [TabFact: A Large-scale Dataset for Table-based Fact Verification](https://arxiv.org/abs/1909.02164) (ICLR2020) 209 | - [Rapid Adaptation of BERT for Information Extraction on Domain-Specific Business Documents](https://arxiv.org/abs/2002.01861) 210 | - [A Focused Study to Compare Arabic Pre-training Models on Newswire IE Tasks](https://arxiv.org/abs/2004.14519) 211 | - [LAMBERT: Layout-Aware (Language) Modeling for information extraction](https://arxiv.org/abs/2002.08087) (ICDAR2021) 212 | - [Keyphrase Extraction from Scholarly Articles as Sequence Labeling using Contextualized Embeddings](https://arxiv.org/abs/1910.08840) (ECIR2020) [[github](https://github.com/midas-research/keyphrase-extraction-as-sequence-labeling-data)] 213 | - [Keyphrase Extraction with Span-based Feature Representations](https://arxiv.org/abs/2002.05407) 214 | - [Keyphrase Prediction With Pre-trained Language Model](https://arxiv.org/abs/2004.10462) 215 | - [Self-Supervised Contextual Keyword and Keyphrase Retrieval with Self-Labelling](https://www.preprints.org/manuscript/201908.0073/v1) [[github](https://github.com/MaartenGr/KeyBERT)] 216 | - [Joint Keyphrase Chunking and Salience Ranking with BERT](https://arxiv.org/abs/2004.13639) 217 | - [Generalizing Natural Language Analysis through Span-relation Representations](https://arxiv.org/abs/1911.03822) (ACL2020) [[github](https://github.com/neulab/cmu-multinlp)] 218 | - [What do you mean, BERT? Assessing BERT as a Distributional Semantics Model](https://arxiv.org/abs/1911.05758) 219 | - [tBERT: Topic Models and BERT Joining Forces for Semantic Similarity Detection](https://www.aclweb.org/anthology/2020.acl-main.630/) (ACL2020) 220 | - [Domain Adaptation with BERT-based Domain Classification and Data Selection](https://www.aclweb.org/anthology/D19-6109/) (EMNLP2019 WS) 221 | - [PERL: Pivot-based Domain Adaptation for Pre-trained Deep Contextualized Embedding Models](https://arxiv.org/abs/2006.09075) (TACL2020) 222 | - [Unsupervised Out-of-Domain Detection via Pre-trained Transformers](https://arxiv.org/abs/2106.00948) (ACL2021) [[github](https://github.com/rivercold/BERT-unsupervised-OOD)] 223 | - [Knowledge Distillation for BERT Unsupervised Domain Adaptation](https://arxiv.org/abs/2010.11478) 224 | - [Sensitive Data Detection and Classification in Spanish Clinical Text: Experiments with BERT](https://arxiv.org/abs/2003.03106) (LREC2020) 225 | - [Does BERT Pretrained on Clinical Notes Reveal Sensitive Data?](https://arxiv.org/abs/2104.07762) (NAACL2021) 226 | - [On the Importance of Word and Sentence Representation Learning in Implicit Discourse Relation Classification](https://arxiv.org/abs/2004.12617) (IJCAI2020) 227 | - [Adapting BERT to Implicit Discourse Relation Classification with a Focus on Discourse Connectives](http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.144.pdf) (LREC2020) 228 | - [Labeling Explicit Discourse Relations using Pre-trained Language Models](https://arxiv.org/abs/2006.11852) (TSD2020) 229 | - [Causal-BERT : Language models for causality detection between events expressed in text](https://arxiv.org/abs/2012.05453) 230 | - [BERT4SO: Neural Sentence Ordering by Fine-tuning BERT](https://arxiv.org/abs/2103.13584) 231 | - [Document-Level Event Argument Extraction by Conditional Generation](https://arxiv.org/abs/2104.05919) (NAACL2021) 232 | - [Cross-lingual Zero- and Few-shot Hate Speech Detection Utilising Frozen Transformer Language Models and AXEL](https://arxiv.org/abs/2004.13850) 233 | - [Same Side Stance Classification Task: Facilitating Argument Stance Classification by Fine-tuning a BERT Model](https://arxiv.org/abs/2004.11163) 234 | - [Kungfupanda at SemEval-2020 Task 12: BERT-Based Multi-Task Learning for Offensive Language Detection](https://arxiv.org/abs/2004.13432) 235 | - [KEIS@JUST at SemEval-2020 Task 12: Identifying Multilingual Offensive Tweets Using Weighted Ensemble and Fine-Tuned BERT](https://arxiv.org/abs/2005.07820) 236 | - [ALBERT-BiLSTM for Sequential Metaphor Detection](https://www.aclweb.org/anthology/2020.figlang-1.17/) (ACL2020 WS) 237 | - [MelBERT: Metaphor Detection via Contextualized Late Interaction using Metaphorical Identification Theories](https://arxiv.org/abs/2104.13615) (NAACL2021) 238 | - [A BERT-based Dual Embedding Model for Chinese Idiom Prediction](https://arxiv.org/abs/2011.02378) (COLING2020) 239 | - [Should You Fine-Tune BERT for Automated Essay Scoring?](https://www.aclweb.org/anthology/2020.bea-1.15/) (ACL2020 WS) 240 | - [KILT: a Benchmark for Knowledge Intensive Language Tasks](https://arxiv.org/abs/2009.02252) (NAACL2021) [[github](https://github.com/facebookresearch/KILT)] 241 | - [IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding](https://arxiv.org/abs/2009.05387) (AACL-IJCNLP2020) 242 | - [MedFilter: Improving Extraction of Task-relevant Utterances through Integration of Discourse Structure and Ontological Knowledge](https://arxiv.org/abs/2010.02246) (EMNLP2020) 243 | - [ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces](https://arxiv.org/abs/2012.12350) (AAAI2021) 244 | - [UserBERT: Self-supervised User Representation Learning](https://openreview.net/forum?id=zmgJIjyWSOw) 245 | - [UserBERT: Contrastive User Model Pre-training](https://arxiv.org/abs/2109.01274) 246 | - [Fine-tuning BERT for Low-Resource Natural Language Understanding via Active Learning](https://arxiv.org/abs/2012.02462) (COLING2020) 247 | - [Automatic punctuation restoration with BERT models](https://arxiv.org/abs/2101.07343) 248 | ### Word segmentation, parsing, NER 249 | - [BERT Meets Chinese Word Segmentation](https://arxiv.org/abs/1909.09292) 250 | - [Unified Multi-Criteria Chinese Word Segmentation with BERT](https://arxiv.org/abs/2004.05808) 251 | - [RethinkCWS: Is Chinese Word Segmentation a Solved Task?](https://arxiv.org/abs/2011.06858) (EMNLP2020) [[github](https://github.com/neulab/InterpretEval)] 252 | - [Enhancing Chinese Word Segmentation via Pseudo Labels for Practicability](https://aclanthology.org/2021.findings-acl.383/) (ACL2021 Findings) 253 | - [Joint Persian Word Segmentation Correction and Zero-Width Non-Joiner Recognition Using BERT](https://arxiv.org/abs/2010.00287) 254 | - [Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning](https://arxiv.org/abs/1903.04190) 255 | - [Establishing Strong Baselines for the New Decade: Sequence Tagging, Syntactic and Semantic Parsing with BERT](https://arxiv.org/abs/1908.04943) (FLAIRS-33) 256 | - [Evaluating Contextualized Embeddings on 54 Languages in POS Tagging, Lemmatization and Dependency Parsing](https://arxiv.org/abs/1908.07448) 257 | - [fastHan: A BERT-based Joint Many-Task Toolkit for Chinese NLP](https://arxiv.org/abs/2009.08633) 258 | - [Deep Contextualized Word Embeddings in Transition-Based and Graph-Based Dependency Parsing -- A Tale of Two Parsers Revisited](https://arxiv.org/abs/1908.07397) (EMNLP2019) 259 | - [Is POS Tagging Necessary or Even Helpful for Neural Dependency Parsing?](https://arxiv.org/abs/2003.03204) 260 | - [Parsing as Pretraining](https://arxiv.org/abs/2002.01685) (AAAI2020) 261 | - [Cross-Lingual BERT Transformation for Zero-Shot Dependency Parsing](https://arxiv.org/abs/1909.06775) 262 | - [Recursive Non-Autoregressive Graph-to-Graph Transformer for Dependency Parsing with Iterative Refinement](https://arxiv.org/abs/2003.13118) 263 | - [StructFormer: Joint Unsupervised Induction of Dependency and Constituency Structure from Masked Language Modeling](https://arxiv.org/abs/2012.00857) 264 | - [pyBART: Evidence-based Syntactic Transformations for IE](https://arxiv.org/abs/2005.01306) [[github](https://allenai.github.io/pybart/)] 265 | - [Named Entity Recognition -- Is there a glass ceiling?](https://arxiv.org/abs/1910.02403) (CoNLL2019) 266 | - [A Unified MRC Framework for Named Entity Recognition](https://arxiv.org/abs/1910.11476) 267 | - [Biomedical named entity recognition using BERT in the machine reading comprehension framework](https://arxiv.org/abs/2009.01560) 268 | - [Training Compact Models for Low Resource Entity Tagging using Pre-trained Language Models](https://arxiv.org/abs/1910.06294) 269 | - [Robust Named Entity Recognition with Truecasing Pretraining](https://arxiv.org/abs/1912.07095) (AAAI2020) 270 | - [LTP: A New Active Learning Strategy for Bert-CRF Based Named Entity Recognition](https://arxiv.org/abs/2001.02524) 271 | - [Named Entity Recognition as Dependency Parsing](https://arxiv.org/abs/2005.07150) (ACL2020) 272 | - [Exploring Cross-sentence Contexts for Named Entity Recognition with BERT](https://arxiv.org/abs/2006.01563) 273 | - [CrossNER: Evaluating Cross-Domain Named Entity Recognition](https://arxiv.org/abs/2012.04373) (AAAI2021) [[github](https://github.com/zliucr/CrossNER)] 274 | - [Embeddings of Label Components for Sequence Labeling: A Case Study of Fine-grained Named Entity Recognition](https://arxiv.org/abs/2006.01372) (ACL2020 SRW) 275 | - [BOND: BERT-Assisted Open-Domain Named Entity Recognition with Distant Supervision](https://arxiv.org/abs/2006.15509) (KDD2020) [[github](https://github.com/cliang1453/BOND)] 276 | - [Interpretability Analysis for Named Entity Recognition to Understand System Predictions and How They Can Improve](https://arxiv.org/abs/2004.04564) 277 | - [Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language](https://arxiv.org/abs/2004.12440) (ACL2020) 278 | - [To BERT or Not to BERT: Comparing Task-specific and Task-agnostic Semi-Supervised Approaches for Sequence Tagging](https://arxiv.org/abs/2010.14042) (EMNLP2020) 279 | - [Example-Based Named Entity Recognition](https://arxiv.org/abs/2008.10570) 280 | - [FLERT: Document-Level Features for Named Entity Recognition](https://arxiv.org/abs/2011.06993) 281 | - [Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition](https://arxiv.org/abs/2012.05426) 282 | - [What's in a Name? Are BERT Named Entity Representations just as Good for any other Name?](https://arxiv.org/abs/2007.06897) (ACL2020 WS) 283 | - [Interpretable Multi-dataset Evaluation for Named Entity Recognition](https://arxiv.org/abs/2011.06854) (EMNLP2020) [[github](https://github.com/neulab/InterpretEval)] 284 | - [Entity Enhanced BERT Pre-training for Chinese NER](https://aclanthology.org/2020.emnlp-main.518/) (EMNLP2020) 285 | - [Lexicon Enhanced Chinese Sequence Labeling Using BERT Adapter](https://arxiv.org/abs/2105.07148) (ACL2021) 286 | - [FLAT: Chinese NER Using Flat-Lattice Transformer](https://aclanthology.org/2020.acl-main.611/) (ACL2020) 287 | - [BioALBERT: A Simple and Effective Pre-trained Language Model for Biomedical Named Entity Recognition](https://arxiv.org/abs/2009.09223) 288 | - [MT-BioNER: Multi-task Learning for Biomedical Named Entity Recognition using Deep Bidirectional Transformers](https://arxiv.org/abs/2001.08904) 289 | - [Knowledge Guided Named Entity Recognition for BioMedical Text](https://arxiv.org/abs/1911.03869) 290 | - [Cross-Lingual Named Entity Recognition Using Parallel Corpus: A New Approach Using XLM-RoBERTa Alignment](https://arxiv.org/abs/2101.11112) 291 | - [Portuguese Named Entity Recognition using BERT-CRF](https://arxiv.org/abs/1909.10649) 292 | - [Towards Lingua Franca Named Entity Recognition with BERT](https://arxiv.org/abs/1912.01389) 293 | - [Larger-Context Tagging: When and Why Does It Work?](https://arxiv.org/abs/2104.04434) (NAACL2021) 294 | ### Pronoun/coreference resolution 295 | - [A Brief Survey and Comparative Study of Recent Development of Pronoun Coreference Resolution](https://arxiv.org/abs/2009.12721) 296 | - [Resolving Gendered Ambiguous Pronouns with BERT](https://arxiv.org/abs/1906.01161) (ACL2019 WS) 297 | - [Anonymized BERT: An Augmentation Approach to the Gendered Pronoun Resolution Challenge](https://arxiv.org/abs/1905.01780) (ACL2019 WS) 298 | - [Gendered Pronoun Resolution using BERT and an extractive question answering formulation](https://arxiv.org/abs/1906.03695) (ACL2019 WS) 299 | - [MSnet: A BERT-based Network for Gendered Pronoun Resolution](https://arxiv.org/abs/1908.00308) (ACL2019 WS) 300 | - [Scalable Cross Lingual Pivots to Model Pronoun Gender for Translation](https://arxiv.org/abs/2006.08881) 301 | - [Fill the GAP: Exploiting BERT for Pronoun Resolution](https://www.aclweb.org/anthology/papers/W/W19/W19-3815/) (ACL2019 WS) 302 | - [On GAP Coreference Resolution Shared Task: Insights from the 3rd Place Solution](https://www.aclweb.org/anthology/W19-3816/) (ACL2019 WS) 303 | - [Look Again at the Syntax: Relational Graph Convolutional Network for Gendered Ambiguous Pronoun Resolution](https://arxiv.org/abs/1905.08868) (ACL2019 WS) 304 | - [Unsupervised Pronoun Resolution via Masked Noun-Phrase Prediction](https://arxiv.org/abs/2105.12392) (ACL2021) 305 | - [BERT Masked Language Modeling for Co-reference Resolution](https://www.aclweb.org/anthology/papers/W/W19/W19-3811/) (ACL2019 WS) 306 | - [Coreference Resolution with Entity Equalization](https://www.aclweb.org/anthology/P19-1066/) (ACL2019) 307 | - [BERT for Coreference Resolution: Baselines and Analysis](https://arxiv.org/abs/1908.09091) (EMNLP2019) [[github](https://github.com/mandarjoshi90/coref)] 308 | - [WikiCREM: A Large Unsupervised Corpus for Coreference Resolution](https://arxiv.org/abs/1908.08025) (EMNLP2019) 309 | - [CD2CR: Co-reference Resolution Across Documents and Domains](https://arxiv.org/abs/2101.12637) (EACL2021) 310 | - [Ellipsis Resolution as Question Answering: An Evaluation](https://arxiv.org/abs/1908.11141) (EACL2021) 311 | - [Coreference Resolution as Query-based Span Prediction](https://arxiv.org/abs/1911.01746) 312 | - [Coreferential Reasoning Learning for Language Representation](https://arxiv.org/abs/2004.06870) (EMNLP2020) 313 | - [Revisiting Memory-Efficient Incremental Coreference Resolution](https://arxiv.org/abs/2005.00128) 314 | - [Revealing the Myth of Higher-Order Inference in Coreference Resolution](https://arxiv.org/abs/2009.12013) (EMNLP2020) 315 | - [Coreference Resolution without Span Representations](https://arxiv.org/abs/2101.00434) (ACL2021) 316 | - [Neural Mention Detection](https://arxiv.org/abs/1907.12524) (LREC2020) 317 | - [ZPR2: Joint Zero Pronoun Recovery and Resolution using Multi-Task Learning and BERT](https://www.aclweb.org/anthology/2020.acl-main.482/) (ACL2020) 318 | - [An Empirical Study of Contextual Data Augmentation for Japanese Zero Anaphora Resolution](https://aclanthology.org/2020.coling-main.435/) (COLING2020) 319 | - [BERT-based Cohesion Analysis of Japanese Texts](https://aclanthology.org/2020.coling-main.114/) (COLING2020) 320 | - [Joint Coreference Resolution and Character Linking for Multiparty Conversation](https://arxiv.org/abs/2101.11204) 321 | - [Sequence to Sequence Coreference Resolution](https://www.aclweb.org/anthology/2020.crac-1.5/) (COLING2020 WS) 322 | - [Within-Document Event Coreference with BERT-Based Contextualized Representations](https://arxiv.org/abs/2102.09600) 323 | - [Multi-task Learning Based Neural Bridging Reference Resolution](https://arxiv.org/abs/2003.03666) 324 | - [Bridging Anaphora Resolution as Question Answering](https://arxiv.org/abs/2004.07898) (ACL2020) 325 | - [Fine-grained Information Status Classification Using Discourse Context-Aware BERT](https://arxiv.org/abs/2010.14759) (COLING2020) 326 | ### Word sense disambiguation 327 | - [Language Models and Word Sense Disambiguation: An Overview and Analysis](https://arxiv.org/abs/2008.11608) 328 | - [GlossBERT: BERT for Word Sense Disambiguation with Gloss Knowledge](https://arxiv.org/abs/1908.07245) (EMNLP2019) 329 | - [Adapting BERT for Word Sense Disambiguation with Gloss Selection Objective and Example Sentences](https://arxiv.org/abs/2009.11795) (EMNLP2020 Findings) 330 | - [Improved Word Sense Disambiguation Using Pre-Trained Contextualized Word Representations](https://arxiv.org/abs/1910.00194) (EMNLP2019) 331 | - [Using BERT for Word Sense Disambiguation](https://arxiv.org/abs/1909.08358) 332 | - [Language Modelling Makes Sense: Propagating Representations through WordNet for Full-Coverage Word Sense Disambiguation](https://www.aclweb.org/anthology/P19-1569.pdf) (ACL2019) 333 | - [Does BERT Make Any Sense? Interpretable Word Sense Disambiguation with Contextualized Embeddings](https://arxiv.org/abs/1909.10430) (KONVENS2019) 334 | - [An Accurate Model for Predicting the (Graded) Effect of Context in Word Similarity Based on Bert](https://arxiv.org/abs/2005.01006) 335 | - [PolyLM: Learning about Polysemy through Language Modeling](https://arxiv.org/abs/2101.10448) (EACL2021) 336 | - [CluBERT: A Cluster-Based Approach for Learning Sense Distributions in Multiple Languages](https://www.aclweb.org/anthology/2020.acl-main.369/) (ACL2020) 337 | - [Cross-lingual Word Sense Disambiguation using mBERT Embeddings with Syntactic Dependencies](https://arxiv.org/abs/2012.05300) 338 | - [VCDM: Leveraging Variational Bi-encoding and Deep Contextualized Word Representations for Improved Definition Modeling](https://arxiv.org/abs/2010.03124) (EMNLP2020) 339 | ### Sentiment analysis 340 | - [Utilizing BERT for Aspect-Based Sentiment Analysis via Constructing Auxiliary Sentence](https://arxiv.org/abs/1903.09588) (NAACL2019) 341 | - [BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis](https://arxiv.org/abs/1904.02232) (NAACL2019) 342 | - [Exploiting BERT for End-to-End Aspect-based Sentiment Analysis](https://arxiv.org/abs/1910.00883) (EMNLP2019 WS) 343 | - [Improving BERT Performance for Aspect-Based Sentiment Analysis](https://arxiv.org/abs/2010.11731) 344 | - [Context-Guided BERT for Targeted Aspect-Based Sentiment Analysis](https://arxiv.org/abs/2010.07523) 345 | - [Understanding Pre-trained BERT for Aspect-based Sentiment Analysis](https://arxiv.org/abs/2011.00169) (COLING2020) 346 | - [Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoBERTa](https://arxiv.org/abs/2104.04986) (NAACL2021) 347 | - [Adapt or Get Left Behind: Domain Adaptation through BERT Language Model Finetuning for Aspect-Target Sentiment Classification](https://arxiv.org/abs/1908.11860) (LREC2020) 348 | - [An Investigation of Transfer Learning-Based Sentiment Analysis in Japanese](https://arxiv.org/abs/1905.09642) (ACL2019) 349 | - ["Mask and Infill" : Applying Masked Language Model to Sentiment Transfer](https://arxiv.org/abs/1908.08039) 350 | - [Adversarial Training for Aspect-Based Sentiment Analysis with BERT](https://arxiv.org/abs/2001.11316) 351 | - [Adversarial and Domain-Aware BERT for Cross-Domain Sentiment Analysis](https://www.aclweb.org/anthology/2020.acl-main.370/) (ACL2020) 352 | - [Utilizing BERT Intermediate Layers for Aspect Based Sentiment Analysis and Natural Language Inference](https://arxiv.org/abs/2002.04815) 353 | - [DomBERT: Domain-oriented Language Model for Aspect-based Sentiment Analysis](https://arxiv.org/abs/2004.13816) 354 | - [YASO: A New Benchmark for Targeted Sentiment Analysis](https://arxiv.org/abs/2012.14541) 355 | - [SentiBERT: A Transferable Transformer-Based Architecture for Compositional Sentiment Semantics](https://arxiv.org/abs/2005.04114) (ACL2020) 356 | ### Relation extraction 357 | - [Matching the Blanks: Distributional Similarity for Relation Learning](https://arxiv.org/abs/1906.03158) (ACL2019) 358 | - [BERT-Based Multi-Head Selection for Joint Entity-Relation Extraction](https://arxiv.org/abs/1908.05908) (NLPCC2019) 359 | - [Enriching Pre-trained Language Model with Entity Information for Relation Classification](https://arxiv.org/abs/1905.08284) 360 | - [Span-based Joint Entity and Relation Extraction with Transformer Pre-training](https://arxiv.org/abs/1909.07755) 361 | - [Fine-tune Bert for DocRED with Two-step Process](https://arxiv.org/abs/1909.11898) 362 | - [Relation Extraction as Two-way Span-Prediction](https://arxiv.org/abs/2010.04829) 363 | - [Entity, Relation, and Event Extraction with Contextualized Span Representations](https://arxiv.org/abs/1909.03546) (EMNLP2019) 364 | - [Fine-tuning BERT for Joint Entity and Relation Extraction in Chinese Medical Text](https://arxiv.org/abs/1908.07721) 365 | - [Downstream Model Design of Pre-trained Language Model for Relation Extraction Task](https://arxiv.org/abs/2004.03786) 366 | - [Efficient long-distance relation extraction with DG-SpanBERT](https://arxiv.org/abs/2004.03636) 367 | - [Global-to-Local Neural Networks for Document-Level Relation Extraction](https://arxiv.org/abs/2009.10359) (EMNLP2020) 368 | - [DARE: Data Augmented Relation Extraction with GPT-2](https://arxiv.org/abs/2004.13845) 369 | - [Distantly-Supervised Neural Relation Extraction with Side Information using BERT](https://arxiv.org/abs/2004.14443) (IJCNN2020) 370 | - [Improving Distantly-Supervised Relation Extraction through BERT-based Label & Instance Embeddings](https://arxiv.org/abs/2102.01156) 371 | - [An End-to-end Model for Entity-level Relation Extraction using Multi-instance Learning](https://arxiv.org/abs/2102.05980) (EACL2021) 372 | - [ZS-BERT: Towards Zero-Shot Relation Extraction with Attribute Representation Learning](https://arxiv.org/abs/2104.04697) (NAACL2021) [[github](https://github.com/dinobby/ZS-BERT)] 373 | - [AdaPrompt: Adaptive Prompt-based Finetuning for Relation Extraction](https://arxiv.org/abs/2104.07650) 374 | - [Dialogue-Based Relation Extraction](https://arxiv.org/abs/2004.08056) (ACL2020) 375 | - [An Embarrassingly Simple Model for Dialogue Relation Extraction](https://arxiv.org/abs/2012.13873) 376 | - [A Novel Cascade Binary Tagging Framework for Relational Triple Extraction](https://arxiv.org/abs/1909.03227) (ACL2020) [[github](https://github.com/weizhepei/CasRel)] 377 | - [ExpBERT: Representation Engineering with Natural Language Explanations](https://arxiv.org/abs/2005.01932) (ACL2020) [[github](https://github.com/MurtyShikhar/ExpBERT)] 378 | - [AutoRC: Improving BERT Based Relation Classification Models via Architecture Search](https://arxiv.org/abs/2009.10680) 379 | - [Investigation of BERT Model on Biomedical Relation Extraction Based on Revised Fine-tuning Mechanism](https://arxiv.org/abs/2011.00398) 380 | - [Experiments on transfer learning architectures for biomedical relation extraction](https://arxiv.org/abs/2011.12380) 381 | - [Improving BERT Model Using Contrastive Learning for Biomedical Relation Extraction](https://arxiv.org/abs/2104.13913) (BioNLP2021) 382 | - [Cross-Lingual Relation Extraction with Transformers](https://arxiv.org/abs/2010.08652) 383 | - [Improving Scholarly Knowledge Representation: Evaluating BERT-based Models for Scientific Relation Classification](https://arxiv.org/abs/2004.06153) 384 | - [Robustly Pre-trained Neural Model for Direct Temporal Relation Extraction](https://arxiv.org/abs/2004.06216) 385 | - [A BERT-based One-Pass Multi-Task Model for Clinical Temporal Relation Extraction](https://www.aclweb.org/anthology/2020.bionlp-1.7/) (ACL2020 WS) 386 | - [Exploring Contextualized Neural Language Models for Temporal Dependency Parsing](https://arxiv.org/abs/2004.14577) 387 | - [Temporal Reasoning on Implicit Events from Distant Supervision](https://arxiv.org/abs/2010.12753) 388 | - [IMoJIE: Iterative Memory-Based Joint Open Information Extraction](https://arxiv.org/abs/2005.08178) (ACL2020) 389 | - [OpenIE6: Iterative Grid Labeling and Coordination Analysis for Open Information Extraction](https://arxiv.org/abs/2010.03147) (EMNLP2020) [[github](https://github.com/dair-iitd/openie6)] 390 | - [Multi2OIE: Multilingual Open Information Extraction Based on Multi-Head Attention with BERT](https://arxiv.org/abs/2009.08128) (EMNLP2020 Findings) 391 | ### Knowledge base 392 | - [KG-BERT: BERT for Knowledge Graph Completion](https://arxiv.org/abs/1909.03193) 393 | - [How Context Affects Language Models' Factual Predictions](https://openreview.net/forum?id=025X0zPfn) (AKBC2020) 394 | - [Inducing Relational Knowledge from BERT](https://arxiv.org/abs/1911.12753) (AAAI2020) 395 | - [Latent Relation Language Models](https://arxiv.org/abs/1908.07690) (AAAI2020) 396 | - [Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model](https://openreview.net/forum?id=BJlzm64tDH) (ICLR2020) 397 | - [Scalable Zero-shot Entity Linking with Dense Entity Retrieval](https://arxiv.org/abs/1911.03814) (EMNLP2020) [[github](https://github.com/facebookresearch/BLINK)] 398 | - [Zero-shot Entity Linking with Efficient Long Range Sequence Modeling](https://arxiv.org/abs/2010.06065) (EMNLP2020 Findings) 399 | - [Investigating Entity Knowledge in BERT with Simple Neural End-To-End Entity Linking](https://www.aclweb.org/anthology/K19-1063/) (CoNLL2019) 400 | - [Improving Entity Linking by Modeling Latent Entity Type Information](https://arxiv.org/abs/2001.01447) (AAAI2020) 401 | - [Global Entity Disambiguation with Pretrained Contextualized Embeddings of Words and Entities](https://arxiv.org/abs/1909.00426) 402 | - [YELM: End-to-End Contextualized Entity Linking](https://arxiv.org/abs/1911.03834) 403 | - [Empirical Evaluation of Pretraining Strategies for Supervised Entity Linking](https://arxiv.org/abs/2005.14253) (AKBC2020) 404 | - [LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](https://arxiv.org/abs/2010.01057) (EMNLP2020) [[github](https://github.com/studio-ousia/luke)] 405 | - [Linking Entities to Unseen Knowledge Bases with Arbitrary Schemas](https://arxiv.org/abs/2010.11333) 406 | - [CHOLAN: A Modular Approach for Neural Entity Linking on Wikipedia and Wikidata](https://arxiv.org/abs/2101.09969) (EACL2021) 407 | - [PEL-BERT: A Joint Model for Protocol Entity Linking](https://arxiv.org/abs/2002.00744) 408 | - [End-to-end Biomedical Entity Linking with Span-based Dictionary Matching](https://arxiv.org/abs/2104.10493) 409 | - [Efficient One-Pass End-to-End Entity Linking for Questions](https://arxiv.org/abs/2010.02413) (EMNLP2020) [[github](https://github.com/belindal/BLINK/tree/master/elq)] 410 | - [Cross-Lingual Transfer in Zero-Shot Cross-Language Entity Linking](https://arxiv.org/abs/2010.09828) 411 | - [Entity Linking in 100 Languages](https://arxiv.org/abs/2011.02690) (EMNLP2020) [[github](https://github.com/google-research/google-research/tree/master/dense_representations_for_entity_retrieval/mel)] 412 | - [COMETA: A Corpus for Medical Entity Linking in the Social Media](https://arxiv.org/abs/2010.03295) (EMNLP2020) [[github](https://github.com/cambridgeltl/cometa)] 413 | - [How Can We Know What Language Models Know?](https://arxiv.org/abs/1911.12543) (TACL2020) [[github](https://github.com/jzbjyb/LPAQA)] 414 | - [How to Query Language Models?](https://arxiv.org/abs/2108.01928) 415 | - [Deep Entity Matching with Pre-Trained Language Models](https://arxiv.org/abs/2004.00584) 416 | - [Ultra-Fine Entity Typing with Weak Supervision from a Masked Language Model](https://arxiv.org/abs/2106.04098) (ACL2021) 417 | - [Constructing Taxonomies from Pretrained Language Models](https://arxiv.org/abs/2010.12813) (NAACL2021) 418 | - [Language Models are Open Knowledge Graphs](https://arxiv.org/abs/2010.11967) 419 | - [Can Generative Pre-trained Language Models Serve as Knowledge Bases for Closed-book QA?](https://arxiv.org/abs/2106.01561) (ACL2021) 420 | - [DualTKB: A Dual Learning Bridge between Text and Knowledge Base](https://arxiv.org/abs/2010.14660) (EMNLP2020) [[github](https://github.com/IBM/dualtkb)] 421 | - [Zero-shot Slot Filling with DPR and RAG](https://arxiv.org/abs/2104.08610) 422 | - [How to Avoid Being Eaten by a Grue: Structured Exploration Strategies for Textual Worlds](https://arxiv.org/abs/2006.07409) [[github](https://github.com/rajammanabrolu/Q-BERT)] 423 | - [MLMLM: Link Prediction with Mean Likelihood Masked Language Model](https://arxiv.org/abs/2009.07058) 424 | - [Beyond I.I.D.: Three Levels of Generalization for Question Answering on Knowledge Bases](https://arxiv.org/abs/2011.07743) 425 | ### Text classification 426 | - [Deep Learning Based Text Classification: A Comprehensive Review](https://arxiv.org/abs/2004.03705) 427 | - [A Text Classification Survey: From Shallow to Deep Learning](https://arxiv.org/abs/2008.00364) 428 | - [How to Fine-Tune BERT for Text Classification?](https://arxiv.org/abs/1905.05583) 429 | - [X-BERT: eXtreme Multi-label Text Classification with BERT](https://arxiv.org/abs/1905.02331) 430 | - [An Empirical Study on Large-Scale Multi-Label Text Classification Including Few and Zero-Shot Labels](https://arxiv.org/abs/2010.01653) (EMNLP2020) 431 | - [Taming Pretrained Transformers for Extreme Multi-label Text Classification](https://arxiv.org/abs/1905.02331) (KDD2020) 432 | - [Layer-wise Guided Training for BERT: Learning Incrementally Refined Document Representations](https://arxiv.org/abs/2010.05763) (EMNLP2020 WS) 433 | - [DocBERT: BERT for Document Classification](https://arxiv.org/abs/1904.08398) 434 | - [Enriching BERT with Knowledge Graph Embeddings for Document Classification](https://arxiv.org/abs/1909.08402) 435 | - [Classification and Clustering of Arguments with Contextualized Word Embeddings](https://arxiv.org/abs/1906.09821) (ACL2019) 436 | - [BERT for Evidence Retrieval and Claim Verification](https://arxiv.org/abs/1910.02655) 437 | - [Stacked DeBERT: All Attention in Incomplete Data for Text Classification](https://arxiv.org/abs/2001.00137) 438 | - [Cost-Sensitive BERT for Generalisable Sentence Classification with Imbalanced Data](https://arxiv.org/abs/2003.11563) 439 | - [BAE: BERT-based Adversarial Examples for Text Classification](https://arxiv.org/abs/2004.01970) (EMNLP2020) 440 | - [FireBERT: Hardening BERT-based classifiers against adversarial attack](https://arxiv.org/abs/2008.04203) [[github](https://github.com/FireBERT-author/FireBERT)] 441 | - [GAN-BERT: Generative Adversarial Learning for Robust Text Classification with a Bunch of Labeled Examples](https://www.aclweb.org/anthology/2020.acl-main.191/) (ACL2020) 442 | - [Description Based Text Classification with Reinforcement Learning](https://arxiv.org/abs/2002.03067) 443 | - [VGCN-BERT: Augmenting BERT with Graph Embedding for Text Classification](https://arxiv.org/abs/2004.05707) 444 | - [Zero-shot Text Classification via Reinforced Self-training](https://www.aclweb.org/anthology/2020.acl-main.272/) (ACL2020) 445 | - [On Data Augmentation for Extreme Multi-label Classification](https://arxiv.org/abs/2009.10778) 446 | - [Noisy Channel Language Model Prompting for Few-Shot Text Classification](https://arxiv.org/abs/2108.04106) 447 | - [Improving Pretrained Models for Zero-shot Multi-label Text Classification through Reinforced Label Hierarchy Reasoning](https://arxiv.org/abs/2104.01666) (NAACL2021) 448 | - [Towards Evaluating the Robustness of Chinese BERT Classifiers](https://arxiv.org/abs/2004.03742) 449 | - [COVID-Twitter-BERT: A Natural Language Processing Model to Analyse COVID-19 Content on Twitter](https://arxiv.org/abs/2005.07503) [[github](https://github.com/digitalepidemiologylab/covid-twitter-bert)] 450 | - [Large Scale Legal Text Classification Using Transformer Models](https://arxiv.org/abs/2010.12871) 451 | - [BBAEG: Towards BERT-based Biomedical Adversarial Example Generation for Text Classification](https://arxiv.org/abs/2104.01782) (NAACL2021) 452 | - [A Comparison of LSTM and BERT for Small Corpus](https://arxiv.org/abs/2009.05451) 453 | ### WSC, WNLI, NLI 454 | - [Exploring Unsupervised Pretraining and Sentence Structure Modelling for Winograd Schema Challenge](https://arxiv.org/abs/1904.09705) 455 | - [A Surprisingly Robust Trick for the Winograd Schema Challenge](https://arxiv.org/abs/1905.06290) 456 | - [WinoGrande: An Adversarial Winograd Schema Challenge at Scale](https://arxiv.org/abs/1907.10641) (AAAI2020) 457 | - [TTTTTackling WinoGrande Schemas](https://arxiv.org/abs/2003.08380) 458 | - [WinoWhy: A Deep Diagnosis of Essential Commonsense Knowledge for Answering Winograd Schema Challenge](https://arxiv.org/abs/2005.05763) (ACL2020) 459 | - [The Sensitivity of Language Models and Humans to Winograd Schema Perturbations](https://arxiv.org/abs/2005.01348) (ACL2020) 460 | - [Precise Task Formalization Matters in Winograd Schema Evaluations](https://arxiv.org/abs/2010.04043) (EMNLP2020) 461 | - [Tackling Domain-Specific Winograd Schemas with Knowledge-Based Reasoning and Machine Learning](https://arxiv.org/abs/2011.12081) 462 | - [A Review of Winograd Schema Challenge Datasets and Approaches](https://arxiv.org/abs/2004.13831) 463 | - [Improving Natural Language Inference with a Pretrained Parser](https://arxiv.org/abs/1909.08217) 464 | - [Are Natural Language Inference Models IMPPRESsive? Learning IMPlicature and PRESupposition](https://arxiv.org/abs/2004.03066) 465 | - [DocNLI: A Large-scale Dataset for Document-level Natural Language Inference](https://arxiv.org/abs/2106.09449) (ACL2021 Findings) 466 | - [Adversarial NLI: A New Benchmark for Natural Language Understanding](https://arxiv.org/abs/1910.14599) 467 | - [Adversarial Analysis of Natural Language Inference Systems](https://arxiv.org/abs/1912.03441) (ICSC2020) 468 | - [ANLIzing the Adversarial Natural Language Inference Dataset](https://arxiv.org/abs/2010.12729) 469 | - [Syntactic Data Augmentation Increases Robustness to Inference Heuristics](https://arxiv.org/abs/2004.11999) (ACL2020) 470 | - [Linguistically-Informed Transformations (LIT): A Method for Automatically Generating Contrast Sets](https://arxiv.org/abs/2010.08580) (EMNLP2020 WS) [[github](https://github.com/leo-liuzy/LIT_auto-gen-contrast-set)] 471 | - [HypoNLI: Exploring the Artificial Patterns of Hypothesis-only Bias in Natural Language Inference](https://arxiv.org/abs/2003.02756) (LREC2020) 472 | - [Use of Machine Translation to Obtain Labeled Datasets for Resource-Constrained Languages](https://arxiv.org/abs/2004.14963) (EMNLP2020) [[github](https://github.com/boun-tabi/NLI-TR)] 473 | - [FarsTail: A Persian Natural Language Inference Dataset](https://arxiv.org/abs/2009.08820) 474 | - [Evaluating BERT for natural language inference: A case study on the CommitmentBank](https://www.aclweb.org/anthology/D19-1630/) (EMNLP2019) 475 | - [Do Neural Models Learn Systematicity of Monotonicity Inference in Natural Language?](https://arxiv.org/abs/2004.14839) (ACL2020) 476 | - [Abductive Commonsense Reasoning](https://arxiv.org/abs/1908.05739) (ICLR2020) 477 | - [Entailment as Few-Shot Learner](https://arxiv.org/abs/2104.14690) 478 | - [Collecting Entailment Data for Pretraining: New Protocols and Negative Results](https://arxiv.org/abs/2004.11997) 479 | - [WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation](https://arxiv.org/abs/2201.05955) (EMNLP2022 Findings) [[github](https://github.com/alisawuffles/wanli)] 480 | - [Mining Knowledge for Natural Language Inference from Wikipedia Categories](https://arxiv.org/abs/2010.01239) (EMNLP2020 Findings) 481 | ### Commonsense 482 | - [CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge](https://arxiv.org/abs/1811.00937) (NAACL2019) 483 | - [Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention](https://arxiv.org/abs/2112.03254) 484 | - [HellaSwag: Can a Machine Really Finish Your Sentence?](https://arxiv.org/abs/1905.07830) (ACL2019) [[website](https://rowanzellers.com/hellaswag/)] 485 | - [A Method for Building a Commonsense Inference Dataset Based on Basic Events](https://www.aclweb.org/anthology/2020.emnlp-main.192/) (EMNLP2020) [[website](http://nlp.ist.i.kyoto-u.ac.jp/EN/?KUCI)] 486 | - [Story Ending Prediction by Transferable BERT](https://arxiv.org/abs/1905.07504) (IJCAI2019) 487 | - [Explain Yourself! Leveraging Language Models for Commonsense Reasoning](https://arxiv.org/abs/1906.02361) (ACL2019) 488 | - [Pre-training Is (Almost) All You Need: An Application to Commonsense Reasoning](https://arxiv.org/abs/2004.14074) (ACL2020) 489 | - [Align, Mask and Select: A Simple Method for Incorporating Commonsense Knowledge into Language Representation Models](https://arxiv.org/abs/1908.06725) 490 | - [Informing Unsupervised Pretraining with External Linguistic Knowledge](https://arxiv.org/abs/1909.02339) 491 | - [Commonsense Knowledge + BERT for Level 2 Reading Comprehension Ability Test](https://arxiv.org/abs/1909.03415) 492 | - [BIG MOOD: Relating Transformers to Explicit Commonsense Knowledge](https://arxiv.org/abs/1910.07713) 493 | - [Commonsense Knowledge Mining from Pretrained Models](https://arxiv.org/abs/1909.00505) (EMNLP2019) 494 | - [KagNet: Knowledge-Aware Graph Networks for Commonsense Reasoning](https://arxiv.org/abs/1909.02151) (EMNLP2019) 495 | - [Cracking the Contextual Commonsense Code: Understanding Commonsense Reasoning Aptitude of Deep Contextual Representations](https://www.aclweb.org/anthology/D19-6001/) (EMNLP2019 WS) 496 | - [Do Massively Pretrained Language Models Make Better Storytellers?](https://arxiv.org/abs/1909.10705) (CoNLL2019) 497 | - [PIQA: Reasoning about Physical Commonsense in Natural Language](https://arxiv.org/abs/1911.11641) (AAAI2020) 498 | - [Evaluating Commonsense in Pre-trained Language Models](https://arxiv.org/abs/1911.11931) (AAAI2020) 499 | - [Why Do Masked Neural Language Models Still Need Common Sense Knowledge?](https://arxiv.org/abs/1911.03024) 500 | - [Does BERT Solve Commonsense Task via Commonsense Knowledge?](https://arxiv.org/abs/2008.03945) 501 | - [Unsupervised Commonsense Question Answering with Self-Talk](https://arxiv.org/abs/2004.05483) (EMNLP2020) 502 | - [Knowledge-driven Data Construction for Zero-shot Evaluation in Commonsense Question Answering](https://arxiv.org/abs/2011.03863) (AAAI2021) 503 | - [G-DAUG: Generative Data Augmentation for Commonsense Reasoning](https://arxiv.org/abs/2004.11546) 504 | - [Contrastive Self-Supervised Learning for Commonsense Reasoning](https://arxiv.org/abs/2005.00669) (ACL2020) 505 | - [Differentiable Open-Ended Commonsense Reasoning](https://arxiv.org/abs/2010.14439) 506 | - [Adversarial Training for Commonsense Inference](https://arxiv.org/abs/2005.08156) (ACL2020 WS) 507 | - [Do Fine-tuned Commonsense Language Models Really Generalize?](https://arxiv.org/abs/2011.09159) 508 | - [Do Language Models Perform Generalizable Commonsense Inference?](https://arxiv.org/abs/2106.11533) (ACL2021 Findings) 509 | - [Improving Zero Shot Learning Baselines with Commonsense Knowledge](https://arxiv.org/abs/2012.06236) 510 | - [XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning](https://ducdauge.github.io/files/xcopa.pdf) [[github](https://github.com/cambridgeltl/xcopa)] 511 | - [Do Neural Language Representations Learn Physical Commonsense?](https://arxiv.org/abs/1908.02899) (CogSci2019) 512 | ### Extractive summarization 513 | - [HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization](https://arxiv.org/abs/1905.06566) (ACL2019) 514 | - [Deleter: Leveraging BERT to Perform Unsupervised Successive Text Compression](https://arxiv.org/abs/1909.03223) 515 | - [Discourse-Aware Neural Extractive Text Summarization](https://arxiv.org/abs/1910.14142) (ACL2020) [[github](https://github.com/jiacheng-xu/DiscoBERT)] 516 | - [AREDSUM: Adaptive Redundancy-Aware Iterative Sentence Ranking for Extractive Document Summarization](https://arxiv.org/abs/2004.06176) 517 | - [Fact-level Extractive Summarization with Hierarchical Graph Mask on BERT](https://arxiv.org/abs/2011.09739) (COLING2020) 518 | - [Do We Really Need That Many Parameters In Transformer For Extractive Summarization? Discourse Can Help !](https://arxiv.org/abs/2012.02144) (EMNLP2020 WS) 519 | - [Multi-Document Summarization with Determinantal Point Processes and Contextualized Representations](https://arxiv.org/abs/1910.11411) (EMNLP2019 WS) 520 | - [Continual BERT: Continual Learning for Adaptive Extractive Summarization of COVID-19 Literature](https://arxiv.org/abs/2007.03405) 521 | ### Grammatical error correction 522 | - [Multi-headed Architecture Based on BERT for Grammatical Errors Correction](https://www.aclweb.org/anthology/papers/W/W19/W19-4426/) (ACL2019 WS) 523 | - [Towards Minimal Supervision BERT-based Grammar Error Correction](https://arxiv.org/abs/2001.03521) 524 | - [Learning to combine Grammatical Error Corrections](https://arxiv.org/abs/1906.03897) (EMNLP2019 WS) 525 | - [LM-Critic: Language Models for Unsupervised Grammatical Error Correction](https://arxiv.org/abs/2109.06822) (EMNLP2021) [[github](https://github.com/michiyasunaga/LM-Critic)] 526 | - [Encoder-Decoder Models Can Benefit from Pre-trained Masked Language Models in Grammatical Error Correction](https://arxiv.org/abs/2005.00987) (ACL2020) 527 | - [Chinese Grammatical Correction Using BERT-based Pre-trained Model](https://arxiv.org/abs/2011.02093) (AACL-IJCNLP2020) 528 | - [Spelling Error Correction with Soft-Masked BERT](https://arxiv.org/abs/2005.07421) (ACL2020) 529 | ### IR 530 | - [BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models](https://arxiv.org/abs/2104.08663) [[github](https://github.com/UKPLab/beir)] 531 | - [Pretrained Transformers for Text Ranking: BERT and Beyond](https://arxiv.org/abs/2010.06467) 532 | - [Passage Re-ranking with BERT](https://arxiv.org/abs/1901.04085) 533 | - [Investigating the Successes and Failures of BERT for Passage Re-Ranking](https://arxiv.org/abs/1905.01758) 534 | - [Understanding the Behaviors of BERT in Ranking](https://arxiv.org/abs/1904.07531) 535 | - [Document Expansion by Query Prediction](https://arxiv.org/abs/1904.08375) 536 | - [Improving Document Representations by Generating Pseudo Query Embeddings for Dense Retrieval](https://arxiv.org/abs/2105.03599) (ACL2021) 537 | - [CEDR: Contextualized Embeddings for Document Ranking](https://arxiv.org/abs/1904.07094) (SIGIR2019) 538 | - [Deeper Text Understanding for IR with Contextual Neural Language Modeling](https://arxiv.org/abs/1905.09217) (SIGIR2019) 539 | - [FAQ Retrieval using Query-Question Similarity and BERT-Based Query-Answer Relevance](https://arxiv.org/abs/1905.02851) (SIGIR2019) 540 | - [An Analysis of BERT FAQ Retrieval Models for COVID-19 Infobot](https://openreview.net/forum?id=dGOeF3y_Weh) 541 | - [COUGH: A Challenge Dataset and Models for COVID-19 FAQ Retrieval](https://arxiv.org/abs/2010.12800) 542 | - [Unsupervised FAQ Retrieval with Question Generation and BERT](https://www.aclweb.org/anthology/2020.acl-main.74/) (ACL2020) 543 | - [Multi-Stage Document Ranking with BERT](https://arxiv.org/abs/1910.14424) 544 | - [Learning-to-Rank with BERT in TF-Ranking](https://arxiv.org/abs/2004.08476) 545 | - [Transformer-Based Language Models for Similar Text Retrieval and Ranking](https://arxiv.org/abs/2005.04588) 546 | - [DeText: A Deep Text Ranking Framework with BERT](https://arxiv.org/abs/2008.02460) 547 | - [ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT](https://arxiv.org/abs/2004.12832) (SIGIR2020) 548 | - [RepBERT: Contextualized Text Embeddings for First-Stage Retrieval](https://arxiv.org/abs/2006.15498) [[github](https://github.com/jingtaozhan/RepBERT-Index)] 549 | - [Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval](https://arxiv.org/abs/2007.00808) 550 | - [Multi-Perspective Semantic Information Retrieval](https://arxiv.org/abs/2009.01938) 551 | - [CharacterBERT and Self-Teaching for Improving the Robustness of Dense Retrievers on Queries with Typos](https://arxiv.org/abs/2204.00716) (SIGIR2022) 552 | - [Expansion via Prediction of Importance with Contextualization](https://arxiv.org/abs/2004.14245) (SIGIR2020) 553 | - [BERT-QE: Contextualized Query Expansion for Document Re-ranking](https://arxiv.org/abs/2009.07258) (EMNLP2020 Findings) 554 | - [Beyond \[CLS\] through Ranking by Generation](https://arxiv.org/abs/2010.03073) (EMNLP2020) 555 | - [Efficient Document Re-Ranking for Transformers by Precomputing Term Representations](https://arxiv.org/abs/2004.14255) (SIGIR2020) 556 | - [Training Curricula for Open Domain Answer Re-Ranking](https://arxiv.org/abs/2004.14269) (SIGIR2020) 557 | - [Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling](SIGIR2021) 558 | - [Boosted Dense Retriever](https://arxiv.org/abs/2112.07771) 559 | - [ERNIE-Search: Bridging Cross-Encoder with Dual-Encoder via Self On-the-fly Distillation for Dense Passage Retrieval](https://arxiv.org/abs/2205.09153) 560 | - [Document Ranking with a Pretrained Sequence-to-Sequence Model](https://arxiv.org/abs/2003.06713) 561 | - [A Neural Corpus Indexer for Document Retrieval](https://arxiv.org/abs/2206.02743) 562 | - [COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List](https://arxiv.org/abs/2104.07186) (NAACL2021) 563 | - [Guided Transformer: Leveraging Multiple External Sources for Representation Learning in Conversational Search](https://arxiv.org/abs/2006.07548) (SIGIR2020) 564 | - [Fine-tune BERT for E-commerce Non-Default Search Ranking](https://arxiv.org/abs/2008.09689) 565 | - [IR-BERT: Leveraging BERT for Semantic Search in Background Linking for News Articles](https://arxiv.org/abs/2007.12603) 566 | - [ProphetNet-Ads: A Looking Ahead Strategy for Generative Retrieval Models in Sponsored Search Engine](https://arxiv.org/abs/2010.10789) (NLPCC2020) 567 | - [Rapidly Deploying a Neural Search Engine for the COVID-19 Open Research Dataset: Preliminary Thoughts and Lessons Learned](https://openreview.net/forum?id=PlUA_mgGaPq) (ACL2020 WS) 568 | - [SLEDGE-Z: A Zero-Shot Baseline for COVID-19 Literature Search](https://arxiv.org/abs/2010.05987) (EMNLP2020) 569 | - [Neural Duplicate Question Detection without Labeled Training Data](https://www.aclweb.org/anthology/D19-1171/) (EMNLP2019) 570 | - [Cross-Domain Generalization Through Memorization: A Study of Nearest Neighbors in Neural Duplicate Question Detection](https://arxiv.org/abs/2011.11090) 571 | - [Effective Transfer Learning for Identifying Similar Questions: Matching User Questions to COVID-19 FAQs](https://arxiv.org/abs/2008.13546) 572 | - [Cross-lingual Information Retrieval with BERT](https://arxiv.org/abs/2004.13005) 573 | - [Cross-lingual Retrieval for Iterative Self-Supervised Training](https://arxiv.org/abs/2006.09526) (NeurIPS2020) 574 | - [Graph-based Multilingual Product Retrieval in E-Commerce Search](https://aclanthology.org/2021.naacl-industry.19/) (NAACL2021 Industry) 575 | - [Teaching a New Dog Old Tricks: Resurrecting Multilingual Retrieval Using Zero-shot Learning](https://arxiv.org/abs/1912.13080) (ECIR2020) 576 | - [PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval](https://arxiv.org/abs/2010.10137) (WSDM2021) 577 | - [B-PROP: Bootstrapped Pre-training with Representative Words Prediction for Ad-hoc Retrieval](https://arxiv.org/abs/2104.09791) (SIGIR2021) 578 | - [Condenser: a Pre-training Architecture for Dense Retrieval](https://arxiv.org/abs/2104.08253) (EMNLP2021) 579 | - [Augmenting Document Representations for Dense Retrieval with Interpolation and Perturbation](https://arxiv.org/abs/2203.07735) (ACL2022) 580 | - [Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval](https://arxiv.org/abs/2108.08787) (EMNLP2021 WS) [[github](https://github.com/castorini/mr.tydi)] 581 | ## Generation 582 | - [Pretrained Language Models for Text Generation: A Survey](https://arxiv.org/abs/2105.10311) (IJCAI2021 Survey Track) 583 | - [A Survey of Pretrained Language Models Based Text Generation](https://arxiv.org/abs/2201.05273) 584 | - [GLGE: A New General Language Generation Evaluation Benchmark](https://arxiv.org/abs/2011.11928) [[github](https://github.com/microsoft/glge)] 585 | - [BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model](https://arxiv.org/abs/1902.04094) (NAACL2019 WS) 586 | - [Pretraining-Based Natural Language Generation for Text Summarization](https://arxiv.org/abs/1902.09243) 587 | - [Text Summarization with Pretrained Encoders](https://arxiv.org/abs/1908.08345) (EMNLP2019) [[github (original)](https://github.com/nlpyang/PreSumm)] [[github (huggingface)](https://github.com/huggingface/transformers/tree/master/examples/summarization)] 588 | - [Multi-stage Pretraining for Abstractive Summarization](https://arxiv.org/abs/1909.10599) 589 | - [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) 590 | - [Abstractive Summarization with Combination of Pre-trained Sequence-to-Sequence and Saliency Models](https://arxiv.org/abs/2003.13028) 591 | - [GSum: A General Framework for Guided Neural Abstractive Summarization](https://arxiv.org/abs/2010.08014) (NAACL2021) [[github](https://github.com/neulab/guided_summarization)] 592 | - [STEP: Sequence-to-Sequence Transformer Pre-training for Document Summarization](https://arxiv.org/abs/2004.01853) 593 | - [TLDR: Extreme Summarization of Scientific Documents](https://arxiv.org/abs/2004.15011) [[github](https://github.com/allenai/scitldr)] 594 | - [Product Title Generation for Conversational Systems using BERT](https://arxiv.org/abs/2007.11768) 595 | - [WSL-DS: Weakly Supervised Learning with Distant Supervision for Query Focused Multi-Document Abstractive Summarization](https://arxiv.org/abs/2011.01421) (COLING2020) 596 | - [Constrained Abstractive Summarization: Preserving Factual Consistency with Constrained Generation](https://arxiv.org/abs/2010.12723) 597 | - [Abstractive Query Focused Summarization with Query-Free Resources](https://arxiv.org/abs/2012.14774) 598 | - [Abstractive Summarization of Spoken and Written Instructions with BERT](https://arxiv.org/abs/2008.09676) 599 | - [Language Model as an Annotator: Exploring DialoGPT for Dialogue Summarization](https://arxiv.org/abs/2105.12544) (ACL2021) 600 | - [Coreference-Aware Dialogue Summarization](https://arxiv.org/abs/2106.08556) (SIGDIAL2021) 601 | - [XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages](https://arxiv.org/abs/2106.13822) (ACL2021 Findings) [[github](https://github.com/csebuetnlp/xl-sum)] 602 | - [BERT Fine-tuning For Arabic Text Summarization](https://arxiv.org/abs/2004.14135) (ICLR2020 WS) 603 | - [Automatic Text Summarization of COVID-19 Medical Research Articles using BERT and GPT-2](https://arxiv.org/abs/2006.01997) 604 | - [Mixed-Lingual Pre-training for Cross-lingual Summarization](https://arxiv.org/abs/2010.08892) (AACL-IJCNLP2020) 605 | - [PoinT-5: Pointer Network and T-5 based Financial NarrativeSummarisation](https://arxiv.org/abs/2010.04191) (COLING2020 WS) 606 | - [MASS: Masked Sequence to Sequence Pre-training for Language Generation](https://arxiv.org/abs/1905.02450) (ICML2019) [[github](https://github.com/microsoft/MASS)], [[github](https://github.com/microsoft/MASS/tree/master/MASS-fairseq)] 607 | - [JASS: Japanese-specific Sequence to Sequence Pre-training for Neural Machine Translation](https://arxiv.org/abs/2005.03361) (LREC2020) 608 | - [Unified Language Model Pre-training for Natural Language Understanding and Generation](https://arxiv.org/abs/1905.03197) [[github](https://github.com/microsoft/unilm)] (NeurIPS2019) 609 | - [UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training](https://arxiv.org/abs/2002.12804) [[github](https://github.com/microsoft/unilm)] 610 | - [Dual Inference for Improving Language Understanding and Generation](https://arxiv.org/abs/2010.04246) (EMNLP2020 Findings) 611 | - [All NLP Tasks Are Generation Tasks: A General Pretraining Framework](https://arxiv.org/abs/2103.10360) 612 | - [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) (EMNLP2020 Findings) [[github](https://github.com/microsoft/ProphetNet)] 613 | - [ProphetNet-X: Large-Scale Pre-training Models for English, Chinese, Multi-lingual, Dialog, and Code Generation](https://arxiv.org/abs/2104.08006) 614 | - [Towards Making the Most of BERT in Neural Machine Translation](https://arxiv.org/abs/1908.05672) 615 | - [Improving Neural Machine Translation with Pre-trained Representation](https://arxiv.org/abs/1908.07688) 616 | - [BERT, mBERT, or BiBERT? A Study on Contextualized Embeddings for Neural Machine Translation](https://arxiv.org/abs/2109.04588) (EMNLP2021) 617 | - [On the use of BERT for Neural Machine Translation](https://arxiv.org/abs/1909.12744) (EMNLP2019 WS) 618 | - [Incorporating BERT into Neural Machine Translation](https://openreview.net/forum?id=Hyl7ygStwB) (ICLR2020) 619 | - [Recycling a Pre-trained BERT Encoder for Neural Machine Translation](https://www.aclweb.org/anthology/D19-5603/) 620 | - [Exploring Unsupervised Pretraining Objectives for Machine Translation](https://arxiv.org/abs/2106.05634) (ACL2021 Findings) 621 | - [Reusing a Pretrained Language Model on Languages with Limited Corpora for Unsupervised NMT](https://arxiv.org/abs/2009.07610) (EMNLP2020) 622 | - [Language Models are Good Translators](https://arxiv.org/abs/2106.13627) 623 | - [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) 624 | - [Mask-Predict: Parallel Decoding of Conditional Masked Language Models](https://arxiv.org/abs/1904.09324) (EMNLP2019) 625 | - [PALM: Pre-training an Autoencoding&Autoregressive Language Model for Context-conditioned Generation](https://arxiv.org/abs/2004.07159) (EMNLP2020) 626 | - [ERNIE-GEN: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation](https://arxiv.org/abs/2001.11314) 627 | - [Non-Autoregressive Text Generation with Pre-trained Language Models](https://arxiv.org/abs/2102.08220) (EACL2021) 628 | - [Cross-Lingual Natural Language Generation via Pre-Training](https://arxiv.org/abs/1909.10481) (AAAI2020) [[github](https://github.com/CZWin32768/XNLG)] 629 | - [PLATO: Pre-trained Dialogue Generation Model with Discrete Latent Variable](https://arxiv.org/abs/1910.07931) (ACL2020) 630 | - [A Tailored Pre-Training Model for Task-Oriented Dialog Generation](https://arxiv.org/abs/2004.13835) 631 | - [Pretrained Language Models for Dialogue Generation with Multiple Input Sources](https://arxiv.org/abs/2010.07576) (EMNLP2020 Findings) 632 | - [Knowledge-Grounded Dialogue Generation with Pre-trained Language Models](https://arxiv.org/abs/2010.08824) (EMNLP2020) 633 | - [Are Pre-trained Language Models Knowledgeable to Ground Open Domain Dialogues?](https://arxiv.org/abs/2011.09708) 634 | - [Open-Domain Dialogue Generation Based on Pre-trained Language Models](https://arxiv.org/abs/2010.12780) 635 | - [LaMDA: Language Models for Dialog Applications](https://arxiv.org/abs/2201.08239) 636 | - [Retrieval-Augmented Transformer-XL for Close-Domain Dialog Generation](https://arxiv.org/abs/2105.09235) 637 | - [Internet-Augmented Dialogue Generation](https://arxiv.org/abs/2107.07566) 638 | - [DialogBERT: Discourse-Aware Response Generation via Learning to Recover and Rank Utterances](https://arxiv.org/abs/2012.01775) (AAAI2021) 639 | - [CG-BERT: Conditional Text Generation with BERT for Generalized Few-shot Intent Detection](https://arxiv.org/abs/2004.01881) 640 | - [QURIOUS: Question Generation Pretraining for Text Generation](https://arxiv.org/abs/2004.11026) 641 | - [Few-Shot NLG with Pre-Trained Language Model](https://arxiv.org/abs/1904.09521) (ACL2020) 642 | - [Text-to-Text Pre-Training for Data-to-Text Tasks](https://arxiv.org/abs/2005.10433) 643 | - [KGPT: Knowledge-Grounded Pre-Training for Data-to-Text Generation](https://arxiv.org/abs/2010.02307) (EMNLP2020) 644 | - [Evaluating Semantic Accuracy of Data-to-Text Generation with Natural Language Inference](https://arxiv.org/abs/2011.10819) (INLG2020) 645 | - [Large Scale Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training](https://arxiv.org/abs/2010.12688) 646 | - [Structure-Grounded Pretraining for Text-to-SQL](https://arxiv.org/abs/2010.12773) 647 | - [Data Agnostic RoBERTa-based Natural Language to SQL Query Generation](https://arxiv.org/abs/2010.05243) 648 | - [ToTTo: A Controlled Table-To-Text Generation Dataset](https://arxiv.org/abs/2004.14373) (EMNLP2020) [[github](https://github.com/google-research-datasets/ToTTo)] 649 | - [Exploring Fluent Query Reformulations with Text-to-Text Transformers and Reinforcement Learning](https://arxiv.org/abs/2012.10033) (AAAI2021 WS) 650 | - [A Knowledge-Enhanced Pretraining Model for Commonsense Story Generation](https://arxiv.org/abs/2001.05139) (TACL2020) [[github](https://github.com/JianGuanTHU/CommonsenseStoryGen)] 651 | - [MEGATRON-CNTRL: Controllable Story Generation with External Knowledge Using Large-Scale Language Models](https://arxiv.org/abs/2010.00840) (EMNLP2020) 652 | - [Facts2Story: Controlling Text Generation by Key Facts](https://arxiv.org/abs/2012.04332) 653 | - [CommonGen: A Constrained Text Generation Challenge for Generative Commonsense Reasoning](https://arxiv.org/abs/1911.03705) [[github](https://github.com/INK-USC/CommonGen)] [[website](https://inklab.usc.edu/CommonGen/)] (EMNLP2020 Findings) 654 | - [An Enhanced Knowledge Injection Model for Commonsense Generation](https://arxiv.org/abs/2012.00366) (COLING2020) 655 | - [Retrieval Enhanced Model for Commonsense Generation](https://arxiv.org/abs/2105.11174) (ACL2021 Findings) 656 | - [Lexically-constrained Text Generation through Commonsense Knowledge Extraction and Injection](https://arxiv.org/abs/2012.10813) (AAAI2021WS) 657 | - [Pre-training Text-to-Text Transformers for Concept-centric Common Sense](https://arxiv.org/abs/2011.07956) 658 | - [Language Generation with Multi-Hop Reasoning on Commonsense Knowledge Graph](https://arxiv.org/abs/2009.11692) (EMNLP2020) 659 | - [KG-BART: Knowledge Graph-Augmented BART for Generative Commonsense Reasoning](https://arxiv.org/abs/2009.12677) 660 | - [Autoregressive Entity Retrieval](https://arxiv.org/abs/2010.00904) (ICLR2021) [[github](https://github.com/facebookresearch/GENRE)] 661 | - [Multilingual Autoregressive Entity Linking](https://arxiv.org/abs/2103.12528) 662 | - [EIGEN: Event Influence GENeration using Pre-trained Language Models](https://arxiv.org/abs/2010.11764) 663 | - [proScript: Partially Ordered Scripts Generation via Pre-trained Language Models](https://arxiv.org/abs/2104.08251) 664 | - [Goal-Oriented Script Construction](https://arxiv.org/abs/2107.13189) (INLG2021) 665 | - [Contrastive Triple Extraction with Generative Transformer](https://arxiv.org/abs/2009.06207) (AAAI2021) 666 | - [GeDi: Generative Discriminator Guided Sequence Generation](https://arxiv.org/abs/2009.06367) 667 | - [Generating similes effortlessly like a Pro: A Style Transfer Approach for Simile Generation](https://arxiv.org/abs/2009.08942) (EMNLP2020) 668 | - [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) (JMLR2020) [[github](https://github.com/google-research/text-to-text-transfer-transformer)] 669 | - [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) (NAACL2021) [[github](https://github.com/google-research/multilingual-t5)] 670 | - [nmT5 -- Is parallel data still relevant for pre-training massively multilingual language models?](https://arxiv.org/abs/2106.02171) (ACL2021) 671 | - [mT6: Multilingual Pretrained Text-to-Text Transformer with Translation Pairs](https://arxiv.org/abs/2104.08692) 672 | - [WT5?! Training Text-to-Text Models to Explain their Predictions](https://arxiv.org/abs/2004.14546) 673 | - [NT5?! Training T5 to Perform Numerical Reasoning](https://arxiv.org/abs/2104.07307) [[github](https://github.com/lesterpjy/numeric-t5)] 674 | - [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) (ACL2020) 675 | - [The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics](https://arxiv.org/abs/2102.01672) 676 | - [GEMv2: Multilingual NLG Benchmarking in a Single Line of Code](https://arxiv.org/abs/2206.11249) 677 | - [Finetuned Language Models Are Zero-Shot Learners](https://arxiv.org/abs/2109.01652) [[blog](https://ai.googleblog.com/2021/10/introducing-flan-more-generalizable.html)] 678 | - [Multitask Prompted Training Enables Zero-Shot Task Generalization](https://arxiv.org/abs/2110.08207) 679 | - [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) 680 | - [Best Practices for Data-Efficient Modeling in NLG:How to Train Production-Ready Neural Models with Less Data](https://arxiv.org/abs/2011.03877) (COLING2020) 681 | - [Prefix-Tuning: Optimizing Continuous Prompts for Generation](https://arxiv.org/abs/2101.00190) 682 | - [Unsupervised Pre-training for Natural Language Generation: A Literature Review](https://arxiv.org/abs/1911.06171) 683 | ## Quality evaluator 684 | - [BERTScore: Evaluating Text Generation with BERT](https://arxiv.org/abs/1904.09675) (ICLR2020) 685 | - [BERTTune: Fine-Tuning Neural Machine Translation with BERTScore](https://arxiv.org/abs/2106.02208) (ACL2021) 686 | - [Machine Translation Evaluation with BERT Regressor](https://arxiv.org/abs/1907.12679) 687 | - [TransQuest: Translation Quality Estimation with Cross-lingual Transformers](https://arxiv.org/abs/2011.01536) (COLING2020) 688 | - [SumQE: a BERT-based Summary Quality Estimation Model](https://arxiv.org/abs/1909.00578) (EMNLP2019) 689 | - [MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance](https://arxiv.org/abs/1909.02622) (EMNLP2019) [[github](https://github.com/AIPHES/emnlp19-moverscore)] 690 | - [BERT as a Teacher: Contextual Embeddings for Sequence-Level Reward](https://arxiv.org/abs/2003.02738) 691 | - [Language Model Augmented Relevance Score](https://arxiv.org/abs/2108.08485) (ACL2021) 692 | - [BLEURT: Learning Robust Metrics for Text Generation](https://arxiv.org/abs/2004.04696) (ACL2020) 693 | - [BARTScore: Evaluating Generated Text as Text Generation](https://arxiv.org/abs/2106.11520) [[github](https://github.com/neulab/BARTScore)] 694 | - [Masked Language Model Scoring](https://arxiv.org/abs/1910.14659) (ACL2020) 695 | - [Simple-QE: Better Automatic Quality Estimation for Text Simplification](https://arxiv.org/abs/2012.12382) 696 | ## Modification (multi-task, masking strategy, etc.) 697 | - [Multi-Task Deep Neural Networks for Natural Language Understanding](https://arxiv.org/abs/1901.11504) (ACL2019) 698 | - [The Microsoft Toolkit of Multi-Task Deep Neural Networks for Natural Language Understanding](https://arxiv.org/abs/2002.07972) 699 | - [BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning](https://arxiv.org/abs/1902.02671) (ICML2019) 700 | - [Measuring Massive Multitask Language Understanding](https://arxiv.org/abs/2009.03300) (ICLR2021) [[github](https://github.com/hendrycks/test)] 701 | - [Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks](https://arxiv.org/abs/2106.04489) (ACL2021) 702 | - [Pre-training Text Representations as Meta Learning](https://arxiv.org/abs/2004.05568) 703 | - [Unifying Question Answering and Text Classification via Span Extraction](https://arxiv.org/abs/1904.09286) 704 | - [MATINF: A Jointly Labeled Large-Scale Dataset for Classification, Question Answering and Summarization](https://arxiv.org/abs/2004.12302) (ACL2020) 705 | - [ERNIE: Enhanced Language Representation with Informative Entities](https://arxiv.org/abs/1905.07129) (ACL2019) 706 | - [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223) 707 | - [ERNIE 2.0: A Continual Pre-training Framework for Language Understanding](https://arxiv.org/abs/1907.12412) (AAAI2020) 708 | - [ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation](https://arxiv.org/abs/2107.02137) 709 | - [ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding](https://arxiv.org/abs/2010.12148) 710 | - [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) (NeurIPS2019) [[github](https://github.com/zihangdai/xlnet)] 711 | - [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) 712 | - [Pre-Training with Whole Word Masking for Chinese BERT](https://arxiv.org/abs/1906.08101) 713 | - [SpanBERT: Improving Pre-training by Representing and Predicting Spans](https://arxiv.org/abs/1907.10529) (TACL2020) [[github](https://github.com/facebookresearch/SpanBERT)] 714 | - [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) 715 | - [Frustratingly Simple Pretraining Alternatives to Masked Language Modeling](https://arxiv.org/abs/2109.01819) (EMNLP2021) [[github](https://github.com/gucci-j/light-transformer-emnlp2021)] 716 | - [TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning](https://arxiv.org/abs/2111.04198) (NAACL2022) 717 | - [ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations](https://aclanthology.org/2020.findings-emnlp.425/) (EMNLP2020 Findings) 718 | - [ZEN 2.0: Continue Training and Adaption for N-gram Enhanced Text Encoders](https://arxiv.org/abs/2105.01279) 719 | - [MVP-BERT: Redesigning Vocabularies for Chinese BERT and Multi-Vocab Pretraining](https://arxiv.org/abs/2011.08539) 720 | - [Adversarial Training for Large Neural Language Models](https://arxiv.org/abs/2004.08994) 721 | - [BERTAC: Enhancing Transformer-based Language Models with Adversarially Pretrained Convolutional Neural Networks](https://aclanthology.org/2021.acl-long.164/) (ACL2021) 722 | - [Train No Evil: Selective Masking for Task-guided Pre-training](https://arxiv.org/abs/2004.09733) 723 | - [Position Masking for Language Models](https://arxiv.org/abs/2006.05676) 724 | - [Masking as an Efficient Alternative to Finetuning for Pretrained Language Models](https://arxiv.org/abs/2004.12406) (EMNLP2020) 725 | - [Variance-reduced Language Pretraining via a Mask Proposal Network](https://arxiv.org/abs/2008.05333) 726 | - [Neural Mask Generator: Learning to Generate Adaptive Word Maskings for Language Model Adaptation](https://arxiv.org/abs/2010.02705) (EMNLP2020) 727 | - [Improving Self-supervised Pre-training via a Fully-Explored Masked Language Model](https://arxiv.org/abs/2010.06040) 728 | - [Contextual Representation Learning beyond Masked Language Modeling](https://arxiv.org/abs/2204.04163) (ACL2022) 729 | - [Curriculum learning for language modeling](https://arxiv.org/abs/2108.02170) 730 | - [Curriculum Learning: A Regularization Method for Efficient and Stable Billion-Scale GPT Model Pre-Training](https://arxiv.org/abs/2108.06084) 731 | - [Focusing More on Conflicts with Mis-Predictions Helps Language Pre-Training](https://arxiv.org/abs/2012.08789) 732 | - [Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference](https://arxiv.org/abs/2001.07676) (EACL2021) [[github](https://github.com/timoschick/pet)] 733 | - [It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners](https://arxiv.org/abs/2009.07118) (NAACL2021) [[github](https://github.com/timoschick/pet)] 734 | - [Making Pre-trained Language Models Better Few-shot Learners](https://arxiv.org/abs/2012.15723) (ACL2021) [[github](https://github.com/princeton-nlp/LM-BFF)] 735 | - [CrossFit: A Few-shot Learning Challenge for Cross-task Generalization in NLP](https://arxiv.org/abs/2104.08835) 736 | - [Lifelong Learning of Few-shot Learners across NLP Tasks](https://arxiv.org/abs/2104.08808) 737 | - [Don't Stop Pretraining: Adapt Language Models to Domains and Tasks](https://arxiv.org/abs/2004.10964) (ACL2020) 738 | - [Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora](https://aclanthology.org/2022.naacl-main.351/) (NAACL2022) 739 | - [Towards Continual Knowledge Learning of Language Models](https://arxiv.org/abs/2110.03215) (ICLR2022) 740 | - [An Empirical Investigation Towards Efficient Multi-Domain Language Model Pre-training](https://arxiv.org/abs/2010.00784) [[github](https://github.com/aws-health-ai/multi_domain_lm)] 741 | - [To Pretrain or Not to Pretrain: Examining the Benefits of Pretraining on Resource Rich Tasks](https://arxiv.org/abs/2006.08671) (ACL2020) 742 | - [Revisiting Few-sample BERT Fine-tuning](https://arxiv.org/abs/2006.05987) 743 | - [Blank Language Models](https://arxiv.org/abs/2002.03079) 744 | - [Enabling Language Models to Fill in the Blanks](https://arxiv.org/abs/2005.05339) (ACL2020) 745 | - [Efficient Training of BERT by Progressively Stacking](http://proceedings.mlr.press/v97/gong19a.html) (ICML2019) [[github](https://github.com/gonglinyuan/StackingBERT)] 746 | - [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) [[github](https://github.com/pytorch/fairseq/tree/master/examples/roberta)] 747 | - [On Losses for Modern Language Models](https://arxiv.org/abs/2010.01694) (EMNLP2020) [[github](https://github.com/StephAO/olfmlm)] 748 | - [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942) (ICLR2020) 749 | - [Rethinking Embedding Coupling in Pre-trained Language Models](https://openreview.net/forum?id=xpFFI_NtgpW) (ICLR2021) 750 | - [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://openreview.net/forum?id=r1xMH1BtvB) (ICLR2020) [[github](https://github.com/google-research/electra)] [[blog](https://ai.googleblog.com/2020/03/more-efficient-nlp-model-pre-training.html)] 751 | - [Training ELECTRA Augmented with Multi-word Selection](https://arxiv.org/abs/2106.00139) (ACL2021 Findings) 752 | - [Learning to Sample Replacements for ELECTRA Pre-Training](https://arxiv.org/abs/2106.13715) (ACL2021 Findings) 753 | - [SCRIPT: Self-Critic PreTraining of Transformers](https://aclanthology.org/2021.naacl-main.409/) (NAACL2021) 754 | - [Pre-Training Transformers as Energy-Based Cloze Models](https://www.aclweb.org/anthology/2020.emnlp-main.20/) (EMNLP2020) [[github](https://github.com/google-research/electra)] 755 | - [MC-BERT: Efficient Language Pre-Training via a Meta Controller](https://arxiv.org/abs/2006.05744) 756 | - [FreeLB: Enhanced Adversarial Training for Language Understanding](https://openreview.net/forum?id=BygzbyHFvB) (ICLR2020) 757 | - [KERMIT: Generative Insertion-Based Modeling for Sequences](https://arxiv.org/abs/1906.01604) 758 | - [CALM: Continuous Adaptive Learning for Language Modeling](https://arxiv.org/abs/2004.03794) 759 | - [SegaBERT: Pre-training of Segment-aware BERT for Language Understanding](https://arxiv.org/abs/2004.14996) 760 | - [DisSent: Sentence Representation Learning from Explicit Discourse Relations](https://arxiv.org/abs/1710.04334) (ACL2019) 761 | - [Pretraining with Contrastive Sentence Objectives Improves Discourse Performance of Language Models](https://arxiv.org/abs/2005.10389) (ACL2020) 762 | - [CAPT: Contrastive Pre-Training for Learning Denoised Sequence Representations](https://arxiv.org/abs/2010.06351) 763 | - [SLM: Learning a Discourse Language Representation with Sentence Unshuffling](https://arxiv.org/abs/2010.16249) (EMNLP2020) 764 | - [CausalBERT: Injecting Causal Knowledge Into Pre-trained Models with Minimal Supervision](https://arxiv.org/abs/2107.09852) 765 | - [StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding](https://arxiv.org/abs/1908.04577) (ICLR2020) 766 | - [Structural Pre-training for Dialogue Comprehension](https://arxiv.org/abs/2105.10956) (ACL2021) 767 | - [Retrofitting Structure-aware Transformer Language Model for End Tasks](https://arxiv.org/abs/2009.07408) (EMNLP2020) 768 | - [Syntax-Enhanced Pre-trained Model](https://arxiv.org/abs/2012.14116) 769 | - [Syntax-Infused Transformer and BERT models for Machine Translation and Natural Language Understanding](https://arxiv.org/abs/1911.06156) 770 | - [Do Syntax Trees Help Pre-trained Transformers Extract Information?](https://arxiv.org/abs/2008.09084) 771 | - [SenseBERT: Driving Some Sense into BERT](https://arxiv.org/abs/1908.05646) 772 | - [Semantics-aware BERT for Language Understanding](https://arxiv.org/abs/1909.02209) (AAAI2020) 773 | - [GiBERT: Introducing Linguistic Knowledge into BERT through a Lightweight Gated Injection Method](https://arxiv.org/abs/2010.12532) 774 | - [K-BERT: Enabling Language Representation with Knowledge Graph](https://arxiv.org/abs/1909.07606) 775 | - [Knowledge Enhanced Contextual Word Representations](https://arxiv.org/abs/1909.04164) (EMNLP2019) 776 | - [Knowledge-Aware Language Model Pretraining](https://arxiv.org/abs/2007.00655) 777 | - [K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters](https://arxiv.org/abs/2002.01808) 778 | - [JAKET: Joint Pre-training of Knowledge Graph and Language Understanding](https://arxiv.org/abs/2010.00796) 779 | - [E-BERT: Efficient-Yet-Effective Entity Embeddings for BERT](https://arxiv.org/abs/1911.03681) (EMNLP2020) 780 | - [KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation](https://arxiv.org/abs/1911.06136) 781 | - [Entities as Experts: Sparse Memory Access with Entity Supervision](https://arxiv.org/abs/2004.07202) (EMNLP2020) 782 | - [Exploiting Structured Knowledge in Text via Graph-Guided Representation Learning](https://www.aclweb.org/anthology/2020.emnlp-main.722/) (EMNLP2020) 783 | - [Contextualized Representations Using Textual Encyclopedic Knowledge](https://arxiv.org/abs/2004.12006) 784 | - [CoLAKE: Contextualized Language and Knowledge Embedding](https://arxiv.org/abs/2010.00309) (COLING2020) 785 | - [KI-BERT: Infusing Knowledge Context for Better Language and Domain Understanding](https://arxiv.org/abs/2104.08145) 786 | - [K-XLNet: A General Method for Combining Explicit Knowledge with Language Model Pretraining](https://arxiv.org/abs/2104.10649) 787 | - [Combining pre-trained language models and structured knowledge](https://arxiv.org/abs/2101.12294) 788 | - [Coarse-to-Fine Pre-training for Named Entity Recognition](https://arxiv.org/abs/2010.08210) (EMNLP2020) 789 | - [E.T.: Entity-Transformers. Coreference augmented Neural Language Model for richer mention representations via Entity-Transformer blocks](https://arxiv.org/abs/2011.05431) (COLING2020 WS) 790 | - [REALM: Retrieval-Augmented Language Model Pre-Training](https://arxiv.org/abs/2002.08909) (ICML2020) [[github](https://github.com/google-research/language/tree/master/language/realm)] 791 | - [Simple and Efficient ways to Improve REALM](https://arxiv.org/abs/2104.08710) 792 | - [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) (NeurIPS2020) 793 | - [Fine-tune the Entire RAG Architecture (including DPR retriever) for Question-Answering](https://arxiv.org/abs/2106.11517) 794 | - [Joint Retrieval and Generation Training for Grounded Text Generation](https://arxiv.org/abs/2105.06597) 795 | - [Retrieval Augmentation Reduces Hallucination in Conversation](https://arxiv.org/abs/2104.07567) 796 | - [On-The-Fly Information Retrieval Augmentation for Language Models](https://arxiv.org/abs/2007.01528) 797 | - [Current Limitations of Language Models: What You Need is Retrieval](https://arxiv.org/abs/2009.06857) 798 | - [Improving language models by retrieving from trillions of tokens](https://arxiv.org/abs/2112.04426) [[blog](https://deepmind.com/research/publications/2021/improving-language-models-by-retrieving-from-trillions-of-tokens)] [[blog](http://jalammar.github.io/illustrated-retrieval-transformer/)] 799 | - [Taking Notes on the Fly Helps BERT Pre-training](https://arxiv.org/abs/2008.01466) 800 | - [Pre-training via Paraphrasing](https://arxiv.org/abs/2006.15020) 801 | - [SKEP: Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis](https://arxiv.org/abs/2005.05635) (ACL2020) 802 | - [Improving Event Duration Prediction via Time-aware Pre-training](https://arxiv.org/abs/2011.02610) (EMNLP2020 Findings) 803 | - [Knowledge-Aware Procedural Text Understanding with Multi-Stage Training](https://arxiv.org/abs/2009.13199) 804 | - [Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring](https://arxiv.org/abs/1905.01969) (ICLR2020) 805 | - [Rethinking Positional Encoding in Language Pre-training](https://arxiv.org/abs/2006.15595) 806 | - [Improve Transformer Models with Better Relative Position Embeddings](https://arxiv.org/abs/2009.13658) (EMNLP2020 Findings) 807 | - [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) 808 | - [Position Information in Transformers: An Overview](https://arxiv.org/abs/2102.11090) 809 | - [BoostingBERT:Integrating Multi-Class Boosting into BERT for NLP Tasks](https://arxiv.org/abs/2009.05959) 810 | - [BURT: BERT-inspired Universal Representation from Twin Structure](https://arxiv.org/abs/2004.13947) 811 | - [Universal Text Representation from BERT: An Empirical Study](https://arxiv.org/abs/1910.07973) 812 | - [Symmetric Regularization based BERT for Pair-wise Semantic Reasoning](https://arxiv.org/abs/1909.03405) (SIGIR2020) 813 | - [Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Document Matching](https://arxiv.org/abs/2004.12297) 814 | - [Hi-Transformer: Hierarchical Interactive Transformer for Efficient and Effective Long Document Modeling](https://arxiv.org/abs/2106.01040) (ACL2021) 815 | - [Transfer Fine-Tuning: A BERT Case Study](https://arxiv.org/abs/1909.00931) (EMNLP2019) 816 | - [Improving Pre-Trained Multilingual Models with Vocabulary Expansion](https://arxiv.org/abs/1909.12440) (CoNLL2019) 817 | - [BERTRAM: Improved Word Embeddings Have Big Impact on Contextualized Model Performance](https://arxiv.org/abs/1910.07181) (ACL2020) 818 | - [A Mixture of h−1 Heads is Better than h Heads](https://arxiv.org/abs/2005.06537) (ACL2020) 819 | - [SesameBERT: Attention for Anywhere](https://arxiv.org/abs/1910.03176) 820 | - [Multi-Head Attention: Collaborate Instead of Concatenate](https://arxiv.org/abs/2006.16362) 821 | - [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) [[github](https://github.com/microsoft/DeBERTa)] 822 | - [Deepening Hidden Representations from Pre-trained Language Models](https://arxiv.org/abs/1911.01940) 823 | - [On the Transformer Growth for Progressive BERT Training](https://arxiv.org/abs/2010.12562) 824 | - [Improving BERT with Self-Supervised Attention](https://arxiv.org/abs/2004.03808) 825 | - [Guiding Attention for Self-Supervised Learning with Transformers](https://arxiv.org/abs/2010.02399) (EMNLP2020 Findings) 826 | - [Improving Disfluency Detection by Self-Training a Self-Attentive Model](https://arxiv.org/abs/2004.05323) 827 | - [Self-training Improves Pre-training for Natural Language Understanding](https://arxiv.org/abs/2010.02194) [[github](https://github.com/facebookresearch/SentAugment)] 828 | - [CERT: Contrastive Self-supervised Learning for Language Understanding](https://arxiv.org/abs/2005.12766) 829 | - [Robust Transfer Learning with Pretrained Language Models through Adapters](https://arxiv.org/abs/2108.02340) (ACL2021) 830 | - [ReadOnce Transformers: Reusable Representations of Text for Transformers](https://arxiv.org/abs/2010.12854) (ACL2021) 831 | - [LV-BERT: Exploiting Layer Variety for BERT](https://arxiv.org/abs/2106.11740) (ACL2021 Findings) [[github](https://github.com/yuweihao/LV-BERT)] 832 | - [Large Product Key Memory for Pretrained Language Models](https://arxiv.org/abs/2010.03881) (EMNLP2020 Findings) 833 | - [Enhancing Pre-trained Language Model with Lexical Simplification](https://arxiv.org/abs/2012.15070) 834 | - [Contextual BERT: Conditioning the Language Model Using a Global State](https://arxiv.org/abs/2010.15778) (COLING2020 WS) 835 | - [SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization](https://arxiv.org/abs/1911.03437) (ACL2020) 836 | - [Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning](https://arxiv.org/abs/2109.05687) (EMNLP2021) [[github](https://github.com/RunxinXu/ChildTuning)] 837 | - [Token Dropping for Efficient BERT Pretraining](https://arxiv.org/abs/2203.13240) (ACL2022) [[github](https://github.com/tensorflow/models/tree/master/official/projects/token_dropping)] 838 | - [Pay Attention to MLPs](https://arxiv.org/abs/2105.08050) 839 | - [Are Pre-trained Convolutions Better than Pre-trained Transformers?](https://arxiv.org/abs/2105.03322) (ACL2021) 840 | - [Pre-Training a Language Model Without Human Language](https://arxiv.org/abs/2012.11995) 841 | ### Tokenization 842 | - [Training Multilingual Pre-trained Language Model with Byte-level Subwords](https://arxiv.org/abs/2101.09012) 843 | - [Byte Pair Encoding is Suboptimal for Language Model Pretraining](https://arxiv.org/abs/2004.03720) (EMNLP2020 Findings) 844 | - [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) (TACL2022) [[github](https://github.com/google-research/language/tree/master/language/canine)] 845 | - [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/abs/2105.13626) (TACL2022) [[github](https://github.com/google-research/byt5)] 846 | - [Multi-view Subword Regularization](https://arxiv.org/abs/2103.08490) (NAACL2021) 847 | - [Bridging Subword Gaps in Pretrain-Finetune Paradigm for Natural Language Generation](https://arxiv.org/abs/2106.06125) (ACL2021) 848 | - [An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks](https://arxiv.org/abs/2010.02534) (AACL-IJCNLP2020) 849 | - [AMBERT: A Pre-trained Language Model with Multi-Grained Tokenization](https://arxiv.org/abs/2008.11869) 850 | - [LICHEE: Improving Language Model Pre-training with Multi-grained Tokenization](https://arxiv.org/abs/2108.00801) (ACL2021 Findings) 851 | - [Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese Pre-trained Language Models](https://arxiv.org/abs/2104.07204) (NAACL2021) 852 | - [CharBERT: Character-aware Pre-trained Language Model](https://arxiv.org/abs/2011.01513) (COLING2020) [[github](https://github.com/wtma/CharBERT)] 853 | - [CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters](https://arxiv.org/abs/2010.10392) (COLING2020) 854 | - [Charformer: Fast Character Transformers via Gradient-based Subword Tokenization](https://arxiv.org/abs/2106.12672) [[github](https://github.com/google-research/google-research/tree/master/charformer)] 855 | - [Fast WordPiece Tokenization](https://arxiv.org/abs/2012.15524) (EMNLP2021) 856 | - [MaxMatch-Dropout: Subword Regularization for WordPiece](https://arxiv.org/abs/2209.04126) (COLING2022) 857 | ### Prompt 858 | - [Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing](https://arxiv.org/abs/2107.13586) 859 | - [AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts](https://arxiv.org/abs/2010.15980) (EMNLP2020) [[github](https://github.com/ucinlp/autoprompt)] 860 | - [Calibrate Before Use: Improving Few-Shot Performance of Language Models](https://arxiv.org/abs/2102.09690) 861 | - [Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm](https://arxiv.org/abs/2102.07350) 862 | - [GPT Understands, Too](https://arxiv.org/abs/2103.10385) [[github](https://github.com/THUDM/P-tuning)] 863 | - [How Many Data Points is a Prompt Worth?](https://arxiv.org/abs/2103.08493) (NAACL2021) [[website](https://huggingface.co/blog/how_many_data_points/)] 864 | - [Learning How to Ask: Querying LMs with Mixtures of Soft Prompts](https://arxiv.org/abs/2104.06599) (NAACL2021) 865 | - [Meta-tuning Language Models to Answer Prompts Better](https://arxiv.org/abs/2104.04670) 866 | - [Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity](https://arxiv.org/abs/2104.08786) 867 | - [The Power of Scale for Parameter-Efficient Prompt Tuning](https://arxiv.org/abs/2104.08691) (EMNLP2021) 868 | - [Cutting Down on Prompts and Parameters: Simple Few-Shot Learning with Language Models](https://arxiv.org/abs/2106.13353) 869 | - [PPT: Pre-trained Prompt Tuning for Few-shot Learning](https://arxiv.org/abs/2109.04332) 870 | - [True Few-Shot Learning with Language Models](https://arxiv.org/abs/2105.11447) 871 | - [Few-shot Sequence Learning with Transformers](https://arxiv.org/abs/2012.09543) (NeurIPS2020 WS) 872 | - [PTR: Prompt Tuning with Rules for Text Classification](https://arxiv.org/abs/2105.11259) 873 | - [Knowledgeable Prompt-tuning: Incorporating Knowledge into Prompt Verbalizer for Text Classification](https://arxiv.org/abs/2108.02035) 874 | - [Discrete and Soft Prompting for Multilingual Models](https://arxiv.org/abs/2109.03630) (EMNLP2021) 875 | - [Reframing Instructional Prompts to GPTk's Language](https://arxiv.org/abs/2109.07830) 876 | - [Multimodal Few-Shot Learning with Frozen Language Models](https://arxiv.org/abs/2106.13884) 877 | - [FLEX: Unifying Evaluation for Few-Shot NLP](https://arxiv.org/abs/2107.07170) 878 | - [Do Prompt-Based Models Really Understand the Meaning of their Prompts?](https://arxiv.org/abs/2109.01247) 879 | - [OpenPrompt: An Open-source Framework for Prompt-learning](https://aclanthology.org/2022.acl-demo.10/) (ACL2022 Demo) 880 | ## Sentence embedding 881 | - [Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks](https://arxiv.org/abs/1811.01088) 882 | - [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084) (EMNLP2019) 883 | - [Parameter-free Sentence Embedding via Orthogonal Basis](https://arxiv.org/abs/1810.00438) (EMNLP2019) 884 | - [SBERT-WK: A Sentence Embedding Method By Dissecting BERT-based Word Models](https://arxiv.org/abs/2002.06652) 885 | - [On the Sentence Embeddings from Pre-trained Language Models](https://arxiv.org/abs/2011.05864) (EMNLP2020) 886 | - [Semantic Re-tuning with Contrastive Tension](https://openreview.net/forum?id=Ov_sMNau-PF) (ICLR2021) 887 | - [DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations](https://arxiv.org/abs/2006.03659) (ACL2021) 888 | - [ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer](https://arxiv.org/abs/2105.11741) (ACL2021) 889 | - [CLEAR: Contrastive Learning for Sentence Representation](https://arxiv.org/abs/2012.15466) 890 | - [SimCSE: Simple Contrastive Learning of Sentence Embeddings](https://arxiv.org/abs/2104.08821) (EMNLP2021) [[github](https://github.com/princeton-nlp/simcse)] 891 | - [ESimCSE: Enhanced Sample Building Method for Contrastive Learning of Unsupervised Sentence Embedding](https://arxiv.org/abs/2109.04380) 892 | - [Fast, Effective, and Self-Supervised: Transforming Masked Language Models into Universal Lexical and Sentence Encoders](https://arxiv.org/abs/2104.08027) (EMNLP2021) [[github](https://github.com/cambridgeltl/mirror-bert)] 893 | - [TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning](https://arxiv.org/abs/2104.06979) (EMNLP2021 Findings) 894 | - [Trans-Encoder: Unsupervised sentence-pair modelling through self- and mutual-distillations](https://arxiv.org/abs/2109.13059) [[github](https://github.com/amzn/trans-encoder)] 895 | - [Whitening Sentence Representations for Better Semantics and Faster Retrieval](https://arxiv.org/abs/2103.15316) [[github](https://github.com/bojone/BERT-whitening)] 896 | - [Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks](https://arxiv.org/abs/2010.08240) (NAACL2021) 897 | - [DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings](https://arxiv.org/abs/2204.10298) (NAACL2022) [[code](https://github.com/voidism/DiffCSE)] 898 | - [Unsupervised Sentence Representation via Contrastive Learning with Mixing Negatives](https://ojs.aaai.org/index.php/AAAI/article/view/21428) (AAAI2022) [[github](https://github.com/BDBC-KG-NLP/MixCSE_AAAI2022)] 899 | - [Sentence Embeddings by Ensemble Distillation](https://arxiv.org/abs/2104.06719) 900 | - [EASE: Entity-Aware Contrastive Learning of Sentence Embedding](https://arxiv.org/abs/2205.04260) (NAACL2022) 901 | - [Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models](https://arxiv.org/abs/2108.08877) 902 | - [Dual-View Distilled BERT for Sentence Embedding](https://arxiv.org/abs/2104.08675) (SIGIR2021) 903 | - [DefSent: Sentence Embeddings using Definition Sentences](https://arxiv.org/abs/2105.04339) (ACL2021) 904 | - [Paraphrastic Representations at Scale](https://arxiv.org/abs/2104.15114) [[github](https://github.com/jwieting/paraphrastic-representations-at-scale)] 905 | - [Learning Dense Representations of Phrases at Scale](https://arxiv.org/abs/2012.12624) (ACL2021) [[github](https://github.com/princeton-nlp/DensePhrases)] 906 | - [Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration](https://arxiv.org/abs/2109.06304) (EMNLP2021) 907 | ## Transformer variants 908 | - [Efficient Transformers: A Survey](https://arxiv.org/abs/2009.06732) 909 | - [Adaptive Attention Span in Transformers](https://arxiv.org/abs/1905.07799) (ACL2019) 910 | - [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) (ACL2019) [[github](https://github.com/kimiyoung/transformer-xl)] 911 | - [Generating Long Sequences with Sparse Transformers](https://arxiv.org/abs/1904.10509) 912 | - [Do Transformers Need Deep Long-Range Memory](https://arxiv.org/abs/2007.03356) (ACL2020) 913 | - [DA-Transformer: Distance-aware Transformer](https://arxiv.org/abs/2010.06925) (NAACL2021) 914 | - [Adaptively Sparse Transformers](https://arxiv.org/abs/1909.00015) (EMNLP2019) 915 | - [Compressive Transformers for Long-Range Sequence Modelling](https://arxiv.org/abs/1911.05507) 916 | - [The Evolved Transformer](https://arxiv.org/abs/1901.11117) (ICML2019) 917 | - [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) (ICLR2020) [[github](https://github.com/google/trax/tree/master/trax/models/reformer)] 918 | - [GRET: Global Representation Enhanced Transformer](https://arxiv.org/abs/2002.10101) (AAAI2020) 919 | - [GMAT: Global Memory Augmentation for Transformers](https://arxiv.org/abs/2006.03274) 920 | - [Memory Transformer](https://arxiv.org/abs/2006.11527) 921 | - [Transformer on a Diet](https://arxiv.org/abs/2002.06170) [[github](https://github.com/cgraywang/transformer-on-diet)] 922 | - [A Tensorized Transformer for Language Modeling](https://arxiv.org/abs/1906.09777) (NeurIPS2019) 923 | - [DeFINE: DEep Factorized INput Token Embeddings for Neural Sequence Modeling](https://arxiv.org/abs/1911.12385) (ICLR2020) [[github](https://github.com/sacmehta/delight)] 924 | - [DeLighT: Very Deep and Light-weight Transformer](https://arxiv.org/abs/2008.00623) [[github](https://github.com/sacmehta/delight)] 925 | - [Lite Transformer with Long-Short Range Attention](https://arxiv.org/abs/2004.11886) [[github](https://github.com/mit-han-lab/lite-transformer)] (ICLR2020) 926 | - [Efficient Content-Based Sparse Attention with Routing Transformers](https://openreview.net/forum?id=B1gjs6EtDr) 927 | - [BP-Transformer: Modelling Long-Range Context via Binary Partitioning](https://arxiv.org/abs/1911.04070) 928 | - [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) [[github](https://github.com/allenai/longformer)] 929 | - [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) 930 | - [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) (AAAI2021) 931 | - [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) (AAAI2021) [[github](https://github.com/mlpen/Nystromformer)] 932 | - [Improving Transformer Models by Reordering their Sublayers](https://arxiv.org/abs/1911.03864) (ACL2020) 933 | - [Highway Transformer: Self-Gating Enhanced Self-Attentive Networks](https://arxiv.org/abs/2004.08178) 934 | - [Mask Attention Networks: Rethinking and Strengthen Transformer](https://arxiv.org/abs/2103.13597) (NAACL2021) 935 | - [Synthesizer: Rethinking Self-Attention in Transformer Models](https://arxiv.org/abs/2005.00743) 936 | - [Query-Key Normalization for Transformers](https://arxiv.org/abs/2010.04245) (EMNLP2020 Findings) 937 | - [Rethinking Attention with Performers](https://arxiv.org/abs/2009.14794) (ICLR2021) 938 | - [FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness](https://arxiv.org/abs/2205.14135) 939 | - [Dynamically Adjusting Transformer Batch Size by Monitoring Gradient Direction Change](https://arxiv.org/abs/2005.02008) 940 | - [HAT: Hardware-Aware Transformers for Efficient Natural Language Processing](https://arxiv.org/abs/2005.14187) (ACL2020) [[github](https://github.com/mit-han-lab/hardware-aware-transformers)] 941 | - [Linformer: Self-Attention with Linear Complexity](https://arxiv.org/abs/2006.04768) 942 | - [What's Hidden in a One-layer Randomly Weighted Transformer?](https://arxiv.org/abs/2109.03939) (EMNLP2021) 943 | - [Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention](https://arxiv.org/abs/2006.16236) 944 | - [Understanding the Difficulty of Training Transformers](https://arxiv.org/abs/2004.08249) (EMNLP2020) 945 | - [Towards Fully 8-bit Integer Inference for the Transformer Model](https://arxiv.org/abs/2009.08034) (IJCAI2020) 946 | - [Extremely Low Bit Transformer Quantization for On-Device Neural Machine Translation](https://arxiv.org/abs/2009.07453) 947 | - [Long Range Arena: A Benchmark for Efficient Transformers](https://arxiv.org/abs/2011.04006) 948 | ## Probe 949 | - [A Structural Probe for Finding Syntax in Word Representations](https://aclweb.org/anthology/papers/N/N19/N19-1419/) (NAACL2019) 950 | - [When Bert Forgets How To POS: Amnesic Probing of Linguistic Properties and MLM Predictions](https://arxiv.org/abs/2006.00995) 951 | - [Finding Universal Grammatical Relations in Multilingual BERT](https://arxiv.org/abs/2005.04511) (ACL2020) 952 | - [Probing Multilingual BERT for Genetic and Typological Signals](https://arxiv.org/abs/2011.02070) (COLING2020) 953 | - [Linguistic Knowledge and Transferability of Contextual Representations](https://arxiv.org/abs/1903.08855) (NAACL2019) [[github](https://github.com/nelson-liu/contextual-repr-analysis)] 954 | - [Probing What Different NLP Tasks Teach Machines about Function Word Comprehension](https://arxiv.org/abs/1904.11544) (*SEM2019) 955 | - [BERT Rediscovers the Classical NLP Pipeline](https://arxiv.org/abs/1905.05950) (ACL2019) 956 | - [A Closer Look at How Fine-tuning Changes BERT](https://arxiv.org/abs/2106.14282) (ACL2022) 957 | - [Mediators in Determining what Processing BERT Performs First](https://arxiv.org/abs/2104.06400) (NAACL2021) 958 | - [Probing Neural Network Comprehension of Natural Language Arguments](https://arxiv.org/abs/1907.07355) (ACL2019) 959 | - [Cracking the Contextual Commonsense Code: Understanding Commonsense Reasoning Aptitude of Deep Contextual Representations](https://arxiv.org/abs/1910.01157) (EMNLP2019 WS) 960 | - [What do you mean, BERT? Assessing BERT as a Distributional Semantics Model](https://arxiv.org/abs/1911.05758) 961 | - [Quantity doesn't buy quality syntax with neural language models](https://arxiv.org/abs/1909.00111) (EMNLP2019) 962 | - [Are Pre-trained Language Models Aware of Phrases? Simple but Strong Baselines for Grammar Induction](https://openreview.net/forum?id=H1xPR3NtPB) (ICLR2020) 963 | - [Discourse Probing of Pretrained Language Models](https://arxiv.org/abs/2104.05882) (NAACL2021) 964 | - [oLMpics -- On what Language Model Pre-training Captures](https://arxiv.org/abs/1912.13283) 965 | - [Do Neural Language Models Show Preferences for Syntactic Formalisms?](https://arxiv.org/abs/2004.14096) (ACL2020) 966 | - [Probing for Predicate Argument Structures in Pretrained Language Models](https://aclanthology.org/2022.acl-long.316/) (ACL2022) 967 | - [Perturbed Masking: Parameter-free Probing for Analyzing and Interpreting BERT](https://arxiv.org/abs/2004.14786) (ACL2020) 968 | - [Intermediate-Task Transfer Learning with Pretrained Models for Natural Language Understanding: When and Why Does It Work?](https://arxiv.org/abs/2005.00628) (ACL2020) 969 | - [Probing Linguistic Systematicity](https://arxiv.org/abs/2005.04315) (ACL2020) 970 | - [A Matter of Framing: The Impact of Linguistic Formalism on Probing Results](https://arxiv.org/abs/2004.14999) 971 | - [A Cross-Task Analysis of Text Span Representations](https://www.aclweb.org/anthology/2020.repl4nlp-1.20/) (ACL2020 WS) 972 | - [When Do You Need Billions of Words of Pretraining Data?](https://arxiv.org/abs/2011.04946) [[github](https://github.com/nyu-mll/pretraining-learning-curves)] 973 | - [Picking BERT's Brain: Probing for Linguistic Dependencies in Contextualized Embeddings Using Representational Similarity Analysis](https://arxiv.org/abs/2011.12073) 974 | - [Language Models as Knowledge Bases?](https://arxiv.org/abs/1909.01066) (EMNLP2019) [[github](https://github.com/facebookresearch/LAMA)] 975 | - [BERT is Not a Knowledge Base (Yet): Factual Knowledge vs. Name-Based Reasoning in Unsupervised QA](https://arxiv.org/abs/1911.03681) 976 | - [How Much Knowledge Can You Pack Into the Parameters of a Language Model?](https://arxiv.org/abs/2002.08910) (EMNLP2020) 977 | - [Language Models as Knowledge Bases: On Entity Representations, Storage Capacity, and Paraphrased Queries](https://arxiv.org/abs/2008.09036) (EACL2021) 978 | - [Factual Probing Is [MASK]: Learning vs. Learning to Recall](https://arxiv.org/abs/2104.05240) (NAACL2021) [[github](https://github.com/princeton-nlp/OptiPrompt)] 979 | - [Knowledge Neurons in Pretrained Transformers](https://arxiv.org/abs/2104.08696) 980 | - [DirectProbe: Studying Representations without Classifiers](https://www.aclweb.org/anthology/2021.naacl-main.401/) (NAACL2021) 981 | - [The Language Model Understood the Prompt was Ambiguous: Probing Syntactic Uncertainty Through Generation](https://arxiv.org/abs/2109.07848) (EMNLP2021 WS) 982 | - [X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained Language Models](https://arxiv.org/abs/2010.06189) (EMNLP2020) 983 | - [Probing BERT in Hyperbolic Spaces](https://arxiv.org/abs/2104.03869) (ICLR2021) 984 | - [Probing Across Time: What Does RoBERTa Know and When?](https://arxiv.org/abs/2104.07885) 985 | - [Do NLP Models Know Numbers? Probing Numeracy in Embeddings](https://arxiv.org/abs/1909.07940) (EMNLP2019) 986 | - [Birds have four legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-trained Language Models](https://arxiv.org/abs/2005.00683) [[github](https://github.com/INK-USC/NumerSense)] [[website](https://inklab.usc.edu/NumerSense/)] 987 | - [Negated and Misprimed Probes for Pretrained Language Models: Birds Can Talk, But Cannot Fly](https://www.aclweb.org/anthology/2020.acl-main.698/) (ACL2020) 988 | - [How is BERT surprised? Layerwise detection of linguistic anomalies](https://arxiv.org/abs/2105.07452) (ACL2021) 989 | - [Exploring the Role of BERT Token Representations to Explain Sentence Probing Results](https://arxiv.org/abs/2104.01477) 990 | - [What Does My QA Model Know? Devising Controlled Probes using Expert Knowledge](https://arxiv.org/abs/1912.13337) 991 | - [A Pairwise Probe for Understanding BERT Fine-Tuning on Machine Reading Comprehension](https://arxiv.org/abs/2006.01346) 992 | - [Can BERT Reason? Logically Equivalent Probes for Evaluating the Inference Capabilities of Language Models](https://arxiv.org/abs/2005.00782) 993 | - [Probing Task-Oriented Dialogue Representation from Language Models](https://arxiv.org/abs/2010.13912) (EMNLP2020) 994 | - [Probing for Bridging Inference in Transformer Language Models](https://arxiv.org/abs/2104.09400) 995 | - [BERTering RAMS: What and How Much does BERT Already Know About Event Arguments? -- A Study on the RAMS Dataset](https://arxiv.org/abs/2010.04098) (EMNLP2020 WS) 996 | - [CxGBERT: BERT meets Construction Grammar](https://arxiv.org/abs/2011.04134) (COLING2020) [[github](https://github.com/H-TayyarMadabushi/CxGBERT-BERT-meets-Construction-Grammar)] 997 | - [BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Identify Analogies?](https://arxiv.org/abs/2105.04949) (ACL2021) 998 | ## Inside BERT 999 | - [What does BERT learn about the structure of language?](https://hal.inria.fr/hal-02131630/document) (ACL2019) 1000 | - [Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned](https://arxiv.org/abs/1905.09418) (ACL2019) [[github](https://github.com/lena-voita/the-story-of-heads)] 1001 | - [Multi-head or Single-head? An Empirical Comparison for Transformer Training](https://arxiv.org/abs/2106.09650) 1002 | - [Open Sesame: Getting Inside BERT's Linguistic Knowledge](https://arxiv.org/abs/1906.01698) (ACL2019 WS) 1003 | - [Analyzing the Structure of Attention in a Transformer Language Model](https://arxiv.org/abs/1906.04284) (ACL2019 WS) 1004 | - [What Does BERT Look At? An Analysis of BERT's Attention](https://arxiv.org/abs/1906.04341) (ACL2019 WS) 1005 | - [Do Attention Heads in BERT Track Syntactic Dependencies?](https://arxiv.org/abs/1911.12246) 1006 | - [Blackbox meets blackbox: Representational Similarity and Stability Analysis of Neural Language Models and Brains](https://arxiv.org/abs/1906.01539) (ACL2019 WS) 1007 | - [Inducing Syntactic Trees from BERT Representations](https://arxiv.org/abs/1906.11511) (ACL2019 WS) 1008 | - [A Multiscale Visualization of Attention in the Transformer Model](https://arxiv.org/abs/1906.05714) (ACL2019 Demo) 1009 | - [Visualizing and Measuring the Geometry of BERT](https://arxiv.org/abs/1906.02715) 1010 | - [How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings](https://arxiv.org/abs/1909.00512) (EMNLP2019) 1011 | - [Are Sixteen Heads Really Better than One?](https://arxiv.org/abs/1905.10650) (NeurIPS2019) 1012 | - [On the Validity of Self-Attention as Explanation in Transformer Models](https://arxiv.org/abs/1908.04211) 1013 | - [Visualizing and Understanding the Effectiveness of BERT](https://arxiv.org/abs/1908.05620) (EMNLP2019) 1014 | - [Attention Interpretability Across NLP Tasks](https://arxiv.org/abs/1909.11218) 1015 | - [Revealing the Dark Secrets of BERT](https://arxiv.org/abs/1908.08593) (EMNLP2019) 1016 | - [Analyzing Redundancy in Pretrained Transformer Models](https://arxiv.org/abs/2004.04010) (EMNLP2020) 1017 | - [What's so special about BERT's layers? A closer look at the NLP pipeline in monolingual and multilingual models](https://arxiv.org/abs/2004.06499) 1018 | - [Attention Module is Not Only a Weight: Analyzing Transformers with Vector Norms](https://arxiv.org/abs/2004.10102) (ACL2020 SRW) 1019 | - [Incorporating Residual and Normalization Layers into Analysis of Masked Language Models](https://arxiv.org/abs/2109.07152) (EMNLP2021) 1020 | - [Quantifying Attention Flow in Transformers](https://arxiv.org/abs/2005.00928) 1021 | - [Telling BERT's full story: from Local Attention to Global Aggregation](https://arxiv.org/abs/2004.05916) (EACL2021) 1022 | - [How Far Does BERT Look At:Distance-based Clustering and Analysis of BERT′s Attention](https://arxiv.org/abs/2011.00943) 1023 | - [Contributions of Transformer Attention Heads in Multi- and Cross-lingual Tasks](https://arxiv.org/abs/2108.08375) (ACL2021) 1024 | - [What Do Position Embeddings Learn? An Empirical Study of Pre-Trained Language Model Positional Encoding](https://arxiv.org/abs/2010.04903) (EMNLP2020) 1025 | - [Investigating BERT's Knowledge of Language: Five Analysis Methods with NPIs](https://arxiv.org/abs/1909.02597) (EMNLP2019) 1026 | - [Are Pretrained Language Models Symbolic Reasoners Over Knowledge?](https://arxiv.org/abs/2006.10413) (CoNLL2020) 1027 | - [Rethinking the Value of Transformer Components](https://arxiv.org/abs/2011.03803) (COLING2020) 1028 | - [Transformer Feed-Forward Layers Are Key-Value Memories](https://arxiv.org/abs/2012.14913) 1029 | - [Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space](https://arxiv.org/abs/2203.14680) 1030 | - [Investigating Transferability in Pretrained Language Models](https://arxiv.org/abs/2004.14975) 1031 | - [What Happens To BERT Embeddings During Fine-tuning?](https://arxiv.org/abs/2004.14448) 1032 | - [Analyzing Individual Neurons in Pre-trained Language Models](https://arxiv.org/abs/2010.02695) (EMNLP2020) 1033 | - [How fine can fine-tuning be? Learning efficient language models](https://arxiv.org/abs/2004.14129) (AISTATS2020) 1034 | - [The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives](https://arxiv.org/abs/1909.01380) (EMNLP2019) 1035 | - [A Primer in BERTology: What we know about how BERT works](https://arxiv.org/abs/2002.12327) (TACL2020) 1036 | - [Pretrained Language Model Embryology: The Birth of ALBERT](https://arxiv.org/abs/2010.02480) (EMNLP2020) [[github](https://github.com/d223302/albert-embryology)] 1037 | - [Evaluating Saliency Methods for Neural Language Models](https://arxiv.org/abs/2104.05824) (NAACL2021) 1038 | - [Investigating Gender Bias in BERT](https://arxiv.org/abs/2009.05021) 1039 | - [Measuring and Reducing Gendered Correlations in Pre-trained Models](https://arxiv.org/abs/2010.06032) [[website](https://ai.googleblog.com/2020/10/measuring-gendered-correlations-in-pre.html)] 1040 | - [Unmasking Contextual Stereotypes: Measuring and Mitigating BERT's Gender Bias](https://arxiv.org/abs/2010.14534) (COLING2020 WS) 1041 | - [Stereotype and Skew: Quantifying Gender Bias in Pre-trained and Fine-tuned Language Models](https://arxiv.org/abs/2101.09688) (EACL2021) 1042 | - [CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models](https://arxiv.org/abs/2010.00133) (EMNLP2020) 1043 | - [Unmasking the Mask -- Evaluating Social Biases in Masked Language Models](https://arxiv.org/abs/2104.07496) 1044 | - [BERT Knows Punta Cana is not just beautiful, it's gorgeous: Ranking Scalar Adjectives with Contextualised Representations](https://arxiv.org/abs/2010.02686) (EMNLP2020) 1045 | - [Does Chinese BERT Encode Word Structure?](https://arxiv.org/abs/2010.07711) (COLING2020) [[github](https://github.com/ylwangy/BERT_zh_Analysis)] 1046 | - [How Does BERT Answer Questions? A Layer-Wise Analysis of Transformer Representations](https://arxiv.org/abs/1909.04925) (CIKM2019) 1047 | - [Whatcha lookin' at? DeepLIFTing BERT's Attention in Question Answering](https://arxiv.org/abs/1910.06431) 1048 | - [What does BERT Learn from Multiple-Choice Reading Comprehension Datasets?](https://arxiv.org/abs/1910.12391) 1049 | - [What do Models Learn from Question Answering Datasets?](https://arxiv.org/abs/2004.03490) 1050 | - [Towards Interpreting BERT for Reading Comprehension Based QA](https://arxiv.org/abs/2010.08983) (EMNLP2020) 1051 | - [Compositional and Lexical Semantics in RoBERTa, BERT and DistilBERT: A Case Study on CoQA](https://arxiv.org/abs/2009.08257) (EMNLP2020) 1052 | - [How does BERT’s attention change when you fine-tune? An analysis methodology and a case study in negation scope](https://www.aclweb.org/anthology/2020.acl-main.429/) (ACL2020) 1053 | - [Calibration of Pre-trained Transformers](https://arxiv.org/abs/2003.07892) 1054 | - [When BERT Plays the Lottery, All Tickets Are Winning](https://arxiv.org/abs/2005.00561) (EMNLP2020) 1055 | - [The Lottery Ticket Hypothesis for Pre-trained BERT Networks](https://arxiv.org/abs/2007.12223) 1056 | - [What Context Features Can Transformer Language Models Use?](https://arxiv.org/abs/2106.08367) (ACL2021) 1057 | - [exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformers Models](https://arxiv.org/abs/1910.05276) [[github](https://github.com/bhoov/exbert)] 1058 | - [The Language Interpretability Tool: Extensible, Interactive Visualizations and Analysis for NLP Models](https://arxiv.org/abs/2008.05122) [[github](https://github.com/pair-code/lit)] 1059 | - [What Does BERT with Vision Look At?](https://www.aclweb.org/anthology/2020.acl-main.469/) (ACL2020) 1060 | - [Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models]() (ECCV2020) 1061 | - [Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers](https://arxiv.org/pdf/2102.00529.pdf) (TACL2021) 1062 | - [What Vision-Language Models ‘See’ when they See Scenes](https://arxiv.org/pdf/2109.07301.pdf) 1063 | ## Multi-lingual 1064 | - [A Primer on Pretrained Multilingual Language Models](https://arxiv.org/abs/2107.00676) 1065 | - [Multilingual Constituency Parsing with Self-Attention and Pre-Training](https://arxiv.org/abs/1812.11760) (ACL2019) 1066 | - [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) (NeurIPS2019) [[github](https://github.com/facebookresearch/XLM)] 1067 | - [XLM-E: Cross-lingual Language Model Pre-training via ELECTRA](https://arxiv.org/abs/2106.16138) 1068 | - [XLM-K: Improving Cross-Lingual Language Model Pre-Training with Multilingual Knowledge](https://arxiv.org/abs/2109.12573) 1069 | - [75 Languages, 1 Model: Parsing Universal Dependencies Universally](https://arxiv.org/abs/1904.02099) (EMNLP2019) [[github](https://github.com/hyperparticle/udify)] 1070 | - [Zero-shot Dependency Parsing with Pre-trained Multilingual Sentence Representations](https://arxiv.org/abs/1910.05479) (EMNLP2019 WS) 1071 | - [Parsing with Multilingual BERT, a Small Corpus, and a Small Treebank](https://arxiv.org/abs/2009.14124) (EMNLP2020 Findings) 1072 | - [Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT](https://arxiv.org/abs/1904.09077) (EMNLP2019) 1073 | - [How multilingual is Multilingual BERT?](https://arxiv.org/abs/1906.01502) (ACL2019) 1074 | - [How Language-Neutral is Multilingual BERT?](https://arxiv.org/abs/1911.03310) 1075 | - [How to Adapt Your Pretrained Multilingual Model to 1600 Languages](https://arxiv.org/abs/2106.02124) (ACL2021) 1076 | - [Load What You Need: Smaller Versions of Multilingual BERT](https://arxiv.org/abs/2010.05609) (EMNLP2020) [[github](https://github.com/Geotrend-research/smaller-transformers)] 1077 | - [Is Multilingual BERT Fluent in Language Generation?](https://arxiv.org/abs/1910.03806) 1078 | - [ZmBART: An Unsupervised Cross-lingual Transfer Framework for Language Generation](https://arxiv.org/abs/2106.01597) (ACL2021 Findings) 1079 | - [Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks](https://www.aclweb.org/anthology/D19-1252/) (EMNLP2019) 1080 | - [BERT is Not an Interlingua and the Bias of Tokenization](https://www.aclweb.org/anthology/D19-6106/) (EMNLP2019 WS) 1081 | - [Cross-Lingual Ability of Multilingual BERT: An Empirical Study](https://openreview.net/forum?id=HJeT3yrtDr) (ICLR2020) 1082 | - [Multilingual Alignment of Contextual Word Representations](https://arxiv.org/abs/2002.03518) (ICLR2020) 1083 | - [Emerging Cross-lingual Structure in Pretrained Language Models](https://arxiv.org/abs/1911.01464) (ACL2020) 1084 | - [On the Cross-lingual Transferability of Monolingual Representations](https://arxiv.org/abs/1910.11856) 1085 | - [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) (ACL2020) 1086 | - [FILTER: An Enhanced Fusion Method for Cross-lingual Language Understanding](https://arxiv.org/abs/2009.05166) 1087 | - [Cross-lingual Alignment Methods for Multilingual BERT: A Comparative Study](https://arxiv.org/abs/2009.14304) (EMNLP2020 Findings) 1088 | - [Emerging Cross-lingual Structure in Pretrained Language Models](https://arxiv.org/abs/1911.01464) 1089 | - [Can Monolingual Pretrained Models Help Cross-Lingual Classification?](https://arxiv.org/abs/1911.03913) 1090 | - [A Study of Cross-Lingual Ability and Language-specific Information in Multilingual BERT](https://arxiv.org/abs/2004.09205) 1091 | - [Fully Unsupervised Crosslingual Semantic Textual Similarity Metric Based on BERT for Identifying Parallel Data](https://www.aclweb.org/anthology/K19-1020/) (CoNLL2019) 1092 | - [What the \[MASK\]? Making Sense of Language-Specific BERT Models](https://arxiv.org/abs/2003.02912) 1093 | - [XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization](https://arxiv.org/abs/2003.11080) (ICML2020) 1094 | - [XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation](https://arxiv.org/abs/2104.07412) (EMNLP2021) 1095 | - [XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation](https://arxiv.org/abs/2004.01401) 1096 | - [A Systematic Analysis of Morphological Content in BERT Models for Multiple Languages](https://arxiv.org/abs/2004.03032) 1097 | - [Extending Multilingual BERT to Low-Resource Languages](https://arxiv.org/abs/2004.13640) 1098 | - [Learning Better Universal Representations from Pre-trained Contextualized Language Models](https://arxiv.org/abs/2004.13947) 1099 | - [Universal Dependencies according to BERT: both more specific and more general](https://arxiv.org/abs/2004.14620) 1100 | - [A Call for More Rigor in Unsupervised Cross-lingual Learning](https://arxiv.org/abs/2004.14958) (ACL2020) 1101 | - [Identifying Necessary Elements for BERT's Multilinguality](https://arxiv.org/abs/2005.00396) (EMNLP2020) 1102 | - [MAD-X: An Adapter-based Framework for Multi-task Cross-lingual Transfer](https://arxiv.org/abs/2005.00052) 1103 | - [From Zero to Hero: On the Limitations of Zero-Shot Cross-Lingual Transfer with Multilingual Transformers](https://arxiv.org/abs/2005.00633) 1104 | - [Language Models are Few-shot Multilingual Learners](https://arxiv.org/abs/2109.07684) 1105 | - [First Align, then Predict: Understanding the Cross-Lingual Ability of Multilingual BERT](https://arxiv.org/abs/2101.11109) (EACL2021) 1106 | - [Multilingual BERT Post-Pretraining Alignment](https://arxiv.org/abs/2010.12547) (NAACL2021) 1107 | - [XeroAlign: Zero-Shot Cross-lingual Transformer Alignment](https://arxiv.org/abs/2105.02472) (ACL2021 Findings) 1108 | - [Syntax-augmented Multilingual BERT for Cross-lingual Transfer](https://arxiv.org/abs/2106.02134) (ACL2021) 1109 | - [Language Representation in Multilingual BERT and its applications to improve Cross-lingual Generalization](https://arxiv.org/abs/2010.10041) 1110 | - [VECO: Variable Encoder-decoder Pre-training for Cross-lingual Understanding and Generation](https://openreview.net/forum?id=YjNv-hzM8BE) 1111 | - [On the Language Neutrality of Pre-trained Multilingual Representations](https://arxiv.org/abs/2004.05160) 1112 | - [Are All Languages Created Equal in Multilingual BERT?](https://arxiv.org/abs/2005.09093) (ACL2020 WS) 1113 | - [When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models](https://arxiv.org/abs/2010.12858) 1114 | - [Adapting Monolingual Models: Data can be Scarce when Language Similarity is High](https://arxiv.org/abs/2105.02855) (ACL2021 Findings) 1115 | - [Language-agnostic BERT Sentence Embedding](https://arxiv.org/abs/2007.01852) 1116 | - [Universal Sentence Representation Learning with Conditional Masked Language Model](https://arxiv.org/abs/2012.14388) 1117 | - [WikiBERT models: deep transfer learning for many languages](https://arxiv.org/abs/2006.01538) 1118 | - [Inducing Language-Agnostic Multilingual Representations](https://arxiv.org/abs/2008.09112) 1119 | - [To What Degree Can Language Borders Be Blurred In BERT-based Multilingual Spoken Language Understanding?](https://arxiv.org/abs/2011.05007) (COLING2020) 1120 | - [It's not Greek to mBERT: Inducing Word-Level Translations from Multilingual BERT](https://arxiv.org/abs/2010.08275) (EMNLP2020 WS) 1121 | - [XLM-T: A Multilingual Language Model Toolkit for Twitter](https://arxiv.org/abs/2104.12250) 1122 | - [A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios](https://arxiv.org/abs/2010.12309) 1123 | - [Translation Artifacts in Cross-lingual Transfer Learning](https://arxiv.org/abs/2004.04721) (EMNLP2020) 1124 | - [Identifying Cultural Differences through Multi-Lingual Wikipedia](https://arxiv.org/abs/2004.04938) 1125 | - [A Supervised Word Alignment Method based on Cross-Language Span Prediction using Multilingual BERT](https://arxiv.org/abs/2004.14516) (EMNLP2020) 1126 | - [BERT for Monolingual and Cross-Lingual Reverse Dictionary](https://arxiv.org/abs/2009.14790) (EMNLP2020 Findings) 1127 | - [Bilingual Text Extraction as Reading Comprehension](https://arxiv.org/abs/2004.14517) 1128 | - [Evaluating Multilingual BERT for Estonian](https://arxiv.org/abs/2010.00454) 1129 | - [How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models](https://arxiv.org/abs/2012.15613) (ACL2021) [[github](https://github.com/Adapter-Hub/hgiyt)] 1130 | - [Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training](https://arxiv.org/abs/2109.07306) (EMNLP2021) 1131 | - [BERTologiCoMix: How does Code-Mixing interact with Multilingual BERT?](https://www.aclweb.org/anthology/2021.adaptnlp-1.12/) (EACL2021 WS) 1132 | ## Other than English models 1133 | - [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) (ACL2020) 1134 | - [On the importance of pre-training data volume for compact language models](https://arxiv.org/abs/2010.03813) (EMNLP2020) 1135 | - [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) (LREC2020) 1136 | - [Multilingual is not enough: BERT for Finnish](https://arxiv.org/abs/1912.07076) 1137 | - [BERTje: A Dutch BERT Model](https://arxiv.org/abs/1912.09582) 1138 | - [RobBERT: a Dutch RoBERTa-based Language Model](https://arxiv.org/abs/2001.06286) 1139 | - [Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language](https://arxiv.org/abs/1905.07213) 1140 | - [RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark](https://arxiv.org/abs/2010.15925) (EMNLP2020) 1141 | - [AraBERT: Transformer-based Model for Arabic Language Understanding](https://arxiv.org/abs/2003.00104) 1142 | - [ALUE: Arabic Language Understanding Evaluation](https://aclanthology.org/2021.wanlp-1.18/) (EACL2021 WS) [[website](https://www.alue.org/home)] 1143 | - [ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic](https://aclanthology.org/2021.acl-long.551/) (ACL2021) [[github](https://github.com/UBC-NLP/marbert)] 1144 | - [Pre-Training BERT on Arabic Tweets: Practical Considerations](https://arxiv.org/abs/2102.10684) 1145 | - [PhoBERT: Pre-trained language models for Vietnamese](https://arxiv.org/abs/2003.00744) 1146 | - [Give your Text Representation Models some Love: the Case for Basque](https://arxiv.org/abs/2004.00033) (LREC2020) 1147 | - [ParsBERT: Transformer-based Model for Persian Language Understanding](https://arxiv.org/abs/2005.12515) 1148 | - [Leveraging ParsBERT and Pretrained mT5 for Persian Abstractive Text Summarization](https://arxiv.org/abs/2012.11204) (CSICC2021) 1149 | - [Pre-training Polish Transformer-based Language Models at Scale](https://arxiv.org/abs/2006.04229) 1150 | - [Playing with Words at the National Library of Sweden -- Making a Swedish BERT](https://arxiv.org/abs/2007.01658) 1151 | - [KR-BERT: A Small-Scale Korean-Specific Language Model](https://arxiv.org/abs/2008.03979) 1152 | - [KoreALBERT: Pretraining a Lite BERT Model for Korean Language Understanding](https://arxiv.org/abs/2101.11363) (ICPR2020) 1153 | - [What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers](https://arxiv.org/abs/2109.04650) (EMNLP2021) 1154 | - [KLUE: Korean Language Understanding Evaluation](https://arxiv.org/abs/2105.09680) 1155 | - [WangchanBERTa: Pretraining transformer-based Thai Language Models](https://arxiv.org/abs/2101.09635) 1156 | - [FinEst BERT and CroSloEngual BERT: less is more in multilingual models](https://arxiv.org/abs/2006.07890) (TSD2020) 1157 | - [GREEK-BERT: The Greeks visiting Sesame Street](https://arxiv.org/abs/2008.12014) (SETN2020) 1158 | - [The birth of Romanian BERT](https://arxiv.org/abs/2009.08712) (EMNLP2020 Findings) 1159 | - [German's Next Language Model](https://arxiv.org/abs/2010.10906) (COLING2020 Industry Truck) 1160 | - [GottBERT: a pure German Language Model](https://arxiv.org/abs/2012.02110) 1161 | - [EstBERT: A Pretrained Language-Specific BERT for Estonian](https://arxiv.org/abs/2011.04784) 1162 | - [Czert -- Czech BERT-like Model for Language Representation](https://arxiv.org/abs/2103.13031) 1163 | - [RobeCzech: Czech RoBERTa, a monolingual contextualized language representation model](https://arxiv.org/abs/2105.11314) (TSD2021) 1164 | - [Bertinho: Galician BERT Representations](https://arxiv.org/abs/2103.13799) 1165 | - [Pretraining and Fine-Tuning Strategies for Sentiment Analysis of Latvian Tweets](https://arxiv.org/abs/2010.12401) 1166 | - [PTT5: Pretraining and validating the T5 model on Brazilian Portuguese data](https://arxiv.org/abs/2008.09144) 1167 | - [IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages](https://aclanthology.org/2020.findings-emnlp.445/) (EMNLP2020 Findings) 1168 | - [Indic-Transformers: An Analysis of Transformer Language Models for Indian Languages](https://arxiv.org/abs/2011.02323) (NeurIPS2020 WS) 1169 | - [IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP](https://www.aclweb.org/anthology/2020.coling-main.66/) (COLING2020) 1170 | - [IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization](https://arxiv.org/abs/2109.04607) (EMNLP2021) 1171 | - [IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation](https://arxiv.org/abs/2104.08200) (EMNLP2021) 1172 | - [AfroMT: Pretraining Strategies and Reproducible Benchmarks for Translation of 8 African Languages](https://arxiv.org/abs/2109.04715) (EMNLP2021) 1173 | - [KinyaBERT: a Morphology-aware Kinyarwanda Language Model](https://arxiv.org/abs/2203.08459) (ACL2022) 1174 | - [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) 1175 | - [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) 1176 | - [Revisiting Pre-Trained Models for Chinese Natural Language Processing](https://arxiv.org/abs/2004.13922) (EMNLP2020 Findings) 1177 | - [ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information](https://arxiv.org/abs/2106.16038) (ACL2021) [[github](https://github.com/ShannonAI/ChineseBert)] 1178 | - [Intrinsic Knowledge Evaluation on Chinese Language Models](https://arxiv.org/abs/2011.14277) 1179 | - [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://arxiv.org/abs/2012.00413) [[github](https://github.com/TsinghuaAI/CPM-Generate)] 1180 | - [PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation](https://arxiv.org/abs/2104.12369) 1181 | - [CLUECorpus2020: A Large-scale Chinese Corpus for Pre-training Language Model](https://arxiv.org/abs/2003.01355) 1182 | - [CLUE: A Chinese Language Understanding Evaluation Benchmark](https://arxiv.org/abs/2004.05986) 1183 | - [CUGE: A Chinese Language Understanding and Generation Evaluation Benchmark](https://arxiv.org/abs/2112.13610) 1184 | - [FewCLUE: A Chinese Few-shot Learning Evaluation Benchmark](https://arxiv.org/abs/2107.07498) 1185 | - [AnchiBERT: A Pre-Trained Model for Ancient ChineseLanguage Understanding and Generation](https://arxiv.org/abs/2009.11473) 1186 | - [UER: An Open-Source Toolkit for Pre-training Models](https://arxiv.org/abs/1909.05658) (EMNLP2019 Demo) [[github](https://github.com/dbiir/UER-py)] 1187 | ## Domain specific 1188 | - [AMMU -- A Survey of Transformer-based Biomedical Pretrained Language Models](https://arxiv.org/abs/2105.00827) 1189 | - [BioBERT: a pre-trained biomedical language representation model for biomedical text mining](https://arxiv.org/abs/1901.08746) 1190 | - [Self-Alignment Pretraining for Biomedical Entity Representations](https://aclanthology.org/2021.naacl-main.334/) (NAACL2021) [[github](https://github.com/cambridgeltl/sapbert)] 1191 | - [Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking](https://aclanthology.org/2021.acl-short.72/) (ACL2021) [[github](https://github.com/cambridgeltl/sapbert)] 1192 | - [Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets](https://arxiv.org/abs/1906.05474) (ACL2019 WS) 1193 | - [BERT-based Ranking for Biomedical Entity Normalization](https://arxiv.org/abs/1908.03548) 1194 | - [PubMedQA: A Dataset for Biomedical Research Question Answering](https://arxiv.org/abs/1909.06146) (EMNLP2019) 1195 | - [Pre-trained Language Model for Biomedical Question Answering](https://arxiv.org/abs/1909.08229) 1196 | - [How to Pre-Train Your Model? Comparison of Different Pre-Training Models for Biomedical Question Answering](https://arxiv.org/abs/1911.00712) 1197 | - [On Adversarial Examples for Biomedical NLP Tasks](https://arxiv.org/abs/2004.11157) 1198 | - [An Empirical Study of Multi-Task Learning on BERT for Biomedical Text Mining](https://arxiv.org/abs/2005.02799) (ACL2020 WS) 1199 | - [Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing](https://arxiv.org/abs/2007.15779) [[github](https://microsoft.github.io/BLURB/)] 1200 | - [Improving Biomedical Pretrained Language Models with Knowledge](https://arxiv.org/abs/2104.10344) (BioNLP2021) 1201 | - [BioMegatron: Larger Biomedical Domain Language Model](https://arxiv.org/abs/2010.06060) (EMNLP2020) [[website](https://developer.nvidia.com/blog/building-state-of-the-art-biomedical-and-clinical-nlp-models-with-bio-megatron/)] 1202 | - [Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art](https://www.aclweb.org/anthology/2020.clinicalnlp-1.17/) (EMNLP2020 WS) 1203 | - [A pre-training technique to localize medical BERT and enhance BioBERT](https://arxiv.org/abs/2005.07202) [[github](https://github.com/sy-wada/blue_benchmark_with_transformers)] 1204 | - [exBERT: Extending Pre-trained Models with Domain-specific Vocabulary Under Constrained Training Resources](https://aclanthology.org/2020.findings-emnlp.129/) [[github](https://github.com/cgmhaicenter/exBERT)] (EMNLP2020 Findings) 1205 | - [BERTology Meets Biology: Interpreting Attention in Protein Language Models](https://arxiv.org/abs/2006.15222) (ICLR2021) 1206 | - [ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission](https://arxiv.org/abs/1904.05342) 1207 | - [Predicting Clinical Diagnosis from Patients Electronic Health Records Using BERT-based Neural Networks](https://arxiv.org/abs/2007.07562) (AIME2020) 1208 | - [Publicly Available Clinical BERT Embeddings](https://arxiv.org/abs/1904.03323) (NAACL2019 WS) 1209 | - [UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus](https://arxiv.org/abs/2010.10391) (NAACL2021) 1210 | - [MT-Clinical BERT: Scaling Clinical Information Extraction with Multitask Learning](https://arxiv.org/abs/2004.10220) 1211 | - [A clinical specific BERT developed with huge size of Japanese clinical narrative](https://www.medrxiv.org/content/10.1101/2020.07.07.20148585v1) 1212 | - [Clinical Reading Comprehension: A Thorough Analysis of the emrQA Dataset](https://arxiv.org/abs/2005.00574) (ACL2020) [[github](https://github.com/xiangyue9607/CliniRC)] 1213 | - [Knowledge-Empowered Representation Learning for Chinese Medical Reading Comprehension: Task, Model and Resources](https://arxiv.org/abs/2008.10327) 1214 | - [Classifying Long Clinical Documents with Pre-trained Transformers](https://arxiv.org/abs/2105.06752) 1215 | - [Detecting Adverse Drug Reactions from Twitter through Domain-Specific Preprocessing and BERT Ensembling](https://arxiv.org/abs/2005.06634) 1216 | - [Progress Notes Classification and Keyword Extraction using Attention-based Deep Learning Models with BERT](https://arxiv.org/abs/1910.05786) 1217 | - [BERT-XML: Large Scale Automated ICD Coding Using BERT Pretraining](https://arxiv.org/abs/2006.03685) 1218 | - [Prediction of ICD Codes with Clinical BERT Embeddings and Text Augmentation with Label Balancing using MIMIC-III](https://arxiv.org/abs/2008.10492) 1219 | - [Infusing Disease Knowledge into BERT for Health Question Answering, Medical Inference and Disease Name Recognition](https://arxiv.org/abs/2010.03746) (EMNLP2020) 1220 | - [CheXbert: Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT](https://arxiv.org/abs/2004.09167) (EMNLP2020) 1221 | - [Students Need More Attention: BERT-based Attention Model for Small Data with Application to Automatic Patient Message Triage](https://arxiv.org/abs/2006.11991) (MLHC2020) 1222 | - [Med-BERT: pre-trained contextualized embeddings on large-scale structured electronic health records for disease prediction](https://arxiv.org/abs/2005.12833) [[github](https://github.com/ZhiGroup/Med-BERT)] 1223 | - [SciBERT: Pretrained Contextualized Embeddings for Scientific Text](https://aclanthology.org/D19-1371/) (EMNLP2019) [[github](https://github.com/allenai/scibert)] 1224 | - [SPECTER: Document-level Representation Learning using Citation-informed Transformers](https://aclanthology.org/2020.acl-main.207/) (ACL2020) [[github](https://github.com/allenai/specter)] 1225 | - [OAG-BERT: Pre-train Heterogeneous Entity-augmented Academic Language Models](https://arxiv.org/abs/2103.02410) [[github](https://github.com/thudm/oag-bert)] 1226 | - [PatentBERT: Patent Classification with Fine-Tuning a pre-trained BERT Model](https://arxiv.org/abs/1906.02124) 1227 | - [FinBERT: A Pretrained Language Model for Financial Communications](https://arxiv.org/abs/2006.08097) 1228 | - [LEGAL-BERT: The Muppets straight out of Law School](https://arxiv.org/abs/2010.02559) (EMNLP2020 Findings) 1229 | - [Lawformer: A Pre-trained Language Model for Chinese Legal Long Documents](https://arxiv.org/abs/2105.03887) 1230 | - [E-BERT: A Phrase and Product Knowledge Enhanced Language Model for E-commerce](https://arxiv.org/abs/2009.02835) 1231 | - [BERT Goes Shopping: Comparing Distributional Models for Product Representations](https://arxiv.org/abs/2012.09807) 1232 | - [NewsBERT: Distilling Pre-trained Language Model for Intelligent News Application](https://arxiv.org/abs/2102.04887) 1233 | - [Code and Named Entity Recognition in StackOverflow](https://arxiv.org/abs/2005.01634) (ACL2020) [[github](https://github.com/lanwuwei/BERTOverflow)] 1234 | - [BERTweet: A pre-trained language model for English Tweets](https://arxiv.org/abs/2005.10200) (EMNLP2020 Demo) 1235 | - [TweetBERT: A Pretrained Language Representation Model for Twitter Text Analysis](https://arxiv.org/abs/2010.11091) 1236 | - [A Million Tweets Are Worth a Few Points: Tuning Transformers for Customer Service Tasks](https://arxiv.org/abs/2104.07944) 1237 | - [Analyzing COVID-19 Tweets with Transformer-based Language Models](https://arxiv.org/abs/2104.10259) 1238 | - [Cost-effective Selection of Pretraining Data: A Case Study of Pretraining BERT on Social Media](https://arxiv.org/abs/2010.01150) (EMNLP2020 Findings) 1239 | ## Multi-modal 1240 | - [A Survey on Visual Transformer](https://arxiv.org/abs/2012.12556) 1241 | - [Transformers in Vision: A Survey](https://arxiv.org/abs/2101.01169) 1242 | - [Vision-Language Pre-training: Basics, Recent Advances, and Future Trends](https://arxiv.org/abs/2210.09263) 1243 | - [VideoBERT: A Joint Model for Video and Language Representation Learning](https://arxiv.org/abs/1904.01766) (ICCV2019) 1244 | - [ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks](https://arxiv.org/abs/1908.02265) (NeurIPS2019) 1245 | - [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/abs/1908.03557) 1246 | - [Selfie: Self-supervised Pretraining for Image Embedding](https://arxiv.org/abs/1906.02940) 1247 | - [ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data](https://arxiv.org/abs/2001.07966) 1248 | - [SimVLM: Simple Visual Language Model Pretraining with Weak Supervision](https://arxiv.org/abs/2108.10904) (ICLR2022) 1249 | - [Align before Fuse: Vision and Language Representation Learning with Momentum Distillation](https://arxiv.org/abs/2107.07651) (NeurIPS2021) [[github](https://github.com/salesforce/ALBEF)] 1250 | - [Contrastive Bidirectional Transformer for Temporal Representation Learning](https://arxiv.org/abs/1906.05743) 1251 | - [M-BERT: Injecting Multimodal Information in the BERT Structure](https://arxiv.org/abs/1908.05787) 1252 | - [Integrating Multimodal Information in Large Pretrained Transformers](https://arxiv.org/abs/1908.05787) 1253 | - [LXMERT: Learning Cross-Modality Encoder Representations from Transformers](https://arxiv.org/abs/1908.07490) (EMNLP2019) 1254 | - [Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions](https://arxiv.org/abs/2010.12831) (NAACL2021) 1255 | - [X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers](https://arxiv.org/abs/2009.11278) (EMNLP2020) 1256 | - [Adaptive Transformers for Learning Multimodal Representations](https://arxiv.org/abs/2005.07486) (ACL2020SRW) [[github](https://github.com/prajjwal1/adaptive_transformer)] 1257 | - [GEM: A General Evaluation Benchmark for Multimodal Tasks](https://arxiv.org/abs/2106.09889) (ACL2021 Findings) [[github](https://github.com/microsoft/GEM)] 1258 | - [Fusion of Detected Objects in Text for Visual Question Answering](https://arxiv.org/abs/1908.05054) (EMNLP2019) 1259 | - [VisualMRC: Machine Reading Comprehension on Document Images](https://arxiv.org/abs/2101.11272) (AAAI2021) 1260 | - [LambdaNetworks: Modeling long-range Interactions without Attention](https://openreview.net/forum?id=xTJEN-ggl1b) [[github](https://github.com/gsarti/lambda-bert)] 1261 | - [BERT representations for Video Question Answering](http://openaccess.thecvf.com/content_WACV_2020/html/Yang_BERT_representations_for_Video_Question_Answering_WACV_2020_paper.html) (WACV2020) 1262 | - [Self-supervised pre-training and contrastive representation learning for multiple-choice video QA](https://arxiv.org/abs/2009.08043) (AAAI2021) 1263 | - [UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning](https://arxiv.org/abs/2012.15409) (ACL2021) 1264 | - [BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation](https://arxiv.org/abs/2201.12086) [[github](https://github.com/salesforce/BLIP)] 1265 | - [Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training](https://arxiv.org/abs/2201.04026) 1266 | - [Contrastive Visual-Linguistic Pretraining](https://arxiv.org/abs/2007.13135) 1267 | - [What is More Likely to Happen Next? Video-and-Language Future Event Prediction](https://arxiv.org/abs/2010.07999) (EMNLP2020) 1268 | - [VisualGPT: Data-efficient Image Captioning by Balancing Visual Input and Linguistic Knowledge from Pretraining](https://arxiv.org/abs/2102.10407) 1269 | - [XGPT: Cross-modal Generative Pre-Training for Image Captioning](https://arxiv.org/abs/2003.01473) 1270 | - [Scaling Up Vision-Language Pre-training for Image Captioning](https://arxiv.org/abs/2111.12233) 1271 | - [Injecting Semantic Concepts into End-to-End Image Captioning](https://arxiv.org/abs/2112.05230) (CVPR2022) 1272 | - [Unified Vision-Language Pre-Training for Image Captioning and VQA](https://arxiv.org/abs/1909.11059) (AAAI2020) [[github](https://github.com/LuoweiZhou/VLP)] 1273 | - [TAP: Text-Aware Pre-training for Text-VQA and Text-Caption](https://arxiv.org/abs/2012.04638) 1274 | - [An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA](https://arxiv.org/abs/2109.05014) (AAAI2022) 1275 | - [Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer](https://arxiv.org/abs/2102.10772) 1276 | - [VisualCOMET: Reasoning about the Dynamic Context of a Still Image](https://arxiv.org/abs/2004.10796) (ECCV2020) [[website](http://visualcomet.xyz)] 1277 | - [Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline](https://arxiv.org/abs/1912.02379) 1278 | - [VD-BERT: A Unified Vision and Dialog Transformer with BERT](https://arxiv.org/abs/2004.13278) (EMNLP2020) 1279 | - [VL-BERT: Pre-training of Generic Visual-Linguistic Representations](https://arxiv.org/abs/1908.08530) (ICLR2020) 1280 | - [Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training](https://arxiv.org/abs/1908.06066) 1281 | - [UNITER: Learning UNiversal Image-TExt Representations](https://arxiv.org/abs/1909.11740) 1282 | - [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) 1283 | - [Supervised Multimodal Bitransformers for Classifying Images and Text](https://arxiv.org/abs/1909.02950) 1284 | - [InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining](https://arxiv.org/abs/2003.13198) 1285 | - [Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs](https://arxiv.org/abs/2011.15124) (TACL2021) 1286 | - [SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple Levels](https://arxiv.org/abs/2103.07829) 1287 | - [LiT : Zero-Shot Transfer with Locked-image Text Tuning](https://arxiv.org/abs/2111.07991) (CVPR2022) 1288 | - [WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training](https://arxiv.org/abs/2103.06561) 1289 | - [Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training](https://arxiv.org/abs/2106.13488) (NeurIPS2021) 1290 | - [E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning](https://arxiv.org/abs/2106.01804) (ACL2021) 1291 | - [UNIMO-2: End-to-End Unified Vision-Language Grounded Learning](https://arxiv.org/abs/2203.09067) (ACL2022) 1292 | - [Grounded Language-Image Pre-training](https://arxiv.org/abs/2112.03857) [[github](https://github.com/microsoft/GLIP)] 1293 | - [VLMO: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts](https://arxiv.org/abs/2111.02358) [[github](https://github.com/microsoft/unilm/tree/master/vlmo)] 1294 | - [VinVL: Revisiting Visual Representations in Vision-Language Models](https://arxiv.org/abs/2101.00529) 1295 | - [An Empirical Study of Training End-to-End Vision-and-Language Transformers](https://arxiv.org/abs/2111.02387) (CVPR2022) [[github](https://github.com/zdou0830/METER)] 1296 | - [Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling](https://arxiv.org/abs/2111.12085) 1297 | - [UFO: A UniFied TransfOrmer for Vision-Language Representation Learning](https://arxiv.org/abs/2111.10023) 1298 | - [Florence: A New Foundation Model for Computer Vision](https://arxiv.org/abs/2111.11432) 1299 | - [Large-Scale Adversarial Training for Vision-and-Language Representation Learning](https://arxiv.org/abs/2006.06195) (NeurIPS2020) 1300 | - [Flamingo: a Visual Language Model for Few-Shot Learning](https://arxiv.org/abs/2204.14198) 1301 | - [OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models](https://arxiv.org/abs/2308.01390) [[github](https://github.com/mlfoundations/open_flamingo)] 1302 | - [Do DALL-E and Flamingo Understand Each Other?](https://arxiv.org/abs/2212.12249) 1303 | - [Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts](https://arxiv.org/abs/2102.08981) 1304 | - [Unifying Vision-and-Language Tasks via Text Generation](https://arxiv.org/abs/2102.02779) 1305 | - [Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network](https://arxiv.org/abs/2101.11562) (AAAI2021) 1306 | - [ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph](https://arxiv.org/abs/2006.16934) 1307 | - [KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual Commonsense Reasoning](https://arxiv.org/abs/2012.07000) 1308 | - [A Closer Look at the Robustness of Vision-and-Language Pre-trained Models](https://arxiv.org/abs/2012.08673) 1309 | - [Self-Supervised learning with cross-modal transformers for emotion recognition](https://arxiv.org/abs/2011.10652) (SLT2020) 1310 | - [Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision](https://arxiv.org/abs/2010.06775) (EMNLP2020) 1311 | - [12-in-1: Multi-Task Vision and Language Representation Learning](https://arxiv.org/abs/1912.02315) 1312 | - [Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models](https://arxiv.org/abs/2103.08849) (NAACL2021) 1313 | - [M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training](https://arxiv.org/abs/2006.02635) (CVPR2021) 1314 | - [UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training](https://arxiv.org/abs/2104.00332) 1315 | - [CM3: A Causal Masked Multimodal Model of the Internet](https://arxiv.org/abs/2201.07520) 1316 | - [Retrieval-Augmented Multimodal Language Modeling](https://arxiv.org/abs/2211.12561) 1317 | - [Cycle Text-To-Image GAN with BERT](https://arxiv.org/abs/2003.12137) 1318 | - [Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks](https://arxiv.org/abs/1912.03063) 1319 | - [Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks](https://arxiv.org/abs/2004.06165) 1320 | - [VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning](https://arxiv.org/abs/2009.13682) 1321 | - [DeVLBert: Learning Deconfounded Visio-Linguistic Representations](https://arxiv.org/abs/2008.06884) (ACMMM2020) 1322 | - [A Recurrent Vision-and-Language BERT for Navigation](https://arxiv.org/abs/2011.13922) 1323 | - [BERT Can See Out of the Box: On the Cross-modal Transferability of Text Representations](https://arxiv.org/abs/2002.10832) 1324 | - [Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning](https://arxiv.org/abs/2104.03135) (CVPR2021) 1325 | - [Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers](https://arxiv.org/abs/2109.04448) (EMNLP2021) 1326 | - [Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers](https://arxiv.org/abs/2004.00849) 1327 | - [IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages](https://arxiv.org/abs/2201.11732) 1328 | - [Understanding Advertisements with BERT](https://www.aclweb.org/anthology/2020.acl-main.674/) (ACL2020) 1329 | - [BERTERS: Multimodal Representation Learning for Expert Recommendation System with Transformer](https://arxiv.org/abs/2007.07229) 1330 | - [FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval](https://arxiv.org/abs/2005.09801) (SIGIR2020) 1331 | - [Kaleido-BERT: Vision-Language Pre-training on Fashion Domain](https://arxiv.org/abs/2103.16110) (CVPR2021) 1332 | - [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) (KDD2020) [[github](https://github.com/microsoft/unilm/tree/master/layoutlm)] 1333 | - [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) (ACL2021) 1334 | - [LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/abs/2104.08836) 1335 | - [Unifying Vision, Text, and Layout for Universal Document Processing](https://arxiv.org/abs/2212.02623) 1336 | - [LAMPRET: Layout-Aware Multimodal PreTraining for Document Understanding](https://arxiv.org/abs/2104.08405) 1337 | - [BROS: A Pre-trained Language Model for Understanding Texts in Document](https://openreview.net/forum?id=punMXQEsPr0) 1338 | - [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) 1339 | - [LayoutReader: Pre-training of Text and Layout for Reading Order Detection](https://arxiv.org/abs/2108.11591) (EMNLP2021) 1340 | - [BERT for Large-scale Video Segment Classification with Test-time Augmentation](https://arxiv.org/abs/1912.01127) (ICCV2019WS) 1341 | - [Is Space-Time Attention All You Need for Video Understanding?](https://arxiv.org/abs/2102.05095) 1342 | - [lamBERT: Language and Action Learning Using Multimodal BERT](https://arxiv.org/abs/2004.07093) 1343 | - [Generative Pretraining from Pixels](https://cdn.openai.com/papers/Generative_Pretraining_from_Pixels_V2.pdf) [[github](https://github.com/openai/image-gpt)] [[website](https://openai.com/blog/image-gpt/)] 1344 | - [Visual Transformers: Token-based Image Representation and Processing for Computer Vision](https://arxiv.org/abs/2006.03677) 1345 | - [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://openreview.net/forum?id=YicbFdNTTy) (ICLR2021) 1346 | - [BEiT: BERT Pre-Training of Image Transformers](https://arxiv.org/abs/2106.08254) 1347 | - [Zero-Shot Text-to-Image Generation](https://arxiv.org/abs/2102.12092) [[github](https://github.com/openai/DALL-E)] [[website](https://openai.com/blog/dall-e/)] 1348 | - [Hierarchical Text-Conditional Image Generation with CLIP Latents](https://arxiv.org/abs/2204.06125) [[website](https://openai.com/dall-e-2/)] 1349 | - [Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding](https://arxiv.org/abs/2205.11487) [[website](https://imagen.research.google/)] 1350 | - [Scaling Autoregressive Models for Content-Rich Text-to-Image Generation](https://arxiv.org/abs/2206.10789) 1351 | - [Learning Transferable Visual Models From Natural Language Supervision](https://cdn.openai.com/papers/Learning_Transferable_Visual_Models_From_Natural_Language_Supervision.pdf) [[github](https://github.com/openai/CLIP)] [[website](https://openai.com/blog/clip/)] 1352 | - [How Much Can CLIP Benefit Vision-and-Language Tasks?](https://arxiv.org/abs/2107.06383) 1353 | - [EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling](https://arxiv.org/abs/2109.04699) 1354 | - [e-CLIP: Large-Scale Vision-Language Representation Learning in E-commerce](https://arxiv.org/abs/2207.00208) 1355 | - [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) 1356 | - [Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation](https://arxiv.org/abs/2203.06386) (ACL2022) 1357 | - [StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery](https://arxiv.org/abs/2103.17249) 1358 | - [Training Vision Transformers for Image Retrieval](https://arxiv.org/abs/2102.05644) 1359 | - [LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval](https://arxiv.org/abs/2103.08784) (NAACL2021) 1360 | - [Colorization Transformer](https://arxiv.org/abs/2102.04432) (ICLR2021) [[github](https://github.com/google-research/google-research/tree/master/coltran)] 1361 | - [A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer](https://arxiv.org/abs/2005.08271) [[website](https://v-iashin.github.io/bmt)] 1362 | - [Multimodal Pretraining for Dense Video Captioning](https://arxiv.org/abs/2011.11760) (AACL-IJCNLP2020) 1363 | - [Is Space-Time Attention All You Need for Video Understanding?](https://arxiv.org/abs/2102.05095) 1364 | - [Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling](https://arxiv.org/abs/2102.06183) (CVPR2021) [[github](https://github.com/jayleicn/ClipBERT)] 1365 | - [VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding](https://arxiv.org/abs/2105.09996) (ACL2021 Findings) 1366 | - [VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding](https://arxiv.org/abs/2109.14084) (EMNLP2021) 1367 | - [BERT-hLSTMs: BERT and Hierarchical LSTMs for Visual Storytelling](https://arxiv.org/abs/2012.02128) 1368 | - [A Generalist Agent](https://arxiv.org/abs/2205.06175) [[website](https://www.deepmind.com/publications/a-generalist-agent)] 1369 | - [SpeechBERT: Cross-Modal Pre-trained Language Model for End-to-end Spoken Question Answering](https://arxiv.org/abs/1910.11559) 1370 | - [An Audio-enriched BERT-based Framework for Spoken Multiple-choice Question Answering](https://arxiv.org/abs/2005.12142) 1371 | - [vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations](https://arxiv.org/abs/1910.05453) 1372 | - [Effectiveness of self-supervised pre-training for speech recognition](https://arxiv.org/abs/1911.03912) 1373 | - [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) 1374 | - [Applying wav2vec2.0 to Speech Recognition in various low-resource languages](https://arxiv.org/abs/2012.12121) 1375 | - [Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition](https://arxiv.org/abs/2010.10504) 1376 | - [Speech Recognition by Simply Fine-tuning BERT](https://arxiv.org/abs/2102.00291) (ICASSP2021) 1377 | - [Understanding Semantics from Speech Through Pre-training](https://arxiv.org/abs/1909.10924) 1378 | - [Speech-XLNet: Unsupervised Acoustic Model Pretraining For Self-Attention Networks](https://arxiv.org/abs/1910.10387) 1379 | - [Learning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision](https://arxiv.org/abs/2007.04134) (ICML2020 WS) 1380 | - [Semi-Supervised Spoken Language Understanding via Self-Supervised Speech and Language Model Pretraining](https://arxiv.org/abs/2010.13826) 1381 | - [ST-BERT: Cross-modal Language Model Pre-training For End-to-end Spoken Language Understanding](https://arxiv.org/abs/2010.12283) 1382 | - [End-to-end spoken language understanding using transformer networks and self-supervised pre-trained features](https://arxiv.org/abs/2011.08238) 1383 | - [Speech-language Pre-training for End-to-end Spoken Language Understanding](https://arxiv.org/abs/2102.06283) 1384 | - [Jointly Encoding Word Confusion Network and Dialogue Context with BERT for Spoken Language Understanding](https://arxiv.org/abs/2005.11640) (Interspeech2020) 1385 | - [AudioCLIP: Extending CLIP to Image, Text and Audio](https://arxiv.org/abs/2106.13043) 1386 | - [Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio Representation](https://arxiv.org/abs/2005.08575) 1387 | - [Unsupervised Cross-lingual Representation Learning for Speech Recognition](https://arxiv.org/abs/2006.13979) 1388 | - [Curriculum Pre-training for End-to-End Speech Translation](https://arxiv.org/abs/2004.10093) (ACL2020) 1389 | - [MAM: Masked Acoustic Modeling for End-to-End Speech-to-Text Translation](https://arxiv.org/abs/2010.11445) 1390 | - [Multilingual Speech Translation with Efficient Finetuning of Pretrained Models](https://arxiv.org/abs/2010.12829) (ACL2021) 1391 | - [Multilingual Byte2Speech Text-To-Speech Models Are Few-shot Spoken Language Learners](https://arxiv.org/abs/2103.03541) 1392 | - [Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models](https://arxiv.org/abs/1906.07307) 1393 | - [To BERT or Not To BERT: Comparing Speech and Language-based Approaches for Alzheimer's Disease Detection](https://arxiv.org/abs/2008.01551) (Interspeech2020) 1394 | - [BERT for Joint Multichannel Speech Dereverberation with Spatial-aware Tasks](https://arxiv.org/abs/2010.10892) 1395 | ## Model compression 1396 | - [Compression of Deep Learning Models for Text: A Survey](https://arxiv.org/abs/2008.05221) 1397 | - [Distilling Task-Specific Knowledge from BERT into Simple Neural Networks](https://arxiv.org/abs/1903.12136) 1398 | - [Patient Knowledge Distillation for BERT Model Compression](https://arxiv.org/abs/1908.09355) (EMNLP2019) 1399 | - [Small and Practical BERT Models for Sequence Labeling](https://arxiv.org/abs/1909.00100) (EMNLP2019) 1400 | - [TinyBERT: Distilling BERT for Natural Language Understanding](https://arxiv.org/abs/1909.10351) [[github](https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/TinyBERT)] 1401 | - [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) (NeurIPS2019 WS) [[github](https://github.com/huggingface/transformers/tree/master/examples/distillation)] 1402 | - [Contrastive Distillation on Intermediate Representations for Language Model Compression](https://arxiv.org/abs/2009.14167) (EMNLP2020) 1403 | - [Knowledge Distillation from Internal Representations](https://arxiv.org/abs/1910.03723) (AAAI2020) 1404 | - [Reinforced Multi-Teacher Selection for Knowledge Distillation](https://arxiv.org/abs/2012.06048) (AAAI2021) 1405 | - [ALP-KD: Attention-Based Layer Projection for Knowledge Distillation](https://arxiv.org/abs/2012.14022) (AAAI2021) 1406 | - [Dynamic Knowledge Distillation for Pre-trained Language Models](https://arxiv.org/abs/2109.11295) (EMNLP2021) 1407 | - [Distilling Linguistic Context for Language Model Compression](https://arxiv.org/abs/2109.08359) (EMNLP2021) 1408 | - [Improving Task-Agnostic BERT Distillation with Layer Mapping Search](https://arxiv.org/abs/2012.06153) 1409 | - [PoWER-BERT: Accelerating BERT inference for Classification Tasks](https://arxiv.org/abs/2001.08950) 1410 | - [WaLDORf: Wasteless Language-model Distillation On Reading-comprehension](https://arxiv.org/abs/1912.06638) 1411 | - [Extremely Small BERT Models from Mixed-Vocabulary Training](https://arxiv.org/abs/1909.11687) (EACL2021) 1412 | - [BERT-of-Theseus: Compressing BERT by Progressive Module Replacing](https://arxiv.org/abs/2002.02925) (EMNLP2020) 1413 | - [Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning](https://arxiv.org/abs/2002.08307) (ACL2020 SRW) 1414 | - [MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers](https://arxiv.org/abs/2002.10957) 1415 | - [Extract then Distill: Efficient and Effective Task-Agnostic BERT Distillation](https://arxiv.org/abs/2104.11928) 1416 | - [Compressing Large-Scale Transformer-Based Models: A Case Study on BERT](https://arxiv.org/abs/2002.11985) 1417 | - [Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers](https://arxiv.org/abs/2002.11794) 1418 | - [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962) 1419 | - [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) (ACL2020) 1420 | - [Distilling Knowledge from Pre-trained Language Models via Text Smoothing](https://arxiv.org/abs/2005.03848) 1421 | - [DynaBERT: Dynamic BERT with Adaptive Width and Depth](https://arxiv.org/abs/2004.04037) 1422 | - [Reducing Transformer Depth on Demand with Structured Dropout](https://arxiv.org/abs/1909.11556) 1423 | - [DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference](https://www.aclweb.org/anthology/2020.acl-main.204/) (ACL2020) 1424 | - [BERT Loses Patience: Fast and Robust Inference with Early Exit](https://arxiv.org/abs/2006.04152) [[github](https://github.com/JetRunner/PABEE)] [[github](https://github.com/huggingface/transformers/tree/master/examples/bert-loses-patience)] 1425 | - [Accelerating BERT Inference for Sequence Labeling via Early-Exit](https://arxiv.org/abs/2105.13878) (ACL2021) 1426 | - [Elbert: Fast Albert with Confidence-Window Based Early Exit](https://arxiv.org/abs/2107.00175) 1427 | - [RomeBERT: Robust Training of Multi-Exit BERT](https://arxiv.org/abs/2101.09755) 1428 | - [TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference](https://arxiv.org/abs/2105.11618) (NAACL2021) 1429 | - [FastBERT: a Self-distilling BERT with Adaptive Inference Time](https://www.aclweb.org/anthology/2020.acl-main.537/) (ACL2020) 1430 | - [Distilling Large Language Models into Tiny and Effective Students using pQRNN](https://arxiv.org/abs/2101.08890) 1431 | - [Towards Non-task-specific Distillation of BERT via Sentence Representation Approximation](https://arxiv.org/abs/2004.03097) 1432 | - [LadaBERT: Lightweight Adaptation of BERT through Hybrid Model Compression](https://arxiv.org/abs/2004.04124) (COLING2020) 1433 | - [Poor Man's BERT: Smaller and Faster Transformer Models](https://arxiv.org/abs/2004.03844) 1434 | - [schuBERT: Optimizing Elements of BERT](https://arxiv.org/abs/2005.06628) (ACL2020) 1435 | - [BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth Mover's Distance](https://arxiv.org/abs/2010.06133) (EMNLP2020) [[github](https://github.com/lxk00/BERT-EMD)] 1436 | - [One Teacher is Enough? Pre-trained Language Model Distillation from Multiple Teachers](https://arxiv.org/abs/2106.01023) (ACL2021 Findings) 1437 | - [From Dense to Sparse: Contrastive Pruning for Better Pre-trained Language Model Compression](https://arxiv.org/abs/2112.07198) (AAAI2022) 1438 | - [TinyMBERT: Multi-Stage Distillation Framework for Massive Multi-lingual NER](https://arxiv.org/abs/2004.05686) (ACL2020) 1439 | - [XtremeDistil: Multi-stage Distillation for Massive Multilingual Models](https://www.aclweb.org/anthology/2020.acl-main.202/) (ACL2020) 1440 | - [Robustly Optimized and Distilled Training for Natural Language Understanding](https://arxiv.org/abs/2103.08809) 1441 | - [Structured Pruning of Large Language Models](https://arxiv.org/abs/1910.04732) 1442 | - [Movement Pruning: Adaptive Sparsity by Fine-Tuning](https://arxiv.org/abs/2005.07683) [[github](https://github.com/huggingface/transformers/tree/master/examples/movement-pruning)] 1443 | - [Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning](https://arxiv.org/abs/2009.08065) (EMNLP2020 Findings) 1444 | - [Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior](https://www.aclweb.org/anthology/2020.findings-emnlp.64/) (EMNLP2020 Findings) 1445 | - [Parameter-Efficient Transfer Learning with Diff Pruning](https://arxiv.org/abs/2012.07463) 1446 | - [FastFormers: Highly Efficient Transformer Models for Natural Language Understanding](https://arxiv.org/abs/2010.13382) (EMNLP2020 WS) [[github](https://github.com/microsoft/fastformers)] 1447 | - [AutoTinyBERT: Automatic Hyper-parameter Optimization for Efficient Pre-trained Language Models](https://arxiv.org/abs/2107.13686) (ACL2021) [[github](https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/AutoTinyBERT)] 1448 | - [Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains](https://arxiv.org/abs/2106.13474) (ACL2021 Findings) 1449 | - [Distilling BERT into Simple Neural Networks with Unlabeled Transfer Data](https://arxiv.org/abs/1910.01769) 1450 | - [AdaBERT: Task-Adaptive BERT Compression with Differentiable Neural Architecture Search](https://arxiv.org/abs/2001.04246) 1451 | - [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) 1452 | - [Optimizing Transformers with Approximate Computing for Faster, Smaller and more Accurate NLP Models](https://arxiv.org/abs/2010.03688) 1453 | - [An Approximation Algorithm for Optimal Subarchitecture Extraction](https://arxiv.org/abs/2010.08512) [[github](https://github.com/alexa/bort/)] 1454 | - [Structured Pruning of a BERT-based Question Answering Model](https://arxiv.org/abs/1910.06360) 1455 | - [DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering](https://arxiv.org/abs/2005.00697) (ACL2020) 1456 | - [Distilling Knowledge Learned in BERT for Text Generation](https://arxiv.org/abs/1911.03829) (ACL2020) 1457 | - [Distilling the Knowledge of BERT for Sequence-to-Sequence ASR](https://arxiv.org/abs/2008.03822) (Interspeech2020) 1458 | - [Pre-trained Summarization Distillation](https://arxiv.org/abs/2010.13002) 1459 | - [Understanding BERT Rankers Under Distillation](https://arxiv.org/abs/2007.11088) (ICTIR2020) 1460 | - [Simplified TinyBERT: Knowledge Distillation for Document Retrieval](https://arxiv.org/abs/2009.07531) 1461 | - [Exploring the Limits of Simple Learners in Knowledge Distillation for Document Classification with DocBERT](https://www.aclweb.org/anthology/2020.repl4nlp-1.10/) (ACL2020 WS) 1462 | - [TextBrewer: An Open-Source Knowledge Distillation Toolkit for Natural Language Processing](https://arxiv.org/abs/2002.12620) (ACL2020 Demo) 1463 | - [TopicBERT for Energy Efficient Document Classification](https://arxiv.org/abs/2010.16407) (EMNLP2020 Findings) 1464 | - [MiniVLM: A Smaller and Faster Vision-Language Model](https://arxiv.org/abs/2012.06946) 1465 | - [Compressing Visual-linguistic Model via Knowledge Distillation](https://arxiv.org/abs/2104.02096) 1466 | - [Playing Lottery Tickets with Vision and Language](https://arxiv.org/abs/2104.11832) 1467 | - [Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT](https://arxiv.org/abs/1909.05840) 1468 | - [Q8BERT: Quantized 8Bit BERT](https://arxiv.org/abs/1910.06188) (NeurIPS2019 WS) 1469 | - [Training with Quantization Noise for Extreme Model Compression](https://arxiv.org/abs/2004.07320) (ICLR2021) 1470 | - [Hardware Acceleration of Fully Quantized BERT for Efficient Natural Language Processing](https://arxiv.org/abs/2103.02800) 1471 | - [BinaryBERT: Pushing the Limit of BERT Quantization](https://arxiv.org/abs/2012.15701) (ACL2021) 1472 | - [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) 1473 | - [ROSITA: Refined BERT cOmpreSsion with InTegrAted techniques](https://arxiv.org/abs/2103.11367) (AAAI2021) 1474 | - [TernaryBERT: Distillation-aware Ultra-low Bit BERT](https://arxiv.org/abs/2009.12812) (EMNLP2020) 1475 | - [EdgeBERT: Optimizing On-Chip Inference for Multi-Task NLP](https://arxiv.org/abs/2011.14203) 1476 | - [Optimizing Inference Performance of Transformers on CPUs](https://arxiv.org/abs/2102.06621) 1477 | ## Large language model 1478 | - [Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf) [[github](https://github.com/openai/gpt-2)] 1479 | - [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165) (NeurIPS2020) [[github](https://github.com/openai/gpt-3)] 1480 | - [Language Models as Few-Shot Learner for Task-Oriented Dialogue Systems](https://arxiv.org/abs/2008.06239) 1481 | - [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) [[website](https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/)] 1482 | - [GPT-NeoX-20B: An Open-Source Autoregressive Language Model](https://arxiv.org/abs/2204.06745) 1483 | - [Scaling Language Models: Methods, Analysis & Insights from Training Gopher](https://arxiv.org/abs/2112.11446) 1484 | - [Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961) 1485 | - [GLaM: Efficient Scaling of Language Models with Mixture-of-Experts](https://arxiv.org/abs/2112.06905) [[blog](https://ai.googleblog.com/2021/12/more-efficient-in-context-learning-with.html)] 1486 | - [Training Compute-Optimal Large Language Models](https://arxiv.org/abs/2203.15556) 1487 | - [PaLM: Scaling Language Modeling with Pathways](https://arxiv.org/abs/2204.02311) [[blog](https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html)] 1488 | - [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971) 1489 | - [Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling](https://arxiv.org/abs/2304.01373) [[github](https://github.com/EleutherAI/pythia)] 1490 | - [PolyLM: An Open Source Polyglot Large Language Model](https://arxiv.org/abs/2307.06018) 1491 | - [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) 1492 | - [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473) 1493 | - [Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model](https://arxiv.org/abs/2201.11990) [[blog](https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/)] 1494 | - [DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale](https://arxiv.org/abs/2207.00032) [[github](https://github.com/microsoft/DeepSpeed)] 1495 | - [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/abs/1910.02054) 1496 | - [ZeRO++: Extremely Efficient Collective Communication for Giant Model Training](https://arxiv.org/abs/2306.10209) [[blog](https://www.microsoft.com/en-us/research/blog/deepspeed-zero-a-leap-in-speed-for-llm-and-chat-model-training-with-4x-less-communication/)] 1497 | - [ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857) 1498 | ## Reinforcement learning from human feedback 1499 | - [Fine-Tuning Language Models from Human Preferences](https://arxiv.org/abs/1909.08593) [[github](https://github.com/openai/lm-human-preferences)] [[blog](https://openai.com/blog/fine-tuning-gpt-2/)] 1500 | - [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155) [[github](https://github.com/openai/following-instructions-human-feedback)] [[blog](https://openai.com/blog/instruction-following/)] 1501 | - [WebGPT: Browser-assisted question-answering with human feedback](https://arxiv.org/abs/2112.09332) [[blog](https://openai.com/blog/webgpt/)] 1502 | - [Improving alignment of dialogue agents via targeted human judgements](https://arxiv.org/abs/2209.14375) 1503 | - [Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback](https://arxiv.org/abs/2204.05862) 1504 | - [Training Language Models with Language Feedback](https://arxiv.org/abs/2204.14146) (ACL2022 WS) 1505 | - [Self-Instruct: Aligning Language Model with Self Generated Instructions](https://arxiv.org/abs/2212.10560) [[github](https://github.com/yizhongw/self-instruct)] 1506 | - [Is ChatGPT a General-Purpose Natural Language Processing Task Solver?](https://arxiv.org/abs/2302.06476) 1507 | - [ChatGPT: A Meta-Analysis after 2.5 Months](https://arxiv.org/abs/2302.13795) 1508 | ## Misc. 1509 | - [Extracting Training Data from Large Language Models](https://arxiv.org/abs/2012.07805) 1510 | - [Generative Language Modeling for Automated Theorem Proving](https://arxiv.org/abs/2009.03393) 1511 | - [Do you have the right scissors? Tailoring Pre-trained Language Models via Monte-Carlo Methods](https://www.aclweb.org/anthology/2020.acl-main.314/) (ACL2020) 1512 | - [jiant: A Software Toolkit for Research on General-Purpose Text Understanding Models](https://arxiv.org/abs/2003.02249) [[github](https://github.com/nyu-mll/jiant/)] 1513 | - [Cloze-driven Pretraining of Self-attention Networks](https://arxiv.org/abs/1903.07785) 1514 | - [Learning and Evaluating General Linguistic Intelligence](https://arxiv.org/abs/1901.11373) 1515 | - [To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks](https://arxiv.org/abs/1903.05987) (ACL2019 WS) 1516 | - [Learning to Speak and Act in a Fantasy Text Adventure Game](https://www.aclweb.org/anthology/D19-1062/) (EMNLP2019) 1517 | - [A Two-Stage Masked LM Method for Term Set Expansion](https://arxiv.org/abs/2005.01063) (ACL2020) 1518 | - [Cold-start Active Learning through Self-supervised Language Modeling](https://arxiv.org/abs/2010.09535) (EMNLP2020) 1519 | - [Conditional BERT Contextual Augmentation](https://arxiv.org/abs/1812.06705) 1520 | - [Data Augmentation using Pre-trained Transformer Models](https://arxiv.org/abs/2003.02245) (AACL-IJCNLP2020) [[github](https://github.com/varinf/TransformersDataAugmentation)] 1521 | - [Mixup-Transfomer: Dynamic Data Augmentation for NLP Tasks](https://arxiv.org/abs/2010.02394) (COLING2020) 1522 | - [GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation](https://arxiv.org/abs/2104.08826) 1523 | - [Unsupervised Text Style Transfer with Padded Masked Language Models](https://arxiv.org/abs/2010.01054) (EMNLP2020) 1524 | - [Assessing Discourse Relations in Language Generation from Pre-trained Language Models](https://arxiv.org/abs/2004.12506) 1525 | - [Large Batch Optimization for Deep Learning: Training BERT in 76 minutes](https://arxiv.org/abs/1904.00962) (ICLR2020) 1526 | - [Accelerated Large Batch Optimization of BERT Pretraining in 54 minutes](https://arxiv.org/abs/2006.13484) 1527 | - [IsoBN: Fine-Tuning BERT with Isotropic Batch Normalization](https://arxiv.org/abs/2005.02178) (AAAI2021) 1528 | - [Multi-node Bert-pretraining: Cost-efficient Approach](https://arxiv.org/abs/2008.00177) 1529 | - [How to Train BERT with an Academic Budget](https://arxiv.org/abs/2104.07705) 1530 | - [Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training](https://arxiv.org/abs/2111.05972) 1531 | - [PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management](https://arxiv.org/abs/2108.05818) [[github](https://github.com/Tencent/PatrickStar)] 1532 | - [1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed](https://arxiv.org/abs/2102.02888) 1533 | - [TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models](https://arxiv.org/abs/2102.07988) 1534 | - [Efficient Large-Scale Language Model Training on GPU Clusters](https://arxiv.org/abs/2104.04473) 1535 | - [Scaling Laws for Neural Language Models](https://arxiv.org/abs/2001.08361) 1536 | - [Scaling Laws for Autoregressive Generative Modeling](https://arxiv.org/abs/2010.14701) 1537 | - [Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers](https://arxiv.org/abs/2109.10686) 1538 | - [The Pile: An 800GB Dataset of Diverse Text for Language Modeling](https://arxiv.org/abs/2101.00027) [[website](https://pile.eleuther.ai/)] 1539 | - [Deduplicating Training Data Makes Language Models Better](https://arxiv.org/abs/2107.06499) 1540 | - [Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models](https://openreview.net/forum?id=HkgaETNtDB) (ICLR2020) 1541 | - [A Mutual Information Maximization Perspective of Language Representation Learning](https://openreview.net/forum?id=Syx79eBKwr) (ICLR2020) 1542 | - [Is BERT Really Robust? Natural Language Attack on Text Classification and Entailment](https://arxiv.org/abs/1907.11932) (AAAI2020) 1543 | - [Weight Poisoning Attacks on Pre-trained Models](https://arxiv.org/abs/2004.06660) (ACL2020) 1544 | - [BERT-ATTACK: Adversarial Attack Against BERT Using BERT](https://arxiv.org/abs/2004.09984) (EMNLP2020) 1545 | - [BERT-Defense: A Probabilistic Model Based on BERT to Combat Cognitively Inspired Orthographic Adversarial Attacks](https://arxiv.org/abs/2106.01452) (ACL2021 Findings) 1546 | - [Model Extraction and Adversarial Transferability, Your BERT is Vulnerable!](https://arxiv.org/abs/2103.10013) (NAACL2021) 1547 | - [Adv-BERT: BERT is not robust on misspellings! Generating nature adversarial samples on BERT](https://arxiv.org/abs/2003.04985) 1548 | - [Robust Encodings: A Framework for Combating Adversarial Typos](https://www.aclweb.org/anthology/2020.acl-main.245/) (ACL2020) 1549 | - [On the Robustness of Language Encoders against Grammatical Errors](https://arxiv.org/abs/2005.05683) (ACL2020) 1550 | - [Evaluating the Robustness of Neural Language Models to Input Perturbations](https://arxiv.org/abs/2108.12237) (EMNLP2021) 1551 | - [Pretrained Transformers Improve Out-of-Distribution Robustness](https://arxiv.org/abs/2004.06100) (ACL2020) [[github](https://github.com/camelop/NLP-Robustness)] 1552 | - ["You are grounded!": Latent Name Artifacts in Pre-trained Language Models](https://arxiv.org/abs/2004.03012) (EMNLP2020) 1553 | - [The Right Tool for the Job: Matching Model and Instance Complexities](https://arxiv.org/abs/2004.07453) (ACL2020) [[github](https://github.com/allenai/sledgehammer)] 1554 | - [Unsupervised Domain Clusters in Pretrained Language Models](https://arxiv.org/abs/2004.02105) (ACL2020) 1555 | - [Thieves on Sesame Street! Model Extraction of BERT-based APIs](https://arxiv.org/abs/1910.12366) (ICLR2020) 1556 | - [Graph-Bert: Only Attention is Needed for Learning Graph Representations](https://arxiv.org/abs/2001.05140) 1557 | - [Graph-Aware Transformer: Is Attention All Graphs Need?](https://arxiv.org/abs/2006.05213) 1558 | - [CodeBERT: A Pre-Trained Model for Programming and Natural Languages](https://arxiv.org/abs/2002.08155) (EMNLP2020 Findings) 1559 | - [Unsupervised Translation of Programming Languages](https://arxiv.org/abs/2006.03511) 1560 | - [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) (NAACL2021) 1561 | - [MathBERT: A Pre-Trained Model for Mathematical Formula Understanding](https://arxiv.org/abs/2105.00377) 1562 | - [Investigating Math Word Problems using Pretrained Multilingual Language Models](https://arxiv.org/abs/2105.08928) 1563 | - [Measuring and Improving BERT's Mathematical Abilities by Predicting the Order of Reasoning](https://arxiv.org/abs/2106.03921) (ACL2021) 1564 | - [Pre-train or Annotate? Domain Adaptation with a Constrained Budget](https://arxiv.org/abs/2109.04711) (EMNLP2021) 1565 | - [Item-based Collaborative Filtering with BERT](https://www.aclweb.org/anthology/2020.ecnlp-1.8/) (ACL2020 WS) 1566 | - [RecoBERT: A Catalog Language Model for Text-Based Recommendations](https://arxiv.org/abs/2009.13292) 1567 | - [Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping](https://arxiv.org/abs/2002.06305) 1568 | - [Extending Machine Language Models toward Human-Level Language Understanding](https://arxiv.org/abs/1912.05877) 1569 | - [Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data](https://openreview.net/forum?id=GKTvAcb12b) (ACL2020) 1570 | - [Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level](https://arxiv.org/abs/2105.06020) (ACL2021 Findings) [[github](https://github.com/ruiqi-zhong/acl2021-instance-level)] 1571 | - [Glyce: Glyph-vectors for Chinese Character Representations](https://arxiv.org/abs/1901.10125) 1572 | - [Back to the Future -- Sequential Alignment of Text Representations](https://arxiv.org/abs/1909.03464) 1573 | - [Improving Cuneiform Language Identification with BERT](https://www.aclweb.org/anthology/papers/W/W19/W19-1402/) (NAACL2019 WS) 1574 | - [Generating Derivational Morphology with BERT](https://arxiv.org/abs/2005.00672) 1575 | - [BERT has a Moral Compass: Improvements of ethical and moral values of machines](https://arxiv.org/abs/1912.05238) 1576 | - [MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training](https://arxiv.org/abs/2106.05630) (ACL2021 Findings) 1577 | - [SMILES-BERT: Large Scale Unsupervised Pre-Training for Molecular Property Prediction](https://dl.acm.org/citation.cfm?id=3342186) (ACM-BCB2019) 1578 | - [ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction](https://arxiv.org/abs/2010.09885) 1579 | - [BERT Learns (and Teaches) Chemistry](https://arxiv.org/abs/2007.16012) 1580 | - [Prediction of RNA-protein interactions using a nucleotide language model](https://www.biorxiv.org/content/10.1101/2021.04.27.441365v1) 1581 | - [Sketch-BERT: Learning Sketch Bidirectional Encoder Representation from Transformers by Self-supervised Learning of Sketch Gestalt](https://arxiv.org/abs/2005.09159) (CVPR2020) 1582 | - [The Chess Transformer: Mastering Play using Generative Language Models](https://arxiv.org/abs/2008.04057) 1583 | - [The Go Transformer: Natural Language Modeling for Game Play](https://arxiv.org/abs/2007.03500) 1584 | - [On the comparability of Pre-trained Language Models](https://arxiv.org/abs/2001.00781) 1585 | - [Transformers: State-of-the-art Natural Language Processing](https://arxiv.org/abs/1910.03771) 1586 | - [The Cost of Training NLP Models: A Concise Overview](https://arxiv.org/abs/2004.08900) 1587 | --------------------------------------------------------------------------------