└── README.md


/README.md:
--------------------------------------------------------------------------------
  1 | # Awesome Captioning:[![Awesome](https://awesome.re/badge.svg)](https://awesome.re)
  2 | 
  3 | <p align="center">
  4 |   <img width="250" src="https://camo.githubusercontent.com/1131548cf666e1150ebd2a52f44776d539f06324/68747470733a2f2f63646e2e7261776769742e636f6d2f73696e647265736f726875732f617765736f6d652f6d61737465722f6d656469612f6c6f676f2e737667" "Awesome!">
  5 | </p>
  6 | 
  7 | A curated list of **Visual Captioning** and related area. 
  8 | 
  9 | 
 10 | 
 11 | ## Table of Contents
 12 |   * [Survey Papers](#survey-papers)
 13 |   * [Research Papers](#reasearch-papers)
 14 |     * [2022](#2022)
 15 |         - [arxiv 2022](arxiv-2022)
 16 |         - [AAAI 2022](#AAAI-2022)
 17 |     * [2021](#2021)
 18 |         - [NIPS 2021](NIPS-2021)
 19 |         - [EMNLP 2021](#EMNLP-2021)
 20 |         - [ICCV 2021](#ICCV-2021)
 21 |         - [ACM MM 2021](#ACMMM-2021)
 22 |         - [Interspeech 2021](#Interspeech-2021)
 23 |         - [ACL 2021](#ACL-2021)
 24 |         - [IJCAI 2021](#IJCAI-2021)
 25 |         - [NAACL 2021](#NAACL-2021)
 26 |         - [CVPR 2021](#CVPR-2021)
 27 |         - [ICASSP 2021](#ICASSP-2021)
 28 |         - [AAAI 2021](#AAAI-2021)
 29 |      * [2020](#2020)
 30 |         - [EMNLP 2020](#EMNLP-2020)
 31 |         - [NIPS 2020](#NIPS-2020)
 32 |         - [ACM MM 2020](#ACMMM-2020)
 33 |         - [ECCV 2020](#ECCV-2020)
 34 |         - [IJCAI 2020](#IJCAI-2020)
 35 |         - [ACL 2020](#ACL-2020)
 36 |         - [CVPR 2020](#CVPR-2020)
 37 |         - [AAAI 2020](#AAAI-2020)
 38 |      * [2019](#2019)
 39 |         * [NIPS 2019](#NIPS-2019)
 40 |         * [ICCV 2019](#ICCV-2019)
 41 |         * [ACL 2019](#ACL-2019)
 42 |         * [CVPR 2019](#CVPR-2019)
 43 |         * [AAAI 2019](#AAAI-2019)
 44 |      * [2018](#2018)
 45 |         * [NIPS 2018](#NIPS-2018)
 46 |         * [ECCV 2018](#ECCV-2018)
 47 |         * [ACL 2018](#ACL-2018)
 48 |         * [CVPR 2018](#CVPR-2018)
 49 |      * [2017](#2017)
 50 |         * [ICCV 2017](#ICCV-2017)
 51 |         * [CVPR 2017](#CVPR-2017)
 52 |         * [TPAMI 2017](#TPAMI-2017)
 53 |      * [2016](#2016)
 54 |         * [CVPR 2016](#CVPR-2016)
 55 |         * [TPAMI 2016](#TPAMI-2016)
 56 |      * [2015](#2015)
 57 |         * [ICML 2015](#ICML-2015)
 58 |         * [ICCV 2015](#ICCV-2015)
 59 |         * [CVPR 2015](#CVPR-2015)
 60 |         * [ICLR 2015](#ICLR-2015)
 61 |   * [Dataset](#dataset)  
 62 |   * [Popular Codebase](#popular-codebase)    
 63 |   * [Reference and Acknowledgement](#reference-and-acknowledgement)
 64 | ## Survey Papers
 65 | ### 2021
 66 | - From Show to Tell: A Survey on Image Captioning. [[paper]](https://arxiv.org/pdf/2107.06912.pdf)
 67 | 
 68 | ## Research Papers
 69 | ### 2022
 70 | #### arxiv 2022
 71 | ##### Image Captioning
 72 | - Compact Bidirectional Transformer for Image Captioning. [[paper]](https://arxiv.org/pdf/2201.01984.pdf) [[code]](https://github.com/YuanEZhou/CBTrans)
 73 | - ViNTER: Image Narrative Generation with Emotion-Arc-Aware Transformer. [[paper]](https://arxiv.org/pdf/2202.07305.pdf)
 74 | - I-Tuning: Tuning Language Models with Image for Caption Generation. [[paper]](https://arxiv.org/pdf/2202.06574.pdf)
 75 | - CaMEL: Mean Teacher Learning for Image Captioning. [[paper]](https://arxiv.org/pdf/2202.10492.pdf) [[code]](https://github.com/aimagelab/camel)
 76 | - Unpaired Image Captioning by Image-level Weakly-Supervised Visual Concept Recognition. [[paper]](https://arxiv.org/pdf/2203.03195.pdf)
 77 | ##### Video Captioning
 78 | - Discourse Analysis for Evaluating Coherence in Video Paragraph Captions. [[paper]](https://arxiv.org/pdf/2201.06207.pdf)
 79 | - Cross-modal Contrastive Distillation for Instructional Activity Anticipation. [[paper]](https://arxiv.org/pdf/2201.06734.pdf)
 80 | - End-to-end Generative Pretraining for Multimodal Video Captioning. [[paper]](https://arxiv.org/pdf/2201.08264.pdf)
 81 | - Deep soccer captioning with transformer: dataset, semantics-related losses, and multi-level evaluation. [[paper]](https://arxiv.org/pdf/2202.05728.pdf) [[code]](https://sites.google.com/view/soccercaptioning)
 82 | - Dual-Level Decoupled Transformer for Video Captioning. [[paper]](https://arxiv.org/pdf/2205.03039.pdf)
 83 | - Attract me to Buy: Advertisement Copywriting Generation with Multimodal Multi-structured Information. [[paper]](https://arxiv.org/pdf/2205.03534.pdf) [[code]](https://e-mmad.github.io/e-mmad.net/index.html)
 84 | #### IJCAI 2022
 85 | - Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds. [[paper]](https://arxiv.org/pdf/2204.10688.pdf)
 86 | #### CVPR 2022
 87 | - X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning. [[paper]](https://arxiv.org/pdf/2203.00843.pdf)
 88 | - Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning. [[paper]](https://arxiv.org/pdf/2205.04363.pdf)
 89 | ##### Video Captioning
 90 | - What's in a Caption? Dataset-Specific Linguistic Diversity and Its Effect on Visual Description Models and Metrics. [[paper]](https://arxiv.org/pdf/2205.06253.pdf) (Workshop)
 91 | #### AAAI 2022
 92 | ##### Image Captioning
 93 | - Image Difference Captioning with Pre-training and Contrastive Learning. [[paper]](https://arxiv.org/pdf/2202.04298.pdf)
 94 | ### 2021
 95 | #### NIPS 2021
 96 | ##### Image Captioning
 97 | - Auto-Encoding Knowledge Graph for Unsupervised Medical Report Generation. [[paper]](https://openreview.net/pdf?id=nIL7Q-p7-Sh)
 98 | - FFA-IR: Towards an Explainable and Reliable Medical Report Generation Benchmark. [[paper]](https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/35f4a8d465e6e1edc05f3d8ab658c551-Paper-round2.pdf) [[code]](https://github.com/mlii0117/FFA-IR) 
 99 | ##### Video Captioning
100 | - Multi-modal Dependency Tree for Video Captioning. [[paper]](https://openreview.net/pdf?id=sW40wkwfsZp)
101 | 
102 | 
103 | #### EMNLP 2021
104 | ##### Image Captioning
105 | - Visual News: Benchmark and Challenges in News Image Captioning. [[paper]](https://aclanthology.org/2021.emnlp-main.542.pdf) [[code]](https://github.com/FuxiaoLiu/VisualNews-Repository)
106 | - R3Net:Relation-embedded Representation Reconstruction Network for Change Captioning. [[paper]](https://arxiv.org/pdf/2110.10328.pdf) [[code]](https://github.com/tuyunbin/R3Net)
107 | - CLIPScore: A Reference-free Evaluation Metric for Image Captioning. [[paper]](https://aclanthology.org/2021.emnlp-main.595.pdf)
108 | - Journalistic Guidelines Aware News Image Captioning. [[paper]](https://arxiv.org/pdf/2109.02865.pdf)
109 | - Understanding Guided Image Captioning Performance across Domains. [[paper]](https://aclanthology.org/2021.conll-1.14.pdf) [[code]](https://github.com/google-research-datasets/T2-Guiding) (CoNLL)
110 | - Language Resource Efficient Learning for Captioning. [[paper]](https://aclanthology.org/2021.findings-emnlp.162.pdf) (Findings)
111 | - Retrieval, Analogy, and Composition: A framework for Compositional Generalization in Image Captioning. [[paper]](https://aclanthology.org/2021.findings-emnlp.171.pdf) (Findings)
112 | - QACE: Asking Questions to Evaluate an Image Caption. [[paper]](https://arxiv.org/pdf/2108.12560.pdf) (Findings)
113 | - COSMic: A Coherence-Aware Generation Metric for Image Descriptions. [[paper]](https://arxiv.org/pdf/2109.05281.pdf) (Findings)
114 | 
115 | #### ICCV 2021
116 | ##### Image Captioning
117 | - Auto-Parsing Network for Image Captioning and Visual Question Answering. [[paper]](https://arxiv.org/pdf/2108.10568.pdf)
118 | - Similar Scenes arouse Similar Emotions: Parallel Data Augmentation for Stylized Image Captioning. [[paper]](https://arxiv.org/pdf/2108.11912.pdf)
119 | - Explain Me the Painting: Multi-Topic Knowledgeable Art Description Generation. [[paper]](https://arxiv.org/pdf/2109.05743.pdf)
120 | - Partial Off-Policy Learning: Balance Accuracy and Diversity for Human-Oriented Image Captioning. [[paper]](https://openaccess.thecvf.com/content/ICCV2021/papers/Shi_Partial_Off-Policy_Learning_Balance_Accuracy_and_Diversity_for_Human-Oriented_Image_ICCV_2021_paper.pdf)
121 | - Topic Scene Graph Generation by Attention Distillation from Caption. [[paper]](https://openaccess.thecvf.com/content/ICCV2021/papers/Wang_Topic_Scene_Graph_Generation_by_Attention_Distillation_From_Caption_ICCV_2021_paper.pdf)
122 | - Understanding and Evaluating Racial Biases in Image Captioning. [[paper]](https://openaccess.thecvf.com/content/ICCV2021/papers/Zhao_Understanding_and_Evaluating_Racial_Biases_in_Image_Captioning_ICCV_2021_paper.pdf) [[code]](https://github.com/princetonvisualai/imagecaptioning-bias)
123 | - In Defense of Scene Graphs for Image Captioning. [[paper]](https://openaccess.thecvf.com/content/ICCV2021/papers/Nguyen_In_Defense_of_Scene_Graphs_for_Image_Captioning_ICCV_2021_paper.pdf) [[code]](https://github.com/Kien085/SG2Caps)
124 | - Viewpoint-Agnostic Change Captioning with Cycle Consistency. [[paper]](https://openaccess.thecvf.com/content/ICCV2021/papers/Kim_Viewpoint-Agnostic_Change_Captioning_With_Cycle_Consistency_ICCV_2021_paper.pdf)
125 | - Visual-Textual Attentive Semantic Consistency for Medical Report Generation. [[paper]](https://openaccess.thecvf.com/content/ICCV2021/papers/Zhou_Visual-Textual_Attentive_Semantic_Consistency_for_Medical_Report_Generation_ICCV_2021_paper.pdf)
126 | - Semi-Autoregressive Transformer for Image Captioning. [[paper]](https://openaccess.thecvf.com/content/ICCV2021W/CLVL/papers/Zhou_Semi-Autoregressive_Transformer_for_Image_Captioning_ICCVW_2021_paper.pdf) (Workshop)
127 | ##### Video Captioning
128 | - End-to-End Dense Video Captioning with Parallel Decoding. [[paper]](https://arxiv.org/pdf/2108.07781.pdf) [[code]](https://github.com/ttengwang/PDVC)
129 | - Motion Guided Region Message Passing for Video Captioning. [[paper]](https://openaccess.thecvf.com/content/ICCV2021/papers/Chen_Motion_Guided_Region_Message_Passing_for_Video_Captioning_ICCV_2021_paper.pdf)
130 | 
131 | #### ACMMM 2021
132 | ##### Image Captioning
133 | - Distributed Attention for Grounded Image Captioning. [[paper]](https://arxiv.org/pdf/2108.01056.pdf)
134 | - Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image Captioning. [[paper]](https://arxiv.org/pdf/2108.02366.pdf) [[code]](https://github.com/Unbear430/DGCN-for-image-captioning)
135 | - Group-based Distinctive Image Captioning with Memory Attention. [[paper]](https://arxiv.org/pdf/2108.09151.pdf)
136 | - Direction Relation Transformer for Image Captioning. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3474085.3475607)
137 | ##### Text Captioning
138 | - Question-controlled Text-aware Image Captioning. [[paper]](https://arxiv.org/pdf/2108.02059.pdf)
139 | ##### Video Captioning
140 | - Hybrid Reasoning Network for Video-based Commonsense Captioning. [[paper]](https://arxiv.org/ftp/arxiv/papers/2108/2108.02365.pdf) 
141 | - Discriminative Latent Semantic Graph for Video Captioning. [[paper]](https://arxiv.org/pdf/2108.03662.pdf) [[code]](https://github.com/baiyang4/D-LSG-Video-Caption)
142 | - Sensor-Augmented Egocentric-Video Captioning with Dynamic Modal Attention. [[paper]](https://arxiv.org/pdf/2109.02955.pdf)
143 | - CLIP4Caption: CLIP for Video Caption. [[paper]](https://arxiv.org/pdf/2110.06615.pdf)
144 | #### Interspeech 2021
145 | ##### Video Captioning
146 | - Optimizing Latency for Online Video CaptioningUsing Audio-Visual Transformers. [[paper]](https://arxiv.org/pdf/2108.02147.pdf)
147 | 
148 | #### ACL 2021
149 | ##### Image Captioning
150 | - Writing by Memorizing: Hierarchical Retrieval-based Medical Report Generation. [[paper]](https://arxiv.org/pdf/2106.06471.pdf)   
151 | - Competence-based Multimodal Curriculum Learning for Medical Report Generation.
152 | - Control Image Captioning Spatially and Temporally. [[paper]](https://aclanthology.org/2021.acl-long.157.pdf)
153 | - SMURF: SeMantic and linguistic UndeRstanding Fusion for Caption Evaluation via Typicality Analysis. [[paper]](https://arxiv.org/pdf/2106.01444.pdf)   
154 | - Enhancing Descriptive Image Captioning with Natural Language Inference.
155 | - UMIC: An Unreferenced Metric for Image Captioning via Contrastive Learning. [[paper]](https://arxiv.org/pdf/2106.14019.pdf) [[code]](https://github.com/hwanheelee1993/UMIC)
156 | - Cross-modal Memory Networks for Radiology Report Generation.
157 | 
158 | [comment]: <> (- IgSEG: Image-guided Story Ending Generation.)
159 | [comment]: <> (- Semantic Relation-aware Difference Representation Learning for Change Captioning.)
160 | [comment]: <> (- Contrastive Attention for Automatic Chest X-ray Report Generation. [[paper]]&#40;https://arxiv.org/pdf/2106.06965.pdf&#41;)
161 | ##### Video Captioning
162 | - Hierarchical Context-aware Network for Dense Video Event Captioning. 
163 | - Video Paragraph Captioning as a Text Summarization Task.
164 | 
165 | [comment]: <> (- O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning. [[paper]]&#40;https://arxiv.org/pdf/2108.02359.pdf&#41;)
166 | 
167 | #### IJCAI 2021
168 | ##### Image Captioning
169 | - TCIC: Theme Concepts Learning Cross Language and Vision for Image Captioning. [[paper]](https://arxiv.org/pdf/2106.10936.pdf)   
170 | 
171 | #### NAACL 2021
172 | ##### Image Captioning
173 | - Quality Estimation for Image Captions Based on Large-scale Human Evaluations. [[paper]](https://www.aclweb.org/anthology/2021.naacl-main.253.pdf)   
174 | - Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation. [[paper]](https://www.aclweb.org/anthology/2021.naacl-main.416.pdf)
175 | 
176 | [comment]: <> (- Validity-Based Sampling and Smoothing Methods for Multiple Reference Image Captioning. [[paper]]&#40;https://www.aclweb.org/anthology/2021.maiworkshop-1.6.pdf&#41;)
177 | [comment]: <> (- Leveraging Partial Dependency Trees to Control Image Captions. [[paper]]&#40;https://www.aclweb.org/anthology/2021.alvr-1.3.pdf&#41;)
178 | [comment]: <> (- Coherent and Concise Radiology Report Generation via Context Specific Image Representations and Orthogonal Sentence States. [[paper]]&#40;https://www.aclweb.org/anthology/2021.naacl-industry.31.pdf&#41;)
179 | [comment]: <> (- Multi-Modal Image Captioning for the Visually Impaired. [[paper]]&#40;https://www.aclweb.org/anthology/2021.naacl-srw.8.pdf&#41;)
180 | ##### Video Captioning
181 | - DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization. [[paper]](https://www.aclweb.org/anthology/2021.naacl-main.193.pdf)  
182 | 
183 | 
184 | #### CVPR 2021
185 | ##### Image Captioning
186 | - Connecting What to Say With Where to Look by Modeling Human Attention Traces. [[paper]](https://arxiv.org/pdf/2105.05964.pdf) [[code]](https://github.com/facebookresearch/connect-caption-and-trace)  
187 | - Multiple Instance Captioning: Learning Representations from Histopathology Textbooks and Articles. [[paper]](https://arxiv.org/pdf/2103.05121.pdf)  
188 | - Image Change Captioning by Learning From an Auxiliary Task.   [[paper]](https://openaccess.thecvf.com/content/CVPR2021/papers/Hosseinzadeh_Image_Change_Captioning_by_Learning_From_an_Auxiliary_Task_CVPR_2021_paper.pdf)
189 | - Scan2Cap: Context-aware Dense Captioning in RGB-D Scans. [[paper]](https://arxiv.org/pdf/2012.02206.pdf) [[code]](https://github.com/daveredrum/Scan2Cap)  
190 | - FAIEr: Fidelity and Adequacy Ensured Image Caption Evaluation. [[paper]](https://openaccess.thecvf.com/content/CVPR2021/papers/Wang_FAIEr_Fidelity_and_Adequacy_Ensured_Image_Caption_Evaluation_CVPR_2021_paper.pdf)  
191 | - RSTNet: Captioning With Adaptive Attention on Visual and Non-Visual Words. [[paper]](https://openaccess.thecvf.com/content/CVPR2021/papers/Zhang_RSTNet_Captioning_With_Adaptive_Attention_on_Visual_and_Non-Visual_Words_CVPR_2021_paper.pdf)
192 | - Human-Like Controllable Image Captioning With Verb-Specific Semantic Roles. [[paper]](https://arxiv.org/pdf/2103.12204.pdf)  
193 | ##### Text Captioning
194 | - Improving OCR-Based Image Captioning by Incorporating Geometrical Relationship. [[paper]](https://openaccess.thecvf.com/content/CVPR2021/papers/Wang_Improving_OCR-Based_Image_Captioning_by_Incorporating_Geometrical_Relationship_CVPR_2021_paper.pdf)  
195 | - TAP: Text-Aware Pre-Training for Text-VQA and Text-Caption. [[paper]](https://arxiv.org/pdf/2012.04638.pdf)  
196 | - Towards Accurate Text-Based Image Captioning With Content Diversity Exploration. [[paper]](https://arxiv.org/pdf/2105.03236.pdf)  
197 | 
198 | ##### Video Captioning
199 | - Open-Book Video Captioning With Retrieve-Copy-Generate Network. [[paper]](https://arxiv.org/pdf/2103.05284.pdf)  
200 | - Towards Diverse Paragraph Captioning for Untrimmed Videos.  [[paper]](https://openaccess.thecvf.com/content/CVPR2021/papers/Song_Towards_Diverse_Paragraph_Captioning_for_Untrimmed_Videos_CVPR_2021_paper.pdf) 
201 | - Towards Bridging Event Captioner and Sentence Localizer for Weakly Supervised Dense Event Captioning. [[paper]](https://openaccess.thecvf.com/content/CVPR2021/papers/Chen_Towards_Bridging_Event_Captioner_and_Sentence_Localizer_for_Weakly_Supervised_CVPR_2021_paper.pdf)
202 | 
203 | #### ICASSP 2021
204 | ##### Image Captioning
205 | - Cascade Attention Fusion for Fine-grained Image Captioning based on Multi-layer LSTM. [[paper]](https://ieeexplore.ieee.org/document/9413691)
206 | - Triple Sequence Generative Adversarial Nets for Unsupervised Image Captioning. [[paper]](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9414335)
207 | 
208 | [comment]: <> (##### Audio Captioning)
209 | 
210 | [comment]: <> (- Investigating Local and Global Information for Automated Audio Captioning with Transfer Learning. [[paper]]&#40;https://arxiv.org/pdf/2102.11457.pdf&#41;)
211 | 
212 | #### AAAI 2021
213 | ##### Image Captioning
214 | - Partially Non-Autoregressive Image Captioning. [[paper]](https://ojs.aaai.org/index.php/AAAI/article/view/16219) [[code]](https://github.com/feizc/PNAIC/tree/master)  
215 | - Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network. [[paper]](https://arxiv.org/pdf/2012.07061.pdf)   
216 | - Object Relation Attention for Image Paragraph Captioning. [[paper]](https://ojs.aaai.org/index.php/AAAI/article/view/16423)  
217 | - Dual-Level Collaborative Transformer for Image Captioning. [[paper]](https://arxiv.org/pdf/2101.06462.pdf) [[code]](https://github.com/luo3300612/image-captioning-DLCT)    
218 | - Memory-Augmented Image Captioning. [[paper]](https://ojs.aaai.org/index.php/AAAI/article/view/16220)
219 | - Image Captioning with Context-Aware Auxiliary Guidance. [[paper]](https://arxiv.org/pdf/2012.05545.pdf)  
220 | - Consensus Graph Representation Learning for Better Grounded Image Captioning. [[paper]](https://www.aaai.org/AAAI21Papers/AAAI-3680.ZhangW.pdf)  
221 | - FixMyPose: Pose Correctional Captioning and Retrieval. [[paper]](https://arxiv.org/pdf/2104.01703.pdf) [[code]](https://github.com/hyounghk/FixMyPose)  
222 | - VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning. [[paper]](https://arxiv.org/pdf/2009.13682)  
223 | 
224 | ##### Video Captioning
225 | - Non-Autoregressive Coarse-to-Fine Video Captioning. [[paper]](https://arxiv.org/pdf/1911.12018.pdf)  [[code]](https://github.com/yangbang18/Non-Autoregressive-Video-Captioning)    
226 | - Semantic Grouping Network for Video Captioning. [[paper]](https://arxiv.org/pdf/2102.00831.pdf) [[code]](https://github.com/hobincar/SGN)    
227 | - Augmented Partial Mutual Learning with Frame Masking for Video Captioning. [[paper]](https://www.aaai.org/AAAI21Papers/AAAI-9714.LinK.pdf)  
228 | 
229 | #### TPAMI 2021
230 | ##### Video Captioning
231 | - Saying the Unseen: Video Descriptions via Dialog Agents. [[paper]](https://arxiv.org/pdf/2106.14069.pdf)
232 | 
233 | ### 2020
234 | #### EMNLP 2020
235 | ##### Image Captioning
236 | - CapWAP: Captioning with a Purpose. [[paper]](https://www.aclweb.org/anthology/2020.emnlp-main.705/) [[code]](https://github.com/google-research/language/tree/master/language/capwap)
237 | - Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements. [[paper]](https://www.aclweb.org/anthology/2020.emnlp-main.443/) [[code]](https://github.com/googleresearch-datasets/widget-caption)
238 | - Visually Grounded Continual Learning of Compositional Phrases. [[paper]](https://www.aclweb.org/anthology/2020.emnlp-main.158/) 
239 | - Pragmatic Issue-Sensitive Image Captioning. [[paper]](https://www.aclweb.org/anthology/2020.findings-emnlp.173/)  
240 | - Structural and Functional Decomposition for Personality Image Captioning in a Communication Game. [[paper]](https://www.aclweb.org/anthology/2020.findings-emnlp.411/) 
241 | - Generating Image Descriptions via Sequential Cross-Modal Alignment Guided by Human Gaze. [[paper]](https://www.aclweb.org/anthology/2020.emnlp-main.377/) 
242 | - ZEST: Zero-shot Learning from Text Descriptions using Textual Similarity and Visual Summarization. [[paper]](https://www.aclweb.org/anthology/2020.findings-emnlp.50/) 
243 | 
244 | ##### Video Captioning
245 | - Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning. [[paper]](https://arxiv.org/abs/2003.05162) 
246 | 
247 | 
248 | #### NIPS 2020
249 | ##### Image Captioning
250 | - RATT: Recurrent Attention to Transient Tasks for Continual Image Captioning.  [[paper]](https://proceedings.neurips.cc/paper/2020/file/c2964caac096f26db222cb325aa267cb-Paper.pdf)  
251 | - Diverse Image Captioning with Context-Object Split Latent Spaces. [[paper]](https://papers.nips.cc/paper/2020/file/24bea84d52e6a1f8025e313c2ffff50a-Paper.pdf)  
252 | - Prophet Attention: Predicting Attention with Future Attention for Improved Image Captioning. [[paper]](https://proceedings.neurips.cc/paper/2020/file/13fe9d84310e77f13a6d184dbf1232f3-Paper.pdf)   
253 | 
254 | #### ACMMM 2020
255 | ##### Image Captioning
256 | - Structural Semantic Adversarial Active Learning for Image Captioning. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3413885) 
257 | - Iterative Back Modification for Faster Image Captioning. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3413901) 
258 | - Bridging the Gap between Vision and Language Domains for Improved Image Captioning. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3414004) 
259 | - Hierarchical Scene Graph Encoder-Decoder for Image Paragraph Captioning. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3413859) 
260 | - Improving Intra- and Inter-Modality Visual Relation for Image Captioning. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3413877) 
261 | - ICECAP: Information Concentrated Entity-aware Image Captioning. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3413576) 
262 | - Attacking Image Captioning Towards Accuracy-Preserving Target Words Removal. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3414009) 
263 | 
264 | ##### Text Captioning
265 | - Multimodal Attention with Image Text Spatial Relationship for OCR-Based Image Captioning. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3413753) 
266 | 
267 | ##### Video Captioning
268 | - Controllable Video Captioning with an Exemplar Sentence. [[paper]](https://dl.acm.org/doi/abs/10.1145/3394171.3413908) 
269 | - Poet: Product-oriented Video Captioner for E-commerce. [[paper]](https://arxiv.org/abs/2008.06880)  [[code]](https://github.com/shengyuzhang/Poet)
270 | - Learning Semantic Concepts and Temporal Alignment for Narrated Video Procedural Captioning. [[paper]](https://dl.acm.org/doi/10.1145/3394171.3413498) 
271 | - Relational Graph Learning for Grounded Video Description Generation. [[paper]](https://dl.acm.org/doi/abs/10.1145/3394171.3413746)  
272 | 
273 | #### ECCV 2020
274 | ##### Image Captioning
275 | - Compare and Reweight: Distinctive Image Captioning Using Similar Images Sets.  [[paper]](https://arxiv.org/pdf/2007.06877.pdf)
276 | - Towards Unique and Informative Captioning of Images. [[paper]](http://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123520613.pdf) 
277 | - Learning Visual Representations with Caption Annotations. [[paper]](https://arxiv.org/pdf/2008.01392.pdf) 
278 | - Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards. [[paper]](https://arxiv.org/pdf/2008.02693.pdf) [[code]](https://github.com/xuewyang/Fashion_Captioning)
279 | - Length Controllable Image Captioning. [[paper]](https://arxiv.org/pdf/2007.09580.pdf) [[code]](https://github.com/bearcatt/LaBERT)
280 | - Comprehensive Image Captioning via Scene Graph Decomposition. [[paper]](https://arxiv.org/pdf/2007.11731.pdf) 
281 | - Finding It at Another Side: A Viewpoint-Adapted Matching Encoder for Change Captioning. [[paper]](http://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123590562.pdf)
282 | - Captioning Images Taken by People Who Are Blind. [[paper]](http://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123620409.pdf)
283 | - Learning to Generate Grounded Visual Captions without Localization Supervision. [[paper]](https://arxiv.org/pdf/1906.00283.pdf) [[code]](https://github.com/chihyaoma/cyclical-visual-captioning)
284 | - Finding It at Another Side: A Viewpoint-Adapted Matching Encoder for Change Captioning. [[paper]](http://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123590562.pdf) 
285 | - Describing Textures using Natural Language. [[paper]](https://www.ecva.net/papers/eccv_2020/papers_ECCV/html/384_ECCV_2020_paper.php) 
286 | - Connecting Vision and Language with Localized Narratives. [[paper]](https://arxiv.org/pdf/1912.03098.pdf) [[code]](https://github.com/google/localized-narratives)
287 | ##### Text Captioning
288 | - TextCaps: a Dataset for Image Captioning with Reading Comprehension.  [[paper]](https://arxiv.org/pdf/2003.12462.pdf)  [[code]](https://github.com/facebookresearch/mmf/tree/master/projects/m4c_captioner)
289 | 
290 | ##### Video Captioning
291 | - Character Grounding and Re-Identification in Story of Videos and Text Descriptions.  [[paper]](http://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123500528.pdf) [[code]](https://github.com/yj-yu/CiSIN/)
292 | - SODA: Story Oriented Dense Video Captioning Evaluation Framework. [[paper]](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123510511.pdf)  [[code]](https://github.com/fujiso/SODA)
293 | - In-Home Daily-Life Captioning Using Radio Signals. [[paper]](https://arxiv.org/pdf/2008.10966.pdf) 
294 | - TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval. [[paper]](https://arxiv.org/pdf/2001.09099.pdf)  [[code]](https://github.com/jayleicn/TVCaption)
295 | - Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos. [[paper]](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123490324.pdf) 
296 | - Identity-Aware Multi-Sentence Video Description. [[paper]](http://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123660358.pdf)  
297 | 
298 | #### IJCAI 2020
299 | ##### Image Captioning
300 | - Human Consensus-Oriented Image Captioning. [[paper]](https://www.ijcai.org/Proceedings/2020/0092.pdf) 
301 | - Non-Autoregressive Image Captioning with Counterfactuals-Critical Multi-Agent Learning. [[paper]](https://www.ijcai.org/Proceedings/2020/0107.pdf) 
302 | - Recurrent Relational Memory Network for Unsupervised Image Captioning. [[paper]](https://www.ijcai.org/Proceedings/2020/0128.pdf) 
303 | ##### Video Captioning
304 | - Learning to Discretely Compose Reasoning Module Networks for Video Captioning. [[paper]](https://arxiv.org/abs/2007.09049)  [[code]](https://github.com/tgc1997/RMN)
305 | - SBAT: Video Captioning with Sparse Boundary-Aware Transformer. [[paper]](https://www.ijcai.org/Proceedings/2020/88) 
306 | - Hierarchical Attention Based Spatial-Temporal Graph-to-Sequence Learning for Grounded Video Description. [[paper]](https://www.ijcai.org/Proceedings/2020/131) 
307 | 
308 | #### ACL 2020
309 | ##### Image Captioning
310 | - Clue: Cross-modal Coherence Modeling for Caption Generation. [[paper]](https://www.aclweb.org/anthology/2020.acl-main.583.pdf) 
311 | - Improving Image Captioning Evaluation by Considering Inter References Variance. [[paper]](https://www.aclweb.org/anthology/2020.acl-main.93.pdf) [[code]](https://github.com/ck0123/improved-bertscore-for-image-captioning-evaluation)
312 | - Improving Image Captioning with Better Use of Caption. [[paper]](https://www.aclweb.org/anthology/2020.acl-main.664.pdf)  [[code]](https://github.com/Gitsamshi/WeakVRD-Captioning)
313 | ##### Video Captioning
314 | - MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning. [[paper]](https://www.aclweb.org/anthology/2020.acl-main.233.pdf)  [[code]](https://github.com/jayleicn/recurrent-transformer)
315 | 
316 | [comment]: <> (#### ICME 2020)
317 | 
318 | [comment]: <> (##### Image Captioning)
319 | 
320 | [comment]: <> (- [Fooled by Imagination: Adversarial Attack to Image Captioning Via Perturbation in Complex Domain]&#40;https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9102842&#41; - Shaofeng Zhang et al, **ICME 2020**. )
321 | 
322 | [comment]: <> (- [Modeling Local and Global Contexts for Image Captioning]&#40;https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9102935&#41; - Peng Yao et al, **ICME 2020**. )
323 | 
324 | [comment]: <> (##### Video Captioning)
325 | 
326 | [comment]: <> (- [Video Captioning With Temporal And Region Graph Convolution Network]&#40;https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9102967&#41; - Xinlong Xiao et al, **ICME 2020**. )
327 | 
328 | #### CVPR 2020
329 | ##### Image Captioning
330 | - Context-Aware Group Captioning via Self-Attention and Contrastive Features. [[paper]](https://arxiv.org/abs/2004.03708)  [[code]](https://lizw14.github.io/project/groupcap)
331 | - Show, Edit and Tell: A Framework for Editing Image Captions. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/papers/Sammani_Show_Edit_and_Tell_A_Framework_for_Editing_Image_Captions_CVPR_2020_paper.pdf)  [[code]](https://github.com/fawazsammani/show-edit-tell)
332 | - Say As You Wish: Fine-Grained Control of Image Caption Generation With Abstract Scene Graphs. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/papers/Chen_Say_As_You_Wish_Fine-Grained_Control_of_Image_Caption_Generation_CVPR_2020_paper.pdf)  [[code]](https://github.com/cshizhe/asg2cap)
333 | - Normalized and Geometry-Aware Self-Attention Network for Image Captioning. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/papers/Guo_Normalized_and_Geometry-Aware_Self-Attention_Network_for_Image_Captioning_CVPR_2020_paper.pdf) 
334 | - Meshed-Memory Transformer for Image Captioning. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/papers/Cornia_Meshed-Memory_Transformer_for_Image_Captioning_CVPR_2020_paper.pdf)  [[code]](https://github.com/aimagelab/meshed-memory-transformer)
335 | - X-Linear Attention Networks for Image Captioning. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/papers/Pan_X-Linear_Attention_Networks_for_Image_Captioning_CVPR_2020_paper.pdf)  [[code]](https://github.com/JDAI-CV/image-captioning)
336 | - Transform and Tell: Entity-Aware News Image Captioning. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/papers/Tran_Transform_and_Tell_Entity-Aware_News_Image_Captioning_CVPR_2020_paper.pdf)  [[code]](https://github.com/alasdairtran/transform-and-tell)
337 | - More Grounded Image Captioning by Distilling Image-Text Matching Model. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/papers/Zhou_More_Grounded_Image_Captioning_by_Distilling_Image-Text_Matching_Model_CVPR_2020_paper.pdf)  [[code]](https://github.com/YuanEZhou/Grounded-Image-Captioning)
338 | - Better Captioning With Sequence-Level Exploration. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/html/Chen_Better_Captioning_With_Sequence-Level_Exploration_CVPR_2020_paper.html) 
339 | 
340 | [comment]: <> (- Alleviating Noisy Data in Image Captioning with Cooperative Distillation. [[paper]]&#40;https://arxiv.org/pdf/2012.11691.pdf&#41; -  Pierre Dognin et al, **CVPRW 2020**. )
341 | 
342 | ##### Video Captioning
343 | - Object Relational Graph With Teacher-Recommended Learning for Video Captioning. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/papers/Zhang_Object_Relational_Graph_With_Teacher-Recommended_Learning_for_Video_Captioning_CVPR_2020_paper.pdf) 
344 | - Spatio-Temporal Graph for Video Captioning With Knowledge Distillation. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/papers/Pan_Spatio-Temporal_Graph_for_Video_Captioning_With_Knowledge_Distillation_CVPR_2020_paper.pdf)  [[code]](https://github.com/StanfordVL/STGraph)
345 | - Better Captioning With Sequence-Level Exploration. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/html/Chen_Better_Captioning_With_Sequence-Level_Exploration_CVPR_2020_paper.html)  
346 | - Syntax-Aware Action Targeting for Video Captioning. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/html/Zheng_Syntax-Aware_Action_Targeting_for_Video_Captioning_CVPR_2020_paper.html)  [[code]](https://github.com/SydCaption/SAAT)   
347 | - Screencast Tutorial Video Understanding. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/papers/Li_Screencast_Tutorial_Video_Understanding_CVPR_2020_paper.pdf)  
348 | 
349 | 
350 | #### AAAI 2020
351 | ##### Image Captioning
352 | - Unified Vision-Language Pre-Training for Image Captioning and VQA. [[paper]](https://arxiv.org/abs/1909.11059)  [[code]](https://github.com/LuoweiZhou/VLP)
353 | - Reinforcing an Image Caption Generator using Off-line Human Feedback. [[paper]](https://arxiv.org/abs/1911.09753) 
354 | - Memorizing Style Knowledge for Image Captioning. [[paper]](https://www.aiide.org/ojs/index.php/AAAI/article/view/6998) 
355 | - Joint Commonsense and Relation Reasoning for Image and Video Captioning. [[paper]](https://www.aiide.org/ojs/index.php/AAAI/article/view/6731) 
356 | - Learning Long- and Short-Term User Literal-Preference with Multimodal Hierarchical Transformer Network for Personalized Image Caption. [[paper]](https://weizhangltt.github.io/paper/zhang-aaai20.pdf) 
357 | - Show, Recall, and Tell: Image Captioning with Recall Mechanism. [[paper]](https://ojs.aaai.org/index.php/AAAI/article/view/6898) 
358 | - Interactive Dual Generative Adversarial Networks for Image Captioning. [[paper]](https://aaai.org/ojs/index.php/AAAI/article/view/6826) 
359 | - Feature Deformation Meta-Networks in Image Captioning of Novel Objects. [[paper]](https://aaai.org/ojs/index.php/AAAI/article/view/6620) 
360 | ##### Video Captioning
361 | - An Efficient Framework for Dense Video Captioning. [[paper]](https://aaai.org/ojs/index.php/AAAI/article/view/6881) 
362 | 
363 | 
364 | ### 2019
365 | 
366 | #### NIPS 2019 ####
367 | ##### Image Captioning
368 | - Adaptively Aligned Image Captioning via Adaptive Attention Time.  [[paper]](http://papers.nips.cc/paper/by-source-2019-4799) [[code]](https://github.com/husthuaan/AAT)
369 | - Image Captioning: Transforming Objects into Words.  [[paper]](http://papers.nips.cc/paper/by-source-2019-5963) [[code]](https://github.com/yahoo/object_relation_transformer)
370 | - Variational Structured Semantic Inference for Diverse Image Captioning.  [[paper]](http://papers.nips.cc/paper/by-source-2019-1113)  
371 |   
372 | 
373 | #### ICCV 2019
374 | ##### Image Captioning
375 | 
376 | - Robust Change Captioning. [[paper]](https://arxiv.org/abs/1901.02527)
377 | - Attention on Attention for Image Captioning. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Huang_Attention_on_Attention_for_Image_Captioning_ICCV_2019_paper.pdf)
378 | - Exploring Overall Contextual Information for Image Captioning in Human-Like Cognitive Style. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Ge_Exploring_Overall_Contextual_Information_for_Image_Captioning_in_Human-Like_Cognitive_ICCV_2019_paper.pdf)
379 | - Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Datta_Align2Ground_Weakly_Supervised_Phrase_Grounding_Guided_by_Image-Caption_Alignment_ICCV_2019_paper.pdf)
380 | - Hierarchy Parsing for Image Captioning. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Yao_Hierarchy_Parsing_for_Image_Captioning_ICCV_2019_paper.pdf)
381 | - Generating Diverse and Descriptive Image Captions Using Visual Paraphrases.  [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Liu_Generating_Diverse_and_Descriptive_Image_Captions_Using_Visual_Paraphrases_ICCV_2019_paper.pdf)
382 | - Learning to Collocate Neural Modules for Image Captioning. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Yang_Learning_to_Collocate_Neural_Modules_for_Image_Captioning_ICCV_2019_paper.pdf)
383 | - Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Aneja_Sequential_Latent_Spaces_for_Modeling_the_Intention_During_Diverse_Image_ICCV_2019_paper.pdf)
384 | - Towards Unsupervised Image Captioning With Shared Multimodal Embeddings. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Laina_Towards_Unsupervised_Image_Captioning_With_Shared_Multimodal_Embeddings_ICCV_2019_paper.pdf)
385 | - Human Attention in Image Captioning: Dataset and Analysis. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/He_Human_Attention_in_Image_Captioning_Dataset_and_Analysis_ICCV_2019_paper.pdf)
386 | - Reflective Decoding Network for Image Captioning. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Ke_Reflective_Decoding_Network_for_Image_Captioning_ICCV_2019_paper.pdf)
387 | - Joint Optimization for Cooperative Image Captioning. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Vered_Joint_Optimization_for_Cooperative_Image_Captioning_ICCV_2019_paper.pdf)
388 | - Entangled Transformer for Image Captioning. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Li_Entangled_Transformer_for_Image_Captioning_ICCV_2019_paper.pdf)
389 | - nocaps: novel object captioning at scale. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Agrawal_nocaps_novel_object_captioning_at_scale_ICCV_2019_paper.pdf)
390 | - Cap2Det: Learning to Amplify Weak Caption Supervision for Object Detection. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Ye_Cap2Det_Learning_to_Amplify_Weak_Caption_Supervision_for_Object_Detection_ICCV_2019_paper.pdf)
391 | - Unpaired Image Captioning via Scene Graph Alignments. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Gu_Unpaired_Image_Captioning_via_Scene_Graph_Alignments_ICCV_2019_paper.pdf)
392 | - Learning to Caption Images Through a Lifetime by Asking Questions. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Shen_Learning_to_Caption_Images_Through_a_Lifetime_by_Asking_Questions_ICCV_2019_paper.pdf)  
393 |   
394 | ##### Video Captioning
395 | - VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Wang_VaTeX_A_Large-Scale_High-Quality_Multilingual_Dataset_for_Video-and-Language_Research_ICCV_2019_paper.pdf)
396 | - Controllable Video Captioning With POS Sequence Guidance Based on Gated Fusion Network. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Wang_Controllable_Video_Captioning_With_POS_Sequence_Guidance_Based_on_Gated_ICCV_2019_paper.pdf)
397 | - Joint Syntax Representation Learning and Visual Cue Translation for Video Captioning. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Hou_Joint_Syntax_Representation_Learning_and_Visual_Cue_Translation_for_Video_ICCV_2019_paper.pdf)
398 | - Watch, Listen and Tell: Multi-Modal Weakly Supervised Dense Event Captioning. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Rahman_Watch_Listen_and_Tell_Multi-Modal_Weakly_Supervised_Dense_Event_Captioning_ICCV_2019_paper.pdf)
399 | 
400 | #### ACL 2019 ####
401 | ##### Image Captioning
402 | - Informative Image Captioning with External Sources of Information  [[paper]](https://www.aclweb.org/anthology/P19-1650.pdf)  
403 | 
404 | - Bridging by Word: Image Grounded Vocabulary Construction for Visual Captioning  [[paper]](https://www.aclweb.org/anthology/P19-1652.pdf)  
405 |   
406 | - Generating Question Relevant Captions to Aid Visual Question Answering  [[paper]](https://www.aclweb.org/anthology/P19-1348.pdf)  
407 | ##### Video Captioning
408 | - Dense Procedure Captioning in Narrated Instructional Videos  [[paper]](https://www.aclweb.org/anthology/P19-1641.pdf)  
409 |   
410 | #### CVPR 2019 ####
411 | ##### Image Captioning
412 | - Auto-Encoding Scene Graphs for Image Captioning  [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Yang_Auto-Encoding_Scene_Graphs_for_Image_Captioning_CVPR_2019_paper.pdf) [[code]](https://github.com/fengyang0317/unsupervised_captioning)   
413 |   
414 | - Fast, Diverse and Accurate Image Captioning Guided by Part-Of-Speech  [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Deshpande_Fast_Diverse_and_Accurate_Image_Captioning_Guided_by_Part-Of-Speech_CVPR_2019_paper.pdf)   
415 | 
416 | - Unsupervised Image Captioning  [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Feng_Unsupervised_Image_Captioning_CVPR_2019_paper.pdf) [[code]](https://github.com/fengyang0317/unsupervised_captioning)   
417 |   
418 | - Adversarial Attack to Image Captioning via Structured Output Learning With Latent Variables  [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Xu_Exact_Adversarial_Attack_to_Image_Captioning_via_Structured_Output_Learning_CVPR_2019_paper.pdf)  
419 |   
420 | - Describing like Humans: On Diversity in Image Captioning [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Wang_Describing_Like_Humans_On_Diversity_in_Image_Captioning_CVPR_2019_paper.pdf)   
421 |   
422 | - MSCap: Multi-Style Image Captioning With Unpaired Stylized Text  [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Guo_MSCap_Multi-Style_Image_Captioning_With_Unpaired_Stylized_Text_CVPR_2019_paper.pdf)
423 | - Leveraging Captioning to Boost Semantics for Salient Object Detection  [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Zhang_CapSal_Leveraging_Captioning_to_Boost_Semantics_for_Salient_Object_Detection_CVPR_2019_paper.pdf)  [[code]](https://github.com/zhangludl/code-and-dataset-for-CapSal)  
424 |   
425 | - Context and Attribute Grounded Dense Captioning  [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Yin_Context_and_Attribute_Grounded_Dense_Captioning_CVPR_2019_paper.pdf)  
426 |   
427 | - Dense Relational Captioning: Triple-Stream Networks for Relationship-Based Captioning  [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Kim_Dense_Relational_Captioning_Triple-Stream_Networks_for_Relationship-Based_Captioning_CVPR_2019_paper.pdf)   
428 |   
429 | - Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions  [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Cornia_Show_Control_and_Tell_A_Framework_for_Generating_Controllable_and_CVPR_2019_paper.pdf)  
430 |   
431 | - Self-Critical N-step Training for Image Captioning  [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Gao_Self-Critical_N-Step_Training_for_Image_Captioning_CVPR_2019_paper.pdf)  
432 |   
433 | - Look Back and Predict Forward in Image Captioning  [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Qin_Look_Back_and_Predict_Forward_in_Image_Captioning_CVPR_2019_paper.pdf)  
434 |   
435 | - Intention Oriented Image Captions with Guiding Objects  [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Zheng_Intention_Oriented_Image_Captions_With_Guiding_Objects_CVPR_2019_paper.pdf)   
436 |   
437 | - Adversarial Semantic Alignment for Improved Image Captions  [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Dognin_Adversarial_Semantic_Alignment_for_Improved_Image_Captions_CVPR_2019_paper.pdf)  
438 |   
439 | - Good News, Everyone! Context driven entity-aware captioning for news images.  [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Biten_Good_News_Everyone_Context_Driven_Entity-Aware_Captioning_for_News_Images_CVPR_2019_paper.pdf)  [[code]](https://github.com/furkanbiten/GoodNews)
440 | - Pointing Novel Objects in Image Captioning  [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Li_Pointing_Novel_Objects_in_Image_Captioning_CVPR_2019_paper.pdf)   
441 |   
442 | - Engaging Image Captioning via Personality  [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Shuster_Engaging_Image_Captioning_via_Personality_CVPR_2019_paper.pdf)  
443 |   
444 | - Intention Oriented Image Captions With Guiding Objects  [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Zheng_Intention_Oriented_Image_Captions_With_Guiding_Objects_CVPR_2019_paper.pdf)  
445 |   
446 | - Exact Adversarial Attack to Image Captioning via Structured Output Learning With Latent Variables  [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Xu_Exact_Adversarial_Attack_to_Image_Captioning_via_Structured_Output_Learning_CVPR_2019_paper.pdf)  
447 |   
448 | - Towards Unsupervised Image Captioning with Shared Multimodal Embeddings. [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Zadeh_Social-IQ_A_Question_Answering_Benchmark_for_Artificial_Social_Intelligence_CVPR_2019_paper.pdf) 
449 | ##### Video Captioning
450 | -  Streamlined Dense Video Captioning.  [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Mun_Streamlined_Dense_Video_Captioning_CVPR_2019_paper.pdf)  
451 | -  Grounded Video Description. [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Zhou_Grounded_Video_Description_CVPR_2019_paper.pdf)  
452 | -  Adversarial Inference for Multi-Sentence Video Description.  [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Park_Adversarial_Inference_for_Multi-Sentence_Video_Description_CVPR_2019_paper.pdf)  
453 | -  Object-aware Aggregation with Bidirectional Temporal Graph for Video Captioning. [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Zhang_Object-Aware_Aggregation_With_Bidirectional_Temporal_Graph_for_Video_Captioning_CVPR_2019_paper.pdf)  
454 | -  Memory-Attended Recurrent Network for Video Captioning. [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Pei_Memory-Attended_Recurrent_Network_for_Video_Captioning_CVPR_2019_paper.pdf)  
455 | -  Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning. [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Aafaq_Spatio-Temporal_Dynamics_and_Semantic_Attribute_Enriched_Visual_Encoding_for_Video_CVPR_2019_paper.pdf)  
456 |    
457 | 
458 | #### AAAI 2019
459 | ##### Image Captioning
460 | - Improving Image Captioning with Conditional Generative Adversarial Nets  [[paper]](https://arxiv.org/pdf/1805.07112.pdf)  
461 |   
462 | - Connecting Language to Images: A Progressive Attention-Guided Network for Simultaneous Image Captioning and Language Grounding  [[paper]](https://www.aaai.org/ojs/index.php/AAAI/article/view/4916)  
463 |   
464 | - Meta Learning for Image Captioning  [[paper]](https://www.aaai.org/ojs/index.php/AAAI/article/view/4883)
465 | - Deliberate Residual based Attention Network for Image Captioning  [[paper]](https://www.aaai.org/ojs/index.php/AAAI/article/view/4845/4718) 
466 |   
467 | - Hierarchical Attention Network for Image Captioning  [[paper]](https://www.aaai.org/ojs/index.php/AAAI/article/view/4924)  
468 |   
469 | - Learning Object Context for Dense Captioning  [[paper]](https://www.aaai.org/ojs/index.php/AAAI/article/view/4886)  
470 |  
471 | ##### Video Captioning
472 | - Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video Captioning  [[code]](https://github.com/eric-xw/Zero-Shot-Video-Captioning) [[paper]](https://arxiv.org/pdf/1811.02765.pdf)  
473 | 
474 | - Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning   [[paper]](https://arxiv.org/pdf/1905.01077v1.pdf)  
475 | 
476 | - Fully Convolutional Video Captioning with Coarse-to-Fine and Inherited Attention  [[paper]](https://aaai.org/ojs/index.php/AAAI/article/view/4839)  
477 |   
478 | - Motion Guided Spatial Attention for Video Captioning  [[paper]](http://yugangjiang.info/publication/19AAAI-vidcaptioning.pdf)  
479 | 
480 | ### 2018
481 | 
482 | #### NIPS 2018 ####
483 | ##### Image Captioning
484 | - A Neural Compositional Paradigm for Image Captioning. [[paper]](https://arxiv.org/pdf/1810.09630.pdf) [[code]](https://github.com/doubledaibo/compcaption_neurips2018)
485 | ##### Video Captioning
486 | - Weakly Supervised Dense Event Captioning in Videos. [[paper]](https://papers.nips.cc/paper/2018/file/49af6c4e558a7569d80eee2e035e2bd7-Paper.pdf)  [[code]](https://github.com/XgDuan/WSDEC)
487 | 
488 | 
489 | 
490 | #### ECCV 2018 ####
491 | ##### Image Captioning
492 | - Unpaired Image Captioning by Language Pivoting. [[paper]](https://arxiv.org/pdf/1803.05526.pdf)  [[code]](https://github.com/gujiuxiang/unpaired_image_captioning)
493 | - Exploring Visual Relationship for Image Captioning. [[paper]](https://openaccess.thecvf.com/content_ECCV_2018/papers/Ting_Yao_Exploring_Visual_Relationship_ECCV_2018_paper.pdf)
494 | - Recurrent Fusion Network for Image Captioning. [[paper]](https://arxiv.org/pdf/1807.09986.pdf) [[code]](https://github.com/cswhjiang/Recurrent_Fusion_Network)
495 | - Boosted Attention: Leveraging Human Attention for Image Captioning. [[paper]](https://arxiv.org/pdf/1904.00767.pdf)
496 | - Show, Tell and Discriminate: Image Captioning by Self-retrieval with Partially Labeled Data. [[paper]](https://arxiv.org/pdf/1803.08314.pdf)
497 | - "Factual" or "Emotional": Stylized Image Captioning with Adaptive Learning and Attention. [[paper]](https://arxiv.org/pdf/1807.03871.pdf)
498 | #### ACL 2018
499 | ##### Image Captioning
500 | - Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning. [[paper]](https://arxiv.org/pdf/1712.02051.pdf) 
501 | - Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. [[paper]](https://aclanthology.org/P18-1238.pdf) [[code]](https://github.com/google-research-datasets/conceptual-captions)
502 | 
503 | #### IJCAI 2018
504 | ##### Image Captioning
505 | - Show and Tell More: Topic-Oriented Multi-Sentence Image Captioning. [[paper]](https://www.ijcai.org/proceedings/2018/0592.pdf)
506 | #### CVPR 2018 ####
507 | ##### Image Captioning
508 | - Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. [[paper]](http://openaccess.thecvf.com/content_cvpr_2018/html/Anderson_Bottom-Up_and_Top-Down_CVPR_2018_paper.html)  [[code]](https://github.com/peteanderson80/bottom-up-attention) 
509 | - Neural Baby Talk. [[paper]](https://arxiv.org/pdf/1803.09845.pdf) 
510 | - GroupCap: Group-Based Image Captioning With Structured Relevance and Diversity Constraints. [[paper]](https://openaccess.thecvf.com/content_cvpr_2018/papers/Chen_GroupCap_Group-Based_Image_CVPR_2018_paper.pdf)
511 | ##### Video Captioning
512 | - Reconstruction Network for Video Captioning. [[paper]](https://arxiv.org/pdf/1803.11438.pdf) [[code]](https://github.com/hobincar/reconstruction-network-for-video-captioning)
513 | ### 2017
514 | #### ICCV 2017 
515 | ##### Image Captioning
516 | - Boosting Image Captioning with Attributes. [[paper]](https://openaccess.thecvf.com/content_ICCV_2017/papers/Yao_Boosting_Image_Captioning_ICCV_2017_paper.pdf)
517 | - Show, Adapt and Tell: Adversarial Training of Cross-domain Image Captioner. [[paper]](https://openaccess.thecvf.com/content_ICCV_2017/papers/Chen_Show_Adapt_and_ICCV_2017_paper.pdf) [[code]](https://github.com/tsenghungchen/show-adapt-and-tell)
518 | #### CVPR 2017 
519 | ##### Image Captioning
520 | - SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning. [[paper]](https://openaccess.thecvf.com/content_cvpr_2017/papers/Chen_SCA-CNN_Spatial_and_CVPR_2017_paper.pdf)  [[code]](https://github.com/zjuchenlong/sca-cnn.cvpr17) 
521 | - When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning. [[paper]](https://arxiv.org/pdf/1612.01887.pdf) [[code]](https://github.com/jiasenlu/AdaptiveAttention)
522 | - Self-critical Sequence Training for Image Captioning. [[paper]](https://openaccess.thecvf.com/content_cvpr_2017/papers/Rennie_Self-Critical_Sequence_Training_CVPR_2017_paper.pdf)
523 | - Semantic Compositional Networks for Visual Captioning. [[paper]](https://arxiv.org/pdf/1611.08002.pdf) [[code]](https://github.com/zhegan27/Semantic_Compositional_Nets)
524 | - StyleNet: Generating Attractive Visual Captions with Styles. [[paper]](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/Generating-Attractive-Visual-Captions-with-Styles.pdf) [[code]](https://github.com/kacky24/stylenet)
525 | #### TPAMI 2017
526 | ##### Image Captioning
527 | - BreakingNews: Article Annotation by Image and Text Processing. [[paper]](https://arxiv.org/pdf/1603.07141.pdf)
528 | ### 2016
529 | 
530 | #### ECCV 2016
531 | - SPICE: Semantic Propositional Image Caption Evaluation. [[paper]](https://arxiv.org/pdf/1607.08822.pdf) [[code]](https://github.com/peteanderson80/SPICE)
532 | - Generating Visual Explanations. [[paper]](https://arxiv.org/pdf/1603.08507.pdf) [[code]](https://github.com/LisaAnne/ECCV2016)
533 | #### CVPR 2016
534 | ##### Image Captioning
535 | - Image Captioning with Semantic Attention. [[paper]](https://openaccess.thecvf.com/content_cvpr_2016/papers/You_Image_Captioning_With_CVPR_2016_paper.pdf) [[code]](https://github.com/chapternewscu/image-captioning-with-semantic-attention)
536 | - Learning Deep Representations of Fine-grained Visual Descriptions. [[paper]](https://arxiv.org/pdf/1605.05395.pdf) [[code]](https://github.com/reedscot/cvpr2016)
537 | - Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data. [[paper]](https://arxiv.org/pdf/1511.05284.pdf) [[code]](https://github.com/LisaAnne/DCC)
538 | #### TPAMI 2016
539 | ##### Image Captioning
540 | - Aligning Where to See and What to Tell: Image Captioning with Region-Based Attention and Scene-Specific Contexts. [[paper]](https://ieeexplore.ieee.org/abstract/document/7792748) [[code]](https://github.com/fukun07/neural-image-captioning)
541 | 
542 | ### 2015
543 | #### ICML 2015 
544 | ##### Image Captioning
545 | - Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. [[paper]](https://arxiv.org/pdf/1502.03044.pdf)   
546 | 
547 | #### ICCV 2015
548 | ##### Image Captioning
549 | - Guiding Long-Short Term Memory for Image Caption Generation. [[paper]](https://arxiv.org/pdf/1509.04942.pdf)
550 | 
551 | #### CVPR 2015 
552 | ##### Image Captioning
553 | - Show and Tell: A Neural Image Caption Generator. [[paper]](https://arxiv.org/pdf/1411.4555.pdf)   
554 | - Deep Visual-Semantic Alignments for Generating Image Descriptions. [[paper]](https://arxiv.org/pdf/1412.2306.pdf) [[code]](https://github.com/karpathy/neuraltalk2)
555 | - CIDEr: Consensus-based Image Description Evaluation. [[paper]](https://arxiv.org/pdf/1411.5726.pdf) [[cider]](https://github.com/vrama91/cider)
556 | #### ICLR 2015 
557 | ##### Image Captioning
558 | - Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN). [[paper]](https://arxiv.org/pdf/1412.6632.pdf)   
559 | 
560 | ## Dataset
561 | - MSCOCO
562 | - Flickr30K
563 | - Flickr8K
564 | - VizWiz
565 | 
566 | ## Popular Codebase
567 | - [ruotianluo/ImageCaptioning.pytorch](https://github.com/ruotianluo/ImageCaptioning.pytorch)
568 | 
569 | ## Reference and Acknowledgement
570 | - [awesome-image-captioning](https://github.com/zhjohnchan/awesome-image-captioning) from [Zhihong Chen](https://github.com/zhjohnchan)
571 | 
572 | Really appreciate for there contributions in this area.
573 | 


--------------------------------------------------------------------------------