└── README.md
/README.md:
--------------------------------------------------------------------------------
1 | # Awesome Captioning:[](https://awesome.re)
2 |
3 |
4 |
5 |
6 |
7 | A curated list of **Visual Captioning** and related area.
8 |
9 |
10 |
11 | ## Table of Contents
12 | * [Survey Papers](#survey-papers)
13 | * [Research Papers](#reasearch-papers)
14 | * [2022](#2022)
15 | - [arxiv 2022](arxiv-2022)
16 | - [AAAI 2022](#AAAI-2022)
17 | * [2021](#2021)
18 | - [NIPS 2021](NIPS-2021)
19 | - [EMNLP 2021](#EMNLP-2021)
20 | - [ICCV 2021](#ICCV-2021)
21 | - [ACM MM 2021](#ACMMM-2021)
22 | - [Interspeech 2021](#Interspeech-2021)
23 | - [ACL 2021](#ACL-2021)
24 | - [IJCAI 2021](#IJCAI-2021)
25 | - [NAACL 2021](#NAACL-2021)
26 | - [CVPR 2021](#CVPR-2021)
27 | - [ICASSP 2021](#ICASSP-2021)
28 | - [AAAI 2021](#AAAI-2021)
29 | * [2020](#2020)
30 | - [EMNLP 2020](#EMNLP-2020)
31 | - [NIPS 2020](#NIPS-2020)
32 | - [ACM MM 2020](#ACMMM-2020)
33 | - [ECCV 2020](#ECCV-2020)
34 | - [IJCAI 2020](#IJCAI-2020)
35 | - [ACL 2020](#ACL-2020)
36 | - [CVPR 2020](#CVPR-2020)
37 | - [AAAI 2020](#AAAI-2020)
38 | * [2019](#2019)
39 | * [NIPS 2019](#NIPS-2019)
40 | * [ICCV 2019](#ICCV-2019)
41 | * [ACL 2019](#ACL-2019)
42 | * [CVPR 2019](#CVPR-2019)
43 | * [AAAI 2019](#AAAI-2019)
44 | * [2018](#2018)
45 | * [NIPS 2018](#NIPS-2018)
46 | * [ECCV 2018](#ECCV-2018)
47 | * [ACL 2018](#ACL-2018)
48 | * [CVPR 2018](#CVPR-2018)
49 | * [2017](#2017)
50 | * [ICCV 2017](#ICCV-2017)
51 | * [CVPR 2017](#CVPR-2017)
52 | * [TPAMI 2017](#TPAMI-2017)
53 | * [2016](#2016)
54 | * [CVPR 2016](#CVPR-2016)
55 | * [TPAMI 2016](#TPAMI-2016)
56 | * [2015](#2015)
57 | * [ICML 2015](#ICML-2015)
58 | * [ICCV 2015](#ICCV-2015)
59 | * [CVPR 2015](#CVPR-2015)
60 | * [ICLR 2015](#ICLR-2015)
61 | * [Dataset](#dataset)
62 | * [Popular Codebase](#popular-codebase)
63 | * [Reference and Acknowledgement](#reference-and-acknowledgement)
64 | ## Survey Papers
65 | ### 2021
66 | - From Show to Tell: A Survey on Image Captioning. [[paper]](https://arxiv.org/pdf/2107.06912.pdf)
67 |
68 | ## Research Papers
69 | ### 2022
70 | #### arxiv 2022
71 | ##### Image Captioning
72 | - Compact Bidirectional Transformer for Image Captioning. [[paper]](https://arxiv.org/pdf/2201.01984.pdf) [[code]](https://github.com/YuanEZhou/CBTrans)
73 | - ViNTER: Image Narrative Generation with Emotion-Arc-Aware Transformer. [[paper]](https://arxiv.org/pdf/2202.07305.pdf)
74 | - I-Tuning: Tuning Language Models with Image for Caption Generation. [[paper]](https://arxiv.org/pdf/2202.06574.pdf)
75 | - CaMEL: Mean Teacher Learning for Image Captioning. [[paper]](https://arxiv.org/pdf/2202.10492.pdf) [[code]](https://github.com/aimagelab/camel)
76 | - Unpaired Image Captioning by Image-level Weakly-Supervised Visual Concept Recognition. [[paper]](https://arxiv.org/pdf/2203.03195.pdf)
77 | ##### Video Captioning
78 | - Discourse Analysis for Evaluating Coherence in Video Paragraph Captions. [[paper]](https://arxiv.org/pdf/2201.06207.pdf)
79 | - Cross-modal Contrastive Distillation for Instructional Activity Anticipation. [[paper]](https://arxiv.org/pdf/2201.06734.pdf)
80 | - End-to-end Generative Pretraining for Multimodal Video Captioning. [[paper]](https://arxiv.org/pdf/2201.08264.pdf)
81 | - Deep soccer captioning with transformer: dataset, semantics-related losses, and multi-level evaluation. [[paper]](https://arxiv.org/pdf/2202.05728.pdf) [[code]](https://sites.google.com/view/soccercaptioning)
82 | - Dual-Level Decoupled Transformer for Video Captioning. [[paper]](https://arxiv.org/pdf/2205.03039.pdf)
83 | - Attract me to Buy: Advertisement Copywriting Generation with Multimodal Multi-structured Information. [[paper]](https://arxiv.org/pdf/2205.03534.pdf) [[code]](https://e-mmad.github.io/e-mmad.net/index.html)
84 | #### IJCAI 2022
85 | - Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds. [[paper]](https://arxiv.org/pdf/2204.10688.pdf)
86 | #### CVPR 2022
87 | - X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning. [[paper]](https://arxiv.org/pdf/2203.00843.pdf)
88 | - Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning. [[paper]](https://arxiv.org/pdf/2205.04363.pdf)
89 | ##### Video Captioning
90 | - What's in a Caption? Dataset-Specific Linguistic Diversity and Its Effect on Visual Description Models and Metrics. [[paper]](https://arxiv.org/pdf/2205.06253.pdf) (Workshop)
91 | #### AAAI 2022
92 | ##### Image Captioning
93 | - Image Difference Captioning with Pre-training and Contrastive Learning. [[paper]](https://arxiv.org/pdf/2202.04298.pdf)
94 | ### 2021
95 | #### NIPS 2021
96 | ##### Image Captioning
97 | - Auto-Encoding Knowledge Graph for Unsupervised Medical Report Generation. [[paper]](https://openreview.net/pdf?id=nIL7Q-p7-Sh)
98 | - FFA-IR: Towards an Explainable and Reliable Medical Report Generation Benchmark. [[paper]](https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/35f4a8d465e6e1edc05f3d8ab658c551-Paper-round2.pdf) [[code]](https://github.com/mlii0117/FFA-IR)
99 | ##### Video Captioning
100 | - Multi-modal Dependency Tree for Video Captioning. [[paper]](https://openreview.net/pdf?id=sW40wkwfsZp)
101 |
102 |
103 | #### EMNLP 2021
104 | ##### Image Captioning
105 | - Visual News: Benchmark and Challenges in News Image Captioning. [[paper]](https://aclanthology.org/2021.emnlp-main.542.pdf) [[code]](https://github.com/FuxiaoLiu/VisualNews-Repository)
106 | - R3Net:Relation-embedded Representation Reconstruction Network for Change Captioning. [[paper]](https://arxiv.org/pdf/2110.10328.pdf) [[code]](https://github.com/tuyunbin/R3Net)
107 | - CLIPScore: A Reference-free Evaluation Metric for Image Captioning. [[paper]](https://aclanthology.org/2021.emnlp-main.595.pdf)
108 | - Journalistic Guidelines Aware News Image Captioning. [[paper]](https://arxiv.org/pdf/2109.02865.pdf)
109 | - Understanding Guided Image Captioning Performance across Domains. [[paper]](https://aclanthology.org/2021.conll-1.14.pdf) [[code]](https://github.com/google-research-datasets/T2-Guiding) (CoNLL)
110 | - Language Resource Efficient Learning for Captioning. [[paper]](https://aclanthology.org/2021.findings-emnlp.162.pdf) (Findings)
111 | - Retrieval, Analogy, and Composition: A framework for Compositional Generalization in Image Captioning. [[paper]](https://aclanthology.org/2021.findings-emnlp.171.pdf) (Findings)
112 | - QACE: Asking Questions to Evaluate an Image Caption. [[paper]](https://arxiv.org/pdf/2108.12560.pdf) (Findings)
113 | - COSMic: A Coherence-Aware Generation Metric for Image Descriptions. [[paper]](https://arxiv.org/pdf/2109.05281.pdf) (Findings)
114 |
115 | #### ICCV 2021
116 | ##### Image Captioning
117 | - Auto-Parsing Network for Image Captioning and Visual Question Answering. [[paper]](https://arxiv.org/pdf/2108.10568.pdf)
118 | - Similar Scenes arouse Similar Emotions: Parallel Data Augmentation for Stylized Image Captioning. [[paper]](https://arxiv.org/pdf/2108.11912.pdf)
119 | - Explain Me the Painting: Multi-Topic Knowledgeable Art Description Generation. [[paper]](https://arxiv.org/pdf/2109.05743.pdf)
120 | - Partial Off-Policy Learning: Balance Accuracy and Diversity for Human-Oriented Image Captioning. [[paper]](https://openaccess.thecvf.com/content/ICCV2021/papers/Shi_Partial_Off-Policy_Learning_Balance_Accuracy_and_Diversity_for_Human-Oriented_Image_ICCV_2021_paper.pdf)
121 | - Topic Scene Graph Generation by Attention Distillation from Caption. [[paper]](https://openaccess.thecvf.com/content/ICCV2021/papers/Wang_Topic_Scene_Graph_Generation_by_Attention_Distillation_From_Caption_ICCV_2021_paper.pdf)
122 | - Understanding and Evaluating Racial Biases in Image Captioning. [[paper]](https://openaccess.thecvf.com/content/ICCV2021/papers/Zhao_Understanding_and_Evaluating_Racial_Biases_in_Image_Captioning_ICCV_2021_paper.pdf) [[code]](https://github.com/princetonvisualai/imagecaptioning-bias)
123 | - In Defense of Scene Graphs for Image Captioning. [[paper]](https://openaccess.thecvf.com/content/ICCV2021/papers/Nguyen_In_Defense_of_Scene_Graphs_for_Image_Captioning_ICCV_2021_paper.pdf) [[code]](https://github.com/Kien085/SG2Caps)
124 | - Viewpoint-Agnostic Change Captioning with Cycle Consistency. [[paper]](https://openaccess.thecvf.com/content/ICCV2021/papers/Kim_Viewpoint-Agnostic_Change_Captioning_With_Cycle_Consistency_ICCV_2021_paper.pdf)
125 | - Visual-Textual Attentive Semantic Consistency for Medical Report Generation. [[paper]](https://openaccess.thecvf.com/content/ICCV2021/papers/Zhou_Visual-Textual_Attentive_Semantic_Consistency_for_Medical_Report_Generation_ICCV_2021_paper.pdf)
126 | - Semi-Autoregressive Transformer for Image Captioning. [[paper]](https://openaccess.thecvf.com/content/ICCV2021W/CLVL/papers/Zhou_Semi-Autoregressive_Transformer_for_Image_Captioning_ICCVW_2021_paper.pdf) (Workshop)
127 | ##### Video Captioning
128 | - End-to-End Dense Video Captioning with Parallel Decoding. [[paper]](https://arxiv.org/pdf/2108.07781.pdf) [[code]](https://github.com/ttengwang/PDVC)
129 | - Motion Guided Region Message Passing for Video Captioning. [[paper]](https://openaccess.thecvf.com/content/ICCV2021/papers/Chen_Motion_Guided_Region_Message_Passing_for_Video_Captioning_ICCV_2021_paper.pdf)
130 |
131 | #### ACMMM 2021
132 | ##### Image Captioning
133 | - Distributed Attention for Grounded Image Captioning. [[paper]](https://arxiv.org/pdf/2108.01056.pdf)
134 | - Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image Captioning. [[paper]](https://arxiv.org/pdf/2108.02366.pdf) [[code]](https://github.com/Unbear430/DGCN-for-image-captioning)
135 | - Group-based Distinctive Image Captioning with Memory Attention. [[paper]](https://arxiv.org/pdf/2108.09151.pdf)
136 | - Direction Relation Transformer for Image Captioning. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3474085.3475607)
137 | ##### Text Captioning
138 | - Question-controlled Text-aware Image Captioning. [[paper]](https://arxiv.org/pdf/2108.02059.pdf)
139 | ##### Video Captioning
140 | - Hybrid Reasoning Network for Video-based Commonsense Captioning. [[paper]](https://arxiv.org/ftp/arxiv/papers/2108/2108.02365.pdf)
141 | - Discriminative Latent Semantic Graph for Video Captioning. [[paper]](https://arxiv.org/pdf/2108.03662.pdf) [[code]](https://github.com/baiyang4/D-LSG-Video-Caption)
142 | - Sensor-Augmented Egocentric-Video Captioning with Dynamic Modal Attention. [[paper]](https://arxiv.org/pdf/2109.02955.pdf)
143 | - CLIP4Caption: CLIP for Video Caption. [[paper]](https://arxiv.org/pdf/2110.06615.pdf)
144 | #### Interspeech 2021
145 | ##### Video Captioning
146 | - Optimizing Latency for Online Video CaptioningUsing Audio-Visual Transformers. [[paper]](https://arxiv.org/pdf/2108.02147.pdf)
147 |
148 | #### ACL 2021
149 | ##### Image Captioning
150 | - Writing by Memorizing: Hierarchical Retrieval-based Medical Report Generation. [[paper]](https://arxiv.org/pdf/2106.06471.pdf)
151 | - Competence-based Multimodal Curriculum Learning for Medical Report Generation.
152 | - Control Image Captioning Spatially and Temporally. [[paper]](https://aclanthology.org/2021.acl-long.157.pdf)
153 | - SMURF: SeMantic and linguistic UndeRstanding Fusion for Caption Evaluation via Typicality Analysis. [[paper]](https://arxiv.org/pdf/2106.01444.pdf)
154 | - Enhancing Descriptive Image Captioning with Natural Language Inference.
155 | - UMIC: An Unreferenced Metric for Image Captioning via Contrastive Learning. [[paper]](https://arxiv.org/pdf/2106.14019.pdf) [[code]](https://github.com/hwanheelee1993/UMIC)
156 | - Cross-modal Memory Networks for Radiology Report Generation.
157 |
158 | [comment]: <> (- IgSEG: Image-guided Story Ending Generation.)
159 | [comment]: <> (- Semantic Relation-aware Difference Representation Learning for Change Captioning.)
160 | [comment]: <> (- Contrastive Attention for Automatic Chest X-ray Report Generation. [[paper]](https://arxiv.org/pdf/2106.06965.pdf))
161 | ##### Video Captioning
162 | - Hierarchical Context-aware Network for Dense Video Event Captioning.
163 | - Video Paragraph Captioning as a Text Summarization Task.
164 |
165 | [comment]: <> (- O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning. [[paper]](https://arxiv.org/pdf/2108.02359.pdf))
166 |
167 | #### IJCAI 2021
168 | ##### Image Captioning
169 | - TCIC: Theme Concepts Learning Cross Language and Vision for Image Captioning. [[paper]](https://arxiv.org/pdf/2106.10936.pdf)
170 |
171 | #### NAACL 2021
172 | ##### Image Captioning
173 | - Quality Estimation for Image Captions Based on Large-scale Human Evaluations. [[paper]](https://www.aclweb.org/anthology/2021.naacl-main.253.pdf)
174 | - Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation. [[paper]](https://www.aclweb.org/anthology/2021.naacl-main.416.pdf)
175 |
176 | [comment]: <> (- Validity-Based Sampling and Smoothing Methods for Multiple Reference Image Captioning. [[paper]](https://www.aclweb.org/anthology/2021.maiworkshop-1.6.pdf))
177 | [comment]: <> (- Leveraging Partial Dependency Trees to Control Image Captions. [[paper]](https://www.aclweb.org/anthology/2021.alvr-1.3.pdf))
178 | [comment]: <> (- Coherent and Concise Radiology Report Generation via Context Specific Image Representations and Orthogonal Sentence States. [[paper]](https://www.aclweb.org/anthology/2021.naacl-industry.31.pdf))
179 | [comment]: <> (- Multi-Modal Image Captioning for the Visually Impaired. [[paper]](https://www.aclweb.org/anthology/2021.naacl-srw.8.pdf))
180 | ##### Video Captioning
181 | - DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization. [[paper]](https://www.aclweb.org/anthology/2021.naacl-main.193.pdf)
182 |
183 |
184 | #### CVPR 2021
185 | ##### Image Captioning
186 | - Connecting What to Say With Where to Look by Modeling Human Attention Traces. [[paper]](https://arxiv.org/pdf/2105.05964.pdf) [[code]](https://github.com/facebookresearch/connect-caption-and-trace)
187 | - Multiple Instance Captioning: Learning Representations from Histopathology Textbooks and Articles. [[paper]](https://arxiv.org/pdf/2103.05121.pdf)
188 | - Image Change Captioning by Learning From an Auxiliary Task. [[paper]](https://openaccess.thecvf.com/content/CVPR2021/papers/Hosseinzadeh_Image_Change_Captioning_by_Learning_From_an_Auxiliary_Task_CVPR_2021_paper.pdf)
189 | - Scan2Cap: Context-aware Dense Captioning in RGB-D Scans. [[paper]](https://arxiv.org/pdf/2012.02206.pdf) [[code]](https://github.com/daveredrum/Scan2Cap)
190 | - FAIEr: Fidelity and Adequacy Ensured Image Caption Evaluation. [[paper]](https://openaccess.thecvf.com/content/CVPR2021/papers/Wang_FAIEr_Fidelity_and_Adequacy_Ensured_Image_Caption_Evaluation_CVPR_2021_paper.pdf)
191 | - RSTNet: Captioning With Adaptive Attention on Visual and Non-Visual Words. [[paper]](https://openaccess.thecvf.com/content/CVPR2021/papers/Zhang_RSTNet_Captioning_With_Adaptive_Attention_on_Visual_and_Non-Visual_Words_CVPR_2021_paper.pdf)
192 | - Human-Like Controllable Image Captioning With Verb-Specific Semantic Roles. [[paper]](https://arxiv.org/pdf/2103.12204.pdf)
193 | ##### Text Captioning
194 | - Improving OCR-Based Image Captioning by Incorporating Geometrical Relationship. [[paper]](https://openaccess.thecvf.com/content/CVPR2021/papers/Wang_Improving_OCR-Based_Image_Captioning_by_Incorporating_Geometrical_Relationship_CVPR_2021_paper.pdf)
195 | - TAP: Text-Aware Pre-Training for Text-VQA and Text-Caption. [[paper]](https://arxiv.org/pdf/2012.04638.pdf)
196 | - Towards Accurate Text-Based Image Captioning With Content Diversity Exploration. [[paper]](https://arxiv.org/pdf/2105.03236.pdf)
197 |
198 | ##### Video Captioning
199 | - Open-Book Video Captioning With Retrieve-Copy-Generate Network. [[paper]](https://arxiv.org/pdf/2103.05284.pdf)
200 | - Towards Diverse Paragraph Captioning for Untrimmed Videos. [[paper]](https://openaccess.thecvf.com/content/CVPR2021/papers/Song_Towards_Diverse_Paragraph_Captioning_for_Untrimmed_Videos_CVPR_2021_paper.pdf)
201 | - Towards Bridging Event Captioner and Sentence Localizer for Weakly Supervised Dense Event Captioning. [[paper]](https://openaccess.thecvf.com/content/CVPR2021/papers/Chen_Towards_Bridging_Event_Captioner_and_Sentence_Localizer_for_Weakly_Supervised_CVPR_2021_paper.pdf)
202 |
203 | #### ICASSP 2021
204 | ##### Image Captioning
205 | - Cascade Attention Fusion for Fine-grained Image Captioning based on Multi-layer LSTM. [[paper]](https://ieeexplore.ieee.org/document/9413691)
206 | - Triple Sequence Generative Adversarial Nets for Unsupervised Image Captioning. [[paper]](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9414335)
207 |
208 | [comment]: <> (##### Audio Captioning)
209 |
210 | [comment]: <> (- Investigating Local and Global Information for Automated Audio Captioning with Transfer Learning. [[paper]](https://arxiv.org/pdf/2102.11457.pdf))
211 |
212 | #### AAAI 2021
213 | ##### Image Captioning
214 | - Partially Non-Autoregressive Image Captioning. [[paper]](https://ojs.aaai.org/index.php/AAAI/article/view/16219) [[code]](https://github.com/feizc/PNAIC/tree/master)
215 | - Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network. [[paper]](https://arxiv.org/pdf/2012.07061.pdf)
216 | - Object Relation Attention for Image Paragraph Captioning. [[paper]](https://ojs.aaai.org/index.php/AAAI/article/view/16423)
217 | - Dual-Level Collaborative Transformer for Image Captioning. [[paper]](https://arxiv.org/pdf/2101.06462.pdf) [[code]](https://github.com/luo3300612/image-captioning-DLCT)
218 | - Memory-Augmented Image Captioning. [[paper]](https://ojs.aaai.org/index.php/AAAI/article/view/16220)
219 | - Image Captioning with Context-Aware Auxiliary Guidance. [[paper]](https://arxiv.org/pdf/2012.05545.pdf)
220 | - Consensus Graph Representation Learning for Better Grounded Image Captioning. [[paper]](https://www.aaai.org/AAAI21Papers/AAAI-3680.ZhangW.pdf)
221 | - FixMyPose: Pose Correctional Captioning and Retrieval. [[paper]](https://arxiv.org/pdf/2104.01703.pdf) [[code]](https://github.com/hyounghk/FixMyPose)
222 | - VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning. [[paper]](https://arxiv.org/pdf/2009.13682)
223 |
224 | ##### Video Captioning
225 | - Non-Autoregressive Coarse-to-Fine Video Captioning. [[paper]](https://arxiv.org/pdf/1911.12018.pdf) [[code]](https://github.com/yangbang18/Non-Autoregressive-Video-Captioning)
226 | - Semantic Grouping Network for Video Captioning. [[paper]](https://arxiv.org/pdf/2102.00831.pdf) [[code]](https://github.com/hobincar/SGN)
227 | - Augmented Partial Mutual Learning with Frame Masking for Video Captioning. [[paper]](https://www.aaai.org/AAAI21Papers/AAAI-9714.LinK.pdf)
228 |
229 | #### TPAMI 2021
230 | ##### Video Captioning
231 | - Saying the Unseen: Video Descriptions via Dialog Agents. [[paper]](https://arxiv.org/pdf/2106.14069.pdf)
232 |
233 | ### 2020
234 | #### EMNLP 2020
235 | ##### Image Captioning
236 | - CapWAP: Captioning with a Purpose. [[paper]](https://www.aclweb.org/anthology/2020.emnlp-main.705/) [[code]](https://github.com/google-research/language/tree/master/language/capwap)
237 | - Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements. [[paper]](https://www.aclweb.org/anthology/2020.emnlp-main.443/) [[code]](https://github.com/googleresearch-datasets/widget-caption)
238 | - Visually Grounded Continual Learning of Compositional Phrases. [[paper]](https://www.aclweb.org/anthology/2020.emnlp-main.158/)
239 | - Pragmatic Issue-Sensitive Image Captioning. [[paper]](https://www.aclweb.org/anthology/2020.findings-emnlp.173/)
240 | - Structural and Functional Decomposition for Personality Image Captioning in a Communication Game. [[paper]](https://www.aclweb.org/anthology/2020.findings-emnlp.411/)
241 | - Generating Image Descriptions via Sequential Cross-Modal Alignment Guided by Human Gaze. [[paper]](https://www.aclweb.org/anthology/2020.emnlp-main.377/)
242 | - ZEST: Zero-shot Learning from Text Descriptions using Textual Similarity and Visual Summarization. [[paper]](https://www.aclweb.org/anthology/2020.findings-emnlp.50/)
243 |
244 | ##### Video Captioning
245 | - Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning. [[paper]](https://arxiv.org/abs/2003.05162)
246 |
247 |
248 | #### NIPS 2020
249 | ##### Image Captioning
250 | - RATT: Recurrent Attention to Transient Tasks for Continual Image Captioning. [[paper]](https://proceedings.neurips.cc/paper/2020/file/c2964caac096f26db222cb325aa267cb-Paper.pdf)
251 | - Diverse Image Captioning with Context-Object Split Latent Spaces. [[paper]](https://papers.nips.cc/paper/2020/file/24bea84d52e6a1f8025e313c2ffff50a-Paper.pdf)
252 | - Prophet Attention: Predicting Attention with Future Attention for Improved Image Captioning. [[paper]](https://proceedings.neurips.cc/paper/2020/file/13fe9d84310e77f13a6d184dbf1232f3-Paper.pdf)
253 |
254 | #### ACMMM 2020
255 | ##### Image Captioning
256 | - Structural Semantic Adversarial Active Learning for Image Captioning. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3413885)
257 | - Iterative Back Modification for Faster Image Captioning. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3413901)
258 | - Bridging the Gap between Vision and Language Domains for Improved Image Captioning. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3414004)
259 | - Hierarchical Scene Graph Encoder-Decoder for Image Paragraph Captioning. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3413859)
260 | - Improving Intra- and Inter-Modality Visual Relation for Image Captioning. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3413877)
261 | - ICECAP: Information Concentrated Entity-aware Image Captioning. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3413576)
262 | - Attacking Image Captioning Towards Accuracy-Preserving Target Words Removal. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3414009)
263 |
264 | ##### Text Captioning
265 | - Multimodal Attention with Image Text Spatial Relationship for OCR-Based Image Captioning. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3413753)
266 |
267 | ##### Video Captioning
268 | - Controllable Video Captioning with an Exemplar Sentence. [[paper]](https://dl.acm.org/doi/abs/10.1145/3394171.3413908)
269 | - Poet: Product-oriented Video Captioner for E-commerce. [[paper]](https://arxiv.org/abs/2008.06880) [[code]](https://github.com/shengyuzhang/Poet)
270 | - Learning Semantic Concepts and Temporal Alignment for Narrated Video Procedural Captioning. [[paper]](https://dl.acm.org/doi/10.1145/3394171.3413498)
271 | - Relational Graph Learning for Grounded Video Description Generation. [[paper]](https://dl.acm.org/doi/abs/10.1145/3394171.3413746)
272 |
273 | #### ECCV 2020
274 | ##### Image Captioning
275 | - Compare and Reweight: Distinctive Image Captioning Using Similar Images Sets. [[paper]](https://arxiv.org/pdf/2007.06877.pdf)
276 | - Towards Unique and Informative Captioning of Images. [[paper]](http://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123520613.pdf)
277 | - Learning Visual Representations with Caption Annotations. [[paper]](https://arxiv.org/pdf/2008.01392.pdf)
278 | - Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards. [[paper]](https://arxiv.org/pdf/2008.02693.pdf) [[code]](https://github.com/xuewyang/Fashion_Captioning)
279 | - Length Controllable Image Captioning. [[paper]](https://arxiv.org/pdf/2007.09580.pdf) [[code]](https://github.com/bearcatt/LaBERT)
280 | - Comprehensive Image Captioning via Scene Graph Decomposition. [[paper]](https://arxiv.org/pdf/2007.11731.pdf)
281 | - Finding It at Another Side: A Viewpoint-Adapted Matching Encoder for Change Captioning. [[paper]](http://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123590562.pdf)
282 | - Captioning Images Taken by People Who Are Blind. [[paper]](http://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123620409.pdf)
283 | - Learning to Generate Grounded Visual Captions without Localization Supervision. [[paper]](https://arxiv.org/pdf/1906.00283.pdf) [[code]](https://github.com/chihyaoma/cyclical-visual-captioning)
284 | - Finding It at Another Side: A Viewpoint-Adapted Matching Encoder for Change Captioning. [[paper]](http://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123590562.pdf)
285 | - Describing Textures using Natural Language. [[paper]](https://www.ecva.net/papers/eccv_2020/papers_ECCV/html/384_ECCV_2020_paper.php)
286 | - Connecting Vision and Language with Localized Narratives. [[paper]](https://arxiv.org/pdf/1912.03098.pdf) [[code]](https://github.com/google/localized-narratives)
287 | ##### Text Captioning
288 | - TextCaps: a Dataset for Image Captioning with Reading Comprehension. [[paper]](https://arxiv.org/pdf/2003.12462.pdf) [[code]](https://github.com/facebookresearch/mmf/tree/master/projects/m4c_captioner)
289 |
290 | ##### Video Captioning
291 | - Character Grounding and Re-Identification in Story of Videos and Text Descriptions. [[paper]](http://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123500528.pdf) [[code]](https://github.com/yj-yu/CiSIN/)
292 | - SODA: Story Oriented Dense Video Captioning Evaluation Framework. [[paper]](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123510511.pdf) [[code]](https://github.com/fujiso/SODA)
293 | - In-Home Daily-Life Captioning Using Radio Signals. [[paper]](https://arxiv.org/pdf/2008.10966.pdf)
294 | - TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval. [[paper]](https://arxiv.org/pdf/2001.09099.pdf) [[code]](https://github.com/jayleicn/TVCaption)
295 | - Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos. [[paper]](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123490324.pdf)
296 | - Identity-Aware Multi-Sentence Video Description. [[paper]](http://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123660358.pdf)
297 |
298 | #### IJCAI 2020
299 | ##### Image Captioning
300 | - Human Consensus-Oriented Image Captioning. [[paper]](https://www.ijcai.org/Proceedings/2020/0092.pdf)
301 | - Non-Autoregressive Image Captioning with Counterfactuals-Critical Multi-Agent Learning. [[paper]](https://www.ijcai.org/Proceedings/2020/0107.pdf)
302 | - Recurrent Relational Memory Network for Unsupervised Image Captioning. [[paper]](https://www.ijcai.org/Proceedings/2020/0128.pdf)
303 | ##### Video Captioning
304 | - Learning to Discretely Compose Reasoning Module Networks for Video Captioning. [[paper]](https://arxiv.org/abs/2007.09049) [[code]](https://github.com/tgc1997/RMN)
305 | - SBAT: Video Captioning with Sparse Boundary-Aware Transformer. [[paper]](https://www.ijcai.org/Proceedings/2020/88)
306 | - Hierarchical Attention Based Spatial-Temporal Graph-to-Sequence Learning for Grounded Video Description. [[paper]](https://www.ijcai.org/Proceedings/2020/131)
307 |
308 | #### ACL 2020
309 | ##### Image Captioning
310 | - Clue: Cross-modal Coherence Modeling for Caption Generation. [[paper]](https://www.aclweb.org/anthology/2020.acl-main.583.pdf)
311 | - Improving Image Captioning Evaluation by Considering Inter References Variance. [[paper]](https://www.aclweb.org/anthology/2020.acl-main.93.pdf) [[code]](https://github.com/ck0123/improved-bertscore-for-image-captioning-evaluation)
312 | - Improving Image Captioning with Better Use of Caption. [[paper]](https://www.aclweb.org/anthology/2020.acl-main.664.pdf) [[code]](https://github.com/Gitsamshi/WeakVRD-Captioning)
313 | ##### Video Captioning
314 | - MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning. [[paper]](https://www.aclweb.org/anthology/2020.acl-main.233.pdf) [[code]](https://github.com/jayleicn/recurrent-transformer)
315 |
316 | [comment]: <> (#### ICME 2020)
317 |
318 | [comment]: <> (##### Image Captioning)
319 |
320 | [comment]: <> (- [Fooled by Imagination: Adversarial Attack to Image Captioning Via Perturbation in Complex Domain](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9102842) - Shaofeng Zhang et al, **ICME 2020**. )
321 |
322 | [comment]: <> (- [Modeling Local and Global Contexts for Image Captioning](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9102935) - Peng Yao et al, **ICME 2020**. )
323 |
324 | [comment]: <> (##### Video Captioning)
325 |
326 | [comment]: <> (- [Video Captioning With Temporal And Region Graph Convolution Network](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9102967) - Xinlong Xiao et al, **ICME 2020**. )
327 |
328 | #### CVPR 2020
329 | ##### Image Captioning
330 | - Context-Aware Group Captioning via Self-Attention and Contrastive Features. [[paper]](https://arxiv.org/abs/2004.03708) [[code]](https://lizw14.github.io/project/groupcap)
331 | - Show, Edit and Tell: A Framework for Editing Image Captions. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/papers/Sammani_Show_Edit_and_Tell_A_Framework_for_Editing_Image_Captions_CVPR_2020_paper.pdf) [[code]](https://github.com/fawazsammani/show-edit-tell)
332 | - Say As You Wish: Fine-Grained Control of Image Caption Generation With Abstract Scene Graphs. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/papers/Chen_Say_As_You_Wish_Fine-Grained_Control_of_Image_Caption_Generation_CVPR_2020_paper.pdf) [[code]](https://github.com/cshizhe/asg2cap)
333 | - Normalized and Geometry-Aware Self-Attention Network for Image Captioning. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/papers/Guo_Normalized_and_Geometry-Aware_Self-Attention_Network_for_Image_Captioning_CVPR_2020_paper.pdf)
334 | - Meshed-Memory Transformer for Image Captioning. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/papers/Cornia_Meshed-Memory_Transformer_for_Image_Captioning_CVPR_2020_paper.pdf) [[code]](https://github.com/aimagelab/meshed-memory-transformer)
335 | - X-Linear Attention Networks for Image Captioning. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/papers/Pan_X-Linear_Attention_Networks_for_Image_Captioning_CVPR_2020_paper.pdf) [[code]](https://github.com/JDAI-CV/image-captioning)
336 | - Transform and Tell: Entity-Aware News Image Captioning. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/papers/Tran_Transform_and_Tell_Entity-Aware_News_Image_Captioning_CVPR_2020_paper.pdf) [[code]](https://github.com/alasdairtran/transform-and-tell)
337 | - More Grounded Image Captioning by Distilling Image-Text Matching Model. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/papers/Zhou_More_Grounded_Image_Captioning_by_Distilling_Image-Text_Matching_Model_CVPR_2020_paper.pdf) [[code]](https://github.com/YuanEZhou/Grounded-Image-Captioning)
338 | - Better Captioning With Sequence-Level Exploration. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/html/Chen_Better_Captioning_With_Sequence-Level_Exploration_CVPR_2020_paper.html)
339 |
340 | [comment]: <> (- Alleviating Noisy Data in Image Captioning with Cooperative Distillation. [[paper]](https://arxiv.org/pdf/2012.11691.pdf) - Pierre Dognin et al, **CVPRW 2020**. )
341 |
342 | ##### Video Captioning
343 | - Object Relational Graph With Teacher-Recommended Learning for Video Captioning. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/papers/Zhang_Object_Relational_Graph_With_Teacher-Recommended_Learning_for_Video_Captioning_CVPR_2020_paper.pdf)
344 | - Spatio-Temporal Graph for Video Captioning With Knowledge Distillation. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/papers/Pan_Spatio-Temporal_Graph_for_Video_Captioning_With_Knowledge_Distillation_CVPR_2020_paper.pdf) [[code]](https://github.com/StanfordVL/STGraph)
345 | - Better Captioning With Sequence-Level Exploration. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/html/Chen_Better_Captioning_With_Sequence-Level_Exploration_CVPR_2020_paper.html)
346 | - Syntax-Aware Action Targeting for Video Captioning. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/html/Zheng_Syntax-Aware_Action_Targeting_for_Video_Captioning_CVPR_2020_paper.html) [[code]](https://github.com/SydCaption/SAAT)
347 | - Screencast Tutorial Video Understanding. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/papers/Li_Screencast_Tutorial_Video_Understanding_CVPR_2020_paper.pdf)
348 |
349 |
350 | #### AAAI 2020
351 | ##### Image Captioning
352 | - Unified Vision-Language Pre-Training for Image Captioning and VQA. [[paper]](https://arxiv.org/abs/1909.11059) [[code]](https://github.com/LuoweiZhou/VLP)
353 | - Reinforcing an Image Caption Generator using Off-line Human Feedback. [[paper]](https://arxiv.org/abs/1911.09753)
354 | - Memorizing Style Knowledge for Image Captioning. [[paper]](https://www.aiide.org/ojs/index.php/AAAI/article/view/6998)
355 | - Joint Commonsense and Relation Reasoning for Image and Video Captioning. [[paper]](https://www.aiide.org/ojs/index.php/AAAI/article/view/6731)
356 | - Learning Long- and Short-Term User Literal-Preference with Multimodal Hierarchical Transformer Network for Personalized Image Caption. [[paper]](https://weizhangltt.github.io/paper/zhang-aaai20.pdf)
357 | - Show, Recall, and Tell: Image Captioning with Recall Mechanism. [[paper]](https://ojs.aaai.org/index.php/AAAI/article/view/6898)
358 | - Interactive Dual Generative Adversarial Networks for Image Captioning. [[paper]](https://aaai.org/ojs/index.php/AAAI/article/view/6826)
359 | - Feature Deformation Meta-Networks in Image Captioning of Novel Objects. [[paper]](https://aaai.org/ojs/index.php/AAAI/article/view/6620)
360 | ##### Video Captioning
361 | - An Efficient Framework for Dense Video Captioning. [[paper]](https://aaai.org/ojs/index.php/AAAI/article/view/6881)
362 |
363 |
364 | ### 2019
365 |
366 | #### NIPS 2019 ####
367 | ##### Image Captioning
368 | - Adaptively Aligned Image Captioning via Adaptive Attention Time. [[paper]](http://papers.nips.cc/paper/by-source-2019-4799) [[code]](https://github.com/husthuaan/AAT)
369 | - Image Captioning: Transforming Objects into Words. [[paper]](http://papers.nips.cc/paper/by-source-2019-5963) [[code]](https://github.com/yahoo/object_relation_transformer)
370 | - Variational Structured Semantic Inference for Diverse Image Captioning. [[paper]](http://papers.nips.cc/paper/by-source-2019-1113)
371 |
372 |
373 | #### ICCV 2019
374 | ##### Image Captioning
375 |
376 | - Robust Change Captioning. [[paper]](https://arxiv.org/abs/1901.02527)
377 | - Attention on Attention for Image Captioning. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Huang_Attention_on_Attention_for_Image_Captioning_ICCV_2019_paper.pdf)
378 | - Exploring Overall Contextual Information for Image Captioning in Human-Like Cognitive Style. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Ge_Exploring_Overall_Contextual_Information_for_Image_Captioning_in_Human-Like_Cognitive_ICCV_2019_paper.pdf)
379 | - Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Datta_Align2Ground_Weakly_Supervised_Phrase_Grounding_Guided_by_Image-Caption_Alignment_ICCV_2019_paper.pdf)
380 | - Hierarchy Parsing for Image Captioning. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Yao_Hierarchy_Parsing_for_Image_Captioning_ICCV_2019_paper.pdf)
381 | - Generating Diverse and Descriptive Image Captions Using Visual Paraphrases. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Liu_Generating_Diverse_and_Descriptive_Image_Captions_Using_Visual_Paraphrases_ICCV_2019_paper.pdf)
382 | - Learning to Collocate Neural Modules for Image Captioning. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Yang_Learning_to_Collocate_Neural_Modules_for_Image_Captioning_ICCV_2019_paper.pdf)
383 | - Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Aneja_Sequential_Latent_Spaces_for_Modeling_the_Intention_During_Diverse_Image_ICCV_2019_paper.pdf)
384 | - Towards Unsupervised Image Captioning With Shared Multimodal Embeddings. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Laina_Towards_Unsupervised_Image_Captioning_With_Shared_Multimodal_Embeddings_ICCV_2019_paper.pdf)
385 | - Human Attention in Image Captioning: Dataset and Analysis. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/He_Human_Attention_in_Image_Captioning_Dataset_and_Analysis_ICCV_2019_paper.pdf)
386 | - Reflective Decoding Network for Image Captioning. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Ke_Reflective_Decoding_Network_for_Image_Captioning_ICCV_2019_paper.pdf)
387 | - Joint Optimization for Cooperative Image Captioning. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Vered_Joint_Optimization_for_Cooperative_Image_Captioning_ICCV_2019_paper.pdf)
388 | - Entangled Transformer for Image Captioning. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Li_Entangled_Transformer_for_Image_Captioning_ICCV_2019_paper.pdf)
389 | - nocaps: novel object captioning at scale. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Agrawal_nocaps_novel_object_captioning_at_scale_ICCV_2019_paper.pdf)
390 | - Cap2Det: Learning to Amplify Weak Caption Supervision for Object Detection. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Ye_Cap2Det_Learning_to_Amplify_Weak_Caption_Supervision_for_Object_Detection_ICCV_2019_paper.pdf)
391 | - Unpaired Image Captioning via Scene Graph Alignments. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Gu_Unpaired_Image_Captioning_via_Scene_Graph_Alignments_ICCV_2019_paper.pdf)
392 | - Learning to Caption Images Through a Lifetime by Asking Questions. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Shen_Learning_to_Caption_Images_Through_a_Lifetime_by_Asking_Questions_ICCV_2019_paper.pdf)
393 |
394 | ##### Video Captioning
395 | - VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Wang_VaTeX_A_Large-Scale_High-Quality_Multilingual_Dataset_for_Video-and-Language_Research_ICCV_2019_paper.pdf)
396 | - Controllable Video Captioning With POS Sequence Guidance Based on Gated Fusion Network. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Wang_Controllable_Video_Captioning_With_POS_Sequence_Guidance_Based_on_Gated_ICCV_2019_paper.pdf)
397 | - Joint Syntax Representation Learning and Visual Cue Translation for Video Captioning. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Hou_Joint_Syntax_Representation_Learning_and_Visual_Cue_Translation_for_Video_ICCV_2019_paper.pdf)
398 | - Watch, Listen and Tell: Multi-Modal Weakly Supervised Dense Event Captioning. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Rahman_Watch_Listen_and_Tell_Multi-Modal_Weakly_Supervised_Dense_Event_Captioning_ICCV_2019_paper.pdf)
399 |
400 | #### ACL 2019 ####
401 | ##### Image Captioning
402 | - Informative Image Captioning with External Sources of Information [[paper]](https://www.aclweb.org/anthology/P19-1650.pdf)
403 |
404 | - Bridging by Word: Image Grounded Vocabulary Construction for Visual Captioning [[paper]](https://www.aclweb.org/anthology/P19-1652.pdf)
405 |
406 | - Generating Question Relevant Captions to Aid Visual Question Answering [[paper]](https://www.aclweb.org/anthology/P19-1348.pdf)
407 | ##### Video Captioning
408 | - Dense Procedure Captioning in Narrated Instructional Videos [[paper]](https://www.aclweb.org/anthology/P19-1641.pdf)
409 |
410 | #### CVPR 2019 ####
411 | ##### Image Captioning
412 | - Auto-Encoding Scene Graphs for Image Captioning [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Yang_Auto-Encoding_Scene_Graphs_for_Image_Captioning_CVPR_2019_paper.pdf) [[code]](https://github.com/fengyang0317/unsupervised_captioning)
413 |
414 | - Fast, Diverse and Accurate Image Captioning Guided by Part-Of-Speech [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Deshpande_Fast_Diverse_and_Accurate_Image_Captioning_Guided_by_Part-Of-Speech_CVPR_2019_paper.pdf)
415 |
416 | - Unsupervised Image Captioning [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Feng_Unsupervised_Image_Captioning_CVPR_2019_paper.pdf) [[code]](https://github.com/fengyang0317/unsupervised_captioning)
417 |
418 | - Adversarial Attack to Image Captioning via Structured Output Learning With Latent Variables [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Xu_Exact_Adversarial_Attack_to_Image_Captioning_via_Structured_Output_Learning_CVPR_2019_paper.pdf)
419 |
420 | - Describing like Humans: On Diversity in Image Captioning [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Wang_Describing_Like_Humans_On_Diversity_in_Image_Captioning_CVPR_2019_paper.pdf)
421 |
422 | - MSCap: Multi-Style Image Captioning With Unpaired Stylized Text [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Guo_MSCap_Multi-Style_Image_Captioning_With_Unpaired_Stylized_Text_CVPR_2019_paper.pdf)
423 | - Leveraging Captioning to Boost Semantics for Salient Object Detection [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Zhang_CapSal_Leveraging_Captioning_to_Boost_Semantics_for_Salient_Object_Detection_CVPR_2019_paper.pdf) [[code]](https://github.com/zhangludl/code-and-dataset-for-CapSal)
424 |
425 | - Context and Attribute Grounded Dense Captioning [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Yin_Context_and_Attribute_Grounded_Dense_Captioning_CVPR_2019_paper.pdf)
426 |
427 | - Dense Relational Captioning: Triple-Stream Networks for Relationship-Based Captioning [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Kim_Dense_Relational_Captioning_Triple-Stream_Networks_for_Relationship-Based_Captioning_CVPR_2019_paper.pdf)
428 |
429 | - Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Cornia_Show_Control_and_Tell_A_Framework_for_Generating_Controllable_and_CVPR_2019_paper.pdf)
430 |
431 | - Self-Critical N-step Training for Image Captioning [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Gao_Self-Critical_N-Step_Training_for_Image_Captioning_CVPR_2019_paper.pdf)
432 |
433 | - Look Back and Predict Forward in Image Captioning [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Qin_Look_Back_and_Predict_Forward_in_Image_Captioning_CVPR_2019_paper.pdf)
434 |
435 | - Intention Oriented Image Captions with Guiding Objects [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Zheng_Intention_Oriented_Image_Captions_With_Guiding_Objects_CVPR_2019_paper.pdf)
436 |
437 | - Adversarial Semantic Alignment for Improved Image Captions [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Dognin_Adversarial_Semantic_Alignment_for_Improved_Image_Captions_CVPR_2019_paper.pdf)
438 |
439 | - Good News, Everyone! Context driven entity-aware captioning for news images. [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Biten_Good_News_Everyone_Context_Driven_Entity-Aware_Captioning_for_News_Images_CVPR_2019_paper.pdf) [[code]](https://github.com/furkanbiten/GoodNews)
440 | - Pointing Novel Objects in Image Captioning [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Li_Pointing_Novel_Objects_in_Image_Captioning_CVPR_2019_paper.pdf)
441 |
442 | - Engaging Image Captioning via Personality [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Shuster_Engaging_Image_Captioning_via_Personality_CVPR_2019_paper.pdf)
443 |
444 | - Intention Oriented Image Captions With Guiding Objects [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Zheng_Intention_Oriented_Image_Captions_With_Guiding_Objects_CVPR_2019_paper.pdf)
445 |
446 | - Exact Adversarial Attack to Image Captioning via Structured Output Learning With Latent Variables [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Xu_Exact_Adversarial_Attack_to_Image_Captioning_via_Structured_Output_Learning_CVPR_2019_paper.pdf)
447 |
448 | - Towards Unsupervised Image Captioning with Shared Multimodal Embeddings. [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Zadeh_Social-IQ_A_Question_Answering_Benchmark_for_Artificial_Social_Intelligence_CVPR_2019_paper.pdf)
449 | ##### Video Captioning
450 | - Streamlined Dense Video Captioning. [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Mun_Streamlined_Dense_Video_Captioning_CVPR_2019_paper.pdf)
451 | - Grounded Video Description. [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Zhou_Grounded_Video_Description_CVPR_2019_paper.pdf)
452 | - Adversarial Inference for Multi-Sentence Video Description. [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Park_Adversarial_Inference_for_Multi-Sentence_Video_Description_CVPR_2019_paper.pdf)
453 | - Object-aware Aggregation with Bidirectional Temporal Graph for Video Captioning. [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Zhang_Object-Aware_Aggregation_With_Bidirectional_Temporal_Graph_for_Video_Captioning_CVPR_2019_paper.pdf)
454 | - Memory-Attended Recurrent Network for Video Captioning. [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Pei_Memory-Attended_Recurrent_Network_for_Video_Captioning_CVPR_2019_paper.pdf)
455 | - Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning. [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Aafaq_Spatio-Temporal_Dynamics_and_Semantic_Attribute_Enriched_Visual_Encoding_for_Video_CVPR_2019_paper.pdf)
456 |
457 |
458 | #### AAAI 2019
459 | ##### Image Captioning
460 | - Improving Image Captioning with Conditional Generative Adversarial Nets [[paper]](https://arxiv.org/pdf/1805.07112.pdf)
461 |
462 | - Connecting Language to Images: A Progressive Attention-Guided Network for Simultaneous Image Captioning and Language Grounding [[paper]](https://www.aaai.org/ojs/index.php/AAAI/article/view/4916)
463 |
464 | - Meta Learning for Image Captioning [[paper]](https://www.aaai.org/ojs/index.php/AAAI/article/view/4883)
465 | - Deliberate Residual based Attention Network for Image Captioning [[paper]](https://www.aaai.org/ojs/index.php/AAAI/article/view/4845/4718)
466 |
467 | - Hierarchical Attention Network for Image Captioning [[paper]](https://www.aaai.org/ojs/index.php/AAAI/article/view/4924)
468 |
469 | - Learning Object Context for Dense Captioning [[paper]](https://www.aaai.org/ojs/index.php/AAAI/article/view/4886)
470 |
471 | ##### Video Captioning
472 | - Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video Captioning [[code]](https://github.com/eric-xw/Zero-Shot-Video-Captioning) [[paper]](https://arxiv.org/pdf/1811.02765.pdf)
473 |
474 | - Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning [[paper]](https://arxiv.org/pdf/1905.01077v1.pdf)
475 |
476 | - Fully Convolutional Video Captioning with Coarse-to-Fine and Inherited Attention [[paper]](https://aaai.org/ojs/index.php/AAAI/article/view/4839)
477 |
478 | - Motion Guided Spatial Attention for Video Captioning [[paper]](http://yugangjiang.info/publication/19AAAI-vidcaptioning.pdf)
479 |
480 | ### 2018
481 |
482 | #### NIPS 2018 ####
483 | ##### Image Captioning
484 | - A Neural Compositional Paradigm for Image Captioning. [[paper]](https://arxiv.org/pdf/1810.09630.pdf) [[code]](https://github.com/doubledaibo/compcaption_neurips2018)
485 | ##### Video Captioning
486 | - Weakly Supervised Dense Event Captioning in Videos. [[paper]](https://papers.nips.cc/paper/2018/file/49af6c4e558a7569d80eee2e035e2bd7-Paper.pdf) [[code]](https://github.com/XgDuan/WSDEC)
487 |
488 |
489 |
490 | #### ECCV 2018 ####
491 | ##### Image Captioning
492 | - Unpaired Image Captioning by Language Pivoting. [[paper]](https://arxiv.org/pdf/1803.05526.pdf) [[code]](https://github.com/gujiuxiang/unpaired_image_captioning)
493 | - Exploring Visual Relationship for Image Captioning. [[paper]](https://openaccess.thecvf.com/content_ECCV_2018/papers/Ting_Yao_Exploring_Visual_Relationship_ECCV_2018_paper.pdf)
494 | - Recurrent Fusion Network for Image Captioning. [[paper]](https://arxiv.org/pdf/1807.09986.pdf) [[code]](https://github.com/cswhjiang/Recurrent_Fusion_Network)
495 | - Boosted Attention: Leveraging Human Attention for Image Captioning. [[paper]](https://arxiv.org/pdf/1904.00767.pdf)
496 | - Show, Tell and Discriminate: Image Captioning by Self-retrieval with Partially Labeled Data. [[paper]](https://arxiv.org/pdf/1803.08314.pdf)
497 | - "Factual" or "Emotional": Stylized Image Captioning with Adaptive Learning and Attention. [[paper]](https://arxiv.org/pdf/1807.03871.pdf)
498 | #### ACL 2018
499 | ##### Image Captioning
500 | - Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning. [[paper]](https://arxiv.org/pdf/1712.02051.pdf)
501 | - Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. [[paper]](https://aclanthology.org/P18-1238.pdf) [[code]](https://github.com/google-research-datasets/conceptual-captions)
502 |
503 | #### IJCAI 2018
504 | ##### Image Captioning
505 | - Show and Tell More: Topic-Oriented Multi-Sentence Image Captioning. [[paper]](https://www.ijcai.org/proceedings/2018/0592.pdf)
506 | #### CVPR 2018 ####
507 | ##### Image Captioning
508 | - Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. [[paper]](http://openaccess.thecvf.com/content_cvpr_2018/html/Anderson_Bottom-Up_and_Top-Down_CVPR_2018_paper.html) [[code]](https://github.com/peteanderson80/bottom-up-attention)
509 | - Neural Baby Talk. [[paper]](https://arxiv.org/pdf/1803.09845.pdf)
510 | - GroupCap: Group-Based Image Captioning With Structured Relevance and Diversity Constraints. [[paper]](https://openaccess.thecvf.com/content_cvpr_2018/papers/Chen_GroupCap_Group-Based_Image_CVPR_2018_paper.pdf)
511 | ##### Video Captioning
512 | - Reconstruction Network for Video Captioning. [[paper]](https://arxiv.org/pdf/1803.11438.pdf) [[code]](https://github.com/hobincar/reconstruction-network-for-video-captioning)
513 | ### 2017
514 | #### ICCV 2017
515 | ##### Image Captioning
516 | - Boosting Image Captioning with Attributes. [[paper]](https://openaccess.thecvf.com/content_ICCV_2017/papers/Yao_Boosting_Image_Captioning_ICCV_2017_paper.pdf)
517 | - Show, Adapt and Tell: Adversarial Training of Cross-domain Image Captioner. [[paper]](https://openaccess.thecvf.com/content_ICCV_2017/papers/Chen_Show_Adapt_and_ICCV_2017_paper.pdf) [[code]](https://github.com/tsenghungchen/show-adapt-and-tell)
518 | #### CVPR 2017
519 | ##### Image Captioning
520 | - SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning. [[paper]](https://openaccess.thecvf.com/content_cvpr_2017/papers/Chen_SCA-CNN_Spatial_and_CVPR_2017_paper.pdf) [[code]](https://github.com/zjuchenlong/sca-cnn.cvpr17)
521 | - When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning. [[paper]](https://arxiv.org/pdf/1612.01887.pdf) [[code]](https://github.com/jiasenlu/AdaptiveAttention)
522 | - Self-critical Sequence Training for Image Captioning. [[paper]](https://openaccess.thecvf.com/content_cvpr_2017/papers/Rennie_Self-Critical_Sequence_Training_CVPR_2017_paper.pdf)
523 | - Semantic Compositional Networks for Visual Captioning. [[paper]](https://arxiv.org/pdf/1611.08002.pdf) [[code]](https://github.com/zhegan27/Semantic_Compositional_Nets)
524 | - StyleNet: Generating Attractive Visual Captions with Styles. [[paper]](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/Generating-Attractive-Visual-Captions-with-Styles.pdf) [[code]](https://github.com/kacky24/stylenet)
525 | #### TPAMI 2017
526 | ##### Image Captioning
527 | - BreakingNews: Article Annotation by Image and Text Processing. [[paper]](https://arxiv.org/pdf/1603.07141.pdf)
528 | ### 2016
529 |
530 | #### ECCV 2016
531 | - SPICE: Semantic Propositional Image Caption Evaluation. [[paper]](https://arxiv.org/pdf/1607.08822.pdf) [[code]](https://github.com/peteanderson80/SPICE)
532 | - Generating Visual Explanations. [[paper]](https://arxiv.org/pdf/1603.08507.pdf) [[code]](https://github.com/LisaAnne/ECCV2016)
533 | #### CVPR 2016
534 | ##### Image Captioning
535 | - Image Captioning with Semantic Attention. [[paper]](https://openaccess.thecvf.com/content_cvpr_2016/papers/You_Image_Captioning_With_CVPR_2016_paper.pdf) [[code]](https://github.com/chapternewscu/image-captioning-with-semantic-attention)
536 | - Learning Deep Representations of Fine-grained Visual Descriptions. [[paper]](https://arxiv.org/pdf/1605.05395.pdf) [[code]](https://github.com/reedscot/cvpr2016)
537 | - Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data. [[paper]](https://arxiv.org/pdf/1511.05284.pdf) [[code]](https://github.com/LisaAnne/DCC)
538 | #### TPAMI 2016
539 | ##### Image Captioning
540 | - Aligning Where to See and What to Tell: Image Captioning with Region-Based Attention and Scene-Specific Contexts. [[paper]](https://ieeexplore.ieee.org/abstract/document/7792748) [[code]](https://github.com/fukun07/neural-image-captioning)
541 |
542 | ### 2015
543 | #### ICML 2015
544 | ##### Image Captioning
545 | - Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. [[paper]](https://arxiv.org/pdf/1502.03044.pdf)
546 |
547 | #### ICCV 2015
548 | ##### Image Captioning
549 | - Guiding Long-Short Term Memory for Image Caption Generation. [[paper]](https://arxiv.org/pdf/1509.04942.pdf)
550 |
551 | #### CVPR 2015
552 | ##### Image Captioning
553 | - Show and Tell: A Neural Image Caption Generator. [[paper]](https://arxiv.org/pdf/1411.4555.pdf)
554 | - Deep Visual-Semantic Alignments for Generating Image Descriptions. [[paper]](https://arxiv.org/pdf/1412.2306.pdf) [[code]](https://github.com/karpathy/neuraltalk2)
555 | - CIDEr: Consensus-based Image Description Evaluation. [[paper]](https://arxiv.org/pdf/1411.5726.pdf) [[cider]](https://github.com/vrama91/cider)
556 | #### ICLR 2015
557 | ##### Image Captioning
558 | - Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN). [[paper]](https://arxiv.org/pdf/1412.6632.pdf)
559 |
560 | ## Dataset
561 | - MSCOCO
562 | - Flickr30K
563 | - Flickr8K
564 | - VizWiz
565 |
566 | ## Popular Codebase
567 | - [ruotianluo/ImageCaptioning.pytorch](https://github.com/ruotianluo/ImageCaptioning.pytorch)
568 |
569 | ## Reference and Acknowledgement
570 | - [awesome-image-captioning](https://github.com/zhjohnchan/awesome-image-captioning) from [Zhihong Chen](https://github.com/zhjohnchan)
571 |
572 | Really appreciate for there contributions in this area.
573 |
--------------------------------------------------------------------------------