└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Awesome Captioning:[![Awesome](https://awesome.re/badge.svg)](https://awesome.re) 2 | 3 |

4 | 5 |

6 | 7 | A curated list of **Visual Captioning** and related area. 8 | 9 | 10 | 11 | ## Table of Contents 12 | * [Survey Papers](#survey-papers) 13 | * [Research Papers](#reasearch-papers) 14 | * [2022](#2022) 15 | - [arxiv 2022](arxiv-2022) 16 | - [AAAI 2022](#AAAI-2022) 17 | * [2021](#2021) 18 | - [NIPS 2021](NIPS-2021) 19 | - [EMNLP 2021](#EMNLP-2021) 20 | - [ICCV 2021](#ICCV-2021) 21 | - [ACM MM 2021](#ACMMM-2021) 22 | - [Interspeech 2021](#Interspeech-2021) 23 | - [ACL 2021](#ACL-2021) 24 | - [IJCAI 2021](#IJCAI-2021) 25 | - [NAACL 2021](#NAACL-2021) 26 | - [CVPR 2021](#CVPR-2021) 27 | - [ICASSP 2021](#ICASSP-2021) 28 | - [AAAI 2021](#AAAI-2021) 29 | * [2020](#2020) 30 | - [EMNLP 2020](#EMNLP-2020) 31 | - [NIPS 2020](#NIPS-2020) 32 | - [ACM MM 2020](#ACMMM-2020) 33 | - [ECCV 2020](#ECCV-2020) 34 | - [IJCAI 2020](#IJCAI-2020) 35 | - [ACL 2020](#ACL-2020) 36 | - [CVPR 2020](#CVPR-2020) 37 | - [AAAI 2020](#AAAI-2020) 38 | * [2019](#2019) 39 | * [NIPS 2019](#NIPS-2019) 40 | * [ICCV 2019](#ICCV-2019) 41 | * [ACL 2019](#ACL-2019) 42 | * [CVPR 2019](#CVPR-2019) 43 | * [AAAI 2019](#AAAI-2019) 44 | * [2018](#2018) 45 | * [NIPS 2018](#NIPS-2018) 46 | * [ECCV 2018](#ECCV-2018) 47 | * [ACL 2018](#ACL-2018) 48 | * [CVPR 2018](#CVPR-2018) 49 | * [2017](#2017) 50 | * [ICCV 2017](#ICCV-2017) 51 | * [CVPR 2017](#CVPR-2017) 52 | * [TPAMI 2017](#TPAMI-2017) 53 | * [2016](#2016) 54 | * [CVPR 2016](#CVPR-2016) 55 | * [TPAMI 2016](#TPAMI-2016) 56 | * [2015](#2015) 57 | * [ICML 2015](#ICML-2015) 58 | * [ICCV 2015](#ICCV-2015) 59 | * [CVPR 2015](#CVPR-2015) 60 | * [ICLR 2015](#ICLR-2015) 61 | * [Dataset](#dataset) 62 | * [Popular Codebase](#popular-codebase) 63 | * [Reference and Acknowledgement](#reference-and-acknowledgement) 64 | ## Survey Papers 65 | ### 2021 66 | - From Show to Tell: A Survey on Image Captioning. [[paper]](https://arxiv.org/pdf/2107.06912.pdf) 67 | 68 | ## Research Papers 69 | ### 2022 70 | #### arxiv 2022 71 | ##### Image Captioning 72 | - Compact Bidirectional Transformer for Image Captioning. [[paper]](https://arxiv.org/pdf/2201.01984.pdf) [[code]](https://github.com/YuanEZhou/CBTrans) 73 | - ViNTER: Image Narrative Generation with Emotion-Arc-Aware Transformer. [[paper]](https://arxiv.org/pdf/2202.07305.pdf) 74 | - I-Tuning: Tuning Language Models with Image for Caption Generation. [[paper]](https://arxiv.org/pdf/2202.06574.pdf) 75 | - CaMEL: Mean Teacher Learning for Image Captioning. [[paper]](https://arxiv.org/pdf/2202.10492.pdf) [[code]](https://github.com/aimagelab/camel) 76 | - Unpaired Image Captioning by Image-level Weakly-Supervised Visual Concept Recognition. [[paper]](https://arxiv.org/pdf/2203.03195.pdf) 77 | ##### Video Captioning 78 | - Discourse Analysis for Evaluating Coherence in Video Paragraph Captions. [[paper]](https://arxiv.org/pdf/2201.06207.pdf) 79 | - Cross-modal Contrastive Distillation for Instructional Activity Anticipation. [[paper]](https://arxiv.org/pdf/2201.06734.pdf) 80 | - End-to-end Generative Pretraining for Multimodal Video Captioning. [[paper]](https://arxiv.org/pdf/2201.08264.pdf) 81 | - Deep soccer captioning with transformer: dataset, semantics-related losses, and multi-level evaluation. [[paper]](https://arxiv.org/pdf/2202.05728.pdf) [[code]](https://sites.google.com/view/soccercaptioning) 82 | - Dual-Level Decoupled Transformer for Video Captioning. [[paper]](https://arxiv.org/pdf/2205.03039.pdf) 83 | - Attract me to Buy: Advertisement Copywriting Generation with Multimodal Multi-structured Information. [[paper]](https://arxiv.org/pdf/2205.03534.pdf) [[code]](https://e-mmad.github.io/e-mmad.net/index.html) 84 | #### IJCAI 2022 85 | - Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds. [[paper]](https://arxiv.org/pdf/2204.10688.pdf) 86 | #### CVPR 2022 87 | - X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning. [[paper]](https://arxiv.org/pdf/2203.00843.pdf) 88 | - Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning. [[paper]](https://arxiv.org/pdf/2205.04363.pdf) 89 | ##### Video Captioning 90 | - What's in a Caption? Dataset-Specific Linguistic Diversity and Its Effect on Visual Description Models and Metrics. [[paper]](https://arxiv.org/pdf/2205.06253.pdf) (Workshop) 91 | #### AAAI 2022 92 | ##### Image Captioning 93 | - Image Difference Captioning with Pre-training and Contrastive Learning. [[paper]](https://arxiv.org/pdf/2202.04298.pdf) 94 | ### 2021 95 | #### NIPS 2021 96 | ##### Image Captioning 97 | - Auto-Encoding Knowledge Graph for Unsupervised Medical Report Generation. [[paper]](https://openreview.net/pdf?id=nIL7Q-p7-Sh) 98 | - FFA-IR: Towards an Explainable and Reliable Medical Report Generation Benchmark. [[paper]](https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/35f4a8d465e6e1edc05f3d8ab658c551-Paper-round2.pdf) [[code]](https://github.com/mlii0117/FFA-IR) 99 | ##### Video Captioning 100 | - Multi-modal Dependency Tree for Video Captioning. [[paper]](https://openreview.net/pdf?id=sW40wkwfsZp) 101 | 102 | 103 | #### EMNLP 2021 104 | ##### Image Captioning 105 | - Visual News: Benchmark and Challenges in News Image Captioning. [[paper]](https://aclanthology.org/2021.emnlp-main.542.pdf) [[code]](https://github.com/FuxiaoLiu/VisualNews-Repository) 106 | - R3Net:Relation-embedded Representation Reconstruction Network for Change Captioning. [[paper]](https://arxiv.org/pdf/2110.10328.pdf) [[code]](https://github.com/tuyunbin/R3Net) 107 | - CLIPScore: A Reference-free Evaluation Metric for Image Captioning. [[paper]](https://aclanthology.org/2021.emnlp-main.595.pdf) 108 | - Journalistic Guidelines Aware News Image Captioning. [[paper]](https://arxiv.org/pdf/2109.02865.pdf) 109 | - Understanding Guided Image Captioning Performance across Domains. [[paper]](https://aclanthology.org/2021.conll-1.14.pdf) [[code]](https://github.com/google-research-datasets/T2-Guiding) (CoNLL) 110 | - Language Resource Efficient Learning for Captioning. [[paper]](https://aclanthology.org/2021.findings-emnlp.162.pdf) (Findings) 111 | - Retrieval, Analogy, and Composition: A framework for Compositional Generalization in Image Captioning. [[paper]](https://aclanthology.org/2021.findings-emnlp.171.pdf) (Findings) 112 | - QACE: Asking Questions to Evaluate an Image Caption. [[paper]](https://arxiv.org/pdf/2108.12560.pdf) (Findings) 113 | - COSMic: A Coherence-Aware Generation Metric for Image Descriptions. [[paper]](https://arxiv.org/pdf/2109.05281.pdf) (Findings) 114 | 115 | #### ICCV 2021 116 | ##### Image Captioning 117 | - Auto-Parsing Network for Image Captioning and Visual Question Answering. [[paper]](https://arxiv.org/pdf/2108.10568.pdf) 118 | - Similar Scenes arouse Similar Emotions: Parallel Data Augmentation for Stylized Image Captioning. [[paper]](https://arxiv.org/pdf/2108.11912.pdf) 119 | - Explain Me the Painting: Multi-Topic Knowledgeable Art Description Generation. [[paper]](https://arxiv.org/pdf/2109.05743.pdf) 120 | - Partial Off-Policy Learning: Balance Accuracy and Diversity for Human-Oriented Image Captioning. [[paper]](https://openaccess.thecvf.com/content/ICCV2021/papers/Shi_Partial_Off-Policy_Learning_Balance_Accuracy_and_Diversity_for_Human-Oriented_Image_ICCV_2021_paper.pdf) 121 | - Topic Scene Graph Generation by Attention Distillation from Caption. [[paper]](https://openaccess.thecvf.com/content/ICCV2021/papers/Wang_Topic_Scene_Graph_Generation_by_Attention_Distillation_From_Caption_ICCV_2021_paper.pdf) 122 | - Understanding and Evaluating Racial Biases in Image Captioning. [[paper]](https://openaccess.thecvf.com/content/ICCV2021/papers/Zhao_Understanding_and_Evaluating_Racial_Biases_in_Image_Captioning_ICCV_2021_paper.pdf) [[code]](https://github.com/princetonvisualai/imagecaptioning-bias) 123 | - In Defense of Scene Graphs for Image Captioning. [[paper]](https://openaccess.thecvf.com/content/ICCV2021/papers/Nguyen_In_Defense_of_Scene_Graphs_for_Image_Captioning_ICCV_2021_paper.pdf) [[code]](https://github.com/Kien085/SG2Caps) 124 | - Viewpoint-Agnostic Change Captioning with Cycle Consistency. [[paper]](https://openaccess.thecvf.com/content/ICCV2021/papers/Kim_Viewpoint-Agnostic_Change_Captioning_With_Cycle_Consistency_ICCV_2021_paper.pdf) 125 | - Visual-Textual Attentive Semantic Consistency for Medical Report Generation. [[paper]](https://openaccess.thecvf.com/content/ICCV2021/papers/Zhou_Visual-Textual_Attentive_Semantic_Consistency_for_Medical_Report_Generation_ICCV_2021_paper.pdf) 126 | - Semi-Autoregressive Transformer for Image Captioning. [[paper]](https://openaccess.thecvf.com/content/ICCV2021W/CLVL/papers/Zhou_Semi-Autoregressive_Transformer_for_Image_Captioning_ICCVW_2021_paper.pdf) (Workshop) 127 | ##### Video Captioning 128 | - End-to-End Dense Video Captioning with Parallel Decoding. [[paper]](https://arxiv.org/pdf/2108.07781.pdf) [[code]](https://github.com/ttengwang/PDVC) 129 | - Motion Guided Region Message Passing for Video Captioning. [[paper]](https://openaccess.thecvf.com/content/ICCV2021/papers/Chen_Motion_Guided_Region_Message_Passing_for_Video_Captioning_ICCV_2021_paper.pdf) 130 | 131 | #### ACMMM 2021 132 | ##### Image Captioning 133 | - Distributed Attention for Grounded Image Captioning. [[paper]](https://arxiv.org/pdf/2108.01056.pdf) 134 | - Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image Captioning. [[paper]](https://arxiv.org/pdf/2108.02366.pdf) [[code]](https://github.com/Unbear430/DGCN-for-image-captioning) 135 | - Group-based Distinctive Image Captioning with Memory Attention. [[paper]](https://arxiv.org/pdf/2108.09151.pdf) 136 | - Direction Relation Transformer for Image Captioning. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3474085.3475607) 137 | ##### Text Captioning 138 | - Question-controlled Text-aware Image Captioning. [[paper]](https://arxiv.org/pdf/2108.02059.pdf) 139 | ##### Video Captioning 140 | - Hybrid Reasoning Network for Video-based Commonsense Captioning. [[paper]](https://arxiv.org/ftp/arxiv/papers/2108/2108.02365.pdf) 141 | - Discriminative Latent Semantic Graph for Video Captioning. [[paper]](https://arxiv.org/pdf/2108.03662.pdf) [[code]](https://github.com/baiyang4/D-LSG-Video-Caption) 142 | - Sensor-Augmented Egocentric-Video Captioning with Dynamic Modal Attention. [[paper]](https://arxiv.org/pdf/2109.02955.pdf) 143 | - CLIP4Caption: CLIP for Video Caption. [[paper]](https://arxiv.org/pdf/2110.06615.pdf) 144 | #### Interspeech 2021 145 | ##### Video Captioning 146 | - Optimizing Latency for Online Video CaptioningUsing Audio-Visual Transformers. [[paper]](https://arxiv.org/pdf/2108.02147.pdf) 147 | 148 | #### ACL 2021 149 | ##### Image Captioning 150 | - Writing by Memorizing: Hierarchical Retrieval-based Medical Report Generation. [[paper]](https://arxiv.org/pdf/2106.06471.pdf) 151 | - Competence-based Multimodal Curriculum Learning for Medical Report Generation. 152 | - Control Image Captioning Spatially and Temporally. [[paper]](https://aclanthology.org/2021.acl-long.157.pdf) 153 | - SMURF: SeMantic and linguistic UndeRstanding Fusion for Caption Evaluation via Typicality Analysis. [[paper]](https://arxiv.org/pdf/2106.01444.pdf) 154 | - Enhancing Descriptive Image Captioning with Natural Language Inference. 155 | - UMIC: An Unreferenced Metric for Image Captioning via Contrastive Learning. [[paper]](https://arxiv.org/pdf/2106.14019.pdf) [[code]](https://github.com/hwanheelee1993/UMIC) 156 | - Cross-modal Memory Networks for Radiology Report Generation. 157 | 158 | [comment]: <> (- IgSEG: Image-guided Story Ending Generation.) 159 | [comment]: <> (- Semantic Relation-aware Difference Representation Learning for Change Captioning.) 160 | [comment]: <> (- Contrastive Attention for Automatic Chest X-ray Report Generation. [[paper]](https://arxiv.org/pdf/2106.06965.pdf)) 161 | ##### Video Captioning 162 | - Hierarchical Context-aware Network for Dense Video Event Captioning. 163 | - Video Paragraph Captioning as a Text Summarization Task. 164 | 165 | [comment]: <> (- O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning. [[paper]](https://arxiv.org/pdf/2108.02359.pdf)) 166 | 167 | #### IJCAI 2021 168 | ##### Image Captioning 169 | - TCIC: Theme Concepts Learning Cross Language and Vision for Image Captioning. [[paper]](https://arxiv.org/pdf/2106.10936.pdf) 170 | 171 | #### NAACL 2021 172 | ##### Image Captioning 173 | - Quality Estimation for Image Captions Based on Large-scale Human Evaluations. [[paper]](https://www.aclweb.org/anthology/2021.naacl-main.253.pdf) 174 | - Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation. [[paper]](https://www.aclweb.org/anthology/2021.naacl-main.416.pdf) 175 | 176 | [comment]: <> (- Validity-Based Sampling and Smoothing Methods for Multiple Reference Image Captioning. [[paper]](https://www.aclweb.org/anthology/2021.maiworkshop-1.6.pdf)) 177 | [comment]: <> (- Leveraging Partial Dependency Trees to Control Image Captions. [[paper]](https://www.aclweb.org/anthology/2021.alvr-1.3.pdf)) 178 | [comment]: <> (- Coherent and Concise Radiology Report Generation via Context Specific Image Representations and Orthogonal Sentence States. [[paper]](https://www.aclweb.org/anthology/2021.naacl-industry.31.pdf)) 179 | [comment]: <> (- Multi-Modal Image Captioning for the Visually Impaired. [[paper]](https://www.aclweb.org/anthology/2021.naacl-srw.8.pdf)) 180 | ##### Video Captioning 181 | - DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization. [[paper]](https://www.aclweb.org/anthology/2021.naacl-main.193.pdf) 182 | 183 | 184 | #### CVPR 2021 185 | ##### Image Captioning 186 | - Connecting What to Say With Where to Look by Modeling Human Attention Traces. [[paper]](https://arxiv.org/pdf/2105.05964.pdf) [[code]](https://github.com/facebookresearch/connect-caption-and-trace) 187 | - Multiple Instance Captioning: Learning Representations from Histopathology Textbooks and Articles. [[paper]](https://arxiv.org/pdf/2103.05121.pdf) 188 | - Image Change Captioning by Learning From an Auxiliary Task. [[paper]](https://openaccess.thecvf.com/content/CVPR2021/papers/Hosseinzadeh_Image_Change_Captioning_by_Learning_From_an_Auxiliary_Task_CVPR_2021_paper.pdf) 189 | - Scan2Cap: Context-aware Dense Captioning in RGB-D Scans. [[paper]](https://arxiv.org/pdf/2012.02206.pdf) [[code]](https://github.com/daveredrum/Scan2Cap) 190 | - FAIEr: Fidelity and Adequacy Ensured Image Caption Evaluation. [[paper]](https://openaccess.thecvf.com/content/CVPR2021/papers/Wang_FAIEr_Fidelity_and_Adequacy_Ensured_Image_Caption_Evaluation_CVPR_2021_paper.pdf) 191 | - RSTNet: Captioning With Adaptive Attention on Visual and Non-Visual Words. [[paper]](https://openaccess.thecvf.com/content/CVPR2021/papers/Zhang_RSTNet_Captioning_With_Adaptive_Attention_on_Visual_and_Non-Visual_Words_CVPR_2021_paper.pdf) 192 | - Human-Like Controllable Image Captioning With Verb-Specific Semantic Roles. [[paper]](https://arxiv.org/pdf/2103.12204.pdf) 193 | ##### Text Captioning 194 | - Improving OCR-Based Image Captioning by Incorporating Geometrical Relationship. [[paper]](https://openaccess.thecvf.com/content/CVPR2021/papers/Wang_Improving_OCR-Based_Image_Captioning_by_Incorporating_Geometrical_Relationship_CVPR_2021_paper.pdf) 195 | - TAP: Text-Aware Pre-Training for Text-VQA and Text-Caption. [[paper]](https://arxiv.org/pdf/2012.04638.pdf) 196 | - Towards Accurate Text-Based Image Captioning With Content Diversity Exploration. [[paper]](https://arxiv.org/pdf/2105.03236.pdf) 197 | 198 | ##### Video Captioning 199 | - Open-Book Video Captioning With Retrieve-Copy-Generate Network. [[paper]](https://arxiv.org/pdf/2103.05284.pdf) 200 | - Towards Diverse Paragraph Captioning for Untrimmed Videos. [[paper]](https://openaccess.thecvf.com/content/CVPR2021/papers/Song_Towards_Diverse_Paragraph_Captioning_for_Untrimmed_Videos_CVPR_2021_paper.pdf) 201 | - Towards Bridging Event Captioner and Sentence Localizer for Weakly Supervised Dense Event Captioning. [[paper]](https://openaccess.thecvf.com/content/CVPR2021/papers/Chen_Towards_Bridging_Event_Captioner_and_Sentence_Localizer_for_Weakly_Supervised_CVPR_2021_paper.pdf) 202 | 203 | #### ICASSP 2021 204 | ##### Image Captioning 205 | - Cascade Attention Fusion for Fine-grained Image Captioning based on Multi-layer LSTM. [[paper]](https://ieeexplore.ieee.org/document/9413691) 206 | - Triple Sequence Generative Adversarial Nets for Unsupervised Image Captioning. [[paper]](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9414335) 207 | 208 | [comment]: <> (##### Audio Captioning) 209 | 210 | [comment]: <> (- Investigating Local and Global Information for Automated Audio Captioning with Transfer Learning. [[paper]](https://arxiv.org/pdf/2102.11457.pdf)) 211 | 212 | #### AAAI 2021 213 | ##### Image Captioning 214 | - Partially Non-Autoregressive Image Captioning. [[paper]](https://ojs.aaai.org/index.php/AAAI/article/view/16219) [[code]](https://github.com/feizc/PNAIC/tree/master) 215 | - Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network. [[paper]](https://arxiv.org/pdf/2012.07061.pdf) 216 | - Object Relation Attention for Image Paragraph Captioning. [[paper]](https://ojs.aaai.org/index.php/AAAI/article/view/16423) 217 | - Dual-Level Collaborative Transformer for Image Captioning. [[paper]](https://arxiv.org/pdf/2101.06462.pdf) [[code]](https://github.com/luo3300612/image-captioning-DLCT) 218 | - Memory-Augmented Image Captioning. [[paper]](https://ojs.aaai.org/index.php/AAAI/article/view/16220) 219 | - Image Captioning with Context-Aware Auxiliary Guidance. [[paper]](https://arxiv.org/pdf/2012.05545.pdf) 220 | - Consensus Graph Representation Learning for Better Grounded Image Captioning. [[paper]](https://www.aaai.org/AAAI21Papers/AAAI-3680.ZhangW.pdf) 221 | - FixMyPose: Pose Correctional Captioning and Retrieval. [[paper]](https://arxiv.org/pdf/2104.01703.pdf) [[code]](https://github.com/hyounghk/FixMyPose) 222 | - VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning. [[paper]](https://arxiv.org/pdf/2009.13682) 223 | 224 | ##### Video Captioning 225 | - Non-Autoregressive Coarse-to-Fine Video Captioning. [[paper]](https://arxiv.org/pdf/1911.12018.pdf) [[code]](https://github.com/yangbang18/Non-Autoregressive-Video-Captioning) 226 | - Semantic Grouping Network for Video Captioning. [[paper]](https://arxiv.org/pdf/2102.00831.pdf) [[code]](https://github.com/hobincar/SGN) 227 | - Augmented Partial Mutual Learning with Frame Masking for Video Captioning. [[paper]](https://www.aaai.org/AAAI21Papers/AAAI-9714.LinK.pdf) 228 | 229 | #### TPAMI 2021 230 | ##### Video Captioning 231 | - Saying the Unseen: Video Descriptions via Dialog Agents. [[paper]](https://arxiv.org/pdf/2106.14069.pdf) 232 | 233 | ### 2020 234 | #### EMNLP 2020 235 | ##### Image Captioning 236 | - CapWAP: Captioning with a Purpose. [[paper]](https://www.aclweb.org/anthology/2020.emnlp-main.705/) [[code]](https://github.com/google-research/language/tree/master/language/capwap) 237 | - Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements. [[paper]](https://www.aclweb.org/anthology/2020.emnlp-main.443/) [[code]](https://github.com/googleresearch-datasets/widget-caption) 238 | - Visually Grounded Continual Learning of Compositional Phrases. [[paper]](https://www.aclweb.org/anthology/2020.emnlp-main.158/) 239 | - Pragmatic Issue-Sensitive Image Captioning. [[paper]](https://www.aclweb.org/anthology/2020.findings-emnlp.173/) 240 | - Structural and Functional Decomposition for Personality Image Captioning in a Communication Game. [[paper]](https://www.aclweb.org/anthology/2020.findings-emnlp.411/) 241 | - Generating Image Descriptions via Sequential Cross-Modal Alignment Guided by Human Gaze. [[paper]](https://www.aclweb.org/anthology/2020.emnlp-main.377/) 242 | - ZEST: Zero-shot Learning from Text Descriptions using Textual Similarity and Visual Summarization. [[paper]](https://www.aclweb.org/anthology/2020.findings-emnlp.50/) 243 | 244 | ##### Video Captioning 245 | - Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning. [[paper]](https://arxiv.org/abs/2003.05162) 246 | 247 | 248 | #### NIPS 2020 249 | ##### Image Captioning 250 | - RATT: Recurrent Attention to Transient Tasks for Continual Image Captioning. [[paper]](https://proceedings.neurips.cc/paper/2020/file/c2964caac096f26db222cb325aa267cb-Paper.pdf) 251 | - Diverse Image Captioning with Context-Object Split Latent Spaces. [[paper]](https://papers.nips.cc/paper/2020/file/24bea84d52e6a1f8025e313c2ffff50a-Paper.pdf) 252 | - Prophet Attention: Predicting Attention with Future Attention for Improved Image Captioning. [[paper]](https://proceedings.neurips.cc/paper/2020/file/13fe9d84310e77f13a6d184dbf1232f3-Paper.pdf) 253 | 254 | #### ACMMM 2020 255 | ##### Image Captioning 256 | - Structural Semantic Adversarial Active Learning for Image Captioning. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3413885) 257 | - Iterative Back Modification for Faster Image Captioning. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3413901) 258 | - Bridging the Gap between Vision and Language Domains for Improved Image Captioning. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3414004) 259 | - Hierarchical Scene Graph Encoder-Decoder for Image Paragraph Captioning. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3413859) 260 | - Improving Intra- and Inter-Modality Visual Relation for Image Captioning. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3413877) 261 | - ICECAP: Information Concentrated Entity-aware Image Captioning. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3413576) 262 | - Attacking Image Captioning Towards Accuracy-Preserving Target Words Removal. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3414009) 263 | 264 | ##### Text Captioning 265 | - Multimodal Attention with Image Text Spatial Relationship for OCR-Based Image Captioning. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3413753) 266 | 267 | ##### Video Captioning 268 | - Controllable Video Captioning with an Exemplar Sentence. [[paper]](https://dl.acm.org/doi/abs/10.1145/3394171.3413908) 269 | - Poet: Product-oriented Video Captioner for E-commerce. [[paper]](https://arxiv.org/abs/2008.06880) [[code]](https://github.com/shengyuzhang/Poet) 270 | - Learning Semantic Concepts and Temporal Alignment for Narrated Video Procedural Captioning. [[paper]](https://dl.acm.org/doi/10.1145/3394171.3413498) 271 | - Relational Graph Learning for Grounded Video Description Generation. [[paper]](https://dl.acm.org/doi/abs/10.1145/3394171.3413746) 272 | 273 | #### ECCV 2020 274 | ##### Image Captioning 275 | - Compare and Reweight: Distinctive Image Captioning Using Similar Images Sets. [[paper]](https://arxiv.org/pdf/2007.06877.pdf) 276 | - Towards Unique and Informative Captioning of Images. [[paper]](http://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123520613.pdf) 277 | - Learning Visual Representations with Caption Annotations. [[paper]](https://arxiv.org/pdf/2008.01392.pdf) 278 | - Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards. [[paper]](https://arxiv.org/pdf/2008.02693.pdf) [[code]](https://github.com/xuewyang/Fashion_Captioning) 279 | - Length Controllable Image Captioning. [[paper]](https://arxiv.org/pdf/2007.09580.pdf) [[code]](https://github.com/bearcatt/LaBERT) 280 | - Comprehensive Image Captioning via Scene Graph Decomposition. [[paper]](https://arxiv.org/pdf/2007.11731.pdf) 281 | - Finding It at Another Side: A Viewpoint-Adapted Matching Encoder for Change Captioning. [[paper]](http://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123590562.pdf) 282 | - Captioning Images Taken by People Who Are Blind. [[paper]](http://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123620409.pdf) 283 | - Learning to Generate Grounded Visual Captions without Localization Supervision. [[paper]](https://arxiv.org/pdf/1906.00283.pdf) [[code]](https://github.com/chihyaoma/cyclical-visual-captioning) 284 | - Finding It at Another Side: A Viewpoint-Adapted Matching Encoder for Change Captioning. [[paper]](http://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123590562.pdf) 285 | - Describing Textures using Natural Language. [[paper]](https://www.ecva.net/papers/eccv_2020/papers_ECCV/html/384_ECCV_2020_paper.php) 286 | - Connecting Vision and Language with Localized Narratives. [[paper]](https://arxiv.org/pdf/1912.03098.pdf) [[code]](https://github.com/google/localized-narratives) 287 | ##### Text Captioning 288 | - TextCaps: a Dataset for Image Captioning with Reading Comprehension. [[paper]](https://arxiv.org/pdf/2003.12462.pdf) [[code]](https://github.com/facebookresearch/mmf/tree/master/projects/m4c_captioner) 289 | 290 | ##### Video Captioning 291 | - Character Grounding and Re-Identification in Story of Videos and Text Descriptions. [[paper]](http://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123500528.pdf) [[code]](https://github.com/yj-yu/CiSIN/) 292 | - SODA: Story Oriented Dense Video Captioning Evaluation Framework. [[paper]](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123510511.pdf) [[code]](https://github.com/fujiso/SODA) 293 | - In-Home Daily-Life Captioning Using Radio Signals. [[paper]](https://arxiv.org/pdf/2008.10966.pdf) 294 | - TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval. [[paper]](https://arxiv.org/pdf/2001.09099.pdf) [[code]](https://github.com/jayleicn/TVCaption) 295 | - Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos. [[paper]](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123490324.pdf) 296 | - Identity-Aware Multi-Sentence Video Description. [[paper]](http://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123660358.pdf) 297 | 298 | #### IJCAI 2020 299 | ##### Image Captioning 300 | - Human Consensus-Oriented Image Captioning. [[paper]](https://www.ijcai.org/Proceedings/2020/0092.pdf) 301 | - Non-Autoregressive Image Captioning with Counterfactuals-Critical Multi-Agent Learning. [[paper]](https://www.ijcai.org/Proceedings/2020/0107.pdf) 302 | - Recurrent Relational Memory Network for Unsupervised Image Captioning. [[paper]](https://www.ijcai.org/Proceedings/2020/0128.pdf) 303 | ##### Video Captioning 304 | - Learning to Discretely Compose Reasoning Module Networks for Video Captioning. [[paper]](https://arxiv.org/abs/2007.09049) [[code]](https://github.com/tgc1997/RMN) 305 | - SBAT: Video Captioning with Sparse Boundary-Aware Transformer. [[paper]](https://www.ijcai.org/Proceedings/2020/88) 306 | - Hierarchical Attention Based Spatial-Temporal Graph-to-Sequence Learning for Grounded Video Description. [[paper]](https://www.ijcai.org/Proceedings/2020/131) 307 | 308 | #### ACL 2020 309 | ##### Image Captioning 310 | - Clue: Cross-modal Coherence Modeling for Caption Generation. [[paper]](https://www.aclweb.org/anthology/2020.acl-main.583.pdf) 311 | - Improving Image Captioning Evaluation by Considering Inter References Variance. [[paper]](https://www.aclweb.org/anthology/2020.acl-main.93.pdf) [[code]](https://github.com/ck0123/improved-bertscore-for-image-captioning-evaluation) 312 | - Improving Image Captioning with Better Use of Caption. [[paper]](https://www.aclweb.org/anthology/2020.acl-main.664.pdf) [[code]](https://github.com/Gitsamshi/WeakVRD-Captioning) 313 | ##### Video Captioning 314 | - MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning. [[paper]](https://www.aclweb.org/anthology/2020.acl-main.233.pdf) [[code]](https://github.com/jayleicn/recurrent-transformer) 315 | 316 | [comment]: <> (#### ICME 2020) 317 | 318 | [comment]: <> (##### Image Captioning) 319 | 320 | [comment]: <> (- [Fooled by Imagination: Adversarial Attack to Image Captioning Via Perturbation in Complex Domain](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9102842) - Shaofeng Zhang et al, **ICME 2020**. ) 321 | 322 | [comment]: <> (- [Modeling Local and Global Contexts for Image Captioning](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9102935) - Peng Yao et al, **ICME 2020**. ) 323 | 324 | [comment]: <> (##### Video Captioning) 325 | 326 | [comment]: <> (- [Video Captioning With Temporal And Region Graph Convolution Network](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9102967) - Xinlong Xiao et al, **ICME 2020**. ) 327 | 328 | #### CVPR 2020 329 | ##### Image Captioning 330 | - Context-Aware Group Captioning via Self-Attention and Contrastive Features. [[paper]](https://arxiv.org/abs/2004.03708) [[code]](https://lizw14.github.io/project/groupcap) 331 | - Show, Edit and Tell: A Framework for Editing Image Captions. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/papers/Sammani_Show_Edit_and_Tell_A_Framework_for_Editing_Image_Captions_CVPR_2020_paper.pdf) [[code]](https://github.com/fawazsammani/show-edit-tell) 332 | - Say As You Wish: Fine-Grained Control of Image Caption Generation With Abstract Scene Graphs. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/papers/Chen_Say_As_You_Wish_Fine-Grained_Control_of_Image_Caption_Generation_CVPR_2020_paper.pdf) [[code]](https://github.com/cshizhe/asg2cap) 333 | - Normalized and Geometry-Aware Self-Attention Network for Image Captioning. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/papers/Guo_Normalized_and_Geometry-Aware_Self-Attention_Network_for_Image_Captioning_CVPR_2020_paper.pdf) 334 | - Meshed-Memory Transformer for Image Captioning. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/papers/Cornia_Meshed-Memory_Transformer_for_Image_Captioning_CVPR_2020_paper.pdf) [[code]](https://github.com/aimagelab/meshed-memory-transformer) 335 | - X-Linear Attention Networks for Image Captioning. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/papers/Pan_X-Linear_Attention_Networks_for_Image_Captioning_CVPR_2020_paper.pdf) [[code]](https://github.com/JDAI-CV/image-captioning) 336 | - Transform and Tell: Entity-Aware News Image Captioning. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/papers/Tran_Transform_and_Tell_Entity-Aware_News_Image_Captioning_CVPR_2020_paper.pdf) [[code]](https://github.com/alasdairtran/transform-and-tell) 337 | - More Grounded Image Captioning by Distilling Image-Text Matching Model. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/papers/Zhou_More_Grounded_Image_Captioning_by_Distilling_Image-Text_Matching_Model_CVPR_2020_paper.pdf) [[code]](https://github.com/YuanEZhou/Grounded-Image-Captioning) 338 | - Better Captioning With Sequence-Level Exploration. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/html/Chen_Better_Captioning_With_Sequence-Level_Exploration_CVPR_2020_paper.html) 339 | 340 | [comment]: <> (- Alleviating Noisy Data in Image Captioning with Cooperative Distillation. [[paper]](https://arxiv.org/pdf/2012.11691.pdf) - Pierre Dognin et al, **CVPRW 2020**. ) 341 | 342 | ##### Video Captioning 343 | - Object Relational Graph With Teacher-Recommended Learning for Video Captioning. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/papers/Zhang_Object_Relational_Graph_With_Teacher-Recommended_Learning_for_Video_Captioning_CVPR_2020_paper.pdf) 344 | - Spatio-Temporal Graph for Video Captioning With Knowledge Distillation. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/papers/Pan_Spatio-Temporal_Graph_for_Video_Captioning_With_Knowledge_Distillation_CVPR_2020_paper.pdf) [[code]](https://github.com/StanfordVL/STGraph) 345 | - Better Captioning With Sequence-Level Exploration. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/html/Chen_Better_Captioning_With_Sequence-Level_Exploration_CVPR_2020_paper.html) 346 | - Syntax-Aware Action Targeting for Video Captioning. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/html/Zheng_Syntax-Aware_Action_Targeting_for_Video_Captioning_CVPR_2020_paper.html) [[code]](https://github.com/SydCaption/SAAT) 347 | - Screencast Tutorial Video Understanding. [[paper]](https://openaccess.thecvf.com/content_CVPR_2020/papers/Li_Screencast_Tutorial_Video_Understanding_CVPR_2020_paper.pdf) 348 | 349 | 350 | #### AAAI 2020 351 | ##### Image Captioning 352 | - Unified Vision-Language Pre-Training for Image Captioning and VQA. [[paper]](https://arxiv.org/abs/1909.11059) [[code]](https://github.com/LuoweiZhou/VLP) 353 | - Reinforcing an Image Caption Generator using Off-line Human Feedback. [[paper]](https://arxiv.org/abs/1911.09753) 354 | - Memorizing Style Knowledge for Image Captioning. [[paper]](https://www.aiide.org/ojs/index.php/AAAI/article/view/6998) 355 | - Joint Commonsense and Relation Reasoning for Image and Video Captioning. [[paper]](https://www.aiide.org/ojs/index.php/AAAI/article/view/6731) 356 | - Learning Long- and Short-Term User Literal-Preference with Multimodal Hierarchical Transformer Network for Personalized Image Caption. [[paper]](https://weizhangltt.github.io/paper/zhang-aaai20.pdf) 357 | - Show, Recall, and Tell: Image Captioning with Recall Mechanism. [[paper]](https://ojs.aaai.org/index.php/AAAI/article/view/6898) 358 | - Interactive Dual Generative Adversarial Networks for Image Captioning. [[paper]](https://aaai.org/ojs/index.php/AAAI/article/view/6826) 359 | - Feature Deformation Meta-Networks in Image Captioning of Novel Objects. [[paper]](https://aaai.org/ojs/index.php/AAAI/article/view/6620) 360 | ##### Video Captioning 361 | - An Efficient Framework for Dense Video Captioning. [[paper]](https://aaai.org/ojs/index.php/AAAI/article/view/6881) 362 | 363 | 364 | ### 2019 365 | 366 | #### NIPS 2019 #### 367 | ##### Image Captioning 368 | - Adaptively Aligned Image Captioning via Adaptive Attention Time. [[paper]](http://papers.nips.cc/paper/by-source-2019-4799) [[code]](https://github.com/husthuaan/AAT) 369 | - Image Captioning: Transforming Objects into Words. [[paper]](http://papers.nips.cc/paper/by-source-2019-5963) [[code]](https://github.com/yahoo/object_relation_transformer) 370 | - Variational Structured Semantic Inference for Diverse Image Captioning. [[paper]](http://papers.nips.cc/paper/by-source-2019-1113) 371 | 372 | 373 | #### ICCV 2019 374 | ##### Image Captioning 375 | 376 | - Robust Change Captioning. [[paper]](https://arxiv.org/abs/1901.02527) 377 | - Attention on Attention for Image Captioning. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Huang_Attention_on_Attention_for_Image_Captioning_ICCV_2019_paper.pdf) 378 | - Exploring Overall Contextual Information for Image Captioning in Human-Like Cognitive Style. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Ge_Exploring_Overall_Contextual_Information_for_Image_Captioning_in_Human-Like_Cognitive_ICCV_2019_paper.pdf) 379 | - Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Datta_Align2Ground_Weakly_Supervised_Phrase_Grounding_Guided_by_Image-Caption_Alignment_ICCV_2019_paper.pdf) 380 | - Hierarchy Parsing for Image Captioning. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Yao_Hierarchy_Parsing_for_Image_Captioning_ICCV_2019_paper.pdf) 381 | - Generating Diverse and Descriptive Image Captions Using Visual Paraphrases. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Liu_Generating_Diverse_and_Descriptive_Image_Captions_Using_Visual_Paraphrases_ICCV_2019_paper.pdf) 382 | - Learning to Collocate Neural Modules for Image Captioning. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Yang_Learning_to_Collocate_Neural_Modules_for_Image_Captioning_ICCV_2019_paper.pdf) 383 | - Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Aneja_Sequential_Latent_Spaces_for_Modeling_the_Intention_During_Diverse_Image_ICCV_2019_paper.pdf) 384 | - Towards Unsupervised Image Captioning With Shared Multimodal Embeddings. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Laina_Towards_Unsupervised_Image_Captioning_With_Shared_Multimodal_Embeddings_ICCV_2019_paper.pdf) 385 | - Human Attention in Image Captioning: Dataset and Analysis. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/He_Human_Attention_in_Image_Captioning_Dataset_and_Analysis_ICCV_2019_paper.pdf) 386 | - Reflective Decoding Network for Image Captioning. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Ke_Reflective_Decoding_Network_for_Image_Captioning_ICCV_2019_paper.pdf) 387 | - Joint Optimization for Cooperative Image Captioning. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Vered_Joint_Optimization_for_Cooperative_Image_Captioning_ICCV_2019_paper.pdf) 388 | - Entangled Transformer for Image Captioning. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Li_Entangled_Transformer_for_Image_Captioning_ICCV_2019_paper.pdf) 389 | - nocaps: novel object captioning at scale. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Agrawal_nocaps_novel_object_captioning_at_scale_ICCV_2019_paper.pdf) 390 | - Cap2Det: Learning to Amplify Weak Caption Supervision for Object Detection. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Ye_Cap2Det_Learning_to_Amplify_Weak_Caption_Supervision_for_Object_Detection_ICCV_2019_paper.pdf) 391 | - Unpaired Image Captioning via Scene Graph Alignments. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Gu_Unpaired_Image_Captioning_via_Scene_Graph_Alignments_ICCV_2019_paper.pdf) 392 | - Learning to Caption Images Through a Lifetime by Asking Questions. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Shen_Learning_to_Caption_Images_Through_a_Lifetime_by_Asking_Questions_ICCV_2019_paper.pdf) 393 | 394 | ##### Video Captioning 395 | - VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Wang_VaTeX_A_Large-Scale_High-Quality_Multilingual_Dataset_for_Video-and-Language_Research_ICCV_2019_paper.pdf) 396 | - Controllable Video Captioning With POS Sequence Guidance Based on Gated Fusion Network. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Wang_Controllable_Video_Captioning_With_POS_Sequence_Guidance_Based_on_Gated_ICCV_2019_paper.pdf) 397 | - Joint Syntax Representation Learning and Visual Cue Translation for Video Captioning. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Hou_Joint_Syntax_Representation_Learning_and_Visual_Cue_Translation_for_Video_ICCV_2019_paper.pdf) 398 | - Watch, Listen and Tell: Multi-Modal Weakly Supervised Dense Event Captioning. [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Rahman_Watch_Listen_and_Tell_Multi-Modal_Weakly_Supervised_Dense_Event_Captioning_ICCV_2019_paper.pdf) 399 | 400 | #### ACL 2019 #### 401 | ##### Image Captioning 402 | - Informative Image Captioning with External Sources of Information [[paper]](https://www.aclweb.org/anthology/P19-1650.pdf) 403 | 404 | - Bridging by Word: Image Grounded Vocabulary Construction for Visual Captioning [[paper]](https://www.aclweb.org/anthology/P19-1652.pdf) 405 | 406 | - Generating Question Relevant Captions to Aid Visual Question Answering [[paper]](https://www.aclweb.org/anthology/P19-1348.pdf) 407 | ##### Video Captioning 408 | - Dense Procedure Captioning in Narrated Instructional Videos [[paper]](https://www.aclweb.org/anthology/P19-1641.pdf) 409 | 410 | #### CVPR 2019 #### 411 | ##### Image Captioning 412 | - Auto-Encoding Scene Graphs for Image Captioning [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Yang_Auto-Encoding_Scene_Graphs_for_Image_Captioning_CVPR_2019_paper.pdf) [[code]](https://github.com/fengyang0317/unsupervised_captioning) 413 | 414 | - Fast, Diverse and Accurate Image Captioning Guided by Part-Of-Speech [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Deshpande_Fast_Diverse_and_Accurate_Image_Captioning_Guided_by_Part-Of-Speech_CVPR_2019_paper.pdf) 415 | 416 | - Unsupervised Image Captioning [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Feng_Unsupervised_Image_Captioning_CVPR_2019_paper.pdf) [[code]](https://github.com/fengyang0317/unsupervised_captioning) 417 | 418 | - Adversarial Attack to Image Captioning via Structured Output Learning With Latent Variables [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Xu_Exact_Adversarial_Attack_to_Image_Captioning_via_Structured_Output_Learning_CVPR_2019_paper.pdf) 419 | 420 | - Describing like Humans: On Diversity in Image Captioning [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Wang_Describing_Like_Humans_On_Diversity_in_Image_Captioning_CVPR_2019_paper.pdf) 421 | 422 | - MSCap: Multi-Style Image Captioning With Unpaired Stylized Text [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Guo_MSCap_Multi-Style_Image_Captioning_With_Unpaired_Stylized_Text_CVPR_2019_paper.pdf) 423 | - Leveraging Captioning to Boost Semantics for Salient Object Detection [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Zhang_CapSal_Leveraging_Captioning_to_Boost_Semantics_for_Salient_Object_Detection_CVPR_2019_paper.pdf) [[code]](https://github.com/zhangludl/code-and-dataset-for-CapSal) 424 | 425 | - Context and Attribute Grounded Dense Captioning [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Yin_Context_and_Attribute_Grounded_Dense_Captioning_CVPR_2019_paper.pdf) 426 | 427 | - Dense Relational Captioning: Triple-Stream Networks for Relationship-Based Captioning [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Kim_Dense_Relational_Captioning_Triple-Stream_Networks_for_Relationship-Based_Captioning_CVPR_2019_paper.pdf) 428 | 429 | - Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Cornia_Show_Control_and_Tell_A_Framework_for_Generating_Controllable_and_CVPR_2019_paper.pdf) 430 | 431 | - Self-Critical N-step Training for Image Captioning [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Gao_Self-Critical_N-Step_Training_for_Image_Captioning_CVPR_2019_paper.pdf) 432 | 433 | - Look Back and Predict Forward in Image Captioning [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Qin_Look_Back_and_Predict_Forward_in_Image_Captioning_CVPR_2019_paper.pdf) 434 | 435 | - Intention Oriented Image Captions with Guiding Objects [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Zheng_Intention_Oriented_Image_Captions_With_Guiding_Objects_CVPR_2019_paper.pdf) 436 | 437 | - Adversarial Semantic Alignment for Improved Image Captions [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Dognin_Adversarial_Semantic_Alignment_for_Improved_Image_Captions_CVPR_2019_paper.pdf) 438 | 439 | - Good News, Everyone! Context driven entity-aware captioning for news images. [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Biten_Good_News_Everyone_Context_Driven_Entity-Aware_Captioning_for_News_Images_CVPR_2019_paper.pdf) [[code]](https://github.com/furkanbiten/GoodNews) 440 | - Pointing Novel Objects in Image Captioning [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Li_Pointing_Novel_Objects_in_Image_Captioning_CVPR_2019_paper.pdf) 441 | 442 | - Engaging Image Captioning via Personality [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Shuster_Engaging_Image_Captioning_via_Personality_CVPR_2019_paper.pdf) 443 | 444 | - Intention Oriented Image Captions With Guiding Objects [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Zheng_Intention_Oriented_Image_Captions_With_Guiding_Objects_CVPR_2019_paper.pdf) 445 | 446 | - Exact Adversarial Attack to Image Captioning via Structured Output Learning With Latent Variables [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Xu_Exact_Adversarial_Attack_to_Image_Captioning_via_Structured_Output_Learning_CVPR_2019_paper.pdf) 447 | 448 | - Towards Unsupervised Image Captioning with Shared Multimodal Embeddings. [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Zadeh_Social-IQ_A_Question_Answering_Benchmark_for_Artificial_Social_Intelligence_CVPR_2019_paper.pdf) 449 | ##### Video Captioning 450 | - Streamlined Dense Video Captioning. [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Mun_Streamlined_Dense_Video_Captioning_CVPR_2019_paper.pdf) 451 | - Grounded Video Description. [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Zhou_Grounded_Video_Description_CVPR_2019_paper.pdf) 452 | - Adversarial Inference for Multi-Sentence Video Description. [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Park_Adversarial_Inference_for_Multi-Sentence_Video_Description_CVPR_2019_paper.pdf) 453 | - Object-aware Aggregation with Bidirectional Temporal Graph for Video Captioning. [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Zhang_Object-Aware_Aggregation_With_Bidirectional_Temporal_Graph_for_Video_Captioning_CVPR_2019_paper.pdf) 454 | - Memory-Attended Recurrent Network for Video Captioning. [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Pei_Memory-Attended_Recurrent_Network_for_Video_Captioning_CVPR_2019_paper.pdf) 455 | - Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning. [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Aafaq_Spatio-Temporal_Dynamics_and_Semantic_Attribute_Enriched_Visual_Encoding_for_Video_CVPR_2019_paper.pdf) 456 | 457 | 458 | #### AAAI 2019 459 | ##### Image Captioning 460 | - Improving Image Captioning with Conditional Generative Adversarial Nets [[paper]](https://arxiv.org/pdf/1805.07112.pdf) 461 | 462 | - Connecting Language to Images: A Progressive Attention-Guided Network for Simultaneous Image Captioning and Language Grounding [[paper]](https://www.aaai.org/ojs/index.php/AAAI/article/view/4916) 463 | 464 | - Meta Learning for Image Captioning [[paper]](https://www.aaai.org/ojs/index.php/AAAI/article/view/4883) 465 | - Deliberate Residual based Attention Network for Image Captioning [[paper]](https://www.aaai.org/ojs/index.php/AAAI/article/view/4845/4718) 466 | 467 | - Hierarchical Attention Network for Image Captioning [[paper]](https://www.aaai.org/ojs/index.php/AAAI/article/view/4924) 468 | 469 | - Learning Object Context for Dense Captioning [[paper]](https://www.aaai.org/ojs/index.php/AAAI/article/view/4886) 470 | 471 | ##### Video Captioning 472 | - Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video Captioning [[code]](https://github.com/eric-xw/Zero-Shot-Video-Captioning) [[paper]](https://arxiv.org/pdf/1811.02765.pdf) 473 | 474 | - Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning [[paper]](https://arxiv.org/pdf/1905.01077v1.pdf) 475 | 476 | - Fully Convolutional Video Captioning with Coarse-to-Fine and Inherited Attention [[paper]](https://aaai.org/ojs/index.php/AAAI/article/view/4839) 477 | 478 | - Motion Guided Spatial Attention for Video Captioning [[paper]](http://yugangjiang.info/publication/19AAAI-vidcaptioning.pdf) 479 | 480 | ### 2018 481 | 482 | #### NIPS 2018 #### 483 | ##### Image Captioning 484 | - A Neural Compositional Paradigm for Image Captioning. [[paper]](https://arxiv.org/pdf/1810.09630.pdf) [[code]](https://github.com/doubledaibo/compcaption_neurips2018) 485 | ##### Video Captioning 486 | - Weakly Supervised Dense Event Captioning in Videos. [[paper]](https://papers.nips.cc/paper/2018/file/49af6c4e558a7569d80eee2e035e2bd7-Paper.pdf) [[code]](https://github.com/XgDuan/WSDEC) 487 | 488 | 489 | 490 | #### ECCV 2018 #### 491 | ##### Image Captioning 492 | - Unpaired Image Captioning by Language Pivoting. [[paper]](https://arxiv.org/pdf/1803.05526.pdf) [[code]](https://github.com/gujiuxiang/unpaired_image_captioning) 493 | - Exploring Visual Relationship for Image Captioning. [[paper]](https://openaccess.thecvf.com/content_ECCV_2018/papers/Ting_Yao_Exploring_Visual_Relationship_ECCV_2018_paper.pdf) 494 | - Recurrent Fusion Network for Image Captioning. [[paper]](https://arxiv.org/pdf/1807.09986.pdf) [[code]](https://github.com/cswhjiang/Recurrent_Fusion_Network) 495 | - Boosted Attention: Leveraging Human Attention for Image Captioning. [[paper]](https://arxiv.org/pdf/1904.00767.pdf) 496 | - Show, Tell and Discriminate: Image Captioning by Self-retrieval with Partially Labeled Data. [[paper]](https://arxiv.org/pdf/1803.08314.pdf) 497 | - "Factual" or "Emotional": Stylized Image Captioning with Adaptive Learning and Attention. [[paper]](https://arxiv.org/pdf/1807.03871.pdf) 498 | #### ACL 2018 499 | ##### Image Captioning 500 | - Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning. [[paper]](https://arxiv.org/pdf/1712.02051.pdf) 501 | - Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. [[paper]](https://aclanthology.org/P18-1238.pdf) [[code]](https://github.com/google-research-datasets/conceptual-captions) 502 | 503 | #### IJCAI 2018 504 | ##### Image Captioning 505 | - Show and Tell More: Topic-Oriented Multi-Sentence Image Captioning. [[paper]](https://www.ijcai.org/proceedings/2018/0592.pdf) 506 | #### CVPR 2018 #### 507 | ##### Image Captioning 508 | - Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. [[paper]](http://openaccess.thecvf.com/content_cvpr_2018/html/Anderson_Bottom-Up_and_Top-Down_CVPR_2018_paper.html) [[code]](https://github.com/peteanderson80/bottom-up-attention) 509 | - Neural Baby Talk. [[paper]](https://arxiv.org/pdf/1803.09845.pdf) 510 | - GroupCap: Group-Based Image Captioning With Structured Relevance and Diversity Constraints. [[paper]](https://openaccess.thecvf.com/content_cvpr_2018/papers/Chen_GroupCap_Group-Based_Image_CVPR_2018_paper.pdf) 511 | ##### Video Captioning 512 | - Reconstruction Network for Video Captioning. [[paper]](https://arxiv.org/pdf/1803.11438.pdf) [[code]](https://github.com/hobincar/reconstruction-network-for-video-captioning) 513 | ### 2017 514 | #### ICCV 2017 515 | ##### Image Captioning 516 | - Boosting Image Captioning with Attributes. [[paper]](https://openaccess.thecvf.com/content_ICCV_2017/papers/Yao_Boosting_Image_Captioning_ICCV_2017_paper.pdf) 517 | - Show, Adapt and Tell: Adversarial Training of Cross-domain Image Captioner. [[paper]](https://openaccess.thecvf.com/content_ICCV_2017/papers/Chen_Show_Adapt_and_ICCV_2017_paper.pdf) [[code]](https://github.com/tsenghungchen/show-adapt-and-tell) 518 | #### CVPR 2017 519 | ##### Image Captioning 520 | - SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning. [[paper]](https://openaccess.thecvf.com/content_cvpr_2017/papers/Chen_SCA-CNN_Spatial_and_CVPR_2017_paper.pdf) [[code]](https://github.com/zjuchenlong/sca-cnn.cvpr17) 521 | - When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning. [[paper]](https://arxiv.org/pdf/1612.01887.pdf) [[code]](https://github.com/jiasenlu/AdaptiveAttention) 522 | - Self-critical Sequence Training for Image Captioning. [[paper]](https://openaccess.thecvf.com/content_cvpr_2017/papers/Rennie_Self-Critical_Sequence_Training_CVPR_2017_paper.pdf) 523 | - Semantic Compositional Networks for Visual Captioning. [[paper]](https://arxiv.org/pdf/1611.08002.pdf) [[code]](https://github.com/zhegan27/Semantic_Compositional_Nets) 524 | - StyleNet: Generating Attractive Visual Captions with Styles. [[paper]](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/Generating-Attractive-Visual-Captions-with-Styles.pdf) [[code]](https://github.com/kacky24/stylenet) 525 | #### TPAMI 2017 526 | ##### Image Captioning 527 | - BreakingNews: Article Annotation by Image and Text Processing. [[paper]](https://arxiv.org/pdf/1603.07141.pdf) 528 | ### 2016 529 | 530 | #### ECCV 2016 531 | - SPICE: Semantic Propositional Image Caption Evaluation. [[paper]](https://arxiv.org/pdf/1607.08822.pdf) [[code]](https://github.com/peteanderson80/SPICE) 532 | - Generating Visual Explanations. [[paper]](https://arxiv.org/pdf/1603.08507.pdf) [[code]](https://github.com/LisaAnne/ECCV2016) 533 | #### CVPR 2016 534 | ##### Image Captioning 535 | - Image Captioning with Semantic Attention. [[paper]](https://openaccess.thecvf.com/content_cvpr_2016/papers/You_Image_Captioning_With_CVPR_2016_paper.pdf) [[code]](https://github.com/chapternewscu/image-captioning-with-semantic-attention) 536 | - Learning Deep Representations of Fine-grained Visual Descriptions. [[paper]](https://arxiv.org/pdf/1605.05395.pdf) [[code]](https://github.com/reedscot/cvpr2016) 537 | - Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data. [[paper]](https://arxiv.org/pdf/1511.05284.pdf) [[code]](https://github.com/LisaAnne/DCC) 538 | #### TPAMI 2016 539 | ##### Image Captioning 540 | - Aligning Where to See and What to Tell: Image Captioning with Region-Based Attention and Scene-Specific Contexts. [[paper]](https://ieeexplore.ieee.org/abstract/document/7792748) [[code]](https://github.com/fukun07/neural-image-captioning) 541 | 542 | ### 2015 543 | #### ICML 2015 544 | ##### Image Captioning 545 | - Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. [[paper]](https://arxiv.org/pdf/1502.03044.pdf) 546 | 547 | #### ICCV 2015 548 | ##### Image Captioning 549 | - Guiding Long-Short Term Memory for Image Caption Generation. [[paper]](https://arxiv.org/pdf/1509.04942.pdf) 550 | 551 | #### CVPR 2015 552 | ##### Image Captioning 553 | - Show and Tell: A Neural Image Caption Generator. [[paper]](https://arxiv.org/pdf/1411.4555.pdf) 554 | - Deep Visual-Semantic Alignments for Generating Image Descriptions. [[paper]](https://arxiv.org/pdf/1412.2306.pdf) [[code]](https://github.com/karpathy/neuraltalk2) 555 | - CIDEr: Consensus-based Image Description Evaluation. [[paper]](https://arxiv.org/pdf/1411.5726.pdf) [[cider]](https://github.com/vrama91/cider) 556 | #### ICLR 2015 557 | ##### Image Captioning 558 | - Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN). [[paper]](https://arxiv.org/pdf/1412.6632.pdf) 559 | 560 | ## Dataset 561 | - MSCOCO 562 | - Flickr30K 563 | - Flickr8K 564 | - VizWiz 565 | 566 | ## Popular Codebase 567 | - [ruotianluo/ImageCaptioning.pytorch](https://github.com/ruotianluo/ImageCaptioning.pytorch) 568 | 569 | ## Reference and Acknowledgement 570 | - [awesome-image-captioning](https://github.com/zhjohnchan/awesome-image-captioning) from [Zhihong Chen](https://github.com/zhjohnchan) 571 | 572 | Really appreciate for there contributions in this area. 573 | --------------------------------------------------------------------------------