├── Fig_1_MML.png
└── README.md


/Fig_1_MML.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marslanm/Multimodality-Representation-Learning/c02e70f6e736804639255816bf04f157f1c81148/Fig_1_MML.png


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
   1 | # Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications
   2 | ![a](Fig_1_MML.png)
   3 | ## Multimodal Deep Learnig based Research
   4 | 
   5 | - [Survey Papers](#survey)
   6 | - [Task-specific Methods](#Task-specific-Methods)
   7 | - [Pretrainig Approaches](#Pretraining-Approaches)
   8 | - [Multimodal Applications](#Multimodal-Applications) ([Understanidng](#Understanding), [Classification](#Classification), [Generation](#Generation), [Retrieval](#Retrieval), [Translation](#Translation))
   9 | - [Multimodal Datasets](#Multimodal-Datasets)
  10 | 
  11 | # Survey
  12 | 
  13 | [**Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications.**](https://dl.acm.org/doi/abs/10.1145/3617833) <br>
  14 | *[Muhammad Arslan Manzoor](https://scholar.google.com/citations?hl=en&user=ZvXClnUAAAAJ), [Sarah Albarri](), [Ziting Xian](https://scholar.google.com/citations?hl=zh-CN&user=G7VId5YAAAAJ&view_op=list_works&gmla=AJsN-F6TTZmzbi9CmBIRLNRpAhcgmzH-nOUd8hM5UTjfT5A_mYW2ABzjSrdX7ki9GFgGaId2dLlMtXBkfq7X_qzYOwF_OuvCCthMiuVNUuUiac-aGoSwsKQ), [Zaiqiao Meng](https://scholar.google.com/citations?user=5jJKFVcAAAAJ&hl=en), [Preslav Nakov](https://scholar.google.com/citations?user=DfXsKZ4AAAAJ&hl=en), and [Shangsong Liang](https://scholar.google.com/citations?user=4uggVcIAAAAJ&h).*<br>
  15 | [[PDF](https://dl.acm.org/doi/abs/10.1145/3617833)] 
  16 | 
  17 | **Vision-Language Pre-training:Basics, Recent Advances, and Future Trends.**[17th Oct, 2022]<br>
  18 | *Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng Gao.*<br>
  19 | [[PDF](https://arxiv.org/pdf/2210.09263.pdf)]
  20 | 
  21 | **VLP: A survey on vision-language pre-training.**[18th Feb, 2022]<br>
  22 | *Feilong Chen, Duzhen Zhang, Minglun Han, Xiuyi Chen, Jing Shi, Shuang Xu, and Bo Xu.*<br>
  23 | [[PDF](https://link.springer.com/article/10.1007/s11633-022-1369-5)]
  24 | 
  25 | 
  26 | **A Survey of Vision-Language Pre-Trained Models.**[18th Feb, 2022]<br>
  27 | *Yifan Du, Zikang Liu, Junyi Li, Wayne Xin Zhao.*<br>
  28 | [[PDF](https://arxiv.org/abs/2202.10936)]
  29 | 
  30 | 
  31 | **Vision-and-Language Pretrained Models: A Survey.**[15th Apr, 2022]<br>
  32 | *Siqu Long, Feiqi Cao, Soyeon Caren Han, Haiqin Yang.*<br>
  33 | [[PDF](https://arxiv.org/abs/2204.07356)]
  34 | 
  35 | **Comprehensive reading list for Multimodal Literature** <br>
  36 | [[Github](https://github.com/pliang279/awesome-multimodal-ml#survey-papers)]
  37 | 
  38 | **Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing.**[28th Jul, 2021]<br>
  39 | *Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, Graham Neubig*<br>
  40 | [[PDF](https://arxiv.org/abs/2107.13586)]
  41 | 
  42 | **Recent Advances and Trends in Multimodal Deep Learning: A Review.**[24th May, 2021]<br>
  43 | *Jabeen Summaira, Xi Li, Amin Muhammad Shoib, Songyuan Li, Jabbar Abdul.*<br>
  44 | [[PDF](https://arxiv.org/pdf/2105.11087)]
  45 | 
  46 | 
  47 | 
  48 | # Task-specific-Methods
  49 | 
  50 | **Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network.**[9th Feb, 2021]
  51 | 
  52 | *Jiayi Ji, Yunpeng Luo, Xiaoshuai Sun, Fuhai Chen, Gen Luo, Yongjian Wu, Yue Gao, Rongrong Ji*
  53 | 
  54 | [[PDF](https://ojs.aaai.org/index.php/AAAI/article/view/16258)]
  55 | 
  56 | **Cascaded Recurrent Neural Networks for Hyperspectral Image Classification.**[Aug, 2019]
  57 | 
  58 | *Renlong Hang, Qingshan Liu, Danfeng Hong, Pedram Ghamisi*
  59 | 
  60 | [[PDF](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8662780)]
  61 | 
  62 | **Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.**[2015 NIPS]
  63 | 
  64 | *Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun*
  65 | 
  66 | [[PDF](https://proceedings.neurips.cc/paper/2015/hash/14bfa6bb14875e45bba028a21ed38046-Abstract.html)]
  67 | 
  68 | **Microsoft coco: Common objects in context.**[2014 ECCV]
  69 | 
  70 | *Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, C. Lawrence Zitnick*
  71 | 
  72 | [[PDF](https://arxiv.org/pdf/1405.0312.pdf%090.949.pdf)]
  73 | 
  74 | **Multimodal Deep Learning.**[2011 ICML]
  75 | 
  76 | *Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, Andrew Y. Ng*
  77 | 
  78 | [[PDF](http://ai.stanford.edu/~ang/papers/icml11-MultimodalDeepLearning.pdf)]
  79 | 
  80 | **Extracting and composing robust features with denoising autoencoders.**[5th July, 2008]
  81 | 
  82 | *Pascak Vincent, Hugo Larochelle, Yoshua Bengio, Pierre-Antoine Manzagol*
  83 | 
  84 | [[PDF](https://dl.acm.org/doi/pdf/10.1145/1390156.1390294)]
  85 | 
  86 | **Multi-Gate Attention Network for Image Captioning.**[13th Mar, 2021]
  87 | 
  88 | *WEITAO JIANG, XIYING LI, HAIFENG HU, QIANG LU, AND BOHONG LIU*
  89 | 
  90 | [[PDF](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9382255)]
  91 | 
  92 | **AMC: Attention guided Multi-modal Correlation Learning for Image Search.**[2017 CVPR]
  93 | 
  94 | *Kan Chen, Trung Bui, Chen Fang, Zhaowen Wang, Ram Nevatia*
  95 | 
  96 | [[PDF](https://openaccess.thecvf.com/content_cvpr_2017/papers/Chen_AMC_Attention_guided_CVPR_2017_paper.pdf)]
  97 | 
  98 | **Video Captioning via Hierarchical Reinforcement Learning.**[2018 CVPR]
  99 | 
 100 | *Xin Wang, Wenhu Chen, Jiawei Wu, Yuan-Fang Wang, William Yang Wang*
 101 | 
 102 | [[PDF](https://openaccess.thecvf.com/content_cvpr_2018/papers/Wang_Video_Captioning_via_CVPR_2018_paper.pdf)]
 103 | 
 104 | **Gaussian Process with Graph Convolutional Kernel for Relational Learning.**[14th Aug, 2021]
 105 | 
 106 | *Jinyuan Fang, Shangsong Liang, Zaiqiao Meng, Qiang Zhang*
 107 | 
 108 | [[PDF](https://dl.acm.org/doi/pdf/10.1145/3447548.3467327)]
 109 | 
 110 | **Multi-Relational Graph Representation Learning with Bayesian Gaussian Process Network.**[28th June, 2022]
 111 | 
 112 | *Guanzheng Chen, Jinyuan Fang, Zaiqiao Meng, Qiang Zhang, Shangsong Liang*
 113 | 
 114 | [[PDF](https://doi.org/10.1609/aaai.v36i5.20492)]
 115 | 
 116 | # Pretraining-Approaches
 117 | 
 118 | **Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction.**[5th Jan, 2022]
 119 | 
 120 | *Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, Abdelrahman Mohamed*
 121 | 
 122 | [[PDF](https://proceedings.neurips.cc/paper/2019/hash/c74d97b01eae257e44aa9d5bade97baf-Abstract.html)]
 123 | 
 124 | **A Survey of Vision-Language Pre-Trained Models.**[18th Feb, 2022]
 125 | 
 126 | *Yifan Du, Zikang Liu, Junyi Li, Wayne Xin Zhao*
 127 | 
 128 | [[PDF](https://arxiv.org/abs/2202.10936)]
 129 | 
 130 | **Attention is All you Need.**[2017 NIPS]
 131 | 
 132 | *Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin*
 133 | 
 134 | [[PDF](https://proceedings.neurips.cc/paper/7181-attention-is-all)]
 135 | 
 136 | **VinVL: Revisiting Visual Representations in Vision-Language Models.**[2021 CVPR]
 137 | 
 138 | *Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, Jianfeng Gao*
 139 | 
 140 | [[PDF](http://openaccess.thecvf.com/content/CVPR2021/html/Zhang_VinVL_Revisiting_Visual_Representations_in_Vision-Language_Models_CVPR_2021_paper.html)]
 141 | 
 142 | **M6: Multi-Modality-to-Multi-Modality Multitask Mega-transformer for Unified Pretraining.**[Aug, 2021]
 143 | 
 144 | *Junyang Lin, Rui Men, An Yang, Chang Zhou, Yichang Zhang, Peng Wang, Jingren Zhou, Jie Tang, Hongxia Yang*
 145 | 
 146 | [[PDF](https://dl.acm.org/doi/abs/10.1145/3447548.3467206)]
 147 | 
 148 | **AMMU: A survey of transformer-based biomedical pretrained language models.**[23th Mar, 2020]
 149 | 
 150 | *Katikapalli Subramanyam Kalyan, Ajit Rajasekharan, Sivanesan Sangeetha*
 151 | 
 152 | [[PDF](https://www.sciencedirect.com/science/article/pii/S1532046421003117)]
 153 | 
 154 | **ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators**
 155 | 
 156 | *Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning*
 157 | 
 158 | [[PDF](https://arxiv.org/abs/2003.10555)]
 159 | 
 160 | **RoBERTa: A Robustly Optimized BERT Pretraining Approach.**[26th Jul, 2019]
 161 | 
 162 | *Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov*
 163 | 
 164 | [[PDF](https://arxiv.org/abs/1907.11692)]
 165 | 
 166 | **BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.**[11th Oct, 2018]
 167 | 
 168 | *Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova*
 169 | 
 170 | [[PDF](https://arxiv.org/abs/1810.04805)]
 171 | 
 172 | **BioBERT: a pre-trained biomedical language representation model for biomedical text mining.**[10th Sep, 2019]
 173 | 
 174 | Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, onghyeon Kim, Sunkyu Kim, Chan Ho So, Jaewoo Kang
 175 | 
 176 | [[PDF](https://academic.oup.com/bioinformatics/article-abstract/36/4/1234/5566506)]
 177 | 
 178 | **HateBERT: Retraining BERT for Abusive Language Detection in English.**[23th Oct, 2020]
 179 | 
 180 | *Tommaso Caselli, Valerio Basile, Jelena Mitrovic, Michael Granitzer*
 181 | 
 182 | [[PDF](https://arxiv.org/abs/2010.12472)]
 183 | 
 184 | **InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training.**[15th Jul, 2020]
 185 | 
 186 | *Zewen Chi, Li Dong, Furu Wei, Nan Yang, Saksham Singhal, Wenhui Wang, Xia Song, XIan-Ling Mao, Heyan Huang, Ming Zhou*
 187 | 
 188 | [[PDF](https://arxiv.org/abs/2007.07834)]
 189 | 
 190 | **Pre-training technique to localize medical BERT and enhance biomedical BERT.**[14th May, 2020]
 191 | 
 192 | *Shoya Wada, Toshihiro Takeda, Shiro Manabe, Shozo Konishi, Jun Kamohara, Yasushi, Matsumura*
 193 | 
 194 | [[PDF](https://arxiv.org/abs/2005.07202)]
 195 | 
 196 | **Don't Stop Pretraining: Adapt Language Models to Domains and Tasks.**[23th Apr, 2020]
 197 | 
 198 | *Suchin Gururangan, Ana Marasovic, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, Noah A. Smith*
 199 | 
 200 | [[PDF](https://arxiv.org/abs/2004.10964)]
 201 | 
 202 | **Knowledge Inheritance for Pre-trained Language Models.**[28th May, 2021]
 203 | 
 204 | *Yujia Qin, Yankai Lin, Jing Yi, Jiajie Zhang, Xu Han, Zhengyan Zhang, Yusheng Su, Zhiyuan Liu, Peng Li, Maosong Sun, Jie Zhou*
 205 | 
 206 | [[PDF](https://arxiv.org/abs/2105.13880)]
 207 | 
 208 | **Improving Language Understanding by Generative Pre-Training.**[2018]
 209 | 
 210 | *Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever*
 211 | 
 212 | [[PDF](https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf)]
 213 | 
 214 | **Shuffled-token Detection for Refining Pre-trained RoBERTa**
 215 | 
 216 | *Subhadarshi Panda, Anjali Agrawal, Jeewon Ha, Benjamin Bloch*
 217 | 
 218 | [[PDF](https://aclanthology.org/2021.naacl-srw.12/)]
 219 | 
 220 | **ALBERT: A Lite BERT for Self-supervised Learning of Language Representations.**[26th Sep, 2019]
 221 | 
 222 | *Zhenzhong Lan, Minga Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut*
 223 | 
 224 | [[PDF](https://arxiv.org/abs/1909.11942)]
 225 | 
 226 | **Exploring the limits of transfer learning with a unified text-to-text transformer.**[1st Jan, 2020]
 227 | 
 228 | *Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu*
 229 | 
 230 | [[PDF](https://dl.acm.org/doi/abs/10.5555/3455716.3455856)]
 231 | 
 232 | **End-to-End Object Detection with Transformers.**[3rd Nov, 2020]
 233 | 
 234 | *Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko*
 235 | 
 236 | [[PDF](https://link.springer.com/chapter/10.1007/978-3-030-58452-8_13)]
 237 | 
 238 | **Deformable DETR: Deformable Transformers for End-to-End Object Detection.**[8th Oct, 2018]
 239 | 
 240 | Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, Jifeng Dai
 241 | 
 242 | [[PDF](https://arxiv.org/abs/2010.04159)]
 243 | 
 244 | **Unified Vision-Language Pre-Training for Image Captioning and VQA.**[2020 AAAI]
 245 | 
 246 | *Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason Corso, Jianfeng Gao*
 247 | 
 248 | [[PDF](https://ojs.aaai.org/index.php/AAAI/article/view/7005)]
 249 | 
 250 | **VirTex: Learning Visual Representations From Textual Annotations.**[2021 CVPR]
 251 | 
 252 | *Karan Desai, Justin Johnson*
 253 | 
 254 | [[PDF](http://openaccess.thecvf.com/content/CVPR2021/html/Desai_VirTex_Learning_Visual_Representations_From_Textual_Annotations_CVPR_2021_paper.html)]
 255 | 
 256 | **Ernie-vil: Knowledge enhanced vision-language representations through scene graphs.**[2021 AAAI]
 257 | 
 258 | *Fei Yu, Jiji Tang, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang*
 259 | 
 260 | [[PDF](https://ojs.aaai.org/index.php/AAAI/article/view/16431)]
 261 | 
 262 | **OSCAR: Object-Semantics Aligned Pre-training for Vision-Language Tasks.**[24th Sep, 2020]
 263 | 
 264 | *Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, Jianfeng Gao*
 265 | 
 266 | [[PDF](https://link.springer.com/chapter/10.1007/978-3-030-58577-8_8)]
 267 | 
 268 | **Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision.**[14th Oct, 2020]
 269 | 
 270 | *Hao Tan, Mohit Bansal*
 271 | 
 272 | [[PDF](https://arxiv.org/abs/2010.06775)]
 273 | 
 274 | **Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models.**[2015 ICCV]
 275 | 
 276 | *Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, Svetlana Lazebnik*
 277 | 
 278 | [[PDF](http://openaccess.thecvf.com/content_iccv_2015/html/Plummer_Flickr30k_Entities_Collecting_ICCV_2015_paper.html)]
 279 | 
 280 | **Distributed representations of words and phrases and their compositionality.**[2013 NIPS]
 281 | 
 282 | *Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, Jeff Dean*
 283 | 
 284 | [[PDF](https://proceedings.neurips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html)]
 285 | 
 286 | **AllenNLP: A Deep Semantic Natural Language Processing Platform.**[20 Mar, 2018]
 287 | 
 288 | *Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson Liu, Matthew Peters, Michael Schmitz, Luke Zettlemoyer*
 289 | 
 290 | [[PDF](https://arxiv.org/abs/1803.07640)]
 291 | 
 292 | **Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data.**[Jul, 2020]
 293 | 
 294 | *Emily M. Bender, Alexander Koller*
 295 | 
 296 | [[PDF](https://aclanthology.org/2020.acl-main.463/)]
 297 | 
 298 | **Experience Grounds Language.**[21th Apr, 2020]
 299 | 
 300 | *Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio, Joyce Chai, Mirella Lapata, Angeliki Lazaridou, Jonathan May, Aleksandr Nisnevich, Nicolas Pinto, Joseph Turian*
 301 | 
 302 | [[PDF](https://arxiv.org/abs/2004.10151)]
 303 | 
 304 | **Hubert: How Much Can a Bad Teacher Benefit ASR Pre-Training?**
 305 | 
 306 | *Wei-Ning Hsu, Yao-Hung Hubert Tsai, Benjamin Bolte, Ruslan Salakhutdinov, Abdelrahman Mohamed*
 307 | 
 308 | [[PDF](https://ieeexplore.ieee.org/abstract/document/9414460/)]
 309 | 
 310 | ## Unifying Achitectures
 311 | 
 312 | **BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.**[11th Oct, 2018]
 313 | 
 314 | *Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova*
 315 | 
 316 | [[PDF](https://arxiv.org/abs/1810.04805)]
 317 | 
 318 | **Improving Language Understanding by Generative Pre-Training.**[2018]
 319 | 
 320 | *Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever*
 321 | 
 322 | [[PDF](https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf)]
 323 | 
 324 | **End-to-End Object Detection with Transformers.**[3rd Nov, 2020]
 325 | 
 326 | *Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko*
 327 | 
 328 | [[PDF](https://link.springer.com/chapter/10.1007/978-3-030-58452-8_13)]
 329 | 
 330 | **UNITER: UNiversal Image-TExt Representation Learning.**[24th Sep, 2020]
 331 | 
 332 | *Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed EI Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, Jingjing Liu*
 333 | 
 334 | [[PDF](https://link.springer.com/chapter/10.1007/978-3-030-58577-8_7)]
 335 | 
 336 | **UNITER: UNiversal Image-TExt Representation Learning.**[2021 ICCV]
 337 | 
 338 | *Ronghang Hu, Amanpreet Singh*
 339 | 
 340 | [[PDF](https://openaccess.thecvf.com/content/ICCV2021/html/Hu_UniT_Multimodal_Multitask_Learning_With_a_Unified_Transformer_ICCV_2021_paper.html?ref=https://githubhelp.com)]
 341 | 
 342 | **VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text.**[2021 NIPS]
 343 | 
 344 | *Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, Boqing Gong*
 345 | 
 346 | [[PDF](https://proceedings.neurips.cc/paper/2021/hash/cb3213ada48302953cb0f166464ab356-Abstract.html)]
 347 | 
 348 | **OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework.**[2022 ICML]
 349 | 
 350 | *Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, Hongxia Yang* 
 351 | 
 352 | [[PDF](https://proceedings.mlr.press/v162/wang22al.html)]
 353 | 
 354 | **BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension.**[29th Oct, 2019]
 355 | 
 356 | *Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, Luke Zettlemoyer*
 357 | 
 358 | [[PDF](https://arxiv.org/abs/1910.13461)]
 359 | 
 360 | # Multimodal-Applications
 361 | 
 362 | ## Understanding
 363 |   
 364 | **Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction.**[5th Jan, 2022]
 365 | 
 366 | *Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, Abdelrahman Mohamed*
 367 | 
 368 | [[PDF](https://arxiv.org/abs/2201.02184)]
 369 | 
 370 | **Self-Supervised Multimodal Opinion Summarization.**[27th May, 2021]
 371 | 
 372 | *Jinbae lm, Moonki Kim, Hoyeop Lee, Hyunsouk Cho, Sehee Chung*
 373 | 
 374 | [[PDF](https://arxiv.org/abs/2105.13135)]
 375 | 
 376 | **Hubert: How Much Can a Bad Teacher Benefit ASR Pre-Training?**
 377 | 
 378 | *Wei-Ning Hsu, Yao-Hung Hubert Tsai, Benjamin Bolte, Ruslan Salakhutdinov, Abdelrahman Mohamed*
 379 | 
 380 | [[PDF](https://ieeexplore.ieee.org/abstract/document/9414460/)]
 381 | 
 382 | **LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding.**[29th Dec, 2020]
 383 | 
 384 | *Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou*
 385 | 
 386 | [[PDF](https://arxiv.org/abs/2012.14740)]
 387 | 
 388 | **Structext: Structured text understanding with multi-modal transformers.**[17th Oct, 2021]
 389 | 
 390 | *Yulin Li, Yuxi Qian, Yuechen Yu, Xiameng Qin, Chengquan Zhang, Yan Liu, Kun Yao, Junyu Han, Jingtuo Liu, Errui Ding*
 391 | 
 392 | [[PDF](https://dl.acm.org/doi/abs/10.1145/3474085.3475345)]
 393 | 
 394 | **ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction.**[20th Sep, 2019]
 395 | 
 396 | *Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, C. V. Jawahar*
 397 | 
 398 | [[PDF](https://ieeexplore.ieee.org/abstract/document/8977955/)]
 399 | 
 400 | **FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents.**[20th Sep, 2019]
 401 | 
 402 | *Guillaume Jaume, Hazim Kemal Ekenel, Jean-Philippe Thiran*
 403 | 
 404 | [[PDF](https://ieeexplore.ieee.org/abstract/document/8892998/)]
 405 | 
 406 | **XYLayoutLM: Towards Layout-Aware Multimodal Networks for Visually-Rich Document Understanding.**[2022 CVPR]
 407 | 
 408 | *Zhangxuan Gu, Changhua Meng, Ke Wang, Jun Lan, Weiqiang Wang, Ming Gu, Liqing Zhang*
 409 | 
 410 | [[PDF](http://openaccess.thecvf.com/content/CVPR2022/html/Gu_XYLayoutLM_Towards_Layout-Aware_Multimodal_Networks_for_Visually-Rich_Document_Understanding_CVPR_2022_paper.html)]
 411 | 
 412 | **Multistage Fusion with Forget Gate for Multimodal Summarization in Open-Domain Videos.**[2022 EMNLP]
 413 | 
 414 | *Nayu Liu, Xian SUn, Hongfeng Yu, Wenkai Zhang, Guangluan Xu*
 415 | 
 416 | [[PDF](https://aclanthology.org/2020.emnlp-main.144/)]
 417 | 
 418 | **Multimodal Abstractive Summarization for How2 Videos.**[19th Jun, 2019]
 419 | 
 420 | *Shruti Palaskar, Jindrich Libovicky, Spandana Gella, Florian Metze*
 421 | 
 422 | [[PDF](https://arxiv.org/abs/1906.07901)]
 423 | 
 424 | **Vision guided generative pre-trained language models for multimodal abstractive summarization.**[6th Sep, 2021]
 425 | 
 426 | *Tiezheng Yu, Wenliang Dai, Zihan Liu, Pascale Fung*
 427 | 
 428 | [[PDF](https://arxiv.org/abs/2109.02401)]
 429 | 
 430 | **How2: A Large-scale Dataset for Multimodal Language Understanding.**[1st Nov, 2018]
 431 | 
 432 | *Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loic Barrault, Lucia Specia, Florian Metze*
 433 | 
 434 | [[PDF](https://arxiv.org/abs/1811.00347)]
 435 | 
 436 | **wav2vec 2.0: A framework for self-supervised learning of speech representations.**[2020 NIPS]
 437 | 
 438 | *Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, Michael Auli*
 439 | 
 440 | [[PDF](https://proceedings.neurips.cc/paper/2020/hash/92d1e1eb1cd6f9fba3227870bb6d7f07-Abstract.html)]
 441 | 
 442 | **DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization.**[11th Dec, 2020]
 443 | 
 444 | *Shaoshi Ling, Yuzong Liu*
 445 | 
 446 | [[PDF](https://arxiv.org/abs/2012.06659)]
 447 | 
 448 | **LRS3-TED: a large-scale dataset for visual speech recognition.**[3rd Sep, 2018]
 449 | 
 450 | *Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman*
 451 | 
 452 | [[PDF](https://arxiv.org/abs/1809.00496)]
 453 | 
 454 | **Recurrent Neural Network Transducer for Audio-Visual Speech Recognition.**[Dec 2019]
 455 | 
 456 | *Takaki Makino, Hank Liao, Yannis Assael, Brendan Shillingford, Basilio Garcia, Otavio Braga, Olivier Siohan*
 457 | 
 458 | [[PDF](https://ieeexplore.ieee.org/abstract/document/9004036/)]
 459 | 
 460 | **Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis.**[2020 CVPR]
 461 | 
 462 | *K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, C.V. Jawahar*
 463 | 
 464 | [[PDF](http://openaccess.thecvf.com/content_CVPR_2020/html/Prajwal_Learning_Individual_Speaking_Styles_for_Accurate_Lip_to_Speech_Synthesis_CVPR_2020_paper.html)]
 465 | 
 466 | **On the importance of super-Gaussian speech priors for machine-learning based speech enhancement.**[28th Nov, 2017]
 467 | 
 468 | *Robert Rehr, Timo Gerkmann*
 469 | 
 470 | [[PDF](https://ieeexplore.ieee.org/abstract/document/8121999/)]
 471 | 
 472 | **Active appearance models.**[1998 ECCV]
 473 | 
 474 | *T. F. Cootes, G. J. Edwards, C. J. Taylor*
 475 | 
 476 | [[PDF](https://link.springer.com/chapter/10.1007/BFb0054760)]
 477 | 
 478 | **Leveraging category information for single-frame visual sound source separation.**[20th Jul, 2021]
 479 | 
 480 | *Lingyu Zhu, Esa Rahtu*
 481 | 
 482 | [[PDF](https://ieeexplore.ieee.org/abstract/document/9484036/)]
 483 | 
 484 | **The Sound of Pixels.**[2018 ECCV]
 485 | 
 486 | *Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, Antonio Torralba*
 487 | 
 488 | [[PDF](http://openaccess.thecvf.com/content_ECCV_2018/html/Hang_Zhao_The_Sound_of_ECCV_2018_paper.html)]
 489 | 
 490 | ## Classification
 491 | 
 492 | **Vqa: Visual question answering.**[2015 ICCV]
 493 | 
 494 | *Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh*
 495 | 
 496 | [[PDF](http://openaccess.thecvf.com/content_iccv_2015/html/Antol_VQA_Visual_Question_ICCV_2015_paper.html)]
 497 | 
 498 | **Topic-based content and sentiment analysis of Ebola virus on Twitter and in the news.**[1th Jul, 2016]
 499 | 
 500 | *Erin Hea-Jin Kim, Yoo Kyung Jeong, Yuyong Kim, Keun Young kang, Min Song*
 501 | 
 502 | [[PDF](https://journals.sagepub.com/doi/pdf/10.1177/0165551515608733)]
 503 | 
 504 | **On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis.**[6th Jul, 2017]
 505 | 
 506 | *Jose Camacho-Collados, Mohammad Taher Pilehvar*
 507 | 
 508 | [[PDF](https://arxiv.org/abs/1707.01780)]
 509 | 
 510 | **Market strategies used by processed food manufacturers to increase and consolidate their power: a systematic review and document analysis.**[26th Jan, 2021]
 511 | 
 512 | *Benjamin Wood, Owain Williams, Vijaya Nagarajan, Gary Sacks*
 513 | 
 514 | [[PDF](https://globalizationandhealth.biomedcentral.com/articles/10.1186/s12992-021-00667-7)]
 515 | 
 516 | **Swafn: Sentimental words aware fusion network for multimodal sentiment analysis.**[2020 COLING]
 517 | 
 518 | *Minping Chen, Xia Li*
 519 | 
 520 | [[PDF](https://aclanthology.org/2020.coling-main.93/)]
 521 | 
 522 | **Adaptive online event detection in news streams.**[15th Dec, 2017]
 523 | 
 524 | *Linmei Hu, Bin Zhang, Lei Hou, Juanzi Li*
 525 | 
 526 | [[PDF](https://www.sciencedirect.com/science/article/pii/S0950705117304550)]
 527 | 
 528 | **Multi-source multimodal data and deep learning for disaster response: A systematic review.**[27th Nov, 2021]
 529 | 
 530 | *Nilani Algiriyage, Raj Prasanna, Kristin Stock, Emma E. H. Doyle, David Johnston*
 531 | 
 532 | [[PDF](https://link.springer.com/article/10.1007/s42979-021-00971-4)]
 533 | 
 534 | **A Survey of Data Representation for Multi-Modality Event Detection and Evolution.**[2nd Nov, 2021]
 535 | 
 536 | Kejing Xiao, Zhaopeng Qian, Biao Qin.
 537 | 
 538 | [[PDF](https://www.mdpi.com/2076-3417/12/4/2204)]
 539 | 
 540 | **Crisismmd: Multimodal twitter datasets from natural disasters.**[15th Jun, 2018]
 541 | 
 542 | *Firoj Alam, Ferda Ofli, Muhammad Imran*
 543 | 
 544 | [[PDF](https://www.aaai.org/ocs/index.php/ICWSM/ICWSM18/paper/viewPaper/17816)]
 545 | 
 546 | **Multi-modal generative adversarial networks for traffic event detection in smart cities.**[1st Sep, 2021]
 547 | 
 548 | *Qi Chen, WeiWang, Kaizhu Huang, Suparna De, Frans Coenen*
 549 | 
 550 | [[PDF](https://www.sciencedirect.com/science/article/pii/S0957417421003808)]
 551 | 
 552 | **Proppy: Organizing the news based on their propagandistic content.**[5th Sep, 2019]
 553 | 
 554 | *Alberto Barron-Cedeno, Israa Jaradat, Giovanni Da San Martino, Preslav Nakov*
 555 | 
 556 | [[PDF](https://www.sciencedirect.com/science/article/pii/S0306457318306058)]
 557 | 
 558 | **Fine-grained analysis of propaganda in news article.**[Nov 2019]
 559 | 
 560 | *Giovanni Da San Martino, Seunghak Yu, Alberto Barron-Cedeno, Rostislav Petrov, Preslav Nakov*
 561 | 
 562 | [[PDF](https://aclanthology.org/D19-1565/)]
 563 | 
 564 | **Multimodal Fusion with Recurrent Neural Networks for Rumor Detection on Microblogs.**[Oct, 2017]
 565 | 
 566 | *Zhiwei Jin, Juan Cao, Han Guo, Yongdong Zhang, Jiebo Luo*
 567 | 
 568 | [[PDF](https://dl.acm.org/doi/abs/10.1145/3123266.3123454)]
 569 | 
 570 | **𝖲𝖠𝖥𝖤: Similarity-Aware Multi-modal Fake News Detection.**[6th May, 2020]
 571 | 
 572 | *Xinyi Zhou, Jindi Wu, Reza Zafarani*
 573 | 
 574 | [[PDF](https://link.springer.com/chapter/10.1007/978-3-030-47436-2_27)]
 575 | 
 576 | **From Recognition to Cognition: Visual Commonsense Reasoning.**[2019 CVPR]
 577 | 
 578 | *Rowan Zellers, Yonatan Bisk, Ali Farhadi, Yejin Choi*
 579 | 
 580 | [[PDF](http://openaccess.thecvf.com/content_CVPR_2019/html/Zellers_From_Recognition_to_Cognition_Visual_Commonsense_Reasoning_CVPR_2019_paper.html)]
 581 | 
 582 | **KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for visual commonsense reasoning.**[27th Oct, 2021]
 583 | 
 584 | *Dandan Song, Siyi Ma, Zhanchen Sun, Sicheng Yang, Lejian Liao*
 585 | 
 586 | [[PDF](https://www.sciencedirect.com/science/article/pii/S0950705121006705)]
 587 | 
 588 | **LXMERT: Learning Cross-Modality Encoder Representations from Transformers.**[20th Aug, 2019]
 589 | 
 590 | *Hao Tan, Mohit Bansal*
 591 | 
 592 | [[PDF](https://arxiv.org/abs/1908.07490)]
 593 | 
 594 | **Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers.**[2 Apr, 2020]
 595 | 
 596 | Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, Jianlong Fu
 597 | 
 598 | [[PDF](https://arxiv.org/abs/2004.00849)]
 599 | 
 600 | **Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks.**[2020 CVPR]
 601 | 
 602 | *Fengda Zhu, Yi Zhu, Xiaojun Chang, Xiaodan Liang*
 603 | 
 604 | [[PDF](http://openaccess.thecvf.com/content_CVPR_2020/html/Zhu_Vision-Language_Navigation_With_Self-Supervised_Auxiliary_Reasoning_Tasks_CVPR_2020_paper.html)]
 605 | 
 606 | ## Generation
 607 | 
 608 | **Recent advances and trends in multimodal deep learning: A review.**[24th May, 2021]
 609 | 
 610 | Jabeen Summaira, Xi Li, Amin Muhammad Shoib, Songyuan Li, Jabbar Abdul
 611 | 
 612 | [[PDF](https://arxiv.org/abs/2105.11087)]
 613 | 
 614 | **Vqa: Visual question answering.**[2015 ICCV]
 615 | 
 616 | *Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh*
 617 | 
 618 | [[PDF](http://openaccess.thecvf.com/content_iccv_2015/html/Antol_VQA_Visual_Question_ICCV_2015_paper.html)]
 619 | 
 620 | **Microsoft coco: Common objects in context.**[2014 ECCV]
 621 | 
 622 | *Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, C. Lawrence Zitnick*
 623 | 
 624 | [[PDF](https://arxiv.org/pdf/1405.0312.pdf%090.949.pdf)]
 625 | 
 626 | **BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.**[11th Oct, 2018]
 627 | 
 628 | *Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova*
 629 | 
 630 | [[PDF](https://arxiv.org/abs/1810.04805)]
 631 | 
 632 | **Distributed Representations of Words and Phrases and their Compositionality.**[2013 NIPS]
 633 | 
 634 | *Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, Jeff Dean*
 635 | 
 636 | [[PDF](https://proceedings.neurips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html)]
 637 | 
 638 | **LRS3-TED: a large-scale dataset for visual speech recognition.**[3rd Sep, 2018]
 639 | 
 640 | *Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman*
 641 | 
 642 | [[PDF](https://arxiv.org/abs/1809.00496)]
 643 | 
 644 | **A lip sync expert is all you need for speech to lip generation in the wild.**[12th Oct, 2019]
 645 | 
 646 | *K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, C.V. Jawahar*
 647 | 
 648 | [[PDF](https://dl.acm.org/doi/abs/10.1145/3394171.3413532)]
 649 | 
 650 | **Unified Vision-Language Pre-Training for Image Captioning and VQA.**[2020 AAAI]
 651 | 
 652 | *Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason Corso, Jianfeng Gao*
 653 | 
 654 | [[PDF](https://ojs.aaai.org/index.php/AAAI/article/view/7005)]
 655 | 
 656 | **Show and Tell: A Neural Image Caption Generator.**[2015 CVPR]
 657 | 
 658 | *Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan*
 659 | 
 660 | [[PDF](https://www.cv-foundation.org/openaccess/content_cvpr_2015/html/Vinyals_Show_and_Tell_2015_CVPR_paper.html)]
 661 | 
 662 | **SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning.**[2017 CVPR]
 663 | 
 664 | *Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, Tat-Seng Chua*
 665 | 
 666 | [[PDF](http://openaccess.thecvf.com/content_cvpr_2017/html/Chen_SCA-CNN_Spatial_and_CVPR_2017_paper.html)]
 667 | 
 668 | **Self-Critical Sequence Training for Image Captioning.**[2017 CVPR]
 669 | 
 670 | *Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, Vaibhava Goel*
 671 | 
 672 | [[PDF](http://openaccess.thecvf.com/content_cvpr_2017/html/Rennie_Self-Critical_Sequence_Training_CVPR_2017_paper.html)]
 673 | 
 674 | **Visual question answering: A survey of methods and datasets.**[Oct, 2017]
 675 | 
 676 | *Qi WU, Damien Teney, Peng Wang, Chunhua Shen, Anthony Dick, Anton van den Hengel*
 677 | 
 678 | [[PDF](https://www.sciencedirect.com/science/article/pii/S1077314217300772)]
 679 | 
 680 | **How to find a good image-text embedding for remote sensing visual question answering?.**[24th Sep, 2021]
 681 | 
 682 | *Christel Chappuis, Sylvain Lobry, Benjamin Kellenberger, Bertrand Le, Saux, Devis Tuia*
 683 | 
 684 | [[PDF](https://arxiv.org/abs/2109.11848)]
 685 | 
 686 | **An Improved Attention for Visual Question Answering.**[2021 CVPR]
 687 | 
 688 | *Tanzila Rahman, Shih-Han Chou, Leonid Sigal, Giuseppe Carenini*
 689 | 
 690 | [[PDF](https://openaccess.thecvf.com/content/CVPR2021W/MULA/html/Rahman_An_Improved_Attention_for_Visual_Question_Answering_CVPRW_2021_paper.html)]
 691 | 
 692 | **Analyzing Compositionality of Visual Question Answering.**[2019 NIPS]
 693 | 
 694 | *Sanjay Subramanian, Sameer Singh, Matt Gardner*
 695 | 
 696 | [[PDF](https://vigilworkshop.github.io/static/papers-2019/43.pdf)]
 697 | 
 698 | **OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge.**[2019 CVPR]
 699 | 
 700 | *Kenneth Marino, Mohammad Rastegari, Ali Farhadi, Roozbeh Mottaghi*
 701 | 
 702 | [[PDF](http://openaccess.thecvf.com/content_CVPR_2019/html/Marino_OK-VQA_A_Visual_Question_Answering_Benchmark_Requiring_External_Knowledge_CVPR_2019_paper.html)]
 703 | 
 704 | **MultiBench: Multiscale Benchmarks for Multimodal Representation Learning.**[15th Jul, 2021]
 705 | 
 706 | *Paul Pu Liang, Yiwei Lyu, Xiang Fan, Zetian Wu, Yun Cheng,
 707 | Jason Wu, Leslie Chen, Peter Wu, Michelle A. Lee, Yuke Zhu5,
 708 | Ruslan Salakhutdinov1, Louis-Philippe Morency*
 709 | 
 710 | [[PDF](https://arxiv.org/abs/2107.07502)]
 711 | 
 712 | **Benchmarking Multimodal AutoML for Tabular Data with Text Fields.**[4th Nov, 20201]
 713 | 
 714 | Xingjian Shi, Jonas Mueller, Nick Erickson, Mu Li, Alexander J. Smola
 715 | 
 716 | [[PDF](https://arxiv.org/abs/2111.02705)]
 717 | 
 718 | **Multimodal Explanations: Justifying Decisions and Pointing to the Evidence.**[2018 CVPR]
 719 | 
 720 | *Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt Schiele, Trevor Darrell, Marcus Rohrbach*
 721 | 
 722 | [[PDF](http://openaccess.thecvf.com/content_cvpr_2018/html/Park_Multimodal_Explanations_Justifying_CVPR_2018_paper.html)]
 723 | 
 724 | **Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering.**[2018 CVPR]
 725 | 
 726 | *Aishwarya Agrawal, Dhruv Batra, Devi Parikh, Aniruddha Kembhavi*
 727 | 
 728 | [[PDF](http://openaccess.thecvf.com/content_cvpr_2018/html/Agrawal_Dont_Just_Assume_CVPR_2018_paper.html)]
 729 | 
 730 | **Generative Adversarial Text to Image Synthesis.**[2016 ICML]
 731 | 
 732 | *Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, Honglak Lee*
 733 | 
 734 | [[PDF](http://proceedings.mlr.press/v48/reed16.html)]
 735 | 
 736 | **The Caltech-UCSD Birds-200-2011 Dataset.**
 737 | 
 738 | [[PDF](https://authors.library.caltech.edu/27452/)]
 739 | 
 740 | **AttnGAN: Fine-Grained Text to Image Generation With Attentional Generative Adversarial Networks.**[2018 CVPR]
 741 | 
 742 | *Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, Xiaodong He*
 743 | 
 744 | [[PDF](http://openaccess.thecvf.com/content_cvpr_2018/html/Xu_AttnGAN_Fine-Grained_Text_CVPR_2018_paper.html)]
 745 | 
 746 | **LipSound: Neural Mel-spectrogram Reconstruction for Lip Reading.**[15 Sep, 2019]
 747 | 
 748 | *Leyuan Qu, Cornelius Weber, Stefan Wermter*
 749 | 
 750 | [[PDF](https://www.isca-speech.org/archive_v0/Interspeech_2019/pdfs/1393.pdf)]
 751 | 
 752 | **The Conversation: Deep Audio-Visual Speech Enhancement.**[11th Apr, 2018]
 753 | 
 754 | *Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman*
 755 | 
 756 | [[PDF](https://arxiv.org/abs/1804.04121)]
 757 | 
 758 | **TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech.**[5th May, 2015]
 759 | 
 760 | *Naomi Harte, Eoin Gillen*
 761 | 
 762 | [[PDF](https://ieeexplore.ieee.org/abstract/document/7050271/)]
 763 | 
 764 | **Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning.**[20 Oct, 2017]
 765 | 
 766 | *WeiPing, KainanPeng, AndrewGibiansk, SercanO. Arık, Ajay Kannan, Sharan Narang*
 767 | 
 768 | [[PDF](https://arxiv.org/abs/1710.07654)]
 769 | 
 770 | **Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions.**[15th Apr, 2018]
 771 | 
 772 | *Jonathan Shen, Ruiming Pang, Ron J. Weiss, Mile Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, Rif A. Saurous, Yannis Agiomvrgiannakis, Yonghui Wu*
 773 | 
 774 | [[PDF](https://ieeexplore.ieee.org/abstract/document/8461368/)]
 775 | 
 776 | **Vid2speech: Speech reconstruction from silent video.**[5th Mar, 2017]
 777 | 
 778 | *Ariel Ephrat, Shmuel Peleg*
 779 | 
 780 | [[PDF](https://ieeexplore.ieee.org/abstract/document/7953127/)]
 781 | 
 782 | **Lip2Audspec: Speech Reconstruction from Silent Lip Movements Video.**[15th Apr, 2018]
 783 | 
 784 | *Hassan Akbari, Himani Arora, Liangliang Cao, Nima Mesgarani*
 785 | 
 786 | [[PDF](https://ieeexplore.ieee.org/abstract/document/8461856/)]
 787 | 
 788 | **Video-Driven Speech Reconstruction using Generative Adversarial Networks.**[14th Jun, 2019]
 789 | 
 790 | *Konstantinos Vougioukas, Pingchuan Ma, Stavros Petridi, Maja Pantic*
 791 | 
 792 | [[PDF](https://arxiv.org/abs/1906.06301)]
 793 | 
 794 | ## Retrieval
 795 | 
 796 | **ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks.**[2019 NIPS]
 797 | 
 798 | *Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee*
 799 | 
 800 | [[PDF](https://proceedings.neurips.cc/paper/2019/hash/c74d97b01eae257e44aa9d5bade97baf-Abstract.html)]
 801 | 
 802 | **Learning Robust Patient Representations from Multi-modal Electronic Health Records: A Supervised Deep Learning Approach.**[2021]
 803 | 
 804 | *Leman Akoglu, Evimaria Terzi, Xianli Zhang, Buyue Qian, Yang Liu, Xi Chen, Chong Guan, Chen Li*
 805 | 
 806 | [[PDF](https://epubs.siam.org/doi/abs/10.1137/1.9781611976700.66)]
 807 | 
 808 | **Referring Expression Comprehension: A Survey of Methods and Datasets.**[7th Dec, 2020]
 809 | 
 810 | *Yanyuan QIao, Chaorui Deng, Qi Wu*
 811 | 
 812 | [[PDF](https://ieeexplore.ieee.org/abstract/document/9285213/)]
 813 | 
 814 | **VL-BERT: Pre-training of Generic Visual-Linguistic Representations.**[22th Aug, 2019]
 815 | 
 816 | *Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, Jifeng Dai*
 817 | 
 818 | [[PDF](https://arxiv.org/abs/1908.08530)]
 819 | 
 820 | **Clinically Accurate Chest X-Ray Report Generation.**[2019 MLHC]
 821 | 
 822 | *Guanxiong Liu, Tzu-Ming Harry Hsu, Matthew McDermott, Willie Boag, Wei-Hung Weng, Peter Szolovits, Marzyeh Ghassemi*
 823 | 
 824 | [[PDF](http://proceedings.mlr.press/v106/liu19a.html)]
 825 | 
 826 | ## Translation
 827 | 
 828 | **Deep Residual Learning for Image Recognition.**[2016 CVPR]
 829 | 
 830 | *Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun*
 831 | 
 832 | [[PDF](http://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html)]
 833 | 
 834 | **Probing the Need for Visual Context in Multimodal Machine Translation.**[20th Mar, 2019]
 835 | 
 836 | *Ozan Caglayan, Pranava Madhyastha, Lucia Specia, Loic Barrault*
 837 | 
 838 | [[PDF](https://arxiv.org/abs/1903.08678)]
 839 | 
 840 | **Neural Machine Translation by Jointly Learning to Align and Translate.**[1st Sep, 2014]
 841 | 
 842 | *Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio*
 843 | 
 844 | [[PDF](https://arxiv.org/abs/1409.0473)]
 845 | 
 846 | **Multi-modal neural machine translation with deep semantic interactions.**[Apr, 2021]
 847 | 
 848 | *Jinsong Su, Jinchang Chen, Hui Jiang, Chulun Zhou, Huan Lin, Yubin Ge, Qingqiang Wu, Yongxuan Lai*
 849 | 
 850 | [[PDF](https://www.sciencedirect.com/science/article/pii/S0020025520311105)]
 851 | 
 852 | # Multimodal-Datasets
 853 | 
 854 | Vqa: Visual question answering.**[2015 ICCV]
 855 | 
 856 | *Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh*
 857 | 
 858 | [[PDF](http://openaccess.thecvf.com/content_iccv_2015/html/Antol_VQA_Visual_Question_ICCV_2015_paper.html)]
 859 | 
 860 | **Microsoft coco: Common objects in context.**[2014 ECCV]
 861 | 
 862 | *Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, C. Lawrence Zitnick*
 863 | 
 864 | [[PDF](https://arxiv.org/pdf/1405.0312.pdf%090.949.pdf)]
 865 | 
 866 | **Pre-training technique to localize medical BERT and enhance biomedical BERT.**[14th May, 2020]
 867 | 
 868 | *Shoya Wada, Toshihiro Takeda, Shiro Manabe, Shozo Konishi, Jun Kamohara, Yasushi, Matsumura*
 869 | 
 870 | [[PDF](https://arxiv.org/abs/2005.07202)]
 871 | 
 872 | **Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models.**[2015 ICCV]
 873 | 
 874 | *Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, Svetlana Lazebnik*
 875 | 
 876 | [[PDF](http://openaccess.thecvf.com/content_iccv_2015/html/Plummer_Flickr30k_Entities_Collecting_ICCV_2015_paper.html)]
 877 | 
 878 | **ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction.**[20th Sep, 2019]
 879 | 
 880 | *Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, C. V. Jawahar*
 881 | 
 882 | [[PDF](https://ieeexplore.ieee.org/abstract/document/8977955/)]
 883 | 
 884 | **FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents.**[20th Sep, 2019]
 885 | 
 886 | *Guillaume Jaume, Hazim Kemal Ekenel, Jean-Philippe Thiran*
 887 | 
 888 | [[PDF](https://ieeexplore.ieee.org/abstract/document/8892998/)]
 889 | 
 890 | **How2: A Large-scale Dataset for Multimodal Language Understanding.**[1st Nov, 2018]
 891 | 
 892 | *Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loic Barrault, Lucia Specia, Florian Metze*
 893 | 
 894 | [[PDF](https://arxiv.org/abs/1811.00347)]
 895 | 
 896 | **Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis.**[2020 CVPR]
 897 | 
 898 | *K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, C.V. Jawahar*
 899 | 
 900 | [[PDF](http://openaccess.thecvf.com/content_CVPR_2020/html/Prajwal_Learning_Individual_Speaking_Styles_for_Accurate_Lip_to_Speech_Synthesis_CVPR_2020_paper.html)]
 901 | 
 902 | **The Sound of Pixels.**[2018 ECCV]
 903 | 
 904 | *Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, Antonio Torralba*
 905 | 
 906 | [[PDF](http://openaccess.thecvf.com/content_ECCV_2018/html/Hang_Zhao_The_Sound_of_ECCV_2018_paper.html)]
 907 | 
 908 | **Crisismmd: Multimodal twitter datasets from natural disasters.**[15th Jun, 2018]
 909 | 
 910 | *Firoj Alam, Ferda Ofli, Muhammad Imran*
 911 | 
 912 | [[PDF](https://www.aaai.org/ocs/index.php/ICWSM/ICWSM18/paper/viewPaper/17816)]
 913 | 
 914 | **From Recognition to Cognition: Visual Commonsense Reasoning.**[2019 CVPR]
 915 | 
 916 | *Rowan Zellers, Yonatan Bisk, Ali Farhadi, Yejin Choi*
 917 | 
 918 | [[PDF](http://openaccess.thecvf.com/content_CVPR_2019/html/Zellers_From_Recognition_to_Cognition_Visual_Commonsense_Reasoning_CVPR_2019_paper.html)]
 919 | 
 920 | **The Caltech-UCSD Birds-200-2011 Dataset.**
 921 | 
 922 | [[PDF](https://authors.library.caltech.edu/27452/)]
 923 | 
 924 | **Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics.**[30th Aug, 2013]
 925 | 
 926 | *M. Hodosh, P. Young, J. Hockenmaier*
 927 | 
 928 | [[PDF](https://www.jair.org/index.php/jair/article/view/10833)]
 929 | 
 930 | **Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph.**[Jul, 2018]
 931 | 
 932 | *AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, Louis-Philippe Morency*
 933 | 
 934 | [[PDF](https://aclanthology.org/P18-1208/)]
 935 | 
 936 | **MIMIC-III, a freely accessible critical care database.**[24th May, 2016]
 937 | 
 938 | *Alistair E.W. Johnson, Tom J. Pollard, Li-wei H. Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Leo Anthony Celi & Roger G. Mark*
 939 | 
 940 | [[PDF](https://www.nature.com/articles/sdata201635)]
 941 | 
 942 | **Fashion 200K Benchmark**
 943 | 
 944 | [[Github](https://github.com/xthan/fashion-200k)]
 945 | 
 946 | **Indoor scene segmentation using a structured light sensor.**[Nov 2011]
 947 | 
 948 | *Nathan Silberman, Rob Fergus*
 949 | 
 950 | [[PDF](https://ieeexplore.ieee.org/abstract/document/6130298/)]
 951 | 
 952 | **Indoor Segmentation and Support Inference from RGBD Images.**[2012 ECCV]
 953 | 
 954 | *Nathan Silberman, Derek Hoiem, Pushmeet Kohli, Rob Fergus*
 955 | 
 956 | [[PDF](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/11/shkf_eccv2012.pdf)]
 957 | 
 958 | **Good News, Everyone! Context Driven Entity-Aware Captioning for News Images.**[2019 CVPR]
 959 | 
 960 | *Ali Furkan Biten, Lluis Gomez, Marcal Rusinol, Dimosthenis Karatzas*
 961 | 
 962 | [[PDF](http://openaccess.thecvf.com/content_CVPR_2019/html/Biten_Good_News_Everyone_Context_Driven_Entity-Aware_Captioning_for_News_Images_CVPR_2019_paper.html)]
 963 | 
 964 | **MSR-VTT: A Large Video Description Dataset for Bridging Video and Language.**[2016 CVPR]
 965 | 
 966 | *Jun Xu, Tao Mei, Ting Yao, Yong Rui*
 967 | 
 968 | [[PDF](http://openaccess.thecvf.com/content_cvpr_2016/html/Xu_MSR-VTT_A_Large_CVPR_2016_paper.html)]
 969 | 
 970 | **Video Question Answering via Gradually Refined Attention over Appearance and Motion.**[Oct, 2017]
 971 | 
 972 | *Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, Yueting Zhuang*
 973 | 
 974 | [[PDF](https://dl.acm.org/doi/abs/10.1145/3123266.3123427)]
 975 | 
 976 | **TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering.**[2017 CVPR]
 977 | 
 978 | *Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, Gunhee Kim*
 979 | 
 980 | [[PDF](http://openaccess.thecvf.com/content_cvpr_2017/html/Jang_TGIF-QA_Toward_Spatio-Temporal_CVPR_2017_paper.html)]
 981 | 
 982 | **Multi-Target Embodied Question Answering.**[2019 CVPR]
 983 | 
 984 | *Licheng Yu, Xinlei Chen, Georgia Gkioxari, Mohit Bansal, Tamara L. Berg, Dhruv Batra*
 985 | 
 986 | [[PDF](http://openaccess.thecvf.com/content_CVPR_2019/html/Yu_Multi-Target_Embodied_Question_Answering_CVPR_2019_paper.html)]
 987 | 
 988 | **VideoNavQA: Bridging the Gap between Visual and Embodied Question Answering.**[14th Aug, 2019]
 989 | 
 990 | *Catalina Cangea, Eugene Belilovsky, Pietro Lio, Aaron Courville*
 991 | 
 992 | [[PDF](https://arxiv.org/abs/1908.04950)]
 993 | 
 994 | **An Analysis of Visual Question Answering Algorithms.**[2017 ICCV]
 995 | 
 996 | *Kushal Kafle, Christopher Kanan*
 997 | 
 998 | [[PDF](http://openaccess.thecvf.com/content_iccv_2017/html/Kafle_An_Analysis_of_ICCV_2017_paper.html)]
 999 | 
1000 | **nuScenes: A Multimodal Dataset for Autonomous Driving.**[2020 CVPR]
1001 | 
1002 | *Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, Oscar Beijbom*
1003 | 
1004 | [[PDF](http://openaccess.thecvf.com/content_CVPR_2020/html/Caesar_nuScenes_A_Multimodal_Dataset_for_Autonomous_Driving_CVPR_2020_paper.html)]
1005 | 
1006 | **Automated Flower Classification over a Large Number of Classes.**[20th Jan, 2009]
1007 | 
1008 | *Maria-Elena Nilsback, Andrew Zisserman*
1009 | 
1010 | [[PDF](https://ieeexplore.ieee.org/abstract/document/4756141/)]
1011 | 
1012 | **MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos.**[20th Jun, 2016]
1013 | 
1014 | *Amir Zadeh, Rowan Zellers, Eli Pincus, Louis-Philippe Morency*
1015 | 
1016 | [[PDF](https://arxiv.org/abs/1606.06259)]
1017 | 
1018 | **Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech Recognition.**[18th Jan, 2020]
1019 | 
1020 | *Yuanhang Zhang, Shuang Yang, Jingyun Xiao, Shiguang Shan, Xilin Chen*
1021 | 
1022 | [[PDF](https://ieeexplore.ieee.org/abstract/document/9320240/)]
1023 | 
1024 | **The MIT Stata Center dataset.** [2013]
1025 | 
1026 | *Maurice Fallon, Hordur Johannsson, Michael Kaess and John J Leonard*
1027 | 
1028 | [[PDF](https://journals.sagepub.com/doi/pdf/10.1177/0278364913509035)]
1029 | 
1030 | **data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language.**[2022 ICML]
1031 | 
1032 | *Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli*
1033 | 
1034 | [[PDF](https://proceedings.mlr.press/v162/baevski22a.html)]
1035 | 
1036 | **FLAVA: A Foundational Language and Vision Alignment Model.**[2022 CVPR]
1037 | 
1038 | *Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, Douwe Kiela*
1039 | 
1040 | [[PDF](http://openaccess.thecvf.com/content/CVPR2022/html/Singh_FLAVA_A_Foundational_Language_and_Vision_Alignment_Model_CVPR_2022_paper.html)]
1041 | 
1042 | **UC2: Universal Cross-Lingual Cross-Modal Vision-and-Language Pre-Training.**[2021 CVPR]
1043 | 
1044 | *Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu, Jingjing Liu*
1045 | 
1046 | [[PDF](http://openaccess.thecvf.com/content/CVPR2021/html/Zhou_UC2_Universal_Cross-Lingual_Cross-Modal_Vision-and-Language_Pre-Training_CVPR_2021_paper.html)]
1047 | 
1048 | # Citation
1049 | If you find the listing and survey useful for your work, please cite the paper:
1050 | ```
1051 | @article{manzoor2023multimodality,
1052 |   title={Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications},
1053 |   author={Manzoor, Muhammad Arslan and Albarri, Sarah and Xian, Ziting and Meng, Zaiqiao and Nakov, Preslav and Liang, Shangsong},
1054 |   journal={arXiv preprint arXiv:2302.00389},
1055 |   year={2023}
1056 | }
1057 | ```
1058 | 


--------------------------------------------------------------------------------