├── Fig_1_MML.png └── README.md /Fig_1_MML.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marslanm/Multimodality-Representation-Learning/c02e70f6e736804639255816bf04f157f1c81148/Fig_1_MML.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications 2 | ![a](Fig_1_MML.png) 3 | ## Multimodal Deep Learnig based Research 4 | 5 | - [Survey Papers](#survey) 6 | - [Task-specific Methods](#Task-specific-Methods) 7 | - [Pretrainig Approaches](#Pretraining-Approaches) 8 | - [Multimodal Applications](#Multimodal-Applications) ([Understanidng](#Understanding), [Classification](#Classification), [Generation](#Generation), [Retrieval](#Retrieval), [Translation](#Translation)) 9 | - [Multimodal Datasets](#Multimodal-Datasets) 10 | 11 | # Survey 12 | 13 | [**Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications.**](https://dl.acm.org/doi/abs/10.1145/3617833)
14 | *[Muhammad Arslan Manzoor](https://scholar.google.com/citations?hl=en&user=ZvXClnUAAAAJ), [Sarah Albarri](), [Ziting Xian](https://scholar.google.com/citations?hl=zh-CN&user=G7VId5YAAAAJ&view_op=list_works&gmla=AJsN-F6TTZmzbi9CmBIRLNRpAhcgmzH-nOUd8hM5UTjfT5A_mYW2ABzjSrdX7ki9GFgGaId2dLlMtXBkfq7X_qzYOwF_OuvCCthMiuVNUuUiac-aGoSwsKQ), [Zaiqiao Meng](https://scholar.google.com/citations?user=5jJKFVcAAAAJ&hl=en), [Preslav Nakov](https://scholar.google.com/citations?user=DfXsKZ4AAAAJ&hl=en), and [Shangsong Liang](https://scholar.google.com/citations?user=4uggVcIAAAAJ&h).*
15 | [[PDF](https://dl.acm.org/doi/abs/10.1145/3617833)] 16 | 17 | **Vision-Language Pre-training:Basics, Recent Advances, and Future Trends.**[17th Oct, 2022]
18 | *Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng Gao.*
19 | [[PDF](https://arxiv.org/pdf/2210.09263.pdf)] 20 | 21 | **VLP: A survey on vision-language pre-training.**[18th Feb, 2022]
22 | *Feilong Chen, Duzhen Zhang, Minglun Han, Xiuyi Chen, Jing Shi, Shuang Xu, and Bo Xu.*
23 | [[PDF](https://link.springer.com/article/10.1007/s11633-022-1369-5)] 24 | 25 | 26 | **A Survey of Vision-Language Pre-Trained Models.**[18th Feb, 2022]
27 | *Yifan Du, Zikang Liu, Junyi Li, Wayne Xin Zhao.*
28 | [[PDF](https://arxiv.org/abs/2202.10936)] 29 | 30 | 31 | **Vision-and-Language Pretrained Models: A Survey.**[15th Apr, 2022]
32 | *Siqu Long, Feiqi Cao, Soyeon Caren Han, Haiqin Yang.*
33 | [[PDF](https://arxiv.org/abs/2204.07356)] 34 | 35 | **Comprehensive reading list for Multimodal Literature**
36 | [[Github](https://github.com/pliang279/awesome-multimodal-ml#survey-papers)] 37 | 38 | **Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing.**[28th Jul, 2021]
39 | *Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, Graham Neubig*
40 | [[PDF](https://arxiv.org/abs/2107.13586)] 41 | 42 | **Recent Advances and Trends in Multimodal Deep Learning: A Review.**[24th May, 2021]
43 | *Jabeen Summaira, Xi Li, Amin Muhammad Shoib, Songyuan Li, Jabbar Abdul.*
44 | [[PDF](https://arxiv.org/pdf/2105.11087)] 45 | 46 | 47 | 48 | # Task-specific-Methods 49 | 50 | **Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network.**[9th Feb, 2021] 51 | 52 | *Jiayi Ji, Yunpeng Luo, Xiaoshuai Sun, Fuhai Chen, Gen Luo, Yongjian Wu, Yue Gao, Rongrong Ji* 53 | 54 | [[PDF](https://ojs.aaai.org/index.php/AAAI/article/view/16258)] 55 | 56 | **Cascaded Recurrent Neural Networks for Hyperspectral Image Classification.**[Aug, 2019] 57 | 58 | *Renlong Hang, Qingshan Liu, Danfeng Hong, Pedram Ghamisi* 59 | 60 | [[PDF](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8662780)] 61 | 62 | **Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.**[2015 NIPS] 63 | 64 | *Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun* 65 | 66 | [[PDF](https://proceedings.neurips.cc/paper/2015/hash/14bfa6bb14875e45bba028a21ed38046-Abstract.html)] 67 | 68 | **Microsoft coco: Common objects in context.**[2014 ECCV] 69 | 70 | *Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, C. Lawrence Zitnick* 71 | 72 | [[PDF](https://arxiv.org/pdf/1405.0312.pdf%090.949.pdf)] 73 | 74 | **Multimodal Deep Learning.**[2011 ICML] 75 | 76 | *Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, Andrew Y. Ng* 77 | 78 | [[PDF](http://ai.stanford.edu/~ang/papers/icml11-MultimodalDeepLearning.pdf)] 79 | 80 | **Extracting and composing robust features with denoising autoencoders.**[5th July, 2008] 81 | 82 | *Pascak Vincent, Hugo Larochelle, Yoshua Bengio, Pierre-Antoine Manzagol* 83 | 84 | [[PDF](https://dl.acm.org/doi/pdf/10.1145/1390156.1390294)] 85 | 86 | **Multi-Gate Attention Network for Image Captioning.**[13th Mar, 2021] 87 | 88 | *WEITAO JIANG, XIYING LI, HAIFENG HU, QIANG LU, AND BOHONG LIU* 89 | 90 | [[PDF](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9382255)] 91 | 92 | **AMC: Attention guided Multi-modal Correlation Learning for Image Search.**[2017 CVPR] 93 | 94 | *Kan Chen, Trung Bui, Chen Fang, Zhaowen Wang, Ram Nevatia* 95 | 96 | [[PDF](https://openaccess.thecvf.com/content_cvpr_2017/papers/Chen_AMC_Attention_guided_CVPR_2017_paper.pdf)] 97 | 98 | **Video Captioning via Hierarchical Reinforcement Learning.**[2018 CVPR] 99 | 100 | *Xin Wang, Wenhu Chen, Jiawei Wu, Yuan-Fang Wang, William Yang Wang* 101 | 102 | [[PDF](https://openaccess.thecvf.com/content_cvpr_2018/papers/Wang_Video_Captioning_via_CVPR_2018_paper.pdf)] 103 | 104 | **Gaussian Process with Graph Convolutional Kernel for Relational Learning.**[14th Aug, 2021] 105 | 106 | *Jinyuan Fang, Shangsong Liang, Zaiqiao Meng, Qiang Zhang* 107 | 108 | [[PDF](https://dl.acm.org/doi/pdf/10.1145/3447548.3467327)] 109 | 110 | **Multi-Relational Graph Representation Learning with Bayesian Gaussian Process Network.**[28th June, 2022] 111 | 112 | *Guanzheng Chen, Jinyuan Fang, Zaiqiao Meng, Qiang Zhang, Shangsong Liang* 113 | 114 | [[PDF](https://doi.org/10.1609/aaai.v36i5.20492)] 115 | 116 | # Pretraining-Approaches 117 | 118 | **Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction.**[5th Jan, 2022] 119 | 120 | *Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, Abdelrahman Mohamed* 121 | 122 | [[PDF](https://proceedings.neurips.cc/paper/2019/hash/c74d97b01eae257e44aa9d5bade97baf-Abstract.html)] 123 | 124 | **A Survey of Vision-Language Pre-Trained Models.**[18th Feb, 2022] 125 | 126 | *Yifan Du, Zikang Liu, Junyi Li, Wayne Xin Zhao* 127 | 128 | [[PDF](https://arxiv.org/abs/2202.10936)] 129 | 130 | **Attention is All you Need.**[2017 NIPS] 131 | 132 | *Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin* 133 | 134 | [[PDF](https://proceedings.neurips.cc/paper/7181-attention-is-all)] 135 | 136 | **VinVL: Revisiting Visual Representations in Vision-Language Models.**[2021 CVPR] 137 | 138 | *Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, Jianfeng Gao* 139 | 140 | [[PDF](http://openaccess.thecvf.com/content/CVPR2021/html/Zhang_VinVL_Revisiting_Visual_Representations_in_Vision-Language_Models_CVPR_2021_paper.html)] 141 | 142 | **M6: Multi-Modality-to-Multi-Modality Multitask Mega-transformer for Unified Pretraining.**[Aug, 2021] 143 | 144 | *Junyang Lin, Rui Men, An Yang, Chang Zhou, Yichang Zhang, Peng Wang, Jingren Zhou, Jie Tang, Hongxia Yang* 145 | 146 | [[PDF](https://dl.acm.org/doi/abs/10.1145/3447548.3467206)] 147 | 148 | **AMMU: A survey of transformer-based biomedical pretrained language models.**[23th Mar, 2020] 149 | 150 | *Katikapalli Subramanyam Kalyan, Ajit Rajasekharan, Sivanesan Sangeetha* 151 | 152 | [[PDF](https://www.sciencedirect.com/science/article/pii/S1532046421003117)] 153 | 154 | **ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators** 155 | 156 | *Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning* 157 | 158 | [[PDF](https://arxiv.org/abs/2003.10555)] 159 | 160 | **RoBERTa: A Robustly Optimized BERT Pretraining Approach.**[26th Jul, 2019] 161 | 162 | *Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov* 163 | 164 | [[PDF](https://arxiv.org/abs/1907.11692)] 165 | 166 | **BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.**[11th Oct, 2018] 167 | 168 | *Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova* 169 | 170 | [[PDF](https://arxiv.org/abs/1810.04805)] 171 | 172 | **BioBERT: a pre-trained biomedical language representation model for biomedical text mining.**[10th Sep, 2019] 173 | 174 | Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, onghyeon Kim, Sunkyu Kim, Chan Ho So, Jaewoo Kang 175 | 176 | [[PDF](https://academic.oup.com/bioinformatics/article-abstract/36/4/1234/5566506)] 177 | 178 | **HateBERT: Retraining BERT for Abusive Language Detection in English.**[23th Oct, 2020] 179 | 180 | *Tommaso Caselli, Valerio Basile, Jelena Mitrovic, Michael Granitzer* 181 | 182 | [[PDF](https://arxiv.org/abs/2010.12472)] 183 | 184 | **InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training.**[15th Jul, 2020] 185 | 186 | *Zewen Chi, Li Dong, Furu Wei, Nan Yang, Saksham Singhal, Wenhui Wang, Xia Song, XIan-Ling Mao, Heyan Huang, Ming Zhou* 187 | 188 | [[PDF](https://arxiv.org/abs/2007.07834)] 189 | 190 | **Pre-training technique to localize medical BERT and enhance biomedical BERT.**[14th May, 2020] 191 | 192 | *Shoya Wada, Toshihiro Takeda, Shiro Manabe, Shozo Konishi, Jun Kamohara, Yasushi, Matsumura* 193 | 194 | [[PDF](https://arxiv.org/abs/2005.07202)] 195 | 196 | **Don't Stop Pretraining: Adapt Language Models to Domains and Tasks.**[23th Apr, 2020] 197 | 198 | *Suchin Gururangan, Ana Marasovic, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, Noah A. Smith* 199 | 200 | [[PDF](https://arxiv.org/abs/2004.10964)] 201 | 202 | **Knowledge Inheritance for Pre-trained Language Models.**[28th May, 2021] 203 | 204 | *Yujia Qin, Yankai Lin, Jing Yi, Jiajie Zhang, Xu Han, Zhengyan Zhang, Yusheng Su, Zhiyuan Liu, Peng Li, Maosong Sun, Jie Zhou* 205 | 206 | [[PDF](https://arxiv.org/abs/2105.13880)] 207 | 208 | **Improving Language Understanding by Generative Pre-Training.**[2018] 209 | 210 | *Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever* 211 | 212 | [[PDF](https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf)] 213 | 214 | **Shuffled-token Detection for Refining Pre-trained RoBERTa** 215 | 216 | *Subhadarshi Panda, Anjali Agrawal, Jeewon Ha, Benjamin Bloch* 217 | 218 | [[PDF](https://aclanthology.org/2021.naacl-srw.12/)] 219 | 220 | **ALBERT: A Lite BERT for Self-supervised Learning of Language Representations.**[26th Sep, 2019] 221 | 222 | *Zhenzhong Lan, Minga Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut* 223 | 224 | [[PDF](https://arxiv.org/abs/1909.11942)] 225 | 226 | **Exploring the limits of transfer learning with a unified text-to-text transformer.**[1st Jan, 2020] 227 | 228 | *Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu* 229 | 230 | [[PDF](https://dl.acm.org/doi/abs/10.5555/3455716.3455856)] 231 | 232 | **End-to-End Object Detection with Transformers.**[3rd Nov, 2020] 233 | 234 | *Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko* 235 | 236 | [[PDF](https://link.springer.com/chapter/10.1007/978-3-030-58452-8_13)] 237 | 238 | **Deformable DETR: Deformable Transformers for End-to-End Object Detection.**[8th Oct, 2018] 239 | 240 | Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, Jifeng Dai 241 | 242 | [[PDF](https://arxiv.org/abs/2010.04159)] 243 | 244 | **Unified Vision-Language Pre-Training for Image Captioning and VQA.**[2020 AAAI] 245 | 246 | *Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason Corso, Jianfeng Gao* 247 | 248 | [[PDF](https://ojs.aaai.org/index.php/AAAI/article/view/7005)] 249 | 250 | **VirTex: Learning Visual Representations From Textual Annotations.**[2021 CVPR] 251 | 252 | *Karan Desai, Justin Johnson* 253 | 254 | [[PDF](http://openaccess.thecvf.com/content/CVPR2021/html/Desai_VirTex_Learning_Visual_Representations_From_Textual_Annotations_CVPR_2021_paper.html)] 255 | 256 | **Ernie-vil: Knowledge enhanced vision-language representations through scene graphs.**[2021 AAAI] 257 | 258 | *Fei Yu, Jiji Tang, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang* 259 | 260 | [[PDF](https://ojs.aaai.org/index.php/AAAI/article/view/16431)] 261 | 262 | **OSCAR: Object-Semantics Aligned Pre-training for Vision-Language Tasks.**[24th Sep, 2020] 263 | 264 | *Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, Jianfeng Gao* 265 | 266 | [[PDF](https://link.springer.com/chapter/10.1007/978-3-030-58577-8_8)] 267 | 268 | **Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision.**[14th Oct, 2020] 269 | 270 | *Hao Tan, Mohit Bansal* 271 | 272 | [[PDF](https://arxiv.org/abs/2010.06775)] 273 | 274 | **Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models.**[2015 ICCV] 275 | 276 | *Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, Svetlana Lazebnik* 277 | 278 | [[PDF](http://openaccess.thecvf.com/content_iccv_2015/html/Plummer_Flickr30k_Entities_Collecting_ICCV_2015_paper.html)] 279 | 280 | **Distributed representations of words and phrases and their compositionality.**[2013 NIPS] 281 | 282 | *Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, Jeff Dean* 283 | 284 | [[PDF](https://proceedings.neurips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html)] 285 | 286 | **AllenNLP: A Deep Semantic Natural Language Processing Platform.**[20 Mar, 2018] 287 | 288 | *Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson Liu, Matthew Peters, Michael Schmitz, Luke Zettlemoyer* 289 | 290 | [[PDF](https://arxiv.org/abs/1803.07640)] 291 | 292 | **Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data.**[Jul, 2020] 293 | 294 | *Emily M. Bender, Alexander Koller* 295 | 296 | [[PDF](https://aclanthology.org/2020.acl-main.463/)] 297 | 298 | **Experience Grounds Language.**[21th Apr, 2020] 299 | 300 | *Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio, Joyce Chai, Mirella Lapata, Angeliki Lazaridou, Jonathan May, Aleksandr Nisnevich, Nicolas Pinto, Joseph Turian* 301 | 302 | [[PDF](https://arxiv.org/abs/2004.10151)] 303 | 304 | **Hubert: How Much Can a Bad Teacher Benefit ASR Pre-Training?** 305 | 306 | *Wei-Ning Hsu, Yao-Hung Hubert Tsai, Benjamin Bolte, Ruslan Salakhutdinov, Abdelrahman Mohamed* 307 | 308 | [[PDF](https://ieeexplore.ieee.org/abstract/document/9414460/)] 309 | 310 | ## Unifying Achitectures 311 | 312 | **BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.**[11th Oct, 2018] 313 | 314 | *Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova* 315 | 316 | [[PDF](https://arxiv.org/abs/1810.04805)] 317 | 318 | **Improving Language Understanding by Generative Pre-Training.**[2018] 319 | 320 | *Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever* 321 | 322 | [[PDF](https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf)] 323 | 324 | **End-to-End Object Detection with Transformers.**[3rd Nov, 2020] 325 | 326 | *Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko* 327 | 328 | [[PDF](https://link.springer.com/chapter/10.1007/978-3-030-58452-8_13)] 329 | 330 | **UNITER: UNiversal Image-TExt Representation Learning.**[24th Sep, 2020] 331 | 332 | *Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed EI Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, Jingjing Liu* 333 | 334 | [[PDF](https://link.springer.com/chapter/10.1007/978-3-030-58577-8_7)] 335 | 336 | **UNITER: UNiversal Image-TExt Representation Learning.**[2021 ICCV] 337 | 338 | *Ronghang Hu, Amanpreet Singh* 339 | 340 | [[PDF](https://openaccess.thecvf.com/content/ICCV2021/html/Hu_UniT_Multimodal_Multitask_Learning_With_a_Unified_Transformer_ICCV_2021_paper.html?ref=https://githubhelp.com)] 341 | 342 | **VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text.**[2021 NIPS] 343 | 344 | *Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, Boqing Gong* 345 | 346 | [[PDF](https://proceedings.neurips.cc/paper/2021/hash/cb3213ada48302953cb0f166464ab356-Abstract.html)] 347 | 348 | **OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework.**[2022 ICML] 349 | 350 | *Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, Hongxia Yang*  351 | 352 | [[PDF](https://proceedings.mlr.press/v162/wang22al.html)] 353 | 354 | **BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension.**[29th Oct, 2019] 355 | 356 | *Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, Luke Zettlemoyer* 357 | 358 | [[PDF](https://arxiv.org/abs/1910.13461)] 359 | 360 | # Multimodal-Applications 361 | 362 | ## Understanding 363 | 364 | **Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction.**[5th Jan, 2022] 365 | 366 | *Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, Abdelrahman Mohamed* 367 | 368 | [[PDF](https://arxiv.org/abs/2201.02184)] 369 | 370 | **Self-Supervised Multimodal Opinion Summarization.**[27th May, 2021] 371 | 372 | *Jinbae lm, Moonki Kim, Hoyeop Lee, Hyunsouk Cho, Sehee Chung* 373 | 374 | [[PDF](https://arxiv.org/abs/2105.13135)] 375 | 376 | **Hubert: How Much Can a Bad Teacher Benefit ASR Pre-Training?** 377 | 378 | *Wei-Ning Hsu, Yao-Hung Hubert Tsai, Benjamin Bolte, Ruslan Salakhutdinov, Abdelrahman Mohamed* 379 | 380 | [[PDF](https://ieeexplore.ieee.org/abstract/document/9414460/)] 381 | 382 | **LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding.**[29th Dec, 2020] 383 | 384 | *Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou* 385 | 386 | [[PDF](https://arxiv.org/abs/2012.14740)] 387 | 388 | **Structext: Structured text understanding with multi-modal transformers.**[17th Oct, 2021] 389 | 390 | *Yulin Li, Yuxi Qian, Yuechen Yu, Xiameng Qin, Chengquan Zhang, Yan Liu, Kun Yao, Junyu Han, Jingtuo Liu, Errui Ding* 391 | 392 | [[PDF](https://dl.acm.org/doi/abs/10.1145/3474085.3475345)] 393 | 394 | **ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction.**[20th Sep, 2019] 395 | 396 | *Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, C. V. Jawahar* 397 | 398 | [[PDF](https://ieeexplore.ieee.org/abstract/document/8977955/)] 399 | 400 | **FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents.**[20th Sep, 2019] 401 | 402 | *Guillaume Jaume, Hazim Kemal Ekenel, Jean-Philippe Thiran* 403 | 404 | [[PDF](https://ieeexplore.ieee.org/abstract/document/8892998/)] 405 | 406 | **XYLayoutLM: Towards Layout-Aware Multimodal Networks for Visually-Rich Document Understanding.**[2022 CVPR] 407 | 408 | *Zhangxuan Gu, Changhua Meng, Ke Wang, Jun Lan, Weiqiang Wang, Ming Gu, Liqing Zhang* 409 | 410 | [[PDF](http://openaccess.thecvf.com/content/CVPR2022/html/Gu_XYLayoutLM_Towards_Layout-Aware_Multimodal_Networks_for_Visually-Rich_Document_Understanding_CVPR_2022_paper.html)] 411 | 412 | **Multistage Fusion with Forget Gate for Multimodal Summarization in Open-Domain Videos.**[2022 EMNLP] 413 | 414 | *Nayu Liu, Xian SUn, Hongfeng Yu, Wenkai Zhang, Guangluan Xu* 415 | 416 | [[PDF](https://aclanthology.org/2020.emnlp-main.144/)] 417 | 418 | **Multimodal Abstractive Summarization for How2 Videos.**[19th Jun, 2019] 419 | 420 | *Shruti Palaskar, Jindrich Libovicky, Spandana Gella, Florian Metze* 421 | 422 | [[PDF](https://arxiv.org/abs/1906.07901)] 423 | 424 | **Vision guided generative pre-trained language models for multimodal abstractive summarization.**[6th Sep, 2021] 425 | 426 | *Tiezheng Yu, Wenliang Dai, Zihan Liu, Pascale Fung* 427 | 428 | [[PDF](https://arxiv.org/abs/2109.02401)] 429 | 430 | **How2: A Large-scale Dataset for Multimodal Language Understanding.**[1st Nov, 2018] 431 | 432 | *Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loic Barrault, Lucia Specia, Florian Metze* 433 | 434 | [[PDF](https://arxiv.org/abs/1811.00347)] 435 | 436 | **wav2vec 2.0: A framework for self-supervised learning of speech representations.**[2020 NIPS] 437 | 438 | *Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, Michael Auli* 439 | 440 | [[PDF](https://proceedings.neurips.cc/paper/2020/hash/92d1e1eb1cd6f9fba3227870bb6d7f07-Abstract.html)] 441 | 442 | **DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization.**[11th Dec, 2020] 443 | 444 | *Shaoshi Ling, Yuzong Liu* 445 | 446 | [[PDF](https://arxiv.org/abs/2012.06659)] 447 | 448 | **LRS3-TED: a large-scale dataset for visual speech recognition.**[3rd Sep, 2018] 449 | 450 | *Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman* 451 | 452 | [[PDF](https://arxiv.org/abs/1809.00496)] 453 | 454 | **Recurrent Neural Network Transducer for Audio-Visual Speech Recognition.**[Dec 2019] 455 | 456 | *Takaki Makino, Hank Liao, Yannis Assael, Brendan Shillingford, Basilio Garcia, Otavio Braga, Olivier Siohan* 457 | 458 | [[PDF](https://ieeexplore.ieee.org/abstract/document/9004036/)] 459 | 460 | **Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis.**[2020 CVPR] 461 | 462 | *K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, C.V. Jawahar* 463 | 464 | [[PDF](http://openaccess.thecvf.com/content_CVPR_2020/html/Prajwal_Learning_Individual_Speaking_Styles_for_Accurate_Lip_to_Speech_Synthesis_CVPR_2020_paper.html)] 465 | 466 | **On the importance of super-Gaussian speech priors for machine-learning based speech enhancement.**[28th Nov, 2017] 467 | 468 | *Robert Rehr, Timo Gerkmann* 469 | 470 | [[PDF](https://ieeexplore.ieee.org/abstract/document/8121999/)] 471 | 472 | **Active appearance models.**[1998 ECCV] 473 | 474 | *T. F. Cootes, G. J. Edwards, C. J. Taylor* 475 | 476 | [[PDF](https://link.springer.com/chapter/10.1007/BFb0054760)] 477 | 478 | **Leveraging category information for single-frame visual sound source separation.**[20th Jul, 2021] 479 | 480 | *Lingyu Zhu, Esa Rahtu* 481 | 482 | [[PDF](https://ieeexplore.ieee.org/abstract/document/9484036/)] 483 | 484 | **The Sound of Pixels.**[2018 ECCV] 485 | 486 | *Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, Antonio Torralba* 487 | 488 | [[PDF](http://openaccess.thecvf.com/content_ECCV_2018/html/Hang_Zhao_The_Sound_of_ECCV_2018_paper.html)] 489 | 490 | ## Classification 491 | 492 | **Vqa: Visual question answering.**[2015 ICCV] 493 | 494 | *Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh* 495 | 496 | [[PDF](http://openaccess.thecvf.com/content_iccv_2015/html/Antol_VQA_Visual_Question_ICCV_2015_paper.html)] 497 | 498 | **Topic-based content and sentiment analysis of Ebola virus on Twitter and in the news.**[1th Jul, 2016] 499 | 500 | *Erin Hea-Jin Kim, Yoo Kyung Jeong, Yuyong Kim, Keun Young kang, Min Song* 501 | 502 | [[PDF](https://journals.sagepub.com/doi/pdf/10.1177/0165551515608733)] 503 | 504 | **On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis.**[6th Jul, 2017] 505 | 506 | *Jose Camacho-Collados, Mohammad Taher Pilehvar* 507 | 508 | [[PDF](https://arxiv.org/abs/1707.01780)] 509 | 510 | **Market strategies used by processed food manufacturers to increase and consolidate their power: a systematic review and document analysis.**[26th Jan, 2021] 511 | 512 | *Benjamin Wood, Owain Williams, Vijaya Nagarajan, Gary Sacks* 513 | 514 | [[PDF](https://globalizationandhealth.biomedcentral.com/articles/10.1186/s12992-021-00667-7)] 515 | 516 | **Swafn: Sentimental words aware fusion network for multimodal sentiment analysis.**[2020 COLING] 517 | 518 | *Minping Chen, Xia Li* 519 | 520 | [[PDF](https://aclanthology.org/2020.coling-main.93/)] 521 | 522 | **Adaptive online event detection in news streams.**[15th Dec, 2017] 523 | 524 | *Linmei Hu, Bin Zhang, Lei Hou, Juanzi Li* 525 | 526 | [[PDF](https://www.sciencedirect.com/science/article/pii/S0950705117304550)] 527 | 528 | **Multi-source multimodal data and deep learning for disaster response: A systematic review.**[27th Nov, 2021] 529 | 530 | *Nilani Algiriyage, Raj Prasanna, Kristin Stock, Emma E. H. Doyle, David Johnston* 531 | 532 | [[PDF](https://link.springer.com/article/10.1007/s42979-021-00971-4)] 533 | 534 | **A Survey of Data Representation for Multi-Modality Event Detection and Evolution.**[2nd Nov, 2021] 535 | 536 | Kejing Xiao, Zhaopeng Qian, Biao Qin. 537 | 538 | [[PDF](https://www.mdpi.com/2076-3417/12/4/2204)] 539 | 540 | **Crisismmd: Multimodal twitter datasets from natural disasters.**[15th Jun, 2018] 541 | 542 | *Firoj Alam, Ferda Ofli, Muhammad Imran* 543 | 544 | [[PDF](https://www.aaai.org/ocs/index.php/ICWSM/ICWSM18/paper/viewPaper/17816)] 545 | 546 | **Multi-modal generative adversarial networks for traffic event detection in smart cities.**[1st Sep, 2021] 547 | 548 | *Qi Chen, WeiWang, Kaizhu Huang, Suparna De, Frans Coenen* 549 | 550 | [[PDF](https://www.sciencedirect.com/science/article/pii/S0957417421003808)] 551 | 552 | **Proppy: Organizing the news based on their propagandistic content.**[5th Sep, 2019] 553 | 554 | *Alberto Barron-Cedeno, Israa Jaradat, Giovanni Da San Martino, Preslav Nakov* 555 | 556 | [[PDF](https://www.sciencedirect.com/science/article/pii/S0306457318306058)] 557 | 558 | **Fine-grained analysis of propaganda in news article.**[Nov 2019] 559 | 560 | *Giovanni Da San Martino, Seunghak Yu, Alberto Barron-Cedeno, Rostislav Petrov, Preslav Nakov* 561 | 562 | [[PDF](https://aclanthology.org/D19-1565/)] 563 | 564 | **Multimodal Fusion with Recurrent Neural Networks for Rumor Detection on Microblogs.**[Oct, 2017] 565 | 566 | *Zhiwei Jin, Juan Cao, Han Guo, Yongdong Zhang, Jiebo Luo* 567 | 568 | [[PDF](https://dl.acm.org/doi/abs/10.1145/3123266.3123454)] 569 | 570 | **𝖲𝖠𝖥𝖤: Similarity-Aware Multi-modal Fake News Detection.**[6th May, 2020] 571 | 572 | *Xinyi Zhou, Jindi Wu, Reza Zafarani* 573 | 574 | [[PDF](https://link.springer.com/chapter/10.1007/978-3-030-47436-2_27)] 575 | 576 | **From Recognition to Cognition: Visual Commonsense Reasoning.**[2019 CVPR] 577 | 578 | *Rowan Zellers, Yonatan Bisk, Ali Farhadi, Yejin Choi* 579 | 580 | [[PDF](http://openaccess.thecvf.com/content_CVPR_2019/html/Zellers_From_Recognition_to_Cognition_Visual_Commonsense_Reasoning_CVPR_2019_paper.html)] 581 | 582 | **KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for visual commonsense reasoning.**[27th Oct, 2021] 583 | 584 | *Dandan Song, Siyi Ma, Zhanchen Sun, Sicheng Yang, Lejian Liao* 585 | 586 | [[PDF](https://www.sciencedirect.com/science/article/pii/S0950705121006705)] 587 | 588 | **LXMERT: Learning Cross-Modality Encoder Representations from Transformers.**[20th Aug, 2019] 589 | 590 | *Hao Tan, Mohit Bansal* 591 | 592 | [[PDF](https://arxiv.org/abs/1908.07490)] 593 | 594 | **Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers.**[2 Apr, 2020] 595 | 596 | Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, Jianlong Fu 597 | 598 | [[PDF](https://arxiv.org/abs/2004.00849)] 599 | 600 | **Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks.**[2020 CVPR] 601 | 602 | *Fengda Zhu, Yi Zhu, Xiaojun Chang, Xiaodan Liang* 603 | 604 | [[PDF](http://openaccess.thecvf.com/content_CVPR_2020/html/Zhu_Vision-Language_Navigation_With_Self-Supervised_Auxiliary_Reasoning_Tasks_CVPR_2020_paper.html)] 605 | 606 | ## Generation 607 | 608 | **Recent advances and trends in multimodal deep learning: A review.**[24th May, 2021] 609 | 610 | Jabeen Summaira, Xi Li, Amin Muhammad Shoib, Songyuan Li, Jabbar Abdul 611 | 612 | [[PDF](https://arxiv.org/abs/2105.11087)] 613 | 614 | **Vqa: Visual question answering.**[2015 ICCV] 615 | 616 | *Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh* 617 | 618 | [[PDF](http://openaccess.thecvf.com/content_iccv_2015/html/Antol_VQA_Visual_Question_ICCV_2015_paper.html)] 619 | 620 | **Microsoft coco: Common objects in context.**[2014 ECCV] 621 | 622 | *Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, C. Lawrence Zitnick* 623 | 624 | [[PDF](https://arxiv.org/pdf/1405.0312.pdf%090.949.pdf)] 625 | 626 | **BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.**[11th Oct, 2018] 627 | 628 | *Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova* 629 | 630 | [[PDF](https://arxiv.org/abs/1810.04805)] 631 | 632 | **Distributed Representations of Words and Phrases and their Compositionality.**[2013 NIPS] 633 | 634 | *Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, Jeff Dean* 635 | 636 | [[PDF](https://proceedings.neurips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html)] 637 | 638 | **LRS3-TED: a large-scale dataset for visual speech recognition.**[3rd Sep, 2018] 639 | 640 | *Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman* 641 | 642 | [[PDF](https://arxiv.org/abs/1809.00496)] 643 | 644 | **A lip sync expert is all you need for speech to lip generation in the wild.**[12th Oct, 2019] 645 | 646 | *K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, C.V. Jawahar* 647 | 648 | [[PDF](https://dl.acm.org/doi/abs/10.1145/3394171.3413532)] 649 | 650 | **Unified Vision-Language Pre-Training for Image Captioning and VQA.**[2020 AAAI] 651 | 652 | *Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason Corso, Jianfeng Gao* 653 | 654 | [[PDF](https://ojs.aaai.org/index.php/AAAI/article/view/7005)] 655 | 656 | **Show and Tell: A Neural Image Caption Generator.**[2015 CVPR] 657 | 658 | *Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan* 659 | 660 | [[PDF](https://www.cv-foundation.org/openaccess/content_cvpr_2015/html/Vinyals_Show_and_Tell_2015_CVPR_paper.html)] 661 | 662 | **SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning.**[2017 CVPR] 663 | 664 | *Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, Tat-Seng Chua* 665 | 666 | [[PDF](http://openaccess.thecvf.com/content_cvpr_2017/html/Chen_SCA-CNN_Spatial_and_CVPR_2017_paper.html)] 667 | 668 | **Self-Critical Sequence Training for Image Captioning.**[2017 CVPR] 669 | 670 | *Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, Vaibhava Goel* 671 | 672 | [[PDF](http://openaccess.thecvf.com/content_cvpr_2017/html/Rennie_Self-Critical_Sequence_Training_CVPR_2017_paper.html)] 673 | 674 | **Visual question answering: A survey of methods and datasets.**[Oct, 2017] 675 | 676 | *Qi WU, Damien Teney, Peng Wang, Chunhua Shen, Anthony Dick, Anton van den Hengel* 677 | 678 | [[PDF](https://www.sciencedirect.com/science/article/pii/S1077314217300772)] 679 | 680 | **How to find a good image-text embedding for remote sensing visual question answering?.**[24th Sep, 2021] 681 | 682 | *Christel Chappuis, Sylvain Lobry, Benjamin Kellenberger, Bertrand Le, Saux, Devis Tuia* 683 | 684 | [[PDF](https://arxiv.org/abs/2109.11848)] 685 | 686 | **An Improved Attention for Visual Question Answering.**[2021 CVPR] 687 | 688 | *Tanzila Rahman, Shih-Han Chou, Leonid Sigal, Giuseppe Carenini* 689 | 690 | [[PDF](https://openaccess.thecvf.com/content/CVPR2021W/MULA/html/Rahman_An_Improved_Attention_for_Visual_Question_Answering_CVPRW_2021_paper.html)] 691 | 692 | **Analyzing Compositionality of Visual Question Answering.**[2019 NIPS] 693 | 694 | *Sanjay Subramanian, Sameer Singh, Matt Gardner* 695 | 696 | [[PDF](https://vigilworkshop.github.io/static/papers-2019/43.pdf)] 697 | 698 | **OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge.**[2019 CVPR] 699 | 700 | *Kenneth Marino, Mohammad Rastegari, Ali Farhadi, Roozbeh Mottaghi* 701 | 702 | [[PDF](http://openaccess.thecvf.com/content_CVPR_2019/html/Marino_OK-VQA_A_Visual_Question_Answering_Benchmark_Requiring_External_Knowledge_CVPR_2019_paper.html)] 703 | 704 | **MultiBench: Multiscale Benchmarks for Multimodal Representation Learning.**[15th Jul, 2021] 705 | 706 | *Paul Pu Liang, Yiwei Lyu, Xiang Fan, Zetian Wu, Yun Cheng, 707 | Jason Wu, Leslie Chen, Peter Wu, Michelle A. Lee, Yuke Zhu5, 708 | Ruslan Salakhutdinov1, Louis-Philippe Morency* 709 | 710 | [[PDF](https://arxiv.org/abs/2107.07502)] 711 | 712 | **Benchmarking Multimodal AutoML for Tabular Data with Text Fields.**[4th Nov, 20201] 713 | 714 | Xingjian Shi, Jonas Mueller, Nick Erickson, Mu Li, Alexander J. Smola 715 | 716 | [[PDF](https://arxiv.org/abs/2111.02705)] 717 | 718 | **Multimodal Explanations: Justifying Decisions and Pointing to the Evidence.**[2018 CVPR] 719 | 720 | *Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt Schiele, Trevor Darrell, Marcus Rohrbach* 721 | 722 | [[PDF](http://openaccess.thecvf.com/content_cvpr_2018/html/Park_Multimodal_Explanations_Justifying_CVPR_2018_paper.html)] 723 | 724 | **Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering.**[2018 CVPR] 725 | 726 | *Aishwarya Agrawal, Dhruv Batra, Devi Parikh, Aniruddha Kembhavi* 727 | 728 | [[PDF](http://openaccess.thecvf.com/content_cvpr_2018/html/Agrawal_Dont_Just_Assume_CVPR_2018_paper.html)] 729 | 730 | **Generative Adversarial Text to Image Synthesis.**[2016 ICML] 731 | 732 | *Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, Honglak Lee* 733 | 734 | [[PDF](http://proceedings.mlr.press/v48/reed16.html)] 735 | 736 | **The Caltech-UCSD Birds-200-2011 Dataset.** 737 | 738 | [[PDF](https://authors.library.caltech.edu/27452/)] 739 | 740 | **AttnGAN: Fine-Grained Text to Image Generation With Attentional Generative Adversarial Networks.**[2018 CVPR] 741 | 742 | *Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, Xiaodong He* 743 | 744 | [[PDF](http://openaccess.thecvf.com/content_cvpr_2018/html/Xu_AttnGAN_Fine-Grained_Text_CVPR_2018_paper.html)] 745 | 746 | **LipSound: Neural Mel-spectrogram Reconstruction for Lip Reading.**[15 Sep, 2019] 747 | 748 | *Leyuan Qu, Cornelius Weber, Stefan Wermter* 749 | 750 | [[PDF](https://www.isca-speech.org/archive_v0/Interspeech_2019/pdfs/1393.pdf)] 751 | 752 | **The Conversation: Deep Audio-Visual Speech Enhancement.**[11th Apr, 2018] 753 | 754 | *Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman* 755 | 756 | [[PDF](https://arxiv.org/abs/1804.04121)] 757 | 758 | **TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech.**[5th May, 2015] 759 | 760 | *Naomi Harte, Eoin Gillen* 761 | 762 | [[PDF](https://ieeexplore.ieee.org/abstract/document/7050271/)] 763 | 764 | **Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning.**[20 Oct, 2017] 765 | 766 | *WeiPing, KainanPeng, AndrewGibiansk, SercanO. Arık, Ajay Kannan, Sharan Narang* 767 | 768 | [[PDF](https://arxiv.org/abs/1710.07654)] 769 | 770 | **Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions.**[15th Apr, 2018] 771 | 772 | *Jonathan Shen, Ruiming Pang, Ron J. Weiss, Mile Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, Rif A. Saurous, Yannis Agiomvrgiannakis, Yonghui Wu* 773 | 774 | [[PDF](https://ieeexplore.ieee.org/abstract/document/8461368/)] 775 | 776 | **Vid2speech: Speech reconstruction from silent video.**[5th Mar, 2017] 777 | 778 | *Ariel Ephrat, Shmuel Peleg* 779 | 780 | [[PDF](https://ieeexplore.ieee.org/abstract/document/7953127/)] 781 | 782 | **Lip2Audspec: Speech Reconstruction from Silent Lip Movements Video.**[15th Apr, 2018] 783 | 784 | *Hassan Akbari, Himani Arora, Liangliang Cao, Nima Mesgarani* 785 | 786 | [[PDF](https://ieeexplore.ieee.org/abstract/document/8461856/)] 787 | 788 | **Video-Driven Speech Reconstruction using Generative Adversarial Networks.**[14th Jun, 2019] 789 | 790 | *Konstantinos Vougioukas, Pingchuan Ma, Stavros Petridi, Maja Pantic* 791 | 792 | [[PDF](https://arxiv.org/abs/1906.06301)] 793 | 794 | ## Retrieval 795 | 796 | **ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks.**[2019 NIPS] 797 | 798 | *Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee* 799 | 800 | [[PDF](https://proceedings.neurips.cc/paper/2019/hash/c74d97b01eae257e44aa9d5bade97baf-Abstract.html)] 801 | 802 | **Learning Robust Patient Representations from Multi-modal Electronic Health Records: A Supervised Deep Learning Approach.**[2021] 803 | 804 | *Leman Akoglu, Evimaria Terzi, Xianli Zhang, Buyue Qian, Yang Liu, Xi Chen, Chong Guan, Chen Li* 805 | 806 | [[PDF](https://epubs.siam.org/doi/abs/10.1137/1.9781611976700.66)] 807 | 808 | **Referring Expression Comprehension: A Survey of Methods and Datasets.**[7th Dec, 2020] 809 | 810 | *Yanyuan QIao, Chaorui Deng, Qi Wu* 811 | 812 | [[PDF](https://ieeexplore.ieee.org/abstract/document/9285213/)] 813 | 814 | **VL-BERT: Pre-training of Generic Visual-Linguistic Representations.**[22th Aug, 2019] 815 | 816 | *Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, Jifeng Dai* 817 | 818 | [[PDF](https://arxiv.org/abs/1908.08530)] 819 | 820 | **Clinically Accurate Chest X-Ray Report Generation.**[2019 MLHC] 821 | 822 | *Guanxiong Liu, Tzu-Ming Harry Hsu, Matthew McDermott, Willie Boag, Wei-Hung Weng, Peter Szolovits, Marzyeh Ghassemi* 823 | 824 | [[PDF](http://proceedings.mlr.press/v106/liu19a.html)] 825 | 826 | ## Translation 827 | 828 | **Deep Residual Learning for Image Recognition.**[2016 CVPR] 829 | 830 | *Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun* 831 | 832 | [[PDF](http://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html)] 833 | 834 | **Probing the Need for Visual Context in Multimodal Machine Translation.**[20th Mar, 2019] 835 | 836 | *Ozan Caglayan, Pranava Madhyastha, Lucia Specia, Loic Barrault* 837 | 838 | [[PDF](https://arxiv.org/abs/1903.08678)] 839 | 840 | **Neural Machine Translation by Jointly Learning to Align and Translate.**[1st Sep, 2014] 841 | 842 | *Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio* 843 | 844 | [[PDF](https://arxiv.org/abs/1409.0473)] 845 | 846 | **Multi-modal neural machine translation with deep semantic interactions.**[Apr, 2021] 847 | 848 | *Jinsong Su, Jinchang Chen, Hui Jiang, Chulun Zhou, Huan Lin, Yubin Ge, Qingqiang Wu, Yongxuan Lai* 849 | 850 | [[PDF](https://www.sciencedirect.com/science/article/pii/S0020025520311105)] 851 | 852 | # Multimodal-Datasets 853 | 854 | Vqa: Visual question answering.**[2015 ICCV] 855 | 856 | *Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh* 857 | 858 | [[PDF](http://openaccess.thecvf.com/content_iccv_2015/html/Antol_VQA_Visual_Question_ICCV_2015_paper.html)] 859 | 860 | **Microsoft coco: Common objects in context.**[2014 ECCV] 861 | 862 | *Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, C. Lawrence Zitnick* 863 | 864 | [[PDF](https://arxiv.org/pdf/1405.0312.pdf%090.949.pdf)] 865 | 866 | **Pre-training technique to localize medical BERT and enhance biomedical BERT.**[14th May, 2020] 867 | 868 | *Shoya Wada, Toshihiro Takeda, Shiro Manabe, Shozo Konishi, Jun Kamohara, Yasushi, Matsumura* 869 | 870 | [[PDF](https://arxiv.org/abs/2005.07202)] 871 | 872 | **Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models.**[2015 ICCV] 873 | 874 | *Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, Svetlana Lazebnik* 875 | 876 | [[PDF](http://openaccess.thecvf.com/content_iccv_2015/html/Plummer_Flickr30k_Entities_Collecting_ICCV_2015_paper.html)] 877 | 878 | **ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction.**[20th Sep, 2019] 879 | 880 | *Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, C. V. Jawahar* 881 | 882 | [[PDF](https://ieeexplore.ieee.org/abstract/document/8977955/)] 883 | 884 | **FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents.**[20th Sep, 2019] 885 | 886 | *Guillaume Jaume, Hazim Kemal Ekenel, Jean-Philippe Thiran* 887 | 888 | [[PDF](https://ieeexplore.ieee.org/abstract/document/8892998/)] 889 | 890 | **How2: A Large-scale Dataset for Multimodal Language Understanding.**[1st Nov, 2018] 891 | 892 | *Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loic Barrault, Lucia Specia, Florian Metze* 893 | 894 | [[PDF](https://arxiv.org/abs/1811.00347)] 895 | 896 | **Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis.**[2020 CVPR] 897 | 898 | *K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, C.V. Jawahar* 899 | 900 | [[PDF](http://openaccess.thecvf.com/content_CVPR_2020/html/Prajwal_Learning_Individual_Speaking_Styles_for_Accurate_Lip_to_Speech_Synthesis_CVPR_2020_paper.html)] 901 | 902 | **The Sound of Pixels.**[2018 ECCV] 903 | 904 | *Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, Antonio Torralba* 905 | 906 | [[PDF](http://openaccess.thecvf.com/content_ECCV_2018/html/Hang_Zhao_The_Sound_of_ECCV_2018_paper.html)] 907 | 908 | **Crisismmd: Multimodal twitter datasets from natural disasters.**[15th Jun, 2018] 909 | 910 | *Firoj Alam, Ferda Ofli, Muhammad Imran* 911 | 912 | [[PDF](https://www.aaai.org/ocs/index.php/ICWSM/ICWSM18/paper/viewPaper/17816)] 913 | 914 | **From Recognition to Cognition: Visual Commonsense Reasoning.**[2019 CVPR] 915 | 916 | *Rowan Zellers, Yonatan Bisk, Ali Farhadi, Yejin Choi* 917 | 918 | [[PDF](http://openaccess.thecvf.com/content_CVPR_2019/html/Zellers_From_Recognition_to_Cognition_Visual_Commonsense_Reasoning_CVPR_2019_paper.html)] 919 | 920 | **The Caltech-UCSD Birds-200-2011 Dataset.** 921 | 922 | [[PDF](https://authors.library.caltech.edu/27452/)] 923 | 924 | **Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics.**[30th Aug, 2013] 925 | 926 | *M. Hodosh, P. Young, J. Hockenmaier* 927 | 928 | [[PDF](https://www.jair.org/index.php/jair/article/view/10833)] 929 | 930 | **Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph.**[Jul, 2018] 931 | 932 | *AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, Louis-Philippe Morency* 933 | 934 | [[PDF](https://aclanthology.org/P18-1208/)] 935 | 936 | **MIMIC-III, a freely accessible critical care database.**[24th May, 2016] 937 | 938 | *Alistair E.W. Johnson, Tom J. Pollard, Li-wei H. Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Leo Anthony Celi & Roger G. Mark* 939 | 940 | [[PDF](https://www.nature.com/articles/sdata201635)] 941 | 942 | **Fashion 200K Benchmark** 943 | 944 | [[Github](https://github.com/xthan/fashion-200k)] 945 | 946 | **Indoor scene segmentation using a structured light sensor.**[Nov 2011] 947 | 948 | *Nathan Silberman, Rob Fergus* 949 | 950 | [[PDF](https://ieeexplore.ieee.org/abstract/document/6130298/)] 951 | 952 | **Indoor Segmentation and Support Inference from RGBD Images.**[2012 ECCV] 953 | 954 | *Nathan Silberman, Derek Hoiem, Pushmeet Kohli, Rob Fergus* 955 | 956 | [[PDF](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/11/shkf_eccv2012.pdf)] 957 | 958 | **Good News, Everyone! Context Driven Entity-Aware Captioning for News Images.**[2019 CVPR] 959 | 960 | *Ali Furkan Biten, Lluis Gomez, Marcal Rusinol, Dimosthenis Karatzas* 961 | 962 | [[PDF](http://openaccess.thecvf.com/content_CVPR_2019/html/Biten_Good_News_Everyone_Context_Driven_Entity-Aware_Captioning_for_News_Images_CVPR_2019_paper.html)] 963 | 964 | **MSR-VTT: A Large Video Description Dataset for Bridging Video and Language.**[2016 CVPR] 965 | 966 | *Jun Xu, Tao Mei, Ting Yao, Yong Rui* 967 | 968 | [[PDF](http://openaccess.thecvf.com/content_cvpr_2016/html/Xu_MSR-VTT_A_Large_CVPR_2016_paper.html)] 969 | 970 | **Video Question Answering via Gradually Refined Attention over Appearance and Motion.**[Oct, 2017] 971 | 972 | *Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, Yueting Zhuang* 973 | 974 | [[PDF](https://dl.acm.org/doi/abs/10.1145/3123266.3123427)] 975 | 976 | **TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering.**[2017 CVPR] 977 | 978 | *Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, Gunhee Kim* 979 | 980 | [[PDF](http://openaccess.thecvf.com/content_cvpr_2017/html/Jang_TGIF-QA_Toward_Spatio-Temporal_CVPR_2017_paper.html)] 981 | 982 | **Multi-Target Embodied Question Answering.**[2019 CVPR] 983 | 984 | *Licheng Yu, Xinlei Chen, Georgia Gkioxari, Mohit Bansal, Tamara L. Berg, Dhruv Batra* 985 | 986 | [[PDF](http://openaccess.thecvf.com/content_CVPR_2019/html/Yu_Multi-Target_Embodied_Question_Answering_CVPR_2019_paper.html)] 987 | 988 | **VideoNavQA: Bridging the Gap between Visual and Embodied Question Answering.**[14th Aug, 2019] 989 | 990 | *Catalina Cangea, Eugene Belilovsky, Pietro Lio, Aaron Courville* 991 | 992 | [[PDF](https://arxiv.org/abs/1908.04950)] 993 | 994 | **An Analysis of Visual Question Answering Algorithms.**[2017 ICCV] 995 | 996 | *Kushal Kafle, Christopher Kanan* 997 | 998 | [[PDF](http://openaccess.thecvf.com/content_iccv_2017/html/Kafle_An_Analysis_of_ICCV_2017_paper.html)] 999 | 1000 | **nuScenes: A Multimodal Dataset for Autonomous Driving.**[2020 CVPR] 1001 | 1002 | *Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, Oscar Beijbom* 1003 | 1004 | [[PDF](http://openaccess.thecvf.com/content_CVPR_2020/html/Caesar_nuScenes_A_Multimodal_Dataset_for_Autonomous_Driving_CVPR_2020_paper.html)] 1005 | 1006 | **Automated Flower Classification over a Large Number of Classes.**[20th Jan, 2009] 1007 | 1008 | *Maria-Elena Nilsback, Andrew Zisserman* 1009 | 1010 | [[PDF](https://ieeexplore.ieee.org/abstract/document/4756141/)] 1011 | 1012 | **MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos.**[20th Jun, 2016] 1013 | 1014 | *Amir Zadeh, Rowan Zellers, Eli Pincus, Louis-Philippe Morency* 1015 | 1016 | [[PDF](https://arxiv.org/abs/1606.06259)] 1017 | 1018 | **Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech Recognition.**[18th Jan, 2020] 1019 | 1020 | *Yuanhang Zhang, Shuang Yang, Jingyun Xiao, Shiguang Shan, Xilin Chen* 1021 | 1022 | [[PDF](https://ieeexplore.ieee.org/abstract/document/9320240/)] 1023 | 1024 | **The MIT Stata Center dataset.** [2013] 1025 | 1026 | *Maurice Fallon, Hordur Johannsson, Michael Kaess and John J Leonard* 1027 | 1028 | [[PDF](https://journals.sagepub.com/doi/pdf/10.1177/0278364913509035)] 1029 | 1030 | **data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language.**[2022 ICML] 1031 | 1032 | *Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli* 1033 | 1034 | [[PDF](https://proceedings.mlr.press/v162/baevski22a.html)] 1035 | 1036 | **FLAVA: A Foundational Language and Vision Alignment Model.**[2022 CVPR] 1037 | 1038 | *Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, Douwe Kiela* 1039 | 1040 | [[PDF](http://openaccess.thecvf.com/content/CVPR2022/html/Singh_FLAVA_A_Foundational_Language_and_Vision_Alignment_Model_CVPR_2022_paper.html)] 1041 | 1042 | **UC2: Universal Cross-Lingual Cross-Modal Vision-and-Language Pre-Training.**[2021 CVPR] 1043 | 1044 | *Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu, Jingjing Liu* 1045 | 1046 | [[PDF](http://openaccess.thecvf.com/content/CVPR2021/html/Zhou_UC2_Universal_Cross-Lingual_Cross-Modal_Vision-and-Language_Pre-Training_CVPR_2021_paper.html)] 1047 | 1048 | # Citation 1049 | If you find the listing and survey useful for your work, please cite the paper: 1050 | ``` 1051 | @article{manzoor2023multimodality, 1052 | title={Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications}, 1053 | author={Manzoor, Muhammad Arslan and Albarri, Sarah and Xian, Ziting and Meng, Zaiqiao and Nakov, Preslav and Liang, Shangsong}, 1054 | journal={arXiv preprint arXiv:2302.00389}, 1055 | year={2023} 1056 | } 1057 | ``` 1058 | --------------------------------------------------------------------------------