├── Fig_1_MML.png
└── README.md
/Fig_1_MML.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marslanm/Multimodality-Representation-Learning/c02e70f6e736804639255816bf04f157f1c81148/Fig_1_MML.png
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications
2 | 
3 | ## Multimodal Deep Learnig based Research
4 |
5 | - [Survey Papers](#survey)
6 | - [Task-specific Methods](#Task-specific-Methods)
7 | - [Pretrainig Approaches](#Pretraining-Approaches)
8 | - [Multimodal Applications](#Multimodal-Applications) ([Understanidng](#Understanding), [Classification](#Classification), [Generation](#Generation), [Retrieval](#Retrieval), [Translation](#Translation))
9 | - [Multimodal Datasets](#Multimodal-Datasets)
10 |
11 | # Survey
12 |
13 | [**Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications.**](https://dl.acm.org/doi/abs/10.1145/3617833)
14 | *[Muhammad Arslan Manzoor](https://scholar.google.com/citations?hl=en&user=ZvXClnUAAAAJ), [Sarah Albarri](), [Ziting Xian](https://scholar.google.com/citations?hl=zh-CN&user=G7VId5YAAAAJ&view_op=list_works&gmla=AJsN-F6TTZmzbi9CmBIRLNRpAhcgmzH-nOUd8hM5UTjfT5A_mYW2ABzjSrdX7ki9GFgGaId2dLlMtXBkfq7X_qzYOwF_OuvCCthMiuVNUuUiac-aGoSwsKQ), [Zaiqiao Meng](https://scholar.google.com/citations?user=5jJKFVcAAAAJ&hl=en), [Preslav Nakov](https://scholar.google.com/citations?user=DfXsKZ4AAAAJ&hl=en), and [Shangsong Liang](https://scholar.google.com/citations?user=4uggVcIAAAAJ&h).*
15 | [[PDF](https://dl.acm.org/doi/abs/10.1145/3617833)]
16 |
17 | **Vision-Language Pre-training:Basics, Recent Advances, and Future Trends.**[17th Oct, 2022]
18 | *Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng Gao.*
19 | [[PDF](https://arxiv.org/pdf/2210.09263.pdf)]
20 |
21 | **VLP: A survey on vision-language pre-training.**[18th Feb, 2022]
22 | *Feilong Chen, Duzhen Zhang, Minglun Han, Xiuyi Chen, Jing Shi, Shuang Xu, and Bo Xu.*
23 | [[PDF](https://link.springer.com/article/10.1007/s11633-022-1369-5)]
24 |
25 |
26 | **A Survey of Vision-Language Pre-Trained Models.**[18th Feb, 2022]
27 | *Yifan Du, Zikang Liu, Junyi Li, Wayne Xin Zhao.*
28 | [[PDF](https://arxiv.org/abs/2202.10936)]
29 |
30 |
31 | **Vision-and-Language Pretrained Models: A Survey.**[15th Apr, 2022]
32 | *Siqu Long, Feiqi Cao, Soyeon Caren Han, Haiqin Yang.*
33 | [[PDF](https://arxiv.org/abs/2204.07356)]
34 |
35 | **Comprehensive reading list for Multimodal Literature**
36 | [[Github](https://github.com/pliang279/awesome-multimodal-ml#survey-papers)]
37 |
38 | **Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing.**[28th Jul, 2021]
39 | *Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, Graham Neubig*
40 | [[PDF](https://arxiv.org/abs/2107.13586)]
41 |
42 | **Recent Advances and Trends in Multimodal Deep Learning: A Review.**[24th May, 2021]
43 | *Jabeen Summaira, Xi Li, Amin Muhammad Shoib, Songyuan Li, Jabbar Abdul.*
44 | [[PDF](https://arxiv.org/pdf/2105.11087)]
45 |
46 |
47 |
48 | # Task-specific-Methods
49 |
50 | **Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network.**[9th Feb, 2021]
51 |
52 | *Jiayi Ji, Yunpeng Luo, Xiaoshuai Sun, Fuhai Chen, Gen Luo, Yongjian Wu, Yue Gao, Rongrong Ji*
53 |
54 | [[PDF](https://ojs.aaai.org/index.php/AAAI/article/view/16258)]
55 |
56 | **Cascaded Recurrent Neural Networks for Hyperspectral Image Classification.**[Aug, 2019]
57 |
58 | *Renlong Hang, Qingshan Liu, Danfeng Hong, Pedram Ghamisi*
59 |
60 | [[PDF](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8662780)]
61 |
62 | **Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.**[2015 NIPS]
63 |
64 | *Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun*
65 |
66 | [[PDF](https://proceedings.neurips.cc/paper/2015/hash/14bfa6bb14875e45bba028a21ed38046-Abstract.html)]
67 |
68 | **Microsoft coco: Common objects in context.**[2014 ECCV]
69 |
70 | *Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, C. Lawrence Zitnick*
71 |
72 | [[PDF](https://arxiv.org/pdf/1405.0312.pdf%090.949.pdf)]
73 |
74 | **Multimodal Deep Learning.**[2011 ICML]
75 |
76 | *Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, Andrew Y. Ng*
77 |
78 | [[PDF](http://ai.stanford.edu/~ang/papers/icml11-MultimodalDeepLearning.pdf)]
79 |
80 | **Extracting and composing robust features with denoising autoencoders.**[5th July, 2008]
81 |
82 | *Pascak Vincent, Hugo Larochelle, Yoshua Bengio, Pierre-Antoine Manzagol*
83 |
84 | [[PDF](https://dl.acm.org/doi/pdf/10.1145/1390156.1390294)]
85 |
86 | **Multi-Gate Attention Network for Image Captioning.**[13th Mar, 2021]
87 |
88 | *WEITAO JIANG, XIYING LI, HAIFENG HU, QIANG LU, AND BOHONG LIU*
89 |
90 | [[PDF](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9382255)]
91 |
92 | **AMC: Attention guided Multi-modal Correlation Learning for Image Search.**[2017 CVPR]
93 |
94 | *Kan Chen, Trung Bui, Chen Fang, Zhaowen Wang, Ram Nevatia*
95 |
96 | [[PDF](https://openaccess.thecvf.com/content_cvpr_2017/papers/Chen_AMC_Attention_guided_CVPR_2017_paper.pdf)]
97 |
98 | **Video Captioning via Hierarchical Reinforcement Learning.**[2018 CVPR]
99 |
100 | *Xin Wang, Wenhu Chen, Jiawei Wu, Yuan-Fang Wang, William Yang Wang*
101 |
102 | [[PDF](https://openaccess.thecvf.com/content_cvpr_2018/papers/Wang_Video_Captioning_via_CVPR_2018_paper.pdf)]
103 |
104 | **Gaussian Process with Graph Convolutional Kernel for Relational Learning.**[14th Aug, 2021]
105 |
106 | *Jinyuan Fang, Shangsong Liang, Zaiqiao Meng, Qiang Zhang*
107 |
108 | [[PDF](https://dl.acm.org/doi/pdf/10.1145/3447548.3467327)]
109 |
110 | **Multi-Relational Graph Representation Learning with Bayesian Gaussian Process Network.**[28th June, 2022]
111 |
112 | *Guanzheng Chen, Jinyuan Fang, Zaiqiao Meng, Qiang Zhang, Shangsong Liang*
113 |
114 | [[PDF](https://doi.org/10.1609/aaai.v36i5.20492)]
115 |
116 | # Pretraining-Approaches
117 |
118 | **Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction.**[5th Jan, 2022]
119 |
120 | *Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, Abdelrahman Mohamed*
121 |
122 | [[PDF](https://proceedings.neurips.cc/paper/2019/hash/c74d97b01eae257e44aa9d5bade97baf-Abstract.html)]
123 |
124 | **A Survey of Vision-Language Pre-Trained Models.**[18th Feb, 2022]
125 |
126 | *Yifan Du, Zikang Liu, Junyi Li, Wayne Xin Zhao*
127 |
128 | [[PDF](https://arxiv.org/abs/2202.10936)]
129 |
130 | **Attention is All you Need.**[2017 NIPS]
131 |
132 | *Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin*
133 |
134 | [[PDF](https://proceedings.neurips.cc/paper/7181-attention-is-all)]
135 |
136 | **VinVL: Revisiting Visual Representations in Vision-Language Models.**[2021 CVPR]
137 |
138 | *Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, Jianfeng Gao*
139 |
140 | [[PDF](http://openaccess.thecvf.com/content/CVPR2021/html/Zhang_VinVL_Revisiting_Visual_Representations_in_Vision-Language_Models_CVPR_2021_paper.html)]
141 |
142 | **M6: Multi-Modality-to-Multi-Modality Multitask Mega-transformer for Unified Pretraining.**[Aug, 2021]
143 |
144 | *Junyang Lin, Rui Men, An Yang, Chang Zhou, Yichang Zhang, Peng Wang, Jingren Zhou, Jie Tang, Hongxia Yang*
145 |
146 | [[PDF](https://dl.acm.org/doi/abs/10.1145/3447548.3467206)]
147 |
148 | **AMMU: A survey of transformer-based biomedical pretrained language models.**[23th Mar, 2020]
149 |
150 | *Katikapalli Subramanyam Kalyan, Ajit Rajasekharan, Sivanesan Sangeetha*
151 |
152 | [[PDF](https://www.sciencedirect.com/science/article/pii/S1532046421003117)]
153 |
154 | **ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators**
155 |
156 | *Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning*
157 |
158 | [[PDF](https://arxiv.org/abs/2003.10555)]
159 |
160 | **RoBERTa: A Robustly Optimized BERT Pretraining Approach.**[26th Jul, 2019]
161 |
162 | *Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov*
163 |
164 | [[PDF](https://arxiv.org/abs/1907.11692)]
165 |
166 | **BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.**[11th Oct, 2018]
167 |
168 | *Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova*
169 |
170 | [[PDF](https://arxiv.org/abs/1810.04805)]
171 |
172 | **BioBERT: a pre-trained biomedical language representation model for biomedical text mining.**[10th Sep, 2019]
173 |
174 | Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, onghyeon Kim, Sunkyu Kim, Chan Ho So, Jaewoo Kang
175 |
176 | [[PDF](https://academic.oup.com/bioinformatics/article-abstract/36/4/1234/5566506)]
177 |
178 | **HateBERT: Retraining BERT for Abusive Language Detection in English.**[23th Oct, 2020]
179 |
180 | *Tommaso Caselli, Valerio Basile, Jelena Mitrovic, Michael Granitzer*
181 |
182 | [[PDF](https://arxiv.org/abs/2010.12472)]
183 |
184 | **InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training.**[15th Jul, 2020]
185 |
186 | *Zewen Chi, Li Dong, Furu Wei, Nan Yang, Saksham Singhal, Wenhui Wang, Xia Song, XIan-Ling Mao, Heyan Huang, Ming Zhou*
187 |
188 | [[PDF](https://arxiv.org/abs/2007.07834)]
189 |
190 | **Pre-training technique to localize medical BERT and enhance biomedical BERT.**[14th May, 2020]
191 |
192 | *Shoya Wada, Toshihiro Takeda, Shiro Manabe, Shozo Konishi, Jun Kamohara, Yasushi, Matsumura*
193 |
194 | [[PDF](https://arxiv.org/abs/2005.07202)]
195 |
196 | **Don't Stop Pretraining: Adapt Language Models to Domains and Tasks.**[23th Apr, 2020]
197 |
198 | *Suchin Gururangan, Ana Marasovic, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, Noah A. Smith*
199 |
200 | [[PDF](https://arxiv.org/abs/2004.10964)]
201 |
202 | **Knowledge Inheritance for Pre-trained Language Models.**[28th May, 2021]
203 |
204 | *Yujia Qin, Yankai Lin, Jing Yi, Jiajie Zhang, Xu Han, Zhengyan Zhang, Yusheng Su, Zhiyuan Liu, Peng Li, Maosong Sun, Jie Zhou*
205 |
206 | [[PDF](https://arxiv.org/abs/2105.13880)]
207 |
208 | **Improving Language Understanding by Generative Pre-Training.**[2018]
209 |
210 | *Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever*
211 |
212 | [[PDF](https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf)]
213 |
214 | **Shuffled-token Detection for Refining Pre-trained RoBERTa**
215 |
216 | *Subhadarshi Panda, Anjali Agrawal, Jeewon Ha, Benjamin Bloch*
217 |
218 | [[PDF](https://aclanthology.org/2021.naacl-srw.12/)]
219 |
220 | **ALBERT: A Lite BERT for Self-supervised Learning of Language Representations.**[26th Sep, 2019]
221 |
222 | *Zhenzhong Lan, Minga Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut*
223 |
224 | [[PDF](https://arxiv.org/abs/1909.11942)]
225 |
226 | **Exploring the limits of transfer learning with a unified text-to-text transformer.**[1st Jan, 2020]
227 |
228 | *Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu*
229 |
230 | [[PDF](https://dl.acm.org/doi/abs/10.5555/3455716.3455856)]
231 |
232 | **End-to-End Object Detection with Transformers.**[3rd Nov, 2020]
233 |
234 | *Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko*
235 |
236 | [[PDF](https://link.springer.com/chapter/10.1007/978-3-030-58452-8_13)]
237 |
238 | **Deformable DETR: Deformable Transformers for End-to-End Object Detection.**[8th Oct, 2018]
239 |
240 | Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, Jifeng Dai
241 |
242 | [[PDF](https://arxiv.org/abs/2010.04159)]
243 |
244 | **Unified Vision-Language Pre-Training for Image Captioning and VQA.**[2020 AAAI]
245 |
246 | *Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason Corso, Jianfeng Gao*
247 |
248 | [[PDF](https://ojs.aaai.org/index.php/AAAI/article/view/7005)]
249 |
250 | **VirTex: Learning Visual Representations From Textual Annotations.**[2021 CVPR]
251 |
252 | *Karan Desai, Justin Johnson*
253 |
254 | [[PDF](http://openaccess.thecvf.com/content/CVPR2021/html/Desai_VirTex_Learning_Visual_Representations_From_Textual_Annotations_CVPR_2021_paper.html)]
255 |
256 | **Ernie-vil: Knowledge enhanced vision-language representations through scene graphs.**[2021 AAAI]
257 |
258 | *Fei Yu, Jiji Tang, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang*
259 |
260 | [[PDF](https://ojs.aaai.org/index.php/AAAI/article/view/16431)]
261 |
262 | **OSCAR: Object-Semantics Aligned Pre-training for Vision-Language Tasks.**[24th Sep, 2020]
263 |
264 | *Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, Jianfeng Gao*
265 |
266 | [[PDF](https://link.springer.com/chapter/10.1007/978-3-030-58577-8_8)]
267 |
268 | **Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision.**[14th Oct, 2020]
269 |
270 | *Hao Tan, Mohit Bansal*
271 |
272 | [[PDF](https://arxiv.org/abs/2010.06775)]
273 |
274 | **Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models.**[2015 ICCV]
275 |
276 | *Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, Svetlana Lazebnik*
277 |
278 | [[PDF](http://openaccess.thecvf.com/content_iccv_2015/html/Plummer_Flickr30k_Entities_Collecting_ICCV_2015_paper.html)]
279 |
280 | **Distributed representations of words and phrases and their compositionality.**[2013 NIPS]
281 |
282 | *Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, Jeff Dean*
283 |
284 | [[PDF](https://proceedings.neurips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html)]
285 |
286 | **AllenNLP: A Deep Semantic Natural Language Processing Platform.**[20 Mar, 2018]
287 |
288 | *Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson Liu, Matthew Peters, Michael Schmitz, Luke Zettlemoyer*
289 |
290 | [[PDF](https://arxiv.org/abs/1803.07640)]
291 |
292 | **Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data.**[Jul, 2020]
293 |
294 | *Emily M. Bender, Alexander Koller*
295 |
296 | [[PDF](https://aclanthology.org/2020.acl-main.463/)]
297 |
298 | **Experience Grounds Language.**[21th Apr, 2020]
299 |
300 | *Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio, Joyce Chai, Mirella Lapata, Angeliki Lazaridou, Jonathan May, Aleksandr Nisnevich, Nicolas Pinto, Joseph Turian*
301 |
302 | [[PDF](https://arxiv.org/abs/2004.10151)]
303 |
304 | **Hubert: How Much Can a Bad Teacher Benefit ASR Pre-Training?**
305 |
306 | *Wei-Ning Hsu, Yao-Hung Hubert Tsai, Benjamin Bolte, Ruslan Salakhutdinov, Abdelrahman Mohamed*
307 |
308 | [[PDF](https://ieeexplore.ieee.org/abstract/document/9414460/)]
309 |
310 | ## Unifying Achitectures
311 |
312 | **BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.**[11th Oct, 2018]
313 |
314 | *Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova*
315 |
316 | [[PDF](https://arxiv.org/abs/1810.04805)]
317 |
318 | **Improving Language Understanding by Generative Pre-Training.**[2018]
319 |
320 | *Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever*
321 |
322 | [[PDF](https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf)]
323 |
324 | **End-to-End Object Detection with Transformers.**[3rd Nov, 2020]
325 |
326 | *Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko*
327 |
328 | [[PDF](https://link.springer.com/chapter/10.1007/978-3-030-58452-8_13)]
329 |
330 | **UNITER: UNiversal Image-TExt Representation Learning.**[24th Sep, 2020]
331 |
332 | *Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed EI Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, Jingjing Liu*
333 |
334 | [[PDF](https://link.springer.com/chapter/10.1007/978-3-030-58577-8_7)]
335 |
336 | **UNITER: UNiversal Image-TExt Representation Learning.**[2021 ICCV]
337 |
338 | *Ronghang Hu, Amanpreet Singh*
339 |
340 | [[PDF](https://openaccess.thecvf.com/content/ICCV2021/html/Hu_UniT_Multimodal_Multitask_Learning_With_a_Unified_Transformer_ICCV_2021_paper.html?ref=https://githubhelp.com)]
341 |
342 | **VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text.**[2021 NIPS]
343 |
344 | *Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, Boqing Gong*
345 |
346 | [[PDF](https://proceedings.neurips.cc/paper/2021/hash/cb3213ada48302953cb0f166464ab356-Abstract.html)]
347 |
348 | **OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework.**[2022 ICML]
349 |
350 | *Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, Hongxia Yang*
351 |
352 | [[PDF](https://proceedings.mlr.press/v162/wang22al.html)]
353 |
354 | **BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension.**[29th Oct, 2019]
355 |
356 | *Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, Luke Zettlemoyer*
357 |
358 | [[PDF](https://arxiv.org/abs/1910.13461)]
359 |
360 | # Multimodal-Applications
361 |
362 | ## Understanding
363 |
364 | **Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction.**[5th Jan, 2022]
365 |
366 | *Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, Abdelrahman Mohamed*
367 |
368 | [[PDF](https://arxiv.org/abs/2201.02184)]
369 |
370 | **Self-Supervised Multimodal Opinion Summarization.**[27th May, 2021]
371 |
372 | *Jinbae lm, Moonki Kim, Hoyeop Lee, Hyunsouk Cho, Sehee Chung*
373 |
374 | [[PDF](https://arxiv.org/abs/2105.13135)]
375 |
376 | **Hubert: How Much Can a Bad Teacher Benefit ASR Pre-Training?**
377 |
378 | *Wei-Ning Hsu, Yao-Hung Hubert Tsai, Benjamin Bolte, Ruslan Salakhutdinov, Abdelrahman Mohamed*
379 |
380 | [[PDF](https://ieeexplore.ieee.org/abstract/document/9414460/)]
381 |
382 | **LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding.**[29th Dec, 2020]
383 |
384 | *Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou*
385 |
386 | [[PDF](https://arxiv.org/abs/2012.14740)]
387 |
388 | **Structext: Structured text understanding with multi-modal transformers.**[17th Oct, 2021]
389 |
390 | *Yulin Li, Yuxi Qian, Yuechen Yu, Xiameng Qin, Chengquan Zhang, Yan Liu, Kun Yao, Junyu Han, Jingtuo Liu, Errui Ding*
391 |
392 | [[PDF](https://dl.acm.org/doi/abs/10.1145/3474085.3475345)]
393 |
394 | **ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction.**[20th Sep, 2019]
395 |
396 | *Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, C. V. Jawahar*
397 |
398 | [[PDF](https://ieeexplore.ieee.org/abstract/document/8977955/)]
399 |
400 | **FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents.**[20th Sep, 2019]
401 |
402 | *Guillaume Jaume, Hazim Kemal Ekenel, Jean-Philippe Thiran*
403 |
404 | [[PDF](https://ieeexplore.ieee.org/abstract/document/8892998/)]
405 |
406 | **XYLayoutLM: Towards Layout-Aware Multimodal Networks for Visually-Rich Document Understanding.**[2022 CVPR]
407 |
408 | *Zhangxuan Gu, Changhua Meng, Ke Wang, Jun Lan, Weiqiang Wang, Ming Gu, Liqing Zhang*
409 |
410 | [[PDF](http://openaccess.thecvf.com/content/CVPR2022/html/Gu_XYLayoutLM_Towards_Layout-Aware_Multimodal_Networks_for_Visually-Rich_Document_Understanding_CVPR_2022_paper.html)]
411 |
412 | **Multistage Fusion with Forget Gate for Multimodal Summarization in Open-Domain Videos.**[2022 EMNLP]
413 |
414 | *Nayu Liu, Xian SUn, Hongfeng Yu, Wenkai Zhang, Guangluan Xu*
415 |
416 | [[PDF](https://aclanthology.org/2020.emnlp-main.144/)]
417 |
418 | **Multimodal Abstractive Summarization for How2 Videos.**[19th Jun, 2019]
419 |
420 | *Shruti Palaskar, Jindrich Libovicky, Spandana Gella, Florian Metze*
421 |
422 | [[PDF](https://arxiv.org/abs/1906.07901)]
423 |
424 | **Vision guided generative pre-trained language models for multimodal abstractive summarization.**[6th Sep, 2021]
425 |
426 | *Tiezheng Yu, Wenliang Dai, Zihan Liu, Pascale Fung*
427 |
428 | [[PDF](https://arxiv.org/abs/2109.02401)]
429 |
430 | **How2: A Large-scale Dataset for Multimodal Language Understanding.**[1st Nov, 2018]
431 |
432 | *Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loic Barrault, Lucia Specia, Florian Metze*
433 |
434 | [[PDF](https://arxiv.org/abs/1811.00347)]
435 |
436 | **wav2vec 2.0: A framework for self-supervised learning of speech representations.**[2020 NIPS]
437 |
438 | *Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, Michael Auli*
439 |
440 | [[PDF](https://proceedings.neurips.cc/paper/2020/hash/92d1e1eb1cd6f9fba3227870bb6d7f07-Abstract.html)]
441 |
442 | **DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization.**[11th Dec, 2020]
443 |
444 | *Shaoshi Ling, Yuzong Liu*
445 |
446 | [[PDF](https://arxiv.org/abs/2012.06659)]
447 |
448 | **LRS3-TED: a large-scale dataset for visual speech recognition.**[3rd Sep, 2018]
449 |
450 | *Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman*
451 |
452 | [[PDF](https://arxiv.org/abs/1809.00496)]
453 |
454 | **Recurrent Neural Network Transducer for Audio-Visual Speech Recognition.**[Dec 2019]
455 |
456 | *Takaki Makino, Hank Liao, Yannis Assael, Brendan Shillingford, Basilio Garcia, Otavio Braga, Olivier Siohan*
457 |
458 | [[PDF](https://ieeexplore.ieee.org/abstract/document/9004036/)]
459 |
460 | **Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis.**[2020 CVPR]
461 |
462 | *K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, C.V. Jawahar*
463 |
464 | [[PDF](http://openaccess.thecvf.com/content_CVPR_2020/html/Prajwal_Learning_Individual_Speaking_Styles_for_Accurate_Lip_to_Speech_Synthesis_CVPR_2020_paper.html)]
465 |
466 | **On the importance of super-Gaussian speech priors for machine-learning based speech enhancement.**[28th Nov, 2017]
467 |
468 | *Robert Rehr, Timo Gerkmann*
469 |
470 | [[PDF](https://ieeexplore.ieee.org/abstract/document/8121999/)]
471 |
472 | **Active appearance models.**[1998 ECCV]
473 |
474 | *T. F. Cootes, G. J. Edwards, C. J. Taylor*
475 |
476 | [[PDF](https://link.springer.com/chapter/10.1007/BFb0054760)]
477 |
478 | **Leveraging category information for single-frame visual sound source separation.**[20th Jul, 2021]
479 |
480 | *Lingyu Zhu, Esa Rahtu*
481 |
482 | [[PDF](https://ieeexplore.ieee.org/abstract/document/9484036/)]
483 |
484 | **The Sound of Pixels.**[2018 ECCV]
485 |
486 | *Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, Antonio Torralba*
487 |
488 | [[PDF](http://openaccess.thecvf.com/content_ECCV_2018/html/Hang_Zhao_The_Sound_of_ECCV_2018_paper.html)]
489 |
490 | ## Classification
491 |
492 | **Vqa: Visual question answering.**[2015 ICCV]
493 |
494 | *Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh*
495 |
496 | [[PDF](http://openaccess.thecvf.com/content_iccv_2015/html/Antol_VQA_Visual_Question_ICCV_2015_paper.html)]
497 |
498 | **Topic-based content and sentiment analysis of Ebola virus on Twitter and in the news.**[1th Jul, 2016]
499 |
500 | *Erin Hea-Jin Kim, Yoo Kyung Jeong, Yuyong Kim, Keun Young kang, Min Song*
501 |
502 | [[PDF](https://journals.sagepub.com/doi/pdf/10.1177/0165551515608733)]
503 |
504 | **On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis.**[6th Jul, 2017]
505 |
506 | *Jose Camacho-Collados, Mohammad Taher Pilehvar*
507 |
508 | [[PDF](https://arxiv.org/abs/1707.01780)]
509 |
510 | **Market strategies used by processed food manufacturers to increase and consolidate their power: a systematic review and document analysis.**[26th Jan, 2021]
511 |
512 | *Benjamin Wood, Owain Williams, Vijaya Nagarajan, Gary Sacks*
513 |
514 | [[PDF](https://globalizationandhealth.biomedcentral.com/articles/10.1186/s12992-021-00667-7)]
515 |
516 | **Swafn: Sentimental words aware fusion network for multimodal sentiment analysis.**[2020 COLING]
517 |
518 | *Minping Chen, Xia Li*
519 |
520 | [[PDF](https://aclanthology.org/2020.coling-main.93/)]
521 |
522 | **Adaptive online event detection in news streams.**[15th Dec, 2017]
523 |
524 | *Linmei Hu, Bin Zhang, Lei Hou, Juanzi Li*
525 |
526 | [[PDF](https://www.sciencedirect.com/science/article/pii/S0950705117304550)]
527 |
528 | **Multi-source multimodal data and deep learning for disaster response: A systematic review.**[27th Nov, 2021]
529 |
530 | *Nilani Algiriyage, Raj Prasanna, Kristin Stock, Emma E. H. Doyle, David Johnston*
531 |
532 | [[PDF](https://link.springer.com/article/10.1007/s42979-021-00971-4)]
533 |
534 | **A Survey of Data Representation for Multi-Modality Event Detection and Evolution.**[2nd Nov, 2021]
535 |
536 | Kejing Xiao, Zhaopeng Qian, Biao Qin.
537 |
538 | [[PDF](https://www.mdpi.com/2076-3417/12/4/2204)]
539 |
540 | **Crisismmd: Multimodal twitter datasets from natural disasters.**[15th Jun, 2018]
541 |
542 | *Firoj Alam, Ferda Ofli, Muhammad Imran*
543 |
544 | [[PDF](https://www.aaai.org/ocs/index.php/ICWSM/ICWSM18/paper/viewPaper/17816)]
545 |
546 | **Multi-modal generative adversarial networks for traffic event detection in smart cities.**[1st Sep, 2021]
547 |
548 | *Qi Chen, WeiWang, Kaizhu Huang, Suparna De, Frans Coenen*
549 |
550 | [[PDF](https://www.sciencedirect.com/science/article/pii/S0957417421003808)]
551 |
552 | **Proppy: Organizing the news based on their propagandistic content.**[5th Sep, 2019]
553 |
554 | *Alberto Barron-Cedeno, Israa Jaradat, Giovanni Da San Martino, Preslav Nakov*
555 |
556 | [[PDF](https://www.sciencedirect.com/science/article/pii/S0306457318306058)]
557 |
558 | **Fine-grained analysis of propaganda in news article.**[Nov 2019]
559 |
560 | *Giovanni Da San Martino, Seunghak Yu, Alberto Barron-Cedeno, Rostislav Petrov, Preslav Nakov*
561 |
562 | [[PDF](https://aclanthology.org/D19-1565/)]
563 |
564 | **Multimodal Fusion with Recurrent Neural Networks for Rumor Detection on Microblogs.**[Oct, 2017]
565 |
566 | *Zhiwei Jin, Juan Cao, Han Guo, Yongdong Zhang, Jiebo Luo*
567 |
568 | [[PDF](https://dl.acm.org/doi/abs/10.1145/3123266.3123454)]
569 |
570 | **𝖲𝖠𝖥𝖤: Similarity-Aware Multi-modal Fake News Detection.**[6th May, 2020]
571 |
572 | *Xinyi Zhou, Jindi Wu, Reza Zafarani*
573 |
574 | [[PDF](https://link.springer.com/chapter/10.1007/978-3-030-47436-2_27)]
575 |
576 | **From Recognition to Cognition: Visual Commonsense Reasoning.**[2019 CVPR]
577 |
578 | *Rowan Zellers, Yonatan Bisk, Ali Farhadi, Yejin Choi*
579 |
580 | [[PDF](http://openaccess.thecvf.com/content_CVPR_2019/html/Zellers_From_Recognition_to_Cognition_Visual_Commonsense_Reasoning_CVPR_2019_paper.html)]
581 |
582 | **KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for visual commonsense reasoning.**[27th Oct, 2021]
583 |
584 | *Dandan Song, Siyi Ma, Zhanchen Sun, Sicheng Yang, Lejian Liao*
585 |
586 | [[PDF](https://www.sciencedirect.com/science/article/pii/S0950705121006705)]
587 |
588 | **LXMERT: Learning Cross-Modality Encoder Representations from Transformers.**[20th Aug, 2019]
589 |
590 | *Hao Tan, Mohit Bansal*
591 |
592 | [[PDF](https://arxiv.org/abs/1908.07490)]
593 |
594 | **Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers.**[2 Apr, 2020]
595 |
596 | Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, Jianlong Fu
597 |
598 | [[PDF](https://arxiv.org/abs/2004.00849)]
599 |
600 | **Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks.**[2020 CVPR]
601 |
602 | *Fengda Zhu, Yi Zhu, Xiaojun Chang, Xiaodan Liang*
603 |
604 | [[PDF](http://openaccess.thecvf.com/content_CVPR_2020/html/Zhu_Vision-Language_Navigation_With_Self-Supervised_Auxiliary_Reasoning_Tasks_CVPR_2020_paper.html)]
605 |
606 | ## Generation
607 |
608 | **Recent advances and trends in multimodal deep learning: A review.**[24th May, 2021]
609 |
610 | Jabeen Summaira, Xi Li, Amin Muhammad Shoib, Songyuan Li, Jabbar Abdul
611 |
612 | [[PDF](https://arxiv.org/abs/2105.11087)]
613 |
614 | **Vqa: Visual question answering.**[2015 ICCV]
615 |
616 | *Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh*
617 |
618 | [[PDF](http://openaccess.thecvf.com/content_iccv_2015/html/Antol_VQA_Visual_Question_ICCV_2015_paper.html)]
619 |
620 | **Microsoft coco: Common objects in context.**[2014 ECCV]
621 |
622 | *Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, C. Lawrence Zitnick*
623 |
624 | [[PDF](https://arxiv.org/pdf/1405.0312.pdf%090.949.pdf)]
625 |
626 | **BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.**[11th Oct, 2018]
627 |
628 | *Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova*
629 |
630 | [[PDF](https://arxiv.org/abs/1810.04805)]
631 |
632 | **Distributed Representations of Words and Phrases and their Compositionality.**[2013 NIPS]
633 |
634 | *Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, Jeff Dean*
635 |
636 | [[PDF](https://proceedings.neurips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html)]
637 |
638 | **LRS3-TED: a large-scale dataset for visual speech recognition.**[3rd Sep, 2018]
639 |
640 | *Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman*
641 |
642 | [[PDF](https://arxiv.org/abs/1809.00496)]
643 |
644 | **A lip sync expert is all you need for speech to lip generation in the wild.**[12th Oct, 2019]
645 |
646 | *K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, C.V. Jawahar*
647 |
648 | [[PDF](https://dl.acm.org/doi/abs/10.1145/3394171.3413532)]
649 |
650 | **Unified Vision-Language Pre-Training for Image Captioning and VQA.**[2020 AAAI]
651 |
652 | *Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason Corso, Jianfeng Gao*
653 |
654 | [[PDF](https://ojs.aaai.org/index.php/AAAI/article/view/7005)]
655 |
656 | **Show and Tell: A Neural Image Caption Generator.**[2015 CVPR]
657 |
658 | *Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan*
659 |
660 | [[PDF](https://www.cv-foundation.org/openaccess/content_cvpr_2015/html/Vinyals_Show_and_Tell_2015_CVPR_paper.html)]
661 |
662 | **SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning.**[2017 CVPR]
663 |
664 | *Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, Tat-Seng Chua*
665 |
666 | [[PDF](http://openaccess.thecvf.com/content_cvpr_2017/html/Chen_SCA-CNN_Spatial_and_CVPR_2017_paper.html)]
667 |
668 | **Self-Critical Sequence Training for Image Captioning.**[2017 CVPR]
669 |
670 | *Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, Vaibhava Goel*
671 |
672 | [[PDF](http://openaccess.thecvf.com/content_cvpr_2017/html/Rennie_Self-Critical_Sequence_Training_CVPR_2017_paper.html)]
673 |
674 | **Visual question answering: A survey of methods and datasets.**[Oct, 2017]
675 |
676 | *Qi WU, Damien Teney, Peng Wang, Chunhua Shen, Anthony Dick, Anton van den Hengel*
677 |
678 | [[PDF](https://www.sciencedirect.com/science/article/pii/S1077314217300772)]
679 |
680 | **How to find a good image-text embedding for remote sensing visual question answering?.**[24th Sep, 2021]
681 |
682 | *Christel Chappuis, Sylvain Lobry, Benjamin Kellenberger, Bertrand Le, Saux, Devis Tuia*
683 |
684 | [[PDF](https://arxiv.org/abs/2109.11848)]
685 |
686 | **An Improved Attention for Visual Question Answering.**[2021 CVPR]
687 |
688 | *Tanzila Rahman, Shih-Han Chou, Leonid Sigal, Giuseppe Carenini*
689 |
690 | [[PDF](https://openaccess.thecvf.com/content/CVPR2021W/MULA/html/Rahman_An_Improved_Attention_for_Visual_Question_Answering_CVPRW_2021_paper.html)]
691 |
692 | **Analyzing Compositionality of Visual Question Answering.**[2019 NIPS]
693 |
694 | *Sanjay Subramanian, Sameer Singh, Matt Gardner*
695 |
696 | [[PDF](https://vigilworkshop.github.io/static/papers-2019/43.pdf)]
697 |
698 | **OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge.**[2019 CVPR]
699 |
700 | *Kenneth Marino, Mohammad Rastegari, Ali Farhadi, Roozbeh Mottaghi*
701 |
702 | [[PDF](http://openaccess.thecvf.com/content_CVPR_2019/html/Marino_OK-VQA_A_Visual_Question_Answering_Benchmark_Requiring_External_Knowledge_CVPR_2019_paper.html)]
703 |
704 | **MultiBench: Multiscale Benchmarks for Multimodal Representation Learning.**[15th Jul, 2021]
705 |
706 | *Paul Pu Liang, Yiwei Lyu, Xiang Fan, Zetian Wu, Yun Cheng,
707 | Jason Wu, Leslie Chen, Peter Wu, Michelle A. Lee, Yuke Zhu5,
708 | Ruslan Salakhutdinov1, Louis-Philippe Morency*
709 |
710 | [[PDF](https://arxiv.org/abs/2107.07502)]
711 |
712 | **Benchmarking Multimodal AutoML for Tabular Data with Text Fields.**[4th Nov, 20201]
713 |
714 | Xingjian Shi, Jonas Mueller, Nick Erickson, Mu Li, Alexander J. Smola
715 |
716 | [[PDF](https://arxiv.org/abs/2111.02705)]
717 |
718 | **Multimodal Explanations: Justifying Decisions and Pointing to the Evidence.**[2018 CVPR]
719 |
720 | *Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt Schiele, Trevor Darrell, Marcus Rohrbach*
721 |
722 | [[PDF](http://openaccess.thecvf.com/content_cvpr_2018/html/Park_Multimodal_Explanations_Justifying_CVPR_2018_paper.html)]
723 |
724 | **Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering.**[2018 CVPR]
725 |
726 | *Aishwarya Agrawal, Dhruv Batra, Devi Parikh, Aniruddha Kembhavi*
727 |
728 | [[PDF](http://openaccess.thecvf.com/content_cvpr_2018/html/Agrawal_Dont_Just_Assume_CVPR_2018_paper.html)]
729 |
730 | **Generative Adversarial Text to Image Synthesis.**[2016 ICML]
731 |
732 | *Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, Honglak Lee*
733 |
734 | [[PDF](http://proceedings.mlr.press/v48/reed16.html)]
735 |
736 | **The Caltech-UCSD Birds-200-2011 Dataset.**
737 |
738 | [[PDF](https://authors.library.caltech.edu/27452/)]
739 |
740 | **AttnGAN: Fine-Grained Text to Image Generation With Attentional Generative Adversarial Networks.**[2018 CVPR]
741 |
742 | *Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, Xiaodong He*
743 |
744 | [[PDF](http://openaccess.thecvf.com/content_cvpr_2018/html/Xu_AttnGAN_Fine-Grained_Text_CVPR_2018_paper.html)]
745 |
746 | **LipSound: Neural Mel-spectrogram Reconstruction for Lip Reading.**[15 Sep, 2019]
747 |
748 | *Leyuan Qu, Cornelius Weber, Stefan Wermter*
749 |
750 | [[PDF](https://www.isca-speech.org/archive_v0/Interspeech_2019/pdfs/1393.pdf)]
751 |
752 | **The Conversation: Deep Audio-Visual Speech Enhancement.**[11th Apr, 2018]
753 |
754 | *Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman*
755 |
756 | [[PDF](https://arxiv.org/abs/1804.04121)]
757 |
758 | **TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech.**[5th May, 2015]
759 |
760 | *Naomi Harte, Eoin Gillen*
761 |
762 | [[PDF](https://ieeexplore.ieee.org/abstract/document/7050271/)]
763 |
764 | **Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning.**[20 Oct, 2017]
765 |
766 | *WeiPing, KainanPeng, AndrewGibiansk, SercanO. Arık, Ajay Kannan, Sharan Narang*
767 |
768 | [[PDF](https://arxiv.org/abs/1710.07654)]
769 |
770 | **Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions.**[15th Apr, 2018]
771 |
772 | *Jonathan Shen, Ruiming Pang, Ron J. Weiss, Mile Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, Rif A. Saurous, Yannis Agiomvrgiannakis, Yonghui Wu*
773 |
774 | [[PDF](https://ieeexplore.ieee.org/abstract/document/8461368/)]
775 |
776 | **Vid2speech: Speech reconstruction from silent video.**[5th Mar, 2017]
777 |
778 | *Ariel Ephrat, Shmuel Peleg*
779 |
780 | [[PDF](https://ieeexplore.ieee.org/abstract/document/7953127/)]
781 |
782 | **Lip2Audspec: Speech Reconstruction from Silent Lip Movements Video.**[15th Apr, 2018]
783 |
784 | *Hassan Akbari, Himani Arora, Liangliang Cao, Nima Mesgarani*
785 |
786 | [[PDF](https://ieeexplore.ieee.org/abstract/document/8461856/)]
787 |
788 | **Video-Driven Speech Reconstruction using Generative Adversarial Networks.**[14th Jun, 2019]
789 |
790 | *Konstantinos Vougioukas, Pingchuan Ma, Stavros Petridi, Maja Pantic*
791 |
792 | [[PDF](https://arxiv.org/abs/1906.06301)]
793 |
794 | ## Retrieval
795 |
796 | **ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks.**[2019 NIPS]
797 |
798 | *Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee*
799 |
800 | [[PDF](https://proceedings.neurips.cc/paper/2019/hash/c74d97b01eae257e44aa9d5bade97baf-Abstract.html)]
801 |
802 | **Learning Robust Patient Representations from Multi-modal Electronic Health Records: A Supervised Deep Learning Approach.**[2021]
803 |
804 | *Leman Akoglu, Evimaria Terzi, Xianli Zhang, Buyue Qian, Yang Liu, Xi Chen, Chong Guan, Chen Li*
805 |
806 | [[PDF](https://epubs.siam.org/doi/abs/10.1137/1.9781611976700.66)]
807 |
808 | **Referring Expression Comprehension: A Survey of Methods and Datasets.**[7th Dec, 2020]
809 |
810 | *Yanyuan QIao, Chaorui Deng, Qi Wu*
811 |
812 | [[PDF](https://ieeexplore.ieee.org/abstract/document/9285213/)]
813 |
814 | **VL-BERT: Pre-training of Generic Visual-Linguistic Representations.**[22th Aug, 2019]
815 |
816 | *Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, Jifeng Dai*
817 |
818 | [[PDF](https://arxiv.org/abs/1908.08530)]
819 |
820 | **Clinically Accurate Chest X-Ray Report Generation.**[2019 MLHC]
821 |
822 | *Guanxiong Liu, Tzu-Ming Harry Hsu, Matthew McDermott, Willie Boag, Wei-Hung Weng, Peter Szolovits, Marzyeh Ghassemi*
823 |
824 | [[PDF](http://proceedings.mlr.press/v106/liu19a.html)]
825 |
826 | ## Translation
827 |
828 | **Deep Residual Learning for Image Recognition.**[2016 CVPR]
829 |
830 | *Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun*
831 |
832 | [[PDF](http://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html)]
833 |
834 | **Probing the Need for Visual Context in Multimodal Machine Translation.**[20th Mar, 2019]
835 |
836 | *Ozan Caglayan, Pranava Madhyastha, Lucia Specia, Loic Barrault*
837 |
838 | [[PDF](https://arxiv.org/abs/1903.08678)]
839 |
840 | **Neural Machine Translation by Jointly Learning to Align and Translate.**[1st Sep, 2014]
841 |
842 | *Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio*
843 |
844 | [[PDF](https://arxiv.org/abs/1409.0473)]
845 |
846 | **Multi-modal neural machine translation with deep semantic interactions.**[Apr, 2021]
847 |
848 | *Jinsong Su, Jinchang Chen, Hui Jiang, Chulun Zhou, Huan Lin, Yubin Ge, Qingqiang Wu, Yongxuan Lai*
849 |
850 | [[PDF](https://www.sciencedirect.com/science/article/pii/S0020025520311105)]
851 |
852 | # Multimodal-Datasets
853 |
854 | Vqa: Visual question answering.**[2015 ICCV]
855 |
856 | *Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh*
857 |
858 | [[PDF](http://openaccess.thecvf.com/content_iccv_2015/html/Antol_VQA_Visual_Question_ICCV_2015_paper.html)]
859 |
860 | **Microsoft coco: Common objects in context.**[2014 ECCV]
861 |
862 | *Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, C. Lawrence Zitnick*
863 |
864 | [[PDF](https://arxiv.org/pdf/1405.0312.pdf%090.949.pdf)]
865 |
866 | **Pre-training technique to localize medical BERT and enhance biomedical BERT.**[14th May, 2020]
867 |
868 | *Shoya Wada, Toshihiro Takeda, Shiro Manabe, Shozo Konishi, Jun Kamohara, Yasushi, Matsumura*
869 |
870 | [[PDF](https://arxiv.org/abs/2005.07202)]
871 |
872 | **Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models.**[2015 ICCV]
873 |
874 | *Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, Svetlana Lazebnik*
875 |
876 | [[PDF](http://openaccess.thecvf.com/content_iccv_2015/html/Plummer_Flickr30k_Entities_Collecting_ICCV_2015_paper.html)]
877 |
878 | **ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction.**[20th Sep, 2019]
879 |
880 | *Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, C. V. Jawahar*
881 |
882 | [[PDF](https://ieeexplore.ieee.org/abstract/document/8977955/)]
883 |
884 | **FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents.**[20th Sep, 2019]
885 |
886 | *Guillaume Jaume, Hazim Kemal Ekenel, Jean-Philippe Thiran*
887 |
888 | [[PDF](https://ieeexplore.ieee.org/abstract/document/8892998/)]
889 |
890 | **How2: A Large-scale Dataset for Multimodal Language Understanding.**[1st Nov, 2018]
891 |
892 | *Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loic Barrault, Lucia Specia, Florian Metze*
893 |
894 | [[PDF](https://arxiv.org/abs/1811.00347)]
895 |
896 | **Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis.**[2020 CVPR]
897 |
898 | *K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, C.V. Jawahar*
899 |
900 | [[PDF](http://openaccess.thecvf.com/content_CVPR_2020/html/Prajwal_Learning_Individual_Speaking_Styles_for_Accurate_Lip_to_Speech_Synthesis_CVPR_2020_paper.html)]
901 |
902 | **The Sound of Pixels.**[2018 ECCV]
903 |
904 | *Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, Antonio Torralba*
905 |
906 | [[PDF](http://openaccess.thecvf.com/content_ECCV_2018/html/Hang_Zhao_The_Sound_of_ECCV_2018_paper.html)]
907 |
908 | **Crisismmd: Multimodal twitter datasets from natural disasters.**[15th Jun, 2018]
909 |
910 | *Firoj Alam, Ferda Ofli, Muhammad Imran*
911 |
912 | [[PDF](https://www.aaai.org/ocs/index.php/ICWSM/ICWSM18/paper/viewPaper/17816)]
913 |
914 | **From Recognition to Cognition: Visual Commonsense Reasoning.**[2019 CVPR]
915 |
916 | *Rowan Zellers, Yonatan Bisk, Ali Farhadi, Yejin Choi*
917 |
918 | [[PDF](http://openaccess.thecvf.com/content_CVPR_2019/html/Zellers_From_Recognition_to_Cognition_Visual_Commonsense_Reasoning_CVPR_2019_paper.html)]
919 |
920 | **The Caltech-UCSD Birds-200-2011 Dataset.**
921 |
922 | [[PDF](https://authors.library.caltech.edu/27452/)]
923 |
924 | **Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics.**[30th Aug, 2013]
925 |
926 | *M. Hodosh, P. Young, J. Hockenmaier*
927 |
928 | [[PDF](https://www.jair.org/index.php/jair/article/view/10833)]
929 |
930 | **Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph.**[Jul, 2018]
931 |
932 | *AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, Louis-Philippe Morency*
933 |
934 | [[PDF](https://aclanthology.org/P18-1208/)]
935 |
936 | **MIMIC-III, a freely accessible critical care database.**[24th May, 2016]
937 |
938 | *Alistair E.W. Johnson, Tom J. Pollard, Li-wei H. Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Leo Anthony Celi & Roger G. Mark*
939 |
940 | [[PDF](https://www.nature.com/articles/sdata201635)]
941 |
942 | **Fashion 200K Benchmark**
943 |
944 | [[Github](https://github.com/xthan/fashion-200k)]
945 |
946 | **Indoor scene segmentation using a structured light sensor.**[Nov 2011]
947 |
948 | *Nathan Silberman, Rob Fergus*
949 |
950 | [[PDF](https://ieeexplore.ieee.org/abstract/document/6130298/)]
951 |
952 | **Indoor Segmentation and Support Inference from RGBD Images.**[2012 ECCV]
953 |
954 | *Nathan Silberman, Derek Hoiem, Pushmeet Kohli, Rob Fergus*
955 |
956 | [[PDF](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/11/shkf_eccv2012.pdf)]
957 |
958 | **Good News, Everyone! Context Driven Entity-Aware Captioning for News Images.**[2019 CVPR]
959 |
960 | *Ali Furkan Biten, Lluis Gomez, Marcal Rusinol, Dimosthenis Karatzas*
961 |
962 | [[PDF](http://openaccess.thecvf.com/content_CVPR_2019/html/Biten_Good_News_Everyone_Context_Driven_Entity-Aware_Captioning_for_News_Images_CVPR_2019_paper.html)]
963 |
964 | **MSR-VTT: A Large Video Description Dataset for Bridging Video and Language.**[2016 CVPR]
965 |
966 | *Jun Xu, Tao Mei, Ting Yao, Yong Rui*
967 |
968 | [[PDF](http://openaccess.thecvf.com/content_cvpr_2016/html/Xu_MSR-VTT_A_Large_CVPR_2016_paper.html)]
969 |
970 | **Video Question Answering via Gradually Refined Attention over Appearance and Motion.**[Oct, 2017]
971 |
972 | *Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, Yueting Zhuang*
973 |
974 | [[PDF](https://dl.acm.org/doi/abs/10.1145/3123266.3123427)]
975 |
976 | **TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering.**[2017 CVPR]
977 |
978 | *Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, Gunhee Kim*
979 |
980 | [[PDF](http://openaccess.thecvf.com/content_cvpr_2017/html/Jang_TGIF-QA_Toward_Spatio-Temporal_CVPR_2017_paper.html)]
981 |
982 | **Multi-Target Embodied Question Answering.**[2019 CVPR]
983 |
984 | *Licheng Yu, Xinlei Chen, Georgia Gkioxari, Mohit Bansal, Tamara L. Berg, Dhruv Batra*
985 |
986 | [[PDF](http://openaccess.thecvf.com/content_CVPR_2019/html/Yu_Multi-Target_Embodied_Question_Answering_CVPR_2019_paper.html)]
987 |
988 | **VideoNavQA: Bridging the Gap between Visual and Embodied Question Answering.**[14th Aug, 2019]
989 |
990 | *Catalina Cangea, Eugene Belilovsky, Pietro Lio, Aaron Courville*
991 |
992 | [[PDF](https://arxiv.org/abs/1908.04950)]
993 |
994 | **An Analysis of Visual Question Answering Algorithms.**[2017 ICCV]
995 |
996 | *Kushal Kafle, Christopher Kanan*
997 |
998 | [[PDF](http://openaccess.thecvf.com/content_iccv_2017/html/Kafle_An_Analysis_of_ICCV_2017_paper.html)]
999 |
1000 | **nuScenes: A Multimodal Dataset for Autonomous Driving.**[2020 CVPR]
1001 |
1002 | *Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, Oscar Beijbom*
1003 |
1004 | [[PDF](http://openaccess.thecvf.com/content_CVPR_2020/html/Caesar_nuScenes_A_Multimodal_Dataset_for_Autonomous_Driving_CVPR_2020_paper.html)]
1005 |
1006 | **Automated Flower Classification over a Large Number of Classes.**[20th Jan, 2009]
1007 |
1008 | *Maria-Elena Nilsback, Andrew Zisserman*
1009 |
1010 | [[PDF](https://ieeexplore.ieee.org/abstract/document/4756141/)]
1011 |
1012 | **MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos.**[20th Jun, 2016]
1013 |
1014 | *Amir Zadeh, Rowan Zellers, Eli Pincus, Louis-Philippe Morency*
1015 |
1016 | [[PDF](https://arxiv.org/abs/1606.06259)]
1017 |
1018 | **Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech Recognition.**[18th Jan, 2020]
1019 |
1020 | *Yuanhang Zhang, Shuang Yang, Jingyun Xiao, Shiguang Shan, Xilin Chen*
1021 |
1022 | [[PDF](https://ieeexplore.ieee.org/abstract/document/9320240/)]
1023 |
1024 | **The MIT Stata Center dataset.** [2013]
1025 |
1026 | *Maurice Fallon, Hordur Johannsson, Michael Kaess and John J Leonard*
1027 |
1028 | [[PDF](https://journals.sagepub.com/doi/pdf/10.1177/0278364913509035)]
1029 |
1030 | **data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language.**[2022 ICML]
1031 |
1032 | *Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli*
1033 |
1034 | [[PDF](https://proceedings.mlr.press/v162/baevski22a.html)]
1035 |
1036 | **FLAVA: A Foundational Language and Vision Alignment Model.**[2022 CVPR]
1037 |
1038 | *Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, Douwe Kiela*
1039 |
1040 | [[PDF](http://openaccess.thecvf.com/content/CVPR2022/html/Singh_FLAVA_A_Foundational_Language_and_Vision_Alignment_Model_CVPR_2022_paper.html)]
1041 |
1042 | **UC2: Universal Cross-Lingual Cross-Modal Vision-and-Language Pre-Training.**[2021 CVPR]
1043 |
1044 | *Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu, Jingjing Liu*
1045 |
1046 | [[PDF](http://openaccess.thecvf.com/content/CVPR2021/html/Zhou_UC2_Universal_Cross-Lingual_Cross-Modal_Vision-and-Language_Pre-Training_CVPR_2021_paper.html)]
1047 |
1048 | # Citation
1049 | If you find the listing and survey useful for your work, please cite the paper:
1050 | ```
1051 | @article{manzoor2023multimodality,
1052 | title={Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications},
1053 | author={Manzoor, Muhammad Arslan and Albarri, Sarah and Xian, Ziting and Meng, Zaiqiao and Nakov, Preslav and Liang, Shangsong},
1054 | journal={arXiv preprint arXiv:2302.00389},
1055 | year={2023}
1056 | }
1057 | ```
1058 |
--------------------------------------------------------------------------------