├── README.md ├── paper-notes.md └── summary.md /README.md: -------------------------------------------------------------------------------- 1 | # Multi-Modal-Dialgoue-System-Paperlist 2 | 3 | This is a paper list for the multimodal dialogue systems topic. 4 | 5 | **Keyword**: Multi-modal, Dialogue system, visual, conversation 6 | 7 | # Paperlist 8 | 9 | ## Dataset & Challenges 10 | 11 | ### Images 12 | 13 | (1) [**Visual QA**](https://visualqa.org/workshop.html) VQA datasets in CVPR2021,2020,2019,..., containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer. 14 | - [VQA](https://visualqa.org/challenge) datasets [1.0](http://arxiv.org/abs/1505.00468) [2.0](https://arxiv.org/abs/1612.00837) 15 | - [TextVQA](https://textvqa.org/paper) TextVQA requires models to read and reason about text in an image to answer questions based on them. In order to perform well on this task, models need to first detect and read text in the images. Models then need to reason about this to answer the question. 16 | - [TextCap](https://arxiv.org/abs/2003.12462) TextCaps requires models to read and reason about text in images to generate captions about them. Specifically, models need to incorporate a new modality of text present in the images and reason over it and visual content in the image to generate image descriptions. 17 | - Issues : 18 | - visual-explainable: the model should rely on the right visual regions when making decisions, 19 | - question-sensitive: the model should be sensitive to the linguistic variations in question 20 | - reduce language biases: the model should not take the language shortcut to answer the question without looking at the image 21 | - Further Papers (too many) 22 | - cross-modal interaction /fusion 23 | - [Multimodal Neural Graph Memory Networks for Visual Question Answering](https://www.aclweb.org/anthology/2020.acl-main.643.pdf) ACL2020 24 | - [Bottom-up and top-down attention for image captioning and visual question answering](https://arxiv.org/abs/1707.07998) in CVPR2018, winner of the 2017 Visual Question Answering challenge 25 | - [Multimodal Neural Graph Memory Networks for Visual Question Answering](https://www.aclweb.org/anthology/2020.acl-main.643.pdf) ACL2020, visual features + encoded region-grounded captions (of object attributes and their relationships) = two graph nets which compute question-guided contextualized representation for each, then the updated representations are written to an external spatial memory (??what's that??). 26 | - [Cross-Modality Relevance for Reasoning on Language and Vision](https://www.aclweb.org/anthology/2020.acl-main.683.pdf) in ACL2020 27 | - [Hypergraph Attention Networks for Multimodal Learning](https://bi.snu.ac.kr/~btzhang/selected_papers/CVPR2020_ESKimKOHZ.pdf) CVPR2020 28 | - [Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions?](https://www.aclweb.org/anthology/D16-1092/) EMNLP2016 29 | - [Multi-level Attention Networks for Visual Question Answering](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/Multi-level-Attention-Networks-for-Visual-Question-Answering.pdf) CVPR2017 30 | - [Hierarchical Question-Image Co-Attention for Visual Question Answering](https://arxiv.org/abs/1606.00061) CVPR2016 31 | - vision-language pretraining / representation learning 32 | - [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557.pdf) arXiv2019, ground element of language to image regions with self-attention 33 | - [ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks](https://arxiv.org/abs/1908.02265) NeuIPS2019 34 | - [VL-BERT: Pre-training of Generic Visual-Linguistic Representations](https://arxiv.org/abs/1908.08530) [[Code](https://github.com/jackroos/VL-BERT)] ICRL2020 35 | - [VinVL: Making Visual Representations Matter in Vision-Language Models](https://arxiv.org/abs/2101.00529) [[Code](https://github.com/microsoft/Oscar)] CVPR2021 36 | - [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) [[Code](https://github.com/dandelin/vilt)] ICML 2021 37 | - [12-in-1: Multi-Task Vision and Language Representation Learning](https://arxiv.org/abs/1912.02315) [[Code](https://github.com/facebookresearch/vilbert-multi-task)] CVPR2020 38 | - [Unified Vision-Language Pre-Training for Image Captioning and VQA](https://arxiv.org/abs/1909.11059) [[Code](https://github.com/LuoweiZhou/VLP)] AAAI2020 39 | - [LXMERT: Learning Cross-Modality Encoder Representations from Transformers](https://www.aclweb.org/anthology/D19-1514.pdf) [[Code](https://github.com/airsplay/lxmert)] EMNLP2019 40 | - [Adaptive Transformers for Learning Multimodal Representations](https://www.aclweb.org/anthology/2020.acl-srw.1.pdf)[[Code](https://github.com/prajjwal1/adaptive_transformer)] SRW ACL2020 41 | - [Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer](https://www.aclweb.org/anthology/2020.acl-main.306.pdf) [[Data Code](https://github.com/jefferyYu/UMT)] ACL2020 42 | - Language prior issue 43 | - [AdaVQA: Overcoming Language Priors with Adapted Margin Cosine Loss](https://arxiv.org/pdf/2105.01993.pdf) in a perspective of feature space learning (not classification task) 44 | - [Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering](https://arxiv.org/abs/1612.00837) CVPR2017 VQA 2.0 is also for the purpose of balance language prior to images 45 | - [Self-Critical Reasoning for Robust Visual Question Answering](https://arxiv.org/pdf/1905.09998.pdf) NeurIPS2019 46 | - [Overcoming Language Priors in Visual Question Answering with Adversarial Regularization](https://arxiv.org/pdf/1810.03649.pdf) NeurIPS2018, question-only model 47 | - [RUBi: Reducing Unimodal Biases in Visual Question Answering](https://arxiv.org/pdf/1906.10169.pdf) NeurIPS2019 also question-only model 48 | - [Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering](https://arxiv.org/abs/1712.00377) [[Code](https://github.com/AishwaryaAgrawal/GVQA) CVPR2018 49 | - [Counterfactual VQA: A Cause-Effect Look at Language Bias](https://arxiv.org/abs/2006.04315) [[Code](https://github.com/yuleiniu/cfvqa)] CVPR2021 50 | - [Counterfactual Vision and Language Learning](https://openaccess.thecvf.com/content_CVPR_2020/papers/Abbasnejad_Counterfactual_Vision_and_Language_Learning_CVPR_2020_paper.pdf) CVPR2020 51 | - Visual-explainable issue 52 | - [Counterfactual Samples Synthesizing for Robust Visual Question Answering](http://arxiv.org/pdf/2003.06576) CVPR2020 53 | - [Learning to Contrast the Counterfactual Samples for Robust Visual Question Answering](https://www.aclweb.org/anthology/2020.emnlp-main.265.pdf) EMNLP2020 54 | - [Learning What Makes a Difference from Counterfactual Examples and Gradient Supervision](https://arxiv.org/pdf/2004.09034.pdf) ECCV2020 leveraging overlooked supervisory signal found in existing datasets to improve generalization capabilities 55 | - [Generating Natural Language Explanations for Visual Question Answering using Scene Graphs and Visual Attention](https://arxiv.org/abs/1902.05715) arXiv2019 56 | - [Towards Transparent AI Systems: Interpreting Visual Question Answering Models](https://arxiv.org/abs/1608.08974) 2016 57 | - object relation reasoning / visual understanding / cross-modal / Graphs 58 | - [MUREL: Multimodal Relational Reasoning for Visual Question Answering](http://arxiv.org/pdf/1902.09487) CVPR2019, [[Code](github.com/Cadene/murel.bootstrap.pytorch)], represent and refine interactions between question words and image regions, more fine than attention-maps 59 | - [CRA-Net: Composed Relation Attention Network for Visual Question Answering](https://dl.acm.org/doi/10.1145/3343031.3350925) ACM2019 object relation reasoning attention should look at both visual (features, spatial) and linguistic (in questions) features 不让看哦? 60 | - [Hierarchical Graph Attention Network for Visual Relationship Detection](https://openaccess.thecvf.com/content_CVPR_2020/papers/Mi_Hierarchical_Graph_Attention_Network_for_Visual_Relationship_Detection_CVPR_2020_paper.pdf) CVPR2020 object-level graph: (1) woman (sit on) bench, (2) woman (in front of) water; triplet-level graph: relation between triplet(1) and triplet(2) 61 | - [Visual Relationship Detection With Visual-Linguistic Knowledge From Multimodal Representations](https://ieeexplore.ieee.org/ielx7/6287639/6514899/09387302.pdf) IEEE2021, relational visual-linguistic BERT 62 | - [Relation-Aware Graph Attention Network for Visual Question Answering](http://arxiv.org/pdf/1903.12314) ICCV2019, explicit relations of geometric positions and semantic interactions between objects, implicit relations of hidden dynamics between image regions 63 | - [Fusion of Detected Objects in Text for Visual Question Answering](https://arxiv.org/abs/1908.05054) EMNLP2020 64 | - [GraghVQA: Language-Guided Graph Neural Networks for Graph-based Visual Question Answering](https://arxiv.org/abs/2104.10283) arXiv2021 65 | - [A Simple Baseline for Visual Commonsense Reasoning](https://vigilworkshop.github.io/static/papers/34.pdf) ViGil@NeuIPS2019 66 | - [Learning Conditioned Graph Structures for Interpretable Visual Question Answering](https://arxiv.org/abs/1806.07243) [[Code](https://github.com/aimbrain/vqa-project)] NeuIPS2018 67 | - [Graph-Structured Representations for Visual Question Answering](https://arxiv.org/abs/1609.05600) CVPR2017 68 | - [R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering](https://arxiv.org/abs/1805.09701) [[Code](https://github.com/lupantech/rvqa)] ACM KDD2018 69 | - Knowledge / cross-modal fusion / Graphs 70 | - [Towards Knowledge-Augmented Visual Question Answering](https://www.aclweb.org/anthology/2020.coling-main.169.pdf) Coling2020, capture the interactions between objects in a visual scene and entities in an external knowledge source, with many many graphs ... 71 | - [ConceptBert: Concept-Aware Representation for Visual Question Answering](https://www.aclweb.org/anthology/2020.findings-emnlp.44.pdf) EMNLP2020, learn a joint Concept-Vision-Language embedding (maybe similar to [[this paper](https://openreview.net/references/pdf?id=Uhl6chXANP)] in the way of adding "entity embedding" ?) 72 | - [Incorporating External Knowledge to Answer Open-Domain Visual Questions with Dynamic Memory Networks](https://arxiv.org/abs/1712.00733) 2017 73 | - text in the image (TextCap & TextVQA) 74 | - [Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text](http://arxiv.org/pdf/2003.13962) [[Code](https://github.com/ricolike/mmgnn_textvqa)] CVPR2020, the printed text on the bottle is the brand of the drink ==> graph representation of the image should have sub-graphs and respective aggregators to pass messages among graphs (我不知道我在说什么???) 75 | - [Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image Classification and Retrieval](https://arxiv.org/pdf/2009.09809.pdf) arXiv2020, common semantic space between salient objects and text found in an image 76 | - [Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps](https://arxiv.org/pdf/2012.05153.pdf) arXiv2020, simple attention mechanism is, good 77 | - [Cascade Reasoning Network for Text-based Visual Question Answering](https://tanmingkui.github.io/files/publications/Cascade.pdf) ACM2020, 1) which info's useful, 2)question related to text but also visual concepts, how to capture cross-modal relathionships, 3)what if OCR fails 78 | - [TAP: Text-Aware Pre-training for Text-VQA and Text-Caption](https://arxiv.org/pdf/2012.04638.pdf) arXiv2020, incorporates OCR generated text in pre-training 79 | - [Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA](https://arxiv.org/abs/1911.06258) arXiv2020 80 | - multi-task 81 | - [Answer Them All! Toward Universal Visual Question Answering Models](https://arxiv.org/abs/1903.00366) CVPR2019 82 | - [A Recipe for Creating Multimodal Aligned Datasets for Sequential Tasks](https://www.aclweb.org/anthology/2020.acl-main.440.pdf) ACL2020 83 | - [Visual Question Answering as a Multi-Task Problem](https://arxiv.org/pdf/2007.01780.pdf) arXiv2020 84 | 85 | 86 | (2) [**Visual Dialog**](https://visualdialog.org/) CVPR 2017, Open-domain dialogs & given an image, a dialog history, and a follow-up question about the image, the task is to answer the question. 87 | - [VisDial v1.0 dataset](https://visualdialog.org/data) [[Paper](https://arxiv.org/abs/1611.08669)] [[Source Code to collect chat data](https://github.com/batra-mlp-lab/visdial-amt-chat)] 88 | - Further papers 89 | - reasoning 90 | - [KBGN: Knowledge-Bridge Graph Network for Adaptive Vision-Text Reasoning in Visual Dialogue](http://arxiv.org/pdf/2008.04858) ACM2020, here knowledge = text knowledge & vision knowledge, encoding (T2V graph & V2T graph) then bridging (update graph nodes) then storing then retrieving (via adaptive information selection mode) 91 | - [Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog](https://www.aclweb.org/anthology/P19-1648.pdf) ACL2019, iteratively refine the question's representation based on image and dialog history 92 | - [Recursive visual attention in visual dialog](https://arxiv.org/abs/1812.02664) CVPR2019 [[Code](https://github.com/yuleiniu/rva)] 93 | - [DMRM: A Dual-channel Multi-hop Reasoning Model for Visual Dialog](https://aaai.org/ojs/index.php/AAAI/article/view/6248) AAAI2020 94 | - [Visual Reasoning with Multi-hop Feature Modulation](https://arxiv.org/abs/1808.04446) [[Code](https://github.com/ethanjperez/film)] ECCV2018 95 | - [VisualCOMET: Reasoning About the Dynamic Context of a Still Image](https://arxiv.org/abs/2004.10796) [[Code](https://github.com/jamespark3922/visual-comet)] ECCV2020 96 | - understanding 97 | - [DualVD: An Adaptive Dual Encoding Model for Deep Visual Understanding in Visual Dialogue](https://arxiv.org/pdf/1911.07251.pdf) [[Code](https://github.com/JXZe/DualVD)] AAAI2020 98 | - [Learning Dual Encoding Model for Adaptive Visual Understanding in Visual Dialogue](http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9247486) [[Code](https://github.com/JXZe/Learning_DualVD)] IEEE2021 99 | - [Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision](https://www.aclweb.org/anthology/2020.emnlp-main.162) [[Code](https://github.com/airsplay/vokenization)] EMNLP2020 100 | - coreference 101 | - [Modeling Coreference Relations in Visual Dialog](https://www.aclweb.org/anthology/2021.eacl-main.290) [[Code](https://github.com/facebookresearch/corefnmn)] EACL2021 102 | - [What You See is What You Get: Visual Pronoun Coreference Resolution in Dialogues](https://www.aclweb.org/anthology/D19-1516) [[Data Code](https://github.com/HKUST-KnowComp/Visual_PCR)] EMNLP2019 103 | - reference 104 | - [Dual Attention Networks for Visual Reference Resolution in Visual Dialog](https://www.aclweb.org/anthology/D19-1209.pdf) [[Code](https://github.com/gicheonkang/DAN-VisDial)] EMNLP2019 105 | - [Visual Reference Resolution using Attention Memory for Visual Dialog](http://papers.neurips.cc/paper/6962-visual-reference-resolution-using-attention-memory-for-visual-dialog.pdf) NIPS2017 106 | - [Referring Expression Generation via Visual Dialogue](https://github.com/llxuan/ReferWhat) NLPCC2020 107 | - cross-modal / fusion / joint / dual ... 108 | - [Efficient Attention Mechanism for Handling All the Interactions between Many Inputs with Application to Visual Dialog](https://arxiv.org/abs/1911.11390) ECCV2019 109 | - [Image-Question-Answer Synergistic Network for Visual Dialog](https://openaccess.thecvf.com/content_CVPR_2019/papers/Guo_Image-Question-Answer_Synergistic_Network_for_Visual_Dialog_CVPR_2019_paper.pdf) CVPR2019 110 | - [DialGraph: Sparse Graph Learning Networks for Visual Dialog](https://arxiv.org/abs/2004.06698) arXiv 111 | - [All-in-One Image-Grounded Conversational Agents](https://arxiv.org/abs/1912.12394) arXiv2019 112 | - [Visual-Textual Alignment for Graph Inference in Visual Dialog](https://www.aclweb.org/anthology/2020.coling-main.170.pdf) Coling2020 113 | - [Connecting Language and Vision to Actions](https://www.aclweb.org/anthology/P18-5004) ACL2018 114 | - [Parallel Attention: A Unified Framework for Visual Object Discovery Through Dialogs and Queries](https://arxiv.org/abs/1711.06370) [[Code](https://github.com/bohanzhuang/Parallel-Attention-A-Unified-Framework-for-Visual-Object-Discovery-through-Dialogs-and-Queries)] CVPR2018 115 | - [Neural Multimodal Belief Tracker with Adaptive Attention for Dialogue Systems](https://dl.acm.org/doi/fullHtml/10.1145/3308558.3313598) WWW2019 116 | - [Reactive Multi-Stage Feature Fusion for Multimodal Dialogue Modeling](https://arxiv.org/abs/1908.05067) 2019 117 | - [Two Causal Principles for Improving Visual Dialog](https://openaccess.thecvf.com/content_CVPR_2020/papers/Qi_Two_Causal_Principles_for_Improving_Visual_Dialog_CVPR_2020_paper.pdf) [[Code](https://github.com/simpleshinobu/visdial-principles)] CVPR2020 118 | - [Learning Cross-modal Context Graph for Visual Grounding](https://arxiv.org/abs/1911.09042) [[Code](https://github.com/youngfly11/LCMCG-PyTorch) AAAI2020 119 | - [Multi-View Attention Networks for Visual Dialog](https://arxiv.org/abs/2004.14025) [[Code](https://github.com/taesunwhang/MVAN-VisDial)] arXiv2020 120 | - [Efficient Attention Mechanism for Visual Dialog that Can Handle All the Interactions Between Multiple Inputs](https://arxiv.org/abs/1911.11390) [[Code](https://github.com/davidnvq/visdial)] ECCV2020 121 | - [Shuffle-Then-Assemble: Learning Object-Agnostic Visual Relationship Features](https://arxiv.org/abs/1808.00171) [[Code in TF](https://github.com/yangxuntu/vrd)] ECCV2018 122 | - use dialog history / user guided 123 | - [Making History Matter: History-Advantage Sequence Training for Visual Dialog](https://arxiv.org/abs/1902.09326) ICCV2019 124 | - [User Attention-guided Multimodal Dialog Systems](https://xuemengsong.github.io/sigir2019_umd.pdf) [[Code](https://github.com/ChenTsuei/UMD)] SIGIR2019 125 | - [History for Visual Dialog: Do we really need it?](https://www.aclweb.org/anthology/2020.acl-main.728) ACL2020 126 | - [Integrating Historical States and Co-attention Mechanism for Visual Dialog](https://ieeexplore.ieee.org/document/9412629/) ICPR2021 127 | - knowledge 128 | - [The Dialogue Dodecathlon: Open-Domain Knowledge and Image Grounded Conversational Agents](https://www.aclweb.org/anthology/2020.acl-main.222) ACL2020 129 | - [Knowledge-aware Multimodal Dialogue Systems](http://staff.ustc.edu.cn/~hexn/papers/mm18-multimodal-dialog.pdf) ACM2018 130 | - [A Knowledge-Grounded Multimodal Search-Based Conversational Agent](https://www.aclweb.org/anthology/W18-5709) [[Code](https://github.com/shubhamagarwal92/mmd) wow finally a code about "knowledge" or "graph"] SCAI@EMNLP2018 131 | - modality bias 132 | - [Modality-Balanced Models for Visual Dialogue](https://www.aaai.org/Papers/AAAI/2020GB/AAAI-KimH.8168.pdf) AAAI2020 133 | - [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) [[Code](https://github.com/facebookresearch/deit)] arXiv2020 134 | - [Unsupervised Natural Language Inference via Decoupled Multimodal Contrastive Learning](https://www.aclweb.org/anthology/2020.emnlp-main.444.pdf) EMNLP2020 135 | - [Visual Dialogue without Vision or Dialogue](https://arxiv.org/abs/1812.06417) 2018 136 | - [Be Different to Be Better! A Benchmark to Leverage the Complementarity of Language and Vision](https://www.aclweb.org/anthology/2020.findings-emnlp.248) EMNLP findings 2020 137 | - [Worst of Both Worlds: Biases Compound in Pre-trained Vision-and-Language Models](https://arxiv.org/abs/2104.08666) 2021 138 | - pretraining / representation learning / bertologie 139 | - [VD-BERT: A Unified Vision and Dialog Transformer with BERT](https://www.aclweb.org/anthology/2020.emnlp-main.269) [[Code](https://github.com/salesforce/VD-BERT)] EMNLP2020 140 | - [Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123630324.pdf) [[Code](https://github.com/vmurahari3/visdial-bert)] ECCV2020 141 | - [Kaleido-BERT: Vision-Language Pre-training on Fashion Domain](https://arxiv.org/abs/2103.16110) [[Code](https://github.com/mczhuge/Kaleido-BERT)] arXiv2021 142 | - [Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123750120.pdf) [[Code](https://github.com/microsoft/Oscar)] ECCV2020 143 | - [12-in-1: Multi-Task Vision and Language Representation Learning](https://arxiv.org/abs/1912.02315) [[Code](https://github.com/facebookresearch/vilbert-multi-task)] CVPR2020 144 | - [Large-Scale Adversarial Training for Vision-and-Language Representation Learning](https://arxiv.org/abs/2006.06195) [[Code](https://github.com/zhegan27/VILLA)] NeurIPS 2020 145 | - [Integrating Multimodal Information in Large Pretrained Transformers](https://www.aclweb.org/anthology/2020.acl-main.214.pdf) [[Code](https://github.com/WasifurRahman/BERT_multimodal_transformer)] ACL2020 146 | - Generative dialogue / diverse 147 | - [Improving Generative Visual Dialog by Answering Diverse Questions](https://arxiv.org/pdf/1909.10470.pdf) EMNLP 2019, [[Code](https://github.com/vmurahari3/visdial-diversity)] 148 | - [Visual Dialogue State Tracking for Question Generation](https://arxiv.org/abs/1911.07928) [[Code is in the series of guesswhat guesswhich visdial](https://github.com/xubuvd/guesswhat)] AAAI2020 149 | - [MultiDM-GCN: Aspect-Guided Response Generation in Multi-Domain Multi-Modal Dialogue System using Graph Convolution Network](https://www.aclweb.org/anthology/2020.findings-emnlp.210) EMNLP2020 150 | - [Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model](https://arxiv.org/abs/1706.01554) [[Code](https://github.com/jiasenlu/visDial.pytorch)] NIPS2017 151 | - [FLIPDIAL: A Generative Model for Two-Way Visual Dialogue](https://arxiv.org/abs/1802.03803) CVPR2018 152 | - [DAM: Deliberation, Abandon and Memory Networks for Generating Detailed and Non-repetitive Responses in Visual Dialogue](https://www.ijcai.org/Proceedings/2020/96) IJCAI2020 [[Code soon](https://github.com/JXZe/DAM)] 153 | - [More to diverse: Generating diversified responses in a task oriented multimodal dialog system](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0241271) 2020 154 | - [Multimodal Dialog System: Generating Responses via Adaptive Decoders](https://liqiangnie.github.io/paper/fp349-nieAemb.pdf) ACM2019 155 | - [Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation](https://www.aclweb.org/anthology/I17-1047) IJCNLP2017 156 | - [Multimodal Differential Network for Visual Question Generation](https://www.aclweb.org/anthology/D18-1434.pdf) [[Code](https://github.com/badripatro/MDN-VQG)] EMNLP2018 157 | - [Generative Visual Dialogue System via Adaptive Reasoning and Weighted Likelihood Estimation](https://www.ijcai.org/Proceedings/2019/0144.pdf) 2019 158 | - [Aspect-Aware Response Generation for Multimodal Dialogue System](https://www.aclweb.org/anthology/2020.findings-emnlp.210.pdf) ACM 2021 159 | - [An Empirical Study on the Generalization Power of Neural Representations Learned via Visual Guessing Games](https://arxiv.org/abs/2102.00424) EACL2021 160 | - Adversarial training 161 | - [The World in My Mind: Visual Dialog with Adversarial Multi-modal Feature Encoding](https://www.aclweb.org/anthology/N19-1266) NAACL2019 162 | - [Mind Your Language: Learning Visually Grounded Dialog in a Multi-Agent Setting](https://agakshat.github.io/assets/documents/ALA18_CameraReady.pdf) 2018 163 | - [GADE: A Generative Adversarial Approach to Density Estimation and its Applications](https://openaccess.thecvf.com/content_CVPR_2019/papers/Abbasnejad_A_Generative_Adversarial_Density_Estimator_CVPR_2019_paper.pdf) IJCV2020 164 | - RL 165 | - [Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning](https://arxiv.org/abs/1703.06585) ICCV2017 oral, [[Code](https://github.com/batra-mlp-lab/visdial-rl)] 166 | - [Multimodal Hierarchical Reinforcement Learning Policy for Task-Oriented Visual Dialog](https://arxiv.org/abs/1805.03257) SIGDIAL 2018 167 | - [Multimodal Dialog for Browsing Large Visual Catalogs using Exploration-Exploitation Paradigm in a Joint Embedding Space](https://arxiv.org/abs/1901.09854) ICMR2019 168 | - [Recurrent Attention Network with Reinforced Generator for Visual Dialog](https://dl.acm.org/doi/10.1145/3390891) ACM 2020 169 | - linguistic / probabilistic 170 | - [A Linguistic Analysis of Visually Grounded Dialogues Based on Spatial Expressions](https://www.aclweb.org/anthology/2020.findings-emnlp.67) EMNLP finds 2020 171 | - [Probabilistic framework for solving Visual Dialog](https://arxiv.org/abs/1909.04800) PR 2020 172 | - [Learning Goal-Oriented Visual Dialog Agents: Imitating and Surpassing Analytic Experts](https://arxiv.org/abs/1907.10500) IEEE2019 173 | 174 | 175 | 176 | 177 | (3) [CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog](https://www.aclweb.org/anthology/N19-1058) NAACL2019, [[code](https://github.com/satwikkottur/clevr-dialog)] 178 | - Further paper 179 | - [VQA With No Questions-Answers Training](http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9157617) CVPR2020 180 | - [Domain-robust VQA with diverse datasets and methods but no target labels](https://arxiv.org/pdf/2103.15974.pdf) arXiv2021 181 | - [Scene Graph based Image Retrieval - A case study on the CLEVR Dataset](https://arxiv.org/abs/1911.00850) 2019 182 | 183 | (4) Open-domain: 184 | - [OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual Contexts](https://arxiv.org/abs/2012.15015) 185 | - [The PhotoBook Dataset: Building Common Ground through Visually-Grounded Dialogue](https://www.aclweb.org/anthology/P19-1184) ACL2019 186 | - [A Visually-Grounded Parallel Corpus with Phrase-to-Region Linking](https://www.aclweb.org/anthology/2020.lrec-1.518) LREC2020 187 | - Papers 188 | - [Multi-Modal Open-Domain Dialogue](https://arxiv.org/abs/2010.01082) 2020 189 | - [Open Domain Dialogue Generation with Latent Images](https://arxiv.org/abs/2004.01981) 2020 190 | - [Image-Chat: Engaging Grounded Conversations] ACL2020 191 | - [The Dialogue Dodecathlon: Open-Domain Knowledge and Image Grounded Conversational Agents](https://www.aclweb.org/anthology/2020.acl-main.222.pdf) ACL2020 192 | 193 | (?) sentiment 194 | - [MEISD: A Multimodal Multi-Label Emotion, Intensity and Sentiment Dialogue Dataset for Emotion Recognition and Sentiment Analysis in Conversations](https://www.aclweb.org/anthology/2020.coling-main.393) Coling2020 195 | - [Bridging Dialogue Generation and Facial Expression Synthesis](https://arxiv.org/abs/1905.11240) 2019 196 | 197 | (5) Task/Goal-oriented: 198 | - [CRWIZ: A Framework for Crowdsourcing Real-Time Wizard-of-Oz Dialogues](https://www.aclweb.org/anthology/2020.lrec-1.36/) LREC2020 199 | - [A Corpus for Reasoning About Natural Language Grounded in Photographs](https://www.aclweb.org/anthology/P19-1644) ACL2019 200 | - [CoDraw: Collaborative Drawing as a Testbed for Grounded Goal-driven Communication](https://www.aclweb.org/anthology/P19-1651) [[Data](https://github.com/facebookresearch/CoDraw)] ACL2019 201 | - [AirDialogue: An Environment for Goal-Oriented Dialogue Research](https://www.aclweb.org/anthology/D18-1419/) EMNLP2018 202 | - [ReferIt](http://tamaraberg.com/referitgame/) [[paper](http://tamaraberg.com/papers/referit.pdf)] in EMNLP2014, 2-players game of refer & label 203 | - Papers 204 | - [Answerer in Questioner's Mind for Goal-Oriented Visual Dialogue](http://papers.neurips.cc/paper/7524-answerer-in-questioners-mind-information-theoretic-approach-to-goal-oriented-visual-dialog.pdf) [[Code](https://github.com/naver/aqm-plus)] NeurIPS 2018 205 | - [End-to-end optimization of goal-driven and visually grounded dialogue systems](https://arxiv.org/abs/1703.05423) IJCAI2017 206 | - [Learning Goal-Oriented Visual Dialog via Tempered Policy Gradient and Code](https://github.com/ruizhaogit/GuessWhat-TemperedPolicyGradient) IEEE2018 207 | - [Answer-Driven Visual State Estimator for Goal-Oriented Visual Dialogue](https://arxiv.org/abs/2010.00361) [[Code](https://github.com/zipengxuc/ADVSE-GuessWhat)] ACM MM 2020 208 | - [Building Task-Oriented Visual Dialog Systems Through Alternative Optimization Between Dialog Policy and Language Generation](https://www.aclweb.org/anthology/D19-1014) EMNLP2019 209 | - [Storyboarding of Recipes: Grounded Contextual Generation](https://www.aclweb.org/anthology/P19-1606) [[Script Data](https://github.com/khyathiraghavi/storyboarding_data)] DGS@ICLR2019 210 | - [Gold Seeker: Information Gain From Policy Distributions for Goal-Oriented Vision-and-Langauge Reasoning](https://arxiv.org/abs/1812.06398) CVPR2020 211 | 212 | (6) evaluation 213 | - [A Revised Generative Evaluation of Visual Dialogue](https://arxiv.org/abs/2004.09272) [[Code](https://github.com/danielamassiceti/geneval_visdial)] arXiv2020 214 | - [Evaluating Visual Conversational Agents via Cooperative Human-AI Games](https://arxiv.org/abs/1708.05122) [[Code for GuessWhich](https://github.com/GT-Vision-Lab/GuessWhich)] 2017 215 | - [The Interplay of Task Success and Dialogue Quality: An in-depth Evaluation in Task-Oriented Visual Dialogues](https://arxiv.org/abs/2103.11151) EACL2021 216 | 217 | 218 | (7) classification 219 | - [**GuessWhat?!**](https://openaccess.thecvf.com/content_cvpr_2017/html/de_Vries_GuessWhat_Visual_Object_CVPR_2017_paper.html) Visual Object Discovery Through Multi-Modal Dialogue in CVPR2017, a two-player guessing game (1 oracle & 1 questioner). 220 | - [[Code]](https://github.com/GuessWhatGame/guesswhat) 221 | - Further paper 222 | - [End-to-end optimization of goal-driven and visually grounded dialogue systems](https://arxiv.org/abs/1703.05423) Reinforcement Learning applied to GuessWhat?! 223 | - [Guessing State Tracking for Visual Dialogue](https://github.com/GT-Vision-Lab/GuessWhich) ECCV2020 224 | - [Language-Conditioned Feature Pyramids for Visual Selection Tasks] EMNLP2020 [[Code](https://github.com/Alab-NII/lcfp)] 225 | - [Beyond task success: A closer look at jointly learning to see, ask, and GuessWhat](https://arxiv.org/abs/1809.03408) NAACL2019 226 | - [Interactive Classification by Asking Informative Questions](https://www.aclweb.org/anthology/2020.acl-main.237) [[Code](https://github.com/asappresearch/interactive-classification)] ACL2020 227 | 228 | 229 | (?) Others 230 | - [Fatality Killed the Cat or: BabelPic, a Multimodal Dataset for Non-Concrete Concepts](https://www.aclweb.org/anthology/2020.acl-main.425.pdf) ACL2020 231 | - [How2: A Large-scale Dataset for Multimodal Language Understanding](https://arxiv.org/abs/1811.00347) [[Data](https://srvk.github.io/how2-dataset/)] NIPS2018 232 | 233 | 234 | (8) [Image caption] generating natural language description of an image 235 | - [MS COCO dataset 2014](https://arxiv.org/pdf/1405.0312.pdf) Images + captions (but captions are single words not sentences) 236 | - Further papers 237 | - Feature images as a whole / and regions (early approachs) : 238 | - [Deep visual-semantic alignments for generating image descriptions](https://cs.stanford.edu/people/karpathy/deepimagesent/devisagen.pdf) CVPR2015 239 | - [Densecap: Fully convolutional localization networks for dense captioning](https://cs.stanford.edu/people/karpathy/densecap/) CVPR2016 240 | - Attention based approaches : 241 | - [Bottom-up and top-down attention for image captioning and visual question answering](https://arxiv.org/abs/1707.07998) in CVPR2018, winner of the 2017 Visual Question Answering challenge 242 | - [Show, attend and tell: Neural image caption generation with visual attention](https://arxiv.org/abs/1502.03044) in ICML2015 243 | - [Review networks for caption generation](https://arxiv.org/abs/1605.07912) NIPS2016 244 | - [Image captioning with semantic attention](https://arxiv.org/abs/1603.03925) CVPR2016 245 | - Graph structured approaches : 246 | - [Exploring Visual Relationship for Image Captioning](https://arxiv.org/abs/1809.07041) ECCV2018 247 | - [Auto-encoding scene graphs for image captioning](https://arxiv.org/abs/1812.02378) CVPR2019 248 | - Reinforcement learning: 249 | - [Context-aware visual policy network for sequence-level image captioning](https://arxiv.org/abs/1808.05864) ACM2018 250 | - [Self-critical sequence training for image captioning](https://arxiv.org/abs/1612.00563) 251 | - Transformer based: 252 | - [Image captioning: transform objects into words](https://papers.nips.cc/paper/2019/file/680390c55bbd9ce416d1d69a9ab4760d-Paper.pdf) in NIPS2019 using Transformers focusing on objects and their spatial relationships 253 | - [Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning](https://www.aclweb.org/anthology/P18-1238.pdf) in ACL2018, also a dataset 254 | - [Improving Image Captioning with Better Use of Caption](https://www.aclweb.org/anthology/2020.acl-main.664.pdf) ACL2020 255 | - [Improving Image Captioning Evaluation by Considering Inter References Variance](https://www.aclweb.org/anthology/2020.acl-main.93.pdf) ACL2020 256 | 257 | 258 | (9) Navigation task 259 | - [Talk the walk: Navigating new york city through grounded dialogue](https://arxiv.org/pdf/1807.03367.pdf) 260 | - [A Visually-grounded First-person Dialogue Dataset with Verbal and Non-verbal Responses] EMNLP2020 261 | - navigating 262 | - [Improving Vision-and-Language Navigation with Image-Text Pairs from the Web] ECCV2020 263 | - [Diagnosing Vision-and-Language Navigation: What Really Matters] arXiv2021 264 | - [Vision-Dialog Navigation by Exploring Cross-Modal Memory] CVPR2020 265 | - [Vision-and-Dialog Navigation] CoVR 2019 266 | - [Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments] CVPR2018 267 | - [Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters] 2020 268 | - [Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation] ACL2019 269 | - [Active Visual Information Gathering for Vision-Language Navigation] ECCV2020 270 | - [Environment-agnostic Multitask Learning for Natural Language Grounded Navigation] ECCV2020 271 | - [Perceive, Transform, and Act: Multi-Modal Attention Networks for Vision-and-Language Navigation] 2019 272 | - [Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation] CVPR2019 273 | - [Engaging Image Chat: Modeling Personality in Grounded Dialogue] 2018 274 | - [TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments] CVPR2019 275 | - [Multi-modal Discriminative Model for Vision-and-Language Navigation] 2019 276 | - [REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments] CVPR 2020 277 | - [Learning To Follow Directions in Street View] AAAI2020 278 | - [Help, Anna! Visual Navigation with Natural Multimodal Assistance via Retrospective Curiosity-Encouraging Imitation Learning] ViGil@NeuIPS2019 279 | - representation learning 280 | - [A Recurrent Vision-and-Language BERT for Navigation](https://arxiv.org/abs/2011.13922) [[Code](https://github.com/YicongHong/Recurrent-VLN-BERT)] CVPR2021 281 | - [Transferable Representation Learning in Vision-and-Language Navigation](https://arxiv.org/abs/1908.03409) ICCV2019 282 | - Grounding 283 | - [Words Aren't Enough, Their Order Matters: On the Robustness of Grounding Visual Referring Expressions](https://www.aclweb.org/anthology/2020.acl-main.586.pdf) ACL2020 284 | - [Grounding Conversations with Improvised Dialogues](https://www.aclweb.org/anthology/2020.acl-main.218.pdf) ACL2020 285 | - [A negative case analysis of visual grounding methods for VQA](https://www.aclweb.org/anthology/2020.acl-main.727.pdf) ACL2020 286 | - [Knowledge Supports Visual Language Grounding: A Case Study on Colour Terms](https://www.aclweb.org/anthology/2020.acl-main.584.pdf) ACL2020 287 | - [Where Are You? Localization from Embodied Dialog](https://www.aclweb.org/anthology/2020.emnlp-main.59.pdf) [[Code](https://github.com/batra-mlp-lab/WAY)] EMNLP2020 288 | - [Visual Referring Expression Recognition: What Do Systems Actually Learn?](https://www.aclweb.org/anthology/N18-2123) NAACL2018 289 | - [Ask No More: Deciding when to guess in referential visual dialogue](https://www.aclweb.org/anthology/C18-1104) coling2018 290 | - [Refer, Reuse, Reduce: Generating Subsequent References in Visual and Conversational Contexts](https://www.aclweb.org/anthology/2020.emnlp-main.353) EMNLP2020 291 | - [Achieving Common Ground in Multi-modal Dialogue](https://www.aclweb.org/anthology/2020.acl-tutorials.3) ACL2020 292 | 293 | (10) retrieval task 294 | - image retrieval/visual retrieval 295 | - [Exploring Phrase Grounding without Training: Contextualisation and Extension to Text-Based Image Retrieval](https://openaccess.thecvf.com/content_CVPRW_2020/papers/w56/Parcalabescu_Exploring_Phrase_Grounding_Without_Training_Contextualisation_and_Extension_to_Text-Based_CVPRW_2020_paper.pdf) CVPRW2020 296 | - [Toward General Scene Graph: Integration of Visual Semantic Knowledge with Entity Synset Alignment](https://www.aclweb.org/anthology/2020.alvr-1.2) [[Code](https://github.com/videoturingtest/alvr-ESA) wow finally a code for graph] ALVR2020 297 | - [Dialog-based Interactive Image Retrieval](https://arxiv.org/abs/1805.00145) [[Code Fashion retrieval](https://github.com/XiaoxiaoGuo/fashion-retrieval)] NeuIPS2018 298 | - [I Want This Product but Different : Multimodal Retrieval with Synthetic Query Expansion](https://arxiv.org/abs/2102.08871) 2021 299 | - [Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers](https://arxiv.org/abs/2103.16553) 2021 300 | 301 | 302 | (11) image editing / text-to-image 303 | - [Sequential Attention GAN for Interactive Image Editing] ACM2020 304 | - [Tell, Draw, and Repeat: Generating and Modifying Images Based on Continual Linguistic Instruction] ICCV2019 305 | - [ChatPainter: Improving Text to Image Generation using Dialogue] ICLR2018 306 | - [Adversarial Text-to-Image Synthesis: A Review] 2021 307 | - [A Multimodal Dialogue System for Conversational Image Editing] 2020 308 | 309 | (12) Fashion 🌟🌟🌟 ----F-a-s-h-i-o-n---- 310 | - [**SIMMC**](https://github.com/facebookresearch/simmc) - Domains include furniture and fashion 🌟🌟🌟, it can be seen as a variant of [multiWOZ](https://github.com/budzianowski/multiwoz) or [schema guided dialogue dataset](https://github.com/google-research-datasets/dstc8-schema-guided-dialogue#scheme-representation%5D) 311 | - [Situated and Interactive Multimodal Conversations](https://www.aclweb.org/anthology/2020.coling-main.96) EMNLP2020 [[SIMMC 1.0](https://arxiv.org/abs/2006.01460)] in Coling2020, [[SIMMC 2.0](https://arxiv.org/pdf/2104.08667.pdf)], track in [DSTC9](https://dstc9.dstc.community/home) and [DSTC10](https://sites.google.com/dstc.community/dstc10/tracks) 312 | - [[Code](https://github.com/facebookresearch/simmc)] 313 | - Further papers 314 | - [A Response Retrieval Approach for Dialogue Using a Multi-Attentive Transformer](https://arxiv.org/abs/2012.08148) second winner DSTC9 SIMMC fashion, [[code](https://github.com/D2KLab/dstc9-SIMMC)] 315 | - [Overview of the Ninth Dialog System Technology Challenge: DSTC9](https://arxiv.org/pdf/2011.06486.pdf) to better see the winners' models 316 | - [[Code winner1 TNU](https://github.com/billkunghappy/DSTC_TRACK4_ENTER)](有点乱), [[Code winner2 SU](https://github.com/inkoon/simmc), [[Code other](https://github.com/facebookresearch/simmc/blob/master/DSTC9_SIMMC_RESULTS.md)] 317 | 318 | - [**Fashion IQ**](https://sites.google.com/view/cvcreative2020/fashion-iq) in CVPR2020 workshop, [[paper](https://arxiv.org/pdf/1905.12794.pdf)] [[dataset & startkit](https://github.com/XiaoxiaoGuo/fashion-iq)] 319 | 320 | - [**MMD** Towards Building Large Scale Multimodal Domain-Aware Conversation Systems](https://arxiv.org/abs/1704.00200), arXiv 2017, [[code](https://amritasaha1812.github.io/MMD/)], [Multimodal Dialogs (MMD): A large-scale dataset for studying multimodal domain-aware conversations] 2017 321 | 322 | 323 | ### Video 324 | 325 | (13) video 326 | - [Audio Visual Scene-Aware Dialog Track in DSTC8](http://workshop.colips.org/dstc7/dstc8/Audiovisual_Scene_Aware_Dialog.pdf) [[Paper]((https://ieeexplore.ieee.org/document/8953254)] [[site]]((https://video-dialog.com/) 327 | - [CMU Sinbad’s Submission for the DSTC7 AVSD Challenge] 328 | - [DSTC8-AVSD: Multimodal Semantic Transformer Network with Retrieval Style Word Generator] 2020 329 | - [A Simple Baseline for Audio-Visual Scene-Aware Dialog] CVPR2019 330 | - [[TVQA](https://arxiv.org/abs/1809.01696)] [[MovieQA](http://movieqa.cs.toronto.edu/)] [[TGif-QA](https://arxiv.org/abs/1704.04497)] 331 | - [TVQA+: Spatio-Temporal Grounding for Video Question Answering](https://www.aclweb.org/anthology/2020.acl-main.730.pdf) ACL2020 332 | - [MultiSubs: A Large-scale Multimodal and Multilingual Dataset] 2021 333 | - [Adversarial Multimodal Network for Movie Question Answering] 2019 334 | - [What Makes Training Multi-Modal Classification Networks Hard?] CVPR2020 335 | - [DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue]() 2021 336 | - Minecraft 337 | - [Learning to execute instructions in a Minecraft dialogue](https://www.aclweb.org/anthology/2020.acl-main.232) [[Code](https://github.com/prashant-jayan21/minecraft-bap-models)] ACL2020 338 | - [Collaborative Dialogue in Minecraft](https://www.aclweb.org/anthology/P19-1537) [[Code](https://github.com/prashant-jayan21/minecraft-dialogue)] ACL2020 339 | - video & QA/Dialog papers 340 | - representation learning 341 | - VideoBERT: A Joint Model for Video and Language Representation Learning 342 | - Learning Question-Guided Video Representation for Multi-Turn Video Question Answering ViGil@NeuIPS2019 343 | - Video Dialog via Progressive Inference and Cross-Transformer EMNLP2019 344 | - [Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems] ACL2019 345 | - [Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog] 2020 346 | - [Video-Grounded Dialogues with Pretrained Generation Language Models](https://www.aclweb.org/anthology/2020.acl-main.518.pdf) ACL2020 347 | - Graph 348 | - Location-Aware Graph Convolutional Networks for Video Question Answering 349 | - Object Relational Graph With Teacher-Recommended Learning for Video Captioning 350 | - Fusion 351 | - End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features IEEE2019 352 | - [See the Sound, Hear the Pixels] IEEE2020 353 | - [Video Dialog via Multi-Grained Convolutional Self-Attention Context Networks] SIGIR2019 354 | - [Video Dialog via Multi-Grained Convolutional Self-Attention Context Multi-Modal Networks] IEEE2020 355 | - [Game-Based Video-Context Dialogue] EMNLP2018 356 | - [Long-Form Video Question Answering via Dynamic Hierarchical Reinforced Networks] IEEE2019 357 | - [End-to-End Multimodal Dialog Systems with Hierarchical Multimodal Attention on Video Features] 2018 358 | 359 | 360 | ### Charts / figures 361 | (14) [LEAF-QA: Locate, Encode & Attend for Figure Question Answering](https://openaccess.thecvf.com/content_WACV_2020/papers/Chaudhry_LEAF-QA_Locate_Encode__Attend_for_Figure_Question_Answering_WACV_2020_paper.pdf) 362 | 363 | ### Meme 364 | (15) [MOD Meme incorporated Open Dialogue](https://anonymous.4open.science/r/e7eaef6a-b6d5-47c6-896f-93265a0af4b1/README.md) WeChat conversations with meme / stickers in Chinese language. 365 | - A Multimodal Memes Classification: A Survey and Open Research Issues 366 | - [Learning to Respond with Stickers: A Framework of Unifying Multi-Modality in Multi-Turn Dialog] WWW2020 367 | - [Learning to Respond with Your Favorite Stickers: A Framework of Unifying Multi-Modality and User Preference in Multi-Turn Dialog] 2020 368 | - [The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes] NeuIPS2020 369 | 370 | # Survey 371 | - [Multimodal Research in Vision and Language: A Review of Current and Emerging Trends](https://arxiv.org/pdf/2010.09522v2.pdf) arXiv2020 372 | - [Transformers in Vision: A Survey](https://arxiv.org/abs/2101.01169) arXiv2021 373 | - [Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods](https://arxiv.org/abs/1907.09358) JAIR2020 374 | 375 | # Other github paperlists 376 | - [Multimodal-Dialogue-PaperList](https://github.com/silverriver/Multimodal-Dialogue-PaperList) 377 | - [Awesome-Visual-Transformer](https://github.com/dk-liang/Awesome-Visual-Transformer) 378 | - [Transformer-in-Vision](https://github.com/DirtyHarryLYL/Transformer-in-Vision) 379 | - [awesome-multimodal-ml](https://github.com/pliang279/awesome-multimodal-ml) 380 | - [awesome-visual-question-answering](https://github.com/jokieleung/awesome-visual-question-answering) 381 | - [awesome-vqa-latest](https://github.com/Taaccoo/awesome-vqa-latest) 382 | - [awesome-visual-dialog](https://github.com/badripatro/awesome-visual-dialog) 383 | - [Awesome-Scene-Graphs](https://github.com/huoxingmeishi/Awesome-Scene-Graphs) 384 | - [awesome-vln](https://github.com/daqingliu/awesome-vln) 385 | 386 | # In general 387 | - Tasks 388 | - Visual Question Answering, 389 | - Visual dialog 390 | - Visual Commonsense Reasoning, 391 | - Image-Text Retrieval, 392 | - Referring Expression Comprehension, 393 | - Visual Entailment 394 | - NL+V representation ==> multimodal pretraining 395 | - Issues / topics: 396 | - text and image bias 397 | - VL or LV bertologie 398 | - visual understanding / reasoning / object relation 399 | - cross-modal text-image relation (attention on interaction) 400 | - incorporate knowledge / common sense (attention on knowledge) 401 | - Often used model-elements : 402 | - [Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks](https://arxiv.org/abs/1506.01497) 2015 403 | - LSTM 404 | - GANs 405 | - Transformers 406 | - Graphs : attention graph, GCN, memory graph ......... 407 | - often mentioned approaches: 408 | - adversarial training 409 | - reinforcement learning 410 | - graph neural network 411 | - joint learning / parel / Dual encoder / Dual attention 412 | - my questions 413 | - what does "adaptive" mean? why everyone likes this specific word? 414 | - "ground", mysterious word too... 415 | - often can't find many codes for papers with "graph" or "reinforcement learning" in title ??? 416 | -------------------------------------------------------------------------------- /paper-notes.md: -------------------------------------------------------------------------------- 1 | # Multi-Modal-Dialgoue-System-Paperlist 2 | 3 | This is a paper list for the multimodal dialogue systems topic. 4 | 5 | **Keyword**: Multi-modal, Dialogue system, visual, conversation 6 | 7 | # Paperlist 8 | 9 | ## Dataset & Challenges 10 | 11 | ### Images 12 | 13 | [GuessWhat?!](https://openaccess.thecvf.com/content_cvpr_2017/html/de_Vries_GuessWhat_Visual_Object_CVPR_2017_paper.html) Visual Object Discovery Through Multi-Modal Dialogue in CVPR2017, a two-player guessing game (1 oracle & 1 questioner). [Code](https://github.com/GuessWhatGame/guesswhat) 14 | - **Data and task**: 15 | - **Data**: images are from **MS COCO dataset**, questioners & oracles are from **Amazon Mechanical Turk**. Players are asked to shorten their dialogues to speed up the game (and therefore maximize their gains). The goal of the game is to **locate an unknown object** in a **rich image scene** (meaning that there're several objects in an image / photo) by asking a sequence of questions, eg. After a sequence of n questions (of yes / no / NA), it becomes possible to locate the object (highlighted by a green mask). Once the questioner has gathered enough evidence to locate the object, they notify the oracle that they are ready to guess the object. We then reveal the list of objects, and if the questioner picks the right object, we consider the game successful (*recall@k ??*). 16 | - **Task**: The **oracle task** requires to produce a yes-no answer for any object within a picture given a natural language question. The **questioner task** is divided into two different sub-tasks that are trained independently: The **Guesser** must predict the correct object 17 | O_correct from the set of all objects O given an image I and a sequence of questions and answers D_J . The **Question Generator** must produce a new question q_T+1 Given an image I and a sequence of T questions and answers D_≤T . 18 | - **Problematic**: How to create models that understand natural language descriptions and ground them in the visual world. Higher-level image understanding, like spatial reasoning and language grounding, is required to solve the proposed task. 19 | - **Baseline model**: 20 | - **Oracle baseline**: a classification problem (yes/no/NA). Embedding = Image (VGG16) + Question (LSTM) + Crop (VGG16) + Spatial information (bbox) + Object Category taxonomy; MLP ; cross-entropy loss. **Reflection** In general, we expect the object crop to contain additional information, such as color information, beside the object class. However, we find that the object category outperforms the object crop embedding. This might be partly due to the imperfect feature extraction from the crops. 21 | - **Guesser**: a classification problem (among a list of objects). Embedding1 = Question(LSTM/HRED) + Image(VGG16), Embedding2 = Objects (MLP(Spatial+Category)), then dot product; **Reflection** In general, we find that including VGG features does not improve the performance of the LSTM and HRED models. We hypothesize that the VGG features are a too coarse representation of the image scene, and that most of the visual information is already encoded in the question and the object features (???) 22 | - **Question Generator**: encoder-decoder structure: P(q_2|q_1,a_1,Image(vgg)), We use our questioner model to first generate a question which is then answered by the oracle model. We repeat this procedure 5 times to obtain a dialogue. We then use the best performing guesser model to predict the object and report its error as the metric for the QGEN model. **Reflection** using the Oracle’s answers while training the Question Generator introduces additional errors than using Ground Truth answers. 23 | - **Further papers & models: 24 | - **Paper**:[End-to-end optimization of goal-driven and visually grounded dialogue systems](https://arxiv.org/abs/1703.05423) Reinforcement Learning applied to GuessWhat?! based on **policy gradient algorithms**. **Problematic**: Encoder-Decoder structure's drawbacks: 1. vast action space vs unseen scenarios, inconsistent dialogues 2. supervised learning in dialog systems does not account for the intrinsic planning problem, especially in task-oriented dialogues. 3. difficult in naturally integrating external contexts (common ground) 4. difficult in dialogue evaluation. **?quote?** In addition, successful applications of the RL framework to dialogue often rely on a predefined structure of the task, such as slot-filling tasks [Williams and Young, 2007] where the task can be casted as filling in a form. **Proposed model** kan bu dong 25 | 26 | 27 | [ReferIt](http://tamaraberg.com/referitgame/) [paper](http://tamaraberg.com/papers/referit.pdf) in EMNLP2014 28 | - **Task**: 2 players game, 1 give referring expressions of an object in the image and the other should label it on the image. 29 | - **Data processing**: To avoid hand annotations, use attributes for referring expressions and a template-based parser. 30 | - **Model** to generate referring expressions (a set of attributes, not in natural language) too much mathematics I'm blind. 31 | 32 | [Image Captioning] generating natural language descriptions of images. 33 | - **Data and task**: [MS Coco](https://arxiv.org/pdf/1405.0312.pdf) Images + captions (but captions are single words not sentences) 34 | - **Further papers & models**: 35 | - [Object relation transformer](https://papers.nips.cc/paper/2019/file/680390c55bbd9ce416d1d69a9ab4760d-Paper.pdf) in NIPS2019 **Model** always Encoder-Decoder structure but this time a Transformer (!), given an input image, use object detection model to get appearance features and geometry features (actually meaning a bounding box and its object and features), take these as inputs of the Transformer 36 | 37 | [Visual QA](https://visualqa.org/workshop.html) VQA datasets in CVPR2021,2020,2019,etc 38 | - **Data and task**: 39 | - [VQA](https://visualqa.org/challenge) datasets [1.0](http://arxiv.org/abs/1505.00468) [2.0](https://arxiv.org/abs/1612.00837) 40 | - [TextVQA](https://textvqa.org/paper) TextVQA requires models to read and reason about text in an image to answer questions based on them. In order to perform well on this task, models need to first detect and read text in the images. Models then need to reason about this to answer the question. 41 | - [TextCap](https://arxiv.org/abs/2003.12462) TextCaps requires models to read and reason about text in images to generate captions about them. Specifically, models need to incorporate a new modality of text present in the images and reason over it and visual content in the image to generate image descriptions. 42 | - **Baseline models** 43 | - **Proposed papers & models**: 44 | 45 | 46 | 47 | [Visual Dialog](https://visualdialog.org/#:~:text=Visual%20Dialog%20is%20a%20novel,has%20to%20answer%20the%20question.) Open-domain dialogs & given an image, a dialog history, and a follow-up question about the image, the task is to answer the question. [[VisDial v1.0 dataset](https://visualdialog.org/data)] [[Paper](https://arxiv.org/abs/1611.08669)] [[Code diverse questions](https://github.com/vmurahari3/visdial-diversity)] [[Code rl](https://github.com/batra-mlp-lab/visdial-rl)] [[Code collect chat](https://github.com/batra-mlp-lab/visdial-amt-chat)] 🌟🌟 48 | 49 | [SIMMC](https://github.com/facebookresearch/simmc) Situated and Interactive Multimodal Conversations track in [DSTC9](https://dstc9.dstc.community/home) and [DSTC10](https://sites.google.com/dstc.community/dstc10/tracks) by Facebook [[Paper](https://arxiv.org/abs/2006.01460)] Domains include furniture and fashion 🌟🌟🌟 50 | 51 | [Fashion IQ](https://sites.google.com/view/cvcreative2020/fashion-iq) in CVPR2020 workshop, [[paper](https://arxiv.org/pdf/1905.12794.pdf)] [[dataset & startkit](https://github.com/XiaoxiaoGuo/fashion-iq)] 52 | 53 | ### Video 54 | 55 | [AVSD - Audio Visual Scene-Aware Dataset](https://video-dialog.com/) was used in [DSTC7](http://workshop.colips.org/dstc7/) and [DSTC8](https://sites.google.com/dstc.community/dstc8/tracks) The task is to build a system that generates response sentences in a dialog about an input VIDEO. The data collection paradigm is similar to VisDial 56 | 57 | Related : [[TVQA](https://arxiv.org/abs/1809.01696)] [[MovieQA](http://movieqa.cs.toronto.edu/)] [[TGif-QA](https://arxiv.org/abs/1704.04497)] 58 | 59 | ### Meme 60 | 61 | [MOD Meme incorporated Open Dialogue](https://anonymous.4open.science/r/e7eaef6a-b6d5-47c6-896f-93265a0af4b1/README.md) WeChat conversations with meme / stickers in Chinese language. 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | -------------------------------------------------------------------------------- /summary.md: -------------------------------------------------------------------------------- 1 | Multi-modal dialogue systems consist of those dialogue systems that deal with multi-modal inputs and outputs, besides of textual modality, 2 | audio, visual or audio-visual features are taken consideration as multi-modalities. In our very simple and brief survey, we would like to concentrate 3 | on visual dialogue systems at the first time. 我这里参考了比较多[这篇survey](https://arxiv.org/pdf/2010.09522.pdf). 4 | 5 | Visual dialogue system can be seen as a variant subtask of Visual Question Answering (VQA) which remains to be the most popular task as well as Visual Captioning 6 | across various visual-linguistic tasks in the research community. It comprises of VQA in dialogue (VQADi -- Visual Dialogue v1), and VQG in dialogue (VQGDi -- GuessWhich Task) 7 | wherein the main goal is to automate machine conversations about images with humans. 8 | 9 | The referred survey defines the formats of output in purpose of distinguish "generation task" and "classification task": 10 | 11 | > The output of the learnt mapping *f* could either belong to a set 12 | of possible answers in which case we refer this task format 13 | as MCQ, or could be arbitrary in nature depending on the 14 | question in which we can refer to as free-form. We regard the 15 | more generalized free-form VQA as a generation task, while 16 | MCQ VQA as a classification task where the model predicts 17 | the most suitable answer from a pool of choices. 18 | 19 | Generally, classical VQA datasets/tasks like VQA 1.0 & 2.0 apply MCQ format of output while visual dialogue tasks and VQG tasks on VQA 2.0 have *free form* format of output. Specifically, the survey takes out separately the Visual Commensense Reasoning as a popular task in parallel to VQA tasks, which aims to develop higherorder cognition in vision systems and commonsense reasoning of the world so that they can provide justifications to their answers. 20 | 21 | I have to mention VQA because it's a group of classical and living datasets and tasks that are incontourable before talking about Visual Dialogue, and many of the most popular issues and topics and methods affect also those in visual dialogues, like reasoning, deep understanding, modality bias, etc., while in special, the works about VCR show a special kind of enthusiasm in using bertology methods like Vi-Bert, VisualBERT, VL-Bert, 22 | KVL-BERT, etc. (A paper about [The Exploration of the Reasoning Capability of BERT in Relation Extraction](https://ieeexplore.ieee.org/document/9202183) but it was published only in 2020. So why suddenly a bunch of Bertologies in Visual-Language research keeps to be a mistery for me...) Of course, in visual dialogue tasks we have bertology methods, like VD-BERT. 23 | 24 | | Article | Dataset | Visual Encoder | Language Model | Encoder | Decoder | 25 | | ------------- |:-------------:|:---------------:|:------------:|:----------:|:---------:| 26 | | [Visual dialog](https://arxiv.org/pdf/1611.08669.pdf)| VisDial v1.0 | VGG16 | 2 diff LSTMs; dialog-RNN + Attention + LSTM ; LSTM | Late fusion ; HRE ; MN| LSTM, Softmax | 27 | | [IQA Synergistic](https://openaccess.thecvf.com/content_CVPR_2019/papers/Guo_Image-Question-Answer_Synergistic_Network_for_Visual_Dialog_CVPR_2019_paper.pdf)| VisDial v1.0 |Faster-RCNN, CNN | 2 diff LSTMs | MFB ; (discriminative model: in primary stage, answers are also encoded by LSTM)| softmax; (generative model: answer decoded by LSTM)| 28 | | [LTMI](https://arxiv.org/pdf/1911.11390.pdf) | VisDial v1.0 | | | 29 | | [LF] | centered | | | | fusion by concat | | 30 | | [LTMI](https://arxiv.org/pdf/1911.11390.pdf) | VisDial v1.0 | | | 31 | | [LF] | centered | | | | fusion by concat | | 32 | | [LTMI](https://arxiv.org/pdf/1911.11390.pdf) | VisDial v1.0 | | | 33 | | [LF] | centered | | | | fusion by concat | | 34 | | [LTMI](https://arxiv.org/pdf/1911.11390.pdf) | VisDial v1.0 | | | 35 | | [LF] | centered | | | | fusion by concat | | 36 | | [LTMI](https://arxiv.org/pdf/1911.11390.pdf) | VisDial v1.0 | | | 37 | | [LF] | centered | | | | fusion by concat | | 38 | | [LTMI](https://arxiv.org/pdf/1911.11390.pdf) | VisDial v1.0 | | | 39 | | [LF] | centered | | | | fusion by concat | | 40 | | [LTMI](https://arxiv.org/pdf/1911.11390.pdf) | VisDial v1.0 | | | 41 | | [LF] | centered | | | | fusion by concat | | 42 | | [LTMI](https://arxiv.org/pdf/1911.11390.pdf) | VisDial v1.0 | | | 43 | | [LF] | centered | | | | fusion by concat | | 44 | --------------------------------------------------------------------------------