├── README.md
├── paper-notes.md
└── summary.md


/README.md:
--------------------------------------------------------------------------------
  1 | # Multi-Modal-Dialgoue-System-Paperlist
  2 | 
  3 | This is a paper list for the multimodal dialogue systems topic.
  4 | 
  5 | **Keyword**: Multi-modal, Dialogue system, visual, conversation
  6 | 
  7 | # Paperlist
  8 | 
  9 | ## Dataset & Challenges
 10 | 
 11 | ### Images
 12 |  
 13 | (1) [**Visual QA**](https://visualqa.org/workshop.html) VQA datasets in CVPR2021,2020,2019,..., containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer.
 14 | - [VQA](https://visualqa.org/challenge) datasets [1.0](http://arxiv.org/abs/1505.00468) [2.0](https://arxiv.org/abs/1612.00837)
 15 | - [TextVQA](https://textvqa.org/paper) TextVQA requires models to read and reason about text in an image to answer questions based on them. In order to perform well on this task, models need to first detect and read text in the images. Models then need to reason about this to answer the question. 
 16 | - [TextCap](https://arxiv.org/abs/2003.12462) TextCaps requires models to read and reason about text in images to generate captions about them. Specifically, models need to incorporate a new modality of text present in the images and reason over it and visual content in the image to generate image descriptions.
 17 | - Issues : 
 18 |   - visual-explainable: the model should rely on the right visual regions when making decisions, 
 19 |   - question-sensitive: the model should be sensitive to the linguistic variations in question
 20 |   - reduce language biases: the model should not take the language shortcut to answer the question without looking at the image
 21 | - Further Papers (too many)
 22 |   - cross-modal interaction /fusion
 23 |     - [Multimodal Neural Graph Memory Networks for Visual Question Answering](https://www.aclweb.org/anthology/2020.acl-main.643.pdf) ACL2020
 24 |     - [Bottom-up and top-down attention for image captioning and visual question answering](https://arxiv.org/abs/1707.07998) in CVPR2018, winner of the 2017 Visual Question Answering challenge
 25 |     - [Multimodal Neural Graph Memory Networks for Visual Question Answering](https://www.aclweb.org/anthology/2020.acl-main.643.pdf) ACL2020, visual features + encoded region-grounded captions (of object attributes and their relationships) = two graph nets which compute question-guided contextualized representation for each, then the updated representations are written to an external spatial memory (??what's that??).
 26 |     - [Cross-Modality Relevance for Reasoning on Language and Vision](https://www.aclweb.org/anthology/2020.acl-main.683.pdf) in ACL2020
 27 |     - [Hypergraph Attention Networks for Multimodal Learning](https://bi.snu.ac.kr/~btzhang/selected_papers/CVPR2020_ESKimKOHZ.pdf) CVPR2020
 28 |     - [Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions?](https://www.aclweb.org/anthology/D16-1092/) EMNLP2016
 29 |     - [Multi-level Attention Networks for Visual Question Answering](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/Multi-level-Attention-Networks-for-Visual-Question-Answering.pdf) CVPR2017
 30 |     - [Hierarchical Question-Image Co-Attention for Visual Question Answering](https://arxiv.org/abs/1606.00061) CVPR2016
 31 |   - vision-language pretraining / representation learning
 32 |     - [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557.pdf) arXiv2019, ground element of language to image regions with self-attention
 33 |     - [ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks](https://arxiv.org/abs/1908.02265) NeuIPS2019
 34 |     - [VL-BERT: Pre-training of Generic Visual-Linguistic Representations](https://arxiv.org/abs/1908.08530) [[Code](https://github.com/jackroos/VL-BERT)] ICRL2020
 35 |     - [VinVL: Making Visual Representations Matter in Vision-Language Models](https://arxiv.org/abs/2101.00529) [[Code](https://github.com/microsoft/Oscar)] CVPR2021
 36 |     - [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) [[Code](https://github.com/dandelin/vilt)] ICML 2021
 37 |     - [12-in-1: Multi-Task Vision and Language Representation Learning](https://arxiv.org/abs/1912.02315) [[Code](https://github.com/facebookresearch/vilbert-multi-task)] CVPR2020
 38 |     - [Unified Vision-Language Pre-Training for Image Captioning and VQA](https://arxiv.org/abs/1909.11059) [[Code](https://github.com/LuoweiZhou/VLP)] AAAI2020
 39 |     - [LXMERT: Learning Cross-Modality Encoder Representations from Transformers](https://www.aclweb.org/anthology/D19-1514.pdf) [[Code](https://github.com/airsplay/lxmert)] EMNLP2019
 40 |     - [Adaptive Transformers for Learning Multimodal Representations](https://www.aclweb.org/anthology/2020.acl-srw.1.pdf)[[Code](https://github.com/prajjwal1/adaptive_transformer)] SRW ACL2020 
 41 |     - [Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer](https://www.aclweb.org/anthology/2020.acl-main.306.pdf) [[Data Code](https://github.com/jefferyYu/UMT)] ACL2020
 42 |   - Language prior issue
 43 |     - [AdaVQA: Overcoming Language Priors with Adapted Margin Cosine Loss](https://arxiv.org/pdf/2105.01993.pdf) in a perspective of feature space learning (not classification task)
 44 |     - [Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering](https://arxiv.org/abs/1612.00837) CVPR2017 VQA 2.0 is also for the purpose of balance language prior to images 
 45 |     - [Self-Critical Reasoning for Robust Visual Question Answering](https://arxiv.org/pdf/1905.09998.pdf) NeurIPS2019
 46 |     - [Overcoming Language Priors in Visual Question Answering with Adversarial Regularization](https://arxiv.org/pdf/1810.03649.pdf) NeurIPS2018, question-only model
 47 |     - [RUBi: Reducing Unimodal Biases in Visual Question Answering](https://arxiv.org/pdf/1906.10169.pdf) NeurIPS2019 also question-only model
 48 |     - [Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering](https://arxiv.org/abs/1712.00377) [[Code](https://github.com/AishwaryaAgrawal/GVQA) CVPR2018
 49 |     - [Counterfactual VQA: A Cause-Effect Look at Language Bias](https://arxiv.org/abs/2006.04315) [[Code](https://github.com/yuleiniu/cfvqa)] CVPR2021 
 50 |     - [Counterfactual Vision and Language Learning](https://openaccess.thecvf.com/content_CVPR_2020/papers/Abbasnejad_Counterfactual_Vision_and_Language_Learning_CVPR_2020_paper.pdf) CVPR2020
 51 |   - Visual-explainable issue
 52 |     - [Counterfactual Samples Synthesizing for Robust Visual Question Answering](http://arxiv.org/pdf/2003.06576) CVPR2020
 53 |     - [Learning to Contrast the Counterfactual Samples for Robust Visual Question Answering](https://www.aclweb.org/anthology/2020.emnlp-main.265.pdf) EMNLP2020
 54 |     - [Learning What Makes a Difference from Counterfactual Examples and Gradient Supervision](https://arxiv.org/pdf/2004.09034.pdf) ECCV2020 leveraging overlooked supervisory signal found in existing datasets to improve generalization capabilities
 55 |     - [Generating Natural Language Explanations for Visual Question Answering using Scene Graphs and Visual Attention](https://arxiv.org/abs/1902.05715) arXiv2019
 56 |     - [Towards Transparent AI Systems: Interpreting Visual Question Answering Models](https://arxiv.org/abs/1608.08974) 2016
 57 |   - object relation reasoning / visual understanding / cross-modal / Graphs
 58 |     - [MUREL: Multimodal Relational Reasoning for Visual Question Answering](http://arxiv.org/pdf/1902.09487) CVPR2019, [[Code](github.com/Cadene/murel.bootstrap.pytorch)], represent and refine interactions between question words and image regions, more fine than attention-maps
 59 |     - [CRA-Net: Composed Relation Attention Network for Visual Question Answering](https://dl.acm.org/doi/10.1145/3343031.3350925) ACM2019 object relation reasoning attention should look at both visual (features, spatial) and linguistic (in questions) features 不让看哦？
 60 |     - [Hierarchical Graph Attention Network for Visual Relationship Detection](https://openaccess.thecvf.com/content_CVPR_2020/papers/Mi_Hierarchical_Graph_Attention_Network_for_Visual_Relationship_Detection_CVPR_2020_paper.pdf) CVPR2020 object-level graph: (1) woman (sit on) bench, (2) woman (in front of) water; triplet-level graph: relation between triplet(1) and triplet(2)
 61 |     - [Visual Relationship Detection With Visual-Linguistic Knowledge From Multimodal Representations](https://ieeexplore.ieee.org/ielx7/6287639/6514899/09387302.pdf) IEEE2021, relational visual-linguistic BERT
 62 |     - [Relation-Aware Graph Attention Network for Visual Question Answering](http://arxiv.org/pdf/1903.12314) ICCV2019, explicit relations of geometric positions and semantic interactions between objects, implicit relations of hidden dynamics between image regions
 63 |     - [Fusion of Detected Objects in Text for Visual Question Answering](https://arxiv.org/abs/1908.05054) EMNLP2020
 64 |     - [GraghVQA: Language-Guided Graph Neural Networks for Graph-based Visual Question Answering](https://arxiv.org/abs/2104.10283) arXiv2021
 65 |     - [A Simple Baseline for Visual Commonsense Reasoning](https://vigilworkshop.github.io/static/papers/34.pdf) ViGil@NeuIPS2019
 66 |     - [Learning Conditioned Graph Structures for Interpretable Visual Question Answering](https://arxiv.org/abs/1806.07243) [[Code](https://github.com/aimbrain/vqa-project)] NeuIPS2018
 67 |     - [Graph-Structured Representations for Visual Question Answering](https://arxiv.org/abs/1609.05600) CVPR2017
 68 |     - [R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering](https://arxiv.org/abs/1805.09701) [[Code](https://github.com/lupantech/rvqa)] ACM KDD2018
 69 |   - Knowledge / cross-modal fusion / Graphs
 70 |     - [Towards Knowledge-Augmented Visual Question Answering](https://www.aclweb.org/anthology/2020.coling-main.169.pdf) Coling2020, capture the interactions between objects in a visual scene and entities in an external knowledge source, with many many graphs ...
 71 |     - [ConceptBert: Concept-Aware Representation for Visual Question Answering](https://www.aclweb.org/anthology/2020.findings-emnlp.44.pdf) EMNLP2020, learn a joint Concept-Vision-Language embedding (maybe similar to [[this paper](https://openreview.net/references/pdf?id=Uhl6chXANP)] in the way of adding "entity embedding" ?)
 72 |     - [Incorporating External Knowledge to Answer Open-Domain Visual Questions with Dynamic Memory Networks](https://arxiv.org/abs/1712.00733) 2017
 73 |   - text in the image (TextCap & TextVQA)
 74 |     - [Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text](http://arxiv.org/pdf/2003.13962) [[Code](https://github.com/ricolike/mmgnn_textvqa)] CVPR2020, the printed text on the bottle is the brand of the drink ==> graph representation of the image should have sub-graphs and respective aggregators to pass messages among graphs (我不知道我在说什么???)
 75 |     - [Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image Classification and Retrieval](https://arxiv.org/pdf/2009.09809.pdf) arXiv2020, common semantic space between salient objects and text found in an image
 76 |     - [Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps](https://arxiv.org/pdf/2012.05153.pdf) arXiv2020, simple attention mechanism is, good 
 77 |     - [Cascade Reasoning Network for Text-based Visual Question Answering](https://tanmingkui.github.io/files/publications/Cascade.pdf) ACM2020, 1) which info's useful, 2)question related to text but also visual concepts, how to capture cross-modal relathionships, 3)what if OCR fails 
 78 |     - [TAP: Text-Aware Pre-training for Text-VQA and Text-Caption](https://arxiv.org/pdf/2012.04638.pdf) arXiv2020, incorporates OCR generated text in pre-training
 79 |     - [Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA](https://arxiv.org/abs/1911.06258) arXiv2020
 80 |   - multi-task
 81 |     - [Answer Them All! Toward Universal Visual Question Answering Models](https://arxiv.org/abs/1903.00366) CVPR2019
 82 |     - [A Recipe for Creating Multimodal Aligned Datasets for Sequential Tasks](https://www.aclweb.org/anthology/2020.acl-main.440.pdf) ACL2020
 83 |     - [Visual Question Answering as a Multi-Task Problem](https://arxiv.org/pdf/2007.01780.pdf) arXiv2020
 84 |     
 85 | 
 86 | (2) [**Visual Dialog**](https://visualdialog.org/) CVPR 2017, Open-domain dialogs & given an image, a dialog history, and a follow-up question about the image, the task is to answer the question. 
 87 | - [VisDial v1.0 dataset](https://visualdialog.org/data) [[Paper](https://arxiv.org/abs/1611.08669)] [[Source Code to collect chat data](https://github.com/batra-mlp-lab/visdial-amt-chat)]
 88 | - Further papers
 89 |   - reasoning
 90 |     - [KBGN: Knowledge-Bridge Graph Network for Adaptive Vision-Text Reasoning in Visual Dialogue](http://arxiv.org/pdf/2008.04858) ACM2020, here knowledge = text knowledge & vision knowledge, encoding (T2V graph & V2T graph) then bridging (update graph nodes) then storing then retrieving (via adaptive information selection mode)
 91 |     - [Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog](https://www.aclweb.org/anthology/P19-1648.pdf) ACL2019, iteratively refine the question's representation based on image and dialog history
 92 |     - [Recursive visual attention in visual dialog](https://arxiv.org/abs/1812.02664) CVPR2019 [[Code](https://github.com/yuleiniu/rva)]
 93 |     - [DMRM: A Dual-channel Multi-hop Reasoning Model for Visual Dialog](https://aaai.org/ojs/index.php/AAAI/article/view/6248) AAAI2020
 94 |     - [Visual Reasoning with Multi-hop Feature Modulation](https://arxiv.org/abs/1808.04446) [[Code](https://github.com/ethanjperez/film)] ECCV2018
 95 |     - [VisualCOMET: Reasoning About the Dynamic Context of a Still Image](https://arxiv.org/abs/2004.10796) [[Code](https://github.com/jamespark3922/visual-comet)] ECCV2020
 96 |   - understanding
 97 |     - [DualVD: An Adaptive Dual Encoding Model for Deep Visual Understanding in Visual Dialogue](https://arxiv.org/pdf/1911.07251.pdf) [[Code](https://github.com/JXZe/DualVD)] AAAI2020
 98 |     - [Learning Dual Encoding Model for Adaptive Visual Understanding in Visual Dialogue](http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9247486) [[Code](https://github.com/JXZe/Learning_DualVD)] IEEE2021
 99 |     - [Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision](https://www.aclweb.org/anthology/2020.emnlp-main.162) [[Code](https://github.com/airsplay/vokenization)] EMNLP2020   
100 |   - coreference 
101 |     - [Modeling Coreference Relations in Visual Dialog](https://www.aclweb.org/anthology/2021.eacl-main.290) [[Code](https://github.com/facebookresearch/corefnmn)] EACL2021
102 |     - [What You See is What You Get: Visual Pronoun Coreference Resolution in Dialogues](https://www.aclweb.org/anthology/D19-1516) [[Data Code](https://github.com/HKUST-KnowComp/Visual_PCR)] EMNLP2019
103 |   - reference
104 |     - [Dual Attention Networks for Visual Reference Resolution in Visual Dialog](https://www.aclweb.org/anthology/D19-1209.pdf) [[Code](https://github.com/gicheonkang/DAN-VisDial)] EMNLP2019
105 |     - [Visual Reference Resolution using Attention Memory for Visual Dialog](http://papers.neurips.cc/paper/6962-visual-reference-resolution-using-attention-memory-for-visual-dialog.pdf) NIPS2017
106 |     - [Referring Expression Generation via Visual Dialogue](https://github.com/llxuan/ReferWhat) NLPCC2020
107 |   - cross-modal / fusion / joint / dual ...
108 |     - [Efficient Attention Mechanism for Handling All the Interactions between Many Inputs with Application to Visual Dialog](https://arxiv.org/abs/1911.11390) ECCV2019
109 |     - [Image-Question-Answer Synergistic Network for Visual Dialog](https://openaccess.thecvf.com/content_CVPR_2019/papers/Guo_Image-Question-Answer_Synergistic_Network_for_Visual_Dialog_CVPR_2019_paper.pdf) CVPR2019
110 |     - [DialGraph: Sparse Graph Learning Networks for Visual Dialog](https://arxiv.org/abs/2004.06698) arXiv
111 |     - [All-in-One Image-Grounded Conversational Agents](https://arxiv.org/abs/1912.12394) arXiv2019
112 |     - [Visual-Textual Alignment for Graph Inference in Visual Dialog](https://www.aclweb.org/anthology/2020.coling-main.170.pdf) Coling2020
113 |     - [Connecting Language and Vision to Actions](https://www.aclweb.org/anthology/P18-5004) ACL2018
114 |     - [Parallel Attention: A Unified Framework for Visual Object Discovery Through Dialogs and Queries](https://arxiv.org/abs/1711.06370) [[Code](https://github.com/bohanzhuang/Parallel-Attention-A-Unified-Framework-for-Visual-Object-Discovery-through-Dialogs-and-Queries)] CVPR2018
115 |     - [Neural Multimodal Belief Tracker with Adaptive Attention for Dialogue Systems](https://dl.acm.org/doi/fullHtml/10.1145/3308558.3313598) WWW2019
116 |     - [Reactive Multi-Stage Feature Fusion for Multimodal Dialogue Modeling](https://arxiv.org/abs/1908.05067) 2019
117 |     - [Two Causal Principles for Improving Visual Dialog](https://openaccess.thecvf.com/content_CVPR_2020/papers/Qi_Two_Causal_Principles_for_Improving_Visual_Dialog_CVPR_2020_paper.pdf) [[Code](https://github.com/simpleshinobu/visdial-principles)] CVPR2020
118 |     - [Learning Cross-modal Context Graph for Visual Grounding](https://arxiv.org/abs/1911.09042) [[Code](https://github.com/youngfly11/LCMCG-PyTorch) AAAI2020
119 |     - [Multi-View Attention Networks for Visual Dialog](https://arxiv.org/abs/2004.14025) [[Code](https://github.com/taesunwhang/MVAN-VisDial)] arXiv2020
120 |     - [Efficient Attention Mechanism for Visual Dialog that Can Handle All the Interactions Between Multiple Inputs](https://arxiv.org/abs/1911.11390) [[Code](https://github.com/davidnvq/visdial)] ECCV2020
121 |     - [Shuffle-Then-Assemble: Learning Object-Agnostic Visual Relationship Features](https://arxiv.org/abs/1808.00171) [[Code in TF](https://github.com/yangxuntu/vrd)] ECCV2018
122 |   - use dialog history / user guided
123 |     - [Making History Matter: History-Advantage Sequence Training for Visual Dialog](https://arxiv.org/abs/1902.09326) ICCV2019
124 |     - [User Attention-guided Multimodal Dialog Systems](https://xuemengsong.github.io/sigir2019_umd.pdf) [[Code](https://github.com/ChenTsuei/UMD)] SIGIR2019
125 |     - [History for Visual Dialog: Do we really need it?](https://www.aclweb.org/anthology/2020.acl-main.728) ACL2020
126 |     - [Integrating Historical States and Co-attention Mechanism for Visual Dialog](https://ieeexplore.ieee.org/document/9412629/) ICPR2021
127 |   - knowledge
128 |     - [The Dialogue Dodecathlon: Open-Domain Knowledge and Image Grounded Conversational Agents](https://www.aclweb.org/anthology/2020.acl-main.222) ACL2020
129 |     - [Knowledge-aware Multimodal Dialogue Systems](http://staff.ustc.edu.cn/~hexn/papers/mm18-multimodal-dialog.pdf) ACM2018
130 |     - [A Knowledge-Grounded Multimodal Search-Based Conversational Agent](https://www.aclweb.org/anthology/W18-5709) [[Code](https://github.com/shubhamagarwal92/mmd) wow finally a code about "knowledge" or "graph"] SCAI@EMNLP2018
131 |   - modality bias
132 |     - [Modality-Balanced Models for Visual Dialogue](https://www.aaai.org/Papers/AAAI/2020GB/AAAI-KimH.8168.pdf) AAAI2020
133 |     - [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) [[Code](https://github.com/facebookresearch/deit)] arXiv2020
134 |     - [Unsupervised Natural Language Inference via Decoupled Multimodal Contrastive Learning](https://www.aclweb.org/anthology/2020.emnlp-main.444.pdf) EMNLP2020
135 |     - [Visual Dialogue without Vision or Dialogue](https://arxiv.org/abs/1812.06417) 2018
136 |     - [Be Different to Be Better! A Benchmark to Leverage the Complementarity of Language and Vision](https://www.aclweb.org/anthology/2020.findings-emnlp.248) EMNLP findings 2020
137 |     - [Worst of Both Worlds: Biases Compound in Pre-trained Vision-and-Language Models](https://arxiv.org/abs/2104.08666) 2021
138 |   - pretraining / representation learning / bertologie
139 |     - [VD-BERT: A Unified Vision and Dialog Transformer with BERT](https://www.aclweb.org/anthology/2020.emnlp-main.269) [[Code](https://github.com/salesforce/VD-BERT)] EMNLP2020
140 |     - [Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123630324.pdf) [[Code](https://github.com/vmurahari3/visdial-bert)] ECCV2020
141 |     - [Kaleido-BERT: Vision-Language Pre-training on Fashion Domain](https://arxiv.org/abs/2103.16110) [[Code](https://github.com/mczhuge/Kaleido-BERT)] arXiv2021
142 |     - [Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123750120.pdf) [[Code](https://github.com/microsoft/Oscar)] ECCV2020
143 |     - [12-in-1: Multi-Task Vision and Language Representation Learning](https://arxiv.org/abs/1912.02315) [[Code](https://github.com/facebookresearch/vilbert-multi-task)] CVPR2020
144 |     - [Large-Scale Adversarial Training for Vision-and-Language Representation Learning](https://arxiv.org/abs/2006.06195) [[Code](https://github.com/zhegan27/VILLA)] NeurIPS 2020
145 |     - [Integrating Multimodal Information in Large Pretrained Transformers](https://www.aclweb.org/anthology/2020.acl-main.214.pdf) [[Code](https://github.com/WasifurRahman/BERT_multimodal_transformer)] ACL2020
146 |   - Generative dialogue / diverse
147 |     - [Improving Generative Visual Dialog by Answering Diverse Questions](https://arxiv.org/pdf/1909.10470.pdf) EMNLP 2019, [[Code](https://github.com/vmurahari3/visdial-diversity)]
148 |     - [Visual Dialogue State Tracking for Question Generation](https://arxiv.org/abs/1911.07928) [[Code is in the series of guesswhat guesswhich visdial](https://github.com/xubuvd/guesswhat)] AAAI2020
149 |     - [MultiDM-GCN: Aspect-Guided Response Generation in Multi-Domain Multi-Modal Dialogue System using Graph Convolution Network](https://www.aclweb.org/anthology/2020.findings-emnlp.210) EMNLP2020
150 |     - [Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model](https://arxiv.org/abs/1706.01554) [[Code](https://github.com/jiasenlu/visDial.pytorch)] NIPS2017
151 |     - [FLIPDIAL: A Generative Model for Two-Way Visual Dialogue](https://arxiv.org/abs/1802.03803) CVPR2018
152 |     - [DAM: Deliberation, Abandon and Memory Networks for Generating Detailed and Non-repetitive Responses in Visual Dialogue](https://www.ijcai.org/Proceedings/2020/96) IJCAI2020 [[Code soon](https://github.com/JXZe/DAM)]
153 |     - [More to diverse: Generating diversified responses in a task oriented multimodal dialog system](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0241271) 2020
154 |     - [Multimodal Dialog System: Generating Responses via Adaptive Decoders](https://liqiangnie.github.io/paper/fp349-nieAemb.pdf) ACM2019
155 |     - [Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation](https://www.aclweb.org/anthology/I17-1047) IJCNLP2017
156 |     - [Multimodal Differential Network for Visual Question Generation](https://www.aclweb.org/anthology/D18-1434.pdf) [[Code](https://github.com/badripatro/MDN-VQG)] EMNLP2018
157 |     - [Generative Visual Dialogue System via Adaptive Reasoning and Weighted Likelihood Estimation](https://www.ijcai.org/Proceedings/2019/0144.pdf) 2019
158 |     - [Aspect-Aware Response Generation for Multimodal Dialogue System](https://www.aclweb.org/anthology/2020.findings-emnlp.210.pdf) ACM 2021
159 |     - [An Empirical Study on the Generalization Power of Neural Representations Learned via Visual Guessing Games](https://arxiv.org/abs/2102.00424) EACL2021
160 |   - Adversarial training
161 |     - [The World in My Mind: Visual Dialog with Adversarial Multi-modal Feature Encoding](https://www.aclweb.org/anthology/N19-1266) NAACL2019
162 |     - [Mind Your Language: Learning Visually Grounded Dialog in a Multi-Agent Setting](https://agakshat.github.io/assets/documents/ALA18_CameraReady.pdf) 2018
163 |     - [GADE: A Generative Adversarial Approach to Density Estimation and its Applications](https://openaccess.thecvf.com/content_CVPR_2019/papers/Abbasnejad_A_Generative_Adversarial_Density_Estimator_CVPR_2019_paper.pdf) IJCV2020
164 |   - RL
165 |     - [Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning](https://arxiv.org/abs/1703.06585) ICCV2017 oral, [[Code](https://github.com/batra-mlp-lab/visdial-rl)]
166 |     - [Multimodal Hierarchical Reinforcement Learning Policy for Task-Oriented Visual Dialog](https://arxiv.org/abs/1805.03257) SIGDIAL 2018
167 |     - [Multimodal Dialog for Browsing Large Visual Catalogs using Exploration-Exploitation Paradigm in a Joint Embedding Space](https://arxiv.org/abs/1901.09854) ICMR2019
168 |     - [Recurrent Attention Network with Reinforced Generator for Visual Dialog](https://dl.acm.org/doi/10.1145/3390891) ACM 2020
169 |   - linguistic / probabilistic
170 |     - [A Linguistic Analysis of Visually Grounded Dialogues Based on Spatial Expressions](https://www.aclweb.org/anthology/2020.findings-emnlp.67) EMNLP finds 2020
171 |     - [Probabilistic framework for solving Visual Dialog](https://arxiv.org/abs/1909.04800) PR 2020
172 |     - [Learning Goal-Oriented Visual Dialog Agents: Imitating and Surpassing Analytic Experts](https://arxiv.org/abs/1907.10500) IEEE2019
173 |  
174 | 
175 | 
176 | 
177 | (3) [CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog](https://www.aclweb.org/anthology/N19-1058) NAACL2019, [[code](https://github.com/satwikkottur/clevr-dialog)]
178 | - Further paper
179 |   - [VQA With No Questions-Answers Training](http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9157617) CVPR2020
180 |   - [Domain-robust VQA with diverse datasets and methods but no target labels](https://arxiv.org/pdf/2103.15974.pdf) arXiv2021
181 |   - [Scene Graph based Image Retrieval - A case study on the CLEVR Dataset](https://arxiv.org/abs/1911.00850)  2019
182 | 
183 | (4) Open-domain:
184 | - [OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual Contexts](https://arxiv.org/abs/2012.15015)
185 | - [The PhotoBook Dataset: Building Common Ground through Visually-Grounded Dialogue](https://www.aclweb.org/anthology/P19-1184) ACL2019
186 | - [A Visually-Grounded Parallel Corpus with Phrase-to-Region Linking](https://www.aclweb.org/anthology/2020.lrec-1.518) LREC2020
187 | - Papers
188 |   - [Multi-Modal Open-Domain Dialogue](https://arxiv.org/abs/2010.01082) 2020
189 |   - [Open Domain Dialogue Generation with Latent Images](https://arxiv.org/abs/2004.01981) 2020
190 |   - [Image-Chat: Engaging Grounded Conversations] ACL2020
191 |   - [The Dialogue Dodecathlon: Open-Domain Knowledge and Image Grounded Conversational Agents](https://www.aclweb.org/anthology/2020.acl-main.222.pdf) ACL2020
192 | 
193 | (?) sentiment
194 | - [MEISD: A Multimodal Multi-Label Emotion, Intensity and Sentiment Dialogue Dataset for Emotion Recognition and Sentiment Analysis in Conversations](https://www.aclweb.org/anthology/2020.coling-main.393) Coling2020
195 | - [Bridging Dialogue Generation and Facial Expression Synthesis](https://arxiv.org/abs/1905.11240) 2019
196 | 
197 | (5) Task/Goal-oriented:
198 | - [CRWIZ: A Framework for Crowdsourcing Real-Time Wizard-of-Oz Dialogues](https://www.aclweb.org/anthology/2020.lrec-1.36/) LREC2020
199 | - [A Corpus for Reasoning About Natural Language Grounded in Photographs](https://www.aclweb.org/anthology/P19-1644) ACL2019
200 | - [CoDraw: Collaborative Drawing as a Testbed for Grounded Goal-driven Communication](https://www.aclweb.org/anthology/P19-1651) [[Data](https://github.com/facebookresearch/CoDraw)] ACL2019
201 | - [AirDialogue: An Environment for Goal-Oriented Dialogue Research](https://www.aclweb.org/anthology/D18-1419/) EMNLP2018
202 | - [ReferIt](http://tamaraberg.com/referitgame/) [[paper](http://tamaraberg.com/papers/referit.pdf)] in EMNLP2014, 2-players game of refer & label
203 | - Papers
204 |   - [Answerer in Questioner's Mind for Goal-Oriented Visual Dialogue](http://papers.neurips.cc/paper/7524-answerer-in-questioners-mind-information-theoretic-approach-to-goal-oriented-visual-dialog.pdf) [[Code](https://github.com/naver/aqm-plus)] NeurIPS 2018
205 |   - [End-to-end optimization of goal-driven and visually grounded dialogue systems](https://arxiv.org/abs/1703.05423) IJCAI2017
206 |   - [Learning Goal-Oriented Visual Dialog via Tempered Policy Gradient and Code](https://github.com/ruizhaogit/GuessWhat-TemperedPolicyGradient) IEEE2018
207 |   - [Answer-Driven Visual State Estimator for Goal-Oriented Visual Dialogue](https://arxiv.org/abs/2010.00361) [[Code](https://github.com/zipengxuc/ADVSE-GuessWhat)] ACM MM 2020
208 |   - [Building Task-Oriented Visual Dialog Systems Through Alternative Optimization Between Dialog Policy and Language Generation](https://www.aclweb.org/anthology/D19-1014) EMNLP2019
209 |   - [Storyboarding of Recipes: Grounded Contextual Generation](https://www.aclweb.org/anthology/P19-1606) [[Script Data](https://github.com/khyathiraghavi/storyboarding_data)] DGS@ICLR2019
210 |   - [Gold Seeker: Information Gain From Policy Distributions for Goal-Oriented Vision-and-Langauge Reasoning](https://arxiv.org/abs/1812.06398) CVPR2020
211 | 
212 | (6) evaluation
213 |     - [A Revised Generative Evaluation of Visual Dialogue](https://arxiv.org/abs/2004.09272) [[Code](https://github.com/danielamassiceti/geneval_visdial)] arXiv2020
214 |     - [Evaluating Visual Conversational Agents via Cooperative Human-AI Games](https://arxiv.org/abs/1708.05122) [[Code for GuessWhich](https://github.com/GT-Vision-Lab/GuessWhich)] 2017
215 |     - [The Interplay of Task Success and Dialogue Quality: An in-depth Evaluation in Task-Oriented Visual Dialogues](https://arxiv.org/abs/2103.11151) EACL2021
216 | 
217 | 
218 | (7) classification
219 | - [**GuessWhat?!**](https://openaccess.thecvf.com/content_cvpr_2017/html/de_Vries_GuessWhat_Visual_Object_CVPR_2017_paper.html) Visual Object Discovery Through Multi-Modal Dialogue in CVPR2017, a two-player guessing game (1 oracle & 1 questioner). 
220 |   - [[Code]](https://github.com/GuessWhatGame/guesswhat)
221 |   - Further paper 
222 |     - [End-to-end optimization of goal-driven and visually grounded dialogue systems](https://arxiv.org/abs/1703.05423) Reinforcement Learning applied to GuessWhat?! 
223 |     - [Guessing State Tracking for Visual Dialogue](https://github.com/GT-Vision-Lab/GuessWhich) ECCV2020
224 |     - [Language-Conditioned Feature Pyramids for Visual Selection Tasks] EMNLP2020 [[Code](https://github.com/Alab-NII/lcfp)]
225 |     - [Beyond task success: A closer look at jointly learning to see, ask, and GuessWhat](https://arxiv.org/abs/1809.03408) NAACL2019
226 | - [Interactive Classification by Asking Informative Questions](https://www.aclweb.org/anthology/2020.acl-main.237) [[Code](https://github.com/asappresearch/interactive-classification)] ACL2020
227 | 
228 | 
229 | (?) Others
230 | - [Fatality Killed the Cat or: BabelPic, a Multimodal Dataset for Non-Concrete Concepts](https://www.aclweb.org/anthology/2020.acl-main.425.pdf) ACL2020
231 | - [How2: A Large-scale Dataset for Multimodal Language Understanding](https://arxiv.org/abs/1811.00347) [[Data](https://srvk.github.io/how2-dataset/)] NIPS2018
232 | 
233 | 
234 | (8) [Image caption] generating natural language description of an image
235 | - [MS COCO dataset 2014](https://arxiv.org/pdf/1405.0312.pdf) Images + captions (but captions are single words not sentences)
236 | - Further papers
237 |   - Feature images as a whole / and regions (early approachs) : 
238 |     - [Deep visual-semantic alignments for generating image descriptions](https://cs.stanford.edu/people/karpathy/deepimagesent/devisagen.pdf) CVPR2015
239 |     - [Densecap: Fully convolutional localization networks for dense captioning](https://cs.stanford.edu/people/karpathy/densecap/) CVPR2016
240 |   - Attention based approaches :
241 |     - [Bottom-up and top-down attention for image captioning and visual question answering](https://arxiv.org/abs/1707.07998) in CVPR2018, winner of the 2017 Visual Question Answering challenge
242 |     - [Show, attend and tell: Neural image caption generation with visual attention](https://arxiv.org/abs/1502.03044) in ICML2015
243 |     - [Review networks for caption generation](https://arxiv.org/abs/1605.07912) NIPS2016
244 |     - [Image captioning with semantic attention](https://arxiv.org/abs/1603.03925) CVPR2016
245 |   - Graph structured approaches :
246 |     - [Exploring Visual Relationship for Image Captioning](https://arxiv.org/abs/1809.07041) ECCV2018
247 |     - [Auto-encoding scene graphs for image captioning](https://arxiv.org/abs/1812.02378) CVPR2019
248 |   - Reinforcement learning:
249 |     - [Context-aware visual policy network for sequence-level image captioning](https://arxiv.org/abs/1808.05864) ACM2018
250 |     - [Self-critical sequence training for image captioning](https://arxiv.org/abs/1612.00563)
251 |   - Transformer based:
252 |     - [Image captioning: transform objects into words](https://papers.nips.cc/paper/2019/file/680390c55bbd9ce416d1d69a9ab4760d-Paper.pdf) in NIPS2019 using Transformers focusing on objects and their spatial relationships
253 |     - [Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning](https://www.aclweb.org/anthology/P18-1238.pdf) in ACL2018, also a dataset
254 |     - [Improving Image Captioning with Better Use of Caption](https://www.aclweb.org/anthology/2020.acl-main.664.pdf) ACL2020
255 |     - [Improving Image Captioning Evaluation by Considering Inter References Variance](https://www.aclweb.org/anthology/2020.acl-main.93.pdf) ACL2020
256 | 
257 | 
258 | (9) Navigation task
259 | - [Talk the walk: Navigating new york city through grounded dialogue](https://arxiv.org/pdf/1807.03367.pdf)
260 | - [A Visually-grounded First-person Dialogue Dataset with Verbal and Non-verbal Responses] EMNLP2020
261 |   - navigating
262 |     - [Improving Vision-and-Language Navigation with Image-Text Pairs from the Web] ECCV2020
263 |     - [Diagnosing Vision-and-Language Navigation: What Really Matters] arXiv2021
264 |     - [Vision-Dialog Navigation by Exploring Cross-Modal Memory] CVPR2020
265 |     - [Vision-and-Dialog Navigation] CoVR 2019
266 |     - [Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments] CVPR2018
267 |     - [Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters] 2020
268 |     - [Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation] ACL2019
269 |     - [Active Visual Information Gathering for Vision-Language Navigation] ECCV2020
270 |     - [Environment-agnostic Multitask Learning for Natural Language Grounded Navigation] ECCV2020
271 |     - [Perceive, Transform, and Act: Multi-Modal Attention Networks for Vision-and-Language Navigation] 2019
272 |     - [Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation] CVPR2019
273 |     - [Engaging Image Chat: Modeling Personality in Grounded Dialogue] 2018
274 |     - [TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments] CVPR2019
275 |     - [Multi-modal Discriminative Model for Vision-and-Language Navigation] 2019
276 |     - [REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments] CVPR 2020
277 |     - [Learning To Follow Directions in Street View] AAAI2020
278 |     - [Help, Anna! Visual Navigation with Natural Multimodal Assistance via Retrospective Curiosity-Encouraging Imitation Learning] ViGil@NeuIPS2019
279 |   - representation learning
280 |     - [A Recurrent Vision-and-Language BERT for Navigation](https://arxiv.org/abs/2011.13922) [[Code](https://github.com/YicongHong/Recurrent-VLN-BERT)] CVPR2021 
281 |     - [Transferable Representation Learning in Vision-and-Language Navigation](https://arxiv.org/abs/1908.03409) ICCV2019
282 |   - Grounding
283 |     - [Words Aren't Enough, Their Order Matters: On the Robustness of Grounding Visual Referring Expressions](https://www.aclweb.org/anthology/2020.acl-main.586.pdf) ACL2020
284 |     - [Grounding Conversations with Improvised Dialogues](https://www.aclweb.org/anthology/2020.acl-main.218.pdf) ACL2020
285 |     - [A negative case analysis of visual grounding methods for VQA](https://www.aclweb.org/anthology/2020.acl-main.727.pdf) ACL2020
286 |     - [Knowledge Supports Visual Language Grounding: A Case Study on Colour Terms](https://www.aclweb.org/anthology/2020.acl-main.584.pdf) ACL2020
287 |     - [Where Are You? Localization from Embodied Dialog](https://www.aclweb.org/anthology/2020.emnlp-main.59.pdf) [[Code](https://github.com/batra-mlp-lab/WAY)] EMNLP2020
288 |     - [Visual Referring Expression Recognition: What Do Systems Actually Learn?](https://www.aclweb.org/anthology/N18-2123) NAACL2018
289 |     - [Ask No More: Deciding when to guess in referential visual dialogue](https://www.aclweb.org/anthology/C18-1104) coling2018
290 |     - [Refer, Reuse, Reduce: Generating Subsequent References in Visual and Conversational Contexts](https://www.aclweb.org/anthology/2020.emnlp-main.353) EMNLP2020
291 |     - [Achieving Common Ground in Multi-modal Dialogue](https://www.aclweb.org/anthology/2020.acl-tutorials.3) ACL2020
292 | 
293 | (10) retrieval task
294 | - image retrieval/visual retrieval
295 |   - [Exploring Phrase Grounding without Training: Contextualisation and Extension to Text-Based Image Retrieval](https://openaccess.thecvf.com/content_CVPRW_2020/papers/w56/Parcalabescu_Exploring_Phrase_Grounding_Without_Training_Contextualisation_and_Extension_to_Text-Based_CVPRW_2020_paper.pdf) CVPRW2020
296 |   - [Toward General Scene Graph: Integration of Visual Semantic Knowledge with Entity Synset Alignment](https://www.aclweb.org/anthology/2020.alvr-1.2) [[Code](https://github.com/videoturingtest/alvr-ESA) wow finally a code for graph] ALVR2020
297 |   - [Dialog-based Interactive Image Retrieval](https://arxiv.org/abs/1805.00145) [[Code Fashion retrieval](https://github.com/XiaoxiaoGuo/fashion-retrieval)] NeuIPS2018
298 |   - [I Want This Product but Different : Multimodal Retrieval with Synthetic Query Expansion](https://arxiv.org/abs/2102.08871) 2021
299 |   - [Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers](https://arxiv.org/abs/2103.16553) 2021 
300 | 
301 | 
302 | (11) image editing / text-to-image
303 | - [Sequential Attention GAN for Interactive Image Editing] ACM2020
304 | - [Tell, Draw, and Repeat: Generating and Modifying Images Based on Continual Linguistic Instruction] ICCV2019
305 | - [ChatPainter: Improving Text to Image Generation using Dialogue] ICLR2018
306 | - [Adversarial Text-to-Image Synthesis: A Review] 2021
307 | - [A Multimodal Dialogue System for Conversational Image Editing] 2020
308 | 
309 | (12) Fashion 🌟🌟🌟 ----F-a-s-h-i-o-n----
310 | - [**SIMMC**](https://github.com/facebookresearch/simmc) - Domains include furniture and fashion 🌟🌟🌟, it can be seen as a variant of [multiWOZ](https://github.com/budzianowski/multiwoz) or [schema guided dialogue dataset](https://github.com/google-research-datasets/dstc8-schema-guided-dialogue#scheme-representation%5D)
311 |   - [Situated and Interactive Multimodal Conversations](https://www.aclweb.org/anthology/2020.coling-main.96) EMNLP2020 [[SIMMC 1.0](https://arxiv.org/abs/2006.01460)] in Coling2020, [[SIMMC 2.0](https://arxiv.org/pdf/2104.08667.pdf)], track in [DSTC9](https://dstc9.dstc.community/home) and [DSTC10](https://sites.google.com/dstc.community/dstc10/tracks) 
312 |   - [[Code](https://github.com/facebookresearch/simmc)] 
313 |   - Further papers
314 |     - [A Response Retrieval Approach for Dialogue Using a Multi-Attentive Transformer](https://arxiv.org/abs/2012.08148) second winner DSTC9 SIMMC fashion, [[code](https://github.com/D2KLab/dstc9-SIMMC)]
315 |     - [Overview of the Ninth Dialog System Technology Challenge: DSTC9](https://arxiv.org/pdf/2011.06486.pdf) to better see the winners' models
316 |     - [[Code winner1 TNU](https://github.com/billkunghappy/DSTC_TRACK4_ENTER)](有点乱), [[Code winner2 SU](https://github.com/inkoon/simmc), [[Code other](https://github.com/facebookresearch/simmc/blob/master/DSTC9_SIMMC_RESULTS.md)]
317 | 
318 | - [**Fashion IQ**](https://sites.google.com/view/cvcreative2020/fashion-iq) in CVPR2020 workshop, [[paper](https://arxiv.org/pdf/1905.12794.pdf)] [[dataset & startkit](https://github.com/XiaoxiaoGuo/fashion-iq)] 
319 | 
320 | - [**MMD** Towards Building Large Scale Multimodal Domain-Aware Conversation Systems](https://arxiv.org/abs/1704.00200), arXiv 2017, [[code](https://amritasaha1812.github.io/MMD/)], [Multimodal Dialogs (MMD): A large-scale dataset for studying multimodal domain-aware conversations] 2017
321 | 
322 | 
323 | ### Video
324 |  
325 | (13) video
326 | - [Audio Visual Scene-Aware Dialog Track in DSTC8](http://workshop.colips.org/dstc7/dstc8/Audiovisual_Scene_Aware_Dialog.pdf) [[Paper]((https://ieeexplore.ieee.org/document/8953254)]  [[site]]((https://video-dialog.com/) 
327 |   - [CMU Sinbad’s Submission for the DSTC7 AVSD Challenge]
328 |   - [DSTC8-AVSD: Multimodal Semantic Transformer Network with Retrieval Style Word Generator] 2020
329 |   - [A Simple Baseline for Audio-Visual Scene-Aware Dialog] CVPR2019
330 | - [[TVQA](https://arxiv.org/abs/1809.01696)] [[MovieQA](http://movieqa.cs.toronto.edu/)] [[TGif-QA](https://arxiv.org/abs/1704.04497)]
331 |   - [TVQA+: Spatio-Temporal Grounding for Video Question Answering](https://www.aclweb.org/anthology/2020.acl-main.730.pdf) ACL2020
332 |   - [MultiSubs: A Large-scale Multimodal and Multilingual Dataset] 2021
333 |   - [Adversarial Multimodal Network for Movie Question Answering] 2019
334 |   - [What Makes Training Multi-Modal Classification Networks Hard?] CVPR2020
335 | - [DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue]() 2021
336 | - Minecraft
337 |   - [Learning to execute instructions in a Minecraft dialogue](https://www.aclweb.org/anthology/2020.acl-main.232) [[Code](https://github.com/prashant-jayan21/minecraft-bap-models)] ACL2020
338 |   - [Collaborative Dialogue in Minecraft](https://www.aclweb.org/anthology/P19-1537) [[Code](https://github.com/prashant-jayan21/minecraft-dialogue)] ACL2020
339 | - video & QA/Dialog papers
340 |   - representation learning
341 |     - VideoBERT: A Joint Model for Video and Language Representation Learning
342 |     - Learning Question-Guided Video Representation for Multi-Turn Video Question Answering ViGil@NeuIPS2019
343 |     - Video Dialog via Progressive Inference and Cross-Transformer EMNLP2019
344 |     - [Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems] ACL2019
345 |     - [Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog] 2020
346 |     - [Video-Grounded Dialogues with Pretrained Generation Language Models](https://www.aclweb.org/anthology/2020.acl-main.518.pdf) ACL2020
347 |   - Graph
348 |     - Location-Aware Graph Convolutional Networks for Video Question Answering
349 |     - Object Relational Graph With Teacher-Recommended Learning for Video Captioning
350 |   - Fusion
351 |     - End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features IEEE2019
352 |     - [See the Sound, Hear the Pixels] IEEE2020
353 |     - [Video Dialog via Multi-Grained Convolutional Self-Attention Context Networks] SIGIR2019
354 |     - [Video Dialog via Multi-Grained Convolutional Self-Attention Context Multi-Modal Networks] IEEE2020
355 |     - [Game-Based Video-Context Dialogue] EMNLP2018
356 |     - [Long-Form Video Question Answering via Dynamic Hierarchical Reinforced Networks] IEEE2019
357 |     - [End-to-End Multimodal Dialog Systems with Hierarchical Multimodal Attention on Video Features] 2018
358 |   
359 | 
360 | ### Charts / figures
361 | (14) [LEAF-QA: Locate, Encode & Attend for Figure Question Answering](https://openaccess.thecvf.com/content_WACV_2020/papers/Chaudhry_LEAF-QA_Locate_Encode__Attend_for_Figure_Question_Answering_WACV_2020_paper.pdf)
362 | 
363 | ### Meme
364 | (15) [MOD Meme incorporated Open Dialogue](https://anonymous.4open.science/r/e7eaef6a-b6d5-47c6-896f-93265a0af4b1/README.md) WeChat conversations with meme / stickers in Chinese language.
365 | - A Multimodal Memes Classification: A Survey and Open Research Issues
366 | - [Learning to Respond with Stickers: A Framework of Unifying Multi-Modality in Multi-Turn Dialog] WWW2020
367 | - [Learning to Respond with Your Favorite Stickers: A Framework of Unifying Multi-Modality and User Preference in Multi-Turn Dialog] 2020
368 | - [The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes] NeuIPS2020
369 | 
370 | # Survey
371 | - [Multimodal Research in Vision and Language: A Review of Current and Emerging Trends](https://arxiv.org/pdf/2010.09522v2.pdf) arXiv2020
372 | - [Transformers in Vision: A Survey](https://arxiv.org/abs/2101.01169) arXiv2021
373 | - [Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods](https://arxiv.org/abs/1907.09358) JAIR2020
374 | 
375 | # Other github paperlists
376 | - [Multimodal-Dialogue-PaperList](https://github.com/silverriver/Multimodal-Dialogue-PaperList)
377 | - [Awesome-Visual-Transformer](https://github.com/dk-liang/Awesome-Visual-Transformer)
378 | - [Transformer-in-Vision](https://github.com/DirtyHarryLYL/Transformer-in-Vision)
379 | - [awesome-multimodal-ml](https://github.com/pliang279/awesome-multimodal-ml)
380 | - [awesome-visual-question-answering](https://github.com/jokieleung/awesome-visual-question-answering)
381 | - [awesome-vqa-latest](https://github.com/Taaccoo/awesome-vqa-latest)
382 | - [awesome-visual-dialog](https://github.com/badripatro/awesome-visual-dialog)
383 | - [Awesome-Scene-Graphs](https://github.com/huoxingmeishi/Awesome-Scene-Graphs)
384 | - [awesome-vln](https://github.com/daqingliu/awesome-vln)
385 | 
386 | # In general
387 | - Tasks
388 |   - Visual Question Answering, 
389 |   - Visual dialog
390 |   - Visual Commonsense Reasoning, 
391 |   - Image-Text Retrieval, 
392 |   - Referring Expression Comprehension, 
393 |   - Visual Entailment
394 |   - NL+V representation ==> multimodal pretraining
395 | - Issues / topics:
396 |   - text and image bias
397 |   - VL or LV bertologie
398 |   - visual understanding / reasoning / object relation
399 |   - cross-modal text-image relation (attention on interaction)
400 |   - incorporate knowledge / common sense (attention on knowledge)
401 | - Often used model-elements :
402 |   - [Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks](https://arxiv.org/abs/1506.01497) 2015
403 |   - LSTM
404 |   - GANs
405 |   - Transformers
406 |   - Graphs : attention graph, GCN, memory graph .........
407 | - often mentioned approaches:
408 |   - adversarial training 
409 |   - reinforcement learning
410 |   - graph neural network
411 |   - joint learning / parel / Dual encoder / Dual attention
412 | - my questions
413 |   - what does "adaptive" mean? why everyone likes this specific word?
414 |   - "ground", mysterious word too...
415 |   - often can't find many codes for papers with "graph" or "reinforcement learning" in title ???
416 | 


--------------------------------------------------------------------------------
/paper-notes.md:
--------------------------------------------------------------------------------
 1 | # Multi-Modal-Dialgoue-System-Paperlist
 2 | 
 3 | This is a paper list for the multimodal dialogue systems topic.
 4 | 
 5 | **Keyword**: Multi-modal, Dialogue system, visual, conversation
 6 | 
 7 | # Paperlist
 8 | 
 9 | ## Dataset & Challenges
10 | 
11 | ### Images
12 | 
13 | [GuessWhat?!](https://openaccess.thecvf.com/content_cvpr_2017/html/de_Vries_GuessWhat_Visual_Object_CVPR_2017_paper.html) Visual Object Discovery Through Multi-Modal Dialogue in CVPR2017, a two-player guessing game (1 oracle & 1 questioner). [Code](https://github.com/GuessWhatGame/guesswhat)
14 | - **Data and task**: 
15 |   - **Data**: images are from **MS COCO dataset**, questioners & oracles are from **Amazon Mechanical Turk**. Players are asked to shorten their dialogues to speed up the game (and therefore maximize their gains). The goal of the game is to **locate an unknown object** in a **rich image scene** (meaning that there're several objects in an image / photo) by asking a sequence of questions, eg. After a sequence of n questions (of yes / no / NA), it becomes possible to locate the object (highlighted by a green mask). Once the questioner has gathered enough evidence to locate the object, they notify the oracle that they are ready to guess the object. We then reveal the list of objects, and if the questioner picks the right object, we consider the game successful (*recall@k ??*).
16 |   - **Task**: The **oracle task** requires to produce a yes-no answer for any object within a picture given a natural language question. The  **questioner task** is divided into two different sub-tasks that are trained independently: The **Guesser** must predict the correct object
17 | O_correct from the set of all objects O given an image I and a sequence of questions and answers D_J . The **Question Generator** must produce a new question q_T+1 Given an image I and a sequence of T questions and answers D_≤T .
18 | - **Problematic**:  How to create models that understand natural language descriptions and ground them in the visual world. Higher-level image understanding, like spatial reasoning and language grounding, is required to solve the proposed task.
19 | - **Baseline model**: 
20 |   - **Oracle baseline**: a classification problem (yes/no/NA). Embedding = Image (VGG16) + Question (LSTM) + Crop (VGG16) + Spatial information (bbox) + Object Category taxonomy; MLP ; cross-entropy loss. **Reflection**  In general, we expect the object crop to contain additional information, such as color information, beside the object class. However, we find that the object category outperforms the object crop embedding. This might be partly due to the imperfect feature extraction from the crops.
21 |   - **Guesser**: a classification problem (among a list of objects). Embedding1 = Question(LSTM/HRED) + Image(VGG16), Embedding2 = Objects (MLP(Spatial+Category)), then dot product; **Reflection** In general, we find that including VGG features does not improve the performance of the LSTM and HRED models. We hypothesize that the VGG features are a too coarse representation of the image scene, and that most of the visual information is already encoded in the question and the object features (???)
22 |   - **Question Generator**: encoder-decoder structure: P(q_2|q_1,a_1,Image(vgg)), We use our questioner model to first generate a question which is then answered by the oracle model. We repeat this procedure 5 times to obtain a dialogue. We then use the best performing guesser model to predict the object and report its error as the metric for the QGEN model. **Reflection** using the Oracle’s answers while training the Question Generator introduces additional errors than using Ground Truth answers.
23 | - **Further papers & models:
24 |   - **Paper**:[End-to-end optimization of goal-driven and visually grounded dialogue systems](https://arxiv.org/abs/1703.05423) Reinforcement Learning applied to GuessWhat?! based on **policy gradient algorithms**. **Problematic**: Encoder-Decoder structure's drawbacks: 1. vast action space vs unseen scenarios, inconsistent dialogues 2. supervised learning in dialog systems does not account for the intrinsic planning problem, especially in task-oriented dialogues. 3. difficult in naturally integrating external contexts (common ground) 4. difficult in dialogue evaluation. **?quote?** In addition, successful applications of the RL framework to dialogue often rely on a predefined structure of the task, such as slot-filling tasks [Williams and Young, 2007] where the task can be casted as filling in a form. **Proposed model** kan bu dong
25 | 
26 | 
27 | [ReferIt](http://tamaraberg.com/referitgame/) [paper](http://tamaraberg.com/papers/referit.pdf) in EMNLP2014
28 | - **Task**: 2 players game, 1 give referring expressions of an object in the image and the other should label it on the image. 
29 | - **Data processing**: To avoid hand annotations, use attributes for referring expressions and a template-based parser. 
30 | - **Model** to generate referring expressions (a set of attributes, not in natural language) too much mathematics I'm blind.
31 | 
32 | [Image Captioning] generating natural language descriptions of images.
33 | - **Data and task**: [MS Coco](https://arxiv.org/pdf/1405.0312.pdf) Images + captions (but captions are single words not sentences)
34 | - **Further papers & models**:
35 |   - [Object relation transformer](https://papers.nips.cc/paper/2019/file/680390c55bbd9ce416d1d69a9ab4760d-Paper.pdf) in NIPS2019 **Model** always Encoder-Decoder structure but this time a Transformer (!), given an input image, use object detection model to get appearance features and geometry features (actually meaning a bounding box and its object and features), take these as inputs of the Transformer 
36 | 
37 | [Visual QA](https://visualqa.org/workshop.html) VQA datasets in CVPR2021,2020,2019,etc
38 | - **Data and task**: 
39 |   - [VQA](https://visualqa.org/challenge) datasets [1.0](http://arxiv.org/abs/1505.00468) [2.0](https://arxiv.org/abs/1612.00837)
40 |   - [TextVQA](https://textvqa.org/paper) TextVQA requires models to read and reason about text in an image to answer questions based on them. In order to perform well on this task, models need to first detect and read text in the images. Models then need to reason about this to answer the question. 
41 |   - [TextCap](https://arxiv.org/abs/2003.12462) TextCaps requires models to read and reason about text in images to generate captions about them. Specifically, models need to incorporate a new modality of text present in the images and reason over it and visual content in the image to generate image descriptions.
42 | - **Baseline models**
43 | - **Proposed papers & models**:
44 | 
45 | 
46 | 
47 | [Visual Dialog](https://visualdialog.org/#:~:text=Visual%20Dialog%20is%20a%20novel,has%20to%20answer%20the%20question.) Open-domain dialogs & given an image, a dialog history, and a follow-up question about the image, the task is to answer the question. [[VisDial v1.0 dataset](https://visualdialog.org/data)] [[Paper](https://arxiv.org/abs/1611.08669)] [[Code diverse questions](https://github.com/vmurahari3/visdial-diversity)] [[Code rl](https://github.com/batra-mlp-lab/visdial-rl)] [[Code collect chat](https://github.com/batra-mlp-lab/visdial-amt-chat)] 🌟🌟
48 | 
49 | [SIMMC](https://github.com/facebookresearch/simmc) Situated and Interactive Multimodal Conversations track in [DSTC9](https://dstc9.dstc.community/home) and [DSTC10](https://sites.google.com/dstc.community/dstc10/tracks) by Facebook [[Paper](https://arxiv.org/abs/2006.01460)] Domains include furniture and fashion 🌟🌟🌟
50 | 
51 | [Fashion IQ](https://sites.google.com/view/cvcreative2020/fashion-iq) in CVPR2020 workshop, [[paper](https://arxiv.org/pdf/1905.12794.pdf)] [[dataset & startkit](https://github.com/XiaoxiaoGuo/fashion-iq)]
52 | 
53 | ### Video
54 | 
55 | [AVSD - Audio Visual Scene-Aware Dataset](https://video-dialog.com/) was used in [DSTC7](http://workshop.colips.org/dstc7/) and [DSTC8](https://sites.google.com/dstc.community/dstc8/tracks) The task is to build a system that generates response sentences in a dialog about an input VIDEO. The data collection paradigm is similar to VisDial
56 | 
57 | Related : [[TVQA](https://arxiv.org/abs/1809.01696)] [[MovieQA](http://movieqa.cs.toronto.edu/)] [[TGif-QA](https://arxiv.org/abs/1704.04497)]
58 | 
59 | ### Meme
60 | 
61 | [MOD Meme incorporated Open Dialogue](https://anonymous.4open.science/r/e7eaef6a-b6d5-47c6-896f-93265a0af4b1/README.md) WeChat conversations with meme / stickers in Chinese language.
62 | 
63 | 
64 | 
65 | 
66 | 
67 | 
68 | 
69 | 


--------------------------------------------------------------------------------
/summary.md:
--------------------------------------------------------------------------------
 1 | Multi-modal dialogue systems consist of those dialogue systems that deal with multi-modal inputs and outputs, besides of textual modality,
 2 | audio, visual or audio-visual features are taken consideration as multi-modalities. In our very simple and brief survey, we would like to concentrate
 3 | on visual dialogue systems at the first time. 我这里参考了比较多[这篇survey](https://arxiv.org/pdf/2010.09522.pdf).
 4 | 
 5 | Visual dialogue system can be seen as a variant subtask of Visual Question Answering (VQA) which remains to be the most popular task as well as Visual Captioning
 6 | across various visual-linguistic tasks in the research community. It comprises of VQA in dialogue (VQADi -- Visual Dialogue v1), and VQG in dialogue (VQGDi -- GuessWhich Task)
 7 | wherein the main goal is to automate machine conversations about images with humans. 
 8 | 
 9 | The referred survey defines the formats of output in purpose of distinguish "generation task" and "classification task": 
10 | 
11 | > The output of the learnt mapping *f* could either belong to a set
12 | of possible answers in which case we refer this task format
13 | as MCQ, or could be arbitrary in nature depending on the
14 | question in which we can refer to as free-form. We regard the
15 | more generalized free-form VQA as a generation task, while
16 | MCQ VQA as a classification task where the model predicts
17 | the most suitable answer from a pool of choices. 
18 | 
19 | Generally, classical VQA datasets/tasks like VQA 1.0 & 2.0 apply MCQ format of output while visual dialogue tasks and VQG tasks on VQA 2.0 have *free form* format of output. Specifically, the survey takes out separately the Visual Commensense Reasoning as a popular task in parallel to VQA tasks, which aims to develop higherorder cognition in vision systems and commonsense reasoning of the world so that they can provide justifications to their answers.
20 | 
21 | I have to mention VQA because it's a group of classical and living datasets and tasks that are incontourable before talking about Visual Dialogue, and many of the most popular issues and topics and methods affect also those in visual dialogues, like reasoning, deep understanding, modality bias, etc., while in special, the works about VCR show a special kind of enthusiasm in using bertology methods like Vi-Bert, VisualBERT, VL-Bert,
22 | KVL-BERT, etc. (A paper about [The Exploration of the Reasoning Capability of BERT in Relation Extraction](https://ieeexplore.ieee.org/document/9202183) but it was published only in 2020. So why suddenly a bunch of Bertologies in Visual-Language research keeps to be a mistery for me...) Of course, in visual dialogue tasks we have bertology methods, like VD-BERT. 
23 | 
24 | | Article        | Dataset           | Visual Encoder  | Language Model | Encoder | Decoder |
25 | | ------------- |:-------------:|:---------------:|:------------:|:----------:|:---------:|
26 | | [Visual dialog](https://arxiv.org/pdf/1611.08669.pdf)| VisDial v1.0 | VGG16 | 2 diff LSTMs; dialog-RNN + Attention + LSTM ; LSTM | Late fusion ; HRE ; MN| LSTM, Softmax |
27 | | [IQA Synergistic](https://openaccess.thecvf.com/content_CVPR_2019/papers/Guo_Image-Question-Answer_Synergistic_Network_for_Visual_Dialog_CVPR_2019_paper.pdf)| VisDial v1.0 |Faster-RCNN, CNN  | 2 diff LSTMs | MFB ; (discriminative model: in primary stage, answers are also encoded by LSTM)| softmax; (generative model: answer decoded by LSTM)|
28 | | [LTMI](https://arxiv.org/pdf/1911.11390.pdf) | VisDial v1.0 |  |  | 
29 | | [LF]      | centered      |    |  |  | fusion by concat |  |
30 | | [LTMI](https://arxiv.org/pdf/1911.11390.pdf) | VisDial v1.0 |  |  | 
31 | | [LF]      | centered      |    |  |  | fusion by concat |  |
32 | | [LTMI](https://arxiv.org/pdf/1911.11390.pdf) | VisDial v1.0 |  |  | 
33 | | [LF]      | centered      |    |  |  | fusion by concat |  |
34 | | [LTMI](https://arxiv.org/pdf/1911.11390.pdf) | VisDial v1.0 |  |  | 
35 | | [LF]      | centered      |    |  |  | fusion by concat |  |
36 | | [LTMI](https://arxiv.org/pdf/1911.11390.pdf) | VisDial v1.0 |  |  | 
37 | | [LF]      | centered      |    |  |  | fusion by concat |  |
38 | | [LTMI](https://arxiv.org/pdf/1911.11390.pdf) | VisDial v1.0 |  |  | 
39 | | [LF]      | centered      |    |  |  | fusion by concat |  |
40 | | [LTMI](https://arxiv.org/pdf/1911.11390.pdf) | VisDial v1.0 |  |  | 
41 | | [LF]      | centered      |    |  |  | fusion by concat |  |
42 | | [LTMI](https://arxiv.org/pdf/1911.11390.pdf) | VisDial v1.0 |  |  | 
43 | | [LF]      | centered      |    |  |  | fusion by concat |  |
44 | 


--------------------------------------------------------------------------------