67 |
68 | - [**TRECQA**](https://trec.nist.gov/data/qa.html) dataset is created by [Wang et. al.](https://www.aclweb.org/anthology/D07-1003) from TREC QA track 8-13 data, with candidate answers automatically selected from each question’s document pool using a combination of overlapping non-stop word counts and pattern matching. This data set is one of the most widely used benchmarks for [answer sentence selection]().
69 |
70 | - [**WikiQA**](https://www.microsoft.com/en-us/download/details.aspx?id=52419) is a publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering by Microsoft Research.
71 |
72 | - [**InsuranceQA**](https://github.com/shuzi/insuranceQA) is a non-factoid QA dataset from the insurance domain. Question may have multiple correct answers and normally the questions are much shorter than the answers. The average length of questions and answers in tokens are 7 and 95, respectively. For each question in the development and test sets, there is a set of 500 candidate answers.
73 |
74 | - [**FiQA**](https://sites.google.com/view/fiqa) is a non-factoid QA dataset from the financial domain which has been recently released for WWW 2018 Challenges. The dataset is built by crawling Stackexchange, Reddit and StockTwits in which part of the questions are opinionated, targeting mined opinions and their respective entities, aspects, sentiment polarity and opinion holder.
75 |
76 | - [**Yahoo! Answers**](https://webscope.sandbox.yahoo.com) is a web site where people post questions and answers, all of which are public to any web user willing to browse or download them. The data we have collected is the Yahoo! Answers corpus as of 10/25/2007. This is a benchmark dataset for communitybased question answering that was collected from Yahoo Answers. In this dataset, the answer lengths are relatively longer than TrecQA and WikiQA.
77 |
78 | - [**SemEval-2015 Task 3**](http://alt.qcri.org/semeval2015/task3/) consists of two sub-tasks. In Subtask A, given a question (short title + extended description), and several community answers, classify each of the answer as definitely relevance (good), potentially useful (potential), or bad or irrelevant (bad, dialog, non-english other). In Subtask B, given a YES/NO question (short title + extended description), and a list of community answers, decide whether the global answer to the question should be yes, no, or unsure.
79 |
80 | - [**SemEval-2016 Task 3**](http://alt.qcri.org/semeval2016/task3/) consists two sub-tasks, namely _Question-Comment Similarity_ and _Question-Question Similarity_. In the _Question-Comment Similarity_ task, given a question from a question-comment thread, rank the comments according to their relevance with respect to the question. In _Question-Question Similarity_ task, given the new question, rerank all similar questions retrieved by a search engine.
81 |
82 | - [**SemEval-2017 Task 3**](http://alt.qcri.org/semeval2017/task3/) contains two sub-tasks, namely _Question Similarity_ and _Relevance Classification_. Given the new question and a set of related questions from the collection, the _Question Similarity_ task is to rank the similar questions according to their similarity to the original question. While the _Relevance Classification_ is to rank the answer posts according to their relevance with respect to the question based on a question-answer thread.
83 |
84 | ## Performance
85 |
86 | ### TREC QA (Raw Version)
87 |
88 | | Model | Code | MAP | MRR | Paper |
89 | | :---------------------------------- | :----------------------------------------------------------: | :-------: | :-------: | :----------------------------------------------------------- |
90 | | Punyakanok (2004) | — | 0.419 | 0.494 | [Mapping dependencies trees: An application to question answering, ISAIM 2004](http://cogcomp.cs.illinois.edu/papers/PunyakanokRoYi04a.pdf) |
91 | | Cui (2005) | — | 0.427 | 0.526 | [Question Answering Passage Retrieval Using Dependency Relations, SIGIR 2005](https://www.comp.nus.edu.sg/~kanmy/papers/f66-cui.pdf) |
92 | | Wang (2007) | — | 0.603 | 0.685 | [What is the Jeopardy Model? A Quasi-Synchronous Grammar for QA, EMNLP 2007](http://www.aclweb.org/anthology/D/D07/D07-1003.pdf) |
93 | | H&S (2010) | — | 0.609 | 0.692 | [Tree Edit Models for Recognizing Textual Entailments, Paraphrases, and Answers to Questions, NAACL 2010](http://www.aclweb.org/anthology/N10-1145) |
94 | | W&M (2010) | — | 0.595 | 0.695 | [Probabilistic Tree-Edit Models with Structured Latent Variables for Textual Entailment and Question Answering, COLING 2020](http://aclweb.org/anthology//C/C10/C10-1131.pdf) |
95 | | Yao (2013) | — | 0.631 | 0.748 | [Answer Extraction as Sequence Tagging with Tree Edit Distance, NAACL 2013](http://www.aclweb.org/anthology/N13-1106.pdf) |
96 | | S&M (2013) | — | 0.678 | 0.736 | [Automatic Feature Engineering for Answer Selection and Extraction, EMNLP 2013](http://www.aclweb.org/anthology/D13-1044.pdf) |
97 | | Backward (Shnarch et al., 2013) | — | 0.686 | 0.754 | [Probabilistic Models for Lexical Inference, Ph.D. thesis 2013](http://u.cs.biu.ac.il/~nlp/wp-content/uploads/eyal-thesis-library-ready.pdf) |
98 | | LCLR (Yih et al., 2013) | — | 0.709 | 0.770 | [Question Answering Using Enhanced Lexical Semantic Models, ACL 2013](http://research.microsoft.com/pubs/192357/QA-SentSel-Updated-PostACL.pdf) |
99 | | bigram+count (Yu et al., 2014) | — | 0.711 | 0.785 | [Deep Learning for Answer Sentence Selection, NIPS 2014](http://arxiv.org/pdf/1412.1632v1.pdf) |
100 | | BLSTM (W&N et al., 2015) | — | 0.713 | 0.791 | [A Long Short-Term Memory Model for Answer Sentence Selection in Question Answering, ACL 2015](http://www.aclweb.org/anthology/P15-2116) |
101 | | Architecture-II (Feng et al., 2015) | — | 0.711 | 0.800 | [Applying deep learning to answer selection: A study and an open task, ASRU 2015](http://arxiv.org/abs/1508.01585) |
102 | | PairCNN (Severyn et al., 2015) | [](https://github.com/zhangzibin/PairCNN-Ranking) | 0.746 | 0.808 | [Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks, SIGIR 2015](http://disi.unitn.eu/moschitti/since2013/2015_SIGIR_Severyn_LearningRankShort.pdf) |
103 | | aNMM (Yang et al., 2016) | [](https://github.com/yangliuy/aNMM-CIKM16) [](https://github.com/NTMC-Community/MatchZoo-py/tree/master/matchzoo/models/anmm.py) | 0.750 | 0.811 | [aNMM: Ranking Short Answer Texts with Attention-Based Neural Matching Model, CIKM 2016](http://maroo.cs.umass.edu/pub/web/getpdf.php?id=1240) |
104 | | HDLA (Tay et al., 2017) | [](https://github.com/vanzytay/YahooQA_Splits) | 0.750 | 0.815 | [Learning to Rank Question Answer Pairs with Holographic Dual LSTM Architecture, SIGIR 2017](https://arxiv.org/abs/1707.06372) |
105 | | PWIM (Hua et al. 2016) | [](https://github.com/castorini/VDPWI-NN-Torch) | 0.758 | 0.822 | [Pairwise Word Interaction Modeling with Deep Neural Networks for Semantic Similarity Measurement, NAACL 2016](https://cs.uwaterloo.ca/~jimmylin/publications/He_etal_NAACL-HTL2016.pdf) |
106 | | MP-CNN (Hua et al. 2015) | [](https://github.com/castorini/MP-CNN-Torch) | 0.762 | 0.830 | [Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Networks, EMNLP 2015](http://aclweb.org/anthology/D/D15/D15-1181.pdf) |
107 | | HyperQA (Tay et al., 2017) | [](https://github.com/vanzytay/WSDM2018_HyperQA) | 0.770 | 0.825 | [Enabling Efficient Question Answer Retrieval via Hyperbolic Neural Networks, WSDM 2018](https://arxiv.org/pdf/1707.07847) |
108 | | MP-CNN (Rao et al., 2016) | [](https://github.com/castorini/NCE-CNN-Torch) | 0.780 | 0.834 | [Noise-Contrastive Estimation for Answer Selection with Deep Neural Networks, CIKM 2016](https://dl.acm.org/authorize.cfm?key=N27026) |
109 | | HCAN (Rao et al., 2019) | — | 0.774 | 0.843 | [Bridging the Gap between Relevance Matching and Semantic Matching for Short Text Similarity Modeling, EMNLP 2019](https://jinfengr.github.io/publications/Rao_etal_EMNLP2019.pdf) |
110 | | MP-CNN (Tayyar et al., 2018) | — | 0.836 | 0.863 | [Integrating Question Classification and Deep Learning for improved Answer Selection, COLING 2018](https://aclanthology.coli.uni-saarland.de/papers/C18-1278/c18-1278) |
111 | | Pre-Attention (Kamath et al., 2019) | — | 0.852 | 0.891 | [Predicting and Integrating Expected Answer Types into a Simple Recurrent Neural Network Model for Answer Sentence Selection, CICLING 2019](https://hal.archives-ouvertes.fr/hal-02104488/) |
112 | | CETE (Laskar et al., 2020) | — | **0.950** | **0.980** | [Contextualized Embeddings based Transformer Encoder for Sentence Similarity Modeling in Answer Selection Task LREC 2020](http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.676.pdf) |
113 |
114 | ### TREC QA (Clean Version)
115 |
116 | | Model | Code | MAP | MRR | Paper |
117 | | :------------------------------- | :----------------------------------------------------------: | :-------: | :-------: | :----------------------------------------------------------- |
118 | | W&I (2015) | — | 0.746 | 0.820 | [FAQ-based Question Answering via Word Alignment, arXiv 2015](http://arxiv.org/abs/1507.02628) |
119 | | LSTM (Tan et al., 2015) | [](https://github.com/Alan-Lee123/answer-selection) | 0.728 | 0.832 | [LSTM-Based Deep Learning Models for Nonfactoid Answer Selection, arXiv 2015](http://arxiv.org/abs/1511.04108) |
120 | | AP-CNN (dos Santos et al. 2016) | — | 0.753 | 0.851 | [Attentive Pooling Networks, arXiv 2016](http://arxiv.org/abs/1602.03609) |
121 | | L.D.C Model (Wang et al., 2016) | — | 0.771 | 0.845 | [Sentence Similarity Learning by Lexical Decomposition and Composition, COLING 2016](http://arxiv.org/pdf/1602.07019v1.pdf) |
122 | | MP-CNN (Hua et al., 2015) | [](https://github.com/castorini/MP-CNN-Torch) | 0.777 | 0.836 | [Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Networks, EMNLP 2015](http://aclweb.org/anthology/D/D15/D15-1181.pdf) |
123 | | HyperQA (Tay et al., 2017) | [](https://github.com/vanzytay/WSDM2018_HyperQA) | 0.784 | 0.865 | [Enabling Efficient Question Answer Retrieval via Hyperbolic Neural Networks, WSDM 2018](https://arxiv.org/pdf/1707.07847) |
124 | | MP-CNN (Rao et al., 2016) | [](https://github.com/castorini/NCE-CNN-Torch) | 0.801 | 0.877 | [Noise-Contrastive Estimation for Answer Selection with Deep Neural Networks, CIKM 2016](https://dl.acm.org/authorize.cfm?key=N27026) |
125 | | BiMPM (Wang et al., 2017) | [](https://github.com/zhiguowang/BiMPM) [](https://github.com/NTMC-Community/MatchZoo-py/blob/dev/matchzoo/models/bimpm.py) | 0.802 | 0.875 | [Bilateral Multi-Perspective Matching for Natural Language Sentences, arXiv 2017](https://arxiv.org/pdf/1702.03814.pdf) |
126 | | CA (Bian et al., 2017) | [](https://github.com/wjbianjason/Dynamic-Clip-Attention) [](https://github.com/NTMC-Community/MatchZoo-py/blob/dev/matchzoo/models/dynamic_clip.py) | 0.821 | 0.899 | [A Compare-Aggregate Model with Dynamic-Clip Attention for Answer Selection, CIKM 2017](https://dl.acm.org/citation.cfm?id=3133089&CFID=791659397&CFTOKEN=43388059) |
127 | | IWAN (Shen et al., 2017) | — | 0.822 | 0.889 | [Inter-Weighted Alignment Network for Sentence Pair Modeling, EMNLP 2017](https://aclanthology.info/pdf/D/D17/D17-1122.pdf) |
128 | | sCARNN (Tran et al., 2018) | — | 0.829 | 0.875 | [The Context-dependent Additive Recurrent Neural Net, NAACL 2018](http://www.aclweb.org/anthology/N18-1115) |
129 | | MCAN (Tay et al., 2018) | — | 0.838 | 0.904 | [Multi-Cast Attention Networks, KDD 2018](https://arxiv.org/abs/1806.00778) |
130 | | MP-CNN (Tayyar et al., 2018) | — | 0.865 | 0.904 | [Integrating Question Classification and Deep Learning for improved Answer Selection, COLING 2018](https://aclanthology.coli.uni-saarland.de/papers/C18-1278/c18-1278) |
131 | | CA + LM + LC (Yoon et al., 2019) | — | 0.868 | 0.928 | [A Compare-Aggregate Model with Latent Clustering for Answer Selection, CIKM 2019](https://arxiv.org/abs/1905.12897) |
132 | | GSAMN (Lai et al., 2019) | [](https://github.com/laituan245/StackExchangeQA) | 0.914 | 0.957 | [A Gated Self-attention Memory Network for Answer Selection, EMNLP 2019](https://arxiv.org/pdf/1909.09696.pdf) |
133 | | TANDA (Garg et al., 2019) | [](https://github.com/alexa/wqa_tanda) | **0.943** | 0.974 | [TANDA: Transfer and Adapt Pre-Trained Transformer Models for Answer Sentence Selection, AAAI 2020](https://arxiv.org/abs/1911.04118) |
134 | | CETE (Laskar et al., 2020) | — | 0.936 | **0.978** | [Contextualized Embeddings based Transformer Encoder for Sentence Similarity Modeling in Answer Selection Task, LREC 2020](http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.676.pdf) |
135 |
136 | ### WikiQA
137 |
138 | | Model | Code | MAP | MRR | Paper |
139 | | ---------------------------------------- | :-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | --------- | --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
140 | | ABCNN (Yin et al., 2016) | [](https://github.com/galsang/ABCNN) | 0.6921 | 0.7108 | [ABCNN: Attention-based convolutional neural network for modeling sentence pairs, ACL 2016](https://doi.org/10.1162/tacl_a_00097) |
141 | | Multi-Perspective CNN (Rao et al., 2016) | [](https://github.com/castorini/NCE-CNN-Torch) | 0.701 | 0.718 | [Noise-Contrastive Estimation for Answer Selection with Deep Neural Networks, CIKM 2016](https://dl.acm.org/authorize.cfm?key=N27026) |
142 | | HyperQA (Tay et al., 2017) | [](https://github.com/vanzytay/WSDM2018_HyperQA) | 0.705 | 0.720 | [Enabling Efficient Question Answer Retrieval via Hyperbolic Neural Networks, WSDM 2018](https://arxiv.org/pdf/1707.07847) |
143 | | KVMN (Miller et al., 2016) | [](https://github.com/siyuanzhao/key-value-memory-networks) | 0.7069 | 0.7265 | [Key-Value Memory Networks for Directly Reading Documents, ACL 2016](https://doi.org/10.18653/v1/D16-1147) |
144 | | BiMPM (Wang et al., 2017) | [](https://github.com/zhiguowang/BiMPM) [](https://github.com/NTMC-Community/MatchZoo-py/blob/dev/matchzoo/models/bimpm.py) | 0.718 | 0.731 | [Bilateral Multi-Perspective Matching for Natural Language Sentences, IJCAI 2017](https://arxiv.org/pdf/1702.03814.pdf) |
145 | | IWAN (Shen et al., 2017) | — | 0.733 | 0.750 | [Inter-Weighted Alignment Network for Sentence Pair Modeling, EMNLP 2017](https://aclanthology.info/pdf/D/D17/D17-1122.pdf) |
146 | | CA (Wang and Jiang, 2017) | [](https://github.com/pcgreat/SeqMatchSeq) | 0.7433 | 0.7545 | [A Compare-Aggregate Model for Matching Text Sequences, ICLR 2017](https://arxiv.org/abs/1611.01747) |
147 | | HCRN (Tay et al., 2018c) | [](https://github.com/NTMC-Community/MatchZoo-py/blob/dev/matchzoo/models/hcrn.py) | 0.7430 | 0.7560 | [Hermitian co-attention networks for text matching in asymmetrical domains, IJCAI 2018](https://www.ijcai.org/proceedings/2018/615) |
148 | | Compare-Aggregate (Bian et al., 2017) | [](https://github.com/wjbianjason/Dynamic-Clip-Attention) [](https://github.com/NTMC-Community/MatchZoo-py/blob/dev/matchzoo/models/dynamic_clip.py) | 0.748 | 0.758 | [A Compare-Aggregate Model with Dynamic-Clip Attention for Answer Selection, CIKM 2017](https://dl.acm.org/citation.cfm?id=3133089&CFID=791659397&CFTOKEN=43388059) |
149 | | RE2 (Yang et al., 2019) | [](https://github.com/alibaba-edu/simple-effective-text-matching) [](https://github.com/NTMC-Community/MatchZoo-py/blob/dev/matchzoo/models/re2.py) | 0.7452 | 0.7618 | [Simple and Effective Text Matching with Richer Alignment Features, ACL 2019](https://www.aclweb.org/anthology/P19-1465.pdf) |
150 | | GSAMN (Lai et al., 2019) | [](https://github.com/laituan245/StackExchangeQA) | 0.857 | 0.872 | [A Gated Self-attention Memory Network for Answer Selection, EMNLP 2019](https://arxiv.org/pdf/1909.09696.pdf) |
151 | | TANDA (Garg et al., 2019) | [](https://github.com/alexa/wqa_tanda) | **0.920** | **0.933** | [TANDA: Transfer and Adapt Pre-Trained Transformer Models for Answer Sentence Selection, AAAI 2020](https://arxiv.org/abs/1911.04118) |
152 |
--------------------------------------------------------------------------------
/Gemfile:
--------------------------------------------------------------------------------
1 | source 'https://rubygems.org'
2 | gem 'github-pages', group: :jekyll_plugins
--------------------------------------------------------------------------------
/LFQA/LFQA.md:
--------------------------------------------------------------------------------
1 | # Community Question Answering
2 |
3 | **Long Form Question Answer** is to automatically search for relevant answers among many responses provided for a given question (Answer Selection), and search for relevant questions to reuse their existing answers (Question Retrieval).
4 |
5 | ## Classic Datasets
6 |
7 |
67 |
68 | - [**TRECQA**](https://trec.nist.gov/data/qa.html) dataset is created by [Wang et. al.](https://www.aclweb.org/anthology/D07-1003) from TREC QA track 8-13 data, with candidate answers automatically selected from each question’s document pool using a combination of overlapping non-stop word counts and pattern matching. This data set is one of the most widely used benchmarks for [answer sentence selection]().
69 |
70 | - [**WikiQA**](https://www.microsoft.com/en-us/download/details.aspx?id=52419) is a publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering by Microsoft Research.
71 |
72 | - [**InsuranceQA**](https://github.com/shuzi/insuranceQA) is a non-factoid QA dataset from the insurance domain. Question may have multiple correct answers and normally the questions are much shorter than the answers. The average length of questions and answers in tokens are 7 and 95, respectively. For each question in the development and test sets, there is a set of 500 candidate answers.
73 |
74 | - [**FiQA**](https://sites.google.com/view/fiqa) is a non-factoid QA dataset from the financial domain which has been recently released for WWW 2018 Challenges. The dataset is built by crawling Stackexchange, Reddit and StockTwits in which part of the questions are opinionated, targeting mined opinions and their respective entities, aspects, sentiment polarity and opinion holder.
75 |
76 | - [**Yahoo! Answers**](https://webscope.sandbox.yahoo.com) is a web site where people post questions and answers, all of which are public to any web user willing to browse or download them. The data we have collected is the Yahoo! Answers corpus as of 10/25/2007. This is a benchmark dataset for communitybased question answering that was collected from Yahoo Answers. In this dataset, the answer lengths are relatively longer than TrecQA and WikiQA.
77 |
78 | - [**SemEval-2015 Task 3**](http://alt.qcri.org/semeval2015/task3/) consists of two sub-tasks. In Subtask A, given a question (short title + extended description), and several community answers, classify each of the answer as definitely relevance (good), potentially useful (potential), or bad or irrelevant (bad, dialog, non-english other). In Subtask B, given a YES/NO question (short title + extended description), and a list of community answers, decide whether the global answer to the question should be yes, no, or unsure.
79 |
80 | - [**SemEval-2016 Task 3**](http://alt.qcri.org/semeval2016/task3/) consists two sub-tasks, namely _Question-Comment Similarity_ and _Question-Question Similarity_. In the _Question-Comment Similarity_ task, given a question from a question-comment thread, rank the comments according to their relevance with respect to the question. In _Question-Question Similarity_ task, given the new question, rerank all similar questions retrieved by a search engine.
81 |
82 | - [**SemEval-2017 Task 3**](http://alt.qcri.org/semeval2017/task3/) contains two sub-tasks, namely _Question Similarity_ and _Relevance Classification_. Given the new question and a set of related questions from the collection, the _Question Similarity_ task is to rank the similar questions according to their similarity to the original question. While the _Relevance Classification_ is to rank the answer posts according to their relevance with respect to the question based on a question-answer thread.
83 |
84 | ## Performance
85 |
86 | ### TREC QA (Raw Version)
87 |
88 | | Model | Code | MAP | MRR | Paper |
89 | | :---------------------------------- | :----------------------------------------------------------: | :-------: | :-------: | :----------------------------------------------------------- |
90 | | Punyakanok (2004) | — | 0.419 | 0.494 | [Mapping dependencies trees: An application to question answering, ISAIM 2004](http://cogcomp.cs.illinois.edu/papers/PunyakanokRoYi04a.pdf) |
91 | | Cui (2005) | — | 0.427 | 0.526 | [Question Answering Passage Retrieval Using Dependency Relations, SIGIR 2005](https://www.comp.nus.edu.sg/~kanmy/papers/f66-cui.pdf) |
92 | | Wang (2007) | — | 0.603 | 0.685 | [What is the Jeopardy Model? A Quasi-Synchronous Grammar for QA, EMNLP 2007](http://www.aclweb.org/anthology/D/D07/D07-1003.pdf) |
93 | | H&S (2010) | — | 0.609 | 0.692 | [Tree Edit Models for Recognizing Textual Entailments, Paraphrases, and Answers to Questions, NAACL 2010](http://www.aclweb.org/anthology/N10-1145) |
94 | | W&M (2010) | — | 0.595 | 0.695 | [Probabilistic Tree-Edit Models with Structured Latent Variables for Textual Entailment and Question Answering, COLING 2020](http://aclweb.org/anthology//C/C10/C10-1131.pdf) |
95 | | Yao (2013) | — | 0.631 | 0.748 | [Answer Extraction as Sequence Tagging with Tree Edit Distance, NAACL 2013](http://www.aclweb.org/anthology/N13-1106.pdf) |
96 | | S&M (2013) | — | 0.678 | 0.736 | [Automatic Feature Engineering for Answer Selection and Extraction, EMNLP 2013](http://www.aclweb.org/anthology/D13-1044.pdf) |
97 | | Backward (Shnarch et al., 2013) | — | 0.686 | 0.754 | [Probabilistic Models for Lexical Inference, Ph.D. thesis 2013](http://u.cs.biu.ac.il/~nlp/wp-content/uploads/eyal-thesis-library-ready.pdf) |
98 | | LCLR (Yih et al., 2013) | — | 0.709 | 0.770 | [Question Answering Using Enhanced Lexical Semantic Models, ACL 2013](http://research.microsoft.com/pubs/192357/QA-SentSel-Updated-PostACL.pdf) |
99 | | bigram+count (Yu et al., 2014) | — | 0.711 | 0.785 | [Deep Learning for Answer Sentence Selection, NIPS 2014](http://arxiv.org/pdf/1412.1632v1.pdf) |
100 | | BLSTM (W&N et al., 2015) | — | 0.713 | 0.791 | [A Long Short-Term Memory Model for Answer Sentence Selection in Question Answering, ACL 2015](http://www.aclweb.org/anthology/P15-2116) |
101 | | Architecture-II (Feng et al., 2015) | — | 0.711 | 0.800 | [Applying deep learning to answer selection: A study and an open task, ASRU 2015](http://arxiv.org/abs/1508.01585) |
102 | | PairCNN (Severyn et al., 2015) | [](https://github.com/zhangzibin/PairCNN-Ranking) | 0.746 | 0.808 | [Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks, SIGIR 2015](http://disi.unitn.eu/moschitti/since2013/2015_SIGIR_Severyn_LearningRankShort.pdf) |
103 | | aNMM (Yang et al., 2016) | [](https://github.com/yangliuy/aNMM-CIKM16) [](https://github.com/NTMC-Community/MatchZoo-py/tree/master/matchzoo/models/anmm.py) | 0.750 | 0.811 | [aNMM: Ranking Short Answer Texts with Attention-Based Neural Matching Model, CIKM 2016](http://maroo.cs.umass.edu/pub/web/getpdf.php?id=1240) |
104 | | HDLA (Tay et al., 2017) | [](https://github.com/vanzytay/YahooQA_Splits) | 0.750 | 0.815 | [Learning to Rank Question Answer Pairs with Holographic Dual LSTM Architecture, SIGIR 2017](https://arxiv.org/abs/1707.06372) |
105 | | PWIM (Hua et al. 2016) | [](https://github.com/castorini/VDPWI-NN-Torch) | 0.758 | 0.822 | [Pairwise Word Interaction Modeling with Deep Neural Networks for Semantic Similarity Measurement, NAACL 2016](https://cs.uwaterloo.ca/~jimmylin/publications/He_etal_NAACL-HTL2016.pdf) |
106 | | MP-CNN (Hua et al. 2015) | [](https://github.com/castorini/MP-CNN-Torch) | 0.762 | 0.830 | [Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Networks, EMNLP 2015](http://aclweb.org/anthology/D/D15/D15-1181.pdf) |
107 | | HyperQA (Tay et al., 2017) | [](https://github.com/vanzytay/WSDM2018_HyperQA) | 0.770 | 0.825 | [Enabling Efficient Question Answer Retrieval via Hyperbolic Neural Networks, WSDM 2018](https://arxiv.org/pdf/1707.07847) |
108 | | MP-CNN (Rao et al., 2016) | [](https://github.com/castorini/NCE-CNN-Torch) | 0.780 | 0.834 | [Noise-Contrastive Estimation for Answer Selection with Deep Neural Networks, CIKM 2016](https://dl.acm.org/authorize.cfm?key=N27026) |
109 | | HCAN (Rao et al., 2019) | — | 0.774 | 0.843 | [Bridging the Gap between Relevance Matching and Semantic Matching for Short Text Similarity Modeling, EMNLP 2019](https://jinfengr.github.io/publications/Rao_etal_EMNLP2019.pdf) |
110 | | MP-CNN (Tayyar et al., 2018) | — | 0.836 | 0.863 | [Integrating Question Classification and Deep Learning for improved Answer Selection, COLING 2018](https://aclanthology.coli.uni-saarland.de/papers/C18-1278/c18-1278) |
111 | | Pre-Attention (Kamath et al., 2019) | — | 0.852 | 0.891 | [Predicting and Integrating Expected Answer Types into a Simple Recurrent Neural Network Model for Answer Sentence Selection, CICLING 2019](https://hal.archives-ouvertes.fr/hal-02104488/) |
112 | | CETE (Laskar et al., 2020) | — | **0.950** | **0.980** | [Contextualized Embeddings based Transformer Encoder for Sentence Similarity Modeling in Answer Selection Task LREC 2020](http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.676.pdf) |
113 |
114 | ### TREC QA (Clean Version)
115 |
116 | | Model | Code | MAP | MRR | Paper |
117 | | :------------------------------- | :----------------------------------------------------------: | :-------: | :-------: | :----------------------------------------------------------- |
118 | | W&I (2015) | — | 0.746 | 0.820 | [FAQ-based Question Answering via Word Alignment, arXiv 2015](http://arxiv.org/abs/1507.02628) |
119 | | LSTM (Tan et al., 2015) | [](https://github.com/Alan-Lee123/answer-selection) | 0.728 | 0.832 | [LSTM-Based Deep Learning Models for Nonfactoid Answer Selection, arXiv 2015](http://arxiv.org/abs/1511.04108) |
120 | | AP-CNN (dos Santos et al. 2016) | — | 0.753 | 0.851 | [Attentive Pooling Networks, arXiv 2016](http://arxiv.org/abs/1602.03609) |
121 | | L.D.C Model (Wang et al., 2016) | — | 0.771 | 0.845 | [Sentence Similarity Learning by Lexical Decomposition and Composition, COLING 2016](http://arxiv.org/pdf/1602.07019v1.pdf) |
122 | | MP-CNN (Hua et al., 2015) | [](https://github.com/castorini/MP-CNN-Torch) | 0.777 | 0.836 | [Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Networks, EMNLP 2015](http://aclweb.org/anthology/D/D15/D15-1181.pdf) |
123 | | HyperQA (Tay et al., 2017) | [](https://github.com/vanzytay/WSDM2018_HyperQA) | 0.784 | 0.865 | [Enabling Efficient Question Answer Retrieval via Hyperbolic Neural Networks, WSDM 2018](https://arxiv.org/pdf/1707.07847) |
124 | | MP-CNN (Rao et al., 2016) | [](https://github.com/castorini/NCE-CNN-Torch) | 0.801 | 0.877 | [Noise-Contrastive Estimation for Answer Selection with Deep Neural Networks, CIKM 2016](https://dl.acm.org/authorize.cfm?key=N27026) |
125 | | BiMPM (Wang et al., 2017) | [](https://github.com/zhiguowang/BiMPM) [](https://github.com/NTMC-Community/MatchZoo-py/blob/dev/matchzoo/models/bimpm.py) | 0.802 | 0.875 | [Bilateral Multi-Perspective Matching for Natural Language Sentences, arXiv 2017](https://arxiv.org/pdf/1702.03814.pdf) |
126 | | CA (Bian et al., 2017) | [](https://github.com/wjbianjason/Dynamic-Clip-Attention) [](https://github.com/NTMC-Community/MatchZoo-py/blob/dev/matchzoo/models/dynamic_clip.py) | 0.821 | 0.899 | [A Compare-Aggregate Model with Dynamic-Clip Attention for Answer Selection, CIKM 2017](https://dl.acm.org/citation.cfm?id=3133089&CFID=791659397&CFTOKEN=43388059) |
127 | | IWAN (Shen et al., 2017) | — | 0.822 | 0.889 | [Inter-Weighted Alignment Network for Sentence Pair Modeling, EMNLP 2017](https://aclanthology.info/pdf/D/D17/D17-1122.pdf) |
128 | | sCARNN (Tran et al., 2018) | — | 0.829 | 0.875 | [The Context-dependent Additive Recurrent Neural Net, NAACL 2018](http://www.aclweb.org/anthology/N18-1115) |
129 | | MCAN (Tay et al., 2018) | — | 0.838 | 0.904 | [Multi-Cast Attention Networks, KDD 2018](https://arxiv.org/abs/1806.00778) |
130 | | MP-CNN (Tayyar et al., 2018) | — | 0.865 | 0.904 | [Integrating Question Classification and Deep Learning for improved Answer Selection, COLING 2018](https://aclanthology.coli.uni-saarland.de/papers/C18-1278/c18-1278) |
131 | | CA + LM + LC (Yoon et al., 2019) | — | 0.868 | 0.928 | [A Compare-Aggregate Model with Latent Clustering for Answer Selection, CIKM 2019](https://arxiv.org/abs/1905.12897) |
132 | | GSAMN (Lai et al., 2019) | [](https://github.com/laituan245/StackExchangeQA) | 0.914 | 0.957 | [A Gated Self-attention Memory Network for Answer Selection, EMNLP 2019](https://arxiv.org/pdf/1909.09696.pdf) |
133 | | TANDA (Garg et al., 2019) | [](https://github.com/alexa/wqa_tanda) | **0.943** | 0.974 | [TANDA: Transfer and Adapt Pre-Trained Transformer Models for Answer Sentence Selection, AAAI 2020](https://arxiv.org/abs/1911.04118) |
134 | | CETE (Laskar et al., 2020) | — | 0.936 | **0.978** | [Contextualized Embeddings based Transformer Encoder for Sentence Similarity Modeling in Answer Selection Task, LREC 2020](http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.676.pdf) |
135 |
136 | ### WikiQA
137 |
138 | | Model | Code | MAP | MRR | Paper |
139 | | ---------------------------------------- | :-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | --------- | --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
140 | | ABCNN (Yin et al., 2016) | [](https://github.com/galsang/ABCNN) | 0.6921 | 0.7108 | [ABCNN: Attention-based convolutional neural network for modeling sentence pairs, ACL 2016](https://doi.org/10.1162/tacl_a_00097) |
141 | | Multi-Perspective CNN (Rao et al., 2016) | [](https://github.com/castorini/NCE-CNN-Torch) | 0.701 | 0.718 | [Noise-Contrastive Estimation for Answer Selection with Deep Neural Networks, CIKM 2016](https://dl.acm.org/authorize.cfm?key=N27026) |
142 | | HyperQA (Tay et al., 2017) | [](https://github.com/vanzytay/WSDM2018_HyperQA) | 0.705 | 0.720 | [Enabling Efficient Question Answer Retrieval via Hyperbolic Neural Networks, WSDM 2018](https://arxiv.org/pdf/1707.07847) |
143 | | KVMN (Miller et al., 2016) | [](https://github.com/siyuanzhao/key-value-memory-networks) | 0.7069 | 0.7265 | [Key-Value Memory Networks for Directly Reading Documents, ACL 2016](https://doi.org/10.18653/v1/D16-1147) |
144 | | BiMPM (Wang et al., 2017) | [](https://github.com/zhiguowang/BiMPM) [](https://github.com/NTMC-Community/MatchZoo-py/blob/dev/matchzoo/models/bimpm.py) | 0.718 | 0.731 | [Bilateral Multi-Perspective Matching for Natural Language Sentences, IJCAI 2017](https://arxiv.org/pdf/1702.03814.pdf) |
145 | | IWAN (Shen et al., 2017) | — | 0.733 | 0.750 | [Inter-Weighted Alignment Network for Sentence Pair Modeling, EMNLP 2017](https://aclanthology.info/pdf/D/D17/D17-1122.pdf) |
146 | | CA (Wang and Jiang, 2017) | [](https://github.com/pcgreat/SeqMatchSeq) | 0.7433 | 0.7545 | [A Compare-Aggregate Model for Matching Text Sequences, ICLR 2017](https://arxiv.org/abs/1611.01747) |
147 | | HCRN (Tay et al., 2018c) | [](https://github.com/NTMC-Community/MatchZoo-py/blob/dev/matchzoo/models/hcrn.py) | 0.7430 | 0.7560 | [Hermitian co-attention networks for text matching in asymmetrical domains, IJCAI 2018](https://www.ijcai.org/proceedings/2018/615) |
148 | | Compare-Aggregate (Bian et al., 2017) | [](https://github.com/wjbianjason/Dynamic-Clip-Attention) [](https://github.com/NTMC-Community/MatchZoo-py/blob/dev/matchzoo/models/dynamic_clip.py) | 0.748 | 0.758 | [A Compare-Aggregate Model with Dynamic-Clip Attention for Answer Selection, CIKM 2017](https://dl.acm.org/citation.cfm?id=3133089&CFID=791659397&CFTOKEN=43388059) |
149 | | RE2 (Yang et al., 2019) | [](https://github.com/alibaba-edu/simple-effective-text-matching) [](https://github.com/NTMC-Community/MatchZoo-py/blob/dev/matchzoo/models/re2.py) | 0.7452 | 0.7618 | [Simple and Effective Text Matching with Richer Alignment Features, ACL 2019](https://www.aclweb.org/anthology/P19-1465.pdf) |
150 | | GSAMN (Lai et al., 2019) | [](https://github.com/laituan245/StackExchangeQA) | 0.857 | 0.872 | [A Gated Self-attention Memory Network for Answer Selection, EMNLP 2019](https://arxiv.org/pdf/1909.09696.pdf) |
151 | | TANDA (Garg et al., 2019) | [](https://github.com/alexa/wqa_tanda) | **0.920** | **0.933** | [TANDA: Transfer and Adapt Pre-Trained Transformer Models for Answer Sentence Selection, AAAI 2020](https://arxiv.org/abs/1911.04118) |
152 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2018 Neural Text Similarity Community
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/Natural-Language-Inference/Natural-Language-Inference.md:
--------------------------------------------------------------------------------
1 | # Natural language Inference
2 |
3 | **Natural Language Inference** is the task of determining whether a "hypothesis" is true (entailment), false (contradiction), or undetermined (neutral) given a "premise".
4 |
5 | ## Classic Datasets
6 |
7 |
29 |
30 | - [**SNLI**](https://arxiv.org/abs/1508.05326) is the short of Stanford Natural Language Inference, which has 570k human annotated sentence pairs. Thre premise data is draw from the captions of the Flickr30k corpus, and the hypothesis data is manually composed.
31 | - [**MultiNLI**](https://arxiv.org/abs/1704.05426) is short of Multi-Genre NLI, which has 433k sentence pairs, whose collection process and task detail are modeled closely to SNLI. The premise data is collected from maximally broad range of genre of American English such as non-fiction genres (SLATE, OUP, GOVERNMENT, VERBATIM, TRAVEL), spoken genres (TELEPHONE, FACE-TO-FACE), less formal written genres (FICTION, LETTERS) and a specialized one for 9/11.
32 | - [**SciTail**](http://ai2-website.s3.amazonaws.com/publications/scitail-aaai-2018_cameraready.pdf) entailment dataset consists of 27k. In contrast to the SNLI and MultiNLI, it was not crowd-sourced but created from sentences that already exist “in the wild”. Hypotheses were created from science questions and the corresponding answer candidates, while relevant web sentences from a large corpus were used as premises.
33 |
34 | ## Performance
35 |
36 | ### SNLI
37 |
38 | | Model | Code| Accuracy | Paper |
39 | | ------------------------------------------------------------ | :-----------------------------------: | ------------------------------------------------------------ | ------------------------------------------------------------ |
40 | | Match-LSTM (Wang et al. ,2016) | [](https://github.com/NTMC-Community/MatchZoo-py/blob/master/matchzoo/models/matchlstm.py) | 86.1 | [Learning Natural Language Inference with LSTM](https://www.aclweb.org/anthology/N16-1170.pdf) |
41 | | Decomposable (Parikh et al., 2016) |—|86.3/86.8(Intra-sentence attention) | [A Decomposable Attention Model for Natural Language Inference](https://arxiv.org/pdf/1606.01933.pdf) |
42 | | BiMPM (Wang et al., 2017) | [](https://zhiguowang.github.io/) [](https://github.com/NTMC-Community/MatchZoo-py/blob/master/matchzoo/models/bimpm.py)| 86.9 | [Bilateral Multi-Perspective Matching for Natural Language Sentences](https://arxiv.org/pdf/1702.03814.pdf) |
43 | | Shortcut-Stacked BiLSTM (Nie et al., 2017) | [](https://github.com/easonnie/multiNLI_encoder) | 86.1 | [Shortcut-Stacked Sentence Encoders for Multi-Domain Inference](https://arxiv.org/pdf/1708.02312.pdf) |
44 | | ESIM (Chen et al., 2017) | [](https://github.com/lukecq1231/nli) [](https://github.com/NTMC-Community/MatchZoo-py/blob/master/matchzoo/models/esim.py) |88.0/88.6(Tree-LSTM) | [Enhanced LSTM for Natural Language Inference](https://arxiv.org/pdf/1609.06038.pdf) |
45 | | DIIN (Gong et al., 2018) | [](https://github.com/YichenGong/Densely-Interactive-Inference-Network) [](https://github.com/NTMC-Community/MatchZoo-py/blob/master/matchzoo/models/diin.py) | 88.0 | [Natural Language Inference over Interaction Space](https://arxiv.org/pdf/1709.04348.pdf) |
46 | | SAN (Liu et al., 2018) |—| 88.7 | [Stochastic Answer Networks for Natural Language Inference](https://arxiv.org/pdf/1804.07888.pdf) |
47 | | AF-DMN (Duan et al., 2018) |—| 88.6 | [Attention-Fused Deep Matching Network for Natural Language Inference](https://www.ijcai.org/Proceedings/2018/0561.pdf) |
48 | | MwAN (Tan et al., 2018) |—| 88.3 | [Multiway Attention Networks for Modeling Sentence Pairs](https://www.ijcai.org/Proceedings/2018/0613.pdf) |
49 | | HBMP (Talman et al., 2018) | [](https://github.com/Helsinki-NLP/HBMP) [](https://github.com/NTMC-Community/MatchZoo-py/blob/master/matchzoo/models/hbmp.py) | 86.6 | [Natural Language Inference with Hierarchical BiLSTM Max Pooling Architecture](https://arxiv.org/pdf/1808.08762v1.pdf) |
50 | | CAFE (Tay et al., 2018) |—| 88.5 | [Compare, Compress and Propagate: Enhancing Neural Architectures with Alignment Factorization for Natural Language Inference](https://arxiv.org/pdf/1801.00102v2.pdf) |
51 | | DSA (Yoon et al., 2018) |—| 86.8 | [Dynamic Self-Attention: Computing Attention over Words Dynamically for Sentence Embedding](https://arxiv.org/pdf/1808.07383.pdf) |
52 | | Enhancing Sentence Embedding with Generalized Pooling (Chen et al., 2018) | [](https://github.com/lukecq1231/generalized-pooling) | 86.6 | [Enhancing Sentence Embedding with Generalized Pooling](https://arxiv.org/pdf/1806.09828.pdf?source=post_page---------------------------) |
53 | | ReSAN (Shen et al., 2018) | [](https://github.com/taoshen58/DiSAN/tree/master/ReSAN) | 86.3 | [Reinforced Self-Attention Network: a Hybrid of Hard and Soft Attention for Sequence Modeling](https://arxiv.org/pdf/1801.10296.pdf) |
54 | | DMAN (Pan et al., 2018) |—| 88.8 | [Discourse Marker Augmented Network with Reinforcement Learning for Natural Language Inference](https://www.aclweb.org/anthology/P18-1091.pdf) |
55 | | DRCN (Kim et al., 2018) |—| 90.1 | [Semantic Sentence Matching with Densely-connected Recurrent and Co-attentive Information](https://www.aaai.org/ojs/index.php/AAAI/article/download/4627/4505) |
56 | | RE2 (Yang et al., 2019) | [](https://github.com/hitvoice/RE2) [](https://github.com/NTMC-Community/MatchZoo-py/blob/dev/matchzoo/models/re2.py) | 88.9 | [Simple and Effective Text Matching with Richer Alignment Features](https://arxiv.org/pdf/1908.00300.pdf) |
57 | | MT-DNN (Liu et al., 2019) | [](https://github.com/namisan/mt-dnn) | **91.1(base)/91.6(large)** | [Multi-Task Deep Neural Networks for Natural Language Understanding](https://arxiv.org/pdf/1901.11504.pdf) |
58 |
59 | ### MNLI
60 |
61 | | Model |Code| Matched Accuracy | Mismatched Accuracy | Paper |
62 | | ------------------------------------------------------------ | :-----------------------: | ------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
63 | | ESIM (Chen et al., 2017) | [](https://github.com/lukecq1231/nli) [](https://github.com/NTMC-Community/MatchZoo-py/blob/master/matchzoo/models/esim.py) | 76.8 | 75.8 | [Recurrent Neural Network-Based Sentence Encoder with Gated Attention for Natural Language Inference](https://arxiv.org/pdf/1708.01353.pdf) |
64 | | Shortcut-Stacked BiLSTM (Nie et al., 2017) | [](https://github.com/easonnie/multiNLI_encoder) | 74.6 | 73.6 | [Shortcut-Stacked Sentence Encoders for Multi-Domain Inference](https://arxiv.org/pdf/1708.02312.pdf) |
65 | | HBMP (Talman et al., 2018) | [](https://github.com/Helsinki-NLP/HBMP) [](https://github.com/NTMC-Community/MatchZoo-py/blob/master/matchzoo/models/hbmp.py) | 73.7 | 73.0 | [Natural Language Inference with Hierarchical BiLSTM Max Pooling Architecture](https://arxiv.org/pdf/1808.08762v1.pdf) |
66 | | Generalized Pooling (Chen et al., 2018) | [](https://github.com/lukecq1231/generalized-pooling) | 73.8 | 74.0 | [Enhancing Sentence Embedding with Generalized Pooling](https://arxiv.org/pdf/1806.09828.pdf?source=post_page---------------------------) |
67 | | AF-DMN (Duan et al., 2018) |—| 76.9 | 76.3 | [Attention-Fused Deep Matching Network for Natural Language Inference](https://www.ijcai.org/Proceedings/2018/0561.pdf) |
68 | | DIIN (Gong et al., 2018) | [](https://github.com/YichenGong/Densely-Interactive-Inference-Network) [](https://github.com/NTMC-Community/MatchZoo-py/blob/master/matchzoo/models/diin.py) | 78.8 | 77.8 | [Natural Language Inference over Interaction Space](https://github.com/YichenGong/Densely-Interactive-Inference-Network) |
69 | | SAN (Liu et al., 2018) |—| **79.3** | **78.7** | [Stochastic Answer Networks for Natural Language Inference](https://arxiv.org/pdf/1804.07888.pdf) |
70 | | MwAN (Tan et al., 2018) |—| 78.5 | 77.7 | [Multiway Attention Networks for Modeling Sentence Pairs](https://www.ijcai.org/Proceedings/2018/0613.pdf) |
71 | | CAFE (Tay et al., 2018) |—| 78.7 | 77.9 | [Compare, Compress and Propagate: Enhancing Neural Architectures with Alignment Factorization for Natural Language Inference](https://arxiv.org/pdf/1801.00102v2.pdf) |
72 | | DRCN (Kim et al., 2018) |—| 79.1 | 78.4 | [Semantic Sentence Matching with Densely-connected Recurrent and Co-attentive Information](https://www.aaai.org/ojs/index.php/AAAI/article/download/4627/4505) |
73 | | DMAN (Pan et al., 2018) |—| 78.9 | 78.2 | [Discourse Marker Augmented Network with Reinforcement Learning for Natural Language Inference](https://www.aclweb.org/anthology/P18-1091.pdf) |
74 |
75 |
76 | ### SciTail
77 |
78 | | Model | Code | Accuracy | Paper |
79 | | -------------------------- | :----------------------: | ------------------------------------------------------------ | ------------------------------------------------------------ |
80 | | SAN (Liu et al., 2018) |—| 88.4 | [Stochastic Answer Networks for Natural Language Inference](https://arxiv.org/pdf/1804.07888.pdf) |
81 | | HCRN (Tay et al., 2018) |—| 80.0 | [Hermitian Co-Attention Networks for Text Matching in Asymmetrical Domains](https://www.ijcai.org/Proceedings/2018/0615.pdf) |
82 | | HBMP (Talman et al., 2018) | [](https://github.com/Helsinki-NLP/HBMP) [](https://github.com/NTMC-Community/MatchZoo-py/blob/master/matchzoo/models/hbmp.py) | 86.0 | [Natural Language Inference with Hierarchical BiLSTM Max Pooling Architecture](https://arxiv.org/pdf/1808.08762v1.pdf) |
83 | | CAFE (Tay et al., 2018) |—| 83.3 | [Compare, Compress and Propagate: Enhancing Neural Architectures with Alignment Factorization for Natural Language Inference](https://arxiv.org/pdf/1801.00102v2.pdf) |
84 | | RE2 (Yang et al., 2019) | [](https://github.com/hitvoice/RE2) [](https://github.com/NTMC-Community/MatchZoo-py/blob/dev/matchzoo/models/re2.py) | 86.0 | [Simple and Effective Text Matching with Richer Alignment Features](https://arxiv.org/pdf/1908.00300.pdf) |
85 | | MT-DNN (Liu et al., 2019) | [](https://github.com/namisan/mt-dnn) | **94.1(base)/95.0(large)** | [Multi-Task Deep Neural Networks for Natural Language Understanding](https://arxiv.org/pdf/1901.11504.pdf) |
86 |
--------------------------------------------------------------------------------
/Paraphrase-Identification/Paraphrase-Identification.md:
--------------------------------------------------------------------------------
1 | # Paraphrase Identification
2 |
3 | ---
4 |
5 | **Paraphrase Identification** is an task to determine whether two sentences have the same meaning, a problem considered a touchstone of natural language understanding.
6 |
7 | Take an instance in MRPC dataset for example, this is a pair of sentences with the same meaning:
8 |
9 | sentence1: Amrozi accused his brother, whom he called "the witness", of deliberately distorting his evidence.
10 |
11 | sentence2: Referring to him as only "the witness", Amrozi accused his brother of deliberately distorting his evidence.
12 |
13 |
14 | Some benchmark datasets are listed in the following.
15 |
16 | ## Classic Datasets
17 |
18 |
48 |
49 | - [**MRPC**](https://www.microsoft.com/en-us/download/details.aspx?id=52398&from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fdownloads%2F607d14d9-20cd-47e3-85bc-a2f65cd28042%2Fdefault.aspx) is short for Microsoft Research Paraphrase Corpus. It contains 5,800 pairs of sentences which have been extracted from news sources on the web, along with human annotations indicating whether each pair captures a paraphrase/semantic equivalence relationship.
50 | - [**SentEval**](https://arxiv.org/pdf/1803.05449.pdf) encompasses semantic relatedness datasets including
51 | SICK and the STS Benchmark dataset. SICK dataset includes two subtasks SICK-R and SICK-E.
52 | For STS and SICK-R, it learns to predict relatedness scores between a pair of sentences. For SICK-E, it has the same pairs of sentences with SICK-R but can be treated as a three-class classification problem (classes are 'entailment', 'contradiction’, and 'neutral').
53 |
54 | - [**Quora Question Pairs**](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) is a task released by Quora which aims to identify duplicate questions. It consists of over 400,000 pairs of questions on Quora, and each question pair is annotated with a binary value indicating whether the two questions are paraphrase of each other.
55 |
56 | A list of neural matching models for paraphrase identification models are as follows.
57 |
58 | ## Performance
59 |
60 | ### MRPC
61 |
62 | | Model | Code | accuracy | f1 | Paper|
63 | | :-----:| :----: | :----: |:----: |:----: |
64 | | XLNet-Large (ensemble) (Yang et al., 2019) | [](https://github.com/zihangdai/xlnet/) | 93.0 | 90.7 | [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/pdf/1906.08237.pdf) |
65 | | MT-DNN-ensemble (Liu et al., 2019) | [](https://github.com/namisan/mt-dnn/) |92.7 | 90.3 | [Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding](https://arxiv.org/pdf/1904.09482.pdf) |
66 | | Snorkel MeTaL(ensemble) (Ratner et al., 2018) | [](https://github.com/HazyResearch/metal) | 91.5 | 88.5 | [Training Complex Models with Multi-Task Weak Supervision](https://arxiv.org/pdf/1810.02840.pdf) |
67 | | GenSen (Subramanian et al., 2018) | [](https://github.com/Maluuba/gensen) | 78.6 | 84.4 | [Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning](https://arxiv.org/abs/1804.00079) |
68 | | InferSent (Conneau et al., 2017) |[](https://github.com/facebookresearch/InferSent) | 76.2 | 83.1 | [Supervised Learning of Universal Sentence Representations from Natural Language Inference Data](https://arxiv.org/abs/1705.02364) |
69 | | TF-KLD (Ji and Eisenstein, 2013) | — | 80.4 | 85.9 | [Discriminative Improvements to Distributional Sentence Similarity](http://www.aclweb.org/anthology/D/D13/D13-1090.pdf) |
70 | | SpanBert (Joshi et al., 2019) | [](https://github.com/facebookresearch/SpanBERT) | 90.9 | 87.9 | [SpanBERT: Improving Pre-training by Representing and Predicting Spans](https://arxiv.org/pdf/1907.10529.pdf) |
71 | | MT-DNN (Liu et al., 2019) | [](https://github.com/namisan/mt-dnn) | 91.1 | 88.2 | [Multi-Task Deep Neural Networks for Natural Language Understanding](https://arxiv.org/pdf/1901.11504.pdf) |
72 | | AugDeepParaphrase (Agarwal et al., 2017) | — | 77.7 | 84.5 | [A Deep Network Model for Paraphrase Detection in Short Text Messages](https://arxiv.org/pdf/1712.02820.pdf) |
73 | | ERNIE (Zhang et al. 2019) | []( https://github.com/thunlp/ERNIE) | 88.2 | — | [ERNIE: Enhanced Language Representation with Informative Entities](https://arxiv.org/pdf/1905.07129.pdf) |
74 | | This work (Lan and Xu, 2018) | — | 84.0 | — | [Character-based Neural Networks for Sentence Pair Modeling](https://www.aclweb.org/anthology/N18-2025.pdf) |
75 | | ABCNN (Yin et al.2018) | [](https://github.com/yinwenpeng/Answer_Selection) | 78.9 | 84.8 | [ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs](https://arxiv.org/pdf/1512.05193.pdf) |
76 | | Attentive Tree-LSTMs (Zhou et al.2016) | [](https://github.com/yoosan/sentpair) | 75.8 |83.7 | [Modelling Sentence Pairs with Tree-structured Attentive Encoder](https://arxiv.org/pdf/1610.02806.pdf) |
77 | | Bi-CNN-MI (Yin and Schutze, 2015) | — | 78.1 | 84.4 | [Convolutional Neural Network for Paraphrase Identification](https://www.aclweb.org/anthology/N15-1091.pdf) |
78 |
79 |
80 | ### SentEval
81 | The evaluation metric for STS and SICK-R is Pearson correlation.
82 |
83 | The evaluation metri for SICK-E is classification accuracy.
84 |
85 | | Model | Code | SICK-R | SICK-E | STS | Paper|
86 | | :-----:| :----: | :----: |:----: |:----: |:----: |
87 | | XLNet-Large (ensemble) (Yang et al., 2019) | [](https://github.com/zihangdai/xlnet/) | — | — | 91.6 | [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/pdf/1906.08237.pdf) |
88 | | MT-DNN-ensemble (Liu et al., 2019) | [](https://github.com/namisan/mt-dnn/) | — | — | 91.1 | [Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding](https://arxiv.org/pdf/1904.09482.pdf) |
89 | | Snorkel MeTaL(ensemble) (Ratner et al., 2018) | [](https://github.com/HazyResearch/metal) | — | — | 90.1 | [Training Complex Models with Multi-Task Weak Supervision](https://arxiv.org/pdf/1810.02840.pdf) |
90 | | GenSen (Subramanian et al., 2018) | [](https://github.com/Maluuba/gensen) | 88.8 | 87.8 | 78.9 | [Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning](https://arxiv.org/abs/1804.00079) |
91 | | InferSent (Conneau et al., 2017) |[](https://github.com/facebookresearch/InferSent) | 88.4 | 86.3 | 75.8 | [Supervised Learning of Universal Sentence Representations from Natural Language Inference Data](https://arxiv.org/abs/1705.02364) |
92 | | SpanBert (Joshi et al., 2019) | [](https://github.com/facebookresearch/SpanBERT) | — | — | 89.9 | [SpanBERT: Improving Pre-training by Representing and Predicting Spans](https://arxiv.org/pdf/1907.10529.pdf) |
93 | | MT-DNN (Liu et al., 2019) | [](https://github.com/namisan/mt-dnn) | — | — | 89.5 | [Multi-Task Deep Neural Networks for Natural Language Understanding](https://arxiv.org/pdf/1901.11504.pdf) |
94 | | ERNIE (Zhang et al. 2019) | []( https://github.com/thunlp/ERNIE) | — | — | 83.2 | [ERNIE: Enhanced Language Representation with Informative Entities](https://arxiv.org/pdf/1905.07129.pdf) |
95 | | PWIM (He and Lin, 2016) |[](https://github.com/lukecq1231/nli) | — | — | 76.7 | [Pairwise Word Interaction Modeling with Deep Neural Networks for Semantic Similarity Measurement](https://www.aclweb.org/anthology/N16-1108.pdf)|
96 |
97 |
98 | ### Quora Question Pair
99 |
100 | | Model | Code | F1 | Accuracy | Paper|
101 | | :-----:| :----: | :----: |:----: |:----: |
102 | | XLNet-Large (ensemble) (Yang et al., 2019) | [](https://github.com/zihangdai/xlnet/) | 74.2 | 90.3 | [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/pdf/1906.08237.pdf) |
103 | | MT-DNN-ensemble (Liu et al., 2019) | [](https://github.com/namisan/mt-dnn/) | 73.7 | 89.9 | [Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding](https://arxiv.org/pdf/1904.09482.pdf) |
104 | |Snorkel MeTaL(ensemble) (Ratner et al., 2018) | [](https://github.com/HazyResearch/metal) | 73.1 | 89.9 | [Training Complex Models with Multi-Task Weak Supervision](https://arxiv.org/pdf/1810.02840.pdf) |
105 | |MwAN (Tan et al., 2018) | — | — | 89.12| [Multiway Attention Networks for Modeling Sentence Pairs](https://www.ijcai.org/proceedings/2018/0613.pdf) |
106 | | DIIN (Gong et al., 2018) | [](https://github.com/YichenGong/Densely-Interactive-Inference-Network) [](https://github.com/NTMC-Community/MatchZoo-py/blob/dev/matchzoo/models/diin.py) | — | 89.06 | [Natural Language Inference over Interaction Space](https://arxiv.org/pdf/1709.04348.pdf) |
107 | | pt-DecAtt (Char) (Tomar et al., 2017) | — | — | 88.40 | [Neural Paraphrase Identification of Questions with Noisy Pretraining](https://arxiv.org/abs/1704.04565) |
108 | | BiMPM (Wang et al., 2017) | [](https://github.com/zhiguowang/BiMPM) [](https://github.com/NTMC-Community/MatchZoo-py/blob/dev/matchzoo/models/bimpm.py) | — | 88.17 | [Bilateral Multi-Perspective Matching for Natural Language Sentences](https://arxiv.org/abs/1702.03814) |
109 | | GenSen (Subramanian et al., 2018) | — | — | 87.01 | [Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning](https://arxiv.org/abs/1804.00079)| [](https://github.com/Maluuba/gensen) |
110 | | This work (Wang et al.2016) | — | 78.4 | 84.7 | [Sentence Similarity Learning by Lexical Decomposition and Composition](https://www.aclweb.org/anthology/C16-1127/) |
111 | | RE2 (Yang et al., 2019) | [](https://github.com/hitvoice/RE2) [](https://github.com/NTMC-Community/MatchZoo-py/blob/dev/matchzoo/models/re2.py) | — | 89.2 | [Simple and Effective Text Matching with Richer Alignment Features](https://www.aclweb.org/anthology/P19-1465.pdf) |
112 | | MSEM (Wang et al.2016) | — | — | 88.86 | [Multi-task Sentence Encoding Model for Semantic Retrieval in Question Answering Systems](https://arxiv.org/ftp/arxiv/papers/1911/1911.07405.pdf) |
113 | | Bi-CAS-LSTM (Choi et al.2019) | — | — | 88.6 | [Cell-aware Stacked LSTMs for Modeling Sentences](https://arxiv.org/pdf/1809.02279.pdf)|
114 | |DecAtt (Parikh et al., 2016)| — | — | 86.5 | [A Decomposable Attention Model for Natural Language Inference](https://arxiv.org/pdf/1606.01933.pdf)| — |
115 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
Awesome Neural Models for Semantic Match
5 |
6 |
7 |
8 | A collection of papers maintained by MatchZoo Team.
9 |
10 | Checkout our open source toolkit MatchZoo for more information!
11 |
12 |
13 |
14 | Text matching is a core component in many natural language processing tasks, where many task can be viewed as a matching between two texts input.
15 |
16 |
17 |
18 |
19 |
20 | Where **s** and **t** are source text input and target text input, respectively. The **psi** and **phi** are representation function for input **s** and **t**, respectively. The **f** is the interaction function, and **g** is the aggregation function. More detailed explaination about this formula can be found on [A Deep Look into Neural Ranking Models for Information Retrieval](https://arxiv.org/abs/1903.06902). The representative matching tasks are as follows:
21 |
22 | | **Tasks** | **Source Text** | **Target Text** |
23 | | :------------------------------------------------------------------------------------------: | :-------------: | :----------------------: |
24 | | [Ad-hoc Information Retrieval](Ad-hoc-Information-Retrieval/Ad-hoc-Information-Retrieval.md) | query | document (title/content) |
25 | | [Community Question Answering](Community-Question-Answering/Community-Question-Answering.md) | question | question/answer |
26 | | [Paraphrase Identification](Paraphrase-Identification/Paraphrase-Identification.md) | string1 | string2 |
27 | | [Natural Language Inference](Natural-Language-Inference/Natural-Language-Inference.md) | premise | hypothesis |
28 | | [Response Retrieval](Response-Retrieval/Response-Retrieval.md) | context/utterances | response |
29 | | [Long Form Question Answering](LFQA/LFQA.md) | question+document | answer |
30 |
31 | ### Healthcheck
32 |
33 | ```python
34 | pip3 install -r requirements.txt
35 | python3 healthcheck.py
36 | ```
37 |
--------------------------------------------------------------------------------
/Response-Retrieval/Response-Retrieval.md:
--------------------------------------------------------------------------------
1 | ## Response retrieval
2 |
3 | **Response retrieval**/selection aims to rank/select a proper response from a dialog repository.
4 | Automatic conversation (AC) aims to create an automatic human-computer dialog process for the purpose of question answering, task completion, and social chat (i.e., chit-chat). In general, AC could be formulated either as an IR problem that aims to rank/select a proper response from a dialog repository or a generation problem that aims to generate an appropriate response with respect to the input utterance. Here, we refer response retrieval as the IR-based way to do AC.
5 | Example:
6 | 
7 |
8 | ## Classic Datasets
9 |
10 | |Dataset |Partition | #Context Response pair | #Candidate per Context | Positive:Negative |Avg #turns per context|
11 | |---- | ---- | ---- |---- | ---- | ---- |
12 | |UDC| train/validation/test| 1M/500k/500k| 2/10/10| 1:1/1:9/1:9| 10.13/10.11/10.11|
13 | |[**Douban**](https://github.com/MarkWuNLP/MultiTurnResponseSelection)| train/validation/test| 1M/50k/10k |2/2/10|1:1/1:1/1.18:8.82|6.69/6.75/6.45|
14 | |[**MSDialog**](https://ciir.cs.umass.edu/downloads/msdialog/)| train/validation/test| 173k/37k/35k | 10/10/10|1:9/1:9/1:9|5.0/4.9/4.4|
15 | |[**EDC**](https://github.com/cooelf/DeepUtteranceAggregation)| train/validation/test| 1M/10k/10k| 2/2/10| 1:1/1:1/1:9| 5.51/5.48/5.64|
16 | |Persona-Chat dataset| 8939/1000/968 | 20/20/20 | 1:19/1:19/1:19 | 7.35/7.80/7.76 |
17 | |CMUDoG dataset| 2881/196/537 | 20/20/20 | 1:19/1:19/1:19 | 12.55/12.37/12.36 |
18 |
19 | - Ubuntu Dialog Corpus (UDC) contains multi-turn dialogues collected from chat logs of the Ubuntu Forum. The data set consists of 1 million context-response pairs for training, 0.5 million pairs for validation, and 0.5 million pairs for testing. Positive responses are true responses from humans, and negative ones are randomly sampled. The ratio of the positive and the negative is 1:1 in training, and 1:9 in validation and testing.
20 | - [**Douban Conversation Corpus**](https://github.com/MarkWuNLP/MultiTurnResponseSelection) is an open domain dataset constructed from Douban group (a popular social networking service in China). The data set consists of 1 million context-response pairs for training, 50k pairs for validation, and 10k pairs for testing, corresponding to 2, 2, and 10 response candidates per context respectively. Response candidates on the test set, retrieved from Sina Weibo (the largest microblogging service in China), are labeled by human judges.
21 | - [**MSDialog**](https://ciir.cs.umass.edu/downloads/msdialog/) is a labeled dialog dataset of question answering (QA) interactions between information seekers and answer providers from an online forum on Microsoft products (Microsoft Community). The dataset contains more than 2,000 multi-turn information-seeking conversations with 10,000 utterances that are annotated with user intent on the utterance level.
22 | - [**E-commerce Dialogue Corpus**](https://github.com/cooelf/DeepUtteranceAggregation) contains over 5 types of conversations (e.g. commodity consultation, logistics express, recommendation, negotiation and chitchat) based on over 20 commodities. The ratio of the positive and the negative
23 | is 1:1 in training and validation, and 1:9 in testing.
24 |
25 | $R_n@k$: recall at position $k$ in $n$ candidates.
26 |
27 | ## Performance
28 |
29 | ### Ubuntu Corpus
30 |
31 | | Model | Code|MAP|$R_2@1$|$R_{10}@1$|$R_{10}@2$|$R_{10}@5$|Paper| type |
32 | | ---- | ---- | ---- | --- | --- | --- | --- | ---- | ---- |
33 | | Multi-View (Zhou et al. 2016)| N/A | — | 0.908 | 0.662 | 0.801 | 0.951 | [Multi-view Response Selection for Human-Computer Conversation, ACL 2016](https://www.aclweb.org/anthology/D16-1036.pdf) | multi-turn |
34 | | DL2R (Yan, Song and Wu 2016)| N/A | — | 0.899 | 0.626 | 0.783 | 0.944|[Learning to Respond with Deep Neural Networks for Retrieval-Based Human-Computer Conversation System, SIGIR 2016](http://www.ruiyan.me/pubs/SIGIR2016.pdf) | multi-turn|
35 | | SMN (Wu et al. 2017) | [](https://github.com/MarkWuNLP/MultiTurnResponseSelection) | 0.7327 | 0.927 | 0.726 |0.847| 0.962 |[Sequential Matching Network: A New Architecture for Multi-turn Response Selection in Retrieval-Based Chatbots, ACL 2017](https://www.aclweb.org/anthology/P17-1046.pdf)| Multi-turn|
36 | |DAM(Zhou et al. 2018) | [](https://github.com/baidu/Dialogue/tree/master/DAM) | — | 0.938 | 0.767 | 0.874 | 0.969 |[Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network, ACL 2018](https://www.aclweb.org/anthology/P18-1103.pdf) | multi-turn|
37 | |DUA (Zhang et al. 2018)|[](https://github.com/cooelf/DeepUtteranceAggregation)| — | — | 0.752 | 0.868 | 0.962 |[Modeling Multi-turn Conversation with Deep Utterance Aggregation, arXiv 2018](https://arxiv.org/pdf/1806.09102.pdf)|multi-turn|
38 | | DMN (Yang et al. 2018)| [](https://github.com/yangliuy/NeuralResponseRanking) | 0.7719 | — | — | — | — |[Response Ranking with Deep Matching Networks and External Knowledge in Information-seeking Conversation Systems, arXiv 2018](https://arxiv.org/pdf/1805.00188.pdf) |multi-turn |
39 | |U2U-IMN(Gu et al. 2019 a)|[](https://github.com/JasonForJoy/U2U-IMN) | **0.866** | 0.945 | 0.790 | 0.886 | 0.973 |[Utterance-to-Utterance Interactive Matching Network for Multi-Turn Response Selection in Retrieval-Based Chatbots, arXiv 2019](https://arxiv.org/pdf/1911.06940v1.pdf)|multi-turn|
40 | |TripleNet(Ma et al. 2019)|[](https://github.com/wtma/TripleNet.)|— | 0.943 | 0.79 | 0.885 | 0.97 |[TripleNet: Triple Attention Network for Multi-Turn Response Selection in Retrieval-based Chatbots, arXiv 2019](https://arxiv.org/pdf/1909.10666v2.pdf)|multi-turn|
41 | |IMN(Gu et al. 2019 b)|[](https://github.com/JasonForJoy/IMN)| — | 0.946 | 0.794 | 0.889 | 0.974|[Interactive Matching Network for Multi-Turn Response Selection in Retrieval-Based Chatbots, arXiv 2019](https://arxiv.org/pdf/1901.01824v2.pdf)|multi-turn|
42 | |IOI-local(Tao et al. 2019)|[](https://github.com/chongyangtao/IOI)| — | 0.947 | 0.796 | 0.894 | 0.974 | [One Time of Interaction May Not Be Enough: Go Deep with an Interaction-over-Interaction Network for Response Selection in Dialogues, ACL 2019](https://www.aclweb.org/anthology/P19-1001.pdf) |multi-turn|
43 | |MSN(Yuan et al. 2019)|[](https://github.com/chunyuanY/Dialogue)|— |— | 0.8 | 0.899 | 0.978 | [Multi-hop Selector Network for Multi-turn Response Selection in Retrieval-based Chatbots, ACL 2019](https://www.aclweb.org/anthology/D19-1011.pdf) |multi-turn|
44 | |SA-BERT (Gu et al. 2020)|[](https://github.com/JasonForJoy/SA-BERT)| — | **0.965** | **0.855** | **0.928** | **0.983** |[Speaker-Aware BERT for Multi-Turn Response Selection in Retrieval-Based Chatbots, arXiv 2020](https://arxiv.org/pdf/2004.03588v1.pdf)|multi-turn|
45 | |RoBERTaBASE-SS-DA (Lu et al. 2020)|[](https://github.com/CSLujunyu/Improving-Contextual-Language-Modelsfor-Response-Retrieval-in-Multi-Turn-Conversation) | - |0.955 |0.826 |0.909 |0.978 | [Improving Contextual Language Models for Response Retrieval in Multi-Turn Conversation, SIGIR 2020](https://dl.acm.org/doi/pdf/10.1145/3397271.3401255) | multi-turn|
46 | |SMN + ECMo (Tao et al. 2020)| N/A | - |0.934 |0.756 |0.867 |0.966 |[Improving Matching Models with Hierarchical Contextualized Representations for Multi-turn Response Selection, SIGIR 2020](https://dl.acm.org/doi/pdf/10.1145/3397271.3401290) | multi-turn|
47 |
48 |
49 | ### Douban Conversation Corpus
50 |
51 | | Model | Code | MAP | MRR | P@1| $R_{10}@1$ |$R_{10}@2$ |$R_{10}@5$|Paper| type |
52 | | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
53 | | Multi-View (Zhou et al. 2016)| N/A | 0.505 | 0.543 | 0.342 | 0.202 | 0.350 | 0.729 | [Multi-view Response Selection for Human-Computer Conversation, ACL 2016](https://www.aclweb.org/anthology/D16-1036.pdf) | multi-turn |
54 | | DL2R (Yan, Song and Wu 2016)| N/A |0.488 | 0.527 | 0.33 | 0.193 | 0.342 | 0.705 |[Learning to Respond with Deep Neural Networks for Retrieval-Based Human-Computer Conversation System, SIGIR 2016](http://www.ruiyan.me/pubs/SIGIR2016.pdf) | multi-turn|
55 | | SMN (Wu et al. 2017) | [](https://github.com/MarkWuNLP/MultiTurnResponseSelection) | 0.529 | 0.572 | 0.397 | 0.236 | 0.396 | 0.734 |[Sequential Matching Network: A New Architecture for Multi-turn Response Selection in Retrieval-Based Chatbots, ACL 2017](https://www.aclweb.org/anthology/P17-1046.pdf)| Multi-turn|
56 | |DAM(Zhou et al. 2018) | [](https://github.com/baidu/Dialogue/tree/master/DAM) | 0.55 | 0.601 | 0.427 | 0.254 | 0.410 | 0.757 |[Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network, ACL 2018](https://www.aclweb.org/anthology/P18-1103.pdf) | multi-turn|
57 | |DUA (Zhang et al. 2018)|[](https://github.com/cooelf/DeepUtteranceAggregation)| 0.551 | 0.599 | 0.421 | 0.243 | 0.421 | 0.780 |[Modeling Multi-turn Conversation with Deep Utterance Aggregation, arXiv 2018](https://arxiv.org/pdf/1806.09102.pdf)|multi-turn|
58 | |U2U-IMN(Gu et al. 2019 a)|[](https://github.com/JasonForJoy/U2U-IMN) |0.564 | 0.611 | 0.429 | 0.259 | 0.43 | 0.791 |[Utterance-to-Utterance Interactive Matching Network for Multi-Turn Response Selection in Retrieval-Based Chatbots, arXiv 2019](https://arxiv.org/pdf/1911.06940v1.pdf)|multi-turn|
59 | |TripleNet(Ma et al. 2019)|[](https://github.com/wtma/TripleNet.)|0.564 | 0.618 | 0.447 | 0.268 | 0.426 | 0.778 |[TripleNet: Triple Attention Network for Multi-Turn Response Selection in Retrieval-based Chatbots, arXiv 2019](https://arxiv.org/pdf/1909.10666v2.pdf)|multi-turn|
60 | |IMN(Gu et al. 2019 b)|[](https://github.com/JasonForJoy/IMN)|0.570 | 0.615 | 0.433 | 0.262 | 0.452 | 0.789 |[Interactive Matching Network for Multi-Turn Response Selection in Retrieval-Based Chatbots, arXiv 2019](https://arxiv.org/pdf/1901.01824v2.pdf)|multi-turn|
61 | |IOI-local(Tao et al. 2019)|[](https://github.com/chongyangtao/IOI)| 0.573 | 0.621 | 0.444 | 0.269 | 0.451 | 0.786 |[One Time of Interaction May Not Be Enough: Go Deep with an Interaction-over-Interaction Network for Response Selection in Dialogues, ACL 2019](https://www.aclweb.org/anthology/P19-1001.pdf)|multi-turn|
62 | |MSN(Yuan et al. 2019)|[](https://github.com/chunyuanY/Dialogue)| 0.587 | 0.632 | 0.470 | 0.295 | 0.452 | 0.788 |[Multi-hop Selector Network for Multi-turn Response Selection in Retrieval-based Chatbots, ACL 2019](https://www.aclweb.org/anthology/D19-1011.pdf)|multi-turn|
63 | |SA-BERT(Gu et al. 2020)|[](https://github.com/JasonForJoy/SA-BERT)| **0.619** | **0.659** | **0.496** | **0.313** | **0.481** | **0.847** |[Speaker-Aware BERT for Multi-Turn Response Selection in Retrieval-Based Chatbots, arXiv 2020](https://arxiv.org/pdf/2004.03588v1.pdf)|multi-turn|
64 | |RoBERTaBASE-SS-DA (Lu et al. 2020)|[](https://github.com/CSLujunyu/Improving-Contextual-Language-Modelsfor-Response-Retrieval-in-Multi-Turn-Conversation) | 0.602 | 0.646 | 0.460 | 0.280 | 0.495 | 0.847 | [Improving Contextual Language Models for Response Retrieval in Multi-Turn Conversation, SIGIR 2020](https://dl.acm.org/doi/pdf/10.1145/3397271.3401255) | multi-turn|
65 | |SMN + ECMo (Tao et al. 2020)| N/A | 0.549 | 0.593 | 0.409 | 0.247 | 0.416 | 0.774 |[Improving Matching Models with Hierarchical Contextualized Representations for Multi-turn Response Selection, SIGIR 2020](https://dl.acm.org/doi/pdf/10.1145/3397271.3401290) | multi-turn|
66 |
67 |
68 | ### MSDialog
69 |
70 | | Model | Code | MAP | Recall@5| Recall@1| Recall@2|Paper| type |
71 | | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
72 | | DMN (Yang et al. 2018)|[](https://github.com/yangliuy/NeuralResponseRanking)| 0.6792 | 0.9356 | 0.5021 | 0.7122 |[Response Ranking with Deep Matching Networks and External Knowledge in Information-seeking Conversation Systems, arXiv 2018](https://arxiv.org/pdf/1805.00188.pdf) |multi-turn |
73 |
74 | ### E-commerce Corpus
75 |
76 | | Model | Code | MAP | $R_{10}@1$ | $R_{10}@2$ | $R_{10}@5$ | Paper| type |
77 | | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
78 | | Multi-View (Zhou et al. 2016)| N/A | — | 0.421 | 0.601 | 0.861 | [Multi-view Response Selection for Human-Computer Conversation, ACL 2016](https://www.aclweb.org/anthology/D16-1036.pdf) | multi-turn |
79 | | DL2R (Yan, Song and Wu 2016)| N/A | — |0.399 | 0.571 | 0.842 |[Learning to Respond with Deep Neural Networks for Retrieval-Based Human-Computer Conversation System, SIGIR 2016](http://www.ruiyan.me/pubs/SIGIR2016.pdf) | multi-turn|
80 | | SMN (Wu et al. 2017) | [](https://github.com/MarkWuNLP/MultiTurnResponseSelection) | — | 0.453 | 0.654 | 0.886 | [Sequential Matching Network: A New Architecture for Multi-turn Response Selection in Retrieval-Based Chatbots, ACL 2017](https://www.aclweb.org/anthology/P17-1046.pdf) | Multi-turn|
81 | |DAM(Zhou et al. 2018) | [](https://github.com/baidu/Dialogue/tree/master/DAM) | — | 0.526 | 0.727 | 0.933 |[Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network, ACL 2018](https://www.aclweb.org/anthology/P18-1103.pdf) | multi-turn|
82 | |DUA (Zhang et al. 2018)|[](https://github.com/cooelf/DeepUtteranceAggregation)| — | 0.501 | 0.700 | 0.921 |[Modeling Multi-turn Conversation with Deep Utterance Aggregation, arXiv 2018](https://arxiv.org/pdf/1806.09102.pdf)|multi-turn|
83 | |U2U-IMN(Gu et al. 2019 a)|[](https://github.com/JasonForJoy/U2U-IMN) |**0.759** | 0.616 | 0.806 | 0.966 |[Utterance-to-Utterance Interactive Matching Network for Multi-Turn Response Selection in Retrieval-Based Chatbots, arXiv 2019](https://arxiv.org/pdf/1911.06940v1.pdf)|multi-turn|
84 | |IMN(Gu et al. 2019 b)|[](https://github.com/JasonForJoy/IMN)| — | 0.621 | 0.797 | 0.964 |[Interactive Matching Network for Multi-Turn Response Selection in Retrieval-Based Chatbots, arXiv 2019](https://arxiv.org/pdf/1901.01824v2.pdf)|multi-turn|
85 | |IOI-local(Tao et al. 2019)|[](https://github.com/chongyangtao/IOI)|— | 0.563 | 0.768 | 0.950 |[One Time of Interaction May Not Be Enough: Go Deep with an Interaction-over-Interaction Network for Response Selection in Dialogues, ACL 2019](https://www.aclweb.org/anthology/P19-1001.pdf)|multi-turn|
86 | |MSN(Yuan et al. 2019)|[](https://github.com/chunyuanY/Dialogue)| — | 0.606 | 0.770 | 0.937 |[Multi-hop Selector Network for Multi-turn Response Selection in Retrieval-based Chatbots, ACL 2019](https://www.aclweb.org/anthology/D19-1011.pdf)|multi-turn|
87 | |SA-BERT(Gu et al. 2020)|[](https://github.com/JasonForJoy/SA-BERT)| — | 0.704 | 0.879 | **0.985** |[Speaker-Aware BERT for Multi-Turn Response Selection in Retrieval-Based Chatbots, arXiv 2020](https://arxiv.org/pdf/2004.03588v1.pdf)|multi-turn|
88 | |RoBERTaBASE-SS-DA (Lu et al. 2020)|[](https://github.com/CSLujunyu/Improving-Contextual-Language-Modelsfor-Response-Retrieval-in-Multi-Turn-Conversation) | - | **0.800** | **0.910** | 0.972 | [Improving Contextual Language Models for Response Retrieval in Multi-Turn Conversation, SIGIR 2020](https://dl.acm.org/doi/pdf/10.1145/3397271.3401255) | multi-turn|
89 |
90 | ### Persona-Chat dataset
91 |
92 | Orinigal Persona
93 | | Model | Code | $R_{20}@1$ | $R_{20}@2$ | $R_{20}@5$ | Paper | type |
94 | | ---- | ---- | ---- | ----| ---- | ---- | ---- |
95 | | RSM-DCK (Hua et al. 2020) | N/A | 0.7965 | 0.9021 | 0.9747 | [Learning to Detect Relevant Contexts and Knowledge for Response Selection in Retrieval-based Dialogue Systems, CIKM 2020](https://dl.acm.org/doi/pdf/10.1145/3340531.3411967) | multi-turn |
96 |
97 | Revised Persona
98 | | Model | Code | $R_{20}@1$ | $R_{20}@2$ | $R_{20}@5$ | Paper | type |
99 | | ---- | ---- | ---- | ----| ---- | ---- | ---- |
100 | | RSM-DCK (Hua et al. 2020) | N/A | 0.7185 | 0.8494 | 0.9550 | [Learning to Detect Relevant Contexts and Knowledge for Response Selection in Retrieval-based Dialogue Systems, CIKM 2020](https://dl.acm.org/doi/pdf/10.1145/3340531.3411967) | multi-turn |
101 | ### CMUDoG dataset
102 | | Model | Code | $R_{20}@1$ | $R_{20}@2$ | $R_{20}@5$ | Paper | type |
103 | | ---- | ---- | ---- | ----| ---- | ---- | ---- |
104 | | RSM-DCK (Hua et al. 2020) | N/A | 0.7925 | 0.8884 | 0.9666 | [Learning to Detect Relevant Contexts and Knowledge for Response Selection in Retrieval-based Dialogue Systems, CIKM 2020](https://dl.acm.org/doi/pdf/10.1145/3340531.3411967) | multi-turn |
105 |
106 |
--------------------------------------------------------------------------------
/_config.yml:
--------------------------------------------------------------------------------
1 | theme: jekyll-theme-slate
--------------------------------------------------------------------------------
/_includes/chart.html:
--------------------------------------------------------------------------------
1 |
20 |
21 |
22 | {% for result in include.results %}
23 | {% assign score = result[include.score] %}
24 |