├── README.md
├── datasets
    ├── 0.template.md
    ├── Chinese_Dialogue_Dataset_with_Sentence_Function.md
    ├── Cornell_Movie_Dialogs_Corpus.md
    ├── Douban_Conversation_Corpus.md
    ├── E-commerce_Dialogue_Corpus.md
    ├── JD_Customer_Service_Corpus.md
    ├── Leiden_Weibo_Corpus.md
    ├── NTCIR14-STC3-CECG.md
    ├── Noah_NRM_Data.md
    ├── OpenSubtitles.md
    ├── Personality_Assignment_Dataset.md
    ├── STC_Data.md
    ├── Twitter.md
    └── Ubuntu_Dialogue_Corpus_v2.md
└── models
    └── models_list.md


/README.md:
--------------------------------------------------------------------------------
 1 | # 对话系统中英文语料
 2 | 
 3 | 本项目收集目前论文中，已公开的，用于训练中(英)文对话系统的语料以及开源的对话模型。
 4 | 
 5 | [部分开源模型](models/models_list.md) 待整理…
 6 | 
 7 | ## 常用
 8 | 
 9 | ### 中文
10 | 
11 | [Douban Conversation Corpus](datasets/Douban_Conversation_Corpus.md)
12 | 
13 | [Noah NRM Data](datasets/Noah_NRM_Data.md)
14 | 
15 | [STC Data](datasets/STC_Data.md)
16 | 
17 | 
18 | 
19 | ### 英文
20 | 
21 | [Ubuntu Dialogue Corpus v2](datasets/Ubuntu_Dialogue_Corpus_v2.md)
22 | 
23 | [OpenSubtitles](datasets/OpenSubtitles.md)
24 | 
25 | [Cornell Movie Dialogs Corpus](datasets/Cornell_Movie_Dialogs_Corpus.md)
26 | 
27 | [Twitter](datasets/Twitter.md)
28 | 
29 | 
30 | 
31 | ## 微博
32 | 
33 | [Noah NRM Data](datasets/Noah_NRM_Data.md)
34 | 
35 | [STC Data](datasets/STC_Data.md)
36 | 
37 | [NTCIR14 STC3 CECG](datasets/NTCIR14-STC3-CECG.md)
38 | 
39 | [Personality Assignment Dataset](datasets/Personality_Assignment_Dataset.md)
40 | 
41 | [Chinese Dialogue Dataset with Sentence Function](datasets/Chinese_Dialogue_Dataset_with_Sentence_Function.md)
42 | 
43 | 
44 | 
45 | ## Twitter
46 | 
47 | [Twitter](datasets/Twitter.md)
48 | 
49 | 
50 | 
51 | ## 豆瓣
52 | 
53 | [Douban Conversation Corpus](datasets/Douban_Conversation_Corpus.md)
54 | 
55 | 
56 | 
57 | ## 电商
58 | 
59 | [JD Customer Service Corpus](datasets/JD_Customer_Service_Corpus.md)
60 | 
61 | [E-commerce Dialogue Corpus](datasets/E-commerce_Dialogue_Corpus.md)
62 | 
63 | 
64 | 
65 | 


--------------------------------------------------------------------------------
/datasets/0.template.md:
--------------------------------------------------------------------------------
 1 | # 数据集名称
 2 | 
 3 | **类型：** 
 4 | 
 5 | **来源：** 
 6 | 
 7 | **规模：**
 8 | 
 9 | **下载链接：**
10 | 
11 | 
12 | 
13 | **构造方法：**
14 | 
15 | 
16 | 
17 | **公开数据的论文：**
18 | 
19 | 
20 | 
21 | **使用数据的论文：**
22 | 
23 | 


--------------------------------------------------------------------------------
/datasets/Chinese_Dialogue_Dataset_with_Sentence_Function.md:
--------------------------------------------------------------------------------
 1 | # Chinese Dialogue Dataset with Sentence Function
 2 | 
 3 | **类型：** 单轮、中文、开放领域
 4 | 
 5 | **来源：** 微博
 6 | 
 7 | **规模：** 1,963,382 pairs
 8 | 
 9 | **下载链接：**
10 | 
11 | http://coai.cs.tsinghua.edu.cn/file/DialogwithSenFun.tar.gz
12 | 
13 | 
14 | 
15 | **构造方法：**
16 | 
17 | 1. 1000万条微博抓取数据
18 | 2. 采样2000条进行人工标注，标注Sentence Function（Interrogative, Declarative, Imperative, and Exclamatory），train:valid:test=6:1:1 训练Sentence Function分类器
19 | 3. 用Sentence Function分类器对微博文本进行分类打上标签，每个类别采样约600万条
20 | 
21 | 
22 | 
23 | **其他标签：**
24 | 
25 | Sentence Function：Interrogative, Declarative, Imperative, and Exclamatory
26 | 
27 | 
28 | 
29 | **公开数据的论文：**
30 | 
31 | Ke, Pei, et al. "Generating informative responses with controlled sentence function."[Link](http://coai.cs.tsinghua.edu.cn/hml/media/files/acl_senfun.pdf)
32 | 
33 | 
34 | 
35 | **使用数据的论文：**
36 | 
37 | 


--------------------------------------------------------------------------------
/datasets/Cornell_Movie_Dialogs_Corpus.md:
--------------------------------------------------------------------------------
 1 | # Cornell Movie Dialogs Corpus
 2 | 
 3 | **类型：** 英文、多轮、电影字幕
 4 | 
 5 | **来源：** Internet Movie Database
 6 | 
 7 | **规模：** 304,713 utterances
 8 | 
 9 | **下载链接：**
10 | 
11 | [http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html](http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html)
12 | 
13 | 
14 | 
15 | **构造方法：**
16 | 
17 | 
18 | 
19 | **公开数据的论文：**
20 | 
21 | Danescu-Niculescu-Mizil, Cristian, and Lillian Lee. "Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs." [Link](https://arxiv.org/pdf/1106.3077.pdf)
22 | 
23 | 
24 | 
25 | **使用数据的论文：**
26 | 
27 | 


--------------------------------------------------------------------------------
/datasets/Douban_Conversation_Corpus.md:
--------------------------------------------------------------------------------
 1 | # Douban Conversation Corpus
 2 | 
 3 | **类型：** 多轮、中文、开放领域
 4 | 
 5 | **来源：** 豆瓣
 6 | 
 7 | **规模：**
 8 | 
 9 | session-response pairs 1,000,000 pairs
10 | 
11 | 平均6.69轮每session
12 | 
13 | 
14 | 
15 | **下载链接：**
16 | 
17 | https://github.com/MarkWuNLP/MultiTurnResponseSelection
18 | 
19 | 
20 | 
21 | **构造方法：**
22 | 
23 | 1. 抓取原始数据，豆瓣小组中轮数大于2轮的110万。
24 | 
25 | 2. sample50万做训练集，25万做验证集，从110万中其他句子中sample负例。
26 | 
27 | 3. 最终100万训练集，50万验证集。
28 | 
29 | 4. 检索模型的数据由抓取的1500万微博对话构成。
30 | 
31 |    
32 | 
33 | **公开数据的论文：**
34 | 
35 | Wu, Yu, et al. "Sequential matching network: A new architecture for multi-turn response selection in retrieval-based chatbots." [Link](https://pdfs.semanticscholar.org/a6ee/c00f10346ce27d4f69f9e38f5665fffe8056.pdf)
36 | 
37 | 
38 | 
39 | **使用数据的论文：**
40 | 
41 | 1. Zhou, Xiangyang, et al. "Multi-turn response selection for chatbots with deep attention matching network."
42 | 2. Zhang, Zhuosheng, et al. "Modeling multi-turn conversation with deep utterance aggregation."
43 | 3. Wu, Yu, et al. "Learning matching models with weak supervision for response selection in retrieval-based chatbots."


--------------------------------------------------------------------------------
/datasets/E-commerce_Dialogue_Corpus.md:
--------------------------------------------------------------------------------
 1 | # E-commerce Dialogue Corpus
 2 | 
 3 | **类型：** 多轮、中文、电商领域
 4 | 
 5 | **来源：** 淘宝
 6 | 
 7 | **规模：** 1,000,000 Utterance
 8 | 
 9 | **下载链接：**
10 | 
11 | https://github.com/cooelf/DeepUtteranceAggregation
12 | 
13 | 
14 | 
15 | **构造方法：**
16 | 
17 | conversations between customers and customer service staff from our E-commerce partners in Taobao
18 | 
19 | 
20 | 
21 | **公开数据的论文：**
22 | 
23 | Zhang, Zhuosheng, et al. "Modeling multi-turn conversation with deep utterance aggregation." [Link](https://arxiv.org/pdf/1806.09102.pdf)
24 | 
25 | **使用数据的论文：**
26 | 
27 | 


--------------------------------------------------------------------------------
/datasets/JD_Customer_Service_Corpus.md:
--------------------------------------------------------------------------------
 1 | # JD Customer Service Corpus
 2 | 
 3 | **类型：** 多轮、中文、电商领域
 4 | 
 5 | **来源：** 京东
 6 | 
 7 | **规模：** 420,000 Utterance
 8 | 
 9 | **下载链接：**
10 | 
11 | https://github.com/chenhongshen/HVMN
12 | 
13 | 
14 | 
15 | **构造方法：**
16 | 
17 | each conversation is between a customer and a customer service staff
18 | 
19 | 
20 | 
21 | **公开数据的论文：**
22 | 
23 | Chen, Hongshen, et al. "Hierarchical variational memory network for dialogue generation." [Link]([http://delivery.acm.org/10.1145/3190000/3186077/p1653-chen.pdf?ip=106.120.213.65&id=3186077&acc=OPEN&key=BF85BBA5741FDC6E%2E68C876273B0CA8EC%2E4D4702B0C3E38B35%2E6D218144511F3437&__acm__=1567786076_1b2e1077297f250b888837a9566218dd](http://delivery.acm.org/10.1145/3190000/3186077/p1653-chen.pdf?ip=106.120.213.65&id=3186077&acc=OPEN&key=BF85BBA5741FDC6E.68C876273B0CA8EC.4D4702B0C3E38B35.6D218144511F3437&__acm__=1567786076_1b2e1077297f250b888837a9566218dd))
24 | 
25 | **使用数据的论文：**
26 | 
27 | 


--------------------------------------------------------------------------------
/datasets/Leiden_Weibo_Corpus.md:
--------------------------------------------------------------------------------
 1 | # Leiden Weibo Corpus
 2 | 
 3 | **类型：** 单轮、中文、开放领域
 4 | 
 5 | **来源：** 微博
 6 | 
 7 | **规模： ** 5,103,566 post
 8 | 
 9 | **下载链接：**
10 | 
11 | http://lwc.daanvanesch.nl/openaccess.php
12 | 
13 | 
14 | 
15 | **构造方法：**
16 | 
17 | 使用微博API抓取2012.01.08 - 2012.01.30的5,103,566条post，好像没有response，没办法用作对话？
18 | 
19 | http://lwc.daanvanesch.nl/help.php#methodology
20 | 
21 | 
22 | 
23 | **公开数据的论文：**
24 | 
25 | 
26 | 
27 | **使用数据的论文：**
28 | 
29 | 


--------------------------------------------------------------------------------
/datasets/NTCIR14-STC3-CECG.md:
--------------------------------------------------------------------------------
 1 | # NTCIR14-STC3-CECG
 2 | 
 3 | **类型：** 单轮、中文
 4 | 
 5 | **来源： ** 微博
 6 | 
 7 | **规模：** 1,710,000 pairs
 8 | 
 9 | **下载链接：** 
10 | 
11 | http://coai.cs.tsinghua.edu.cn/hml/challenge/dataset_description/
12 | 
13 | 
14 | 
15 | **构造方法：**
16 | 
17 | Noah NRM Data 基础上使用Bi-LSTM进行情感分类加上情感，并做了类别均衡
18 | 
19 | 
20 | 
21 | **其他标签：**
22 | 
23 | 生成回复的情感类别，使用Bi-LSTM进行分类。
24 | 
25 | 
26 | 
27 | **公开数据的论文：**
28 | 
29 | Zhou, Hao, et al. "Emotional chatting machine: Emotional conversation generation with internal and external memory." [Link](http://coai.cs.tsinghua.edu.cn/hml/media/files/aaai2018-ecm.pdf)
30 | 
31 | **使用数据的论文：**
32 | 
33 | 


--------------------------------------------------------------------------------
/datasets/Noah_NRM_Data.md:
--------------------------------------------------------------------------------
 1 | # Noah NRM Data
 2 | 
 3 | **类型：** 单轮、中文、开放领域
 4 | 
 5 | **来源：** 微博
 6 | 
 7 | **规模：**
 8 | 
 9 | 4,435,959 Pairs
10 | 
11 | post 219,905 / responses 4,308,211/ 平均每个post，20条response
12 | 
13 | 
14 | 
15 | **下载链接：**
16 | 
17 | http://61.93.89.94/Noah_NRM_Data/ (已失效)
18 | 
19 | http://www.noahlab.com.hk/topics/ShortTextConversation (已失效)
20 | 
21 | 
22 | 
23 | **构造方法：**
24 | 
25 | 1. 删除post小于10个中文字符或response小于5个中文字符的pairs
26 | 
27 | 2. 只保留post主题相关的前30条response
28 | 
29 | 3. 过滤广告、无意义的回复如"哈哈"，
30 | 
31 |    
32 | 
33 | **公开数据的论文：**
34 | 
35 | Shang, Lifeng, Zhengdong Lu, and Hang Li. "Neural responding machine for short-text conversation."  [Link.](https://www.aclweb.org/anthology/P15-1152)
36 | 
37 | 
38 | 
39 | **使用数据的论文：**
40 | 
41 | 1. Gu, Jiatao, et al. "Incorporating copying mechanism in sequence-to-sequence learning."
42 | 2. Zhou, Hao, et al. "Emotional chatting machine: Emotional conversation generation with internal and external memory."


--------------------------------------------------------------------------------
/datasets/OpenSubtitles.md:
--------------------------------------------------------------------------------
 1 | # OpenSubtitles
 2 | 
 3 | **类型：** 多轮、多语言、电影字幕
 4 | 
 5 | **来源：** Internet Movie Database
 6 | 
 7 | **规模：**
 8 | 
 9 | 2.6 billion sentences (17.2 billion tokens) distributed over 60 languages
10 | 
11 | **下载链接：**
12 | 
13 | http://opus.nlpl.eu/OpenSubtitles.php
14 | 
15 | 
16 | 
17 | **构造方法：**
18 | 
19 | 
20 | 
21 | **公开数据的论文：**
22 | 
23 | 1. Lison, Pierre, and Jörg Tiedemann. "Opensubtitles2016: Extracting large parallel corpora from movie and tv subtitles." [Link](https://pdfs.semanticscholar.org/10df/593bdf6d0e4a5f0daa8c224a8bdddf9e3167.pdf)
24 | 2. Lison, Pierre, Jörg Tiedemann, and Milen Kouylekov. "Opensubtitles2018: Statistical rescoring of sentence alignments in large, noisy parallel corpora." [Link](http://www.lrec-conf.org/proceedings/lrec2018/pdf/294.pdf)
25 | 
26 | **使用数据的论文：**
27 | 
28 | 1. Li, Jiwei, et al. "A diversity-promoting objective function for neural conversation models." 
29 | 2. Li, Jiwei, et al. "Adversarial learning for neural dialogue generation."


--------------------------------------------------------------------------------
/datasets/Personality_Assignment_Dataset.md:
--------------------------------------------------------------------------------
 1 | # Personality Assignment Dataset
 2 | 
 3 | **类型：** 单轮、中文、开放领域
 4 | 
 5 | **来源：** 微博
 6 | 
 7 | **规模： ** 9, 697, 651 pairs
 8 | 
 9 | **下载链接：**
10 | 
11 | [http://coai.cs.tsinghua.edu.cn/file/ijcai_data.zip](http://coai.cs.tsinghua.edu.cn/file/ijcai_data.zip)
12 | 
13 | 
14 | 
15 | **构造方法：**
16 | 
17 | 1. Weibo Dataset(WD) - 微博抓取数据，9, 697, 651 pairs 
18 | 
19 | 2. Profile Binary Subset(PB) -  微博抓取数据中，通过表达式匹配profile的相关句子{姓名，性别，年龄，城市，体重，星座}，76, 930 pairs，13个人标注为某个设定下的正负例
20 | 
21 | 3. Profile Related Subset - PB当中跟profile相关的正例，42, 193 pairs
22 | 
23 | 4. Manual Dataset - 人工写的与profile相关的正负例句子
24 | 
25 |    
26 | 
27 | **公开数据的论文：**
28 | 
29 | Qian, Qiao, et al. "Assigning personality/identity to a chatting machine for coherent conversation generation." [Link](https://arxiv.org/pdf/1706.02861.pdf)
30 | 
31 | **使用数据的论文：**
32 | 
33 | 


--------------------------------------------------------------------------------
/datasets/STC_Data.md:
--------------------------------------------------------------------------------
 1 | # STC Data
 2 | 
 3 | **类型：** 单轮、中文、开放领域
 4 | 
 5 | **来源：** 微博
 6 | 
 7 | **规模：** posts 38,016 / responses 618,104
 8 | 
 9 | **下载链接：**
10 | 
11 | http://data.noahlab.com.hk/conversation/
12 | 
13 | 
14 | 
15 | **构造方法：**
16 | 
17 | 1. 删除post小于10个中文字符或response小于5个中文字符的pairs
18 | 
19 | 2. 只保留post的前100条response
20 | 
21 | 3. 过滤广告
22 | 
23 |    
24 | 
25 | **公开数据的论文：**
26 | 
27 | Wang, Hao, et al. "A dataset for research on short-text conversations." [Link](http://w.hangli-hl.com/uploads/3/1/6/8/3168008/emnlp_2013.pdf)
28 | 
29 | 
30 | 
31 | **使用数据的论文：**
32 | 
33 | 


--------------------------------------------------------------------------------
/datasets/Twitter.md:
--------------------------------------------------------------------------------
 1 | # Twitter
 2 | 
 3 | **类型：** 单轮、英文、开放领域
 4 | 
 5 | **来源：** Twitter 2009年夏2个月数据
 6 | 
 7 | **规模：**
 8 | 
 9 | 130万pairs
10 | 
11 | 69%的post只有1个response，最多的有242个
12 | 
13 | 
14 | 
15 | **下载链接：**
16 | 
17 | http://www.cs.washington.edu/homes/aritter/twitter_chat/ (已失效)
18 | 
19 | 后续衍化 http://research.microsoft.com/convo/
20 | 
21 | 这里数据的关系整理的还不是非常清晰，就目前来看，许多用到Twitter数据集的论文都会回到这个[数据集](https://www.aclweb.org/anthology/N10-1020)或者这个数据集[衍化版本](https://arxiv.org/pdf/1506.06714.pdf)上。具体的关系后续进行补充。
22 | 
23 | 
24 | 
25 | **存在问题：**
26 | 
27 | 1. 拼写问题。使用Jcluster word clustering算法对词进行聚类，发现同样的词有非常多的拼写形式。
28 | 
29 | 
30 | 
31 | **公开数据的论文：**
32 | 
33 | Ritter, Alan, Colin Cherry, and Bill Dolan. "Unsupervised modeling of twitter conversations." [Link](https://www.aclweb.org/anthology/N10-1020)
34 | 
35 | 
36 | 
37 | **相关论文：**
38 | 
39 | 1. Ritter, Alan, Colin Cherry, and William B. Dolan. "Data-driven response generation in social media."
40 | 2. Sordoni, Alessandro, et al. "A neural network approach to context-sensitive generation of conversational responses."
41 | 3. 其他相关论文，Microsoft Data-Driven Conversation Dataset [Link](https://www.microsoft.com/en-us/research/project/data-driven-conversation/?from=http%3A%2F%2Fresearch.microsoft.com%2Fconvo%2F#!publications)
42 | 
43 | 
44 | 
45 | 


--------------------------------------------------------------------------------
/datasets/Ubuntu_Dialogue_Corpus_v2.md:
--------------------------------------------------------------------------------
 1 | # Ubuntu Dialogue Corpus v2.0
 2 | 
 3 | **类型：** 多轮、英文、Ubuntu领域聊天室日志
 4 | 
 5 | **来源：** Ubuntu-related chat rooms on the Freenode Internet Relay Chat (IRC) network.
 6 | 
 7 | **规模：**
 8 | 
 9 | dialogues 930,000
10 | 
11 | utterances 7,100,000
12 | 
13 | 7.71 turns per dialogue
14 | 
15 | **下载链接：**
16 | 
17 | https://github.com/rkadlec/ubuntu-ranking-dataset-creator
18 | 
19 | 
20 | 
21 | **公开数据的论文：**
22 | 
23 | Lowe, Ryan, et al. "The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems." [Link](https://arxiv.org/pdf/1506.08909.pdf)
24 | 
25 | 
26 | 
27 | **使用数据的论文：**
28 | 
29 | 1. Wu, Yu, et al. "Sequential matching network: A new architecture for multi-turn response selection in retrieval-based chatbots."
30 | 
31 |    
32 | 
33 | 
34 | 
35 | 


--------------------------------------------------------------------------------
/models/models_list.md:
--------------------------------------------------------------------------------
 1 | # 开源模型
 2 | 
 3 | - Seq2Seq
 4 |   - 单轮对话、微博等数据都可以用
 5 |   - 链接
 6 |     - https://github.com/tensorflow/nmt 
 7 |     - https://github.com/google/seq2seq 
 8 |     - 最朴素的S2S，相关链接比较多，别的开源还有基于transformer实现的版本可以参考
 9 | - SeqGAN
10 |   - 单轮对话、微博等数据都可以用
11 |   - 开源链接 
12 |     - https://github.com/LantaoYu/SeqGAN
13 |     - 相关的实现也比较多，也可以参考其他的实现
14 |   - 论文 [Link](https://arxiv.org/pdf/1609.05473.pdf)
15 | - SMN - Sequential Match Network
16 |   - 多轮对话
17 |   - 使用的数据集  [Douban_Conversation_Corpus](../datasets/Douban_Conversation_Corpus.md)
18 |   - 开源链接 https://github.com/MarkWuNLP/MultiTurnResponseSelection
19 |   - 论文 [Link](https://pdfs.semanticscholar.org/a6ee/c00f10346ce27d4f69f9e38f5665fffe8056.pdf)
20 | - DAM
21 |   - 多轮对话
22 |   - 使用的数据集 [Ubuntu Dataset V1](Ubuntu_Dialogue_Corpus_v2.md) / [Douban_Conversation_Corpus](../datasets/Douban_Conversation_Corpus.md)
23 |   - 开源链接 https://github.com/baidu/Dialogue/tree/master/DAM
24 |   - 论文 [Link](https://www.aclweb.org/anthology/P18-1103/)
25 | - ESIM - Enhanced Sequential Inference Model 
26 |   - 阿里开源、多轮对话
27 |   - 使用的数据集 [Ubuntu Dataset](Ubuntu_Dialogue_Corpus_v2.md) / [E-commerce_Dataset](../datasets/E-commerce_Dialogue_Corpus.md)
28 |   - 开源链接 https://github.com/alibaba/esim-response-selection
29 |   - 论文 [Link](https://arxiv.org/pdf/1901.02609.pdf)


--------------------------------------------------------------------------------