├── README.md ├── datasets ├── 0.template.md ├── Chinese_Dialogue_Dataset_with_Sentence_Function.md ├── Cornell_Movie_Dialogs_Corpus.md ├── Douban_Conversation_Corpus.md ├── E-commerce_Dialogue_Corpus.md ├── JD_Customer_Service_Corpus.md ├── Leiden_Weibo_Corpus.md ├── NTCIR14-STC3-CECG.md ├── Noah_NRM_Data.md ├── OpenSubtitles.md ├── Personality_Assignment_Dataset.md ├── STC_Data.md ├── Twitter.md └── Ubuntu_Dialogue_Corpus_v2.md └── models └── models_list.md /README.md: -------------------------------------------------------------------------------- 1 | # 对话系统中英文语料 2 | 3 | 本项目收集目前论文中,已公开的,用于训练中(英)文对话系统的语料以及开源的对话模型。 4 | 5 | [部分开源模型](models/models_list.md) 待整理… 6 | 7 | ## 常用 8 | 9 | ### 中文 10 | 11 | [Douban Conversation Corpus](datasets/Douban_Conversation_Corpus.md) 12 | 13 | [Noah NRM Data](datasets/Noah_NRM_Data.md) 14 | 15 | [STC Data](datasets/STC_Data.md) 16 | 17 | 18 | 19 | ### 英文 20 | 21 | [Ubuntu Dialogue Corpus v2](datasets/Ubuntu_Dialogue_Corpus_v2.md) 22 | 23 | [OpenSubtitles](datasets/OpenSubtitles.md) 24 | 25 | [Cornell Movie Dialogs Corpus](datasets/Cornell_Movie_Dialogs_Corpus.md) 26 | 27 | [Twitter](datasets/Twitter.md) 28 | 29 | 30 | 31 | ## 微博 32 | 33 | [Noah NRM Data](datasets/Noah_NRM_Data.md) 34 | 35 | [STC Data](datasets/STC_Data.md) 36 | 37 | [NTCIR14 STC3 CECG](datasets/NTCIR14-STC3-CECG.md) 38 | 39 | [Personality Assignment Dataset](datasets/Personality_Assignment_Dataset.md) 40 | 41 | [Chinese Dialogue Dataset with Sentence Function](datasets/Chinese_Dialogue_Dataset_with_Sentence_Function.md) 42 | 43 | 44 | 45 | ## Twitter 46 | 47 | [Twitter](datasets/Twitter.md) 48 | 49 | 50 | 51 | ## 豆瓣 52 | 53 | [Douban Conversation Corpus](datasets/Douban_Conversation_Corpus.md) 54 | 55 | 56 | 57 | ## 电商 58 | 59 | [JD Customer Service Corpus](datasets/JD_Customer_Service_Corpus.md) 60 | 61 | [E-commerce Dialogue Corpus](datasets/E-commerce_Dialogue_Corpus.md) 62 | 63 | 64 | 65 | -------------------------------------------------------------------------------- /datasets/0.template.md: -------------------------------------------------------------------------------- 1 | # 数据集名称 2 | 3 | **类型:** 4 | 5 | **来源:** 6 | 7 | **规模:** 8 | 9 | **下载链接:** 10 | 11 | 12 | 13 | **构造方法:** 14 | 15 | 16 | 17 | **公开数据的论文:** 18 | 19 | 20 | 21 | **使用数据的论文:** 22 | 23 | -------------------------------------------------------------------------------- /datasets/Chinese_Dialogue_Dataset_with_Sentence_Function.md: -------------------------------------------------------------------------------- 1 | # Chinese Dialogue Dataset with Sentence Function 2 | 3 | **类型:** 单轮、中文、开放领域 4 | 5 | **来源:** 微博 6 | 7 | **规模:** 1,963,382 pairs 8 | 9 | **下载链接:** 10 | 11 | http://coai.cs.tsinghua.edu.cn/file/DialogwithSenFun.tar.gz 12 | 13 | 14 | 15 | **构造方法:** 16 | 17 | 1. 1000万条微博抓取数据 18 | 2. 采样2000条进行人工标注,标注Sentence Function(Interrogative, Declarative, Imperative, and Exclamatory),train:valid:test=6:1:1 训练Sentence Function分类器 19 | 3. 用Sentence Function分类器对微博文本进行分类打上标签,每个类别采样约600万条 20 | 21 | 22 | 23 | **其他标签:** 24 | 25 | Sentence Function:Interrogative, Declarative, Imperative, and Exclamatory 26 | 27 | 28 | 29 | **公开数据的论文:** 30 | 31 | Ke, Pei, et al. "Generating informative responses with controlled sentence function."[Link](http://coai.cs.tsinghua.edu.cn/hml/media/files/acl_senfun.pdf) 32 | 33 | 34 | 35 | **使用数据的论文:** 36 | 37 | -------------------------------------------------------------------------------- /datasets/Cornell_Movie_Dialogs_Corpus.md: -------------------------------------------------------------------------------- 1 | # Cornell Movie Dialogs Corpus 2 | 3 | **类型:** 英文、多轮、电影字幕 4 | 5 | **来源:** Internet Movie Database 6 | 7 | **规模:** 304,713 utterances 8 | 9 | **下载链接:** 10 | 11 | [http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html](http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html) 12 | 13 | 14 | 15 | **构造方法:** 16 | 17 | 18 | 19 | **公开数据的论文:** 20 | 21 | Danescu-Niculescu-Mizil, Cristian, and Lillian Lee. "Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs." [Link](https://arxiv.org/pdf/1106.3077.pdf) 22 | 23 | 24 | 25 | **使用数据的论文:** 26 | 27 | -------------------------------------------------------------------------------- /datasets/Douban_Conversation_Corpus.md: -------------------------------------------------------------------------------- 1 | # Douban Conversation Corpus 2 | 3 | **类型:** 多轮、中文、开放领域 4 | 5 | **来源:** 豆瓣 6 | 7 | **规模:** 8 | 9 | session-response pairs 1,000,000 pairs 10 | 11 | 平均6.69轮每session 12 | 13 | 14 | 15 | **下载链接:** 16 | 17 | https://github.com/MarkWuNLP/MultiTurnResponseSelection 18 | 19 | 20 | 21 | **构造方法:** 22 | 23 | 1. 抓取原始数据,豆瓣小组中轮数大于2轮的110万。 24 | 25 | 2. sample50万做训练集,25万做验证集,从110万中其他句子中sample负例。 26 | 27 | 3. 最终100万训练集,50万验证集。 28 | 29 | 4. 检索模型的数据由抓取的1500万微博对话构成。 30 | 31 | 32 | 33 | **公开数据的论文:** 34 | 35 | Wu, Yu, et al. "Sequential matching network: A new architecture for multi-turn response selection in retrieval-based chatbots." [Link](https://pdfs.semanticscholar.org/a6ee/c00f10346ce27d4f69f9e38f5665fffe8056.pdf) 36 | 37 | 38 | 39 | **使用数据的论文:** 40 | 41 | 1. Zhou, Xiangyang, et al. "Multi-turn response selection for chatbots with deep attention matching network." 42 | 2. Zhang, Zhuosheng, et al. "Modeling multi-turn conversation with deep utterance aggregation." 43 | 3. Wu, Yu, et al. "Learning matching models with weak supervision for response selection in retrieval-based chatbots." -------------------------------------------------------------------------------- /datasets/E-commerce_Dialogue_Corpus.md: -------------------------------------------------------------------------------- 1 | # E-commerce Dialogue Corpus 2 | 3 | **类型:** 多轮、中文、电商领域 4 | 5 | **来源:** 淘宝 6 | 7 | **规模:** 1,000,000 Utterance 8 | 9 | **下载链接:** 10 | 11 | https://github.com/cooelf/DeepUtteranceAggregation 12 | 13 | 14 | 15 | **构造方法:** 16 | 17 | conversations between customers and customer service staff from our E-commerce partners in Taobao 18 | 19 | 20 | 21 | **公开数据的论文:** 22 | 23 | Zhang, Zhuosheng, et al. "Modeling multi-turn conversation with deep utterance aggregation." [Link](https://arxiv.org/pdf/1806.09102.pdf) 24 | 25 | **使用数据的论文:** 26 | 27 | -------------------------------------------------------------------------------- /datasets/JD_Customer_Service_Corpus.md: -------------------------------------------------------------------------------- 1 | # JD Customer Service Corpus 2 | 3 | **类型:** 多轮、中文、电商领域 4 | 5 | **来源:** 京东 6 | 7 | **规模:** 420,000 Utterance 8 | 9 | **下载链接:** 10 | 11 | https://github.com/chenhongshen/HVMN 12 | 13 | 14 | 15 | **构造方法:** 16 | 17 | each conversation is between a customer and a customer service staff 18 | 19 | 20 | 21 | **公开数据的论文:** 22 | 23 | Chen, Hongshen, et al. "Hierarchical variational memory network for dialogue generation." [Link]([http://delivery.acm.org/10.1145/3190000/3186077/p1653-chen.pdf?ip=106.120.213.65&id=3186077&acc=OPEN&key=BF85BBA5741FDC6E%2E68C876273B0CA8EC%2E4D4702B0C3E38B35%2E6D218144511F3437&__acm__=1567786076_1b2e1077297f250b888837a9566218dd](http://delivery.acm.org/10.1145/3190000/3186077/p1653-chen.pdf?ip=106.120.213.65&id=3186077&acc=OPEN&key=BF85BBA5741FDC6E.68C876273B0CA8EC.4D4702B0C3E38B35.6D218144511F3437&__acm__=1567786076_1b2e1077297f250b888837a9566218dd)) 24 | 25 | **使用数据的论文:** 26 | 27 | -------------------------------------------------------------------------------- /datasets/Leiden_Weibo_Corpus.md: -------------------------------------------------------------------------------- 1 | # Leiden Weibo Corpus 2 | 3 | **类型:** 单轮、中文、开放领域 4 | 5 | **来源:** 微博 6 | 7 | **规模: ** 5,103,566 post 8 | 9 | **下载链接:** 10 | 11 | http://lwc.daanvanesch.nl/openaccess.php 12 | 13 | 14 | 15 | **构造方法:** 16 | 17 | 使用微博API抓取2012.01.08 - 2012.01.30的5,103,566条post,好像没有response,没办法用作对话? 18 | 19 | http://lwc.daanvanesch.nl/help.php#methodology 20 | 21 | 22 | 23 | **公开数据的论文:** 24 | 25 | 26 | 27 | **使用数据的论文:** 28 | 29 | -------------------------------------------------------------------------------- /datasets/NTCIR14-STC3-CECG.md: -------------------------------------------------------------------------------- 1 | # NTCIR14-STC3-CECG 2 | 3 | **类型:** 单轮、中文 4 | 5 | **来源: ** 微博 6 | 7 | **规模:** 1,710,000 pairs 8 | 9 | **下载链接:** 10 | 11 | http://coai.cs.tsinghua.edu.cn/hml/challenge/dataset_description/ 12 | 13 | 14 | 15 | **构造方法:** 16 | 17 | Noah NRM Data 基础上使用Bi-LSTM进行情感分类加上情感,并做了类别均衡 18 | 19 | 20 | 21 | **其他标签:** 22 | 23 | 生成回复的情感类别,使用Bi-LSTM进行分类。 24 | 25 | 26 | 27 | **公开数据的论文:** 28 | 29 | Zhou, Hao, et al. "Emotional chatting machine: Emotional conversation generation with internal and external memory." [Link](http://coai.cs.tsinghua.edu.cn/hml/media/files/aaai2018-ecm.pdf) 30 | 31 | **使用数据的论文:** 32 | 33 | -------------------------------------------------------------------------------- /datasets/Noah_NRM_Data.md: -------------------------------------------------------------------------------- 1 | # Noah NRM Data 2 | 3 | **类型:** 单轮、中文、开放领域 4 | 5 | **来源:** 微博 6 | 7 | **规模:** 8 | 9 | 4,435,959 Pairs 10 | 11 | post 219,905 / responses 4,308,211/ 平均每个post,20条response 12 | 13 | 14 | 15 | **下载链接:** 16 | 17 | http://61.93.89.94/Noah_NRM_Data/ (已失效) 18 | 19 | http://www.noahlab.com.hk/topics/ShortTextConversation (已失效) 20 | 21 | 22 | 23 | **构造方法:** 24 | 25 | 1. 删除post小于10个中文字符或response小于5个中文字符的pairs 26 | 27 | 2. 只保留post主题相关的前30条response 28 | 29 | 3. 过滤广告、无意义的回复如"哈哈", 30 | 31 | 32 | 33 | **公开数据的论文:** 34 | 35 | Shang, Lifeng, Zhengdong Lu, and Hang Li. "Neural responding machine for short-text conversation." [Link.](https://www.aclweb.org/anthology/P15-1152) 36 | 37 | 38 | 39 | **使用数据的论文:** 40 | 41 | 1. Gu, Jiatao, et al. "Incorporating copying mechanism in sequence-to-sequence learning." 42 | 2. Zhou, Hao, et al. "Emotional chatting machine: Emotional conversation generation with internal and external memory." -------------------------------------------------------------------------------- /datasets/OpenSubtitles.md: -------------------------------------------------------------------------------- 1 | # OpenSubtitles 2 | 3 | **类型:** 多轮、多语言、电影字幕 4 | 5 | **来源:** Internet Movie Database 6 | 7 | **规模:** 8 | 9 | 2.6 billion sentences (17.2 billion tokens) distributed over 60 languages 10 | 11 | **下载链接:** 12 | 13 | http://opus.nlpl.eu/OpenSubtitles.php 14 | 15 | 16 | 17 | **构造方法:** 18 | 19 | 20 | 21 | **公开数据的论文:** 22 | 23 | 1. Lison, Pierre, and Jörg Tiedemann. "Opensubtitles2016: Extracting large parallel corpora from movie and tv subtitles." [Link](https://pdfs.semanticscholar.org/10df/593bdf6d0e4a5f0daa8c224a8bdddf9e3167.pdf) 24 | 2. Lison, Pierre, Jörg Tiedemann, and Milen Kouylekov. "Opensubtitles2018: Statistical rescoring of sentence alignments in large, noisy parallel corpora." [Link](http://www.lrec-conf.org/proceedings/lrec2018/pdf/294.pdf) 25 | 26 | **使用数据的论文:** 27 | 28 | 1. Li, Jiwei, et al. "A diversity-promoting objective function for neural conversation models." 29 | 2. Li, Jiwei, et al. "Adversarial learning for neural dialogue generation." -------------------------------------------------------------------------------- /datasets/Personality_Assignment_Dataset.md: -------------------------------------------------------------------------------- 1 | # Personality Assignment Dataset 2 | 3 | **类型:** 单轮、中文、开放领域 4 | 5 | **来源:** 微博 6 | 7 | **规模: ** 9, 697, 651 pairs 8 | 9 | **下载链接:** 10 | 11 | [http://coai.cs.tsinghua.edu.cn/file/ijcai_data.zip](http://coai.cs.tsinghua.edu.cn/file/ijcai_data.zip) 12 | 13 | 14 | 15 | **构造方法:** 16 | 17 | 1. Weibo Dataset(WD) - 微博抓取数据,9, 697, 651 pairs 18 | 19 | 2. Profile Binary Subset(PB) - 微博抓取数据中,通过表达式匹配profile的相关句子{姓名,性别,年龄,城市,体重,星座},76, 930 pairs,13个人标注为某个设定下的正负例 20 | 21 | 3. Profile Related Subset - PB当中跟profile相关的正例,42, 193 pairs 22 | 23 | 4. Manual Dataset - 人工写的与profile相关的正负例句子 24 | 25 | 26 | 27 | **公开数据的论文:** 28 | 29 | Qian, Qiao, et al. "Assigning personality/identity to a chatting machine for coherent conversation generation." [Link](https://arxiv.org/pdf/1706.02861.pdf) 30 | 31 | **使用数据的论文:** 32 | 33 | -------------------------------------------------------------------------------- /datasets/STC_Data.md: -------------------------------------------------------------------------------- 1 | # STC Data 2 | 3 | **类型:** 单轮、中文、开放领域 4 | 5 | **来源:** 微博 6 | 7 | **规模:** posts 38,016 / responses 618,104 8 | 9 | **下载链接:** 10 | 11 | http://data.noahlab.com.hk/conversation/ 12 | 13 | 14 | 15 | **构造方法:** 16 | 17 | 1. 删除post小于10个中文字符或response小于5个中文字符的pairs 18 | 19 | 2. 只保留post的前100条response 20 | 21 | 3. 过滤广告 22 | 23 | 24 | 25 | **公开数据的论文:** 26 | 27 | Wang, Hao, et al. "A dataset for research on short-text conversations." [Link](http://w.hangli-hl.com/uploads/3/1/6/8/3168008/emnlp_2013.pdf) 28 | 29 | 30 | 31 | **使用数据的论文:** 32 | 33 | -------------------------------------------------------------------------------- /datasets/Twitter.md: -------------------------------------------------------------------------------- 1 | # Twitter 2 | 3 | **类型:** 单轮、英文、开放领域 4 | 5 | **来源:** Twitter 2009年夏2个月数据 6 | 7 | **规模:** 8 | 9 | 130万pairs 10 | 11 | 69%的post只有1个response,最多的有242个 12 | 13 | 14 | 15 | **下载链接:** 16 | 17 | http://www.cs.washington.edu/homes/aritter/twitter_chat/ (已失效) 18 | 19 | 后续衍化 http://research.microsoft.com/convo/ 20 | 21 | 这里数据的关系整理的还不是非常清晰,就目前来看,许多用到Twitter数据集的论文都会回到这个[数据集](https://www.aclweb.org/anthology/N10-1020)或者这个数据集[衍化版本](https://arxiv.org/pdf/1506.06714.pdf)上。具体的关系后续进行补充。 22 | 23 | 24 | 25 | **存在问题:** 26 | 27 | 1. 拼写问题。使用Jcluster word clustering算法对词进行聚类,发现同样的词有非常多的拼写形式。 28 | 29 | 30 | 31 | **公开数据的论文:** 32 | 33 | Ritter, Alan, Colin Cherry, and Bill Dolan. "Unsupervised modeling of twitter conversations." [Link](https://www.aclweb.org/anthology/N10-1020) 34 | 35 | 36 | 37 | **相关论文:** 38 | 39 | 1. Ritter, Alan, Colin Cherry, and William B. Dolan. "Data-driven response generation in social media." 40 | 2. Sordoni, Alessandro, et al. "A neural network approach to context-sensitive generation of conversational responses." 41 | 3. 其他相关论文,Microsoft Data-Driven Conversation Dataset [Link](https://www.microsoft.com/en-us/research/project/data-driven-conversation/?from=http%3A%2F%2Fresearch.microsoft.com%2Fconvo%2F#!publications) 42 | 43 | 44 | 45 | -------------------------------------------------------------------------------- /datasets/Ubuntu_Dialogue_Corpus_v2.md: -------------------------------------------------------------------------------- 1 | # Ubuntu Dialogue Corpus v2.0 2 | 3 | **类型:** 多轮、英文、Ubuntu领域聊天室日志 4 | 5 | **来源:** Ubuntu-related chat rooms on the Freenode Internet Relay Chat (IRC) network. 6 | 7 | **规模:** 8 | 9 | dialogues 930,000 10 | 11 | utterances 7,100,000 12 | 13 | 7.71 turns per dialogue 14 | 15 | **下载链接:** 16 | 17 | https://github.com/rkadlec/ubuntu-ranking-dataset-creator 18 | 19 | 20 | 21 | **公开数据的论文:** 22 | 23 | Lowe, Ryan, et al. "The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems." [Link](https://arxiv.org/pdf/1506.08909.pdf) 24 | 25 | 26 | 27 | **使用数据的论文:** 28 | 29 | 1. Wu, Yu, et al. "Sequential matching network: A new architecture for multi-turn response selection in retrieval-based chatbots." 30 | 31 | 32 | 33 | 34 | 35 | -------------------------------------------------------------------------------- /models/models_list.md: -------------------------------------------------------------------------------- 1 | # 开源模型 2 | 3 | - Seq2Seq 4 | - 单轮对话、微博等数据都可以用 5 | - 链接 6 | - https://github.com/tensorflow/nmt 7 | - https://github.com/google/seq2seq 8 | - 最朴素的S2S,相关链接比较多,别的开源还有基于transformer实现的版本可以参考 9 | - SeqGAN 10 | - 单轮对话、微博等数据都可以用 11 | - 开源链接 12 | - https://github.com/LantaoYu/SeqGAN 13 | - 相关的实现也比较多,也可以参考其他的实现 14 | - 论文 [Link](https://arxiv.org/pdf/1609.05473.pdf) 15 | - SMN - Sequential Match Network 16 | - 多轮对话 17 | - 使用的数据集 [Douban_Conversation_Corpus](../datasets/Douban_Conversation_Corpus.md) 18 | - 开源链接 https://github.com/MarkWuNLP/MultiTurnResponseSelection 19 | - 论文 [Link](https://pdfs.semanticscholar.org/a6ee/c00f10346ce27d4f69f9e38f5665fffe8056.pdf) 20 | - DAM 21 | - 多轮对话 22 | - 使用的数据集 [Ubuntu Dataset V1](Ubuntu_Dialogue_Corpus_v2.md) / [Douban_Conversation_Corpus](../datasets/Douban_Conversation_Corpus.md) 23 | - 开源链接 https://github.com/baidu/Dialogue/tree/master/DAM 24 | - 论文 [Link](https://www.aclweb.org/anthology/P18-1103/) 25 | - ESIM - Enhanced Sequential Inference Model 26 | - 阿里开源、多轮对话 27 | - 使用的数据集 [Ubuntu Dataset](Ubuntu_Dialogue_Corpus_v2.md) / [E-commerce_Dataset](../datasets/E-commerce_Dialogue_Corpus.md) 28 | - 开源链接 https://github.com/alibaba/esim-response-selection 29 | - 论文 [Link](https://arxiv.org/pdf/1901.02609.pdf) --------------------------------------------------------------------------------