├── img └── meld.jpeg ├── LICENSE └── README.md /img/meld.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jojonki/NLP-Corpora/HEAD/img/meld.jpeg -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # NLP Corpora 2 | 3 | 4 | This is a list of NLP corpora. You can report a new corpus at the Issues page. 5 | 6 | 7 | 8 | - [NLP Corpora](#nlp-corpora) 9 | - [Emoji Meanings](#emoji-meanings) 10 | - [Corpora](#corpora) 11 | - [Dialog (Task-oriented)](#dialog-task-oriented) 12 | - [Dialog (Others)](#dialog-others) 13 | - [Question Answering](#question-answering) 14 | - [Translation](#translation) 15 | - [Sentiment Analysis](#sentiment-analysis) 16 | - [Recommendation](#recommendation) 17 | - [Others](#others) 18 | - [References](#references) 19 | 20 | 21 | 22 | ## Emoji Meanings 23 | - 🤖🤖 Machine-to-Machine conversations which were synthetically generated. 24 | - 👦🧒 Human-to-Human conversations. (including dialogs by crowd workers such as Mechanical Turk). 25 | - 👦🤖 Human-to-Machine (dialog systems) conversations. 26 | - 📝 Written texts, utterance is not assumed. 27 | - 🗣 Spoken dialogs, also containing assuming to speak (generally written dialogs by crowd workers). 28 | - 🧙‍ The data was collected using a [Wizard-of-Oz](https://en.wikipedia.org/wiki/Wizard_of_Oz_experiment) scheme. 29 | 30 | ## Corpora 31 | ### Dialog (Task-oriented) 32 | 33 | - [Permuted Dialog bAbI tasks, IBM, EMNLP 2018.](https://github.com/IBM/permuted-bAbI-dialog-tasks) 34 | 35 | ``` 36 | 28 what do you think of this option: resto_rome_expensive_french_8stars-1|what do you think of this option: resto_rome_expensive_french_8stars-2 37 | 29 do you have something else sure let me find an other option for you 38 | 30 what do you think of this option: resto_rome_expensive_french_8stars-2 39 | 31 do you have something else sure let me find an other option for you 40 | 32 what do you think of this option: resto_rome_expensive_french_1stars 41 | 33 that looks great great let me do the reservation 42 | ``` 43 | 44 | 45 | - [MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling, EMNLP 2018.](https://arxiv.org/pdf/1807.06517.pdf)👦👦🧙‍ 46 | 2480 single-domain dialogs and 7375 multi-domain dialogs (usually from 2 up to 5 domains). 47 | This dataset is designed for Dialog State Tracking. 48 | There are [Code](https://github.com/nmrksic/neural-belief-tracker) and [Corpus)](http://dialogue.mi.eng.cam.ac.uk/index.php/corpus/). 49 | 50 | ``` 51 | 3 domain(s): ['bus', 'hotel', 'train'] 52 | Usr: I need to book a hotel in the east that has 4 stars. 53 | Sys: I can help you with that. What is your price range? 54 | Usr: That doesn't matter as long as it has free wifi and parking. 55 | Sys: If you'd like something cheap, I recommend the Allenbell. For something moderately priced, I would recommend the Warkworth House. 56 | Usr: Could you book the Wartworth for one night, 1 person? 57 | Sys: What day will you be staying? 58 | Usr: Friday and Can you book it for me and get a reference number ? 59 | Sys: Booking was successful. 60 | Reference number is : BMUKPTG6. Can I help you with anything else today? 61 | Usr: I am looking to book a train that is leaving from Cambridge to Bishops Stortford on Friday. 62 | Sys: There are a number of trains leaving throughout the day. What time would you like to travel? 63 | Usr: I want to get there by 19:45 at the latest. 64 | Sys: Okay! The latest train you can take leaves at 17:29, and arrives by 18:07. Would you like for me to book that for you? 65 | Usr: Yes please. I also need the travel time, departure time, and price. 66 | Sys: Reference number is : UIFV8FAS. The price is 10.1 GBP and the trip will take about 38 minutes. May I be of any other assistance? 67 | Usr: Yes. Sorry, but suddenly my plans changed. Can you change the Wartworth booking to Monday for 3 people and 4 nights? 68 | Sys: I have made that change and your reference number is YF86GE4J 69 | Usr: Thank you very much, goodbye. 70 | Sys: You're welcome. Have a nice day! 71 | ``` 72 | 73 | - [CMU SimDial: Synthetic Task-oriented Dialog Generator with Controllable Complexity, SIGDIAL 2018.](https://github.com/snakeztc/SimDial)👦🤖🗣 74 | SimDial is a multi-domain dialog generator chich can generate conversations for slot-filling domains with configurable complexity. 75 | 76 | ``` 77 | Domain: Restaurant 78 | Sys: Welcome to the restaurant recommendation system. What do you need? 79 | User (0.83): Hello robot. I am looking for a restaurant. 80 | Sys: Which city are you interested in? 81 | User (0.75): I am in San Jose. And Korean. Uhm sorry, I am interested in Chinese food. 82 | Sys: I believe you said San Jose. I believe you said Chinese. 83 | Sys: QUERY goal: restaurant food-pref: Chinese location: 84 | San Jose 85 | KB: RET restaurant: 17 86 | Sys: Restaurant 17 is a good choice. What else can I do for you? 87 | User (0.6): Not done yet. Is that restaurant open now? 88 | Sys: QUERY goal: open restaurant: 17 food-pref: Chinese 89 | location: San Jose 90 | KB: RET open: yes 91 | Sys: Yes, it is open right now. What else? 92 | User (0.69): No 93 | ``` 94 | 95 | - [The (6) dialog bAbI tasks, facebook, 2016.](https://research.fb.com/downloads/babi/) 👦🤖🗣 96 | The set of six tasks for testing e2e task-oriented dialogs systems for restaurant reservation. Task1-5 are synthetically generated data while Task6 is created from human-to-computer dialogs (DSTC2). 97 | For each task, there are 1000 dialogs for training, 1000 for development and 1000 for testing. 98 | 99 | ``` 100 | U: 101 | S: Hello , welcome to the Cambridge restaurant system? You can ask for restaurants by area , price range or food type . How may I help you? 102 | U: cheap restaurant that serves german food 103 | S: I'm sorry but there is no restaurant serving german food 104 | U: what about asian oriental foo 105 | S: You are looking for a german restaurant right? U: asian 106 | S: Let me confirm , You are looking for a restaurant in the cheap price range right? 107 | ``` 108 | 109 | - [babi plus, Heriot-Watt, 2017.](https://arxiv.org/abs/1709.07840) 🤖🤖🗣 110 | Synthetically added disfluencies to the dialogs of bAbI-Task 1. And [their code](https://github.com/ishalyminov/memn2n) is also available. 111 | 112 | ``` 113 | sys: hello what can I help you with today? 114 | usr: I’d like to book a uhm yeah I’d like to book a 115 | table in a expensive price range no sorry in a 116 | cheap price range 117 | sys: I’m on it. Any preference on a type of cuisine? 118 | usr: with indian food no sorry with spanish food 119 | ``` 120 | 121 | - [A New Multi-Turn, Multi-Domain, Task-Oriented Dialogue Dataset, Stanford, 2017.](https://nlp.stanford.edu/blog/a-new-multi-turn-multi-domain-task-oriented-dialogue-dataset/)👦👦🧙‍ 122 | Three domains (weather/calendar/poi) dialogs of a car assistant. 123 | 3,031 multi-turn dialogues are created in manner of Wizard-of-Oz. 124 | 125 | ``` 126 | DRIVER: I need to find the time and parties attending my optometrist appointment. 127 | CAR : I have 3 appointments scheduled, with Alex, your sister, and Jeff. Which are you referring to? 128 | DRIVER: I want to know about the one that Alex is joining me at. 129 | CAR : That optometrist appointment is at 4 pm. 130 | DRIVER: Thanks. 131 | CAR : No problem. 132 | ``` 133 | 134 | 135 | - [Frames: A Corpus for Adding Memory to Goal-Oriented Dialogue Systems, Maluuba, 2017.](https://datasets.maluuba.com/Frames)👦👦🧙‍ 136 | A corpus of 1369 human-human dialogues with an average of 15 turns per dialog collected in Wizard-of-Oz. 137 | 138 | ``` 139 | User I'd like to book a trip to boston from London on Saturday, August 13, 2016 for 8 adults. I have a tight budget of 1700. Frame #1 140 | Act 1 inform(intent=book) 141 | Act 2 inform(dst_city = boston, or_city = London, str_date = Saturday\, August 13\, 2016, n_adults = 8, budget = 1700)" 142 | Wizard "Hi...I checked a few options for you, and unfortunately, we do not currently have any trips that meet this criteria. Would you like to book an alternate travel option?" Frame #1 143 | Act 1 no_result 144 | Act 2 suggest(dst_city) 145 | User "Yes, how about going to Detroit from London on August 13, 2016 for 5 adults. For this trip, my budget would be 1900." Frame #2 146 | Act 1 inform(dst_city = Detroit, n_adults = 5, budget = 1900, ref = [1{or_city = London, str = August 13\, 2016}]) 147 | Wizard "I checked the availability for those dates and there were no trips available. Would you like to select some alternate dates?" Frame #2 148 | Act 1 no_result(str_date, end_date) 149 | Act 2 suggest(str_date, end_date) 150 | ``` 151 | 152 | 153 | - [DSTC 1-6](https://www.microsoft.com/en-us/research/event/dialog-state-tracking-challenge/)👦🤖🗣 154 | Dialog State Tracking Challenge is research community tasks. The link presents each DSTC task respectively. 155 | - DSTC1: Bus schedules. 156 | - DSTC2/3: Restaurant reservations. 157 | - DSTC4: Tourist information. 158 | - DSTC5: Tourist information, and privided in several languages. 159 | - DSTC6: It consists of 3 parallel tracks: End-to-End Goal Oriented Dialog Learning, End-to-End Conversation Modeling, and Dialogue Breakdown Detection. 160 | 161 | ### Dialog (Others) 162 | 163 | - [Multimodal EmotionLines Dataset. 2018](https://affective-meld.github.io/)👦🧒🗣 164 | MELD is a multi-modal (audio/vision/text with an emotion) dataset using Friends TV shows. It has more than 1300 dialogues and 13000 utterances. 165 | 166 | 167 | - [Document Grounded Conversations. EMNLP 2018](https://github.com/festvox/datasets-CMU_DoG) 168 | Conversations that are about the contents of a specified document. The documents were Wikipedia articles about movies. 169 | The dataset contains 4112 conversations with an average of 21.43 turns per conversation. 170 | 171 | 172 | ``` 173 | user2: Hey have you seen the inception? 174 | user1: No, I have not but have heard of it. What is it about 175 | user2: It’s about extractors that perform experiments using military technology on people to retrieve info about their targets. 176 | ``` 177 | 178 | - [Personalizing Dialogue Agents, facebook, 2018.](https://github.com/facebookresearch/ParlAI/tree/master/projects/personachat)👦🧒📝 179 | Two people dialogs, conditioned on personas. 180 | This contains 164,356 utterances (10,981 dialogs). 181 | 182 | ``` 183 | [PERSON 1:] Hi 184 | [PERSON 2:] Hello ! How are you today ? 185 | [PERSON 1:] I am good thank you , how are you. 186 | [PERSON 2:] Great, thanks ! My children and I were just about to watch Game of Thrones. 187 | [PERSON 1:] Nice ! How old are your children? 188 | [PERSON 2:] I have four that range in age from 10 to 21. You? 189 | [PERSON 1:] I do not have children at the moment. 190 | [PERSON 2:] That just means you get to keep all the popcorn for yourself. 191 | [PERSON 1:] And Cheetos at the moment! 192 | [PERSON 2:] Good choice. Do you watch Game of Thrones? 193 | [PERSON 1:] No, I do not have much time for TV. 194 | [PERSON 2:] I usually spend my time painting: but, I love the show. 195 | ``` 196 | 197 | - [Edina: Building an Open Domain Socialbot with Self-dialogues, 2017.](https://github.com/jfainberg/self_dialogue_corpus)👦👦 198 | The data is collected by AMT in manner of _self-dialog_, and used in Alexa Prize 2017. 199 | The workers created self-dialogs alone given a topic. 200 | There are 24,283 self-dialogues, 3,653,313 words, across 141,945 turns, from 2,717 Workers. 201 | 202 | ``` 203 | What is your absolute favorite movie? 204 | I think Beauty and the Beast is my favorite. 205 | The new one? 206 | No, the cartoon. Something about it just feels magical. 207 | It is my favorite Disney movie. 208 | What’s your favorite movie in general? 209 | I think my favorite is The Sound of Music. 210 | Really? Other than cartoons and stuff I can never get into musicals. 211 | I love musicals. I really liked Phantom of the Opera. 212 | ``` 213 | 214 | 215 | - [DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset, 2017.](http://yanran.li/dailydialog)👦🧒📝 216 | The dialogs are our daily communication way and cover various topics about our daily life. The dataset contains 13,118 multi-turn dialogs. 217 | 218 | ``` 219 | A: I’m worried about something. 220 | B: What’s that? 221 | A: Well, I have to drive to school for a meeting this morning, 222 | and I’m going to end up getting stuck in rush-hour traffic. 223 | B: That’s annoying, but nothing to worry about. 224 | Just breathe deeply when you feel yourself getting upset. 225 | ``` 226 | 227 | - [The Ubuntu Dialogue Corpus, 2015.](https://github.com/rkadlec/ubuntu-ranking-dataset-creator)🧒🧒📝 228 | Human to Human dialogs in Ubuntu boards. 229 | 1 million multi-turn dialogues, with a total of over 7 million utterances and 100 million words. 230 | 231 | ### Question Answering 232 | 233 | - [CommonsenseQA, 2018](https://www.tau-nlp.org/commonsenseqa) 234 | CommonsenseQA is a new QA dataset that contains 9,500 examples and aims to test commonsense knowledge. 235 | 236 | 237 | - [Open Book Question Answering, 2018.](http://data.allenai.org/OpenBookQA) 238 | > OpenBookQA, modeled after open book exams for assessing human understanding of a subject. The open book that comes with our questions is a set of 1326 elementary level science facts. Roughly 6000 questions probe an understanding of these facts and their application to novel situations. This requires combining an open book fact (e.g., metals conduct electricity) with broad common knowledge (e.g., a suit of armor is made of metal) obtained from other sources. 239 | 240 | ``` 241 | Question: 242 | Which of these would let the most heat travel through? 243 | A) a new pair of jeans. 244 | B) a steel spoon in a cafeteria. 245 | C) a cotton candy at a store. 246 | D) a calvin klein cotton hat. 247 | Science Fact: 248 | Metal is a thermal conductor. 249 | Common Knowledge: 250 | Steel is made of metal. 251 | Heat travels through a thermal conductor. 252 | ``` 253 | 254 | 255 | - [CoQA, 2018](https://stanfordnlp.github.io/coqa/)🧒🧒🗣 256 | CoQA is a large-scale dataset for building Conversational Question Answering systems. It contains 127K questions with answers obtained from 8K conversations. Each conversation is collected by paring two crowdworkers to chat about a passage in the form of QAs. 257 | ``` 258 | Jessica went to sit in her rocking chair. Today was her birthday 259 | and she was turning 80. Her granddaughter Annie was coming 260 | over in the afternoon and Jessica was very excited to see 261 | her. Her daughter Melanie and Melanie’s husband Josh were 262 | coming as well. Jessica had . . . 263 | 264 | Q1: Who had a birthday? 265 | A1: Jessica 266 | R1: Jessica went to sit in her rocking chair. Today was her birthday and she was turning 80. 267 | Q2: How old would she be? 268 | A2: 80 269 | R2: she was turning 80 270 | ``` 271 | 272 | 273 | - [SQuAD 2.0, 2018](https://rajpurkar.github.io/SQuAD-explorer/)📝 274 | 275 | > SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 new, unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. 276 | 277 | - [Spoken SQuAD, 2018](https://github.com/chiahsuan156/Spoken-SQuAD/)🗣 278 | Artificially generated corpus based on SQuAD by using Google TTS. They used Google TTS to generate the spoken version of SQuAD articles. Then, the generated texts are passed to CMU Sphinx (speech recognizer) to validate whether the generated texts contain a correct answer. 279 | 280 | - [ODSQA, 2018](https://github.com/chiahsuan156/Spoken-SQuAD/)🗣 281 | This is a __Chinese corpus__. Over three thousand questions by 20 Chinese speakers. 282 | 283 | ### Translation 284 | 285 | 🚧 286 | 287 | ### Sentiment Analysis 288 | 289 | 🚧 290 | 291 | 292 | ### Recommendation 293 | - [ReDial. Microsoft Research. NIPS 2018](https://redialdata.github.io/website) 294 | A large-scale data set consisting of real-world dialogues centered around movie recommendations. 295 | ReDial consists of over 10K conversations centered around the theme of providing movie recommendations. 296 | 297 | ``` 298 | SEEKER: those sound good ! i ’m going to look into those movies. 299 | HUMAN: i hope you enjoy, have a nice one 300 | HRED: have you seen foxcatcher ? it ’s about a man who has a 301 | rich guy. 302 | OURS: i hope i was able to help you find a good movie to watch 303 | SEEKER: thank you for your help ! have a great night ! good bye 304 | ``` 305 | 306 | ### Others 307 | - [WikiHow. 2018](https://github.com/mahnazkoupaee/WikiHow-Dataset) 308 | > WikiHow is a new large-scale dataset using the online WikiHow (http://www.wikihow.com/) knowledge base. Each article consists of multiple paragraphs and each paragraph starts with a sentence summarizing it. By merging the paragraphs to form the article and the paragraph outlines to form the summary, the resulting version of the dataset contains more than 200,000 long-sequence pairs. 309 | 310 | - [Spider 1.0, EMNLP 2018](https://yale-lily.github.io/spider) 311 | A human-labeled dataset for complex and cross-domain semantic parsing and Text-to-SQL Task. 312 | 313 | ``` 314 | Question: What are the name and budget of the departments with average instructor salary greater than the overall average? 315 | SQL: 316 | SELECT T2.name, T2.budget 317 | FROM instructor as T1 JOIN department as 318 | T2 ON T1.department_id = T2.id 319 | GROUP BY T1.department_id 320 | HAVING avg(T1.salary) > (SELECT avg(salary) FROM instructor) 321 | ``` 322 | 323 | 324 | 325 | - [YouTube AV 50K: an Annotated Corpus for Comments in Autonomous Vehicles, 2018](https://arxiv.org/abs/1807.11227)🧒📝 326 | Comments from 50K Youtube videos related to autonomous vehicles. The total # of comments is 30,456 including 19,126 annotated comments. 327 | 328 | - [SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference. 329 | 2018.](https://rowanzellers.com/swag/) 330 | SWAG (Situations With Adversarial Generations) is a dataset for grounded commonsense inference. It consists of 113k multiple choice questions about grounded situations. 331 | 332 | ``` 333 | On stage, a woman takes a seat at the piano. She 334 | a) sits on a bench as her sister plays with the doll. 335 | b) smiles with someone as the music plays. 336 | c) is in the crowd, watching the dancers. 337 | d) nervously sets her fingers on the keys. <--- 338 | 339 | A girl is going across a set of monkey bars. She 340 | a) jumps up across the monkey bars. 341 | b) struggles onto the monkey bars to grab her head. 342 | c) gets to the end and stands on a wooden plank. <--- 343 | d) jumps up and does a back flip. 344 | ``` 345 | 346 | - [Friends TV Corpus, 2011](https://sites.google.com/site/friendstvcorpus/) 347 | The data was created for an analysis of the use of various linguistic structures and a comparison of their use in inter-gender and intra-gender conversation environments in "Friends". The paper's title is "A STUDY INTO THE USE OF LINGUISTIC STRUCTURES USED INTER-GENDER AND INTRA-GENDER IN THE TV SHOW ‘FRIENDS’". 348 | 349 | ``` 350 | 1 1 MONICA F Monica: There's nothing to tell! He's just some guy I work with! There's nothing to tell! He's just some guy I work with! There_EX 's_VBZ nothing_PN1 to_TO tell_VVI !_! He_PPHS1 's_VBZ just_RR some_DD guy_NN1 I_PPIS1 work_VV0 with_IW !_! 0101.txt 351 | 101 1 JOEY M Joey: C'mon, you're going out with the guy! There's gotta be something wrong with him! C'mon, you're going out with the guy! There's gotta be something wrong with him! C'm_VV0 on_RP you_PPY 're_VBR going_VVG out_RP with_IW the_AT guy_NN1 !_! There_EX 's_VHZ got_VVN 352 | ``` 353 | 354 | ## References 355 | - [A Survey of Available Corpora for Building Data-Driven Dialogue Systems](https://breakend.github.io/DialogDatasets/) 356 | - [Natural Language Processing Corpora, NLP for Hackers](https://nlpforhackers.io/corpora/) 357 | --------------------------------------------------------------------------------