├── .gitignore ├── README.md └── LICENSE /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | share/python-wheels/ 24 | *.egg-info/ 25 | .installed.cfg 26 | *.egg 27 | MANIFEST 28 | 29 | # PyInstaller 30 | # Usually these files are written by a python script from a template 31 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 32 | *.manifest 33 | *.spec 34 | 35 | # Installer logs 36 | pip-log.txt 37 | pip-delete-this-directory.txt 38 | 39 | # Unit test / coverage reports 40 | htmlcov/ 41 | .tox/ 42 | .nox/ 43 | .coverage 44 | .coverage.* 45 | .cache 46 | nosetests.xml 47 | coverage.xml 48 | *.cover 49 | *.py,cover 50 | .hypothesis/ 51 | .pytest_cache/ 52 | cover/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | .pybuilder/ 76 | target/ 77 | 78 | # Jupyter Notebook 79 | .ipynb_checkpoints 80 | 81 | # IPython 82 | profile_default/ 83 | ipython_config.py 84 | 85 | # pyenv 86 | # For a library or package, you might want to ignore these files since the code is 87 | # intended to run in multiple environments; otherwise, check them in: 88 | # .python-version 89 | 90 | # pipenv 91 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 92 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 93 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 94 | # install all needed dependencies. 95 | #Pipfile.lock 96 | 97 | # poetry 98 | # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. 99 | # This is especially recommended for binary packages to ensure reproducibility, and is more 100 | # commonly ignored for libraries. 101 | # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control 102 | #poetry.lock 103 | 104 | # pdm 105 | # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. 106 | #pdm.lock 107 | # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it 108 | # in version control. 109 | # https://pdm.fming.dev/#use-with-ide 110 | .pdm.toml 111 | 112 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm 113 | __pypackages__/ 114 | 115 | # Celery stuff 116 | celerybeat-schedule 117 | celerybeat.pid 118 | 119 | # SageMath parsed files 120 | *.sage.py 121 | 122 | # Environments 123 | .env 124 | .venv 125 | env/ 126 | venv/ 127 | ENV/ 128 | env.bak/ 129 | venv.bak/ 130 | 131 | # Spyder project settings 132 | .spyderproject 133 | .spyproject 134 | 135 | # Rope project settings 136 | .ropeproject 137 | 138 | # mkdocs documentation 139 | /site 140 | 141 | # mypy 142 | .mypy_cache/ 143 | .dmypy.json 144 | dmypy.json 145 | 146 | # Pyre type checker 147 | .pyre/ 148 | 149 | # pytype static type analyzer 150 | .pytype/ 151 | 152 | # Cython debug symbols 153 | cython_debug/ 154 | 155 | # PyCharm 156 | # JetBrains specific template is maintained in a separate JetBrains.gitignore that can 157 | # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore 158 | # and can be added to the global gitignore or merged into this file. For a more nuclear 159 | # option (not recommended) you can uncomment the following to ignore the entire idea folder. 160 | #.idea/ 161 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Claire-datasets 2 | 3 | This repository lists public conversational datasets in text formats. 4 | 5 | * [Raw datasets](#raw-datasets) 6 | * [French](#french) 7 | * [Parliamentary Proceedings](#parliamentary-proceedings) 8 | * [Theatre](#theatre) 9 | * [Interviews](#interviews) 10 | * [Free Conversations](#free-conversations) 11 | * [Meetings](#meetings) 12 | * [Debates](#debates) 13 | * [Assistance](#assistance) 14 | * [Presentation, Formal Address](#presentation-formal-address) 15 | * [English](#english) 16 | * [Parliamentary Proceedings](#parliamentary-proceedings-1) 17 | * [Spoken Dialogue](#spoken-dialogue) 18 | * [Broadcast](#broadcast) 19 | * [Meetings](#meetings-1) 20 | * [Assistance](#assistance-1) 21 | * [Free Chat](#free-chat) 22 | * [Misc](#misc) 23 | * [Normalized datasets](#normalized-datasets) 24 | * [Contact](#contact) 25 | 26 | 27 | ## Raw datasets 28 | 29 | ### French 30 | 31 |
| Dataset | 35 |Description | 36 |Words | 37 |Turns | 38 |Conversations | 39 |License (and conditions) | 40 |
|---|---|---|---|---|---|
Parliamentary Proceedings | |||||
| Assemblée Nationale | 47 |Parliamentary proceedings from the French National Assembly | 48 |133M | 49 |1.6M | 50 |4.5k | 51 |Open License 2.0 | 52 |
Theatre | |||||
| Theatre Classique | 57 |Classic stage plays | 58 |12.8M | 59 |441k | 60 |25k | 61 |CC BY-NC-SA 4.0 (please cite) | 62 |
| Theatre Gratuit | 65 |Stage plays | 66 |2.7M | 67 |155k | 68 |4k | 69 |70 | |
Interviews | |||||
| ESLO (1/5) | 75 |Guided conversations | 76 |4.2M | 77 |329k | 78 |399 | 79 |CC BY-NC-SA 4.0 (please cite) | 80 |
| TCOF (adults) | 83 |Guided conversations (between adults) | 84 |765k | 85 |49k | 86 |237 | 87 |CC BY-NC-SA 2.0 (please cite) | 88 |
| CFPP | 91 |Interviews of people in Paris in 2000 | 92 |608k | 93 |48k | 94 |42 | 95 |CC BY-NC-SA 3.0 (please cite) | 96 |
| ORFEO/Valibel (1/2) | 99 |Guided conversations of Belgian French speakers | 100 |458k | 101 |19k | 102 |67 | 103 |CC BY-NC-SA 4.0 (please cite) | 104 |
| PFC (1/2) | 107 |Guided interviews | 108 |268k | 109 |15k | 110 |173 | 111 |CC BY-NC-SA 4.0 (please cite) | 112 |
| ORFEO/CFPB | 115 |Interviews of people in Brussels | 116 |138k | 117 |11k | 118 |12 | 119 |CC BY-NC-SA 4.0 | 120 |
| ACSYNT | 123 |Guided interviews from southwestern France | 124 |61k | 125 |2.7k | 126 |144 | 127 |CC BY-SA 4.0 (please cite) | 128 |
Free Conversations | |||||
| OFROM | 133 |Conversations in French-speaking Switzerland | 134 |590k | 135 |44k | 136 |151 | 137 |CC BY-NC-SA 3.0 (please cite) | 138 |
| ESLO (2/5) | 141 |Diverse conversation | 142 |480k | 143 |47k | 144 |98 | 145 |CC BY-NC-SA 4.0 (please cite) | 146 |
| ORFEO/CRFP | 149 |Diverse conversations | 150 |405k | 151 |9k | 152 |124 | 153 |CC BY-NC-SA 4.0 (please cite) | 154 |
| ORFEO/C-ORAL-ROM | 157 |Diverse conversation | 158 |248k | 159 |6k | 160 |152 | 161 |CC BY-NC-SA 4.0 (please cite) | 162 |
| PFC (2/2) | 165 |Diverse conversation | 166 |230k | 167 |14k | 168 |146 | 169 |CC BY-NC-SA 4.0 (please cite) | 170 |
| CLAPI | 173 |Diverse conversation | 174 |122k | 175 |15k | 176 |14 | 177 |CC BY-NC-SA 4.0 | 178 |
| CID | 181 |Dialogues between two friends | 182 |118k | 183 |9k | 184 |8 | 185 |CC BY-NC-SA 4.0 (please cite) | 186 |
| Rhapsodie | 189 |Diverse conversations | 190 |28k | 191 |1k | 192 |41 | 193 |CC BY-NC-SA 3.0 (please cite) | 194 |
| Paris Stories | 197 |Diverse conversations in Paris | 198 |28k | 199 |351 | 200 |54 | 201 |CC BY-SA 4.0 | 202 |
| LinTO (1/3) | 205 |Diverse conversation | 206 |26k | 207 |2k | 208 |4 | 209 |CC BY-SA 4.0 (please cite) | 210 |
Meetings | |||||
| SUMM-RE | 215 |Meeting-style conversations (transcribed with Whisper large-v2 ASR) | 216 |1.3M | 217 |39k | 218 |283 | 219 |CC BY-SA 4.0 (please cite) | 220 |
| ORFEO/Reunions-de-Travail | 223 |Real meetings | 224 |210k | 225 |12k | 226 |29 | 227 |CC BY-NC-SA 4.0 | 228 |
| LinTO (2/3) | 231 |Meetings on speech recognition | 232 |41k | 233 |1.8k | 234 |6 | 235 |CC BY-SA 4.0 (please cite) | 236 |
Debates | |||||
| FREDSum | 241 |French political debates | 242 |406k | 243 |7k | 244 |144 | 245 |CC BY-SA 4.0 (please cite) | 246 |
| ESLO (3/5) | 249 |Conferences | 250 |76k | 251 |2k | 252 |4 | 253 |CC BY-NC-SA 4.0 (please cite) | 254 |
Assistance | |||||
| ESLO (4/5) | 259 |In-person assistance and call-centers | 260 |95k | 261 |11k | 262 |143 | 263 |CC BY-NC-SA 4.0 (please cite) | 264 |
| ORFEO/Fleuron | 267 |Interactions created to teach foreign students about university life | 268 |33k | 269 |2k | 270 |51 | 271 |CC BY-NC-SA 4.0 (please cite) | 272 |
| OTG | 275 |Dialogues in a tourism office | 276 |27k | 277 |4k | 278 |315 | 279 |CC BY-SA 3.0 (contact before usage) | 280 |
| Accueil UBS | 283 |University telephone answering service | 284 |7.2k | 285 |1k | 286 |41 | 287 |CC BY-SA 3.0 (contact before usage) | 288 |
Presentation, Formal Address | |||||
| ESLO (5/5) | 293 |Conference presentations | 294 |43k | 295 |120 | 296 |9 | 297 |CC BY-NC-SA 4.0 (please cite) | 298 |
| LinTO (3/3) | 301 |Technical presentations (AI topics) with Q/A | 302 |38k | 303 |1.5k | 304 |4 | 305 |CC BY-SA 4.0 (please cite) | 306 |
| ORFEO/Valibel (2/2) | 309 |Formal university addresses | 310 |12k | 311 |5 | 312 |5 | 313 |CC BY-NC-SA 4.0 (please cite) | 314 |
| Dataset | 325 |Description | 326 |Words | 327 |Turns | 328 |Conversations | 329 |License (and conditions) | 330 |
|---|---|---|---|---|---|
Parliamentary Proceedings | |||||
| Europarl | 337 |The Europarl parallel corpus | 338 |56M | 339 |214K | 340 |11K | 341 |No copyright restrictions. If you use this data in your research, please contact phi@jhu.edu | 342 |
Spoken Dialogue | |||||
| Charlotte Narratives | 347 |The Charlotte Narrative and Conversation Collection (CNCC) contains 95 narratives, conversations and interviews representative of the residents of Mecklenburg County, North Carolina and surrounding North Carolina communities. | 348 |200K | 349 |2.7K | 350 |93 | 351 |Available for download and use for research and development, including commercial development | 352 |
| Switchboard | 355 |The corpus consists of approximately 260 hours of speech and was originally collected by Texas Instruments in 1990-1, under DARPA sponsorship. | 356 |3M | 357 |290K | 358 |2320 | 359 |LDC User Ageement for Non-Members | 360 |
Broadcast | |||||
| MediaSum (GitHub) | 366 |MediaSum dataset for summarization. A collection of transcripts of CNN and NPR interviews with short summaries. | 367 |720M | 368 |13M | 369 |458K | 370 |For research purposes only | 371 |
Meetings | |||||
| AMI (project page) | 377 |The AMI Meeting Corpus is a multi-modal data set consisting of 100 hours of meeting recordings. | 378 |712K | 379 |75K | 380 |139 | 381 |CC BY 4.0 | 382 |
| ICSI (project page) | 385 |About 70 hours of meeting recordings. | 386 |804K | 387 |64K | 388 |<1K | 389 |CC BY 4.0 | 390 |
Assistance | |||||
| ReDial (GitHub) | 396 |ReDial (Recommendation Dialogues) is an annotated dataset of dialogues, where users recommend movies to each other. | 397 |1.5M | 398 |139K | 399 |11K | 400 |CC BY 4.0 | 401 |
| OpenDialKG (GitHub) | 404 |OpenDialKG is a dataset of conversations between two crowdsourcing agents engaging in a dialog about a given topic. | 405 |1M | 406 |84K | 407 |12K | 408 |CC-BY-NC-4.0 | 409 |
| ABCD (GitHub) | 412 |Action-Based Conversations Dataset. | 413 |1.5M | 414 |142K | 415 |10K | 416 |MIT | 417 |
| AirDialogue (GitHub) | 420 |AirDialogue is a benchmark dataset for goal-oriented dialogue generation research. | 421 |37M | 422 |4.6M | 423 |361K | 424 |Apache License 2.0 | 425 |
| MULTIWOZ2_2 (pfb30) | 428 |Multi-Domain Wizard-of-Oz dataset (MultiWOZ), a fully-labeled collection of human-human written conversations spanning over multiple domains and topics. | 429 |1.9M | 430 |143K | 431 |10.4K | 432 |Apache License 2.0 | 433 |
| MulDoGO2 (GitHub) | 436 |Conversations from the airline, fastfood, finance, insurance, media, and software domains. | 437 |10M | 438 |892K | 439 |63K | 440 |CDLA Permissive License | 441 |
Free Chat | |||||
| Chit-Chat (GitHub) | 447 |Open-domain conversational dataset from the BYU Perception, Control & Cognition lab's Chit-Chat Challenge. | 448 |2.3M | 449 |7.1K | 450 |258K | 451 |MIT License | 452 |
| DailyDialog | 455 |High-quality multi-turn dialog dataset. | 456 |1.2M | 457 |102K | 458 |13K | 459 |CC BY-NC-SA 4.0 | 460 |
Misc | |||||
| British National Corpus (BNC) | 467 |Collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English, both spoken and written, from the late twentieth century. | 468 |110M | 469 |663K | 470 |0.9K | 471 |BCN License | 472 |