├── .gitignore
├── .github
    └── pull_request_template.md
├── LICENSE
├── CODE_OF_CONDUCT.md
└── README.md


/.gitignore:
--------------------------------------------------------------------------------
1 | *~


--------------------------------------------------------------------------------
/.github/pull_request_template.md:
--------------------------------------------------------------------------------
 1 | # Pull request guidelines
 2 | 
 3 | Welcome to the 🐸open-speech-corpora project! We are excited to see your interest, and we appreciate your support!
 4 | 
 5 | This repository is governed by the Contributor Covenant Code of Conduct. For more details, see the [CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md) file.
 6 | 
 7 | Before accepting your pull request, you will be asked to sign a [Contributor License Agreement](https://cla-assistant.io/coqui-ai/open-speech-corpora).
 8 | 
 9 | This [Contributor License Agreement](https://cla-assistant.io/coqui-ai/open-speech-corpora):
10 | 
11 | - Protects you, Coqui, and the users of the code.
12 | - Does not change your rights to use your contributions for any purpose.
13 | - Does not change the license of the 🐸open-speech-corpora project. It just makes the terms of your contribution clearer and lets us know you are OK to contribute.
14 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2020 Josh Meyer
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
  1 | # Contributor Covenant Code of Conduct
  2 | 
  3 | ## Our Pledge
  4 | 
  5 | We as members, contributors, and leaders pledge to make participation in our
  6 | community a harassment-free experience for everyone, regardless of age, body
  7 | size, visible or invisible disability, ethnicity, sex characteristics, gender
  8 | identity and expression, level of experience, education, socio-economic status,
  9 | nationality, personal appearance, race, caste, color, religion, or sexual identity
 10 | and orientation.
 11 | 
 12 | We pledge to act and interact in ways that contribute to an open, welcoming,
 13 | diverse, inclusive, and healthy community.
 14 | 
 15 | ## Our Standards
 16 | 
 17 | Examples of behavior that contributes to a positive environment for our
 18 | community include:
 19 | 
 20 | * Demonstrating empathy and kindness toward other people
 21 | * Being respectful of differing opinions, viewpoints, and experiences
 22 | * Giving and gracefully accepting constructive feedback
 23 | * Accepting responsibility and apologizing to those affected by our mistakes,
 24 |   and learning from the experience
 25 | * Focusing on what is best not just for us as individuals, but for the
 26 |   overall community
 27 | 
 28 | Examples of unacceptable behavior include:
 29 | 
 30 | * The use of sexualized language or imagery, and sexual attention or
 31 |   advances of any kind
 32 | * Trolling, insulting or derogatory comments, and personal or political attacks
 33 | * Public or private harassment
 34 | * Publishing others' private information, such as a physical or email
 35 |   address, without their explicit permission
 36 | * Other conduct which could reasonably be considered inappropriate in a
 37 |   professional setting
 38 | 
 39 | ## Enforcement Responsibilities
 40 | 
 41 | Community leaders are responsible for clarifying and enforcing our standards of
 42 | acceptable behavior and will take appropriate and fair corrective action in
 43 | response to any behavior that they deem inappropriate, threatening, offensive,
 44 | or harmful.
 45 | 
 46 | Community leaders have the right and responsibility to remove, edit, or reject
 47 | comments, commits, code, wiki edits, issues, and other contributions that are
 48 | not aligned to this Code of Conduct, and will communicate reasons for moderation
 49 | decisions when appropriate.
 50 | 
 51 | ## Scope
 52 | 
 53 | This Code of Conduct applies within all community spaces, and also applies when
 54 | an individual is officially representing the community in public spaces.
 55 | Examples of representing our community include using an official e-mail address,
 56 | posting via an official social media account, or acting as an appointed
 57 | representative at an online or offline event.
 58 | 
 59 | ## Enforcement
 60 | 
 61 | Instances of abusive, harassing, or otherwise unacceptable behavior may be
 62 | reported to the community leaders responsible for enforcement by emailing
 63 | [coc-report@coqui.ai](mailto:coc-report@coqui.ai).
 64 | All complaints will be reviewed and investigated promptly and fairly.
 65 | 
 66 | All community leaders are obligated to respect the privacy and security of the
 67 | reporter of any incident.
 68 | 
 69 | ## Enforcement Guidelines
 70 | 
 71 | Community leaders will follow these Community Impact Guidelines in determining
 72 | the consequences for any action they deem in violation of this Code of Conduct:
 73 | 
 74 | ### 1. Correction
 75 | 
 76 | **Community Impact**: Use of inappropriate language or other behavior deemed
 77 | unprofessional or unwelcome in the community.
 78 | 
 79 | **Consequence**: A private, written warning from community leaders, providing
 80 | clarity around the nature of the violation and an explanation of why the
 81 | behavior was inappropriate. A public apology may be requested.
 82 | 
 83 | ### 2. Warning
 84 | 
 85 | **Community Impact**: A violation through a single incident or series
 86 | of actions.
 87 | 
 88 | **Consequence**: A warning with consequences for continued behavior. No
 89 | interaction with the people involved, including unsolicited interaction with
 90 | those enforcing the Code of Conduct, for a specified period of time. This
 91 | includes avoiding interactions in community spaces as well as external channels
 92 | like social media. Violating these terms may lead to a temporary or
 93 | permanent ban.
 94 | 
 95 | ### 3. Temporary Ban
 96 | 
 97 | **Community Impact**: A serious violation of community standards, including
 98 | sustained inappropriate behavior.
 99 | 
100 | **Consequence**: A temporary ban from any sort of interaction or public
101 | communication with the community for a specified period of time. No public or
102 | private interaction with the people involved, including unsolicited interaction
103 | with those enforcing the Code of Conduct, is allowed during this period.
104 | Violating these terms may lead to a permanent ban.
105 | 
106 | ### 4. Permanent Ban
107 | 
108 | **Community Impact**: Demonstrating a pattern of violation of community
109 | standards, including sustained inappropriate behavior,  harassment of an
110 | individual, or aggression toward or disparagement of classes of individuals.
111 | 
112 | **Consequence**: A permanent ban from any sort of public interaction within
113 | the community.
114 | 
115 | ## Attribution
116 | 
117 | This Code of Conduct is adapted from the [Contributor Covenant][homepage],
118 | version 2.0, available at
119 | [https://www.contributor-covenant.org/version/2/0/code_of_conduct.html][v2.0].
120 | 
121 | Community Impact Guidelines were inspired by
122 | [Mozilla's code of conduct enforcement ladder][Mozilla CoC].
123 | 
124 | For answers to common questions about this code of conduct, see the FAQ at
125 | [https://www.contributor-covenant.org/faq][FAQ]. Translations are available
126 | at [https://www.contributor-covenant.org/translations][translations].
127 | 
128 | [homepage]: https://www.contributor-covenant.org
129 | [v2.0]: https://www.contributor-covenant.org/version/2/0/code_of_conduct.html
130 | [Mozilla CoC]: https://github.com/mozilla/diversity
131 | [FAQ]: https://www.contributor-covenant.org/faq
132 | [translations]: https://www.contributor-covenant.org/translations
133 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # 💎 Open Speech Corpora
  2 | 
  3 | A list of open speech corpora for Speech Technology research and development.
  4 | 
  5 | This list has a preference for free (i.e. no $ cost) and truly open corpora (e.g. released under a [Creative Commons license](https://en.wikipedia.org/wiki/Creative_Commons_license) or a [Community Data License Agreement](https://en.wikipedia.org/wiki/Linux_Foundation#Community_Data_License_Agreement_%28CDLA%29)). Not all these corpora may meet those criteria, but all the following corpora are accessible and usable for research and/or commercial use.
  6 | 
  7 | Feel free to propse additions to the list!
  8 | 
  9 | *There's a long backlog of corpora to be added in the [Issues](https://github.com/coqui-ai/open-speech-corpora/issues), and Pull Requests are very welcome :)*
 10 | 
 11 | ## 📜 [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/)
 12 | 
 13 | | CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
 14 | | --- | --- | --- | --- | --- | --- |
 15 | | Common Voice | Multilingual | >15,000 hours (validated); >20,000 hours (total) | Multi-speaker | <https://voice.mozilla.org/en/datasets> | [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/) |
 16 | | Yesno | Hebrew | 6 mins | one male | <http://www.openslr.org/1/> | [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/) |
 17 | | LJ Speech Corpus | English | ~24 hours | [one female](https://librivox.org/reader/11049) | <https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2> | [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/) |
 18 | | NST Danish ASR Database | Danish | 229,992 utterances | 616 speakers | original: <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-19/>, reorganized: <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-55/> | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) |
 19 | | NST Danish Dictation | Danish | 34,955 utterances | 151 speakers | <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-20/> | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) |
 20 | | NST Danish Speech Synthesis | Danish | 4,108 utterances | 1 male speaker | <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-21/> | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) |
 21 | | NST Swedish ASR Database | Swedish | 366,000 utterances | 1,000 speakers | original: <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-16/>, reorganized: <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-56/> | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) |
 22 | | NST Swedish Dictation | Swedish | 45,620 utterances | 195 speakers | <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-17/> | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) |
 23 | | NST Swedish Speech Synthesis | Swedish | 5,279 utterances | 1 male speaker | <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-18/> | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) |
 24 | | NST Norwegian ASR Database | Norwegian | 359,760 utterances | 980 speakers | original: <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-13/>, reorganized: <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-54/> | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) |
 25 | | NST Norwegian Dictation | Norwegian | 33,360 utterances | 144 speakers | <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-14/> | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) |
 26 | | NST Norwegian Speech Synthesis | Norwegian | 5,363 utterances | 1 male speaker | <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-15/> | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) |
 27 | | NB Tale – Speech Database for Norwegian | Norwegian | 7,600 utterances + ~12 hours | 380 speakers | <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-31/> | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) |
 28 | | Norwegian Parliamentary Speech Corpus (v0.1) | Norwegian | ~59 hours | 203 speakers | <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-58/> | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) |
 29 | | Wikimedia Commons Odia | Odia | ~8 hours | ~20 speakers | <https://commons.wikimedia.org/wiki/Category:Odia_pronunciation> | mostly(?) [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/) |
 30 | | Thorsten-21.02-neutral | German | ~24 hours | 1 male speaker | <https://www.Thorsten-Voice.de> | [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/) |
 31 | | Thorsten-21.06-emotional | German | 2.400 utterances (8 emotions) | 1 male speaker | <https://www.Thorsten-Voice.de> | [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/) |
 32 | 
 33 | ## 📜 [CC-BY](https://creativecommons.org/licenses/by/4.0/)
 34 | 
 35 | | CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
 36 | | --- | --- | --- | --- | --- | --- |
 37 | | ARU Speech Corpus | English (UK) | 720 utterances / speaker | 12 (6 femals; 6 male) | <http://datacat.liverpool.ac.uk/681/1/ARU_Speech_Corpus_v1_0.zip> | [CC-BY 3.0](https://creativecommons.org/licenses/by/3.0/) |
 38 | | Althingi Parliamentary Speech Corpus  | Icelandic | 542 hours and 25 minutes | 196 speakers | <http://www.malfong.is/index.php?dlid=73&lang=en> | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) |
 39 | | Alþingisumræður Parliamentary Speech Corpus | Icelandic | ~21 hours | | <http://www.malfong.is/index.php?dlid=8&lang=en> | [CC-BY 3.0](https://creativecommons.org/licenses/by/3.0/) |
 40 | | Hjal Corpus | Icelandic | ~41,000 recordings | 883 speakers | <http://www.malfong.is/index.php?dlid=5&lang=en> | [CC-BY 3.0](https://creativecommons.org/licenses/by/3.0/) |
 41 | | The Malromur Corpus | Icelandic | 152 hours | 563 speakers | <http://www.malfong.is/index.php?dlid=65&lang=en> | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) |
 42 | | Telecooperation German Corpus for Kinect | German | ~35 hours | ~180 speakers | <http://www.repository.voxforge1.org/downloads/de/german-speechdata-TUDa-2015.tar.gz> | [CC-BY 2.0](https://creativecommons.org/licenses/by/2.0/) |
 43 | | African Speech Technology English-English Speech Corpus | English | ~21 hours | | <https://repo.sadilar.org/handle/20.500.12185/283> | [CC-BY 2.5 South Africa](https://creativecommons.org/licenses/by/2.5/za/legalcode) |
 44 | | African Speech Technology isiXhosa Speech Corpus | isiXhosa | ~26 hours | | <https://repo.sadilar.org/handle/20.500.12185/305> | [CC-BY 2.5 South Africa](https://creativecommons.org/licenses/by/2.5/za/legalcode) |
 45 | | NCHLT Afrikaans | Afrikaans | 56 hours | 210 speakers (98 female / 112 male) | <https://repo.sadilar.org/handle/20.500.12185/280> | CC-BY 3.0 |
 46 | | NCHLT English | English | 56 hours | 210 speakers (100 female / 110 male) | <https://repo.sadilar.org/handle/20.500.12185/274> | CC-BY 3.0 |
 47 | | NCHLT isiNdebele | isiNdebele | 56 hours | 148 speakers (78 female / 70 male) | <https://repo.sadilar.org/handle/20.500.12185/272> | CC-BY 3.0 |
 48 | | NCHLT isiXhosa | isiXhosa | 56 hours | 209 speakers (106 female / 103 male) | <https://repo.sadilar.org/handle/20.500.12185/279> | CC-BY 3.0 |
 49 | | NCHLT isiZulu | isiZulu | 56 hours | 210 speakers (98 female / 112 male) | <https://repo.sadilar.org/handle/20.500.12185/275> | CC-BY 3.0 |
 50 | | NCHLT Sepedi | Sepedi | 56 hours | 210 speakers (100 female / 110 male) | <https://repo.sadilar.org/handle/20.500.12185/270> | CC-BY 3.0 |
 51 | | NCHLT Sesotho | Sesotho | 56 hours | 210 speakers (113 female / 97 male) | <https://repo.sadilar.org/handle/20.500.12185/278> | CC-BY 3.0 |
 52 | | NCHLT Setswana | Setswana | 56 hours | 210 speakers (109 female / 101 male) | <https://repo.sadilar.org/handle/20.500.12185/281> | CC-BY 3.0 |
 53 | | NCHLT Siswati | Siswati | 56 hours | 197 speakers (96 female / 101 male) | <https://repo.sadilar.org/handle/20.500.12185/271> | CC-BY 3.0 |
 54 | | NCHLT Tshivenda | Tshivenda | 56 hours | 208 speakers (83 female / 125 male) | <https://repo.sadilar.org/handle/20.500.12185/276> | CC-BY 3.0 |
 55 | | NCHLT Xitsonga | Xitsonga | 56 hours | 198 speakers (95 female/103 male) | <https://repo.sadilar.org/handle/20.500.12185/277> | CC-BY 3.0 |
 56 | | Lwazi II Cross-lingual Proper Name Corpus | Afrikaans; English; isiZulu; Sesotho | 2 hours 5 mins| 20 speakers | <https://repo.sadilar.org/handle/20.500.12185/445> | CC-BY 3.0 |
 57 | | Lwazi II Proper Name Call Routing Telephone Corpus | English | 2 hours 7 mins | | <https://repo.sadilar.org/handle/20.500.12185/448> | CC-BY 3.0 |
 58 | | Lwazi II Afrikaans Trajectory Tracking Corpus | Afrikaans | 4 hours | one male | <https://repo.sadilar.org/handle/20.500.12185/442> | CC-BY 3.0 |
 59 | | LibriSpeech | English | ~1000 hours | 2484 speakers (1201 female / 1283 male) | <http://www.openslr.org/12/> | CC-BY 4.0 |
 60 | | Zeroth-Korean | Korean | 52.8 hours | 115 speakers | <http://www.openslr.org/40/> | CC-BY 4.0 |
 61 | | Speech Commands | English | 17.8 hours  | >1,000 speakers | <https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html> | CC-BY 4.0 |
 62 | | ParlamentParla | Catalan | 320 hours  |  | <https://www.openslr.org/59/> | CC-BY 4.0 |
 63 | |  SIWIS | French | ~10 hours  | one female | <http://datashare.is.ed.ac.uk/download/DS_10283_2353.zip> | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) |
 64 | |  VCTK | English | 44 hours | 109 speakers  | <http://datashare.is.ed.ac.uk/download/DS_10283_3443.zip> | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) |
 65 | |  LibriTTS | English | 586 hours | 2,456 speakers (1,185 female / 1,271 male)  | <http://www.openslr.org/60/> | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) |
 66 | |  Augmented LibriSpeech | Audio (English); Text (English, French) | 236 hours | | <https://persyval-platform.univ-grenoble-alpes.fr/datasets/DS91> | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) |
 67 | |  Helsinki Prosody Corpus | English | 262.5 hours | 1,230 speakers | <https://github.com/Helsinki-NLP/prosody> | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) |
 68 | |Tuva Speech Database | Norwegian | 24 hours | 40 speakers | https://www.nb.no/sprakbanken/show?serial=oai:nb.no:sbr-44&lang= |  [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) |
 69 | | COERLL Kʼicheʼ corpus | Kʼicheʼ | 34 minutes | ? speakers | https://cl.indiana.edu/~ftyers/resources/utexas-kiche-audio.tar.gz | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) |
 70 | | Timers and Such v0.1 | English (synthetic: US, real: various nationalities) | synthetic: 172 hours, real: 0.29 hours | 21 synthetic, 11 real | https://zenodo.org/record/4110812#.X9j0RmBOkYM | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) |
 71 | | Large Corpus of Czech Parliament Plenary Hearings | Czech | 444 hours | | <https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3126> | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) |
 72 | 
 73 | ## 📜 [CC-BY-SA](https://creativecommons.org/licenses/by-sa/4.0/)
 74 | 
 75 | | CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
 76 | | --- | --- | --- | --- | --- | --- |
 77 | | Iban | Iban | 8 hours |  | <http://www.openslr.org/24/> <https://github.com/sarahjuan/iban> | CC-BY-SA 2.0 |
 78 | | Vystadial 2013 | English; Czech | 41 hours; 15 hours |  | <http://www.openslr.org/6/> | CC-BY-SA 3.0 US |
 79 | | Vystadial 2016 Czech | Czech | 77 hours; includes Vystadial 2013 Czech | | <https://lindat.cz/repository/xmlui/handle/11234/1-1740> | CC-BY-SA 4.0 |
 80 | | Free Spoken Digit Dataset | English | 2,000 isolated digits | 4 speakers | <https://github.com/Jakobovski/free-spoken-digit-dataset> | CC-BY-SA 4.0 |
 81 | | Google Javanese | Javanese | 296 hours| 1019 speakers| <http://www.openslr.org/35/> | CC-BY-SA 4.0 |
 82 | | Google Nepali | Nepali | 165 hours| 527 speakers| <http://www.openslr.org/54/> | CC-BY-SA 4.0 |
 83 | | Google Bengali | Bengali | 229 hours| 508 speakers| <http://www.openslr.org/53/> | CC-BY-SA 4.0 |
 84 | | Google Sinhala | Sinhala | 224 hours| 478 speakers| <http://www.openslr.org/52/> | CC-BY-SA 4.0 |
 85 | | Google Sundanese | Sundanese | 333 hours| 542 speakers| <http://www.openslr.org/36/> | CC-BY-SA 4.0 |
 86 | | Spoken Wikipedia Corpus (SWC-2017) | English; German; Dutch | 182 hours; 249 hours; 79 hours | 395 speakers; 339 speakers; 145 speakers | <https://nats.gitlab.io/swc/> | CC-BY-SA 4.0 |
 87 | | Chuvash TTS | Chuvash | 4 hours | 1 speaker | <https://github.com/ftyers/Turkic_TTS> | CC-BY-SA 4.0  |
 88 | | Forschergeist | German | 2 hours | 2 speakers (1 female; 1 male) | female speaker: <https://goofy.zamia.org/zamia-speech/corpora/forschergeist/annettevogt-20180320-rec.tgz>; male speaker: <https://goofy.zamia.org/zamia-speech/corpora/forschergeist/timpritlove-20180320-rec.tgz> | CC-BY-SA 4.0  |
 89 | | Malayalam Speech Corpus by [SMC](https://blog.smc.org.in/malayalam-speech-corpus/) | Malayalam | 1:36 hours | 75 speakers (3 female, 12 male, 60 unidentified) | https://releases.smc.org.in/msc-reviewed-speech/ | CC-BY-SA 4.0  |
 90 | | Google Malayalam | Malayalam | 3.02 hours| 24 speakers| <http://www.openslr.org/63/> | CC-BY-SA 4.0 |
 91 | 
 92 | ## 📜 [CC-BY-ND](https://creativecommons.org/licenses/by-nd/4.0/)
 93 | 
 94 | | CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
 95 | | --- | --- | --- | --- | --- | --- |
 96 | | IBM Recorded Debates v1 | English | 5 hours | 10 speakers | <https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Debate%20Speech%20Analysis> | CC-BY-ND |
 97 | | IBM Recorded Debates v2 | English | ~14 hours  | 14 speakers | <https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Debate%20Speech%20Analysis> | CC-BY-ND |
 98 | 
 99 | ## 📜 [CC-BY-NC](https://creativecommons.org/licenses/by-nc/4.0/)
100 | 
101 | | CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
102 | | --- | --- | --- | --- | --- | --- |
103 | | TV3Parla | Catalan | 240 hours  |  | <http://laklak.eu/share/tv3_0.3.tar.gz> | [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) |
104 | | Russian Open STT Corpus | Russian | ~10,000 hours public, ~10,000 more upon request  |  | <https://github.com/snakers4/open_stt/#links> | [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) with some [exceptions](https://github.com/snakers4/open_stt/blob/master/LICENSE)|
105 | | Russian Open TTS Corpus | Russian | 145 hours  | 3 males | <https://github.com/snakers4/open_tts/#links> | [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) with some [expections](https://github.com/snakers4/open_tts/blob/master/LICENSE)|
106 | | OVM – Otázky Václava Moravce | Czech | 35 hours  |  | <https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-000D-EC98-3> | [CC-BY-NC 3.0](https://creativecommons.org/licenses/by-nc/3.0/) |
107 | 
108 | ## 📜 [CC-BY-NC-SA](https://creativecommons.org/licenses/by-nc-sa/4.0/)
109 | 
110 | | CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
111 | | --- | --- | --- | --- | --- | --- |
112 | | CHiME-Home | English | 6.8 hours |  | <https://archive.org/details/chime-home> | [CC-BY-NC-SA 3.0](https://creativecommons.org/licenses/by-nc-sa/3.0/) |
113 | | Cameroon Pidgin English Corpus | Cameroon Pidgin English | ~17 hours |  | <http://ota.ox.ac.uk/text/2563.zip> | [CC-BY-NC-SA 3.0](https://creativecommons.org/licenses/by-nc-sa/3.0/) |
114 | 
115 | ## 📜 [CC-BY-NC-ND](https://creativecommons.org/licenses/by-nc-nd/4.0/)
116 | 
117 | | CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
118 | | --- | --- | --- | --- | --- | --- |
119 | | Tatoeba-Eng | English | ~250 hours (rough estimate) | 6 speakers | <https://voice.mozilla.org/en/datasets> | [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) (some audio) / [CC-BY-NC-ND 3.0](https://creativecommons.org/licenses/by-nc-nd/3.0/) (most audio) / [CC-BY 2.0](https://creativecommons.org/licenses/by/2.0/) (all text) |
120 | | TED-LIUM | English | 118 hours | 685 speakers (36h female / 81h male) | <http://www.openslr.org/7/> | [CC-BY-NC-ND 3.0](https://creativecommons.org/licenses/by-nc-nd/3.0/) |
121 | | TED-LIUM-2 | English | 207 hours | 1242 speakers (66h female / 141h male) | <http://www.openslr.org/19/> | [CC-BY-NC-ND 3.0](https://creativecommons.org/licenses/by-nc-nd/3.0/) |
122 | | TED-LIUM-3 | English | 452 hours | 2028 speakers (134h female / 316h male) | <http://www.openslr.org/51/> | [CC-BY-NC-ND 3.0](https://creativecommons.org/licenses/by-nc-nd/3.0/) |
123 | | Pansori TEDxKR | Korean | 3 hours | 41 speakers | <http://www.openslr.org/58/> | [CC-BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/) |
124 | | Primewords Mandarin | Mandarin | 100 hours | 296 speakers | <http://www.openslr.org/47/> | [CC-BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/)|
125 | | MuST-C v1.0 | Audio (English); Text (Dutch, French, German, Italian, Portuguese, Romanian, Russian, Spanish) | 408, 504, 492, 465, 442, 385, 432, 489 hours per language pair | | <https://ict.fbk.eu/must-c-release-v1-0/> | [CC-BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/) |
126 | | Czech Parliament Meetings | Czech | 88 hours | | <https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0005-CF9C-4> | [CC-BY-NC-ND 3.0](https://creativecommons.org/licenses/by-nc-nd/3.0/) |
127 | | BembaSpeech | Bemba | 24 hours | 17 speakers (9 male / 8 female) | <https://github.com/csikasote/BembaSpeech> | [CC-BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/) |
128 | 
129 | ## 📜 [CDLA-Permissive](https://cdla.io/permissive-1-0/)
130 | 
131 | | CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
132 | | --- | --- | --- | --- | --- | --- |
133 | | DiPCo | English | ~5 hours | 32 speakers (13 female; 19 male) | <https://s3.amazonaws.com/dipco/DiPCo.tgz> | [CDLA-Permissive-1.0](https://cdla.io/permissive-1-0/) |
134 | 
135 | ## 📜 [GNU General Public License](https://www.gnu.org/licenses/gpl.html)
136 | 
137 | | CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
138 | | --- | --- | --- | --- | --- | --- |
139 | | VoxForge | English | ~120 hours | ~2966 speakers | <http://www.repository.voxforge1.org/downloads/en/Trunk/Audio/Main/16kHz_16bit/> <https://voice.mozilla.org/en/datasets> | GNU-GPL 3.0 |
140 | | VoxForge | Russian |  | | <http://www.repository.voxforge1.org/downloads/ru/Trunk/Audio/Main/16kHz_16bit/> <http://www.repository.voxforge1.org/downloads/Russian/Trunk/Audio/Main/16kHz_16bit/>| GNU-GPL 3.0 |
141 | | VoxForge | German |  | | <http://www.repository.voxforge1.org/downloads/de/Trunk/Audio/Main/16kHz_16bit/> | GNU-GPL 3.0 |
142 | 
143 | 
144 | ## 📜 [Apache License](https://www.apache.org/licenses/LICENSE-2.0)
145 | 
146 | | CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
147 | | --- | --- | --- | --- | --- | --- |
148 | | AISHELL-1 | Mandarin | 170 hours | 400 speakers | <http://www.openslr.org/33/> | Apache 2.0 |
149 | | Tunisian_MSA | Modern Standard Arabic (Tunisia) | 11.2 hours | 118 speakers | <http://www.openslr.org/46/> | Apache 2.0 |
150 | | African Accented French | French | 22 hours | 232 speakers | <http://www.openslr.org/57/> | Apache 2.0 |
151 | | THCHS-30 | Mandarin Chinese | 33.57 hours (13,389 utterances) | 40 speakers (31 female; 9 male) | <http://www.openslr.org/18/> | Apache 2.0 |
152 | | Living Audio Dataset - Dutch | Dutch | 57:49 min | 1 speaker | <https://github.com/Idlak/Living-Audio-Dataset> | Apache 2.0 |
153 | | Living Audio Dataset - English | English | 50:50 min | 1 speaker | <https://github.com/Idlak/Living-Audio-Dataset> | Apache 2.0 |
154 | | Living Audio Dataset - Irish | Irish | 61:56 min | 1 speaker | <https://github.com/Idlak/Living-Audio-Dataset> | Apache 2.0 |
155 | | Living Audio Dataset - Russian | Russian | 34:58 min | 1 speaker | <https://github.com/Idlak/Living-Audio-Dataset> | Apache 2.0 |
156 | 
157 | 
158 | 
159 | ## 📜 [MIT License](https://opensource.org/licenses/MIT)
160 | 
161 | | CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
162 | | --- | --- | --- | --- | --- | --- |
163 | | ALFFA | Amharic;Hausa (paid); Swahili; Wolof |  |  | <http://www.openslr.org/25/> <https://github.com/besacier/ALFFA_PUBLIC> | MIT |
164 | 
165 | 
166 | ## 📜 [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause)
167 | 
168 | 
169 | | CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
170 | | --- | --- | --- | --- | --- | --- |
171 | | M-AILABS German Corpus | German | 237 hours and 22 minutes |  | <http://www.caito.de/data/Training/stt_tts/de_DE.tgz> | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))|
172 | | M-AILABS Queen's English Corpus | Queen's English | 45 hours and 35 minutes |  | <http://www.caito.de/data/Training/stt_tts/en_UK.tgz> | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))|
173 | | M-AILABS US English Corpus | American English | 102 hours and 7 minutes |  | <http://www.caito.de/data/Training/stt_tts/en_US.tgz> | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))|
174 | | M-AILABS Spanish Corpus | Spanish Spanish | 108 hours and 34 minutes |  | <http://www.caito.de/data/Training/stt_tts/es_ES.tgz> | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))|
175 | | M-AILABS Italian Corpus | Italian | 127 hours and 40 minutes |  | <http://www.caito.de/data/Training/stt_tts/it_IT.tgz> | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))|
176 | | M-AILABS Ukrainian Corpus | Ukrainian | 87 hours and 8 minutes |  | <http://www.caito.de/data/Training/stt_tts/uk_UK.tgz> | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))|
177 | | M-AILABS Russian Corpus | Russian | 46 hours and 47 minutes |  | <http://www.caito.de/data/Training/stt_tts/ru_RU.tgz> | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))|
178 | | M-AILABS French-v0.9 Corpus | French | 190 hours and 30 minutes |  | <http://www.caito.de/data/Training/stt_tts/fr_FR.tgz> | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))|
179 | | M-AILABS Polish Corpus | Polish | 53 hours and 50 minutes |  | <http://www.caito.de/data/Training/stt_tts/pl_PL.tgz> | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))|
180 | 
181 | ## 📜 [Custom License](https://en.wikipedia.org/wiki/Copyright)
182 | 
183 | | CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
184 | | --- | --- | --- | --- | --- | --- |
185 | | Fluent Speech Commands Corpus | English | 19 hours (30,043 utterances) | 97 speakers | <http://fluent.ai:2052/jf8398hf30f0381738rucj3828chfdnchs.tar.gz> | [Fluent Speech Commands Public License](https://groups.google.com/a/fluent.ai/forum/#!msg/fluent-speech-commands/MXh_7Y-3QC8/9i2pHPW9AwAJ) |
186 | | CMU Wilderness | 700 Langs | Alignments distributed without audio or text total:~14,000 hours; per lang: ~20 hours |  | <https://github.com/festvox/datasets-CMU_Wilderness> | <https://live.bible.is/terms> |
187 | | CHiME-5 | English | 50 hours | 48 speakers | <http://spandh.dcs.shef.ac.uk/chime_challenge/data.html> | [CHiME-5 License](http://spandh.dcs.shef.ac.uk/chime_challenge/download.html) |
188 | | Fearless Steps Corpus | English | 19,000 hours (20 hours transcribed) | ~450 speakers | <https://fearless-steps.github.io/ChallengePhase3/#19k_Corpus_Access> | [NASA Media Usage Guidelines](https://www.nasa.gov/multimedia/guidelines/index.html) |
189 | | Microsoft Speech Corpus (Indian languages) | Telugu; Tamil; Gujarati | | | <https://msropendata.com/datasets/7230b4b1-912d-400e-be58-f84e0512985e> | [Microsoft Speech Corpus (Indian Languages) License](https://msropendata.com/datasets/7230b4b1-912d-400e-be58-f84e0512985e) |
190 | | Microsoft Speech Language Translation Corpus | English; Chinese; Japanese| | | <https://msropendata.com/datasets/54813518-4ea6-4c39-9bb2-b0d1e5f0c187> | [Microsoft Research Data License Agreement](https://msrodr-api.azurewebsites.net//licenses/2f933be3-284d-500b-7ea3-2aa2fd0f1bb2/file) |
191 | | Hey Snips Corpus | English | 11K positive "Hey Snips" (~4.4 hours) and 87K negative (~89 hours) utterances | 2215 speakers (positive & negative) and 4028 speakers (negative only) | <https://research.snips.ai/datasets/keyword-spotting> | [Snips Data License](https://github.com/snipsco/keyword-spotting-research-datasets/blob/master/LICENSE) |
192 | | Snips SLU Corpus | English; French | 1660 "Smart Lights EN" (~1.3 hours), 1286 "Smart Speaker EN" (~55 minutes), 1138 "Smart Speaker FR" (~50 minutes) utterances | English: 69 speakers; French: 30 speakers | <https://research.snips.ai/datasets/spoken-language-understanding> | [Snips Data License](https://github.com/snipsco/keyword-spotting-research-datasets/blob/master/LICENSE) |
193 | | CMU Sphinx Group - AN4 | English | "an4_clstk"(~50 minutes) "an4test_clstk" (~6 minutes) | "an4_clstk": 21 female, 53 male "an4test_clstk": 3 female, 7 male | http://www.speech.cs.cmu.edu/databases/an4/an4_raw.bigendian.tar.gz | [AN4](http://www.speech.cs.cmu.edu/databases/an4/LICENSE.html) |
194 | | FT Speech | Danish | ~1,857 hours (1,017,244 utterances) | 434 speakers (176 female, 258 male) | <https://ftspeech.dk> | [FT Speech License](https://ftspeech.dk/LICENSE.html) |
195 | | FalaBrasil-LAPS-Constituicao | Brazilian-Portuguese | 9 hours | 1 speaker | <https://drive.google.com/uc?export=download&confirm=SrvW&id=1Nf849u-27CYRzJqedLaI-FaZfMRO7FT> | ["Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita* pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita*] / Portanto, apenas as bases livres estão sendo disponibilizadas."](http://labvis.ufpa.br/falabrasil/downloads/) |
196 | | FalaBrasil-LaPSMail | Brazilian-Portuguese | 1 hour | 25 speakers | <https://drive.google.com/uc?export=download&confirm=PecV&id=1B_Vq8MDSE4fBQefVxqCGSl-EcKAcjJLb> | ["Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita* pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita*] / Portanto, apenas as bases livres estão sendo disponibilizadas."](http://labvis.ufpa.br/falabrasil/downloads/) |
197 | | FalaBrasil-LaPS Benchmark | Brazilian-Portuguese | 1 hour | 1 speaker | <https://drive.google.com/uc?export=download&confirm=XFfF&id=1nZ8L9nJTt4blFC0RGT9Y7XRu02aAvDIo> | ["Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita* pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita*] / Portanto, apenas as bases livres estão sendo disponibilizadas."](http://labvis.ufpa.br/falabrasil/downloads/) |
198 | 


--------------------------------------------------------------------------------