├── mosel-logo-white.png
├── .github
├── issue_template.md
└── pull_request_template.md
├── README.md
└── LICENSE
/mosel-logo-white.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hlt-mt/mosel/HEAD/mosel-logo-white.png
--------------------------------------------------------------------------------
/.github/issue_template.md:
--------------------------------------------------------------------------------
1 | # Dataset Issue :bug:
2 |
3 |
4 |
5 |
--------------------------------------------------------------------------------
/.github/pull_request_template.md:
--------------------------------------------------------------------------------
1 | # New Dataset Request :grapes:
2 |
3 | ## Required information
4 | Please compile the following required information for submitting a new dataset:
5 |
6 | * **Name:** Official name of the dataset you are willing to add
7 | * **Link:** Link to where the data is mainly distributed (it can be GitHub, HuggingFace, and so on)
8 | * **License Type:** Describe the license of the collection (e.g., CC BY 4.0)
9 | * **License Source:** Describe where information about the license of the collection can be found (e.g., a link to the LICENSE page)
10 | * **Source of the speech:** Describe the source(s) from which the speech data has been obtained (e.g., YouTube videos with CC-permissive license)
11 | * **License of the speech:** Describe where information about the license of the speech contained in the collection can be found
12 | * **Hours:** The total number of hours contained in the resource
13 | * **Languages:** The list of languages covered using two-letter ISO 639 codes
14 | * **Label:** Yes or No, whether the dataset contains labels for the corresponding speech
15 |
16 | ## Additional Information
17 |
18 | Please describe additional information you believe can be useful for users, you can remove or add any other option below:
19 |
20 | - [ ] *Domain:* e.g., news broadcast, scientific presentation
21 | - [ ] *Type of source speech:* read/natural
22 | - [ ] *Multiple speakers (within each speech segment):* yes/no
23 | - [ ] *Background noise or music:* yes/no
24 | - [ ] *Supported tasks:* e.g., automatic speech recognition, speech translation
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | # MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages
5 |
| Name | 15 |License | 16 |Hours | 17 |Languages | 18 |Label | 19 |
|---|---|---|---|---|
| CommonVoice | 24 |CC 0 | 25 |6,732 | 26 |bg, cs, da, nl, en, et, fi, fr, de, el, hu, ga, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv | 27 |✅ | 28 |
| CoVoST2 | 31 |CC 0 | 32 |687 | 33 |en, fr, it, es, pt, et, nl, sv, lv, sl | 34 |✅ | 35 |
| CSS10 | 38 |Public Domain | 39 |99 | 40 |nl, fi, fr, de, el, hu, es | 41 |✅ | 42 |
| EMU | 45 |CC BY 3.0 | 46 |56 | 47 |pl | 48 |✅ | 49 |
| EU Parliament | 52 |CC BY 4.0 | 53 |32 | 54 |pl | 55 |✅ | 56 |
| FLEURS | 59 |CC BY 4.0 | 60 |215 | 61 |bg, cs, da, nl, en, et, fi, fr, de, el, hu, ga, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv | 62 |✅ | 63 |
| Large Corpus of Czech Parliament Plenary Hearings | 66 |CC BY 4.0 | 67 |444 | 68 |cs | 69 |✅ | 70 |
| LibriLight | 73 |Public Domain | 74 |57,706 | 75 |en | 76 |❌ | 77 |
| LibriTTS | 80 |CC BY 4.0 | 81 |585 | 82 |en | 83 |✅ | 84 |
| LibriSpeech | 87 |CC BY 4.0 | 88 |360 | 89 |en | 90 |✅ | 91 |
| LibriVoxDeEn | 94 |Public Domain | 95 |547 | 96 |de | 97 |✅ | 98 |
| MC Speech | 101 |CC 0 | 102 |22 | 103 |pl | 104 |✅ | 105 |
| Multilingual LibriSpeech | 108 |CC BY 4.0 | 109 |50,687 | 110 |nl, en, fr, de, it, pl, pt, es | 111 |✅ | 112 |
| SIWIS | 115 |CC BY 4.0 | 116 |11 | 117 |fr | 118 |✅ | 119 |
| Speech Commands | 122 |CC BY 4.0 | 123 |18 | 124 |en | 125 |✅ | 126 |
| VCTK | 129 |CC BY 4.0 | 130 |44 | 131 |en | 132 |✅ | 133 |
| VoxPopuli | 136 |CC 0 | 137 |383,500 | 138 |bg, hr, cs, da, nl, en, et, fi, fr, de, el, hu, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv | 139 |❌ | 140 |
| 1,791 | 143 |hr, cs, nl, en, et, fu, fr, de, hu, it, lt, pl, ro, sk, sl, es | 144 |✅ | 145 |||
| YouTube-Commons | 148 |CC BY 4.0 | 149 |3,261 | 150 |bg, cs, nl, en, et, fr, de, el, hu, it, pl, pt, ro, es | 151 |❌ | 152 |
| 443,396 | 155 |bg, cs, nl, en, et, fi, fr, de, el, hu, it, lv, lt, pl, pt, ro, es, sv | 156 |✅ | 157 |||
| MOSEL :grapes: | 160 |CC BY 4.0 | 161 |441,206 | 162 |bg, hr, cs, da, nl, en, et, fi, fr, de, el, hu, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv | 163 |✅ | 164 |
| 167 | Datasets added after MOSEL release 168 | | 169 |||||
| Yodas | 172 |CC BY 3.0 | 173 |369,510 | 174 |149 | 175 |✅ | 176 |
| LoquaciousSet | 179 |CC BY 3.0/4.0 | 180 |25,000 | 181 |en | 182 |✅ | 183 |
| Granary :corn: | 186 |CC BY 3.0/4.0 | 187 |~1M | 188 |bg, cs, da, de, el, en, es, et, fi, fr, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv, uk, ru | 189 |✅ | 190 |