├── .gitignore ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── extension.md ├── requirements.txt └── voxpopuli ├── __init__.py ├── download_audios.py ├── get_asr_data.py ├── get_lm_data.py ├── get_s2s_data.py ├── get_unlabelled_data.py ├── segmentation ├── __init__.py ├── cut_from_labels.py ├── cut_with_align_files.py ├── get_segment_pyannote_speaker.py └── run_pyannote_sd.py ├── text ├── __init__.py ├── wer_tools.py └── word_align_tools.py └── utils.py /.gitignore: -------------------------------------------------------------------------------- 1 | .idea 2 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Code of Conduct 2 | 3 | ## Our Pledge 4 | 5 | In the interest of fostering an open and welcoming environment, we as 6 | contributors and maintainers pledge to make participation in our project and 7 | our community a harassment-free experience for everyone, regardless of age, body 8 | size, disability, ethnicity, sex characteristics, gender identity and expression, 9 | level of experience, education, socio-economic status, nationality, personal 10 | appearance, race, religion, or sexual identity and orientation. 11 | 12 | ## Our Standards 13 | 14 | Examples of behavior that contributes to creating a positive environment 15 | include: 16 | 17 | * Using welcoming and inclusive language 18 | * Being respectful of differing viewpoints and experiences 19 | * Gracefully accepting constructive criticism 20 | * Focusing on what is best for the community 21 | * Showing empathy towards other community members 22 | 23 | Examples of unacceptable behavior by participants include: 24 | 25 | * The use of sexualized language or imagery and unwelcome sexual attention or 26 | advances 27 | * Trolling, insulting/derogatory comments, and personal or political attacks 28 | * Public or private harassment 29 | * Publishing others' private information, such as a physical or electronic 30 | address, without explicit permission 31 | * Other conduct which could reasonably be considered inappropriate in a 32 | professional setting 33 | 34 | ## Our Responsibilities 35 | 36 | Project maintainers are responsible for clarifying the standards of acceptable 37 | behavior and are expected to take appropriate and fair corrective action in 38 | response to any instances of unacceptable behavior. 39 | 40 | Project maintainers have the right and responsibility to remove, edit, or 41 | reject comments, commits, code, wiki edits, issues, and other contributions 42 | that are not aligned to this Code of Conduct, or to ban temporarily or 43 | permanently any contributor for other behaviors that they deem inappropriate, 44 | threatening, offensive, or harmful. 45 | 46 | ## Scope 47 | 48 | This Code of Conduct applies within all project spaces, and it also applies when 49 | an individual is representing the project or its community in public spaces. 50 | Examples of representing a project or community include using an official 51 | project e-mail address, posting via an official social media account, or acting 52 | as an appointed representative at an online or offline event. Representation of 53 | a project may be further defined and clarified by project maintainers. 54 | 55 | ## Enforcement 56 | 57 | Instances of abusive, harassing, or otherwise unacceptable behavior may be 58 | reported by contacting the project team at . All 59 | complaints will be reviewed and investigated and will result in a response that 60 | is deemed necessary and appropriate to the circumstances. The project team is 61 | obligated to maintain confidentiality with regard to the reporter of an incident. 62 | Further details of specific enforcement policies may be posted separately. 63 | 64 | Project maintainers who do not follow or enforce the Code of Conduct in good 65 | faith may face temporary or permanent repercussions as determined by other 66 | members of the project's leadership. 67 | 68 | ## Attribution 69 | 70 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, 71 | available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html 72 | 73 | [homepage]: https://www.contributor-covenant.org 74 | 75 | For answers to common questions about this code of conduct, see 76 | https://www.contributor-covenant.org/faq 77 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing to voxpopuli 2 | We want to make contributing to this project as easy and transparent as 3 | possible. 4 | 5 | ## Pull Requests 6 | We actively welcome your pull requests. 7 | 8 | 1. Fork the repo and create your branch from `master`. 9 | 2. If you've added code that should be tested, add tests. 10 | 3. If you've changed APIs, update the documentation. 11 | 4. Ensure the test suite passes. 12 | 5. Make sure your code lints. 13 | 6. If you haven't already, complete the Contributor License Agreement ("CLA"). 14 | 15 | ## Contributor License Agreement ("CLA") 16 | In order to accept your pull request, we need you to submit a CLA. You only need 17 | to do this once to work on any of Facebook's open source projects. 18 | 19 | Complete your CLA here: 20 | 21 | ## Issues 22 | We use GitHub issues to track public bugs. Please ensure your description is 23 | clear and has sufficient instructions to be able to reproduce the issue. 24 | 25 | Facebook has a [bounty program](https://www.facebook.com/whitehat/) for the safe 26 | disclosure of security bugs. In those cases, please go through the process 27 | outlined on that page and do not file a public issue. 28 | 29 | ## License 30 | By contributing to voxpopuli, you agree that your contributions will be licensed 31 | under the LICENSE file in the root directory of this source tree. -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Attribution-NonCommercial 4.0 International 2 | 3 | ======================================================================= 4 | 5 | Creative Commons Corporation ("Creative Commons") is not a law firm and 6 | does not provide legal services or legal advice. Distribution of 7 | Creative Commons public licenses does not create a lawyer-client or 8 | other relationship. Creative Commons makes its licenses and related 9 | information available on an "as-is" basis. Creative Commons gives no 10 | warranties regarding its licenses, any material licensed under their 11 | terms and conditions, or any related information. Creative Commons 12 | disclaims all liability for damages resulting from their use to the 13 | fullest extent possible. 14 | 15 | Using Creative Commons Public Licenses 16 | 17 | Creative Commons public licenses provide a standard set of terms and 18 | conditions that creators and other rights holders may use to share 19 | original works of authorship and other material subject to copyright 20 | and certain other rights specified in the public license below. The 21 | following considerations are for informational purposes only, are not 22 | exhaustive, and do not form part of our licenses. 23 | 24 | Considerations for licensors: Our public licenses are 25 | intended for use by those authorized to give the public 26 | permission to use material in ways otherwise restricted by 27 | copyright and certain other rights. Our licenses are 28 | irrevocable. Licensors should read and understand the terms 29 | and conditions of the license they choose before applying it. 30 | Licensors should also secure all rights necessary before 31 | applying our licenses so that the public can reuse the 32 | material as expected. Licensors should clearly mark any 33 | material not subject to the license. This includes other CC- 34 | licensed material, or material used under an exception or 35 | limitation to copyright. More considerations for licensors: 36 | wiki.creativecommons.org/Considerations_for_licensors 37 | 38 | Considerations for the public: By using one of our public 39 | licenses, a licensor grants the public permission to use the 40 | licensed material under specified terms and conditions. If 41 | the licensor's permission is not necessary for any reason--for 42 | example, because of any applicable exception or limitation to 43 | copyright--then that use is not regulated by the license. Our 44 | licenses grant only permissions under copyright and certain 45 | other rights that a licensor has authority to grant. Use of 46 | the licensed material may still be restricted for other 47 | reasons, including because others have copyright or other 48 | rights in the material. A licensor may make special requests, 49 | such as asking that all changes be marked or described. 50 | Although not required by our licenses, you are encouraged to 51 | respect those requests where reasonable. More_considerations 52 | for the public: 53 | wiki.creativecommons.org/Considerations_for_licensees 54 | 55 | ======================================================================= 56 | 57 | Creative Commons Attribution-NonCommercial 4.0 International Public 58 | License 59 | 60 | By exercising the Licensed Rights (defined below), You accept and agree 61 | to be bound by the terms and conditions of this Creative Commons 62 | Attribution-NonCommercial 4.0 International Public License ("Public 63 | License"). To the extent this Public License may be interpreted as a 64 | contract, You are granted the Licensed Rights in consideration of Your 65 | acceptance of these terms and conditions, and the Licensor grants You 66 | such rights in consideration of benefits the Licensor receives from 67 | making the Licensed Material available under these terms and 68 | conditions. 69 | 70 | Section 1 -- Definitions. 71 | 72 | a. Adapted Material means material subject to Copyright and Similar 73 | Rights that is derived from or based upon the Licensed Material 74 | and in which the Licensed Material is translated, altered, 75 | arranged, transformed, or otherwise modified in a manner requiring 76 | permission under the Copyright and Similar Rights held by the 77 | Licensor. For purposes of this Public License, where the Licensed 78 | Material is a musical work, performance, or sound recording, 79 | Adapted Material is always produced where the Licensed Material is 80 | synched in timed relation with a moving image. 81 | 82 | b. Adapter's License means the license You apply to Your Copyright 83 | and Similar Rights in Your contributions to Adapted Material in 84 | accordance with the terms and conditions of this Public License. 85 | 86 | c. Copyright and Similar Rights means copyright and/or similar rights 87 | closely related to copyright including, without limitation, 88 | performance, broadcast, sound recording, and Sui Generis Database 89 | Rights, without regard to how the rights are labeled or 90 | categorized. For purposes of this Public License, the rights 91 | specified in Section 2(b)(1)-(2) are not Copyright and Similar 92 | Rights. 93 | d. Effective Technological Measures means those measures that, in the 94 | absence of proper authority, may not be circumvented under laws 95 | fulfilling obligations under Article 11 of the WIPO Copyright 96 | Treaty adopted on December 20, 1996, and/or similar international 97 | agreements. 98 | 99 | e. Exceptions and Limitations means fair use, fair dealing, and/or 100 | any other exception or limitation to Copyright and Similar Rights 101 | that applies to Your use of the Licensed Material. 102 | 103 | f. Licensed Material means the artistic or literary work, database, 104 | or other material to which the Licensor applied this Public 105 | License. 106 | 107 | g. Licensed Rights means the rights granted to You subject to the 108 | terms and conditions of this Public License, which are limited to 109 | all Copyright and Similar Rights that apply to Your use of the 110 | Licensed Material and that the Licensor has authority to license. 111 | 112 | h. Licensor means the individual(s) or entity(ies) granting rights 113 | under this Public License. 114 | 115 | i. NonCommercial means not primarily intended for or directed towards 116 | commercial advantage or monetary compensation. For purposes of 117 | this Public License, the exchange of the Licensed Material for 118 | other material subject to Copyright and Similar Rights by digital 119 | file-sharing or similar means is NonCommercial provided there is 120 | no payment of monetary compensation in connection with the 121 | exchange. 122 | 123 | j. Share means to provide material to the public by any means or 124 | process that requires permission under the Licensed Rights, such 125 | as reproduction, public display, public performance, distribution, 126 | dissemination, communication, or importation, and to make material 127 | available to the public including in ways that members of the 128 | public may access the material from a place and at a time 129 | individually chosen by them. 130 | 131 | k. Sui Generis Database Rights means rights other than copyright 132 | resulting from Directive 96/9/EC of the European Parliament and of 133 | the Council of 11 March 1996 on the legal protection of databases, 134 | as amended and/or succeeded, as well as other essentially 135 | equivalent rights anywhere in the world. 136 | 137 | l. You means the individual or entity exercising the Licensed Rights 138 | under this Public License. Your has a corresponding meaning. 139 | 140 | Section 2 -- Scope. 141 | 142 | a. License grant. 143 | 144 | 1. Subject to the terms and conditions of this Public License, 145 | the Licensor hereby grants You a worldwide, royalty-free, 146 | non-sublicensable, non-exclusive, irrevocable license to 147 | exercise the Licensed Rights in the Licensed Material to: 148 | 149 | a. reproduce and Share the Licensed Material, in whole or 150 | in part, for NonCommercial purposes only; and 151 | 152 | b. produce, reproduce, and Share Adapted Material for 153 | NonCommercial purposes only. 154 | 155 | 2. Exceptions and Limitations. For the avoidance of doubt, where 156 | Exceptions and Limitations apply to Your use, this Public 157 | License does not apply, and You do not need to comply with 158 | its terms and conditions. 159 | 160 | 3. Term. The term of this Public License is specified in Section 161 | 6(a). 162 | 163 | 4. Media and formats; technical modifications allowed. The 164 | Licensor authorizes You to exercise the Licensed Rights in 165 | all media and formats whether now known or hereafter created, 166 | and to make technical modifications necessary to do so. The 167 | Licensor waives and/or agrees not to assert any right or 168 | authority to forbid You from making technical modifications 169 | necessary to exercise the Licensed Rights, including 170 | technical modifications necessary to circumvent Effective 171 | Technological Measures. For purposes of this Public License, 172 | simply making modifications authorized by this Section 2(a) 173 | (4) never produces Adapted Material. 174 | 175 | 5. Downstream recipients. 176 | 177 | a. Offer from the Licensor -- Licensed Material. Every 178 | recipient of the Licensed Material automatically 179 | receives an offer from the Licensor to exercise the 180 | Licensed Rights under the terms and conditions of this 181 | Public License. 182 | 183 | b. No downstream restrictions. You may not offer or impose 184 | any additional or different terms or conditions on, or 185 | apply any Effective Technological Measures to, the 186 | Licensed Material if doing so restricts exercise of the 187 | Licensed Rights by any recipient of the Licensed 188 | Material. 189 | 190 | 6. No endorsement. Nothing in this Public License constitutes or 191 | may be construed as permission to assert or imply that You 192 | are, or that Your use of the Licensed Material is, connected 193 | with, or sponsored, endorsed, or granted official status by, 194 | the Licensor or others designated to receive attribution as 195 | provided in Section 3(a)(1)(A)(i). 196 | 197 | b. Other rights. 198 | 199 | 1. Moral rights, such as the right of integrity, are not 200 | licensed under this Public License, nor are publicity, 201 | privacy, and/or other similar personality rights; however, to 202 | the extent possible, the Licensor waives and/or agrees not to 203 | assert any such rights held by the Licensor to the limited 204 | extent necessary to allow You to exercise the Licensed 205 | Rights, but not otherwise. 206 | 207 | 2. Patent and trademark rights are not licensed under this 208 | Public License. 209 | 210 | 3. To the extent possible, the Licensor waives any right to 211 | collect royalties from You for the exercise of the Licensed 212 | Rights, whether directly or through a collecting society 213 | under any voluntary or waivable statutory or compulsory 214 | licensing scheme. In all other cases the Licensor expressly 215 | reserves any right to collect such royalties, including when 216 | the Licensed Material is used other than for NonCommercial 217 | purposes. 218 | 219 | Section 3 -- License Conditions. 220 | 221 | Your exercise of the Licensed Rights is expressly made subject to the 222 | following conditions. 223 | 224 | a. Attribution. 225 | 226 | 1. If You Share the Licensed Material (including in modified 227 | form), You must: 228 | 229 | a. retain the following if it is supplied by the Licensor 230 | with the Licensed Material: 231 | 232 | i. identification of the creator(s) of the Licensed 233 | Material and any others designated to receive 234 | attribution, in any reasonable manner requested by 235 | the Licensor (including by pseudonym if 236 | designated); 237 | 238 | ii. a copyright notice; 239 | 240 | iii. a notice that refers to this Public License; 241 | 242 | iv. a notice that refers to the disclaimer of 243 | warranties; 244 | 245 | v. a URI or hyperlink to the Licensed Material to the 246 | extent reasonably practicable; 247 | 248 | b. indicate if You modified the Licensed Material and 249 | retain an indication of any previous modifications; and 250 | 251 | c. indicate the Licensed Material is licensed under this 252 | Public License, and include the text of, or the URI or 253 | hyperlink to, this Public License. 254 | 255 | 2. You may satisfy the conditions in Section 3(a)(1) in any 256 | reasonable manner based on the medium, means, and context in 257 | which You Share the Licensed Material. For example, it may be 258 | reasonable to satisfy the conditions by providing a URI or 259 | hyperlink to a resource that includes the required 260 | information. 261 | 262 | 3. If requested by the Licensor, You must remove any of the 263 | information required by Section 3(a)(1)(A) to the extent 264 | reasonably practicable. 265 | 266 | 4. If You Share Adapted Material You produce, the Adapter's 267 | License You apply must not prevent recipients of the Adapted 268 | Material from complying with this Public License. 269 | 270 | Section 4 -- Sui Generis Database Rights. 271 | 272 | Where the Licensed Rights include Sui Generis Database Rights that 273 | apply to Your use of the Licensed Material: 274 | 275 | a. for the avoidance of doubt, Section 2(a)(1) grants You the right 276 | to extract, reuse, reproduce, and Share all or a substantial 277 | portion of the contents of the database for NonCommercial purposes 278 | only; 279 | 280 | b. if You include all or a substantial portion of the database 281 | contents in a database in which You have Sui Generis Database 282 | Rights, then the database in which You have Sui Generis Database 283 | Rights (but not its individual contents) is Adapted Material; and 284 | 285 | c. You must comply with the conditions in Section 3(a) if You Share 286 | all or a substantial portion of the contents of the database. 287 | 288 | For the avoidance of doubt, this Section 4 supplements and does not 289 | replace Your obligations under this Public License where the Licensed 290 | Rights include other Copyright and Similar Rights. 291 | 292 | Section 5 -- Disclaimer of Warranties and Limitation of Liability. 293 | 294 | a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE 295 | EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS 296 | AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF 297 | ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS, 298 | IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION, 299 | WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR 300 | PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS, 301 | ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT 302 | KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT 303 | ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU. 304 | 305 | b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE 306 | TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION, 307 | NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT, 308 | INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES, 309 | COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR 310 | USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN 311 | ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR 312 | DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR 313 | IN PART, THIS LIMITATION MAY NOT APPLY TO YOU. 314 | 315 | c. The disclaimer of warranties and limitation of liability provided 316 | above shall be interpreted in a manner that, to the extent 317 | possible, most closely approximates an absolute disclaimer and 318 | waiver of all liability. 319 | 320 | Section 6 -- Term and Termination. 321 | 322 | a. This Public License applies for the term of the Copyright and 323 | Similar Rights licensed here. However, if You fail to comply with 324 | this Public License, then Your rights under this Public License 325 | terminate automatically. 326 | 327 | b. Where Your right to use the Licensed Material has terminated under 328 | Section 6(a), it reinstates: 329 | 330 | 1. automatically as of the date the violation is cured, provided 331 | it is cured within 30 days of Your discovery of the 332 | violation; or 333 | 334 | 2. upon express reinstatement by the Licensor. 335 | 336 | For the avoidance of doubt, this Section 6(b) does not affect any 337 | right the Licensor may have to seek remedies for Your violations 338 | of this Public License. 339 | 340 | c. For the avoidance of doubt, the Licensor may also offer the 341 | Licensed Material under separate terms or conditions or stop 342 | distributing the Licensed Material at any time; however, doing so 343 | will not terminate this Public License. 344 | 345 | d. Sections 1, 5, 6, 7, and 8 survive termination of this Public 346 | License. 347 | 348 | Section 7 -- Other Terms and Conditions. 349 | 350 | a. The Licensor shall not be bound by any additional or different 351 | terms or conditions communicated by You unless expressly agreed. 352 | 353 | b. Any arrangements, understandings, or agreements regarding the 354 | Licensed Material not stated herein are separate from and 355 | independent of the terms and conditions of this Public License. 356 | 357 | Section 8 -- Interpretation. 358 | 359 | a. For the avoidance of doubt, this Public License does not, and 360 | shall not be interpreted to, reduce, limit, restrict, or impose 361 | conditions on any use of the Licensed Material that could lawfully 362 | be made without permission under this Public License. 363 | 364 | b. To the extent possible, if any provision of this Public License is 365 | deemed unenforceable, it shall be automatically reformed to the 366 | minimum extent necessary to make it enforceable. If the provision 367 | cannot be reformed, it shall be severed from this Public License 368 | without affecting the enforceability of the remaining terms and 369 | conditions. 370 | 371 | c. No term or condition of this Public License will be waived and no 372 | failure to comply consented to unless expressly agreed to by the 373 | Licensor. 374 | 375 | d. Nothing in this Public License constitutes or may be interpreted 376 | as a limitation upon, or waiver of, any privileges and immunities 377 | that apply to the Licensor or You, including from the legal 378 | processes of any jurisdiction or authority. 379 | 380 | ======================================================================= 381 | 382 | Creative Commons is not a party to its public 383 | licenses. Notwithstanding, Creative Commons may elect to apply one of 384 | its public licenses to material it publishes and in those instances 385 | will be considered the “Licensor.” The text of the Creative Commons 386 | public licenses is dedicated to the public domain under the CC0 Public 387 | Domain Dedication. Except for the limited purpose of indicating that 388 | material is shared under a Creative Commons public license or as 389 | otherwise permitted by the Creative Commons policies published at 390 | creativecommons.org/policies, Creative Commons does not authorize the 391 | use of the trademark "Creative Commons" or any other trademark or logo 392 | of Creative Commons without its prior written consent including, 393 | without limitation, in connection with any unauthorized modifications 394 | to any of its public licenses or any other arrangements, 395 | understandings, or agreements concerning use of licensed material. For 396 | the avoidance of doubt, this paragraph does not form part of the 397 | public licenses. 398 | 399 | Creative Commons may be contacted at creativecommons.org. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | VoxPopuli 2 | ===== 3 | [https://aclanthology.org/2021.acl-long.80](https://aclanthology.org/2021.acl-long.80) 4 | 5 | A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. 6 | 7 | # Overview 8 | 9 | VoxPopuli provides 10 | - 400K hours of unlabelled speech data for 23 languages 11 | - 1.8K hours of transcribed speech data for 16 languages 12 | - 17.3K hours of speech-to-speech interpretation data for 15x15 directions 13 | - 29 hours of transcribed speech data of non-native English intended for research in ASR for accented speech (15 L2 accents) 14 | 15 | The raw data is collected from 2009-2020 [European Parliament event recordings](https://multimedia.europarl.europa.eu/en/home). 16 | We acknowledge the European Parliament for creating and sharing these materials. 17 | 18 | #### Detailed statistics 19 | 20 |
Unlabelled and transcribed data

21 | 22 | | Language | Code | Unlabelled Hours (v1/v2) | Transcribed Hours | Transcribed Speakers | Transcribed Tokens | LM Tokens | 23 | |:---:|:---:|:---:|:---:|:---:|:---:|:---:| 24 | | English | En | 4.5K/24.1K | 543 | 1313 | 4.8M | 60.1M | 25 | | German | De | 4.5K/23.2K | 282 | 531 | 2.3M | 50.0M | 26 | | French | Fr | 4.5K/22.8K | 211 | 534 | 2.1M | 58.6M | 27 | | Spanish | Es | 4.4K/21.4K | 166 | 305 | 1.6M | 57.4M | 28 | | Polish | Pl | 4.5K/21.2K | 111 | 282 | 802K | 13.6M | 29 | | Italian | It | 4.6K/21.9K | 91 | 306 | 757K | 52.1M | 30 | | Romanian | Ro | 4.5K/17.9K | 89 | 164 | 739K | 10.3M | 31 | | Hungarian | Hu | 4.4K/17.7K | 63 | 143 | 431K | 13.0M | 32 | | Czech | Cs | 4.5K/18.7K | 62 | 138 | 461K | 13.5M | 33 | | Dutch | Nl | 4.5K/19.0K | 53 | 221 | 488K | 54.6M | 34 | | Finnish | Fi | 4.4K/14.2K | 27 | 84 | 160K | 34.5M | 35 | | Croatian | Hr | 2.7K/8.1K | 43 | 83 | 337K | 285K | 36 | | Slovak | Sk | 4.4K/12.1K | 35 | 96 | 270K | 13.3M | 37 | | Slovene | Sl | 4.4K/11.3K | 10 | 45 | 76K | 12.6M | 38 | | Estonian | Et | 4.3K/10.6K | 3 | 29 | 18K | 11.3M | 39 | | Lithuanian | Lt | 4.3K/14.4K | 2 | 21 | 10K | 11.5M | 40 | | Portuguese | Pt | 4.4K/17.5K | - | - | - | - | 41 | | Bulgarian | Bg | 4.3K/17.6K | - | - | - | - | 42 | | Greek | El | 4.4K/17.7K | - | - | - | - | 43 | | Latvian | Lv | 4.4K/13.1K | - | - | - | - | 44 | | Maltese | Mt | 4.4K/9.1K | - | - | - | - | 45 | | Swedish | Sv | 4.5K/16.3K | - | - | - | - | 46 | | Danish | Da | 4.3K/13.6K | - | - | - | - | 47 | | Total | | 100K/384K | 1791 | 4295 | 15M | 467M | 48 | 49 |

50 | 51 |
Speech-to-speech interpretation data

52 | 53 | | Source/Target | En | De | Fr | Es | Pl | It | Ro | Hu | Cs | Nl | Fi | Sk | Sl | Lt | Da | Total | 54 | |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| 55 | | En | - | 463 | 427 | 441 | 432 | 461 | 457 | 382 | 427 | 400 | 442 | 433 | 434 | 398 | 370 | 6.0K | 56 | | De | 187 | - | 196 | 204 | 214 | 217 | 198 | 205 | 214 | 196 | 217 | 208 | 218 | 164 | 179 | 2.8K | 57 | | Fr | 169 | 187 | - | 187 | 172 | 197 | 195 | 144 | 170 | 158 | 168 | 168 | 156 | 139 | 134 | 2.3K | 58 | | Es | 130 | 138 | 135 | - | 118 | 148 | 128 | 93 | 118 | 115 | 124 | 114 | 108 | 83 | 86 | 1.6K | 59 | | Pl | 68 | 66 | 54 | 55 | - | 67 | 55 | 43 | 67 | 42 | 55 | 62 | 57 | 50 | 34 | 775 | 60 | | It | 69 | 77 | 76 | 79 | 72 | - | 75 | 61 | 68 | 64 | 71 | 66 | 70 | 53 | 60 | 961 | 61 | | Ro | 60 | 59 | 59 | 58 | 49 | 61 | - | 38 | 50 | 43 | 48 | 50 | 46 | 38 | 29 | 688 | 62 | | Hu | 30 | 38 | 25 | 27 | 29 | 30 | 27 | - | 27 | 20 | 31 | 29 | 26 | 21 | 18 | 378 | 63 | | Cs | 39 | 35 | 29 | 30 | 36 | 32 | 31 | 23 | - | 23 | 29 | 55 | 29 | 25 | 18 | 434 | 64 | | Nl | 31 | 43 | 35 | 29 | 27 | 38 | 24 | 25 | 25 | - | 32 | 25 | 23 | 19 | 25 | 401 | 65 | | Fi | 15 | 18 | 15 | 13 | 13 | 13 | 13 | 12 | 13 | 11 | - | 14 | 12 | 11 | 9 | 182 | 66 | | Hr | 31 | 27 | 27 | 24 | 27 | 28 | 24 | 22 | 24 | 22 | 24 | 26 | 37 | 21 | 20 | 384 | 67 | | Sk | 21 | 22 | 14 | 16 | 19 | 16 | 16 | 14 | 32 | 13 | 16 | - | 17 | 13 | 10 | 239 | 68 | | Sl | 6 | 6 | 4 | 5 | 5 | 6 | 5 | 4 | 5 | 4 | 5 | 6 | - | 4 | 3 | 68 | 69 | | Lt | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | - | 0 | 13 | 70 | | Total | 857 | 1.2K | 1.1K | 1.2K | 1.2K | 1.3K | 1.2K | 1.1K | 1.2K | 1.1K | 1.3K | 1.3K | 1.2K | 1.0K | 995 | 17.3K | 71 | 72 |

73 | 74 |
Accented speech transcribed data

75 | 76 | | Accent | Code | Transcribed Hours | Transcribed Speakers | 77 | |:---:|:---:|:---:|:---:| 78 | | Dutch | en_nl | 3.52 | 45 | 79 | | German | en_de | 3.52 | 84 | 80 | | Czech | en_cs | 3.30 | 26 | 81 | | Polish | en_pl | 3.23 | 33 | 82 | | French | en_fr | 2.56 | 27 | 83 | | Hungarian | en_hu | 2.33 | 23 | 84 | | Finnish | en_fi | 2.18 | 20 | 85 | | Romanian | en_ro | 1.85 | 27 | 86 | | Slovak | en_sk | 1.46 | 17 | 87 | | Spanish | en_es | 1.42 | 18 | 88 | | Italian | en_it | 1.11 | 15 | 89 | | Estonian | en_et | 1.08 | 6 | 90 | | Lithuanian | en_lt | 0.65 | 7 | 91 | | Croatian | en_hr | 0.42 | 9 | 92 | | Slovene | en_sl | 0.25 | 7 | 93 | 94 |

95 | 96 | # What's New 97 | - __2022-02-01__: New labelled accented English speech data released. 98 | - __2022-01-15__: New [wav2vec 2.0 pre-trained models](https://github.com/facebookresearch/voxpopuli#wav2vec-20) released. 99 | - __2021-07-26__: New unlabelled data (additional 300K hours) released. 100 | - __2021-03-03__: VoxPopuli released. 101 | 102 | # Getting Data 103 | We provide raw audios as well as scripts to segment and align them with transcription/interpretation. The output format 104 | is [Ogg Vorbis](https://en.wikipedia.org/wiki/Vorbis) (16000Hz, 16-bit, mono-channel), 105 | which is supported by common libraries such as `libsndfile` and `libsox` (they have Python frontends 106 | by [soundfile](https://github.com/bastibe/python-soundfile), [torchaudio](https://github.com/pytorch/audio), etc.). 107 | 108 | As the first step, clone this repo for the processing scripts 109 | ```bash 110 | git clone https://github.com/facebookresearch/voxpopuli.git 111 | ``` 112 | and install required PyPI packages: 113 | ```bash 114 | pip install -r requirements.txt 115 | ``` 116 | 117 | 118 | ### Unlabelled Data 119 | First, download raw audios via 120 | ```bash 121 | python -m voxpopuli.download_audios --root [ROOT] --subset [SUBSET] 122 | ``` 123 | which saves audios to `${ROOT}/raw_audios/[language]/[year]/[recording_id].ogg`. 124 | 125 | `SUBSET` specifies the data subset to download: 126 | 127 | | --subset | # Languages | Hours | Years | Size | 128 | |:---:|:---:|:---:|:---:|:---:| 129 | | en, de, fr, es, pl, it, ro, hu, cs, nl, fi, hr, sk, sl, et, lt, pt, bg, el, lv, mt, sv or da | 1 | 2.7K-4.6K | 2009-2020 | 44G-75G | 130 | | en_v2, de_v2, fr_v2, es_v2, pl_v2, it_v2, ro_v2, hu_v2, cs_v2, nl_v2, fi_v2, hr_v2, sk_v2, sl_v2, et_v2, lt_v2, pt_v2, bg_v2, el_v2, lv_v2, mt_v2, sv_v2 or da_v2 | 1 | 8.1K-24.1K | 2009-2020 | 130G-385G | 131 | | 10k | 23 | 10K | 2019-2020 | 170G | 132 | | 100k | 23 | 100K | 2009-2020 | 1.7T | 133 | | 400k | 23 | 400K | 2009-2020 | 6.4T | 134 | 135 | Then, segment these audios via 136 | ```bash 137 | python -m voxpopuli.get_unlabelled_data --root [ROOT] --subset [SUBSET] 138 | ``` 139 | which outputs to `${ROOT}/unlabelled_data/[language]/[year]/[segment_id].ogg` 140 | 141 | ### Transcribed (ASR) Data 142 | First, download raw audios via 143 | ```bash 144 | python -m voxpopuli.download_audios --root [ROOT] --subset asr 145 | ``` 146 | which saves audios to `${ROOT}/raw_audios/original/[year]/[recording_id].ogg`. 147 | 148 | Then, segment these audios and align them with transcripts via 149 | ```bash 150 | python -m voxpopuli.get_asr_data --root [ROOT] --lang [LANGUAGE] 151 | ``` 152 | which outputs 153 | - audios `${ROOT}/transcribed_data/[language]/[year]/[segment_id].ogg` 154 | - per-split manifest (ID, transcript, speaker ID) `${ROOT}/transcribed_data/[language]/asr_[split].tsv` 155 | 156 | **Accented transcribed data** 157 | To retrieve the transcribed accented speech data, follow the above steps with `--lang [LANGUAGE]_accented` (e.g. `--lang en_accented`). 158 | Note that the accented speech data is only composed of a test set for now. 159 | 160 | ### Speech-to-Speech Interpretation Data 161 | First, follow the instructions above to set up ASR data (source audios and transcripts). 162 | 163 | Then, download target audios via 164 | ```bash 165 | python -m voxpopuli.download_audios --root [ROOT] --subset [TARGET_LANGUAGE] 166 | ``` 167 | which saves audios to `${ROOT}/raw_audios/[target_language]/[year]/[recording_id].ogg`. 168 | 169 | Finally, segment these audios and match them with source ones via 170 | ```bash 171 | python -m voxpopuli.get_s2s_data --root [ROOT] --source-lang [SOURCE_LANGUAGE] --target-lang [TARGET_LANGUAGE] 172 | ``` 173 | which outputs 174 | - target audios `${ROOT}/transcribed_data/[language]/[target_language]/[year]/[segment_id].ogg` 175 | - manifest (source ID, transcript, speaker ID, target ID) `${ROOT}/transcribed_data/[language]/[target_language]/s2s.tsv` 176 | 177 | We also human-transcribe part of the target audios (for English, French and Spanish only) to allow more accurate alignments. 178 | To use them instead of machine transcriptions in the alignments, add `--use-annotated-target` to the command line. 179 | 180 | ### Language Modeling (LM) Data 181 | We combine VoxPopuli transcripts and text data from [Europarl](https://www.statmt.org/europarl/) for LM training. 182 | 183 | Download VoxPopuli and Europarl text data, process the raw text and generate the vocabulary via 184 | ```bash 185 | python -m voxpopuli.get_lm_data --root [ROOT] --lang [LANGUAGE] 186 | ``` 187 | which outputs 188 | - sentences `${ROOT}/lm_data/[language]/sentences.txt` 189 | - vocabulary `${ROOT}/lm_data/[language]/vocabulary.txt` 190 | 191 | To train an n-gram LM with [KenLM](https://github.com/kpu/kenlm), run 192 | ```bash 193 | ${KENLM_PATH}/lmplz -o ${n} --limit_vocab_file [OUT_VOCAB_FILE] < [OUT_TEXT_FILE] > ${n}gram_lm.arpa 194 | ${KENLM_PATH}/build_binary ${n}gram_lm.arpa ${n}gram_lm.bin 195 | ``` 196 | 197 | # Pre-trained Models 198 | ## wav2vec 2.0 199 | We provide pre-trained wav2vec 2.0 models 200 | (implemented in [fairseq](https://github.com/pytorch/fairseq) and [wav2letter/flashlight](https://github.com/facebookresearch/flashlight)) 201 | for downstream speech tasks. Each language is covered by a monolingual _Base_ model and multilingual _Large_ models that 202 | combine languages in the same family or all languages. See also [XLS-R](https://github.com/pytorch/fairseq/tree/main/examples/wav2vec/xlsr) 203 | for larger-scale (up to 2B) multilingual models trained on VoxPopuli (400K hours). 204 | 205 |
Download

206 | 207 | | Language(s) | Family | PT Hours | Base Model (95M) | Large Model (317M) | 208 | |:----------------:|:--------------:|:----------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:| 209 | | Es (V1/V2) | Romance | 4.4K/21.4K | fairseq [V1](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_es.pt) / [V2](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_es_v2.pt) | fairseq [V1](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_es.pt) / [V2 Romance](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_romance_v2.pt) | 210 | | Fr (V1/V2) | Romance | 4.5K/22.8K | fairseq [V1](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_fr.pt) / [V2](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_fr_v2.pt) | fairseq [V1](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_fr.pt) / [V2 Romance](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_romance_v2.pt) | 211 | | It (V1/V2) | Romance | 4.6K/21.9K | fairseq [V1](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_it.pt) / [V2](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_it_v2.pt) | fairseq [V1](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_it.pt) / [V2 Romance](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_romance_v2.pt) | 212 | | Pt (V2) | Romance | 17.5K | [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_pt_v2.pt) | [fairseq V2 Romance](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_romance_v2.pt) | 213 | | Ro (V2) | Romance | 17.9K | [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_ro_v2.pt) | [fairseq V2 Romance](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_romance_v2.pt) | 214 | | Nl (V1/V2) | West Germanic | 4.5K/19.0K | fairseq [V1](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_nl.pt) / [V2](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_nl_v2.pt) | fairseq [V1](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_nl.pt) / [V2 West Germanic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_west_germanic_v2.pt) | 215 | | En (V2) | West Germanic | 24.1K | [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_en_v2.pt) | [fairseq V2 West Germanic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_west_germanic_v2.pt) | 216 | | De (V2) | West Germanic | 23.2K | [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_de_v2.pt) | [fairseq V2 West Germanic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_west_germanic_v2.pt) | 217 | | Sv (V1/V2) | North Germanic | 4.5K/16.3K | fairseq [V1](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_sv.pt) / [V2](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_sv_v2.pt) | fairseq [V1](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_sv.pt) / [V2 North Germanic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_north_germanic_v2.pt) | 218 | | Da (V2) | North Germanic | 13.6K | [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_da_v2.pt) | [fairseq V2 North Germanic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_north_germanic_v2.pt) | 219 | | Bg (V2) | Slavic | 17.6K | [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_bg_v2.pt) | [fairseq V2 Slavic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_slavic_v2.pt) | 220 | | Cs (V2) | Slavic | 18.7K | [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_cs_v2.pt) | [fairseq V2 Slavic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_slavic_v2.pt) | 221 | | Hr (V2) | Slavic | 8.1K | [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_hr_v2.pt) | [fairseq V2 Slavic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_slavic_v2.pt) | 222 | | Pl (V2) | Slavic | 21.2K | [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_pl_v2.pt) | [fairseq V2 Slavic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_slavic_v2.pt) | 223 | | Sk (V2) | Slavic | 12.1K | [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_sk_v2.pt) | [fairseq V2 Slavic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_slavic_v2.pt) | 224 | | Sl (V2) | Slavic | 11.3K | [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_sl_v2.pt) | [fairseq V2 Slavic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_slavic_v2.pt) | 225 | | Et (V2) | Uralic | 10.6K | [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_et_v2.pt) | [fairseq V2 Uralic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_uralic_v2.pt) | 226 | | Fi (V2) | Uralic | 14.2K | [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_fi_v2.pt) | [fairseq V2 Uralic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_uralic_v2.pt) | 227 | | Hu (V2) | Uralic | 17.7K | [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_hu_v2.pt) | [fairseq V2 Uralic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_uralic_v2.pt) | 228 | | Lv (V2) | Baltic | 13.1K | [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_lv_v2.pt) | [fairseq V2 Baltic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_baltic_v2.pt) | 229 | | Lt (V2) | Baltic | 14.4K | [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_lt_v2.pt) | [fairseq V2 Baltic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_baltic_v2.pt) | 230 | | El (V2) | Greek | 17.7K | [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_el_v2.pt) | [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_el_v2.pt) | 231 | | Mt (V2) | Semitic | 9.1K | [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_mt_v2.pt) | [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_mt_v2.pt) | 232 | | All 23 languages | - | 10K | [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_10k.pt) | [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_10k.pt) | 233 | | All 23 languages | - | 100K | [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_100k.pt) / [wav2letter](https://dl.fbaipublicfiles.com/voxpopuli/vox_populi_100k_500iters.tar.gz) | [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_100k.pt) | 234 | 235 |

236 | 237 | In [our paper](https://arxiv.org/pdf/2101.00390.pdf) (Section 4.3.1), we evaluated part of these models on the [Common Voice](https://commonvoice.mozilla.org/) corpus 238 | in the normal setting and the [few-shot phoneme recognition setting](https://github.com/facebookresearch/CPC_audio#cross-lingual-transfer). 239 | 240 | ## Wav2letter C++ implementation 241 | 242 | A wav2letter implementation as well as a checkpoint pretrained on VoxPopuli 100k (base model) is also available in the [Wav2letter respository](https://github.com/flashlight/wav2letter/tree/master/recipes/joint_training_vox_populi). 243 | 244 | The complete fine-tuned ASR baselines for this codebase shoulda come soon. 245 | The wav2letter implementation follows [this paper](https://arxiv.org/abs/2011.00093). 246 | 247 | ## ASR and LM 248 | For the VoxPopuli ASR task, we provide Transformer baselines, fine-tuned wav2vec2 models (Base 10K) as well as n-gram LMs (trained with [KenLM](https://github.com/kpu/kenlm)) and their lexicons. 249 | 250 |
Download

251 | 252 | | Language | ASR (fairseq) | LM (kenLM) | Lexicon | 253 | |:---:|:---:|:---:|:---:| 254 | | Cs | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_cs.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_cs.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/cs/cs_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/cs/cs_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/cs/cs_lm.lexicon) | 255 | | De | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_de.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_de.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/de/de_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/de/de_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/de/de_lm.lexicon) | 256 | | En | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_en.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_en.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/en/en_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/en/en_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/en/en_lm.lexicon) | 257 | | Es | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_es.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_es.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/es/es_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/es/es_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/es/es_lm.lexicon) | 258 | | Et | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_et.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_et.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/et/et_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/et/et_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/et/et_lm.lexicon) | 259 | | Fi | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_fi.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_fi.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/fi/fi_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/fi/fi_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/fi/fi_lm.lexicon) | 260 | | Fr | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_fr.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_fr.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/fr/fr_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/fr/fr_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/fr/fr_lm.lexicon) | 261 | | Hr | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_hr.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_hr.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/hr/hr_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/hr/hr_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/hr/hr_lm.lexicon) | 262 | | Hu | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_hu.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_hu.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/hu/hu_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/hu/hu_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/hu/hu_lm.lexicon) | 263 | | It | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_it.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_it.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/it/it_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/it/it_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/it/it_lm.lexicon) | 264 | | Lt | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_lt.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_lt.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/lt/lt_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/lt/lt_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/lt/lt_lm.lexicon) | 265 | | Nl | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_nl.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_nl.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/nl/nl_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/nl/nl_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/nl/nl_lm.lexicon) | 266 | | Pl | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_pl.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_pl.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/pl/pl_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/pl/pl_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/pl/pl_lm.lexicon) | 267 | | Ro | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_ro.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_ro.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/ro/ro_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/ro/ro_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/ro/ro_lm.lexicon) | 268 | | Sk | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_sk.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_sk.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/sk/sk_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/sk/sk_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/sk/sk_lm.lexicon) | 269 | | Sl | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_sl.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_sl.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/sl/sl_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/sl/sl_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/sl/sl_lm.lexicon) | 270 | 271 |

272 | 273 | We also provide [CoVoST 2](https://github.com/facebookresearch/covost) + 274 | [EuroParl-ST](https://www.mllp.upv.es/europarl-st/) ASR Transformer models that are self-trained on 3000h VoxPopuli 275 | unlabelled data. 276 | 277 |
Download

278 | 279 | | Language | CoVoST 2 Test (WER) | EuroParl-ST Test (WER) | Model (fairseq) | 280 | |:---:|:---:|:---:|:---:| 281 | | De | 17.3 | 21.4 | [s2t_transformer_l](https://dl.fbaipublicfiles.com/voxpopuli/models/cvst_epst/s2t_transformer_l_de.tar) | 282 | | Es | 13.2 | 15.3 | [s2t_transformer_l](https://dl.fbaipublicfiles.com/voxpopuli/models/cvst_epst/s2t_transformer_l_es.tar) | 283 | | Fr | 17.0 | 19.0 | [s2t_transformer_l](https://dl.fbaipublicfiles.com/voxpopuli/models/cvst_epst/s2t_transformer_l_fr.tar) | 284 | 285 |

286 | 287 | Please refer to the [S2T examples](https://github.com/pytorch/fairseq/tree/master/examples/speech_to_text) for the use 288 | of Transformer model checkpoints. 289 | 290 | ## Speech-to-Text Translation (ST) 291 | We provide [CoVoST 2](https://github.com/facebookresearch/covost) + 292 | [EuroParl-ST](https://www.mllp.upv.es/europarl-st/) ST Transformer models that are jointly trained with 400h VoxPopuli 293 | weakly labelled data. 294 | 295 |
Download

296 | 297 | | Direction | CoVoST 2 Test (BLEU) | EuroParl-ST Test (BLEU) | Model (fairseq) | 298 | |:---:|:---:|:---:|:---:| 299 | | De-En | 23.4 | 24.4 | [s2t_transformer_l](https://dl.fbaipublicfiles.com/voxpopuli/models/cvst_epst/s2t_transformer_l_de-en.tar) | 300 | | Es-En | 29.7 | 28.4 | [s2t_transformer_l](https://dl.fbaipublicfiles.com/voxpopuli/models/cvst_epst/s2t_transformer_l_es-en.tar) | 301 | | Fr-En | 30.3 | 31.1 | [s2t_transformer_l](https://dl.fbaipublicfiles.com/voxpopuli/models/cvst_epst/s2t_transformer_l_fr-en.tar) | 302 | 303 |

304 | 305 | Please refer to the 306 | [S2T examples](https://github.com/pytorch/fairseq/tree/master/examples/speech_to_text) for the use of these checkpoints. 307 | 308 | # License 309 | | | License | 310 | |:---:|:---:| 311 | | VoxPopuli Data | [CC0](https://creativecommons.org/share-your-work/public-domain/cc0/) (see also European Parliament's [legal notice](https://www.europarl.europa.eu/legal-notice/en/) for the raw data) | 312 | | LM Data | (Please check out the [Europarl website](https://www.statmt.org/europarl/) for the Europarl portion) | 313 | | Pre-trained Models | [CC BY-NC 4.0](https://github.com/facebookresearch/covost/blob/master/LICENSE) | 314 | | Code | [CC BY-NC 4.0](https://github.com/facebookresearch/covost/blob/master/LICENSE) | 315 | 316 | # Contact 317 | Changhan Wang (changhan@fb.com), Morgane Rivière (mriviere@fb.com), Ann Lee (annl@fb.com) 318 | 319 | # Citation 320 | ``` 321 | @inproceedings{wang-etal-2021-voxpopuli, 322 | title = "{V}ox{P}opuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation", 323 | author = "Wang, Changhan and 324 | Riviere, Morgane and 325 | Lee, Ann and 326 | Wu, Anne and 327 | Talnikar, Chaitanya and 328 | Haziza, Daniel and 329 | Williamson, Mary and 330 | Pino, Juan and 331 | Dupoux, Emmanuel", 332 | booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)", 333 | month = aug, 334 | year = "2021", 335 | address = "Online", 336 | publisher = "Association for Computational Linguistics", 337 | url = "https://aclanthology.org/2021.acl-long.80", 338 | pages = "993--1003", 339 | } 340 | ``` 341 | -------------------------------------------------------------------------------- /extension.md: -------------------------------------------------------------------------------- 1 | # Extension 2 | 3 | We provide additional scripts for customizing our data processing pipelines. 4 | 5 | ### [Experimental] Segmenting Unlabelled Data with Speaker Diarization 6 | 7 | Our current pipeline uses voice activity detection (VAD) algorithm to segment unlabelled data which has no awareness 8 | of the speakers. However, potential speaker changes inside the audio clips may not be favored by downstream applications. 9 | We propose 2-step segmentation (speaker diarization followed by VAD) to mitigate this issue. 10 | 11 | First, apply speaker diarization (SD) model provided by pyannote: 12 | 13 | ```bash 14 | python -m voxpopuli.segmentation.run_pyannote_sd \ 15 | --root [ROOT] -l [LANGUAGE_LIST] \ 16 | --segment-min [MIN_SEGMENT_DURATION_IN_SECONDS] 17 | ``` 18 | 19 | Then, apply VAD on top of SD outputs to segment the audios: 20 | ```bash 21 | python -m voxpopuli.segmentation.get_segment_pyannote_speaker.py \ 22 | --root [ROOT] --languages [LANGUAGE_LIST] -o [OUTPUT_DIR] \ 23 | --max-dur-vad [MAX_SEGMENT_DURATION_IN_SECONDS] 24 | ``` 25 | 26 | We also provide pre-computed segments on the 10k subset. Apply the segmentation directly via 27 | ```bash 28 | python -m voxpopuli.get_unlabelled_data --root [ROOT] --subset 10k_sd 29 | ``` 30 | which outputs to `${ROOT}/unlabelled_data_sd/[language]/[year]/[segment_id].ogg` 31 | 32 | ### Customizing Force-Alignment for Transcribed (ASR) Data 33 | 34 | To segment the labelled data you will need the decoded text corresponding to each audio segment. 35 | They are available upon request: please contact us or post an issue. 36 | 37 | If you want to use the force-aligned text for any purpose (like VAD), 38 | they are available [here](https://dl.fbaipublicfiles.com/voxpopuli/align_data.tar.gz). 39 | 40 | To segment paragraphs into utterances for the given language $LANG, run: 41 | 42 | ```bash 43 | python -m voxpopuli.pipeline.cut_with_align_files \ 44 | --dir_wer ${DIR_DOWNLOAD_WER}/${LANG}/wer \ 45 | --dir_align ${DIR_DOWNLOAD_WER}/${LANG}/align/ \ 46 | --dir_audio $VOX_POPULI_DIR \ 47 | -o $OUTPUT_DIRECTORY \ 48 | --path_chars ${DIR_DOWNLOAD}/${LANG}/${LANG}_grapheme.tokens \ 49 | --lang $LANG 50 | ``` 51 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | tqdm 2 | torchaudio 3 | num2words 4 | edlib 5 | editdistance 6 | -------------------------------------------------------------------------------- /voxpopuli/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Facebook, Inc. and its affiliates. 2 | # 3 | # This source code is licensed under the MIT license found in the 4 | # LICENSE file in the root directory of this source tree. 5 | 6 | LANGUAGES = [ 7 | "en", "de", "fr", "es", "pl", "it", "ro", "hu", "cs", "nl", "fi", "hr", 8 | "sk", "sl", "et", "lt", "pt", "bg", "el", "lv", "mt", "sv", "da" 9 | ] 10 | LANGUAGES_V2 = [f"{x}_v2" for x in LANGUAGES] 11 | 12 | YEARS = list(range(2009, 2020 + 1)) 13 | 14 | ASR_LANGUAGES = [ 15 | "en", "de", "fr", "es", "pl", "it", "ro", "hu", "cs", "nl", "fi", "hr", 16 | "sk", "sl", "et", "lt" 17 | ] 18 | ASR_ACCENTED_LANGUAGES = [ 19 | "en_accented" 20 | ] 21 | 22 | S2S_SRC_LANGUAGES = ASR_LANGUAGES 23 | 24 | S2S_TGT_LANGUAGES = [ 25 | "en", "de", "fr", "es", "pl", "it", "ro", "hu", "cs", "nl", "fi", "hr", 26 | "sk", "sl", "et", "lt", "pt", "bg", "el", "lv", "mt", "sv", "da" 27 | ] 28 | 29 | S2S_TGT_LANGUAGES_WITH_HUMAN_TRANSCRIPTION = ["en", "fr", "es"] 30 | 31 | DOWNLOAD_BASE_URL = "https://dl.fbaipublicfiles.com/voxpopuli" 32 | -------------------------------------------------------------------------------- /voxpopuli/download_audios.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Facebook, Inc. and its affiliates. 2 | # 3 | # This source code is licensed under the MIT license found in the 4 | # LICENSE file in the root directory of this source tree. 5 | 6 | import argparse 7 | import os 8 | from pathlib import Path 9 | 10 | from tqdm import tqdm 11 | from torchaudio.datasets.utils import download_url, extract_archive 12 | 13 | from voxpopuli import LANGUAGES, LANGUAGES_V2, YEARS, DOWNLOAD_BASE_URL 14 | 15 | 16 | def get_args(): 17 | parser = argparse.ArgumentParser() 18 | parser.add_argument( 19 | "--root", "-r", type=str, required=True, help="data root path" 20 | ) 21 | parser.add_argument( 22 | "--subset", "-s", type=str, required=True, 23 | choices=["400k", "100k", "10k", "asr"] + LANGUAGES + LANGUAGES_V2, 24 | help="data subset to download" 25 | ) 26 | return parser.parse_args() 27 | 28 | 29 | def download(args): 30 | if args.subset in LANGUAGES_V2: 31 | languages = [args.subset.split("_")[0]] 32 | years = YEARS + [f"{y}_2" for y in YEARS] 33 | elif args.subset in LANGUAGES: 34 | languages = [args.subset] 35 | years = YEARS 36 | else: 37 | languages = { 38 | "400k": LANGUAGES, 39 | "100k": LANGUAGES, 40 | "10k": LANGUAGES, 41 | "asr": ["original"] 42 | }.get(args.subset, None) 43 | years = { 44 | "400k": YEARS + [f"{y}_2" for y in YEARS], 45 | "100k": YEARS, 46 | "10k": [2019, 2020], 47 | "asr": YEARS 48 | }.get(args.subset, None) 49 | 50 | url_list = [] 51 | for l in languages: 52 | for y in years: 53 | url_list.append(f"{DOWNLOAD_BASE_URL}/audios/{l}_{y}.tar") 54 | 55 | out_root = Path(args.root) / "raw_audios" 56 | out_root.mkdir(exist_ok=True, parents=True) 57 | print(f"{len(url_list)} files to download...") 58 | for url in tqdm(url_list): 59 | tar_path = out_root / Path(url).name 60 | download_url(url, out_root.as_posix(), Path(url).name) 61 | extract_archive(tar_path.as_posix()) 62 | os.remove(tar_path) 63 | 64 | 65 | def main(): 66 | args = get_args() 67 | download(args) 68 | 69 | 70 | if __name__ == '__main__': 71 | main() 72 | -------------------------------------------------------------------------------- /voxpopuli/get_asr_data.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Facebook, Inc. and its affiliates. 2 | # 3 | # This source code is licensed under the MIT license found in the 4 | # LICENSE file in the root directory of this source tree. 5 | import csv 6 | import argparse 7 | from tqdm import tqdm 8 | from ast import literal_eval 9 | import gzip 10 | from pathlib import Path 11 | from typing import Dict, List, Tuple 12 | from collections import defaultdict 13 | 14 | import torch 15 | import torchaudio 16 | from torchaudio.datasets.utils import download_url 17 | 18 | from voxpopuli import ASR_LANGUAGES, ASR_ACCENTED_LANGUAGES, DOWNLOAD_BASE_URL 19 | from voxpopuli.utils import multiprocess_run 20 | 21 | 22 | SPLITS = ["train", "dev", "test"] 23 | 24 | 25 | def cut_session(info: Tuple[str, Dict[str, List[Tuple[float, float]]]]) -> None: 26 | in_path, out_path_to_timestamps = info 27 | waveform, sr = torchaudio.load(in_path) 28 | duration = waveform.size(1) 29 | for out_path, timestamps in out_path_to_timestamps.items(): 30 | segment = torch.cat( 31 | [waveform[:, int(s * sr): min(int(t * sr), duration)] 32 | for s, t in timestamps], 33 | dim=1 34 | ) 35 | torchaudio.save(out_path, segment, sr) 36 | 37 | 38 | def get(args): 39 | in_root = Path(args.root) / "raw_audios" / "original" 40 | out_root = Path(args.root) / "transcribed_data" / args.lang 41 | out_root.mkdir(exist_ok=True, parents=True) 42 | # Get metadata TSV 43 | url = f"{DOWNLOAD_BASE_URL}/annotations/asr/asr_{args.lang}.tsv.gz" 44 | tsv_path = out_root / Path(url).name 45 | if not tsv_path.exists(): 46 | download_url(url, out_root.as_posix(), Path(url).name) 47 | with gzip.open(tsv_path, "rt") as f: 48 | metadata = [x for x in csv.DictReader(f, delimiter="|")] 49 | # Get segment into list 50 | items = defaultdict(dict) 51 | manifest = [] 52 | for r in tqdm(metadata): 53 | split = r["split"] 54 | if split not in SPLITS: 55 | continue 56 | event_id = r["session_id"] 57 | year = event_id[:4] 58 | in_path = in_root / year / f"{event_id}_original.ogg" 59 | cur_out_root = out_root / year 60 | cur_out_root.mkdir(exist_ok=True, parents=True) 61 | out_path = cur_out_root / "{}-{}.ogg".format(event_id, r["id_"]) 62 | timestamps = [(t[0], t[1]) for t in literal_eval(r["vad"])] 63 | items[in_path.as_posix()][out_path.as_posix()] = timestamps 64 | manifest.append( 65 | ( 66 | out_path.stem, 67 | r["original_text"], 68 | r["normed_text"], 69 | r["speaker_id"], 70 | split, 71 | r["gender"], 72 | r.get("is_gold_transcript", str(False)), 73 | r.get("accent", str(None)) 74 | ) 75 | ) 76 | items = list(items.items()) 77 | # Segment 78 | multiprocess_run(items, cut_session) 79 | # Output per-split manifest 80 | header = [ 81 | "id", "raw_text", "normalized_text", "speaker_id", "split", 82 | "gender", "is_gold_transcript", "accent" 83 | ] 84 | for split in SPLITS: 85 | with open(out_root / f"asr_{split}.tsv", "w") as f_o: 86 | f_o.write("\t".join(header) + "\n") 87 | for cols in manifest: 88 | if cols[4] == split: 89 | f_o.write("\t".join(cols) + "\n") 90 | 91 | 92 | def get_args(): 93 | parser = argparse.ArgumentParser("Prepare transcribed data") 94 | parser.add_argument( 95 | "--root", 96 | help="data root path", 97 | type=str, 98 | required=True, 99 | ) 100 | parser.add_argument( 101 | "--lang", 102 | required=True, 103 | type=str, 104 | choices=ASR_LANGUAGES + ASR_ACCENTED_LANGUAGES, 105 | ) 106 | return parser.parse_args() 107 | 108 | 109 | def main(): 110 | args = get_args() 111 | get(args) 112 | 113 | 114 | if __name__ == "__main__": 115 | main() 116 | -------------------------------------------------------------------------------- /voxpopuli/get_lm_data.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Facebook, Inc. and its affiliates. 2 | # 3 | # This source code is licensed under the MIT license found in the 4 | # LICENSE file in the root directory of this source tree. 5 | 6 | import argparse 7 | import csv 8 | import gzip 9 | import logging 10 | from multiprocessing import Pool 11 | import re 12 | import os 13 | import string 14 | from typing import List, Optional, Set, Tuple 15 | from pathlib import Path 16 | import tarfile 17 | 18 | from num2words import num2words 19 | import tqdm 20 | from torchaudio.datasets.utils import download_url 21 | 22 | from voxpopuli.text import ( 23 | LANG_TOKENS, 24 | REMOVE_TRANSLATOR, 25 | SPACE_TRANSLATOR, 26 | SPACE, 27 | WHITESPACE_NORMALIZER, 28 | is_valid_text, 29 | ) 30 | from voxpopuli import DOWNLOAD_BASE_URL 31 | 32 | PUNCTUATIONS_TO_REMOVE = ( 33 | string.punctuation.replace("'", "") 34 | .replace("-", "") 35 | .replace("–", "") 36 | .replace("/", "") 37 | + "«»‟″“”…‘•„‚≤ᵉ" 38 | ) 39 | PUNCTUATIONS_TO_SPACE = "-/–·" 40 | 41 | 42 | def remove_parentheses(text: str) -> str: 43 | # remove all substring within () or [] 44 | out = "" 45 | num_p = 0 46 | start_i = 0 47 | for i, c in enumerate(text): 48 | if c == "(" or c == "[": 49 | if num_p == 0 and i > start_i: 50 | out += text[start_i:i] 51 | num_p += 1 52 | elif c == ")" or c == "]": 53 | num_p -= 1 54 | if num_p == 0: 55 | start_i = i + 1 56 | 57 | if len(text) > start_i: 58 | out += text[start_i:] 59 | 60 | return out 61 | 62 | 63 | def digit2text(text: str, lang: str) -> str: 64 | out = text.strip(" ") 65 | if len(text) == 0 or all([not c.isdigit() for c in text]): 66 | return text 67 | 68 | # remove leading and trailing punctuations 69 | is_negative = text[0] == "-" 70 | out = text.lstrip((string.punctuation)) 71 | out_tmp = out.rstrip((string.punctuation)) 72 | suffix = "" if out == out_tmp else out[len(out_tmp) :] 73 | out = out_tmp.replace(",", ".") 74 | out = out.replace(":", ".") 75 | 76 | # leading characters, e.g. a10, h1n1, $10 77 | m = re.search(r"^(\D+)", out) 78 | if m: 79 | prefix = m.groups()[0] 80 | return prefix + " " + digit2text(out[len(prefix) :], lang) + suffix 81 | 82 | # leading digits, e.g. 50th, 1900s 83 | to_format = "cardinal" 84 | # trailing characters as ordinal numbers, e.g. 50th 85 | # TODO: more rules for multiple languages, e.g. date 86 | m = re.search(r"\b(\d+)(st|nd|th)\b", out.lower()) 87 | if m: 88 | to_format = "ordinal" 89 | out = m.groups()[0] 90 | 91 | # different cases for xx.xx 92 | if "." in out: 93 | segs = out.split(".") 94 | if all([len(s) == 3 for s in segs[1:]]): # 12.000.000 95 | out = out.replace(".", "") 96 | else: # date 18.4.2009, IP address, time 18.30, etc. 97 | norm_segs = [] 98 | for s in segs: 99 | norm_segs.append(digit2text(s, lang)) 100 | return " ".join(norm_segs) + suffix 101 | 102 | m = re.search(r"\b(\d+)(\D+)", out) 103 | if m: 104 | suffix = " " + digit2text(out[len(m.groups()[0]) :], lang) + suffix 105 | out = m.groups()[0] 106 | 107 | if is_negative: 108 | out = "-" + out 109 | 110 | try: 111 | num = int(out) 112 | except ValueError: 113 | try: 114 | num = float(out) 115 | except Exception as e: 116 | num = out 117 | logging.warning(f"cannot transform '{out}' to numbers") 118 | 119 | try: 120 | d = num2words(num, lang=lang, to=to_format) 121 | except NotImplementedError: # lang not supported, default to en 122 | assert lang != "en" 123 | d = digit2text(out, lang="en") 124 | except Exception as e: 125 | d = "" 126 | logging.warning(f"cannot process {out} ({num}) with {lang} in {to_format} mode") 127 | 128 | if suffix: 129 | d = d + suffix 130 | 131 | return d 132 | 133 | 134 | def process_digits(text: str, lang: str) -> str: 135 | words = text.split() 136 | out = [digit2text(w, lang) for w in words] 137 | 138 | return " ".join(out) 139 | 140 | 141 | def load_from_tsv_gz(in_file: Path) -> List[str]: 142 | output = [] 143 | with gzip.open(in_file, "rt") as f: 144 | reader = csv.DictReader( 145 | f, 146 | delimiter="|", 147 | quotechar=None, 148 | doublequote=False, 149 | lineterminator="\n", 150 | quoting=csv.QUOTE_NONE, 151 | ) 152 | 153 | for e in reader: 154 | e = dict(e) 155 | if e["split"] != "train": 156 | continue 157 | text = e["normed_text"] 158 | text = text.translate(REMOVE_TRANSLATOR) 159 | output.append(text) 160 | 161 | return output 162 | 163 | 164 | def process_text( 165 | text: str, lang: str, tokens: Optional[Set[str]] = None 166 | ) -> Tuple[str, Set]: 167 | # TODO: more rules, e.g. "%" -> percent, "°c" -> "degree celsius", "‰", etc. 168 | # for multiple languages 169 | out = text.lower() 170 | out = remove_parentheses(out) 171 | out = out.replace("’", "'") 172 | out = out.translate(SPACE_TRANSLATOR) 173 | out = process_digits(out, lang) 174 | out = out.translate(REMOVE_TRANSLATOR) 175 | out = re.sub("'+", "'", out) 176 | out = out.strip("'").replace("' ", " ").replace(" '", " ") 177 | out = WHITESPACE_NORMALIZER.sub(SPACE, out) 178 | 179 | vocab = set() 180 | if tokens: 181 | for w in out.split(): 182 | if is_valid_text(w, tokens): 183 | vocab.add(w) 184 | 185 | return out, vocab 186 | 187 | 188 | def main(args): 189 | out_root = Path(args.root) / "lm_data" / args.lang 190 | out_root.mkdir(exist_ok=True, parents=True) 191 | asr_root = Path(args.root) / "transcribed_data" / args.lang 192 | asr_root.mkdir(exist_ok=True, parents=True) 193 | 194 | # Get VoxPopuli transcript 195 | url = f"{DOWNLOAD_BASE_URL}/annotations/asr/asr_{args.lang}.tsv.gz" 196 | path = asr_root / Path(url).name 197 | if not path.exists(): 198 | download_url(url, asr_root.as_posix(), Path(url).name) 199 | text = load_from_tsv_gz(path) 200 | # Get Europarl data 201 | if args.lang != "hr": 202 | for filename in ["europarl.tgz", "tools.tgz"]: 203 | url = f"https://www.statmt.org/europarl/v7/{filename}" 204 | if not (out_root / filename).exists(): 205 | download_url(url, out_root.as_posix(), filename) 206 | with tarfile.open(out_root / "europarl.tgz", "r:gz") as f: 207 | members = [ 208 | i for i in f.getmembers() 209 | if i.name.startswith(f"txt/{args.lang}") 210 | and not (out_root / i.name).exists() 211 | ] 212 | f.extractall(out_root, members=members) 213 | with tarfile.open(out_root / "tools.tgz", "r:gz") as f: 214 | f.extractall(out_root) 215 | cur_text = set() 216 | paths = list((out_root / "txt" / args.lang).glob("*.txt")) 217 | for p in tqdm.tqdm(paths): 218 | cur_out_path = p.with_suffix('.out') 219 | script_path = out_root / "tools" / "split-sentences.perl" 220 | os.system( 221 | f"perl {script_path.as_posix()} -l {args.lang} -q " 222 | f"< {p.as_posix()} > {cur_out_path.as_posix()}" 223 | ) 224 | with open(cur_out_path) as f_o: 225 | cur_text.update(r.strip() for r in f_o if not r.startswith("<")) 226 | text.extend(cur_text) 227 | assert len(text) > 0, "Cannot load any text. Aborting." 228 | 229 | tokens = LANG_TOKENS[args.lang] 230 | 231 | out_text = [] 232 | vocab = set() 233 | with Pool(args.n_proc) as p: 234 | for norm_text, uniq_vocab in tqdm.tqdm( 235 | p.starmap(process_text, [(t, args.lang, tokens) for t in text]) 236 | ): 237 | out_text.append(norm_text) 238 | if tokens: 239 | vocab |= uniq_vocab 240 | 241 | out_path = out_root / "sentences.txt" 242 | with open(out_path, "w") as o: 243 | for line in out_text: 244 | o.write(line + "\n") 245 | 246 | vocab_path = out_root / "vocabulary.txt" 247 | vocab = sorted(vocab) 248 | with open(vocab_path, "w") as o: 249 | o.write(" ".join(vocab)) 250 | 251 | 252 | if __name__ == "__main__": 253 | parser = argparse.ArgumentParser("LM data preparation") 254 | parser.add_argument( 255 | "--root", 256 | help="data root path", 257 | type=str, 258 | required=True, 259 | ) 260 | parser.add_argument( 261 | "--lang", 262 | type=str, 263 | required=True, 264 | choices=LANG_TOKENS.keys(), 265 | help=f"Language of the input text. VoxPopuli provides labelled data in ({', '.join(LANG_TOKENS.keys())})", 266 | ) 267 | parser.add_argument( 268 | "--n-proc", 269 | type=int, 270 | default=8, 271 | help="Number of processes to use", 272 | ) 273 | args = parser.parse_args() 274 | 275 | main(args) 276 | -------------------------------------------------------------------------------- /voxpopuli/get_s2s_data.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Facebook, Inc. and its affiliates. 2 | # 3 | # This source code is licensed under the MIT license found in the 4 | # LICENSE file in the root directory of this source tree. 5 | 6 | import argparse 7 | from pathlib import Path 8 | import csv 9 | import gzip 10 | from typing import Tuple, List 11 | from collections import defaultdict 12 | 13 | import torchaudio 14 | from torchaudio.datasets.utils import download_url 15 | from tqdm import tqdm 16 | 17 | from voxpopuli import (S2S_SRC_LANGUAGES, S2S_TGT_LANGUAGES, DOWNLOAD_BASE_URL, 18 | S2S_TGT_LANGUAGES_WITH_HUMAN_TRANSCRIPTION) 19 | from voxpopuli.utils import multiprocess_run 20 | 21 | 22 | def parse_src_id(id_): 23 | event_id, utt_id = id_.split("_", 1) 24 | event_id, lang = event_id.rsplit("-", 1) 25 | return event_id, lang, utt_id 26 | 27 | 28 | def _segment(info: Tuple[str, List[Tuple[str, float, float]]]): 29 | in_path, out_path_and_timestamps = info 30 | waveform, sr = torchaudio.load(in_path) 31 | for out_path, start, end in out_path_and_timestamps: 32 | start, end = int(start * sr), min(waveform.size(1), int(end * sr)) 33 | torchaudio.save(out_path, waveform[:, start: end], sr) 34 | 35 | 36 | def get(args): 37 | src_lang, tgt_lang = args.source_lang, args.target_lang 38 | if args.use_annotated_target: 39 | assert tgt_lang in S2S_TGT_LANGUAGES_WITH_HUMAN_TRANSCRIPTION 40 | in_root = Path(args.root) / "raw_audios" / tgt_lang 41 | asr_root = Path(args.root) / "transcribed_data" / src_lang 42 | out_root = asr_root / tgt_lang 43 | out_root.mkdir(exist_ok=True, parents=True) 44 | # Get metadata TSV 45 | url = f"{DOWNLOAD_BASE_URL}/annotations/asr/asr_{src_lang}.tsv.gz" 46 | tsv_path = asr_root / Path(url).name 47 | if not tsv_path.exists(): 48 | download_url(url, asr_root.as_posix(), Path(url).name) 49 | with gzip.open(tsv_path, "rt") as f: 50 | src_metadata = [x for x in csv.DictReader(f, delimiter="|")] 51 | src_metadata = { 52 | "{}-{}".format(r["session_id"], r["id_"]): ( 53 | r["original_text"], r["speaker_id"] 54 | ) 55 | for r in src_metadata 56 | } 57 | ref_sfx = "_ref" if args.use_annotated_target else "" 58 | url = f"{DOWNLOAD_BASE_URL}/annotations/s2s/s2s_{tgt_lang}{ref_sfx}.tsv.gz" 59 | tsv_path = out_root / Path(url).name 60 | if not tsv_path.exists(): 61 | download_url(url, out_root.as_posix(), Path(url).name) 62 | with gzip.open(tsv_path, "rt") as f: 63 | tgt_metadata = [x for x in csv.DictReader(f, delimiter="\t")] 64 | # Get segment into list 65 | items = defaultdict(list) 66 | manifest = [] 67 | print("Loading manifest...") 68 | for r in tqdm(tgt_metadata): 69 | src_id = r["id"] 70 | event_id, _src_lang, utt_id = parse_src_id(src_id) 71 | if _src_lang != src_lang: 72 | continue 73 | year = event_id[:4] 74 | in_path = in_root / year / f"{event_id}_{tgt_lang}.ogg" 75 | cur_out_root = out_root / year 76 | cur_out_root.mkdir(exist_ok=True, parents=True) 77 | tgt_id = f"{event_id}-{tgt_lang}_{utt_id}" 78 | out_path = cur_out_root / f"{tgt_id}.ogg" 79 | items[in_path.as_posix()].append( 80 | (out_path.as_posix(), float(r["start_time"]), float(r["end_time"])) 81 | ) 82 | src_text, src_speaker_id = src_metadata[src_id] 83 | tgt_text = r["tgt_text"] if args.use_annotated_target else "" 84 | manifest.append((src_id, src_text, src_speaker_id, tgt_id, tgt_text)) 85 | items = list(items.items()) 86 | # Segment 87 | print(f"Segmenting {len(items):,} files...") 88 | multiprocess_run(items, _segment) 89 | # Output per-data-split list 90 | header = ["src_id", "src_text", "src_speaker_id", "tgt_id", "tgt_text"] 91 | with open(out_root / f"s2s{ref_sfx}.tsv", "w") as f_o: 92 | f_o.write("\t".join(header) + "\n") 93 | for cols in manifest: 94 | f_o.write("\t".join(cols) + "\n") 95 | 96 | 97 | def get_args(): 98 | parser = argparse.ArgumentParser("Prepare S2S interpretation data") 99 | parser.add_argument( 100 | "--root", 101 | help="data root path", 102 | type=str, 103 | required=True, 104 | ) 105 | parser.add_argument( 106 | "--source-lang", 107 | required=True, 108 | type=str, 109 | choices=S2S_SRC_LANGUAGES, 110 | ) 111 | parser.add_argument( 112 | "--target-lang", 113 | required=True, 114 | type=str, 115 | choices=S2S_TGT_LANGUAGES, 116 | ) 117 | parser.add_argument("--use-annotated-target", action="store_true") 118 | return parser.parse_args() 119 | 120 | 121 | def main(): 122 | args = get_args() 123 | get(args) 124 | 125 | 126 | if __name__ == '__main__': 127 | main() 128 | -------------------------------------------------------------------------------- /voxpopuli/get_unlabelled_data.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Facebook, Inc. and its affiliates. 2 | # 3 | # This source code is licensed under the MIT license found in the 4 | # LICENSE file in the root directory of this source tree. 5 | 6 | import argparse 7 | import gzip 8 | import csv 9 | from pathlib import Path 10 | from collections import defaultdict 11 | from typing import Tuple, List 12 | 13 | from tqdm import tqdm 14 | from torchaudio.datasets.utils import download_url 15 | import torchaudio 16 | 17 | from voxpopuli import LANGUAGES, LANGUAGES_V2, DOWNLOAD_BASE_URL 18 | from voxpopuli.utils import multiprocess_run 19 | 20 | 21 | def _segment(item: Tuple[str, List[Tuple[str, float, float]], str]): 22 | in_path, segments, out_root = item 23 | _in_path = Path(in_path) 24 | event_id = _in_path.stem 25 | lang, year = _in_path.parent.parent.stem, _in_path.parent.stem 26 | waveform, sr = torchaudio.load(in_path) 27 | for i, s, e in segments: 28 | start, end = int(s * sr), min(waveform.size(1), int(e * sr)) 29 | out_path = Path(out_root) / lang / year / f'{event_id}_{i}.ogg' 30 | torchaudio.save(out_path.as_posix(), waveform[:, start: end], sr) 31 | 32 | 33 | def get_metadata(out_root, subset): 34 | def predicate(id_): 35 | is_plenary = id_.find("PLENARY") > -1 36 | if subset in {"10k", "10k_sd"}: 37 | return is_plenary and 20190101 <= int(id_[:8]) < 20200801 38 | elif subset in {"100k"}: 39 | return is_plenary 40 | elif subset in LANGUAGES: 41 | return is_plenary and id_.endswith(subset) 42 | elif subset in LANGUAGES_V2: 43 | return id_.endswith(subset.split("_")[0]) 44 | return True 45 | 46 | filename = "unlabelled_sd" if subset == "10k_sd" else "unlabelled_v2" 47 | url = f"{DOWNLOAD_BASE_URL}/annotations/{filename}.tsv.gz" 48 | tsv_path = out_root / Path(url).name 49 | if not tsv_path.exists(): 50 | download_url(url, out_root.as_posix(), Path(url).name) 51 | if subset == '10k_sd': 52 | with gzip.open(tsv_path, mode="rt") as f: 53 | rows = [ 54 | (r["session_id"], r["id_"], r["start_time"], r["end_time"]) 55 | for r in csv.DictReader(f, delimiter="|") 56 | if predicate(r["session_id"]) 57 | ] 58 | else: 59 | with gzip.open(tsv_path, mode="rt") as f: 60 | rows = [ 61 | (r["event_id"], r["segment_no"], r["start"], r["end"]) 62 | for r in csv.DictReader(f, delimiter="\t") 63 | if predicate(r["event_id"]) 64 | ] 65 | return rows 66 | 67 | 68 | def get(args): 69 | audio_root = Path(args.root) / "raw_audios" 70 | out_root = Path(args.root) / "unlabelled_data" 71 | out_root.mkdir(exist_ok=True, parents=True) 72 | items = defaultdict(list) 73 | print("Loading manifest...") 74 | manifest = get_metadata(out_root, args.subset) 75 | for event_id, seg_no, start, end in tqdm(manifest): 76 | lang, year = event_id.rsplit("_", 1)[1], event_id[:4] 77 | cur_out_root = out_root / lang / year 78 | cur_out_root.mkdir(exist_ok=True, parents=True) 79 | path = audio_root / lang / year / f"{event_id}.ogg" 80 | items[path.as_posix()].append((seg_no, float(start), float(end))) 81 | items = [(k, v, out_root.as_posix()) for k, v in items.items()] 82 | print(f"Segmenting {len(items):,} files...") 83 | multiprocess_run(items, _segment) 84 | 85 | 86 | def get_args(): 87 | parser = argparse.ArgumentParser("Prepare unlabelled data") 88 | parser.add_argument( 89 | "--root", "-r", type=str, required=True, help="data root path" 90 | ) 91 | parser.add_argument( 92 | "--subset", "-s", type=str, required=True, 93 | choices=["400k", "100k", "10k", "10k_sd"] + LANGUAGES + LANGUAGES_V2, 94 | help="data subset to download" 95 | ) 96 | return parser.parse_args() 97 | 98 | 99 | def main(): 100 | args = get_args() 101 | get(args) 102 | 103 | 104 | if __name__ == "__main__": 105 | main() 106 | -------------------------------------------------------------------------------- /voxpopuli/segmentation/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Facebook, Inc. and its affiliates. 2 | # 3 | # This source code is licensed under the MIT license found in the 4 | # LICENSE file in the root directory of this source tree. 5 | 6 | from pathlib import Path 7 | from dataclasses import dataclass 8 | from typing import List, Union 9 | import pickle as pkl 10 | import json 11 | import enum 12 | 13 | import torch 14 | import torchaudio 15 | 16 | 17 | @dataclass 18 | class Timestamp: 19 | t_start: float 20 | t_end: float 21 | 22 | 23 | class LangCode(enum.Enum): 24 | HR = "hr" 25 | HU = "hu" 26 | IT = "it" 27 | SL = "sl" 28 | ES = "es" 29 | BG = "bg" 30 | NL = "nl" 31 | ET = "et" 32 | DE = "de" 33 | MT = "mt" 34 | PT = "pt" 35 | DA = "da" 36 | EN = "en" 37 | FI = "fi" 38 | LV = "lv" 39 | PL = "pl" 40 | RO = "ro" 41 | FR = "fr" 42 | LT = "lt" 43 | SK = "sk" 44 | SV = "sv" 45 | CS = "cs" 46 | EL = "el" 47 | 48 | @classmethod 49 | def has_value(cls, value): 50 | return value in cls._value2member_map_ 51 | 52 | 53 | def load_segments_from_pkl(pkl_path, min_duration): 54 | with open(pkl_path, "rb") as f: 55 | annotation = pkl.load(f) 56 | segments = [ 57 | (round(segment.start, 3), round(segment.end, 3), label) 58 | for segment, track, label in annotation.itertracks(yield_label=True) 59 | ] 60 | segments = [(s, t, l) for s, t, l in segments if t - s >= min_duration] 61 | return segments 62 | 63 | 64 | def get_pyannote_segments(path_audio, pyannote_cfg, min_duration=0.1): 65 | pkl_path = path_audio.parent / f"{path_audio.stem}.pyannote.{pyannote_cfg}.pkl" 66 | if pkl_path.is_file(): 67 | return load_segments_from_pkl(pkl_path, min_duration) 68 | 69 | json_path = path_audio.parent / f"{path_audio.stem}.pyannote.{pyannote_cfg}.json" 70 | if json_path.is_file(): 71 | with open(json_path, "r") as f: 72 | segments = json.load(f) 73 | return [(s, t, l) for s, t, l in segments if t - s >= min_duration] 74 | 75 | raise FileNotFoundError(f"{pkl_path} and {json_path} not found") 76 | 77 | 78 | def is_id_valid(name: str): 79 | 80 | # An id should have the following format 81 | # YYYYMMDD-XXXX-[NAME] 82 | # YYYYMMDD : is the date of the session 83 | # XXXX : is a 4 digits identification number 84 | # [NAME] : can be any string 85 | 86 | data = name.split("-") 87 | if len(data) < 3: 88 | return False 89 | 90 | date = data[0] 91 | if len(date) != 8 or any((not x.isdigit()) for x in date): 92 | return False 93 | 94 | if int(date[4:6]) > 12: 95 | return False 96 | if int(date[6:]) > 31: 97 | return False 98 | 99 | session_id = data[1] 100 | if any((not x.isdigit()) for x in session_id): 101 | return False 102 | 103 | return True 104 | 105 | 106 | def get_batches(list_like, batch_size: int): 107 | for i in list(range(0, len(list_like), batch_size)): 108 | yield list_like[i : min(i + batch_size, len(list_like))] 109 | 110 | 111 | def is_plenary(_id: str): 112 | return _id.find("-PLENARY") > -1 113 | 114 | 115 | def to_wav2letter_format(data: torch.tensor, sr: int) -> torch.tensor: 116 | r""" 117 | Wav2letter needs mono 16kHz inputs 118 | """ 119 | if len(data.size()) == 2: 120 | data = data.mean(dim=0, keepdim=True) 121 | elif len(data.size()) == 1: 122 | data = data.view(1, -1) 123 | else: 124 | raise ValueError("Invalid tensor format") 125 | if sr != 16000: 126 | data = torchaudio.transforms.Resample(orig_freq=sr, new_freq=16000)(data) 127 | data = torch.clamp(data, min=-1.0, max=1.0) 128 | return data 129 | 130 | 131 | def correct_name_fbcluster_output(name_in: str) -> str: 132 | r"""A quick patch to solve some discreepancies from the output names 133 | in the align / WER pipeliness without having to relaunch everything""" 134 | 135 | split_ = name_in.split("-") 136 | if len(split_) == 3: 137 | return "-".join(split_[:2]) 138 | 139 | return name_in 140 | 141 | 142 | def get_all_years_for_lang(path_root: Union[str, Path], lang: str) -> List[str]: 143 | path_lang = Path(path_root) / lang 144 | return [ 145 | x.stem 146 | for x in path_lang.glob("*") 147 | if (len(x.stem) == 4 and x.is_dir() and all(p.isdigit() for p in x.stem)) 148 | ] 149 | 150 | 151 | def get_all_sessions_lang_year(path_root: Path, lang: str, year: str) -> List[str]: 152 | 153 | audio = list((path_root / lang / year).glob(f"*_{lang}.ogg")) 154 | return [x.stem.split("_")[0] for x in audio] 155 | 156 | 157 | def get_path_full_audio(path_root: Path, session_id: str, lang: str) -> Path: 158 | year = session_id[:4] 159 | return path_root / lang / year / f"{session_id}_{lang}.ogg" 160 | 161 | 162 | def get_all_audio_for_lang(path_root: Path, lang: str) -> List[Path]: 163 | 164 | audio_paths = [] 165 | years = get_all_years_for_lang(path_root, lang) 166 | for year in years: 167 | all_sessions = get_all_sessions_lang_year(path_root, lang, year) 168 | loc = [ 169 | get_path_full_audio(path_root, session_id, lang) 170 | for session_id in all_sessions 171 | ] 172 | audio_paths += loc 173 | return audio_paths 174 | -------------------------------------------------------------------------------- /voxpopuli/segmentation/cut_from_labels.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Facebook, Inc. and its affiliates. 2 | # 3 | # This source code is licensed under the MIT license found in the 4 | # LICENSE file in the root directory of this source tree. 5 | import soundfile as sf 6 | import csv 7 | import argparse 8 | import tqdm 9 | import numpy as np 10 | import ast 11 | from pathlib import Path 12 | from voxpopuli.segmentation import Timestamp, get_path_full_audio 13 | from typing import Callable, Dict, List, Tuple 14 | from multiprocessing import Pool 15 | 16 | 17 | VadData = List[Timestamp] 18 | 19 | 20 | def parse_seq_path(seq_path: str) -> Tuple[str, str, str]: 21 | out = seq_path.split("/") 22 | assert len(out) == 3 23 | return out[0], out[1], out[2] 24 | 25 | 26 | def get_path_paragraph(row, idx_: Dict[str, int]) -> Path: 27 | base_path = Path(row[idx_["session_id"]]) / row[idx_["paragraph_id"]] 28 | if "lang" in idx_: 29 | base_path = Path(row[idx_["lang"]]) / base_path 30 | return base_path 31 | 32 | 33 | def get_path_fully_segmented(row, idx_: Dict[str, int]) -> Path: 34 | return get_path_paragraph(row, idx_) / row[idx_["id_"]] 35 | 36 | 37 | def get_ts_base(row, idx_: Dict[str, int]) -> List[Timestamp]: 38 | return [Timestamp(float(row[idx_["start_time"]]), float(row[idx_["end_time"]]))] 39 | 40 | 41 | def get_ts_speaker(row, idx_: Dict[str, int]) -> List[Timestamp]: 42 | return [ 43 | Timestamp(float(row[idx_["speaker_start"]]), float(row[idx_["speaker_end"]])) 44 | ] 45 | 46 | 47 | def get_ts_vad(row, idx_: Dict[str, int]) -> List[Timestamp]: 48 | vad = ast.literal_eval(row[idx_["vad"]]) 49 | return [Timestamp(x[0], x[1]) for x in vad] 50 | 51 | 52 | def load_annot_file( 53 | path_input: Path, 54 | path_extractor: Callable, 55 | timestamp_extractor: Callable, 56 | suffix: str = ".flac", 57 | ) -> Dict[Tuple[str, str], Dict[Path, VadData]]: 58 | with open(path_input, "r") as csvfile: 59 | data = csv.reader(csvfile, delimiter="|") 60 | 61 | names = next(data) 62 | idx_ = {x: i for i, x in enumerate(names)} 63 | idx_name = idx_["session_id"] 64 | idx_lang = idx_.get("lang", None) 65 | 66 | out = {} 67 | for row in data: 68 | session_name = row[idx_name] 69 | path_seq = path_extractor(row, idx_).with_suffix(suffix) 70 | vad = timestamp_extractor(row, idx_) 71 | lang = "original" if idx_lang is None else row[idx_lang] 72 | 73 | index = session_name, lang 74 | if index not in out: 75 | out[index] = {} 76 | out[index][path_seq] = vad 77 | 78 | return out 79 | 80 | 81 | def cut_session( 82 | root_original: Path, 83 | root_out: Path, 84 | session_name: str, 85 | ts_2_names: Dict[str, List[Timestamp]], 86 | lang: str, 87 | ) -> None: 88 | 89 | sound, sr = sf.read(str(get_path_full_audio(root_original, session_name, lang))) 90 | for loc_path, vad in ts_2_names.items(): 91 | full_path = root_out / loc_path 92 | full_path.parent.mkdir(exist_ok=True, parents=True) 93 | sf.write( 94 | full_path, 95 | cut_with_vad(sound, sr, vad), 96 | sr, 97 | subtype="PCM_16", 98 | ) 99 | 100 | 101 | def cut_with_vad(sound: np.array, sr: int, vad: List[Timestamp]) -> np.array: 102 | 103 | out = [] 104 | for ts in vad: 105 | out += [sound[int(ts.t_start * sr) : int(ts.t_end * sr)]] 106 | return np.concatenate(out, axis=0) 107 | 108 | 109 | class FileSegmenter: 110 | def __init__( 111 | self, 112 | root_original: Path, 113 | root_out: Path, 114 | annot_dict: Dict[str, Dict[Path, List[Timestamp]]] 115 | ): 116 | 117 | self.root_original = root_original 118 | self.root_out = root_out 119 | self.annot_dict = annot_dict 120 | 121 | def cut_session(self, session_id_lang: Tuple[str, str]): 122 | session_id, lang = session_id_lang 123 | cut_session( 124 | self.root_original, 125 | self.root_out, 126 | session_id, 127 | self.annot_dict[session_id_lang], 128 | lang, 129 | ) 130 | 131 | def run(self, n_procs: int = 8): 132 | 133 | with Pool(processes=n_procs) as pool: 134 | for _ in tqdm.tqdm( 135 | pool.imap_unordered(self.cut_session, self.annot_dict), 136 | total=len(self.annot_dict), 137 | ): 138 | pass 139 | 140 | 141 | def main(args): 142 | 143 | path_data = Path(args.root_original) 144 | path_out = Path(args.output) 145 | path_annotations = Path(args.tsv_file) 146 | 147 | path_extractor = get_path_fully_segmented 148 | if args.mode == "labelled": 149 | timestamp_extractor = get_ts_vad 150 | elif args.mode == "per_speaker_vad": 151 | timestamp_extractor = get_ts_base 152 | elif args.mode == "per_speaker": 153 | timestamp_extractor = get_ts_speaker 154 | path_extractor = get_path_paragraph 155 | else: 156 | raise RuntimeError(f"Invalid mode {args.mode}") 157 | 158 | annot_dict = load_annot_file(path_annotations, path_extractor, timestamp_extractor) 159 | segmenter = FileSegmenter(path_data, path_out, annot_dict) 160 | segmenter.run(n_procs=args.n_procs) 161 | 162 | 163 | if __name__ == "__main__": 164 | 165 | parser = argparse.ArgumentParser("Segment the data from the given .tsv file. " 166 | "Can be used for a customed segmentation of the 10k timetsamps") 167 | parser.add_argument( 168 | "--root_original", 169 | help="Root directory where the original data are stored.", 170 | type=str, 171 | required=True, 172 | ) 173 | parser.add_argument( 174 | "--tsv_file", 175 | help="Path to the .tsv file containing the labels.", 176 | type=str, 177 | required=True, 178 | ) 179 | parser.add_argument( 180 | "-o", "--output", help="Path to the outpit directory.", type=str, required=True 181 | ) 182 | parser.add_argument( 183 | "--n-procs", help="Number of processes to run", type=int, default=8 184 | ) 185 | parser.add_argument( 186 | "--lang", help="Lang to consider", type=str, required=True 187 | ) 188 | parser.add_argument( 189 | "--mode", 190 | required=True, 191 | type=str, 192 | choices=["labelled", "per_speaker", "per_speaker_vad"], 193 | help="labelled to segment the labelled data. " 194 | "per_speaker to cut the 10k data per speaker " 195 | "per_speaker_vad to add the vad of top of the segmentation of the 10k data." 196 | ) 197 | 198 | main(parser.parse_args()) 199 | -------------------------------------------------------------------------------- /voxpopuli/segmentation/cut_with_align_files.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Facebook, Inc. and its affiliates. 2 | # 3 | # This source code is licensed under the MIT license found in the 4 | # LICENSE file in the root directory of this source tree. 5 | 6 | import argparse 7 | import torchaudio 8 | import shutil 9 | import os 10 | import torch 11 | import json 12 | import string 13 | from pathlib import Path 14 | from typing import NamedTuple, List, Optional, Set, Tuple 15 | from multiprocessing import Pool 16 | 17 | from voxpopuli.text.wer_tools import ( 18 | WordAlignFile, 19 | load_word_align_file, 20 | get_partial_transcriptions, 21 | get_wer, 22 | get_ler, 23 | create_word_align_file, 24 | reinsert_punctuation, 25 | ) 26 | from voxpopuli.text.word_align_tools import ( 27 | AlignedData, 28 | AlignedWord, 29 | cut_align_data, 30 | load_audio_align_wav2letter, 31 | ) 32 | 33 | from voxpopuli.segmentation import is_id_valid, to_wav2letter_format 34 | 35 | 36 | class CutIndex(NamedTuple): 37 | index_word: int 38 | index_align: int 39 | 40 | 41 | class SilCutConfig(NamedTuple): 42 | padding_start: float 43 | padding_end: float 44 | min_size_sil: float 45 | min_size_audio: Optional[float] = None 46 | 47 | 48 | class FullSegConfig(NamedTuple): 49 | segmentation_cfg: SilCutConfig 50 | vad_cfg: SilCutConfig 51 | target_size_segment: int 52 | sil_symbol: str = "$" 53 | 54 | 55 | def save_timestamp(ts_segmentation, ts_vad, path_out): 56 | 57 | out = { 58 | "start": ts_segmentation[0], 59 | "end": ts_segmentation[1], 60 | "vad": [(x[0] + ts_segmentation[0], x[1] + ts_segmentation[0]) for x in ts_vad], 61 | } 62 | 63 | with open(path_out, "w") as f: 64 | json.dump(out, f, indent=2) 65 | 66 | 67 | def save_transcription(target: str, decoded: str, path_out: Path): 68 | path_out = path_out.with_suffix(".json") 69 | out = { 70 | "target": target, 71 | "decoded": decoded, 72 | "wer": get_wer(target, decoded), 73 | "ler": get_ler(target, decoded), 74 | } 75 | 76 | with open(path_out, "w", encoding="utf8") as file: 77 | json.dump(out, file, indent=2, ensure_ascii=False) 78 | 79 | 80 | def add_punc_from_tsv(path_tsv, align_text, chars, punc): 81 | 82 | with open(path_tsv, "r") as f: 83 | text = f.read() 84 | return reinsert_punctuation(text, align_text, chars, punc) 85 | 86 | 87 | def cut_with_segment( 88 | data: torch.tensor, 89 | sr: int, 90 | audio_align_data: AlignedData, 91 | index_align: List[int], 92 | padding_start: float = 0.1, 93 | padding_end: float = 0.2, 94 | ) -> List[torch.tensor]: 95 | 96 | last_start = 0 97 | out = [] 98 | timestamps = [] 99 | if len(index_align) == 0: 100 | return [data], [(0, data.size(0) / sr)] 101 | 102 | for cut_index in index_align: 103 | 104 | last_end = audio_align_data.data[cut_index].start + padding_end 105 | s = int(last_start * sr) 106 | e = int(last_end * sr) 107 | out.append(data[s:e]) 108 | timestamps.append((last_start, last_end)) 109 | last_start = max(last_end, audio_align_data.data[cut_index].end - padding_start) 110 | 111 | if index_align[-1] < len(audio_align_data[-1]): 112 | s = int(last_start * sr) 113 | out.append(data[s:]) 114 | timestamps.append((last_start, data.size(0) / sr)) 115 | 116 | return out, timestamps 117 | 118 | 119 | def segment_word_align( 120 | audio_align_data: AlignedData, 121 | word_align_data: WordAlignFile, 122 | sil_symbol: str = "$", 123 | size_min_sil: float = 0.5, 124 | target_size_segment: float = 1, 125 | punc_mark=None, 126 | ) -> List[CutIndex]: 127 | 128 | out = [] 129 | cum_size = 0 130 | index_target_transcription = 0 131 | index_char_transcription = 0 132 | 133 | if punc_mark is None: 134 | punc_mark = [] 135 | 136 | target = word_align_data.target.split() 137 | 138 | for index_align, align in enumerate(audio_align_data.data[:-1]): 139 | 140 | curr_size = align.end - align.start 141 | cum_size += curr_size 142 | 143 | if align.word != sil_symbol: 144 | has_punc = target[index_target_transcription][-1] in punc_mark 145 | if not has_punc: 146 | if align.word != target[index_target_transcription]: 147 | print(word_align_data.file_id) 148 | print(word_align_data.target) 149 | print(align.word, target[index_target_transcription]) 150 | assert align.word == target[index_target_transcription] 151 | index_char_transcription += len(target[index_target_transcription]) + 1 152 | index_target_transcription += 1 153 | continue 154 | if index_align == 0: 155 | continue 156 | if word_align_data.align_path[index_target_transcription].action != "=": 157 | continue 158 | if not has_punc: 159 | if cum_size < target_size_segment: 160 | continue 161 | if curr_size < size_min_sil: 162 | continue 163 | 164 | if has_punc: 165 | index_char_transcription += 1 166 | 167 | out.append( 168 | CutIndex(index_word=index_char_transcription - 1, index_align=index_align) 169 | ) 170 | cum_size = 0 171 | 172 | return out 173 | 174 | 175 | def cut_sils( 176 | data: torch.tensor, 177 | sr: int, 178 | audio_align_data: AlignedData, 179 | padding_start: float = 0.5, 180 | padding_end: float = 0.2, 181 | sil_symbol: str = "$", 182 | min_size_sil: float = 0.8, 183 | min_size_audio: float = 0.5, 184 | ) -> List[torch.tensor]: 185 | 186 | out = [] 187 | start = 0 188 | ts_vad = [] 189 | for align in audio_align_data.data: 190 | 191 | if align.word == sil_symbol: 192 | if align.end - align.start > min_size_sil: 193 | end = align.start + padding_end 194 | if end - start > min_size_audio: 195 | s = int(start * sr) 196 | e = int(end * sr) 197 | out.append(data[s:e]) 198 | ts_vad.append((start, end)) 199 | start = max(end, align.end - padding_start) 200 | if float(data.size(0)) / sr - start > min_size_audio: 201 | s = int(start * sr) 202 | out.append(data[s:]) 203 | ts_vad.append((start, data.size(0) / sr)) 204 | 205 | if len(out) > 0: 206 | return torch.cat(out, dim=0), ts_vad 207 | else: 208 | return None, None 209 | 210 | 211 | def remove_extremities( 212 | data: torch.tensor, 213 | sr: int, 214 | audio_align_data: AlignedData, 215 | padding_start: float = 0.5, 216 | padding_end: float = 0.2, 217 | sil_symbol: str = "$", 218 | ) -> Tuple[torch.tensor, AlignedData]: 219 | 220 | index_start = 0 221 | while audio_align_data.data[index_start].word == sil_symbol: 222 | index_start += 1 223 | 224 | index_end = -1 225 | while audio_align_data.data[index_end].word == sil_symbol: 226 | index_end -= 1 227 | 228 | start = max(0, audio_align_data.data[index_start].start - padding_start) 229 | out_data = [ 230 | AlignedWord(max(0, x.start - start), max(0, x.end - start), x.word) 231 | for x in audio_align_data.data[index_start : index_end + 1] 232 | ] 233 | e = int( 234 | min(data.size(0), (audio_align_data.data[index_end].end + padding_end) * sr) 235 | ) 236 | s = int(start * sr) 237 | return data[s:e], AlignedData(audio_align_data.file_id, out_data) 238 | 239 | 240 | def get_matches( 241 | word_align_file: List[WordAlignFile], audio_align_file: List[AlignedData] 242 | ) -> List[Tuple[WordAlignFile, AlignedData]]: 243 | 244 | word_align_file.sort(key=lambda x: x.file_id) 245 | audio_align_file.sort(key=lambda x: x.file_id) 246 | 247 | i_ = 0 248 | out = [] 249 | max_i = len(audio_align_file) 250 | for w_d in word_align_file: 251 | while i_ < max_i and audio_align_file[i_].file_id < w_d.file_id: 252 | i_ += 1 253 | 254 | if i_ < max_i and audio_align_file[i_].file_id == w_d.file_id: 255 | out.append((w_d, audio_align_file[i_])) 256 | 257 | return out 258 | 259 | 260 | def process_file( 261 | word_align_file: WordAlignFile, 262 | audio_align_file: AlignedData, 263 | path_audio: Path, 264 | dir_out: Path, 265 | full_seg_cfg: FullSegConfig, 266 | punc_mark=None, 267 | ) -> None: 268 | 269 | name_out = word_align_file.file_id 270 | dir_out.mkdir(exist_ok=True) 271 | cut_index = segment_word_align( 272 | audio_align_file, 273 | word_align_file, 274 | sil_symbol=full_seg_cfg.sil_symbol, 275 | size_min_sil=full_seg_cfg.segmentation_cfg.min_size_sil, 276 | target_size_segment=full_seg_cfg.target_size_segment, 277 | punc_mark=punc_mark, 278 | ) 279 | 280 | trans_list = get_partial_transcriptions( 281 | word_align_file, [x.index_word for x in cut_index] 282 | ) 283 | 284 | audio, sr = torchaudio.load(path_audio) 285 | audio = to_wav2letter_format(audio, sr) 286 | audio = audio.mean(dim=0) 287 | sr = 16000 288 | 289 | segs, ts_segmentation = cut_with_segment( 290 | audio, 291 | sr, 292 | audio_align_file, 293 | [x.index_align for x in cut_index], 294 | padding_start=full_seg_cfg.segmentation_cfg.padding_start, 295 | padding_end=full_seg_cfg.segmentation_cfg.padding_end, 296 | ) 297 | new_align = cut_align_data( 298 | audio_align_file, 299 | [x.index_align for x in cut_index], 300 | sil_symbol=full_seg_cfg.sil_symbol, 301 | padding_start=full_seg_cfg.segmentation_cfg.padding_start, 302 | padding_end=full_seg_cfg.segmentation_cfg.padding_end, 303 | ) 304 | 305 | for index, seg in enumerate(segs): 306 | 307 | seg, curr_align = remove_extremities(seg, sr, new_align[index]) 308 | seg_no_sil, ts_vad = cut_sils( 309 | seg, 310 | sr, 311 | curr_align, 312 | min_size_sil=full_seg_cfg.vad_cfg.min_size_sil, 313 | padding_start=full_seg_cfg.vad_cfg.padding_start, 314 | padding_end=full_seg_cfg.vad_cfg.padding_end, 315 | min_size_audio=full_seg_cfg.vad_cfg.min_size_audio, 316 | ) 317 | 318 | if seg_no_sil is None: 319 | continue 320 | 321 | if seg_no_sil.size(0) == 0: 322 | continue 323 | 324 | path_out = dir_out / f"{name_out}_{index}.flac" 325 | torchaudio.save(str(path_out), seg_no_sil, sr) 326 | path_trans = dir_out / f"{name_out}_{index}_trans.json" 327 | target, decoded = trans_list[index] 328 | save_transcription(target, decoded, path_trans) 329 | 330 | path_timestamps = dir_out / f"{name_out}_{index}_timestamps.json" 331 | save_timestamp(ts_segmentation[index], ts_vad, path_timestamps) 332 | 333 | 334 | def process_session_lang( 335 | path_wer: Path, 336 | path_align: Path, 337 | dir_audio: Path, 338 | dir_out: Path, 339 | full_seg_cfg: FullSegConfig, 340 | max_wer: Optional[float] = None, 341 | max_ler: Optional[float] = None, 342 | chars=string.ascii_lowercase, 343 | punc_mark=None, 344 | ): 345 | 346 | word_align_data = load_word_align_file(path_wer) 347 | audio_align_data = load_audio_align_wav2letter(path_align) 348 | 349 | if max_wer is not None: 350 | word_align_data = [x for x in word_align_data if x.wer < max_wer] 351 | 352 | if max_ler is not None: 353 | word_align_data = [x for x in word_align_data if x.ler < max_ler] 354 | 355 | matches = get_matches(word_align_data, audio_align_data) 356 | print(f"{path_wer.stem} : {len(matches)} matches found") 357 | 358 | for w_d, a_d in matches: 359 | align_text = " ".join([x.word for x in a_d.data if x.word != "$"]) 360 | if len(align_text) == 0: 361 | continue 362 | try: 363 | if punc_mark is not None: 364 | path_tsv = dir_audio / f"{w_d.file_id}.tsv" 365 | align_text = add_punc_from_tsv(path_tsv, align_text, chars, punc_mark) 366 | final_wd = create_word_align_file(w_d.file_id, align_text, w_d.decoded) 367 | dir_session = dir_out / final_wd.file_id 368 | path_audio = dir_audio / f"{final_wd.file_id}.flac" 369 | if not path_audio.is_file(): 370 | print(f"ERROR: {str(path_audio)} not found") 371 | continue 372 | dir_out.mkdir(exist_ok=True, parents=True) 373 | process_file( 374 | final_wd, 375 | a_d, 376 | path_audio, 377 | dir_session, 378 | full_seg_cfg, 379 | punc_mark=punc_mark, 380 | ) 381 | 382 | path_speaker = dir_audio / f"{final_wd.file_id}.speaker" 383 | path_out_speaker = dir_session / f"{final_wd.file_id}.speaker" 384 | if path_out_speaker.is_file(): 385 | os.remove(path_out_speaker) 386 | shutil.copyfile(path_speaker, path_out_speaker) 387 | except FileNotFoundError: 388 | continue 389 | 390 | 391 | class FinalAudioSegmenter: 392 | def __init__( 393 | self, 394 | root_audio: Path, 395 | root_wer: Path, 396 | root_align: Path, 397 | root_out: Path, 398 | lang: str, 399 | full_seg_cfg: FullSegConfig, 400 | max_wer: Optional[float] = None, 401 | max_ler: Optional[float] = None, 402 | chars=string.ascii_lowercase, 403 | punc_mark=";.?!", 404 | ): 405 | 406 | self.root_audio = root_audio 407 | self.root_wer = root_wer 408 | self.root_align = root_align 409 | self.root_out = root_out 410 | self.full_seg_cfg = full_seg_cfg 411 | self.max_wer = max_wer 412 | self.max_ler = max_ler 413 | self.lang = lang 414 | self.chars = chars 415 | self.punc_mark = punc_mark 416 | 417 | def processs_session(self, session_id: str): 418 | 419 | path_wer = self.root_wer / f"{session_id}_{self.lang}_wer_no_lm_wav2letter.json" 420 | path_align = self.root_align / f"{session_id}_{self.lang}_align_wav2letter.txt" 421 | 422 | dir_audio = self.get_dir_paragraph(session_id) 423 | 424 | if not dir_audio.is_dir(): 425 | raise RuntimeError(f"ERROR: paragraph data not found at {dir_audio}") 426 | 427 | dir_out = self.root_out / session_id 428 | process_session_lang( 429 | path_wer, 430 | path_align, 431 | dir_audio, 432 | dir_out, 433 | self.full_seg_cfg, 434 | self.max_wer, 435 | self.max_ler, 436 | chars=self.chars, 437 | punc_mark=self.punc_mark, 438 | ) 439 | 440 | def get_dir_paragraph(self, session_id: str): 441 | return self.root_audio / "original" / session_id / "paragraphs" 442 | 443 | def process_db(self, session_ids: List[str], num_proc: int = 8): 444 | 445 | print(f"Launching the segmentation on {len(session_ids)} sessions") 446 | with Pool(num_proc) as pool: 447 | out = list( 448 | pool.imap_unordered(self.processs_session, session_ids, chunksize=30) 449 | ) 450 | 451 | 452 | def get_session_ids(root_align: Path, root_wer: Path, lang: str) -> Set[str]: 453 | 454 | files_align = [ 455 | x.name 456 | for x in root_align.glob(f"*_{lang}_align_wav2letter.txt") 457 | if is_id_valid(x.name[:-24]) 458 | ] 459 | files_wer = [ 460 | x.name 461 | for x in root_wer.glob(f"*_{lang}_wer_no_lm_wav2letter.json") 462 | if is_id_valid(x.name[:-29]) 463 | ] 464 | ids_align = {x[:-24] for x in files_align} 465 | ids_wer = {x[:-29] for x in files_wer} 466 | 467 | return ids_align.intersection(ids_wer) 468 | 469 | 470 | if __name__ == "__main__": 471 | 472 | parser = argparse.ArgumentParser( 473 | "Using the decoded data and the word alignment, segment the labelled " 474 | "sequences in small chunk with their estimated WER" 475 | ) 476 | parser.add_argument( 477 | "--dir_wer", 478 | type=str, 479 | required=True, 480 | help="Directory containing the decoding output", 481 | ) 482 | parser.add_argument( 483 | "--dir_align", 484 | type=str, 485 | required=True, 486 | help="Directory containing the alignment output", 487 | ) 488 | parser.add_argument( 489 | "--dir_audio", 490 | type=str, 491 | required=True, 492 | help="Directory containing the audio data", 493 | ) 494 | parser.add_argument( 495 | "--n_proc", 496 | type=int, 497 | default=8, 498 | help="Number of processes to use", 499 | ) 500 | parser.add_argument("--lang", type=str, required=True, help="Language Code.") 501 | parser.add_argument( 502 | "-o", "--output", type=str, required=True, help="Output directory." 503 | ) 504 | parser_segmentation = parser.add_argument_group("Segmentation parameters") 505 | parser_segmentation.add_argument( 506 | "--target_size_segment", 507 | type=int, 508 | default=20, 509 | help="Target size of each segment", 510 | ) 511 | parser_segmentation.add_argument( 512 | "--padding_start_seg", 513 | type=float, 514 | default=0.4, 515 | help="Padding start segmentation", 516 | ) 517 | parser_segmentation.add_argument( 518 | "--padding_end_seg", type=float, default=0.4, help="Padding end segmentation" 519 | ) 520 | parser_segmentation.add_argument( 521 | "--min_size_sil_seg", 522 | type=float, 523 | default=0.7, 524 | help="Minimum size of a silence when cutting a sequence.", 525 | ) 526 | parser_segmentation.add_argument( 527 | "--max_wer", 528 | type=float, 529 | default=None, 530 | help="Ignores all sequences with a Word Error Rate (WER) higher than " 531 | "the given value", 532 | ) 533 | parser_segmentation.add_argument( 534 | "--max_ler", 535 | type=float, 536 | default=100, 537 | help="Ignores all sequences with a Letter Error Rate (LER) higher than " 538 | "the given value", 539 | ) 540 | parser_segmentation.add_argument( 541 | "--ignore_punctuation", 542 | action="store_true", 543 | help="Activates to ignore all punctuation and cut only by silence.", 544 | ) 545 | parser_segmentation.add_argument( 546 | "--path_chars", 547 | type=str, 548 | default=None, 549 | help="Path to the char file containing the tokens of the considered language. (Default tokens are english latin)", 550 | ) 551 | parser_sil = parser.add_argument_group("VAD extraction parameters") 552 | parser_sil.add_argument( 553 | "--padding_start_vad", 554 | type=float, 555 | default=0.2, 556 | help="Padding start segmentation", 557 | ) 558 | parser_sil.add_argument( 559 | "--padding_end_vad", type=float, default=0.5, help="Padding end segmentation" 560 | ) 561 | parser_sil.add_argument( 562 | "--min_size_sil_vad", 563 | type=float, 564 | default=1, 565 | help="Minimum size of a silence when considering voice activity.", 566 | ) 567 | parser_sil.add_argument( 568 | "--min_size_audio_vad", 569 | type=float, 570 | default=0.5, 571 | help="Isolated audio segments smaller than the given threshold will" 572 | " be removed", 573 | ) 574 | args = parser.parse_args() 575 | 576 | args.dir_audio = Path(args.dir_audio) 577 | args.dir_wer = Path(args.dir_wer) 578 | args.dir_align = Path(args.dir_align) 579 | args.output = Path(args.output) 580 | 581 | seg_cfg = SilCutConfig( 582 | padding_start=args.padding_start_seg, 583 | padding_end=args.padding_end_seg, 584 | min_size_sil=args.min_size_sil_seg, 585 | ) 586 | vad_cfg = SilCutConfig( 587 | padding_start=args.padding_start_vad, 588 | padding_end=args.padding_end_vad, 589 | min_size_sil=args.min_size_sil_vad, 590 | min_size_audio=args.min_size_audio_vad, 591 | ) 592 | full_seg_cfg = FullSegConfig( 593 | segmentation_cfg=seg_cfg, 594 | vad_cfg=vad_cfg, 595 | target_size_segment=args.target_size_segment, 596 | ) 597 | 598 | id_list = get_session_ids(args.dir_align, args.dir_wer, args.lang) 599 | print(f"{len(id_list)} sessions found") 600 | 601 | letters = string.ascii_lowercase 602 | if args.path_chars is not None: 603 | with open(args.path_chars, "r") as f: 604 | letters = "".join([x.strip() for x in f.readlines()]) 605 | 606 | punc_mark = None if args.ignore_punctuation else ".;?!" 607 | 608 | segmenter = FinalAudioSegmenter( 609 | args.dir_audio, 610 | args.dir_wer, 611 | args.dir_align, 612 | args.output, 613 | args.lang, 614 | full_seg_cfg, 615 | max_wer=args.max_wer, 616 | max_ler=args.max_ler, 617 | chars=letters, 618 | punc_mark=punc_mark, 619 | ) 620 | segmenter.process_db(list(id_list), num_proc=args.n_proc) 621 | -------------------------------------------------------------------------------- /voxpopuli/segmentation/get_segment_pyannote_speaker.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Facebook, Inc. and its affiliates. 2 | # 3 | # This source code is licensed under the MIT license found in the 4 | # LICENSE file in the root directory of this source tree. 5 | 6 | import os 7 | import argparse 8 | import shutil 9 | from tqdm import tqdm 10 | from pathlib import Path 11 | from typing import List, Tuple, Union 12 | from multiprocessing import Pool 13 | 14 | from auditok import AudioRegion 15 | import soundfile as sf 16 | 17 | from voxpopuli.segmentation import ( 18 | get_all_audio_for_lang, get_pyannote_segments, LangCode 19 | ) 20 | 21 | 22 | def save_timestamp(path_out: Union[str, Path], start: float, end: float) -> None: 23 | 24 | with open(path_out, "w") as f: 25 | f.write(f"{start}\t{end}") 26 | 27 | 28 | def load_timestamp(path_data: Union[str, Path]) -> Tuple[float, float]: 29 | with open(path_data, "r") as f: 30 | data = f.readline().strip() 31 | 32 | start, end = data.split() 33 | return float(start), float(end) 34 | 35 | 36 | def get_path_timestamp(path_audio: Union[str, Path], timestamp_suffix: str) -> Path: 37 | return Path(path_audio).with_suffix(timestamp_suffix) 38 | 39 | 40 | def split_with_vad_wav( 41 | wav_path: Path, 42 | out_dir: Path, 43 | min_dur: float, 44 | max_dur: float, 45 | max_silence: float, 46 | strict_min_dur: bool, 47 | shift: float = 0, 48 | ) -> None: 49 | 50 | assert Path(wav_path).suffix == ".wav" 51 | audio_region = AudioRegion.load(str(wav_path)) 52 | out_dir = Path(out_dir) 53 | regions = audio_region.split( 54 | min_dur=min_dur, 55 | max_dur=max_dur, 56 | max_silence=max_silence, 57 | strict_min_dur=strict_min_dur, 58 | ) 59 | 60 | waveform, sr = sf.read(wav_path, dtype="float32") 61 | out = [] 62 | for i, r in enumerate(regions): 63 | start = int(r._meta.start * sr) 64 | end = int(r._meta.end * sr) 65 | path_seg = out_dir / f"{out_dir.stem}_{i}.flac" 66 | path_timestamp = get_path_timestamp(path_seg, ".vad.timestamp") 67 | save_timestamp(path_timestamp, r._meta.start + shift, r._meta.end + shift) 68 | sf.write( 69 | str(path_seg), waveform[start:end], sr, subtype="PCM_16", format="FLAC" 70 | ) 71 | out.append(path_seg) 72 | 73 | return out 74 | 75 | 76 | def split_vad_non_wav( 77 | audio_path: Path, 78 | out_dir: Path, 79 | min_dur: float, 80 | max_dur: float, 81 | max_silence: float, 82 | strict_min_dur: bool, 83 | shift: float = 0, 84 | ) -> None: 85 | path_wav = Path(audio_path).with_suffix(".wav") 86 | to_wav(audio_path, path_wav) 87 | out = split_with_vad_wav( 88 | path_wav, out_dir, min_dur, max_dur, max_silence, strict_min_dur, shift 89 | ) 90 | os.remove(path_wav) 91 | return out 92 | 93 | 94 | def to_wav(path_in: Path, path_out: Path) -> None: 95 | 96 | assert Path(path_out).suffix == ".wav" 97 | waveform, sr = sf.read(str(path_in), dtype="float32") 98 | sf.write(str(path_out), waveform, sr, format="WAV") 99 | 100 | 101 | def split_audio( 102 | audio_path: Path, 103 | segments: List[Tuple[float, float, str]], 104 | out_root: Union[str, Path], 105 | pyannote_suffix: str, 106 | ) -> List[Path]: 107 | 108 | out_root = Path(out_root) 109 | if out_root.is_dir(): 110 | shutil.rmtree(out_root) 111 | out_root.mkdir(parents=True) 112 | 113 | sr = sf.info(audio_path).samplerate 114 | audio_path = Path(audio_path) 115 | 116 | def save_clip(i, start, end): 117 | name = f"{i:03d}_{start:.0f}-{end:.0f}" 118 | out_audio_path = out_root / f"{name}.flac" 119 | save_timestamp(get_path_timestamp(out_audio_path, pyannote_suffix), start, end) 120 | clip, _ = sf.read(audio_path, start=int(start * sr), stop=int(end * sr)) 121 | sf.write(out_audio_path, clip, sr, subtype="PCM_16", format="FLAC") 122 | return out_audio_path 123 | 124 | last_start, last_end, last_speaker = segments[0] 125 | 126 | out_paths = [] 127 | for i, (start_t, end_t, speaker) in enumerate(segments): 128 | if speaker == last_speaker: 129 | last_end = end_t 130 | continue 131 | out_audio_path = save_clip(i, last_start, last_end) 132 | last_start = start_t 133 | last_speaker = speaker 134 | last_end = end_t 135 | out_paths.append(out_audio_path) 136 | 137 | save_clip(len(segments), last_start, last_end) 138 | 139 | return out_paths 140 | 141 | 142 | def get_segments( 143 | path_audio, pyannote_cfg, min_duration 144 | ) -> List[Tuple[float, float, str]]: 145 | try: 146 | return get_pyannote_segments( 147 | path_audio, pyannote_cfg, min_duration=min_duration 148 | ) 149 | except FileNotFoundError: 150 | return None 151 | 152 | 153 | class FileSegmenter: 154 | def __init__( 155 | self, 156 | root_in: str, 157 | out_dir: str, 158 | pyannote_cfg="sad_ami", 159 | min_duration=1.0, 160 | split_vad=True, 161 | min_dur_vad=15, 162 | max_dur_vad=30, 163 | max_silence_vad=1.5, 164 | strict_min_dur_vad=True, 165 | ): 166 | 167 | self.root_in = root_in 168 | self.out_dir = out_dir 169 | self.pyannote_cfg = pyannote_cfg 170 | self.min_duration = min_duration 171 | self.split_vad = split_vad 172 | self.min_dur_vad = min_dur_vad 173 | self.max_dur_vad = max_dur_vad 174 | self.max_silence_vad = max_silence_vad 175 | self.strict_min_dur_vad = strict_min_dur_vad 176 | 177 | def get_root_lang_id(self, id_: str, lang_code: str) -> bool: 178 | return Path(self.root_in) / id_ / f"{id_}_{lang_code}" 179 | 180 | def get_out_root(self, id_, lang_code) -> Path: 181 | return Path(self.out_dir) / lang_code / id_ / "paragraphs" 182 | 183 | def split_audio(self, audio_path: Path): 184 | 185 | found = 0 186 | lang = audio_path.stem.split("_")[-1] 187 | id_ = audio_path.stem.split("_")[0] 188 | if not audio_path.exists(): 189 | return False 190 | segments = get_segments(audio_path, self.pyannote_cfg, self.min_duration) 191 | if segments is None: 192 | return False 193 | 194 | out_root = self.get_out_root(id_, lang) 195 | 196 | pyannote_suffix = f".pyannote.{self.pyannote_cfg}" 197 | out_audio = split_audio(audio_path, segments, out_root, pyannote_suffix) 198 | 199 | if not self.split_vad: 200 | return True 201 | 202 | for audio_path in out_audio: 203 | dir_out = audio_path.parent / audio_path.stem 204 | dir_out.mkdir() 205 | path_timestamp_audio = get_path_timestamp(audio_path, pyannote_suffix) 206 | shift = load_timestamp(path_timestamp_audio)[0] 207 | vad_seq = split_vad_non_wav( 208 | audio_path, 209 | dir_out, 210 | min_dur=self.min_dur_vad, 211 | max_dur=self.max_dur_vad, 212 | max_silence=self.max_silence_vad, 213 | strict_min_dur=self.strict_min_dur_vad, 214 | shift=shift, 215 | ) 216 | os.remove(audio_path) 217 | if len(vad_seq) == 0: 218 | shutil.rmtree(dir_out) 219 | os.remove(audio_path.with_suffix(f".pyannote.{self.pyannote_cfg}")) 220 | 221 | return True 222 | 223 | 224 | def get_all(args): 225 | audio_paths = [] 226 | root = Path(args.root) 227 | for lang in args.languages: 228 | audio_paths += get_all_audio_for_lang(root, lang) 229 | if args.max_num is not None: 230 | audio_paths = audio_paths[: args.max_num] 231 | 232 | segmenter = FileSegmenter( 233 | args.root, 234 | args.output, 235 | pyannote_cfg=args.pyannote_cfg, 236 | min_duration=args.min_duration, 237 | split_vad=not args.no_vad, 238 | min_dur_vad=args.min_dur_vad, 239 | max_dur_vad=args.max_dur_vad, 240 | max_silence_vad=args.max_silence_vad, 241 | ) 242 | found = 0 243 | with Pool(args.nproc) as p: 244 | for x in tqdm( 245 | p.imap_unordered(segmenter.split_audio, audio_paths), total=len(audio_paths) 246 | ): 247 | found += int(x) 248 | 249 | print(f"{found} audio data segmented") 250 | 251 | 252 | def main(): 253 | parser = argparse.ArgumentParser( 254 | "Cut the data by speaker. " "run_pyanote_sd.py must have been run before" 255 | ) 256 | parser.add_argument("--root", type=str, required=True, help="Input root directory") 257 | parser.add_argument( 258 | "-o", 259 | "--output", 260 | type=str, 261 | default=None, 262 | help="Output directory, if different from the input " "one", 263 | ) 264 | parser.add_argument( 265 | "--languages", 266 | type=str, 267 | nargs="*", 268 | help="If given, Ttranslated data to deal with", 269 | ) 270 | parser.add_argument( 271 | "--max-num", 272 | default=None, 273 | type=int, 274 | help="If given, maximum number of session to deal with", 275 | ) 276 | parser.add_argument("--nproc", default=8, type=int, help="Number of processes") 277 | parser.add_argument( 278 | "--pyannote-cfg", 279 | default="dia_ami", 280 | type=str, 281 | choices=["dia", "dia_ami", "sad_ami"], 282 | ) 283 | parser.add_argument( 284 | "--min-duration", 285 | default=1.0, 286 | type=float, 287 | help="Ignore all speaker segments lasting less than the given number of seconds", 288 | ) 289 | parser.add_argument( 290 | "--no-vad", 291 | action="store_true", 292 | help="Does not apply the vad after the speaker segmentation", 293 | ) 294 | parser.add_argument( 295 | "--min-dur-vad", 296 | default=15, 297 | type=int, 298 | help="Min size of a sequence (in seconds) after applying the vad.", 299 | ) 300 | parser.add_argument( 301 | "--max-dur-vad", 302 | default=30, 303 | type=int, 304 | help="Max size of a sequence (in seconds) after applying the vad.", 305 | ) 306 | parser.add_argument( 307 | "--max-silence-vad", 308 | default=1.5, 309 | type=float, 310 | help="Maximum length of a silence allowed in the voice activity detection" 311 | " (the lower the stricter)", 312 | ) 313 | args = parser.parse_args() 314 | 315 | if args.output is None: 316 | args.output = args.root 317 | 318 | if args.languages is None: 319 | args.languages = [x.value for x in LangCode] 320 | 321 | get_all(args) 322 | 323 | 324 | if __name__ == "__main__": 325 | main() 326 | -------------------------------------------------------------------------------- /voxpopuli/segmentation/run_pyannote_sd.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Facebook, Inc. and its affiliates. 2 | # 3 | # This source code is licensed under the MIT license found in the 4 | # LICENSE file in the root directory of this source tree. 5 | ### 6 | # Run pyannote speaker diarization (SD) models 7 | ### 8 | 9 | import os.path as op 10 | import pickle as pkl 11 | import argparse 12 | import torchaudio 13 | from tqdm import tqdm 14 | import torch 15 | import json 16 | from pathlib import Path 17 | from tempfile import TemporaryDirectory 18 | from typing import List 19 | from voxpopuli.segmentation import get_batches, get_all_audio_for_lang 20 | 21 | 22 | def check(path_audio: Path, pyannote_cfg="dia_ami"): 23 | rttm_path = path_audio.parent / f"{path_audio.stem}.pyannote.{pyannote_cfg}.rttm" 24 | pkl_path = path_audio.parent / f"{path_audio.stem}.pyannote.{pyannote_cfg}.pkl" 25 | if rttm_path.exists() and pkl_path.exists(): 26 | return True 27 | json_path = path_audio.parent / f"{path_audio.stem}.pyannote.{pyannote_cfg}.json" 28 | if json_path.exists(): 29 | return True 30 | return False 31 | 32 | 33 | def segment_audio_overlap( 34 | path_audio: Path, dir_out: Path, max_size_sec: int 35 | ) -> List[Path]: 36 | 37 | info = torchaudio.info(str(path_audio))[0] 38 | s_data = info.length // info.channels 39 | sr = info.rate 40 | frames = int(sr * max_size_sec) 41 | if frames % 1 == 0: 42 | frames += 1 43 | 44 | n_cuts = s_data // frames 45 | if s_data % frames > min(sr, s_data): 46 | n_cuts += 1 47 | 48 | n_cuts += n_cuts - 1 49 | 50 | out = [] 51 | offset = 0 52 | print(f"{path_audio.parent.name} : {n_cuts} segments to save") 53 | for index in range(n_cuts): 54 | num_frames = min(frames, s_data - offset) 55 | if num_frames <= 0: 56 | break 57 | data = torchaudio.load(str(path_audio), num_frames=num_frames, offset=offset)[0] 58 | path_out = dir_out / f"{path_audio.stem}_{index}.flac" 59 | torchaudio.save(str(path_out), data, sr) 60 | offset += frames // 2 61 | out.append(path_out) 62 | print(f"{path_audio.parent.name} : {n_cuts} segments saved") 63 | 64 | return out, max_size_sec / 2 65 | 66 | 67 | def merge_segments(path_list_pkl: List[Path], size_overlap: float): 68 | 69 | out = [] 70 | shift = 0 71 | last_start = None 72 | for i_pkl, pkl_path in enumerate(path_list_pkl): 73 | with open(pkl_path, "rb") as f: 74 | annotation = pkl.load(f) 75 | segments = [ 76 | ( 77 | shift + round(segment.start, 3), 78 | shift + round(segment.end, 3), 79 | f"{i_pkl}_{label}", 80 | ) 81 | for segment, track, label in annotation.itertracks(yield_label=True) 82 | ] 83 | if len(segments) == 0: 84 | continue 85 | 86 | start_index = 0 87 | if last_start is not None: 88 | min_diff = size_overlap + 1 89 | for i, pack in enumerate(segments): 90 | s = pack[0] 91 | d = abs(s - last_start) 92 | if d < min_diff: 93 | min_diff = d 94 | start_index = i 95 | 96 | if len(out) > 0: 97 | s, e, l = segments[start_index] 98 | out[-1] = last_start, e, l 99 | start_index += 1 100 | 101 | if start_index < len(segments): 102 | out += segments[start_index:] 103 | 104 | if len(out) > 0: 105 | last_start = out[-1][0] 106 | 107 | shift += size_overlap 108 | return out 109 | 110 | 111 | def get_segments( 112 | audio_path: str, 113 | device: int = 0, 114 | pyannote_cfg="dia_ami", 115 | max_size_sec: int = 10 * 60, 116 | ): 117 | 118 | print(audio_path) 119 | if not op.exists(audio_path): 120 | return 121 | 122 | torch.cuda.set_device(device) 123 | 124 | id_ = Path(audio_path).parent.name 125 | 126 | with TemporaryDirectory() as tmp_dir: 127 | tmp_dir = Path(tmp_dir) 128 | list_str, overlapp = segment_audio_overlap( 129 | Path(audio_path), tmp_dir, max_size_sec 130 | ) 131 | pyannote_pipeline = torch.hub.load( 132 | "pyannote/pyannote-audio", pyannote_cfg, pipeline=True 133 | ) 134 | path_pkls = [] 135 | for index, path_ in enumerate(list_str): 136 | print(f"{id_}: running pyannote on {index} / {len(list_str)}") 137 | sd = pyannote_pipeline({"uri": "filename", "audio": path_}) 138 | rttm_path = ( 139 | audio_path.parent / f"{audio_path.stem}.pyannote.{pyannote_cfg}.rttm" 140 | ) 141 | pkl_path = ( 142 | audio_path.parent / f"{audio_path.stem}.pyannote.{pyannote_cfg}.pkl" 143 | ) 144 | with open(rttm_path, "w") as f: 145 | sd.write_rttm(f) 146 | with open(pkl_path, "wb") as f: 147 | pkl.dump(sd, f) 148 | path_pkls.append(pkl_path) 149 | 150 | out_seg = merge_segments(path_pkls, overlapp) 151 | 152 | path_out = Path(audio_path).parent / f"pyannote.{pyannote_cfg}.json" 153 | 154 | with open(path_out, "w") as f: 155 | json.dump(out_seg, f, indent=2) 156 | 157 | 158 | def get(audio_path: str, device: int = 0, pyannote_cfg="dia_ami"): 159 | assert pyannote_cfg in {"dia_ami", "dia", "sad_ami"} 160 | 161 | if not audio_path.exists(): 162 | return 163 | 164 | torch.cuda.set_device(device) 165 | pyannote_pipeline = torch.hub.load( 166 | "pyannote/pyannote-audio", pyannote_cfg, pipeline=True 167 | ) 168 | sd = pyannote_pipeline({"uri": "filename", "audio": audio_path}) 169 | rttm_path = audio_path.parent / f"{audio_path.stem}.pyannote.{pyannote_cfg}.rttm" 170 | pkl_path = audio_path.parent / f"{audio_path.stem}.pyannote.{pyannote_cfg}.pkl" 171 | with open(rttm_path, "w") as f: 172 | sd.write_rttm(f) 173 | with open(pkl_path, "wb") as f: 174 | pkl.dump(sd, f) 175 | 176 | 177 | def get_multiprocess(i, items, pyannote_cfg="dia_ami", max_size_min_input: int = None): 178 | if i >= len(items): 179 | return 180 | 181 | if max_size_min_input is not None: 182 | get_segments( 183 | items[i], i, pyannote_cfg=pyannote_cfg, max_size_sec=max_size_min_input * 60 184 | ) 185 | else: 186 | get(items[i], i, pyannote_cfg=pyannote_cfg) 187 | 188 | 189 | def main(args): 190 | languages = [lang if lang != "original" else "" for lang in args.languages] 191 | 192 | root = Path(args.root) 193 | audio_paths = [] 194 | for lang in languages: 195 | audio_paths += get_all_audio_for_lang(root, lang) 196 | 197 | if not args.overwrite: 198 | audio_paths = [x for x in audio_paths if not check(x, args.pyannote_cfg)] 199 | 200 | if args.max_num is not None: 201 | audio_paths = audio_paths[: args.max_num] 202 | n_devices = torch.cuda.device_count() 203 | 204 | if n_devices < 2: 205 | for d in audio_paths: 206 | print(d) 207 | get(d) 208 | else: 209 | batches = list(get_batches(audio_paths, batch_size=n_devices)) 210 | for batch in tqdm(batches): 211 | torch.multiprocessing.spawn( 212 | fn=get_multiprocess, 213 | args=(batch, args.pyannote_cfg, args.segment_min), 214 | nprocs=n_devices, 215 | ) 216 | 217 | 218 | if __name__ == "__main__": 219 | parser = argparse.ArgumentParser( 220 | "Speaker diarization with pyannote." 221 | " Compute the speakers boundaries for the given audio files" 222 | ) 223 | parser.add_argument( 224 | "--root", type=str, help="Root directory containing the session directories" 225 | ) 226 | parser.add_argument( 227 | "--max-num", 228 | default=None, 229 | type=int, 230 | help="If given, maximal number of session to deal with.", 231 | ) 232 | parser.add_argument( 233 | "--overwrite", 234 | action="store_true", 235 | help="Set to true to overwrite previous results", 236 | ) 237 | parser.add_argument( 238 | "-l", 239 | "--languages", 240 | type=str, 241 | nargs="*", 242 | required=True, 243 | help="Languages to deal with. 'original' stands for the original audio.", 244 | ) 245 | parser.add_argument( 246 | "--segment-min", 247 | type=int, 248 | default=None, 249 | help="If given, will split the inpit audio into several " 250 | "overlapping chunks of size segment_min seconds and merge the " 251 | "resulting segmentation. In this case, a single speaker may end " 252 | "with several labels if he speaks across several segments." 253 | "In this case, the output file will be in json format " 254 | "(to avoid confusion with the proper diharisation output).", 255 | ) 256 | parser.add_argument( 257 | "--pyannote-cfg", 258 | type=str, 259 | choices=["dia_ami", "dia", "sad_ami"], 260 | help="Pyannote configuration.", 261 | default="dia_ami", 262 | ) 263 | args = parser.parse_args() 264 | 265 | main(args) 266 | -------------------------------------------------------------------------------- /voxpopuli/text/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Facebook, Inc. and its affiliates. 2 | # 3 | # This source code is licensed under the MIT license found in the 4 | # LICENSE file in the root directory of this source tree. 5 | 6 | import re 7 | import string 8 | from typing import Set 9 | 10 | 11 | PUNCTUATIONS_TO_REMOVE = ( 12 | string.punctuation.replace("'", "") 13 | .replace("-", "") 14 | .replace("–", "") 15 | .replace("/", "") 16 | + "«»‟″“”…‘•„‚≤ᵉ" 17 | ) 18 | PUNCTUATIONS_TO_SPACE = "-/–·" 19 | REMOVE_TRANSLATOR = str.maketrans("", "", PUNCTUATIONS_TO_REMOVE) 20 | SPACE_TRANSLATOR = str.maketrans( 21 | PUNCTUATIONS_TO_SPACE, " " * len(PUNCTUATIONS_TO_SPACE) 22 | ) 23 | 24 | SPACE = chr(32) 25 | WHITESPACE_NORMALIZER = re.compile(r"\s+") 26 | 27 | # fmt: off 28 | LANG_TOKENS = { 29 | "cs": {"'", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "á", "é", "í", "ó", "ú", "ý", "č", "ď", "ě", "ň", "ř", "š", "ť", "ů", "ž",}, 30 | "de": {"'", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "ß", "ä", "ö", "ü",}, 31 | "en": {"'", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z",}, 32 | "es": {"'", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "á", "é", "í", "ñ", "ó", "ú", "ü",}, 33 | "et": {"'", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "ä", "õ", "ö", "ü", "š", "ž",}, 34 | "fi": {"'", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "ä", "å", "ö",}, 35 | "fr": {"'", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "à", "â", "æ", "ç", "è", "é", "ê", "ë", "î", "ï", "ô", "ù", "û", "ü", "œ", "ÿ",}, 36 | "hr": {"'", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "ć", "č", "đ", "š", "ž",}, 37 | "hu": {"'", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "á", "é", "ó", "ö", "ú", "ü", "ő", "ű",}, 38 | "it": {"'", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "à", "è", "é", "ì", "í", "ï", "ò", "ó", "ù",}, 39 | "lt": {"a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "ą", "č", "ė", "ę", "į", "š", "ū", "ų", "ž",}, 40 | "nl": {"'", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "à", "ç", "è", "é", "ê", "ë", "í", "ï", "ö", "ü",}, 41 | "pl": {"'", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "ó", "ą", "ć", "ę", "ł", "ń", "ś", "ź", "ż",}, 42 | "ro": {"a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "â", "î", "ă", "ș", "ț",}, 43 | "sk": {"a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "á", "ä", "é", "í", "ó", "ô", "ú", "ý", "č", "ď", "ĺ", "ľ", "ň", "ŕ", "š", "ť", "ž",}, 44 | "sl": {"'", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "č", "š", "ž",}, 45 | } 46 | # fmt: on 47 | 48 | 49 | def correct_name_fbcluster_output(name_in: str) -> str: 50 | r"""A quick patch to solve some discreepancies from the output names 51 | in the align / WER pipeliness without having to relaunch everything""" 52 | 53 | split_ = name_in.split("-") 54 | if len(split_) == 3: 55 | return "-".join(split_[:2]) 56 | 57 | return name_in 58 | 59 | 60 | def is_valid_text(text: str, tokens: Set[str]) -> bool: 61 | chars = "".join(text.split()) 62 | return all(x in tokens for x in chars) 63 | -------------------------------------------------------------------------------- /voxpopuli/text/wer_tools.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Facebook, Inc. and its affiliates. 2 | # 3 | # This source code is licensed under the MIT license found in the 4 | # LICENSE file in the root directory of this source tree. 5 | ### 6 | # Run pyannote speaker diarization (SD) models 7 | ### 8 | 9 | import json 10 | from typing import Iterable, NamedTuple, List, Tuple 11 | from pathlib import Path 12 | 13 | import edlib 14 | import editdistance 15 | 16 | from voxpopuli.segmentation import correct_name_fbcluster_output 17 | 18 | 19 | class CharAlignToken(NamedTuple): 20 | index_decoded: int 21 | action: str 22 | 23 | 24 | class WordAlignFile(NamedTuple): 25 | file_id: str 26 | target: str 27 | decoded: str 28 | wer: float 29 | ler: float 30 | align_path: List[CharAlignToken] 31 | 32 | 33 | def quick_norm(str_in, char_set): 34 | 35 | str_in = str_in.lower().strip() 36 | str_in = " ".join(str_in.split()) 37 | out = "".join([x for x in str_in if x in char_set]) 38 | return out 39 | 40 | 41 | def get_wer(query, decoded): 42 | return get_ler(query.split(), decoded.split()) 43 | 44 | 45 | def get_ler(query, decoded): 46 | d = editdistance.eval(query, decoded) 47 | return 100 * float(d) / (1e-8 + len(query)) 48 | 49 | 50 | def expand_cigar_format(path_cigar: str) -> str: 51 | 52 | out = "" 53 | size = len(path_cigar) 54 | i_ = 0 55 | while i_ < size: 56 | next = i_ + 1 57 | while path_cigar[next].isdigit(): 58 | next += 1 59 | n = int(path_cigar[i_:next]) 60 | v = path_cigar[next] 61 | out += n * v 62 | i_ = next + 1 63 | 64 | return out 65 | 66 | 67 | def get_align_index_path(query: Iterable, target: Iterable) -> List[CharAlignToken]: 68 | 69 | path_ = edlib.align(query, target, task="path")["cigar"] 70 | if path_ is None: 71 | return [] 72 | path_ = expand_cigar_format(path_) 73 | 74 | index_out = 0 75 | index_path = 0 76 | out = [] 77 | for index_query in range(len(query)): 78 | while path_[index_path] == "D": 79 | index_out += 1 80 | index_path += 1 81 | 82 | action = path_[index_path] 83 | 84 | out.append(CharAlignToken(index_out, action)) 85 | if action == "=": 86 | assert query[index_query] == target[index_out] 87 | if action in ["=", "X"]: 88 | index_out += 1 89 | 90 | index_path += 1 91 | 92 | return out 93 | 94 | 95 | def get_partial_transcriptions( 96 | data: WordAlignFile, word_cuts: List[int] 97 | ) -> List[Tuple[str, str]]: 98 | 99 | last_index = 0 100 | last_index_decoded = data.align_path[0].index_decoded 101 | 102 | output = [] 103 | for word_index in word_cuts: 104 | i_decoded = data.align_path[word_index].index_decoded 105 | # Go until the end of the next word 106 | while i_decoded < len(data.decoded) and data.decoded[i_decoded] != " ": 107 | i_decoded += 1 108 | while word_index < len(data.target) and data.target[word_index] != " ": 109 | word_index += 1 110 | out_target = data.target[last_index:word_index] 111 | out_decoded = data.decoded[last_index_decoded:i_decoded] 112 | last_index = word_index 113 | last_index_decoded = i_decoded 114 | output.append((out_target, out_decoded)) 115 | 116 | if last_index < len(data.target): 117 | out_target = data.target[last_index:] 118 | out_decoded = data.decoded[last_index_decoded:] 119 | output.append((out_target, out_decoded)) 120 | 121 | return output 122 | 123 | 124 | def reinsert_punctuation( 125 | str_original: str, str_normed: str, char_set: str, punc_list: str 126 | ) -> str: 127 | 128 | quick_norm_ref = quick_norm(str_original, char_set + punc_list) 129 | align_path = get_align_index_path(quick_norm_ref, str_normed) 130 | punc_indexes = [(i, x) for i, x in enumerate(quick_norm_ref) if x in punc_list] 131 | last_index = 0 132 | 133 | out = "" 134 | 135 | for p_index, punc in punc_indexes: 136 | i_normed = align_path[p_index].index_decoded 137 | if i_normed <= last_index: 138 | continue 139 | while i_normed < len(str_normed) and str_normed[i_normed] != " ": 140 | i_normed += 1 141 | loc_norm = str_normed[last_index:i_normed] 142 | out += loc_norm + punc + " " 143 | last_index = i_normed 144 | 145 | if last_index < len(str_normed): 146 | out += str_normed[last_index:] 147 | 148 | return out 149 | 150 | 151 | def create_word_align_file(file_id: str, target: str, decoded: str) -> WordAlignFile: 152 | 153 | return WordAlignFile( 154 | file_id=file_id, 155 | target=target, 156 | decoded=decoded, 157 | wer=get_wer(target, decoded), 158 | ler=get_ler(target, decoded), 159 | align_path=get_align_index_path(target, decoded), 160 | ) 161 | 162 | 163 | def load_word_align_file(path_file: Path) -> List[WordAlignFile]: 164 | 165 | with open(path_file, "r") as file: 166 | data = json.load(file) 167 | 168 | out = [] 169 | 170 | for file_data in data: 171 | align_path = get_align_index_path( 172 | file_data["target"], file_data["word_prediction_no_lm"] 173 | ) 174 | if len(align_path) == 0: 175 | continue 176 | out.append( 177 | WordAlignFile( 178 | file_id=correct_name_fbcluster_output(file_data["sample_id"]), 179 | target=file_data["target"], 180 | decoded=file_data["word_prediction_no_lm"], 181 | wer=file_data["wer"], 182 | ler=file_data["ler"], 183 | align_path=align_path, 184 | ) 185 | ) 186 | return out 187 | -------------------------------------------------------------------------------- /voxpopuli/text/word_align_tools.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Facebook, Inc. and its affiliates. 2 | # 3 | # This source code is licensed under the MIT license found in the 4 | # LICENSE file in the root directory of this source tree. 5 | 6 | 7 | from pathlib import Path 8 | from typing import List, NamedTuple 9 | 10 | from voxpopuli.segmentation import correct_name_fbcluster_output 11 | 12 | 13 | class AlignedWord(NamedTuple): 14 | start: float 15 | end: float 16 | word: str 17 | 18 | 19 | class AlignedData(NamedTuple): 20 | file_id: str 21 | data: List[AlignedWord] 22 | 23 | 24 | def load_audio_align_wav2letter(input_path: Path) -> List[AlignedData]: 25 | 26 | with open(input_path, "r") as file: 27 | data = file.readlines() 28 | 29 | output = [] 30 | 31 | for line in data: 32 | file_id, segments = line.split("\t") 33 | file_id = correct_name_fbcluster_output(file_id) 34 | segments = segments.split("\\n") 35 | samples = [] 36 | for s in segments: 37 | (_, _, start, duration, word) = s.split(" ") 38 | end = float(start) + float(duration) 39 | samples.append(AlignedWord(float(start), end, word.strip())) 40 | output.append(AlignedData(file_id, samples)) 41 | 42 | return output 43 | 44 | 45 | def cut_align_data( 46 | audio_align_data: AlignedData, 47 | index_align: List[int], 48 | sil_symbol: str = "$", 49 | padding_start: float = 0.1, 50 | padding_end: float = 0.2, 51 | ) -> List[AlignedData]: 52 | 53 | base_name = audio_align_data.file_id 54 | out = [] 55 | last_index = 0 56 | last_end = 0 57 | shift = 0 58 | 59 | if len(index_align) == 0: 60 | return [audio_align_data] 61 | 62 | for cut_index in index_align: 63 | 64 | last_end = audio_align_data.data[cut_index].start + padding_end 65 | out_align = [ 66 | AlignedWord(max(0, x.start - shift), max(0, x.end - shift), x.word) 67 | for x in audio_align_data.data[last_index:cut_index] 68 | ] 69 | out.append(AlignedData(f"{base_name}_{len(out)}", out_align)) 70 | last_index = cut_index 71 | shift = max(last_end, audio_align_data.data[last_index].end - padding_start) 72 | 73 | if last_index < len(audio_align_data[-1]): 74 | out_align = [ 75 | AlignedWord(max(0, x.start - shift), max(0, x.end - shift), x.word) 76 | for x in audio_align_data.data[index_align[-1] :] 77 | ] 78 | out.append(AlignedData(f"{base_name}_{len(out)}", out_align)) 79 | 80 | return out 81 | -------------------------------------------------------------------------------- /voxpopuli/utils.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Facebook, Inc. and its affiliates. 2 | # 3 | # This source code is licensed under the MIT license found in the 4 | # LICENSE file in the root directory of this source tree. 5 | 6 | from typing import Optional 7 | 8 | from tqdm.contrib.concurrent import process_map 9 | 10 | 11 | def multiprocess_run( 12 | a_list: list, func: callable, n_workers: Optional[int] = None 13 | ): 14 | process_map(func, a_list, max_workers=n_workers, chunksize=1) 15 | --------------------------------------------------------------------------------