├── .gitignore
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── extension.md
├── requirements.txt
└── voxpopuli
    ├── __init__.py
    ├── download_audios.py
    ├── get_asr_data.py
    ├── get_lm_data.py
    ├── get_s2s_data.py
    ├── get_unlabelled_data.py
    ├── segmentation
        ├── __init__.py
        ├── cut_from_labels.py
        ├── cut_with_align_files.py
        ├── get_segment_pyannote_speaker.py
        └── run_pyannote_sd.py
    ├── text
        ├── __init__.py
        ├── wer_tools.py
        └── word_align_tools.py
    └── utils.py


/.gitignore:
--------------------------------------------------------------------------------
1 | .idea
2 | 


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
 1 | # Code of Conduct
 2 | 
 3 | ## Our Pledge
 4 | 
 5 | In the interest of fostering an open and welcoming environment, we as
 6 | contributors and maintainers pledge to make participation in our project and
 7 | our community a harassment-free experience for everyone, regardless of age, body
 8 | size, disability, ethnicity, sex characteristics, gender identity and expression,
 9 | level of experience, education, socio-economic status, nationality, personal
10 | appearance, race, religion, or sexual identity and orientation.
11 | 
12 | ## Our Standards
13 | 
14 | Examples of behavior that contributes to creating a positive environment
15 | include:
16 | 
17 | * Using welcoming and inclusive language
18 | * Being respectful of differing viewpoints and experiences
19 | * Gracefully accepting constructive criticism
20 | * Focusing on what is best for the community
21 | * Showing empathy towards other community members
22 | 
23 | Examples of unacceptable behavior by participants include:
24 | 
25 | * The use of sexualized language or imagery and unwelcome sexual attention or
26 | advances
27 | * Trolling, insulting/derogatory comments, and personal or political attacks
28 | * Public or private harassment
29 | * Publishing others' private information, such as a physical or electronic
30 | address, without explicit permission
31 | * Other conduct which could reasonably be considered inappropriate in a
32 | professional setting
33 | 
34 | ## Our Responsibilities
35 | 
36 | Project maintainers are responsible for clarifying the standards of acceptable
37 | behavior and are expected to take appropriate and fair corrective action in
38 | response to any instances of unacceptable behavior.
39 | 
40 | Project maintainers have the right and responsibility to remove, edit, or
41 | reject comments, commits, code, wiki edits, issues, and other contributions
42 | that are not aligned to this Code of Conduct, or to ban temporarily or
43 | permanently any contributor for other behaviors that they deem inappropriate,
44 | threatening, offensive, or harmful.
45 | 
46 | ## Scope
47 | 
48 | This Code of Conduct applies within all project spaces, and it also applies when
49 | an individual is representing the project or its community in public spaces.
50 | Examples of representing a project or community include using an official
51 | project e-mail address, posting via an official social media account, or acting
52 | as an appointed representative at an online or offline event. Representation of
53 | a project may be further defined and clarified by project maintainers.
54 | 
55 | ## Enforcement
56 | 
57 | Instances of abusive, harassing, or otherwise unacceptable behavior may be
58 | reported by contacting the project team at <opensource-conduct@fb.com>. All
59 | complaints will be reviewed and investigated and will result in a response that
60 | is deemed necessary and appropriate to the circumstances. The project team is
61 | obligated to maintain confidentiality with regard to the reporter of an incident.
62 | Further details of specific enforcement policies may be posted separately.
63 | 
64 | Project maintainers who do not follow or enforce the Code of Conduct in good
65 | faith may face temporary or permanent repercussions as determined by other
66 | members of the project's leadership.
67 | 
68 | ## Attribution
69 | 
70 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
71 | available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
72 | 
73 | [homepage]: https://www.contributor-covenant.org
74 | 
75 | For answers to common questions about this code of conduct, see
76 | https://www.contributor-covenant.org/faq
77 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing to voxpopuli
 2 | We want to make contributing to this project as easy and transparent as
 3 | possible.
 4 | 
 5 | ## Pull Requests
 6 | We actively welcome your pull requests.
 7 | 
 8 | 1. Fork the repo and create your branch from `master`.
 9 | 2. If you've added code that should be tested, add tests.
10 | 3. If you've changed APIs, update the documentation.
11 | 4. Ensure the test suite passes.
12 | 5. Make sure your code lints.
13 | 6. If you haven't already, complete the Contributor License Agreement ("CLA").
14 | 
15 | ## Contributor License Agreement ("CLA")
16 | In order to accept your pull request, we need you to submit a CLA. You only need
17 | to do this once to work on any of Facebook's open source projects.
18 | 
19 | Complete your CLA here: <https://code.facebook.com/cla>
20 | 
21 | ## Issues
22 | We use GitHub issues to track public bugs. Please ensure your description is
23 | clear and has sufficient instructions to be able to reproduce the issue.
24 | 
25 | Facebook has a [bounty program](https://www.facebook.com/whitehat/) for the safe
26 | disclosure of security bugs. In those cases, please go through the process
27 | outlined on that page and do not file a public issue.
28 | 
29 | ## License
30 | By contributing to voxpopuli, you agree that your contributions will be licensed
31 | under the LICENSE file in the root directory of this source tree.


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 | Attribution-NonCommercial 4.0 International
  2 | 
  3 | =======================================================================
  4 | 
  5 | Creative Commons Corporation ("Creative Commons") is not a law firm and
  6 | does not provide legal services or legal advice. Distribution of
  7 | Creative Commons public licenses does not create a lawyer-client or
  8 | other relationship. Creative Commons makes its licenses and related
  9 | information available on an "as-is" basis. Creative Commons gives no
 10 | warranties regarding its licenses, any material licensed under their
 11 | terms and conditions, or any related information. Creative Commons
 12 | disclaims all liability for damages resulting from their use to the
 13 | fullest extent possible.
 14 | 
 15 | Using Creative Commons Public Licenses
 16 | 
 17 | Creative Commons public licenses provide a standard set of terms and
 18 | conditions that creators and other rights holders may use to share
 19 | original works of authorship and other material subject to copyright
 20 | and certain other rights specified in the public license below. The
 21 | following considerations are for informational purposes only, are not
 22 | exhaustive, and do not form part of our licenses.
 23 | 
 24 |      Considerations for licensors: Our public licenses are
 25 |      intended for use by those authorized to give the public
 26 |      permission to use material in ways otherwise restricted by
 27 |      copyright and certain other rights. Our licenses are
 28 |      irrevocable. Licensors should read and understand the terms
 29 |      and conditions of the license they choose before applying it.
 30 |      Licensors should also secure all rights necessary before
 31 |      applying our licenses so that the public can reuse the
 32 |      material as expected. Licensors should clearly mark any
 33 |      material not subject to the license. This includes other CC-
 34 |      licensed material, or material used under an exception or
 35 |      limitation to copyright. More considerations for licensors:
 36 | 	wiki.creativecommons.org/Considerations_for_licensors
 37 | 
 38 |      Considerations for the public: By using one of our public
 39 |      licenses, a licensor grants the public permission to use the
 40 |      licensed material under specified terms and conditions. If
 41 |      the licensor's permission is not necessary for any reason--for
 42 |      example, because of any applicable exception or limitation to
 43 |      copyright--then that use is not regulated by the license. Our
 44 |      licenses grant only permissions under copyright and certain
 45 |      other rights that a licensor has authority to grant. Use of
 46 |      the licensed material may still be restricted for other
 47 |      reasons, including because others have copyright or other
 48 |      rights in the material. A licensor may make special requests,
 49 |      such as asking that all changes be marked or described.
 50 |      Although not required by our licenses, you are encouraged to
 51 |      respect those requests where reasonable. More_considerations
 52 |      for the public:
 53 | 	wiki.creativecommons.org/Considerations_for_licensees
 54 | 
 55 | =======================================================================
 56 | 
 57 | Creative Commons Attribution-NonCommercial 4.0 International Public
 58 | License
 59 | 
 60 | By exercising the Licensed Rights (defined below), You accept and agree
 61 | to be bound by the terms and conditions of this Creative Commons
 62 | Attribution-NonCommercial 4.0 International Public License ("Public
 63 | License"). To the extent this Public License may be interpreted as a
 64 | contract, You are granted the Licensed Rights in consideration of Your
 65 | acceptance of these terms and conditions, and the Licensor grants You
 66 | such rights in consideration of benefits the Licensor receives from
 67 | making the Licensed Material available under these terms and
 68 | conditions.
 69 | 
 70 | Section 1 -- Definitions.
 71 | 
 72 |   a. Adapted Material means material subject to Copyright and Similar
 73 |      Rights that is derived from or based upon the Licensed Material
 74 |      and in which the Licensed Material is translated, altered,
 75 |      arranged, transformed, or otherwise modified in a manner requiring
 76 |      permission under the Copyright and Similar Rights held by the
 77 |      Licensor. For purposes of this Public License, where the Licensed
 78 |      Material is a musical work, performance, or sound recording,
 79 |      Adapted Material is always produced where the Licensed Material is
 80 |      synched in timed relation with a moving image.
 81 | 
 82 |   b. Adapter's License means the license You apply to Your Copyright
 83 |      and Similar Rights in Your contributions to Adapted Material in
 84 |      accordance with the terms and conditions of this Public License.
 85 | 
 86 |   c. Copyright and Similar Rights means copyright and/or similar rights
 87 |      closely related to copyright including, without limitation,
 88 |      performance, broadcast, sound recording, and Sui Generis Database
 89 |      Rights, without regard to how the rights are labeled or
 90 |      categorized. For purposes of this Public License, the rights
 91 |      specified in Section 2(b)(1)-(2) are not Copyright and Similar
 92 |      Rights.
 93 |   d. Effective Technological Measures means those measures that, in the
 94 |      absence of proper authority, may not be circumvented under laws
 95 |      fulfilling obligations under Article 11 of the WIPO Copyright
 96 |      Treaty adopted on December 20, 1996, and/or similar international
 97 |      agreements.
 98 | 
 99 |   e. Exceptions and Limitations means fair use, fair dealing, and/or
100 |      any other exception or limitation to Copyright and Similar Rights
101 |      that applies to Your use of the Licensed Material.
102 | 
103 |   f. Licensed Material means the artistic or literary work, database,
104 |      or other material to which the Licensor applied this Public
105 |      License.
106 | 
107 |   g. Licensed Rights means the rights granted to You subject to the
108 |      terms and conditions of this Public License, which are limited to
109 |      all Copyright and Similar Rights that apply to Your use of the
110 |      Licensed Material and that the Licensor has authority to license.
111 | 
112 |   h. Licensor means the individual(s) or entity(ies) granting rights
113 |      under this Public License.
114 | 
115 |   i. NonCommercial means not primarily intended for or directed towards
116 |      commercial advantage or monetary compensation. For purposes of
117 |      this Public License, the exchange of the Licensed Material for
118 |      other material subject to Copyright and Similar Rights by digital
119 |      file-sharing or similar means is NonCommercial provided there is
120 |      no payment of monetary compensation in connection with the
121 |      exchange.
122 | 
123 |   j. Share means to provide material to the public by any means or
124 |      process that requires permission under the Licensed Rights, such
125 |      as reproduction, public display, public performance, distribution,
126 |      dissemination, communication, or importation, and to make material
127 |      available to the public including in ways that members of the
128 |      public may access the material from a place and at a time
129 |      individually chosen by them.
130 | 
131 |   k. Sui Generis Database Rights means rights other than copyright
132 |      resulting from Directive 96/9/EC of the European Parliament and of
133 |      the Council of 11 March 1996 on the legal protection of databases,
134 |      as amended and/or succeeded, as well as other essentially
135 |      equivalent rights anywhere in the world.
136 | 
137 |   l. You means the individual or entity exercising the Licensed Rights
138 |      under this Public License. Your has a corresponding meaning.
139 | 
140 | Section 2 -- Scope.
141 | 
142 |   a. License grant.
143 | 
144 |        1. Subject to the terms and conditions of this Public License,
145 |           the Licensor hereby grants You a worldwide, royalty-free,
146 |           non-sublicensable, non-exclusive, irrevocable license to
147 |           exercise the Licensed Rights in the Licensed Material to:
148 | 
149 |             a. reproduce and Share the Licensed Material, in whole or
150 |                in part, for NonCommercial purposes only; and
151 | 
152 |             b. produce, reproduce, and Share Adapted Material for
153 |                NonCommercial purposes only.
154 | 
155 |        2. Exceptions and Limitations. For the avoidance of doubt, where
156 |           Exceptions and Limitations apply to Your use, this Public
157 |           License does not apply, and You do not need to comply with
158 |           its terms and conditions.
159 | 
160 |        3. Term. The term of this Public License is specified in Section
161 |           6(a).
162 | 
163 |        4. Media and formats; technical modifications allowed. The
164 |           Licensor authorizes You to exercise the Licensed Rights in
165 |           all media and formats whether now known or hereafter created,
166 |           and to make technical modifications necessary to do so. The
167 |           Licensor waives and/or agrees not to assert any right or
168 |           authority to forbid You from making technical modifications
169 |           necessary to exercise the Licensed Rights, including
170 |           technical modifications necessary to circumvent Effective
171 |           Technological Measures. For purposes of this Public License,
172 |           simply making modifications authorized by this Section 2(a)
173 |           (4) never produces Adapted Material.
174 | 
175 |        5. Downstream recipients.
176 | 
177 |             a. Offer from the Licensor -- Licensed Material. Every
178 |                recipient of the Licensed Material automatically
179 |                receives an offer from the Licensor to exercise the
180 |                Licensed Rights under the terms and conditions of this
181 |                Public License.
182 | 
183 |             b. No downstream restrictions. You may not offer or impose
184 |                any additional or different terms or conditions on, or
185 |                apply any Effective Technological Measures to, the
186 |                Licensed Material if doing so restricts exercise of the
187 |                Licensed Rights by any recipient of the Licensed
188 |                Material.
189 | 
190 |        6. No endorsement. Nothing in this Public License constitutes or
191 |           may be construed as permission to assert or imply that You
192 |           are, or that Your use of the Licensed Material is, connected
193 |           with, or sponsored, endorsed, or granted official status by,
194 |           the Licensor or others designated to receive attribution as
195 |           provided in Section 3(a)(1)(A)(i).
196 | 
197 |   b. Other rights.
198 | 
199 |        1. Moral rights, such as the right of integrity, are not
200 |           licensed under this Public License, nor are publicity,
201 |           privacy, and/or other similar personality rights; however, to
202 |           the extent possible, the Licensor waives and/or agrees not to
203 |           assert any such rights held by the Licensor to the limited
204 |           extent necessary to allow You to exercise the Licensed
205 |           Rights, but not otherwise.
206 | 
207 |        2. Patent and trademark rights are not licensed under this
208 |           Public License.
209 | 
210 |        3. To the extent possible, the Licensor waives any right to
211 |           collect royalties from You for the exercise of the Licensed
212 |           Rights, whether directly or through a collecting society
213 |           under any voluntary or waivable statutory or compulsory
214 |           licensing scheme. In all other cases the Licensor expressly
215 |           reserves any right to collect such royalties, including when
216 |           the Licensed Material is used other than for NonCommercial
217 |           purposes.
218 | 
219 | Section 3 -- License Conditions.
220 | 
221 | Your exercise of the Licensed Rights is expressly made subject to the
222 | following conditions.
223 | 
224 |   a. Attribution.
225 | 
226 |        1. If You Share the Licensed Material (including in modified
227 |           form), You must:
228 | 
229 |             a. retain the following if it is supplied by the Licensor
230 |                with the Licensed Material:
231 | 
232 |                  i. identification of the creator(s) of the Licensed
233 |                     Material and any others designated to receive
234 |                     attribution, in any reasonable manner requested by
235 |                     the Licensor (including by pseudonym if
236 |                     designated);
237 | 
238 |                 ii. a copyright notice;
239 | 
240 |                iii. a notice that refers to this Public License;
241 | 
242 |                 iv. a notice that refers to the disclaimer of
243 |                     warranties;
244 | 
245 |                  v. a URI or hyperlink to the Licensed Material to the
246 |                     extent reasonably practicable;
247 | 
248 |             b. indicate if You modified the Licensed Material and
249 |                retain an indication of any previous modifications; and
250 | 
251 |             c. indicate the Licensed Material is licensed under this
252 |                Public License, and include the text of, or the URI or
253 |                hyperlink to, this Public License.
254 | 
255 |        2. You may satisfy the conditions in Section 3(a)(1) in any
256 |           reasonable manner based on the medium, means, and context in
257 |           which You Share the Licensed Material. For example, it may be
258 |           reasonable to satisfy the conditions by providing a URI or
259 |           hyperlink to a resource that includes the required
260 |           information.
261 | 
262 |        3. If requested by the Licensor, You must remove any of the
263 |           information required by Section 3(a)(1)(A) to the extent
264 |           reasonably practicable.
265 | 
266 |        4. If You Share Adapted Material You produce, the Adapter's
267 |           License You apply must not prevent recipients of the Adapted
268 |           Material from complying with this Public License.
269 | 
270 | Section 4 -- Sui Generis Database Rights.
271 | 
272 | Where the Licensed Rights include Sui Generis Database Rights that
273 | apply to Your use of the Licensed Material:
274 | 
275 |   a. for the avoidance of doubt, Section 2(a)(1) grants You the right
276 |      to extract, reuse, reproduce, and Share all or a substantial
277 |      portion of the contents of the database for NonCommercial purposes
278 |      only;
279 | 
280 |   b. if You include all or a substantial portion of the database
281 |      contents in a database in which You have Sui Generis Database
282 |      Rights, then the database in which You have Sui Generis Database
283 |      Rights (but not its individual contents) is Adapted Material; and
284 | 
285 |   c. You must comply with the conditions in Section 3(a) if You Share
286 |      all or a substantial portion of the contents of the database.
287 | 
288 | For the avoidance of doubt, this Section 4 supplements and does not
289 | replace Your obligations under this Public License where the Licensed
290 | Rights include other Copyright and Similar Rights.
291 | 
292 | Section 5 -- Disclaimer of Warranties and Limitation of Liability.
293 | 
294 |   a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
295 |      EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
296 |      AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
297 |      ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
298 |      IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
299 |      WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
300 |      PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
301 |      ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
302 |      KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
303 |      ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
304 | 
305 |   b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
306 |      TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
307 |      NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
308 |      INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
309 |      COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
310 |      USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
311 |      ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
312 |      DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
313 |      IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
314 | 
315 |   c. The disclaimer of warranties and limitation of liability provided
316 |      above shall be interpreted in a manner that, to the extent
317 |      possible, most closely approximates an absolute disclaimer and
318 |      waiver of all liability.
319 | 
320 | Section 6 -- Term and Termination.
321 | 
322 |   a. This Public License applies for the term of the Copyright and
323 |      Similar Rights licensed here. However, if You fail to comply with
324 |      this Public License, then Your rights under this Public License
325 |      terminate automatically.
326 | 
327 |   b. Where Your right to use the Licensed Material has terminated under
328 |      Section 6(a), it reinstates:
329 | 
330 |        1. automatically as of the date the violation is cured, provided
331 |           it is cured within 30 days of Your discovery of the
332 |           violation; or
333 | 
334 |        2. upon express reinstatement by the Licensor.
335 | 
336 |      For the avoidance of doubt, this Section 6(b) does not affect any
337 |      right the Licensor may have to seek remedies for Your violations
338 |      of this Public License.
339 | 
340 |   c. For the avoidance of doubt, the Licensor may also offer the
341 |      Licensed Material under separate terms or conditions or stop
342 |      distributing the Licensed Material at any time; however, doing so
343 |      will not terminate this Public License.
344 | 
345 |   d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
346 |      License.
347 | 
348 | Section 7 -- Other Terms and Conditions.
349 | 
350 |   a. The Licensor shall not be bound by any additional or different
351 |      terms or conditions communicated by You unless expressly agreed.
352 | 
353 |   b. Any arrangements, understandings, or agreements regarding the
354 |      Licensed Material not stated herein are separate from and
355 |      independent of the terms and conditions of this Public License.
356 | 
357 | Section 8 -- Interpretation.
358 | 
359 |   a. For the avoidance of doubt, this Public License does not, and
360 |      shall not be interpreted to, reduce, limit, restrict, or impose
361 |      conditions on any use of the Licensed Material that could lawfully
362 |      be made without permission under this Public License.
363 | 
364 |   b. To the extent possible, if any provision of this Public License is
365 |      deemed unenforceable, it shall be automatically reformed to the
366 |      minimum extent necessary to make it enforceable. If the provision
367 |      cannot be reformed, it shall be severed from this Public License
368 |      without affecting the enforceability of the remaining terms and
369 |      conditions.
370 | 
371 |   c. No term or condition of this Public License will be waived and no
372 |      failure to comply consented to unless expressly agreed to by the
373 |      Licensor.
374 | 
375 |   d. Nothing in this Public License constitutes or may be interpreted
376 |      as a limitation upon, or waiver of, any privileges and immunities
377 |      that apply to the Licensor or You, including from the legal
378 |      processes of any jurisdiction or authority.
379 | 
380 | =======================================================================
381 | 
382 | Creative Commons is not a party to its public
383 | licenses. Notwithstanding, Creative Commons may elect to apply one of
384 | its public licenses to material it publishes and in those instances
385 | will be considered the “Licensor.” The text of the Creative Commons
386 | public licenses is dedicated to the public domain under the CC0 Public
387 | Domain Dedication. Except for the limited purpose of indicating that
388 | material is shared under a Creative Commons public license or as
389 | otherwise permitted by the Creative Commons policies published at
390 | creativecommons.org/policies, Creative Commons does not authorize the
391 | use of the trademark "Creative Commons" or any other trademark or logo
392 | of Creative Commons without its prior written consent including,
393 | without limitation, in connection with any unauthorized modifications
394 | to any of its public licenses or any other arrangements,
395 | understandings, or agreements concerning use of licensed material. For
396 | the avoidance of doubt, this paragraph does not form part of the
397 | public licenses.
398 | 
399 | Creative Commons may be contacted at creativecommons.org.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 |  VoxPopuli
  2 | =====
  3 | [https://aclanthology.org/2021.acl-long.80](https://aclanthology.org/2021.acl-long.80)
  4 | 
  5 | A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation.
  6 | 
  7 | # Overview
  8 | 
  9 | VoxPopuli provides
 10 | - 400K hours of unlabelled speech data for 23 languages
 11 | - 1.8K hours of transcribed speech data for 16 languages
 12 | - 17.3K hours of speech-to-speech interpretation data for 15x15 directions
 13 | - 29 hours of transcribed speech data of non-native English intended for research in ASR for accented speech (15 L2 accents)
 14 | 
 15 | The raw data is collected from 2009-2020 [European Parliament event recordings](https://multimedia.europarl.europa.eu/en/home).
 16 | We acknowledge the European Parliament for creating and sharing these materials.
 17 | 
 18 | #### Detailed statistics
 19 | 
 20 | <details><summary>Unlabelled and transcribed data</summary><p>
 21 | 
 22 | | Language | Code | Unlabelled Hours (v1/v2) | Transcribed Hours | Transcribed Speakers | Transcribed Tokens | LM Tokens |
 23 | |:---:|:---:|:---:|:---:|:---:|:---:|:---:|
 24 | | English | En | 4.5K/24.1K | 543 | 1313 | 4.8M | 60.1M |
 25 | | German | De | 4.5K/23.2K | 282 | 531 | 2.3M | 50.0M |
 26 | | French | Fr | 4.5K/22.8K | 211 | 534 | 2.1M | 58.6M |
 27 | | Spanish | Es | 4.4K/21.4K | 166 | 305 | 1.6M | 57.4M |
 28 | | Polish | Pl | 4.5K/21.2K | 111 | 282 | 802K | 13.6M |
 29 | | Italian | It | 4.6K/21.9K | 91 | 306 | 757K | 52.1M |
 30 | | Romanian | Ro | 4.5K/17.9K | 89 | 164 | 739K | 10.3M |
 31 | | Hungarian | Hu | 4.4K/17.7K | 63 | 143 | 431K | 13.0M |
 32 | | Czech | Cs | 4.5K/18.7K | 62 | 138 | 461K | 13.5M |
 33 | | Dutch | Nl | 4.5K/19.0K | 53 | 221 | 488K | 54.6M |
 34 | | Finnish | Fi | 4.4K/14.2K | 27 | 84 | 160K | 34.5M |
 35 | | Croatian | Hr | 2.7K/8.1K | 43 | 83 | 337K | 285K |
 36 | | Slovak | Sk | 4.4K/12.1K | 35 | 96 | 270K | 13.3M |
 37 | | Slovene | Sl | 4.4K/11.3K | 10 | 45 | 76K | 12.6M |
 38 | | Estonian | Et | 4.3K/10.6K | 3 | 29 | 18K | 11.3M |
 39 | | Lithuanian | Lt | 4.3K/14.4K | 2 | 21 | 10K | 11.5M |
 40 | | Portuguese | Pt | 4.4K/17.5K | - | - | - | - |
 41 | | Bulgarian | Bg | 4.3K/17.6K | - | - | - | - |
 42 | | Greek | El | 4.4K/17.7K | - | - | - | - |
 43 | | Latvian | Lv | 4.4K/13.1K | - | - | - | - |
 44 | | Maltese | Mt | 4.4K/9.1K | - | - | - | - |
 45 | | Swedish | Sv | 4.5K/16.3K | - | - | - | - |
 46 | | Danish | Da | 4.3K/13.6K | - | - | - | - |
 47 | | Total | | 100K/384K | 1791 | 4295 | 15M | 467M |
 48 | 
 49 | </p></details>
 50 | 
 51 | <details><summary>Speech-to-speech interpretation data</summary><p>
 52 | 
 53 | | Source/Target | En | De | Fr | Es | Pl | It | Ro | Hu | Cs | Nl | Fi | Sk | Sl | Lt | Da | Total |
 54 | |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
 55 | | En | - | 463 | 427 | 441 | 432 | 461 | 457 | 382 | 427 | 400 | 442 | 433 | 434 | 398 | 370 | 6.0K |
 56 | | De | 187 | - | 196 | 204 | 214 | 217 | 198 | 205 | 214 | 196 | 217 | 208 | 218 | 164 | 179 | 2.8K |
 57 | | Fr | 169 | 187 | - | 187 | 172 | 197 | 195 | 144 | 170 | 158 | 168 | 168 | 156 | 139 | 134 | 2.3K |
 58 | | Es | 130 | 138 | 135 | - | 118 | 148 | 128 | 93 | 118 | 115 | 124 | 114 | 108 | 83 | 86 | 1.6K |
 59 | | Pl | 68 | 66 | 54 | 55 | - | 67 | 55 | 43 | 67 | 42 | 55 | 62 | 57 | 50 | 34 | 775 |
 60 | | It | 69 | 77 | 76 | 79 | 72 | - | 75 | 61 | 68 | 64 | 71 | 66 | 70 | 53 | 60 | 961 |
 61 | | Ro | 60 | 59 | 59 | 58 | 49 | 61 | - | 38 | 50 | 43 | 48 | 50 | 46 | 38 | 29 | 688 |
 62 | | Hu | 30 | 38 | 25 | 27 | 29 | 30 | 27 | - | 27 | 20 | 31 | 29 | 26 | 21 | 18 | 378 |
 63 | | Cs | 39 | 35 | 29 | 30 | 36 | 32 | 31 | 23 | - | 23 | 29 | 55 | 29 | 25 | 18 | 434 |
 64 | | Nl | 31 | 43 | 35 | 29 | 27 | 38 | 24 | 25 | 25 | - | 32 | 25 | 23 | 19 | 25 | 401 |
 65 | | Fi | 15 | 18 | 15 | 13 | 13 | 13 | 13 | 12 | 13 | 11 | - | 14 | 12 | 11 | 9 | 182 |
 66 | | Hr | 31 | 27 | 27 | 24 | 27 | 28 | 24 | 22 | 24 | 22 | 24 | 26 | 37 | 21 | 20 | 384 |
 67 | | Sk | 21 | 22 | 14 | 16 | 19 | 16 | 16 | 14 | 32 | 13 | 16 | - | 17 | 13 | 10 | 239 |
 68 | | Sl | 6 | 6 | 4 | 5 | 5 | 6 | 5 | 4 | 5 | 4 | 5 | 6 | - | 4 | 3 | 68 |
 69 | | Lt | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | - | 0 | 13 |
 70 | | Total | 857 | 1.2K | 1.1K | 1.2K | 1.2K | 1.3K | 1.2K | 1.1K | 1.2K | 1.1K | 1.3K | 1.3K | 1.2K | 1.0K | 995 | 17.3K |
 71 | 
 72 | </p></details>
 73 | 
 74 | <details><summary>Accented speech transcribed data</summary><p>
 75 | 
 76 | | Accent | Code | Transcribed Hours | Transcribed Speakers |
 77 | |:---:|:---:|:---:|:---:|
 78 | | Dutch | en_nl | 3.52 | 45 |
 79 | | German | en_de | 3.52 | 84 |
 80 | | Czech | en_cs | 3.30 | 26 |
 81 | | Polish | en_pl | 3.23 | 33 |
 82 | | French | en_fr | 2.56 | 27 |
 83 | | Hungarian | en_hu | 2.33 | 23 |
 84 | | Finnish | en_fi | 2.18 | 20 |
 85 | | Romanian | en_ro | 1.85 | 27 |
 86 | | Slovak | en_sk | 1.46 | 17 |
 87 | | Spanish | en_es | 1.42 | 18 |
 88 | | Italian | en_it | 1.11 | 15 |
 89 | | Estonian | en_et | 1.08 | 6 |
 90 | | Lithuanian | en_lt | 0.65 | 7 |
 91 | | Croatian | en_hr | 0.42 | 9 |
 92 | | Slovene | en_sl | 0.25 | 7 |
 93 | 
 94 | </p></details>
 95 | 
 96 | # What's New
 97 | - __2022-02-01__: New labelled accented English speech data released.
 98 | - __2022-01-15__: New [wav2vec 2.0 pre-trained models](https://github.com/facebookresearch/voxpopuli#wav2vec-20) released.
 99 | - __2021-07-26__: New unlabelled data (additional 300K hours) released.
100 | - __2021-03-03__: VoxPopuli released.
101 | 
102 | # Getting Data
103 | We provide raw audios as well as scripts to segment and align them with transcription/interpretation. The output format
104 | is [Ogg Vorbis](https://en.wikipedia.org/wiki/Vorbis) (16000Hz, 16-bit, mono-channel),
105 | which is supported by common libraries such as `libsndfile` and `libsox` (they have Python frontends
106 | by [soundfile](https://github.com/bastibe/python-soundfile), [torchaudio](https://github.com/pytorch/audio), etc.).
107 | 
108 | As the first step, clone this repo for the processing scripts
109 | ```bash
110 | git clone https://github.com/facebookresearch/voxpopuli.git
111 | ```
112 | and install required PyPI packages:
113 | ```bash
114 | pip install -r requirements.txt
115 | ```
116 | 
117 | 
118 | ### Unlabelled Data
119 | First, download raw audios via
120 | ```bash
121 | python -m voxpopuli.download_audios --root [ROOT] --subset [SUBSET]
122 | ```
123 | which saves audios to `${ROOT}/raw_audios/[language]/[year]/[recording_id].ogg`.
124 | 
125 | `SUBSET` specifies the data subset to download:
126 | 
127 | |  --subset | # Languages | Hours | Years | Size |
128 | |:---:|:---:|:---:|:---:|:---:|
129 | | en, de, fr, es, pl, it, ro, hu, cs, nl, fi, hr, sk, sl, et, lt, pt, bg, el, lv, mt, sv or da | 1 | 2.7K-4.6K | 2009-2020 | 44G-75G |
130 | | en_v2, de_v2, fr_v2, es_v2, pl_v2, it_v2, ro_v2, hu_v2, cs_v2, nl_v2, fi_v2, hr_v2, sk_v2, sl_v2, et_v2, lt_v2, pt_v2, bg_v2, el_v2, lv_v2, mt_v2, sv_v2 or da_v2 | 1 | 8.1K-24.1K | 2009-2020 | 130G-385G |
131 | | 10k | 23 | 10K | 2019-2020 | 170G |
132 | | 100k | 23 | 100K | 2009-2020 | 1.7T |
133 | | 400k | 23 | 400K | 2009-2020 | 6.4T |
134 | 
135 | Then, segment these audios via
136 | ```bash
137 | python -m voxpopuli.get_unlabelled_data --root [ROOT] --subset [SUBSET]
138 | ```
139 | which outputs to `${ROOT}/unlabelled_data/[language]/[year]/[segment_id].ogg`
140 | 
141 | ### Transcribed (ASR) Data
142 | First, download raw audios via
143 | ```bash
144 | python -m voxpopuli.download_audios --root [ROOT] --subset asr
145 | ```
146 | which saves audios to `${ROOT}/raw_audios/original/[year]/[recording_id].ogg`.
147 | 
148 | Then, segment these audios and align them with transcripts via
149 | ```bash
150 | python -m voxpopuli.get_asr_data --root [ROOT] --lang [LANGUAGE]
151 | ```
152 | which outputs
153 | - audios `${ROOT}/transcribed_data/[language]/[year]/[segment_id].ogg`
154 | - per-split manifest (ID, transcript, speaker ID) `${ROOT}/transcribed_data/[language]/asr_[split].tsv`
155 | 
156 | **Accented transcribed data**
157 | To retrieve the transcribed accented speech data, follow the above steps with `--lang [LANGUAGE]_accented` (e.g. `--lang en_accented`).
158 | Note that the accented speech data is only composed of a test set for now.
159 | 
160 | ### Speech-to-Speech Interpretation Data
161 | First, follow the instructions above to set up ASR data (source audios and transcripts).
162 | 
163 | Then, download target audios via
164 | ```bash
165 | python -m voxpopuli.download_audios --root [ROOT] --subset [TARGET_LANGUAGE]
166 | ```
167 | which saves audios to `${ROOT}/raw_audios/[target_language]/[year]/[recording_id].ogg`.
168 | 
169 | Finally, segment these audios and match them with source ones via
170 | ```bash
171 | python -m voxpopuli.get_s2s_data --root [ROOT] --source-lang [SOURCE_LANGUAGE] --target-lang [TARGET_LANGUAGE]
172 | ```
173 | which outputs
174 | - target audios `${ROOT}/transcribed_data/[language]/[target_language]/[year]/[segment_id].ogg`
175 | - manifest (source ID, transcript, speaker ID, target ID) `${ROOT}/transcribed_data/[language]/[target_language]/s2s.tsv`
176 | 
177 | We also human-transcribe part of the target audios (for English, French and Spanish only) to allow more accurate alignments.
178 | To use them instead of machine transcriptions in the alignments, add `--use-annotated-target` to the command line.
179 | 
180 | ### Language Modeling (LM) Data
181 | We combine VoxPopuli transcripts and text data from [Europarl](https://www.statmt.org/europarl/) for LM training.
182 | 
183 | Download VoxPopuli and Europarl text data, process the raw text and generate the vocabulary via
184 | ```bash
185 | python -m voxpopuli.get_lm_data --root [ROOT] --lang [LANGUAGE]
186 | ```
187 | which outputs
188 | - sentences `${ROOT}/lm_data/[language]/sentences.txt`
189 | - vocabulary `${ROOT}/lm_data/[language]/vocabulary.txt`
190 | 
191 | To train an n-gram LM with [KenLM](https://github.com/kpu/kenlm), run
192 | ```bash
193 | ${KENLM_PATH}/lmplz -o ${n} --limit_vocab_file [OUT_VOCAB_FILE] < [OUT_TEXT_FILE] > ${n}gram_lm.arpa
194 | ${KENLM_PATH}/build_binary ${n}gram_lm.arpa ${n}gram_lm.bin
195 | ```
196 | 
197 | #  Pre-trained Models
198 | ## wav2vec 2.0
199 | We provide pre-trained wav2vec 2.0 models
200 | (implemented in [fairseq](https://github.com/pytorch/fairseq) and [wav2letter/flashlight](https://github.com/facebookresearch/flashlight))
201 | for downstream speech tasks. Each language is covered by a monolingual _Base_ model and multilingual _Large_ models that
202 | combine languages in the same family or all languages. See also [XLS-R](https://github.com/pytorch/fairseq/tree/main/examples/wav2vec/xlsr)
203 | for larger-scale (up to 2B) multilingual models trained on VoxPopuli (400K hours).
204 | 
205 | <details><summary><b>Download</b></summary><p>
206 | 
207 | |   Language(s)    |     Family     |  PT Hours  |                                                                             Base Model (95M)                                                                              |                                                                                      Large Model (317M)                                                                                       |
208 | |:----------------:|:--------------:|:----------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
209 | |    Es (V1/V2)    |    Romance     | 4.4K/21.4K |     fairseq [V1](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_es.pt) / [V2](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_es_v2.pt)      |        fairseq [V1](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_es.pt) / [V2 Romance](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_romance_v2.pt)        |
210 | |    Fr (V1/V2)    |    Romance     | 4.5K/22.8K |     fairseq [V1](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_fr.pt) / [V2](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_fr_v2.pt)      |        fairseq [V1](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_fr.pt) / [V2 Romance](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_romance_v2.pt)        |
211 | |    It (V1/V2)    |    Romance     | 4.6K/21.9K |     fairseq [V1](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_it.pt) / [V2](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_it_v2.pt)      |        fairseq [V1](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_it.pt) / [V2 Romance](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_romance_v2.pt)        |
212 | |     Pt (V2)      |    Romance     |   17.5K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_pt_v2.pt)                                             |                                              [fairseq V2 Romance](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_romance_v2.pt)                                               |
213 | |     Ro (V2)      |    Romance     |   17.9K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_ro_v2.pt)                                             |                                              [fairseq V2 Romance](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_romance_v2.pt)                                               |
214 | |    Nl (V1/V2)    | West Germanic  | 4.5K/19.0K |     fairseq [V1](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_nl.pt) / [V2](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_nl_v2.pt)      |  fairseq [V1](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_nl.pt) / [V2 West Germanic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_west_germanic_v2.pt)  |
215 | |     En (V2)      | West Germanic  |   24.1K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_en_v2.pt)                                             |                                        [fairseq V2 West Germanic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_west_germanic_v2.pt)                                         |
216 | |     De (V2)      | West Germanic  |   23.2K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_de_v2.pt)                                             |                                        [fairseq V2 West Germanic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_west_germanic_v2.pt)                                         |
217 | |    Sv (V1/V2)    | North Germanic | 4.5K/16.3K |     fairseq [V1](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_sv.pt) / [V2](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_sv_v2.pt)      | fairseq [V1](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_sv.pt) / [V2 North Germanic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_north_germanic_v2.pt) |
218 | |     Da (V2)      | North Germanic |   13.6K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_da_v2.pt)                                             |                                       [fairseq V2 North Germanic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_north_germanic_v2.pt)                                        |
219 | |     Bg (V2)      |     Slavic     |   17.6K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_bg_v2.pt)                                             |                                                 [fairseq V2 Slavic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_slavic_v2.pt)                                                 |
220 | |     Cs (V2)      |     Slavic     |   18.7K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_cs_v2.pt)                                             |                                                 [fairseq V2 Slavic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_slavic_v2.pt)                                                 |
221 | |     Hr (V2)      |     Slavic     |    8.1K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_hr_v2.pt)                                             |                                                 [fairseq V2 Slavic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_slavic_v2.pt)                                                 |
222 | |     Pl (V2)      |     Slavic     |   21.2K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_pl_v2.pt)                                             |                                                 [fairseq V2 Slavic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_slavic_v2.pt)                                                 |
223 | |     Sk (V2)      |     Slavic     |   12.1K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_sk_v2.pt)                                             |                                                 [fairseq V2 Slavic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_slavic_v2.pt)                                                 |
224 | |     Sl (V2)      |     Slavic     |   11.3K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_sl_v2.pt)                                             |                                                 [fairseq V2 Slavic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_slavic_v2.pt)                                                 |
225 | |     Et (V2)      |     Uralic     |   10.6K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_et_v2.pt)                                             |                                                 [fairseq V2 Uralic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_uralic_v2.pt)                                                 |
226 | |     Fi (V2)      |     Uralic     |   14.2K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_fi_v2.pt)                                             |                                                 [fairseq V2 Uralic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_uralic_v2.pt)                                                 |
227 | |     Hu (V2)      |     Uralic     |   17.7K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_hu_v2.pt)                                             |                                                 [fairseq V2 Uralic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_uralic_v2.pt)                                                 |
228 | |     Lv (V2)      |     Baltic     |   13.1K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_lv_v2.pt)                                             |                                                 [fairseq V2 Baltic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_baltic_v2.pt)                                                 |
229 | |     Lt (V2)      |     Baltic     |   14.4K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_lt_v2.pt)                                             |                                                 [fairseq V2 Baltic](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_baltic_v2.pt)                                                 |
230 | |     El (V2)      |     Greek      |   17.7K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_el_v2.pt)                                             |                                                      [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_el_v2.pt)                                                       |
231 | |     Mt (V2)      |    Semitic     |    9.1K    |                                             [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_mt_v2.pt)                                             |                                                      [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_mt_v2.pt)                                                       |
232 | | All 23 languages |       -        |    10K     |                                              [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_10k.pt)                                              |                                                       [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_10k.pt)                                                        |
233 | | All 23 languages |       -        |    100K    | [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_base_100k.pt) / [wav2letter](https://dl.fbaipublicfiles.com/voxpopuli/vox_populi_100k_500iters.tar.gz) |                                                       [fairseq](https://dl.fbaipublicfiles.com/voxpopuli/models/wav2vec2_large_100k.pt)                                                       |
234 | 
235 | </p></details>
236 | 
237 | In [our paper](https://arxiv.org/pdf/2101.00390.pdf) (Section 4.3.1), we evaluated part of these models on the [Common Voice](https://commonvoice.mozilla.org/) corpus
238 | in the normal setting and the [few-shot phoneme recognition setting](https://github.com/facebookresearch/CPC_audio#cross-lingual-transfer).
239 | 
240 | ## Wav2letter C++ implementation
241 | 
242 | A wav2letter implementation as well as a checkpoint pretrained on VoxPopuli 100k (base model) is also available in the [Wav2letter respository](https://github.com/flashlight/wav2letter/tree/master/recipes/joint_training_vox_populi).
243 | 
244 | The complete fine-tuned ASR baselines for this codebase shoulda come soon.
245 | The wav2letter implementation follows [this paper](https://arxiv.org/abs/2011.00093).
246 | 
247 | ## ASR and LM
248 | For the VoxPopuli ASR task, we provide Transformer baselines, fine-tuned wav2vec2 models (Base 10K) as well as n-gram LMs (trained with [KenLM](https://github.com/kpu/kenlm)) and their lexicons.
249 | 
250 | <details><summary><b>Download</b></summary><p>
251 | 
252 | |  Language | ASR (fairseq) | LM (kenLM) | Lexicon |
253 | |:---:|:---:|:---:|:---:|
254 | | Cs | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_cs.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_cs.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/cs/cs_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/cs/cs_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/cs/cs_lm.lexicon) |
255 | | De | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_de.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_de.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/de/de_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/de/de_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/de/de_lm.lexicon) |
256 | | En | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_en.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_en.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/en/en_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/en/en_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/en/en_lm.lexicon) |
257 | | Es | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_es.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_es.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/es/es_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/es/es_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/es/es_lm.lexicon) |
258 | | Et | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_et.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_et.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/et/et_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/et/et_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/et/et_lm.lexicon) |
259 | | Fi | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_fi.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_fi.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/fi/fi_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/fi/fi_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/fi/fi_lm.lexicon) |
260 | | Fr | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_fr.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_fr.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/fr/fr_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/fr/fr_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/fr/fr_lm.lexicon) |
261 | | Hr | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_hr.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_hr.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/hr/hr_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/hr/hr_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/hr/hr_lm.lexicon) |
262 | | Hu | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_hu.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_hu.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/hu/hu_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/hu/hu_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/hu/hu_lm.lexicon) |
263 | | It | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_it.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_it.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/it/it_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/it/it_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/it/it_lm.lexicon) |
264 | | Lt | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_lt.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_lt.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/lt/lt_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/lt/lt_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/lt/lt_lm.lexicon) |
265 | | Nl | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_nl.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_nl.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/nl/nl_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/nl/nl_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/nl/nl_lm.lexicon) |
266 | | Pl | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_pl.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_pl.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/pl/pl_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/pl/pl_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/pl/pl_lm.lexicon) |
267 | | Ro | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_ro.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_ro.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/ro/ro_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/ro/ro_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/ro/ro_lm.lexicon) |
268 | | Sk | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_sk.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_sk.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/sk/sk_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/sk/sk_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/sk/sk_lm.lexicon) |
269 | | Sl | [baseline](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/s2t_transformer_s_sl.tar), [fine-tuned wav2vec2](https://dl.fbaipublicfiles.com/voxpopuli/models/vp_asr/wav2vec2_base_10k_ft_sl.tar) | [3-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/sl/sl_3gram_lm.bin), [5-gram](https://dl.fbaipublicfiles.com/voxpopuli/lm/sl/sl_5gram_lm.bin) | [lexicon](https://dl.fbaipublicfiles.com/voxpopuli/lm/sl/sl_lm.lexicon) |
270 | 
271 | </p></details>
272 |  
273 | We also provide [CoVoST 2](https://github.com/facebookresearch/covost) +
274 | [EuroParl-ST](https://www.mllp.upv.es/europarl-st/) ASR Transformer models that are self-trained on 3000h VoxPopuli
275 | unlabelled data.
276 | 
277 | <details><summary><b>Download</b></summary><p>
278 | 
279 | |  Language | CoVoST 2 Test (WER) | EuroParl-ST Test (WER) | Model (fairseq) |
280 | |:---:|:---:|:---:|:---:|
281 | | De | 17.3 | 21.4 | [s2t_transformer_l](https://dl.fbaipublicfiles.com/voxpopuli/models/cvst_epst/s2t_transformer_l_de.tar) |
282 | | Es | 13.2 | 15.3 | [s2t_transformer_l](https://dl.fbaipublicfiles.com/voxpopuli/models/cvst_epst/s2t_transformer_l_es.tar) |
283 | | Fr | 17.0 | 19.0 | [s2t_transformer_l](https://dl.fbaipublicfiles.com/voxpopuli/models/cvst_epst/s2t_transformer_l_fr.tar) |
284 |  
285 | </p></details>
286 | 
287 | Please refer to the [S2T examples](https://github.com/pytorch/fairseq/tree/master/examples/speech_to_text) for the use
288 | of Transformer model checkpoints.
289 | 
290 | ## Speech-to-Text Translation (ST)
291 | We provide [CoVoST 2](https://github.com/facebookresearch/covost) +
292 | [EuroParl-ST](https://www.mllp.upv.es/europarl-st/) ST Transformer models that are jointly trained with 400h VoxPopuli
293 | weakly labelled data.
294 | 
295 | <details><summary><b>Download</b></summary><p>
296 | 
297 | | Direction | CoVoST 2 Test (BLEU) | EuroParl-ST Test (BLEU) | Model (fairseq) |
298 | |:---:|:---:|:---:|:---:|
299 | | De-En | 23.4 | 24.4 | [s2t_transformer_l](https://dl.fbaipublicfiles.com/voxpopuli/models/cvst_epst/s2t_transformer_l_de-en.tar) |
300 | | Es-En | 29.7 | 28.4 | [s2t_transformer_l](https://dl.fbaipublicfiles.com/voxpopuli/models/cvst_epst/s2t_transformer_l_es-en.tar) |
301 | | Fr-En | 30.3 | 31.1 | [s2t_transformer_l](https://dl.fbaipublicfiles.com/voxpopuli/models/cvst_epst/s2t_transformer_l_fr-en.tar) |
302 | 
303 | </p></details>
304 |  
305 | Please refer to the
306 | [S2T examples](https://github.com/pytorch/fairseq/tree/master/examples/speech_to_text) for the use of these checkpoints.
307 | 
308 | # License
309 | |  | License |
310 | |:---:|:---:|
311 | | VoxPopuli Data | [CC0](https://creativecommons.org/share-your-work/public-domain/cc0/) (see also European Parliament's [legal notice](https://www.europarl.europa.eu/legal-notice/en/) for the raw data) |
312 | | LM Data | (Please check out the [Europarl website](https://www.statmt.org/europarl/) for the Europarl portion) |
313 | | Pre-trained Models | [CC BY-NC 4.0](https://github.com/facebookresearch/covost/blob/master/LICENSE) |
314 | | Code | [CC BY-NC 4.0](https://github.com/facebookresearch/covost/blob/master/LICENSE) |
315 | 
316 | # Contact
317 | Changhan Wang (changhan@fb.com), Morgane Rivière (mriviere@fb.com), Ann Lee (annl@fb.com)
318 | 
319 | # Citation
320 | ```
321 | @inproceedings{wang-etal-2021-voxpopuli,
322 |     title = "{V}ox{P}opuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation",
323 |     author = "Wang, Changhan  and
324 |       Riviere, Morgane  and
325 |       Lee, Ann  and
326 |       Wu, Anne  and
327 |       Talnikar, Chaitanya  and
328 |       Haziza, Daniel  and
329 |       Williamson, Mary  and
330 |       Pino, Juan  and
331 |       Dupoux, Emmanuel",
332 |     booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
333 |     month = aug,
334 |     year = "2021",
335 |     address = "Online",
336 |     publisher = "Association for Computational Linguistics",
337 |     url = "https://aclanthology.org/2021.acl-long.80",
338 |     pages = "993--1003",
339 | }
340 | ```
341 | 


--------------------------------------------------------------------------------
/extension.md:
--------------------------------------------------------------------------------
 1 | # Extension
 2 | 
 3 | We provide additional scripts for customizing our data processing pipelines.
 4 | 
 5 | ### [Experimental] Segmenting Unlabelled Data with Speaker Diarization 
 6 | 
 7 | Our current pipeline uses voice activity detection (VAD) algorithm to segment unlabelled data which has no awareness 
 8 | of the speakers. However, potential speaker changes inside the audio clips may not be favored by downstream applications. 
 9 | We propose 2-step segmentation (speaker diarization followed by VAD) to mitigate this issue.
10 | 
11 | First, apply speaker diarization (SD) model provided by pyannote:
12 | 
13 | ```bash
14 | python -m voxpopuli.segmentation.run_pyannote_sd \
15 |   --root [ROOT] -l [LANGUAGE_LIST] \
16 |   --segment-min [MIN_SEGMENT_DURATION_IN_SECONDS]
17 | ```
18 | 
19 | Then, apply VAD on top of SD outputs to segment the audios:
20 | ```bash
21 | python -m voxpopuli.segmentation.get_segment_pyannote_speaker.py \
22 |     --root [ROOT] --languages [LANGUAGE_LIST] -o [OUTPUT_DIR] \
23 |     --max-dur-vad [MAX_SEGMENT_DURATION_IN_SECONDS]
24 | ```
25 | 
26 | We also provide pre-computed segments on the 10k subset. Apply the segmentation directly via
27 | ```bash
28 | python -m voxpopuli.get_unlabelled_data --root [ROOT] --subset 10k_sd
29 | ```
30 | which outputs to `${ROOT}/unlabelled_data_sd/[language]/[year]/[segment_id].ogg`
31 | 
32 | ### Customizing Force-Alignment for Transcribed (ASR) Data
33 | 
34 | To segment the labelled data you will need the decoded text corresponding to each audio segment. 
35 | They are available upon request: please contact us or post an issue. 
36 | 
37 | If you want to use the force-aligned text for any purpose (like VAD), 
38 | they are available [here](https://dl.fbaipublicfiles.com/voxpopuli/align_data.tar.gz).
39 | 
40 | To segment paragraphs into utterances for the given language $LANG, run:
41 | 
42 | ```bash
43 | python -m voxpopuli.pipeline.cut_with_align_files \
44 |     --dir_wer ${DIR_DOWNLOAD_WER}/${LANG}/wer \
45 |     --dir_align ${DIR_DOWNLOAD_WER}/${LANG}/align/ \
46 |     --dir_audio $VOX_POPULI_DIR \
47 |     -o $OUTPUT_DIRECTORY \
48 |     --path_chars ${DIR_DOWNLOAD}/${LANG}/${LANG}_grapheme.tokens \
49 |     --lang $LANG
50 | ```
51 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | tqdm
2 | torchaudio
3 | num2words
4 | edlib
5 | editdistance
6 | 


--------------------------------------------------------------------------------
/voxpopuli/__init__.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) Facebook, Inc. and its affiliates.
 2 | #
 3 | # This source code is licensed under the MIT license found in the
 4 | # LICENSE file in the root directory of this source tree.
 5 | 
 6 | LANGUAGES = [
 7 |     "en", "de", "fr", "es", "pl", "it", "ro", "hu", "cs", "nl", "fi", "hr",
 8 |     "sk", "sl", "et", "lt", "pt", "bg", "el", "lv", "mt", "sv", "da"
 9 | ]
10 | LANGUAGES_V2 = [f"{x}_v2" for x in LANGUAGES]
11 | 
12 | YEARS = list(range(2009, 2020 + 1))
13 | 
14 | ASR_LANGUAGES = [
15 |     "en", "de", "fr", "es", "pl", "it", "ro", "hu", "cs", "nl", "fi", "hr",
16 |     "sk", "sl", "et", "lt"
17 | ]
18 | ASR_ACCENTED_LANGUAGES = [
19 |     "en_accented"
20 | ]
21 | 
22 | S2S_SRC_LANGUAGES = ASR_LANGUAGES
23 | 
24 | S2S_TGT_LANGUAGES = [
25 |     "en", "de", "fr", "es", "pl", "it", "ro", "hu", "cs", "nl", "fi", "hr",
26 |     "sk", "sl", "et", "lt", "pt", "bg", "el", "lv", "mt", "sv", "da"
27 | ]
28 | 
29 | S2S_TGT_LANGUAGES_WITH_HUMAN_TRANSCRIPTION = ["en", "fr", "es"]
30 | 
31 | DOWNLOAD_BASE_URL = "https://dl.fbaipublicfiles.com/voxpopuli"
32 | 


--------------------------------------------------------------------------------
/voxpopuli/download_audios.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) Facebook, Inc. and its affiliates.
 2 | #
 3 | # This source code is licensed under the MIT license found in the
 4 | # LICENSE file in the root directory of this source tree.
 5 | 
 6 | import argparse
 7 | import os
 8 | from pathlib import Path
 9 | 
10 | from tqdm import tqdm
11 | from torchaudio.datasets.utils import download_url, extract_archive
12 | 
13 | from voxpopuli import LANGUAGES, LANGUAGES_V2, YEARS, DOWNLOAD_BASE_URL
14 | 
15 | 
16 | def get_args():
17 |     parser = argparse.ArgumentParser()
18 |     parser.add_argument(
19 |         "--root", "-r", type=str, required=True, help="data root path"
20 |     )
21 |     parser.add_argument(
22 |         "--subset", "-s", type=str, required=True,
23 |         choices=["400k", "100k", "10k", "asr"] + LANGUAGES + LANGUAGES_V2,
24 |         help="data subset to download"
25 |     )
26 |     return parser.parse_args()
27 | 
28 | 
29 | def download(args):
30 |     if args.subset in LANGUAGES_V2:
31 |         languages = [args.subset.split("_")[0]]
32 |         years = YEARS + [f"{y}_2" for y in YEARS]
33 |     elif args.subset in LANGUAGES:
34 |         languages = [args.subset]
35 |         years = YEARS
36 |     else:
37 |         languages = {
38 |             "400k": LANGUAGES,
39 |             "100k": LANGUAGES,
40 |             "10k": LANGUAGES,
41 |             "asr": ["original"]
42 |         }.get(args.subset, None)
43 |         years = {
44 |             "400k": YEARS + [f"{y}_2" for y in YEARS],
45 |             "100k": YEARS,
46 |             "10k": [2019, 2020],
47 |             "asr": YEARS
48 |         }.get(args.subset, None)
49 | 
50 |     url_list = []
51 |     for l in languages:
52 |         for y in years:
53 |             url_list.append(f"{DOWNLOAD_BASE_URL}/audios/{l}_{y}.tar")
54 | 
55 |     out_root = Path(args.root) / "raw_audios"
56 |     out_root.mkdir(exist_ok=True, parents=True)
57 |     print(f"{len(url_list)} files to download...")
58 |     for url in tqdm(url_list):
59 |         tar_path = out_root / Path(url).name
60 |         download_url(url, out_root.as_posix(), Path(url).name)
61 |         extract_archive(tar_path.as_posix())
62 |         os.remove(tar_path)
63 | 
64 | 
65 | def main():
66 |     args = get_args()
67 |     download(args)
68 | 
69 | 
70 | if __name__ == '__main__':
71 |     main()
72 | 


--------------------------------------------------------------------------------
/voxpopuli/get_asr_data.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) Facebook, Inc. and its affiliates.
  2 | #
  3 | # This source code is licensed under the MIT license found in the
  4 | # LICENSE file in the root directory of this source tree.
  5 | import csv
  6 | import argparse
  7 | from tqdm import tqdm
  8 | from ast import literal_eval
  9 | import gzip
 10 | from pathlib import Path
 11 | from typing import Dict, List, Tuple
 12 | from collections import defaultdict
 13 | 
 14 | import torch
 15 | import torchaudio
 16 | from torchaudio.datasets.utils import download_url
 17 | 
 18 | from voxpopuli import ASR_LANGUAGES, ASR_ACCENTED_LANGUAGES, DOWNLOAD_BASE_URL
 19 | from voxpopuli.utils import multiprocess_run
 20 | 
 21 | 
 22 | SPLITS = ["train", "dev", "test"]
 23 | 
 24 | 
 25 | def cut_session(info: Tuple[str, Dict[str, List[Tuple[float, float]]]]) -> None:
 26 |     in_path, out_path_to_timestamps = info
 27 |     waveform, sr = torchaudio.load(in_path)
 28 |     duration = waveform.size(1)
 29 |     for out_path, timestamps in out_path_to_timestamps.items():
 30 |         segment = torch.cat(
 31 |             [waveform[:, int(s * sr): min(int(t * sr), duration)]
 32 |              for s, t in timestamps],
 33 |             dim=1
 34 |         )
 35 |         torchaudio.save(out_path, segment, sr)
 36 | 
 37 | 
 38 | def get(args):
 39 |     in_root = Path(args.root) / "raw_audios" / "original"
 40 |     out_root = Path(args.root) / "transcribed_data" / args.lang
 41 |     out_root.mkdir(exist_ok=True, parents=True)
 42 |     # Get metadata TSV
 43 |     url = f"{DOWNLOAD_BASE_URL}/annotations/asr/asr_{args.lang}.tsv.gz"
 44 |     tsv_path = out_root / Path(url).name
 45 |     if not tsv_path.exists():
 46 |         download_url(url, out_root.as_posix(), Path(url).name)
 47 |     with gzip.open(tsv_path, "rt") as f:
 48 |         metadata = [x for x in csv.DictReader(f, delimiter="|")]
 49 |     # Get segment into list
 50 |     items = defaultdict(dict)
 51 |     manifest = []
 52 |     for r in tqdm(metadata):
 53 |         split = r["split"]
 54 |         if split not in SPLITS:
 55 |             continue
 56 |         event_id = r["session_id"]
 57 |         year = event_id[:4]
 58 |         in_path = in_root / year / f"{event_id}_original.ogg"
 59 |         cur_out_root = out_root / year
 60 |         cur_out_root.mkdir(exist_ok=True, parents=True)
 61 |         out_path = cur_out_root / "{}-{}.ogg".format(event_id, r["id_"])
 62 |         timestamps = [(t[0], t[1]) for t in literal_eval(r["vad"])]
 63 |         items[in_path.as_posix()][out_path.as_posix()] = timestamps
 64 |         manifest.append(
 65 |             (
 66 |              out_path.stem,
 67 |              r["original_text"],
 68 |              r["normed_text"],
 69 |              r["speaker_id"],
 70 |              split,
 71 |              r["gender"],
 72 |              r.get("is_gold_transcript", str(False)),
 73 |              r.get("accent", str(None))
 74 |             )
 75 |         )
 76 |     items = list(items.items())
 77 |     # Segment
 78 |     multiprocess_run(items, cut_session)
 79 |     # Output per-split manifest
 80 |     header = [
 81 |         "id", "raw_text", "normalized_text", "speaker_id", "split",
 82 |         "gender", "is_gold_transcript", "accent"
 83 |     ]
 84 |     for split in SPLITS:
 85 |         with open(out_root / f"asr_{split}.tsv", "w") as f_o:
 86 |             f_o.write("\t".join(header) + "\n")
 87 |             for cols in manifest:
 88 |                 if cols[4] == split:
 89 |                     f_o.write("\t".join(cols) + "\n")
 90 | 
 91 | 
 92 | def get_args():
 93 |     parser = argparse.ArgumentParser("Prepare transcribed data")
 94 |     parser.add_argument(
 95 |         "--root",
 96 |         help="data root path",
 97 |         type=str,
 98 |         required=True,
 99 |     )
100 |     parser.add_argument(
101 |         "--lang",
102 |         required=True,
103 |         type=str,
104 |         choices=ASR_LANGUAGES + ASR_ACCENTED_LANGUAGES,
105 |     )
106 |     return parser.parse_args()
107 | 
108 | 
109 | def main():
110 |     args = get_args()
111 |     get(args)
112 | 
113 | 
114 | if __name__ == "__main__":
115 |     main()
116 | 


--------------------------------------------------------------------------------
/voxpopuli/get_lm_data.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) Facebook, Inc. and its affiliates.
  2 | #
  3 | # This source code is licensed under the MIT license found in the
  4 | # LICENSE file in the root directory of this source tree.
  5 | 
  6 | import argparse
  7 | import csv
  8 | import gzip
  9 | import logging
 10 | from multiprocessing import Pool
 11 | import re
 12 | import os
 13 | import string
 14 | from typing import List, Optional, Set, Tuple
 15 | from pathlib import Path
 16 | import tarfile
 17 | 
 18 | from num2words import num2words
 19 | import tqdm
 20 | from torchaudio.datasets.utils import download_url
 21 | 
 22 | from voxpopuli.text import (
 23 |     LANG_TOKENS,
 24 |     REMOVE_TRANSLATOR,
 25 |     SPACE_TRANSLATOR,
 26 |     SPACE,
 27 |     WHITESPACE_NORMALIZER,
 28 |     is_valid_text,
 29 | )
 30 | from voxpopuli import DOWNLOAD_BASE_URL
 31 | 
 32 | PUNCTUATIONS_TO_REMOVE = (
 33 |     string.punctuation.replace("'", "")
 34 |     .replace("-", "")
 35 |     .replace("–", "")
 36 |     .replace("/", "")
 37 |     + "«»‟″“”…‘•„‚≤ᵉ"
 38 | )
 39 | PUNCTUATIONS_TO_SPACE = "-/–·"
 40 | 
 41 | 
 42 | def remove_parentheses(text: str) -> str:
 43 |     # remove all substring within () or []
 44 |     out = ""
 45 |     num_p = 0
 46 |     start_i = 0
 47 |     for i, c in enumerate(text):
 48 |         if c == "(" or c == "[":
 49 |             if num_p == 0 and i > start_i:
 50 |                 out += text[start_i:i]
 51 |             num_p += 1
 52 |         elif c == ")" or c == "]":
 53 |             num_p -= 1
 54 |             if num_p == 0:
 55 |                 start_i = i + 1
 56 | 
 57 |     if len(text) > start_i:
 58 |         out += text[start_i:]
 59 | 
 60 |     return out
 61 | 
 62 | 
 63 | def digit2text(text: str, lang: str) -> str:
 64 |     out = text.strip(" ")
 65 |     if len(text) == 0 or all([not c.isdigit() for c in text]):
 66 |         return text
 67 | 
 68 |     # remove leading and trailing punctuations
 69 |     is_negative = text[0] == "-"
 70 |     out = text.lstrip((string.punctuation))
 71 |     out_tmp = out.rstrip((string.punctuation))
 72 |     suffix = "" if out == out_tmp else out[len(out_tmp) :]
 73 |     out = out_tmp.replace(",", ".")
 74 |     out = out.replace(":", ".")
 75 | 
 76 |     # leading characters, e.g. a10, h1n1, $10
 77 |     m = re.search(r"^(\D+)", out)
 78 |     if m:
 79 |         prefix = m.groups()[0]
 80 |         return prefix + " " + digit2text(out[len(prefix) :], lang) + suffix
 81 | 
 82 |     # leading digits, e.g. 50th, 1900s
 83 |     to_format = "cardinal"
 84 |     # trailing characters as ordinal numbers, e.g. 50th
 85 |     # TODO: more rules for multiple languages, e.g. date
 86 |     m = re.search(r"\b(\d+)(st|nd|th)\b", out.lower())
 87 |     if m:
 88 |         to_format = "ordinal"
 89 |         out = m.groups()[0]
 90 | 
 91 |     # different cases for xx.xx
 92 |     if "." in out:
 93 |         segs = out.split(".")
 94 |         if all([len(s) == 3 for s in segs[1:]]):  # 12.000.000
 95 |             out = out.replace(".", "")
 96 |         else:  # date 18.4.2009, IP address, time 18.30, etc.
 97 |             norm_segs = []
 98 |             for s in segs:
 99 |                 norm_segs.append(digit2text(s, lang))
100 |             return " ".join(norm_segs) + suffix
101 | 
102 |     m = re.search(r"\b(\d+)(\D+)", out)
103 |     if m:
104 |         suffix = " " + digit2text(out[len(m.groups()[0]) :], lang) + suffix
105 |         out = m.groups()[0]
106 | 
107 |     if is_negative:
108 |         out = "-" + out
109 | 
110 |     try:
111 |         num = int(out)
112 |     except ValueError:
113 |         try:
114 |             num = float(out)
115 |         except Exception as e:
116 |             num = out
117 |             logging.warning(f"cannot transform '{out}' to numbers")
118 | 
119 |     try:
120 |         d = num2words(num, lang=lang, to=to_format)
121 |     except NotImplementedError:  # lang not supported, default to en
122 |         assert lang != "en"
123 |         d = digit2text(out, lang="en")
124 |     except Exception as e:
125 |         d = ""
126 |         logging.warning(f"cannot process {out} ({num}) with {lang} in {to_format} mode")
127 | 
128 |     if suffix:
129 |         d = d + suffix
130 | 
131 |     return d
132 | 
133 | 
134 | def process_digits(text: str, lang: str) -> str:
135 |     words = text.split()
136 |     out = [digit2text(w, lang) for w in words]
137 | 
138 |     return " ".join(out)
139 | 
140 | 
141 | def load_from_tsv_gz(in_file: Path) -> List[str]:
142 |     output = []
143 |     with gzip.open(in_file, "rt") as f:
144 |         reader = csv.DictReader(
145 |             f,
146 |             delimiter="|",
147 |             quotechar=None,
148 |             doublequote=False,
149 |             lineterminator="\n",
150 |             quoting=csv.QUOTE_NONE,
151 |         )
152 | 
153 |         for e in reader:
154 |             e = dict(e)
155 |             if e["split"] != "train":
156 |                 continue
157 |             text = e["normed_text"]
158 |             text = text.translate(REMOVE_TRANSLATOR)
159 |             output.append(text)
160 | 
161 |     return output
162 | 
163 | 
164 | def process_text(
165 |         text: str, lang: str, tokens: Optional[Set[str]] = None
166 | ) -> Tuple[str, Set]:
167 |     # TODO: more rules, e.g. "%" -> percent, "°c" -> "degree celsius", "‰", etc.
168 |     # for multiple languages
169 |     out = text.lower()
170 |     out = remove_parentheses(out)
171 |     out = out.replace("’", "'")
172 |     out = out.translate(SPACE_TRANSLATOR)
173 |     out = process_digits(out, lang)
174 |     out = out.translate(REMOVE_TRANSLATOR)
175 |     out = re.sub("'+", "'", out)
176 |     out = out.strip("'").replace("' ", " ").replace(" '", " ")
177 |     out = WHITESPACE_NORMALIZER.sub(SPACE, out)
178 | 
179 |     vocab = set()
180 |     if tokens:
181 |         for w in out.split():
182 |             if is_valid_text(w, tokens):
183 |                 vocab.add(w)
184 | 
185 |     return out, vocab
186 | 
187 | 
188 | def main(args):
189 |     out_root = Path(args.root) / "lm_data" / args.lang
190 |     out_root.mkdir(exist_ok=True, parents=True)
191 |     asr_root = Path(args.root) / "transcribed_data" / args.lang
192 |     asr_root.mkdir(exist_ok=True, parents=True)
193 | 
194 |     # Get VoxPopuli transcript
195 |     url = f"{DOWNLOAD_BASE_URL}/annotations/asr/asr_{args.lang}.tsv.gz"
196 |     path = asr_root / Path(url).name
197 |     if not path.exists():
198 |         download_url(url, asr_root.as_posix(), Path(url).name)
199 |     text = load_from_tsv_gz(path)
200 |     # Get Europarl data
201 |     if args.lang != "hr":
202 |         for filename in ["europarl.tgz", "tools.tgz"]:
203 |             url = f"https://www.statmt.org/europarl/v7/{filename}"
204 |             if not (out_root / filename).exists():
205 |                 download_url(url, out_root.as_posix(), filename)
206 |         with tarfile.open(out_root / "europarl.tgz", "r:gz") as f:
207 |             members = [
208 |                 i for i in f.getmembers()
209 |                 if i.name.startswith(f"txt/{args.lang}")
210 |                    and not (out_root / i.name).exists()
211 |             ]
212 |             f.extractall(out_root, members=members)
213 |         with tarfile.open(out_root / "tools.tgz", "r:gz") as f:
214 |             f.extractall(out_root)
215 |         cur_text = set()
216 |         paths = list((out_root / "txt" / args.lang).glob("*.txt"))
217 |         for p in tqdm.tqdm(paths):
218 |             cur_out_path = p.with_suffix('.out')
219 |             script_path = out_root / "tools" / "split-sentences.perl"
220 |             os.system(
221 |                 f"perl {script_path.as_posix()} -l {args.lang} -q "
222 |                 f"< {p.as_posix()} > {cur_out_path.as_posix()}"
223 |             )
224 |             with open(cur_out_path) as f_o:
225 |                 cur_text.update(r.strip() for r in f_o if not r.startswith("<"))
226 |         text.extend(cur_text)
227 |     assert len(text) > 0, "Cannot load any text. Aborting."
228 | 
229 |     tokens = LANG_TOKENS[args.lang]
230 | 
231 |     out_text = []
232 |     vocab = set()
233 |     with Pool(args.n_proc) as p:
234 |         for norm_text, uniq_vocab in tqdm.tqdm(
235 |             p.starmap(process_text, [(t, args.lang, tokens) for t in text])
236 |         ):
237 |             out_text.append(norm_text)
238 |             if tokens:
239 |                 vocab |= uniq_vocab
240 | 
241 |     out_path = out_root / "sentences.txt"
242 |     with open(out_path, "w") as o:
243 |         for line in out_text:
244 |             o.write(line + "\n")
245 | 
246 |     vocab_path = out_root / "vocabulary.txt"
247 |     vocab = sorted(vocab)
248 |     with open(vocab_path, "w") as o:
249 |         o.write(" ".join(vocab))
250 | 
251 | 
252 | if __name__ == "__main__":
253 |     parser = argparse.ArgumentParser("LM data preparation")
254 |     parser.add_argument(
255 |         "--root",
256 |         help="data root path",
257 |         type=str,
258 |         required=True,
259 |     )
260 |     parser.add_argument(
261 |         "--lang",
262 |         type=str,
263 |         required=True,
264 |         choices=LANG_TOKENS.keys(),
265 |         help=f"Language of the input text. VoxPopuli provides labelled data in ({', '.join(LANG_TOKENS.keys())})",
266 |     )
267 |     parser.add_argument(
268 |         "--n-proc",
269 |         type=int,
270 |         default=8,
271 |         help="Number of processes to use",
272 |     )
273 |     args = parser.parse_args()
274 | 
275 |     main(args)
276 | 


--------------------------------------------------------------------------------
/voxpopuli/get_s2s_data.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) Facebook, Inc. and its affiliates.
  2 | #
  3 | # This source code is licensed under the MIT license found in the
  4 | # LICENSE file in the root directory of this source tree.
  5 | 
  6 | import argparse
  7 | from pathlib import Path
  8 | import csv
  9 | import gzip
 10 | from typing import Tuple, List
 11 | from collections import defaultdict
 12 | 
 13 | import torchaudio
 14 | from torchaudio.datasets.utils import download_url
 15 | from tqdm import tqdm
 16 | 
 17 | from voxpopuli import (S2S_SRC_LANGUAGES, S2S_TGT_LANGUAGES, DOWNLOAD_BASE_URL,
 18 |                        S2S_TGT_LANGUAGES_WITH_HUMAN_TRANSCRIPTION)
 19 | from voxpopuli.utils import multiprocess_run
 20 | 
 21 | 
 22 | def parse_src_id(id_):
 23 |     event_id, utt_id = id_.split("_", 1)
 24 |     event_id, lang = event_id.rsplit("-", 1)
 25 |     return event_id, lang, utt_id
 26 | 
 27 | 
 28 | def _segment(info: Tuple[str, List[Tuple[str, float, float]]]):
 29 |     in_path, out_path_and_timestamps = info
 30 |     waveform, sr = torchaudio.load(in_path)
 31 |     for out_path, start, end in out_path_and_timestamps:
 32 |         start, end = int(start * sr), min(waveform.size(1), int(end * sr))
 33 |         torchaudio.save(out_path, waveform[:, start: end], sr)
 34 | 
 35 | 
 36 | def get(args):
 37 |     src_lang, tgt_lang = args.source_lang, args.target_lang
 38 |     if args.use_annotated_target:
 39 |         assert tgt_lang in S2S_TGT_LANGUAGES_WITH_HUMAN_TRANSCRIPTION
 40 |     in_root = Path(args.root) / "raw_audios" / tgt_lang
 41 |     asr_root = Path(args.root) / "transcribed_data" / src_lang
 42 |     out_root = asr_root / tgt_lang
 43 |     out_root.mkdir(exist_ok=True, parents=True)
 44 |     # Get metadata TSV
 45 |     url = f"{DOWNLOAD_BASE_URL}/annotations/asr/asr_{src_lang}.tsv.gz"
 46 |     tsv_path = asr_root / Path(url).name
 47 |     if not tsv_path.exists():
 48 |         download_url(url, asr_root.as_posix(), Path(url).name)
 49 |     with gzip.open(tsv_path, "rt") as f:
 50 |         src_metadata = [x for x in csv.DictReader(f, delimiter="|")]
 51 |     src_metadata = {
 52 |         "{}-{}".format(r["session_id"], r["id_"]): (
 53 |             r["original_text"], r["speaker_id"]
 54 |         )
 55 |         for r in src_metadata
 56 |     }
 57 |     ref_sfx = "_ref" if args.use_annotated_target else ""
 58 |     url = f"{DOWNLOAD_BASE_URL}/annotations/s2s/s2s_{tgt_lang}{ref_sfx}.tsv.gz"
 59 |     tsv_path = out_root / Path(url).name
 60 |     if not tsv_path.exists():
 61 |         download_url(url, out_root.as_posix(), Path(url).name)
 62 |     with gzip.open(tsv_path, "rt") as f:
 63 |         tgt_metadata = [x for x in csv.DictReader(f, delimiter="\t")]
 64 |     # Get segment into list
 65 |     items = defaultdict(list)
 66 |     manifest = []
 67 |     print("Loading manifest...")
 68 |     for r in tqdm(tgt_metadata):
 69 |         src_id = r["id"]
 70 |         event_id, _src_lang, utt_id = parse_src_id(src_id)
 71 |         if _src_lang != src_lang:
 72 |             continue
 73 |         year = event_id[:4]
 74 |         in_path = in_root / year / f"{event_id}_{tgt_lang}.ogg"
 75 |         cur_out_root = out_root / year
 76 |         cur_out_root.mkdir(exist_ok=True, parents=True)
 77 |         tgt_id = f"{event_id}-{tgt_lang}_{utt_id}"
 78 |         out_path = cur_out_root / f"{tgt_id}.ogg"
 79 |         items[in_path.as_posix()].append(
 80 |             (out_path.as_posix(), float(r["start_time"]), float(r["end_time"]))
 81 |         )
 82 |         src_text, src_speaker_id = src_metadata[src_id]
 83 |         tgt_text = r["tgt_text"] if args.use_annotated_target else ""
 84 |         manifest.append((src_id, src_text, src_speaker_id, tgt_id, tgt_text))
 85 |     items = list(items.items())
 86 |     # Segment
 87 |     print(f"Segmenting {len(items):,} files...")
 88 |     multiprocess_run(items, _segment)
 89 |     # Output per-data-split list
 90 |     header = ["src_id", "src_text", "src_speaker_id", "tgt_id", "tgt_text"]
 91 |     with open(out_root / f"s2s{ref_sfx}.tsv", "w") as f_o:
 92 |         f_o.write("\t".join(header) + "\n")
 93 |         for cols in manifest:
 94 |             f_o.write("\t".join(cols) + "\n")
 95 | 
 96 | 
 97 | def get_args():
 98 |     parser = argparse.ArgumentParser("Prepare S2S interpretation data")
 99 |     parser.add_argument(
100 |         "--root",
101 |         help="data root path",
102 |         type=str,
103 |         required=True,
104 |     )
105 |     parser.add_argument(
106 |         "--source-lang",
107 |         required=True,
108 |         type=str,
109 |         choices=S2S_SRC_LANGUAGES,
110 |     )
111 |     parser.add_argument(
112 |         "--target-lang",
113 |         required=True,
114 |         type=str,
115 |         choices=S2S_TGT_LANGUAGES,
116 |     )
117 |     parser.add_argument("--use-annotated-target", action="store_true")
118 |     return parser.parse_args()
119 | 
120 | 
121 | def main():
122 |     args = get_args()
123 |     get(args)
124 | 
125 | 
126 | if __name__ == '__main__':
127 |     main()
128 | 


--------------------------------------------------------------------------------
/voxpopuli/get_unlabelled_data.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) Facebook, Inc. and its affiliates.
  2 | #
  3 | # This source code is licensed under the MIT license found in the
  4 | # LICENSE file in the root directory of this source tree.
  5 | 
  6 | import argparse
  7 | import gzip
  8 | import csv
  9 | from pathlib import Path
 10 | from collections import defaultdict
 11 | from typing import Tuple, List
 12 | 
 13 | from tqdm import tqdm
 14 | from torchaudio.datasets.utils import download_url
 15 | import torchaudio
 16 | 
 17 | from voxpopuli import LANGUAGES, LANGUAGES_V2, DOWNLOAD_BASE_URL
 18 | from voxpopuli.utils import multiprocess_run
 19 | 
 20 | 
 21 | def _segment(item: Tuple[str, List[Tuple[str, float, float]], str]):
 22 |     in_path, segments, out_root = item
 23 |     _in_path = Path(in_path)
 24 |     event_id = _in_path.stem
 25 |     lang, year = _in_path.parent.parent.stem, _in_path.parent.stem
 26 |     waveform, sr = torchaudio.load(in_path)
 27 |     for i, s, e in segments:
 28 |         start, end = int(s * sr), min(waveform.size(1), int(e * sr))
 29 |         out_path = Path(out_root) / lang / year / f'{event_id}_{i}.ogg'
 30 |         torchaudio.save(out_path.as_posix(), waveform[:, start: end], sr)
 31 | 
 32 | 
 33 | def get_metadata(out_root, subset):
 34 |     def predicate(id_):
 35 |         is_plenary = id_.find("PLENARY") > -1
 36 |         if subset in {"10k", "10k_sd"}:
 37 |             return is_plenary and 20190101 <= int(id_[:8]) < 20200801
 38 |         elif subset in {"100k"}:
 39 |             return is_plenary
 40 |         elif subset in LANGUAGES:
 41 |             return is_plenary and id_.endswith(subset)
 42 |         elif subset in LANGUAGES_V2:
 43 |             return id_.endswith(subset.split("_")[0])
 44 |         return True
 45 | 
 46 |     filename = "unlabelled_sd" if subset == "10k_sd" else "unlabelled_v2"
 47 |     url = f"{DOWNLOAD_BASE_URL}/annotations/{filename}.tsv.gz"
 48 |     tsv_path = out_root / Path(url).name
 49 |     if not tsv_path.exists():
 50 |         download_url(url, out_root.as_posix(), Path(url).name)
 51 |     if subset == '10k_sd':
 52 |         with gzip.open(tsv_path, mode="rt") as f:
 53 |             rows = [
 54 |                 (r["session_id"], r["id_"], r["start_time"], r["end_time"])
 55 |                 for r in csv.DictReader(f, delimiter="|")
 56 |                 if predicate(r["session_id"])
 57 |             ]
 58 |     else:
 59 |         with gzip.open(tsv_path, mode="rt") as f:
 60 |             rows = [
 61 |                 (r["event_id"], r["segment_no"], r["start"], r["end"])
 62 |                 for r in csv.DictReader(f, delimiter="\t")
 63 |                 if predicate(r["event_id"])
 64 |             ]
 65 |     return rows
 66 | 
 67 | 
 68 | def get(args):
 69 |     audio_root = Path(args.root) / "raw_audios"
 70 |     out_root = Path(args.root) / "unlabelled_data"
 71 |     out_root.mkdir(exist_ok=True, parents=True)
 72 |     items = defaultdict(list)
 73 |     print("Loading manifest...")
 74 |     manifest = get_metadata(out_root, args.subset)
 75 |     for event_id, seg_no, start, end in tqdm(manifest):
 76 |         lang, year = event_id.rsplit("_", 1)[1], event_id[:4]
 77 |         cur_out_root = out_root / lang / year
 78 |         cur_out_root.mkdir(exist_ok=True, parents=True)
 79 |         path = audio_root / lang / year / f"{event_id}.ogg"
 80 |         items[path.as_posix()].append((seg_no, float(start), float(end)))
 81 |     items = [(k, v, out_root.as_posix()) for k, v in items.items()]
 82 |     print(f"Segmenting {len(items):,} files...")
 83 |     multiprocess_run(items, _segment)
 84 | 
 85 | 
 86 | def get_args():
 87 |     parser = argparse.ArgumentParser("Prepare unlabelled data")
 88 |     parser.add_argument(
 89 |         "--root", "-r", type=str, required=True, help="data root path"
 90 |     )
 91 |     parser.add_argument(
 92 |         "--subset", "-s", type=str, required=True,
 93 |         choices=["400k", "100k", "10k", "10k_sd"] + LANGUAGES + LANGUAGES_V2,
 94 |         help="data subset to download"
 95 |     )
 96 |     return parser.parse_args()
 97 | 
 98 | 
 99 | def main():
100 |     args = get_args()
101 |     get(args)
102 | 
103 | 
104 | if __name__ == "__main__":
105 |     main()
106 | 


--------------------------------------------------------------------------------
/voxpopuli/segmentation/__init__.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) Facebook, Inc. and its affiliates.
  2 | #
  3 | # This source code is licensed under the MIT license found in the
  4 | # LICENSE file in the root directory of this source tree.
  5 | 
  6 | from pathlib import Path
  7 | from dataclasses import dataclass
  8 | from typing import List, Union
  9 | import pickle as pkl
 10 | import json
 11 | import enum
 12 | 
 13 | import torch
 14 | import torchaudio
 15 | 
 16 | 
 17 | @dataclass
 18 | class Timestamp:
 19 |     t_start: float
 20 |     t_end: float
 21 | 
 22 | 
 23 | class LangCode(enum.Enum):
 24 |     HR = "hr"
 25 |     HU = "hu"
 26 |     IT = "it"
 27 |     SL = "sl"
 28 |     ES = "es"
 29 |     BG = "bg"
 30 |     NL = "nl"
 31 |     ET = "et"
 32 |     DE = "de"
 33 |     MT = "mt"
 34 |     PT = "pt"
 35 |     DA = "da"
 36 |     EN = "en"
 37 |     FI = "fi"
 38 |     LV = "lv"
 39 |     PL = "pl"
 40 |     RO = "ro"
 41 |     FR = "fr"
 42 |     LT = "lt"
 43 |     SK = "sk"
 44 |     SV = "sv"
 45 |     CS = "cs"
 46 |     EL = "el"
 47 | 
 48 |     @classmethod
 49 |     def has_value(cls, value):
 50 |         return value in cls._value2member_map_
 51 | 
 52 | 
 53 | def load_segments_from_pkl(pkl_path, min_duration):
 54 |     with open(pkl_path, "rb") as f:
 55 |         annotation = pkl.load(f)
 56 |     segments = [
 57 |         (round(segment.start, 3), round(segment.end, 3), label)
 58 |         for segment, track, label in annotation.itertracks(yield_label=True)
 59 |     ]
 60 |     segments = [(s, t, l) for s, t, l in segments if t - s >= min_duration]
 61 |     return segments
 62 | 
 63 | 
 64 | def get_pyannote_segments(path_audio, pyannote_cfg, min_duration=0.1):
 65 |     pkl_path = path_audio.parent / f"{path_audio.stem}.pyannote.{pyannote_cfg}.pkl"
 66 |     if pkl_path.is_file():
 67 |         return load_segments_from_pkl(pkl_path, min_duration)
 68 | 
 69 |     json_path = path_audio.parent / f"{path_audio.stem}.pyannote.{pyannote_cfg}.json"
 70 |     if json_path.is_file():
 71 |         with open(json_path, "r") as f:
 72 |             segments = json.load(f)
 73 |         return [(s, t, l) for s, t, l in segments if t - s >= min_duration]
 74 | 
 75 |     raise FileNotFoundError(f"{pkl_path} and {json_path} not found")
 76 | 
 77 | 
 78 | def is_id_valid(name: str):
 79 | 
 80 |     # An id should have the following format
 81 |     # YYYYMMDD-XXXX-[NAME]
 82 |     # YYYYMMDD : is the date of the session
 83 |     # XXXX : is a 4 digits identification number
 84 |     # [NAME] : can be any string
 85 | 
 86 |     data = name.split("-")
 87 |     if len(data) < 3:
 88 |         return False
 89 | 
 90 |     date = data[0]
 91 |     if len(date) != 8 or any((not x.isdigit()) for x in date):
 92 |         return False
 93 | 
 94 |     if int(date[4:6]) > 12:
 95 |         return False
 96 |     if int(date[6:]) > 31:
 97 |         return False
 98 | 
 99 |     session_id = data[1]
100 |     if any((not x.isdigit()) for x in session_id):
101 |         return False
102 | 
103 |     return True
104 | 
105 | 
106 | def get_batches(list_like, batch_size: int):
107 |     for i in list(range(0, len(list_like), batch_size)):
108 |         yield list_like[i : min(i + batch_size, len(list_like))]
109 | 
110 | 
111 | def is_plenary(_id: str):
112 |     return _id.find("-PLENARY") > -1
113 | 
114 | 
115 | def to_wav2letter_format(data: torch.tensor, sr: int) -> torch.tensor:
116 |     r"""
117 |     Wav2letter needs mono 16kHz inputs
118 |     """
119 |     if len(data.size()) == 2:
120 |         data = data.mean(dim=0, keepdim=True)
121 |     elif len(data.size()) == 1:
122 |         data = data.view(1, -1)
123 |     else:
124 |         raise ValueError("Invalid tensor format")
125 |     if sr != 16000:
126 |         data = torchaudio.transforms.Resample(orig_freq=sr, new_freq=16000)(data)
127 |         data = torch.clamp(data, min=-1.0, max=1.0)
128 |     return data
129 | 
130 | 
131 | def correct_name_fbcluster_output(name_in: str) -> str:
132 |     r"""A quick patch to solve some discreepancies from the output names
133 |     in the align / WER pipeliness without having to relaunch everything"""
134 | 
135 |     split_ = name_in.split("-")
136 |     if len(split_) == 3:
137 |         return "-".join(split_[:2])
138 | 
139 |     return name_in
140 | 
141 | 
142 | def get_all_years_for_lang(path_root: Union[str, Path], lang: str) -> List[str]:
143 |     path_lang = Path(path_root) / lang
144 |     return [
145 |         x.stem
146 |         for x in path_lang.glob("*")
147 |         if (len(x.stem) == 4 and x.is_dir() and all(p.isdigit() for p in x.stem))
148 |     ]
149 | 
150 | 
151 | def get_all_sessions_lang_year(path_root: Path, lang: str, year: str) -> List[str]:
152 | 
153 |     audio = list((path_root / lang / year).glob(f"*_{lang}.ogg"))
154 |     return [x.stem.split("_")[0] for x in audio]
155 | 
156 | 
157 | def get_path_full_audio(path_root: Path, session_id: str, lang: str) -> Path:
158 |     year = session_id[:4]
159 |     return path_root / lang / year / f"{session_id}_{lang}.ogg"
160 | 
161 | 
162 | def get_all_audio_for_lang(path_root: Path, lang: str) -> List[Path]:
163 | 
164 |     audio_paths = []
165 |     years = get_all_years_for_lang(path_root, lang)
166 |     for year in years:
167 |         all_sessions = get_all_sessions_lang_year(path_root, lang, year)
168 |         loc = [
169 |             get_path_full_audio(path_root, session_id, lang)
170 |             for session_id in all_sessions
171 |         ]
172 |         audio_paths += loc
173 |     return audio_paths
174 | 


--------------------------------------------------------------------------------
/voxpopuli/segmentation/cut_from_labels.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) Facebook, Inc. and its affiliates.
  2 | #
  3 | # This source code is licensed under the MIT license found in the
  4 | # LICENSE file in the root directory of this source tree.
  5 | import soundfile as sf
  6 | import csv
  7 | import argparse
  8 | import tqdm
  9 | import numpy as np
 10 | import ast
 11 | from pathlib import Path
 12 | from voxpopuli.segmentation import Timestamp, get_path_full_audio
 13 | from typing import Callable, Dict, List, Tuple
 14 | from multiprocessing import Pool
 15 | 
 16 | 
 17 | VadData = List[Timestamp]
 18 | 
 19 | 
 20 | def parse_seq_path(seq_path: str) -> Tuple[str, str, str]:
 21 |     out = seq_path.split("/")
 22 |     assert len(out) == 3
 23 |     return out[0], out[1], out[2]
 24 | 
 25 | 
 26 | def get_path_paragraph(row, idx_: Dict[str, int]) -> Path:
 27 |     base_path = Path(row[idx_["session_id"]]) / row[idx_["paragraph_id"]]
 28 |     if "lang" in idx_:
 29 |         base_path = Path(row[idx_["lang"]]) / base_path
 30 |     return base_path
 31 | 
 32 | 
 33 | def get_path_fully_segmented(row, idx_: Dict[str, int]) -> Path:
 34 |     return get_path_paragraph(row, idx_) / row[idx_["id_"]]
 35 | 
 36 | 
 37 | def get_ts_base(row, idx_: Dict[str, int]) -> List[Timestamp]:
 38 |     return [Timestamp(float(row[idx_["start_time"]]), float(row[idx_["end_time"]]))]
 39 | 
 40 | 
 41 | def get_ts_speaker(row, idx_: Dict[str, int]) -> List[Timestamp]:
 42 |     return [
 43 |         Timestamp(float(row[idx_["speaker_start"]]), float(row[idx_["speaker_end"]]))
 44 |     ]
 45 | 
 46 | 
 47 | def get_ts_vad(row, idx_: Dict[str, int]) -> List[Timestamp]:
 48 |     vad = ast.literal_eval(row[idx_["vad"]])
 49 |     return [Timestamp(x[0], x[1]) for x in vad]
 50 | 
 51 | 
 52 | def load_annot_file(
 53 |     path_input: Path,
 54 |     path_extractor: Callable,
 55 |     timestamp_extractor: Callable,
 56 |     suffix: str = ".flac",
 57 | ) -> Dict[Tuple[str, str], Dict[Path, VadData]]:
 58 |     with open(path_input, "r") as csvfile:
 59 |         data = csv.reader(csvfile, delimiter="|")
 60 | 
 61 |         names = next(data)
 62 |         idx_ = {x: i for i, x in enumerate(names)}
 63 |         idx_name = idx_["session_id"]
 64 |         idx_lang = idx_.get("lang", None)
 65 | 
 66 |         out = {}
 67 |         for row in data:
 68 |             session_name = row[idx_name]
 69 |             path_seq = path_extractor(row, idx_).with_suffix(suffix)
 70 |             vad = timestamp_extractor(row, idx_)
 71 |             lang = "original" if idx_lang is None else row[idx_lang]
 72 | 
 73 |             index = session_name, lang
 74 |             if index not in out:
 75 |                 out[index] = {}
 76 |             out[index][path_seq] = vad
 77 | 
 78 |     return out
 79 | 
 80 | 
 81 | def cut_session(
 82 |     root_original: Path,
 83 |     root_out: Path,
 84 |     session_name: str,
 85 |     ts_2_names: Dict[str, List[Timestamp]],
 86 |     lang: str,
 87 | ) -> None:
 88 | 
 89 |     sound, sr = sf.read(str(get_path_full_audio(root_original, session_name, lang)))
 90 |     for loc_path, vad in ts_2_names.items():
 91 |         full_path = root_out / loc_path
 92 |         full_path.parent.mkdir(exist_ok=True, parents=True)
 93 |         sf.write(
 94 |             full_path,
 95 |             cut_with_vad(sound, sr, vad),
 96 |             sr,
 97 |             subtype="PCM_16",
 98 |         )
 99 | 
100 | 
101 | def cut_with_vad(sound: np.array, sr: int, vad: List[Timestamp]) -> np.array:
102 | 
103 |     out = []
104 |     for ts in vad:
105 |         out += [sound[int(ts.t_start * sr) : int(ts.t_end * sr)]]
106 |     return np.concatenate(out, axis=0)
107 | 
108 | 
109 | class FileSegmenter:
110 |     def __init__(
111 |         self,
112 |         root_original: Path,
113 |         root_out: Path,
114 |         annot_dict: Dict[str, Dict[Path, List[Timestamp]]]
115 |     ):
116 | 
117 |         self.root_original = root_original
118 |         self.root_out = root_out
119 |         self.annot_dict = annot_dict
120 | 
121 |     def cut_session(self, session_id_lang: Tuple[str, str]):
122 |         session_id, lang = session_id_lang
123 |         cut_session(
124 |             self.root_original,
125 |             self.root_out,
126 |             session_id,
127 |             self.annot_dict[session_id_lang],
128 |             lang,
129 |         )
130 | 
131 |     def run(self, n_procs: int = 8):
132 | 
133 |         with Pool(processes=n_procs) as pool:
134 |             for _ in tqdm.tqdm(
135 |                 pool.imap_unordered(self.cut_session, self.annot_dict),
136 |                 total=len(self.annot_dict),
137 |             ):
138 |                 pass
139 | 
140 | 
141 | def main(args):
142 | 
143 |     path_data = Path(args.root_original)
144 |     path_out = Path(args.output)
145 |     path_annotations = Path(args.tsv_file)
146 | 
147 |     path_extractor = get_path_fully_segmented
148 |     if args.mode == "labelled":
149 |         timestamp_extractor = get_ts_vad
150 |     elif args.mode == "per_speaker_vad":
151 |         timestamp_extractor = get_ts_base
152 |     elif args.mode == "per_speaker":
153 |         timestamp_extractor = get_ts_speaker
154 |         path_extractor = get_path_paragraph
155 |     else:
156 |         raise RuntimeError(f"Invalid mode {args.mode}")
157 | 
158 |     annot_dict = load_annot_file(path_annotations, path_extractor, timestamp_extractor)
159 |     segmenter = FileSegmenter(path_data, path_out, annot_dict)
160 |     segmenter.run(n_procs=args.n_procs)
161 | 
162 | 
163 | if __name__ == "__main__":
164 | 
165 |     parser = argparse.ArgumentParser("Segment the data from the given .tsv file. "
166 |                                      "Can be used for a customed segmentation of the 10k timetsamps")
167 |     parser.add_argument(
168 |         "--root_original",
169 |         help="Root directory where the original data are stored.",
170 |         type=str,
171 |         required=True,
172 |     )
173 |     parser.add_argument(
174 |         "--tsv_file",
175 |         help="Path to the .tsv file containing the labels.",
176 |         type=str,
177 |         required=True,
178 |     )
179 |     parser.add_argument(
180 |         "-o", "--output", help="Path to the outpit directory.", type=str, required=True
181 |     )
182 |     parser.add_argument(
183 |         "--n-procs", help="Number of processes to run", type=int, default=8
184 |     )
185 |     parser.add_argument(
186 |         "--lang", help="Lang to consider", type=str, required=True
187 |     )
188 |     parser.add_argument(
189 |         "--mode",
190 |         required=True,
191 |         type=str,
192 |         choices=["labelled", "per_speaker", "per_speaker_vad"],
193 |         help="labelled to segment the labelled data. "
194 |               "per_speaker to cut the 10k data per speaker "
195 |               "per_speaker_vad to add the vad of top of the segmentation of the 10k data."
196 |     )
197 | 
198 |     main(parser.parse_args())
199 | 


--------------------------------------------------------------------------------
/voxpopuli/segmentation/cut_with_align_files.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) Facebook, Inc. and its affiliates.
  2 | #
  3 | # This source code is licensed under the MIT license found in the
  4 | # LICENSE file in the root directory of this source tree.
  5 | 
  6 | import argparse
  7 | import torchaudio
  8 | import shutil
  9 | import os
 10 | import torch
 11 | import json
 12 | import string
 13 | from pathlib import Path
 14 | from typing import NamedTuple, List, Optional, Set, Tuple
 15 | from multiprocessing import Pool
 16 | 
 17 | from voxpopuli.text.wer_tools import (
 18 |     WordAlignFile,
 19 |     load_word_align_file,
 20 |     get_partial_transcriptions,
 21 |     get_wer,
 22 |     get_ler,
 23 |     create_word_align_file,
 24 |     reinsert_punctuation,
 25 | )
 26 | from voxpopuli.text.word_align_tools import (
 27 |     AlignedData,
 28 |     AlignedWord,
 29 |     cut_align_data,
 30 |     load_audio_align_wav2letter,
 31 | )
 32 | 
 33 | from voxpopuli.segmentation import is_id_valid, to_wav2letter_format
 34 | 
 35 | 
 36 | class CutIndex(NamedTuple):
 37 |     index_word: int
 38 |     index_align: int
 39 | 
 40 | 
 41 | class SilCutConfig(NamedTuple):
 42 |     padding_start: float
 43 |     padding_end: float
 44 |     min_size_sil: float
 45 |     min_size_audio: Optional[float] = None
 46 | 
 47 | 
 48 | class FullSegConfig(NamedTuple):
 49 |     segmentation_cfg: SilCutConfig
 50 |     vad_cfg: SilCutConfig
 51 |     target_size_segment: int
 52 |     sil_symbol: str = "$"
 53 | 
 54 | 
 55 | def save_timestamp(ts_segmentation, ts_vad, path_out):
 56 | 
 57 |     out = {
 58 |         "start": ts_segmentation[0],
 59 |         "end": ts_segmentation[1],
 60 |         "vad": [(x[0] + ts_segmentation[0], x[1] + ts_segmentation[0]) for x in ts_vad],
 61 |     }
 62 | 
 63 |     with open(path_out, "w") as f:
 64 |         json.dump(out, f, indent=2)
 65 | 
 66 | 
 67 | def save_transcription(target: str, decoded: str, path_out: Path):
 68 |     path_out = path_out.with_suffix(".json")
 69 |     out = {
 70 |         "target": target,
 71 |         "decoded": decoded,
 72 |         "wer": get_wer(target, decoded),
 73 |         "ler": get_ler(target, decoded),
 74 |     }
 75 | 
 76 |     with open(path_out, "w", encoding="utf8") as file:
 77 |         json.dump(out, file, indent=2, ensure_ascii=False)
 78 | 
 79 | 
 80 | def add_punc_from_tsv(path_tsv, align_text, chars, punc):
 81 | 
 82 |     with open(path_tsv, "r") as f:
 83 |         text = f.read()
 84 |     return reinsert_punctuation(text, align_text, chars, punc)
 85 | 
 86 | 
 87 | def cut_with_segment(
 88 |     data: torch.tensor,
 89 |     sr: int,
 90 |     audio_align_data: AlignedData,
 91 |     index_align: List[int],
 92 |     padding_start: float = 0.1,
 93 |     padding_end: float = 0.2,
 94 | ) -> List[torch.tensor]:
 95 | 
 96 |     last_start = 0
 97 |     out = []
 98 |     timestamps = []
 99 |     if len(index_align) == 0:
100 |         return [data], [(0, data.size(0) / sr)]
101 | 
102 |     for cut_index in index_align:
103 | 
104 |         last_end = audio_align_data.data[cut_index].start + padding_end
105 |         s = int(last_start * sr)
106 |         e = int(last_end * sr)
107 |         out.append(data[s:e])
108 |         timestamps.append((last_start, last_end))
109 |         last_start = max(last_end, audio_align_data.data[cut_index].end - padding_start)
110 | 
111 |     if index_align[-1] < len(audio_align_data[-1]):
112 |         s = int(last_start * sr)
113 |         out.append(data[s:])
114 |         timestamps.append((last_start, data.size(0) / sr))
115 | 
116 |     return out, timestamps
117 | 
118 | 
119 | def segment_word_align(
120 |     audio_align_data: AlignedData,
121 |     word_align_data: WordAlignFile,
122 |     sil_symbol: str = "$",
123 |     size_min_sil: float = 0.5,
124 |     target_size_segment: float = 1,
125 |     punc_mark=None,
126 | ) -> List[CutIndex]:
127 | 
128 |     out = []
129 |     cum_size = 0
130 |     index_target_transcription = 0
131 |     index_char_transcription = 0
132 | 
133 |     if punc_mark is None:
134 |         punc_mark = []
135 | 
136 |     target = word_align_data.target.split()
137 | 
138 |     for index_align, align in enumerate(audio_align_data.data[:-1]):
139 | 
140 |         curr_size = align.end - align.start
141 |         cum_size += curr_size
142 | 
143 |         if align.word != sil_symbol:
144 |             has_punc = target[index_target_transcription][-1] in punc_mark
145 |             if not has_punc:
146 |                 if align.word != target[index_target_transcription]:
147 |                     print(word_align_data.file_id)
148 |                     print(word_align_data.target)
149 |                     print(align.word, target[index_target_transcription])
150 |                 assert align.word == target[index_target_transcription]
151 |             index_char_transcription += len(target[index_target_transcription]) + 1
152 |             index_target_transcription += 1
153 |             continue
154 |         if index_align == 0:
155 |             continue
156 |         if word_align_data.align_path[index_target_transcription].action != "=":
157 |             continue
158 |         if not has_punc:
159 |             if cum_size < target_size_segment:
160 |                 continue
161 |             if curr_size < size_min_sil:
162 |                 continue
163 | 
164 |         if has_punc:
165 |             index_char_transcription += 1
166 | 
167 |         out.append(
168 |             CutIndex(index_word=index_char_transcription - 1, index_align=index_align)
169 |         )
170 |         cum_size = 0
171 | 
172 |     return out
173 | 
174 | 
175 | def cut_sils(
176 |     data: torch.tensor,
177 |     sr: int,
178 |     audio_align_data: AlignedData,
179 |     padding_start: float = 0.5,
180 |     padding_end: float = 0.2,
181 |     sil_symbol: str = "$",
182 |     min_size_sil: float = 0.8,
183 |     min_size_audio: float = 0.5,
184 | ) -> List[torch.tensor]:
185 | 
186 |     out = []
187 |     start = 0
188 |     ts_vad = []
189 |     for align in audio_align_data.data:
190 | 
191 |         if align.word == sil_symbol:
192 |             if align.end - align.start > min_size_sil:
193 |                 end = align.start + padding_end
194 |                 if end - start > min_size_audio:
195 |                     s = int(start * sr)
196 |                     e = int(end * sr)
197 |                     out.append(data[s:e])
198 |                     ts_vad.append((start, end))
199 |                 start = max(end, align.end - padding_start)
200 |     if float(data.size(0)) / sr - start > min_size_audio:
201 |         s = int(start * sr)
202 |         out.append(data[s:])
203 |         ts_vad.append((start, data.size(0) / sr))
204 | 
205 |     if len(out) > 0:
206 |         return torch.cat(out, dim=0), ts_vad
207 |     else:
208 |         return None, None
209 | 
210 | 
211 | def remove_extremities(
212 |     data: torch.tensor,
213 |     sr: int,
214 |     audio_align_data: AlignedData,
215 |     padding_start: float = 0.5,
216 |     padding_end: float = 0.2,
217 |     sil_symbol: str = "$",
218 | ) -> Tuple[torch.tensor, AlignedData]:
219 | 
220 |     index_start = 0
221 |     while audio_align_data.data[index_start].word == sil_symbol:
222 |         index_start += 1
223 | 
224 |     index_end = -1
225 |     while audio_align_data.data[index_end].word == sil_symbol:
226 |         index_end -= 1
227 | 
228 |     start = max(0, audio_align_data.data[index_start].start - padding_start)
229 |     out_data = [
230 |         AlignedWord(max(0, x.start - start), max(0, x.end - start), x.word)
231 |         for x in audio_align_data.data[index_start : index_end + 1]
232 |     ]
233 |     e = int(
234 |         min(data.size(0), (audio_align_data.data[index_end].end + padding_end) * sr)
235 |     )
236 |     s = int(start * sr)
237 |     return data[s:e], AlignedData(audio_align_data.file_id, out_data)
238 | 
239 | 
240 | def get_matches(
241 |     word_align_file: List[WordAlignFile], audio_align_file: List[AlignedData]
242 | ) -> List[Tuple[WordAlignFile, AlignedData]]:
243 | 
244 |     word_align_file.sort(key=lambda x: x.file_id)
245 |     audio_align_file.sort(key=lambda x: x.file_id)
246 | 
247 |     i_ = 0
248 |     out = []
249 |     max_i = len(audio_align_file)
250 |     for w_d in word_align_file:
251 |         while i_ < max_i and audio_align_file[i_].file_id < w_d.file_id:
252 |             i_ += 1
253 | 
254 |         if i_ < max_i and audio_align_file[i_].file_id == w_d.file_id:
255 |             out.append((w_d, audio_align_file[i_]))
256 | 
257 |     return out
258 | 
259 | 
260 | def process_file(
261 |     word_align_file: WordAlignFile,
262 |     audio_align_file: AlignedData,
263 |     path_audio: Path,
264 |     dir_out: Path,
265 |     full_seg_cfg: FullSegConfig,
266 |     punc_mark=None,
267 | ) -> None:
268 | 
269 |     name_out = word_align_file.file_id
270 |     dir_out.mkdir(exist_ok=True)
271 |     cut_index = segment_word_align(
272 |         audio_align_file,
273 |         word_align_file,
274 |         sil_symbol=full_seg_cfg.sil_symbol,
275 |         size_min_sil=full_seg_cfg.segmentation_cfg.min_size_sil,
276 |         target_size_segment=full_seg_cfg.target_size_segment,
277 |         punc_mark=punc_mark,
278 |     )
279 | 
280 |     trans_list = get_partial_transcriptions(
281 |         word_align_file, [x.index_word for x in cut_index]
282 |     )
283 | 
284 |     audio, sr = torchaudio.load(path_audio)
285 |     audio = to_wav2letter_format(audio, sr)
286 |     audio = audio.mean(dim=0)
287 |     sr = 16000
288 | 
289 |     segs, ts_segmentation = cut_with_segment(
290 |         audio,
291 |         sr,
292 |         audio_align_file,
293 |         [x.index_align for x in cut_index],
294 |         padding_start=full_seg_cfg.segmentation_cfg.padding_start,
295 |         padding_end=full_seg_cfg.segmentation_cfg.padding_end,
296 |     )
297 |     new_align = cut_align_data(
298 |         audio_align_file,
299 |         [x.index_align for x in cut_index],
300 |         sil_symbol=full_seg_cfg.sil_symbol,
301 |         padding_start=full_seg_cfg.segmentation_cfg.padding_start,
302 |         padding_end=full_seg_cfg.segmentation_cfg.padding_end,
303 |     )
304 | 
305 |     for index, seg in enumerate(segs):
306 | 
307 |         seg, curr_align = remove_extremities(seg, sr, new_align[index])
308 |         seg_no_sil, ts_vad = cut_sils(
309 |             seg,
310 |             sr,
311 |             curr_align,
312 |             min_size_sil=full_seg_cfg.vad_cfg.min_size_sil,
313 |             padding_start=full_seg_cfg.vad_cfg.padding_start,
314 |             padding_end=full_seg_cfg.vad_cfg.padding_end,
315 |             min_size_audio=full_seg_cfg.vad_cfg.min_size_audio,
316 |         )
317 | 
318 |         if seg_no_sil is None:
319 |             continue
320 | 
321 |         if seg_no_sil.size(0) == 0:
322 |             continue
323 | 
324 |         path_out = dir_out / f"{name_out}_{index}.flac"
325 |         torchaudio.save(str(path_out), seg_no_sil, sr)
326 |         path_trans = dir_out / f"{name_out}_{index}_trans.json"
327 |         target, decoded = trans_list[index]
328 |         save_transcription(target, decoded, path_trans)
329 | 
330 |         path_timestamps = dir_out / f"{name_out}_{index}_timestamps.json"
331 |         save_timestamp(ts_segmentation[index], ts_vad, path_timestamps)
332 | 
333 | 
334 | def process_session_lang(
335 |     path_wer: Path,
336 |     path_align: Path,
337 |     dir_audio: Path,
338 |     dir_out: Path,
339 |     full_seg_cfg: FullSegConfig,
340 |     max_wer: Optional[float] = None,
341 |     max_ler: Optional[float] = None,
342 |     chars=string.ascii_lowercase,
343 |     punc_mark=None,
344 | ):
345 | 
346 |     word_align_data = load_word_align_file(path_wer)
347 |     audio_align_data = load_audio_align_wav2letter(path_align)
348 | 
349 |     if max_wer is not None:
350 |         word_align_data = [x for x in word_align_data if x.wer < max_wer]
351 | 
352 |     if max_ler is not None:
353 |         word_align_data = [x for x in word_align_data if x.ler < max_ler]
354 | 
355 |     matches = get_matches(word_align_data, audio_align_data)
356 |     print(f"{path_wer.stem} : {len(matches)} matches found")
357 | 
358 |     for w_d, a_d in matches:
359 |         align_text = " ".join([x.word for x in a_d.data if x.word != "$"])
360 |         if len(align_text) == 0:
361 |             continue
362 |         try:
363 |             if punc_mark is not None:
364 |                 path_tsv = dir_audio / f"{w_d.file_id}.tsv"
365 |                 align_text = add_punc_from_tsv(path_tsv, align_text, chars, punc_mark)
366 |             final_wd = create_word_align_file(w_d.file_id, align_text, w_d.decoded)
367 |             dir_session = dir_out / final_wd.file_id
368 |             path_audio = dir_audio / f"{final_wd.file_id}.flac"
369 |             if not path_audio.is_file():
370 |                 print(f"ERROR: {str(path_audio)} not found")
371 |                 continue
372 |             dir_out.mkdir(exist_ok=True, parents=True)
373 |             process_file(
374 |                 final_wd,
375 |                 a_d,
376 |                 path_audio,
377 |                 dir_session,
378 |                 full_seg_cfg,
379 |                 punc_mark=punc_mark,
380 |             )
381 | 
382 |             path_speaker = dir_audio / f"{final_wd.file_id}.speaker"
383 |             path_out_speaker = dir_session / f"{final_wd.file_id}.speaker"
384 |             if path_out_speaker.is_file():
385 |                 os.remove(path_out_speaker)
386 |             shutil.copyfile(path_speaker, path_out_speaker)
387 |         except FileNotFoundError:
388 |             continue
389 | 
390 | 
391 | class FinalAudioSegmenter:
392 |     def __init__(
393 |         self,
394 |         root_audio: Path,
395 |         root_wer: Path,
396 |         root_align: Path,
397 |         root_out: Path,
398 |         lang: str,
399 |         full_seg_cfg: FullSegConfig,
400 |         max_wer: Optional[float] = None,
401 |         max_ler: Optional[float] = None,
402 |         chars=string.ascii_lowercase,
403 |         punc_mark=";.?!",
404 |     ):
405 | 
406 |         self.root_audio = root_audio
407 |         self.root_wer = root_wer
408 |         self.root_align = root_align
409 |         self.root_out = root_out
410 |         self.full_seg_cfg = full_seg_cfg
411 |         self.max_wer = max_wer
412 |         self.max_ler = max_ler
413 |         self.lang = lang
414 |         self.chars = chars
415 |         self.punc_mark = punc_mark
416 | 
417 |     def processs_session(self, session_id: str):
418 | 
419 |         path_wer = self.root_wer / f"{session_id}_{self.lang}_wer_no_lm_wav2letter.json"
420 |         path_align = self.root_align / f"{session_id}_{self.lang}_align_wav2letter.txt"
421 | 
422 |         dir_audio = self.get_dir_paragraph(session_id)
423 | 
424 |         if not dir_audio.is_dir():
425 |             raise RuntimeError(f"ERROR: paragraph data not found at {dir_audio}")
426 | 
427 |         dir_out = self.root_out / session_id
428 |         process_session_lang(
429 |             path_wer,
430 |             path_align,
431 |             dir_audio,
432 |             dir_out,
433 |             self.full_seg_cfg,
434 |             self.max_wer,
435 |             self.max_ler,
436 |             chars=self.chars,
437 |             punc_mark=self.punc_mark,
438 |         )
439 | 
440 |     def get_dir_paragraph(self, session_id: str):
441 |         return self.root_audio / "original" / session_id / "paragraphs"
442 | 
443 |     def process_db(self, session_ids: List[str], num_proc: int = 8):
444 | 
445 |         print(f"Launching the segmentation on {len(session_ids)} sessions")
446 |         with Pool(num_proc) as pool:
447 |             out = list(
448 |                 pool.imap_unordered(self.processs_session, session_ids, chunksize=30)
449 |             )
450 | 
451 | 
452 | def get_session_ids(root_align: Path, root_wer: Path, lang: str) -> Set[str]:
453 | 
454 |     files_align = [
455 |         x.name
456 |         for x in root_align.glob(f"*_{lang}_align_wav2letter.txt")
457 |         if is_id_valid(x.name[:-24])
458 |     ]
459 |     files_wer = [
460 |         x.name
461 |         for x in root_wer.glob(f"*_{lang}_wer_no_lm_wav2letter.json")
462 |         if is_id_valid(x.name[:-29])
463 |     ]
464 |     ids_align = {x[:-24] for x in files_align}
465 |     ids_wer = {x[:-29] for x in files_wer}
466 | 
467 |     return ids_align.intersection(ids_wer)
468 | 
469 | 
470 | if __name__ == "__main__":
471 | 
472 |     parser = argparse.ArgumentParser(
473 |         "Using the decoded data and the word alignment, segment the labelled "
474 |         "sequences in small chunk with their estimated WER"
475 |     )
476 |     parser.add_argument(
477 |         "--dir_wer",
478 |         type=str,
479 |         required=True,
480 |         help="Directory containing the decoding output",
481 |     )
482 |     parser.add_argument(
483 |         "--dir_align",
484 |         type=str,
485 |         required=True,
486 |         help="Directory containing the alignment output",
487 |     )
488 |     parser.add_argument(
489 |         "--dir_audio",
490 |         type=str,
491 |         required=True,
492 |         help="Directory containing the audio data",
493 |     )
494 |     parser.add_argument(
495 |         "--n_proc",
496 |         type=int,
497 |         default=8,
498 |         help="Number of processes to use",
499 |     )
500 |     parser.add_argument("--lang", type=str, required=True, help="Language Code.")
501 |     parser.add_argument(
502 |         "-o", "--output", type=str, required=True, help="Output directory."
503 |     )
504 |     parser_segmentation = parser.add_argument_group("Segmentation parameters")
505 |     parser_segmentation.add_argument(
506 |         "--target_size_segment",
507 |         type=int,
508 |         default=20,
509 |         help="Target size of each segment",
510 |     )
511 |     parser_segmentation.add_argument(
512 |         "--padding_start_seg",
513 |         type=float,
514 |         default=0.4,
515 |         help="Padding start segmentation",
516 |     )
517 |     parser_segmentation.add_argument(
518 |         "--padding_end_seg", type=float, default=0.4, help="Padding end segmentation"
519 |     )
520 |     parser_segmentation.add_argument(
521 |         "--min_size_sil_seg",
522 |         type=float,
523 |         default=0.7,
524 |         help="Minimum size of a silence when cutting a sequence.",
525 |     )
526 |     parser_segmentation.add_argument(
527 |         "--max_wer",
528 |         type=float,
529 |         default=None,
530 |         help="Ignores all sequences with a Word Error Rate (WER) higher than "
531 |         "the given value",
532 |     )
533 |     parser_segmentation.add_argument(
534 |         "--max_ler",
535 |         type=float,
536 |         default=100,
537 |         help="Ignores all sequences with a Letter Error Rate (LER) higher than "
538 |         "the given value",
539 |     )
540 |     parser_segmentation.add_argument(
541 |         "--ignore_punctuation",
542 |         action="store_true",
543 |         help="Activates to ignore all punctuation and cut only by silence.",
544 |     )
545 |     parser_segmentation.add_argument(
546 |         "--path_chars",
547 |         type=str,
548 |         default=None,
549 |         help="Path to the char file containing the tokens of the considered language. (Default tokens are english latin)",
550 |     )
551 |     parser_sil = parser.add_argument_group("VAD extraction parameters")
552 |     parser_sil.add_argument(
553 |         "--padding_start_vad",
554 |         type=float,
555 |         default=0.2,
556 |         help="Padding start segmentation",
557 |     )
558 |     parser_sil.add_argument(
559 |         "--padding_end_vad", type=float, default=0.5, help="Padding end segmentation"
560 |     )
561 |     parser_sil.add_argument(
562 |         "--min_size_sil_vad",
563 |         type=float,
564 |         default=1,
565 |         help="Minimum size of a silence when considering voice activity.",
566 |     )
567 |     parser_sil.add_argument(
568 |         "--min_size_audio_vad",
569 |         type=float,
570 |         default=0.5,
571 |         help="Isolated audio segments smaller than the given threshold will"
572 |         " be removed",
573 |     )
574 |     args = parser.parse_args()
575 | 
576 |     args.dir_audio = Path(args.dir_audio)
577 |     args.dir_wer = Path(args.dir_wer)
578 |     args.dir_align = Path(args.dir_align)
579 |     args.output = Path(args.output)
580 | 
581 |     seg_cfg = SilCutConfig(
582 |         padding_start=args.padding_start_seg,
583 |         padding_end=args.padding_end_seg,
584 |         min_size_sil=args.min_size_sil_seg,
585 |     )
586 |     vad_cfg = SilCutConfig(
587 |         padding_start=args.padding_start_vad,
588 |         padding_end=args.padding_end_vad,
589 |         min_size_sil=args.min_size_sil_vad,
590 |         min_size_audio=args.min_size_audio_vad,
591 |     )
592 |     full_seg_cfg = FullSegConfig(
593 |         segmentation_cfg=seg_cfg,
594 |         vad_cfg=vad_cfg,
595 |         target_size_segment=args.target_size_segment,
596 |     )
597 | 
598 |     id_list = get_session_ids(args.dir_align, args.dir_wer, args.lang)
599 |     print(f"{len(id_list)} sessions found")
600 | 
601 |     letters = string.ascii_lowercase
602 |     if args.path_chars is not None:
603 |         with open(args.path_chars, "r") as f:
604 |             letters = "".join([x.strip() for x in f.readlines()])
605 | 
606 |     punc_mark = None if args.ignore_punctuation else ".;?!"
607 | 
608 |     segmenter = FinalAudioSegmenter(
609 |         args.dir_audio,
610 |         args.dir_wer,
611 |         args.dir_align,
612 |         args.output,
613 |         args.lang,
614 |         full_seg_cfg,
615 |         max_wer=args.max_wer,
616 |         max_ler=args.max_ler,
617 |         chars=letters,
618 |         punc_mark=punc_mark,
619 |     )
620 |     segmenter.process_db(list(id_list), num_proc=args.n_proc)
621 | 


--------------------------------------------------------------------------------
/voxpopuli/segmentation/get_segment_pyannote_speaker.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) Facebook, Inc. and its affiliates.
  2 | #
  3 | # This source code is licensed under the MIT license found in the
  4 | # LICENSE file in the root directory of this source tree.
  5 | 
  6 | import os
  7 | import argparse
  8 | import shutil
  9 | from tqdm import tqdm
 10 | from pathlib import Path
 11 | from typing import List, Tuple, Union
 12 | from multiprocessing import Pool
 13 | 
 14 | from auditok import AudioRegion
 15 | import soundfile as sf
 16 | 
 17 | from voxpopuli.segmentation import (
 18 |     get_all_audio_for_lang, get_pyannote_segments, LangCode
 19 | )
 20 | 
 21 | 
 22 | def save_timestamp(path_out: Union[str, Path], start: float, end: float) -> None:
 23 | 
 24 |     with open(path_out, "w") as f:
 25 |         f.write(f"{start}\t{end}")
 26 | 
 27 | 
 28 | def load_timestamp(path_data: Union[str, Path]) -> Tuple[float, float]:
 29 |     with open(path_data, "r") as f:
 30 |         data = f.readline().strip()
 31 | 
 32 |     start, end = data.split()
 33 |     return float(start), float(end)
 34 | 
 35 | 
 36 | def get_path_timestamp(path_audio: Union[str, Path], timestamp_suffix: str) -> Path:
 37 |     return Path(path_audio).with_suffix(timestamp_suffix)
 38 | 
 39 | 
 40 | def split_with_vad_wav(
 41 |     wav_path: Path,
 42 |     out_dir: Path,
 43 |     min_dur: float,
 44 |     max_dur: float,
 45 |     max_silence: float,
 46 |     strict_min_dur: bool,
 47 |     shift: float = 0,
 48 | ) -> None:
 49 | 
 50 |     assert Path(wav_path).suffix == ".wav"
 51 |     audio_region = AudioRegion.load(str(wav_path))
 52 |     out_dir = Path(out_dir)
 53 |     regions = audio_region.split(
 54 |         min_dur=min_dur,
 55 |         max_dur=max_dur,
 56 |         max_silence=max_silence,
 57 |         strict_min_dur=strict_min_dur,
 58 |     )
 59 | 
 60 |     waveform, sr = sf.read(wav_path, dtype="float32")
 61 |     out = []
 62 |     for i, r in enumerate(regions):
 63 |         start = int(r._meta.start * sr)
 64 |         end = int(r._meta.end * sr)
 65 |         path_seg = out_dir / f"{out_dir.stem}_{i}.flac"
 66 |         path_timestamp = get_path_timestamp(path_seg, ".vad.timestamp")
 67 |         save_timestamp(path_timestamp, r._meta.start + shift, r._meta.end + shift)
 68 |         sf.write(
 69 |             str(path_seg), waveform[start:end], sr, subtype="PCM_16", format="FLAC"
 70 |         )
 71 |         out.append(path_seg)
 72 | 
 73 |     return out
 74 | 
 75 | 
 76 | def split_vad_non_wav(
 77 |     audio_path: Path,
 78 |     out_dir: Path,
 79 |     min_dur: float,
 80 |     max_dur: float,
 81 |     max_silence: float,
 82 |     strict_min_dur: bool,
 83 |     shift: float = 0,
 84 | ) -> None:
 85 |     path_wav = Path(audio_path).with_suffix(".wav")
 86 |     to_wav(audio_path, path_wav)
 87 |     out = split_with_vad_wav(
 88 |         path_wav, out_dir, min_dur, max_dur, max_silence, strict_min_dur, shift
 89 |     )
 90 |     os.remove(path_wav)
 91 |     return out
 92 | 
 93 | 
 94 | def to_wav(path_in: Path, path_out: Path) -> None:
 95 | 
 96 |     assert Path(path_out).suffix == ".wav"
 97 |     waveform, sr = sf.read(str(path_in), dtype="float32")
 98 |     sf.write(str(path_out), waveform, sr, format="WAV")
 99 | 
100 | 
101 | def split_audio(
102 |     audio_path: Path,
103 |     segments: List[Tuple[float, float, str]],
104 |     out_root: Union[str, Path],
105 |     pyannote_suffix: str,
106 | ) -> List[Path]:
107 | 
108 |     out_root = Path(out_root)
109 |     if out_root.is_dir():
110 |         shutil.rmtree(out_root)
111 |     out_root.mkdir(parents=True)
112 | 
113 |     sr = sf.info(audio_path).samplerate
114 |     audio_path = Path(audio_path)
115 | 
116 |     def save_clip(i, start, end):
117 |         name = f"{i:03d}_{start:.0f}-{end:.0f}"
118 |         out_audio_path = out_root / f"{name}.flac"
119 |         save_timestamp(get_path_timestamp(out_audio_path, pyannote_suffix), start, end)
120 |         clip, _ = sf.read(audio_path, start=int(start * sr), stop=int(end * sr))
121 |         sf.write(out_audio_path, clip, sr, subtype="PCM_16", format="FLAC")
122 |         return out_audio_path
123 | 
124 |     last_start, last_end, last_speaker = segments[0]
125 | 
126 |     out_paths = []
127 |     for i, (start_t, end_t, speaker) in enumerate(segments):
128 |         if speaker == last_speaker:
129 |             last_end = end_t
130 |             continue
131 |         out_audio_path = save_clip(i, last_start, last_end)
132 |         last_start = start_t
133 |         last_speaker = speaker
134 |         last_end = end_t
135 |         out_paths.append(out_audio_path)
136 | 
137 |     save_clip(len(segments), last_start, last_end)
138 | 
139 |     return out_paths
140 | 
141 | 
142 | def get_segments(
143 |     path_audio, pyannote_cfg, min_duration
144 | ) -> List[Tuple[float, float, str]]:
145 |     try:
146 |         return get_pyannote_segments(
147 |             path_audio, pyannote_cfg, min_duration=min_duration
148 |         )
149 |     except FileNotFoundError:
150 |         return None
151 | 
152 | 
153 | class FileSegmenter:
154 |     def __init__(
155 |         self,
156 |         root_in: str,
157 |         out_dir: str,
158 |         pyannote_cfg="sad_ami",
159 |         min_duration=1.0,
160 |         split_vad=True,
161 |         min_dur_vad=15,
162 |         max_dur_vad=30,
163 |         max_silence_vad=1.5,
164 |         strict_min_dur_vad=True,
165 |     ):
166 | 
167 |         self.root_in = root_in
168 |         self.out_dir = out_dir
169 |         self.pyannote_cfg = pyannote_cfg
170 |         self.min_duration = min_duration
171 |         self.split_vad = split_vad
172 |         self.min_dur_vad = min_dur_vad
173 |         self.max_dur_vad = max_dur_vad
174 |         self.max_silence_vad = max_silence_vad
175 |         self.strict_min_dur_vad = strict_min_dur_vad
176 | 
177 |     def get_root_lang_id(self, id_: str, lang_code: str) -> bool:
178 |         return Path(self.root_in) / id_ / f"{id_}_{lang_code}"
179 | 
180 |     def get_out_root(self, id_, lang_code) -> Path:
181 |         return Path(self.out_dir) / lang_code / id_ / "paragraphs"
182 | 
183 |     def split_audio(self, audio_path: Path):
184 | 
185 |         found = 0
186 |         lang = audio_path.stem.split("_")[-1]
187 |         id_ = audio_path.stem.split("_")[0]
188 |         if not audio_path.exists():
189 |             return False
190 |         segments = get_segments(audio_path, self.pyannote_cfg, self.min_duration)
191 |         if segments is None:
192 |             return False
193 | 
194 |         out_root = self.get_out_root(id_, lang)
195 | 
196 |         pyannote_suffix = f".pyannote.{self.pyannote_cfg}"
197 |         out_audio = split_audio(audio_path, segments, out_root, pyannote_suffix)
198 | 
199 |         if not self.split_vad:
200 |             return True
201 | 
202 |         for audio_path in out_audio:
203 |             dir_out = audio_path.parent / audio_path.stem
204 |             dir_out.mkdir()
205 |             path_timestamp_audio = get_path_timestamp(audio_path, pyannote_suffix)
206 |             shift = load_timestamp(path_timestamp_audio)[0]
207 |             vad_seq = split_vad_non_wav(
208 |                 audio_path,
209 |                 dir_out,
210 |                 min_dur=self.min_dur_vad,
211 |                 max_dur=self.max_dur_vad,
212 |                 max_silence=self.max_silence_vad,
213 |                 strict_min_dur=self.strict_min_dur_vad,
214 |                 shift=shift,
215 |             )
216 |             os.remove(audio_path)
217 |             if len(vad_seq) == 0:
218 |                 shutil.rmtree(dir_out)
219 |                 os.remove(audio_path.with_suffix(f".pyannote.{self.pyannote_cfg}"))
220 | 
221 |         return True
222 | 
223 | 
224 | def get_all(args):
225 |     audio_paths = []
226 |     root = Path(args.root)
227 |     for lang in args.languages:
228 |         audio_paths += get_all_audio_for_lang(root, lang)
229 |     if args.max_num is not None:
230 |         audio_paths = audio_paths[: args.max_num]
231 | 
232 |     segmenter = FileSegmenter(
233 |         args.root,
234 |         args.output,
235 |         pyannote_cfg=args.pyannote_cfg,
236 |         min_duration=args.min_duration,
237 |         split_vad=not args.no_vad,
238 |         min_dur_vad=args.min_dur_vad,
239 |         max_dur_vad=args.max_dur_vad,
240 |         max_silence_vad=args.max_silence_vad,
241 |     )
242 |     found = 0
243 |     with Pool(args.nproc) as p:
244 |         for x in tqdm(
245 |             p.imap_unordered(segmenter.split_audio, audio_paths), total=len(audio_paths)
246 |         ):
247 |             found += int(x)
248 | 
249 |     print(f"{found} audio data segmented")
250 | 
251 | 
252 | def main():
253 |     parser = argparse.ArgumentParser(
254 |         "Cut the data by speaker. " "run_pyanote_sd.py must have been run before"
255 |     )
256 |     parser.add_argument("--root", type=str, required=True, help="Input root directory")
257 |     parser.add_argument(
258 |         "-o",
259 |         "--output",
260 |         type=str,
261 |         default=None,
262 |         help="Output directory, if different from the input " "one",
263 |     )
264 |     parser.add_argument(
265 |         "--languages",
266 |         type=str,
267 |         nargs="*",
268 |         help="If given, Ttranslated data to deal with",
269 |     )
270 |     parser.add_argument(
271 |         "--max-num",
272 |         default=None,
273 |         type=int,
274 |         help="If given, maximum number of session to deal with",
275 |     )
276 |     parser.add_argument("--nproc", default=8, type=int, help="Number of processes")
277 |     parser.add_argument(
278 |         "--pyannote-cfg",
279 |         default="dia_ami",
280 |         type=str,
281 |         choices=["dia", "dia_ami", "sad_ami"],
282 |     )
283 |     parser.add_argument(
284 |         "--min-duration",
285 |         default=1.0,
286 |         type=float,
287 |         help="Ignore all speaker segments lasting less than the given number of seconds",
288 |     )
289 |     parser.add_argument(
290 |         "--no-vad",
291 |         action="store_true",
292 |         help="Does not apply the vad after the speaker segmentation",
293 |     )
294 |     parser.add_argument(
295 |         "--min-dur-vad",
296 |         default=15,
297 |         type=int,
298 |         help="Min size of a sequence (in seconds) after applying the vad.",
299 |     )
300 |     parser.add_argument(
301 |         "--max-dur-vad",
302 |         default=30,
303 |         type=int,
304 |         help="Max size of a sequence (in seconds) after applying the vad.",
305 |     )
306 |     parser.add_argument(
307 |         "--max-silence-vad",
308 |         default=1.5,
309 |         type=float,
310 |         help="Maximum length of a silence allowed in the voice activity detection"
311 |         " (the lower the stricter)",
312 |     )
313 |     args = parser.parse_args()
314 | 
315 |     if args.output is None:
316 |         args.output = args.root
317 | 
318 |     if args.languages is None:
319 |         args.languages = [x.value for x in LangCode]
320 | 
321 |     get_all(args)
322 | 
323 | 
324 | if __name__ == "__main__":
325 |     main()
326 | 


--------------------------------------------------------------------------------
/voxpopuli/segmentation/run_pyannote_sd.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) Facebook, Inc. and its affiliates.
  2 | #
  3 | # This source code is licensed under the MIT license found in the
  4 | # LICENSE file in the root directory of this source tree.
  5 | ###
  6 | # Run pyannote speaker diarization (SD) models
  7 | ###
  8 | 
  9 | import os.path as op
 10 | import pickle as pkl
 11 | import argparse
 12 | import torchaudio
 13 | from tqdm import tqdm
 14 | import torch
 15 | import json
 16 | from pathlib import Path
 17 | from tempfile import TemporaryDirectory
 18 | from typing import List
 19 | from voxpopuli.segmentation import get_batches, get_all_audio_for_lang
 20 | 
 21 | 
 22 | def check(path_audio: Path, pyannote_cfg="dia_ami"):
 23 |     rttm_path = path_audio.parent / f"{path_audio.stem}.pyannote.{pyannote_cfg}.rttm"
 24 |     pkl_path = path_audio.parent / f"{path_audio.stem}.pyannote.{pyannote_cfg}.pkl"
 25 |     if rttm_path.exists() and pkl_path.exists():
 26 |         return True
 27 |     json_path = path_audio.parent / f"{path_audio.stem}.pyannote.{pyannote_cfg}.json"
 28 |     if json_path.exists():
 29 |         return True
 30 |     return False
 31 | 
 32 | 
 33 | def segment_audio_overlap(
 34 |     path_audio: Path, dir_out: Path, max_size_sec: int
 35 | ) -> List[Path]:
 36 | 
 37 |     info = torchaudio.info(str(path_audio))[0]
 38 |     s_data = info.length // info.channels
 39 |     sr = info.rate
 40 |     frames = int(sr * max_size_sec)
 41 |     if frames % 1 == 0:
 42 |         frames += 1
 43 | 
 44 |     n_cuts = s_data // frames
 45 |     if s_data % frames > min(sr, s_data):
 46 |         n_cuts += 1
 47 | 
 48 |     n_cuts += n_cuts - 1
 49 | 
 50 |     out = []
 51 |     offset = 0
 52 |     print(f"{path_audio.parent.name} : {n_cuts} segments to save")
 53 |     for index in range(n_cuts):
 54 |         num_frames = min(frames, s_data - offset)
 55 |         if num_frames <= 0:
 56 |             break
 57 |         data = torchaudio.load(str(path_audio), num_frames=num_frames, offset=offset)[0]
 58 |         path_out = dir_out / f"{path_audio.stem}_{index}.flac"
 59 |         torchaudio.save(str(path_out), data, sr)
 60 |         offset += frames // 2
 61 |         out.append(path_out)
 62 |     print(f"{path_audio.parent.name} : {n_cuts} segments saved")
 63 | 
 64 |     return out, max_size_sec / 2
 65 | 
 66 | 
 67 | def merge_segments(path_list_pkl: List[Path], size_overlap: float):
 68 | 
 69 |     out = []
 70 |     shift = 0
 71 |     last_start = None
 72 |     for i_pkl, pkl_path in enumerate(path_list_pkl):
 73 |         with open(pkl_path, "rb") as f:
 74 |             annotation = pkl.load(f)
 75 |         segments = [
 76 |             (
 77 |                 shift + round(segment.start, 3),
 78 |                 shift + round(segment.end, 3),
 79 |                 f"{i_pkl}_{label}",
 80 |             )
 81 |             for segment, track, label in annotation.itertracks(yield_label=True)
 82 |         ]
 83 |         if len(segments) == 0:
 84 |             continue
 85 | 
 86 |         start_index = 0
 87 |         if last_start is not None:
 88 |             min_diff = size_overlap + 1
 89 |             for i, pack in enumerate(segments):
 90 |                 s = pack[0]
 91 |                 d = abs(s - last_start)
 92 |                 if d < min_diff:
 93 |                     min_diff = d
 94 |                     start_index = i
 95 | 
 96 |         if len(out) > 0:
 97 |             s, e, l = segments[start_index]
 98 |             out[-1] = last_start, e, l
 99 |             start_index += 1
100 | 
101 |         if start_index < len(segments):
102 |             out += segments[start_index:]
103 | 
104 |         if len(out) > 0:
105 |             last_start = out[-1][0]
106 | 
107 |         shift += size_overlap
108 |     return out
109 | 
110 | 
111 | def get_segments(
112 |     audio_path: str,
113 |     device: int = 0,
114 |     pyannote_cfg="dia_ami",
115 |     max_size_sec: int = 10 * 60,
116 | ):
117 | 
118 |     print(audio_path)
119 |     if not op.exists(audio_path):
120 |         return
121 | 
122 |     torch.cuda.set_device(device)
123 | 
124 |     id_ = Path(audio_path).parent.name
125 | 
126 |     with TemporaryDirectory() as tmp_dir:
127 |         tmp_dir = Path(tmp_dir)
128 |         list_str, overlapp = segment_audio_overlap(
129 |             Path(audio_path), tmp_dir, max_size_sec
130 |         )
131 |         pyannote_pipeline = torch.hub.load(
132 |             "pyannote/pyannote-audio", pyannote_cfg, pipeline=True
133 |         )
134 |         path_pkls = []
135 |         for index, path_ in enumerate(list_str):
136 |             print(f"{id_}: running pyannote on {index} / {len(list_str)}")
137 |             sd = pyannote_pipeline({"uri": "filename", "audio": path_})
138 |             rttm_path = (
139 |                 audio_path.parent / f"{audio_path.stem}.pyannote.{pyannote_cfg}.rttm"
140 |             )
141 |             pkl_path = (
142 |                 audio_path.parent / f"{audio_path.stem}.pyannote.{pyannote_cfg}.pkl"
143 |             )
144 |             with open(rttm_path, "w") as f:
145 |                 sd.write_rttm(f)
146 |             with open(pkl_path, "wb") as f:
147 |                 pkl.dump(sd, f)
148 |             path_pkls.append(pkl_path)
149 | 
150 |         out_seg = merge_segments(path_pkls, overlapp)
151 | 
152 |     path_out = Path(audio_path).parent / f"pyannote.{pyannote_cfg}.json"
153 | 
154 |     with open(path_out, "w") as f:
155 |         json.dump(out_seg, f, indent=2)
156 | 
157 | 
158 | def get(audio_path: str, device: int = 0, pyannote_cfg="dia_ami"):
159 |     assert pyannote_cfg in {"dia_ami", "dia", "sad_ami"}
160 | 
161 |     if not audio_path.exists():
162 |         return
163 | 
164 |     torch.cuda.set_device(device)
165 |     pyannote_pipeline = torch.hub.load(
166 |         "pyannote/pyannote-audio", pyannote_cfg, pipeline=True
167 |     )
168 |     sd = pyannote_pipeline({"uri": "filename", "audio": audio_path})
169 |     rttm_path = audio_path.parent / f"{audio_path.stem}.pyannote.{pyannote_cfg}.rttm"
170 |     pkl_path = audio_path.parent / f"{audio_path.stem}.pyannote.{pyannote_cfg}.pkl"
171 |     with open(rttm_path, "w") as f:
172 |         sd.write_rttm(f)
173 |     with open(pkl_path, "wb") as f:
174 |         pkl.dump(sd, f)
175 | 
176 | 
177 | def get_multiprocess(i, items, pyannote_cfg="dia_ami", max_size_min_input: int = None):
178 |     if i >= len(items):
179 |         return
180 | 
181 |     if max_size_min_input is not None:
182 |         get_segments(
183 |             items[i], i, pyannote_cfg=pyannote_cfg, max_size_sec=max_size_min_input * 60
184 |         )
185 |     else:
186 |         get(items[i], i, pyannote_cfg=pyannote_cfg)
187 | 
188 | 
189 | def main(args):
190 |     languages = [lang if lang != "original" else "" for lang in args.languages]
191 | 
192 |     root = Path(args.root)
193 |     audio_paths = []
194 |     for lang in languages:
195 |         audio_paths += get_all_audio_for_lang(root, lang)
196 | 
197 |     if not args.overwrite:
198 |         audio_paths = [x for x in audio_paths if not check(x, args.pyannote_cfg)]
199 | 
200 |     if args.max_num is not None:
201 |         audio_paths = audio_paths[: args.max_num]
202 |     n_devices = torch.cuda.device_count()
203 | 
204 |     if n_devices < 2:
205 |         for d in audio_paths:
206 |             print(d)
207 |             get(d)
208 |     else:
209 |         batches = list(get_batches(audio_paths, batch_size=n_devices))
210 |         for batch in tqdm(batches):
211 |             torch.multiprocessing.spawn(
212 |                 fn=get_multiprocess,
213 |                 args=(batch, args.pyannote_cfg, args.segment_min),
214 |                 nprocs=n_devices,
215 |             )
216 | 
217 | 
218 | if __name__ == "__main__":
219 |     parser = argparse.ArgumentParser(
220 |         "Speaker diarization with pyannote."
221 |         " Compute the speakers boundaries for the given audio files"
222 |     )
223 |     parser.add_argument(
224 |         "--root", type=str, help="Root directory containing the session directories"
225 |     )
226 |     parser.add_argument(
227 |         "--max-num",
228 |         default=None,
229 |         type=int,
230 |         help="If given, maximal number of session to deal with.",
231 |     )
232 |     parser.add_argument(
233 |         "--overwrite",
234 |         action="store_true",
235 |         help="Set to true to overwrite previous results",
236 |     )
237 |     parser.add_argument(
238 |         "-l",
239 |         "--languages",
240 |         type=str,
241 |         nargs="*",
242 |         required=True,
243 |         help="Languages to deal with. 'original' stands for the original audio.",
244 |     )
245 |     parser.add_argument(
246 |         "--segment-min",
247 |         type=int,
248 |         default=None,
249 |         help="If given, will split the inpit audio into several "
250 |         "overlapping chunks of size segment_min seconds and merge the "
251 |         "resulting segmentation. In this case, a single speaker may end "
252 |         "with several labels if he speaks across several segments."
253 |         "In this case, the output file will be in json format "
254 |         "(to avoid confusion with the proper diharisation output).",
255 |     )
256 |     parser.add_argument(
257 |         "--pyannote-cfg",
258 |         type=str,
259 |         choices=["dia_ami", "dia", "sad_ami"],
260 |         help="Pyannote configuration.",
261 |         default="dia_ami",
262 |     )
263 |     args = parser.parse_args()
264 | 
265 |     main(args)
266 | 


--------------------------------------------------------------------------------
/voxpopuli/text/__init__.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) Facebook, Inc. and its affiliates.
 2 | #
 3 | # This source code is licensed under the MIT license found in the
 4 | # LICENSE file in the root directory of this source tree.
 5 | 
 6 | import re
 7 | import string
 8 | from typing import Set
 9 | 
10 | 
11 | PUNCTUATIONS_TO_REMOVE = (
12 |     string.punctuation.replace("'", "")
13 |     .replace("-", "")
14 |     .replace("–", "")
15 |     .replace("/", "")
16 |     + "«»‟″“”…‘•„‚≤ᵉ"
17 | )
18 | PUNCTUATIONS_TO_SPACE = "-/–·"
19 | REMOVE_TRANSLATOR = str.maketrans("", "", PUNCTUATIONS_TO_REMOVE)
20 | SPACE_TRANSLATOR = str.maketrans(
21 |     PUNCTUATIONS_TO_SPACE, " " * len(PUNCTUATIONS_TO_SPACE)
22 | )
23 | 
24 | SPACE = chr(32)
25 | WHITESPACE_NORMALIZER = re.compile(r"\s+")
26 | 
27 | # fmt: off
28 | LANG_TOKENS = {
29 |     "cs": {"'", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "á", "é", "í", "ó", "ú", "ý", "č", "ď", "ě", "ň", "ř", "š", "ť", "ů", "ž",},
30 |     "de": {"'", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "ß", "ä", "ö", "ü",},
31 |     "en": {"'", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z",},
32 |     "es": {"'", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "á", "é", "í", "ñ", "ó", "ú", "ü",},
33 |     "et": {"'", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "ä", "õ", "ö", "ü", "š", "ž",},
34 |     "fi": {"'", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "ä", "å", "ö",},
35 |     "fr": {"'", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "à", "â", "æ", "ç", "è", "é", "ê", "ë", "î", "ï", "ô", "ù", "û", "ü", "œ", "ÿ",},
36 |     "hr": {"'", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "ć", "č", "đ", "š", "ž",},
37 |     "hu": {"'", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "á", "é", "ó", "ö", "ú", "ü", "ő", "ű",},
38 |     "it": {"'", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "à", "è", "é", "ì", "í", "ï", "ò", "ó", "ù",},
39 |     "lt": {"a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "ą", "č", "ė", "ę", "į", "š", "ū", "ų", "ž",},
40 |     "nl": {"'", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "à", "ç", "è", "é", "ê", "ë", "í", "ï", "ö", "ü",},
41 |     "pl": {"'", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "ó", "ą", "ć", "ę", "ł", "ń", "ś", "ź", "ż",},
42 |     "ro": {"a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "â", "î", "ă", "ș", "ț",},
43 |     "sk": {"a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "á", "ä", "é", "í", "ó", "ô", "ú", "ý", "č", "ď", "ĺ", "ľ", "ň", "ŕ", "š", "ť", "ž",},
44 |     "sl": {"'", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "č", "š", "ž",},
45 | }
46 | # fmt: on
47 | 
48 | 
49 | def correct_name_fbcluster_output(name_in: str) -> str:
50 |     r"""A quick patch to solve some discreepancies from the output names
51 |     in the align / WER pipeliness without having to relaunch everything"""
52 | 
53 |     split_ = name_in.split("-")
54 |     if len(split_) == 3:
55 |         return "-".join(split_[:2])
56 | 
57 |     return name_in
58 | 
59 | 
60 | def is_valid_text(text: str, tokens: Set[str]) -> bool:
61 |     chars = "".join(text.split())
62 |     return all(x in tokens for x in chars)
63 | 


--------------------------------------------------------------------------------
/voxpopuli/text/wer_tools.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) Facebook, Inc. and its affiliates.
  2 | #
  3 | # This source code is licensed under the MIT license found in the
  4 | # LICENSE file in the root directory of this source tree.
  5 | ###
  6 | # Run pyannote speaker diarization (SD) models
  7 | ###
  8 | 
  9 | import json
 10 | from typing import Iterable, NamedTuple, List, Tuple
 11 | from pathlib import Path
 12 | 
 13 | import edlib
 14 | import editdistance
 15 | 
 16 | from voxpopuli.segmentation import correct_name_fbcluster_output
 17 | 
 18 | 
 19 | class CharAlignToken(NamedTuple):
 20 |     index_decoded: int
 21 |     action: str
 22 | 
 23 | 
 24 | class WordAlignFile(NamedTuple):
 25 |     file_id: str
 26 |     target: str
 27 |     decoded: str
 28 |     wer: float
 29 |     ler: float
 30 |     align_path: List[CharAlignToken]
 31 | 
 32 | 
 33 | def quick_norm(str_in, char_set):
 34 | 
 35 |     str_in = str_in.lower().strip()
 36 |     str_in = " ".join(str_in.split())
 37 |     out = "".join([x for x in str_in if x in char_set])
 38 |     return out
 39 | 
 40 | 
 41 | def get_wer(query, decoded):
 42 |     return get_ler(query.split(), decoded.split())
 43 | 
 44 | 
 45 | def get_ler(query, decoded):
 46 |     d = editdistance.eval(query, decoded)
 47 |     return 100 * float(d) / (1e-8 + len(query))
 48 | 
 49 | 
 50 | def expand_cigar_format(path_cigar: str) -> str:
 51 | 
 52 |     out = ""
 53 |     size = len(path_cigar)
 54 |     i_ = 0
 55 |     while i_ < size:
 56 |         next = i_ + 1
 57 |         while path_cigar[next].isdigit():
 58 |             next += 1
 59 |         n = int(path_cigar[i_:next])
 60 |         v = path_cigar[next]
 61 |         out += n * v
 62 |         i_ = next + 1
 63 | 
 64 |     return out
 65 | 
 66 | 
 67 | def get_align_index_path(query: Iterable, target: Iterable) -> List[CharAlignToken]:
 68 | 
 69 |     path_ = edlib.align(query, target, task="path")["cigar"]
 70 |     if path_ is None:
 71 |         return []
 72 |     path_ = expand_cigar_format(path_)
 73 | 
 74 |     index_out = 0
 75 |     index_path = 0
 76 |     out = []
 77 |     for index_query in range(len(query)):
 78 |         while path_[index_path] == "D":
 79 |             index_out += 1
 80 |             index_path += 1
 81 | 
 82 |         action = path_[index_path]
 83 | 
 84 |         out.append(CharAlignToken(index_out, action))
 85 |         if action == "=":
 86 |             assert query[index_query] == target[index_out]
 87 |         if action in ["=", "X"]:
 88 |             index_out += 1
 89 | 
 90 |         index_path += 1
 91 | 
 92 |     return out
 93 | 
 94 | 
 95 | def get_partial_transcriptions(
 96 |     data: WordAlignFile, word_cuts: List[int]
 97 | ) -> List[Tuple[str, str]]:
 98 | 
 99 |     last_index = 0
100 |     last_index_decoded = data.align_path[0].index_decoded
101 | 
102 |     output = []
103 |     for word_index in word_cuts:
104 |         i_decoded = data.align_path[word_index].index_decoded
105 |         # Go until the end of the next word
106 |         while i_decoded < len(data.decoded) and data.decoded[i_decoded] != " ":
107 |             i_decoded += 1
108 |         while word_index < len(data.target) and data.target[word_index] != " ":
109 |             word_index += 1
110 |         out_target = data.target[last_index:word_index]
111 |         out_decoded = data.decoded[last_index_decoded:i_decoded]
112 |         last_index = word_index
113 |         last_index_decoded = i_decoded
114 |         output.append((out_target, out_decoded))
115 | 
116 |     if last_index < len(data.target):
117 |         out_target = data.target[last_index:]
118 |         out_decoded = data.decoded[last_index_decoded:]
119 |         output.append((out_target, out_decoded))
120 | 
121 |     return output
122 | 
123 | 
124 | def reinsert_punctuation(
125 |     str_original: str, str_normed: str, char_set: str, punc_list: str
126 | ) -> str:
127 | 
128 |     quick_norm_ref = quick_norm(str_original, char_set + punc_list)
129 |     align_path = get_align_index_path(quick_norm_ref, str_normed)
130 |     punc_indexes = [(i, x) for i, x in enumerate(quick_norm_ref) if x in punc_list]
131 |     last_index = 0
132 | 
133 |     out = ""
134 | 
135 |     for p_index, punc in punc_indexes:
136 |         i_normed = align_path[p_index].index_decoded
137 |         if i_normed <= last_index:
138 |             continue
139 |         while i_normed < len(str_normed) and str_normed[i_normed] != " ":
140 |             i_normed += 1
141 |         loc_norm = str_normed[last_index:i_normed]
142 |         out += loc_norm + punc + " "
143 |         last_index = i_normed
144 | 
145 |     if last_index < len(str_normed):
146 |         out += str_normed[last_index:]
147 | 
148 |     return out
149 | 
150 | 
151 | def create_word_align_file(file_id: str, target: str, decoded: str) -> WordAlignFile:
152 | 
153 |     return WordAlignFile(
154 |         file_id=file_id,
155 |         target=target,
156 |         decoded=decoded,
157 |         wer=get_wer(target, decoded),
158 |         ler=get_ler(target, decoded),
159 |         align_path=get_align_index_path(target, decoded),
160 |     )
161 | 
162 | 
163 | def load_word_align_file(path_file: Path) -> List[WordAlignFile]:
164 | 
165 |     with open(path_file, "r") as file:
166 |         data = json.load(file)
167 | 
168 |     out = []
169 | 
170 |     for file_data in data:
171 |         align_path = get_align_index_path(
172 |             file_data["target"], file_data["word_prediction_no_lm"]
173 |         )
174 |         if len(align_path) == 0:
175 |             continue
176 |         out.append(
177 |             WordAlignFile(
178 |                 file_id=correct_name_fbcluster_output(file_data["sample_id"]),
179 |                 target=file_data["target"],
180 |                 decoded=file_data["word_prediction_no_lm"],
181 |                 wer=file_data["wer"],
182 |                 ler=file_data["ler"],
183 |                 align_path=align_path,
184 |             )
185 |         )
186 |     return out
187 | 


--------------------------------------------------------------------------------
/voxpopuli/text/word_align_tools.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) Facebook, Inc. and its affiliates.
 2 | #
 3 | # This source code is licensed under the MIT license found in the
 4 | # LICENSE file in the root directory of this source tree.
 5 | 
 6 | 
 7 | from pathlib import Path
 8 | from typing import List, NamedTuple
 9 | 
10 | from voxpopuli.segmentation import correct_name_fbcluster_output
11 | 
12 | 
13 | class AlignedWord(NamedTuple):
14 |     start: float
15 |     end: float
16 |     word: str
17 | 
18 | 
19 | class AlignedData(NamedTuple):
20 |     file_id: str
21 |     data: List[AlignedWord]
22 | 
23 | 
24 | def load_audio_align_wav2letter(input_path: Path) -> List[AlignedData]:
25 | 
26 |     with open(input_path, "r") as file:
27 |         data = file.readlines()
28 | 
29 |     output = []
30 | 
31 |     for line in data:
32 |         file_id, segments = line.split("\t")
33 |         file_id = correct_name_fbcluster_output(file_id)
34 |         segments = segments.split("\\n")
35 |         samples = []
36 |         for s in segments:
37 |             (_, _, start, duration, word) = s.split(" ")
38 |             end = float(start) + float(duration)
39 |             samples.append(AlignedWord(float(start), end, word.strip()))
40 |         output.append(AlignedData(file_id, samples))
41 | 
42 |     return output
43 | 
44 | 
45 | def cut_align_data(
46 |     audio_align_data: AlignedData,
47 |     index_align: List[int],
48 |     sil_symbol: str = "$",
49 |     padding_start: float = 0.1,
50 |     padding_end: float = 0.2,
51 | ) -> List[AlignedData]:
52 | 
53 |     base_name = audio_align_data.file_id
54 |     out = []
55 |     last_index = 0
56 |     last_end = 0
57 |     shift = 0
58 | 
59 |     if len(index_align) == 0:
60 |         return [audio_align_data]
61 | 
62 |     for cut_index in index_align:
63 | 
64 |         last_end = audio_align_data.data[cut_index].start + padding_end
65 |         out_align = [
66 |             AlignedWord(max(0, x.start - shift), max(0, x.end - shift), x.word)
67 |             for x in audio_align_data.data[last_index:cut_index]
68 |         ]
69 |         out.append(AlignedData(f"{base_name}_{len(out)}", out_align))
70 |         last_index = cut_index
71 |         shift = max(last_end, audio_align_data.data[last_index].end - padding_start)
72 | 
73 |     if last_index < len(audio_align_data[-1]):
74 |         out_align = [
75 |             AlignedWord(max(0, x.start - shift), max(0, x.end - shift), x.word)
76 |             for x in audio_align_data.data[index_align[-1] :]
77 |         ]
78 |         out.append(AlignedData(f"{base_name}_{len(out)}", out_align))
79 | 
80 |     return out
81 | 


--------------------------------------------------------------------------------
/voxpopuli/utils.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) Facebook, Inc. and its affiliates.
 2 | #
 3 | # This source code is licensed under the MIT license found in the
 4 | # LICENSE file in the root directory of this source tree.
 5 | 
 6 | from typing import Optional
 7 | 
 8 | from tqdm.contrib.concurrent import process_map
 9 | 
10 | 
11 | def multiprocess_run(
12 |         a_list: list, func: callable, n_workers: Optional[int] = None
13 | ):
14 |     process_map(func, a_list, max_workers=n_workers, chunksize=1)
15 | 


--------------------------------------------------------------------------------