├── .gitignore
├── ACKNOWLEDGEMENTS
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING-RESEARCH.md
├── LICENSE
├── README.md
├── create_datasets.sh
├── fisher
    ├── README.md
    ├── combine_eval_splits.py
    ├── extract-utterance-audios.py
    ├── extract_cs_words_from_raw_data.py
    ├── make_cs_splits.py
    ├── make_mapping_files.py
    ├── prepare-sets.sh
    ├── setup_all.sh
    ├── split_train_and_make_lid.py
    └── splits_data
    │   ├── README.md
    │   ├── dev
    │       └── README.md
    │   ├── dev2
    │       └── README.md
    │   ├── test
    │       └── README.md
    │   └── train
    │       └── README.md
├── mapping_files
    ├── README.md
    ├── fisher_mapping.csv
    └── miami_mapping.csv
├── miami
    ├── common_words
    │   ├── eng.txt
    │   └── spa.txt
    ├── create_test_sets.py
    ├── download_miami_data.sh
    ├── process_miami_data.py
    ├── readme.md
    └── setup_all.sh
└── requirements.txt


/.gitignore:
--------------------------------------------------------------------------------
1 | env/*
2 | **/data/*
3 | **/output/*
4 | **/.DS_Store
5 | **/speech/*


--------------------------------------------------------------------------------
/ACKNOWLEDGEMENTS:
--------------------------------------------------------------------------------
  1 | Acknowledgements
  2 | Portions of this FoundationDB Software may utilize the following copyrighted 
  3 | material, the use of which is hereby acknowledged.
  4 | 
  5 | _____________________
  6 | 
  7 | Jackson L. Lee (pylangacq)
  8 | 
  9 | Permission is hereby granted, free of charge, to any person obtaining a copy
 10 | of this software and associated documentation files (the "Software"), to deal
 11 | in the Software without restriction, including without limitation the rights
 12 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 13 | copies of the Software, and to permit persons to whom the Software is
 14 | furnished to do so, subject to the following conditions:
 15 | 
 16 | The above copyright notice and this permission notice shall be included in
 17 | all copies or substantial portions of the Software.
 18 | 
 19 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 20 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 21 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 22 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 23 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 24 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 25 | THE SOFTWARE.
 26 | 
 27 | _____________________
 28 | Ingy döt Net, Kirill Simonov (PyYAML)
 29 | 
 30 | Permission is hereby granted, free of charge, to any person obtaining a copy of
 31 | this software and associated documentation files (the "Software"), to deal in
 32 | the Software without restriction, including without limitation the rights to
 33 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
 34 | of the Software, and to permit persons to whom the Software is furnished to do
 35 | so, subject to the following conditions:
 36 | 
 37 | The above copyright notice and this permission notice shall be included in all
 38 | copies or substantial portions of the Software.
 39 | 
 40 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 41 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 42 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 43 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 44 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 45 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 46 | SOFTWARE.
 47 | 
 48 | _____________________
 49 | Wes McKinney (pandas)
 50 | 
 51 |   1. Definitions.
 52 | 
 53 |       "License" shall mean the terms and conditions for use, reproduction,
 54 |       and distribution as defined by Sections 1 through 9 of this document.
 55 | 
 56 |       "Licensor" shall mean the copyright owner or entity authorized by
 57 |       the copyright owner that is granting the License.
 58 | 
 59 |       "Legal Entity" shall mean the union of the acting entity and all
 60 |       other entities that control, are controlled by, or are under common
 61 |       control with that entity. For the purposes of this definition,
 62 |       "control" means (i) the power, direct or indirect, to cause the
 63 |       direction or management of such entity, whether by contract or
 64 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 65 |       outstanding shares, or (iii) beneficial ownership of such entity.
 66 | 
 67 |       "You" (or "Your") shall mean an individual or Legal Entity
 68 |       exercising permissions granted by this License.
 69 | 
 70 |       "Source" form shall mean the preferred form for making modifications,
 71 |       including but not limited to software source code, documentation
 72 |       source, and configuration files.
 73 | 
 74 |       "Object" form shall mean any form resulting from mechanical
 75 |       transformation or translation of a Source form, including but
 76 |       not limited to compiled object code, generated documentation,
 77 |       and conversions to other media types.
 78 | 
 79 |       "Work" shall mean the work of authorship, whether in Source or
 80 |       Object form, made available under the License, as indicated by a
 81 |       copyright notice that is included in or attached to the work
 82 |       (an example is provided in the Appendix below).
 83 | 
 84 |       "Derivative Works" shall mean any work, whether in Source or Object
 85 |       form, that is based on (or derived from) the Work and for which the
 86 |       editorial revisions, annotations, elaborations, or other modifications
 87 |       represent, as a whole, an original work of authorship. For the purposes
 88 |       of this License, Derivative Works shall not include works that remain
 89 |       separable from, or merely link (or bind by name) to the interfaces of,
 90 |       the Work and Derivative Works thereof.
 91 | 
 92 |       "Contribution" shall mean any work of authorship, including
 93 |       the original version of the Work and any modifications or additions
 94 |       to that Work or Derivative Works thereof, that is intentionally
 95 |       submitted to Licensor for inclusion in the Work by the copyright owner
 96 |       or by an individual or Legal Entity authorized to submit on behalf of
 97 |       the copyright owner. For the purposes of this definition, "submitted"
 98 |       means any form of electronic, verbal, or written communication sent
 99 |       to the Licensor or its representatives, including but not limited to
100 |       communication on electronic mailing lists, source code control systems,
101 |       and issue tracking systems that are managed by, or on behalf of, the
102 |       Licensor for the purpose of discussing and improving the Work, but
103 |       excluding communication that is conspicuously marked or otherwise
104 |       designated in writing by the copyright owner as "Not a Contribution."
105 | 
106 |       "Contributor" shall mean Licensor and any individual or Legal Entity
107 |       on behalf of whom a Contribution has been received by Licensor and
108 |       subsequently incorporated within the Work.
109 | 
110 |    2. Grant of Copyright License. Subject to the terms and conditions of
111 |       this License, each Contributor hereby grants to You a perpetual,
112 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
113 |       copyright license to reproduce, prepare Derivative Works of,
114 |       publicly display, publicly perform, sublicense, and distribute the
115 |       Work and such Derivative Works in Source or Object form.
116 | 
117 |    3. Grant of Patent License. Subject to the terms and conditions of
118 |       this License, each Contributor hereby grants to You a perpetual,
119 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
120 |       (except as stated in this section) patent license to make, have made,
121 |       use, offer to sell, sell, import, and otherwise transfer the Work,
122 |       where such license applies only to those patent claims licensable
123 |       by such Contributor that are necessarily infringed by their
124 |       Contribution(s) alone or by combination of their Contribution(s)
125 |       with the Work to which such Contribution(s) was submitted. If You
126 |       institute patent litigation against any entity (including a
127 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
128 |       or a Contribution incorporated within the Work constitutes direct
129 |       or contributory patent infringement, then any patent licenses
130 |       granted to You under this License for that Work shall terminate
131 |       as of the date such litigation is filed.
132 | 
133 |    4. Redistribution. You may reproduce and distribute copies of the
134 |       Work or Derivative Works thereof in any medium, with or without
135 |       modifications, and in Source or Object form, provided that You
136 |       meet the following conditions:
137 | 
138 |       (a) You must give any other recipients of the Work or
139 |           Derivative Works a copy of this License; and
140 | 
141 |       (b) You must cause any modified files to carry prominent notices
142 |           stating that You changed the files; and
143 | 
144 |       (c) You must retain, in the Source form of any Derivative Works
145 |           that You distribute, all copyright, patent, trademark, and
146 |           attribution notices from the Source form of the Work,
147 |           excluding those notices that do not pertain to any part of
148 |           the Derivative Works; and
149 | 
150 |       (d) If the Work includes a "NOTICE" text file as part of its
151 |           distribution, then any Derivative Works that You distribute must
152 |           include a readable copy of the attribution notices contained
153 |           within such NOTICE file, excluding those notices that do not
154 |           pertain to any part of the Derivative Works, in at least one
155 |           of the following places: within a NOTICE text file distributed
156 |           as part of the Derivative Works; within the Source form or
157 |           documentation, if provided along with the Derivative Works; or,
158 |           within a display generated by the Derivative Works, if and
159 |           wherever such third-party notices normally appear. The contents
160 |           of the NOTICE file are for informational purposes only and
161 |           do not modify the License. You may add Your own attribution
162 |           notices within Derivative Works that You distribute, alongside
163 |           or as an addendum to the NOTICE text from the Work, provided
164 |           that such additional attribution notices cannot be construed
165 |           as modifying the License.
166 | 
167 |       You may add Your own copyright statement to Your modifications and
168 |       may provide additional or different license terms and conditions
169 |       for use, reproduction, or distribution of Your modifications, or
170 |       for any such Derivative Works as a whole, provided Your use,
171 |       reproduction, and distribution of the Work otherwise complies with
172 |       the conditions stated in this License.
173 | 
174 |    5. Submission of Contributions. Unless You explicitly state otherwise,
175 |       any Contribution intentionally submitted for inclusion in the Work
176 |       by You to the Licensor shall be under the terms and conditions of
177 |       this License, without any additional terms or conditions.
178 |       Notwithstanding the above, nothing herein shall supersede or modify
179 |       the terms of any separate license agreement you may have executed
180 |       with Licensor regarding such Contributions.
181 | 
182 |    6. Trademarks. This License does not grant permission to use the trade
183 |       names, trademarks, service marks, or product names of the Licensor,
184 |       except as required for reasonable and customary use in describing the
185 |       origin of the Work and reproducing the content of the NOTICE file.
186 | 
187 |    7. Disclaimer of Warranty. Unless required by applicable law or
188 |       agreed to in writing, Licensor provides the Work (and each
189 |       Contributor provides its Contributions) on an "AS IS" BASIS,
190 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
191 |       implied, including, without limitation, any warranties or conditions
192 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
193 |       PARTICULAR PURPOSE. You are solely responsible for determining the
194 |       appropriateness of using or redistributing the Work and assume any
195 |       risks associated with Your exercise of permissions under this License.
196 | 
197 |    8. Limitation of Liability. In no event and under no legal theory,
198 |       whether in tort (including negligence), contract, or otherwise,
199 |       unless required by applicable law (such as deliberate and grossly
200 |       negligent acts) or agreed to in writing, shall any Contributor be
201 |       liable to You for damages, including any direct, indirect, special,
202 |       incidental, or consequential damages of any character arising as a
203 |       result of this License or out of the use or inability to use the
204 |       Work (including but not limited to damages for loss of goodwill,
205 |       work stoppage, computer failure or malfunction, or any and all
206 |       other commercial damages or losses), even if such Contributor
207 |       has been advised of the possibility of such damages.
208 | 
209 |    9. Accepting Warranty or Additional Liability. While redistributing
210 |       the Work or Derivative Works thereof, You may choose to offer,
211 |       and charge a fee for, acceptance of support, warranty, indemnity,
212 |       or other liability obligations and/or rights consistent with this
213 |       License. However, in accepting such obligations, You may act only
214 |       on Your own behalf and on Your sole responsibility, not on behalf
215 |       of any other Contributor, and only if You agree to indemnify,
216 |       defend, and hold each Contributor harmless for any liability
217 |       incurred by, or claims asserted against, such Contributor by reason
218 |       of your accepting any such warranty or additional liability.
219 | 
220 | _____________________
221 | NumPy Developers (numpy)
222 | 
223 | Redistribution and use in source and binary forms, with or without
224 | modification, are permitted provided that the following conditions are
225 | met:
226 | 
227 |     * Redistributions of source code must retain the above copyright
228 |        notice, this list of conditions and the following disclaimer.
229 | 
230 |     * Redistributions in binary form must reproduce the above
231 |        copyright notice, this list of conditions and the following
232 |        disclaimer in the documentation and/or other materials provided
233 |        with the distribution.
234 | 
235 |     * Neither the name of the NumPy Developers nor the names of any
236 |        contributors may be used to endorse or promote products derived
237 |        from this software without specific prior written permission.
238 | 
239 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
240 | "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
241 | LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
242 | A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
243 | OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
244 | SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
245 | LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
246 | DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
247 | THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
248 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
249 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
250 | 
251 | _____________________
252 | tqdm developers (tqdm)
253 | `tqdm` is a product of collaborative work.
254 | Unless otherwise stated, all authors (see commit logs) retain copyright
255 | for their respective work, and release the work under the MIT licence
256 | (text below).
257 | 
258 | Exceptions or notable authors are listed below
259 | in reverse chronological order:
260 | 
261 | * files: *
262 |   MPLv2.0 2015-2021 (c) Casper da Costa-Luis
263 |   [casperdcl](https://github.com/casperdcl).
264 | * files: tqdm/_tqdm.py
265 |   MIT 2016 (c) [PR #96] on behalf of Google Inc.
266 | * files: tqdm/_tqdm.py setup.py README.rst MANIFEST.in .gitignore
267 |   MIT 2013 (c) Noam Yorav-Raphael, original author.
268 | 
269 | [PR #96]: https://github.com/tqdm/tqdm/pull/96
270 | 
271 | 
272 | Mozilla Public Licence (MPL) v. 2.0 - Exhibit A
273 | -----------------------------------------------
274 | 
275 | This Source Code Form is subject to the terms of the
276 | Mozilla Public License, v. 2.0.
277 | If a copy of the MPL was not distributed with this project,
278 | You can obtain one at https://mozilla.org/MPL/2.0/.
279 | 
280 | 
281 | MIT License (MIT)
282 | -----------------
283 | 
284 | Copyright (c) 2013 noamraph
285 | 
286 | Permission is hereby granted, free of charge, to any person obtaining a copy of
287 | this software and associated documentation files (the "Software"), to deal in
288 | the Software without restriction, including without limitation the rights to
289 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
290 | the Software, and to permit persons to whom the Software is furnished to do so,
291 | subject to the following conditions:
292 | 
293 | The above copyright notice and this permission notice shall be included in all
294 | copies or substantial portions of the Software.
295 | 
296 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
297 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
298 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
299 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
300 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
301 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
302 | 
303 | _____________________
304 | NLTK Team (NLTK)
305 | 1. Definitions.
306 | 
307 |       "License" shall mean the terms and conditions for use, reproduction,
308 |       and distribution as defined by Sections 1 through 9 of this document.
309 | 
310 |       "Licensor" shall mean the copyright owner or entity authorized by
311 |       the copyright owner that is granting the License.
312 | 
313 |       "Legal Entity" shall mean the union of the acting entity and all
314 |       other entities that control, are controlled by, or are under common
315 |       control with that entity. For the purposes of this definition,
316 |       "control" means (i) the power, direct or indirect, to cause the
317 |       direction or management of such entity, whether by contract or
318 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
319 |       outstanding shares, or (iii) beneficial ownership of such entity.
320 | 
321 |       "You" (or "Your") shall mean an individual or Legal Entity
322 |       exercising permissions granted by this License.
323 | 
324 |       "Source" form shall mean the preferred form for making modifications,
325 |       including but not limited to software source code, documentation
326 |       source, and configuration files.
327 | 
328 |       "Object" form shall mean any form resulting from mechanical
329 |       transformation or translation of a Source form, including but
330 |       not limited to compiled object code, generated documentation,
331 |       and conversions to other media types.
332 | 
333 |       "Work" shall mean the work of authorship, whether in Source or
334 |       Object form, made available under the License, as indicated by a
335 |       copyright notice that is included in or attached to the work
336 |       (an example is provided in the Appendix below).
337 | 
338 |       "Derivative Works" shall mean any work, whether in Source or Object
339 |       form, that is based on (or derived from) the Work and for which the
340 |       editorial revisions, annotations, elaborations, or other modifications
341 |       represent, as a whole, an original work of authorship. For the purposes
342 |       of this License, Derivative Works shall not include works that remain
343 |       separable from, or merely link (or bind by name) to the interfaces of,
344 |       the Work and Derivative Works thereof.
345 | 
346 |       "Contribution" shall mean any work of authorship, including
347 |       the original version of the Work and any modifications or additions
348 |       to that Work or Derivative Works thereof, that is intentionally
349 |       submitted to Licensor for inclusion in the Work by the copyright owner
350 |       or by an individual or Legal Entity authorized to submit on behalf of
351 |       the copyright owner. For the purposes of this definition, "submitted"
352 |       means any form of electronic, verbal, or written communication sent
353 |       to the Licensor or its representatives, including but not limited to
354 |       communication on electronic mailing lists, source code control systems,
355 |       and issue tracking systems that are managed by, or on behalf of, the
356 |       Licensor for the purpose of discussing and improving the Work, but
357 |       excluding communication that is conspicuously marked or otherwise
358 |       designated in writing by the copyright owner as "Not a Contribution."
359 | 
360 |       "Contributor" shall mean Licensor and any individual or Legal Entity
361 |       on behalf of whom a Contribution has been received by Licensor and
362 |       subsequently incorporated within the Work.
363 | 
364 |    2. Grant of Copyright License. Subject to the terms and conditions of
365 |       this License, each Contributor hereby grants to You a perpetual,
366 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
367 |       copyright license to reproduce, prepare Derivative Works of,
368 |       publicly display, publicly perform, sublicense, and distribute the
369 |       Work and such Derivative Works in Source or Object form.
370 | 
371 |    3. Grant of Patent License. Subject to the terms and conditions of
372 |       this License, each Contributor hereby grants to You a perpetual,
373 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
374 |       (except as stated in this section) patent license to make, have made,
375 |       use, offer to sell, sell, import, and otherwise transfer the Work,
376 |       where such license applies only to those patent claims licensable
377 |       by such Contributor that are necessarily infringed by their
378 |       Contribution(s) alone or by combination of their Contribution(s)
379 |       with the Work to which such Contribution(s) was submitted. If You
380 |       institute patent litigation against any entity (including a
381 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
382 |       or a Contribution incorporated within the Work constitutes direct
383 |       or contributory patent infringement, then any patent licenses
384 |       granted to You under this License for that Work shall terminate
385 |       as of the date such litigation is filed.
386 | 
387 |    4. Redistribution. You may reproduce and distribute copies of the
388 |       Work or Derivative Works thereof in any medium, with or without
389 |       modifications, and in Source or Object form, provided that You
390 |       meet the following conditions:
391 | 
392 |       (a) You must give any other recipients of the Work or
393 |           Derivative Works a copy of this License; and
394 | 
395 |       (b) You must cause any modified files to carry prominent notices
396 |           stating that You changed the files; and
397 | 
398 |       (c) You must retain, in the Source form of any Derivative Works
399 |           that You distribute, all copyright, patent, trademark, and
400 |           attribution notices from the Source form of the Work,
401 |           excluding those notices that do not pertain to any part of
402 |           the Derivative Works; and
403 | 
404 |       (d) If the Work includes a "NOTICE" text file as part of its
405 |           distribution, then any Derivative Works that You distribute must
406 |           include a readable copy of the attribution notices contained
407 |           within such NOTICE file, excluding those notices that do not
408 |           pertain to any part of the Derivative Works, in at least one
409 |           of the following places: within a NOTICE text file distributed
410 |           as part of the Derivative Works; within the Source form or
411 |           documentation, if provided along with the Derivative Works; or,
412 |           within a display generated by the Derivative Works, if and
413 |           wherever such third-party notices normally appear. The contents
414 |           of the NOTICE file are for informational purposes only and
415 |           do not modify the License. You may add Your own attribution
416 |           notices within Derivative Works that You distribute, alongside
417 |           or as an addendum to the NOTICE text from the Work, provided
418 |           that such additional attribution notices cannot be construed
419 |           as modifying the License.
420 | 
421 |       You may add Your own copyright statement to Your modifications and
422 |       may provide additional or different license terms and conditions
423 |       for use, reproduction, or distribution of Your modifications, or
424 |       for any such Derivative Works as a whole, provided Your use,
425 |       reproduction, and distribution of the Work otherwise complies with
426 |       the conditions stated in this License.
427 | 
428 |    5. Submission of Contributions. Unless You explicitly state otherwise,
429 |       any Contribution intentionally submitted for inclusion in the Work
430 |       by You to the Licensor shall be under the terms and conditions of
431 |       this License, without any additional terms or conditions.
432 |       Notwithstanding the above, nothing herein shall supersede or modify
433 |       the terms of any separate license agreement you may have executed
434 |       with Licensor regarding such Contributions.
435 | 
436 |    6. Trademarks. This License does not grant permission to use the trade
437 |       names, trademarks, service marks, or product names of the Licensor,
438 |       except as required for reasonable and customary use in describing the
439 |       origin of the Work and reproducing the content of the NOTICE file.
440 | 
441 |    7. Disclaimer of Warranty. Unless required by applicable law or
442 |       agreed to in writing, Licensor provides the Work (and each
443 |       Contributor provides its Contributions) on an "AS IS" BASIS,
444 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
445 |       implied, including, without limitation, any warranties or conditions
446 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
447 |       PARTICULAR PURPOSE. You are solely responsible for determining the
448 |       appropriateness of using or redistributing the Work and assume any
449 |       risks associated with Your exercise of permissions under this License.
450 | 
451 |    8. Limitation of Liability. In no event and under no legal theory,
452 |       whether in tort (including negligence), contract, or otherwise,
453 |       unless required by applicable law (such as deliberate and grossly
454 |       negligent acts) or agreed to in writing, shall any Contributor be
455 |       liable to You for damages, including any direct, indirect, special,
456 |       incidental, or consequential damages of any character arising as a
457 |       result of this License or out of the use or inability to use the
458 |       Work (including but not limited to damages for loss of goodwill,
459 |       work stoppage, computer failure or malfunction, or any and all
460 |       other commercial damages or losses), even if such Contributor
461 |       has been advised of the possibility of such damages.
462 | 
463 |    9. Accepting Warranty or Additional Liability. While redistributing
464 |       the Work or Derivative Works thereof, You may choose to offer,
465 |       and charge a fee for, acceptance of support, warranty, indemnity,
466 |       or other liability obligations and/or rights consistent with this
467 |       License. However, in accepting such obligations, You may act only
468 |       on Your own behalf and on Your sole responsibility, not on behalf
469 |       of any other Contributor, and only if You agree to indemnify,
470 |       defend, and hold each Contributor harmless for any liability
471 |       incurred by, or claims asserted against, such Contributor by reason
472 |       of your accepting any such warranty or additional liability.
473 | 
474 | _____________________
475 | 
476 | Leonard Richardson (beautifulsoup4)
477 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
478 | 
479 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
480 | 
481 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
482 | _____________________
483 | 
484 | Brian McFee, librosa development team (librosa)
485 | Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.
486 | 
487 | THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
488 | 


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
  1 | 
  2 | # Contributor Covenant Code of Conduct
  3 | 
  4 | ## Our Pledge
  5 | 
  6 | We as members, contributors, and leaders pledge to make participation in our
  7 | community a harassment-free experience for everyone, regardless of age, body
  8 | size, visible or invisible disability, ethnicity, sex characteristics, gender
  9 | identity and expression, level of experience, education, socio-economic status,
 10 | nationality, personal appearance, race, caste, color, religion, or sexual
 11 | identity and orientation.
 12 | 
 13 | We pledge to act and interact in ways that contribute to an open, welcoming,
 14 | diverse, inclusive, and healthy community.
 15 | 
 16 | ## Our Standards
 17 | 
 18 | Examples of behavior that contributes to a positive environment for our
 19 | community include:
 20 | 
 21 | * Demonstrating empathy and kindness toward other people
 22 | * Being respectful of differing opinions, viewpoints, and experiences
 23 | * Giving and gracefully accepting constructive feedback
 24 | * Accepting responsibility and apologizing to those affected by our mistakes,
 25 |   and learning from the experience
 26 | * Focusing on what is best not just for us as individuals, but for the overall
 27 |   community
 28 | 
 29 | Examples of unacceptable behavior include:
 30 | 
 31 | * The use of sexualized language or imagery, and sexual attention or advances of
 32 |   any kind
 33 | * Trolling, insulting or derogatory comments, and personal or political attacks
 34 | * Public or private harassment
 35 | * Publishing others' private information, such as a physical or email address,
 36 |   without their explicit permission
 37 | * Other conduct which could reasonably be considered inappropriate in a
 38 |   professional setting
 39 | 
 40 | ## Enforcement Responsibilities
 41 | 
 42 | Community leaders are responsible for clarifying and enforcing our standards of
 43 | acceptable behavior and will take appropriate and fair corrective action in
 44 | response to any behavior that they deem inappropriate, threatening, offensive,
 45 | or harmful.
 46 | 
 47 | Community leaders have the right and responsibility to remove, edit, or reject
 48 | comments, commits, code, wiki edits, issues, and other contributions that are
 49 | not aligned to this Code of Conduct, and will communicate reasons for moderation
 50 | decisions when appropriate.
 51 | 
 52 | ## Scope
 53 | 
 54 | This Code of Conduct applies within all community spaces, and also applies when
 55 | an individual is officially representing the community in public spaces.
 56 | Examples of representing our community include using an official e-mail address,
 57 | posting via an official social media account, or acting as an appointed
 58 | representative at an online or offline event.
 59 | 
 60 | ## Enforcement
 61 | 
 62 | Instances of abusive, harassing, or otherwise unacceptable behavior may be
 63 | reported to the community leaders responsible for enforcement at
 64 | opensource-conduct@group.apple.com.
 65 | All complaints will be reviewed and investigated promptly and fairly.
 66 | 
 67 | All community leaders are obligated to respect the privacy and security of the
 68 | reporter of any incident.
 69 | 
 70 | ## Enforcement Guidelines
 71 | 
 72 | Community leaders will follow these Community Impact Guidelines in determining
 73 | the consequences for any action they deem in violation of this Code of Conduct:
 74 | 
 75 | ### 1. Correction
 76 | 
 77 | **Community Impact**: Use of inappropriate language or other behavior deemed
 78 | unprofessional or unwelcome in the community.
 79 | 
 80 | **Consequence**: A private, written warning from community leaders, providing
 81 | clarity around the nature of the violation and an explanation of why the
 82 | behavior was inappropriate. A public apology may be requested.
 83 | 
 84 | ### 2. Warning
 85 | 
 86 | **Community Impact**: A violation through a single incident or series of
 87 | actions.
 88 | 
 89 | **Consequence**: A warning with consequences for continued behavior. No
 90 | interaction with the people involved, including unsolicited interaction with
 91 | those enforcing the Code of Conduct, for a specified period of time. This
 92 | includes avoiding interactions in community spaces as well as external channels
 93 | like social media. Violating these terms may lead to a temporary or permanent
 94 | ban.
 95 | 
 96 | ### 3. Temporary Ban
 97 | 
 98 | **Community Impact**: A serious violation of community standards, including
 99 | sustained inappropriate behavior.
100 | 
101 | **Consequence**: A temporary ban from any sort of interaction or public
102 | communication with the community for a specified period of time. No public or
103 | private interaction with the people involved, including unsolicited interaction
104 | with those enforcing the Code of Conduct, is allowed during this period.
105 | Violating these terms may lead to a permanent ban.
106 | 
107 | ### 4. Permanent Ban
108 | 
109 | **Community Impact**: Demonstrating a pattern of violation of community
110 | standards, including sustained inappropriate behavior, harassment of an
111 | individual, or aggression toward or disparagement of classes of individuals.
112 | 
113 | **Consequence**: A permanent ban from any sort of public interaction within the
114 | community.
115 | 
116 | ## Attribution
117 | 
118 | This Code of Conduct is adapted from the [Contributor Covenant][homepage],
119 | version 2.1, available at
120 | [https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1].
121 | 
122 | Community Impact Guidelines were inspired by
123 | [Mozilla's code of conduct enforcement ladder][Mozilla CoC].
124 | 
125 | For answers to common questions about this code of conduct, see the FAQ at
126 | [https://www.contributor-covenant.org/faq][FAQ]. Translations are available at
127 | [https://www.contributor-covenant.org/translations][translations].
128 | 
129 | [homepage]: https://www.contributor-covenant.org
130 | [v2.1]: https://www.contributor-covenant.org/version/2/1/code_of_conduct.html
131 | [Mozilla CoC]: https://github.com/mozilla/diversity
132 | [FAQ]: https://www.contributor-covenant.org/faq
133 | [translations]: https://www.contributor-covenant.org/translations
134 | 


--------------------------------------------------------------------------------
/CONTRIBUTING-RESEARCH.md:
--------------------------------------------------------------------------------
1 | # Contribution Guide
2 | 
3 | Thanks for your interest in contributing. This project was released to accompany a research paper for purposes of reproducability, and beyond its publication there are limited plans for future development of the repository.
4 | 
5 | ## Before you get started
6 | 
7 | We ask that all community members read and observe our [Code of Conduct](CODE_OF_CONDUCT.md).
8 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright (C) 2022 Apple Inc. All Rights Reserved.
 2 | 
 3 | IMPORTANT:  This Apple software is supplied to you by Apple
 4 | Inc. ("Apple") in consideration of your agreement to the following
 5 | terms, and your use, installation, modification or redistribution of
 6 | this Apple software constitutes acceptance of these terms.  If you do
 7 | not agree with these terms, please do not use, install, modify or
 8 | redistribute this Apple software.
 9 | 
10 | In consideration of your agreement to abide by the following terms, and
11 | subject to these terms, Apple grants you a personal, non-exclusive
12 | license, under Apple's copyrights in this original Apple software (the
13 | "Apple Software"), to use, reproduce, modify and redistribute the Apple
14 | Software, with or without modifications, in source and/or binary forms;
15 | provided that if you redistribute the Apple Software in its entirety and
16 | without modifications, you must retain this notice and the following
17 | text and disclaimers in all such redistributions of the Apple Software.
18 | Neither the name, trademarks, service marks or logos of Apple Inc. may
19 | be used to endorse or promote products derived from the Apple Software
20 | without specific prior written permission from Apple.  Except as
21 | expressly stated in this notice, no other rights or licenses, express or
22 | implied, are granted by Apple herein, including but not limited to any
23 | patent rights that may be infringed by your derivative works or by other
24 | works in which the Apple Software may be incorporated.
25 | 
26 | The Apple Software is provided by Apple on an "AS IS" basis.  APPLE
27 | MAKES NO WARRANTIES, EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION
28 | THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY AND FITNESS
29 | FOR A PARTICULAR PURPOSE, REGARDING THE APPLE SOFTWARE OR ITS USE AND
30 | OPERATION ALONE OR IN COMBINATION WITH YOUR PRODUCTS.
31 | 
32 | IN NO EVENT SHALL APPLE BE LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL
33 | OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
34 | SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
35 | INTERRUPTION) ARISING IN ANY WAY OUT OF THE USE, REPRODUCTION,
36 | MODIFICATION AND/OR DISTRIBUTION OF THE APPLE SOFTWARE, HOWEVER CAUSED
37 | AND WHETHER UNDER THEORY OF CONTRACT, TORT (INCLUDING NEGLIGENCE),
38 | STRICT LIABILITY OR OTHERWISE, EVEN IF APPLE HAS BEEN ADVISED OF THE
39 | POSSIBILITY OF SUCH DAMAGE.
40 | 
41 | -------------------------------------------------------------------------------
42 | SOFTWARE DISTRIBUTED WITH CODE-SWITCHED-SPEECH-TRANSLATION:
43 | 
44 | The CODE-SWITCHED-SPEECH-TRANSLATION software includes a number of subcomponents with separate 
45 | copyright notices and license terms - please see the file ACKNOWLEDGEMENTS.
46 | -------------------------------------------------------------------------------
47 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Overview
 2 | This repository contains the code and instructions needed to reproduce the dataset splits for ["Speech Translation for Code-Switched Speech"](LINK_TODO).
 3 | 
 4 | You can create both datasets with the `bash create_datasets.sh` command, following the instructions in the [Instructions Section](#Instructions). The `fisher` and `miami` directories contain the scripts needed to for each dataset used by `bash create_datasets.sh`. 
 5 | 
 6 | A mapping between the original data and the new code-switched and monolingual splits used in the paper can be found in `mapping_files`. Note that running `bash create_datasets.sh` will create these mappings.
 7 | 
 8 | ## Instructions
 9 | 0. Install the prerequisite libraries for linux/macOS.  This includes `ffmpeg`, `sox`, `wget`, and `python` (e.g. `apt-get install sox`).
10 | 1. Run `pip install -r requirement.txt` to setup the python enviroment
11 | 2. Collect the data needed for the Fisher corpus ([LDC2010T04](https://catalog.ldc.upenn.edu/LDC2010T04) and [LDC2010S01](https://catalog.ldc.upenn.edu/LDC2010S01)) and export them: `export LDC2010S01={path_to_LDC2010S01}` and `export LDC2010T04={path_to_LDC2010T04}/fisher_spa_tr`.
12 | 3. Run `bash create_datasets.sh` to generate both Miami and Fisher datasets. 
13 | 
14 | 
15 | ## Example
16 | 
17 | Example utterance:
18 | - (Audio clip)
19 | - Transcript (code-switched): *y ti bueno tiene dos papás **which can be a little can be a little challenging**.*
20 | - Translation (English only): *and she has two fathers which can be a little, can be a little challenging.*
21 | 
22 | The data files are composed of three parts:
23 | 1. The transcript for the dataset split (in `{dataset_name}.translation`)
24 | 2. The translation for the dataset split (in `{dataset_name}.translation`)
25 | 3. The audio for the dataset split (in `{dataset_name}.yaml` and `{dataset_name}/clips/*.wav` or `{dataset_name}/clips.zip`)
26 | 
27 | ## Citation
28 | If you found this repository helpful in your research, please consider citing
29 | ```
30 | Orion Weller, Matthias Sperber, Telmo Pessoa Pires, Hendra Setiawan, Christian Gollan, Dominic Telaar, Matthias Paulik: End-to-End Speech Translation for Code Switched Speech (Findings of the Association for Computational Linguistics: ACL 2022)
31 | ```
32 | 


--------------------------------------------------------------------------------
/create_datasets.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | #
 4 | # For licensing see accompanying LICENSE file.
 5 | # Copyright (C) 2022 Apple Inc. All Rights Reserved.
 6 | #
 7 | 
 8 | cd miami
 9 | bash setup_all.sh
10 | cd ../fisher
11 | bash setup_all.sh
12 | cd ../
13 | 


--------------------------------------------------------------------------------
/fisher/README.md:
--------------------------------------------------------------------------------
 1 | # Overview
 2 | This repository contains all the scripts needed to download the Fisher Corpus and preprocess it for Speech Translation
 3 | 
 4 | ## 1-Step Setup
 5 | 0. Run `setup_all.sh` to download the data, and process it. For granular instructions, see the `Multi-Step Setup`
 6 | 
 7 | ## Multi-Step Setup
 8 | 0. See the instructions and comments in the `setup_all.sh` file for individual instructions
 9 | 
10 | 
11 | ## Paper Reference
12 | The Fisher corpus is found in these LDC files ([here](https://catalog.ldc.upenn.edu/LDC2010T04) and [here](https://catalog.ldc.upenn.edu/LDC2010S01)) and was published as part of [this paper](https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2004-fisher-corpus.pdf)


--------------------------------------------------------------------------------
/fisher/combine_eval_splits.py:
--------------------------------------------------------------------------------
 1 | #
 2 | # For licensing see accompanying LICENSE file.
 3 | # Copyright (C) 2022 Apple Inc. All Rights Reserved.
 4 | #
 5 | 
 6 | # The Fisher data is already split into dev/dev2/test
 7 | # this script compiles these threeinto one `eval` test set
 8 | import os
 9 | import yaml
10 | import shutil
11 | from distutils.dir_util import copy_tree
12 | 
13 | DATASET_NAMES = ["cs", "mono"]
14 | SPLITS = ["dev", "dev2", "test"]
15 | 
16 | # combine the evaluation sets
17 | name = "eval"
18 | base_output_path = f"output/fisher/{name}"
19 | 
20 | all_eval = {"cs": None, "mono": None}
21 | for name in DATASET_NAMES:
22 |     data_for_type = [[], [], []]
23 |     for split in SPLITS:
24 |         print(f"Loading the data for {name}, {split}...")
25 |         base_path = f"output/fisher/{split}/{name}"
26 |         transcript = []
27 |         translation = []
28 |         with open(f"{base_path}/fisher.yaml", "r") as fin:
29 |             yaml_data = yaml.safe_load(fin)
30 |         with open(f"{base_path}/fisher.transcript", "r") as fin:
31 |             for line in fin:
32 |                 transcript.append(line.strip())
33 |         with open(f"{base_path}/fisher.translation", "r") as fin:
34 |             for line in fin:
35 |                 translation.append(line.strip())
36 |         assert len(transcript) == len(yaml_data) == len(translation)
37 |         print(f"Length of the original data is {len(transcript)}")
38 | 
39 |         data_for_type[0].extend(yaml_data)
40 |         data_for_type[1].extend(transcript)
41 |         data_for_type[2].extend(translation)
42 | 
43 |     all_eval[name] = data_for_type
44 | 
45 | 
46 | print("Writing the combined data out...")
47 | for (name, datasets) in zip(DATASET_NAMES, [all_eval["cs"], all_eval["mono"]]):
48 |     print(f"Length of the data {name} is {len(datasets[0])}")
49 | 
50 |     if not os.path.isdir(os.path.join(base_output_path, name, "clips")):
51 |         os.makedirs(os.path.join(base_output_path, name, "clips"))
52 | 
53 |     with open(os.path.join(base_output_path, name, f"fisher.yaml"), "w") as fout:
54 |         fout.write(yaml.dump(datasets[0]))
55 |     with open(os.path.join(base_output_path, name, f"fisher.transcript"), "w") as fout:
56 |         for line in datasets[1]:
57 |             assert "\n" not in line, line
58 |             fout.write(line)
59 |             fout.write("\n")
60 |     with open(os.path.join(base_output_path, name, f"fisher.translation"), "w") as fout:
61 |         for line in datasets[2]:
62 |             assert "\n" not in line, line
63 |             fout.write(line)
64 |             fout.write("\n")
65 | 
66 |     print("Moving clip data...")
67 |     for eval_split in SPLITS:
68 |         copy_tree(
69 |             os.path.join(base_output_path.replace("eval", eval_split), name, "clips"),
70 |             os.path.join(base_output_path, name, "clips"),
71 |         )
72 | 
73 |     # make it a zip file
74 |     shutil.make_archive(
75 |         os.path.join(base_output_path, name, "clips"),
76 |         "zip",
77 |         os.path.join(base_output_path, name, "clips"),
78 |     )
79 | 


--------------------------------------------------------------------------------
/fisher/extract-utterance-audios.py:
--------------------------------------------------------------------------------
 1 | #! /usr/bin/env python3
 2 | 
 3 | #
 4 | # For licensing see accompanying LICENSE file.
 5 | # Copyright (C) 2022 Apple Inc. All Rights Reserved.
 6 | #
 7 | 
 8 | # this file extracts the utterance from the larger audio file given the mapping
 9 | 
10 | import sys
11 | import os
12 | import subprocess
13 | 
14 | if len(sys.argv) != 3:
15 |   print("Usage: %s <mapping file> <LDC-speech-dir>" % sys.argv[0])
16 |   sys.exit(1)
17 | srcAudioDir=sys.argv[2]
18 | 
19 | utterance = None
20 | mapping = {}
21 | for line in sys.stdin:
22 |   if line.startswith('##'):
23 |     utterance = line.strip().split(' ')[2]
24 |     lineno = 1
25 |   else:
26 |     mapping[(utterance,repr(lineno))] = line.strip()
27 |     lineno += 1
28 | 
29 | for lineno, line in enumerate(open(sys.argv[1])):
30 |   utterances, ids = line.split()
31 |   output = " ".join(mapping[(utterances,x)] for x in ids.split('_'))
32 |   uttList=[mapping[(utterances,x)] for x in ids.split('_')]
33 |   firstToks=uttList[0].split('+')
34 |   firstToks[4] = firstToks[4].replace(' ', '~')
35 |   uttStart=float(firstToks[2])
36 |   uttDur=float(uttList[-1].split('+')[3])-uttStart
37 |   audioName="%s-utt%06d" % (os.path.basename(sys.argv[1]), lineno+1)
38 |   uttID="%s-%s-c%s-%s" % (audioName, firstToks[0], firstToks[1], firstToks[4])
39 |   spkID="%s-c%s-%s" % (firstToks[0], firstToks[1], firstToks[4])
40 |   wavFilename=os.path.join(os.path.basename(sys.argv[1]), os.path.join(firstToks[0][:-4], audioName))
41 |   print(uttID, wavFilename, spkID, lineno+1, output, uttStart, uttDur) # used in the `prepare-sets.sh bash script`
42 |   directory = os.path.dirname(wavFilename)
43 |   try:
44 |     os.stat(directory)
45 |   except:
46 |     os.mkdir(directory)       
47 |   cmd="/usr/bin/sox %s -c 1 --encoding signed-integer %s.wav remix %d trim %f %f rate 16000" % (os.path.join(srcAudioDir, firstToks[0]), wavFilename, int(firstToks[1])+1, uttStart, uttDur)
48 |   print(uttID, repr(subprocess.check_output(cmd.split(" "))), file=sys.stderr)
49 | 


--------------------------------------------------------------------------------
/fisher/extract_cs_words_from_raw_data.py:
--------------------------------------------------------------------------------
  1 | #
  2 | # For licensing see accompanying LICENSE file.
  3 | # Copyright (C) 2022 Apple Inc. All Rights Reserved.
  4 | #
  5 | 
  6 | # this file takes the raw Fisher data with the code-switched annotations and processes it
  7 | import glob
  8 | import os
  9 | from bs4 import BeautifulSoup
 10 | import numpy as np
 11 | 
 12 | 
 13 | def rawcount(filename):
 14 |     f = open(filename, "rb")
 15 |     lines = 0
 16 |     buf_size = 1024 * 1024
 17 |     read_f = f.raw.read
 18 | 
 19 |     buf = read_f(buf_size)
 20 |     while buf:
 21 |         lines += buf.count(b"\n")
 22 |         buf = read_f(buf_size)
 23 | 
 24 |     return lines
 25 | 
 26 | 
 27 | def fix_small_errors(line) -> str:
 28 |     """The data has some small errors to fix"""
 29 |     if 'lang+"English"' in line:
 30 |         line = line.replace('lang+"English"', 'lang="English"')
 31 |     if 'lan="English"' in line:
 32 |         line = line.replace('lan="English"', 'lang="English"')
 33 |     if " /foreign>" in line:
 34 |         line = line.replace(" /foreign>", "</foreign>")
 35 |     if '<foreign lang="English"> meeting <foreign lang="English">' in line:
 36 |         line = line.replace(
 37 |             '<foreign lang="English"> meeting <foreign lang="English">',
 38 |             '<foreign lang="English"> meeting </foreign>',
 39 |         )
 40 | 
 41 |     return line
 42 | 
 43 | 
 44 | # go through all spanish files, english don't have any markup since it
 45 | # generated from AMT and kept the text
 46 | file_info = {}
 47 | for file_path in glob.glob("fisher-callhome-corpus-tags/corpus/ldc/fisher_*.es"):
 48 |     file_name = file_path.split("/")[-1]
 49 |     cs_info = []
 50 |     file_info[file_name] = {
 51 |         "line_count": rawcount(file_path),
 52 |     }
 53 | 
 54 |     tokens_per_line = []
 55 |     with open(file_path, "r") as fin:
 56 |         for line_idx, line in enumerate(fin):
 57 |             line = line.strip()  # remove newline
 58 |             if (
 59 |                 "<foreign" in line
 60 |             ):  # other tags exist like laughs, but we are only looking for code-switching
 61 | 
 62 |                 line = fix_small_errors(line)
 63 |                 soup = BeautifulSoup(line, features="html.parser")
 64 |                 try:
 65 |                     inside_tags = soup.find_all(
 66 |                         "foreign"
 67 |                     )  # finds all foreign tags, in case of multiples
 68 |                     inside_texts = [
 69 |                         item.get_text() for item in inside_tags
 70 |                     ]  # extracts inside text
 71 |                     langs = [item["lang"] for item in inside_tags]
 72 |                     to_keep = [
 73 |                         idx
 74 |                         for (idx, item) in enumerate(inside_texts)
 75 |                         if item.strip() not in ["", "(())"]
 76 |                     ]
 77 |                     assert (
 78 |                         len(inside_tags) == len(inside_texts) == len(langs)
 79 |                     ), "got different lengths for same tags"
 80 | 
 81 |                     # get the CS text and dataset statistics
 82 |                     total_cs_text = [
 83 |                         item
 84 |                         for item in " ".join(inside_texts).strip().split(" ")
 85 |                         if item != ""
 86 |                     ]
 87 |                     line_tokens = [
 88 |                         item for item in soup.get_text().split(" ") if item != ""
 89 |                     ]
 90 |                     percent_cs = len(total_cs_text) / len(line_tokens)
 91 |                     tokens_per_line.append(percent_cs)
 92 |                     assert percent_cs <= 1.0, f"got {total_cs_text} for line {line}"
 93 | 
 94 |                 except Exception as e:
 95 |                     breakpoint()
 96 |                     print(e)
 97 | 
 98 |                 for idx in to_keep:
 99 |                     info = {
100 |                         "cs_text": inside_texts[idx].strip(),
101 |                         "lang": langs[idx],
102 |                         "line_idx": line_idx,  # to map back
103 |                     }
104 |                     cs_info.append(info)
105 | 
106 |     file_info[file_name]["total_cs"] = len(cs_info)
107 |     file_info[file_name]["cs_info"] = cs_info
108 |     file_info[file_name]["cs_tokens_per_line"] = tokens_per_line
109 | 
110 | #### Analysis Section ####
111 | output_path = "./cs_corpus"
112 | if not os.path.isdir(output_path):
113 |     os.makedirs(output_path)
114 | 
115 | for file_name in file_info.keys():
116 |     print(f"\n## For file {file_name} ##")
117 |     tokens_per_instance = np.array(file_info[file_name]["cs_tokens_per_line"])
118 | 
119 |     # ### make code-switched sets ###
120 |     cs_file_name = file_name.replace(".es", "_cs.es")
121 |     set_of_cs_idxs = set(
122 |         [item["line_idx"] for item in file_info[file_name]["cs_info"]]
123 |     )  # are duplicate line_nums
124 |     with open(os.path.join(output_path, cs_file_name), "w") as fout:
125 |         for idx in range(file_info[file_name]["line_count"]):
126 |             if idx in set_of_cs_idxs:
127 |                 fout.write(str(idx))
128 |                 fout.write("\n")
129 | 
130 |     ### make non-code-switched sets ###
131 |     mono_file_name = file_name.replace(".es", "_mono.es")
132 |     with open(os.path.join(output_path, mono_file_name), "w") as fout:
133 |         for idx in range(file_info[file_name]["line_count"]):
134 |             if idx not in set_of_cs_idxs:
135 |                 fout.write(str(idx))
136 |                 fout.write("\n")
137 | 
138 |     # save CS words only for word tagging
139 |     cs_words_file_name_only = file_name.replace(".es", "_cs_words_cs_only.es")
140 |     cs_count = 0
141 |     with open(os.path.join(output_path, cs_words_file_name_only), "w") as fout:
142 |         for idx in range(file_info[file_name]["line_count"]):
143 |             if idx in set_of_cs_idxs:
144 |                 cs_words = ""
145 |                 while (
146 |                     cs_count < len(file_info[file_name]["cs_info"])
147 |                     and file_info[file_name]["cs_info"][cs_count]["line_idx"] == idx
148 |                 ):
149 |                     instance = file_info[file_name]["cs_info"][cs_count]
150 |                     cs_count += 1
151 |                     cs_words += instance["cs_text"] + " "
152 | 
153 |                 assert (
154 |                     instance["line_idx"] == idx
155 |                 ), f'Line idx at {instance["line_idx"]} with idx {idx}'
156 |                 fout.write(cs_words)
157 |                 fout.write("\n")
158 | 


--------------------------------------------------------------------------------
/fisher/make_cs_splits.py:
--------------------------------------------------------------------------------
  1 | #
  2 | # For licensing see accompanying LICENSE file.
  3 | # Copyright (C) 2022 Apple Inc. All Rights Reserved.
  4 | #
  5 | 
  6 | # This file does the initial splitting from the Fisher ASR splits into a CS and monolingual sets
  7 | import os
  8 | import yaml
  9 | import shutil
 10 | 
 11 | DATASET_NAMES = ["cs", "mono"]
 12 | SPLITS = ["dev", "dev2", "test", "train"]
 13 | 
 14 | 
 15 | def split_data():
 16 |     for split in SPLITS:
 17 |         print(f"Loading the data for {split}...")
 18 |         base_path = f"splits_data/{split}"
 19 |         base_output_path = f"output/fisher/{split}"
 20 |         transcript = []
 21 |         translation = []
 22 |         with open(f"{base_path}/fisher_{split}.yaml", "r") as fin:
 23 |             yaml_data = yaml.safe_load(fin)
 24 |         with open(f"{base_path}/fisher_{split}.es", "r") as fin:
 25 |             for line in fin:
 26 |                 transcript.append(line.strip())
 27 |         translation_path = (
 28 |             f"{base_path}/fisher_{split}.en.0"
 29 |             if split != "train"
 30 |             else f"{base_path}/fisher_{split}.en"
 31 |         )
 32 |         with open(
 33 |             translation_path, "r"
 34 |         ) as fin:  # many refs, we use en.0 although others are possible
 35 |             for line in fin:
 36 |                 translation.append(line.strip())
 37 | 
 38 |         assert len(transcript) == len(yaml_data) == len(translation)
 39 |         print(f"Length of the original data is {len(transcript)}")
 40 | 
 41 |         ## Load code switched indexes ##
 42 |         mono_idxs = []
 43 |         cs_idxs = []
 44 |         with open(f"cs_corpus/fisher_{split}_mono.es", "r") as fin:
 45 |             for line in fin:
 46 |                 mono_idxs.append(line.strip())
 47 |         with open(f"cs_corpus/fisher_{split}_cs.es", "r") as fin:
 48 |             for line in fin:
 49 |                 cs_idxs.append(line.strip())
 50 |         assert len(cs_idxs) + len(mono_idxs) == len(transcript)
 51 |         cs_idxs = set(cs_idxs)
 52 | 
 53 |         mono = [[], [], []]
 54 |         cs = [[], [], []]
 55 |         print("Separating the data...")
 56 |         for idx in range(len(yaml_data)):
 57 |             new_yaml_instance = yaml_data[idx]
 58 |             new_yaml_instance["old_wav"] = new_yaml_instance["wav"]
 59 |             new_yaml_instance["wav"] = (
 60 |                 "clips/" + new_yaml_instance["wav"].split("/")[-1]
 61 |             )
 62 | 
 63 |             if str(idx) in cs_idxs:
 64 |                 cs[0].append(new_yaml_instance)
 65 |                 cs[1].append(transcript[idx])
 66 |                 cs[2].append(translation[idx])
 67 |             else:
 68 |                 mono[0].append(new_yaml_instance)
 69 |                 mono[1].append(transcript[idx])
 70 |                 mono[2].append(translation[idx])
 71 | 
 72 |         print("Writing the data out...")
 73 |         for (name, datasets) in zip(DATASET_NAMES, [cs, mono]):
 74 |             print(f"Length of the data {name} is {len(datasets[0])}")
 75 |             if not os.path.isdir(os.path.join(base_output_path, name)):
 76 |                 os.makedirs(os.path.join(base_output_path, name))
 77 |             with open(
 78 |                 os.path.join(base_output_path, name, f"fisher.yaml"), "w"
 79 |             ) as fout:
 80 |                 fout.write(yaml.dump(datasets[0]))
 81 |             with open(
 82 |                 os.path.join(base_output_path, name, f"fisher.transcript"), "w"
 83 |             ) as fout:
 84 |                 for line in datasets[1]:
 85 |                     assert "\n" not in line, line
 86 |                     fout.write(line)
 87 |                     fout.write("\n")
 88 |             with open(
 89 |                 os.path.join(base_output_path, name, f"fisher.translation"), "w"
 90 |             ) as fout:
 91 |                 for line in datasets[2]:
 92 |                     assert "\n" not in line, line
 93 |                     fout.write(line)
 94 |                     fout.write("\n")
 95 | 
 96 |         print("Moving clip data...")
 97 |         mono_clips = [item["old_wav"] for item in mono[0]]
 98 |         cs_clips = [item["old_wav"] for item in cs[0]]
 99 |         audio_path = "speech"
100 |         assert len(mono_clips) == len(mono[0])
101 |         assert len(cs_clips) == len(cs[0])
102 |         for (name, file_paths) in zip(DATASET_NAMES, [cs_clips, mono_clips]):
103 |             for file_path in file_paths:
104 |                 file_ending = "/".join(
105 |                     file_path.split("/")[-2:]
106 |                 )  # last two are the ones we need
107 |                 if not os.path.isdir(os.path.join(base_output_path, name, "clips")):
108 |                     os.makedirs(os.path.join(base_output_path, name, "clips"))
109 | 
110 |                 shutil.copy(
111 |                     os.path.join(audio_path, f"fisher_{split}", file_ending),
112 |                     os.path.join(
113 |                         base_output_path, name, "clips", file_path.split("/")[-1]
114 |                     ),
115 |                 )
116 | 
117 |             # make it a zip file
118 |             shutil.make_archive(
119 |                 os.path.join(base_output_path, name, "clips"),
120 |                 "zip",
121 |                 os.path.join(base_output_path, name, "clips"),
122 |             )
123 | 
124 | 
125 | if __name__ == "__main__":
126 |     split_data()
127 | 


--------------------------------------------------------------------------------
/fisher/make_mapping_files.py:
--------------------------------------------------------------------------------
 1 | #
 2 | # For licensing see accompanying LICENSE file.
 3 | # Copyright (C) 2022 Apple Inc. All Rights Reserved.
 4 | #
 5 | 
 6 | import os
 7 | import numpy as np
 8 | import pandas as pd
 9 | 
10 | FILE_LEN_MAP = {
11 |     "dev": 3979,
12 |     "dev2": 3961,
13 |     "test": 3641,
14 |     "train": 138819
15 | }
16 | 
17 | def make_mappings():
18 |     SPLITS = ["dev", "dev2", "test", "train"]
19 |     idx_map = {}
20 |     for cs_type in ["cs", "mono"]:
21 |         for split in SPLITS:
22 |             loaded_idxs = []
23 |             with open(f"cs_corpus/fisher_{split}_{cs_type}.es") as fin:
24 |                 for line in fin:
25 |                     loaded_idxs.append(int(line.strip()))
26 |             idx_map[f"{split}-{cs_type}"] = set(loaded_idxs)
27 | 
28 |     # Build Eval/Test set 
29 |     cur_split_data = []
30 |     for split in ["dev", "dev2", "test"]:
31 |         split_audio_path = pd.read_csv(f"fisher-callhome-corpus-tags/mapping/fisher_{split}", index_col=None, delimiter=" ", header=None)
32 |         split_audio_path.columns = ["AudioFile", "LineNum"]
33 |         for idx in range(FILE_LEN_MAP[split]):
34 |             audio_details = split_audio_path.loc[idx]
35 |             details = {"file": f"fisher_{split}", "file_line_num": idx, "split": "test",
36 |                          "audio_file": audio_details["AudioFile"], "audio_file_line_num": audio_details["LineNum"]}
37 |             if idx in idx_map[f"{split}-cs"]:
38 |                 details["cs_type"] = "cs"
39 |             else:
40 |                 details["cs_type"] = "mono"
41 |             cur_split_data.append(details)
42 | 
43 |     # Training and Dev sets
44 |     cs_count = 0
45 |     cs_is_dev = []
46 |     with open("train_vs_dev_cs.txt", "r") as fin:
47 |         for line in fin:
48 |             cs_is_dev.append(int(line.strip()))
49 |     cs_is_dev = set(cs_is_dev)
50 |     split_audio_path = pd.read_csv(f"fisher-callhome-corpus-tags/mapping/fisher_train", index_col=None, delimiter=" ", header=None)
51 |     split_audio_path.columns = ["AudioFile", "LineNum"]
52 |     for idx in range(FILE_LEN_MAP["train"]):
53 |         audio_details = split_audio_path.loc[idx]
54 |         details = {"file": f"fisher_train", "file_line_num": idx, "split": "train",
55 |                         "audio_file": audio_details["AudioFile"], "audio_file_line_num": audio_details["LineNum"]}
56 |         if idx in idx_map[f"train-cs"]:
57 |             details["cs_type"] = "cs"
58 |             if cs_count in cs_is_dev:
59 |                 details["split"] = "dev"
60 |             cs_count += 1
61 |         else:
62 |             details["cs_type"] = "mono"
63 |         cur_split_data.append(details)
64 | 
65 |     df = pd.DataFrame(cur_split_data)
66 |     df.to_csv(f"fisher_mapping.csv")
67 |     print("Made mapping files for Fisher")
68 | 
69 | if __name__ == "__main__":
70 |     make_mappings()
71 | 


--------------------------------------------------------------------------------
/fisher/prepare-sets.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | #
 4 | # For licensing see accompanying LICENSE file.
 5 | # Copyright (C) 2022 Apple Inc. All Rights Reserved.
 6 | #
 7 | 
 8 | 
 9 | # This file converts the speech data into 16K audio and creates the YAML files containing the mapping
10 | 
11 | FISHER_TDF_DIR=${LDC2010T04}/data/transcripts
12 | FISHER_SPEECH_DIR=${LDC2010S01}/data/speech
13 | PARALLEL_DATA_DIR=fisher-callhome-corpus
14 | 
15 | process_audio()
16 | {
17 |   C=${1}
18 |   mkdir -p ${C}
19 |   echo  ${PARALLEL_DATA_DIR}/mapping/${C}
20 |   ufiles=$(cat ${PARALLEL_DATA_DIR}/mapping/$C | cut -d' ' -f1 | uniq)
21 |   # echo $ufiles
22 |   for i in $ufiles; do
23 |       echo "## FILE $i"
24 |       cat ${FISHER_TDF_DIR}/$i.tdf | 
25 | 	grep -v ";;MM" | grep -v "file;unicode" |
26 | 	cut -f 1-5 | tr '\t' '+'
27 |   done | python extract-utterance-audios.py ${PARALLEL_DATA_DIR}/mapping/$C ${FISHER_SPEECH_DIR} > ${C}/ids \
28 | 	   2>${C}.prepare-audio.log
29 | }
30 | 
31 | for SET in fisher_{train,dev,dev2,test}; do
32 |   process_audio ${SET}
33 | done
34 | 
35 | # make YAML audio mapping
36 | for convname in fisher_{train,dev,dev2,test}/*fsp; do
37 |     for filename in $convname/*.wav; do
38 |         echo "- { wav: $filename }" >> $(dirname $convname).yaml
39 |     done
40 | done
41 | 


--------------------------------------------------------------------------------
/fisher/setup_all.sh:
--------------------------------------------------------------------------------
 1 | #
 2 | # For licensing see accompanying LICENSE file.
 3 | # Copyright (C) 2022 Apple Inc. All Rights Reserved.
 4 | #
 5 | 
 6 | # get the Fisher data with CS tags
 7 | git clone https://github.com/orionw/fisher-callhome-corpus.git
 8 | mv fisher-callhome-corpus fisher-callhome-corpus-tags
 9 | cd fisher-callhome-corpus-tags
10 | make
11 | cd ../
12 | python extract_cs_words_from_raw_data.py # makes indexes of CS data and keeps the CS words
13 | 
14 | # make the clean data without the CS tags to use
15 | git clone -b keep_tags https://github.com/orionw/fisher-callhome-corpus.git
16 | cd fisher-callhome-corpus
17 | make
18 | cp corpus/ldc/fisher_dev.{en,es}* ../splits_data/dev/
19 | cp corpus/ldc/fisher_train.{en,es}* ../splits_data/train/
20 | cp corpus/ldc/fisher_dev2.{en,es}* ../splits_data/dev2/
21 | cp corpus/ldc/fisher_test.{en,es}* ../splits_data/test/
22 | cd ../
23 | 
24 | # prepare the speech data (process to 16K, match to the other data lines)
25 | bash prepare-sets.sh
26 | cp fisher_train.yaml splits_data/train/
27 | cp fisher_test.yaml splits_data/test/
28 | cp fisher_dev.yaml splits_data/dev/
29 | cp fisher_dev2.yaml splits_data/dev2/
30 | mkdir speech
31 | mv fisher_train speech
32 | mv fisher_dev speech
33 | mv fisher_dev2 speech
34 | mv fisher_test speech
35 | 
36 | # make the CS and Monolingual splits
37 | sed -i "s/\r//g" splits_data/*/*  # something adds extra carriage returns
38 | python make_cs_splits.py
39 | # make the `eval` set consisting of dev dev2 test
40 | python combine_eval_splits.py
41 | python split_train_and_make_lid.py # split into training and dev CS sets and determine the LID
42 | python make_mapping_files.py # if you want the mapping files, optional
43 | 


--------------------------------------------------------------------------------
/fisher/split_train_and_make_lid.py:
--------------------------------------------------------------------------------
  1 | #
  2 | # For licensing see accompanying LICENSE file.
  3 | # Copyright (C) 2022 Apple Inc. All Rights Reserved.
  4 | #
  5 | 
  6 | # This file creates the LID labels for training/dev as well as splitting the training CS set into dev/train
  7 | import os
  8 | import random
  9 | import shutil
 10 | import yaml
 11 | import string
 12 | import numpy as np
 13 | 
 14 | random.seed(1)
 15 | 
 16 | 
 17 | def create_and_save_labels_for_cs_train_data(
 18 |     transcript, transcript_train, cs_words, output_path, desc, name
 19 | ) -> list:
 20 |     """Only used for fisher_train_cs to save train and dev set labels"""
 21 |     labels1 = []
 22 |     labels2 = []
 23 |     cs_words1, cs_words2 = cs_words
 24 |     for idx, instance in enumerate(transcript):
 25 |         transcript = instance.translate(str.maketrans("", "", string.punctuation))
 26 |         cs_words = cs_words1[idx].translate(str.maketrans("", "", string.punctuation))
 27 | 
 28 |         cs_count = len(cs_words.strip().split(" "))
 29 |         all_count = len(transcript.strip().split(" "))
 30 |         if cs_count / all_count > 0.5:
 31 |             labels1.append(0)  # english
 32 |         elif cs_count / all_count < 0.5:
 33 |             labels1.append(1)  # spanish
 34 |         else:
 35 |             labels1.append(int(random.random() > 0.5))
 36 | 
 37 |     for idx, instance in enumerate(transcript_train):
 38 |         transcript = instance.translate(str.maketrans("", "", string.punctuation))
 39 |         cs_words = cs_words2[idx].translate(str.maketrans("", "", string.punctuation))
 40 | 
 41 |         cs_count = len(cs_words.strip().split(" "))
 42 |         all_count = len(transcript.strip().split(" "))
 43 |         if cs_count / all_count > 0.5:
 44 |             labels2.append(0)  # english
 45 |         elif cs_count / all_count < 0.5:
 46 |             labels2.append(1)  # spanish
 47 |         else:
 48 |             labels2.append(int(random.random() > 0.5))
 49 | 
 50 |     print(
 51 |         f"Averages: labels1={np.array(labels1).mean()} labels2={np.array(labels2).mean()}"
 52 |     )
 53 | 
 54 |     if not os.path.isdir(os.path.join(output_path, desc + "_dev")):
 55 |         os.makedirs(os.path.join(output_path, desc + "_dev"))
 56 |     if not os.path.isdir(os.path.join(output_path, desc + "_train")):
 57 |         os.makedirs(os.path.join(output_path, desc + "_train"))
 58 | 
 59 |     with open(os.path.join(output_path, desc + "_dev", "lid_labels.txt"), "w") as fout:
 60 |         for label in labels1:
 61 |             fout.write(str(label))
 62 |             fout.write("\n")
 63 | 
 64 |     with open(
 65 |         os.path.join(output_path, desc + "_train", "lid_labels.txt"), "w"
 66 |     ) as fout:
 67 |         for label in labels2:
 68 |             fout.write(str(label))
 69 |             fout.write("\n")
 70 | 
 71 |     return labels1, labels2
 72 | 
 73 | 
 74 | def write_out_data(
 75 |     yaml_data, transcript, translation, base_path, output_path, desc, name
 76 | ):
 77 |     """A helper function for writing out all the data"""
 78 |     if not os.path.isdir(os.path.join(output_path, desc, "clips")):
 79 |         os.makedirs(os.path.join(output_path, desc, "clips"))
 80 | 
 81 |     with open(os.path.join(output_path, desc, f"{name}.yaml"), "w") as fout:
 82 |         fout.write(yaml.dump(yaml_data))
 83 |     with open(os.path.join(output_path, desc, f"{name}.transcript"), "w") as fout:
 84 |         for line in transcript:
 85 |             assert "\n" not in line, line
 86 |             fout.write(line)
 87 |             fout.write("\n")
 88 |     with open(os.path.join(output_path, desc, f"{name}.translation"), "w") as fout:
 89 |         for line in translation:
 90 |             assert "\n" not in line, line
 91 |             fout.write(line)
 92 |             fout.write("\n")
 93 | 
 94 |     for instance in yaml_data:
 95 |         audio_path = instance["wav"]
 96 |         shutil.copy(
 97 |             os.path.join(base_path, audio_path),
 98 |             os.path.join(output_path, desc, "clips", audio_path.split("/")[-1]),
 99 |         )
100 | 
101 |     # make it a zip file
102 |     shutil.make_archive(
103 |         os.path.join(output_path, desc, "clips"),
104 |         "zip",
105 |         os.path.join(output_path, desc, "clips"),
106 |     )
107 | 
108 | 
109 | def sample_yaml_data(
110 |     yaml_data, transcript, translation, num_idxs_to_sample, return_both: bool = False, should_write_out: bool = False
111 | ):
112 |     """A helper function for sampling from the data"""
113 |     split_idx = np.array(
114 |         random.sample(list(range(len(yaml_data))), k=num_idxs_to_sample)
115 |     )
116 |     if should_write_out:
117 |         with open("train_vs_dev_cs.txt", "w") as fout:
118 |             for line in split_idx.tolist():
119 |                 fout.write(str(line))
120 |                 fout.write("\n")
121 |     bool_split = np.isin(np.arange(len(transcript)), split_idx)
122 |     transcript1 = np.array(transcript)[bool_split].tolist()
123 |     translation1 = np.array(translation)[bool_split].tolist()
124 |     yaml_data1 = np.array(yaml_data)[bool_split].tolist()
125 |     if return_both:
126 |         cs_words = []
127 |         with open("cs_corpus/fisher_train_cs_words_cs_only.es", "r") as fin:
128 |             for line in fin:
129 |                 cs_words.append(line.strip())
130 |         assert len(cs_words) == len(yaml_data)
131 |         cs_words1 = np.array(cs_words)[bool_split].tolist()
132 |         cs_words2 = np.array(cs_words)[~bool_split].tolist()
133 | 
134 |         transcript2 = np.array(transcript)[~bool_split].tolist()
135 |         translation2 = np.array(translation)[~bool_split].tolist()
136 |         yaml_data2 = np.array(yaml_data)[~bool_split].tolist()
137 |         return (
138 |             yaml_data1,
139 |             transcript1,
140 |             translation1,
141 |             yaml_data2,
142 |             transcript2,
143 |             translation2,
144 |             (cs_words1, cs_words2),
145 |         )
146 |     else:
147 |         return yaml_data1, transcript1, translation1
148 | 
149 | 
150 | def create_and_save_cs_labels_only(yaml_data, transcript, translation):
151 |     """A function that only creates the LID labels and saves them (only used for Fisher Eval CS)"""
152 |     assert len(yaml_data) == len(transcript) == len(translation)
153 | 
154 |     cs_words_list = []
155 |     for file_type in ["dev", "dev2", "test"]:
156 |         with open(f"cs_corpus/fisher_{file_type}_cs_words_cs_only.es", "r") as fin:
157 |             for line in fin:
158 |                 cs_words_list.append(line.strip())
159 | 
160 |     assert len(cs_words_list) == len(
161 |         yaml_data
162 |     ), f"CS words: {len(cs_words_list)} len_data={len(yaml_data)}"
163 | 
164 |     labels = []
165 |     for idx, instance in enumerate(transcript):
166 |         transcript_str = instance.translate(str.maketrans("", "", string.punctuation))
167 |         cs_words = cs_words_list[idx].translate(
168 |             str.maketrans("", "", string.punctuation)
169 |         )
170 | 
171 |         cs_count = len(cs_words.strip().split(" "))
172 |         all_count = len(transcript_str.strip().split(" "))
173 |         if cs_count / all_count > 0.5:
174 |             labels.append(0)  # english
175 |         elif cs_count / all_count < 0.5:
176 |             labels.append(1)  # spanish
177 |         else:
178 |             labels.append(int(random.random() > 0.5))
179 | 
180 |     assert len(labels) == len(yaml_data)
181 | 
182 |     with open(os.path.join("output/fisher/eval/cs/fisher.labels"), "w") as fout:
183 |         for label in labels:
184 |             fout.write(str(label))
185 |             fout.write("\n")
186 | 
187 | 
188 | def gather_lid_data():
189 |     output_path = "output/lid"
190 |     num_idxs_to_sample = None
191 |     if not os.path.isdir(output_path):
192 |         os.makedirs(output_path)
193 | 
194 |     data_paths = [
195 |         ("fisher_eval_cs", "fisher", "output/fisher/eval/cs"),
196 |         ("fisher_train_cs", "fisher", "output/fisher/train/cs"),
197 |         # for the monolingual ones, sample some of them for use in LID training
198 |         ("fisher_train_mono", "fisher", "output/fisher/train/mono"),
199 |         ("miami_train_mono", "miami", "../miami/output/miami/mono_train"),
200 |     ]
201 |     for (desc, name, base_path) in data_paths:
202 |         print(f"Working on {desc}")
203 |         transcript = []
204 |         translation = []
205 |         with open(f"{base_path}/{name}.yaml", "r") as fin:
206 |             yaml_data = yaml.safe_load(fin)
207 |         with open(f"{base_path}/{name}.transcript", "r") as fin:
208 |             for line in fin:
209 |                 transcript.append(line.strip())
210 |         with open(f"{base_path}/{name}.translation", "r") as fin:
211 |             for line in fin:
212 |                 translation.append(line.strip())
213 |         assert len(transcript) == len(yaml_data) == len(translation)
214 | 
215 |         if desc == "fisher_eval_cs":
216 |             create_and_save_cs_labels_only(yaml_data, transcript, translation)
217 |         elif desc == "fisher_train_cs":
218 |             # need to split this into train and dev, then save
219 |             yaml_data, transcript, translation, yaml_data_train, transcript_train, translation_train, cs_words = sample_yaml_data(yaml_data, transcript, translation, 
220 |                                                                                                                             int(0.1 * len(yaml_data)), return_both=True,
221 |                                                                                                                             should_write_out=True)
222 |             print(f"Length of the data {base_path}/{name + '_dev'} is {len(yaml_data)}")
223 |             create_and_save_labels_for_cs_train_data(transcript, transcript_train, cs_words, output_path, desc, name)
224 |             write_out_data(yaml_data, transcript, translation, base_path, output_path, desc + "_dev", name)
225 | 
226 |             print(f"Length of the data {base_path}/{name + '_train'} is {len(yaml_data_train)}")
227 |             write_out_data(yaml_data_train, transcript_train, translation_train, base_path, output_path, desc + "_train", name)
228 |             num_idxs_to_sample = len(yaml_data_train) # make fisher cs the base
229 |         else: # is monolingual
230 |             yaml_data, transcript, translation = sample_yaml_data(yaml_data, transcript, translation, min(len(yaml_data), num_idxs_to_sample))
231 |             print(f"Length of the data {base_path}/{name} is {len(yaml_data)}")
232 |             write_out_data(yaml_data, transcript, translation, base_path, output_path, desc, name)
233 | 
234 | 
235 | if __name__ == "__main__":
236 |     gather_lid_data()
237 | 


--------------------------------------------------------------------------------
/fisher/splits_data/README.md:
--------------------------------------------------------------------------------
1 | These folders are used to hold the initial Fisher files that are later used for processing.


--------------------------------------------------------------------------------
/fisher/splits_data/dev/README.md:
--------------------------------------------------------------------------------
1 | # Contents
2 | This folder should contain these files gathered from `fisher-callhome-corpus` (see `../../readme.md`):
3 | - `fisher_dev.en.0`
4 | - `fisher_dev.en.1`
5 | - `fisher_dev.en.2`
6 | - `fisher_dev.en.3`
7 | - `fisher_dev.es`
8 | - `fisher_dev.yaml`


--------------------------------------------------------------------------------
/fisher/splits_data/dev2/README.md:
--------------------------------------------------------------------------------
1 | # Contents
2 | This folder should contain these files gathered from `fisher-callhome-corpus` (see `../../readme.md`):
3 | - `fisher_dev2.en.0`
4 | - `fisher_dev2.en.1`
5 | - `fisher_dev2.en.2`
6 | - `fisher_dev2.en.3`
7 | - `fisher_dev2.es`
8 | - `fisher_dev2.yaml`


--------------------------------------------------------------------------------
/fisher/splits_data/test/README.md:
--------------------------------------------------------------------------------
1 | # Contents
2 | This folder should contain these files gathered from `fisher-callhome-corpus` (see `../../readme.md`):
3 | - `fisher_test.en.0`
4 | - `fisher_test.en.1`
5 | - `fisher_test.en.2`
6 | - `fisher_test.en.3`
7 | - `fisher_test.es`
8 | - `fisher_test.yaml`


--------------------------------------------------------------------------------
/fisher/splits_data/train/README.md:
--------------------------------------------------------------------------------
1 | # Contents
2 | This folder should contain these files gathered from `fisher-callhome-corpus` (see `../../readme.md`):
3 | - `fisher_train.en`
4 | - `fisher_train.es`
5 | - `fisher_train.yaml`


--------------------------------------------------------------------------------
/mapping_files/README.md:
--------------------------------------------------------------------------------
1 | # Mapping Files
2 | We also provide mapping files for both Fisher and Miami that map the files in the datasets to their respective splits in our data. `n/a` values in the `splits` columns mean that the instance was not used in our data (e.g. due to the lack of a translation, for example). For the Fisher corpus, the `audio_file` column refers to the audio file mappings in `LDC2010T04`, where the duration of each audio file is located (not included here, due to licensing).
3 | 
4 | These files may be useful if you are looking to do additional modifications beyond what this repository provides.


--------------------------------------------------------------------------------
/miami/common_words/eng.txt:
--------------------------------------------------------------------------------
  1 | the
  2 | and
  3 | to
  4 | of
  5 | a
  6 | in
  7 | is
  8 | that
  9 | for
 10 | I
 11 | you
 12 | it
 13 | with
 14 | on
 15 | as
 16 | are
 17 | be
 18 | this
 19 | was
 20 | have
 21 | or
 22 | at
 23 | not
 24 | your
 25 | from
 26 | we
 27 | by
 28 | will
 29 | can
 30 | but
 31 | they
 32 | an
 33 | he
 34 | all
 35 | has
 36 | if
 37 | their
 38 | one
 39 | do
 40 | more
 41 | n't
 42 | my
 43 | his
 44 | so
 45 | there
 46 | about
 47 | which
 48 | when
 49 | what
 50 | out
 51 | up
 52 | our
 53 | who
 54 | also
 55 | had
 56 | time
 57 | some
 58 | would
 59 | were
 60 | like
 61 | been
 62 | just
 63 | her
 64 | new
 65 | other
 66 | them
 67 | she
 68 | people
 69 | these
 70 | no
 71 | get
 72 | how
 73 | me
 74 | into
 75 | than
 76 | only
 77 | its
 78 | most
 79 | may
 80 | any
 81 | many
 82 | make
 83 | then
 84 | well
 85 | first
 86 | very
 87 | over
 88 | now
 89 | could
 90 | after
 91 | even
 92 | because
 93 | us
 94 | said
 95 | good
 96 | way
 97 | two
 98 | should
 99 | work
100 | use
101 | through
102 | see
103 | know
104 | did
105 | much
106 | where
107 | years
108 | need
109 | him
110 | back
111 | such
112 | those
113 | being
114 | day
115 | take
116 | while
117 | here
118 | before
119 | does
120 | great
121 | year
122 | go
123 | help
124 | want
125 | really
126 | think
127 | best
128 | life
129 | each
130 | made
131 | right
132 | world
133 | business
134 | home
135 | own
136 | down
137 | still
138 | used
139 | find
140 | around
141 | going
142 | every
143 | both
144 | last
145 | off
146 | too
147 | same
148 | information
149 | little
150 | another
151 | look
152 | few
153 | long
154 | part
155 | since
156 | things
157 | place
158 | am
159 | between
160 | during
161 | different
162 | must
163 | come
164 | using
165 | however
166 | without
167 | high
168 | why
169 | something
170 | online
171 | system
172 | better
173 | three
174 | never
175 | always
176 | love
177 | say
178 | might
179 | next
180 | company
181 | state
182 | number
183 | again
184 | free
185 | lot
186 | under
187 | family
188 | found
189 | within
190 | give
191 | set
192 | school
193 | important
194 | water
195 | able
196 | keep
197 | got
198 | sure
199 | end
200 | money
201 | service
202 | small
203 | put
204 | experience
205 | having
206 | once
207 | available
208 | health
209 | support
210 | often
211 | including
212 | days
213 | away
214 | old
215 | area
216 | feel
217 | read
218 | show
219 | big
220 | against
221 | thing
222 | order
223 | program
224 | though
225 | city
226 | group
227 | services
228 | site
229 | making
230 | course
231 | point
232 | children
233 | times
234 | team
235 | game
236 | along
237 | let
238 | house
239 | today
240 | body
241 | working
242 | case
243 | man
244 | real
245 | provide
246 | care
247 | public
248 | top
249 | looking
250 | several
251 | start
252 | less
253 | process
254 | become
255 | actually
256 | local
257 | together
258 | person
259 | change
260 | book
261 | enough
262 | getting
263 | week
264 | power
265 | until
266 | market
267 | fact
268 | god
269 | food
270 | students
271 | full
272 | women
273 | community
274 | name
275 | second
276 | data
277 | government
278 | says
279 | others
280 | ever
281 | yet
282 | research
283 | done
284 | left
285 | far
286 | large
287 | called
288 | doing
289 | already
290 | development
291 | social
292 | open
293 | possible
294 | side
295 | play
296 | means
297 | needs
298 | try
299 | came
300 | ca
301 | based
302 | hard
303 | thought
304 | products
305 | national
306 | quality
307 | level
308 | live
309 | design
310 | makes
311 | project
312 | line
313 | night
314 | least
315 | whether
316 | job
317 | car
318 | example
319 | include
320 | following
321 | given
322 | website
323 | past
324 | plan
325 | offer
326 | buy
327 | call
328 | went
329 | simply
330 | hand
331 | music
332 | easy
333 | problem
334 | men
335 | country
336 | took
337 | four
338 | members
339 | form
340 | personal
341 | control
342 | energy
343 | room
344 | head
345 | pay
346 | create
347 | run
348 | kind
349 | credit
350 | almost
351 | believe
352 | quite
353 | mind
354 | law
355 | early
356 | comes
357 | states
358 | usually
359 | companies
360 | web
361 | taking
362 | started
363 | later
364 | although
365 | story
366 | per
367 | future
368 | known
369 | someone
370 | across
371 | rather
372 | young
373 | whole
374 | special
375 | everything
376 | months
377 | anything
378 | training
379 | url
380 | bit
381 | seen
382 | product
383 | american
384 | please
385 | management
386 | cost
387 | either
388 | light
389 | university
390 | face
391 | due
392 | nothing
393 | human
394 | event
395 | history
396 | probably
397 | friends
398 | learn
399 | current
400 | tell
401 | general
402 | price
403 | list
404 | type
405 | building
406 | industry
407 | bad
408 | check
409 | everyone
410 | office
411 | idea
412 | internet
413 | news
414 | million
415 | video
416 | among
417 | air
418 | especially
419 | told
420 | results
421 | post
422 | hours
423 | international
424 | center
425 | understand
426 | above
427 | addition
428 | major
429 | education
430 | white
431 | particular
432 | problems
433 | media
434 | according
435 | upon
436 | page
437 | continue
438 | black
439 | study
440 | issues
441 | inside
442 | technology
443 | five
444 | value
445 | further
446 | access
447 | reason
448 | short
449 | TRUE
450 | simple
451 | natural
452 | amount
453 | search
454 | result
455 | taken
456 | main
457 | heart
458 | space
459 | financial
460 | ago
461 | trying
462 | question
463 | living
464 | likely
465 | interest
466 | various
467 | insurance
468 | common
469 | move
470 | child
471 | yourself
472 | report
473 | certain
474 | share
475 | single
476 | close
477 | instead
478 | bring
479 | works
480 | age
481 | s
482 | season
483 | hope
484 | coming
485 | areas
486 | ask
487 | medical
488 | low
489 | games
490 | turn
491 | key
492 | party
493 | add
494 | month
495 | seems
496 | view
497 | fun
498 | matter
499 | words
500 | needed


--------------------------------------------------------------------------------
/miami/common_words/spa.txt:
--------------------------------------------------------------------------------
  1 | de
  2 | la
  3 | que
  4 | el
  5 | y
  6 | en
  7 | a
  8 | los
  9 | del
 10 | se
 11 | las
 12 | por
 13 | un
 14 | con
 15 | para
 16 | no
 17 | una
 18 | es
 19 | su
 20 | al
 21 | lo
 22 | como
 23 | más
 24 | o
 25 | este
 26 | pero
 27 | sus
 28 | esta
 29 | si
 30 | ha
 31 | me
 32 | ya
 33 | le
 34 | son
 35 | sobre
 36 | entre
 37 | ser
 38 | fue
 39 | sin
 40 | todo
 41 | también
 42 | desde
 43 | cuando
 44 | muy
 45 | a?os
 46 | está
 47 | todos
 48 | hay
 49 | tiene
 50 | nos
 51 | porque
 52 | dos
 53 | hasta
 54 | donde
 55 | parte
 56 | así
 57 | han
 58 | puede
 59 | mi
 60 | a?o
 61 | cada
 62 | uno
 63 | vez
 64 | bien
 65 | hace
 66 | trabajo
 67 | nacional
 68 | estado
 69 | otros
 70 | gobierno
 71 | eso
 72 | tiempo
 73 | además
 74 | mismo
 75 | ese
 76 | hacer
 77 | país
 78 | yo
 79 | durante
 80 | te
 81 | día
 82 | tanto
 83 | vida
 84 | esto
 85 | forma
 86 | estos
 87 | sólo
 88 | personas
 89 | ni
 90 | otro
 91 | ahora
 92 | hoy
 93 | era
 94 | caso
 95 | están
 96 | les
 97 | mejor
 98 | lugar
 99 | qué
100 | quien
101 | cual
102 | esa
103 | ciudad
104 | general
105 | mundo
106 | siempre
107 | menos
108 | desarrollo
109 | contra
110 | cuenta
111 | tres
112 | ver
113 | más
114 | mayor
115 | otra
116 | mucho
117 | dijo
118 | tienen
119 | sido
120 | presidente
121 | ante
122 | según
123 | tener
124 | primera
125 | sea
126 | debe
127 | después
128 | aunque
129 | ley
130 | sistema
131 | manera
132 | solo
133 | poder
134 | nuevo
135 | ellos
136 | todas
137 | social
138 | información
139 | momento
140 | sino
141 | nuestro
142 | otras
143 | antes
144 | luego
145 | estas
146 | tu
147 | algo
148 | había
149 | días
150 | nuestra
151 | primer
152 | nada
153 | hecho
154 | poco
155 | pueden
156 | proyecto
157 | será
158 | va
159 | grupo
160 | fueron
161 | través
162 | algunos
163 | tan
164 | tipo
165 | medio
166 | gente
167 | decir
168 | equipo
169 | nueva
170 | importante
171 | san
172 | toda
173 | mientras
174 | pues
175 | centro
176 | acuerdo
177 | programa
178 | salud
179 | pasado
180 | empresa
181 | muchos
182 | fin
183 | dentro
184 | nivel
185 | partido
186 | servicios
187 | casa
188 | educación
189 | servicio
190 | seguridad
191 | proceso
192 | horas
193 | él
194 | política
195 | tal
196 | artículo
197 | universidad
198 | historia
199 | cosas
200 | cualquier
201 | sí
202 | unos
203 | hacia
204 | misma
205 | estar
206 | ello
207 | tema
208 | cómo
209 | empresas
210 | gracias
211 | calidad
212 | quienes
213 | embargo
214 | público
215 | frente
216 | agua
217 | situación
218 | ella
219 | sociedad
220 | creo
221 | nosotros
222 | final
223 | muchas
224 | méxico
225 | derecho
226 | zona
227 | argentina
228 | bajo
229 | estamos
230 | respecto
231 | entonces
232 | sector
233 | ejemplo
234 | estaba
235 | tras
236 | semana
237 | personal
238 | casi
239 | tenemos
240 | recursos
241 | diferentes
242 | dice
243 | veces
244 | punto
245 | estados
246 | uso
247 | actividades
248 | partir
249 | haber
250 | dar
251 | relación
252 | internacional
253 | número
254 | meses
255 | ni?os
256 | parece
257 | aún
258 | derechos
259 | datos
260 | aquí
261 | grandes
262 | nunca
263 | problemas
264 | mercado
265 | países
266 | cambio
267 | nombre
268 | he
269 | persona
270 | nuestros
271 | segundo
272 | hizo
273 | sentido
274 | cuatro
275 | fecha
276 | da
277 | posible
278 | comunidad
279 | mujeres
280 | lado
281 | obra
282 | familia
283 | junto
284 | director
285 | problema
286 | condiciones
287 | total
288 | actividad
289 | falta
290 | buena
291 | tengo
292 | investigación
293 | algunas
294 | bueno
295 | espa?a
296 | productos
297 | producción
298 | último
299 | presente
300 | casos
301 | comisión
302 | pública
303 | fuera
304 | igual
305 | atención
306 | van
307 | realidad
308 | objetivo
309 | estudio
310 | mediante
311 | control
312 | verdad
313 | provincia
314 | puntos
315 | pueblo
316 | buenos
317 | sociales
318 | hemos
319 | experiencia
320 | apoyo
321 | hombre
322 | varios
323 | medios
324 | resultados
325 | obras
326 | local
327 | chile
328 | dirección
329 | realizar
330 | deben
331 | base
332 | mes
333 | cuanto
334 | gestión
335 | trata
336 | buen
337 | municipal
338 | siendo
339 | julio
340 | alguna
341 | unidos
342 | trabajadores
343 | ayer
344 | proyectos
345 | incluso
346 | cultura
347 | esos
348 | ma?ana
349 | llegar
350 | dicho
351 | región
352 | segunda
353 | población
354 | plan
355 | paso
356 | mundial
357 | conocer
358 | participación
359 | estoy
360 | jóvenes
361 | mujer
362 | cargo
363 | primero
364 | administración
365 | nuevos
366 | hora
367 | cuales
368 | ciento
369 | comunicación
370 | especial
371 | claro
372 | pesos
373 | espacio
374 | estudios
375 | dios
376 | nuevas
377 | juego
378 | mal
379 | encuentra
380 | cinco
381 | mis
382 | capital
383 | valor
384 | seguir
385 | autoridades
386 | podría
387 | justicia
388 | escuela
389 | tuvo
390 | mayoría
391 | área
392 | saber
393 | luis
394 | organización
395 | cuerpo
396 | ministerio
397 | acción
398 | diciembre
399 | largo
400 | nadie
401 | formación
402 | encuentro
403 | ir
404 | consejo
405 | actual
406 | construcción
407 | vamos
408 | necesario
409 | capacidad
410 | acciones
411 | noche
412 | hacen
413 | ex
414 | cabo
415 | estudiantes
416 | idea
417 | minutos
418 | debido
419 | mayo
420 | orden
421 | campo
422 | octubre
423 | haya
424 | presencia
425 | tarde
426 | modo
427 | permite
428 | podemos
429 | red
430 | temas
431 | edad
432 | tenía
433 | últimos
434 | federal
435 | anterior
436 | respuesta
437 | internet
438 | ahí
439 | puesto
440 | cantidad
441 | usted
442 | real
443 | serie
444 | existe
445 | próximo
446 | dinero
447 | dio
448 | principal
449 | sería
450 | materia
451 | libro
452 | acceso
453 | marco
454 | maría
455 | alto
456 | noviembre
457 | calle
458 | siguiente
459 | central
460 | alumnos
461 | web
462 | algún
463 | posibilidad
464 | modelo
465 | grupos
466 | medida
467 | soy
468 | quiere
469 | cierto
470 | futuro
471 | análisis
472 | mano
473 | humanos
474 | instituto
475 | superior
476 | propio
477 | se?or
478 | santa
479 | favor
480 | municipio
481 | cerca
482 | tierra
483 | políticas
484 | programas
485 | ambiente
486 | oportunidad
487 | domingo
488 | economía
489 | crisis
490 | marzo
491 | mejores
492 | interés
493 | etc.
494 | conocimiento
495 | sigue
496 | necesidad
497 | haciendo
498 | cosa
499 | unas
500 | serán


--------------------------------------------------------------------------------
/miami/create_test_sets.py:
--------------------------------------------------------------------------------
  1 | #
  2 | # For licensing see accompanying LICENSE file.
  3 | # Copyright (C) 2022 Apple Inc. All Rights Reserved.
  4 | #
  5 | 
  6 | # this file takes all of the Miami data and turns it into splits
  7 | import os
  8 | import yaml
  9 | import json
 10 | from tqdm import tqdm
 11 | import shutil
 12 | import numpy as np
 13 | import random
 14 | import pandas as pd
 15 | 
 16 | random.seed(1)
 17 | 
 18 | DATASET_NAMES = ["cs", "mono"]
 19 | 
 20 | def map_cs(x: str) -> str:
 21 |     if x == "n/a":
 22 |         return x
 23 |     elif "cs" in x:
 24 |         return "cs"
 25 |     else:
 26 |         return "mono"
 27 | 
 28 | def map_split(x: str) -> str:
 29 |     if x == "n/a":
 30 |         return x
 31 |     elif "train" in x:
 32 |         return "train"
 33 |     else:
 34 |         return "test"
 35 | 
 36 | def split_data():
 37 |     print("Loading the data...")
 38 |     base_path = "output/miami/all"
 39 |     base_output_path = "output/miami"
 40 |     transcript = []
 41 |     translation = []
 42 |     with open(f"{base_path}/miami.yaml", "r") as fin:
 43 |         yaml_data = yaml.safe_load(fin)
 44 |     with open(f"{base_path}/miami.transcript", "r") as fin:
 45 |         for line in fin:
 46 |             transcript.append(line.strip())
 47 |     with open(f"{base_path}/miami.translation", "r") as fin:
 48 |         for idx, line in enumerate(fin):
 49 |             translation.append(line.strip())
 50 |     assert len(transcript) == len(yaml_data) == len(translation), [len(transcript), len(yaml_data), len(translation)]
 51 |     print(f"Length of the original data is {len(transcript)}")
 52 | 
 53 |     mono = [[], [], []]
 54 |     cs = [[], [], []]
 55 |     print("Separating the data...")
 56 |     data_type = []
 57 |     mono_count = 0
 58 |     mono_map = {}
 59 |     offsets = []
 60 |     durations = []
 61 |     files = []
 62 |     local_file_lines = []
 63 |     for idx in tqdm(range(len(yaml_data)), leave=True):
 64 |         # get values for making a mapping file
 65 |         local_line_num = yaml_data[idx]["wav"].split("/")[1].split("_")[-1].split(".")[0].replace("p", "")
 66 |         files.append(yaml_data[idx]["wav"].split("/")[1].split("_")[0])
 67 |         offsets.append(yaml_data[idx]["offset"])
 68 |         durations.append(yaml_data[idx]["duration"])
 69 |         local_file_lines.append(local_line_num)
 70 | 
 71 |         if (
 72 |             translation[idx] not in ["", "\n"] and transcript[idx] != translation[idx]
 73 |         ):  # same would be not helpful
 74 | 
 75 |             yaml_instance = yaml_data[idx]
 76 |             if yaml_instance["duration"] < 0.3:  # remove instances that are too short
 77 |                 data_type.append("n/a")
 78 |                 continue
 79 | 
 80 |             yaml_instance["offset"] = 0  # we're making each their own file
 81 | 
 82 |             if yaml_data[idx]["code_switched"]:
 83 |                 cs[0].append(yaml_instance)
 84 |                 cs[1].append(transcript[idx])
 85 |                 cs[2].append(translation[idx])
 86 |                 data_type.append("cs")
 87 |             else:
 88 |                 mono[0].append(yaml_instance)
 89 |                 mono[1].append(transcript[idx])
 90 |                 mono[2].append(translation[idx])
 91 |                 data_type.append("mono")
 92 |                 mono_map[mono_count] = idx
 93 |                 mono_count += 1
 94 |         else:
 95 |             data_type.append("n/a")
 96 | 
 97 |     # split the mono data
 98 |     mono[0] = np.array(mono[0])
 99 |     mono[1] = np.array(mono[1])
100 |     mono[2] = np.array(mono[2])
101 |     split_mono_idx = np.array(
102 |         random.sample(list(range(len(mono[0]))), len(mono[0]) // 2)
103 |     )
104 |     global_map_from_mono = sorted([mono_map[cur_idx] for cur_idx in split_mono_idx])
105 |     bool_split = np.isin(np.arange(len(mono[0])), split_mono_idx)
106 |     mono_train = [
107 |         mono[0][bool_split].tolist(),
108 |         mono[1][bool_split].tolist(),
109 |         mono[2][bool_split].tolist(),
110 |     ]
111 |     mono = [
112 |         mono[0][~bool_split].tolist(),
113 |         mono[1][~bool_split].tolist(),
114 |         mono[2][~bool_split].tolist(),
115 |     ]
116 | 
117 |     # make a mapping file for others to use
118 |     data_type = [dtype if idx not in global_map_from_mono else "mono_train" \
119 |                     for idx, dtype in enumerate(data_type)]
120 |     mapping_val = pd.DataFrame({"global_idx": list(range(len(data_type))), "split": data_type,
121 |                                 "file": files, "file_line_num": local_file_lines, 
122 |                                 "offset": offsets, "duration": durations})
123 | 
124 |     mapping_val["cs_type"] = mapping_val.split.apply(lambda x: map_cs(x))
125 |     mapping_val["split"] = mapping_val.split.apply(lambda x: map_split(x))                        
126 |     mapping_val.to_csv("miami_mapping.csv", index=None)
127 | 
128 |     print("Writing the data out...")
129 |     for (name, datasets) in zip(DATASET_NAMES + ["mono_train"], [cs, mono, mono_train]):
130 |         print(f"Length of the data {name} is {len(datasets[0])}")
131 |         if not os.path.isdir(os.path.join(base_output_path, name)):
132 |             os.makedirs(os.path.join(base_output_path, name))
133 |         with open(os.path.join(base_output_path, name, f"miami.jsonl"), "w") as fout:
134 |             for segment in datasets[0]:
135 |                 fout.write(json.dumps(segment))
136 |                 fout.write("\n")
137 |         with open(os.path.join(base_output_path, name, f"miami.yaml"), "w") as fout:
138 |             fout.write(yaml.dump(datasets[0], allow_unicode=True))
139 |         with open(
140 |             os.path.join(base_output_path, name, f"miami.transcript"), "w"
141 |         ) as fout:
142 |             for line in datasets[1]:
143 |                 assert "\n" not in line, line
144 |                 fout.write(line)
145 |                 fout.write("\n")
146 |         with open(
147 |             os.path.join(base_output_path, name, f"miami.translation"), "w"
148 |         ) as fout:
149 |             for line in datasets[2]:
150 |                 assert "\n" not in line, line
151 |                 fout.write(line)
152 |                 fout.write("\n")
153 | 
154 |     print("Moving clip data...")
155 |     mono_clips = [item["wav"] for item in mono[0]]
156 |     mono_train_clips = [item["wav"] for item in mono_train[0]]
157 |     cs_clips = [item["wav"] for item in cs[0]]
158 |     assert len(mono_clips) == len(mono[0])
159 |     assert len(cs_clips) == len(cs[0])
160 |     assert len(mono_train_clips) == len(mono_train[0])
161 |     for (name, file_paths) in zip(
162 |         DATASET_NAMES + ["mono_train"], [cs_clips, mono_clips, mono_train_clips]
163 |     ):
164 |         for file_path in file_paths:
165 |             if not os.path.isdir(os.path.join(base_output_path, name, "clips")):
166 |                 os.makedirs(os.path.join(base_output_path, name, "clips"))
167 |             shutil.copy(
168 |                 os.path.join(base_path, file_path),
169 |                 os.path.join(base_output_path, name, file_path),
170 |             )
171 | 
172 |         # make it a zip file
173 |         shutil.make_archive(
174 |             os.path.join(base_output_path, name, "clips"),
175 |             "zip",
176 |             os.path.join(base_output_path, name, "clips"),
177 |         )
178 | 
179 | 
180 | if __name__ == "__main__":
181 |     split_data()
182 | 


--------------------------------------------------------------------------------
/miami/download_miami_data.sh:
--------------------------------------------------------------------------------
 1 | #! /bin/bash
 2 | 
 3 | #
 4 | # For licensing see accompanying LICENSE file.
 5 | # Copyright (C) 2022 Apple Inc. All Rights Reserved.
 6 | #
 7 | 
 8 | # This script downloads the miami corpus from its repository and converts it into 16K audio
 9 | 
10 | # start by downloading their repository
11 | DIRECTORY="data/miami"
12 | if [ ! -d "$DIRECTORY" ]; then
13 |     echo "cloning corpus, which contains the CHAT files with text and mappings"
14 |     cd data
15 |     git clone https://github.com/donnekgit/miami.git
16 |     mkdir miami/audio
17 |     cd ../
18 | fi
19 | 
20 | echo "downloading audio files"
21 | 
22 | declare -a audio=("herring1" "herring2" "herring3" 
23 |  "herring5" "herring6" "herring7" "herring8" 
24 |  "herring9" "herring10" "herring11" "herring12"
25 |  "herring13" "herring14" "herring15" "herring16"
26 |  "herring17" "maria1" "maria2" "maria3" "maria4"
27 |  "maria7" "maria10" "maria16" "maria18" "maria19"
28 |  "maria20" "maria21" "maria24" "maria27" "maria30"
29 |  "maria31" "maria40" "sastre1" "sastre2" "sastre3"
30 |  "sastre4" "sastre5" "sastre6" "sastre7" "sastre8" 
31 |  "sastre9" "sastre10" "sastre11" "sastre12" "sastre13"
32 |  "zeledon1" "zeledon2" "zeledon3" "zeledon4" "zeledon5"
33 |  "zeledon6" "zeledon7" "zeledon8" "zeledon9" "zeledon11"
34 |  "zeledon13" "zeledon14")
35 | 
36 | # Download each of the above files
37 | for i in "${audio[@]}"
38 | do
39 |    if [ ! -f "data/miami/audio/$i.mp3" ]; then
40 |     echo "Downloading $i"
41 |     wget -P data/miami/audio/ http://bangortalk.bangor.ac.uk/$i.mp3
42 |    fi
43 | done
44 | 
45 | # convert each file to 16 bit wav files
46 | for i in "${audio[@]}"
47 | do
48 |    if [ ! -f "data/miami/audio/$i.wav" ]; then
49 |     echo "converting mp3 to wav $i"
50 |     ffmpeg -i data/miami/audio/$i.mp3 -acodec pcm_s16le -ac 1 -ar 16000 data/miami/audio/$i.wav
51 |    fi
52 | done
53 | 
54 | 


--------------------------------------------------------------------------------
/miami/process_miami_data.py:
--------------------------------------------------------------------------------
  1 | #
  2 | # For licensing see accompanying LICENSE file.
  3 | # Copyright (C) 2022 Apple Inc. All Rights Reserved.
  4 | #
  5 | 
  6 | # This file processes the Miami dataset into CS and monolingual test sets
  7 | import re
  8 | import os
  9 | import glob
 10 | import pylangacq
 11 | import librosa
 12 | import yaml
 13 | import json
 14 | import soundfile as sf
 15 | import string
 16 | from tqdm import tqdm
 17 | import random
 18 | from nltk.tokenize.treebank import TreebankWordDetokenizer
 19 | 
 20 | DETOKENIZER = TreebankWordDetokenizer()
 21 | 
 22 | ONE_SECOND = 16000
 23 | 
 24 | # their language mapping to our language tags
 25 | LANG_MAP = {
 26 |     "s:spa": "spa",
 27 |     "s:eng": "eng",
 28 |     "s:eng&spa": "unknown",
 29 |     "s:eng&spag": "unknown",
 30 |     "s:spa&eng": "unknown",
 31 |     "s:spa+eng": "first spa, second eng",
 32 |     "s:eng+spa": "first eng, second spa",
 33 |     "s:eng&spa+eng": "unknown+extra",
 34 |     "s:ita": "italian",
 35 |     "s:fra": "french",
 36 | }
 37 | 
 38 | MAP_FOR_WORD_PREDS = {
 39 |     "first spa, second eng": "spa",
 40 |     "first eng, second spa": "spa",
 41 |     "eng": "eng",
 42 |     "spa": "spa",
 43 |     "italian": "italian",
 44 |     "french": "french",
 45 | }
 46 | 
 47 | 
 48 | ##### simple string cleaning functions #####
 49 | def remove_punct(s: str) -> str:
 50 |     return s.translate(str.maketrans("", "", string.punctuation))
 51 | 
 52 | 
 53 | def verify_text(text: str):
 54 |     illegal_chars = ["[", "]", "(", ")", "/", "+", "&"]
 55 |     for char in illegal_chars:
 56 |         if char in text:
 57 |             raise Exception("had illegal char", char, text)
 58 | 
 59 | 
 60 | def remove_leading_spaces_punct(sent: list) -> str:
 61 |     new_sent = DETOKENIZER.detokenize(sent)
 62 |     # doesn't get second sentence
 63 |     if " ." in new_sent:
 64 |         new_sent = new_sent.replace(" .", ".")
 65 |     if " ?" in new_sent:
 66 |         new_sent = new_sent.replace(" ?", "?")
 67 |     if " ," in new_sent:
 68 |         new_sent = new_sent.replace(" ,", ",")
 69 |     return new_sent
 70 | 
 71 | 
 72 | def clean_underscores(sent: str) -> str:
 73 |     if "o_k" in sent:  # don't want to remove for o_k
 74 |         sent = sent.replace("o_k", "ok")
 75 | 
 76 |     new_sent = sent.replace("_", " ")
 77 |     return new_sent
 78 | 
 79 | 
 80 | def clean_up_common_markup_errors(sent: str) -> str:
 81 |     all_chars_to_replace = [
 82 |         "(.)",
 83 |         "(..)",
 84 |         "+//",
 85 |         "<",
 86 |         ">",
 87 |         "+/.",
 88 |         "+/?",
 89 |         "/",
 90 |         "...",
 91 |         "..",
 92 |         "++",
 93 |         "+/",
 94 |         "xxx",
 95 |         "+",
 96 |         '+"',
 97 |         "+,",
 98 |         "[",
 99 |         "]",
100 |         "“",
101 |     ]
102 |     for char_phrase in all_chars_to_replace:
103 |         sent = sent.replace(char_phrase, "")
104 | 
105 |     if '".' in sent:
106 |         sent.replace('".', ".")
107 |     if ":." in sent:
108 |         sent.replace(":.", ".")
109 | 
110 |     if "@s:eng&spa" in sent:
111 |         sent = sent.replace("@s:eng&spa", "")
112 | 
113 |     if re.search('".*"', sent) is None:
114 |         sent = sent.replace('"', "")
115 |     return sent
116 | 
117 | 
118 | def clean_translation(text: str) -> str:
119 |     text = clean_word_text(text)
120 |     text = [word for word in re.sub(r"\([^)]*\)", "", text).split(" ") if word != ""]
121 |     text = remove_leading_spaces_punct(text)
122 |     return text
123 | 
124 | 
125 | def clean_word_text(transcript):
126 |     transcript = clean_up_common_markup_errors(transcript)
127 |     detokenized_transcript = remove_leading_spaces_punct(transcript.split(" "))
128 |     clean_sent = clean_underscores(detokenized_transcript)
129 | 
130 |     if len(transcript) and transcript[0] == ",":
131 |         transcript = transcript[1:]
132 | 
133 |     return clean_sent
134 | 
135 | 
136 | def make_transcript_manually(raw_utt: str) -> str:
137 |     """
138 |     Some of the utterances have disfluencies, which the pylangacq software excludes from the transcript
139 |         Thus, we have to manually take the raw utterance transcription in CHAT form to keep them
140 |     """
141 |     filter_raw_utt = raw_utt.replace("<", "").replace(">", "")
142 |     only_words = filter_raw_utt.split(" ")[:-1]  # last one is timing
143 |     new_words = []
144 |     for word in only_words:
145 |         # words to replace
146 |         if "@" in word:
147 |             word = word[: word.find("@")]  # annotation for code-switching
148 | 
149 |         if not len(word):
150 |             continue
151 | 
152 |         if word[0] == "[" or word[-1] == "]":
153 |             continue  # don't need the markup
154 | 
155 |         if word[0] == "&":
156 |             continue  # don't need partial starts
157 | 
158 |         word = clean_up_common_markup_errors(word)
159 |         if len(word) == 0:
160 |             continue
161 | 
162 |         if "+//." in word:
163 |             word = word.replace("+//.", ".")
164 |         if '".' in word:
165 |             word = word.replace('".', ".")
166 | 
167 |         if "(" in word or ")" in word:
168 |             # NOTE: this is whether we remove parantheticals
169 |             word = re.sub(r"\([^)]*\)", "", word)
170 | 
171 |         new_words.append(word)
172 | 
173 |     detokenized_sent = remove_leading_spaces_punct(new_words)
174 |     clean_sent = clean_underscores(detokenized_sent)
175 |     return clean_sent
176 | 
177 | 
178 | def gather_cs_statistics_and_words(utterance, raw_utt: str, transcript: str, file_lang: list, cur_lang: str):
179 |     # for tagging each word, use a list of most common words
180 |     common_spanish_words = []
181 |     with open("common_words/spa.txt", "r") as fin:
182 |         for line in fin:
183 |             common_spanish_words.append(line.strip())
184 | 
185 |     common_english_words = []
186 |     with open("common_words/eng.txt", "r") as fin:
187 |         for line in fin:
188 |             common_english_words.append(line.strip())
189 | 
190 |     # words like internet, etc.
191 |     common_spanish_words = list(
192 |         set(common_spanish_words)
193 |         - set(common_english_words).intersection(set(common_spanish_words))
194 |     )
195 | 
196 |     def get_lang_id(input_word): # parse the CHAT language id
197 |         word = input_word.split("@")[1]
198 |         word = (
199 |             word.replace(">", "")
200 |             .replace("[/]", "")
201 |             .replace('"', "")
202 |             .replace("”", "")
203 |             .replace(".", "")
204 |             .replace("]", "")
205 |             .replace(",", "")
206 |         )
207 |         return LANG_MAP[word]
208 | 
209 |     eng = (
210 |         utterance.tiers["%eng"] if "%eng" in utterance.tiers else None
211 |     )  # English translation
212 |     word_to_lang_map = [
213 |         (word.split("@")[0], get_lang_id(word))
214 |         for word in raw_utt.split(" ")
215 |         if "@" in word
216 |     ]
217 |     is_cs = any(
218 |         ["unknown" not in lang for (_, lang) in word_to_lang_map]
219 |     )  # any not unknown is code-switched
220 |     is_cs_any = len(word_to_lang_map)
221 |     num_words = len(
222 |         [
223 |             item
224 |             for item in raw_utt.split(" ")[:-1]
225 |             if ("[" not in item and "." not in item)
226 |         ]
227 |     )
228 |     cs_percent = 0 if not is_cs else len(word_to_lang_map) / num_words
229 | 
230 |     # refine the process of the main language, really only used for statistical purposes
231 |     # this just flips the main language if the CS percent is greater than 0.5
232 |     if (eng is None and cur_lang == "spa" and cs_percent > 0.5 and len(transcript.split(" ")) >= 3):
233 |         cur_lang = "eng"
234 |     if (eng is not None and cur_lang == "eng" and cs_percent > 0.5 and len(transcript.split(" ")) >= 3):
235 |         cur_lang = "spa"
236 | 
237 |     # lets try to get word level tags for CS data. We have to do this manually parsing the sentence
238 |     clean_transcript = remove_punct(transcript)
239 |     cs_words = [
240 |         remove_punct(clean_word_text(word))
241 |         for (word, lang) in word_to_lang_map
242 |         if "unknown" not in lang
243 |     ]
244 |     cs_words_lang = [
245 |         lang for (word, lang) in word_to_lang_map if "unknown" not in lang
246 |     ]
247 |     for idx, cs_word in enumerate(cs_words):
248 |         if cs_word not in clean_transcript:
249 |             # try to clean up the word to see if it's in the clean transcript
250 |             if cs_word.replace("(", "").replace(")", "") in clean_transcript:
251 |                 cs_words[idx] = cs_word.replace("(", "").replace(")", "")
252 |             elif cs_word.split("(")[0] in clean_transcript:
253 |                 cs_words[idx] = cs_word.split("(")[0]
254 | 
255 |     tagged_words = ""
256 |     main_lang, embedded_lang = file_lang[0], file_lang[-1]
257 |     if "[- spa]" in raw_utt or "[-spa]" in raw_utt:
258 |         main_lang, embedded_lang = "spa", "eng"
259 |     if "[- eng]" in raw_utt or "[-eng]" in raw_utt:
260 |         main_lang, embedded_lang = "eng", "spa"
261 | 
262 |     for word in clean_transcript.split(" "):
263 |         if word in cs_words: # first see if they were annotated
264 |             index = cs_words.index(word)
265 |             annote_lang = MAP_FOR_WORD_PREDS[cs_words_lang[index]]
266 |             tagged_words += f"{word}={embedded_lang} "
267 |         else: # try to rely on the backup common words if they're not annotated
268 |             if word in common_spanish_words:
269 |                 tagged_words += f"{word}=spa "
270 |             elif word in common_english_words:
271 |                 tagged_words += f"{word}=eng "
272 |             else:
273 |                 tagged_words += f"{word}={main_lang} "
274 | 
275 |     tagged_words = tagged_words.strip()
276 |     return tagged_words, eng, cs_percent, is_cs, is_cs_any
277 |         
278 | 
279 | 
280 | def write_out(final_path, all_segments, all_transcripts, all_translations):
281 |     with open(os.path.join(final_path, f"miami.yaml"), "w") as fout:
282 |         fout.write(yaml.dump(all_segments, allow_unicode=True))
283 | 
284 |     with open(os.path.join(final_path, f"miami.transcript"), "w") as fout:
285 |         for line in all_transcripts:
286 |             fout.write(line)
287 |             fout.write("\n")
288 | 
289 |     with open(os.path.join(final_path, f"miami.translation"), "w") as fout:
290 |         for line in all_translations:
291 |             fout.write(line)
292 |             fout.write("\n")
293 | 
294 |     with open(os.path.join(final_path, f"miami.jsonl"), "w") as fout:
295 |         for segment in all_segments:
296 |             fout.write(json.dumps(segment))
297 |             fout.write("\n")
298 | 
299 | 
300 | def prepare_miami_data():
301 |     all_segments = []
302 |     all_transcripts = []
303 |     all_translations = []
304 | 
305 |     final_path = "output/miami/all"
306 |     if not os.path.isdir(final_path):
307 |         os.makedirs(os.path.join(final_path, "clips"))
308 | 
309 |     chat_file_location = "data/miami/beta"  # beta has the most up to date
310 |     for chat_file_path in tqdm(
311 |         glob.glob(os.path.join(chat_file_location, "*.cha")), leave=True
312 |     ):
313 |         clip_name = chat_file_path.split("/")[-1].replace(".cha", "")
314 |         cur_reader = pylangacq.read_chat(chat_file_path)
315 |         all_words = cur_reader.words(by_utterances=True)
316 |         assert len(cur_reader._files) == 1
317 |         file_lang = cur_reader._files[0].header["Languages"]
318 | 
319 |         # get wav data
320 |         wav_path = chat_file_path.replace("beta", "audio").replace("cha", "wav")
321 |         wav_data, sampling_rate = librosa.load(
322 |             wav_path, sr=ONE_SECOND
323 |         )  # already at 16khz/16bit/mono
324 |         assert sampling_rate == ONE_SECOND
325 | 
326 |         for idx, utterance in enumerate(cur_reader.utterances()):
327 |             word_utterance = all_words[idx]
328 |             transcript = " ".join(word_utterance)
329 |             if not len(transcript):
330 |                 continue
331 |             transcript = clean_word_text(transcript)
332 |             raw_utt = utterance.tiers[utterance.participant]
333 | 
334 |             # the main language can be overriden if marked that way
335 |             if "[- eng]" in raw_utt or "[-eng]" in raw_utt:
336 |                 cur_lang = "eng"
337 |             elif "[- spa]" in raw_utt or "[-spa]" in raw_utt:
338 |                 cur_lang = "spa"
339 |             else:
340 |                 cur_lang = file_lang[0]
341 | 
342 |             ## Check if we really want to keep cleaning this utterance ##
343 |             if "www" in raw_utt:
344 |                 continue  # means untranscribed text, skip
345 |             if word_utterance == ["."]:
346 |                 continue  # we don't want empty lines
347 | 
348 |             if "[" in raw_utt:  # some markup to deal with
349 |                 # see https://talkbank.org/manuals/CHAT.pdf for details
350 |                 markings = re.findall("\[.*?\]", raw_utt)
351 |                 for mark in markings:
352 |                     if mark in [
353 |                         "[!]",
354 |                         "[?]",
355 |                         "[!!]",
356 |                         "[*]",
357 |                         "[/-]",
358 |                         "[//]",
359 |                         '["]',
360 |                     ] or mark in ["[- spa]", "[-spa]", "[-eng]", "[- eng]"]:
361 |                         """
362 |                         Markup definitions that we can skip/remove for ST purposes:
363 |                             [!] means stressing
364 |                             [!!] means constrastive stressing
365 |                             [?] means uncertainty in transcription, but best guess
366 |                             [=! ...] is some kind of para-linguistic communication, laugh, yell, etc.
367 |                             [# ...] indicates duration of previous <> tag
368 |                             [*] means the word is incorrect semantically/grammatically, typically followed by the [* correct_word]
369 |                             [/-] is for false starts but still spoken
370 |                             [//] for abandended and retracing speech
371 | 
372 |                         """
373 |                         continue
374 |                     elif "[=!" in mark or "[= !" in mark or "[*" in mark:  # see above
375 |                         continue
376 |                     elif mark in ["[/]", "[//]", "[///]"]:
377 |                         # indicates trailing or correction while speaking, pylangacq gets rid of them, do it manually
378 |                         if raw_utt is None:
379 |                             continue
380 |                         transcript = make_transcript_manually(raw_utt)
381 |                         break
382 |                     else:
383 |                         raise Exception(f"Encountered new mark {mark}")
384 | 
385 |             time_marks = utterance.time_marks
386 |             if time_marks is None:
387 |                 continue  # don't know why there are no time marks, but skip.
388 |                 # Happens appx 3 times outside of maria18.cha where there are ~20 instances
389 | 
390 |             # get the audio clip and validate it
391 |             start_time, end_time = time_marks
392 |             start_time_s, end_time_s = start_time / 1000, end_time / 1000
393 |             duration_s = end_time_s - start_time_s
394 |             wav_clip = wav_data[
395 |                 int(start_time_s * ONE_SECOND) : int(end_time_s * ONE_SECOND)
396 |             ]
397 |             if int(end_time_s * ONE_SECOND) < wav_data.shape[0]:
398 |                 # sometimes audio may go beyond the file length, which we allow
399 |                 error_str = f"Wav Clip:{wav_clip.shape[0]} vs duration:{duration_s * ONE_SECOND}"
400 |                 assert (duration_s * ONE_SECOND - wav_clip.shape[0]) < 1, error_str
401 | 
402 |             cur_clip_name = clip_name + "_p" + str(idx)
403 |             clip_path = os.path.join("clips", cur_clip_name + ".wav")
404 |             sf.write(os.path.join(final_path, clip_path), wav_clip, ONE_SECOND)
405 | 
406 |             # for LID and statistics, gather the lang id for each word
407 |             tagged_words, eng, cs_percent, is_cs, is_cs_any = gather_cs_statistics_and_words(utterance, raw_utt, transcript, file_lang, cur_lang)
408 |             speakers = utterance.participant # just in case it's needed someday for speaker ID
409 | 
410 |             all_segments.append(
411 |                 {
412 |                     "wav": clip_path,
413 |                     "offset": start_time_s,
414 |                     "duration": duration_s,
415 |                     "cs_percent": cs_percent,
416 |                     "speaker_id": speakers,
417 |                     "code_switched": is_cs,
418 |                     "main_lang": cur_lang,
419 |                     "code_switched_any": is_cs_any,
420 |                     "tagged_words": tagged_words,
421 |                 }
422 |             )
423 |             translation = clean_translation(eng) if eng is not None else ""
424 |             assert transcript is not None
425 |             
426 |             # validate the sentences
427 |             verify_text(transcript)
428 |             verify_text(translation)
429 |             all_transcripts.append(transcript)
430 |             all_translations.append(translation)
431 |             assert len(all_transcripts) == len(all_segments) == len(all_translations)
432 |         assert len(all_transcripts) == len(all_segments) == len(all_translations)
433 |     assert len(all_transcripts) == len(all_segments) == len(all_translations)
434 |     write_out(final_path, all_segments, all_transcripts, all_translations)
435 | 
436 | 
437 | if __name__ == "__main__":
438 |     prepare_miami_data()
439 | 


--------------------------------------------------------------------------------
/miami/readme.md:
--------------------------------------------------------------------------------
 1 | # Overview
 2 | This repository contains all the scripts needed to download the Bangor Miami Corpus and preprocess it for Speech Translation
 3 | 
 4 | ## 1-Step Setup
 5 | 0. Run `setup_all.sh` to download the data, and process it. For granular instructions, see the `Multi-Step Setup`
 6 | 
 7 | ## Multi-Step Setup
 8 | 0. Gather the data by running `bash download_miami_dataset.sh` which will place the data in `./data`
 9 | 1. Format the data by running `python reformat_miami_data.py` which will output the data in `output/miami/*`. It will contain three files: a `yaml` file containing the timesteps, a `miami.transcript` containing the transcripts, and `miami.translation` containing the translations
10 | 2. Create code-switched and non-code-switched sections by running `python create_test_sets.py`
11 | 3. To create LID data, run `fisher/split_train_and_make_lid.py`
12 | 
13 | 
14 | ## Paper Reference
15 | The Bangor Miami corpus is found [here](https://biling.talkbank.org/access/Bangor/Miami.html) and was published as part of [this paper](https://www.researchgate.net/publication/292243516_Building_bilingual_corpora)


--------------------------------------------------------------------------------
/miami/setup_all.sh:
--------------------------------------------------------------------------------
 1 | #
 2 | # For licensing see accompanying LICENSE file.
 3 | # Copyright (C) 2022 Apple Inc. All Rights Reserved.
 4 | #
 5 | 
 6 | # this script should set everything up
 7 | mkdir output
 8 | mkdir output/miami
 9 | mkdir data
10 | 
11 | bash download_miami_data.sh
12 | python process_miami_data.py
13 | python create_test_sets.py
14 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | pylangacq==0.15.0
2 | PyYAML==5.4.1
3 | pandas==1.3.0
4 | numpy==1.20.0
5 | tqdm==4.61.2
6 | nltk==3.6.2
7 | beautifulsoup4==4.10.0
8 | librosa==0.8.1


--------------------------------------------------------------------------------