├── .gitignore ├── ACKNOWLEDGEMENTS ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING-RESEARCH.md ├── LICENSE ├── README.md ├── create_datasets.sh ├── fisher ├── README.md ├── combine_eval_splits.py ├── extract-utterance-audios.py ├── extract_cs_words_from_raw_data.py ├── make_cs_splits.py ├── make_mapping_files.py ├── prepare-sets.sh ├── setup_all.sh ├── split_train_and_make_lid.py └── splits_data │ ├── README.md │ ├── dev │ └── README.md │ ├── dev2 │ └── README.md │ ├── test │ └── README.md │ └── train │ └── README.md ├── mapping_files ├── README.md ├── fisher_mapping.csv └── miami_mapping.csv ├── miami ├── common_words │ ├── eng.txt │ └── spa.txt ├── create_test_sets.py ├── download_miami_data.sh ├── process_miami_data.py ├── readme.md └── setup_all.sh └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | env/* 2 | **/data/* 3 | **/output/* 4 | **/.DS_Store 5 | **/speech/* -------------------------------------------------------------------------------- /ACKNOWLEDGEMENTS: -------------------------------------------------------------------------------- 1 | Acknowledgements 2 | Portions of this FoundationDB Software may utilize the following copyrighted 3 | material, the use of which is hereby acknowledged. 4 | 5 | _____________________ 6 | 7 | Jackson L. Lee (pylangacq) 8 | 9 | Permission is hereby granted, free of charge, to any person obtaining a copy 10 | of this software and associated documentation files (the "Software"), to deal 11 | in the Software without restriction, including without limitation the rights 12 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 13 | copies of the Software, and to permit persons to whom the Software is 14 | furnished to do so, subject to the following conditions: 15 | 16 | The above copyright notice and this permission notice shall be included in 17 | all copies or substantial portions of the Software. 18 | 19 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 20 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 21 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 22 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 23 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 24 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 25 | THE SOFTWARE. 26 | 27 | _____________________ 28 | Ingy döt Net, Kirill Simonov (PyYAML) 29 | 30 | Permission is hereby granted, free of charge, to any person obtaining a copy of 31 | this software and associated documentation files (the "Software"), to deal in 32 | the Software without restriction, including without limitation the rights to 33 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies 34 | of the Software, and to permit persons to whom the Software is furnished to do 35 | so, subject to the following conditions: 36 | 37 | The above copyright notice and this permission notice shall be included in all 38 | copies or substantial portions of the Software. 39 | 40 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 41 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 42 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 43 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 44 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 45 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 46 | SOFTWARE. 47 | 48 | _____________________ 49 | Wes McKinney (pandas) 50 | 51 | 1. Definitions. 52 | 53 | "License" shall mean the terms and conditions for use, reproduction, 54 | and distribution as defined by Sections 1 through 9 of this document. 55 | 56 | "Licensor" shall mean the copyright owner or entity authorized by 57 | the copyright owner that is granting the License. 58 | 59 | "Legal Entity" shall mean the union of the acting entity and all 60 | other entities that control, are controlled by, or are under common 61 | control with that entity. For the purposes of this definition, 62 | "control" means (i) the power, direct or indirect, to cause the 63 | direction or management of such entity, whether by contract or 64 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 65 | outstanding shares, or (iii) beneficial ownership of such entity. 66 | 67 | "You" (or "Your") shall mean an individual or Legal Entity 68 | exercising permissions granted by this License. 69 | 70 | "Source" form shall mean the preferred form for making modifications, 71 | including but not limited to software source code, documentation 72 | source, and configuration files. 73 | 74 | "Object" form shall mean any form resulting from mechanical 75 | transformation or translation of a Source form, including but 76 | not limited to compiled object code, generated documentation, 77 | and conversions to other media types. 78 | 79 | "Work" shall mean the work of authorship, whether in Source or 80 | Object form, made available under the License, as indicated by a 81 | copyright notice that is included in or attached to the work 82 | (an example is provided in the Appendix below). 83 | 84 | "Derivative Works" shall mean any work, whether in Source or Object 85 | form, that is based on (or derived from) the Work and for which the 86 | editorial revisions, annotations, elaborations, or other modifications 87 | represent, as a whole, an original work of authorship. For the purposes 88 | of this License, Derivative Works shall not include works that remain 89 | separable from, or merely link (or bind by name) to the interfaces of, 90 | the Work and Derivative Works thereof. 91 | 92 | "Contribution" shall mean any work of authorship, including 93 | the original version of the Work and any modifications or additions 94 | to that Work or Derivative Works thereof, that is intentionally 95 | submitted to Licensor for inclusion in the Work by the copyright owner 96 | or by an individual or Legal Entity authorized to submit on behalf of 97 | the copyright owner. For the purposes of this definition, "submitted" 98 | means any form of electronic, verbal, or written communication sent 99 | to the Licensor or its representatives, including but not limited to 100 | communication on electronic mailing lists, source code control systems, 101 | and issue tracking systems that are managed by, or on behalf of, the 102 | Licensor for the purpose of discussing and improving the Work, but 103 | excluding communication that is conspicuously marked or otherwise 104 | designated in writing by the copyright owner as "Not a Contribution." 105 | 106 | "Contributor" shall mean Licensor and any individual or Legal Entity 107 | on behalf of whom a Contribution has been received by Licensor and 108 | subsequently incorporated within the Work. 109 | 110 | 2. Grant of Copyright License. Subject to the terms and conditions of 111 | this License, each Contributor hereby grants to You a perpetual, 112 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 113 | copyright license to reproduce, prepare Derivative Works of, 114 | publicly display, publicly perform, sublicense, and distribute the 115 | Work and such Derivative Works in Source or Object form. 116 | 117 | 3. Grant of Patent License. Subject to the terms and conditions of 118 | this License, each Contributor hereby grants to You a perpetual, 119 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 120 | (except as stated in this section) patent license to make, have made, 121 | use, offer to sell, sell, import, and otherwise transfer the Work, 122 | where such license applies only to those patent claims licensable 123 | by such Contributor that are necessarily infringed by their 124 | Contribution(s) alone or by combination of their Contribution(s) 125 | with the Work to which such Contribution(s) was submitted. If You 126 | institute patent litigation against any entity (including a 127 | cross-claim or counterclaim in a lawsuit) alleging that the Work 128 | or a Contribution incorporated within the Work constitutes direct 129 | or contributory patent infringement, then any patent licenses 130 | granted to You under this License for that Work shall terminate 131 | as of the date such litigation is filed. 132 | 133 | 4. Redistribution. You may reproduce and distribute copies of the 134 | Work or Derivative Works thereof in any medium, with or without 135 | modifications, and in Source or Object form, provided that You 136 | meet the following conditions: 137 | 138 | (a) You must give any other recipients of the Work or 139 | Derivative Works a copy of this License; and 140 | 141 | (b) You must cause any modified files to carry prominent notices 142 | stating that You changed the files; and 143 | 144 | (c) You must retain, in the Source form of any Derivative Works 145 | that You distribute, all copyright, patent, trademark, and 146 | attribution notices from the Source form of the Work, 147 | excluding those notices that do not pertain to any part of 148 | the Derivative Works; and 149 | 150 | (d) If the Work includes a "NOTICE" text file as part of its 151 | distribution, then any Derivative Works that You distribute must 152 | include a readable copy of the attribution notices contained 153 | within such NOTICE file, excluding those notices that do not 154 | pertain to any part of the Derivative Works, in at least one 155 | of the following places: within a NOTICE text file distributed 156 | as part of the Derivative Works; within the Source form or 157 | documentation, if provided along with the Derivative Works; or, 158 | within a display generated by the Derivative Works, if and 159 | wherever such third-party notices normally appear. The contents 160 | of the NOTICE file are for informational purposes only and 161 | do not modify the License. You may add Your own attribution 162 | notices within Derivative Works that You distribute, alongside 163 | or as an addendum to the NOTICE text from the Work, provided 164 | that such additional attribution notices cannot be construed 165 | as modifying the License. 166 | 167 | You may add Your own copyright statement to Your modifications and 168 | may provide additional or different license terms and conditions 169 | for use, reproduction, or distribution of Your modifications, or 170 | for any such Derivative Works as a whole, provided Your use, 171 | reproduction, and distribution of the Work otherwise complies with 172 | the conditions stated in this License. 173 | 174 | 5. Submission of Contributions. Unless You explicitly state otherwise, 175 | any Contribution intentionally submitted for inclusion in the Work 176 | by You to the Licensor shall be under the terms and conditions of 177 | this License, without any additional terms or conditions. 178 | Notwithstanding the above, nothing herein shall supersede or modify 179 | the terms of any separate license agreement you may have executed 180 | with Licensor regarding such Contributions. 181 | 182 | 6. Trademarks. This License does not grant permission to use the trade 183 | names, trademarks, service marks, or product names of the Licensor, 184 | except as required for reasonable and customary use in describing the 185 | origin of the Work and reproducing the content of the NOTICE file. 186 | 187 | 7. Disclaimer of Warranty. Unless required by applicable law or 188 | agreed to in writing, Licensor provides the Work (and each 189 | Contributor provides its Contributions) on an "AS IS" BASIS, 190 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 191 | implied, including, without limitation, any warranties or conditions 192 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 193 | PARTICULAR PURPOSE. You are solely responsible for determining the 194 | appropriateness of using or redistributing the Work and assume any 195 | risks associated with Your exercise of permissions under this License. 196 | 197 | 8. Limitation of Liability. In no event and under no legal theory, 198 | whether in tort (including negligence), contract, or otherwise, 199 | unless required by applicable law (such as deliberate and grossly 200 | negligent acts) or agreed to in writing, shall any Contributor be 201 | liable to You for damages, including any direct, indirect, special, 202 | incidental, or consequential damages of any character arising as a 203 | result of this License or out of the use or inability to use the 204 | Work (including but not limited to damages for loss of goodwill, 205 | work stoppage, computer failure or malfunction, or any and all 206 | other commercial damages or losses), even if such Contributor 207 | has been advised of the possibility of such damages. 208 | 209 | 9. Accepting Warranty or Additional Liability. While redistributing 210 | the Work or Derivative Works thereof, You may choose to offer, 211 | and charge a fee for, acceptance of support, warranty, indemnity, 212 | or other liability obligations and/or rights consistent with this 213 | License. However, in accepting such obligations, You may act only 214 | on Your own behalf and on Your sole responsibility, not on behalf 215 | of any other Contributor, and only if You agree to indemnify, 216 | defend, and hold each Contributor harmless for any liability 217 | incurred by, or claims asserted against, such Contributor by reason 218 | of your accepting any such warranty or additional liability. 219 | 220 | _____________________ 221 | NumPy Developers (numpy) 222 | 223 | Redistribution and use in source and binary forms, with or without 224 | modification, are permitted provided that the following conditions are 225 | met: 226 | 227 | * Redistributions of source code must retain the above copyright 228 | notice, this list of conditions and the following disclaimer. 229 | 230 | * Redistributions in binary form must reproduce the above 231 | copyright notice, this list of conditions and the following 232 | disclaimer in the documentation and/or other materials provided 233 | with the distribution. 234 | 235 | * Neither the name of the NumPy Developers nor the names of any 236 | contributors may be used to endorse or promote products derived 237 | from this software without specific prior written permission. 238 | 239 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 240 | "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 241 | LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 242 | A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT 243 | OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 244 | SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 245 | LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 246 | DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 247 | THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 248 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 249 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 250 | 251 | _____________________ 252 | tqdm developers (tqdm) 253 | `tqdm` is a product of collaborative work. 254 | Unless otherwise stated, all authors (see commit logs) retain copyright 255 | for their respective work, and release the work under the MIT licence 256 | (text below). 257 | 258 | Exceptions or notable authors are listed below 259 | in reverse chronological order: 260 | 261 | * files: * 262 | MPLv2.0 2015-2021 (c) Casper da Costa-Luis 263 | [casperdcl](https://github.com/casperdcl). 264 | * files: tqdm/_tqdm.py 265 | MIT 2016 (c) [PR #96] on behalf of Google Inc. 266 | * files: tqdm/_tqdm.py setup.py README.rst MANIFEST.in .gitignore 267 | MIT 2013 (c) Noam Yorav-Raphael, original author. 268 | 269 | [PR #96]: https://github.com/tqdm/tqdm/pull/96 270 | 271 | 272 | Mozilla Public Licence (MPL) v. 2.0 - Exhibit A 273 | ----------------------------------------------- 274 | 275 | This Source Code Form is subject to the terms of the 276 | Mozilla Public License, v. 2.0. 277 | If a copy of the MPL was not distributed with this project, 278 | You can obtain one at https://mozilla.org/MPL/2.0/. 279 | 280 | 281 | MIT License (MIT) 282 | ----------------- 283 | 284 | Copyright (c) 2013 noamraph 285 | 286 | Permission is hereby granted, free of charge, to any person obtaining a copy of 287 | this software and associated documentation files (the "Software"), to deal in 288 | the Software without restriction, including without limitation the rights to 289 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 290 | the Software, and to permit persons to whom the Software is furnished to do so, 291 | subject to the following conditions: 292 | 293 | The above copyright notice and this permission notice shall be included in all 294 | copies or substantial portions of the Software. 295 | 296 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 297 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 298 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 299 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 300 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 301 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 302 | 303 | _____________________ 304 | NLTK Team (NLTK) 305 | 1. Definitions. 306 | 307 | "License" shall mean the terms and conditions for use, reproduction, 308 | and distribution as defined by Sections 1 through 9 of this document. 309 | 310 | "Licensor" shall mean the copyright owner or entity authorized by 311 | the copyright owner that is granting the License. 312 | 313 | "Legal Entity" shall mean the union of the acting entity and all 314 | other entities that control, are controlled by, or are under common 315 | control with that entity. For the purposes of this definition, 316 | "control" means (i) the power, direct or indirect, to cause the 317 | direction or management of such entity, whether by contract or 318 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 319 | outstanding shares, or (iii) beneficial ownership of such entity. 320 | 321 | "You" (or "Your") shall mean an individual or Legal Entity 322 | exercising permissions granted by this License. 323 | 324 | "Source" form shall mean the preferred form for making modifications, 325 | including but not limited to software source code, documentation 326 | source, and configuration files. 327 | 328 | "Object" form shall mean any form resulting from mechanical 329 | transformation or translation of a Source form, including but 330 | not limited to compiled object code, generated documentation, 331 | and conversions to other media types. 332 | 333 | "Work" shall mean the work of authorship, whether in Source or 334 | Object form, made available under the License, as indicated by a 335 | copyright notice that is included in or attached to the work 336 | (an example is provided in the Appendix below). 337 | 338 | "Derivative Works" shall mean any work, whether in Source or Object 339 | form, that is based on (or derived from) the Work and for which the 340 | editorial revisions, annotations, elaborations, or other modifications 341 | represent, as a whole, an original work of authorship. For the purposes 342 | of this License, Derivative Works shall not include works that remain 343 | separable from, or merely link (or bind by name) to the interfaces of, 344 | the Work and Derivative Works thereof. 345 | 346 | "Contribution" shall mean any work of authorship, including 347 | the original version of the Work and any modifications or additions 348 | to that Work or Derivative Works thereof, that is intentionally 349 | submitted to Licensor for inclusion in the Work by the copyright owner 350 | or by an individual or Legal Entity authorized to submit on behalf of 351 | the copyright owner. For the purposes of this definition, "submitted" 352 | means any form of electronic, verbal, or written communication sent 353 | to the Licensor or its representatives, including but not limited to 354 | communication on electronic mailing lists, source code control systems, 355 | and issue tracking systems that are managed by, or on behalf of, the 356 | Licensor for the purpose of discussing and improving the Work, but 357 | excluding communication that is conspicuously marked or otherwise 358 | designated in writing by the copyright owner as "Not a Contribution." 359 | 360 | "Contributor" shall mean Licensor and any individual or Legal Entity 361 | on behalf of whom a Contribution has been received by Licensor and 362 | subsequently incorporated within the Work. 363 | 364 | 2. Grant of Copyright License. Subject to the terms and conditions of 365 | this License, each Contributor hereby grants to You a perpetual, 366 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 367 | copyright license to reproduce, prepare Derivative Works of, 368 | publicly display, publicly perform, sublicense, and distribute the 369 | Work and such Derivative Works in Source or Object form. 370 | 371 | 3. Grant of Patent License. Subject to the terms and conditions of 372 | this License, each Contributor hereby grants to You a perpetual, 373 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 374 | (except as stated in this section) patent license to make, have made, 375 | use, offer to sell, sell, import, and otherwise transfer the Work, 376 | where such license applies only to those patent claims licensable 377 | by such Contributor that are necessarily infringed by their 378 | Contribution(s) alone or by combination of their Contribution(s) 379 | with the Work to which such Contribution(s) was submitted. If You 380 | institute patent litigation against any entity (including a 381 | cross-claim or counterclaim in a lawsuit) alleging that the Work 382 | or a Contribution incorporated within the Work constitutes direct 383 | or contributory patent infringement, then any patent licenses 384 | granted to You under this License for that Work shall terminate 385 | as of the date such litigation is filed. 386 | 387 | 4. Redistribution. You may reproduce and distribute copies of the 388 | Work or Derivative Works thereof in any medium, with or without 389 | modifications, and in Source or Object form, provided that You 390 | meet the following conditions: 391 | 392 | (a) You must give any other recipients of the Work or 393 | Derivative Works a copy of this License; and 394 | 395 | (b) You must cause any modified files to carry prominent notices 396 | stating that You changed the files; and 397 | 398 | (c) You must retain, in the Source form of any Derivative Works 399 | that You distribute, all copyright, patent, trademark, and 400 | attribution notices from the Source form of the Work, 401 | excluding those notices that do not pertain to any part of 402 | the Derivative Works; and 403 | 404 | (d) If the Work includes a "NOTICE" text file as part of its 405 | distribution, then any Derivative Works that You distribute must 406 | include a readable copy of the attribution notices contained 407 | within such NOTICE file, excluding those notices that do not 408 | pertain to any part of the Derivative Works, in at least one 409 | of the following places: within a NOTICE text file distributed 410 | as part of the Derivative Works; within the Source form or 411 | documentation, if provided along with the Derivative Works; or, 412 | within a display generated by the Derivative Works, if and 413 | wherever such third-party notices normally appear. The contents 414 | of the NOTICE file are for informational purposes only and 415 | do not modify the License. You may add Your own attribution 416 | notices within Derivative Works that You distribute, alongside 417 | or as an addendum to the NOTICE text from the Work, provided 418 | that such additional attribution notices cannot be construed 419 | as modifying the License. 420 | 421 | You may add Your own copyright statement to Your modifications and 422 | may provide additional or different license terms and conditions 423 | for use, reproduction, or distribution of Your modifications, or 424 | for any such Derivative Works as a whole, provided Your use, 425 | reproduction, and distribution of the Work otherwise complies with 426 | the conditions stated in this License. 427 | 428 | 5. Submission of Contributions. Unless You explicitly state otherwise, 429 | any Contribution intentionally submitted for inclusion in the Work 430 | by You to the Licensor shall be under the terms and conditions of 431 | this License, without any additional terms or conditions. 432 | Notwithstanding the above, nothing herein shall supersede or modify 433 | the terms of any separate license agreement you may have executed 434 | with Licensor regarding such Contributions. 435 | 436 | 6. Trademarks. This License does not grant permission to use the trade 437 | names, trademarks, service marks, or product names of the Licensor, 438 | except as required for reasonable and customary use in describing the 439 | origin of the Work and reproducing the content of the NOTICE file. 440 | 441 | 7. Disclaimer of Warranty. Unless required by applicable law or 442 | agreed to in writing, Licensor provides the Work (and each 443 | Contributor provides its Contributions) on an "AS IS" BASIS, 444 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 445 | implied, including, without limitation, any warranties or conditions 446 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 447 | PARTICULAR PURPOSE. You are solely responsible for determining the 448 | appropriateness of using or redistributing the Work and assume any 449 | risks associated with Your exercise of permissions under this License. 450 | 451 | 8. Limitation of Liability. In no event and under no legal theory, 452 | whether in tort (including negligence), contract, or otherwise, 453 | unless required by applicable law (such as deliberate and grossly 454 | negligent acts) or agreed to in writing, shall any Contributor be 455 | liable to You for damages, including any direct, indirect, special, 456 | incidental, or consequential damages of any character arising as a 457 | result of this License or out of the use or inability to use the 458 | Work (including but not limited to damages for loss of goodwill, 459 | work stoppage, computer failure or malfunction, or any and all 460 | other commercial damages or losses), even if such Contributor 461 | has been advised of the possibility of such damages. 462 | 463 | 9. Accepting Warranty or Additional Liability. While redistributing 464 | the Work or Derivative Works thereof, You may choose to offer, 465 | and charge a fee for, acceptance of support, warranty, indemnity, 466 | or other liability obligations and/or rights consistent with this 467 | License. However, in accepting such obligations, You may act only 468 | on Your own behalf and on Your sole responsibility, not on behalf 469 | of any other Contributor, and only if You agree to indemnify, 470 | defend, and hold each Contributor harmless for any liability 471 | incurred by, or claims asserted against, such Contributor by reason 472 | of your accepting any such warranty or additional liability. 473 | 474 | _____________________ 475 | 476 | Leonard Richardson (beautifulsoup4) 477 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 478 | 479 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 480 | 481 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 482 | _____________________ 483 | 484 | Brian McFee, librosa development team (librosa) 485 | Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies. 486 | 487 | THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. 488 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | 2 | # Contributor Covenant Code of Conduct 3 | 4 | ## Our Pledge 5 | 6 | We as members, contributors, and leaders pledge to make participation in our 7 | community a harassment-free experience for everyone, regardless of age, body 8 | size, visible or invisible disability, ethnicity, sex characteristics, gender 9 | identity and expression, level of experience, education, socio-economic status, 10 | nationality, personal appearance, race, caste, color, religion, or sexual 11 | identity and orientation. 12 | 13 | We pledge to act and interact in ways that contribute to an open, welcoming, 14 | diverse, inclusive, and healthy community. 15 | 16 | ## Our Standards 17 | 18 | Examples of behavior that contributes to a positive environment for our 19 | community include: 20 | 21 | * Demonstrating empathy and kindness toward other people 22 | * Being respectful of differing opinions, viewpoints, and experiences 23 | * Giving and gracefully accepting constructive feedback 24 | * Accepting responsibility and apologizing to those affected by our mistakes, 25 | and learning from the experience 26 | * Focusing on what is best not just for us as individuals, but for the overall 27 | community 28 | 29 | Examples of unacceptable behavior include: 30 | 31 | * The use of sexualized language or imagery, and sexual attention or advances of 32 | any kind 33 | * Trolling, insulting or derogatory comments, and personal or political attacks 34 | * Public or private harassment 35 | * Publishing others' private information, such as a physical or email address, 36 | without their explicit permission 37 | * Other conduct which could reasonably be considered inappropriate in a 38 | professional setting 39 | 40 | ## Enforcement Responsibilities 41 | 42 | Community leaders are responsible for clarifying and enforcing our standards of 43 | acceptable behavior and will take appropriate and fair corrective action in 44 | response to any behavior that they deem inappropriate, threatening, offensive, 45 | or harmful. 46 | 47 | Community leaders have the right and responsibility to remove, edit, or reject 48 | comments, commits, code, wiki edits, issues, and other contributions that are 49 | not aligned to this Code of Conduct, and will communicate reasons for moderation 50 | decisions when appropriate. 51 | 52 | ## Scope 53 | 54 | This Code of Conduct applies within all community spaces, and also applies when 55 | an individual is officially representing the community in public spaces. 56 | Examples of representing our community include using an official e-mail address, 57 | posting via an official social media account, or acting as an appointed 58 | representative at an online or offline event. 59 | 60 | ## Enforcement 61 | 62 | Instances of abusive, harassing, or otherwise unacceptable behavior may be 63 | reported to the community leaders responsible for enforcement at 64 | opensource-conduct@group.apple.com. 65 | All complaints will be reviewed and investigated promptly and fairly. 66 | 67 | All community leaders are obligated to respect the privacy and security of the 68 | reporter of any incident. 69 | 70 | ## Enforcement Guidelines 71 | 72 | Community leaders will follow these Community Impact Guidelines in determining 73 | the consequences for any action they deem in violation of this Code of Conduct: 74 | 75 | ### 1. Correction 76 | 77 | **Community Impact**: Use of inappropriate language or other behavior deemed 78 | unprofessional or unwelcome in the community. 79 | 80 | **Consequence**: A private, written warning from community leaders, providing 81 | clarity around the nature of the violation and an explanation of why the 82 | behavior was inappropriate. A public apology may be requested. 83 | 84 | ### 2. Warning 85 | 86 | **Community Impact**: A violation through a single incident or series of 87 | actions. 88 | 89 | **Consequence**: A warning with consequences for continued behavior. No 90 | interaction with the people involved, including unsolicited interaction with 91 | those enforcing the Code of Conduct, for a specified period of time. This 92 | includes avoiding interactions in community spaces as well as external channels 93 | like social media. Violating these terms may lead to a temporary or permanent 94 | ban. 95 | 96 | ### 3. Temporary Ban 97 | 98 | **Community Impact**: A serious violation of community standards, including 99 | sustained inappropriate behavior. 100 | 101 | **Consequence**: A temporary ban from any sort of interaction or public 102 | communication with the community for a specified period of time. No public or 103 | private interaction with the people involved, including unsolicited interaction 104 | with those enforcing the Code of Conduct, is allowed during this period. 105 | Violating these terms may lead to a permanent ban. 106 | 107 | ### 4. Permanent Ban 108 | 109 | **Community Impact**: Demonstrating a pattern of violation of community 110 | standards, including sustained inappropriate behavior, harassment of an 111 | individual, or aggression toward or disparagement of classes of individuals. 112 | 113 | **Consequence**: A permanent ban from any sort of public interaction within the 114 | community. 115 | 116 | ## Attribution 117 | 118 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], 119 | version 2.1, available at 120 | [https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1]. 121 | 122 | Community Impact Guidelines were inspired by 123 | [Mozilla's code of conduct enforcement ladder][Mozilla CoC]. 124 | 125 | For answers to common questions about this code of conduct, see the FAQ at 126 | [https://www.contributor-covenant.org/faq][FAQ]. Translations are available at 127 | [https://www.contributor-covenant.org/translations][translations]. 128 | 129 | [homepage]: https://www.contributor-covenant.org 130 | [v2.1]: https://www.contributor-covenant.org/version/2/1/code_of_conduct.html 131 | [Mozilla CoC]: https://github.com/mozilla/diversity 132 | [FAQ]: https://www.contributor-covenant.org/faq 133 | [translations]: https://www.contributor-covenant.org/translations 134 | -------------------------------------------------------------------------------- /CONTRIBUTING-RESEARCH.md: -------------------------------------------------------------------------------- 1 | # Contribution Guide 2 | 3 | Thanks for your interest in contributing. This project was released to accompany a research paper for purposes of reproducability, and beyond its publication there are limited plans for future development of the repository. 4 | 5 | ## Before you get started 6 | 7 | We ask that all community members read and observe our [Code of Conduct](CODE_OF_CONDUCT.md). 8 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (C) 2022 Apple Inc. All Rights Reserved. 2 | 3 | IMPORTANT: This Apple software is supplied to you by Apple 4 | Inc. ("Apple") in consideration of your agreement to the following 5 | terms, and your use, installation, modification or redistribution of 6 | this Apple software constitutes acceptance of these terms. If you do 7 | not agree with these terms, please do not use, install, modify or 8 | redistribute this Apple software. 9 | 10 | In consideration of your agreement to abide by the following terms, and 11 | subject to these terms, Apple grants you a personal, non-exclusive 12 | license, under Apple's copyrights in this original Apple software (the 13 | "Apple Software"), to use, reproduce, modify and redistribute the Apple 14 | Software, with or without modifications, in source and/or binary forms; 15 | provided that if you redistribute the Apple Software in its entirety and 16 | without modifications, you must retain this notice and the following 17 | text and disclaimers in all such redistributions of the Apple Software. 18 | Neither the name, trademarks, service marks or logos of Apple Inc. may 19 | be used to endorse or promote products derived from the Apple Software 20 | without specific prior written permission from Apple. Except as 21 | expressly stated in this notice, no other rights or licenses, express or 22 | implied, are granted by Apple herein, including but not limited to any 23 | patent rights that may be infringed by your derivative works or by other 24 | works in which the Apple Software may be incorporated. 25 | 26 | The Apple Software is provided by Apple on an "AS IS" basis. APPLE 27 | MAKES NO WARRANTIES, EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION 28 | THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY AND FITNESS 29 | FOR A PARTICULAR PURPOSE, REGARDING THE APPLE SOFTWARE OR ITS USE AND 30 | OPERATION ALONE OR IN COMBINATION WITH YOUR PRODUCTS. 31 | 32 | IN NO EVENT SHALL APPLE BE LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL 33 | OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF 34 | SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 35 | INTERRUPTION) ARISING IN ANY WAY OUT OF THE USE, REPRODUCTION, 36 | MODIFICATION AND/OR DISTRIBUTION OF THE APPLE SOFTWARE, HOWEVER CAUSED 37 | AND WHETHER UNDER THEORY OF CONTRACT, TORT (INCLUDING NEGLIGENCE), 38 | STRICT LIABILITY OR OTHERWISE, EVEN IF APPLE HAS BEEN ADVISED OF THE 39 | POSSIBILITY OF SUCH DAMAGE. 40 | 41 | ------------------------------------------------------------------------------- 42 | SOFTWARE DISTRIBUTED WITH CODE-SWITCHED-SPEECH-TRANSLATION: 43 | 44 | The CODE-SWITCHED-SPEECH-TRANSLATION software includes a number of subcomponents with separate 45 | copyright notices and license terms - please see the file ACKNOWLEDGEMENTS. 46 | ------------------------------------------------------------------------------- 47 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Overview 2 | This repository contains the code and instructions needed to reproduce the dataset splits for ["Speech Translation for Code-Switched Speech"](LINK_TODO). 3 | 4 | You can create both datasets with the `bash create_datasets.sh` command, following the instructions in the [Instructions Section](#Instructions). The `fisher` and `miami` directories contain the scripts needed to for each dataset used by `bash create_datasets.sh`. 5 | 6 | A mapping between the original data and the new code-switched and monolingual splits used in the paper can be found in `mapping_files`. Note that running `bash create_datasets.sh` will create these mappings. 7 | 8 | ## Instructions 9 | 0. Install the prerequisite libraries for linux/macOS. This includes `ffmpeg`, `sox`, `wget`, and `python` (e.g. `apt-get install sox`). 10 | 1. Run `pip install -r requirement.txt` to setup the python enviroment 11 | 2. Collect the data needed for the Fisher corpus ([LDC2010T04](https://catalog.ldc.upenn.edu/LDC2010T04) and [LDC2010S01](https://catalog.ldc.upenn.edu/LDC2010S01)) and export them: `export LDC2010S01={path_to_LDC2010S01}` and `export LDC2010T04={path_to_LDC2010T04}/fisher_spa_tr`. 12 | 3. Run `bash create_datasets.sh` to generate both Miami and Fisher datasets. 13 | 14 | 15 | ## Example 16 | 17 | Example utterance: 18 | - (Audio clip) 19 | - Transcript (code-switched): *y ti bueno tiene dos papás **which can be a little can be a little challenging**.* 20 | - Translation (English only): *and she has two fathers which can be a little, can be a little challenging.* 21 | 22 | The data files are composed of three parts: 23 | 1. The transcript for the dataset split (in `{dataset_name}.translation`) 24 | 2. The translation for the dataset split (in `{dataset_name}.translation`) 25 | 3. The audio for the dataset split (in `{dataset_name}.yaml` and `{dataset_name}/clips/*.wav` or `{dataset_name}/clips.zip`) 26 | 27 | ## Citation 28 | If you found this repository helpful in your research, please consider citing 29 | ``` 30 | Orion Weller, Matthias Sperber, Telmo Pessoa Pires, Hendra Setiawan, Christian Gollan, Dominic Telaar, Matthias Paulik: End-to-End Speech Translation for Code Switched Speech (Findings of the Association for Computational Linguistics: ACL 2022) 31 | ``` 32 | -------------------------------------------------------------------------------- /create_datasets.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # 4 | # For licensing see accompanying LICENSE file. 5 | # Copyright (C) 2022 Apple Inc. All Rights Reserved. 6 | # 7 | 8 | cd miami 9 | bash setup_all.sh 10 | cd ../fisher 11 | bash setup_all.sh 12 | cd ../ 13 | -------------------------------------------------------------------------------- /fisher/README.md: -------------------------------------------------------------------------------- 1 | # Overview 2 | This repository contains all the scripts needed to download the Fisher Corpus and preprocess it for Speech Translation 3 | 4 | ## 1-Step Setup 5 | 0. Run `setup_all.sh` to download the data, and process it. For granular instructions, see the `Multi-Step Setup` 6 | 7 | ## Multi-Step Setup 8 | 0. See the instructions and comments in the `setup_all.sh` file for individual instructions 9 | 10 | 11 | ## Paper Reference 12 | The Fisher corpus is found in these LDC files ([here](https://catalog.ldc.upenn.edu/LDC2010T04) and [here](https://catalog.ldc.upenn.edu/LDC2010S01)) and was published as part of [this paper](https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2004-fisher-corpus.pdf) -------------------------------------------------------------------------------- /fisher/combine_eval_splits.py: -------------------------------------------------------------------------------- 1 | # 2 | # For licensing see accompanying LICENSE file. 3 | # Copyright (C) 2022 Apple Inc. All Rights Reserved. 4 | # 5 | 6 | # The Fisher data is already split into dev/dev2/test 7 | # this script compiles these threeinto one `eval` test set 8 | import os 9 | import yaml 10 | import shutil 11 | from distutils.dir_util import copy_tree 12 | 13 | DATASET_NAMES = ["cs", "mono"] 14 | SPLITS = ["dev", "dev2", "test"] 15 | 16 | # combine the evaluation sets 17 | name = "eval" 18 | base_output_path = f"output/fisher/{name}" 19 | 20 | all_eval = {"cs": None, "mono": None} 21 | for name in DATASET_NAMES: 22 | data_for_type = [[], [], []] 23 | for split in SPLITS: 24 | print(f"Loading the data for {name}, {split}...") 25 | base_path = f"output/fisher/{split}/{name}" 26 | transcript = [] 27 | translation = [] 28 | with open(f"{base_path}/fisher.yaml", "r") as fin: 29 | yaml_data = yaml.safe_load(fin) 30 | with open(f"{base_path}/fisher.transcript", "r") as fin: 31 | for line in fin: 32 | transcript.append(line.strip()) 33 | with open(f"{base_path}/fisher.translation", "r") as fin: 34 | for line in fin: 35 | translation.append(line.strip()) 36 | assert len(transcript) == len(yaml_data) == len(translation) 37 | print(f"Length of the original data is {len(transcript)}") 38 | 39 | data_for_type[0].extend(yaml_data) 40 | data_for_type[1].extend(transcript) 41 | data_for_type[2].extend(translation) 42 | 43 | all_eval[name] = data_for_type 44 | 45 | 46 | print("Writing the combined data out...") 47 | for (name, datasets) in zip(DATASET_NAMES, [all_eval["cs"], all_eval["mono"]]): 48 | print(f"Length of the data {name} is {len(datasets[0])}") 49 | 50 | if not os.path.isdir(os.path.join(base_output_path, name, "clips")): 51 | os.makedirs(os.path.join(base_output_path, name, "clips")) 52 | 53 | with open(os.path.join(base_output_path, name, f"fisher.yaml"), "w") as fout: 54 | fout.write(yaml.dump(datasets[0])) 55 | with open(os.path.join(base_output_path, name, f"fisher.transcript"), "w") as fout: 56 | for line in datasets[1]: 57 | assert "\n" not in line, line 58 | fout.write(line) 59 | fout.write("\n") 60 | with open(os.path.join(base_output_path, name, f"fisher.translation"), "w") as fout: 61 | for line in datasets[2]: 62 | assert "\n" not in line, line 63 | fout.write(line) 64 | fout.write("\n") 65 | 66 | print("Moving clip data...") 67 | for eval_split in SPLITS: 68 | copy_tree( 69 | os.path.join(base_output_path.replace("eval", eval_split), name, "clips"), 70 | os.path.join(base_output_path, name, "clips"), 71 | ) 72 | 73 | # make it a zip file 74 | shutil.make_archive( 75 | os.path.join(base_output_path, name, "clips"), 76 | "zip", 77 | os.path.join(base_output_path, name, "clips"), 78 | ) 79 | -------------------------------------------------------------------------------- /fisher/extract-utterance-audios.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python3 2 | 3 | # 4 | # For licensing see accompanying LICENSE file. 5 | # Copyright (C) 2022 Apple Inc. All Rights Reserved. 6 | # 7 | 8 | # this file extracts the utterance from the larger audio file given the mapping 9 | 10 | import sys 11 | import os 12 | import subprocess 13 | 14 | if len(sys.argv) != 3: 15 | print("Usage: %s " % sys.argv[0]) 16 | sys.exit(1) 17 | srcAudioDir=sys.argv[2] 18 | 19 | utterance = None 20 | mapping = {} 21 | for line in sys.stdin: 22 | if line.startswith('##'): 23 | utterance = line.strip().split(' ')[2] 24 | lineno = 1 25 | else: 26 | mapping[(utterance,repr(lineno))] = line.strip() 27 | lineno += 1 28 | 29 | for lineno, line in enumerate(open(sys.argv[1])): 30 | utterances, ids = line.split() 31 | output = " ".join(mapping[(utterances,x)] for x in ids.split('_')) 32 | uttList=[mapping[(utterances,x)] for x in ids.split('_')] 33 | firstToks=uttList[0].split('+') 34 | firstToks[4] = firstToks[4].replace(' ', '~') 35 | uttStart=float(firstToks[2]) 36 | uttDur=float(uttList[-1].split('+')[3])-uttStart 37 | audioName="%s-utt%06d" % (os.path.basename(sys.argv[1]), lineno+1) 38 | uttID="%s-%s-c%s-%s" % (audioName, firstToks[0], firstToks[1], firstToks[4]) 39 | spkID="%s-c%s-%s" % (firstToks[0], firstToks[1], firstToks[4]) 40 | wavFilename=os.path.join(os.path.basename(sys.argv[1]), os.path.join(firstToks[0][:-4], audioName)) 41 | print(uttID, wavFilename, spkID, lineno+1, output, uttStart, uttDur) # used in the `prepare-sets.sh bash script` 42 | directory = os.path.dirname(wavFilename) 43 | try: 44 | os.stat(directory) 45 | except: 46 | os.mkdir(directory) 47 | cmd="/usr/bin/sox %s -c 1 --encoding signed-integer %s.wav remix %d trim %f %f rate 16000" % (os.path.join(srcAudioDir, firstToks[0]), wavFilename, int(firstToks[1])+1, uttStart, uttDur) 48 | print(uttID, repr(subprocess.check_output(cmd.split(" "))), file=sys.stderr) 49 | -------------------------------------------------------------------------------- /fisher/extract_cs_words_from_raw_data.py: -------------------------------------------------------------------------------- 1 | # 2 | # For licensing see accompanying LICENSE file. 3 | # Copyright (C) 2022 Apple Inc. All Rights Reserved. 4 | # 5 | 6 | # this file takes the raw Fisher data with the code-switched annotations and processes it 7 | import glob 8 | import os 9 | from bs4 import BeautifulSoup 10 | import numpy as np 11 | 12 | 13 | def rawcount(filename): 14 | f = open(filename, "rb") 15 | lines = 0 16 | buf_size = 1024 * 1024 17 | read_f = f.raw.read 18 | 19 | buf = read_f(buf_size) 20 | while buf: 21 | lines += buf.count(b"\n") 22 | buf = read_f(buf_size) 23 | 24 | return lines 25 | 26 | 27 | def fix_small_errors(line) -> str: 28 | """The data has some small errors to fix""" 29 | if 'lang+"English"' in line: 30 | line = line.replace('lang+"English"', 'lang="English"') 31 | if 'lan="English"' in line: 32 | line = line.replace('lan="English"', 'lang="English"') 33 | if " /foreign>" in line: 34 | line = line.replace(" /foreign>", "") 35 | if ' meeting ' in line: 36 | line = line.replace( 37 | ' meeting ', 38 | ' meeting ', 39 | ) 40 | 41 | return line 42 | 43 | 44 | # go through all spanish files, english don't have any markup since it 45 | # generated from AMT and kept the text 46 | file_info = {} 47 | for file_path in glob.glob("fisher-callhome-corpus-tags/corpus/ldc/fisher_*.es"): 48 | file_name = file_path.split("/")[-1] 49 | cs_info = [] 50 | file_info[file_name] = { 51 | "line_count": rawcount(file_path), 52 | } 53 | 54 | tokens_per_line = [] 55 | with open(file_path, "r") as fin: 56 | for line_idx, line in enumerate(fin): 57 | line = line.strip() # remove newline 58 | if ( 59 | " ${C}/ids \ 28 | 2>${C}.prepare-audio.log 29 | } 30 | 31 | for SET in fisher_{train,dev,dev2,test}; do 32 | process_audio ${SET} 33 | done 34 | 35 | # make YAML audio mapping 36 | for convname in fisher_{train,dev,dev2,test}/*fsp; do 37 | for filename in $convname/*.wav; do 38 | echo "- { wav: $filename }" >> $(dirname $convname).yaml 39 | done 40 | done 41 | -------------------------------------------------------------------------------- /fisher/setup_all.sh: -------------------------------------------------------------------------------- 1 | # 2 | # For licensing see accompanying LICENSE file. 3 | # Copyright (C) 2022 Apple Inc. All Rights Reserved. 4 | # 5 | 6 | # get the Fisher data with CS tags 7 | git clone https://github.com/orionw/fisher-callhome-corpus.git 8 | mv fisher-callhome-corpus fisher-callhome-corpus-tags 9 | cd fisher-callhome-corpus-tags 10 | make 11 | cd ../ 12 | python extract_cs_words_from_raw_data.py # makes indexes of CS data and keeps the CS words 13 | 14 | # make the clean data without the CS tags to use 15 | git clone -b keep_tags https://github.com/orionw/fisher-callhome-corpus.git 16 | cd fisher-callhome-corpus 17 | make 18 | cp corpus/ldc/fisher_dev.{en,es}* ../splits_data/dev/ 19 | cp corpus/ldc/fisher_train.{en,es}* ../splits_data/train/ 20 | cp corpus/ldc/fisher_dev2.{en,es}* ../splits_data/dev2/ 21 | cp corpus/ldc/fisher_test.{en,es}* ../splits_data/test/ 22 | cd ../ 23 | 24 | # prepare the speech data (process to 16K, match to the other data lines) 25 | bash prepare-sets.sh 26 | cp fisher_train.yaml splits_data/train/ 27 | cp fisher_test.yaml splits_data/test/ 28 | cp fisher_dev.yaml splits_data/dev/ 29 | cp fisher_dev2.yaml splits_data/dev2/ 30 | mkdir speech 31 | mv fisher_train speech 32 | mv fisher_dev speech 33 | mv fisher_dev2 speech 34 | mv fisher_test speech 35 | 36 | # make the CS and Monolingual splits 37 | sed -i "s/\r//g" splits_data/*/* # something adds extra carriage returns 38 | python make_cs_splits.py 39 | # make the `eval` set consisting of dev dev2 test 40 | python combine_eval_splits.py 41 | python split_train_and_make_lid.py # split into training and dev CS sets and determine the LID 42 | python make_mapping_files.py # if you want the mapping files, optional 43 | -------------------------------------------------------------------------------- /fisher/split_train_and_make_lid.py: -------------------------------------------------------------------------------- 1 | # 2 | # For licensing see accompanying LICENSE file. 3 | # Copyright (C) 2022 Apple Inc. All Rights Reserved. 4 | # 5 | 6 | # This file creates the LID labels for training/dev as well as splitting the training CS set into dev/train 7 | import os 8 | import random 9 | import shutil 10 | import yaml 11 | import string 12 | import numpy as np 13 | 14 | random.seed(1) 15 | 16 | 17 | def create_and_save_labels_for_cs_train_data( 18 | transcript, transcript_train, cs_words, output_path, desc, name 19 | ) -> list: 20 | """Only used for fisher_train_cs to save train and dev set labels""" 21 | labels1 = [] 22 | labels2 = [] 23 | cs_words1, cs_words2 = cs_words 24 | for idx, instance in enumerate(transcript): 25 | transcript = instance.translate(str.maketrans("", "", string.punctuation)) 26 | cs_words = cs_words1[idx].translate(str.maketrans("", "", string.punctuation)) 27 | 28 | cs_count = len(cs_words.strip().split(" ")) 29 | all_count = len(transcript.strip().split(" ")) 30 | if cs_count / all_count > 0.5: 31 | labels1.append(0) # english 32 | elif cs_count / all_count < 0.5: 33 | labels1.append(1) # spanish 34 | else: 35 | labels1.append(int(random.random() > 0.5)) 36 | 37 | for idx, instance in enumerate(transcript_train): 38 | transcript = instance.translate(str.maketrans("", "", string.punctuation)) 39 | cs_words = cs_words2[idx].translate(str.maketrans("", "", string.punctuation)) 40 | 41 | cs_count = len(cs_words.strip().split(" ")) 42 | all_count = len(transcript.strip().split(" ")) 43 | if cs_count / all_count > 0.5: 44 | labels2.append(0) # english 45 | elif cs_count / all_count < 0.5: 46 | labels2.append(1) # spanish 47 | else: 48 | labels2.append(int(random.random() > 0.5)) 49 | 50 | print( 51 | f"Averages: labels1={np.array(labels1).mean()} labels2={np.array(labels2).mean()}" 52 | ) 53 | 54 | if not os.path.isdir(os.path.join(output_path, desc + "_dev")): 55 | os.makedirs(os.path.join(output_path, desc + "_dev")) 56 | if not os.path.isdir(os.path.join(output_path, desc + "_train")): 57 | os.makedirs(os.path.join(output_path, desc + "_train")) 58 | 59 | with open(os.path.join(output_path, desc + "_dev", "lid_labels.txt"), "w") as fout: 60 | for label in labels1: 61 | fout.write(str(label)) 62 | fout.write("\n") 63 | 64 | with open( 65 | os.path.join(output_path, desc + "_train", "lid_labels.txt"), "w" 66 | ) as fout: 67 | for label in labels2: 68 | fout.write(str(label)) 69 | fout.write("\n") 70 | 71 | return labels1, labels2 72 | 73 | 74 | def write_out_data( 75 | yaml_data, transcript, translation, base_path, output_path, desc, name 76 | ): 77 | """A helper function for writing out all the data""" 78 | if not os.path.isdir(os.path.join(output_path, desc, "clips")): 79 | os.makedirs(os.path.join(output_path, desc, "clips")) 80 | 81 | with open(os.path.join(output_path, desc, f"{name}.yaml"), "w") as fout: 82 | fout.write(yaml.dump(yaml_data)) 83 | with open(os.path.join(output_path, desc, f"{name}.transcript"), "w") as fout: 84 | for line in transcript: 85 | assert "\n" not in line, line 86 | fout.write(line) 87 | fout.write("\n") 88 | with open(os.path.join(output_path, desc, f"{name}.translation"), "w") as fout: 89 | for line in translation: 90 | assert "\n" not in line, line 91 | fout.write(line) 92 | fout.write("\n") 93 | 94 | for instance in yaml_data: 95 | audio_path = instance["wav"] 96 | shutil.copy( 97 | os.path.join(base_path, audio_path), 98 | os.path.join(output_path, desc, "clips", audio_path.split("/")[-1]), 99 | ) 100 | 101 | # make it a zip file 102 | shutil.make_archive( 103 | os.path.join(output_path, desc, "clips"), 104 | "zip", 105 | os.path.join(output_path, desc, "clips"), 106 | ) 107 | 108 | 109 | def sample_yaml_data( 110 | yaml_data, transcript, translation, num_idxs_to_sample, return_both: bool = False, should_write_out: bool = False 111 | ): 112 | """A helper function for sampling from the data""" 113 | split_idx = np.array( 114 | random.sample(list(range(len(yaml_data))), k=num_idxs_to_sample) 115 | ) 116 | if should_write_out: 117 | with open("train_vs_dev_cs.txt", "w") as fout: 118 | for line in split_idx.tolist(): 119 | fout.write(str(line)) 120 | fout.write("\n") 121 | bool_split = np.isin(np.arange(len(transcript)), split_idx) 122 | transcript1 = np.array(transcript)[bool_split].tolist() 123 | translation1 = np.array(translation)[bool_split].tolist() 124 | yaml_data1 = np.array(yaml_data)[bool_split].tolist() 125 | if return_both: 126 | cs_words = [] 127 | with open("cs_corpus/fisher_train_cs_words_cs_only.es", "r") as fin: 128 | for line in fin: 129 | cs_words.append(line.strip()) 130 | assert len(cs_words) == len(yaml_data) 131 | cs_words1 = np.array(cs_words)[bool_split].tolist() 132 | cs_words2 = np.array(cs_words)[~bool_split].tolist() 133 | 134 | transcript2 = np.array(transcript)[~bool_split].tolist() 135 | translation2 = np.array(translation)[~bool_split].tolist() 136 | yaml_data2 = np.array(yaml_data)[~bool_split].tolist() 137 | return ( 138 | yaml_data1, 139 | transcript1, 140 | translation1, 141 | yaml_data2, 142 | transcript2, 143 | translation2, 144 | (cs_words1, cs_words2), 145 | ) 146 | else: 147 | return yaml_data1, transcript1, translation1 148 | 149 | 150 | def create_and_save_cs_labels_only(yaml_data, transcript, translation): 151 | """A function that only creates the LID labels and saves them (only used for Fisher Eval CS)""" 152 | assert len(yaml_data) == len(transcript) == len(translation) 153 | 154 | cs_words_list = [] 155 | for file_type in ["dev", "dev2", "test"]: 156 | with open(f"cs_corpus/fisher_{file_type}_cs_words_cs_only.es", "r") as fin: 157 | for line in fin: 158 | cs_words_list.append(line.strip()) 159 | 160 | assert len(cs_words_list) == len( 161 | yaml_data 162 | ), f"CS words: {len(cs_words_list)} len_data={len(yaml_data)}" 163 | 164 | labels = [] 165 | for idx, instance in enumerate(transcript): 166 | transcript_str = instance.translate(str.maketrans("", "", string.punctuation)) 167 | cs_words = cs_words_list[idx].translate( 168 | str.maketrans("", "", string.punctuation) 169 | ) 170 | 171 | cs_count = len(cs_words.strip().split(" ")) 172 | all_count = len(transcript_str.strip().split(" ")) 173 | if cs_count / all_count > 0.5: 174 | labels.append(0) # english 175 | elif cs_count / all_count < 0.5: 176 | labels.append(1) # spanish 177 | else: 178 | labels.append(int(random.random() > 0.5)) 179 | 180 | assert len(labels) == len(yaml_data) 181 | 182 | with open(os.path.join("output/fisher/eval/cs/fisher.labels"), "w") as fout: 183 | for label in labels: 184 | fout.write(str(label)) 185 | fout.write("\n") 186 | 187 | 188 | def gather_lid_data(): 189 | output_path = "output/lid" 190 | num_idxs_to_sample = None 191 | if not os.path.isdir(output_path): 192 | os.makedirs(output_path) 193 | 194 | data_paths = [ 195 | ("fisher_eval_cs", "fisher", "output/fisher/eval/cs"), 196 | ("fisher_train_cs", "fisher", "output/fisher/train/cs"), 197 | # for the monolingual ones, sample some of them for use in LID training 198 | ("fisher_train_mono", "fisher", "output/fisher/train/mono"), 199 | ("miami_train_mono", "miami", "../miami/output/miami/mono_train"), 200 | ] 201 | for (desc, name, base_path) in data_paths: 202 | print(f"Working on {desc}") 203 | transcript = [] 204 | translation = [] 205 | with open(f"{base_path}/{name}.yaml", "r") as fin: 206 | yaml_data = yaml.safe_load(fin) 207 | with open(f"{base_path}/{name}.transcript", "r") as fin: 208 | for line in fin: 209 | transcript.append(line.strip()) 210 | with open(f"{base_path}/{name}.translation", "r") as fin: 211 | for line in fin: 212 | translation.append(line.strip()) 213 | assert len(transcript) == len(yaml_data) == len(translation) 214 | 215 | if desc == "fisher_eval_cs": 216 | create_and_save_cs_labels_only(yaml_data, transcript, translation) 217 | elif desc == "fisher_train_cs": 218 | # need to split this into train and dev, then save 219 | yaml_data, transcript, translation, yaml_data_train, transcript_train, translation_train, cs_words = sample_yaml_data(yaml_data, transcript, translation, 220 | int(0.1 * len(yaml_data)), return_both=True, 221 | should_write_out=True) 222 | print(f"Length of the data {base_path}/{name + '_dev'} is {len(yaml_data)}") 223 | create_and_save_labels_for_cs_train_data(transcript, transcript_train, cs_words, output_path, desc, name) 224 | write_out_data(yaml_data, transcript, translation, base_path, output_path, desc + "_dev", name) 225 | 226 | print(f"Length of the data {base_path}/{name + '_train'} is {len(yaml_data_train)}") 227 | write_out_data(yaml_data_train, transcript_train, translation_train, base_path, output_path, desc + "_train", name) 228 | num_idxs_to_sample = len(yaml_data_train) # make fisher cs the base 229 | else: # is monolingual 230 | yaml_data, transcript, translation = sample_yaml_data(yaml_data, transcript, translation, min(len(yaml_data), num_idxs_to_sample)) 231 | print(f"Length of the data {base_path}/{name} is {len(yaml_data)}") 232 | write_out_data(yaml_data, transcript, translation, base_path, output_path, desc, name) 233 | 234 | 235 | if __name__ == "__main__": 236 | gather_lid_data() 237 | -------------------------------------------------------------------------------- /fisher/splits_data/README.md: -------------------------------------------------------------------------------- 1 | These folders are used to hold the initial Fisher files that are later used for processing. -------------------------------------------------------------------------------- /fisher/splits_data/dev/README.md: -------------------------------------------------------------------------------- 1 | # Contents 2 | This folder should contain these files gathered from `fisher-callhome-corpus` (see `../../readme.md`): 3 | - `fisher_dev.en.0` 4 | - `fisher_dev.en.1` 5 | - `fisher_dev.en.2` 6 | - `fisher_dev.en.3` 7 | - `fisher_dev.es` 8 | - `fisher_dev.yaml` -------------------------------------------------------------------------------- /fisher/splits_data/dev2/README.md: -------------------------------------------------------------------------------- 1 | # Contents 2 | This folder should contain these files gathered from `fisher-callhome-corpus` (see `../../readme.md`): 3 | - `fisher_dev2.en.0` 4 | - `fisher_dev2.en.1` 5 | - `fisher_dev2.en.2` 6 | - `fisher_dev2.en.3` 7 | - `fisher_dev2.es` 8 | - `fisher_dev2.yaml` -------------------------------------------------------------------------------- /fisher/splits_data/test/README.md: -------------------------------------------------------------------------------- 1 | # Contents 2 | This folder should contain these files gathered from `fisher-callhome-corpus` (see `../../readme.md`): 3 | - `fisher_test.en.0` 4 | - `fisher_test.en.1` 5 | - `fisher_test.en.2` 6 | - `fisher_test.en.3` 7 | - `fisher_test.es` 8 | - `fisher_test.yaml` -------------------------------------------------------------------------------- /fisher/splits_data/train/README.md: -------------------------------------------------------------------------------- 1 | # Contents 2 | This folder should contain these files gathered from `fisher-callhome-corpus` (see `../../readme.md`): 3 | - `fisher_train.en` 4 | - `fisher_train.es` 5 | - `fisher_train.yaml` -------------------------------------------------------------------------------- /mapping_files/README.md: -------------------------------------------------------------------------------- 1 | # Mapping Files 2 | We also provide mapping files for both Fisher and Miami that map the files in the datasets to their respective splits in our data. `n/a` values in the `splits` columns mean that the instance was not used in our data (e.g. due to the lack of a translation, for example). For the Fisher corpus, the `audio_file` column refers to the audio file mappings in `LDC2010T04`, where the duration of each audio file is located (not included here, due to licensing). 3 | 4 | These files may be useful if you are looking to do additional modifications beyond what this repository provides. -------------------------------------------------------------------------------- /miami/common_words/eng.txt: -------------------------------------------------------------------------------- 1 | the 2 | and 3 | to 4 | of 5 | a 6 | in 7 | is 8 | that 9 | for 10 | I 11 | you 12 | it 13 | with 14 | on 15 | as 16 | are 17 | be 18 | this 19 | was 20 | have 21 | or 22 | at 23 | not 24 | your 25 | from 26 | we 27 | by 28 | will 29 | can 30 | but 31 | they 32 | an 33 | he 34 | all 35 | has 36 | if 37 | their 38 | one 39 | do 40 | more 41 | n't 42 | my 43 | his 44 | so 45 | there 46 | about 47 | which 48 | when 49 | what 50 | out 51 | up 52 | our 53 | who 54 | also 55 | had 56 | time 57 | some 58 | would 59 | were 60 | like 61 | been 62 | just 63 | her 64 | new 65 | other 66 | them 67 | she 68 | people 69 | these 70 | no 71 | get 72 | how 73 | me 74 | into 75 | than 76 | only 77 | its 78 | most 79 | may 80 | any 81 | many 82 | make 83 | then 84 | well 85 | first 86 | very 87 | over 88 | now 89 | could 90 | after 91 | even 92 | because 93 | us 94 | said 95 | good 96 | way 97 | two 98 | should 99 | work 100 | use 101 | through 102 | see 103 | know 104 | did 105 | much 106 | where 107 | years 108 | need 109 | him 110 | back 111 | such 112 | those 113 | being 114 | day 115 | take 116 | while 117 | here 118 | before 119 | does 120 | great 121 | year 122 | go 123 | help 124 | want 125 | really 126 | think 127 | best 128 | life 129 | each 130 | made 131 | right 132 | world 133 | business 134 | home 135 | own 136 | down 137 | still 138 | used 139 | find 140 | around 141 | going 142 | every 143 | both 144 | last 145 | off 146 | too 147 | same 148 | information 149 | little 150 | another 151 | look 152 | few 153 | long 154 | part 155 | since 156 | things 157 | place 158 | am 159 | between 160 | during 161 | different 162 | must 163 | come 164 | using 165 | however 166 | without 167 | high 168 | why 169 | something 170 | online 171 | system 172 | better 173 | three 174 | never 175 | always 176 | love 177 | say 178 | might 179 | next 180 | company 181 | state 182 | number 183 | again 184 | free 185 | lot 186 | under 187 | family 188 | found 189 | within 190 | give 191 | set 192 | school 193 | important 194 | water 195 | able 196 | keep 197 | got 198 | sure 199 | end 200 | money 201 | service 202 | small 203 | put 204 | experience 205 | having 206 | once 207 | available 208 | health 209 | support 210 | often 211 | including 212 | days 213 | away 214 | old 215 | area 216 | feel 217 | read 218 | show 219 | big 220 | against 221 | thing 222 | order 223 | program 224 | though 225 | city 226 | group 227 | services 228 | site 229 | making 230 | course 231 | point 232 | children 233 | times 234 | team 235 | game 236 | along 237 | let 238 | house 239 | today 240 | body 241 | working 242 | case 243 | man 244 | real 245 | provide 246 | care 247 | public 248 | top 249 | looking 250 | several 251 | start 252 | less 253 | process 254 | become 255 | actually 256 | local 257 | together 258 | person 259 | change 260 | book 261 | enough 262 | getting 263 | week 264 | power 265 | until 266 | market 267 | fact 268 | god 269 | food 270 | students 271 | full 272 | women 273 | community 274 | name 275 | second 276 | data 277 | government 278 | says 279 | others 280 | ever 281 | yet 282 | research 283 | done 284 | left 285 | far 286 | large 287 | called 288 | doing 289 | already 290 | development 291 | social 292 | open 293 | possible 294 | side 295 | play 296 | means 297 | needs 298 | try 299 | came 300 | ca 301 | based 302 | hard 303 | thought 304 | products 305 | national 306 | quality 307 | level 308 | live 309 | design 310 | makes 311 | project 312 | line 313 | night 314 | least 315 | whether 316 | job 317 | car 318 | example 319 | include 320 | following 321 | given 322 | website 323 | past 324 | plan 325 | offer 326 | buy 327 | call 328 | went 329 | simply 330 | hand 331 | music 332 | easy 333 | problem 334 | men 335 | country 336 | took 337 | four 338 | members 339 | form 340 | personal 341 | control 342 | energy 343 | room 344 | head 345 | pay 346 | create 347 | run 348 | kind 349 | credit 350 | almost 351 | believe 352 | quite 353 | mind 354 | law 355 | early 356 | comes 357 | states 358 | usually 359 | companies 360 | web 361 | taking 362 | started 363 | later 364 | although 365 | story 366 | per 367 | future 368 | known 369 | someone 370 | across 371 | rather 372 | young 373 | whole 374 | special 375 | everything 376 | months 377 | anything 378 | training 379 | url 380 | bit 381 | seen 382 | product 383 | american 384 | please 385 | management 386 | cost 387 | either 388 | light 389 | university 390 | face 391 | due 392 | nothing 393 | human 394 | event 395 | history 396 | probably 397 | friends 398 | learn 399 | current 400 | tell 401 | general 402 | price 403 | list 404 | type 405 | building 406 | industry 407 | bad 408 | check 409 | everyone 410 | office 411 | idea 412 | internet 413 | news 414 | million 415 | video 416 | among 417 | air 418 | especially 419 | told 420 | results 421 | post 422 | hours 423 | international 424 | center 425 | understand 426 | above 427 | addition 428 | major 429 | education 430 | white 431 | particular 432 | problems 433 | media 434 | according 435 | upon 436 | page 437 | continue 438 | black 439 | study 440 | issues 441 | inside 442 | technology 443 | five 444 | value 445 | further 446 | access 447 | reason 448 | short 449 | TRUE 450 | simple 451 | natural 452 | amount 453 | search 454 | result 455 | taken 456 | main 457 | heart 458 | space 459 | financial 460 | ago 461 | trying 462 | question 463 | living 464 | likely 465 | interest 466 | various 467 | insurance 468 | common 469 | move 470 | child 471 | yourself 472 | report 473 | certain 474 | share 475 | single 476 | close 477 | instead 478 | bring 479 | works 480 | age 481 | s 482 | season 483 | hope 484 | coming 485 | areas 486 | ask 487 | medical 488 | low 489 | games 490 | turn 491 | key 492 | party 493 | add 494 | month 495 | seems 496 | view 497 | fun 498 | matter 499 | words 500 | needed -------------------------------------------------------------------------------- /miami/common_words/spa.txt: -------------------------------------------------------------------------------- 1 | de 2 | la 3 | que 4 | el 5 | y 6 | en 7 | a 8 | los 9 | del 10 | se 11 | las 12 | por 13 | un 14 | con 15 | para 16 | no 17 | una 18 | es 19 | su 20 | al 21 | lo 22 | como 23 | más 24 | o 25 | este 26 | pero 27 | sus 28 | esta 29 | si 30 | ha 31 | me 32 | ya 33 | le 34 | son 35 | sobre 36 | entre 37 | ser 38 | fue 39 | sin 40 | todo 41 | también 42 | desde 43 | cuando 44 | muy 45 | a?os 46 | está 47 | todos 48 | hay 49 | tiene 50 | nos 51 | porque 52 | dos 53 | hasta 54 | donde 55 | parte 56 | así 57 | han 58 | puede 59 | mi 60 | a?o 61 | cada 62 | uno 63 | vez 64 | bien 65 | hace 66 | trabajo 67 | nacional 68 | estado 69 | otros 70 | gobierno 71 | eso 72 | tiempo 73 | además 74 | mismo 75 | ese 76 | hacer 77 | país 78 | yo 79 | durante 80 | te 81 | día 82 | tanto 83 | vida 84 | esto 85 | forma 86 | estos 87 | sólo 88 | personas 89 | ni 90 | otro 91 | ahora 92 | hoy 93 | era 94 | caso 95 | están 96 | les 97 | mejor 98 | lugar 99 | qué 100 | quien 101 | cual 102 | esa 103 | ciudad 104 | general 105 | mundo 106 | siempre 107 | menos 108 | desarrollo 109 | contra 110 | cuenta 111 | tres 112 | ver 113 | más 114 | mayor 115 | otra 116 | mucho 117 | dijo 118 | tienen 119 | sido 120 | presidente 121 | ante 122 | según 123 | tener 124 | primera 125 | sea 126 | debe 127 | después 128 | aunque 129 | ley 130 | sistema 131 | manera 132 | solo 133 | poder 134 | nuevo 135 | ellos 136 | todas 137 | social 138 | información 139 | momento 140 | sino 141 | nuestro 142 | otras 143 | antes 144 | luego 145 | estas 146 | tu 147 | algo 148 | había 149 | días 150 | nuestra 151 | primer 152 | nada 153 | hecho 154 | poco 155 | pueden 156 | proyecto 157 | será 158 | va 159 | grupo 160 | fueron 161 | través 162 | algunos 163 | tan 164 | tipo 165 | medio 166 | gente 167 | decir 168 | equipo 169 | nueva 170 | importante 171 | san 172 | toda 173 | mientras 174 | pues 175 | centro 176 | acuerdo 177 | programa 178 | salud 179 | pasado 180 | empresa 181 | muchos 182 | fin 183 | dentro 184 | nivel 185 | partido 186 | servicios 187 | casa 188 | educación 189 | servicio 190 | seguridad 191 | proceso 192 | horas 193 | él 194 | política 195 | tal 196 | artículo 197 | universidad 198 | historia 199 | cosas 200 | cualquier 201 | sí 202 | unos 203 | hacia 204 | misma 205 | estar 206 | ello 207 | tema 208 | cómo 209 | empresas 210 | gracias 211 | calidad 212 | quienes 213 | embargo 214 | público 215 | frente 216 | agua 217 | situación 218 | ella 219 | sociedad 220 | creo 221 | nosotros 222 | final 223 | muchas 224 | méxico 225 | derecho 226 | zona 227 | argentina 228 | bajo 229 | estamos 230 | respecto 231 | entonces 232 | sector 233 | ejemplo 234 | estaba 235 | tras 236 | semana 237 | personal 238 | casi 239 | tenemos 240 | recursos 241 | diferentes 242 | dice 243 | veces 244 | punto 245 | estados 246 | uso 247 | actividades 248 | partir 249 | haber 250 | dar 251 | relación 252 | internacional 253 | número 254 | meses 255 | ni?os 256 | parece 257 | aún 258 | derechos 259 | datos 260 | aquí 261 | grandes 262 | nunca 263 | problemas 264 | mercado 265 | países 266 | cambio 267 | nombre 268 | he 269 | persona 270 | nuestros 271 | segundo 272 | hizo 273 | sentido 274 | cuatro 275 | fecha 276 | da 277 | posible 278 | comunidad 279 | mujeres 280 | lado 281 | obra 282 | familia 283 | junto 284 | director 285 | problema 286 | condiciones 287 | total 288 | actividad 289 | falta 290 | buena 291 | tengo 292 | investigación 293 | algunas 294 | bueno 295 | espa?a 296 | productos 297 | producción 298 | último 299 | presente 300 | casos 301 | comisión 302 | pública 303 | fuera 304 | igual 305 | atención 306 | van 307 | realidad 308 | objetivo 309 | estudio 310 | mediante 311 | control 312 | verdad 313 | provincia 314 | puntos 315 | pueblo 316 | buenos 317 | sociales 318 | hemos 319 | experiencia 320 | apoyo 321 | hombre 322 | varios 323 | medios 324 | resultados 325 | obras 326 | local 327 | chile 328 | dirección 329 | realizar 330 | deben 331 | base 332 | mes 333 | cuanto 334 | gestión 335 | trata 336 | buen 337 | municipal 338 | siendo 339 | julio 340 | alguna 341 | unidos 342 | trabajadores 343 | ayer 344 | proyectos 345 | incluso 346 | cultura 347 | esos 348 | ma?ana 349 | llegar 350 | dicho 351 | región 352 | segunda 353 | población 354 | plan 355 | paso 356 | mundial 357 | conocer 358 | participación 359 | estoy 360 | jóvenes 361 | mujer 362 | cargo 363 | primero 364 | administración 365 | nuevos 366 | hora 367 | cuales 368 | ciento 369 | comunicación 370 | especial 371 | claro 372 | pesos 373 | espacio 374 | estudios 375 | dios 376 | nuevas 377 | juego 378 | mal 379 | encuentra 380 | cinco 381 | mis 382 | capital 383 | valor 384 | seguir 385 | autoridades 386 | podría 387 | justicia 388 | escuela 389 | tuvo 390 | mayoría 391 | área 392 | saber 393 | luis 394 | organización 395 | cuerpo 396 | ministerio 397 | acción 398 | diciembre 399 | largo 400 | nadie 401 | formación 402 | encuentro 403 | ir 404 | consejo 405 | actual 406 | construcción 407 | vamos 408 | necesario 409 | capacidad 410 | acciones 411 | noche 412 | hacen 413 | ex 414 | cabo 415 | estudiantes 416 | idea 417 | minutos 418 | debido 419 | mayo 420 | orden 421 | campo 422 | octubre 423 | haya 424 | presencia 425 | tarde 426 | modo 427 | permite 428 | podemos 429 | red 430 | temas 431 | edad 432 | tenía 433 | últimos 434 | federal 435 | anterior 436 | respuesta 437 | internet 438 | ahí 439 | puesto 440 | cantidad 441 | usted 442 | real 443 | serie 444 | existe 445 | próximo 446 | dinero 447 | dio 448 | principal 449 | sería 450 | materia 451 | libro 452 | acceso 453 | marco 454 | maría 455 | alto 456 | noviembre 457 | calle 458 | siguiente 459 | central 460 | alumnos 461 | web 462 | algún 463 | posibilidad 464 | modelo 465 | grupos 466 | medida 467 | soy 468 | quiere 469 | cierto 470 | futuro 471 | análisis 472 | mano 473 | humanos 474 | instituto 475 | superior 476 | propio 477 | se?or 478 | santa 479 | favor 480 | municipio 481 | cerca 482 | tierra 483 | políticas 484 | programas 485 | ambiente 486 | oportunidad 487 | domingo 488 | economía 489 | crisis 490 | marzo 491 | mejores 492 | interés 493 | etc. 494 | conocimiento 495 | sigue 496 | necesidad 497 | haciendo 498 | cosa 499 | unas 500 | serán -------------------------------------------------------------------------------- /miami/create_test_sets.py: -------------------------------------------------------------------------------- 1 | # 2 | # For licensing see accompanying LICENSE file. 3 | # Copyright (C) 2022 Apple Inc. All Rights Reserved. 4 | # 5 | 6 | # this file takes all of the Miami data and turns it into splits 7 | import os 8 | import yaml 9 | import json 10 | from tqdm import tqdm 11 | import shutil 12 | import numpy as np 13 | import random 14 | import pandas as pd 15 | 16 | random.seed(1) 17 | 18 | DATASET_NAMES = ["cs", "mono"] 19 | 20 | def map_cs(x: str) -> str: 21 | if x == "n/a": 22 | return x 23 | elif "cs" in x: 24 | return "cs" 25 | else: 26 | return "mono" 27 | 28 | def map_split(x: str) -> str: 29 | if x == "n/a": 30 | return x 31 | elif "train" in x: 32 | return "train" 33 | else: 34 | return "test" 35 | 36 | def split_data(): 37 | print("Loading the data...") 38 | base_path = "output/miami/all" 39 | base_output_path = "output/miami" 40 | transcript = [] 41 | translation = [] 42 | with open(f"{base_path}/miami.yaml", "r") as fin: 43 | yaml_data = yaml.safe_load(fin) 44 | with open(f"{base_path}/miami.transcript", "r") as fin: 45 | for line in fin: 46 | transcript.append(line.strip()) 47 | with open(f"{base_path}/miami.translation", "r") as fin: 48 | for idx, line in enumerate(fin): 49 | translation.append(line.strip()) 50 | assert len(transcript) == len(yaml_data) == len(translation), [len(transcript), len(yaml_data), len(translation)] 51 | print(f"Length of the original data is {len(transcript)}") 52 | 53 | mono = [[], [], []] 54 | cs = [[], [], []] 55 | print("Separating the data...") 56 | data_type = [] 57 | mono_count = 0 58 | mono_map = {} 59 | offsets = [] 60 | durations = [] 61 | files = [] 62 | local_file_lines = [] 63 | for idx in tqdm(range(len(yaml_data)), leave=True): 64 | # get values for making a mapping file 65 | local_line_num = yaml_data[idx]["wav"].split("/")[1].split("_")[-1].split(".")[0].replace("p", "") 66 | files.append(yaml_data[idx]["wav"].split("/")[1].split("_")[0]) 67 | offsets.append(yaml_data[idx]["offset"]) 68 | durations.append(yaml_data[idx]["duration"]) 69 | local_file_lines.append(local_line_num) 70 | 71 | if ( 72 | translation[idx] not in ["", "\n"] and transcript[idx] != translation[idx] 73 | ): # same would be not helpful 74 | 75 | yaml_instance = yaml_data[idx] 76 | if yaml_instance["duration"] < 0.3: # remove instances that are too short 77 | data_type.append("n/a") 78 | continue 79 | 80 | yaml_instance["offset"] = 0 # we're making each their own file 81 | 82 | if yaml_data[idx]["code_switched"]: 83 | cs[0].append(yaml_instance) 84 | cs[1].append(transcript[idx]) 85 | cs[2].append(translation[idx]) 86 | data_type.append("cs") 87 | else: 88 | mono[0].append(yaml_instance) 89 | mono[1].append(transcript[idx]) 90 | mono[2].append(translation[idx]) 91 | data_type.append("mono") 92 | mono_map[mono_count] = idx 93 | mono_count += 1 94 | else: 95 | data_type.append("n/a") 96 | 97 | # split the mono data 98 | mono[0] = np.array(mono[0]) 99 | mono[1] = np.array(mono[1]) 100 | mono[2] = np.array(mono[2]) 101 | split_mono_idx = np.array( 102 | random.sample(list(range(len(mono[0]))), len(mono[0]) // 2) 103 | ) 104 | global_map_from_mono = sorted([mono_map[cur_idx] for cur_idx in split_mono_idx]) 105 | bool_split = np.isin(np.arange(len(mono[0])), split_mono_idx) 106 | mono_train = [ 107 | mono[0][bool_split].tolist(), 108 | mono[1][bool_split].tolist(), 109 | mono[2][bool_split].tolist(), 110 | ] 111 | mono = [ 112 | mono[0][~bool_split].tolist(), 113 | mono[1][~bool_split].tolist(), 114 | mono[2][~bool_split].tolist(), 115 | ] 116 | 117 | # make a mapping file for others to use 118 | data_type = [dtype if idx not in global_map_from_mono else "mono_train" \ 119 | for idx, dtype in enumerate(data_type)] 120 | mapping_val = pd.DataFrame({"global_idx": list(range(len(data_type))), "split": data_type, 121 | "file": files, "file_line_num": local_file_lines, 122 | "offset": offsets, "duration": durations}) 123 | 124 | mapping_val["cs_type"] = mapping_val.split.apply(lambda x: map_cs(x)) 125 | mapping_val["split"] = mapping_val.split.apply(lambda x: map_split(x)) 126 | mapping_val.to_csv("miami_mapping.csv", index=None) 127 | 128 | print("Writing the data out...") 129 | for (name, datasets) in zip(DATASET_NAMES + ["mono_train"], [cs, mono, mono_train]): 130 | print(f"Length of the data {name} is {len(datasets[0])}") 131 | if not os.path.isdir(os.path.join(base_output_path, name)): 132 | os.makedirs(os.path.join(base_output_path, name)) 133 | with open(os.path.join(base_output_path, name, f"miami.jsonl"), "w") as fout: 134 | for segment in datasets[0]: 135 | fout.write(json.dumps(segment)) 136 | fout.write("\n") 137 | with open(os.path.join(base_output_path, name, f"miami.yaml"), "w") as fout: 138 | fout.write(yaml.dump(datasets[0], allow_unicode=True)) 139 | with open( 140 | os.path.join(base_output_path, name, f"miami.transcript"), "w" 141 | ) as fout: 142 | for line in datasets[1]: 143 | assert "\n" not in line, line 144 | fout.write(line) 145 | fout.write("\n") 146 | with open( 147 | os.path.join(base_output_path, name, f"miami.translation"), "w" 148 | ) as fout: 149 | for line in datasets[2]: 150 | assert "\n" not in line, line 151 | fout.write(line) 152 | fout.write("\n") 153 | 154 | print("Moving clip data...") 155 | mono_clips = [item["wav"] for item in mono[0]] 156 | mono_train_clips = [item["wav"] for item in mono_train[0]] 157 | cs_clips = [item["wav"] for item in cs[0]] 158 | assert len(mono_clips) == len(mono[0]) 159 | assert len(cs_clips) == len(cs[0]) 160 | assert len(mono_train_clips) == len(mono_train[0]) 161 | for (name, file_paths) in zip( 162 | DATASET_NAMES + ["mono_train"], [cs_clips, mono_clips, mono_train_clips] 163 | ): 164 | for file_path in file_paths: 165 | if not os.path.isdir(os.path.join(base_output_path, name, "clips")): 166 | os.makedirs(os.path.join(base_output_path, name, "clips")) 167 | shutil.copy( 168 | os.path.join(base_path, file_path), 169 | os.path.join(base_output_path, name, file_path), 170 | ) 171 | 172 | # make it a zip file 173 | shutil.make_archive( 174 | os.path.join(base_output_path, name, "clips"), 175 | "zip", 176 | os.path.join(base_output_path, name, "clips"), 177 | ) 178 | 179 | 180 | if __name__ == "__main__": 181 | split_data() 182 | -------------------------------------------------------------------------------- /miami/download_miami_data.sh: -------------------------------------------------------------------------------- 1 | #! /bin/bash 2 | 3 | # 4 | # For licensing see accompanying LICENSE file. 5 | # Copyright (C) 2022 Apple Inc. All Rights Reserved. 6 | # 7 | 8 | # This script downloads the miami corpus from its repository and converts it into 16K audio 9 | 10 | # start by downloading their repository 11 | DIRECTORY="data/miami" 12 | if [ ! -d "$DIRECTORY" ]; then 13 | echo "cloning corpus, which contains the CHAT files with text and mappings" 14 | cd data 15 | git clone https://github.com/donnekgit/miami.git 16 | mkdir miami/audio 17 | cd ../ 18 | fi 19 | 20 | echo "downloading audio files" 21 | 22 | declare -a audio=("herring1" "herring2" "herring3" 23 | "herring5" "herring6" "herring7" "herring8" 24 | "herring9" "herring10" "herring11" "herring12" 25 | "herring13" "herring14" "herring15" "herring16" 26 | "herring17" "maria1" "maria2" "maria3" "maria4" 27 | "maria7" "maria10" "maria16" "maria18" "maria19" 28 | "maria20" "maria21" "maria24" "maria27" "maria30" 29 | "maria31" "maria40" "sastre1" "sastre2" "sastre3" 30 | "sastre4" "sastre5" "sastre6" "sastre7" "sastre8" 31 | "sastre9" "sastre10" "sastre11" "sastre12" "sastre13" 32 | "zeledon1" "zeledon2" "zeledon3" "zeledon4" "zeledon5" 33 | "zeledon6" "zeledon7" "zeledon8" "zeledon9" "zeledon11" 34 | "zeledon13" "zeledon14") 35 | 36 | # Download each of the above files 37 | for i in "${audio[@]}" 38 | do 39 | if [ ! -f "data/miami/audio/$i.mp3" ]; then 40 | echo "Downloading $i" 41 | wget -P data/miami/audio/ http://bangortalk.bangor.ac.uk/$i.mp3 42 | fi 43 | done 44 | 45 | # convert each file to 16 bit wav files 46 | for i in "${audio[@]}" 47 | do 48 | if [ ! -f "data/miami/audio/$i.wav" ]; then 49 | echo "converting mp3 to wav $i" 50 | ffmpeg -i data/miami/audio/$i.mp3 -acodec pcm_s16le -ac 1 -ar 16000 data/miami/audio/$i.wav 51 | fi 52 | done 53 | 54 | -------------------------------------------------------------------------------- /miami/process_miami_data.py: -------------------------------------------------------------------------------- 1 | # 2 | # For licensing see accompanying LICENSE file. 3 | # Copyright (C) 2022 Apple Inc. All Rights Reserved. 4 | # 5 | 6 | # This file processes the Miami dataset into CS and monolingual test sets 7 | import re 8 | import os 9 | import glob 10 | import pylangacq 11 | import librosa 12 | import yaml 13 | import json 14 | import soundfile as sf 15 | import string 16 | from tqdm import tqdm 17 | import random 18 | from nltk.tokenize.treebank import TreebankWordDetokenizer 19 | 20 | DETOKENIZER = TreebankWordDetokenizer() 21 | 22 | ONE_SECOND = 16000 23 | 24 | # their language mapping to our language tags 25 | LANG_MAP = { 26 | "s:spa": "spa", 27 | "s:eng": "eng", 28 | "s:eng&spa": "unknown", 29 | "s:eng&spag": "unknown", 30 | "s:spa&eng": "unknown", 31 | "s:spa+eng": "first spa, second eng", 32 | "s:eng+spa": "first eng, second spa", 33 | "s:eng&spa+eng": "unknown+extra", 34 | "s:ita": "italian", 35 | "s:fra": "french", 36 | } 37 | 38 | MAP_FOR_WORD_PREDS = { 39 | "first spa, second eng": "spa", 40 | "first eng, second spa": "spa", 41 | "eng": "eng", 42 | "spa": "spa", 43 | "italian": "italian", 44 | "french": "french", 45 | } 46 | 47 | 48 | ##### simple string cleaning functions ##### 49 | def remove_punct(s: str) -> str: 50 | return s.translate(str.maketrans("", "", string.punctuation)) 51 | 52 | 53 | def verify_text(text: str): 54 | illegal_chars = ["[", "]", "(", ")", "/", "+", "&"] 55 | for char in illegal_chars: 56 | if char in text: 57 | raise Exception("had illegal char", char, text) 58 | 59 | 60 | def remove_leading_spaces_punct(sent: list) -> str: 61 | new_sent = DETOKENIZER.detokenize(sent) 62 | # doesn't get second sentence 63 | if " ." in new_sent: 64 | new_sent = new_sent.replace(" .", ".") 65 | if " ?" in new_sent: 66 | new_sent = new_sent.replace(" ?", "?") 67 | if " ," in new_sent: 68 | new_sent = new_sent.replace(" ,", ",") 69 | return new_sent 70 | 71 | 72 | def clean_underscores(sent: str) -> str: 73 | if "o_k" in sent: # don't want to remove for o_k 74 | sent = sent.replace("o_k", "ok") 75 | 76 | new_sent = sent.replace("_", " ") 77 | return new_sent 78 | 79 | 80 | def clean_up_common_markup_errors(sent: str) -> str: 81 | all_chars_to_replace = [ 82 | "(.)", 83 | "(..)", 84 | "+//", 85 | "<", 86 | ">", 87 | "+/.", 88 | "+/?", 89 | "/", 90 | "...", 91 | "..", 92 | "++", 93 | "+/", 94 | "xxx", 95 | "+", 96 | '+"', 97 | "+,", 98 | "[", 99 | "]", 100 | "“", 101 | ] 102 | for char_phrase in all_chars_to_replace: 103 | sent = sent.replace(char_phrase, "") 104 | 105 | if '".' in sent: 106 | sent.replace('".', ".") 107 | if ":." in sent: 108 | sent.replace(":.", ".") 109 | 110 | if "@s:eng&spa" in sent: 111 | sent = sent.replace("@s:eng&spa", "") 112 | 113 | if re.search('".*"', sent) is None: 114 | sent = sent.replace('"', "") 115 | return sent 116 | 117 | 118 | def clean_translation(text: str) -> str: 119 | text = clean_word_text(text) 120 | text = [word for word in re.sub(r"\([^)]*\)", "", text).split(" ") if word != ""] 121 | text = remove_leading_spaces_punct(text) 122 | return text 123 | 124 | 125 | def clean_word_text(transcript): 126 | transcript = clean_up_common_markup_errors(transcript) 127 | detokenized_transcript = remove_leading_spaces_punct(transcript.split(" ")) 128 | clean_sent = clean_underscores(detokenized_transcript) 129 | 130 | if len(transcript) and transcript[0] == ",": 131 | transcript = transcript[1:] 132 | 133 | return clean_sent 134 | 135 | 136 | def make_transcript_manually(raw_utt: str) -> str: 137 | """ 138 | Some of the utterances have disfluencies, which the pylangacq software excludes from the transcript 139 | Thus, we have to manually take the raw utterance transcription in CHAT form to keep them 140 | """ 141 | filter_raw_utt = raw_utt.replace("<", "").replace(">", "") 142 | only_words = filter_raw_utt.split(" ")[:-1] # last one is timing 143 | new_words = [] 144 | for word in only_words: 145 | # words to replace 146 | if "@" in word: 147 | word = word[: word.find("@")] # annotation for code-switching 148 | 149 | if not len(word): 150 | continue 151 | 152 | if word[0] == "[" or word[-1] == "]": 153 | continue # don't need the markup 154 | 155 | if word[0] == "&": 156 | continue # don't need partial starts 157 | 158 | word = clean_up_common_markup_errors(word) 159 | if len(word) == 0: 160 | continue 161 | 162 | if "+//." in word: 163 | word = word.replace("+//.", ".") 164 | if '".' in word: 165 | word = word.replace('".', ".") 166 | 167 | if "(" in word or ")" in word: 168 | # NOTE: this is whether we remove parantheticals 169 | word = re.sub(r"\([^)]*\)", "", word) 170 | 171 | new_words.append(word) 172 | 173 | detokenized_sent = remove_leading_spaces_punct(new_words) 174 | clean_sent = clean_underscores(detokenized_sent) 175 | return clean_sent 176 | 177 | 178 | def gather_cs_statistics_and_words(utterance, raw_utt: str, transcript: str, file_lang: list, cur_lang: str): 179 | # for tagging each word, use a list of most common words 180 | common_spanish_words = [] 181 | with open("common_words/spa.txt", "r") as fin: 182 | for line in fin: 183 | common_spanish_words.append(line.strip()) 184 | 185 | common_english_words = [] 186 | with open("common_words/eng.txt", "r") as fin: 187 | for line in fin: 188 | common_english_words.append(line.strip()) 189 | 190 | # words like internet, etc. 191 | common_spanish_words = list( 192 | set(common_spanish_words) 193 | - set(common_english_words).intersection(set(common_spanish_words)) 194 | ) 195 | 196 | def get_lang_id(input_word): # parse the CHAT language id 197 | word = input_word.split("@")[1] 198 | word = ( 199 | word.replace(">", "") 200 | .replace("[/]", "") 201 | .replace('"', "") 202 | .replace("”", "") 203 | .replace(".", "") 204 | .replace("]", "") 205 | .replace(",", "") 206 | ) 207 | return LANG_MAP[word] 208 | 209 | eng = ( 210 | utterance.tiers["%eng"] if "%eng" in utterance.tiers else None 211 | ) # English translation 212 | word_to_lang_map = [ 213 | (word.split("@")[0], get_lang_id(word)) 214 | for word in raw_utt.split(" ") 215 | if "@" in word 216 | ] 217 | is_cs = any( 218 | ["unknown" not in lang for (_, lang) in word_to_lang_map] 219 | ) # any not unknown is code-switched 220 | is_cs_any = len(word_to_lang_map) 221 | num_words = len( 222 | [ 223 | item 224 | for item in raw_utt.split(" ")[:-1] 225 | if ("[" not in item and "." not in item) 226 | ] 227 | ) 228 | cs_percent = 0 if not is_cs else len(word_to_lang_map) / num_words 229 | 230 | # refine the process of the main language, really only used for statistical purposes 231 | # this just flips the main language if the CS percent is greater than 0.5 232 | if (eng is None and cur_lang == "spa" and cs_percent > 0.5 and len(transcript.split(" ")) >= 3): 233 | cur_lang = "eng" 234 | if (eng is not None and cur_lang == "eng" and cs_percent > 0.5 and len(transcript.split(" ")) >= 3): 235 | cur_lang = "spa" 236 | 237 | # lets try to get word level tags for CS data. We have to do this manually parsing the sentence 238 | clean_transcript = remove_punct(transcript) 239 | cs_words = [ 240 | remove_punct(clean_word_text(word)) 241 | for (word, lang) in word_to_lang_map 242 | if "unknown" not in lang 243 | ] 244 | cs_words_lang = [ 245 | lang for (word, lang) in word_to_lang_map if "unknown" not in lang 246 | ] 247 | for idx, cs_word in enumerate(cs_words): 248 | if cs_word not in clean_transcript: 249 | # try to clean up the word to see if it's in the clean transcript 250 | if cs_word.replace("(", "").replace(")", "") in clean_transcript: 251 | cs_words[idx] = cs_word.replace("(", "").replace(")", "") 252 | elif cs_word.split("(")[0] in clean_transcript: 253 | cs_words[idx] = cs_word.split("(")[0] 254 | 255 | tagged_words = "" 256 | main_lang, embedded_lang = file_lang[0], file_lang[-1] 257 | if "[- spa]" in raw_utt or "[-spa]" in raw_utt: 258 | main_lang, embedded_lang = "spa", "eng" 259 | if "[- eng]" in raw_utt or "[-eng]" in raw_utt: 260 | main_lang, embedded_lang = "eng", "spa" 261 | 262 | for word in clean_transcript.split(" "): 263 | if word in cs_words: # first see if they were annotated 264 | index = cs_words.index(word) 265 | annote_lang = MAP_FOR_WORD_PREDS[cs_words_lang[index]] 266 | tagged_words += f"{word}={embedded_lang} " 267 | else: # try to rely on the backup common words if they're not annotated 268 | if word in common_spanish_words: 269 | tagged_words += f"{word}=spa " 270 | elif word in common_english_words: 271 | tagged_words += f"{word}=eng " 272 | else: 273 | tagged_words += f"{word}={main_lang} " 274 | 275 | tagged_words = tagged_words.strip() 276 | return tagged_words, eng, cs_percent, is_cs, is_cs_any 277 | 278 | 279 | 280 | def write_out(final_path, all_segments, all_transcripts, all_translations): 281 | with open(os.path.join(final_path, f"miami.yaml"), "w") as fout: 282 | fout.write(yaml.dump(all_segments, allow_unicode=True)) 283 | 284 | with open(os.path.join(final_path, f"miami.transcript"), "w") as fout: 285 | for line in all_transcripts: 286 | fout.write(line) 287 | fout.write("\n") 288 | 289 | with open(os.path.join(final_path, f"miami.translation"), "w") as fout: 290 | for line in all_translations: 291 | fout.write(line) 292 | fout.write("\n") 293 | 294 | with open(os.path.join(final_path, f"miami.jsonl"), "w") as fout: 295 | for segment in all_segments: 296 | fout.write(json.dumps(segment)) 297 | fout.write("\n") 298 | 299 | 300 | def prepare_miami_data(): 301 | all_segments = [] 302 | all_transcripts = [] 303 | all_translations = [] 304 | 305 | final_path = "output/miami/all" 306 | if not os.path.isdir(final_path): 307 | os.makedirs(os.path.join(final_path, "clips")) 308 | 309 | chat_file_location = "data/miami/beta" # beta has the most up to date 310 | for chat_file_path in tqdm( 311 | glob.glob(os.path.join(chat_file_location, "*.cha")), leave=True 312 | ): 313 | clip_name = chat_file_path.split("/")[-1].replace(".cha", "") 314 | cur_reader = pylangacq.read_chat(chat_file_path) 315 | all_words = cur_reader.words(by_utterances=True) 316 | assert len(cur_reader._files) == 1 317 | file_lang = cur_reader._files[0].header["Languages"] 318 | 319 | # get wav data 320 | wav_path = chat_file_path.replace("beta", "audio").replace("cha", "wav") 321 | wav_data, sampling_rate = librosa.load( 322 | wav_path, sr=ONE_SECOND 323 | ) # already at 16khz/16bit/mono 324 | assert sampling_rate == ONE_SECOND 325 | 326 | for idx, utterance in enumerate(cur_reader.utterances()): 327 | word_utterance = all_words[idx] 328 | transcript = " ".join(word_utterance) 329 | if not len(transcript): 330 | continue 331 | transcript = clean_word_text(transcript) 332 | raw_utt = utterance.tiers[utterance.participant] 333 | 334 | # the main language can be overriden if marked that way 335 | if "[- eng]" in raw_utt or "[-eng]" in raw_utt: 336 | cur_lang = "eng" 337 | elif "[- spa]" in raw_utt or "[-spa]" in raw_utt: 338 | cur_lang = "spa" 339 | else: 340 | cur_lang = file_lang[0] 341 | 342 | ## Check if we really want to keep cleaning this utterance ## 343 | if "www" in raw_utt: 344 | continue # means untranscribed text, skip 345 | if word_utterance == ["."]: 346 | continue # we don't want empty lines 347 | 348 | if "[" in raw_utt: # some markup to deal with 349 | # see https://talkbank.org/manuals/CHAT.pdf for details 350 | markings = re.findall("\[.*?\]", raw_utt) 351 | for mark in markings: 352 | if mark in [ 353 | "[!]", 354 | "[?]", 355 | "[!!]", 356 | "[*]", 357 | "[/-]", 358 | "[//]", 359 | '["]', 360 | ] or mark in ["[- spa]", "[-spa]", "[-eng]", "[- eng]"]: 361 | """ 362 | Markup definitions that we can skip/remove for ST purposes: 363 | [!] means stressing 364 | [!!] means constrastive stressing 365 | [?] means uncertainty in transcription, but best guess 366 | [=! ...] is some kind of para-linguistic communication, laugh, yell, etc. 367 | [# ...] indicates duration of previous <> tag 368 | [*] means the word is incorrect semantically/grammatically, typically followed by the [* correct_word] 369 | [/-] is for false starts but still spoken 370 | [//] for abandended and retracing speech 371 | 372 | """ 373 | continue 374 | elif "[=!" in mark or "[= !" in mark or "[*" in mark: # see above 375 | continue 376 | elif mark in ["[/]", "[//]", "[///]"]: 377 | # indicates trailing or correction while speaking, pylangacq gets rid of them, do it manually 378 | if raw_utt is None: 379 | continue 380 | transcript = make_transcript_manually(raw_utt) 381 | break 382 | else: 383 | raise Exception(f"Encountered new mark {mark}") 384 | 385 | time_marks = utterance.time_marks 386 | if time_marks is None: 387 | continue # don't know why there are no time marks, but skip. 388 | # Happens appx 3 times outside of maria18.cha where there are ~20 instances 389 | 390 | # get the audio clip and validate it 391 | start_time, end_time = time_marks 392 | start_time_s, end_time_s = start_time / 1000, end_time / 1000 393 | duration_s = end_time_s - start_time_s 394 | wav_clip = wav_data[ 395 | int(start_time_s * ONE_SECOND) : int(end_time_s * ONE_SECOND) 396 | ] 397 | if int(end_time_s * ONE_SECOND) < wav_data.shape[0]: 398 | # sometimes audio may go beyond the file length, which we allow 399 | error_str = f"Wav Clip:{wav_clip.shape[0]} vs duration:{duration_s * ONE_SECOND}" 400 | assert (duration_s * ONE_SECOND - wav_clip.shape[0]) < 1, error_str 401 | 402 | cur_clip_name = clip_name + "_p" + str(idx) 403 | clip_path = os.path.join("clips", cur_clip_name + ".wav") 404 | sf.write(os.path.join(final_path, clip_path), wav_clip, ONE_SECOND) 405 | 406 | # for LID and statistics, gather the lang id for each word 407 | tagged_words, eng, cs_percent, is_cs, is_cs_any = gather_cs_statistics_and_words(utterance, raw_utt, transcript, file_lang, cur_lang) 408 | speakers = utterance.participant # just in case it's needed someday for speaker ID 409 | 410 | all_segments.append( 411 | { 412 | "wav": clip_path, 413 | "offset": start_time_s, 414 | "duration": duration_s, 415 | "cs_percent": cs_percent, 416 | "speaker_id": speakers, 417 | "code_switched": is_cs, 418 | "main_lang": cur_lang, 419 | "code_switched_any": is_cs_any, 420 | "tagged_words": tagged_words, 421 | } 422 | ) 423 | translation = clean_translation(eng) if eng is not None else "" 424 | assert transcript is not None 425 | 426 | # validate the sentences 427 | verify_text(transcript) 428 | verify_text(translation) 429 | all_transcripts.append(transcript) 430 | all_translations.append(translation) 431 | assert len(all_transcripts) == len(all_segments) == len(all_translations) 432 | assert len(all_transcripts) == len(all_segments) == len(all_translations) 433 | assert len(all_transcripts) == len(all_segments) == len(all_translations) 434 | write_out(final_path, all_segments, all_transcripts, all_translations) 435 | 436 | 437 | if __name__ == "__main__": 438 | prepare_miami_data() 439 | -------------------------------------------------------------------------------- /miami/readme.md: -------------------------------------------------------------------------------- 1 | # Overview 2 | This repository contains all the scripts needed to download the Bangor Miami Corpus and preprocess it for Speech Translation 3 | 4 | ## 1-Step Setup 5 | 0. Run `setup_all.sh` to download the data, and process it. For granular instructions, see the `Multi-Step Setup` 6 | 7 | ## Multi-Step Setup 8 | 0. Gather the data by running `bash download_miami_dataset.sh` which will place the data in `./data` 9 | 1. Format the data by running `python reformat_miami_data.py` which will output the data in `output/miami/*`. It will contain three files: a `yaml` file containing the timesteps, a `miami.transcript` containing the transcripts, and `miami.translation` containing the translations 10 | 2. Create code-switched and non-code-switched sections by running `python create_test_sets.py` 11 | 3. To create LID data, run `fisher/split_train_and_make_lid.py` 12 | 13 | 14 | ## Paper Reference 15 | The Bangor Miami corpus is found [here](https://biling.talkbank.org/access/Bangor/Miami.html) and was published as part of [this paper](https://www.researchgate.net/publication/292243516_Building_bilingual_corpora) -------------------------------------------------------------------------------- /miami/setup_all.sh: -------------------------------------------------------------------------------- 1 | # 2 | # For licensing see accompanying LICENSE file. 3 | # Copyright (C) 2022 Apple Inc. All Rights Reserved. 4 | # 5 | 6 | # this script should set everything up 7 | mkdir output 8 | mkdir output/miami 9 | mkdir data 10 | 11 | bash download_miami_data.sh 12 | python process_miami_data.py 13 | python create_test_sets.py 14 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | pylangacq==0.15.0 2 | PyYAML==5.4.1 3 | pandas==1.3.0 4 | numpy==1.20.0 5 | tqdm==4.61.2 6 | nltk==3.6.2 7 | beautifulsoup4==4.10.0 8 | librosa==0.8.1 --------------------------------------------------------------------------------