├── .gitignore ├── CODE_OF_CONDUCT.md ├── LICENSE ├── README.md ├── align ├── align.py ├── audio.py ├── catalog_tool.py ├── export.py ├── generate_lm.py ├── generate_package.py ├── meta.py ├── sample_collections.py ├── sdb_tool.py ├── search.py ├── stats.py ├── text.py └── utils.py ├── bin ├── align.sh ├── catalog_tool.sh ├── createenv.sh ├── export.sh ├── getmodel.sh ├── gettestdata.sh ├── lm-dependencies.sh ├── meta.sh ├── play2script.py ├── sdb_tool.sh ├── statistics.sh └── taskcluster.py ├── data ├── all-wav.catalog ├── all.catalog ├── test1.catalog └── test2.catalog ├── doc ├── algo.md ├── export.md ├── files.md ├── lm.md ├── metrics.md └── tools.md └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | models 2 | dependencies 3 | data/test* 4 | data/export 5 | 6 | .idea 7 | 8 | # Byte-compiled / optimized / DLL files 9 | __pycache__/ 10 | *.py[cod] 11 | *$py.class 12 | 13 | # C extensions 14 | *.so 15 | 16 | # Distribution / packaging 17 | .Python 18 | build/ 19 | develop-eggs/ 20 | dist/ 21 | downloads/ 22 | eggs/ 23 | .eggs/ 24 | lib/ 25 | lib64/ 26 | parts/ 27 | sdist/ 28 | var/ 29 | wheels/ 30 | *.egg-info/ 31 | .installed.cfg 32 | *.egg 33 | MANIFEST 34 | 35 | # PyInstaller 36 | # Usually these files are written by a python script from a template 37 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 38 | *.manifest 39 | *.spec 40 | 41 | # Installer logs 42 | pip-log.txt 43 | pip-delete-this-directory.txt 44 | 45 | # Unit test / coverage reports 46 | htmlcov/ 47 | .tox/ 48 | .coverage 49 | .coverage.* 50 | .cache 51 | nosetests.xml 52 | coverage.xml 53 | *.cover 54 | .hypothesis/ 55 | .pytest_cache/ 56 | 57 | # Translations 58 | *.mo 59 | *.pot 60 | 61 | # Django stuff: 62 | *.log 63 | local_settings.py 64 | db.sqlite3 65 | 66 | # Flask stuff: 67 | instance/ 68 | .webassets-cache 69 | 70 | # Scrapy stuff: 71 | .scrapy 72 | 73 | # Sphinx documentation 74 | docs/_build/ 75 | 76 | # PyBuilder 77 | target/ 78 | 79 | # Jupyter Notebook 80 | .ipynb_checkpoints 81 | 82 | # pyenv 83 | .python-version 84 | 85 | # celery beat schedule file 86 | celerybeat-schedule 87 | 88 | # SageMath parsed files 89 | *.sage.py 90 | 91 | # Environments 92 | .env 93 | .venv 94 | env/ 95 | venv/ 96 | ENV/ 97 | env.bak/ 98 | venv.bak/ 99 | 100 | # Spyder project settings 101 | .spyderproject 102 | .spyproject 103 | 104 | # Rope project settings 105 | .ropeproject 106 | 107 | # mkdocs documentation 108 | /site 109 | 110 | # mypy 111 | .mypy_cache/ 112 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Community Participation Guidelines 2 | 3 | This repository is governed by Mozilla's code of conduct and etiquette guidelines. 4 | For more details, please read the 5 | [Mozilla Community Participation Guidelines](https://www.mozilla.org/about/governance/policies/participation/). 6 | 7 | ## How to Report 8 | For more information on how to report violations of the Community Participation Guidelines, please read our '[How to Report](https://www.mozilla.org/about/governance/policies/participation/reporting/)' page. 9 | 10 | 16 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Mozilla Public License Version 2.0 2 | ================================== 3 | 4 | 1. Definitions 5 | -------------- 6 | 7 | 1.1. "Contributor" 8 | means each individual or legal entity that creates, contributes to 9 | the creation of, or owns Covered Software. 10 | 11 | 1.2. "Contributor Version" 12 | means the combination of the Contributions of others (if any) used 13 | by a Contributor and that particular Contributor's Contribution. 14 | 15 | 1.3. "Contribution" 16 | means Covered Software of a particular Contributor. 17 | 18 | 1.4. "Covered Software" 19 | means Source Code Form to which the initial Contributor has attached 20 | the notice in Exhibit A, the Executable Form of such Source Code 21 | Form, and Modifications of such Source Code Form, in each case 22 | including portions thereof. 23 | 24 | 1.5. "Incompatible With Secondary Licenses" 25 | means 26 | 27 | (a) that the initial Contributor has attached the notice described 28 | in Exhibit B to the Covered Software; or 29 | 30 | (b) that the Covered Software was made available under the terms of 31 | version 1.1 or earlier of the License, but not also under the 32 | terms of a Secondary License. 33 | 34 | 1.6. "Executable Form" 35 | means any form of the work other than Source Code Form. 36 | 37 | 1.7. "Larger Work" 38 | means a work that combines Covered Software with other material, in 39 | a separate file or files, that is not Covered Software. 40 | 41 | 1.8. "License" 42 | means this document. 43 | 44 | 1.9. "Licensable" 45 | means having the right to grant, to the maximum extent possible, 46 | whether at the time of the initial grant or subsequently, any and 47 | all of the rights conveyed by this License. 48 | 49 | 1.10. "Modifications" 50 | means any of the following: 51 | 52 | (a) any file in Source Code Form that results from an addition to, 53 | deletion from, or modification of the contents of Covered 54 | Software; or 55 | 56 | (b) any new file in Source Code Form that contains any Covered 57 | Software. 58 | 59 | 1.11. "Patent Claims" of a Contributor 60 | means any patent claim(s), including without limitation, method, 61 | process, and apparatus claims, in any patent Licensable by such 62 | Contributor that would be infringed, but for the grant of the 63 | License, by the making, using, selling, offering for sale, having 64 | made, import, or transfer of either its Contributions or its 65 | Contributor Version. 66 | 67 | 1.12. "Secondary License" 68 | means either the GNU General Public License, Version 2.0, the GNU 69 | Lesser General Public License, Version 2.1, the GNU Affero General 70 | Public License, Version 3.0, or any later versions of those 71 | licenses. 72 | 73 | 1.13. "Source Code Form" 74 | means the form of the work preferred for making modifications. 75 | 76 | 1.14. "You" (or "Your") 77 | means an individual or a legal entity exercising rights under this 78 | License. For legal entities, "You" includes any entity that 79 | controls, is controlled by, or is under common control with You. For 80 | purposes of this definition, "control" means (a) the power, direct 81 | or indirect, to cause the direction or management of such entity, 82 | whether by contract or otherwise, or (b) ownership of more than 83 | fifty percent (50%) of the outstanding shares or beneficial 84 | ownership of such entity. 85 | 86 | 2. License Grants and Conditions 87 | -------------------------------- 88 | 89 | 2.1. Grants 90 | 91 | Each Contributor hereby grants You a world-wide, royalty-free, 92 | non-exclusive license: 93 | 94 | (a) under intellectual property rights (other than patent or trademark) 95 | Licensable by such Contributor to use, reproduce, make available, 96 | modify, display, perform, distribute, and otherwise exploit its 97 | Contributions, either on an unmodified basis, with Modifications, or 98 | as part of a Larger Work; and 99 | 100 | (b) under Patent Claims of such Contributor to make, use, sell, offer 101 | for sale, have made, import, and otherwise transfer either its 102 | Contributions or its Contributor Version. 103 | 104 | 2.2. Effective Date 105 | 106 | The licenses granted in Section 2.1 with respect to any Contribution 107 | become effective for each Contribution on the date the Contributor first 108 | distributes such Contribution. 109 | 110 | 2.3. Limitations on Grant Scope 111 | 112 | The licenses granted in this Section 2 are the only rights granted under 113 | this License. No additional rights or licenses will be implied from the 114 | distribution or licensing of Covered Software under this License. 115 | Notwithstanding Section 2.1(b) above, no patent license is granted by a 116 | Contributor: 117 | 118 | (a) for any code that a Contributor has removed from Covered Software; 119 | or 120 | 121 | (b) for infringements caused by: (i) Your and any other third party's 122 | modifications of Covered Software, or (ii) the combination of its 123 | Contributions with other software (except as part of its Contributor 124 | Version); or 125 | 126 | (c) under Patent Claims infringed by Covered Software in the absence of 127 | its Contributions. 128 | 129 | This License does not grant any rights in the trademarks, service marks, 130 | or logos of any Contributor (except as may be necessary to comply with 131 | the notice requirements in Section 3.4). 132 | 133 | 2.4. Subsequent Licenses 134 | 135 | No Contributor makes additional grants as a result of Your choice to 136 | distribute the Covered Software under a subsequent version of this 137 | License (see Section 10.2) or under the terms of a Secondary License (if 138 | permitted under the terms of Section 3.3). 139 | 140 | 2.5. Representation 141 | 142 | Each Contributor represents that the Contributor believes its 143 | Contributions are its original creation(s) or it has sufficient rights 144 | to grant the rights to its Contributions conveyed by this License. 145 | 146 | 2.6. Fair Use 147 | 148 | This License is not intended to limit any rights You have under 149 | applicable copyright doctrines of fair use, fair dealing, or other 150 | equivalents. 151 | 152 | 2.7. Conditions 153 | 154 | Sections 3.1, 3.2, 3.3, and 3.4 are conditions of the licenses granted 155 | in Section 2.1. 156 | 157 | 3. Responsibilities 158 | ------------------- 159 | 160 | 3.1. Distribution of Source Form 161 | 162 | All distribution of Covered Software in Source Code Form, including any 163 | Modifications that You create or to which You contribute, must be under 164 | the terms of this License. You must inform recipients that the Source 165 | Code Form of the Covered Software is governed by the terms of this 166 | License, and how they can obtain a copy of this License. You may not 167 | attempt to alter or restrict the recipients' rights in the Source Code 168 | Form. 169 | 170 | 3.2. Distribution of Executable Form 171 | 172 | If You distribute Covered Software in Executable Form then: 173 | 174 | (a) such Covered Software must also be made available in Source Code 175 | Form, as described in Section 3.1, and You must inform recipients of 176 | the Executable Form how they can obtain a copy of such Source Code 177 | Form by reasonable means in a timely manner, at a charge no more 178 | than the cost of distribution to the recipient; and 179 | 180 | (b) You may distribute such Executable Form under the terms of this 181 | License, or sublicense it under different terms, provided that the 182 | license for the Executable Form does not attempt to limit or alter 183 | the recipients' rights in the Source Code Form under this License. 184 | 185 | 3.3. Distribution of a Larger Work 186 | 187 | You may create and distribute a Larger Work under terms of Your choice, 188 | provided that You also comply with the requirements of this License for 189 | the Covered Software. If the Larger Work is a combination of Covered 190 | Software with a work governed by one or more Secondary Licenses, and the 191 | Covered Software is not Incompatible With Secondary Licenses, this 192 | License permits You to additionally distribute such Covered Software 193 | under the terms of such Secondary License(s), so that the recipient of 194 | the Larger Work may, at their option, further distribute the Covered 195 | Software under the terms of either this License or such Secondary 196 | License(s). 197 | 198 | 3.4. Notices 199 | 200 | You may not remove or alter the substance of any license notices 201 | (including copyright notices, patent notices, disclaimers of warranty, 202 | or limitations of liability) contained within the Source Code Form of 203 | the Covered Software, except that You may alter any license notices to 204 | the extent required to remedy known factual inaccuracies. 205 | 206 | 3.5. Application of Additional Terms 207 | 208 | You may choose to offer, and to charge a fee for, warranty, support, 209 | indemnity or liability obligations to one or more recipients of Covered 210 | Software. However, You may do so only on Your own behalf, and not on 211 | behalf of any Contributor. You must make it absolutely clear that any 212 | such warranty, support, indemnity, or liability obligation is offered by 213 | You alone, and You hereby agree to indemnify every Contributor for any 214 | liability incurred by such Contributor as a result of warranty, support, 215 | indemnity or liability terms You offer. You may include additional 216 | disclaimers of warranty and limitations of liability specific to any 217 | jurisdiction. 218 | 219 | 4. Inability to Comply Due to Statute or Regulation 220 | --------------------------------------------------- 221 | 222 | If it is impossible for You to comply with any of the terms of this 223 | License with respect to some or all of the Covered Software due to 224 | statute, judicial order, or regulation then You must: (a) comply with 225 | the terms of this License to the maximum extent possible; and (b) 226 | describe the limitations and the code they affect. Such description must 227 | be placed in a text file included with all distributions of the Covered 228 | Software under this License. Except to the extent prohibited by statute 229 | or regulation, such description must be sufficiently detailed for a 230 | recipient of ordinary skill to be able to understand it. 231 | 232 | 5. Termination 233 | -------------- 234 | 235 | 5.1. The rights granted under this License will terminate automatically 236 | if You fail to comply with any of its terms. However, if You become 237 | compliant, then the rights granted under this License from a particular 238 | Contributor are reinstated (a) provisionally, unless and until such 239 | Contributor explicitly and finally terminates Your grants, and (b) on an 240 | ongoing basis, if such Contributor fails to notify You of the 241 | non-compliance by some reasonable means prior to 60 days after You have 242 | come back into compliance. Moreover, Your grants from a particular 243 | Contributor are reinstated on an ongoing basis if such Contributor 244 | notifies You of the non-compliance by some reasonable means, this is the 245 | first time You have received notice of non-compliance with this License 246 | from such Contributor, and You become compliant prior to 30 days after 247 | Your receipt of the notice. 248 | 249 | 5.2. If You initiate litigation against any entity by asserting a patent 250 | infringement claim (excluding declaratory judgment actions, 251 | counter-claims, and cross-claims) alleging that a Contributor Version 252 | directly or indirectly infringes any patent, then the rights granted to 253 | You by any and all Contributors for the Covered Software under Section 254 | 2.1 of this License shall terminate. 255 | 256 | 5.3. In the event of termination under Sections 5.1 or 5.2 above, all 257 | end user license agreements (excluding distributors and resellers) which 258 | have been validly granted by You or Your distributors under this License 259 | prior to termination shall survive termination. 260 | 261 | ************************************************************************ 262 | * * 263 | * 6. Disclaimer of Warranty * 264 | * ------------------------- * 265 | * * 266 | * Covered Software is provided under this License on an "as is" * 267 | * basis, without warranty of any kind, either expressed, implied, or * 268 | * statutory, including, without limitation, warranties that the * 269 | * Covered Software is free of defects, merchantable, fit for a * 270 | * particular purpose or non-infringing. The entire risk as to the * 271 | * quality and performance of the Covered Software is with You. * 272 | * Should any Covered Software prove defective in any respect, You * 273 | * (not any Contributor) assume the cost of any necessary servicing, * 274 | * repair, or correction. This disclaimer of warranty constitutes an * 275 | * essential part of this License. No use of any Covered Software is * 276 | * authorized under this License except under this disclaimer. * 277 | * * 278 | ************************************************************************ 279 | 280 | ************************************************************************ 281 | * * 282 | * 7. Limitation of Liability * 283 | * -------------------------- * 284 | * * 285 | * Under no circumstances and under no legal theory, whether tort * 286 | * (including negligence), contract, or otherwise, shall any * 287 | * Contributor, or anyone who distributes Covered Software as * 288 | * permitted above, be liable to You for any direct, indirect, * 289 | * special, incidental, or consequential damages of any character * 290 | * including, without limitation, damages for lost profits, loss of * 291 | * goodwill, work stoppage, computer failure or malfunction, or any * 292 | * and all other commercial damages or losses, even if such party * 293 | * shall have been informed of the possibility of such damages. This * 294 | * limitation of liability shall not apply to liability for death or * 295 | * personal injury resulting from such party's negligence to the * 296 | * extent applicable law prohibits such limitation. Some * 297 | * jurisdictions do not allow the exclusion or limitation of * 298 | * incidental or consequential damages, so this exclusion and * 299 | * limitation may not apply to You. * 300 | * * 301 | ************************************************************************ 302 | 303 | 8. Litigation 304 | ------------- 305 | 306 | Any litigation relating to this License may be brought only in the 307 | courts of a jurisdiction where the defendant maintains its principal 308 | place of business and such litigation shall be governed by laws of that 309 | jurisdiction, without reference to its conflict-of-law provisions. 310 | Nothing in this Section shall prevent a party's ability to bring 311 | cross-claims or counter-claims. 312 | 313 | 9. Miscellaneous 314 | ---------------- 315 | 316 | This License represents the complete agreement concerning the subject 317 | matter hereof. If any provision of this License is held to be 318 | unenforceable, such provision shall be reformed only to the extent 319 | necessary to make it enforceable. Any law or regulation which provides 320 | that the language of a contract shall be construed against the drafter 321 | shall not be used to construe this License against a Contributor. 322 | 323 | 10. Versions of the License 324 | --------------------------- 325 | 326 | 10.1. New Versions 327 | 328 | Mozilla Foundation is the license steward. Except as provided in Section 329 | 10.3, no one other than the license steward has the right to modify or 330 | publish new versions of this License. Each version will be given a 331 | distinguishing version number. 332 | 333 | 10.2. Effect of New Versions 334 | 335 | You may distribute the Covered Software under the terms of the version 336 | of the License under which You originally received the Covered Software, 337 | or under the terms of any subsequent version published by the license 338 | steward. 339 | 340 | 10.3. Modified Versions 341 | 342 | If you create software not governed by this License, and you want to 343 | create a new license for such software, you may create and use a 344 | modified version of this License if you rename the license and remove 345 | any references to the name of the license steward (except to note that 346 | such modified license differs from this License). 347 | 348 | 10.4. Distributing Source Code Form that is Incompatible With Secondary 349 | Licenses 350 | 351 | If You choose to distribute Source Code Form that is Incompatible With 352 | Secondary Licenses under the terms of this version of the License, the 353 | notice described in Exhibit B of this License must be attached. 354 | 355 | Exhibit A - Source Code Form License Notice 356 | ------------------------------------------- 357 | 358 | This Source Code Form is subject to the terms of the Mozilla Public 359 | License, v. 2.0. If a copy of the MPL was not distributed with this 360 | file, You can obtain one at http://mozilla.org/MPL/2.0/. 361 | 362 | If it is not possible or desirable to put the notice in a particular 363 | file, then You may include the notice in a location (such as a LICENSE 364 | file in a relevant directory) where a recipient would be likely to look 365 | for such a notice. 366 | 367 | You may add additional accurate notices of copyright ownership. 368 | 369 | Exhibit B - "Incompatible With Secondary Licenses" Notice 370 | --------------------------------------------------------- 371 | 372 | This Source Code Form is "Incompatible With Secondary Licenses", as 373 | defined by the Mozilla Public License, v. 2.0. 374 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # DSAlign 2 | DeepSpeech based forced alignment tool 3 | 4 | ## Installation 5 | 6 | It is recommended to use this tool from within a virtual environment. 7 | After cloning and changing to the root of the project, 8 | there is a script for creating one with all requirements in the git-ignored dir `venv`: 9 | 10 | ```shell script 11 | $ bin/createenv.sh 12 | $ ls venv 13 | bin include lib lib64 pyvenv.cfg share 14 | ``` 15 | 16 | `bin/align.sh` will automatically use it. 17 | 18 | Internally DSAlign uses the [DeepSpeech](https://github.com/mozilla/DeepSpeech/) STT engine. 19 | For it to be able to function, it requires a couple of files that are specific to 20 | the language of the speech data you want to align. 21 | If you want to align English, there is already a helper script that will download and prepare 22 | all required data: 23 | 24 | ```shell script 25 | $ bin/getmodel.sh 26 | [...] 27 | $ ls models/en/ 28 | alphabet.txt lm.binary output_graph.pb output_graph.pbmm output_graph.tflite trie 29 | ``` 30 | 31 | ## Overview and documentation 32 | 33 | A typical application of the aligner is done in three phases: 34 | 35 | 1. __Preparing__ the data. Albeit most of this has to be done individually, 36 | there are some [tools for data preparation, statistics and maintenance](doc/tools.md). 37 | All involved file formats are described [here](doc/files.md). 38 | 2. __Aligning__ the data using [the alignment tool and it algorithm](doc/algo.md). 39 | 3. __Exporting__ aligned data using [the data-set exporter](doc/export.md). 40 | 41 | ## Quickstart example 42 | 43 | ### Example data 44 | 45 | There is a script for downloading and preparing some public domain speech and transcript data. 46 | It requires `ffmpeg` for some sample conversion. 47 | 48 | ```shell script 49 | $ bin/gettestdata.sh 50 | $ ls data 51 | test1 test2 52 | ``` 53 | 54 | ### Alignment using example data 55 | 56 | Now the aligner can be called either "manually" (specifying all involved files directly): 57 | 58 | ```shell script 59 | $ bin/align.sh --audio data/test1/audio.wav --script data/test1/transcript.txt --aligned data/test1/aligned.json --tlog data/test1/transcript.log 60 | ``` 61 | 62 | Or "automatically" by specifying a so-called catalog file that bundles all involved paths: 63 | 64 | ```shell script 65 | $ bin/align.sh --catalog data/test1.catalog 66 | ``` 67 | -------------------------------------------------------------------------------- /align/align.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | import logging 4 | import argparse 5 | import deepspeech 6 | import subprocess 7 | import os.path as path 8 | import numpy as np 9 | import textdistance 10 | import multiprocessing 11 | from collections import Counter 12 | from search import FuzzySearch 13 | from glob import glob 14 | from text import Alphabet, TextCleaner, levenshtein, similarity 15 | from utils import enweight, log_progress 16 | from audio import DEFAULT_RATE, read_frames_from_file, vad_split 17 | from generate_lm import convert_and_filter_topk, build_lm 18 | from generate_package import create_bundle 19 | 20 | BEAM_WIDTH = 500 21 | LM_ALPHA = 1 22 | LM_BETA = 1.85 23 | 24 | ALGORITHMS = ['WNG', 'jaro_winkler', 'editex', 'levenshtein', 'mra', 'hamming'] 25 | SIM_DESC = 'From 0.0 (not equal at all) to 100.0 (totally equal)' 26 | NAMED_NUMBERS = { 27 | 'tlen': ('transcript length', int, None), 28 | 'mlen': ('match length', int, None), 29 | 'SWS': ('Smith-Waterman score', float, 'From 0.0 (not equal at all) to 100.0+ (pretty equal)'), 30 | 'WNG': ('weighted N-gram similarity', float, SIM_DESC), 31 | 'jaro_winkler': ('Jaro-Winkler similarity', float, SIM_DESC), 32 | 'editex': ('Editex similarity', float, SIM_DESC), 33 | 'levenshtein': ('Levenshtein similarity', float, SIM_DESC), 34 | 'mra': ('MRA similarity', float, SIM_DESC), 35 | 'hamming': ('Hamming similarity', float, SIM_DESC), 36 | 'CER': ('character error rate', float, 'From 0.0 (no different words) to 100.0+ (total miss)'), 37 | 'WER': ('word error rate', float, 'From 0.0 (no wrong characters) to 100.0+ (total miss)') 38 | } 39 | 40 | 41 | def fail(message, code=1): 42 | logging.fatal(message) 43 | exit(code) 44 | 45 | 46 | def read_script(script_path): 47 | tc = TextCleaner(alphabet, 48 | dashes_to_ws=not args.text_keep_dashes, 49 | normalize_space=not args.text_keep_ws, 50 | to_lower=not args.text_keep_casing) 51 | with open(script_path, 'r', encoding='utf-8') as script_file: 52 | content = script_file.read() 53 | if script_path.endswith('.script'): 54 | for phrase in json.loads(content): 55 | tc.add_original_text(phrase['text'], meta=phrase) 56 | elif args.text_meaningful_newlines: 57 | for phrase in content.split('\n'): 58 | tc.add_original_text(phrase) 59 | else: 60 | tc.add_original_text(content) 61 | return tc 62 | 63 | 64 | model = None 65 | 66 | def init_stt(output_graph_path, scorer_path): 67 | global model 68 | model = deepspeech.Model(output_graph_path) 69 | model.enableExternalScorer(scorer_path) 70 | logging.debug('Process {}: Loaded models'.format(os.getpid())) 71 | 72 | 73 | def stt(sample): 74 | time_start, time_end, audio = sample 75 | logging.debug('Process {}: Transcribing...'.format(os.getpid())) 76 | transcript = model.stt(audio) 77 | logging.debug('Process {}: {}'.format(os.getpid(), transcript)) 78 | return time_start, time_end, ' '.join(transcript.split()) 79 | 80 | 81 | def align(triple): 82 | tlog, script, aligned = triple 83 | 84 | logging.debug("Loading script from %s..." % script) 85 | tc = read_script(script) 86 | search = FuzzySearch(tc.clean_text, 87 | max_candidates=args.align_max_candidates, 88 | candidate_threshold=args.align_candidate_threshold, 89 | match_score=args.align_match_score, 90 | mismatch_score=args.align_mismatch_score, 91 | gap_score=args.align_gap_score) 92 | 93 | logging.debug("Loading transcription log from %s..." % tlog) 94 | with open(tlog, 'r', encoding='utf-8') as transcription_log_file: 95 | fragments = json.load(transcription_log_file) 96 | end_fragments = (args.start + args.num_samples) if args.num_samples else len(fragments) 97 | fragments = fragments[args.start:end_fragments] 98 | for index, fragment in enumerate(fragments): 99 | meta = {} 100 | for key, value in list(fragment.items()): 101 | if key not in ['start', 'end', 'transcript']: 102 | meta[key] = value 103 | del fragment[key] 104 | fragment['meta'] = meta 105 | fragment['index'] = index 106 | fragment['transcript'] = fragment['transcript'].strip() 107 | 108 | reasons = Counter() 109 | 110 | def skip(index, reason): 111 | logging.info('Fragment {}: {}'.format(index, reason)) 112 | reasons[reason] += 1 113 | 114 | def split_match(fragments, start=0, end=-1): 115 | n = len(fragments) 116 | if n < 1: 117 | return 118 | elif n == 1: 119 | weighted_fragments = [(0, fragments[0])] 120 | else: 121 | # so we later know the original index of each fragment 122 | weighted_fragments = enumerate(fragments) 123 | # assigns high values to long statements near the center of the list 124 | weighted_fragments = enweight(weighted_fragments) 125 | weighted_fragments = map(lambda fw: (fw[0], (1 - fw[1]) * len(fw[0][1]['transcript'])), weighted_fragments) 126 | # fragments with highest weights first 127 | weighted_fragments = sorted(weighted_fragments, key=lambda fw: fw[1], reverse=True) 128 | # strip weights 129 | weighted_fragments = list(map(lambda fw: fw[0], weighted_fragments)) 130 | for index, fragment in weighted_fragments: 131 | match = search.find_best(fragment['transcript'], start=start, end=end) 132 | match_start, match_end, sws_score, match_substitutions = match 133 | if sws_score > (n - 1) / (2 * n): 134 | fragment['match-start'] = match_start 135 | fragment['match-end'] = match_end 136 | fragment['sws'] = sws_score 137 | fragment['substitutions'] = match_substitutions 138 | for f in split_match(fragments[0:index], start=start, end=match_start): 139 | yield f 140 | yield fragment 141 | for f in split_match(fragments[index + 1:], start=match_end, end=end): 142 | yield f 143 | return 144 | for _, _ in weighted_fragments: 145 | yield None 146 | 147 | matched_fragments = split_match(fragments) 148 | matched_fragments = list(filter(lambda f: f is not None, matched_fragments)) 149 | 150 | similarity_algos = {} 151 | 152 | def phrase_similarity(algo, a, b): 153 | if algo in similarity_algos: 154 | return similarity_algos[algo](a, b) 155 | algo_impl = lambda aa, bb: None 156 | if algo.lower() == 'wng': 157 | algo_impl = similarity_algos[algo] = lambda aa, bb: similarity( 158 | aa, 159 | bb, 160 | direction=1, 161 | min_ngram_size=args.align_wng_min_size, 162 | max_ngram_size=args.align_wng_max_size, 163 | size_factor=args.align_wng_size_factor, 164 | position_factor=args.align_wng_position_factor) 165 | elif algo in ALGORITHMS: 166 | algo_impl = similarity_algos[algo] = getattr(textdistance, algo).normalized_similarity 167 | else: 168 | logging.fatal('Unknown similarity metric "{}"'.format(algo)) 169 | exit(1) 170 | return algo_impl(a, b) 171 | 172 | def get_similarities(a, b, n, gap_text, gap_meta, direction): 173 | if direction < 0: 174 | a, b, gap_text, gap_meta = a[::-1], b[::-1], gap_text[::-1], gap_meta[::-1] 175 | similarities = list(map( 176 | lambda i: (args.align_word_snap_factor if gap_text[i + 1] == ' ' else 1) * 177 | (args.align_phrase_snap_factor if gap_meta[i + 1] is None else 1) * 178 | (phrase_similarity(args.align_similarity_algo, a, b + gap_text[1:i + 1])), 179 | range(n))) 180 | best = max((v, i) for i, v in enumerate(similarities))[1] if n > 0 else 0 181 | return best, similarities 182 | 183 | for index in range(len(matched_fragments) + 1): 184 | if index > 0: 185 | a = matched_fragments[index - 1] 186 | a_start, a_end = a['match-start'], a['match-end'] 187 | a_len = a_end - a_start 188 | a_stretch = int(a_len * args.align_stretch_fraction) 189 | a_shrink = int(a_len * args.align_shrink_fraction) 190 | a_end = a_end - a_shrink 191 | a_ext = a_shrink + a_stretch 192 | else: 193 | a = None 194 | a_start = a_end = 0 195 | if index < len(matched_fragments): 196 | b = matched_fragments[index] 197 | b_start, b_end = b['match-start'], b['match-end'] 198 | b_len = b_end - b_start 199 | b_stretch = int(b_len * args.align_stretch_fraction) 200 | b_shrink = int(b_len * args.align_shrink_fraction) 201 | b_start = b_start + b_shrink 202 | b_ext = b_shrink + b_stretch 203 | else: 204 | b = None 205 | b_start = b_end = len(search.text) 206 | 207 | assert a_end <= b_start 208 | assert a_start <= a_end 209 | assert b_start <= b_end 210 | if a_end == b_start or a_start == a_end or b_start == b_end: 211 | continue 212 | gap_text = tc.clean_text[a_end - 1:b_start + 1] 213 | gap_meta = tc.meta[a_end - 1:b_start + 1] 214 | 215 | if a: 216 | a_best_index, a_similarities = get_similarities(a['transcript'], 217 | tc.clean_text[a_start:a_end], 218 | min(len(gap_text) - 1, a_ext), 219 | gap_text, 220 | gap_meta, 221 | 1) 222 | a_best_end = a_best_index + a_end 223 | if b: 224 | b_best_index, b_similarities = get_similarities(b['transcript'], 225 | tc.clean_text[b_start:b_end], 226 | min(len(gap_text) - 1, b_ext), 227 | gap_text, 228 | gap_meta, 229 | -1) 230 | b_best_start = b_start - b_best_index 231 | 232 | if a and b and a_best_end > b_best_start: 233 | overlap_start = b_start - len(b_similarities) 234 | a_similarities = a_similarities[overlap_start - a_end:] 235 | b_similarities = b_similarities[:len(a_similarities)] 236 | best_index = max((sum(v), i) for i, v in enumerate(zip(a_similarities, b_similarities)))[1] 237 | a_best_end = b_best_start = overlap_start + best_index 238 | 239 | if a: 240 | a['match-end'] = a_best_end 241 | if b: 242 | b['match-start'] = b_best_start 243 | 244 | def apply_number(number_key, index, fragment, show, get_value): 245 | kl = number_key.lower() 246 | should_output = getattr(args, 'output_' + kl) 247 | min_val, max_val = getattr(args, 'output_min_' + kl), getattr(args, 'output_max_' + kl) 248 | if kl.endswith('len') and min_val is None: 249 | min_val = 1 250 | if should_output or min_val or max_val: 251 | val = get_value() 252 | if not kl.endswith('len'): 253 | show.insert(0, '{}: {:.2f}'.format(number_key, val)) 254 | if should_output: 255 | fragment[kl] = val 256 | reason_base = '{} ({})'.format(NAMED_NUMBERS[number_key][0], number_key) 257 | reason = None 258 | if min_val and val < min_val: 259 | reason = reason_base + ' too low' 260 | elif max_val and val > max_val: 261 | reason = reason_base + ' too high' 262 | if reason: 263 | skip(index, reason) 264 | return True 265 | return False 266 | 267 | substitutions = Counter() 268 | result_fragments = [] 269 | for fragment in matched_fragments: 270 | index = fragment['index'] 271 | time_start = fragment['start'] 272 | time_end = fragment['end'] 273 | fragment_transcript = fragment['transcript'] 274 | result_fragment = { 275 | 'start': time_start, 276 | 'end': time_end 277 | } 278 | sample_numbers = [] 279 | 280 | if apply_number('tlen', index, result_fragment, sample_numbers, lambda: len(fragment_transcript)): 281 | continue 282 | result_fragment['transcript'] = fragment_transcript 283 | 284 | if 'match-start' not in fragment or 'match-end' not in fragment: 285 | skip(index, 'No match for transcript') 286 | continue 287 | match_start, match_end = fragment['match-start'], fragment['match-end'] 288 | if match_end - match_start <= 0: 289 | skip(index, 'Empty match for transcript') 290 | continue 291 | original_start = tc.get_original_offset(match_start) 292 | original_end = tc.get_original_offset(match_end) 293 | result_fragment['text-start'] = original_start 294 | result_fragment['text-end'] = original_end 295 | 296 | meta_dict = {} 297 | for meta in list(tc.collect_meta(match_start, match_end)) + [fragment['meta']]: 298 | for key, value in meta.items(): 299 | if key == 'text': 300 | continue 301 | if key in meta_dict: 302 | values = meta_dict[key] 303 | else: 304 | values = meta_dict[key] = [] 305 | if value not in values: 306 | values.append(value) 307 | result_fragment['meta'] = meta_dict 308 | 309 | result_fragment['aligned-raw'] = tc.original_text[original_start:original_end] 310 | 311 | fragment_matched = tc.clean_text[match_start:match_end] 312 | if apply_number('mlen', index, result_fragment, sample_numbers, lambda: len(fragment_matched)): 313 | continue 314 | result_fragment['aligned'] = fragment_matched 315 | 316 | if apply_number('SWS', index, result_fragment, sample_numbers, lambda: 100 * fragment['sws']): 317 | continue 318 | 319 | should_skip = False 320 | for algo in ALGORITHMS: 321 | should_skip = should_skip or apply_number(algo, index, result_fragment, sample_numbers, 322 | lambda: 100 * phrase_similarity(algo, 323 | fragment_matched, 324 | fragment_transcript)) 325 | if should_skip: 326 | continue 327 | 328 | if apply_number('CER', index, result_fragment, sample_numbers, 329 | lambda: 100 * levenshtein(fragment_transcript, fragment_matched) / 330 | len(fragment_matched)): 331 | continue 332 | 333 | if apply_number('WER', index, result_fragment, sample_numbers, 334 | lambda: 100 * levenshtein(fragment_transcript.split(), fragment_matched.split()) / 335 | len(fragment_matched.split())): 336 | continue 337 | 338 | substitutions += fragment['substitutions'] 339 | 340 | result_fragments.append(result_fragment) 341 | logging.debug('Fragment %d aligned with %s' % (index, ' '.join(sample_numbers))) 342 | logging.debug('- T: ' + args.text_context * ' ' + '"%s"' % fragment_transcript) 343 | logging.debug('- O: %s|%s|%s' % ( 344 | tc.clean_text[match_start - args.text_context:match_start], 345 | fragment_matched, 346 | tc.clean_text[match_end:match_end + args.text_context])) 347 | if args.play: 348 | subprocess.check_call(['play', 349 | '--no-show-progress', 350 | args.audio, 351 | 'trim', 352 | str(time_start / 1000.0), 353 | '=' + str(time_end / 1000.0)]) 354 | with open(aligned, 'w', encoding='utf-8') as result_file: 355 | result_file.write(json.dumps(result_fragments, indent=4 if args.output_pretty else None, ensure_ascii=False)) 356 | return aligned, len(result_fragments), len(fragments) - len(result_fragments), reasons 357 | 358 | 359 | def main(): 360 | # Debug helpers 361 | logging.basicConfig() 362 | logging.root.setLevel(args.loglevel if args.loglevel else 20) 363 | 364 | def progress(it=None, desc='Processing', total=None): 365 | logging.info(desc) 366 | return it if args.no_progress else log_progress(it, interval=args.progress_interval, total=total) 367 | 368 | def resolve(base_path, spec_path): 369 | if spec_path is None: 370 | return None 371 | if not path.isabs(spec_path): 372 | spec_path = path.join(base_path, spec_path) 373 | return spec_path 374 | 375 | def exists(file_path): 376 | if file_path is None: 377 | return False 378 | return os.path.isfile(file_path) 379 | 380 | to_prepare = [] 381 | 382 | def enqueue_or_fail(audio, tlog, script, aligned, prefix=''): 383 | if exists(aligned) and not args.force: 384 | fail(prefix + 'Alignment file "{}" already existing - use --force to overwrite'.format(aligned)) 385 | if tlog is None: 386 | if args.ignore_missing: 387 | return 388 | fail(prefix + 'Missing transcription log path') 389 | if not exists(audio) and not exists(tlog): 390 | if args.ignore_missing: 391 | return 392 | fail(prefix + 'Both audio file "{}" and transcription log "{}" are missing'.format(audio, tlog)) 393 | if not exists(script): 394 | if args.ignore_missing: 395 | return 396 | fail(prefix + 'Missing script "{}"'.format(script)) 397 | to_prepare.append((audio, tlog, script, aligned)) 398 | 399 | if (args.audio or args.tlog) and args.script and args.aligned and not args.catalog: 400 | enqueue_or_fail(args.audio, args.tlog, args.script, args.aligned) 401 | elif args.catalog: 402 | if not exists(args.catalog): 403 | fail('Unable to load catalog file "{}"'.format(args.catalog)) 404 | catalog = path.abspath(args.catalog) 405 | catalog_dir = path.dirname(catalog) 406 | with open(catalog, 'r', encoding='utf-8') as catalog_file: 407 | catalog_entries = json.load(catalog_file) 408 | for entry in progress(catalog_entries, desc='Reading catalog'): 409 | enqueue_or_fail(resolve(catalog_dir, entry['audio']), 410 | resolve(catalog_dir, entry['tlog']), 411 | resolve(catalog_dir, entry['script']), 412 | resolve(catalog_dir, entry['aligned']), 413 | prefix='Problem loading catalog "{}" - '.format(catalog)) 414 | else: 415 | fail('You have to either specify a combination of "--audio/--tlog,--script,--aligned" or "--catalog"') 416 | 417 | logging.debug('Start') 418 | 419 | to_align = [] 420 | output_graph_path = None 421 | for audio_path, tlog_path, script_path, aligned_path in to_prepare: 422 | if not exists(tlog_path): 423 | generated_scorer = False 424 | if output_graph_path is None: 425 | logging.debug('Looking for model files in "{}"...'.format(model_dir)) 426 | output_graph_path = glob(model_dir + "/*.pbmm")[0] 427 | lang_scorer_path = glob(model_dir + "/*.scorer")[0] 428 | kenlm_path = 'dependencies/kenlm/build/bin' 429 | if not path.exists(kenlm_path): 430 | kenlm_path = None 431 | deepspeech_path = 'dependencies/deepspeech' 432 | if not path.exists(deepspeech_path): 433 | deepspeech_path = None 434 | if kenlm_path and deepspeech_path and not args.stt_no_own_lm: 435 | tc = read_script(script_path) 436 | if not tc.clean_text.strip(): 437 | logging.error('Cleaned transcript is empty for {}'.format(path.basename(script_path))) 438 | continue 439 | clean_text_path = script_path + '.clean' 440 | with open(clean_text_path, 'w', encoding='utf-8') as clean_text_file: 441 | clean_text_file.write(tc.clean_text) 442 | 443 | scorer_path = script_path + '.scorer' 444 | if not path.exists(scorer_path): 445 | # Generate LM 446 | data_lower, vocab_str = convert_and_filter_topk(scorer_path, clean_text_path, 500000) 447 | build_lm(scorer_path, kenlm_path, 5, '85%', '0|0|1', True, 255, 8, 'trie', data_lower, vocab_str) 448 | os.remove(scorer_path + '.' + 'lower.txt.gz') 449 | os.remove(scorer_path + '.' + 'lm.arpa') 450 | os.remove(scorer_path + '.' + 'lm_filtered.arpa') 451 | os.remove(clean_text_path) 452 | 453 | # Generate scorer 454 | create_bundle(alphabet_path, scorer_path + '.' + 'lm.binary', scorer_path + '.' + 'vocab-500000.txt', scorer_path, False, 0.931289039105002, 1.1834137581510284) 455 | os.remove(scorer_path + '.' + 'lm.binary') 456 | os.remove(scorer_path + '.' + 'vocab-500000.txt') 457 | 458 | generated_scorer = True 459 | else: 460 | scorer_path = lang_scorer_path 461 | 462 | logging.debug('Loading acoustic model from "{}", alphabet from "{}" and scorer from "{}"...' 463 | .format(output_graph_path, alphabet_path, scorer_path)) 464 | 465 | # Run VAD on the input file 466 | logging.debug('Transcribing VAD segments...') 467 | frames = read_frames_from_file(audio_path, model_format, args.audio_vad_frame_length) 468 | segments = vad_split(frames, 469 | model_format, 470 | num_padding_frames=args.audio_vad_padding, 471 | threshold=args.audio_vad_threshold, 472 | aggressiveness=args.audio_vad_aggressiveness) 473 | 474 | def pre_filter(): 475 | for i, segment in enumerate(segments): 476 | segment_buffer, time_start, time_end = segment 477 | time_length = time_end - time_start 478 | if args.stt_min_duration and time_length < args.stt_min_duration: 479 | logging.info('Fragment {}: Audio too short for STT'.format(i)) 480 | continue 481 | if args.stt_max_duration and time_length > args.stt_max_duration: 482 | logging.info('Fragment {}: Audio too long for STT'.format(i)) 483 | continue 484 | yield (time_start, time_end, np.frombuffer(segment_buffer, dtype=np.int16)) 485 | 486 | samples = list(progress(pre_filter(), desc='VAD splitting')) 487 | 488 | pool = multiprocessing.Pool(initializer=init_stt, 489 | initargs=(output_graph_path, scorer_path), 490 | processes=args.stt_workers) 491 | transcripts = list(progress(pool.imap(stt, samples), desc='Transcribing', total=len(samples))) 492 | 493 | fragments = [] 494 | for time_start, time_end, segment_transcript in transcripts: 495 | if segment_transcript is None: 496 | continue 497 | fragments.append({ 498 | 'start': time_start, 499 | 'end': time_end, 500 | 'transcript': segment_transcript 501 | }) 502 | logging.debug('Excluded {} empty transcripts'.format(len(transcripts) - len(fragments))) 503 | 504 | logging.debug('Writing transcription log to file "{}"...'.format(tlog_path)) 505 | with open(tlog_path, 'w', encoding='utf-8') as tlog_file: 506 | tlog_file.write(json.dumps(fragments, indent=4 if args.output_pretty else None, ensure_ascii=False)) 507 | 508 | # Remove scorer if generated 509 | if generated_scorer: 510 | os.remove(scorer_path) 511 | if not path.isfile(tlog_path): 512 | fail('Problem loading transcript from "{}"'.format(tlog_path)) 513 | to_align.append((tlog_path, script_path, aligned_path)) 514 | 515 | total_fragments = 0 516 | dropped_fragments = 0 517 | reasons = Counter() 518 | 519 | index = 0 520 | pool = multiprocessing.Pool(processes=args.align_workers) 521 | for aligned_file, file_total_fragments, file_dropped_fragments, file_reasons in \ 522 | progress(pool.imap_unordered(align, to_align), desc='Aligning', total=len(to_align)): 523 | if args.no_progress: 524 | index += 1 525 | logging.info('Aligned file {} of {} - wrote results to "{}"'.format(index, len(to_align), aligned_file)) 526 | total_fragments += file_total_fragments 527 | dropped_fragments += file_dropped_fragments 528 | reasons += file_reasons 529 | 530 | logging.info('Aligned {} fragments'.format(total_fragments)) 531 | if total_fragments > 0 and dropped_fragments > 0: 532 | logging.info('Dropped {} fragments {:0.2f}%:'.format(dropped_fragments, 533 | dropped_fragments * 100.0 / total_fragments)) 534 | for key, number in reasons.most_common(): 535 | logging.info(' - {}: {}'.format(key, number)) 536 | 537 | 538 | def parse_args(): 539 | parser = argparse.ArgumentParser(description='Force align speech data with a transcript.') 540 | 541 | parser.add_argument('--audio', type=str, 542 | help='Path to speech audio file') 543 | parser.add_argument('--tlog', type=str, 544 | help='Path to STT transcription log (.tlog)') 545 | parser.add_argument('--script', type=str, 546 | help='Path to original transcript (plain text or .script file)') 547 | parser.add_argument('--catalog', type=str, 548 | help='Path to a catalog file with paths to transcription log or audio, original script and ' 549 | '(target) alignment files') 550 | parser.add_argument('--aligned', type=str, 551 | help='Alignment result file (.aligned)') 552 | parser.add_argument('--force', action="store_true", 553 | help='Overwrite existing files') 554 | parser.add_argument('--ignore-missing', action="store_true", 555 | help='Ignores catalog entries with missing paths') 556 | parser.add_argument('--loglevel', type=int, required=False, default=20, 557 | help='Log level (between 0 and 50) - default: 20') 558 | parser.add_argument('--no-progress', action="store_true", 559 | help='Prevents showing progress indication') 560 | parser.add_argument('--progress-interval', type=float, default=1.0, 561 | help='Progress indication interval in seconds') 562 | parser.add_argument('--play', action="store_true", 563 | help='Play audio fragments as they are matched using SoX audio tool') 564 | parser.add_argument('--text-context', type=int, required=False, default=10, 565 | help='Size of textual context for logged statements - default: 10') 566 | parser.add_argument('--start', type=int, required=False, default=0, 567 | help='Start alignment process at given offset of transcribed fragments') 568 | parser.add_argument('--num-samples', type=int, required=False, 569 | help='Number of fragments to align') 570 | parser.add_argument('--alphabet', required=False, 571 | help='Path to an alphabet file (overriding the one from --stt-model-dir)') 572 | 573 | audio_group = parser.add_argument_group(title='Audio pre-processing options') 574 | audio_group.add_argument('--audio-vad-aggressiveness', type=int, choices=range(4), default=3, 575 | help='Aggressiveness of voice activity detection in a frame (default: 3)') 576 | audio_group.add_argument('--audio-vad-padding', type=int, default=10, 577 | help='Number of padding audio frames in VAD ring-buffer') 578 | audio_group.add_argument('--audio-vad-threshold', type=float, default=0.5, 579 | help='VAD ring-buffer threshold for voiced frames ' 580 | '(e.g. 0.5 -> 50%% of the ring-buffer frames have to be voiced ' 581 | 'for triggering a split)') 582 | audio_group.add_argument('--audio-vad-frame-length', choices=[10, 20, 30], default=30, 583 | help='VAD audio frame length in ms (10, 20 or 30)') 584 | 585 | stt_group = parser.add_argument_group(title='STT options') 586 | stt_group.add_argument('--stt-model-rate', type=int, default=DEFAULT_RATE, 587 | help='Supported sample rate of the acoustic model') 588 | stt_group.add_argument('--stt-model-dir', required=False, 589 | help='Path to a directory with output_graph, scorer and (optional) alphabet file ' + 590 | '(default: "models/en"') 591 | stt_group.add_argument('--stt-no-own-lm', action="store_true", 592 | help='Deactivates creation of individual language models per document.' + 593 | 'Uses the one from model dir instead.') 594 | stt_group.add_argument('--stt-workers', type=int, required=False, default=1, 595 | help='Number of parallel STT workers - should be 1 for GPU based DeepSpeech') 596 | stt_group.add_argument('--stt-min-duration', type=int, required=False, default=100, 597 | help='Minimum speech fragment duration in milliseconds to translate (default: 100)') 598 | stt_group.add_argument('--stt-max-duration', type=int, required=False, 599 | help='Maximum speech fragment duration in milliseconds to translate (default: no limit)') 600 | 601 | text_group = parser.add_argument_group(title='Text pre-processing options') 602 | text_group.add_argument('--text-meaningful-newlines', action="store_true", 603 | help='Newlines from plain text file separate phrases/speakers. ' 604 | '(see --align-phrase-snap-factor)') 605 | text_group.add_argument('--text-keep-dashes', action="store_true", 606 | help='No replacing of dashes with spaces. Dependent of alphabet if kept at all.') 607 | text_group.add_argument('--text-keep-ws', action="store_true", 608 | help='No normalization of whitespace. Keep it as it is.') 609 | text_group.add_argument('--text-keep-casing', action="store_true", 610 | help='No lower-casing of characters. Keep them as they are.') 611 | 612 | align_group = parser.add_argument_group(title='Alignment algorithm options') 613 | align_group.add_argument('--align-workers', type=int, required=False, 614 | help='Number of parallel alignment workers - defaults to number of CPUs') 615 | align_group.add_argument('--align-max-candidates', type=int, required=False, default=10, 616 | help='How many global 3gram match candidates are tested at max (default: 10)') 617 | align_group.add_argument('--align-candidate-threshold', type=float, required=False, default=0.92, 618 | help='Factor for how many 3grams the next candidate should have at least ' + 619 | 'compared to its predecessor (default: 0.92)') 620 | align_group.add_argument('--align-match-score', type=int, required=False, default=100, 621 | help='Matching score for Smith-Waterman alignment (default: 100)') 622 | align_group.add_argument('--align-mismatch-score', type=int, required=False, default=-100, 623 | help='Mismatch score for Smith-Waterman alignment (default: -100)') 624 | align_group.add_argument('--align-gap-score', type=int, required=False, default=-100, 625 | help='Gap score for Smith-Waterman alignment (default: -100)') 626 | align_group.add_argument('--align-shrink-fraction', type=float, required=False, default=0.1, 627 | help='Length fraction of the fragment that it could get shrinked during fine alignment') 628 | align_group.add_argument('--align-stretch-fraction', type=float, required=False, default=0.25, 629 | help='Length fraction of the fragment that it could get stretched during fine alignment') 630 | align_group.add_argument('--align-word-snap-factor', type=float, required=False, default=1.5, 631 | help='Priority factor for snapping matched texts to word boundaries ' 632 | '(default: 1.5 - slightly snappy)') 633 | align_group.add_argument('--align-phrase-snap-factor', type=float, required=False, default=1.0, 634 | help='Priority factor for snapping matched texts to word boundaries ' 635 | '(default: 1.0 - no snapping)') 636 | align_group.add_argument('--align-similarity-algo', type=str, required=False, default='wng', 637 | help='Similarity algorithm during fine-alignment - one of ' 638 | 'wng|editex|levenshtein|mra|hamming|jaro_winkler (default: wng)') 639 | align_group.add_argument('--align-wng-min-size', type=int, required=False, default=1, 640 | help='Minimum N-gram size for weighted N-gram similarity ' 641 | 'during fine-alignment (default: 1)') 642 | align_group.add_argument('--align-wng-max-size', type=int, required=False, default=3, 643 | help='Maximum N-gram size for weighted N-gram similarity ' 644 | 'during fine-alignment (default: 3)') 645 | align_group.add_argument('--align-wng-size-factor', type=float, required=False, default=1, 646 | help='Size weight for weighted N-gram similarity ' 647 | 'during fine-alignment (default: 1)') 648 | align_group.add_argument('--align-wng-position-factor', type=float, required=False, default=2.5, 649 | help='Position weight for weighted N-gram similarity ' 650 | 'during fine-alignment (default: 2.5)') 651 | 652 | output_group = parser.add_argument_group(title='Output options') 653 | output_group.add_argument('--output-pretty', action="store_true", 654 | help='Writes indented JSON output"') 655 | 656 | for short in NAMED_NUMBERS.keys(): 657 | long, atype, desc = NAMED_NUMBERS[short] 658 | desc = (' - value range: ' + desc) if desc else '' 659 | output_group.add_argument('--output-' + short.lower(), action="store_true", 660 | help='Writes {} ({}) to output'.format(long, short)) 661 | for extreme in ['Min', 'Max']: 662 | output_group.add_argument('--output-' + extreme.lower() + '-' + short.lower(), type=atype, required=False, 663 | help='{}imum {} ({}) the STT transcript of the audio ' 664 | 'has to have when compared with the original text{}' 665 | .format(extreme, long, short, desc)) 666 | 667 | return parser.parse_args() 668 | 669 | 670 | if __name__ == '__main__': 671 | args = parse_args() 672 | model_dir = os.path.expanduser(args.stt_model_dir if args.stt_model_dir else 'models/en') 673 | if args.alphabet is not None: 674 | alphabet_path = args.alphabet 675 | else: 676 | alphabet_path = os.path.join(model_dir, 'alphabet.txt') 677 | if not os.path.isfile(alphabet_path): 678 | fail('Found no alphabet file') 679 | logging.debug('Loading alphabet from "{}"...'.format(alphabet_path)) 680 | alphabet = Alphabet(alphabet_path) 681 | model_format = (args.stt_model_rate, 1, 2) 682 | main() 683 | -------------------------------------------------------------------------------- /align/audio.py: -------------------------------------------------------------------------------- 1 | import os 2 | import io 3 | import sox 4 | import wave 5 | import tempfile 6 | import collections 7 | import numpy as np 8 | 9 | from webrtcvad import Vad 10 | from utils import LimitingPool 11 | 12 | DEFAULT_RATE = 16000 13 | DEFAULT_CHANNELS = 1 14 | DEFAULT_WIDTH = 2 15 | DEFAULT_FORMAT = (DEFAULT_RATE, DEFAULT_CHANNELS, DEFAULT_WIDTH) 16 | 17 | AUDIO_TYPE_NP = 'application/vnd.mozilla.np' 18 | AUDIO_TYPE_PCM = 'application/vnd.mozilla.pcm' 19 | AUDIO_TYPE_WAV = 'audio/wav' 20 | AUDIO_TYPE_OPUS = 'application/vnd.mozilla.opus' 21 | SERIALIZABLE_AUDIO_TYPES = [AUDIO_TYPE_WAV, AUDIO_TYPE_OPUS] 22 | 23 | OPUS_PCM_LEN_SIZE = 4 24 | OPUS_RATE_SIZE = 4 25 | OPUS_CHANNELS_SIZE = 1 26 | OPUS_WIDTH_SIZE = 1 27 | OPUS_CHUNK_LEN_SIZE = 2 28 | 29 | 30 | class Sample: 31 | """Represents in-memory audio data of a certain (convertible) representation. 32 | Attributes: 33 | audio_type (str): See `__init__`. 34 | audio_format (tuple:(int, int, int)): See `__init__`. 35 | audio (obj): Audio data represented as indicated by `audio_type` 36 | duration (float): Audio duration of the sample in seconds 37 | """ 38 | def __init__(self, audio_type, raw_data, audio_format=None): 39 | """ 40 | Creates a Sample from a raw audio representation. 41 | :param audio_type: Audio data representation type 42 | CSupported types: 43 | - AUDIO_TYPE_OPUS: Memory file representation (BytesIO) of Opus encoded audio 44 | wrapped by a custom container format (used in SDBs) 45 | - AUDIO_TYPE_WAV: Memory file representation (BytesIO) of a Wave file 46 | - AUDIO_TYPE_PCM: Binary representation (bytearray) of PCM encoded audio data (Wave file without header) 47 | - AUDIO_TYPE_NP: NumPy representation of audio data (np.float32) - typically used for GPU feeding 48 | :param raw_data: Audio data in the form of the provided representation type (see audio_type). 49 | For types AUDIO_TYPE_OPUS or AUDIO_TYPE_WAV data can also be passed as a bytearray. 50 | :param audio_format: Tuple of sample-rate, number of channels and sample-width. 51 | Required in case of audio_type = AUDIO_TYPE_PCM or AUDIO_TYPE_NP, 52 | as this information cannot be derived from raw audio data. 53 | """ 54 | self.audio_type = audio_type 55 | self.audio_format = audio_format 56 | if audio_type in SERIALIZABLE_AUDIO_TYPES: 57 | self.audio = raw_data if isinstance(raw_data, io.BytesIO) else io.BytesIO(raw_data) 58 | self.duration = read_duration(audio_type, self.audio) 59 | else: 60 | self.audio = raw_data 61 | if self.audio_format is None: 62 | raise ValueError('For audio type "{}" parameter "audio_format" is mandatory'.format(self.audio_type)) 63 | if audio_type == AUDIO_TYPE_PCM: 64 | self.duration = get_pcm_duration(len(self.audio), self.audio_format) 65 | elif audio_type == AUDIO_TYPE_NP: 66 | self.duration = get_np_duration(len(self.audio), self.audio_format) 67 | else: 68 | raise ValueError('Unsupported audio type: {}'.format(self.audio_type)) 69 | 70 | def change_audio_type(self, new_audio_type): 71 | """ 72 | In-place conversion of audio data into a different representation. 73 | :param new_audio_type: New audio-type - see `__init__`. 74 | Not supported: Converting from AUDIO_TYPE_NP into any other type. 75 | """ 76 | if self.audio_type == new_audio_type: 77 | return 78 | if new_audio_type == AUDIO_TYPE_PCM and self.audio_type in SERIALIZABLE_AUDIO_TYPES: 79 | self.audio_format, audio = read_audio(self.audio_type, self.audio) 80 | self.audio.close() 81 | self.audio = audio 82 | elif new_audio_type == AUDIO_TYPE_NP: 83 | self.change_audio_type(AUDIO_TYPE_PCM) 84 | self.audio = pcm_to_np(self.audio_format, self.audio) 85 | elif new_audio_type in SERIALIZABLE_AUDIO_TYPES: 86 | self.change_audio_type(AUDIO_TYPE_PCM) 87 | audio_bytes = io.BytesIO() 88 | write_audio(new_audio_type, audio_bytes, self.audio_format, self.audio) 89 | audio_bytes.seek(0) 90 | self.audio = audio_bytes 91 | else: 92 | raise RuntimeError('Changing audio representation type from "{}" to "{}" not supported' 93 | .format(self.audio_type, new_audio_type)) 94 | self.audio_type = new_audio_type 95 | 96 | 97 | def change_audio_types(samples, audio_type=AUDIO_TYPE_PCM, processes=None): 98 | def change_audio_type(sample): 99 | sample.change_audio_type(audio_type) 100 | return sample 101 | with LimitingPool(processes=processes) as pool: 102 | for current_sample in pool.map(change_audio_type, samples): 103 | yield current_sample 104 | 105 | 106 | def read_audio_format_from_wav_file(wav_file): 107 | return wav_file.getframerate(), wav_file.getnchannels(), wav_file.getsampwidth() 108 | 109 | 110 | def write_audio_format_to_wav_file(wav_file, audio_format=DEFAULT_FORMAT): 111 | rate, channels, width = audio_format 112 | wav_file.setframerate(rate) 113 | wav_file.setnchannels(channels) 114 | wav_file.setsampwidth(width) 115 | 116 | 117 | def get_num_samples(pcm_buffer_size, audio_format=DEFAULT_FORMAT): 118 | _, channels, width = audio_format 119 | return pcm_buffer_size // (channels * width) 120 | 121 | 122 | def get_pcm_duration(pcm_buffer_size, audio_format=DEFAULT_FORMAT): 123 | return get_num_samples(pcm_buffer_size, audio_format) / audio_format[0] 124 | 125 | 126 | def get_np_duration(np_len, audio_format=DEFAULT_FORMAT): 127 | return np_len / audio_format[0] 128 | 129 | 130 | def convert_audio(src_audio_path, dst_audio_path, file_type=None, audio_format=DEFAULT_FORMAT): 131 | sample_rate, channels, width = audio_format 132 | try: 133 | transformer = sox.Transformer() 134 | transformer.set_output_format(file_type=file_type, rate=sample_rate, channels=channels, bits=width*8) 135 | transformer.build(src_audio_path, dst_audio_path) 136 | except sox.core.SoxError: 137 | return False 138 | return True 139 | 140 | 141 | def verify_wav_file(wav_path): 142 | try: 143 | with wave.open(wav_path, 'r') as wav_file: 144 | if wav_file.getnframes() > 0: 145 | return True 146 | except: 147 | return False 148 | return False 149 | 150 | 151 | def ensure_wav_with_format(src_audio_path, audio_format=DEFAULT_FORMAT, tmp_dir=None): 152 | if src_audio_path.endswith('.wav'): 153 | with wave.open(src_audio_path, 'r') as src_audio_file: 154 | if read_audio_format_from_wav_file(src_audio_file) == audio_format: 155 | return src_audio_path, False 156 | fd, tmp_file_path = tempfile.mkstemp(suffix='.wav', dir=tmp_dir) 157 | os.close(fd) 158 | fd = None 159 | if convert_audio(src_audio_path, tmp_file_path, file_type='wav', audio_format=audio_format): 160 | return tmp_file_path, True 161 | os.remove(tmp_file_path) 162 | return None, False 163 | 164 | 165 | def extract_audio(audio_file, start, end): 166 | assert 0 <= start <= end 167 | rate = audio_file.getframerate() 168 | audio_file.setpos(int(start * rate)) 169 | return audio_file.readframes(int((end - start) * rate)) 170 | 171 | 172 | class AudioFile: 173 | def __init__(self, audio_path, as_path=False, audio_format=DEFAULT_FORMAT): 174 | self.audio_path = audio_path 175 | self.audio_format = audio_format 176 | self.as_path = as_path 177 | self.open_file = None 178 | self.tmp_file_path = None 179 | 180 | def __enter__(self): 181 | if self.audio_path.endswith('.wav'): 182 | self.open_file = wave.open(self.audio_path, 'r') 183 | if read_audio_format_from_wav_file(self.open_file) == self.audio_format: 184 | if self.as_path: 185 | self.open_file.close() 186 | return self.audio_path 187 | return self.open_file 188 | self.open_file.close() 189 | test, self.tmp_file_path = tempfile.mkstemp(suffix='.wav') 190 | os.close(test) 191 | test = None 192 | if not convert_audio(self.audio_path, self.tmp_file_path, file_type='wav', audio_format=self.audio_format): 193 | raise RuntimeError('Unable to convert "{}" to required format'.format(self.audio_path)) 194 | if self.as_path: 195 | return self.tmp_file_path 196 | self.open_file = wave.open(self.tmp_file_path, 'r') 197 | return self.open_file 198 | 199 | def __exit__(self, *args): 200 | self.open_file.close() 201 | if not self.as_path: 202 | self.open_file.close() 203 | if self.tmp_file_path is not None: 204 | os.remove(self.tmp_file_path) 205 | self.open_file = None 206 | 207 | 208 | def read_frames(wav_file, frame_duration_ms=30, yield_remainder=False): 209 | audio_format = read_audio_format_from_wav_file(wav_file) 210 | frame_size = int(audio_format[0] * (frame_duration_ms / 1000.0)) 211 | while True: 212 | try: 213 | data = wav_file.readframes(frame_size) 214 | if not yield_remainder and get_pcm_duration(len(data), audio_format) * 1000 < frame_duration_ms: 215 | break 216 | yield data 217 | except EOFError: 218 | break 219 | 220 | 221 | def read_frames_from_file(audio_path, audio_format=DEFAULT_FORMAT, frame_duration_ms=30, yield_remainder=False): 222 | with AudioFile(audio_path, audio_format=audio_format) as wav_file: 223 | for frame in read_frames(wav_file, frame_duration_ms=frame_duration_ms, yield_remainder=yield_remainder): 224 | yield frame 225 | 226 | 227 | def vad_split(audio_frames, 228 | audio_format=DEFAULT_FORMAT, 229 | num_padding_frames=10, 230 | threshold=0.5, 231 | aggressiveness=3): 232 | sample_rate, channels, width = audio_format 233 | if channels != 1: 234 | raise ValueError('VAD-splitting requires mono samples') 235 | if width != 2: 236 | raise ValueError('VAD-splitting requires 16 bit samples') 237 | if sample_rate not in [8000, 16000, 32000, 48000]: 238 | raise ValueError('VAD-splitting only supported for sample rates 8000, 16000, 32000, or 48000') 239 | if aggressiveness not in [0, 1, 2, 3]: 240 | raise ValueError('VAD-splitting aggressiveness mode has to be one of 0, 1, 2, or 3') 241 | ring_buffer = collections.deque(maxlen=num_padding_frames) 242 | triggered = False 243 | vad = Vad(int(aggressiveness)) 244 | voiced_frames = [] 245 | frame_duration_ms = 0 246 | frame_index = 0 247 | for frame_index, frame in enumerate(audio_frames): 248 | frame_duration_ms = get_pcm_duration(len(frame), audio_format) * 1000 249 | if int(frame_duration_ms) not in [10, 20, 30]: 250 | raise ValueError('VAD-splitting only supported for frame durations 10, 20, or 30 ms') 251 | is_speech = vad.is_speech(frame, sample_rate) 252 | if not triggered: 253 | ring_buffer.append((frame, is_speech)) 254 | num_voiced = len([f for f, speech in ring_buffer if speech]) 255 | if num_voiced > threshold * ring_buffer.maxlen: 256 | triggered = True 257 | for f, s in ring_buffer: 258 | voiced_frames.append(f) 259 | ring_buffer.clear() 260 | else: 261 | voiced_frames.append(frame) 262 | ring_buffer.append((frame, is_speech)) 263 | num_unvoiced = len([f for f, speech in ring_buffer if not speech]) 264 | if num_unvoiced > threshold * ring_buffer.maxlen: 265 | triggered = False 266 | yield b''.join(voiced_frames), \ 267 | frame_duration_ms * max(0, frame_index - len(voiced_frames)), \ 268 | frame_duration_ms * frame_index 269 | ring_buffer.clear() 270 | voiced_frames = [] 271 | if len(voiced_frames) > 0: 272 | yield b''.join(voiced_frames), \ 273 | frame_duration_ms * (frame_index - len(voiced_frames)), \ 274 | frame_duration_ms * (frame_index + 1) 275 | 276 | 277 | def pack_number(n, num_bytes): 278 | return n.to_bytes(num_bytes, 'big', signed=False) 279 | 280 | 281 | def unpack_number(data): 282 | return int.from_bytes(data, 'big', signed=False) 283 | 284 | 285 | def get_opus_frame_size(rate): 286 | return 60 * rate // 1000 287 | 288 | 289 | def write_opus(opus_file, audio_format, audio_data): 290 | rate, channels, width = audio_format 291 | frame_size = get_opus_frame_size(rate) 292 | import opuslib 293 | encoder = opuslib.Encoder(rate, channels, opuslib.APPLICATION_AUDIO) 294 | chunk_size = frame_size * channels * width 295 | opus_file.write(pack_number(len(audio_data), OPUS_PCM_LEN_SIZE)) 296 | opus_file.write(pack_number(rate, OPUS_RATE_SIZE)) 297 | opus_file.write(pack_number(channels, OPUS_CHANNELS_SIZE)) 298 | opus_file.write(pack_number(width, OPUS_WIDTH_SIZE)) 299 | for i in range(0, len(audio_data), chunk_size): 300 | chunk = audio_data[i:i + chunk_size] 301 | # Preventing non-deterministic encoding results from uninitialized remainder of the encoder buffer 302 | if len(chunk) < chunk_size: 303 | chunk = chunk + bytearray(chunk_size - len(chunk)) 304 | encoded = encoder.encode(chunk, frame_size) 305 | opus_file.write(pack_number(len(encoded), OPUS_CHUNK_LEN_SIZE)) 306 | opus_file.write(encoded) 307 | 308 | 309 | def read_opus_header(opus_file): 310 | opus_file.seek(0) 311 | pcm_buffer_size = unpack_number(opus_file.read(OPUS_PCM_LEN_SIZE)) 312 | rate = unpack_number(opus_file.read(OPUS_RATE_SIZE)) 313 | channels = unpack_number(opus_file.read(OPUS_CHANNELS_SIZE)) 314 | width = unpack_number(opus_file.read(OPUS_WIDTH_SIZE)) 315 | return pcm_buffer_size, (rate, channels, width) 316 | 317 | 318 | def read_opus(opus_file): 319 | pcm_buffer_size, audio_format = read_opus_header(opus_file) 320 | rate, channels, _ = audio_format 321 | frame_size = get_opus_frame_size(rate) 322 | import opuslib 323 | decoder = opuslib.Decoder(rate, channels) 324 | audio_data = bytearray() 325 | while len(audio_data) < pcm_buffer_size: 326 | chunk_len = unpack_number(opus_file.read(OPUS_CHUNK_LEN_SIZE)) 327 | chunk = opus_file.read(chunk_len) 328 | decoded = decoder.decode(chunk, frame_size) 329 | audio_data.extend(decoded) 330 | audio_data = audio_data[:pcm_buffer_size] 331 | return audio_format, audio_data 332 | 333 | 334 | def write_wav(wav_file, audio_format, pcm_data): 335 | with wave.open(wav_file, 'wb') as wav_file_writer: 336 | write_audio_format_to_wav_file(wav_file_writer, audio_format=audio_format) 337 | wav_file_writer.writeframes(pcm_data) 338 | 339 | 340 | def read_wav(wav_file): 341 | wav_file.seek(0) 342 | with wave.open(wav_file, 'rb') as wav_file_reader: 343 | audio_format = read_audio_format_from_wav_file(wav_file_reader) 344 | pcm_data = wav_file_reader.readframes(wav_file_reader.getnframes()) 345 | os.close(wav_file) 346 | return audio_format, pcm_data 347 | 348 | 349 | def read_audio(audio_type, audio_file): 350 | if audio_type == AUDIO_TYPE_WAV: 351 | return read_wav(audio_file) 352 | if audio_type == AUDIO_TYPE_OPUS: 353 | return read_opus(audio_file) 354 | raise ValueError('Unsupported audio type: {}'.format(audio_type)) 355 | 356 | 357 | def write_audio(audio_type, audio_file, audio_format, pcm_data): 358 | if audio_type == AUDIO_TYPE_WAV: 359 | return write_wav(audio_file, audio_format, pcm_data) 360 | if audio_type == AUDIO_TYPE_OPUS: 361 | return write_opus(audio_file, audio_format, pcm_data) 362 | raise ValueError('Unsupported audio type: {}'.format(audio_type)) 363 | 364 | 365 | def read_wav_duration(wav_file): 366 | wav_file.seek(0) 367 | with wave.open(wav_file, 'rb') as wav_file_reader: 368 | return wav_file_reader.getnframes() / wav_file_reader.getframerate() 369 | 370 | 371 | def read_opus_duration(opus_file): 372 | pcm_buffer_size, audio_format = read_opus_header(opus_file) 373 | return get_pcm_duration(pcm_buffer_size, audio_format) 374 | 375 | 376 | def read_duration(audio_type, audio_file): 377 | if audio_type == AUDIO_TYPE_WAV: 378 | return read_wav_duration(audio_file) 379 | if audio_type == AUDIO_TYPE_OPUS: 380 | return read_opus_duration(audio_file) 381 | raise ValueError('Unsupported audio type: {}'.format(audio_type)) 382 | 383 | 384 | def pcm_to_np(audio_format, pcm_data): 385 | _, channels, width = audio_format 386 | if width not in [1, 2, 4]: 387 | raise ValueError('Unsupported sample width: {}'.format(width)) 388 | dtype = [None, np.int8, np.int16, None, np.int32][width] 389 | samples = np.frombuffer(pcm_data, dtype=dtype) 390 | samples = samples[::channels] # limited to mono for now 391 | samples = samples.astype(np.float32) / np.iinfo(dtype).max 392 | return np.expand_dims(samples, axis=1) 393 | -------------------------------------------------------------------------------- /align/catalog_tool.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | """ 3 | Tool for combining and converting paths within catalog files 4 | """ 5 | import sys 6 | import json 7 | import argparse 8 | 9 | from glob import glob 10 | from pathlib import Path 11 | 12 | 13 | def fail(message): 14 | print(message) 15 | sys.exit(1) 16 | 17 | 18 | def build_catalog(): 19 | catalog_paths = [] 20 | for source_glob in CLI_ARGS.sources: 21 | catalog_paths.extend(glob(source_glob)) 22 | items = [] 23 | for catalog_original_path in catalog_paths: 24 | catalog_path = Path(catalog_original_path).absolute() 25 | print('Loading catalog "{}"'.format(str(catalog_original_path))) 26 | if not catalog_path.is_file(): 27 | fail('Unable to find catalog file "{}"'.format(str(catalog_path))) 28 | with open(catalog_path, 'r', encoding='utf-8') as catalog_file: 29 | catalog_items = json.load(catalog_file) 30 | base_path = catalog_path.parent.absolute() 31 | for item in catalog_items: 32 | new_item = {} 33 | for entry, entry_original_path in item.items(): 34 | entry_path = Path(entry_original_path) 35 | entry_path = entry_path if entry_path.is_absolute() else (base_path / entry_path).absolute() 36 | if ((len(CLI_ARGS.check) == 1 and CLI_ARGS.check[0] == 'all') 37 | or entry in CLI_ARGS.check) and not entry_path.is_file(): 38 | note = 'Catalog "{}" - Missing file for "{}" ("{}")'.format( 39 | str(catalog_original_path), entry, str(entry_original_path)) 40 | if CLI_ARGS.on_miss == 'fail': 41 | fail(note + ' - aborting') 42 | if CLI_ARGS.on_miss == 'ignore': 43 | print(note + ' - keeping it as it is') 44 | new_item[entry] = str(entry_path) 45 | elif CLI_ARGS.on_miss == 'drop': 46 | print(note + ' - dropping catalog item') 47 | new_item = None 48 | break 49 | else: 50 | print(note + ' - removing entry from item') 51 | else: 52 | new_item[entry] = str(entry_path) 53 | if CLI_ARGS.output is not None and new_item is not None and len(new_item.keys()) > 0: 54 | items.append(new_item) 55 | if CLI_ARGS.output is not None: 56 | catalog_path = Path(CLI_ARGS.output).absolute() 57 | print('Writing catalog "{}"'.format(str(CLI_ARGS.output))) 58 | if CLI_ARGS.make_relative: 59 | base_path = catalog_path.parent 60 | for item in items: 61 | for entry in item.keys(): 62 | item[entry] = str(Path(item[entry]).relative_to(base_path)) 63 | if CLI_ARGS.order_by is not None: 64 | items.sort(key=lambda i: i[CLI_ARGS.order_by] if CLI_ARGS.order_by in i else '') 65 | with open(catalog_path, 'w', encoding='utf-8') as catalog_file: 66 | json.dump(items, catalog_file, indent=2) 67 | 68 | 69 | def handle_args(): 70 | parser = argparse.ArgumentParser(description='Tool for combining catalog files and/or ordering, checking and ' 71 | 'converting paths within catalog files') 72 | parser.add_argument('--output', help='Write collected catalog items to this new catalog file') 73 | parser.add_argument('--make-relative', action='store_true', 74 | help='Make all path entries of all items relative to new catalog file\'s parent directory') 75 | parser.add_argument('--check', 76 | help='Comma separated list of path entries to check for existence ' 77 | '("all" for checking every entry, default: no checks)') 78 | parser.add_argument('--on-miss', default='fail', choices=['fail', 'drop', 'remove', 'ignore'], 79 | help='What to do if a path is not existing: ' 80 | '"fail" (exit program), ' 81 | '"drop" (drop catalog item) or ' 82 | '"remove" (remove path entry from catalog item) or ' 83 | '"ignore" (keep it as it is)') 84 | parser.add_argument('--order-by', help='Path entry used for sorting items in target catalog') 85 | parser.add_argument('sources', nargs='+', help='Source catalog files (supporting wildcards)') 86 | return parser.parse_args() 87 | 88 | 89 | if __name__ == "__main__": 90 | CLI_ARGS = handle_args() 91 | CLI_ARGS.check = [] if CLI_ARGS.check is None else CLI_ARGS.check.split(',') 92 | build_catalog() 93 | -------------------------------------------------------------------------------- /align/export.py: -------------------------------------------------------------------------------- 1 | import os 2 | import io 3 | import sys 4 | import csv 5 | import math 6 | import time 7 | import json 8 | import wave 9 | import pickle 10 | import random 11 | import tarfile 12 | import logging 13 | import argparse 14 | import statistics 15 | import os.path as path 16 | 17 | from datetime import timedelta 18 | from collections import Counter 19 | from multiprocessing import Pool 20 | from audio import AUDIO_TYPE_PCM, AUDIO_TYPE_WAV, AUDIO_TYPE_OPUS,\ 21 | ensure_wav_with_format, extract_audio, change_audio_types, write_audio_format_to_wav_file, verify_wav_file 22 | from sample_collections import SortingSDBWriter, LabeledSample 23 | from utils import parse_file_size, log_progress 24 | 25 | UNKNOWN = '' 26 | AUDIO_TYPE_LOOKUP = { 27 | 'wav': AUDIO_TYPE_WAV, 28 | 'opus': AUDIO_TYPE_OPUS 29 | } 30 | SET_NAMES = ['train', 'dev', 'test'] 31 | 32 | 33 | class Fragment: 34 | def __init__(self, catalog_index, alignment_index, quality=0, duration=0): 35 | self.catalog_index = catalog_index 36 | self.alignment_index = alignment_index 37 | self.quality = quality 38 | self.duration = duration 39 | self.meta = {} 40 | self.partition = 'other' 41 | self.list_name = 'other' 42 | 43 | 44 | def progress(it=None, desc=None, total=None): 45 | if desc is not None: 46 | logging.info(desc) 47 | return it if CLI_ARGS.no_progress else log_progress(it, interval=CLI_ARGS.progress_interval, total=total) 48 | 49 | 50 | def fail(message, code=1): 51 | logging.fatal(message) 52 | exit(code) 53 | 54 | 55 | def check_path(target_path, fs_type='file'): 56 | if not (path.isfile(target_path) if fs_type == 'file' else path.isdir(target_path)): 57 | fail('{} not existing: "{}"'.format(fs_type[0].upper() + fs_type[1:], target_path)) 58 | return path.abspath(target_path) 59 | 60 | 61 | def make_absolute(base_path, spec_path): 62 | if not path.isabs(spec_path): 63 | spec_path = path.join(base_path, spec_path) 64 | spec_path = path.abspath(spec_path) 65 | return spec_path if path.isfile(spec_path) else None 66 | 67 | 68 | def engroup(lst, get_key): 69 | groups = {} 70 | for obj in lst: 71 | key = get_key(obj) 72 | if key in groups: 73 | groups[key].append(obj) 74 | else: 75 | groups[key] = [obj] 76 | return groups 77 | 78 | 79 | def get_sample_size(population_size): 80 | margin_of_error = 0.01 81 | fraction_picking = 0.50 82 | z_score = 2.58 # Corresponds to confidence level 99% 83 | numerator = (z_score ** 2 * fraction_picking * (1 - fraction_picking)) / ( 84 | margin_of_error ** 2 85 | ) 86 | sample_size = 0 87 | for train_size in range(population_size, 0, -1): 88 | denominator = 1 + (z_score ** 2 * fraction_picking * (1 - fraction_picking)) / ( 89 | margin_of_error ** 2 * train_size 90 | ) 91 | sample_size = int(numerator / denominator) 92 | if 2 * sample_size + train_size <= population_size: 93 | break 94 | return sample_size 95 | 96 | 97 | def load_catalog(): 98 | catalog_entries = [] 99 | if CLI_ARGS.audio: 100 | if CLI_ARGS.aligned: 101 | catalog_entries.append((check_path(CLI_ARGS.audio), check_path(CLI_ARGS.aligned))) 102 | else: 103 | fail('If you specify "--audio", you also have to specify "--aligned"') 104 | elif CLI_ARGS.aligned: 105 | fail('If you specify "--aligned", you also have to specify "--audio"') 106 | elif CLI_ARGS.catalog: 107 | catalog = check_path(CLI_ARGS.catalog) 108 | catalog_dir = path.dirname(catalog) 109 | with open(catalog, 'r', encoding='utf-8') as catalog_file: 110 | catalog_file_entries = json.load(catalog_file) 111 | for entry in progress(catalog_file_entries, desc='Reading catalog'): 112 | audio = make_absolute(catalog_dir, entry['audio']) 113 | aligned = make_absolute(catalog_dir, entry['aligned']) 114 | if audio is None or aligned is None: 115 | if CLI_ARGS.ignore_missing: 116 | continue 117 | if audio is None: 118 | fail('Problem loading catalog "{}": Missing referenced audio file "{}"' 119 | .format(CLI_ARGS.catalog, entry['audio'])) 120 | if aligned is None: 121 | fail('Problem loading catalog "{}": Missing referenced alignment file "{}"' 122 | .format(CLI_ARGS.catalog, entry['aligned'])) 123 | catalog_entries.append((audio, aligned)) 124 | else: 125 | fail('You have to either specify "--audio" and "--aligned" or "--catalog"') 126 | return catalog_entries 127 | 128 | 129 | def load_fragments(catalog_entries): 130 | def get_meta_list(ae, mf): 131 | if 'meta' in ae: 132 | meta_fields = ae['meta'] 133 | if isinstance(meta_fields, dict) and mf in meta_fields: 134 | mf = meta_fields[mf] 135 | return mf if isinstance(mf, list) else [mf] 136 | return [] 137 | 138 | required_metas = {} 139 | if CLI_ARGS.debias is not None: 140 | for debias_meta_field in CLI_ARGS.debias: 141 | required_metas[debias_meta_field] = True 142 | if CLI_ARGS.split and CLI_ARGS.split_field is not None: 143 | required_metas[CLI_ARGS.split_field] = True 144 | 145 | fragments = [] 146 | reasons = Counter() 147 | for catalog_index, catalog_entry in enumerate(progress(catalog_entries, desc='Loading alignments')): 148 | audio_path, aligned_path = catalog_entry 149 | with open(aligned_path, 'r', encoding='utf-8') as aligned_file: 150 | aligned = json.load(aligned_file) 151 | for alignment_index, alignment in enumerate(aligned): 152 | quality = eval(CLI_ARGS.criteria, {'math': math}, alignment) 153 | alignment['quality'] = quality 154 | if eval(CLI_ARGS.filter, {'math': math}, alignment): 155 | reasons['Filter'] += 1 156 | continue 157 | meta = {} 158 | keep = True 159 | for meta_field in required_metas.keys(): 160 | meta_list = get_meta_list(alignment, meta_field) 161 | if CLI_ARGS.split and CLI_ARGS.split_field == meta_field: 162 | if CLI_ARGS.split_drop_multiple and len(meta_list) > 1: 163 | reasons['Split drop multiple'] += 1 164 | keep = False 165 | break 166 | elif CLI_ARGS.split_drop_unknown and len(meta_list) == 0: 167 | reasons['Split drop unknown'] += 1 168 | keep = False 169 | break 170 | meta[meta_field] = meta_list[0] if meta_list else UNKNOWN 171 | if keep: 172 | duration = alignment['end'] - alignment['start'] 173 | fragment = Fragment(catalog_index, alignment_index, quality=quality, duration=duration) 174 | fragment.meta = meta 175 | for minimum, partition_name in CLI_ARGS.partition: 176 | if quality >= minimum: 177 | fragment.partition = partition_name 178 | break 179 | fragments.append(fragment) 180 | 181 | if len(fragments) == 0: 182 | fail('No samples left for export') 183 | 184 | if len(reasons.keys()) > 0: 185 | logging.info('Excluded number of samples (for each reason):') 186 | for reason, count in reasons.most_common(): 187 | logging.info(' - "{}": {}'.format(reason, count)) 188 | return fragments 189 | 190 | 191 | def debias(fragments): 192 | if CLI_ARGS.debias is not None: 193 | for debias in CLI_ARGS.debias: 194 | grouped = engroup(fragments, lambda f: f.meta[debias]) 195 | if UNKNOWN in grouped: 196 | fragments = grouped[UNKNOWN] 197 | del grouped[UNKNOWN] 198 | else: 199 | fragments = [] 200 | counts = list(map(lambda f: len(f), grouped.values())) 201 | mean = statistics.mean(counts) 202 | sigma = statistics.pstdev(counts, mu=mean) 203 | cap = int(mean + CLI_ARGS.debias_sigma_factor * sigma) 204 | counter = Counter() 205 | for group, group_fragments in progress(grouped.items(), desc='De-biasing "{}"'.format(debias)): 206 | if len(group_fragments) > cap: 207 | group_fragments.sort(key=lambda f: f.quality) 208 | counter[group] += len(group_fragments) - cap 209 | group_fragments = group_fragments[-cap:] 210 | fragments.extend(group_fragments) 211 | if len(counter.keys()) > 0: 212 | logging.info('Dropped for de-biasing "{}":'.format(debias)) 213 | for group, count in counter.most_common(): 214 | logging.info(' - "{}": {}'.format(group, count)) 215 | return fragments 216 | 217 | 218 | def parse_set_assignments(): 219 | set_assignments = {} 220 | for set_index, set_name in enumerate(SET_NAMES): 221 | attr_name = 'assign_' + set_name 222 | if hasattr(CLI_ARGS, attr_name): 223 | set_entities = getattr(CLI_ARGS, attr_name) 224 | if set_entities is not None: 225 | for entity_id in filter(None, str(set_entities).split(',')): 226 | if entity_id in set_assignments: 227 | fail('Unable to assign entity "{}" to set "{}", as it is already assigned to set "{}"' 228 | .format(entity_id, set_name, SET_NAMES[set_assignments[entity_id]])) 229 | set_assignments[entity_id] = set_index 230 | return set_assignments 231 | 232 | 233 | def check_targets(): 234 | if CLI_ARGS.target_dir is not None and CLI_ARGS.target_tar is not None: 235 | fail('Only one allowed: --target-dir or --target-tar') 236 | elif CLI_ARGS.target_dir is not None: 237 | CLI_ARGS.target_dir = check_path(CLI_ARGS.target_dir, fs_type='directory') 238 | elif CLI_ARGS.target_tar is not None: 239 | if CLI_ARGS.sdb: 240 | fail('Option --sdb not supported for --target-tar output. Use --target-dir instead.') 241 | CLI_ARGS.target_tar = path.abspath(CLI_ARGS.target_tar) 242 | if path.isfile(CLI_ARGS.target_tar): 243 | if not CLI_ARGS.force: 244 | fail('Target tar-file already existing - use --force to overwrite') 245 | elif path.exists(CLI_ARGS.target_tar): 246 | fail('Target tar-file path is existing, but not a file') 247 | elif not path.isdir(path.dirname(CLI_ARGS.target_tar)): 248 | fail('Unable to write tar-file: Path not existing') 249 | else: 250 | fail('Either --target-dir or --target-tar has to be provided') 251 | 252 | 253 | def split(fragments, set_assignments): 254 | lists = [] 255 | 256 | def assign_fragments(frags, name): 257 | lists.append(name) 258 | duration = 0 259 | for f in frags: 260 | f.list_name = name 261 | duration += f.duration 262 | logging.info('Built set "{}" (samples: {}, duration: {})' 263 | .format(name, len(frags), timedelta(milliseconds=duration))) 264 | 265 | if CLI_ARGS.split_seed is not None: 266 | random.seed(CLI_ARGS.split_seed) 267 | 268 | if CLI_ARGS.split and CLI_ARGS.split_field is not None: 269 | fragments = list(fragments) 270 | metas = engroup(fragments, lambda f: f.meta[CLI_ARGS.split_field]).items() 271 | metas = sorted(metas, key=lambda meta_frags: len(meta_frags[1])) 272 | metas = list(map(lambda meta_frags: meta_frags[0], metas)) 273 | partitions = engroup(fragments, lambda f: f.partition) 274 | partitions = list(map(lambda part_frags: (part_frags[0], 275 | get_sample_size(len(part_frags[1])), 276 | engroup(part_frags[1], lambda f: f.meta[CLI_ARGS.split_field]), 277 | [[], [], []]), 278 | partitions.items())) 279 | remaining_metas = [] 280 | for meta in metas: 281 | if meta in set_assignments: 282 | set_index = set_assignments[meta] 283 | for _, _, partition_portions, sample_sets in partitions: 284 | if meta in partition_portions: 285 | sample_sets[set_index].extend(partition_portions[meta]) 286 | del partition_portions[meta] 287 | else: 288 | remaining_metas.append(meta) 289 | metas = remaining_metas 290 | for _, sample_size, _, sample_sets in partitions: 291 | while len(metas) > 0 and (len(sample_sets[1]) < sample_size or len(sample_sets[2]) < sample_size): 292 | for sample_set_index in [1, 2]: 293 | if len(metas) > 0 and sample_size > len(sample_sets[sample_set_index]): 294 | meta = metas.pop(0) 295 | for _, _, partition_portions, other_sample_sets in partitions: 296 | if meta in partition_portions: 297 | other_sample_sets[sample_set_index].extend(partition_portions[meta]) 298 | del partition_portions[meta] 299 | for partition, sample_size, partition_portions, sample_sets in partitions: 300 | for portion in partition_portions.values(): 301 | sample_sets[0].extend(portion) 302 | for set_index, set_name in enumerate(SET_NAMES): 303 | assign_fragments(sample_sets[set_index], partition + '-' + set_name) 304 | else: 305 | partitions = engroup(fragments, lambda f: f.partition) 306 | for partition, partition_fragments in partitions.items(): 307 | if CLI_ARGS.split: 308 | sample_size = get_sample_size(len(partition_fragments)) 309 | random.shuffle(partition_fragments) 310 | test_set = partition_fragments[:sample_size] 311 | partition_fragments = partition_fragments[sample_size:] 312 | dev_set = partition_fragments[:sample_size] 313 | train_set = partition_fragments[sample_size:] 314 | sample_sets = [train_set, dev_set, test_set] 315 | for set_index, set_name in enumerate(SET_NAMES): 316 | assign_fragments(sample_sets[set_index], partition + '-' + set_name) 317 | else: 318 | assign_fragments(partition_fragments, partition) 319 | return lists 320 | 321 | 322 | def check_overwrite(lists): 323 | if CLI_ARGS.target_dir is not None and not CLI_ARGS.force: 324 | for name in lists: 325 | suffixes = ['.meta'] + (['.sdb', '.sdb.tmp'] if CLI_ARGS.sdb else ['', '.csv']) 326 | for s in suffixes: 327 | p = path.join(CLI_ARGS.target_dir, name + s) 328 | if path.exists(p): 329 | fail('"{}" already existing - use --force to ignore'.format(p)) 330 | 331 | 332 | def parse_args(): 333 | parser = argparse.ArgumentParser(description='Export aligned speech samples.') 334 | 335 | parser.add_argument('--plan', type=str, 336 | help='Export plan (preparation-cache) to load and/or store') 337 | parser.add_argument('--audio', type=str, 338 | help='Take audio file as input (requires "--aligned ")') 339 | parser.add_argument('--aligned', type=str, 340 | help='Take alignment file ("<...>.aligned") as input (requires "--audio ")') 341 | parser.add_argument('--catalog', type=str, 342 | help='Take alignment and audio file references of provided catalog ("<...>.catalog") as input') 343 | parser.add_argument('--ignore-missing', action="store_true", 344 | help='Ignores catalog entries with missing files') 345 | 346 | parser.add_argument('--filter', type=str, default='False', 347 | help='Python expression that computes a boolean value from sample data fields. ' 348 | 'If the result is True, the sample will be dropped.') 349 | 350 | parser.add_argument('--criteria', type=str, default='100', 351 | help='Python expression that computes a number as quality indicator from sample data fields.') 352 | 353 | parser.add_argument('--debias', type=str, action='append', 354 | help='Sample meta field to group samples for debiasing (e.g. "speaker"). ' 355 | 'Group sizes will be capped according to --debias-sigma-factor') 356 | parser.add_argument('--debias-sigma-factor', type=float, default=3.0, 357 | help='Standard deviation (sigma) factor that will determine ' 358 | 'the maximum number of samples per group (see --debias).') 359 | 360 | parser.add_argument('--partition', type=str, action='append', 361 | help='Expression of the form ":" where all samples with a quality indicator ' 362 | '(--criteria) above or equal the given number and below the next bigger one are assigned ' 363 | 'to the specified partition. Samples below the lowest partition criteria are assigned to ' 364 | 'partition "other".') 365 | 366 | parser.add_argument('--split', action="store_true", 367 | help='Split each partition except "other" into train/dev/test sub-sets.') 368 | parser.add_argument('--split-field', type=str, 369 | help='Sample meta field that should be used for splitting (e.g. "speaker")') 370 | parser.add_argument('--split-drop-multiple', action="store_true", 371 | help='Drop all samples with multiple --split-field assignments.') 372 | parser.add_argument('--split-drop-unknown', action="store_true", 373 | help='Drop all samples with no --split-field assignment.') 374 | for sub_set in SET_NAMES: 375 | parser.add_argument('--assign-' + sub_set, 376 | help='Comma separated list of --split-field values that are to be assigned to sub-set ' 377 | '"{}"'.format(sub_set)) 378 | parser.add_argument('--split-seed', type=int, 379 | help='Random seed for set splitting') 380 | 381 | parser.add_argument('--target-dir', type=str, required=False, 382 | help='Existing target directory for storing generated sets (files and directories)') 383 | parser.add_argument('--target-tar', type=str, required=False, 384 | help='Target tar-file for storing generated sets (files and directories)') 385 | parser.add_argument('--sdb', action="store_true", 386 | help='Writes Sample DBs instead of CSV and .wav files (requires --target-dir)') 387 | parser.add_argument('--sdb-bucket-size', default='1GB', 388 | help='Memory bucket size for external sorting of SDBs') 389 | parser.add_argument('--sdb-workers', type=int, default=None, 390 | help='Number of SDB encoding workers') 391 | parser.add_argument('--sdb-buffered-samples', type=int, default=None, 392 | help='Number of samples per bucket buffer during finalization') 393 | parser.add_argument('--sdb-audio-type', default='opus', choices=AUDIO_TYPE_LOOKUP.keys(), 394 | help='Audio representation inside target SDBs') 395 | parser.add_argument('--tmp-dir', type=str, default=None, 396 | help='Directory for temporary files - defaults to system one') 397 | parser.add_argument('--buffer', default='1MB', 398 | help='Buffer size for writing files (~16MB by default)') 399 | parser.add_argument('--force', action="store_true", 400 | help='Overwrite existing files') 401 | parser.add_argument('--skip-damaged', action="store_true", 402 | help='If damaged audio files and their samples should be skipped instead of failing export') 403 | parser.add_argument('--no-meta', action="store_true", 404 | help='No writing of meta data files') 405 | parser.add_argument('--rate', type=int, default=16000, 406 | help='Export wav-files with this sample rate') 407 | parser.add_argument('--channels', type=int, default=1, 408 | help='Export wav-files with this number of channels') 409 | parser.add_argument('--width', type=int, default=2, 410 | help='Export wav-files with this sample width (bytes)') 411 | 412 | parser.add_argument('--workers', type=int, default=2, 413 | help='Number of workers for loading and re-sampling audio files. Default: 2') 414 | parser.add_argument('--dry-run', action="store_true", 415 | help='Simulates export without writing or creating any file or directory') 416 | parser.add_argument('--dry-run-fast', action="store_true", 417 | help='Simulates export without writing or creating any file or directory. ' 418 | 'In contrast to --dry-run this faster simulation will not load samples.') 419 | parser.add_argument('--loglevel', type=int, default=20, 420 | help='Log level (between 0 and 50) - default: 20') 421 | parser.add_argument('--no-progress', action="store_true", 422 | help='Prevents showing progress indication') 423 | parser.add_argument('--progress-interval', type=float, default=1.0, 424 | help='Progress indication interval in seconds') 425 | 426 | args = parser.parse_args() 427 | 428 | args.buffer = parse_file_size(args.buffer) 429 | args.sdb_bucket_size = parse_file_size(args.sdb_bucket_size) 430 | args.dry_run = args.dry_run or args.dry_run_fast 431 | partition_specs = [] 432 | if args.partition is not None: 433 | for partition_expr in args.partition: 434 | parts = partition_expr.split(':') 435 | if len(parts) != 2: 436 | fail('Wrong partition specification: "{}"'.format(partition_expr)) 437 | partition_specs.append((float(parts[0]), str(parts[1]))) 438 | partition_specs.sort(key=lambda p: p[0], reverse=True) 439 | args.partition = partition_specs 440 | return args 441 | 442 | 443 | def load_sample(entry): 444 | catalog_index, catalog_entry = entry 445 | audio_path, aligned_path = catalog_entry 446 | with open(aligned_path, 'r', encoding='utf-8') as aligned_file: 447 | aligned = json.load(aligned_file) 448 | tries = 2 449 | while tries > 0: 450 | wav_path, wav_is_temp = ensure_wav_with_format(audio_path, audio_format, tmp_dir=CLI_ARGS.tmp_dir) 451 | if wav_is_temp: 452 | time.sleep(1) 453 | if wav_path is not None: 454 | if verify_wav_file(wav_path): 455 | return catalog_index, wav_path, wav_is_temp, aligned 456 | if wav_is_temp: 457 | os.remove(wav_path) 458 | logging.warn('Problem converting "{}" into required format - retrying...'.format(audio_path)) 459 | time.sleep(10) 460 | tries -= 1 461 | return catalog_index, None, False, aligned 462 | 463 | 464 | def load_sample_dry(entry): 465 | catalog_index, catalog_entry = entry 466 | audio_path, aligned_path = catalog_entry 467 | if path.isfile(audio_path): 468 | logging.info('Would load file "{}"'.format(audio_path)) 469 | else: 470 | fail('Audio file not found: "{}"'.format(audio_path)) 471 | if path.isfile(aligned_path): 472 | logging.info('Would load file "{}"'.format(audio_path)) 473 | else: 474 | fail('Alignment file not found: "{}"'.format(audio_path)) 475 | return catalog_index, '', False, [] 476 | 477 | 478 | def load_samples(catalog_entries, fragments): 479 | catalog_index_wise = engroup(fragments, lambda f: f.catalog_index) 480 | pool = Pool(CLI_ARGS.workers) 481 | ls = load_sample_dry if CLI_ARGS.dry_run_fast else load_sample 482 | indexed_entries = map(lambda ci: (ci, catalog_entries[ci]), catalog_index_wise.keys()) 483 | for catalog_index, wav_path, wav_is_temp, aligned in pool.imap_unordered(ls, indexed_entries): 484 | if wav_path is None: 485 | src_audio_path = catalog_entries[catalog_index][0] 486 | message = 'Unable to convert audio file "{}" to required format - skipping'.format(src_audio_path) 487 | if CLI_ARGS.skip_damaged: 488 | logging.warn(message) 489 | continue 490 | else: 491 | raise RuntimeError(message) 492 | file_fragments = catalog_index_wise[catalog_index] 493 | file_fragments.sort(key=lambda f: f.alignment_index) 494 | if CLI_ARGS.dry_run_fast: 495 | for fragment in file_fragments: 496 | yield b'', fragment, '' 497 | else: 498 | with wave.open(wav_path, 'rb') as source_wav_file: 499 | wav_duration = source_wav_file.getframerate() * source_wav_file.getnframes() * 1000 500 | for fragment in file_fragments: 501 | aligned_entry = aligned[fragment.alignment_index] 502 | try: 503 | start, end = aligned_entry['start'], aligned_entry['end'] 504 | assert start < end <= wav_duration 505 | fragment_audio = extract_audio(source_wav_file, start / 1000.0, end / 1000.0) 506 | except Exception as ae: 507 | message = 'Problem extracting audio for alignment entry {} in catalog entry {}'\ 508 | .format(fragment.alignment_index, fragment.catalog_index) 509 | if CLI_ARGS.skip_damaged: 510 | logging.warn(message) 511 | break 512 | else: 513 | raise RuntimeError(message) from ae 514 | yield fragment_audio, fragment, aligned_entry['aligned'] 515 | if wav_is_temp: 516 | os.remove(wav_path) 517 | 518 | 519 | def write_meta(file, catalog_entries, id_plus_fragment_iter, total=None): 520 | writer = csv.writer(file) 521 | writer.writerow(['sample', 'split_entity', 'catalog_index', 'source_audio_file', 'aligned_file', 'alignment_index']) 522 | has_split_entity = CLI_ARGS.split and CLI_ARGS.split_field is not None 523 | for sample_id, fragment in progress(id_plus_fragment_iter, total=total): 524 | split_entity = fragment.meta[CLI_ARGS.split_field] if has_split_entity else '' 525 | source_audio_file, aligned_file = catalog_entries[fragment.catalog_index] 526 | writer.writerow([sample_id, 527 | split_entity, 528 | fragment.catalog_index, 529 | source_audio_file, 530 | aligned_file, 531 | fragment.alignment_index]) 532 | 533 | 534 | def write_csvs_and_samples(catalog_entries, lists, fragments): 535 | created_directories = {} 536 | tar = None 537 | if CLI_ARGS.target_tar is not None: 538 | if CLI_ARGS.dry_run: 539 | logging.info('Would create tar-file "{}"'.format(CLI_ARGS.target_tar)) 540 | else: 541 | base_tar = open(CLI_ARGS.target_tar, 'wb', buffering=CLI_ARGS.buffer) 542 | tar = tarfile.open(fileobj=base_tar, mode='w') 543 | 544 | class TargetFile: 545 | def __init__(self, data_path, mode): 546 | self.data_path = data_path 547 | self.mode = mode 548 | self.open_file = None 549 | 550 | def __enter__(self): 551 | parts = self.data_path.split('/') 552 | dirs = ([CLI_ARGS.target_dir] if CLI_ARGS.target_dir is not None else []) + parts[:-1] 553 | for i in range(1, len(dirs)): 554 | vp = '/'.join(dirs[:i + 1]) 555 | if not vp in created_directories: 556 | if tar is None: 557 | dir_path = path.join(*dirs[:i + 1]) 558 | if not path.isdir(dir_path): 559 | if CLI_ARGS.dry_run: 560 | logging.info('Would create directory "{}"'.format(dir_path)) 561 | else: 562 | os.mkdir(dir_path) 563 | else: 564 | tdir = tarfile.TarInfo(vp) 565 | tdir.type = tarfile.DIRTYPE 566 | tar.addfile(tdir) 567 | created_directories[vp] = True 568 | if CLI_ARGS.target_tar is None: 569 | file_path = path.join(CLI_ARGS.target_dir, *self.data_path.split('/')) 570 | if CLI_ARGS.dry_run: 571 | logging.info('Would write file "{}"'.format(file_path)) 572 | self.open_file = io.BytesIO() if 'b' in self.mode else io.StringIO() 573 | else: 574 | self.open_file = open(file_path, self.mode) 575 | else: 576 | self.open_file = io.BytesIO() if 'b' in self.mode else io.StringIO() 577 | return self.open_file 578 | 579 | def __exit__(self, *args): 580 | if tar is not None: 581 | if isinstance(self.open_file, io.StringIO): 582 | sfile = self.open_file 583 | sfile.seek(0) 584 | self.open_file = io.BytesIO(sfile.read().encode('utf8')) 585 | self.open_file.seek(0, 2) 586 | sfile.close() 587 | tfile = tarfile.TarInfo(self.data_path) 588 | tfile.size = self.open_file.tell() 589 | self.open_file.seek(0) 590 | tar.addfile(tfile, self.open_file) 591 | tar.members = [] 592 | if self.open_file is not None: 593 | self.open_file.close() 594 | 595 | group_lists = {} 596 | for list_name in lists: 597 | group_lists[list_name] = [] 598 | 599 | for pcm_data, fragment, transcript in progress(load_samples(catalog_entries, fragments), 600 | desc='Exporting samples', total=len(fragments)): 601 | group_list = group_lists[fragment.list_name] 602 | sample_path = '{}/sample-{:010d}.wav'.format(fragment.list_name, len(group_list)) 603 | with TargetFile(sample_path, "wb") as base_wav_file: 604 | with wave.open(base_wav_file, 'wb') as wav_file: 605 | write_audio_format_to_wav_file(wav_file) 606 | wav_file.writeframes(pcm_data) 607 | file_size = base_wav_file.tell() 608 | group_list.append((sample_path, file_size, fragment, transcript)) 609 | 610 | for list_name, group_list in group_lists.items(): 611 | csv_filename = list_name + '.csv' 612 | logging.info('Writing "{}"'.format(csv_filename)) 613 | with TargetFile(csv_filename, 'w') as csv_file: 614 | writer = csv.writer(csv_file) 615 | writer.writerow(['wav_filename', 'wav_filesize', 'transcript']) 616 | for rel_path, file_size, fragment, transcript in progress(group_list): 617 | writer.writerow([rel_path, file_size, transcript]) 618 | if not CLI_ARGS.no_meta: 619 | meta_filename = list_name + '.meta' 620 | logging.info('Writing "{}"'.format(meta_filename)) 621 | with TargetFile(meta_filename, 'w') as meta_file: 622 | path_fragment_list = map(lambda gi: (gi[0], gi[2]), group_list) 623 | write_meta(meta_file, catalog_entries, path_fragment_list, total=len(group_list)) 624 | 625 | if tar is not None: 626 | tar.close() 627 | 628 | 629 | def write_sdbs(catalog_entries, lists, fragments): 630 | audio_type = AUDIO_TYPE_LOOKUP[CLI_ARGS.sdb_audio_type] 631 | sdbs = {} 632 | for list_name in lists: 633 | sdb_path = os.path.join(CLI_ARGS.target_dir, list_name + '.sdb') 634 | if CLI_ARGS.dry_run: 635 | logging.info('Would create SDB "{}"'.format(sdb_path)) 636 | else: 637 | logging.info('Creating SDB "{}"'.format(sdb_path)) 638 | sdbs[list_name] = SortingSDBWriter(sdb_path, 639 | audio_type=audio_type, 640 | buffering=CLI_ARGS.buffer, 641 | cache_size=CLI_ARGS.sdb_bucket_size, 642 | buffered_samples=CLI_ARGS.sdb_buffered_samples) 643 | 644 | def to_samples(): 645 | for pcm_data, fragment, transcript in load_samples(catalog_entries, fragments): 646 | cs = LabeledSample(AUDIO_TYPE_PCM, pcm_data, transcript, audio_format=audio_format) 647 | cs.meta = fragment 648 | yield cs 649 | 650 | samples = change_audio_types(to_samples(), 651 | audio_type=audio_type, 652 | processes=CLI_ARGS.sdb_workers) if not CLI_ARGS.dry_run_fast else to_samples() 653 | set_counter = Counter() 654 | for sample in progress(samples, desc='Exporting samples', total=len(fragments)): 655 | list_name = sample.meta.list_name 656 | if not CLI_ARGS.dry_run: 657 | set_counter[list_name] += 1 658 | sdb = sdbs[list_name] 659 | sdb.add(sample) 660 | for list_name, sdb in sdbs.items(): 661 | meta_path = os.path.join(CLI_ARGS.target_dir, list_name + '.meta') 662 | if CLI_ARGS.dry_run: 663 | if not CLI_ARGS.no_meta: 664 | logging.info('Would write meta file "{}"'.format(meta_path)) 665 | else: 666 | sdb_path = os.path.join(CLI_ARGS.target_dir, list_name + '.sdb') 667 | for _ in progress(sdb.finalize(), desc='Finalizing "{}"'.format(sdb_path), total=set_counter[list_name]): 668 | pass 669 | if not CLI_ARGS.no_meta: 670 | logging.info('Writing "{}"'.format(meta_path)) 671 | with open(meta_path, 'w') as meta_file: 672 | write_meta(meta_file, catalog_entries, enumerate(sdb.meta_list), total=len(sdb.meta_list)) 673 | 674 | 675 | def load_plan(): 676 | if CLI_ARGS.plan is not None and os.path.isfile(CLI_ARGS.plan): 677 | try: 678 | logging.info('Loading export-plan from "{}"'.format(CLI_ARGS.plan)) 679 | with open(CLI_ARGS.plan, 'rb') as plan_file: 680 | catalog_entries, lists, fragments = pickle.load(plan_file) 681 | return True, catalog_entries, lists, fragments 682 | except pickle.PickleError: 683 | logging.warn('Unable to load export-plan "{}" - rebuilding'.format(CLI_ARGS.plan)) 684 | os.remove(CLI_ARGS.plan) 685 | return False, None, None, None 686 | 687 | 688 | def save_plan(catalog_entries, lists, fragments): 689 | if CLI_ARGS.plan is not None: 690 | logging.info('Saving export-plan to "{}"'.format(CLI_ARGS.plan)) 691 | with open(CLI_ARGS.plan, 'wb') as plan_file: 692 | pickle.dump((catalog_entries, lists, fragments), plan_file) 693 | 694 | 695 | def main(): 696 | check_targets() 697 | has_plan, catalog_entries, lists, fragments = load_plan() 698 | if not has_plan: 699 | set_assignments = parse_set_assignments() 700 | catalog_entries = load_catalog() 701 | fragments = load_fragments(catalog_entries) 702 | fragments = debias(fragments) 703 | lists = split(fragments, set_assignments) 704 | save_plan(catalog_entries, lists, fragments) 705 | check_overwrite(lists) 706 | if CLI_ARGS.sdb: 707 | write_sdbs(catalog_entries, lists, fragments) 708 | else: 709 | write_csvs_and_samples(catalog_entries, lists, fragments) 710 | 711 | 712 | if __name__ == '__main__': 713 | CLI_ARGS = parse_args() 714 | audio_format = (CLI_ARGS.rate, CLI_ARGS.channels, CLI_ARGS.width) 715 | logging.basicConfig(stream=sys.stderr, level=CLI_ARGS.loglevel) 716 | logging.getLogger('sox').setLevel(logging.ERROR) 717 | main() 718 | -------------------------------------------------------------------------------- /align/generate_lm.py: -------------------------------------------------------------------------------- 1 | import gzip 2 | import io 3 | import os 4 | import subprocess 5 | from collections import Counter 6 | 7 | def convert_and_filter_topk(output_dir, input_txt, top_k): 8 | """ Convert to lowercase, count word occurrences and save top-k words to a file """ 9 | 10 | counter = Counter() 11 | data_lower = output_dir + "." + "lower.txt.gz" 12 | 13 | print("\nConverting to lowercase and counting word occurrences ...") 14 | with io.TextIOWrapper( 15 | io.BufferedWriter(gzip.open(data_lower, "w+")), encoding="utf-8" 16 | ) as file_out: 17 | 18 | # Open the input file either from input.txt or input.txt.gz 19 | _, file_extension = os.path.splitext(input_txt) 20 | if file_extension == ".gz": 21 | file_in = io.TextIOWrapper( 22 | io.BufferedReader(gzip.open(input_txt)), encoding="utf-8" 23 | ) 24 | else: 25 | file_in = open(input_txt, encoding="utf-8") 26 | 27 | for line in file_in: 28 | line_lower = line.lower() 29 | counter.update(line_lower.split()) 30 | file_out.write(line_lower) 31 | 32 | file_in.close() 33 | 34 | # Save top-k words 35 | print("\nSaving top {} words ...".format(top_k)) 36 | top_counter = counter.most_common(top_k) 37 | vocab_str = "\n".join(word for word, count in top_counter) 38 | vocab_path = "vocab-{}.txt".format(top_k) 39 | vocab_path = output_dir + "." + vocab_path 40 | with open(vocab_path, "w+", encoding="utf-8") as file: 41 | file.write(vocab_str) 42 | 43 | print("\nCalculating word statistics ...") 44 | total_words = sum(counter.values()) 45 | print(" Your text file has {} words in total".format(total_words)) 46 | print(" It has {} unique words".format(len(counter))) 47 | top_words_sum = sum(count for word, count in top_counter) 48 | word_fraction = (top_words_sum / total_words) * 100 49 | print( 50 | " Your top-{} words are {:.4f} percent of all words".format( 51 | top_k, word_fraction 52 | ) 53 | ) 54 | print(' Your most common word "{}" occurred {} times'.format(*top_counter[0])) 55 | last_word, last_count = top_counter[-1] 56 | print( 57 | ' The least common word in your top-k is "{}" with {} times'.format( 58 | last_word, last_count 59 | ) 60 | ) 61 | for i, (w, c) in enumerate(reversed(top_counter)): 62 | if c > last_count: 63 | print( 64 | ' The first word with {} occurrences is "{}" at place {}'.format( 65 | c, w, len(top_counter) - 1 - i 66 | ) 67 | ) 68 | break 69 | 70 | return data_lower, vocab_str 71 | 72 | 73 | def build_lm(output_dir, kenlm_bins, arpa_order, max_arpa_memory, arpa_prune, discount_fallback, binary_a_bits, binary_q_bits, binary_type, data_lower, vocab_str): 74 | print("\nCreating ARPA file ...") 75 | lm_path = output_dir + "." + "lm.arpa" 76 | subargs = [ 77 | os.path.join(kenlm_bins, "lmplz"), 78 | "--order", 79 | str(arpa_order), 80 | "--temp_prefix", 81 | output_dir, 82 | "--memory", 83 | max_arpa_memory, 84 | "--text", 85 | data_lower, 86 | "--arpa", 87 | lm_path, 88 | "--prune", 89 | *arpa_prune.split("|"), 90 | ] 91 | if discount_fallback: 92 | subargs += ["--discount_fallback"] 93 | subprocess.check_call(subargs) 94 | 95 | # Filter LM using vocabulary of top-k words 96 | print("\nFiltering ARPA file using vocabulary of top-k words ...") 97 | filtered_path = output_dir + "." + "lm_filtered.arpa" 98 | subprocess.run( 99 | [ 100 | os.path.join(kenlm_bins, "filter"), 101 | "single", 102 | "model:{}".format(lm_path), 103 | filtered_path, 104 | ], 105 | input=vocab_str.encode("utf-8"), 106 | check=True, 107 | ) 108 | 109 | # Quantize and produce trie binary. 110 | print("\nBuilding lm.binary ...") 111 | binary_path = output_dir + "." + "lm.binary" 112 | subprocess.check_call( 113 | [ 114 | os.path.join(kenlm_bins, "build_binary"), 115 | "-s", 116 | "-a", 117 | str(binary_a_bits), 118 | "-q", 119 | str(binary_q_bits), 120 | "-v", 121 | binary_type, 122 | filtered_path, 123 | binary_path, 124 | ] 125 | ) -------------------------------------------------------------------------------- /align/generate_package.py: -------------------------------------------------------------------------------- 1 | import shutil 2 | import struct 3 | from ds_ctcdecoder import Scorer, Alphabet as NativeAlphabet 4 | 5 | 6 | class Alphabet(object): 7 | def __init__(self, config_file): 8 | self._config_file = config_file 9 | self._label_to_str = {} 10 | self._str_to_label = {} 11 | self._size = 0 12 | if config_file: 13 | with open(config_file, 'r', encoding='utf-8') as fin: 14 | for line in fin: 15 | if line[0:2] == '\\#': 16 | line = '#\n' 17 | elif line[0] == '#': 18 | continue 19 | self._label_to_str[self._size] = line[:-1] # remove the line ending 20 | self._str_to_label[line[:-1]] = self._size 21 | self._size += 1 22 | 23 | def serialize(self): 24 | # Serialization format is a sequence of (key, value) pairs, where key is 25 | # a uint16_t and value is a uint16_t length followed by `length` UTF-8 26 | # encoded bytes with the label. 27 | res = bytearray() 28 | 29 | # We start by writing the number of pairs in the buffer as uint16_t. 30 | res += struct.pack(' self.cache_size: 173 | self.finish_bucket() 174 | sample.change_audio_type(self.audio_type) 175 | sample.sample_id = '#unsorted:{}'.format(len(self.bucket)) 176 | self.meta_dict[sample.sample_id] = sample.meta 177 | self.bucket.append(sample) 178 | self.bucket_size += len(sample.audio.getbuffer()) 179 | return sample.sample_id 180 | 181 | def finalize(self): 182 | if self.tmp_sdb is None: 183 | return 184 | self.finish_bucket() 185 | num_samples = len(self.tmp_sdb) 186 | self.tmp_sdb.close() 187 | self.tmp_sdb = None 188 | if self.buffered_samples is None: 189 | avg_sample_size = self.overall_size / max(1, num_samples) 190 | max_cached_samples = self.cache_size / max(1, avg_sample_size) 191 | buffer_size = max(1, int(max_cached_samples / max(1, len(self.buckets)))) 192 | else: 193 | buffer_size = self.buffered_samples 194 | sdb_reader = SDB(self.tmp_sdb_filename, buffering=self.buffering, id_prefix='#pre-sorted') 195 | 196 | def buffered_view(bucket): 197 | start, end = bucket 198 | buffer = [] 199 | current_offset = start 200 | while current_offset < end: 201 | while len(buffer) < buffer_size and current_offset < end: 202 | buffer.insert(0, sdb_reader[current_offset]) 203 | current_offset += 1 204 | while len(buffer) > 0: 205 | yield buffer.pop(-1) 206 | 207 | bucket_views = list(map(buffered_view, self.buckets)) 208 | interleaved = heapq.merge(*bucket_views, key=lambda s: s.duration) 209 | with DirectSDBWriter(self.sdb_filename, 210 | buffering=self.buffering, 211 | audio_type=self.audio_type, 212 | id_prefix=self.id_prefix) as sdb_writer: 213 | for index, sample in enumerate(interleaved): 214 | old_id = sample.sample_id 215 | sdb_writer.add(sample) 216 | self.meta_list.append(self.meta_dict[old_id]) 217 | del self.meta_dict[old_id] 218 | yield index / num_samples 219 | sdb_reader.close() 220 | os.unlink(self.tmp_sdb_filename) 221 | 222 | def close(self): 223 | for _ in self.finalize(): 224 | pass 225 | 226 | def __exit__(self, exc_type, exc_val, exc_tb): 227 | self.close() 228 | 229 | 230 | class SDB: # pylint: disable=too-many-instance-attributes 231 | """Sample collection reader for reading a Sample DB (SDB) file""" 232 | def __init__(self, sdb_filename, buffering=BUFFER_SIZE, id_prefix=None): 233 | self.sdb_filename = sdb_filename 234 | self.id_prefix = sdb_filename if id_prefix is None else id_prefix 235 | self.sdb_file = open(sdb_filename, 'rb', buffering=buffering) 236 | self.offsets = [] 237 | if self.sdb_file.read(len(MAGIC)) != MAGIC: 238 | raise RuntimeError('No Sample Database') 239 | meta_chunk_len = self.read_big_int() 240 | self.meta = json.loads(self.sdb_file.read(meta_chunk_len).decode()) 241 | if SCHEMA_KEY not in self.meta: 242 | raise RuntimeError('Missing schema') 243 | self.schema = self.meta[SCHEMA_KEY] 244 | 245 | speech_columns = self.find_columns(content=CONTENT_TYPE_SPEECH, mime_type=SERIALIZABLE_AUDIO_TYPES) 246 | if not speech_columns: 247 | raise RuntimeError('No speech data (missing in schema)') 248 | self.speech_index = speech_columns[0] 249 | self.audio_type = self.schema[self.speech_index][MIME_TYPE_KEY] 250 | 251 | transcript_columns = self.find_columns(content=CONTENT_TYPE_TRANSCRIPT, mime_type=MIME_TYPE_TEXT) 252 | if not transcript_columns: 253 | raise RuntimeError('No transcript data (missing in schema)') 254 | self.transcript_index = transcript_columns[0] 255 | 256 | sample_chunk_len = self.read_big_int() 257 | self.sdb_file.seek(sample_chunk_len + BIGINT_SIZE, 1) 258 | num_samples = self.read_big_int() 259 | for _ in range(num_samples): 260 | self.offsets.append(self.read_big_int()) 261 | 262 | def read_int(self): 263 | return int.from_bytes(self.sdb_file.read(INT_SIZE), BIG_ENDIAN) 264 | 265 | def read_big_int(self): 266 | return int.from_bytes(self.sdb_file.read(BIGINT_SIZE), BIG_ENDIAN) 267 | 268 | def find_columns(self, content=None, mime_type=None): 269 | criteria = [] 270 | if content is not None: 271 | criteria.append((CONTENT_KEY, content)) 272 | if mime_type is not None: 273 | criteria.append((MIME_TYPE_KEY, mime_type)) 274 | if len(criteria) == 0: 275 | raise ValueError('At least one of "content" or "mime-type" has to be provided') 276 | matches = [] 277 | for index, column in enumerate(self.schema): 278 | matched = 0 279 | for field, value in criteria: 280 | if column[field] == value or (isinstance(value, list) and column[field] in value): 281 | matched += 1 282 | if matched == len(criteria): 283 | matches.append(index) 284 | return matches 285 | 286 | def read_row(self, row_index, *columns): 287 | columns = list(columns) 288 | column_data = [None] * len(columns) 289 | found = 0 290 | if not 0 <= row_index < len(self.offsets): 291 | raise ValueError('Wrong sample index: {} - has to be between 0 and {}' 292 | .format(row_index, len(self.offsets) - 1)) 293 | self.sdb_file.seek(self.offsets[row_index] + INT_SIZE) 294 | for index in range(len(self.schema)): 295 | chunk_len = self.read_int() 296 | if index in columns: 297 | column_data[columns.index(index)] = self.sdb_file.read(chunk_len) 298 | found += 1 299 | if found == len(columns): 300 | return tuple(column_data) 301 | else: 302 | self.sdb_file.seek(chunk_len, 1) 303 | return tuple(column_data) 304 | 305 | def __getitem__(self, i): 306 | audio_data, transcript = self.read_row(i, self.speech_index, self.transcript_index) 307 | transcript = transcript.decode() 308 | sample_id = '{}:{}'.format(self.id_prefix, i) 309 | return LabeledSample(self.audio_type, audio_data, transcript, sample_id=sample_id) 310 | 311 | def __iter__(self): 312 | for i in range(len(self.offsets)): 313 | yield self[i] 314 | 315 | def __len__(self): 316 | return len(self.offsets) 317 | 318 | def close(self): 319 | if self.sdb_file is not None: 320 | self.sdb_file.close() 321 | 322 | def __del__(self): 323 | self.close() 324 | 325 | 326 | class CSV: 327 | """Sample collection reader for reading a DeepSpeech CSV file""" 328 | def __init__(self, csv_filename): 329 | self.csv_filename = csv_filename 330 | self.rows = [] 331 | csv_dir = Path(csv_filename).parent 332 | with open(csv_filename, 'r') as csv_file: 333 | reader = csv.DictReader(csv_file) 334 | for row in reader: 335 | wav_filename = Path(row['wav_filename']) 336 | if not wav_filename.is_absolute(): 337 | wav_filename = csv_dir / wav_filename 338 | self.rows.append((str(wav_filename), int(row['wav_filesize']), row['transcript'])) 339 | self.rows.sort(key=lambda r: r[1]) 340 | 341 | def __getitem__(self, i): 342 | wav_filename, _, transcript = self.rows[i] 343 | with open(wav_filename, 'rb') as wav_file: 344 | return LabeledSample(AUDIO_TYPE_WAV, wav_file.read(), transcript, sample_id=wav_filename) 345 | 346 | def __iter__(self): 347 | for i in range(len(self.rows)): 348 | yield self[i] 349 | 350 | def __len__(self): 351 | return len(self.rows) 352 | 353 | 354 | def samples_from_file(filename, buffering=BUFFER_SIZE): 355 | """Retrieves the right sample collection reader from a filename""" 356 | ext = os.path.splitext(filename)[1].lower() 357 | if ext == '.sdb': 358 | return SDB(filename, buffering=buffering) 359 | if ext == '.csv': 360 | return CSV(filename) 361 | raise ValueError('Unknown file type: "{}"'.format(ext)) 362 | 363 | 364 | def samples_from_files(filenames, buffering=BUFFER_SIZE): 365 | """Retrieves a (potentially interleaving) sample collection reader from a list of filenames""" 366 | if len(filenames) == 0: 367 | raise ValueError('No files') 368 | if len(filenames) == 1: 369 | return samples_from_file(filenames[0], buffering=buffering) 370 | cols = list(map(partial(samples_from_file, buffering=buffering), filenames)) 371 | return Interleaved(*cols, key=lambda s: s.duration) 372 | -------------------------------------------------------------------------------- /align/sdb_tool.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | ''' 3 | Builds Sample Databases (.sdb files) 4 | Use "python3 sdb_tool.py -h" for help 5 | ''' 6 | from __future__ import absolute_import, division, print_function 7 | 8 | # Make sure we can import stuff from util/ 9 | # This script needs to be run from the root of the DeepSpeech repository 10 | import os 11 | import sys 12 | sys.path.insert(1, os.path.join(sys.path[0], '..')) 13 | 14 | import argparse 15 | 16 | from utils import parse_file_size, log_progress 17 | from audio import change_audio_types, AUDIO_TYPE_WAV, AUDIO_TYPE_OPUS 18 | from sample_collections import samples_from_files, DirectSDBWriter, SortingSDBWriter 19 | 20 | AUDIO_TYPE_LOOKUP = { 21 | 'wav': AUDIO_TYPE_WAV, 22 | 'opus': AUDIO_TYPE_OPUS 23 | } 24 | 25 | 26 | def progress(it=None, desc='Processing', total=None): 27 | print(desc, file=sys.stderr, flush=True) 28 | return it if CLI_ARGS.no_progress else log_progress(it, interval=CLI_ARGS.progress_interval, total=total) 29 | 30 | 31 | def add_samples(sdb_writer): 32 | samples = samples_from_files(CLI_ARGS.sources) 33 | for sample in progress(change_audio_types(samples, audio_type=sdb_writer.audio_type, processes=CLI_ARGS.workers), 34 | total=len(samples), 35 | desc='Writing "{}"...'.format(CLI_ARGS.target)): 36 | sdb_writer.add(sample) 37 | 38 | 39 | def build_sdb(): 40 | audio_type = AUDIO_TYPE_LOOKUP[CLI_ARGS.audio_type] 41 | if CLI_ARGS.sort: 42 | with SortingSDBWriter(CLI_ARGS.target, 43 | tmp_sdb_filename=CLI_ARGS.sort_tmp_file, 44 | cache_size=parse_file_size(CLI_ARGS.sort_cache_size), 45 | audio_type=audio_type) as sdb_writer: 46 | add_samples(sdb_writer) 47 | else: 48 | with DirectSDBWriter(CLI_ARGS.target, audio_type=audio_type) as sdb_writer: 49 | add_samples(sdb_writer) 50 | 51 | 52 | def handle_args(): 53 | parser = argparse.ArgumentParser(description='Tool for building Sample Databases (SDB files) ' 54 | 'from DeepSpeech CSV files and other SDB files') 55 | parser.add_argument('--workers', type=int, default=None, help='Number of encoding SDB workers') 56 | parser.add_argument('--audio-type', default='opus', choices=AUDIO_TYPE_LOOKUP.keys(), 57 | help='Audio representation inside target SDB') 58 | parser.add_argument('--sort', action='store_true', help='Force sample sorting by durations ' 59 | '(assumes SDB sources unsorted)') 60 | parser.add_argument('--sort-tmp-file', default=None, help='Overrides default tmp_file (target + ".tmp") ' 61 | 'for sorting through --sort option') 62 | parser.add_argument('--sort-cache-size', default='1GB', help='Cache (bucket) size for binary audio data ' 63 | 'for sorting through --sort option') 64 | parser.add_argument('--no-progress', action="store_true", help='Prevents showing progress indication') 65 | parser.add_argument('--progress-interval', type=float, default=1.0, help='Progress indication interval in seconds') 66 | parser.add_argument('sources', nargs='+', help='Source CSV and/or SDB files') 67 | parser.add_argument('target', help='SDB file to create') 68 | return parser.parse_args() 69 | 70 | 71 | if __name__ == "__main__": 72 | CLI_ARGS = handle_args() 73 | build_sdb() 74 | -------------------------------------------------------------------------------- /align/search.py: -------------------------------------------------------------------------------- 1 | from collections import Counter 2 | from text import ngrams, similarity 3 | 4 | 5 | class FuzzySearch(object): 6 | def __init__(self, 7 | text, 8 | max_candidates=10, 9 | candidate_threshold=0.92, 10 | match_score=100, 11 | mismatch_score=-100, 12 | gap_score=-100, 13 | char_similarities=None): 14 | self.text = text 15 | self.max_candidates = max_candidates 16 | self.candidate_threshold = candidate_threshold 17 | self.match_score = match_score 18 | self.mismatch_score = mismatch_score 19 | self.gap_score = gap_score 20 | self.char_similarities = char_similarities 21 | self.ngrams = {} 22 | for i, ngram in enumerate(ngrams(' ' + text + ' ', 3)): 23 | if ngram in self.ngrams: 24 | ngram_bucket = self.ngrams[ngram] 25 | else: 26 | ngram_bucket = self.ngrams[ngram] = [] 27 | ngram_bucket.append(i) 28 | 29 | @staticmethod 30 | def char_pair(a, b): 31 | if a > b: 32 | a, b = b, a 33 | return '' + a + b 34 | 35 | def char_similarity(self, a, b): 36 | key = FuzzySearch.char_pair(a, b) 37 | if self.char_similarities and key in self.char_similarities: 38 | return self.char_similarities[key] 39 | return self.match_score if a == b else self.mismatch_score 40 | 41 | def sw_align(self, a, start, end): 42 | b = self.text[start:end] 43 | n, m = len(a), len(b) 44 | # building scoring matrix 45 | f = [[0]] * (n + 1) 46 | for i in range(0, n + 1): 47 | f[i] = [0] * (m + 1) 48 | for i in range(1, n + 1): 49 | f[i][0] = self.gap_score * i 50 | for j in range(1, m + 1): 51 | f[0][j] = self.gap_score * j 52 | max_score = 0 53 | start_i, start_j = 0, 0 54 | for i in range(1, n + 1): 55 | for j in range(1, m + 1): 56 | match = f[i - 1][j - 1] + self.char_similarity(a[i - 1], b[j - 1]) 57 | insert = f[i][j - 1] + self.gap_score 58 | delete = f[i - 1][j] + self.gap_score 59 | score = max(0, match, insert, delete) 60 | f[i][j] = score 61 | if score > max_score: 62 | max_score = score 63 | start_i, start_j = i, j 64 | # backtracking 65 | substitutions = Counter() 66 | i, j = start_i, start_j 67 | while (j > 0 or i > 0) and f[i][j] != 0: 68 | if i > 0 and j > 0 and f[i][j] == (f[i - 1][j - 1] + self.char_similarity(a[i - 1], b[j - 1])): 69 | substitutions[FuzzySearch.char_pair(a[i - 1], b[j - 1])] += 1 70 | i, j = i - 1, j - 1 71 | elif i > 0 and f[i][j] == (f[i - 1][j] + self.gap_score): 72 | i -= 1 73 | elif j > 0 and f[i][j] == (f[i][j - 1] + self.gap_score): 74 | j -= 1 75 | else: 76 | raise Exception('Smith–Waterman failure') 77 | align_start = max(start, start + j - 1) 78 | align_end = min(end, start + start_j) 79 | score = f[start_i][start_j] / (self.match_score * max(align_end - align_start, n)) 80 | return align_start, align_end, score, substitutions 81 | 82 | def find_best(self, look_for, start=0, end=-1): 83 | end = len(self.text) if end < 0 else end 84 | if end - start < 2 * len(look_for): 85 | return self.sw_align(look_for, start, end) 86 | window_size = len(look_for) 87 | windows = {} 88 | for i, ngram in enumerate(ngrams(' ' + look_for + ' ', 3)): 89 | if ngram in self.ngrams: 90 | ngram_bucket = self.ngrams[ngram] 91 | for occurrence in ngram_bucket: 92 | if occurrence < start or occurrence > end: 93 | continue 94 | window = occurrence // window_size 95 | windows[window] = (windows[window] + 1) if window in windows else 1 96 | candidate_windows = sorted(windows.keys(), key=lambda w: windows[w], reverse=True) 97 | best = (-1, -1, 0, None) 98 | last_window_grams = 0.1 99 | for window in candidate_windows[:self.max_candidates]: 100 | ngram_factor = (windows[window] / last_window_grams) 101 | if ngram_factor < self.candidate_threshold: 102 | break 103 | last_window_grams = windows[window] 104 | interval_start = max(start, int((window - 1) * window_size)) 105 | interval_end = min(end, int((window + 2) * window_size)) 106 | search_result = self.sw_align(look_for, interval_start, interval_end) 107 | if search_result[2] > best[2]: 108 | best = search_result 109 | return best 110 | -------------------------------------------------------------------------------- /align/stats.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | import json 4 | import argparse 5 | from os import path 6 | from pickle import load, dump 7 | from collections import Counter 8 | from datetime import timedelta 9 | from utils import log_progress 10 | 11 | 12 | def fail(message, code=1): 13 | print(message) 14 | exit(code) 15 | 16 | 17 | class AlignmentStatistics: 18 | def __init__(self): 19 | self.top = 100 20 | self.stat_ids = ['wng', 'sws', 'wer', 'cer', 'jaro_winkler', 'editex', 'levenshtein', 'mra', 'hamming'] 21 | self.stats = {} 22 | self.stats_duration = {} 23 | for stat_id in self.stat_ids: 24 | self.stats[stat_id] = Counter() 25 | self.stats_duration[stat_id] = Counter() 26 | 27 | self.total_files = 0 28 | self.total_utterances = 0 29 | self.total_duration = 0 30 | self.total_length = 0 31 | 32 | self.durations = Counter() 33 | self.lengths = Counter() 34 | 35 | self.meta_counters = {} 36 | 37 | @staticmethod 38 | def progress(lst, desc='Processing', total=None): 39 | return lst 40 | 41 | def load_aligned(self, aligned_path): 42 | self.total_files += 1 43 | with open(aligned_path, 'r') as aligned_file: 44 | utterances = json.loads(aligned_file.read()) 45 | for utterance in utterances: 46 | self.total_utterances += 1 47 | duration = utterance['end'] - utterance['start'] 48 | self.durations[int(duration / 1000)] += 1 49 | self.total_duration += duration 50 | length = utterance['text-end'] - utterance['text-start'] 51 | self.lengths[length] += 1 52 | self.total_length += length 53 | for stat_id in self.stat_ids: 54 | if stat_id in utterance: 55 | self.stats[stat_id][int(utterance[stat_id])] += 1 56 | self.stats_duration[stat_id][int(utterance[stat_id])] += duration 57 | if 'meta' in utterance: 58 | for meta_type, instances in utterance['meta'].items(): 59 | if meta_type not in self.meta_counters: 60 | self.meta_counters[meta_type] = Counter() 61 | for instance in instances: 62 | self.meta_counters[meta_type][instance] += 1 63 | 64 | def load_catalog(self, catalog_path, ignore_missing=True): 65 | catalog = path.abspath(catalog_path) 66 | catalog_dir = path.dirname(catalog) 67 | with open(catalog, 'r') as catalog_file: 68 | catalog_entries = json.load(catalog_file) 69 | for entry in AlignmentStatistics.progress(catalog_entries, desc='Reading catalog'): 70 | aligned_path = entry['aligned'] 71 | if not path.isabs(aligned_path): 72 | aligned_path = path.join(catalog_dir, aligned_path) 73 | if path.isfile(aligned_path): 74 | self.load_aligned(aligned_path) 75 | else: 76 | if ignore_missing: 77 | continue 78 | else: 79 | fail('Problem loading catalog "{}": Missing referenced alignment file "{}"' 80 | .format(catalog_path, aligned_path)) 81 | 82 | def print_stats(self): 83 | print('Total number of files: {:,}'.format(self.total_files)) 84 | print('') 85 | print('Total number of utterances: {:,}'.format(self.total_utterances)) 86 | print('') 87 | print('Total aligned utterance character length: {:,}'.format(self.total_length)) 88 | print('') 89 | print('Total utterance duration: {} ({:,} hours)'.format( 90 | timedelta(milliseconds=self.total_duration), 91 | int(self.total_duration / (1000 * 60 * 60)))) 92 | print('') 93 | 94 | for meta_type, counter in self.meta_counters.items(): 95 | print('Overall number of instances of meta type "{}": {:,}'.format(meta_type, len(counter.keys()))) 96 | print('') 97 | print('{} most frequent "{}" instances:'.format(self.top, meta_type)) 98 | for value, count in counter.most_common(self.top): 99 | print(value.ljust(20) + '{:,}'.format(count).rjust(12)) 100 | 101 | for stat_id in self.stat_ids: 102 | counter = self.stats_duration[stat_id] 103 | if len(counter) == 0: 104 | continue 105 | print('') 106 | print(stat_id.upper() + ':') 107 | above = 0 108 | for value in sorted(counter): 109 | count = counter[value] / (60 * 60 * 1000) 110 | if value <= 100: 111 | print(str(value).ljust(10) + '{:12.2f}'.format(count).rjust(12)) 112 | else: 113 | above += count 114 | if above > 0: 115 | print('100+'.ljust(10) + '{:12.2f}'.format(above).rjust(12)) 116 | 117 | 118 | def main(args): 119 | parser = argparse.ArgumentParser(description='Export aligned speech samples.') 120 | 121 | parser.add_argument('--cache', type=str, 122 | help='Use provided file as statistics cache (if existing, all other input options are ignored)') 123 | parser.add_argument('--aligned', type=str, action='append', 124 | help='Read alignment file ("<...>.aligned") as input') 125 | parser.add_argument('--catalog', type=str, action='append', 126 | help='Read alignment references of provided catalog ("<...>.catalog") as input') 127 | parser.add_argument('--no-progress', action='store_true', 128 | help='Prevents showing progress bars') 129 | parser.add_argument('--progress-interval', type=float, default=1.0, 130 | help='Progress indication interval in seconds') 131 | 132 | args = parser.parse_args() 133 | 134 | def progress(it=None, desc='Processing', total=None): 135 | print(desc) 136 | return it if args.no_progress else log_progress(it, interval=args.progress_interval, total=total) 137 | AlignmentStatistics.progress = progress 138 | 139 | if args.cache is not None and path.exists(args.cache): 140 | with open(args.cache, 'rb') as stats_file: 141 | stats = load(stats_file) 142 | else: 143 | stats = AlignmentStatistics() 144 | if args.catalog is not None: 145 | for catalog_path in args.catalog: 146 | stats.load_catalog(catalog_path, ignore_missing=True) 147 | if args.aligned is not None: 148 | for aligned_path in args.aligned: 149 | stats.load_aligned(aligned_path) 150 | if args.cache is not None: 151 | with open(args.cache, 'wb') as stats_file: 152 | dump(stats, stats_file) 153 | 154 | stats.print_stats() 155 | 156 | 157 | if __name__ == '__main__': 158 | main(sys.argv[1:]) 159 | os.system('stty sane') 160 | -------------------------------------------------------------------------------- /align/text.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import, division, print_function 2 | 3 | import codecs 4 | from six.moves import range 5 | from collections import Counter 6 | from utils import enweight 7 | 8 | 9 | class Alphabet(object): 10 | def __init__(self, config_file): 11 | self._config_file = config_file 12 | self._label_to_str = [] 13 | self._str_to_label = {} 14 | self._size = 0 15 | with codecs.open(config_file, 'r', 'utf-8') as fin: 16 | for line in fin: 17 | if line[0:2] == '\\#': 18 | line = '#\n' 19 | elif line[0] == '#': 20 | continue 21 | self._label_to_str += line[:-1] # remove the line ending 22 | self._str_to_label[line[:-1]] = self._size 23 | self._size += 1 24 | 25 | def string_from_label(self, label): 26 | return self._label_to_str[label] 27 | 28 | def has_label(self, string): 29 | return string in self._str_to_label 30 | 31 | def label_from_string(self, string): 32 | try: 33 | return self._str_to_label[string] 34 | except KeyError as e: 35 | raise KeyError( 36 | '''ERROR: Your transcripts contain characters which do not occur in data/alphabet.txt! Use util/check_characters.py to see what characters are in your {train,dev,test}.csv transcripts, and then add all these to data/alphabet.txt.''' 37 | ).with_traceback(e.__traceback__) 38 | 39 | def decode(self, labels): 40 | res = '' 41 | for label in labels: 42 | res += self.string_from_label(label) 43 | return res 44 | 45 | def size(self): 46 | return self._size 47 | 48 | def config_file(self): 49 | return self._config_file 50 | 51 | 52 | class TextCleaner(object): 53 | def __init__(self, alphabet, to_lower=True, normalize_space=True, dashes_to_ws=True): 54 | self.alphabet = alphabet 55 | self.to_lower = to_lower 56 | self.normalize_space = normalize_space 57 | self.dashes_to_ws = dashes_to_ws 58 | self.original_text = '' 59 | self.clean_text = '' 60 | self.positions = [] 61 | self.meta = [] 62 | 63 | def add_original_text(self, original_text, meta=None): 64 | if len(self.positions) > 0: 65 | self.clean_text += ' ' 66 | self.original_text += ' ' 67 | self.positions.append(len(self.original_text) - 1) 68 | self.meta.append(None) 69 | ws = True 70 | else: 71 | ws = False 72 | cleaned = [] 73 | prepared_text = original_text.lower() if self.to_lower else original_text 74 | for position, c in enumerate(prepared_text): 75 | if self.dashes_to_ws and c == '-' and not self.alphabet.has_label('-'): 76 | c = ' ' 77 | if self.normalize_space and c.isspace(): 78 | if ws: 79 | continue 80 | else: 81 | ws = True 82 | c = ' ' 83 | if not self.alphabet.has_label(c): 84 | continue 85 | if not c.isspace(): 86 | ws = False 87 | cleaned.append(c) 88 | self.positions.append(len(self.original_text) + position) 89 | self.meta.append(meta) 90 | self.original_text += original_text 91 | self.clean_text += ''.join(cleaned) 92 | 93 | def get_original_offset(self, clean_offset): 94 | if clean_offset == len(self.positions): 95 | return self.positions[-1] + 1 96 | return self.positions[clean_offset] 97 | 98 | def collect_meta(self, from_clean_offset, to_clean_offset=None): 99 | if to_clean_offset is None: 100 | return self.meta[from_clean_offset] 101 | metas = [] 102 | for meta in self.meta[from_clean_offset:to_clean_offset + 1]: 103 | if meta is not None and meta not in metas: 104 | metas.append(meta) 105 | return metas 106 | 107 | 108 | class TextRange(object): 109 | def __init__(self, document, start, end): 110 | self.document = document 111 | self.start = start 112 | self.end = end 113 | 114 | @staticmethod 115 | def token_at(text, position): 116 | start = len(text) 117 | end = 0 118 | for step in [-1, 1]: 119 | pos = position 120 | while 0 <= pos < len(text) and not text[pos].isspace(): 121 | if pos < start: 122 | start = pos 123 | if pos > end: 124 | end = pos 125 | pos += step 126 | return TextRange(text, start, end + 1) if start <= end else TextRange(text, position, position) 127 | 128 | def neighbour_token(self, direction): 129 | return TextRange.token_at(self.document, self.start - 2 if direction < 0 else self.end + 1) 130 | 131 | def next_token(self): 132 | return self.neighbour_token(1) 133 | 134 | def prev_token(self): 135 | return self.neighbour_token(-1) 136 | 137 | def get_text(self): 138 | return self.document[self.start:self.end] 139 | 140 | def __add__(self, other): 141 | if not self.document == other.document: 142 | raise Exception("Unable to add token from other string") 143 | return TextRange(self.document, min(self.start, other.start), max(self.end, other.end)) 144 | 145 | def __eq__(self, other): 146 | return self.document == other.document and self.start == other.start and self.end == other.end 147 | 148 | def __len__(self): 149 | return self.end-self.start 150 | 151 | 152 | def ngrams(s, size): 153 | """ 154 | Lists all appearances of all N-grams of a string from left to right. 155 | :param s: String to decompose 156 | :param size: N-gram size 157 | :return: Produces strings representing all N-grams 158 | """ 159 | window = len(s) - size 160 | if window < 1 or size < 1: 161 | if window == 0: 162 | yield s 163 | return 164 | for i in range(0, window + 1): 165 | yield s[i:i + size] 166 | 167 | 168 | def weighted_ngrams(s, size, direction=0): 169 | """ 170 | Lists all appearances of all N-grams of a string from left to right together with a positional weight value. 171 | The positional weight progresses quadratically. 172 | :param s: String to decompose 173 | :param size: N-gram size 174 | :param direction: Order of assigning positional weights to N-grams: 175 | direction < 0: Weight of first N-gram is 1.0 and of last one 0.0 176 | direction > 0: Weight of first N-gram is 0.0 and of last one 1.0 177 | direction == 0: Weight of center N-gram(s) near or equal 0, weight of first and last N-gram 1.0 178 | :return: Produces (string, float) tuples representing the N-gram along with its assigned positional weight value 179 | """ 180 | return enweight(ngrams(s, size), direction=direction) 181 | 182 | 183 | def similarity(a, b, direction=0, min_ngram_size=1, max_ngram_size=3, size_factor=1, position_factor=1): 184 | """ 185 | Computes similarity value of two strings ranging from 0.0 (completely different) to 1.0 (completely equal). 186 | Counts intersection of weighted N-gram sets of both strings. 187 | :param a: String to compare 188 | :param b: String to compare 189 | :param direction: Order of equality importance: 190 | direction < 0: Left ends of strings more important to be similar 191 | direction > 0: Right ends of strings more important to be similar 192 | direction == 0: Left and right ends more important to be similar than center parts 193 | :param min_ngram_size: Minimum N-gram size to take into account 194 | :param max_ngram_size: Maximum N-gram size to take into account 195 | :param size_factor: Importance factor of the N-gram size (compared to the positional importance). 196 | :param position_factor: Importance factor of the N-gram position (compared to the size importance) 197 | :return: Number between 0.0 (completely different) and 1.0 (completely equal) 198 | """ 199 | if len(a) < len(b): 200 | a, b = b, a 201 | ca, cb = Counter(), Counter() 202 | for s, c in [(a, ca), (b, cb)]: 203 | for size in range(min_ngram_size, max_ngram_size + 1): 204 | for ng, position_weight in weighted_ngrams(s, size, direction=direction): 205 | c[ng] += size * size_factor + position_weight * position_weight * position_factor 206 | score = 0 207 | for key in set(ca.keys()) & set(cb.keys()): 208 | score += min(ca[key], cb[key]) 209 | return score / sum(ca.values()) 210 | 211 | 212 | # The following code is from: http://hetland.org/coding/python/levenshtein.py 213 | 214 | # This is a straightforward implementation of a well-known algorithm, and thus 215 | # probably shouldn't be covered by copyright to begin with. But in case it is, 216 | # the author (Magnus Lie Hetland) has, to the extent possible under law, 217 | # dedicated all copyright and related and neighboring rights to this software 218 | # to the public domain worldwide, by distributing it under the CC0 license, 219 | # version 1.0. This software is distributed without any warranty. For more 220 | # information, see 221 | 222 | def levenshtein(a, b): 223 | """ 224 | Calculates the Levenshtein distance between a and b. 225 | """ 226 | n, m = len(a), len(b) 227 | if n > m: 228 | # Make sure n <= m, to use O(min(n,m)) space 229 | a, b = b, a 230 | n, m = m, n 231 | 232 | current = list(range(n+1)) 233 | for i in range(1, m+1): 234 | previous, current = current, [i]+[0]*n 235 | for j in range(1, n+1): 236 | add, delete = previous[j]+1, current[j-1]+1 237 | change = previous[j-1] 238 | if a[j-1] != b[i-1]: 239 | change = change + 1 240 | current[j] = min(add, delete, change) 241 | 242 | return current[n] 243 | -------------------------------------------------------------------------------- /align/utils.py: -------------------------------------------------------------------------------- 1 | 2 | import os 3 | import sys 4 | import time 5 | import heapq 6 | 7 | from multiprocessing.dummy import Pool as ThreadPool 8 | 9 | KILO = 1024 10 | KILOBYTE = 1 * KILO 11 | MEGABYTE = KILO * KILOBYTE 12 | GIGABYTE = KILO * MEGABYTE 13 | TERABYTE = KILO * GIGABYTE 14 | SIZE_PREFIX_LOOKUP = {'k': KILOBYTE, 'm': MEGABYTE, 'g': GIGABYTE, 't': TERABYTE} 15 | 16 | 17 | def parse_file_size(file_size): 18 | file_size = file_size.lower().strip() 19 | if len(file_size) == 0: 20 | return 0 21 | n = int(keep_only_digits(file_size)) 22 | if file_size[-1] == 'b': 23 | file_size = file_size[:-1] 24 | e = file_size[-1] 25 | return SIZE_PREFIX_LOOKUP[e] * n if e in SIZE_PREFIX_LOOKUP else n 26 | 27 | 28 | def keep_only_digits(txt): 29 | return ''.join(filter(str.isdigit, txt)) 30 | 31 | 32 | def secs_to_hours(secs): 33 | hours, remainder = divmod(secs, 3600) 34 | minutes, seconds = divmod(remainder, 60) 35 | return '%02d:%02d:%02d' % (hours, minutes, seconds) 36 | 37 | 38 | def log_progress(it, total=None, interval=60.0, step=None, entity='it', file=sys.stderr): 39 | if total is None and hasattr(it, '__len__'): 40 | total = len(it) 41 | if total is None: 42 | line_format = ' {:8d} (elapsed: {}, speed: {:.2f} {}/{})' 43 | else: 44 | line_format = ' {:' + str(len(str(total))) + 'd} of {} : {:6.2f}% (elapsed: {}, speed: {:.2f} {}/{}, ETA: {})' 45 | 46 | overall_start = time.time() 47 | interval_start = overall_start 48 | interval_steps = 0 49 | 50 | def print_interval(steps, time_now): 51 | elapsed = time_now - overall_start 52 | elapsed_str = secs_to_hours(elapsed) 53 | speed_unit = 's' 54 | interval_duration = time_now - interval_start 55 | print_speed = speed = interval_steps / (0.001 if interval_duration == 0.0 else interval_duration) 56 | if print_speed < 0.1: 57 | print_speed = print_speed * 60 58 | speed_unit = 'm' 59 | if print_speed < 1: 60 | print_speed = print_speed * 60 61 | speed_unit = 'h' 62 | elif print_speed > 1000: 63 | print_speed = print_speed / 1000.0 64 | speed_unit = 'ms' 65 | if total is None: 66 | line = line_format.format(global_step, elapsed_str, print_speed, entity, speed_unit) 67 | else: 68 | percent = global_step * 100.0 / total 69 | eta = secs_to_hours(((total - global_step) / speed) if speed > 0 else 0) 70 | line = line_format.format(global_step, total, percent, elapsed_str, print_speed, entity, speed_unit, eta) 71 | print(line, file=file, flush=True) 72 | 73 | for global_step, obj in enumerate(it, 1): 74 | interval_steps += 1 75 | yield obj 76 | t = time.time() 77 | if (step is None and t - interval_start > interval) or (step is not None and interval_steps >= step): 78 | print_interval(interval_steps, t) 79 | interval_steps = 0 80 | interval_start = t 81 | if interval_steps > 0: 82 | print_interval(interval_steps, time.time()) 83 | 84 | 85 | def circulate(items, center=None): 86 | count = len(list(items)) 87 | if count > 0: 88 | if center is None: 89 | center = count // 2 90 | center = min(max(center, 0), count - 1) 91 | yield center, items[center] 92 | for i in range(1, count): 93 | #print('ANOTHER') 94 | if center + i < count: 95 | yield center + i, items[center + i] 96 | if center - i >= 0: 97 | yield center - i, items[center - i] 98 | 99 | 100 | def by_len(items): 101 | indexed = list(enumerate(items)) 102 | return sorted(indexed, key=lambda e: len(e[1]), reverse=True) 103 | 104 | 105 | def enweight(items, direction=0): 106 | """ 107 | Enumerates all entries together with a positional weight value. 108 | The positional weight progresses quadratically. 109 | :param items: Items to enumerate 110 | :param direction: Order of assigning positional weights to N-grams: 111 | direction < 0: Weight of first N-gram is 1.0 and of last one 0.0 112 | direction > 0: Weight of first N-gram is 0.0 and of last one 1.0 113 | direction == 0: Weight of center N-gram(s) near or equal 0, weight of first and last N-gram 1.0 114 | :return: Produces (object, float) tuples representing the enumerated item 115 | along with its assigned positional weight value 116 | """ 117 | items = list(items) 118 | direction = -1 if direction < 0 else (1 if direction > 0 else 0) 119 | n = len(items) - 1 120 | if n < 1: 121 | if n == 0: 122 | yield items[0], 1 123 | raise StopIteration 124 | for i, item in enumerate(items): 125 | c = (i + n * (direction - 1) / 2) / n 126 | yield item, c * c * (4 - abs(direction) * 3) 127 | 128 | 129 | def greedy_minimum_search(a, b, compute, result_a=None, result_b=None): 130 | if a > b: 131 | a, b = b, a 132 | result_a, result_b = result_b, result_a 133 | if a == b: 134 | return result_a or result_b or compute(a) 135 | result_a = result_a or compute(a) 136 | result_b = result_b or compute(b) 137 | if b == a+1: 138 | return result_a if result_a[0] < result_b[0] else result_b 139 | c = (a+b) // 2 140 | if result_a[0] < result_b[0]: 141 | return greedy_minimum_search(a, c, compute, result_a=result_a) 142 | else: 143 | return greedy_minimum_search(c, b, compute, result_b=result_b) 144 | 145 | 146 | class Interleaved: 147 | """Collection that lazily combines sorted collections in an interleaving fashion. 148 | During iteration the next smallest element from all the sorted collections is always picked. 149 | The collections must support iter() and len().""" 150 | def __init__(self, *iterables, key=lambda obj: obj): 151 | self.iterables = iterables 152 | self.key = key 153 | self.len = sum(map(len, iterables)) 154 | 155 | def __iter__(self): 156 | return heapq.merge(*self.iterables, key=self.key) 157 | 158 | def __len__(self): 159 | return self.len 160 | 161 | 162 | class LimitingPool: 163 | """Limits unbound ahead-processing of multiprocessing.Pool's imap method 164 | before items get consumed by the iteration caller. 165 | This prevents OOM issues in situations where items represent larger memory allocations.""" 166 | def __init__(self, processes=None, limit_factor=2, sleeping_for=0.1): 167 | self.processes = os.cpu_count() if processes is None else processes 168 | self.pool = ThreadPool(processes=processes) 169 | self.sleeping_for = sleeping_for 170 | self.max_ahead = self.processes * limit_factor 171 | self.processed = 0 172 | 173 | def __enter__(self): 174 | return self 175 | 176 | def limit(self, it): 177 | for obj in it: 178 | while self.processed >= self.max_ahead: 179 | time.sleep(self.sleeping_for) 180 | self.processed += 1 181 | yield obj 182 | 183 | def map(self, fun, it): 184 | for obj in self.pool.imap(fun, self.limit(it)): 185 | self.processed -= 1 186 | yield obj 187 | 188 | def __exit__(self, exc_type, exc_value, traceback): 189 | self.pool.close() 190 | -------------------------------------------------------------------------------- /bin/align.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | approot=$(cd "$(dirname "$(dirname "$0")")" && pwd) 3 | source "$approot/venv/bin/activate" 4 | python "$approot/align/align.py" "$@" 5 | -------------------------------------------------------------------------------- /bin/catalog_tool.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | approot=$(cd "$(dirname "$(dirname "$0")")" && pwd) 3 | source "$approot/venv/bin/activate" 4 | python "$approot/align/catalog_tool.py" "$@" 5 | -------------------------------------------------------------------------------- /bin/createenv.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | python3 -m venv venv 4 | source venv/bin/activate 5 | pip install -r requirements.txt -------------------------------------------------------------------------------- /bin/export.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | approot=$(cd "$(dirname "$(dirname "$0")")" && pwd) 3 | source "$approot/venv/bin/activate" 4 | python "$approot/align/export.py" "$@" 5 | -------------------------------------------------------------------------------- /bin/getmodel.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | version="0.7.1" 4 | dir="deepspeech-${version}-models" 5 | am="${dir}.pbmm" 6 | scorer="${dir}.scorer" 7 | 8 | mkdir -p models/en 9 | cd models/en 10 | 11 | if [[ ! -f $am ]] ; then 12 | wget "https://github.com/mozilla/DeepSpeech/releases/download/v${version}/${am}" 13 | fi 14 | 15 | if [[ ! -f $scorer ]] ; then 16 | wget "https://github.com/mozilla/DeepSpeech/releases/download/v${version}/${scorer}" 17 | fi 18 | 19 | if [[ ! -f "alphabet.txt" ]] ; then 20 | wget "https://raw.githubusercontent.com/mozilla/DeepSpeech/master/data/alphabet.txt" 21 | fi 22 | -------------------------------------------------------------------------------- /bin/gettestdata.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | bin=`pwd`/bin 4 | 5 | mkpush () { 6 | mkdir -p data/$1 7 | pushd data/$1 8 | $1 9 | popd 10 | } 11 | 12 | cwget () { 13 | url=$1 14 | file="${url##*/}" 15 | if [ ! -f $file ]; then 16 | wget $url 17 | fi 18 | } 19 | 20 | test1 () { 21 | cwget https://ia802607.us.archive.org/14/items/artfiction00jamegoog/artfiction00jamegoog_djvu.txt 22 | cp artfiction00jamegoog_djvu.txt transcript.txt 23 | cwget http://www.archive.org/download/art_of_fiction_jvw_librivox/art_of_fiction_jvw_librivox_64kb_mp3.zip 24 | unzip -o art_of_fiction_jvw_librivox_64kb_mp3.zip 25 | if [ ! -f "audio.wav" ]; then 26 | cat *.mp3 >joined.mp3 27 | ffmpeg -y -i joined.mp3 -ar 16000 -ac 1 audio.wav 28 | fi 29 | } 30 | 31 | test2 () { 32 | cwget https://www.ibiblio.org/xml/examples/shakespeare/as_you.xml 33 | python "$bin/play2script.py" script as_you.xml transcript.script 34 | python "$bin/play2script.py" lines as_you.xml transcript-lines.txt 35 | python "$bin/play2script.py" plain as_you.xml transcript-plain.txt 36 | cwget http://www.archive.org/download/as_you_like_it_0902_librivox/as_you_like_it_0902_librivox_64kb_mp3.zip 37 | unzip -o as_you_like_it_0902_librivox_64kb_mp3.zip 38 | if [ ! -f "audio.wav" ]; then 39 | cat *.mp3 >joined.mp3 40 | ffmpeg -y -i joined.mp3 -ar 16000 -ac 1 audio.wav 41 | fi 42 | } 43 | 44 | mkpush test1 45 | mkpush test2 46 | -------------------------------------------------------------------------------- /bin/lm-dependencies.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | basedir="$(pwd)" 4 | 5 | mkdir -p dependencies 6 | pushd dependencies 7 | 8 | 9 | wget -N https://kheafield.com/code/kenlm.tar.gz 10 | tar -xzvf kenlm.tar.gz 11 | pushd kenlm 12 | 13 | mkdir -p build 14 | pushd build 15 | cmake .. 16 | make -j 4 17 | popd 18 | 19 | popd 20 | 21 | 22 | source $basedir/venv/bin/activate 23 | mkdir -p deepspeech 24 | pushd deepspeech 25 | python $basedir/bin/taskcluster.py --target . --branch v0.7.1 -------------------------------------------------------------------------------- /bin/meta.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | approot=$(cd "$(dirname "$(dirname "$0")")" && pwd) 3 | source "$approot/venv/bin/activate" 4 | python "$approot/align/meta.py" "$@" 5 | -------------------------------------------------------------------------------- /bin/play2script.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import json 3 | from xml.dom import minidom 4 | 5 | 6 | def fail(): 7 | print('Usage: play2script.py (script|plain|lines) ') 8 | exit(1) 9 | 10 | 11 | def get_text(elements): 12 | return ' '.join(map(lambda element: ' '.join(t.nodeValue.strip() for t in 13 | element.childNodes if t.nodeType == t.TEXT_NODE), 14 | elements)) 15 | 16 | 17 | def main(args): 18 | if len(args) != 3: 19 | fail() 20 | dom = minidom.parse(args[1]) 21 | if args[0] == 'script': 22 | script = [] 23 | for speech in dom.getElementsByTagName('SPEECH'): 24 | speaker = get_text(speech.getElementsByTagName('SPEAKER')) 25 | speaker = ' '.join(map(lambda p: p[0] + p[1:].lower(), speaker.split(' '))) 26 | text = get_text(speech.getElementsByTagName('LINE')) 27 | script.append({ 28 | 'speaker': speaker, 29 | 'text': text 30 | }) 31 | with open(args[2], 'w') as script_file: 32 | script_file.write(json.dumps(script)) 33 | elif args[0] in ['plain', 'lines']: 34 | with open(args[2], 'w') as script_file: 35 | for speech in dom.getElementsByTagName('SPEECH'): 36 | text = get_text(speech.getElementsByTagName('LINE')) 37 | script_file.write(text + (' ' if args[0] == 'plain' else '\n')) 38 | else: 39 | print('Unknown output specifier') 40 | fail() 41 | 42 | 43 | if __name__ == '__main__': 44 | main(sys.argv[1:]) 45 | -------------------------------------------------------------------------------- /bin/sdb_tool.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | approot=$(cd "$(dirname "$(dirname "$0")")" && pwd) 3 | source "$approot/venv/bin/activate" 4 | python "$approot/align/sdb_tool.py" "$@" 5 | -------------------------------------------------------------------------------- /bin/statistics.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | approot=$(cd "$(dirname "$(dirname "$0")")" && pwd) 3 | source "$approot/venv/bin/activate" 4 | python "$approot/align/stats.py" "$@" 5 | -------------------------------------------------------------------------------- /bin/taskcluster.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | from __future__ import print_function, absolute_import, division 4 | 5 | import argparse 6 | import platform 7 | import subprocess 8 | import sys 9 | import os 10 | import errno 11 | import stat 12 | 13 | import six.moves.urllib as urllib 14 | 15 | from pkg_resources import parse_version 16 | 17 | 18 | DEFAULT_SCHEMES = { 19 | 'deepspeech': 'https://community-tc.services.mozilla.com/api/index/v1/task/project.deepspeech.deepspeech.native_client.%(branch_name)s.%(arch_string)s/artifacts/public/%(artifact_name)s', 20 | 'tensorflow': 'https://community-tc.services.mozilla.com/api/index/v1/task/project.deepspeech.tensorflow.pip.%(branch_name)s.%(arch_string)s/artifacts/public/%(artifact_name)s' 21 | } 22 | 23 | TASKCLUSTER_SCHEME = os.getenv('TASKCLUSTER_SCHEME', DEFAULT_SCHEMES['deepspeech']) 24 | 25 | def get_tc_url(arch_string, artifact_name='native_client.tar.xz', branch_name='master'): 26 | assert arch_string is not None 27 | assert artifact_name is not None 28 | assert artifact_name 29 | assert branch_name is not None 30 | assert branch_name 31 | 32 | return TASKCLUSTER_SCHEME % { 'arch_string': arch_string, 'artifact_name': artifact_name, 'branch_name': branch_name} 33 | 34 | def maybe_download_tc(target_dir, tc_url, progress=True): 35 | def report_progress(count, block_size, total_size): 36 | percent = (count * block_size * 100) // total_size 37 | sys.stdout.write("\rDownloading: %d%%" % percent) 38 | sys.stdout.flush() 39 | 40 | if percent >= 100: 41 | print('\n') 42 | 43 | assert target_dir is not None 44 | 45 | target_dir = os.path.abspath(target_dir) 46 | try: 47 | os.makedirs(target_dir) 48 | except OSError as e: 49 | if e.errno != errno.EEXIST: 50 | raise e 51 | assert os.path.isdir(os.path.dirname(target_dir)) 52 | 53 | tc_filename = os.path.basename(tc_url) 54 | target_file = os.path.join(target_dir, tc_filename) 55 | if not os.path.isfile(target_file): 56 | print('Downloading %s ...' % tc_url) 57 | urllib.request.urlretrieve(tc_url, target_file, reporthook=(report_progress if progress else None)) 58 | else: 59 | print('File already exists: %s' % target_file) 60 | 61 | return target_file 62 | 63 | def maybe_download_tc_bin(**kwargs): 64 | final_file = maybe_download_tc(kwargs['target_dir'], kwargs['tc_url'], kwargs['progress']) 65 | final_stat = os.stat(final_file) 66 | os.chmod(final_file, final_stat.st_mode | stat.S_IEXEC) 67 | 68 | def read(fname): 69 | return open(os.path.join(os.path.dirname(__file__), fname)).read() 70 | 71 | def main(): 72 | parser = argparse.ArgumentParser(description='Tooling to ease downloading of components from TaskCluster.') 73 | parser.add_argument('--target', required=False, 74 | help='Where to put the native client binary files') 75 | parser.add_argument('--arch', required=False, 76 | help='Which architecture to download binaries for. "arm" for ARM 7 (32-bit), "arm64" for ARM64, "gpu" for CUDA enabled x86_64 binaries, "cpu" for CPU-only x86_64 binaries, "osx" for CPU-only x86_64 OSX binaries. Optional ("cpu" by default)') 77 | parser.add_argument('--artifact', required=False, 78 | default='native_client.tar.xz', 79 | help='Name of the artifact to download. Defaults to "native_client.tar.xz"') 80 | parser.add_argument('--source', required=False, default=None, 81 | help='Name of the TaskCluster scheme to use.') 82 | parser.add_argument('--branch', required=False, 83 | help='Branch name to use. Defaulting to current content of VERSION file.') 84 | parser.add_argument('--decoder', action='store_true', 85 | help='Get URL to ds_ctcdecoder Python package.') 86 | 87 | args = parser.parse_args() 88 | 89 | if not args.target and not args.decoder: 90 | print('Pass either --target or --decoder.') 91 | exit(1) 92 | 93 | is_arm = 'arm' in platform.machine() 94 | is_mac = 'darwin' in sys.platform 95 | is_64bit = sys.maxsize > (2**31 - 1) 96 | is_ucs2 = sys.maxunicode < 0x10ffff 97 | 98 | if not args.arch: 99 | if is_arm: 100 | args.arch = 'arm64' if is_64bit else 'arm' 101 | elif is_mac: 102 | args.arch = 'osx' 103 | else: 104 | args.arch = 'cpu' 105 | 106 | if not args.branch: 107 | version_string = read('../VERSION').strip() 108 | ds_version = parse_version(version_string) 109 | args.branch = "v{}".format(version_string) 110 | else: 111 | ds_version = args.branch.lstrip('v') 112 | 113 | if args.decoder: 114 | plat = platform.system().lower() 115 | arch = platform.machine() 116 | 117 | if plat == 'linux' and arch == 'x86_64': 118 | plat = 'manylinux1' 119 | 120 | if plat == 'darwin': 121 | plat = 'macosx_10_10' 122 | 123 | m_or_mu = 'mu' if is_ucs2 else 'm' 124 | pyver = ''.join(map(str, sys.version_info[0:2])) 125 | 126 | artifact = "ds_ctcdecoder-{ds_version}-cp{pyver}-cp{pyver}{m_or_mu}-{platform}_{arch}.whl".format( 127 | ds_version=ds_version, 128 | pyver=pyver, 129 | m_or_mu=m_or_mu, 130 | platform=plat, 131 | arch=arch 132 | ) 133 | 134 | ctc_arch = args.arch + '-ctc' 135 | 136 | print(get_tc_url(ctc_arch, artifact, args.branch)) 137 | exit(0) 138 | 139 | if args.source is not None: 140 | if args.source in DEFAULT_SCHEMES: 141 | global TASKCLUSTER_SCHEME 142 | TASKCLUSTER_SCHEME = DEFAULT_SCHEMES[args.source] 143 | else: 144 | print('No such scheme: %s' % args.source) 145 | exit(1) 146 | 147 | maybe_download_tc(target_dir=args.target, tc_url=get_tc_url(args.arch, args.artifact, args.branch)) 148 | 149 | if '.tar.' in args.artifact: 150 | subprocess.check_call(['tar', 'xvf', os.path.join(args.target, args.artifact), '-C', args.target]) 151 | 152 | if __name__ == '__main__': 153 | main() 154 | -------------------------------------------------------------------------------- /data/all-wav.catalog: -------------------------------------------------------------------------------- 1 | [ 2 | { 3 | "audio": "test1/audio.wav", 4 | "tlog": "test1/joined.tlog", 5 | "script": "test1/transcript.txt", 6 | "aligned": "test1/aligned.json" 7 | }, 8 | { 9 | "audio": "test2/audio.wav", 10 | "tlog": "test2/joined.tlog", 11 | "script": "test2/transcript.script", 12 | "aligned": "test2/aligned.json" 13 | } 14 | ] 15 | -------------------------------------------------------------------------------- /data/all.catalog: -------------------------------------------------------------------------------- 1 | [ 2 | { 3 | "audio": "test1/joined.mp3", 4 | "tlog": "test1/joined.tlog", 5 | "script": "test1/transcript.txt", 6 | "aligned": "test1/aligned.json" 7 | }, 8 | { 9 | "audio": "test2/joined.mp3", 10 | "tlog": "test2/joined.tlog", 11 | "script": "test2/transcript.script", 12 | "aligned": "test2/aligned.json" 13 | } 14 | ] 15 | -------------------------------------------------------------------------------- /data/test1.catalog: -------------------------------------------------------------------------------- 1 | [ 2 | { 3 | "audio": "test1/joined.mp3", 4 | "tlog": "test1/joined.tlog", 5 | "script": "test1/transcript.txt", 6 | "aligned": "test1/aligned.json" 7 | } 8 | ] 9 | -------------------------------------------------------------------------------- /data/test2.catalog: -------------------------------------------------------------------------------- 1 | [ 2 | { 3 | "audio": "test2/joined.mp3", 4 | "tlog": "test2/joined.tlog", 5 | "script": "test2/transcript.script", 6 | "aligned": "test2/aligned.json" 7 | } 8 | ] 9 | -------------------------------------------------------------------------------- /doc/algo.md: -------------------------------------------------------------------------------- 1 | ## Alignment algorithm and its parameters 2 | 3 | ### Step 1 - Splitting audio 4 | 5 | A voice activity detector (at the moment this is `webrtcvad`) is used 6 | to split the provided audio data into voice fragments. 7 | These fragments are essentially streams of continuous speech without any longer pauses 8 | (e.g. sentences). 9 | 10 | `--audio-vad-aggressiveness ` can be used to influence the length of the 11 | resulting fragments. 12 | 13 | ### Step 2 - Preparation of original text 14 | 15 | STT transcripts are typically provided in a normalized textual form with 16 | - no casing, 17 | - no punctuation and 18 | - normalized whitespace (single spaces only). 19 | 20 | So for being able to align STT transcripts with the original text it is necessary 21 | to internally convert the original text into the same form. 22 | 23 | This happens in two steps: 24 | 1. Normalization of whitespace, lower-casing all text and 25 | replacing some characters with spaces (e.g. dashes) 26 | 2. Removal of all characters that are not in the languages's alphabet 27 | (see DeepSpeech model data) 28 | 29 | Be aware: *This conversion happens on text basis and will not remove unspoken content 30 | like markup/markdown tags or artifacts. This should be done beforehand. 31 | Reducing the difference of spoken and original text will improve alignment quality and speed.* 32 | 33 | In the very unlikely situation that you have to change the default behavior (of step 1), 34 | there are some switches: 35 | 36 | `--text-keep-dashes` will prevent substitution of dashes with spaces. 37 | 38 | `--text-keep-ws` will keep whitespace untouched. 39 | 40 | `--text-keep-casing` will keep character casing as provided. 41 | 42 | ### Step 3 (optional) - Generating document specific language model 43 | 44 | If the [dependencies](lm.md) for 45 | individual language model generation got installed, this document-individual model will 46 | now be generated by default. 47 | 48 | Assuming your text document is named `original.txt`, these files will be generated: 49 | - `original.txt.clean` - cleaned version of the original text 50 | - `original.txt.arpa` - text file with probabilities in ARPA format 51 | - `original.txt.lm` - binary representation of the former one 52 | - `original.txt.trie` - prefix-tree optimized for probability lookup 53 | 54 | `--stt-no-own-lm` deactivates creation of individual language models per document and 55 | uses the one from model dir instead. 56 | 57 | ### Step 4 - Transcription of voice fragments through STT 58 | 59 | After VAD splitting the resulting fragments are transcribed into textual phrases. 60 | This transcription is done through [DeepSpeech](https://github.com/mozilla/DeepSpeech/) STT. 61 | 62 | As this can take a longer time, all resulting phrases are - together with their 63 | timestamps - saved as JSON into a transcription log file 64 | (the `audio` parameter path with suffix `.tlog` instead of `.wav`). 65 | Consecutive calls will look for that file and - if found - 66 | load it and skip the transcription phase. 67 | 68 | `--stt-model-dir ` points DeepSpeech to the language specific model data directory. 69 | It defaults to `models/en`. Use `bin/getmodel.sh` for preparing it. 70 | 71 | ### Step 5 - Rough alignment 72 | 73 | The actual text alignment is based on a recursive divide and conquer approach: 74 | 75 | 1. Construct an ordered list of of all phrases in the current interval 76 | (at the beginning this is the list of all phrases that are to be aligned), 77 | where long phrases close to the middle of the interval come first. 78 | 2. Iterate through the list and compute the best Smith-Waterman alignment 79 | (see the following sub-sections) with the document's original text... 80 | 3. ...till there is a phrase whose Smith-Waterman alignment score surpasses a (low) recursion-depth 81 | dependent threshold (in most cases this should already be the first phrase). 82 | 4. Recursively continue with step 1 for the sub-intervals and original text ranges 83 | to the left and right of the phrase and its aligned text range within the original text. 84 | 5. Return all phrases in order of appearance (depth-first) that were aligned with the minimum 85 | Smith-Waterman score on their recursion level. 86 | 87 | This approach assumes that all phrases were spoken in the same order as they appear in the 88 | original transcript. It has the following advantages compared to individual 89 | global phrase matching: 90 | 91 | - Long non-matching chunks of spoken text or the original transcript will automatically and 92 | cleanly get ignored. 93 | - Short phrases (with the risk of matching more than one time per document) will automatically 94 | get aligned to their intended locations by longer ones who "squeeze" them in. 95 | - Smith-Waterman score thresholds can be kept lower 96 | (and thus better match lower quality STT transcripts), as there is a lower chance for 97 | - long sequences to match at a wrong location and for 98 | - shorter sequences to match at a wrong location within their shortened intervals 99 | (as they are getting matched later and deeper in the recursion tree). 100 | 101 | #### Smith-Waterman candidate selection 102 | 103 | Finding the best match of a given phrase within the original (potentially long) transcript 104 | using vanilla Smith-Waterman is not feasible. 105 | 106 | So this tool follows a two-phase approach where the first goal is to get a list of alignment 107 | candidates. As the first step the original text is virtually partitioned into windows of the 108 | same length as the search pattern. These windows are ordered descending by the number of 3-grams 109 | they share with the pattern. 110 | Best alignment candidates are now taken from the beginning of this ordered list. 111 | 112 | `--align-max-candidates ` sets the maximum number of candidate windows 113 | taken from the beginning of the list for further alignment. 114 | 115 | `--align-candidate-threshold ` multiplied with the number of 3-grams of the predecessor 116 | window it gives the minimum number of 3-grams the next candidate window has to have to also be 117 | considered a candidate. 118 | 119 | #### Smith-Waterman alignment 120 | 121 | For each candidate, the best possible alignment is computed using the 122 | [Smith-Waterman](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm) algorithm 123 | within an extended interval of one window-size around the candidate window. 124 | 125 | `--align-match-score ` is the score per correctly matched character. Default: 100 126 | 127 | `--align-mismatch-score ` is the score per non-matching (exchanged) character. Default: -100 128 | 129 | `--align-gap-score ` is the score per character gap (removing 1 character from pattern or original). Default: -100 130 | 131 | The overall best score for the best match is normalized to a value of about 100 maximum by dividing 132 | it through the maximum character count of either the match or the pattern. 133 | 134 | ### Step 6 - Gap alignment 135 | 136 | After recursive matching of fragments there are potential text leftovers between aligned original 137 | texts. 138 | 139 | Some examples: 140 | - Often: Missing (and therefore unaligned) STT transcripts of word-endings (e.g. English past tense endings _-d_ and _-ed_) 141 | on phrase endings to the left of the gap 142 | - Seldom: Phrase beginnings or endings that were wrongly matched on unspoken (but written) text whose actual 143 | alignments are now left unaligned in the gap 144 | - Big unmatched chunks of text, like 145 | - Preface, text summaries or any other kind of meta information 146 | - Copyright headers/footers 147 | - Table of contents 148 | - Chapter headers (if not spoken as they appear) 149 | - Captions of figures 150 | - Contents of tables 151 | - Line-headers like character names in drama scripts 152 | - Dependent of the (pre-processing) quality: OCR leftovers like 153 | - page headers 154 | - page numbers 155 | - reader's notes 156 | 157 | The basic challenge here is to figure out, if all or some of the gap text should be used to extend 158 | the phrase to the left and/or to the right of the gap. 159 | 160 | As Smith-Waterman alignment led to the current (potentially incomplete or even wrong) result, 161 | its score cannot be used for further fine-tuning. Therefore there is a collection of 162 | so called [test-distance metrics](metrics.md) to pick from using the `--align-similarity-algo ` 163 | parameter. 164 | 165 | Using the selected distance metric, the gap alignment is done by looking for the best scoring 166 | extension of the left and right phrases up to their maximum extension. 167 | 168 | `--align-stretch-factor ` is the fraction of the text length that it could get 169 | stretched at max. 170 | 171 | For many languages it is worth putting some emphasis on matching to words boundaries 172 | (that is white-space separated sub-sequences). 173 | 174 | `--align-snap-factor ` allows to control the snappiness to word boundaries. 175 | 176 | If the best scoring extensions should overlap, the best scoring sum of non-overlapping 177 | (but touching) extensions will win. 178 | 179 | ### Step 7 - Selection, filtering and output 180 | 181 | Finally the best alignment of all candidate windows is selected as the winner. 182 | It has to survive a series of filters for getting into the result file. 183 | 184 | For each text distance metric there are two filter parameters: 185 | 186 | `--output-min- ` only keeps utterances having the provided minimum value for the 187 | metric with id `METRIC-ID` 188 | 189 | `--output-max- ` only keeps utterances having the provided maximum value for the 190 | metric with id `METRIC-ID` 191 | 192 | For each text distance metric there's also the option to have it added to each utterance's entry: 193 | 194 | `--output-` adds the computed value for `` to the utterances array-entry 195 | 196 | Error rates and scores are provided as fractional values (typically between 0.0 = 0% and 1.0 = 100% 197 | where numbers >1.0 are theoretically possible). 198 | -------------------------------------------------------------------------------- /doc/export.md: -------------------------------------------------------------------------------- 1 | ## Export 2 | 3 | After files got successfully aligned, one would possibly want to export the aligned utterances 4 | as machine learning training samples. 5 | 6 | This is where the export tool `bin/export.sh` comes in. 7 | 8 | ### Step 1 - Reading the input 9 | 10 | The exporter takes either a single audio file (`--audio