├── .gitignore
├── CODE_OF_CONDUCT.md
├── LICENSE
├── README.md
├── align
    ├── align.py
    ├── audio.py
    ├── catalog_tool.py
    ├── export.py
    ├── generate_lm.py
    ├── generate_package.py
    ├── meta.py
    ├── sample_collections.py
    ├── sdb_tool.py
    ├── search.py
    ├── stats.py
    ├── text.py
    └── utils.py
├── bin
    ├── align.sh
    ├── catalog_tool.sh
    ├── createenv.sh
    ├── export.sh
    ├── getmodel.sh
    ├── gettestdata.sh
    ├── lm-dependencies.sh
    ├── meta.sh
    ├── play2script.py
    ├── sdb_tool.sh
    ├── statistics.sh
    └── taskcluster.py
├── data
    ├── all-wav.catalog
    ├── all.catalog
    ├── test1.catalog
    └── test2.catalog
├── doc
    ├── algo.md
    ├── export.md
    ├── files.md
    ├── lm.md
    ├── metrics.md
    └── tools.md
└── requirements.txt


/.gitignore:
--------------------------------------------------------------------------------
  1 | models
  2 | dependencies
  3 | data/test*
  4 | data/export
  5 | 
  6 | .idea
  7 | 
  8 | # Byte-compiled / optimized / DLL files
  9 | __pycache__/
 10 | *.py[cod]
 11 | *$py.class
 12 | 
 13 | # C extensions
 14 | *.so
 15 | 
 16 | # Distribution / packaging
 17 | .Python
 18 | build/
 19 | develop-eggs/
 20 | dist/
 21 | downloads/
 22 | eggs/
 23 | .eggs/
 24 | lib/
 25 | lib64/
 26 | parts/
 27 | sdist/
 28 | var/
 29 | wheels/
 30 | *.egg-info/
 31 | .installed.cfg
 32 | *.egg
 33 | MANIFEST
 34 | 
 35 | # PyInstaller
 36 | #  Usually these files are written by a python script from a template
 37 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 38 | *.manifest
 39 | *.spec
 40 | 
 41 | # Installer logs
 42 | pip-log.txt
 43 | pip-delete-this-directory.txt
 44 | 
 45 | # Unit test / coverage reports
 46 | htmlcov/
 47 | .tox/
 48 | .coverage
 49 | .coverage.*
 50 | .cache
 51 | nosetests.xml
 52 | coverage.xml
 53 | *.cover
 54 | .hypothesis/
 55 | .pytest_cache/
 56 | 
 57 | # Translations
 58 | *.mo
 59 | *.pot
 60 | 
 61 | # Django stuff:
 62 | *.log
 63 | local_settings.py
 64 | db.sqlite3
 65 | 
 66 | # Flask stuff:
 67 | instance/
 68 | .webassets-cache
 69 | 
 70 | # Scrapy stuff:
 71 | .scrapy
 72 | 
 73 | # Sphinx documentation
 74 | docs/_build/
 75 | 
 76 | # PyBuilder
 77 | target/
 78 | 
 79 | # Jupyter Notebook
 80 | .ipynb_checkpoints
 81 | 
 82 | # pyenv
 83 | .python-version
 84 | 
 85 | # celery beat schedule file
 86 | celerybeat-schedule
 87 | 
 88 | # SageMath parsed files
 89 | *.sage.py
 90 | 
 91 | # Environments
 92 | .env
 93 | .venv
 94 | env/
 95 | venv/
 96 | ENV/
 97 | env.bak/
 98 | venv.bak/
 99 | 
100 | # Spyder project settings
101 | .spyderproject
102 | .spyproject
103 | 
104 | # Rope project settings
105 | .ropeproject
106 | 
107 | # mkdocs documentation
108 | /site
109 | 
110 | # mypy
111 | .mypy_cache/
112 | 


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
 1 | # Community Participation Guidelines
 2 | 
 3 | This repository is governed by Mozilla's code of conduct and etiquette guidelines. 
 4 | For more details, please read the
 5 | [Mozilla Community Participation Guidelines](https://www.mozilla.org/about/governance/policies/participation/). 
 6 | 
 7 | ## How to Report
 8 | For more information on how to report violations of the Community Participation Guidelines, please read our '[How to Report](https://www.mozilla.org/about/governance/policies/participation/reporting/)' page.
 9 | 
10 | <!--
11 | ## Project Specific Etiquette
12 | 
13 | In some cases, there will be additional project etiquette i.e.: (https://bugzilla.mozilla.org/page.cgi?id=etiquette.html).
14 | Please update for your project.
15 | -->
16 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 | Mozilla Public License Version 2.0
  2 | ==================================
  3 | 
  4 | 1. Definitions
  5 | --------------
  6 | 
  7 | 1.1. "Contributor"
  8 |     means each individual or legal entity that creates, contributes to
  9 |     the creation of, or owns Covered Software.
 10 | 
 11 | 1.2. "Contributor Version"
 12 |     means the combination of the Contributions of others (if any) used
 13 |     by a Contributor and that particular Contributor's Contribution.
 14 | 
 15 | 1.3. "Contribution"
 16 |     means Covered Software of a particular Contributor.
 17 | 
 18 | 1.4. "Covered Software"
 19 |     means Source Code Form to which the initial Contributor has attached
 20 |     the notice in Exhibit A, the Executable Form of such Source Code
 21 |     Form, and Modifications of such Source Code Form, in each case
 22 |     including portions thereof.
 23 | 
 24 | 1.5. "Incompatible With Secondary Licenses"
 25 |     means
 26 | 
 27 |     (a) that the initial Contributor has attached the notice described
 28 |         in Exhibit B to the Covered Software; or
 29 | 
 30 |     (b) that the Covered Software was made available under the terms of
 31 |         version 1.1 or earlier of the License, but not also under the
 32 |         terms of a Secondary License.
 33 | 
 34 | 1.6. "Executable Form"
 35 |     means any form of the work other than Source Code Form.
 36 | 
 37 | 1.7. "Larger Work"
 38 |     means a work that combines Covered Software with other material, in
 39 |     a separate file or files, that is not Covered Software.
 40 | 
 41 | 1.8. "License"
 42 |     means this document.
 43 | 
 44 | 1.9. "Licensable"
 45 |     means having the right to grant, to the maximum extent possible,
 46 |     whether at the time of the initial grant or subsequently, any and
 47 |     all of the rights conveyed by this License.
 48 | 
 49 | 1.10. "Modifications"
 50 |     means any of the following:
 51 | 
 52 |     (a) any file in Source Code Form that results from an addition to,
 53 |         deletion from, or modification of the contents of Covered
 54 |         Software; or
 55 | 
 56 |     (b) any new file in Source Code Form that contains any Covered
 57 |         Software.
 58 | 
 59 | 1.11. "Patent Claims" of a Contributor
 60 |     means any patent claim(s), including without limitation, method,
 61 |     process, and apparatus claims, in any patent Licensable by such
 62 |     Contributor that would be infringed, but for the grant of the
 63 |     License, by the making, using, selling, offering for sale, having
 64 |     made, import, or transfer of either its Contributions or its
 65 |     Contributor Version.
 66 | 
 67 | 1.12. "Secondary License"
 68 |     means either the GNU General Public License, Version 2.0, the GNU
 69 |     Lesser General Public License, Version 2.1, the GNU Affero General
 70 |     Public License, Version 3.0, or any later versions of those
 71 |     licenses.
 72 | 
 73 | 1.13. "Source Code Form"
 74 |     means the form of the work preferred for making modifications.
 75 | 
 76 | 1.14. "You" (or "Your")
 77 |     means an individual or a legal entity exercising rights under this
 78 |     License. For legal entities, "You" includes any entity that
 79 |     controls, is controlled by, or is under common control with You. For
 80 |     purposes of this definition, "control" means (a) the power, direct
 81 |     or indirect, to cause the direction or management of such entity,
 82 |     whether by contract or otherwise, or (b) ownership of more than
 83 |     fifty percent (50%) of the outstanding shares or beneficial
 84 |     ownership of such entity.
 85 | 
 86 | 2. License Grants and Conditions
 87 | --------------------------------
 88 | 
 89 | 2.1. Grants
 90 | 
 91 | Each Contributor hereby grants You a world-wide, royalty-free,
 92 | non-exclusive license:
 93 | 
 94 | (a) under intellectual property rights (other than patent or trademark)
 95 |     Licensable by such Contributor to use, reproduce, make available,
 96 |     modify, display, perform, distribute, and otherwise exploit its
 97 |     Contributions, either on an unmodified basis, with Modifications, or
 98 |     as part of a Larger Work; and
 99 | 
100 | (b) under Patent Claims of such Contributor to make, use, sell, offer
101 |     for sale, have made, import, and otherwise transfer either its
102 |     Contributions or its Contributor Version.
103 | 
104 | 2.2. Effective Date
105 | 
106 | The licenses granted in Section 2.1 with respect to any Contribution
107 | become effective for each Contribution on the date the Contributor first
108 | distributes such Contribution.
109 | 
110 | 2.3. Limitations on Grant Scope
111 | 
112 | The licenses granted in this Section 2 are the only rights granted under
113 | this License. No additional rights or licenses will be implied from the
114 | distribution or licensing of Covered Software under this License.
115 | Notwithstanding Section 2.1(b) above, no patent license is granted by a
116 | Contributor:
117 | 
118 | (a) for any code that a Contributor has removed from Covered Software;
119 |     or
120 | 
121 | (b) for infringements caused by: (i) Your and any other third party's
122 |     modifications of Covered Software, or (ii) the combination of its
123 |     Contributions with other software (except as part of its Contributor
124 |     Version); or
125 | 
126 | (c) under Patent Claims infringed by Covered Software in the absence of
127 |     its Contributions.
128 | 
129 | This License does not grant any rights in the trademarks, service marks,
130 | or logos of any Contributor (except as may be necessary to comply with
131 | the notice requirements in Section 3.4).
132 | 
133 | 2.4. Subsequent Licenses
134 | 
135 | No Contributor makes additional grants as a result of Your choice to
136 | distribute the Covered Software under a subsequent version of this
137 | License (see Section 10.2) or under the terms of a Secondary License (if
138 | permitted under the terms of Section 3.3).
139 | 
140 | 2.5. Representation
141 | 
142 | Each Contributor represents that the Contributor believes its
143 | Contributions are its original creation(s) or it has sufficient rights
144 | to grant the rights to its Contributions conveyed by this License.
145 | 
146 | 2.6. Fair Use
147 | 
148 | This License is not intended to limit any rights You have under
149 | applicable copyright doctrines of fair use, fair dealing, or other
150 | equivalents.
151 | 
152 | 2.7. Conditions
153 | 
154 | Sections 3.1, 3.2, 3.3, and 3.4 are conditions of the licenses granted
155 | in Section 2.1.
156 | 
157 | 3. Responsibilities
158 | -------------------
159 | 
160 | 3.1. Distribution of Source Form
161 | 
162 | All distribution of Covered Software in Source Code Form, including any
163 | Modifications that You create or to which You contribute, must be under
164 | the terms of this License. You must inform recipients that the Source
165 | Code Form of the Covered Software is governed by the terms of this
166 | License, and how they can obtain a copy of this License. You may not
167 | attempt to alter or restrict the recipients' rights in the Source Code
168 | Form.
169 | 
170 | 3.2. Distribution of Executable Form
171 | 
172 | If You distribute Covered Software in Executable Form then:
173 | 
174 | (a) such Covered Software must also be made available in Source Code
175 |     Form, as described in Section 3.1, and You must inform recipients of
176 |     the Executable Form how they can obtain a copy of such Source Code
177 |     Form by reasonable means in a timely manner, at a charge no more
178 |     than the cost of distribution to the recipient; and
179 | 
180 | (b) You may distribute such Executable Form under the terms of this
181 |     License, or sublicense it under different terms, provided that the
182 |     license for the Executable Form does not attempt to limit or alter
183 |     the recipients' rights in the Source Code Form under this License.
184 | 
185 | 3.3. Distribution of a Larger Work
186 | 
187 | You may create and distribute a Larger Work under terms of Your choice,
188 | provided that You also comply with the requirements of this License for
189 | the Covered Software. If the Larger Work is a combination of Covered
190 | Software with a work governed by one or more Secondary Licenses, and the
191 | Covered Software is not Incompatible With Secondary Licenses, this
192 | License permits You to additionally distribute such Covered Software
193 | under the terms of such Secondary License(s), so that the recipient of
194 | the Larger Work may, at their option, further distribute the Covered
195 | Software under the terms of either this License or such Secondary
196 | License(s).
197 | 
198 | 3.4. Notices
199 | 
200 | You may not remove or alter the substance of any license notices
201 | (including copyright notices, patent notices, disclaimers of warranty,
202 | or limitations of liability) contained within the Source Code Form of
203 | the Covered Software, except that You may alter any license notices to
204 | the extent required to remedy known factual inaccuracies.
205 | 
206 | 3.5. Application of Additional Terms
207 | 
208 | You may choose to offer, and to charge a fee for, warranty, support,
209 | indemnity or liability obligations to one or more recipients of Covered
210 | Software. However, You may do so only on Your own behalf, and not on
211 | behalf of any Contributor. You must make it absolutely clear that any
212 | such warranty, support, indemnity, or liability obligation is offered by
213 | You alone, and You hereby agree to indemnify every Contributor for any
214 | liability incurred by such Contributor as a result of warranty, support,
215 | indemnity or liability terms You offer. You may include additional
216 | disclaimers of warranty and limitations of liability specific to any
217 | jurisdiction.
218 | 
219 | 4. Inability to Comply Due to Statute or Regulation
220 | ---------------------------------------------------
221 | 
222 | If it is impossible for You to comply with any of the terms of this
223 | License with respect to some or all of the Covered Software due to
224 | statute, judicial order, or regulation then You must: (a) comply with
225 | the terms of this License to the maximum extent possible; and (b)
226 | describe the limitations and the code they affect. Such description must
227 | be placed in a text file included with all distributions of the Covered
228 | Software under this License. Except to the extent prohibited by statute
229 | or regulation, such description must be sufficiently detailed for a
230 | recipient of ordinary skill to be able to understand it.
231 | 
232 | 5. Termination
233 | --------------
234 | 
235 | 5.1. The rights granted under this License will terminate automatically
236 | if You fail to comply with any of its terms. However, if You become
237 | compliant, then the rights granted under this License from a particular
238 | Contributor are reinstated (a) provisionally, unless and until such
239 | Contributor explicitly and finally terminates Your grants, and (b) on an
240 | ongoing basis, if such Contributor fails to notify You of the
241 | non-compliance by some reasonable means prior to 60 days after You have
242 | come back into compliance. Moreover, Your grants from a particular
243 | Contributor are reinstated on an ongoing basis if such Contributor
244 | notifies You of the non-compliance by some reasonable means, this is the
245 | first time You have received notice of non-compliance with this License
246 | from such Contributor, and You become compliant prior to 30 days after
247 | Your receipt of the notice.
248 | 
249 | 5.2. If You initiate litigation against any entity by asserting a patent
250 | infringement claim (excluding declaratory judgment actions,
251 | counter-claims, and cross-claims) alleging that a Contributor Version
252 | directly or indirectly infringes any patent, then the rights granted to
253 | You by any and all Contributors for the Covered Software under Section
254 | 2.1 of this License shall terminate.
255 | 
256 | 5.3. In the event of termination under Sections 5.1 or 5.2 above, all
257 | end user license agreements (excluding distributors and resellers) which
258 | have been validly granted by You or Your distributors under this License
259 | prior to termination shall survive termination.
260 | 
261 | ************************************************************************
262 | *                                                                      *
263 | *  6. Disclaimer of Warranty                                           *
264 | *  -------------------------                                           *
265 | *                                                                      *
266 | *  Covered Software is provided under this License on an "as is"       *
267 | *  basis, without warranty of any kind, either expressed, implied, or  *
268 | *  statutory, including, without limitation, warranties that the       *
269 | *  Covered Software is free of defects, merchantable, fit for a        *
270 | *  particular purpose or non-infringing. The entire risk as to the     *
271 | *  quality and performance of the Covered Software is with You.        *
272 | *  Should any Covered Software prove defective in any respect, You     *
273 | *  (not any Contributor) assume the cost of any necessary servicing,   *
274 | *  repair, or correction. This disclaimer of warranty constitutes an   *
275 | *  essential part of this License. No use of any Covered Software is   *
276 | *  authorized under this License except under this disclaimer.         *
277 | *                                                                      *
278 | ************************************************************************
279 | 
280 | ************************************************************************
281 | *                                                                      *
282 | *  7. Limitation of Liability                                          *
283 | *  --------------------------                                          *
284 | *                                                                      *
285 | *  Under no circumstances and under no legal theory, whether tort      *
286 | *  (including negligence), contract, or otherwise, shall any           *
287 | *  Contributor, or anyone who distributes Covered Software as          *
288 | *  permitted above, be liable to You for any direct, indirect,         *
289 | *  special, incidental, or consequential damages of any character      *
290 | *  including, without limitation, damages for lost profits, loss of    *
291 | *  goodwill, work stoppage, computer failure or malfunction, or any    *
292 | *  and all other commercial damages or losses, even if such party      *
293 | *  shall have been informed of the possibility of such damages. This   *
294 | *  limitation of liability shall not apply to liability for death or   *
295 | *  personal injury resulting from such party's negligence to the       *
296 | *  extent applicable law prohibits such limitation. Some               *
297 | *  jurisdictions do not allow the exclusion or limitation of           *
298 | *  incidental or consequential damages, so this exclusion and          *
299 | *  limitation may not apply to You.                                    *
300 | *                                                                      *
301 | ************************************************************************
302 | 
303 | 8. Litigation
304 | -------------
305 | 
306 | Any litigation relating to this License may be brought only in the
307 | courts of a jurisdiction where the defendant maintains its principal
308 | place of business and such litigation shall be governed by laws of that
309 | jurisdiction, without reference to its conflict-of-law provisions.
310 | Nothing in this Section shall prevent a party's ability to bring
311 | cross-claims or counter-claims.
312 | 
313 | 9. Miscellaneous
314 | ----------------
315 | 
316 | This License represents the complete agreement concerning the subject
317 | matter hereof. If any provision of this License is held to be
318 | unenforceable, such provision shall be reformed only to the extent
319 | necessary to make it enforceable. Any law or regulation which provides
320 | that the language of a contract shall be construed against the drafter
321 | shall not be used to construe this License against a Contributor.
322 | 
323 | 10. Versions of the License
324 | ---------------------------
325 | 
326 | 10.1. New Versions
327 | 
328 | Mozilla Foundation is the license steward. Except as provided in Section
329 | 10.3, no one other than the license steward has the right to modify or
330 | publish new versions of this License. Each version will be given a
331 | distinguishing version number.
332 | 
333 | 10.2. Effect of New Versions
334 | 
335 | You may distribute the Covered Software under the terms of the version
336 | of the License under which You originally received the Covered Software,
337 | or under the terms of any subsequent version published by the license
338 | steward.
339 | 
340 | 10.3. Modified Versions
341 | 
342 | If you create software not governed by this License, and you want to
343 | create a new license for such software, you may create and use a
344 | modified version of this License if you rename the license and remove
345 | any references to the name of the license steward (except to note that
346 | such modified license differs from this License).
347 | 
348 | 10.4. Distributing Source Code Form that is Incompatible With Secondary
349 | Licenses
350 | 
351 | If You choose to distribute Source Code Form that is Incompatible With
352 | Secondary Licenses under the terms of this version of the License, the
353 | notice described in Exhibit B of this License must be attached.
354 | 
355 | Exhibit A - Source Code Form License Notice
356 | -------------------------------------------
357 | 
358 |   This Source Code Form is subject to the terms of the Mozilla Public
359 |   License, v. 2.0. If a copy of the MPL was not distributed with this
360 |   file, You can obtain one at http://mozilla.org/MPL/2.0/.
361 | 
362 | If it is not possible or desirable to put the notice in a particular
363 | file, then You may include the notice in a location (such as a LICENSE
364 | file in a relevant directory) where a recipient would be likely to look
365 | for such a notice.
366 | 
367 | You may add additional accurate notices of copyright ownership.
368 | 
369 | Exhibit B - "Incompatible With Secondary Licenses" Notice
370 | ---------------------------------------------------------
371 | 
372 |   This Source Code Form is "Incompatible With Secondary Licenses", as
373 |   defined by the Mozilla Public License, v. 2.0.
374 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # DSAlign
 2 | DeepSpeech based forced alignment tool
 3 | 
 4 | ## Installation
 5 | 
 6 | It is recommended to use this tool from within a virtual environment.
 7 | After cloning and changing to the root of the project,
 8 | there is a script for creating one with all requirements in the git-ignored dir `venv`:
 9 | 
10 | ```shell script
11 | $ bin/createenv.sh
12 | $ ls venv
13 | bin  include  lib  lib64  pyvenv.cfg  share
14 | ```
15 | 
16 | `bin/align.sh` will automatically use it.
17 | 
18 | Internally DSAlign uses the [DeepSpeech](https://github.com/mozilla/DeepSpeech/) STT engine.
19 | For it to be able to function, it requires a couple of files that are specific to 
20 | the language of the speech data you want to align.
21 | If you want to align English, there is already a helper script that will download and prepare
22 | all required data:
23 | 
24 | ```shell script
25 | $ bin/getmodel.sh 
26 | [...]
27 | $ ls models/en/
28 | alphabet.txt  lm.binary  output_graph.pb  output_graph.pbmm  output_graph.tflite  trie
29 | ```
30 | 
31 | ## Overview and documentation
32 | 
33 | A typical application of the aligner is done in three phases: 
34 | 
35 |  1. __Preparing__ the data. Albeit most of this has to be done individually,
36 |     there are some [tools for data preparation, statistics and maintenance](doc/tools.md).
37 |     All involved file formats are described [here](doc/files.md).
38 |  2. __Aligning__ the data using [the alignment tool and it algorithm](doc/algo.md).
39 |  3. __Exporting__ aligned data using [the data-set exporter](doc/export.md).
40 | 
41 | ## Quickstart example
42 | 
43 | ### Example data
44 | 
45 | There is a script for downloading and preparing some public domain speech and transcript data.
46 | It requires `ffmpeg` for some sample conversion.
47 | 
48 | ```shell script
49 | $ bin/gettestdata.sh
50 | $ ls data
51 | test1  test2
52 | ```
53 | 
54 | ### Alignment using example data
55 | 
56 | Now the aligner can be called either "manually" (specifying all involved files directly):
57 | 
58 | ```shell script
59 | $ bin/align.sh --audio data/test1/audio.wav --script data/test1/transcript.txt --aligned data/test1/aligned.json --tlog data/test1/transcript.log
60 | ```
61 | 
62 | Or "automatically" by specifying a so-called catalog file that bundles all involved paths:
63 | 
64 | ```shell script
65 | $ bin/align.sh --catalog data/test1.catalog
66 | ```
67 | 


--------------------------------------------------------------------------------
/align/align.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import json
  3 | import logging
  4 | import argparse
  5 | import deepspeech
  6 | import subprocess
  7 | import os.path as path
  8 | import numpy as np
  9 | import textdistance
 10 | import multiprocessing
 11 | from collections import Counter
 12 | from search import FuzzySearch
 13 | from glob import glob
 14 | from text import Alphabet, TextCleaner, levenshtein, similarity
 15 | from utils import enweight, log_progress
 16 | from audio import DEFAULT_RATE, read_frames_from_file, vad_split
 17 | from generate_lm import convert_and_filter_topk, build_lm
 18 | from generate_package import create_bundle
 19 | 
 20 | BEAM_WIDTH = 500
 21 | LM_ALPHA = 1
 22 | LM_BETA = 1.85
 23 | 
 24 | ALGORITHMS = ['WNG', 'jaro_winkler', 'editex', 'levenshtein', 'mra', 'hamming']
 25 | SIM_DESC = 'From 0.0 (not equal at all) to 100.0 (totally equal)'
 26 | NAMED_NUMBERS = {
 27 |     'tlen': ('transcript length', int, None),
 28 |     'mlen': ('match length', int, None),
 29 |     'SWS': ('Smith-Waterman score', float, 'From 0.0 (not equal at all) to 100.0+ (pretty equal)'),
 30 |     'WNG': ('weighted N-gram similarity', float, SIM_DESC),
 31 |     'jaro_winkler': ('Jaro-Winkler similarity', float, SIM_DESC),
 32 |     'editex': ('Editex similarity', float, SIM_DESC),
 33 |     'levenshtein': ('Levenshtein similarity', float, SIM_DESC),
 34 |     'mra': ('MRA similarity', float, SIM_DESC),
 35 |     'hamming': ('Hamming similarity', float, SIM_DESC),
 36 |     'CER': ('character error rate', float, 'From 0.0 (no different words) to 100.0+ (total miss)'),
 37 |     'WER': ('word error rate', float, 'From 0.0 (no wrong characters) to 100.0+ (total miss)')
 38 | }
 39 | 
 40 | 
 41 | def fail(message, code=1):
 42 |     logging.fatal(message)
 43 |     exit(code)
 44 | 
 45 | 
 46 | def read_script(script_path):
 47 |     tc = TextCleaner(alphabet,
 48 |                      dashes_to_ws=not args.text_keep_dashes,
 49 |                      normalize_space=not args.text_keep_ws,
 50 |                      to_lower=not args.text_keep_casing)
 51 |     with open(script_path, 'r', encoding='utf-8') as script_file:
 52 |         content = script_file.read()
 53 |         if script_path.endswith('.script'):
 54 |             for phrase in json.loads(content):
 55 |                 tc.add_original_text(phrase['text'], meta=phrase)
 56 |         elif args.text_meaningful_newlines:
 57 |             for phrase in content.split('\n'):
 58 |                 tc.add_original_text(phrase)
 59 |         else:
 60 |             tc.add_original_text(content)
 61 |     return tc
 62 | 
 63 | 
 64 | model = None
 65 | 
 66 | def init_stt(output_graph_path, scorer_path):
 67 |     global model
 68 |     model = deepspeech.Model(output_graph_path)
 69 |     model.enableExternalScorer(scorer_path)
 70 |     logging.debug('Process {}: Loaded models'.format(os.getpid()))
 71 | 
 72 | 
 73 | def stt(sample):
 74 |     time_start, time_end, audio = sample
 75 |     logging.debug('Process {}: Transcribing...'.format(os.getpid()))
 76 |     transcript = model.stt(audio)
 77 |     logging.debug('Process {}: {}'.format(os.getpid(), transcript))
 78 |     return time_start, time_end, ' '.join(transcript.split())
 79 | 
 80 | 
 81 | def align(triple):
 82 |     tlog, script, aligned = triple
 83 | 
 84 |     logging.debug("Loading script from %s..." % script)
 85 |     tc = read_script(script)
 86 |     search = FuzzySearch(tc.clean_text,
 87 |                          max_candidates=args.align_max_candidates,
 88 |                          candidate_threshold=args.align_candidate_threshold,
 89 |                          match_score=args.align_match_score,
 90 |                          mismatch_score=args.align_mismatch_score,
 91 |                          gap_score=args.align_gap_score)
 92 | 
 93 |     logging.debug("Loading transcription log from %s..." % tlog)
 94 |     with open(tlog, 'r', encoding='utf-8') as transcription_log_file:
 95 |         fragments = json.load(transcription_log_file)
 96 |     end_fragments = (args.start + args.num_samples) if args.num_samples else len(fragments)
 97 |     fragments = fragments[args.start:end_fragments]
 98 |     for index, fragment in enumerate(fragments):
 99 |         meta = {}
100 |         for key, value in list(fragment.items()):
101 |             if key not in ['start', 'end', 'transcript']:
102 |                 meta[key] = value
103 |                 del fragment[key]
104 |         fragment['meta'] = meta
105 |         fragment['index'] = index
106 |         fragment['transcript'] = fragment['transcript'].strip()
107 | 
108 |     reasons = Counter()
109 | 
110 |     def skip(index, reason):
111 |         logging.info('Fragment {}: {}'.format(index, reason))
112 |         reasons[reason] += 1
113 | 
114 |     def split_match(fragments, start=0, end=-1):
115 |         n = len(fragments)
116 |         if n < 1:
117 |             return
118 |         elif n == 1:
119 |             weighted_fragments = [(0, fragments[0])]
120 |         else:
121 |             # so we later know the original index of each fragment
122 |             weighted_fragments = enumerate(fragments)
123 |             # assigns high values to long statements near the center of the list
124 |             weighted_fragments = enweight(weighted_fragments)
125 |             weighted_fragments = map(lambda fw: (fw[0], (1 - fw[1]) * len(fw[0][1]['transcript'])), weighted_fragments)
126 |             # fragments with highest weights first
127 |             weighted_fragments = sorted(weighted_fragments, key=lambda fw: fw[1], reverse=True)
128 |             # strip weights
129 |             weighted_fragments = list(map(lambda fw: fw[0], weighted_fragments))
130 |         for index, fragment in weighted_fragments:
131 |             match = search.find_best(fragment['transcript'], start=start, end=end)
132 |             match_start, match_end, sws_score, match_substitutions = match
133 |             if sws_score > (n - 1) / (2 * n):
134 |                 fragment['match-start'] = match_start
135 |                 fragment['match-end'] = match_end
136 |                 fragment['sws'] = sws_score
137 |                 fragment['substitutions'] = match_substitutions
138 |                 for f in split_match(fragments[0:index], start=start, end=match_start):
139 |                     yield f
140 |                 yield fragment
141 |                 for f in split_match(fragments[index + 1:], start=match_end, end=end):
142 |                     yield f
143 |                 return
144 |         for _, _ in weighted_fragments:
145 |             yield None
146 | 
147 |     matched_fragments = split_match(fragments)
148 |     matched_fragments = list(filter(lambda f: f is not None, matched_fragments))
149 | 
150 |     similarity_algos = {}
151 | 
152 |     def phrase_similarity(algo, a, b):
153 |         if algo in similarity_algos:
154 |             return similarity_algos[algo](a, b)
155 |         algo_impl = lambda aa, bb: None
156 |         if algo.lower() == 'wng':
157 |             algo_impl = similarity_algos[algo] = lambda aa, bb: similarity(
158 |                 aa,
159 |                 bb,
160 |                 direction=1,
161 |                 min_ngram_size=args.align_wng_min_size,
162 |                 max_ngram_size=args.align_wng_max_size,
163 |                 size_factor=args.align_wng_size_factor,
164 |                 position_factor=args.align_wng_position_factor)
165 |         elif algo in ALGORITHMS:
166 |             algo_impl = similarity_algos[algo] = getattr(textdistance, algo).normalized_similarity
167 |         else:
168 |             logging.fatal('Unknown similarity metric "{}"'.format(algo))
169 |             exit(1)
170 |         return algo_impl(a, b)
171 | 
172 |     def get_similarities(a, b, n, gap_text, gap_meta, direction):
173 |         if direction < 0:
174 |             a, b, gap_text, gap_meta = a[::-1], b[::-1], gap_text[::-1], gap_meta[::-1]
175 |         similarities = list(map(
176 |             lambda i: (args.align_word_snap_factor if gap_text[i + 1] == ' ' else 1) *
177 |                       (args.align_phrase_snap_factor if gap_meta[i + 1] is None else 1) *
178 |                       (phrase_similarity(args.align_similarity_algo, a, b + gap_text[1:i + 1])),
179 |             range(n)))
180 |         best = max((v, i) for i, v in enumerate(similarities))[1] if n > 0 else 0
181 |         return best, similarities
182 | 
183 |     for index in range(len(matched_fragments) + 1):
184 |         if index > 0:
185 |             a = matched_fragments[index - 1]
186 |             a_start, a_end = a['match-start'], a['match-end']
187 |             a_len = a_end - a_start
188 |             a_stretch = int(a_len * args.align_stretch_fraction)
189 |             a_shrink = int(a_len * args.align_shrink_fraction)
190 |             a_end = a_end - a_shrink
191 |             a_ext = a_shrink + a_stretch
192 |         else:
193 |             a = None
194 |             a_start = a_end = 0
195 |         if index < len(matched_fragments):
196 |             b = matched_fragments[index]
197 |             b_start, b_end = b['match-start'], b['match-end']
198 |             b_len = b_end - b_start
199 |             b_stretch = int(b_len * args.align_stretch_fraction)
200 |             b_shrink = int(b_len * args.align_shrink_fraction)
201 |             b_start = b_start + b_shrink
202 |             b_ext = b_shrink + b_stretch
203 |         else:
204 |             b = None
205 |             b_start = b_end = len(search.text)
206 | 
207 |         assert a_end <= b_start
208 |         assert a_start <= a_end
209 |         assert b_start <= b_end
210 |         if a_end == b_start or a_start == a_end or b_start == b_end:
211 |             continue
212 |         gap_text = tc.clean_text[a_end - 1:b_start + 1]
213 |         gap_meta = tc.meta[a_end - 1:b_start + 1]
214 | 
215 |         if a:
216 |             a_best_index, a_similarities = get_similarities(a['transcript'],
217 |                                                             tc.clean_text[a_start:a_end],
218 |                                                             min(len(gap_text) - 1, a_ext),
219 |                                                             gap_text,
220 |                                                             gap_meta,
221 |                                                             1)
222 |             a_best_end = a_best_index + a_end
223 |         if b:
224 |             b_best_index, b_similarities = get_similarities(b['transcript'],
225 |                                                             tc.clean_text[b_start:b_end],
226 |                                                             min(len(gap_text) - 1, b_ext),
227 |                                                             gap_text,
228 |                                                             gap_meta,
229 |                                                             -1)
230 |             b_best_start = b_start - b_best_index
231 | 
232 |         if a and b and a_best_end > b_best_start:
233 |             overlap_start = b_start - len(b_similarities)
234 |             a_similarities = a_similarities[overlap_start - a_end:]
235 |             b_similarities = b_similarities[:len(a_similarities)]
236 |             best_index = max((sum(v), i) for i, v in enumerate(zip(a_similarities, b_similarities)))[1]
237 |             a_best_end = b_best_start = overlap_start + best_index
238 | 
239 |         if a:
240 |             a['match-end'] = a_best_end
241 |         if b:
242 |             b['match-start'] = b_best_start
243 | 
244 |     def apply_number(number_key, index, fragment, show, get_value):
245 |         kl = number_key.lower()
246 |         should_output = getattr(args, 'output_' + kl)
247 |         min_val, max_val = getattr(args, 'output_min_' + kl), getattr(args, 'output_max_' + kl)
248 |         if kl.endswith('len') and min_val is None:
249 |             min_val = 1
250 |         if should_output or min_val or max_val:
251 |             val = get_value()
252 |             if not kl.endswith('len'):
253 |                 show.insert(0, '{}: {:.2f}'.format(number_key, val))
254 |                 if should_output:
255 |                     fragment[kl] = val
256 |             reason_base = '{} ({})'.format(NAMED_NUMBERS[number_key][0], number_key)
257 |             reason = None
258 |             if min_val and val < min_val:
259 |                 reason = reason_base + ' too low'
260 |             elif max_val and val > max_val:
261 |                 reason = reason_base + ' too high'
262 |             if reason:
263 |                 skip(index, reason)
264 |                 return True
265 |         return False
266 | 
267 |     substitutions = Counter()
268 |     result_fragments = []
269 |     for fragment in matched_fragments:
270 |         index = fragment['index']
271 |         time_start = fragment['start']
272 |         time_end = fragment['end']
273 |         fragment_transcript = fragment['transcript']
274 |         result_fragment = {
275 |             'start': time_start,
276 |             'end': time_end
277 |         }
278 |         sample_numbers = []
279 | 
280 |         if apply_number('tlen', index, result_fragment, sample_numbers, lambda: len(fragment_transcript)):
281 |             continue
282 |         result_fragment['transcript'] = fragment_transcript
283 | 
284 |         if 'match-start' not in fragment or 'match-end' not in fragment:
285 |             skip(index, 'No match for transcript')
286 |             continue
287 |         match_start, match_end = fragment['match-start'], fragment['match-end']
288 |         if match_end - match_start <= 0:
289 |             skip(index, 'Empty match for transcript')
290 |             continue
291 |         original_start = tc.get_original_offset(match_start)
292 |         original_end = tc.get_original_offset(match_end)
293 |         result_fragment['text-start'] = original_start
294 |         result_fragment['text-end'] = original_end
295 | 
296 |         meta_dict = {}
297 |         for meta in list(tc.collect_meta(match_start, match_end)) + [fragment['meta']]:
298 |             for key, value in meta.items():
299 |                 if key == 'text':
300 |                     continue
301 |                 if key in meta_dict:
302 |                     values = meta_dict[key]
303 |                 else:
304 |                     values = meta_dict[key] = []
305 |                 if value not in values:
306 |                     values.append(value)
307 |         result_fragment['meta'] = meta_dict
308 | 
309 |         result_fragment['aligned-raw'] = tc.original_text[original_start:original_end]
310 | 
311 |         fragment_matched = tc.clean_text[match_start:match_end]
312 |         if apply_number('mlen', index, result_fragment, sample_numbers, lambda: len(fragment_matched)):
313 |             continue
314 |         result_fragment['aligned'] = fragment_matched
315 | 
316 |         if apply_number('SWS', index, result_fragment, sample_numbers, lambda: 100 * fragment['sws']):
317 |             continue
318 | 
319 |         should_skip = False
320 |         for algo in ALGORITHMS:
321 |             should_skip = should_skip or apply_number(algo, index, result_fragment, sample_numbers,
322 |                                                       lambda: 100 * phrase_similarity(algo,
323 |                                                                                       fragment_matched,
324 |                                                                                       fragment_transcript))
325 |         if should_skip:
326 |             continue
327 | 
328 |         if apply_number('CER', index, result_fragment, sample_numbers,
329 |                         lambda: 100 * levenshtein(fragment_transcript, fragment_matched) /
330 |                                 len(fragment_matched)):
331 |             continue
332 | 
333 |         if apply_number('WER', index, result_fragment, sample_numbers,
334 |                         lambda: 100 * levenshtein(fragment_transcript.split(), fragment_matched.split()) /
335 |                                 len(fragment_matched.split())):
336 |             continue
337 | 
338 |         substitutions += fragment['substitutions']
339 | 
340 |         result_fragments.append(result_fragment)
341 |         logging.debug('Fragment %d aligned with %s' % (index, ' '.join(sample_numbers)))
342 |         logging.debug('- T: ' + args.text_context * ' ' + '"%s"' % fragment_transcript)
343 |         logging.debug('- O: %s|%s|%s' % (
344 |             tc.clean_text[match_start - args.text_context:match_start],
345 |             fragment_matched,
346 |             tc.clean_text[match_end:match_end + args.text_context]))
347 |         if args.play:
348 |             subprocess.check_call(['play',
349 |                                    '--no-show-progress',
350 |                                    args.audio,
351 |                                    'trim',
352 |                                    str(time_start / 1000.0),
353 |                                    '=' + str(time_end / 1000.0)])
354 |     with open(aligned, 'w', encoding='utf-8') as result_file:
355 |         result_file.write(json.dumps(result_fragments, indent=4 if args.output_pretty else None, ensure_ascii=False))
356 |     return aligned, len(result_fragments), len(fragments) - len(result_fragments), reasons
357 | 
358 | 
359 | def main():
360 |     # Debug helpers
361 |     logging.basicConfig()
362 |     logging.root.setLevel(args.loglevel if args.loglevel else 20)
363 | 
364 |     def progress(it=None, desc='Processing', total=None):
365 |         logging.info(desc)
366 |         return it if args.no_progress else log_progress(it, interval=args.progress_interval, total=total)
367 | 
368 |     def resolve(base_path, spec_path):
369 |         if spec_path is None:
370 |             return None
371 |         if not path.isabs(spec_path):
372 |             spec_path = path.join(base_path, spec_path)
373 |         return spec_path
374 | 
375 |     def exists(file_path):
376 |         if file_path is None:
377 |             return False
378 |         return os.path.isfile(file_path)
379 | 
380 |     to_prepare = []
381 | 
382 |     def enqueue_or_fail(audio, tlog, script, aligned, prefix=''):
383 |         if exists(aligned) and not args.force:
384 |             fail(prefix + 'Alignment file "{}" already existing - use --force to overwrite'.format(aligned))
385 |         if tlog is None:
386 |             if args.ignore_missing:
387 |                 return
388 |             fail(prefix + 'Missing transcription log path')
389 |         if not exists(audio) and not exists(tlog):
390 |             if args.ignore_missing:
391 |                 return
392 |             fail(prefix + 'Both audio file "{}" and transcription log "{}" are missing'.format(audio, tlog))
393 |         if not exists(script):
394 |             if args.ignore_missing:
395 |                 return
396 |             fail(prefix + 'Missing script "{}"'.format(script))
397 |         to_prepare.append((audio, tlog, script, aligned))
398 | 
399 |     if (args.audio or args.tlog) and args.script and args.aligned and not args.catalog:
400 |         enqueue_or_fail(args.audio, args.tlog, args.script, args.aligned)
401 |     elif args.catalog:
402 |         if not exists(args.catalog):
403 |             fail('Unable to load catalog file "{}"'.format(args.catalog))
404 |         catalog = path.abspath(args.catalog)
405 |         catalog_dir = path.dirname(catalog)
406 |         with open(catalog, 'r', encoding='utf-8') as catalog_file:
407 |             catalog_entries = json.load(catalog_file)
408 |         for entry in progress(catalog_entries, desc='Reading catalog'):
409 |             enqueue_or_fail(resolve(catalog_dir, entry['audio']),
410 |                             resolve(catalog_dir, entry['tlog']),
411 |                             resolve(catalog_dir, entry['script']),
412 |                             resolve(catalog_dir, entry['aligned']),
413 |                             prefix='Problem loading catalog "{}" - '.format(catalog))
414 |     else:
415 |         fail('You have to either specify a combination of "--audio/--tlog,--script,--aligned" or "--catalog"')
416 | 
417 |     logging.debug('Start')
418 | 
419 |     to_align = []
420 |     output_graph_path = None
421 |     for audio_path, tlog_path, script_path, aligned_path in to_prepare:
422 |         if not exists(tlog_path):
423 |             generated_scorer = False
424 |             if output_graph_path is None:
425 |                 logging.debug('Looking for model files in "{}"...'.format(model_dir))
426 |                 output_graph_path = glob(model_dir + "/*.pbmm")[0]
427 |                 lang_scorer_path = glob(model_dir + "/*.scorer")[0]
428 |             kenlm_path = 'dependencies/kenlm/build/bin'
429 |             if not path.exists(kenlm_path):
430 |                 kenlm_path = None
431 |             deepspeech_path = 'dependencies/deepspeech'
432 |             if not path.exists(deepspeech_path):
433 |                 deepspeech_path = None
434 |             if kenlm_path and deepspeech_path and not args.stt_no_own_lm:
435 |                 tc = read_script(script_path)
436 |                 if not tc.clean_text.strip():
437 |                     logging.error('Cleaned transcript is empty for {}'.format(path.basename(script_path)))
438 |                     continue
439 |                 clean_text_path = script_path + '.clean'
440 |                 with open(clean_text_path, 'w', encoding='utf-8') as clean_text_file:
441 |                     clean_text_file.write(tc.clean_text)
442 | 
443 |                 scorer_path = script_path + '.scorer'
444 |                 if not path.exists(scorer_path):
445 |                     # Generate LM
446 |                     data_lower, vocab_str = convert_and_filter_topk(scorer_path, clean_text_path, 500000)
447 |                     build_lm(scorer_path, kenlm_path, 5, '85%', '0|0|1', True, 255, 8, 'trie', data_lower, vocab_str)
448 |                     os.remove(scorer_path + '.' + 'lower.txt.gz')
449 |                     os.remove(scorer_path + '.' + 'lm.arpa')
450 |                     os.remove(scorer_path + '.' + 'lm_filtered.arpa')
451 |                     os.remove(clean_text_path)
452 | 
453 |                     # Generate scorer
454 |                     create_bundle(alphabet_path, scorer_path + '.' + 'lm.binary', scorer_path + '.' + 'vocab-500000.txt', scorer_path, False, 0.931289039105002, 1.1834137581510284)
455 |                     os.remove(scorer_path + '.' + 'lm.binary')
456 |                     os.remove(scorer_path + '.' + 'vocab-500000.txt')
457 | 
458 |                     generated_scorer = True
459 |             else:
460 |                 scorer_path = lang_scorer_path
461 | 
462 |             logging.debug('Loading acoustic model from "{}", alphabet from "{}" and scorer from "{}"...'
463 |                           .format(output_graph_path, alphabet_path, scorer_path))
464 | 
465 |             # Run VAD on the input file
466 |             logging.debug('Transcribing VAD segments...')
467 |             frames = read_frames_from_file(audio_path, model_format, args.audio_vad_frame_length)
468 |             segments = vad_split(frames,
469 |                                  model_format,
470 |                                  num_padding_frames=args.audio_vad_padding,
471 |                                  threshold=args.audio_vad_threshold,
472 |                                  aggressiveness=args.audio_vad_aggressiveness)
473 | 
474 |             def pre_filter():
475 |                 for i, segment in enumerate(segments):
476 |                     segment_buffer, time_start, time_end = segment
477 |                     time_length = time_end - time_start
478 |                     if args.stt_min_duration and time_length < args.stt_min_duration:
479 |                         logging.info('Fragment {}: Audio too short for STT'.format(i))
480 |                         continue
481 |                     if args.stt_max_duration and time_length > args.stt_max_duration:
482 |                         logging.info('Fragment {}: Audio too long for STT'.format(i))
483 |                         continue
484 |                     yield (time_start, time_end, np.frombuffer(segment_buffer, dtype=np.int16))
485 | 
486 |             samples = list(progress(pre_filter(), desc='VAD splitting'))
487 | 
488 |             pool = multiprocessing.Pool(initializer=init_stt,
489 |                                         initargs=(output_graph_path, scorer_path),
490 |                                         processes=args.stt_workers)
491 |             transcripts = list(progress(pool.imap(stt, samples), desc='Transcribing', total=len(samples)))
492 | 
493 |             fragments = []
494 |             for time_start, time_end, segment_transcript in transcripts:
495 |                 if segment_transcript is None:
496 |                     continue
497 |                 fragments.append({
498 |                     'start': time_start,
499 |                     'end': time_end,
500 |                     'transcript': segment_transcript
501 |                 })
502 |             logging.debug('Excluded {} empty transcripts'.format(len(transcripts) - len(fragments)))
503 | 
504 |             logging.debug('Writing transcription log to file "{}"...'.format(tlog_path))
505 |             with open(tlog_path, 'w', encoding='utf-8') as tlog_file:
506 |                 tlog_file.write(json.dumps(fragments, indent=4 if args.output_pretty else None, ensure_ascii=False))
507 | 
508 |             # Remove scorer if generated
509 |             if generated_scorer:
510 |                 os.remove(scorer_path)
511 |         if not path.isfile(tlog_path):
512 |             fail('Problem loading transcript from "{}"'.format(tlog_path))
513 |         to_align.append((tlog_path, script_path, aligned_path))
514 | 
515 |     total_fragments = 0
516 |     dropped_fragments = 0
517 |     reasons = Counter()
518 | 
519 |     index = 0
520 |     pool = multiprocessing.Pool(processes=args.align_workers)
521 |     for aligned_file, file_total_fragments, file_dropped_fragments, file_reasons in \
522 |             progress(pool.imap_unordered(align, to_align), desc='Aligning', total=len(to_align)):
523 |         if args.no_progress:
524 |             index += 1
525 |             logging.info('Aligned file {} of {} - wrote results to "{}"'.format(index, len(to_align), aligned_file))
526 |         total_fragments += file_total_fragments
527 |         dropped_fragments += file_dropped_fragments
528 |         reasons += file_reasons
529 | 
530 |     logging.info('Aligned {} fragments'.format(total_fragments))
531 |     if total_fragments > 0 and dropped_fragments > 0:
532 |         logging.info('Dropped {} fragments {:0.2f}%:'.format(dropped_fragments,
533 |                                                              dropped_fragments * 100.0 / total_fragments))
534 |         for key, number in reasons.most_common():
535 |             logging.info(' - {}: {}'.format(key, number))
536 | 
537 | 
538 | def parse_args():
539 |     parser = argparse.ArgumentParser(description='Force align speech data with a transcript.')
540 | 
541 |     parser.add_argument('--audio', type=str,
542 |                         help='Path to speech audio file')
543 |     parser.add_argument('--tlog', type=str,
544 |                         help='Path to STT transcription log (.tlog)')
545 |     parser.add_argument('--script', type=str,
546 |                         help='Path to original transcript (plain text or .script file)')
547 |     parser.add_argument('--catalog', type=str,
548 |                         help='Path to a catalog file with paths to transcription log or audio, original script and '
549 |                              '(target) alignment files')
550 |     parser.add_argument('--aligned', type=str,
551 |                         help='Alignment result file (.aligned)')
552 |     parser.add_argument('--force', action="store_true",
553 |                         help='Overwrite existing files')
554 |     parser.add_argument('--ignore-missing', action="store_true",
555 |                         help='Ignores catalog entries with missing paths')
556 |     parser.add_argument('--loglevel', type=int, required=False, default=20,
557 |                         help='Log level (between 0 and 50) - default: 20')
558 |     parser.add_argument('--no-progress', action="store_true",
559 |                         help='Prevents showing progress indication')
560 |     parser.add_argument('--progress-interval', type=float, default=1.0,
561 |                         help='Progress indication interval in seconds')
562 |     parser.add_argument('--play', action="store_true",
563 |                         help='Play audio fragments as they are matched using SoX audio tool')
564 |     parser.add_argument('--text-context', type=int, required=False, default=10,
565 |                         help='Size of textual context for logged statements - default: 10')
566 |     parser.add_argument('--start', type=int, required=False, default=0,
567 |                         help='Start alignment process at given offset of transcribed fragments')
568 |     parser.add_argument('--num-samples', type=int, required=False,
569 |                         help='Number of fragments to align')
570 |     parser.add_argument('--alphabet', required=False,
571 |                         help='Path to an alphabet file (overriding the one from --stt-model-dir)')
572 | 
573 |     audio_group = parser.add_argument_group(title='Audio pre-processing options')
574 |     audio_group.add_argument('--audio-vad-aggressiveness', type=int, choices=range(4), default=3,
575 |                              help='Aggressiveness of voice activity detection in a frame (default: 3)')
576 |     audio_group.add_argument('--audio-vad-padding', type=int, default=10,
577 |                              help='Number of padding audio frames in VAD ring-buffer')
578 |     audio_group.add_argument('--audio-vad-threshold', type=float, default=0.5,
579 |                              help='VAD ring-buffer threshold for voiced frames '
580 |                                   '(e.g. 0.5 -> 50%% of the ring-buffer frames have to be voiced '
581 |                                   'for triggering a split)')
582 |     audio_group.add_argument('--audio-vad-frame-length', choices=[10, 20, 30], default=30,
583 |                              help='VAD audio frame length in ms (10, 20 or 30)')
584 | 
585 |     stt_group = parser.add_argument_group(title='STT options')
586 |     stt_group.add_argument('--stt-model-rate', type=int, default=DEFAULT_RATE,
587 |                            help='Supported sample rate of the acoustic model')
588 |     stt_group.add_argument('--stt-model-dir', required=False,
589 |                            help='Path to a directory with output_graph, scorer and (optional) alphabet file ' +
590 |                                 '(default: "models/en"')
591 |     stt_group.add_argument('--stt-no-own-lm', action="store_true",
592 |                            help='Deactivates creation of individual language models per document.' +
593 |                                 'Uses the one from model dir instead.')
594 |     stt_group.add_argument('--stt-workers', type=int, required=False, default=1,
595 |                            help='Number of parallel STT workers - should be 1 for GPU based DeepSpeech')
596 |     stt_group.add_argument('--stt-min-duration', type=int, required=False, default=100,
597 |                            help='Minimum speech fragment duration in milliseconds to translate (default: 100)')
598 |     stt_group.add_argument('--stt-max-duration', type=int, required=False,
599 |                            help='Maximum speech fragment duration in milliseconds to translate (default: no limit)')
600 | 
601 |     text_group = parser.add_argument_group(title='Text pre-processing options')
602 |     text_group.add_argument('--text-meaningful-newlines', action="store_true",
603 |                             help='Newlines from plain text file separate phrases/speakers. '
604 |                                  '(see --align-phrase-snap-factor)')
605 |     text_group.add_argument('--text-keep-dashes', action="store_true",
606 |                             help='No replacing of dashes with spaces. Dependent of alphabet if kept at all.')
607 |     text_group.add_argument('--text-keep-ws', action="store_true",
608 |                             help='No normalization of whitespace. Keep it as it is.')
609 |     text_group.add_argument('--text-keep-casing', action="store_true",
610 |                             help='No lower-casing of characters. Keep them as they are.')
611 | 
612 |     align_group = parser.add_argument_group(title='Alignment algorithm options')
613 |     align_group.add_argument('--align-workers', type=int, required=False,
614 |                              help='Number of parallel alignment workers - defaults to number of CPUs')
615 |     align_group.add_argument('--align-max-candidates', type=int, required=False, default=10,
616 |                              help='How many global 3gram match candidates are tested at max (default: 10)')
617 |     align_group.add_argument('--align-candidate-threshold', type=float, required=False, default=0.92,
618 |                              help='Factor for how many 3grams the next candidate should have at least ' +
619 |                                   'compared to its predecessor (default: 0.92)')
620 |     align_group.add_argument('--align-match-score', type=int, required=False, default=100,
621 |                              help='Matching score for Smith-Waterman alignment (default: 100)')
622 |     align_group.add_argument('--align-mismatch-score', type=int, required=False, default=-100,
623 |                              help='Mismatch score for Smith-Waterman alignment (default: -100)')
624 |     align_group.add_argument('--align-gap-score', type=int, required=False, default=-100,
625 |                              help='Gap score for Smith-Waterman alignment (default: -100)')
626 |     align_group.add_argument('--align-shrink-fraction', type=float, required=False, default=0.1,
627 |                              help='Length fraction of the fragment that it could get shrinked during fine alignment')
628 |     align_group.add_argument('--align-stretch-fraction', type=float, required=False, default=0.25,
629 |                              help='Length fraction of the fragment that it could get stretched during fine alignment')
630 |     align_group.add_argument('--align-word-snap-factor', type=float, required=False, default=1.5,
631 |                              help='Priority factor for snapping matched texts to word boundaries '
632 |                                   '(default: 1.5 - slightly snappy)')
633 |     align_group.add_argument('--align-phrase-snap-factor', type=float, required=False, default=1.0,
634 |                              help='Priority factor for snapping matched texts to word boundaries '
635 |                                   '(default: 1.0 - no snapping)')
636 |     align_group.add_argument('--align-similarity-algo', type=str, required=False, default='wng',
637 |                              help='Similarity algorithm during fine-alignment - one of '
638 |                                   'wng|editex|levenshtein|mra|hamming|jaro_winkler (default: wng)')
639 |     align_group.add_argument('--align-wng-min-size', type=int, required=False, default=1,
640 |                              help='Minimum N-gram size for weighted N-gram similarity '
641 |                                   'during fine-alignment (default: 1)')
642 |     align_group.add_argument('--align-wng-max-size', type=int, required=False, default=3,
643 |                              help='Maximum N-gram size for weighted N-gram similarity '
644 |                                   'during fine-alignment (default: 3)')
645 |     align_group.add_argument('--align-wng-size-factor', type=float, required=False, default=1,
646 |                              help='Size weight for weighted N-gram similarity '
647 |                                   'during fine-alignment (default: 1)')
648 |     align_group.add_argument('--align-wng-position-factor', type=float, required=False, default=2.5,
649 |                              help='Position weight for weighted N-gram similarity '
650 |                                   'during fine-alignment (default: 2.5)')
651 | 
652 |     output_group = parser.add_argument_group(title='Output options')
653 |     output_group.add_argument('--output-pretty', action="store_true",
654 |                               help='Writes indented JSON output"')
655 | 
656 |     for short in NAMED_NUMBERS.keys():
657 |         long, atype, desc = NAMED_NUMBERS[short]
658 |         desc = (' - value range: ' + desc) if desc else ''
659 |         output_group.add_argument('--output-' + short.lower(), action="store_true",
660 |                                   help='Writes {} ({}) to output'.format(long, short))
661 |         for extreme in ['Min', 'Max']:
662 |             output_group.add_argument('--output-' + extreme.lower() + '-' + short.lower(), type=atype, required=False,
663 |                                       help='{}imum {} ({}) the STT transcript of the audio '
664 |                                            'has to have when compared with the original text{}'
665 |                                       .format(extreme, long, short, desc))
666 | 
667 |     return parser.parse_args()
668 | 
669 | 
670 | if __name__ == '__main__':
671 |     args = parse_args()
672 |     model_dir = os.path.expanduser(args.stt_model_dir if args.stt_model_dir else 'models/en')
673 |     if args.alphabet is not None:
674 |         alphabet_path = args.alphabet
675 |     else:
676 |         alphabet_path = os.path.join(model_dir, 'alphabet.txt')
677 |     if not os.path.isfile(alphabet_path):
678 |         fail('Found no alphabet file')
679 |     logging.debug('Loading alphabet from "{}"...'.format(alphabet_path))
680 |     alphabet = Alphabet(alphabet_path)
681 |     model_format = (args.stt_model_rate, 1, 2)
682 |     main()
683 | 


--------------------------------------------------------------------------------
/align/audio.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import io
  3 | import sox
  4 | import wave
  5 | import tempfile
  6 | import collections
  7 | import numpy as np
  8 | 
  9 | from webrtcvad import Vad
 10 | from utils import LimitingPool
 11 | 
 12 | DEFAULT_RATE = 16000
 13 | DEFAULT_CHANNELS = 1
 14 | DEFAULT_WIDTH = 2
 15 | DEFAULT_FORMAT = (DEFAULT_RATE, DEFAULT_CHANNELS, DEFAULT_WIDTH)
 16 | 
 17 | AUDIO_TYPE_NP = 'application/vnd.mozilla.np'
 18 | AUDIO_TYPE_PCM = 'application/vnd.mozilla.pcm'
 19 | AUDIO_TYPE_WAV = 'audio/wav'
 20 | AUDIO_TYPE_OPUS = 'application/vnd.mozilla.opus'
 21 | SERIALIZABLE_AUDIO_TYPES = [AUDIO_TYPE_WAV, AUDIO_TYPE_OPUS]
 22 | 
 23 | OPUS_PCM_LEN_SIZE = 4
 24 | OPUS_RATE_SIZE = 4
 25 | OPUS_CHANNELS_SIZE = 1
 26 | OPUS_WIDTH_SIZE = 1
 27 | OPUS_CHUNK_LEN_SIZE = 2
 28 | 
 29 | 
 30 | class Sample:
 31 |     """Represents in-memory audio data of a certain (convertible) representation.
 32 |     Attributes:
 33 |         audio_type (str): See `__init__`.
 34 |         audio_format (tuple:(int, int, int)): See `__init__`.
 35 |         audio (obj): Audio data represented as indicated by `audio_type`
 36 |         duration (float): Audio duration of the sample in seconds
 37 |     """
 38 |     def __init__(self, audio_type, raw_data, audio_format=None):
 39 |         """
 40 |         Creates a Sample from a raw audio representation.
 41 |         :param audio_type: Audio data representation type
 42 |             CSupported types:
 43 |                 - AUDIO_TYPE_OPUS: Memory file representation (BytesIO) of Opus encoded audio
 44 |                     wrapped by a custom container format (used in SDBs)
 45 |                 - AUDIO_TYPE_WAV: Memory file representation (BytesIO) of a Wave file
 46 |                 - AUDIO_TYPE_PCM: Binary representation (bytearray) of PCM encoded audio data (Wave file without header)
 47 |                 - AUDIO_TYPE_NP: NumPy representation of audio data (np.float32) - typically used for GPU feeding
 48 |         :param raw_data: Audio data in the form of the provided representation type (see audio_type).
 49 |             For types AUDIO_TYPE_OPUS or AUDIO_TYPE_WAV data can also be passed as a bytearray.
 50 |         :param audio_format: Tuple of sample-rate, number of channels and sample-width.
 51 |             Required in case of audio_type = AUDIO_TYPE_PCM or AUDIO_TYPE_NP,
 52 |             as this information cannot be derived from raw audio data.
 53 |         """
 54 |         self.audio_type = audio_type
 55 |         self.audio_format = audio_format
 56 |         if audio_type in SERIALIZABLE_AUDIO_TYPES:
 57 |             self.audio = raw_data if isinstance(raw_data, io.BytesIO) else io.BytesIO(raw_data)
 58 |             self.duration = read_duration(audio_type, self.audio)
 59 |         else:
 60 |             self.audio = raw_data
 61 |             if self.audio_format is None:
 62 |                 raise ValueError('For audio type "{}" parameter "audio_format" is mandatory'.format(self.audio_type))
 63 |             if audio_type == AUDIO_TYPE_PCM:
 64 |                 self.duration = get_pcm_duration(len(self.audio), self.audio_format)
 65 |             elif audio_type == AUDIO_TYPE_NP:
 66 |                 self.duration = get_np_duration(len(self.audio), self.audio_format)
 67 |             else:
 68 |                 raise ValueError('Unsupported audio type: {}'.format(self.audio_type))
 69 | 
 70 |     def change_audio_type(self, new_audio_type):
 71 |         """
 72 |         In-place conversion of audio data into a different representation.
 73 |         :param new_audio_type: New audio-type - see `__init__`.
 74 |             Not supported: Converting from AUDIO_TYPE_NP into any other type.
 75 |         """
 76 |         if self.audio_type == new_audio_type:
 77 |             return
 78 |         if new_audio_type == AUDIO_TYPE_PCM and self.audio_type in SERIALIZABLE_AUDIO_TYPES:
 79 |             self.audio_format, audio = read_audio(self.audio_type, self.audio)
 80 |             self.audio.close()
 81 |             self.audio = audio
 82 |         elif new_audio_type == AUDIO_TYPE_NP:
 83 |             self.change_audio_type(AUDIO_TYPE_PCM)
 84 |             self.audio = pcm_to_np(self.audio_format, self.audio)
 85 |         elif new_audio_type in SERIALIZABLE_AUDIO_TYPES:
 86 |             self.change_audio_type(AUDIO_TYPE_PCM)
 87 |             audio_bytes = io.BytesIO()
 88 |             write_audio(new_audio_type, audio_bytes, self.audio_format, self.audio)
 89 |             audio_bytes.seek(0)
 90 |             self.audio = audio_bytes
 91 |         else:
 92 |             raise RuntimeError('Changing audio representation type from "{}" to "{}" not supported'
 93 |                                .format(self.audio_type, new_audio_type))
 94 |         self.audio_type = new_audio_type
 95 | 
 96 | 
 97 | def change_audio_types(samples, audio_type=AUDIO_TYPE_PCM, processes=None):
 98 |     def change_audio_type(sample):
 99 |         sample.change_audio_type(audio_type)
100 |         return sample
101 |     with LimitingPool(processes=processes) as pool:
102 |         for current_sample in pool.map(change_audio_type, samples):
103 |             yield current_sample
104 | 
105 | 
106 | def read_audio_format_from_wav_file(wav_file):
107 |     return wav_file.getframerate(), wav_file.getnchannels(), wav_file.getsampwidth()
108 | 
109 | 
110 | def write_audio_format_to_wav_file(wav_file, audio_format=DEFAULT_FORMAT):
111 |     rate, channels, width = audio_format
112 |     wav_file.setframerate(rate)
113 |     wav_file.setnchannels(channels)
114 |     wav_file.setsampwidth(width)
115 | 
116 | 
117 | def get_num_samples(pcm_buffer_size, audio_format=DEFAULT_FORMAT):
118 |     _, channels, width = audio_format
119 |     return pcm_buffer_size // (channels * width)
120 | 
121 | 
122 | def get_pcm_duration(pcm_buffer_size, audio_format=DEFAULT_FORMAT):
123 |     return get_num_samples(pcm_buffer_size, audio_format) / audio_format[0]
124 | 
125 | 
126 | def get_np_duration(np_len, audio_format=DEFAULT_FORMAT):
127 |     return np_len / audio_format[0]
128 | 
129 | 
130 | def convert_audio(src_audio_path, dst_audio_path, file_type=None, audio_format=DEFAULT_FORMAT):
131 |     sample_rate, channels, width = audio_format
132 |     try:
133 |         transformer = sox.Transformer()
134 |         transformer.set_output_format(file_type=file_type, rate=sample_rate, channels=channels, bits=width*8)
135 |         transformer.build(src_audio_path, dst_audio_path)
136 |     except sox.core.SoxError:
137 |         return False
138 |     return True
139 | 
140 | 
141 | def verify_wav_file(wav_path):
142 |     try:
143 |         with wave.open(wav_path, 'r') as wav_file:
144 |             if wav_file.getnframes() > 0:
145 |                 return True
146 |     except:
147 |         return False
148 |     return False
149 | 
150 | 
151 | def ensure_wav_with_format(src_audio_path, audio_format=DEFAULT_FORMAT, tmp_dir=None):
152 |     if src_audio_path.endswith('.wav'):
153 |         with wave.open(src_audio_path, 'r') as src_audio_file:
154 |             if read_audio_format_from_wav_file(src_audio_file) == audio_format:
155 |                 return src_audio_path, False
156 |     fd, tmp_file_path = tempfile.mkstemp(suffix='.wav', dir=tmp_dir)
157 |     os.close(fd)
158 |     fd = None
159 |     if convert_audio(src_audio_path, tmp_file_path, file_type='wav', audio_format=audio_format):
160 |         return tmp_file_path, True
161 |     os.remove(tmp_file_path)
162 |     return None, False
163 | 
164 | 
165 | def extract_audio(audio_file, start, end):
166 |     assert 0 <= start <= end
167 |     rate = audio_file.getframerate()
168 |     audio_file.setpos(int(start * rate))
169 |     return audio_file.readframes(int((end - start) * rate))
170 | 
171 | 
172 | class AudioFile:
173 |     def __init__(self, audio_path, as_path=False, audio_format=DEFAULT_FORMAT):
174 |         self.audio_path = audio_path
175 |         self.audio_format = audio_format
176 |         self.as_path = as_path
177 |         self.open_file = None
178 |         self.tmp_file_path = None
179 | 
180 |     def __enter__(self):
181 |         if self.audio_path.endswith('.wav'):
182 |             self.open_file = wave.open(self.audio_path, 'r')
183 |             if read_audio_format_from_wav_file(self.open_file) == self.audio_format:
184 |                 if self.as_path:
185 |                     self.open_file.close()
186 |                     return self.audio_path
187 |                 return self.open_file
188 |             self.open_file.close()
189 |         test, self.tmp_file_path = tempfile.mkstemp(suffix='.wav')
190 |         os.close(test)
191 |         test = None
192 |         if not convert_audio(self.audio_path, self.tmp_file_path, file_type='wav', audio_format=self.audio_format):
193 |             raise RuntimeError('Unable to convert "{}" to required format'.format(self.audio_path))
194 |         if self.as_path:
195 |             return self.tmp_file_path
196 |         self.open_file = wave.open(self.tmp_file_path, 'r')
197 |         return self.open_file
198 | 
199 |     def __exit__(self, *args):
200 |         self.open_file.close()
201 |         if not self.as_path:
202 |             self.open_file.close()
203 |         if self.tmp_file_path is not None:
204 |             os.remove(self.tmp_file_path)
205 |         self.open_file = None
206 | 
207 | 
208 | def read_frames(wav_file, frame_duration_ms=30, yield_remainder=False):
209 |     audio_format = read_audio_format_from_wav_file(wav_file)
210 |     frame_size = int(audio_format[0] * (frame_duration_ms / 1000.0))
211 |     while True:
212 |         try:
213 |             data = wav_file.readframes(frame_size)
214 |             if not yield_remainder and get_pcm_duration(len(data), audio_format) * 1000 < frame_duration_ms:
215 |                 break
216 |             yield data
217 |         except EOFError:
218 |             break
219 | 
220 | 
221 | def read_frames_from_file(audio_path, audio_format=DEFAULT_FORMAT, frame_duration_ms=30, yield_remainder=False):
222 |     with AudioFile(audio_path, audio_format=audio_format) as wav_file:
223 |         for frame in read_frames(wav_file, frame_duration_ms=frame_duration_ms, yield_remainder=yield_remainder):
224 |             yield frame
225 | 
226 | 
227 | def vad_split(audio_frames,
228 |               audio_format=DEFAULT_FORMAT,
229 |               num_padding_frames=10,
230 |               threshold=0.5,
231 |               aggressiveness=3):
232 |     sample_rate, channels, width = audio_format
233 |     if channels != 1:
234 |         raise ValueError('VAD-splitting requires mono samples')
235 |     if width != 2:
236 |         raise ValueError('VAD-splitting requires 16 bit samples')
237 |     if sample_rate not in [8000, 16000, 32000, 48000]:
238 |         raise ValueError('VAD-splitting only supported for sample rates 8000, 16000, 32000, or 48000')
239 |     if aggressiveness not in [0, 1, 2, 3]:
240 |         raise ValueError('VAD-splitting aggressiveness mode has to be one of 0, 1, 2, or 3')
241 |     ring_buffer = collections.deque(maxlen=num_padding_frames)
242 |     triggered = False
243 |     vad = Vad(int(aggressiveness))
244 |     voiced_frames = []
245 |     frame_duration_ms = 0
246 |     frame_index = 0
247 |     for frame_index, frame in enumerate(audio_frames):
248 |         frame_duration_ms = get_pcm_duration(len(frame), audio_format) * 1000
249 |         if int(frame_duration_ms) not in [10, 20, 30]:
250 |             raise ValueError('VAD-splitting only supported for frame durations 10, 20, or 30 ms')
251 |         is_speech = vad.is_speech(frame, sample_rate)
252 |         if not triggered:
253 |             ring_buffer.append((frame, is_speech))
254 |             num_voiced = len([f for f, speech in ring_buffer if speech])
255 |             if num_voiced > threshold * ring_buffer.maxlen:
256 |                 triggered = True
257 |                 for f, s in ring_buffer:
258 |                     voiced_frames.append(f)
259 |                 ring_buffer.clear()
260 |         else:
261 |             voiced_frames.append(frame)
262 |             ring_buffer.append((frame, is_speech))
263 |             num_unvoiced = len([f for f, speech in ring_buffer if not speech])
264 |             if num_unvoiced > threshold * ring_buffer.maxlen:
265 |                 triggered = False
266 |                 yield b''.join(voiced_frames), \
267 |                       frame_duration_ms * max(0, frame_index - len(voiced_frames)), \
268 |                       frame_duration_ms * frame_index
269 |                 ring_buffer.clear()
270 |                 voiced_frames = []
271 |     if len(voiced_frames) > 0:
272 |         yield b''.join(voiced_frames), \
273 |               frame_duration_ms * (frame_index - len(voiced_frames)), \
274 |               frame_duration_ms * (frame_index + 1)
275 | 
276 | 
277 | def pack_number(n, num_bytes):
278 |     return n.to_bytes(num_bytes, 'big', signed=False)
279 | 
280 | 
281 | def unpack_number(data):
282 |     return int.from_bytes(data, 'big', signed=False)
283 | 
284 | 
285 | def get_opus_frame_size(rate):
286 |     return 60 * rate // 1000
287 | 
288 | 
289 | def write_opus(opus_file, audio_format, audio_data):
290 |     rate, channels, width = audio_format
291 |     frame_size = get_opus_frame_size(rate)
292 |     import opuslib
293 |     encoder = opuslib.Encoder(rate, channels, opuslib.APPLICATION_AUDIO)
294 |     chunk_size = frame_size * channels * width
295 |     opus_file.write(pack_number(len(audio_data), OPUS_PCM_LEN_SIZE))
296 |     opus_file.write(pack_number(rate, OPUS_RATE_SIZE))
297 |     opus_file.write(pack_number(channels, OPUS_CHANNELS_SIZE))
298 |     opus_file.write(pack_number(width, OPUS_WIDTH_SIZE))
299 |     for i in range(0, len(audio_data), chunk_size):
300 |         chunk = audio_data[i:i + chunk_size]
301 |         # Preventing non-deterministic encoding results from uninitialized remainder of the encoder buffer
302 |         if len(chunk) < chunk_size:
303 |             chunk = chunk + bytearray(chunk_size - len(chunk))
304 |         encoded = encoder.encode(chunk, frame_size)
305 |         opus_file.write(pack_number(len(encoded), OPUS_CHUNK_LEN_SIZE))
306 |         opus_file.write(encoded)
307 | 
308 | 
309 | def read_opus_header(opus_file):
310 |     opus_file.seek(0)
311 |     pcm_buffer_size = unpack_number(opus_file.read(OPUS_PCM_LEN_SIZE))
312 |     rate = unpack_number(opus_file.read(OPUS_RATE_SIZE))
313 |     channels = unpack_number(opus_file.read(OPUS_CHANNELS_SIZE))
314 |     width = unpack_number(opus_file.read(OPUS_WIDTH_SIZE))
315 |     return pcm_buffer_size, (rate, channels, width)
316 | 
317 | 
318 | def read_opus(opus_file):
319 |     pcm_buffer_size, audio_format = read_opus_header(opus_file)
320 |     rate, channels, _ = audio_format
321 |     frame_size = get_opus_frame_size(rate)
322 |     import opuslib
323 |     decoder = opuslib.Decoder(rate, channels)
324 |     audio_data = bytearray()
325 |     while len(audio_data) < pcm_buffer_size:
326 |         chunk_len = unpack_number(opus_file.read(OPUS_CHUNK_LEN_SIZE))
327 |         chunk = opus_file.read(chunk_len)
328 |         decoded = decoder.decode(chunk, frame_size)
329 |         audio_data.extend(decoded)
330 |     audio_data = audio_data[:pcm_buffer_size]
331 |     return audio_format, audio_data
332 | 
333 | 
334 | def write_wav(wav_file, audio_format, pcm_data):
335 |     with wave.open(wav_file, 'wb') as wav_file_writer:
336 |         write_audio_format_to_wav_file(wav_file_writer, audio_format=audio_format)
337 |         wav_file_writer.writeframes(pcm_data)
338 | 
339 | 
340 | def read_wav(wav_file):
341 |     wav_file.seek(0)
342 |     with wave.open(wav_file, 'rb') as wav_file_reader:
343 |         audio_format = read_audio_format_from_wav_file(wav_file_reader)
344 |         pcm_data = wav_file_reader.readframes(wav_file_reader.getnframes())
345 |         os.close(wav_file)
346 |         return audio_format, pcm_data
347 | 
348 | 
349 | def read_audio(audio_type, audio_file):
350 |     if audio_type == AUDIO_TYPE_WAV:
351 |         return read_wav(audio_file)
352 |     if audio_type == AUDIO_TYPE_OPUS:
353 |         return read_opus(audio_file)
354 |     raise ValueError('Unsupported audio type: {}'.format(audio_type))
355 | 
356 | 
357 | def write_audio(audio_type, audio_file, audio_format, pcm_data):
358 |     if audio_type == AUDIO_TYPE_WAV:
359 |         return write_wav(audio_file, audio_format, pcm_data)
360 |     if audio_type == AUDIO_TYPE_OPUS:
361 |         return write_opus(audio_file, audio_format, pcm_data)
362 |     raise ValueError('Unsupported audio type: {}'.format(audio_type))
363 | 
364 | 
365 | def read_wav_duration(wav_file):
366 |     wav_file.seek(0)
367 |     with wave.open(wav_file, 'rb') as wav_file_reader:
368 |         return wav_file_reader.getnframes() / wav_file_reader.getframerate()
369 | 
370 | 
371 | def read_opus_duration(opus_file):
372 |     pcm_buffer_size, audio_format = read_opus_header(opus_file)
373 |     return get_pcm_duration(pcm_buffer_size, audio_format)
374 | 
375 | 
376 | def read_duration(audio_type, audio_file):
377 |     if audio_type == AUDIO_TYPE_WAV:
378 |         return read_wav_duration(audio_file)
379 |     if audio_type == AUDIO_TYPE_OPUS:
380 |         return read_opus_duration(audio_file)
381 |     raise ValueError('Unsupported audio type: {}'.format(audio_type))
382 | 
383 | 
384 | def pcm_to_np(audio_format, pcm_data):
385 |     _, channels, width = audio_format
386 |     if width not in [1, 2, 4]:
387 |         raise ValueError('Unsupported sample width: {}'.format(width))
388 |     dtype = [None, np.int8, np.int16, None, np.int32][width]
389 |     samples = np.frombuffer(pcm_data, dtype=dtype)
390 |     samples = samples[::channels]  # limited to mono for now
391 |     samples = samples.astype(np.float32) / np.iinfo(dtype).max
392 |     return np.expand_dims(samples, axis=1)
393 | 


--------------------------------------------------------------------------------
/align/catalog_tool.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | """
 3 | Tool for combining and converting paths within catalog files
 4 | """
 5 | import sys
 6 | import json
 7 | import argparse
 8 | 
 9 | from glob import glob
10 | from pathlib import Path
11 | 
12 | 
13 | def fail(message):
14 |     print(message)
15 |     sys.exit(1)
16 | 
17 | 
18 | def build_catalog():
19 |     catalog_paths = []
20 |     for source_glob in CLI_ARGS.sources:
21 |         catalog_paths.extend(glob(source_glob))
22 |     items = []
23 |     for catalog_original_path in catalog_paths:
24 |         catalog_path = Path(catalog_original_path).absolute()
25 |         print('Loading catalog "{}"'.format(str(catalog_original_path)))
26 |         if not catalog_path.is_file():
27 |             fail('Unable to find catalog file "{}"'.format(str(catalog_path)))
28 |         with open(catalog_path, 'r', encoding='utf-8') as catalog_file:
29 |             catalog_items = json.load(catalog_file)
30 |         base_path = catalog_path.parent.absolute()
31 |         for item in catalog_items:
32 |             new_item = {}
33 |             for entry, entry_original_path in item.items():
34 |                 entry_path = Path(entry_original_path)
35 |                 entry_path = entry_path if entry_path.is_absolute() else (base_path / entry_path).absolute()
36 |                 if ((len(CLI_ARGS.check) == 1 and CLI_ARGS.check[0] == 'all')
37 |                         or entry in CLI_ARGS.check) and not entry_path.is_file():
38 |                     note = 'Catalog "{}" - Missing file for "{}" ("{}")'.format(
39 |                         str(catalog_original_path), entry, str(entry_original_path))
40 |                     if CLI_ARGS.on_miss == 'fail':
41 |                         fail(note + ' - aborting')
42 |                     if CLI_ARGS.on_miss == 'ignore':
43 |                         print(note + ' - keeping it as it is')
44 |                         new_item[entry] = str(entry_path)
45 |                     elif CLI_ARGS.on_miss == 'drop':
46 |                         print(note + ' - dropping catalog item')
47 |                         new_item = None
48 |                         break
49 |                     else:
50 |                         print(note + ' - removing entry from item')
51 |                 else:
52 |                     new_item[entry] = str(entry_path)
53 |             if CLI_ARGS.output is not None and new_item is not None and len(new_item.keys()) > 0:
54 |                 items.append(new_item)
55 |     if CLI_ARGS.output is not None:
56 |         catalog_path = Path(CLI_ARGS.output).absolute()
57 |         print('Writing catalog "{}"'.format(str(CLI_ARGS.output)))
58 |         if CLI_ARGS.make_relative:
59 |             base_path = catalog_path.parent
60 |             for item in items:
61 |                 for entry in item.keys():
62 |                     item[entry] = str(Path(item[entry]).relative_to(base_path))
63 |         if CLI_ARGS.order_by is not None:
64 |             items.sort(key=lambda i: i[CLI_ARGS.order_by] if CLI_ARGS.order_by in i else '')
65 |         with open(catalog_path, 'w', encoding='utf-8') as catalog_file:
66 |             json.dump(items, catalog_file, indent=2)
67 | 
68 | 
69 | def handle_args():
70 |     parser = argparse.ArgumentParser(description='Tool for combining catalog files and/or ordering, checking and '
71 |                                                  'converting paths within catalog files')
72 |     parser.add_argument('--output', help='Write collected catalog items to this new catalog file')
73 |     parser.add_argument('--make-relative', action='store_true',
74 |                         help='Make all path entries of all items relative to new catalog file\'s parent directory')
75 |     parser.add_argument('--check',
76 |                         help='Comma separated list of path entries to check for existence '
77 |                              '("all" for checking every entry, default: no checks)')
78 |     parser.add_argument('--on-miss', default='fail', choices=['fail', 'drop', 'remove', 'ignore'],
79 |                         help='What to do if a path is not existing: '
80 |                              '"fail" (exit program), '
81 |                              '"drop" (drop catalog item) or '
82 |                              '"remove" (remove path entry from catalog item) or '
83 |                              '"ignore" (keep it as it is)')
84 |     parser.add_argument('--order-by', help='Path entry used for sorting items in target catalog')
85 |     parser.add_argument('sources', nargs='+', help='Source catalog files (supporting wildcards)')
86 |     return parser.parse_args()
87 | 
88 | 
89 | if __name__ == "__main__":
90 |     CLI_ARGS = handle_args()
91 |     CLI_ARGS.check = [] if CLI_ARGS.check is None else CLI_ARGS.check.split(',')
92 |     build_catalog()
93 | 


--------------------------------------------------------------------------------
/align/export.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import io
  3 | import sys
  4 | import csv
  5 | import math
  6 | import time
  7 | import json
  8 | import wave
  9 | import pickle
 10 | import random
 11 | import tarfile
 12 | import logging
 13 | import argparse
 14 | import statistics
 15 | import os.path as path
 16 | 
 17 | from datetime import timedelta
 18 | from collections import Counter
 19 | from multiprocessing import Pool
 20 | from audio import AUDIO_TYPE_PCM, AUDIO_TYPE_WAV, AUDIO_TYPE_OPUS,\
 21 |     ensure_wav_with_format, extract_audio, change_audio_types, write_audio_format_to_wav_file, verify_wav_file
 22 | from sample_collections import SortingSDBWriter, LabeledSample
 23 | from utils import parse_file_size, log_progress
 24 | 
 25 | UNKNOWN = '<UNKNOWN>'
 26 | AUDIO_TYPE_LOOKUP = {
 27 |     'wav': AUDIO_TYPE_WAV,
 28 |     'opus': AUDIO_TYPE_OPUS
 29 | }
 30 | SET_NAMES = ['train', 'dev', 'test']
 31 | 
 32 | 
 33 | class Fragment:
 34 |     def __init__(self, catalog_index, alignment_index, quality=0, duration=0):
 35 |         self.catalog_index = catalog_index
 36 |         self.alignment_index = alignment_index
 37 |         self.quality = quality
 38 |         self.duration = duration
 39 |         self.meta = {}
 40 |         self.partition = 'other'
 41 |         self.list_name = 'other'
 42 | 
 43 | 
 44 | def progress(it=None, desc=None, total=None):
 45 |     if desc is not None:
 46 |         logging.info(desc)
 47 |     return it if CLI_ARGS.no_progress else log_progress(it, interval=CLI_ARGS.progress_interval, total=total)
 48 | 
 49 | 
 50 | def fail(message, code=1):
 51 |     logging.fatal(message)
 52 |     exit(code)
 53 | 
 54 | 
 55 | def check_path(target_path, fs_type='file'):
 56 |     if not (path.isfile(target_path) if fs_type == 'file' else path.isdir(target_path)):
 57 |         fail('{} not existing: "{}"'.format(fs_type[0].upper() + fs_type[1:], target_path))
 58 |     return path.abspath(target_path)
 59 | 
 60 | 
 61 | def make_absolute(base_path, spec_path):
 62 |     if not path.isabs(spec_path):
 63 |         spec_path = path.join(base_path, spec_path)
 64 |     spec_path = path.abspath(spec_path)
 65 |     return spec_path if path.isfile(spec_path) else None
 66 | 
 67 | 
 68 | def engroup(lst, get_key):
 69 |     groups = {}
 70 |     for obj in lst:
 71 |         key = get_key(obj)
 72 |         if key in groups:
 73 |             groups[key].append(obj)
 74 |         else:
 75 |             groups[key] = [obj]
 76 |     return groups
 77 | 
 78 | 
 79 | def get_sample_size(population_size):
 80 |     margin_of_error = 0.01
 81 |     fraction_picking = 0.50
 82 |     z_score = 2.58  # Corresponds to confidence level 99%
 83 |     numerator = (z_score ** 2 * fraction_picking * (1 - fraction_picking)) / (
 84 |             margin_of_error ** 2
 85 |     )
 86 |     sample_size = 0
 87 |     for train_size in range(population_size, 0, -1):
 88 |         denominator = 1 + (z_score ** 2 * fraction_picking * (1 - fraction_picking)) / (
 89 |                 margin_of_error ** 2 * train_size
 90 |         )
 91 |         sample_size = int(numerator / denominator)
 92 |         if 2 * sample_size + train_size <= population_size:
 93 |             break
 94 |     return sample_size
 95 | 
 96 | 
 97 | def load_catalog():
 98 |     catalog_entries = []
 99 |     if CLI_ARGS.audio:
100 |         if CLI_ARGS.aligned:
101 |             catalog_entries.append((check_path(CLI_ARGS.audio), check_path(CLI_ARGS.aligned)))
102 |         else:
103 |             fail('If you specify "--audio", you also have to specify "--aligned"')
104 |     elif CLI_ARGS.aligned:
105 |         fail('If you specify "--aligned", you also have to specify "--audio"')
106 |     elif CLI_ARGS.catalog:
107 |         catalog = check_path(CLI_ARGS.catalog)
108 |         catalog_dir = path.dirname(catalog)
109 |         with open(catalog, 'r', encoding='utf-8') as catalog_file:
110 |             catalog_file_entries = json.load(catalog_file)
111 |         for entry in progress(catalog_file_entries, desc='Reading catalog'):
112 |             audio = make_absolute(catalog_dir, entry['audio'])
113 |             aligned = make_absolute(catalog_dir, entry['aligned'])
114 |             if audio is None or aligned is None:
115 |                 if CLI_ARGS.ignore_missing:
116 |                     continue
117 |                 if audio is None:
118 |                     fail('Problem loading catalog "{}": Missing referenced audio file "{}"'
119 |                          .format(CLI_ARGS.catalog, entry['audio']))
120 |                 if aligned is None:
121 |                     fail('Problem loading catalog "{}": Missing referenced alignment file "{}"'
122 |                          .format(CLI_ARGS.catalog, entry['aligned']))
123 |             catalog_entries.append((audio, aligned))
124 |     else:
125 |         fail('You have to either specify "--audio" and "--aligned" or "--catalog"')
126 |     return catalog_entries
127 | 
128 | 
129 | def load_fragments(catalog_entries):
130 |     def get_meta_list(ae, mf):
131 |         if 'meta' in ae:
132 |             meta_fields = ae['meta']
133 |             if isinstance(meta_fields, dict) and mf in meta_fields:
134 |                 mf = meta_fields[mf]
135 |                 return mf if isinstance(mf, list) else [mf]
136 |         return []
137 | 
138 |     required_metas = {}
139 |     if CLI_ARGS.debias is not None:
140 |         for debias_meta_field in CLI_ARGS.debias:
141 |             required_metas[debias_meta_field] = True
142 |     if CLI_ARGS.split and CLI_ARGS.split_field is not None:
143 |         required_metas[CLI_ARGS.split_field] = True
144 | 
145 |     fragments = []
146 |     reasons = Counter()
147 |     for catalog_index, catalog_entry in enumerate(progress(catalog_entries, desc='Loading alignments')):
148 |         audio_path, aligned_path = catalog_entry
149 |         with open(aligned_path, 'r', encoding='utf-8') as aligned_file:
150 |             aligned = json.load(aligned_file)
151 |         for alignment_index, alignment in enumerate(aligned):
152 |             quality = eval(CLI_ARGS.criteria, {'math': math}, alignment)
153 |             alignment['quality'] = quality
154 |             if eval(CLI_ARGS.filter, {'math': math}, alignment):
155 |                 reasons['Filter'] += 1
156 |                 continue
157 |             meta = {}
158 |             keep = True
159 |             for meta_field in required_metas.keys():
160 |                 meta_list = get_meta_list(alignment, meta_field)
161 |                 if CLI_ARGS.split and CLI_ARGS.split_field == meta_field:
162 |                     if CLI_ARGS.split_drop_multiple and len(meta_list) > 1:
163 |                         reasons['Split drop multiple'] += 1
164 |                         keep = False
165 |                         break
166 |                     elif CLI_ARGS.split_drop_unknown and len(meta_list) == 0:
167 |                         reasons['Split drop unknown'] += 1
168 |                         keep = False
169 |                         break
170 |                 meta[meta_field] = meta_list[0] if meta_list else UNKNOWN
171 |             if keep:
172 |                 duration = alignment['end'] - alignment['start']
173 |                 fragment = Fragment(catalog_index, alignment_index, quality=quality, duration=duration)
174 |                 fragment.meta = meta
175 |                 for minimum, partition_name in CLI_ARGS.partition:
176 |                     if quality >= minimum:
177 |                         fragment.partition = partition_name
178 |                         break
179 |                 fragments.append(fragment)
180 | 
181 |     if len(fragments) == 0:
182 |         fail('No samples left for export')
183 | 
184 |     if len(reasons.keys()) > 0:
185 |         logging.info('Excluded number of samples (for each reason):')
186 |         for reason, count in reasons.most_common():
187 |             logging.info(' - "{}": {}'.format(reason, count))
188 |     return fragments
189 | 
190 | 
191 | def debias(fragments):
192 |     if CLI_ARGS.debias is not None:
193 |         for debias in CLI_ARGS.debias:
194 |             grouped = engroup(fragments, lambda f: f.meta[debias])
195 |             if UNKNOWN in grouped:
196 |                 fragments = grouped[UNKNOWN]
197 |                 del grouped[UNKNOWN]
198 |             else:
199 |                 fragments = []
200 |             counts = list(map(lambda f: len(f), grouped.values()))
201 |             mean = statistics.mean(counts)
202 |             sigma = statistics.pstdev(counts, mu=mean)
203 |             cap = int(mean + CLI_ARGS.debias_sigma_factor * sigma)
204 |             counter = Counter()
205 |             for group, group_fragments in progress(grouped.items(), desc='De-biasing "{}"'.format(debias)):
206 |                 if len(group_fragments) > cap:
207 |                     group_fragments.sort(key=lambda f: f.quality)
208 |                     counter[group] += len(group_fragments) - cap
209 |                     group_fragments = group_fragments[-cap:]
210 |                 fragments.extend(group_fragments)
211 |             if len(counter.keys()) > 0:
212 |                 logging.info('Dropped for de-biasing "{}":'.format(debias))
213 |                 for group, count in counter.most_common():
214 |                     logging.info(' - "{}": {}'.format(group, count))
215 |     return fragments
216 | 
217 | 
218 | def parse_set_assignments():
219 |     set_assignments = {}
220 |     for set_index, set_name in enumerate(SET_NAMES):
221 |         attr_name = 'assign_' + set_name
222 |         if hasattr(CLI_ARGS, attr_name):
223 |             set_entities = getattr(CLI_ARGS, attr_name)
224 |             if set_entities is not None:
225 |                 for entity_id in filter(None, str(set_entities).split(',')):
226 |                     if entity_id in set_assignments:
227 |                         fail('Unable to assign entity "{}" to set "{}", as it is already assigned to set "{}"'
228 |                              .format(entity_id, set_name, SET_NAMES[set_assignments[entity_id]]))
229 |                     set_assignments[entity_id] = set_index
230 |     return set_assignments
231 | 
232 | 
233 | def check_targets():
234 |     if CLI_ARGS.target_dir is not None and CLI_ARGS.target_tar is not None:
235 |         fail('Only one allowed: --target-dir or --target-tar')
236 |     elif CLI_ARGS.target_dir is not None:
237 |         CLI_ARGS.target_dir = check_path(CLI_ARGS.target_dir, fs_type='directory')
238 |     elif CLI_ARGS.target_tar is not None:
239 |         if CLI_ARGS.sdb:
240 |             fail('Option --sdb not supported for --target-tar output. Use --target-dir instead.')
241 |         CLI_ARGS.target_tar = path.abspath(CLI_ARGS.target_tar)
242 |         if path.isfile(CLI_ARGS.target_tar):
243 |             if not CLI_ARGS.force:
244 |                 fail('Target tar-file already existing - use --force to overwrite')
245 |         elif path.exists(CLI_ARGS.target_tar):
246 |             fail('Target tar-file path is existing, but not a file')
247 |         elif not path.isdir(path.dirname(CLI_ARGS.target_tar)):
248 |             fail('Unable to write tar-file: Path not existing')
249 |     else:
250 |         fail('Either --target-dir or --target-tar has to be provided')
251 | 
252 | 
253 | def split(fragments, set_assignments):
254 |     lists = []
255 | 
256 |     def assign_fragments(frags, name):
257 |         lists.append(name)
258 |         duration = 0
259 |         for f in frags:
260 |             f.list_name = name
261 |             duration += f.duration
262 |         logging.info('Built set "{}" (samples: {}, duration: {})'
263 |                      .format(name, len(frags), timedelta(milliseconds=duration)))
264 | 
265 |     if CLI_ARGS.split_seed is not None:
266 |         random.seed(CLI_ARGS.split_seed)
267 | 
268 |     if CLI_ARGS.split and CLI_ARGS.split_field is not None:
269 |         fragments = list(fragments)
270 |         metas = engroup(fragments, lambda f: f.meta[CLI_ARGS.split_field]).items()
271 |         metas = sorted(metas, key=lambda meta_frags: len(meta_frags[1]))
272 |         metas = list(map(lambda meta_frags: meta_frags[0], metas))
273 |         partitions = engroup(fragments, lambda f: f.partition)
274 |         partitions = list(map(lambda part_frags: (part_frags[0],
275 |                                                   get_sample_size(len(part_frags[1])),
276 |                                                   engroup(part_frags[1], lambda f: f.meta[CLI_ARGS.split_field]),
277 |                                                   [[], [], []]),
278 |                               partitions.items()))
279 |         remaining_metas = []
280 |         for meta in metas:
281 |             if meta in set_assignments:
282 |                 set_index = set_assignments[meta]
283 |                 for _, _, partition_portions, sample_sets in partitions:
284 |                     if meta in partition_portions:
285 |                         sample_sets[set_index].extend(partition_portions[meta])
286 |                         del partition_portions[meta]
287 |             else:
288 |                 remaining_metas.append(meta)
289 |         metas = remaining_metas
290 |         for _, sample_size, _, sample_sets in partitions:
291 |             while len(metas) > 0 and (len(sample_sets[1]) < sample_size or len(sample_sets[2]) < sample_size):
292 |                 for sample_set_index in [1, 2]:
293 |                     if len(metas) > 0 and sample_size > len(sample_sets[sample_set_index]):
294 |                         meta = metas.pop(0)
295 |                         for _, _, partition_portions, other_sample_sets in partitions:
296 |                             if meta in partition_portions:
297 |                                 other_sample_sets[sample_set_index].extend(partition_portions[meta])
298 |                                 del partition_portions[meta]
299 |         for partition, sample_size, partition_portions, sample_sets in partitions:
300 |             for portion in partition_portions.values():
301 |                 sample_sets[0].extend(portion)
302 |             for set_index, set_name in enumerate(SET_NAMES):
303 |                 assign_fragments(sample_sets[set_index], partition + '-' + set_name)
304 |     else:
305 |         partitions = engroup(fragments, lambda f: f.partition)
306 |         for partition, partition_fragments in partitions.items():
307 |             if CLI_ARGS.split:
308 |                 sample_size = get_sample_size(len(partition_fragments))
309 |                 random.shuffle(partition_fragments)
310 |                 test_set = partition_fragments[:sample_size]
311 |                 partition_fragments = partition_fragments[sample_size:]
312 |                 dev_set = partition_fragments[:sample_size]
313 |                 train_set = partition_fragments[sample_size:]
314 |                 sample_sets = [train_set, dev_set, test_set]
315 |                 for set_index, set_name in enumerate(SET_NAMES):
316 |                     assign_fragments(sample_sets[set_index], partition + '-' + set_name)
317 |             else:
318 |                 assign_fragments(partition_fragments, partition)
319 |     return lists
320 | 
321 | 
322 | def check_overwrite(lists):
323 |     if CLI_ARGS.target_dir is not None and not CLI_ARGS.force:
324 |         for name in lists:
325 |             suffixes = ['.meta'] + (['.sdb', '.sdb.tmp'] if CLI_ARGS.sdb else ['', '.csv'])
326 |             for s in suffixes:
327 |                 p = path.join(CLI_ARGS.target_dir, name + s)
328 |                 if path.exists(p):
329 |                     fail('"{}" already existing - use --force to ignore'.format(p))
330 | 
331 | 
332 | def parse_args():
333 |     parser = argparse.ArgumentParser(description='Export aligned speech samples.')
334 | 
335 |     parser.add_argument('--plan', type=str,
336 |                         help='Export plan (preparation-cache) to load and/or store')
337 |     parser.add_argument('--audio', type=str,
338 |                         help='Take audio file as input (requires "--aligned <file>")')
339 |     parser.add_argument('--aligned', type=str,
340 |                         help='Take alignment file ("<...>.aligned") as input (requires "--audio <file>")')
341 |     parser.add_argument('--catalog', type=str,
342 |                         help='Take alignment and audio file references of provided catalog ("<...>.catalog") as input')
343 |     parser.add_argument('--ignore-missing', action="store_true",
344 |                         help='Ignores catalog entries with missing files')
345 | 
346 |     parser.add_argument('--filter', type=str, default='False',
347 |                         help='Python expression that computes a boolean value from sample data fields. '
348 |                              'If the result is True, the sample will be dropped.')
349 | 
350 |     parser.add_argument('--criteria', type=str, default='100',
351 |                         help='Python expression that computes a number as quality indicator from sample data fields.')
352 | 
353 |     parser.add_argument('--debias', type=str, action='append',
354 |                         help='Sample meta field to group samples for debiasing (e.g. "speaker"). '
355 |                              'Group sizes will be capped according to --debias-sigma-factor')
356 |     parser.add_argument('--debias-sigma-factor', type=float, default=3.0,
357 |                         help='Standard deviation (sigma) factor that will determine '
358 |                              'the maximum number of samples per group (see --debias).')
359 | 
360 |     parser.add_argument('--partition', type=str, action='append',
361 |                         help='Expression of the form "<number>:<partition>" where all samples with a quality indicator '
362 |                              '(--criteria) above or equal the given number and below the next bigger one are assigned '
363 |                              'to the specified partition. Samples below the lowest partition criteria are assigned to '
364 |                              'partition "other".')
365 | 
366 |     parser.add_argument('--split', action="store_true",
367 |                         help='Split each partition except "other" into train/dev/test sub-sets.')
368 |     parser.add_argument('--split-field', type=str,
369 |                         help='Sample meta field that should be used for splitting (e.g. "speaker")')
370 |     parser.add_argument('--split-drop-multiple', action="store_true",
371 |                         help='Drop all samples with multiple --split-field assignments.')
372 |     parser.add_argument('--split-drop-unknown', action="store_true",
373 |                         help='Drop all samples with no --split-field assignment.')
374 |     for sub_set in SET_NAMES:
375 |         parser.add_argument('--assign-' + sub_set,
376 |                             help='Comma separated list of --split-field values that are to be assigned to sub-set '
377 |                                  '"{}"'.format(sub_set))
378 |     parser.add_argument('--split-seed', type=int,
379 |                         help='Random seed for set splitting')
380 | 
381 |     parser.add_argument('--target-dir', type=str, required=False,
382 |                         help='Existing target directory for storing generated sets (files and directories)')
383 |     parser.add_argument('--target-tar', type=str, required=False,
384 |                         help='Target tar-file for storing generated sets (files and directories)')
385 |     parser.add_argument('--sdb', action="store_true",
386 |                         help='Writes Sample DBs instead of CSV and .wav files (requires --target-dir)')
387 |     parser.add_argument('--sdb-bucket-size', default='1GB',
388 |                         help='Memory bucket size for external sorting of SDBs')
389 |     parser.add_argument('--sdb-workers', type=int, default=None,
390 |                         help='Number of SDB encoding workers')
391 |     parser.add_argument('--sdb-buffered-samples', type=int, default=None,
392 |                         help='Number of samples per bucket buffer during finalization')
393 |     parser.add_argument('--sdb-audio-type', default='opus', choices=AUDIO_TYPE_LOOKUP.keys(),
394 |                         help='Audio representation inside target SDBs')
395 |     parser.add_argument('--tmp-dir', type=str, default=None,
396 |                         help='Directory for temporary files - defaults to system one')
397 |     parser.add_argument('--buffer', default='1MB',
398 |                         help='Buffer size for writing files (~16MB by default)')
399 |     parser.add_argument('--force', action="store_true",
400 |                         help='Overwrite existing files')
401 |     parser.add_argument('--skip-damaged', action="store_true",
402 |                         help='If damaged audio files and their samples should be skipped instead of failing export')
403 |     parser.add_argument('--no-meta', action="store_true",
404 |                         help='No writing of meta data files')
405 |     parser.add_argument('--rate', type=int, default=16000,
406 |                         help='Export wav-files with this sample rate')
407 |     parser.add_argument('--channels', type=int, default=1,
408 |                         help='Export wav-files with this number of channels')
409 |     parser.add_argument('--width', type=int, default=2,
410 |                         help='Export wav-files with this sample width (bytes)')
411 | 
412 |     parser.add_argument('--workers', type=int, default=2,
413 |                         help='Number of workers for loading and re-sampling audio files. Default: 2')
414 |     parser.add_argument('--dry-run', action="store_true",
415 |                         help='Simulates export without writing or creating any file or directory')
416 |     parser.add_argument('--dry-run-fast', action="store_true",
417 |                         help='Simulates export without writing or creating any file or directory. '
418 |                              'In contrast to --dry-run this faster simulation will not load samples.')
419 |     parser.add_argument('--loglevel', type=int, default=20,
420 |                         help='Log level (between 0 and 50) - default: 20')
421 |     parser.add_argument('--no-progress', action="store_true",
422 |                         help='Prevents showing progress indication')
423 |     parser.add_argument('--progress-interval', type=float, default=1.0,
424 |                         help='Progress indication interval in seconds')
425 | 
426 |     args = parser.parse_args()
427 | 
428 |     args.buffer = parse_file_size(args.buffer)
429 |     args.sdb_bucket_size = parse_file_size(args.sdb_bucket_size)
430 |     args.dry_run = args.dry_run or args.dry_run_fast
431 |     partition_specs = []
432 |     if args.partition is not None:
433 |         for partition_expr in args.partition:
434 |             parts = partition_expr.split(':')
435 |             if len(parts) != 2:
436 |                 fail('Wrong partition specification: "{}"'.format(partition_expr))
437 |             partition_specs.append((float(parts[0]), str(parts[1])))
438 |     partition_specs.sort(key=lambda p: p[0], reverse=True)
439 |     args.partition = partition_specs
440 |     return args
441 | 
442 | 
443 | def load_sample(entry):
444 |     catalog_index, catalog_entry = entry
445 |     audio_path, aligned_path = catalog_entry
446 |     with open(aligned_path, 'r', encoding='utf-8') as aligned_file:
447 |         aligned = json.load(aligned_file)
448 |     tries = 2
449 |     while tries > 0:
450 |         wav_path, wav_is_temp = ensure_wav_with_format(audio_path, audio_format, tmp_dir=CLI_ARGS.tmp_dir)
451 |         if wav_is_temp:
452 |             time.sleep(1)
453 |         if wav_path is not None:
454 |             if verify_wav_file(wav_path):
455 |                 return catalog_index, wav_path, wav_is_temp, aligned
456 |             if wav_is_temp:
457 |                 os.remove(wav_path)
458 |         logging.warn('Problem converting "{}" into required format - retrying...'.format(audio_path))
459 |         time.sleep(10)
460 |         tries -= 1
461 |     return catalog_index, None, False, aligned
462 | 
463 | 
464 | def load_sample_dry(entry):
465 |     catalog_index, catalog_entry = entry
466 |     audio_path, aligned_path = catalog_entry
467 |     if path.isfile(audio_path):
468 |         logging.info('Would load file "{}"'.format(audio_path))
469 |     else:
470 |         fail('Audio file not found: "{}"'.format(audio_path))
471 |     if path.isfile(aligned_path):
472 |         logging.info('Would load file "{}"'.format(audio_path))
473 |     else:
474 |         fail('Alignment file not found: "{}"'.format(audio_path))
475 |     return catalog_index, '', False, []
476 | 
477 | 
478 | def load_samples(catalog_entries, fragments):
479 |     catalog_index_wise = engroup(fragments, lambda f: f.catalog_index)
480 |     pool = Pool(CLI_ARGS.workers)
481 |     ls = load_sample_dry if CLI_ARGS.dry_run_fast else load_sample
482 |     indexed_entries = map(lambda ci: (ci, catalog_entries[ci]), catalog_index_wise.keys())
483 |     for catalog_index, wav_path, wav_is_temp, aligned in pool.imap_unordered(ls, indexed_entries):
484 |         if wav_path is None:
485 |             src_audio_path = catalog_entries[catalog_index][0]
486 |             message = 'Unable to convert audio file "{}" to required format - skipping'.format(src_audio_path)
487 |             if CLI_ARGS.skip_damaged:
488 |                 logging.warn(message)
489 |                 continue
490 |             else:
491 |                 raise RuntimeError(message)
492 |         file_fragments = catalog_index_wise[catalog_index]
493 |         file_fragments.sort(key=lambda f: f.alignment_index)
494 |         if CLI_ARGS.dry_run_fast:
495 |             for fragment in file_fragments:
496 |                 yield b'', fragment, ''
497 |         else:
498 |             with wave.open(wav_path, 'rb') as source_wav_file:
499 |                 wav_duration = source_wav_file.getframerate() * source_wav_file.getnframes() * 1000
500 |                 for fragment in file_fragments:
501 |                     aligned_entry = aligned[fragment.alignment_index]
502 |                     try:
503 |                         start, end = aligned_entry['start'], aligned_entry['end']
504 |                         assert start < end <= wav_duration
505 |                         fragment_audio = extract_audio(source_wav_file, start / 1000.0, end / 1000.0)
506 |                     except Exception as ae:
507 |                         message = 'Problem extracting audio for alignment entry {} in catalog entry {}'\
508 |                             .format(fragment.alignment_index, fragment.catalog_index)
509 |                         if CLI_ARGS.skip_damaged:
510 |                             logging.warn(message)
511 |                             break
512 |                         else:
513 |                             raise RuntimeError(message) from ae
514 |                     yield fragment_audio, fragment, aligned_entry['aligned']
515 |             if wav_is_temp:
516 |                 os.remove(wav_path)
517 | 
518 | 
519 | def write_meta(file, catalog_entries, id_plus_fragment_iter, total=None):
520 |     writer = csv.writer(file)
521 |     writer.writerow(['sample', 'split_entity', 'catalog_index', 'source_audio_file', 'aligned_file', 'alignment_index'])
522 |     has_split_entity = CLI_ARGS.split and CLI_ARGS.split_field is not None
523 |     for sample_id, fragment in progress(id_plus_fragment_iter, total=total):
524 |         split_entity = fragment.meta[CLI_ARGS.split_field] if has_split_entity else ''
525 |         source_audio_file, aligned_file = catalog_entries[fragment.catalog_index]
526 |         writer.writerow([sample_id,
527 |                          split_entity,
528 |                          fragment.catalog_index,
529 |                          source_audio_file,
530 |                          aligned_file,
531 |                          fragment.alignment_index])
532 | 
533 | 
534 | def write_csvs_and_samples(catalog_entries, lists, fragments):
535 |     created_directories = {}
536 |     tar = None
537 |     if CLI_ARGS.target_tar is not None:
538 |         if CLI_ARGS.dry_run:
539 |             logging.info('Would create tar-file "{}"'.format(CLI_ARGS.target_tar))
540 |         else:
541 |             base_tar = open(CLI_ARGS.target_tar, 'wb', buffering=CLI_ARGS.buffer)
542 |             tar = tarfile.open(fileobj=base_tar, mode='w')
543 | 
544 |     class TargetFile:
545 |         def __init__(self, data_path, mode):
546 |             self.data_path = data_path
547 |             self.mode = mode
548 |             self.open_file = None
549 | 
550 |         def __enter__(self):
551 |             parts = self.data_path.split('/')
552 |             dirs = ([CLI_ARGS.target_dir] if CLI_ARGS.target_dir is not None else []) + parts[:-1]
553 |             for i in range(1, len(dirs)):
554 |                 vp = '/'.join(dirs[:i + 1])
555 |                 if not vp in created_directories:
556 |                     if tar is None:
557 |                         dir_path = path.join(*dirs[:i + 1])
558 |                         if not path.isdir(dir_path):
559 |                             if CLI_ARGS.dry_run:
560 |                                 logging.info('Would create directory "{}"'.format(dir_path))
561 |                             else:
562 |                                 os.mkdir(dir_path)
563 |                     else:
564 |                         tdir = tarfile.TarInfo(vp)
565 |                         tdir.type = tarfile.DIRTYPE
566 |                         tar.addfile(tdir)
567 |                     created_directories[vp] = True
568 |             if CLI_ARGS.target_tar is None:
569 |                 file_path = path.join(CLI_ARGS.target_dir, *self.data_path.split('/'))
570 |                 if CLI_ARGS.dry_run:
571 |                     logging.info('Would write file "{}"'.format(file_path))
572 |                     self.open_file = io.BytesIO() if 'b' in self.mode else io.StringIO()
573 |                 else:
574 |                     self.open_file = open(file_path, self.mode)
575 |             else:
576 |                 self.open_file = io.BytesIO() if 'b' in self.mode else io.StringIO()
577 |             return self.open_file
578 | 
579 |         def __exit__(self, *args):
580 |             if tar is not None:
581 |                 if isinstance(self.open_file, io.StringIO):
582 |                     sfile = self.open_file
583 |                     sfile.seek(0)
584 |                     self.open_file = io.BytesIO(sfile.read().encode('utf8'))
585 |                     self.open_file.seek(0, 2)
586 |                     sfile.close()
587 |                 tfile = tarfile.TarInfo(self.data_path)
588 |                 tfile.size = self.open_file.tell()
589 |                 self.open_file.seek(0)
590 |                 tar.addfile(tfile, self.open_file)
591 |                 tar.members = []
592 |             if self.open_file is not None:
593 |                 self.open_file.close()
594 | 
595 |     group_lists = {}
596 |     for list_name in lists:
597 |         group_lists[list_name] = []
598 | 
599 |     for pcm_data, fragment, transcript in progress(load_samples(catalog_entries, fragments),
600 |                                                    desc='Exporting samples', total=len(fragments)):
601 |         group_list = group_lists[fragment.list_name]
602 |         sample_path = '{}/sample-{:010d}.wav'.format(fragment.list_name, len(group_list))
603 |         with TargetFile(sample_path, "wb") as base_wav_file:
604 |             with wave.open(base_wav_file, 'wb') as wav_file:
605 |                 write_audio_format_to_wav_file(wav_file)
606 |                 wav_file.writeframes(pcm_data)
607 |                 file_size = base_wav_file.tell()
608 |         group_list.append((sample_path, file_size, fragment, transcript))
609 | 
610 |     for list_name, group_list in group_lists.items():
611 |         csv_filename = list_name + '.csv'
612 |         logging.info('Writing "{}"'.format(csv_filename))
613 |         with TargetFile(csv_filename, 'w') as csv_file:
614 |             writer = csv.writer(csv_file)
615 |             writer.writerow(['wav_filename', 'wav_filesize', 'transcript'])
616 |             for rel_path, file_size, fragment, transcript in progress(group_list):
617 |                 writer.writerow([rel_path, file_size, transcript])
618 |         if not CLI_ARGS.no_meta:
619 |             meta_filename = list_name + '.meta'
620 |             logging.info('Writing "{}"'.format(meta_filename))
621 |             with TargetFile(meta_filename, 'w') as meta_file:
622 |                 path_fragment_list = map(lambda gi: (gi[0], gi[2]), group_list)
623 |                 write_meta(meta_file, catalog_entries, path_fragment_list, total=len(group_list))
624 | 
625 |     if tar is not None:
626 |         tar.close()
627 | 
628 | 
629 | def write_sdbs(catalog_entries, lists, fragments):
630 |     audio_type = AUDIO_TYPE_LOOKUP[CLI_ARGS.sdb_audio_type]
631 |     sdbs = {}
632 |     for list_name in lists:
633 |         sdb_path = os.path.join(CLI_ARGS.target_dir, list_name + '.sdb')
634 |         if CLI_ARGS.dry_run:
635 |             logging.info('Would create SDB "{}"'.format(sdb_path))
636 |         else:
637 |             logging.info('Creating SDB "{}"'.format(sdb_path))
638 |             sdbs[list_name] = SortingSDBWriter(sdb_path,
639 |                                                audio_type=audio_type,
640 |                                                buffering=CLI_ARGS.buffer,
641 |                                                cache_size=CLI_ARGS.sdb_bucket_size,
642 |                                                buffered_samples=CLI_ARGS.sdb_buffered_samples)
643 | 
644 |     def to_samples():
645 |         for pcm_data, fragment, transcript in load_samples(catalog_entries, fragments):
646 |             cs = LabeledSample(AUDIO_TYPE_PCM, pcm_data, transcript, audio_format=audio_format)
647 |             cs.meta = fragment
648 |             yield cs
649 | 
650 |     samples = change_audio_types(to_samples(),
651 |                                  audio_type=audio_type,
652 |                                  processes=CLI_ARGS.sdb_workers) if not CLI_ARGS.dry_run_fast else to_samples()
653 |     set_counter = Counter()
654 |     for sample in progress(samples, desc='Exporting samples', total=len(fragments)):
655 |         list_name = sample.meta.list_name
656 |         if not CLI_ARGS.dry_run:
657 |             set_counter[list_name] += 1
658 |             sdb = sdbs[list_name]
659 |             sdb.add(sample)
660 |     for list_name, sdb in sdbs.items():
661 |         meta_path = os.path.join(CLI_ARGS.target_dir, list_name + '.meta')
662 |         if CLI_ARGS.dry_run:
663 |             if not CLI_ARGS.no_meta:
664 |                 logging.info('Would write meta file "{}"'.format(meta_path))
665 |         else:
666 |             sdb_path = os.path.join(CLI_ARGS.target_dir, list_name + '.sdb')
667 |             for _ in progress(sdb.finalize(), desc='Finalizing "{}"'.format(sdb_path), total=set_counter[list_name]):
668 |                 pass
669 |             if not CLI_ARGS.no_meta:
670 |                 logging.info('Writing "{}"'.format(meta_path))
671 |                 with open(meta_path, 'w') as meta_file:
672 |                     write_meta(meta_file, catalog_entries, enumerate(sdb.meta_list), total=len(sdb.meta_list))
673 | 
674 | 
675 | def load_plan():
676 |     if CLI_ARGS.plan is not None and os.path.isfile(CLI_ARGS.plan):
677 |         try:
678 |             logging.info('Loading export-plan from "{}"'.format(CLI_ARGS.plan))
679 |             with open(CLI_ARGS.plan, 'rb') as plan_file:
680 |                 catalog_entries, lists, fragments = pickle.load(plan_file)
681 |             return True, catalog_entries, lists, fragments
682 |         except pickle.PickleError:
683 |             logging.warn('Unable to load export-plan "{}" - rebuilding'.format(CLI_ARGS.plan))
684 |             os.remove(CLI_ARGS.plan)
685 |     return False, None, None, None
686 | 
687 | 
688 | def save_plan(catalog_entries, lists, fragments):
689 |     if CLI_ARGS.plan is not None:
690 |         logging.info('Saving export-plan to "{}"'.format(CLI_ARGS.plan))
691 |         with open(CLI_ARGS.plan, 'wb') as plan_file:
692 |             pickle.dump((catalog_entries, lists, fragments), plan_file)
693 | 
694 | 
695 | def main():
696 |     check_targets()
697 |     has_plan, catalog_entries, lists, fragments = load_plan()
698 |     if not has_plan:
699 |         set_assignments = parse_set_assignments()
700 |         catalog_entries = load_catalog()
701 |         fragments = load_fragments(catalog_entries)
702 |         fragments = debias(fragments)
703 |         lists = split(fragments, set_assignments)
704 |         save_plan(catalog_entries, lists, fragments)
705 |     check_overwrite(lists)
706 |     if CLI_ARGS.sdb:
707 |         write_sdbs(catalog_entries, lists, fragments)
708 |     else:
709 |         write_csvs_and_samples(catalog_entries, lists, fragments)
710 | 
711 | 
712 | if __name__ == '__main__':
713 |     CLI_ARGS = parse_args()
714 |     audio_format = (CLI_ARGS.rate, CLI_ARGS.channels, CLI_ARGS.width)
715 |     logging.basicConfig(stream=sys.stderr, level=CLI_ARGS.loglevel)
716 |     logging.getLogger('sox').setLevel(logging.ERROR)
717 |     main()
718 | 


--------------------------------------------------------------------------------
/align/generate_lm.py:
--------------------------------------------------------------------------------
  1 | import gzip
  2 | import io
  3 | import os
  4 | import subprocess
  5 | from collections import Counter
  6 | 
  7 | def convert_and_filter_topk(output_dir, input_txt, top_k):
  8 |     """ Convert to lowercase, count word occurrences and save top-k words to a file """
  9 | 
 10 |     counter = Counter()
 11 |     data_lower = output_dir + "." + "lower.txt.gz"
 12 | 
 13 |     print("\nConverting to lowercase and counting word occurrences ...")
 14 |     with io.TextIOWrapper(
 15 |         io.BufferedWriter(gzip.open(data_lower, "w+")), encoding="utf-8"
 16 |     ) as file_out:
 17 | 
 18 |         # Open the input file either from input.txt or input.txt.gz
 19 |         _, file_extension = os.path.splitext(input_txt)
 20 |         if file_extension == ".gz":
 21 |             file_in = io.TextIOWrapper(
 22 |                 io.BufferedReader(gzip.open(input_txt)), encoding="utf-8"
 23 |             )
 24 |         else:
 25 |             file_in = open(input_txt, encoding="utf-8")
 26 | 
 27 |         for line in file_in:
 28 |             line_lower = line.lower()
 29 |             counter.update(line_lower.split())
 30 |             file_out.write(line_lower)
 31 | 
 32 |         file_in.close()
 33 | 
 34 |     # Save top-k words
 35 |     print("\nSaving top {} words ...".format(top_k))
 36 |     top_counter = counter.most_common(top_k)
 37 |     vocab_str = "\n".join(word for word, count in top_counter)
 38 |     vocab_path = "vocab-{}.txt".format(top_k)
 39 |     vocab_path = output_dir + "." + vocab_path
 40 |     with open(vocab_path, "w+", encoding="utf-8") as file:
 41 |         file.write(vocab_str)
 42 | 
 43 |     print("\nCalculating word statistics ...")
 44 |     total_words = sum(counter.values())
 45 |     print("  Your text file has {} words in total".format(total_words))
 46 |     print("  It has {} unique words".format(len(counter)))
 47 |     top_words_sum = sum(count for word, count in top_counter)
 48 |     word_fraction = (top_words_sum / total_words) * 100
 49 |     print(
 50 |         "  Your top-{} words are {:.4f} percent of all words".format(
 51 |             top_k, word_fraction
 52 |         )
 53 |     )
 54 |     print('  Your most common word "{}" occurred {} times'.format(*top_counter[0]))
 55 |     last_word, last_count = top_counter[-1]
 56 |     print(
 57 |         '  The least common word in your top-k is "{}" with {} times'.format(
 58 |             last_word, last_count
 59 |         )
 60 |     )
 61 |     for i, (w, c) in enumerate(reversed(top_counter)):
 62 |         if c > last_count:
 63 |             print(
 64 |                 '  The first word with {} occurrences is "{}" at place {}'.format(
 65 |                     c, w, len(top_counter) - 1 - i
 66 |                 )
 67 |             )
 68 |             break
 69 | 
 70 |     return data_lower, vocab_str
 71 | 
 72 | 
 73 | def build_lm(output_dir, kenlm_bins, arpa_order, max_arpa_memory, arpa_prune, discount_fallback, binary_a_bits, binary_q_bits, binary_type, data_lower, vocab_str):
 74 |     print("\nCreating ARPA file ...")
 75 |     lm_path = output_dir + "." + "lm.arpa"
 76 |     subargs = [
 77 |             os.path.join(kenlm_bins, "lmplz"),
 78 |             "--order",
 79 |             str(arpa_order),
 80 |             "--temp_prefix",
 81 |             output_dir,
 82 |             "--memory",
 83 |             max_arpa_memory,
 84 |             "--text",
 85 |             data_lower,
 86 |             "--arpa",
 87 |             lm_path,
 88 |             "--prune",
 89 |             *arpa_prune.split("|"),
 90 |         ]
 91 |     if discount_fallback:
 92 |         subargs += ["--discount_fallback"]
 93 |     subprocess.check_call(subargs)
 94 | 
 95 |     # Filter LM using vocabulary of top-k words
 96 |     print("\nFiltering ARPA file using vocabulary of top-k words ...")
 97 |     filtered_path = output_dir + "." + "lm_filtered.arpa"
 98 |     subprocess.run(
 99 |         [
100 |             os.path.join(kenlm_bins, "filter"),
101 |             "single",
102 |             "model:{}".format(lm_path),
103 |             filtered_path,
104 |         ],
105 |         input=vocab_str.encode("utf-8"),
106 |         check=True,
107 |     )
108 | 
109 |     # Quantize and produce trie binary.
110 |     print("\nBuilding lm.binary ...")
111 |     binary_path = output_dir + "." + "lm.binary"
112 |     subprocess.check_call(
113 |         [
114 |             os.path.join(kenlm_bins, "build_binary"),
115 |             "-s",
116 |             "-a",
117 |             str(binary_a_bits),
118 |             "-q",
119 |             str(binary_q_bits),
120 |             "-v",
121 |             binary_type,
122 |             filtered_path,
123 |             binary_path,
124 |         ]
125 |     )


--------------------------------------------------------------------------------
/align/generate_package.py:
--------------------------------------------------------------------------------
 1 | import shutil
 2 | import struct
 3 | from ds_ctcdecoder import Scorer, Alphabet as NativeAlphabet
 4 | 
 5 | 
 6 | class Alphabet(object):
 7 |     def __init__(self, config_file):
 8 |         self._config_file = config_file
 9 |         self._label_to_str = {}
10 |         self._str_to_label = {}
11 |         self._size = 0
12 |         if config_file:
13 |             with open(config_file, 'r', encoding='utf-8') as fin:
14 |                 for line in fin:
15 |                     if line[0:2] == '\\#':
16 |                         line = '#\n'
17 |                     elif line[0] == '#':
18 |                         continue
19 |                     self._label_to_str[self._size] = line[:-1] # remove the line ending
20 |                     self._str_to_label[line[:-1]] = self._size
21 |                     self._size += 1
22 | 
23 |     def serialize(self):
24 |         # Serialization format is a sequence of (key, value) pairs, where key is
25 |         # a uint16_t and value is a uint16_t length followed by `length` UTF-8
26 |         # encoded bytes with the label.
27 |         res = bytearray()
28 | 
29 |         # We start by writing the number of pairs in the buffer as uint16_t.
30 |         res += struct.pack('<H', self._size)
31 |         for key, value in self._label_to_str.items():
32 |             value = value.encode('utf-8')
33 |             # struct.pack only takes fixed length strings/buffers, so we have to
34 |             # construct the correct format string with the length of the encoded
35 |             # label.
36 |             res += struct.pack('<HH{}s'.format(len(value)), key, len(value), value)
37 |         return bytes(res)
38 | 
39 | 
40 | def create_bundle(
41 |     alphabet_path,
42 |     lm_path,
43 |     vocab_path,
44 |     package_path,
45 |     force_utf8,
46 |     default_alpha,
47 |     default_beta,
48 | ):
49 |     words = set()
50 |     with open(vocab_path) as fin:
51 |         for line in fin:
52 |             for word in line.split():
53 |                 words.add(word.encode("utf-8"))
54 | 
55 |     if not alphabet_path:
56 |         raise RuntimeError("No --alphabet path specified, can't continue.")
57 |     serialized_alphabet = Alphabet(alphabet_path).serialize()
58 | 
59 |     alphabet = NativeAlphabet()
60 |     err = alphabet.deserialize(serialized_alphabet, len(serialized_alphabet))
61 |     if err != 0:
62 |         raise RuntimeError("Error loading alphabet: {}".format(err))
63 | 
64 |     scorer = Scorer()
65 |     scorer.set_alphabet(alphabet)
66 |     scorer.reset_params(default_alpha, default_beta)
67 |     scorer.load_lm(lm_path)
68 |     # TODO: Why is this not working?
69 |     #err = scorer.load_lm(lm_path)
70 |     #if err != ds_ctcdecoder.DS_ERR_SCORER_NO_TRIE:
71 |     #    print('Error loading language model file: 0x{:X}.'.format(err))
72 |     #    print('See the error codes section in https://deepspeech.readthedocs.io for a description.')
73 |     #    sys.exit(1)
74 |     scorer.fill_dictionary(list(words))
75 |     shutil.copy(lm_path, package_path)
76 |     scorer.save_dictionary(package_path, True)  # append, not overwrite
77 |     print("Package created in {}".format(package_path))


--------------------------------------------------------------------------------
/align/meta.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | import json
 3 | import argparse
 4 | 
 5 | forbidden_keys = ['start', 'end', 'text', 'transcript']
 6 | 
 7 | 
 8 | def main(args):
 9 |     parser = argparse.ArgumentParser(description='Annotate .tlog or .script files by adding meta data')
10 |     parser.add_argument('target', type=str, help='')
11 |     parser.add_argument('assignments', nargs='+', help='Meta data assignments of the form <key>=<value>')
12 |     args = parser.parse_args()
13 | 
14 |     with open(args.target, 'r', encoding='utf-8') as json_file:
15 |         entries = json.load(json_file)
16 | 
17 |     for assignment in args.assignments:
18 |         key, value = assignment.split('=')
19 |         if key in forbidden_keys:
20 |             print('Meta data key "{}" not allowed - forbidden: {}'.format(key, '|'.join(forbidden_keys)))
21 |             sys.exit(1)
22 |         for entry in entries:
23 |             entry[key] = value
24 | 
25 |     with open(args.target, 'w', encoding='utf-8') as json_file:
26 |         json.dump(entries, json_file, indent=2)
27 | 
28 | 
29 | if __name__ == '__main__':
30 |     main(sys.argv[1:])
31 | 


--------------------------------------------------------------------------------
/align/sample_collections.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | import os
  3 | import csv
  4 | import json
  5 | import heapq
  6 | 
  7 | from pathlib import Path
  8 | from functools import partial
  9 | from utils import MEGABYTE, GIGABYTE, Interleaved
 10 | from audio import Sample, DEFAULT_FORMAT, AUDIO_TYPE_WAV, AUDIO_TYPE_OPUS, SERIALIZABLE_AUDIO_TYPES
 11 | 
 12 | BIG_ENDIAN = 'big'
 13 | INT_SIZE = 4
 14 | BIGINT_SIZE = 2 * INT_SIZE
 15 | MAGIC = b'SAMPLEDB'
 16 | 
 17 | BUFFER_SIZE = 1 * MEGABYTE
 18 | CACHE_SIZE = 1 * GIGABYTE
 19 | 
 20 | SCHEMA_KEY = 'schema'
 21 | CONTENT_KEY = 'content'
 22 | MIME_TYPE_KEY = 'mime-type'
 23 | MIME_TYPE_TEXT = 'text/plain'
 24 | CONTENT_TYPE_SPEECH = 'speech'
 25 | CONTENT_TYPE_TRANSCRIPT = 'transcript'
 26 | 
 27 | 
 28 | class LabeledSample(Sample):
 29 |     """In-memory sample collection sample representing an utterance.
 30 |     Derived from util.audio.Sample and used by sample collection readers and writers."""
 31 |     def __init__(self, audio_type, raw_data, transcript, audio_format=DEFAULT_FORMAT, sample_id=None):
 32 |         """
 33 |         Creates an in-memory speech sample together with a transcript of the utterance (label).
 34 |         :param audio_type: See util.audio.Sample.__init__ .
 35 |         :param raw_data: See util.audio.Sample.__init__ .
 36 |         :param transcript: Transcript of the sample's utterance
 37 |         :param audio_format: See util.audio.Sample.__init__ .
 38 |         :param sample_id: Tracking ID
 39 |         """
 40 |         super().__init__(audio_type, raw_data, audio_format=audio_format)
 41 |         self.sample_id = sample_id
 42 |         self.transcript = transcript
 43 |         self.meta = None
 44 | 
 45 | 
 46 | class DirectSDBWriter:
 47 |     """Sample collection writer for creating a Sample DB (SDB) file"""
 48 |     def __init__(self, sdb_filename, buffering=BUFFER_SIZE, audio_type=AUDIO_TYPE_OPUS, id_prefix=None):
 49 |         self.sdb_filename = sdb_filename
 50 |         self.id_prefix = sdb_filename if id_prefix is None else id_prefix
 51 |         if audio_type not in SERIALIZABLE_AUDIO_TYPES:
 52 |             raise ValueError('Audio type "{}" not supported'.format(audio_type))
 53 |         self.audio_type = audio_type
 54 |         self.sdb_file = open(sdb_filename, 'wb', buffering=buffering)
 55 |         self.offsets = []
 56 |         self.num_samples = 0
 57 | 
 58 |         self.sdb_file.write(MAGIC)
 59 | 
 60 |         meta_data = {
 61 |             SCHEMA_KEY: [
 62 |                 {CONTENT_KEY: CONTENT_TYPE_SPEECH, MIME_TYPE_KEY: audio_type},
 63 |                 {CONTENT_KEY: CONTENT_TYPE_TRANSCRIPT, MIME_TYPE_KEY: MIME_TYPE_TEXT}
 64 |             ]
 65 |         }
 66 |         meta_data = json.dumps(meta_data).encode()
 67 |         self.write_big_int(len(meta_data))
 68 |         self.sdb_file.write(meta_data)
 69 | 
 70 |         self.offset_samples = self.sdb_file.tell()
 71 |         self.sdb_file.seek(2 * BIGINT_SIZE, 1)
 72 | 
 73 |     def write_int(self, n):
 74 |         return self.sdb_file.write(n.to_bytes(INT_SIZE, BIG_ENDIAN))
 75 | 
 76 |     def write_big_int(self, n):
 77 |         return self.sdb_file.write(n.to_bytes(BIGINT_SIZE, BIG_ENDIAN))
 78 | 
 79 |     def __enter__(self):
 80 |         return self
 81 | 
 82 |     def add(self, sample):
 83 |         def to_bytes(n):
 84 |             return n.to_bytes(INT_SIZE, BIG_ENDIAN)
 85 |         sample.change_audio_type(self.audio_type)
 86 |         opus = sample.audio.getbuffer()
 87 |         opus_len = to_bytes(len(opus))
 88 |         transcript = sample.transcript.encode()
 89 |         transcript_len = to_bytes(len(transcript))
 90 |         entry_len = to_bytes(len(opus_len) + len(opus) + len(transcript_len) + len(transcript))
 91 |         buffer = b''.join([entry_len, opus_len, opus, transcript_len, transcript])
 92 |         self.offsets.append(self.sdb_file.tell())
 93 |         self.sdb_file.write(buffer)
 94 |         sample.sample_id = '{}:{}'.format(self.id_prefix, self.num_samples)
 95 |         self.num_samples += 1
 96 |         return sample.sample_id
 97 | 
 98 |     def close(self):
 99 |         if self.sdb_file is None:
100 |             return
101 |         offset_index = self.sdb_file.tell()
102 |         self.sdb_file.seek(self.offset_samples)
103 |         self.write_big_int(offset_index - self.offset_samples - BIGINT_SIZE)
104 |         self.write_big_int(self.num_samples)
105 | 
106 |         self.sdb_file.seek(offset_index + BIGINT_SIZE)
107 |         self.write_big_int(self.num_samples)
108 |         for index, offset in enumerate(self.offsets):
109 |             self.write_big_int(offset)
110 |         offset_end = self.sdb_file.tell()
111 |         self.sdb_file.seek(offset_index)
112 |         self.write_big_int(offset_end - offset_index - BIGINT_SIZE)
113 |         self.sdb_file.close()
114 |         self.sdb_file = None
115 | 
116 |     def __len__(self):
117 |         return len(self.offsets)
118 | 
119 |     def __exit__(self, exc_type, exc_val, exc_tb):
120 |         self.close()
121 | 
122 | 
123 | class SortingSDBWriter:  # pylint: disable=too-many-instance-attributes
124 |     def __init__(self,
125 |                  sdb_filename,
126 |                  tmp_sdb_filename=None,
127 |                  cache_size=CACHE_SIZE,
128 |                  buffering=BUFFER_SIZE,
129 |                  audio_type=AUDIO_TYPE_OPUS,
130 |                  buffered_samples=None,
131 |                  id_prefix=None):
132 |         self.sdb_filename = sdb_filename
133 |         self.id_prefix = sdb_filename if id_prefix is None else id_prefix
134 |         self.buffering = buffering
135 |         self.tmp_sdb_filename = (sdb_filename + '.tmp') if tmp_sdb_filename is None else tmp_sdb_filename
136 |         if audio_type not in SERIALIZABLE_AUDIO_TYPES:
137 |             raise ValueError('Audio type "{}" not supported'.format(audio_type))
138 |         self.audio_type = audio_type
139 |         self.buffered_samples = buffered_samples
140 |         self.tmp_sdb = DirectSDBWriter(self.tmp_sdb_filename,
141 |                                        buffering=buffering,
142 |                                        audio_type=audio_type,
143 |                                        id_prefix='#pre-sorted')
144 |         self.cache_size = cache_size
145 |         self.meta_dict = {}
146 |         self.meta_list = []
147 |         self.buckets = []
148 |         self.bucket = []
149 |         self.bucket_offset = 0
150 |         self.bucket_size = 0
151 |         self.overall_size = 0
152 | 
153 |     def __enter__(self):
154 |         return self
155 | 
156 |     def finish_bucket(self):
157 |         if len(self.bucket) == 0:
158 |             return
159 |         self.bucket.sort(key=lambda s: s.duration)
160 |         for sample in self.bucket:
161 |             old_id = sample.sample_id
162 |             new_id = self.tmp_sdb.add(sample)
163 |             self.meta_dict[new_id] = self.meta_dict[old_id]
164 |             del self.meta_dict[old_id]
165 |         self.buckets.append((self.bucket_offset, self.bucket_offset + len(self.bucket)))
166 |         self.bucket_offset += len(self.bucket)
167 |         self.bucket = []
168 |         self.overall_size += self.bucket_size
169 |         self.bucket_size = 0
170 | 
171 |     def add(self, sample):
172 |         if self.bucket_size > self.cache_size:
173 |             self.finish_bucket()
174 |         sample.change_audio_type(self.audio_type)
175 |         sample.sample_id = '#unsorted:{}'.format(len(self.bucket))
176 |         self.meta_dict[sample.sample_id] = sample.meta
177 |         self.bucket.append(sample)
178 |         self.bucket_size += len(sample.audio.getbuffer())
179 |         return sample.sample_id
180 | 
181 |     def finalize(self):
182 |         if self.tmp_sdb is None:
183 |             return
184 |         self.finish_bucket()
185 |         num_samples = len(self.tmp_sdb)
186 |         self.tmp_sdb.close()
187 |         self.tmp_sdb = None
188 |         if self.buffered_samples is None:
189 |             avg_sample_size = self.overall_size / max(1, num_samples)
190 |             max_cached_samples = self.cache_size / max(1, avg_sample_size)
191 |             buffer_size = max(1, int(max_cached_samples / max(1, len(self.buckets))))
192 |         else:
193 |             buffer_size = self.buffered_samples
194 |         sdb_reader = SDB(self.tmp_sdb_filename, buffering=self.buffering, id_prefix='#pre-sorted')
195 | 
196 |         def buffered_view(bucket):
197 |             start, end = bucket
198 |             buffer = []
199 |             current_offset = start
200 |             while current_offset < end:
201 |                 while len(buffer) < buffer_size and current_offset < end:
202 |                     buffer.insert(0, sdb_reader[current_offset])
203 |                     current_offset += 1
204 |                 while len(buffer) > 0:
205 |                     yield buffer.pop(-1)
206 | 
207 |         bucket_views = list(map(buffered_view, self.buckets))
208 |         interleaved = heapq.merge(*bucket_views, key=lambda s: s.duration)
209 |         with DirectSDBWriter(self.sdb_filename,
210 |                              buffering=self.buffering,
211 |                              audio_type=self.audio_type,
212 |                              id_prefix=self.id_prefix) as sdb_writer:
213 |             for index, sample in enumerate(interleaved):
214 |                 old_id = sample.sample_id
215 |                 sdb_writer.add(sample)
216 |                 self.meta_list.append(self.meta_dict[old_id])
217 |                 del self.meta_dict[old_id]
218 |                 yield index / num_samples
219 |         sdb_reader.close()
220 |         os.unlink(self.tmp_sdb_filename)
221 | 
222 |     def close(self):
223 |         for _ in self.finalize():
224 |             pass
225 | 
226 |     def __exit__(self, exc_type, exc_val, exc_tb):
227 |         self.close()
228 | 
229 | 
230 | class SDB:  # pylint: disable=too-many-instance-attributes
231 |     """Sample collection reader for reading a Sample DB (SDB) file"""
232 |     def __init__(self, sdb_filename, buffering=BUFFER_SIZE, id_prefix=None):
233 |         self.sdb_filename = sdb_filename
234 |         self.id_prefix = sdb_filename if id_prefix is None else id_prefix
235 |         self.sdb_file = open(sdb_filename, 'rb', buffering=buffering)
236 |         self.offsets = []
237 |         if self.sdb_file.read(len(MAGIC)) != MAGIC:
238 |             raise RuntimeError('No Sample Database')
239 |         meta_chunk_len = self.read_big_int()
240 |         self.meta = json.loads(self.sdb_file.read(meta_chunk_len).decode())
241 |         if SCHEMA_KEY not in self.meta:
242 |             raise RuntimeError('Missing schema')
243 |         self.schema = self.meta[SCHEMA_KEY]
244 | 
245 |         speech_columns = self.find_columns(content=CONTENT_TYPE_SPEECH, mime_type=SERIALIZABLE_AUDIO_TYPES)
246 |         if not speech_columns:
247 |             raise RuntimeError('No speech data (missing in schema)')
248 |         self.speech_index = speech_columns[0]
249 |         self.audio_type = self.schema[self.speech_index][MIME_TYPE_KEY]
250 | 
251 |         transcript_columns = self.find_columns(content=CONTENT_TYPE_TRANSCRIPT, mime_type=MIME_TYPE_TEXT)
252 |         if not transcript_columns:
253 |             raise RuntimeError('No transcript data (missing in schema)')
254 |         self.transcript_index = transcript_columns[0]
255 | 
256 |         sample_chunk_len = self.read_big_int()
257 |         self.sdb_file.seek(sample_chunk_len + BIGINT_SIZE, 1)
258 |         num_samples = self.read_big_int()
259 |         for _ in range(num_samples):
260 |             self.offsets.append(self.read_big_int())
261 | 
262 |     def read_int(self):
263 |         return int.from_bytes(self.sdb_file.read(INT_SIZE), BIG_ENDIAN)
264 | 
265 |     def read_big_int(self):
266 |         return int.from_bytes(self.sdb_file.read(BIGINT_SIZE), BIG_ENDIAN)
267 | 
268 |     def find_columns(self, content=None, mime_type=None):
269 |         criteria = []
270 |         if content is not None:
271 |             criteria.append((CONTENT_KEY, content))
272 |         if mime_type is not None:
273 |             criteria.append((MIME_TYPE_KEY, mime_type))
274 |         if len(criteria) == 0:
275 |             raise ValueError('At least one of "content" or "mime-type" has to be provided')
276 |         matches = []
277 |         for index, column in enumerate(self.schema):
278 |             matched = 0
279 |             for field, value in criteria:
280 |                 if column[field] == value or (isinstance(value, list) and column[field] in value):
281 |                     matched += 1
282 |             if matched == len(criteria):
283 |                 matches.append(index)
284 |         return matches
285 | 
286 |     def read_row(self, row_index, *columns):
287 |         columns = list(columns)
288 |         column_data = [None] * len(columns)
289 |         found = 0
290 |         if not 0 <= row_index < len(self.offsets):
291 |             raise ValueError('Wrong sample index: {} - has to be between 0 and {}'
292 |                              .format(row_index, len(self.offsets) - 1))
293 |         self.sdb_file.seek(self.offsets[row_index] + INT_SIZE)
294 |         for index in range(len(self.schema)):
295 |             chunk_len = self.read_int()
296 |             if index in columns:
297 |                 column_data[columns.index(index)] = self.sdb_file.read(chunk_len)
298 |                 found += 1
299 |                 if found == len(columns):
300 |                     return tuple(column_data)
301 |             else:
302 |                 self.sdb_file.seek(chunk_len, 1)
303 |         return tuple(column_data)
304 | 
305 |     def __getitem__(self, i):
306 |         audio_data, transcript = self.read_row(i, self.speech_index, self.transcript_index)
307 |         transcript = transcript.decode()
308 |         sample_id = '{}:{}'.format(self.id_prefix, i)
309 |         return LabeledSample(self.audio_type, audio_data, transcript, sample_id=sample_id)
310 | 
311 |     def __iter__(self):
312 |         for i in range(len(self.offsets)):
313 |             yield self[i]
314 | 
315 |     def __len__(self):
316 |         return len(self.offsets)
317 | 
318 |     def close(self):
319 |         if self.sdb_file is not None:
320 |             self.sdb_file.close()
321 | 
322 |     def __del__(self):
323 |         self.close()
324 | 
325 | 
326 | class CSV:
327 |     """Sample collection reader for reading a DeepSpeech CSV file"""
328 |     def __init__(self, csv_filename):
329 |         self.csv_filename = csv_filename
330 |         self.rows = []
331 |         csv_dir = Path(csv_filename).parent
332 |         with open(csv_filename, 'r') as csv_file:
333 |             reader = csv.DictReader(csv_file)
334 |             for row in reader:
335 |                 wav_filename = Path(row['wav_filename'])
336 |                 if not wav_filename.is_absolute():
337 |                     wav_filename = csv_dir / wav_filename
338 |                 self.rows.append((str(wav_filename), int(row['wav_filesize']), row['transcript']))
339 |         self.rows.sort(key=lambda r: r[1])
340 | 
341 |     def __getitem__(self, i):
342 |         wav_filename, _, transcript = self.rows[i]
343 |         with open(wav_filename, 'rb') as wav_file:
344 |             return LabeledSample(AUDIO_TYPE_WAV, wav_file.read(), transcript, sample_id=wav_filename)
345 | 
346 |     def __iter__(self):
347 |         for i in range(len(self.rows)):
348 |             yield self[i]
349 | 
350 |     def __len__(self):
351 |         return len(self.rows)
352 | 
353 | 
354 | def samples_from_file(filename, buffering=BUFFER_SIZE):
355 |     """Retrieves the right sample collection reader from a filename"""
356 |     ext = os.path.splitext(filename)[1].lower()
357 |     if ext == '.sdb':
358 |         return SDB(filename, buffering=buffering)
359 |     if ext == '.csv':
360 |         return CSV(filename)
361 |     raise ValueError('Unknown file type: "{}"'.format(ext))
362 | 
363 | 
364 | def samples_from_files(filenames, buffering=BUFFER_SIZE):
365 |     """Retrieves a (potentially interleaving) sample collection reader from a list of filenames"""
366 |     if len(filenames) == 0:
367 |         raise ValueError('No files')
368 |     if len(filenames) == 1:
369 |         return samples_from_file(filenames[0], buffering=buffering)
370 |     cols = list(map(partial(samples_from_file, buffering=buffering), filenames))
371 |     return Interleaved(*cols, key=lambda s: s.duration)
372 | 


--------------------------------------------------------------------------------
/align/sdb_tool.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | '''
 3 | Builds Sample Databases (.sdb files)
 4 | Use "python3 sdb_tool.py -h" for help
 5 | '''
 6 | from __future__ import absolute_import, division, print_function
 7 | 
 8 | # Make sure we can import stuff from util/
 9 | # This script needs to be run from the root of the DeepSpeech repository
10 | import os
11 | import sys
12 | sys.path.insert(1, os.path.join(sys.path[0], '..'))
13 | 
14 | import argparse
15 | 
16 | from utils import parse_file_size, log_progress
17 | from audio import change_audio_types, AUDIO_TYPE_WAV, AUDIO_TYPE_OPUS
18 | from sample_collections import samples_from_files, DirectSDBWriter, SortingSDBWriter
19 | 
20 | AUDIO_TYPE_LOOKUP = {
21 |     'wav': AUDIO_TYPE_WAV,
22 |     'opus': AUDIO_TYPE_OPUS
23 | }
24 | 
25 | 
26 | def progress(it=None, desc='Processing', total=None):
27 |     print(desc, file=sys.stderr, flush=True)
28 |     return it if CLI_ARGS.no_progress else log_progress(it, interval=CLI_ARGS.progress_interval, total=total)
29 | 
30 | 
31 | def add_samples(sdb_writer):
32 |     samples = samples_from_files(CLI_ARGS.sources)
33 |     for sample in progress(change_audio_types(samples, audio_type=sdb_writer.audio_type, processes=CLI_ARGS.workers),
34 |                            total=len(samples),
35 |                            desc='Writing "{}"...'.format(CLI_ARGS.target)):
36 |         sdb_writer.add(sample)
37 | 
38 | 
39 | def build_sdb():
40 |     audio_type = AUDIO_TYPE_LOOKUP[CLI_ARGS.audio_type]
41 |     if CLI_ARGS.sort:
42 |         with SortingSDBWriter(CLI_ARGS.target,
43 |                               tmp_sdb_filename=CLI_ARGS.sort_tmp_file,
44 |                               cache_size=parse_file_size(CLI_ARGS.sort_cache_size),
45 |                               audio_type=audio_type) as sdb_writer:
46 |             add_samples(sdb_writer)
47 |     else:
48 |         with DirectSDBWriter(CLI_ARGS.target, audio_type=audio_type) as sdb_writer:
49 |             add_samples(sdb_writer)
50 | 
51 | 
52 | def handle_args():
53 |     parser = argparse.ArgumentParser(description='Tool for building Sample Databases (SDB files) '
54 |                                                  'from DeepSpeech CSV files and other SDB files')
55 |     parser.add_argument('--workers', type=int, default=None, help='Number of encoding SDB workers')
56 |     parser.add_argument('--audio-type', default='opus', choices=AUDIO_TYPE_LOOKUP.keys(),
57 |                         help='Audio representation inside target SDB')
58 |     parser.add_argument('--sort', action='store_true', help='Force sample sorting by durations '
59 |                                                             '(assumes SDB sources unsorted)')
60 |     parser.add_argument('--sort-tmp-file', default=None, help='Overrides default tmp_file (target + ".tmp") '
61 |                                                               'for sorting through --sort option')
62 |     parser.add_argument('--sort-cache-size', default='1GB', help='Cache (bucket) size for binary audio data '
63 |                                                                  'for sorting through --sort option')
64 |     parser.add_argument('--no-progress', action="store_true", help='Prevents showing progress indication')
65 |     parser.add_argument('--progress-interval', type=float, default=1.0, help='Progress indication interval in seconds')
66 |     parser.add_argument('sources', nargs='+', help='Source CSV and/or SDB files')
67 |     parser.add_argument('target', help='SDB file to create')
68 |     return parser.parse_args()
69 | 
70 | 
71 | if __name__ == "__main__":
72 |     CLI_ARGS = handle_args()
73 |     build_sdb()
74 | 


--------------------------------------------------------------------------------
/align/search.py:
--------------------------------------------------------------------------------
  1 | from collections import Counter
  2 | from text import ngrams, similarity
  3 | 
  4 | 
  5 | class FuzzySearch(object):
  6 |     def __init__(self,
  7 |                  text,
  8 |                  max_candidates=10,
  9 |                  candidate_threshold=0.92,
 10 |                  match_score=100,
 11 |                  mismatch_score=-100,
 12 |                  gap_score=-100,
 13 |                  char_similarities=None):
 14 |         self.text = text
 15 |         self.max_candidates = max_candidates
 16 |         self.candidate_threshold = candidate_threshold
 17 |         self.match_score = match_score
 18 |         self.mismatch_score = mismatch_score
 19 |         self.gap_score = gap_score
 20 |         self.char_similarities = char_similarities
 21 |         self.ngrams = {}
 22 |         for i, ngram in enumerate(ngrams(' ' + text + ' ', 3)):
 23 |             if ngram in self.ngrams:
 24 |                 ngram_bucket = self.ngrams[ngram]
 25 |             else:
 26 |                 ngram_bucket = self.ngrams[ngram] = []
 27 |             ngram_bucket.append(i)
 28 | 
 29 |     @staticmethod
 30 |     def char_pair(a, b):
 31 |         if a > b:
 32 |             a, b = b, a
 33 |         return '' + a + b
 34 | 
 35 |     def char_similarity(self, a, b):
 36 |         key = FuzzySearch.char_pair(a, b)
 37 |         if self.char_similarities and key in self.char_similarities:
 38 |             return self.char_similarities[key]
 39 |         return self.match_score if a == b else self.mismatch_score
 40 | 
 41 |     def sw_align(self, a, start, end):
 42 |         b = self.text[start:end]
 43 |         n, m = len(a), len(b)
 44 |         # building scoring matrix
 45 |         f = [[0]] * (n + 1)
 46 |         for i in range(0, n + 1):
 47 |             f[i] = [0] * (m + 1)
 48 |         for i in range(1, n + 1):
 49 |             f[i][0] = self.gap_score * i
 50 |         for j in range(1, m + 1):
 51 |             f[0][j] = self.gap_score * j
 52 |         max_score = 0
 53 |         start_i, start_j = 0, 0
 54 |         for i in range(1, n + 1):
 55 |             for j in range(1, m + 1):
 56 |                 match = f[i - 1][j - 1] + self.char_similarity(a[i - 1], b[j - 1])
 57 |                 insert = f[i][j - 1] + self.gap_score
 58 |                 delete = f[i - 1][j] + self.gap_score
 59 |                 score = max(0, match, insert, delete)
 60 |                 f[i][j] = score
 61 |                 if score > max_score:
 62 |                     max_score = score
 63 |                     start_i, start_j = i, j
 64 |         # backtracking
 65 |         substitutions = Counter()
 66 |         i, j = start_i, start_j
 67 |         while (j > 0 or i > 0) and f[i][j] != 0:
 68 |             if i > 0 and j > 0 and f[i][j] == (f[i - 1][j - 1] + self.char_similarity(a[i - 1], b[j - 1])):
 69 |                 substitutions[FuzzySearch.char_pair(a[i - 1], b[j - 1])] += 1
 70 |                 i, j = i - 1, j - 1
 71 |             elif i > 0 and f[i][j] == (f[i - 1][j] + self.gap_score):
 72 |                 i -= 1
 73 |             elif j > 0 and f[i][j] == (f[i][j - 1] + self.gap_score):
 74 |                 j -= 1
 75 |             else:
 76 |                 raise Exception('Smith–Waterman failure')
 77 |         align_start = max(start, start + j - 1)
 78 |         align_end = min(end, start + start_j)
 79 |         score = f[start_i][start_j] / (self.match_score * max(align_end - align_start, n))
 80 |         return align_start, align_end, score, substitutions
 81 | 
 82 |     def find_best(self, look_for, start=0, end=-1):
 83 |         end = len(self.text) if end < 0 else end
 84 |         if end - start < 2 * len(look_for):
 85 |             return self.sw_align(look_for, start, end)
 86 |         window_size = len(look_for)
 87 |         windows = {}
 88 |         for i, ngram in enumerate(ngrams(' ' + look_for + ' ', 3)):
 89 |             if ngram in self.ngrams:
 90 |                 ngram_bucket = self.ngrams[ngram]
 91 |                 for occurrence in ngram_bucket:
 92 |                     if occurrence < start or occurrence > end:
 93 |                         continue
 94 |                     window = occurrence // window_size
 95 |                     windows[window] = (windows[window] + 1) if window in windows else 1
 96 |         candidate_windows = sorted(windows.keys(), key=lambda w: windows[w], reverse=True)
 97 |         best = (-1, -1, 0, None)
 98 |         last_window_grams = 0.1
 99 |         for window in candidate_windows[:self.max_candidates]:
100 |             ngram_factor = (windows[window] / last_window_grams)
101 |             if ngram_factor < self.candidate_threshold:
102 |                 break
103 |             last_window_grams = windows[window]
104 |             interval_start = max(start, int((window - 1) * window_size))
105 |             interval_end = min(end, int((window + 2) * window_size))
106 |             search_result = self.sw_align(look_for, interval_start, interval_end)
107 |             if search_result[2] > best[2]:
108 |                 best = search_result
109 |         return best
110 | 


--------------------------------------------------------------------------------
/align/stats.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import sys
  3 | import json
  4 | import argparse
  5 | from os import path
  6 | from pickle import load, dump
  7 | from collections import Counter
  8 | from datetime import timedelta
  9 | from utils import log_progress
 10 | 
 11 | 
 12 | def fail(message, code=1):
 13 |     print(message)
 14 |     exit(code)
 15 | 
 16 | 
 17 | class AlignmentStatistics:
 18 |     def __init__(self):
 19 |         self.top = 100
 20 |         self.stat_ids = ['wng', 'sws', 'wer', 'cer', 'jaro_winkler', 'editex', 'levenshtein', 'mra', 'hamming']
 21 |         self.stats = {}
 22 |         self.stats_duration = {}
 23 |         for stat_id in self.stat_ids:
 24 |             self.stats[stat_id] = Counter()
 25 |             self.stats_duration[stat_id] = Counter()
 26 | 
 27 |         self.total_files = 0
 28 |         self.total_utterances = 0
 29 |         self.total_duration = 0
 30 |         self.total_length = 0
 31 | 
 32 |         self.durations = Counter()
 33 |         self.lengths = Counter()
 34 | 
 35 |         self.meta_counters = {}
 36 | 
 37 |     @staticmethod
 38 |     def progress(lst, desc='Processing', total=None):
 39 |         return lst
 40 | 
 41 |     def load_aligned(self, aligned_path):
 42 |         self.total_files += 1
 43 |         with open(aligned_path, 'r') as aligned_file:
 44 |             utterances = json.loads(aligned_file.read())
 45 |         for utterance in utterances:
 46 |             self.total_utterances += 1
 47 |             duration = utterance['end'] - utterance['start']
 48 |             self.durations[int(duration / 1000)] += 1
 49 |             self.total_duration += duration
 50 |             length = utterance['text-end'] - utterance['text-start']
 51 |             self.lengths[length] += 1
 52 |             self.total_length += length
 53 |             for stat_id in self.stat_ids:
 54 |                 if stat_id in utterance:
 55 |                     self.stats[stat_id][int(utterance[stat_id])] += 1
 56 |                     self.stats_duration[stat_id][int(utterance[stat_id])] += duration
 57 |             if 'meta' in utterance:
 58 |                 for meta_type, instances in utterance['meta'].items():
 59 |                     if meta_type not in self.meta_counters:
 60 |                         self.meta_counters[meta_type] = Counter()
 61 |                     for instance in instances:
 62 |                         self.meta_counters[meta_type][instance] += 1
 63 | 
 64 |     def load_catalog(self, catalog_path, ignore_missing=True):
 65 |         catalog = path.abspath(catalog_path)
 66 |         catalog_dir = path.dirname(catalog)
 67 |         with open(catalog, 'r') as catalog_file:
 68 |             catalog_entries = json.load(catalog_file)
 69 |         for entry in AlignmentStatistics.progress(catalog_entries, desc='Reading catalog'):
 70 |             aligned_path = entry['aligned']
 71 |             if not path.isabs(aligned_path):
 72 |                 aligned_path = path.join(catalog_dir, aligned_path)
 73 |             if path.isfile(aligned_path):
 74 |                 self.load_aligned(aligned_path)
 75 |             else:
 76 |                 if ignore_missing:
 77 |                     continue
 78 |                 else:
 79 |                     fail('Problem loading catalog "{}": Missing referenced alignment file "{}"'
 80 |                          .format(catalog_path, aligned_path))
 81 | 
 82 |     def print_stats(self):
 83 |         print('Total number of files: {:,}'.format(self.total_files))
 84 |         print('')
 85 |         print('Total number of utterances: {:,}'.format(self.total_utterances))
 86 |         print('')
 87 |         print('Total aligned utterance character length: {:,}'.format(self.total_length))
 88 |         print('')
 89 |         print('Total utterance duration: {} ({:,} hours)'.format(
 90 |             timedelta(milliseconds=self.total_duration),
 91 |             int(self.total_duration / (1000 * 60 * 60))))
 92 |         print('')
 93 |         
 94 |         for meta_type, counter in self.meta_counters.items():
 95 |             print('Overall number of instances of meta type "{}": {:,}'.format(meta_type, len(counter.keys())))
 96 |             print('')
 97 |             print('{} most frequent "{}" instances:'.format(self.top, meta_type))
 98 |             for value, count in counter.most_common(self.top):
 99 |                 print(value.ljust(20) + '{:,}'.format(count).rjust(12))
100 | 
101 |         for stat_id in self.stat_ids:
102 |             counter = self.stats_duration[stat_id]
103 |             if len(counter) == 0:
104 |                 continue
105 |             print('')
106 |             print(stat_id.upper() + ':')
107 |             above = 0
108 |             for value in sorted(counter):
109 |                 count = counter[value] / (60 * 60 * 1000)
110 |                 if value <= 100:
111 |                     print(str(value).ljust(10) + '{:12.2f}'.format(count).rjust(12))
112 |                 else:
113 |                     above += count
114 |             if above > 0:
115 |                 print('100+'.ljust(10) + '{:12.2f}'.format(above).rjust(12))
116 | 
117 | 
118 | def main(args):
119 |     parser = argparse.ArgumentParser(description='Export aligned speech samples.')
120 | 
121 |     parser.add_argument('--cache', type=str,
122 |                         help='Use provided file as statistics cache (if existing, all other input options are ignored)')
123 |     parser.add_argument('--aligned', type=str, action='append',
124 |                         help='Read alignment file ("<...>.aligned") as input')
125 |     parser.add_argument('--catalog', type=str, action='append',
126 |                         help='Read alignment references of provided catalog ("<...>.catalog") as input')
127 |     parser.add_argument('--no-progress', action='store_true',
128 |                         help='Prevents showing progress bars')
129 |     parser.add_argument('--progress-interval', type=float, default=1.0,
130 |                         help='Progress indication interval in seconds')
131 | 
132 |     args = parser.parse_args()
133 | 
134 |     def progress(it=None, desc='Processing', total=None):
135 |         print(desc)
136 |         return it if args.no_progress else log_progress(it, interval=args.progress_interval, total=total)
137 |     AlignmentStatistics.progress = progress
138 | 
139 |     if args.cache is not None and path.exists(args.cache):
140 |         with open(args.cache, 'rb') as stats_file:
141 |             stats = load(stats_file)
142 |     else:
143 |         stats = AlignmentStatistics()
144 |         if args.catalog is not None:
145 |             for catalog_path in args.catalog:
146 |                 stats.load_catalog(catalog_path, ignore_missing=True)
147 |         if args.aligned is not None:
148 |             for aligned_path in args.aligned:
149 |                 stats.load_aligned(aligned_path)
150 |         if args.cache is not None:
151 |             with open(args.cache, 'wb') as stats_file:
152 |                 dump(stats, stats_file)
153 | 
154 |     stats.print_stats()
155 | 
156 | 
157 | if __name__ == '__main__':
158 |     main(sys.argv[1:])
159 |     os.system('stty sane')
160 | 


--------------------------------------------------------------------------------
/align/text.py:
--------------------------------------------------------------------------------
  1 | from __future__ import absolute_import, division, print_function
  2 | 
  3 | import codecs
  4 | from six.moves import range
  5 | from collections import Counter
  6 | from utils import enweight
  7 | 
  8 | 
  9 | class Alphabet(object):
 10 |     def __init__(self, config_file):
 11 |         self._config_file = config_file
 12 |         self._label_to_str = []
 13 |         self._str_to_label = {}
 14 |         self._size = 0
 15 |         with codecs.open(config_file, 'r', 'utf-8') as fin:
 16 |             for line in fin:
 17 |                 if line[0:2] == '\\#':
 18 |                     line = '#\n'
 19 |                 elif line[0] == '#':
 20 |                     continue
 21 |                 self._label_to_str += line[:-1]  # remove the line ending
 22 |                 self._str_to_label[line[:-1]] = self._size
 23 |                 self._size += 1
 24 | 
 25 |     def string_from_label(self, label):
 26 |         return self._label_to_str[label]
 27 | 
 28 |     def has_label(self, string):
 29 |         return string in self._str_to_label
 30 | 
 31 |     def label_from_string(self, string):
 32 |         try:
 33 |             return self._str_to_label[string]
 34 |         except KeyError as e:
 35 |             raise KeyError(
 36 |                 '''ERROR: Your transcripts contain characters which do not occur in data/alphabet.txt! Use util/check_characters.py to see what characters are in your {train,dev,test}.csv transcripts, and then add all these to data/alphabet.txt.'''
 37 |             ).with_traceback(e.__traceback__)
 38 | 
 39 |     def decode(self, labels):
 40 |         res = ''
 41 |         for label in labels:
 42 |             res += self.string_from_label(label)
 43 |         return res
 44 | 
 45 |     def size(self):
 46 |         return self._size
 47 | 
 48 |     def config_file(self):
 49 |         return self._config_file
 50 | 
 51 | 
 52 | class TextCleaner(object):
 53 |     def __init__(self, alphabet, to_lower=True, normalize_space=True, dashes_to_ws=True):
 54 |         self.alphabet = alphabet
 55 |         self.to_lower = to_lower
 56 |         self.normalize_space = normalize_space
 57 |         self.dashes_to_ws = dashes_to_ws
 58 |         self.original_text = ''
 59 |         self.clean_text = ''
 60 |         self.positions = []
 61 |         self.meta = []
 62 | 
 63 |     def add_original_text(self, original_text, meta=None):
 64 |         if len(self.positions) > 0:
 65 |             self.clean_text += ' '
 66 |             self.original_text += ' '
 67 |             self.positions.append(len(self.original_text) - 1)
 68 |             self.meta.append(None)
 69 |             ws = True
 70 |         else:
 71 |             ws = False
 72 |         cleaned = []
 73 |         prepared_text = original_text.lower() if self.to_lower else original_text
 74 |         for position, c in enumerate(prepared_text):
 75 |             if self.dashes_to_ws and c == '-' and not self.alphabet.has_label('-'):
 76 |                 c = ' '
 77 |             if self.normalize_space and c.isspace():
 78 |                 if ws:
 79 |                     continue
 80 |                 else:
 81 |                     ws = True
 82 |                     c = ' '
 83 |             if not self.alphabet.has_label(c):
 84 |                 continue
 85 |             if not c.isspace():
 86 |                 ws = False
 87 |             cleaned.append(c)
 88 |             self.positions.append(len(self.original_text) + position)
 89 |             self.meta.append(meta)
 90 |         self.original_text += original_text
 91 |         self.clean_text += ''.join(cleaned)
 92 | 
 93 |     def get_original_offset(self, clean_offset):
 94 |         if clean_offset == len(self.positions):
 95 |             return self.positions[-1] + 1
 96 |         return self.positions[clean_offset]
 97 | 
 98 |     def collect_meta(self, from_clean_offset, to_clean_offset=None):
 99 |         if to_clean_offset is None:
100 |             return self.meta[from_clean_offset]
101 |         metas = []
102 |         for meta in self.meta[from_clean_offset:to_clean_offset + 1]:
103 |             if meta is not None and meta not in metas:
104 |                 metas.append(meta)
105 |         return metas
106 | 
107 | 
108 | class TextRange(object):
109 |     def __init__(self, document, start, end):
110 |         self.document = document
111 |         self.start = start
112 |         self.end = end
113 | 
114 |     @staticmethod
115 |     def token_at(text, position):
116 |         start = len(text)
117 |         end = 0
118 |         for step in [-1, 1]:
119 |             pos = position
120 |             while 0 <= pos < len(text) and not text[pos].isspace():
121 |                 if pos < start:
122 |                     start = pos
123 |                 if pos > end:
124 |                     end = pos
125 |                 pos += step
126 |         return TextRange(text, start, end + 1) if start <= end else TextRange(text, position, position)
127 | 
128 |     def neighbour_token(self, direction):
129 |         return TextRange.token_at(self.document, self.start - 2 if direction < 0 else self.end + 1)
130 | 
131 |     def next_token(self):
132 |         return self.neighbour_token(1)
133 | 
134 |     def prev_token(self):
135 |         return self.neighbour_token(-1)
136 | 
137 |     def get_text(self):
138 |         return self.document[self.start:self.end]
139 | 
140 |     def __add__(self, other):
141 |         if not self.document == other.document:
142 |             raise Exception("Unable to add token from other string")
143 |         return TextRange(self.document, min(self.start, other.start), max(self.end, other.end))
144 | 
145 |     def __eq__(self, other):
146 |         return self.document == other.document and self.start == other.start and self.end == other.end
147 | 
148 |     def __len__(self):
149 |         return self.end-self.start
150 | 
151 | 
152 | def ngrams(s, size):
153 |     """
154 |     Lists all appearances of all N-grams of a string from left to right.
155 |     :param s: String to decompose
156 |     :param size: N-gram size
157 |     :return: Produces strings representing all N-grams
158 |     """
159 |     window = len(s) - size
160 |     if window < 1 or size < 1:
161 |         if window == 0:
162 |             yield s
163 |         return
164 |     for i in range(0, window + 1):
165 |         yield s[i:i + size]
166 | 
167 | 
168 | def weighted_ngrams(s, size, direction=0):
169 |     """
170 |     Lists all appearances of all N-grams of a string from left to right together with a positional weight value.
171 |     The positional weight progresses quadratically.
172 |     :param s: String to decompose
173 |     :param size: N-gram size
174 |     :param direction: Order of assigning positional weights to N-grams:
175 |         direction < 0: Weight of first N-gram is 1.0 and of last one 0.0
176 |         direction > 0: Weight of first N-gram is 0.0 and of last one 1.0
177 |         direction == 0: Weight of center N-gram(s) near or equal 0, weight of first and last N-gram 1.0
178 |     :return: Produces (string, float) tuples representing the N-gram along with its assigned positional weight value
179 |     """
180 |     return enweight(ngrams(s, size), direction=direction)
181 | 
182 | 
183 | def similarity(a, b, direction=0, min_ngram_size=1, max_ngram_size=3, size_factor=1, position_factor=1):
184 |     """
185 |     Computes similarity value of two strings ranging from 0.0 (completely different) to 1.0 (completely equal).
186 |     Counts intersection of weighted N-gram sets of both strings.
187 |     :param a: String to compare
188 |     :param b: String to compare
189 |     :param direction: Order of equality importance:
190 |         direction < 0: Left ends of strings more important to be similar
191 |         direction > 0: Right ends of strings more important to be similar
192 |         direction == 0: Left and right ends more important to be similar than center parts
193 |     :param min_ngram_size: Minimum N-gram size to take into account
194 |     :param max_ngram_size: Maximum N-gram size to take into account
195 |     :param size_factor: Importance factor of the N-gram size (compared to the positional importance).
196 |     :param position_factor: Importance factor of the N-gram position (compared to the size importance)
197 |     :return: Number between 0.0 (completely different) and 1.0 (completely equal)
198 |     """
199 |     if len(a) < len(b):
200 |         a, b = b, a
201 |     ca, cb = Counter(), Counter()
202 |     for s, c in [(a, ca), (b, cb)]:
203 |         for size in range(min_ngram_size, max_ngram_size + 1):
204 |             for ng, position_weight in weighted_ngrams(s, size, direction=direction):
205 |                 c[ng] += size * size_factor + position_weight * position_weight * position_factor
206 |     score = 0
207 |     for key in set(ca.keys()) & set(cb.keys()):
208 |         score += min(ca[key], cb[key])
209 |     return score / sum(ca.values())
210 | 
211 | 
212 | # The following code is from: http://hetland.org/coding/python/levenshtein.py
213 | 
214 | # This is a straightforward implementation of a well-known algorithm, and thus
215 | # probably shouldn't be covered by copyright to begin with. But in case it is,
216 | # the author (Magnus Lie Hetland) has, to the extent possible under law,
217 | # dedicated all copyright and related and neighboring rights to this software
218 | # to the public domain worldwide, by distributing it under the CC0 license,
219 | # version 1.0. This software is distributed without any warranty. For more
220 | # information, see <http://creativecommons.org/publicdomain/zero/1.0>
221 | 
222 | def levenshtein(a, b):
223 |     """
224 |     Calculates the Levenshtein distance between a and b.
225 |     """
226 |     n, m = len(a), len(b)
227 |     if n > m:
228 |         # Make sure n <= m, to use O(min(n,m)) space
229 |         a, b = b, a
230 |         n, m = m, n
231 | 
232 |     current = list(range(n+1))
233 |     for i in range(1, m+1):
234 |         previous, current = current, [i]+[0]*n
235 |         for j in range(1, n+1):
236 |             add, delete = previous[j]+1, current[j-1]+1
237 |             change = previous[j-1]
238 |             if a[j-1] != b[i-1]:
239 |                 change = change + 1
240 |             current[j] = min(add, delete, change)
241 | 
242 |     return current[n]
243 | 


--------------------------------------------------------------------------------
/align/utils.py:
--------------------------------------------------------------------------------
  1 | 
  2 | import os
  3 | import sys
  4 | import time
  5 | import heapq
  6 | 
  7 | from multiprocessing.dummy import Pool as ThreadPool
  8 | 
  9 | KILO = 1024
 10 | KILOBYTE = 1 * KILO
 11 | MEGABYTE = KILO * KILOBYTE
 12 | GIGABYTE = KILO * MEGABYTE
 13 | TERABYTE = KILO * GIGABYTE
 14 | SIZE_PREFIX_LOOKUP = {'k': KILOBYTE, 'm': MEGABYTE, 'g': GIGABYTE, 't': TERABYTE}
 15 | 
 16 | 
 17 | def parse_file_size(file_size):
 18 |     file_size = file_size.lower().strip()
 19 |     if len(file_size) == 0:
 20 |         return 0
 21 |     n = int(keep_only_digits(file_size))
 22 |     if file_size[-1] == 'b':
 23 |         file_size = file_size[:-1]
 24 |     e = file_size[-1]
 25 |     return SIZE_PREFIX_LOOKUP[e] * n if e in SIZE_PREFIX_LOOKUP else n
 26 | 
 27 | 
 28 | def keep_only_digits(txt):
 29 |     return ''.join(filter(str.isdigit, txt))
 30 | 
 31 | 
 32 | def secs_to_hours(secs):
 33 |     hours, remainder = divmod(secs, 3600)
 34 |     minutes, seconds = divmod(remainder, 60)
 35 |     return '%02d:%02d:%02d' % (hours, minutes, seconds)
 36 | 
 37 | 
 38 | def log_progress(it, total=None, interval=60.0, step=None, entity='it', file=sys.stderr):
 39 |     if total is None and hasattr(it, '__len__'):
 40 |         total = len(it)
 41 |     if total is None:
 42 |         line_format = ' {:8d} (elapsed: {}, speed: {:.2f} {}/{})'
 43 |     else:
 44 |         line_format = ' {:' + str(len(str(total))) + 'd} of {} : {:6.2f}% (elapsed: {}, speed: {:.2f} {}/{}, ETA: {})'
 45 | 
 46 |     overall_start = time.time()
 47 |     interval_start = overall_start
 48 |     interval_steps = 0
 49 | 
 50 |     def print_interval(steps, time_now):
 51 |         elapsed = time_now - overall_start
 52 |         elapsed_str = secs_to_hours(elapsed)
 53 |         speed_unit = 's'
 54 |         interval_duration = time_now - interval_start
 55 |         print_speed = speed = interval_steps / (0.001 if interval_duration == 0.0 else interval_duration)
 56 |         if print_speed < 0.1:
 57 |             print_speed = print_speed * 60
 58 |             speed_unit = 'm'
 59 |             if print_speed < 1:
 60 |                 print_speed = print_speed * 60
 61 |                 speed_unit = 'h'
 62 |         elif print_speed > 1000:
 63 |             print_speed = print_speed / 1000.0
 64 |             speed_unit = 'ms'
 65 |         if total is None:
 66 |             line = line_format.format(global_step, elapsed_str, print_speed, entity, speed_unit)
 67 |         else:
 68 |             percent = global_step * 100.0 / total
 69 |             eta = secs_to_hours(((total - global_step) / speed) if speed > 0 else 0)
 70 |             line = line_format.format(global_step, total, percent, elapsed_str, print_speed, entity, speed_unit, eta)
 71 |         print(line, file=file, flush=True)
 72 | 
 73 |     for global_step, obj in enumerate(it, 1):
 74 |         interval_steps += 1
 75 |         yield obj
 76 |         t = time.time()
 77 |         if (step is None and t - interval_start > interval) or (step is not None and interval_steps >= step):
 78 |             print_interval(interval_steps, t)
 79 |             interval_steps = 0
 80 |             interval_start = t
 81 |     if interval_steps > 0:
 82 |         print_interval(interval_steps, time.time())
 83 | 
 84 | 
 85 | def circulate(items, center=None):
 86 |     count = len(list(items))
 87 |     if count > 0:
 88 |         if center is None:
 89 |             center = count // 2
 90 |         center = min(max(center, 0), count - 1)
 91 |         yield center, items[center]
 92 |         for i in range(1, count):
 93 |             #print('ANOTHER')
 94 |             if center + i < count:
 95 |                 yield center + i, items[center + i]
 96 |             if center - i >= 0:
 97 |                 yield center - i, items[center - i]
 98 | 
 99 | 
100 | def by_len(items):
101 |     indexed = list(enumerate(items))
102 |     return sorted(indexed, key=lambda e: len(e[1]), reverse=True)
103 | 
104 | 
105 | def enweight(items, direction=0):
106 |     """
107 |     Enumerates all entries together with a positional weight value.
108 |     The positional weight progresses quadratically.
109 |     :param items: Items to enumerate
110 |     :param direction: Order of assigning positional weights to N-grams:
111 |         direction < 0: Weight of first N-gram is 1.0 and of last one 0.0
112 |         direction > 0: Weight of first N-gram is 0.0 and of last one 1.0
113 |         direction == 0: Weight of center N-gram(s) near or equal 0, weight of first and last N-gram 1.0
114 |     :return: Produces (object, float) tuples representing the enumerated item
115 |              along with its assigned positional weight value
116 |     """
117 |     items = list(items)
118 |     direction = -1 if direction < 0 else (1 if direction > 0 else 0)
119 |     n = len(items) - 1
120 |     if n < 1:
121 |         if n == 0:
122 |             yield items[0], 1
123 |         raise StopIteration
124 |     for i, item in enumerate(items):
125 |         c = (i + n * (direction - 1) / 2) / n
126 |         yield item, c * c * (4 - abs(direction) * 3)
127 | 
128 | 
129 | def greedy_minimum_search(a, b, compute, result_a=None, result_b=None):
130 |     if a > b:
131 |         a, b = b, a
132 |         result_a, result_b = result_b, result_a
133 |     if a == b:
134 |         return result_a or result_b or compute(a)
135 |     result_a = result_a or compute(a)
136 |     result_b = result_b or compute(b)
137 |     if b == a+1:
138 |         return result_a if result_a[0] < result_b[0] else result_b
139 |     c = (a+b) // 2
140 |     if result_a[0] < result_b[0]:
141 |         return greedy_minimum_search(a, c, compute, result_a=result_a)
142 |     else:
143 |         return greedy_minimum_search(c, b, compute, result_b=result_b)
144 | 
145 | 
146 | class Interleaved:
147 |     """Collection that lazily combines sorted collections in an interleaving fashion.
148 |     During iteration the next smallest element from all the sorted collections is always picked.
149 |     The collections must support iter() and len()."""
150 |     def __init__(self, *iterables, key=lambda obj: obj):
151 |         self.iterables = iterables
152 |         self.key = key
153 |         self.len = sum(map(len, iterables))
154 | 
155 |     def __iter__(self):
156 |         return heapq.merge(*self.iterables, key=self.key)
157 | 
158 |     def __len__(self):
159 |         return self.len
160 | 
161 | 
162 | class LimitingPool:
163 |     """Limits unbound ahead-processing of multiprocessing.Pool's imap method
164 |     before items get consumed by the iteration caller.
165 |     This prevents OOM issues in situations where items represent larger memory allocations."""
166 |     def __init__(self, processes=None, limit_factor=2, sleeping_for=0.1):
167 |         self.processes = os.cpu_count() if processes is None else processes
168 |         self.pool = ThreadPool(processes=processes)
169 |         self.sleeping_for = sleeping_for
170 |         self.max_ahead = self.processes * limit_factor
171 |         self.processed = 0
172 | 
173 |     def __enter__(self):
174 |         return self
175 | 
176 |     def limit(self, it):
177 |         for obj in it:
178 |             while self.processed >= self.max_ahead:
179 |                 time.sleep(self.sleeping_for)
180 |             self.processed += 1
181 |             yield obj
182 | 
183 |     def map(self, fun, it):
184 |         for obj in self.pool.imap(fun, self.limit(it)):
185 |             self.processed -= 1
186 |             yield obj
187 | 
188 |     def __exit__(self, exc_type, exc_value, traceback):
189 |         self.pool.close()
190 | 


--------------------------------------------------------------------------------
/bin/align.sh:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env bash
2 | approot=$(cd "$(dirname "$(dirname "$0")")" && pwd)
3 | source "$approot/venv/bin/activate"
4 | python "$approot/align/align.py" "$@"
5 | 


--------------------------------------------------------------------------------
/bin/catalog_tool.sh:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env bash
2 | approot=$(cd "$(dirname "$(dirname "$0")")" && pwd)
3 | source "$approot/venv/bin/activate"
4 | python "$approot/align/catalog_tool.py" "$@"
5 | 


--------------------------------------------------------------------------------
/bin/createenv.sh:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env bash
2 | 
3 | python3 -m venv venv
4 | source venv/bin/activate
5 | pip install -r requirements.txt


--------------------------------------------------------------------------------
/bin/export.sh:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env bash
2 | approot=$(cd "$(dirname "$(dirname "$0")")" && pwd)
3 | source "$approot/venv/bin/activate"
4 | python "$approot/align/export.py" "$@"
5 | 


--------------------------------------------------------------------------------
/bin/getmodel.sh:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env bash
 2 | 
 3 | version="0.7.1"
 4 | dir="deepspeech-${version}-models"
 5 | am="${dir}.pbmm"
 6 | scorer="${dir}.scorer"
 7 | 
 8 | mkdir -p models/en
 9 | cd models/en
10 | 
11 | if [[ ! -f $am ]] ; then
12 |     wget "https://github.com/mozilla/DeepSpeech/releases/download/v${version}/${am}"
13 | fi
14 | 
15 | if [[ ! -f $scorer ]] ; then
16 |     wget "https://github.com/mozilla/DeepSpeech/releases/download/v${version}/${scorer}"
17 | fi
18 | 
19 | if [[ ! -f "alphabet.txt" ]] ; then
20 |     wget "https://raw.githubusercontent.com/mozilla/DeepSpeech/master/data/alphabet.txt"
21 | fi
22 | 


--------------------------------------------------------------------------------
/bin/gettestdata.sh:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env bash
 2 | 
 3 | bin=`pwd`/bin
 4 | 
 5 | mkpush () {
 6 |     mkdir -p data/$1
 7 |     pushd data/$1
 8 |     $1
 9 |     popd
10 | }
11 | 
12 | cwget () {
13 |     url=$1
14 |     file="${url##*/}"
15 |     if [ ! -f $file ]; then
16 |         wget $url
17 |     fi
18 | }
19 | 
20 | test1 () {
21 |     cwget https://ia802607.us.archive.org/14/items/artfiction00jamegoog/artfiction00jamegoog_djvu.txt
22 |     cp artfiction00jamegoog_djvu.txt transcript.txt
23 |     cwget http://www.archive.org/download/art_of_fiction_jvw_librivox/art_of_fiction_jvw_librivox_64kb_mp3.zip
24 |     unzip -o art_of_fiction_jvw_librivox_64kb_mp3.zip
25 |     if [ ! -f "audio.wav" ]; then
26 |         cat *.mp3 >joined.mp3
27 |         ffmpeg -y -i joined.mp3 -ar 16000 -ac 1 audio.wav
28 |     fi
29 | }
30 | 
31 | test2 () {
32 |     cwget https://www.ibiblio.org/xml/examples/shakespeare/as_you.xml
33 |     python "$bin/play2script.py" script as_you.xml transcript.script
34 |     python "$bin/play2script.py" lines as_you.xml transcript-lines.txt
35 |     python "$bin/play2script.py" plain as_you.xml transcript-plain.txt
36 |     cwget http://www.archive.org/download/as_you_like_it_0902_librivox/as_you_like_it_0902_librivox_64kb_mp3.zip
37 |     unzip -o as_you_like_it_0902_librivox_64kb_mp3.zip
38 |     if [ ! -f "audio.wav" ]; then
39 |         cat *.mp3 >joined.mp3
40 |         ffmpeg -y -i joined.mp3 -ar 16000 -ac 1 audio.wav
41 |     fi
42 | }
43 | 
44 | mkpush test1
45 | mkpush test2
46 | 


--------------------------------------------------------------------------------
/bin/lm-dependencies.sh:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env bash
 2 | 
 3 | basedir="$(pwd)"
 4 | 
 5 | mkdir -p dependencies
 6 | pushd dependencies
 7 | 
 8 | 
 9 | wget -N https://kheafield.com/code/kenlm.tar.gz
10 | tar -xzvf kenlm.tar.gz
11 | pushd kenlm
12 | 
13 | mkdir -p build
14 | pushd build
15 | cmake ..
16 | make -j 4
17 | popd
18 | 
19 | popd
20 | 
21 | 
22 | source $basedir/venv/bin/activate
23 | mkdir -p deepspeech
24 | pushd deepspeech
25 | python $basedir/bin/taskcluster.py --target . --branch v0.7.1


--------------------------------------------------------------------------------
/bin/meta.sh:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env bash
2 | approot=$(cd "$(dirname "$(dirname "$0")")" && pwd)
3 | source "$approot/venv/bin/activate"
4 | python "$approot/align/meta.py" "$@"
5 | 


--------------------------------------------------------------------------------
/bin/play2script.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | import json
 3 | from xml.dom import minidom
 4 | 
 5 | 
 6 | def fail():
 7 |     print('Usage: play2script.py (script|plain|lines) <play-file.xml> <output>')
 8 |     exit(1)
 9 | 
10 | 
11 | def get_text(elements):
12 |     return ' '.join(map(lambda element: ' '.join(t.nodeValue.strip() for t in
13 |                                                  element.childNodes if t.nodeType == t.TEXT_NODE),
14 |                elements))
15 | 
16 | 
17 | def main(args):
18 |     if len(args) != 3:
19 |         fail()
20 |     dom = minidom.parse(args[1])
21 |     if args[0] == 'script':
22 |         script = []
23 |         for speech in dom.getElementsByTagName('SPEECH'):
24 |             speaker = get_text(speech.getElementsByTagName('SPEAKER'))
25 |             speaker = ' '.join(map(lambda p: p[0] + p[1:].lower(), speaker.split(' ')))
26 |             text = get_text(speech.getElementsByTagName('LINE'))
27 |             script.append({
28 |                 'speaker': speaker,
29 |                 'text': text
30 |             })
31 |         with open(args[2], 'w') as script_file:
32 |             script_file.write(json.dumps(script))
33 |     elif args[0] in ['plain', 'lines']:
34 |         with open(args[2], 'w') as script_file:
35 |             for speech in dom.getElementsByTagName('SPEECH'):
36 |                 text = get_text(speech.getElementsByTagName('LINE'))
37 |                 script_file.write(text + (' ' if args[0] == 'plain' else '\n'))
38 |     else:
39 |         print('Unknown output specifier')
40 |         fail()
41 | 
42 | 
43 | if __name__ == '__main__':
44 |     main(sys.argv[1:])
45 | 


--------------------------------------------------------------------------------
/bin/sdb_tool.sh:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env bash
2 | approot=$(cd "$(dirname "$(dirname "$0")")" && pwd)
3 | source "$approot/venv/bin/activate"
4 | python "$approot/align/sdb_tool.py" "$@"
5 | 


--------------------------------------------------------------------------------
/bin/statistics.sh:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env bash
2 | approot=$(cd "$(dirname "$(dirname "$0")")" && pwd)
3 | source "$approot/venv/bin/activate"
4 | python "$approot/align/stats.py" "$@"
5 | 


--------------------------------------------------------------------------------
/bin/taskcluster.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | from __future__ import print_function, absolute_import, division
  4 | 
  5 | import argparse
  6 | import platform
  7 | import subprocess
  8 | import sys
  9 | import os
 10 | import errno
 11 | import stat
 12 | 
 13 | import six.moves.urllib as urllib
 14 | 
 15 | from pkg_resources import parse_version
 16 | 
 17 | 
 18 | DEFAULT_SCHEMES = {
 19 |     'deepspeech': 'https://community-tc.services.mozilla.com/api/index/v1/task/project.deepspeech.deepspeech.native_client.%(branch_name)s.%(arch_string)s/artifacts/public/%(artifact_name)s',
 20 |     'tensorflow': 'https://community-tc.services.mozilla.com/api/index/v1/task/project.deepspeech.tensorflow.pip.%(branch_name)s.%(arch_string)s/artifacts/public/%(artifact_name)s'
 21 | }
 22 | 
 23 | TASKCLUSTER_SCHEME = os.getenv('TASKCLUSTER_SCHEME', DEFAULT_SCHEMES['deepspeech'])
 24 | 
 25 | def get_tc_url(arch_string, artifact_name='native_client.tar.xz', branch_name='master'):
 26 |     assert arch_string is not None
 27 |     assert artifact_name is not None
 28 |     assert artifact_name
 29 |     assert branch_name is not None
 30 |     assert branch_name
 31 | 
 32 |     return TASKCLUSTER_SCHEME % { 'arch_string': arch_string, 'artifact_name': artifact_name, 'branch_name': branch_name}
 33 | 
 34 | def maybe_download_tc(target_dir, tc_url, progress=True):
 35 |     def report_progress(count, block_size, total_size):
 36 |         percent = (count * block_size * 100) // total_size
 37 |         sys.stdout.write("\rDownloading: %d%%" % percent)
 38 |         sys.stdout.flush()
 39 | 
 40 |         if percent >= 100:
 41 |             print('\n')
 42 | 
 43 |     assert target_dir is not None
 44 | 
 45 |     target_dir = os.path.abspath(target_dir)
 46 |     try:
 47 |         os.makedirs(target_dir)
 48 |     except OSError as e:
 49 |         if e.errno != errno.EEXIST:
 50 |             raise e
 51 |     assert os.path.isdir(os.path.dirname(target_dir))
 52 | 
 53 |     tc_filename = os.path.basename(tc_url)
 54 |     target_file = os.path.join(target_dir, tc_filename)
 55 |     if not os.path.isfile(target_file):
 56 |         print('Downloading %s ...' % tc_url)
 57 |         urllib.request.urlretrieve(tc_url, target_file, reporthook=(report_progress if progress else None))
 58 |     else:
 59 |         print('File already exists: %s' % target_file)
 60 | 
 61 |     return target_file
 62 | 
 63 | def maybe_download_tc_bin(**kwargs):
 64 |     final_file = maybe_download_tc(kwargs['target_dir'], kwargs['tc_url'], kwargs['progress'])
 65 |     final_stat = os.stat(final_file)
 66 |     os.chmod(final_file, final_stat.st_mode | stat.S_IEXEC)
 67 | 
 68 | def read(fname):
 69 |     return open(os.path.join(os.path.dirname(__file__), fname)).read()
 70 | 
 71 | def main():
 72 |     parser = argparse.ArgumentParser(description='Tooling to ease downloading of components from TaskCluster.')
 73 |     parser.add_argument('--target', required=False,
 74 |                         help='Where to put the native client binary files')
 75 |     parser.add_argument('--arch', required=False,
 76 |                         help='Which architecture to download binaries for. "arm" for ARM 7 (32-bit), "arm64" for ARM64, "gpu" for CUDA enabled x86_64 binaries, "cpu" for CPU-only x86_64 binaries, "osx" for CPU-only x86_64 OSX binaries. Optional ("cpu" by default)')
 77 |     parser.add_argument('--artifact', required=False,
 78 |                         default='native_client.tar.xz',
 79 |                         help='Name of the artifact to download. Defaults to "native_client.tar.xz"')
 80 |     parser.add_argument('--source', required=False, default=None,
 81 |                         help='Name of the TaskCluster scheme to use.')
 82 |     parser.add_argument('--branch', required=False,
 83 |                         help='Branch name to use. Defaulting to current content of VERSION file.')
 84 |     parser.add_argument('--decoder', action='store_true',
 85 |                         help='Get URL to ds_ctcdecoder Python package.')
 86 | 
 87 |     args = parser.parse_args()
 88 | 
 89 |     if not args.target and not args.decoder:
 90 |         print('Pass either --target or --decoder.')
 91 |         exit(1)
 92 | 
 93 |     is_arm = 'arm' in platform.machine()
 94 |     is_mac = 'darwin' in sys.platform
 95 |     is_64bit = sys.maxsize > (2**31 - 1)
 96 |     is_ucs2 = sys.maxunicode < 0x10ffff
 97 | 
 98 |     if not args.arch:
 99 |         if is_arm:
100 |             args.arch = 'arm64' if is_64bit else 'arm'
101 |         elif is_mac:
102 |             args.arch = 'osx'
103 |         else:
104 |             args.arch = 'cpu'
105 | 
106 |     if not args.branch:
107 |         version_string = read('../VERSION').strip()
108 |         ds_version = parse_version(version_string)
109 |         args.branch = "v{}".format(version_string)
110 |     else:
111 |         ds_version = args.branch.lstrip('v')
112 | 
113 |     if args.decoder:
114 |         plat = platform.system().lower()
115 |         arch = platform.machine()
116 | 
117 |         if plat == 'linux' and arch == 'x86_64':
118 |             plat = 'manylinux1'
119 | 
120 |         if plat == 'darwin':
121 |             plat = 'macosx_10_10'
122 | 
123 |         m_or_mu = 'mu' if is_ucs2 else 'm'
124 |         pyver = ''.join(map(str, sys.version_info[0:2]))
125 | 
126 |         artifact = "ds_ctcdecoder-{ds_version}-cp{pyver}-cp{pyver}{m_or_mu}-{platform}_{arch}.whl".format(
127 |             ds_version=ds_version,
128 |             pyver=pyver,
129 |             m_or_mu=m_or_mu,
130 |             platform=plat,
131 |             arch=arch
132 |         )
133 | 
134 |         ctc_arch = args.arch + '-ctc'
135 | 
136 |         print(get_tc_url(ctc_arch, artifact, args.branch))
137 |         exit(0)
138 | 
139 |     if args.source is not None:
140 |         if args.source in DEFAULT_SCHEMES:
141 |             global TASKCLUSTER_SCHEME
142 |             TASKCLUSTER_SCHEME = DEFAULT_SCHEMES[args.source]
143 |         else:
144 |             print('No such scheme: %s' % args.source)
145 |             exit(1)
146 | 
147 |     maybe_download_tc(target_dir=args.target, tc_url=get_tc_url(args.arch, args.artifact, args.branch))
148 | 
149 |     if '.tar.' in args.artifact:
150 |         subprocess.check_call(['tar', 'xvf', os.path.join(args.target, args.artifact), '-C', args.target])
151 | 
152 | if __name__ == '__main__':
153 |     main()
154 | 


--------------------------------------------------------------------------------
/data/all-wav.catalog:
--------------------------------------------------------------------------------
 1 | [
 2 |   {
 3 |     "audio": "test1/audio.wav",
 4 |     "tlog": "test1/joined.tlog",
 5 |     "script": "test1/transcript.txt",
 6 |     "aligned": "test1/aligned.json"
 7 |   },
 8 |   {
 9 |     "audio": "test2/audio.wav",
10 |     "tlog": "test2/joined.tlog",
11 |     "script": "test2/transcript.script",
12 |     "aligned": "test2/aligned.json"
13 |   }
14 | ]
15 | 


--------------------------------------------------------------------------------
/data/all.catalog:
--------------------------------------------------------------------------------
 1 | [
 2 |   {
 3 |     "audio": "test1/joined.mp3",
 4 |     "tlog": "test1/joined.tlog",
 5 |     "script": "test1/transcript.txt",
 6 |     "aligned": "test1/aligned.json"
 7 |   },
 8 |   {
 9 |     "audio": "test2/joined.mp3",
10 |     "tlog": "test2/joined.tlog",
11 |     "script": "test2/transcript.script",
12 |     "aligned": "test2/aligned.json"
13 |   }
14 | ]
15 | 


--------------------------------------------------------------------------------
/data/test1.catalog:
--------------------------------------------------------------------------------
1 | [
2 |   {
3 |     "audio": "test1/joined.mp3",
4 |     "tlog": "test1/joined.tlog",
5 |     "script": "test1/transcript.txt",
6 |     "aligned": "test1/aligned.json"
7 |   }
8 | ]
9 | 


--------------------------------------------------------------------------------
/data/test2.catalog:
--------------------------------------------------------------------------------
1 | [
2 |   {
3 |     "audio": "test2/joined.mp3",
4 |     "tlog": "test2/joined.tlog",
5 |     "script": "test2/transcript.script",
6 |     "aligned": "test2/aligned.json"
7 |   }
8 | ]
9 | 


--------------------------------------------------------------------------------
/doc/algo.md:
--------------------------------------------------------------------------------
  1 | ## Alignment algorithm and its parameters
  2 | 
  3 | ### Step 1 - Splitting audio
  4 | 
  5 | A voice activity detector (at the moment this is `webrtcvad`) is used
  6 | to split the provided audio data into voice fragments.
  7 | These fragments are essentially streams of continuous speech without any longer pauses 
  8 | (e.g. sentences).
  9 | 
 10 | `--audio-vad-aggressiveness <AGGRESSIVENESS>` can be used to influence the length of the
 11 | resulting fragments.
 12 | 
 13 | ### Step 2 - Preparation of original text
 14 | 
 15 | STT transcripts are typically provided in a normalized textual form with
 16 | - no casing,
 17 | - no punctuation and
 18 | - normalized whitespace (single spaces only).
 19 | 
 20 | So for being able to align STT transcripts with the original text it is necessary
 21 | to internally convert the original text into the same form.
 22 | 
 23 | This happens in two steps:
 24 | 1. Normalization of whitespace, lower-casing all text and 
 25 | replacing some characters with spaces (e.g. dashes)
 26 | 2. Removal of all characters that are not in the languages's alphabet
 27 | (see DeepSpeech model data)
 28 | 
 29 | Be aware: *This conversion happens on text basis and will not remove unspoken content
 30 | like markup/markdown tags or artifacts. This should be done beforehand.
 31 | Reducing the difference of spoken and original text will improve alignment quality and speed.*
 32 | 
 33 | In the very unlikely situation that you have to change the default behavior (of step 1),
 34 | there are some switches:
 35 | 
 36 | `--text-keep-dashes` will prevent substitution of dashes with spaces.
 37 | 
 38 | `--text-keep-ws` will keep whitespace untouched.
 39 | 
 40 | `--text-keep-casing` will keep character casing as provided.
 41 | 
 42 | ### Step 3 (optional) - Generating document specific language model
 43 | 
 44 | If the [dependencies](lm.md) for 
 45 | individual language model generation got installed, this document-individual model will
 46 | now be generated by default.
 47 | 
 48 | Assuming your text document is named `original.txt`, these files will be generated:
 49 | - `original.txt.clean` - cleaned version of the original text
 50 | - `original.txt.arpa` - text file with probabilities in ARPA format
 51 | - `original.txt.lm` - binary representation of the former one
 52 | - `original.txt.trie` - prefix-tree optimized for probability lookup
 53 | 
 54 | `--stt-no-own-lm` deactivates creation of individual language models per document and
 55 | uses the one from model dir instead.
 56 | 
 57 | ### Step 4 - Transcription of voice fragments through STT
 58 | 
 59 | After VAD splitting the resulting fragments are transcribed into textual phrases.
 60 | This transcription is done through [DeepSpeech](https://github.com/mozilla/DeepSpeech/) STT.
 61 | 
 62 | As this can take a longer time, all resulting phrases are - together with their 
 63 | timestamps - saved as JSON into a transcription log file 
 64 | (the `audio` parameter path with suffix `.tlog` instead of `.wav`).
 65 | Consecutive calls will look for that file and - if found - 
 66 | load it and skip the transcription phase.
 67 | 
 68 | `--stt-model-dir <DIR>` points DeepSpeech to the language specific model data directory.
 69 | It defaults to `models/en`. Use `bin/getmodel.sh` for preparing it.  
 70 | 
 71 | ### Step 5 - Rough alignment
 72 | 
 73 | The actual text alignment is based on a recursive divide and conquer approach:
 74 | 
 75 | 1. Construct an ordered list of of all phrases in the current interval
 76 | (at the beginning this is the list of all phrases that are to be aligned),
 77 | where long phrases close to the middle of the interval come first.
 78 | 2. Iterate through the list and compute the best Smith-Waterman alignment
 79 | (see the following sub-sections) with the document's original text...
 80 | 3. ...till there is a phrase whose Smith-Waterman alignment score surpasses a (low) recursion-depth 
 81 | dependent threshold (in most cases this should already be the first phrase).
 82 | 4. Recursively continue with step 1 for the sub-intervals and original text ranges
 83 | to the left and right of the phrase and its aligned text range within the original text.
 84 | 5. Return all phrases in order of appearance (depth-first) that were aligned with the minimum 
 85 | Smith-Waterman score on their recursion level.
 86 | 
 87 | This approach assumes that all phrases were spoken in the same order as they appear in the
 88 | original transcript. It has the following advantages compared to individual
 89 | global phrase matching:
 90 | 
 91 | - Long non-matching chunks of spoken text or the original transcript will automatically and 
 92 | cleanly get ignored.
 93 | - Short phrases (with the risk of matching more than one time per document) will automatically
 94 | get aligned to their intended locations by longer ones who "squeeze" them in.
 95 | - Smith-Waterman score thresholds can be kept lower 
 96 | (and thus better match lower quality STT transcripts), as there is a lower chance for 
 97 |   - long sequences to match at a wrong location and for 
 98 |   - shorter sequences to match at a wrong location within their shortened intervals
 99 |   (as they are getting matched later and deeper in the recursion tree).
100 | 
101 | #### Smith-Waterman candidate selection
102 | 
103 | Finding the best match of a given phrase within the original (potentially long) transcript
104 | using vanilla Smith-Waterman is not feasible.
105 | 
106 | So this tool follows a two-phase approach where the first goal is to get a list of alignment 
107 | candidates. As the first step the original text is virtually partitioned into windows of the 
108 | same length as the search pattern. These windows are ordered descending by the number of 3-grams
109 | they share with the pattern.
110 | Best alignment candidates are now taken from the beginning of this ordered list.
111 | 
112 | `--align-max-candidates <CANDIDATES>` sets the maximum number of candidate windows
113 | taken from the beginning of the list for further alignment.
114 | 
115 | `--align-candidate-threshold <THRESHOLD>` multiplied with the number of 3-grams of the predecessor
116 | window it gives the minimum number of 3-grams the next candidate window has to have to also be
117 | considered a candidate.
118 | 
119 | #### Smith-Waterman alignment
120 | 
121 | For each candidate, the best possible alignment is computed using the 
122 | [Smith-Waterman](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm) algorithm
123 | within an extended interval of one window-size around the candidate window.
124 | 
125 | `--align-match-score <SCORE>` is the score per correctly matched character. Default: 100
126 | 
127 | `--align-mismatch-score <SCORE>` is the score per non-matching (exchanged) character. Default: -100
128 | 
129 | `--align-gap-score <SCORE>` is the score per character gap (removing 1 character from pattern or original). Default: -100
130 | 
131 | The overall best score for the best match is normalized to a value of about 100 maximum by dividing
132 | it through the maximum character count of either the match or the pattern.
133 | 
134 | ### Step 6 - Gap alignment
135 | 
136 | After recursive matching of fragments there are potential text leftovers between aligned original
137 | texts.
138 | 
139 | Some examples:
140 | - Often: Missing (and therefore unaligned) STT transcripts of word-endings (e.g. English past tense endings _-d_ and _-ed_)
141 | on phrase endings to the left of the gap
142 | - Seldom: Phrase beginnings or endings that were wrongly matched on unspoken (but written) text whose actual
143 | alignments are now left unaligned in the gap
144 | - Big unmatched chunks of text, like
145 |   - Preface, text summaries or any other kind of meta information
146 |   - Copyright headers/footers
147 |   - Table of contents
148 | - Chapter headers (if not spoken as they appear)
149 | - Captions of figures
150 | - Contents of tables
151 | - Line-headers like character names in drama scripts
152 | - Dependent of the (pre-processing) quality: OCR leftovers like
153 |   - page headers
154 |   - page numbers
155 |   - reader's notes
156 |   
157 | The basic challenge here is to figure out, if all or some of the gap text should be used to extend 
158 | the phrase to the left and/or to the right of the gap.
159 | 
160 | As Smith-Waterman alignment led to the current (potentially incomplete or even wrong) result,
161 | its score cannot be used for further fine-tuning. Therefore there is a collection of
162 | so called [test-distance metrics](metrics.md) to pick from using the `--align-similarity-algo <METRIC-ID>`
163 | parameter.
164 | 
165 | Using the selected distance metric, the gap alignment is done by looking for the best scoring 
166 | extension of the left and right phrases up to their maximum extension.
167 | 
168 | `--align-stretch-factor <FRACTION>` is the fraction of the text length that it could get
169 | stretched at max.  
170 | 
171 | For many languages it is worth putting some emphasis on matching to words boundaries 
172 | (that is white-space separated sub-sequences).
173 | 
174 | `--align-snap-factor <FACTOR>` allows to control the snappiness to word boundaries.
175 | 
176 | If the best scoring extensions should overlap, the best scoring sum of non-overlapping
177 | (but touching) extensions will win.
178 | 
179 | ### Step 7 - Selection, filtering and output
180 | 
181 | Finally the best alignment of all candidate windows is selected as the winner.
182 | It has to survive a series of filters for getting into the result file.
183 | 
184 | For each text distance metric there are two filter parameters:
185 | 
186 | `--output-min-<METRIC-ID> <VALUE>` only keeps utterances having the provided minimum value for the
187 | metric with id `METRIC-ID`
188 |                               
189 | `--output-max-<METRIC-ID> <VALUE>` only keeps utterances having the provided maximum value for the
190 | metric with id `METRIC-ID`
191 | 
192 | For each text distance metric there's also the option to have it added to each utterance's entry:
193 | 
194 | `--output-<METRIC-ID>` adds the computed value for `<METRIC-ID>` to the utterances array-entry
195 | 
196 | Error rates and scores are provided as fractional values (typically between 0.0 = 0% and 1.0 = 100%
197 | where numbers >1.0 are theoretically possible).
198 | 


--------------------------------------------------------------------------------
/doc/export.md:
--------------------------------------------------------------------------------
  1 | ## Export
  2 | 
  3 | After files got successfully aligned, one would possibly want to export the aligned utterances
  4 | as machine learning training samples.
  5 | 
  6 | This is where the export tool `bin/export.sh` comes in.
  7 | 
  8 | ### Step 1 - Reading the input
  9 | 
 10 | The exporter takes either a single audio file (`--audio <AUDIO>`) 
 11 | plus a corresponding `.aligned` file (`--aligned <ALIGNED>`) or a series
 12 | of such pairs from a `.catalog` file (`--catalog <CATALOG>`) as input.
 13 | 
 14 | All of the following computations will be done on the joined list of all aligned
 15 | utterances of all input pairs.
 16 | 
 17 | Option `--ignore-missing` will not fail on missing file references in the catalog
 18 | and instead just ignore the affected catalog entry.
 19 | 
 20 | ### Step 2 - (Pre-) Filtering
 21 | 
 22 | The parameter `--filter <EXPR>` allows to specify a Python expression that has access
 23 | to all data fields of an aligned utterance (as can be seen in `.aligned` file entries).
 24 | 
 25 | This expression is now applied to each aligned utterance and in case it returns `True`,
 26 | the utterance will get excluded from all the following steps. 
 27 | This is useful for excluding utterances that would not work as input for the planned
 28 | training or other kind of application.
 29 | 
 30 | ### Step 3 - Computing quality
 31 | 
 32 | As with filtering, the parameter `--criteria <EXPR>` allows for specifying a Python 
 33 | expression that has access to all data fields of an aligned utterance.
 34 | 
 35 | The expression is applied to each aligned utterance and its numerical return 
 36 | value is assigned to each utterance as `quality`.
 37 | 
 38 | ### Step 4 - De-biasing
 39 | 
 40 | This step is to (optionally) exclude utterances that would otherwise bias the data
 41 | (risk of overfitting).
 42 | 
 43 | For each `--debias <META DATA TYPE>` parameter the following procedure is applied:
 44 | 1. Take the meta data type (e.g. "name") and read its instances (e.g. "Alice" or "Bob")
 45 | from each utternace and group all utterances accordingly
 46 | (e.g. a group with 2 utterances of "Alice" and a group with 15 utterances of "Bob"...)
 47 | 2. Compute the standard deviation (`sigma`) of the instance-counts of the groups
 48 | 3. For each group: If the instance-count exceeds `sigma` times `--debias-sigma-factor <FACTOR>`:
 49 |     - Drop the number of exceeding utterances in order of their `quality` (lowest first)
 50 |     
 51 | ### Step 5 - Partitioning
 52 | 
 53 | Training sets are often partitioned into several quality levels.
 54 | 
 55 | For each `--partition <QUALITY:PARTITION>` parameter (ordered descending by `QUALITY`):
 56 | If the utterance's `quality` value is greater or equal `QUALITY`, assign it to `PARTITION`.
 57 | 
 58 | Remaining utterances are assigned to partition `other`.
 59 | 
 60 | ### Step 6 - Splitting
 61 | 
 62 | Training sets (actually their partitions) are typically split into sets `train`, `dev` 
 63 | and `test` ([explanation](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets)).
 64 | 
 65 | This can get automated through parameter `--split` which will let the exporter split each
 66 | partition (or the entire set) accordingly.
 67 | 
 68 | Parameter `--split-field` allows for specifying a meta data type that should be considered 
 69 | atomic (e.g. "speaker" would result in all utterances of a speaker 
 70 | instance - like "Alice" - to end up in one sub-set only). This atomic behavior will also hold
 71 | true across partitions.
 72 | 
 73 | Option `--split-drop-multiple` allows for dropping all samples with multiple `--split-field` assignments - e.g. a 
 74 | sample with more than one "speaker".
 75 | 
 76 | In contrast option `--split-drop-unknown` allows for dropping all samples with no `--split-field assignment`.
 77 | 
 78 | With option `--assign-{train|dev|test} <VALUES>` one can pre-assign values (of the comma-separated list)
 79 | to the specified set.
 80 | 
 81 | Option `--split-seed <SEED>` sets an integer random seed for the split operation.
 82 | 
 83 | ### Step 7 - Output
 84 | 
 85 | For each partition/sub-set combination the following is done:
 86 |  - Construction of a `name` (e.g. `good-dev` will represent the validation set of partition `good`).
 87 |  - All samples are lazy-loaded and potentially re-sampled to match parameters: 
 88 |    - `--channels <N>`: Number of audio channels - 1 for mono (default), 2 for stereo
 89 |    - `--rate <RATE>`: Sample rate - default: 16000
 90 |    - `--width <WIDTH>`: Sample width in bytes - default: 2 (16 bit)
 91 |    
 92 |    `--workers <WORKERS>` can be used to specify how many parallel processes should be used for loading and re-sampling.
 93 |    
 94 |    `--tmp-dir <DIR>` overrides system default temporary directory that is used for converting samples.
 95 |    
 96 |    `--skip-damaged` allows for just skipping export of samples that cannot be loaded.
 97 |    
 98 |  - If option `--target-dir <DIR>` is provided, all output will be written to the provided target directory.
 99 |    This can be done in two different ways:
100 |    
101 |      1. With the additional option `--sdb` each set will be written to a so called Sample-DB
102 |         that can be used by DeepSpeech. It will be written as `<name>.sdb` into the target directory.
103 |         SDB export can be controlled with the following additional options:
104 |         - `--sdb-bucket-size <SIZE>`: SDB bucket size (using units like "1GB") for external sorting of the samples
105 |         - `--sdb-workers <WORKERS>`: Number of parallel workers for preparing and compressing SDB entries
106 |         - `--sdb-buffered-samples <SAMPLES>`: Number of samples per bucket buffer during last phase of external sorting
107 |         - `--sdb-audio-type <TYPE>`: Internal audio type for storing SDB samples - `wav` or `opus` (default)
108 |      2. Without option `--sdb` all samples are written as WAV-files into sub-directory `<name>`
109 |         of the target directory and a list of samples to a `<name>.csv` file next to it with columns 
110 |         `wav_filename`, `wav_filesize`, `transcript`.
111 |         
112 |    If not omitted through option `--no-meta`, a CSV file called `<name>.meta` is written to the target directory.
113 |    For each written sample it provides the following columns: 
114 |    `sample`, `split_entity`, `catalog_index`, `source_audio_file`, `aligned_file`, `alignment_index`.
115 |    
116 |    Throughout this process option `--force` allows to overwrite any existing files.
117 |  - If instead option `--target-tar <TAR-FILE>` is provided, the same file structure as with `--target-dir <DIR>`
118 |    is directly written to the specified tar-file.
119 |    This output variant does not support writing SDBs.
120 |  
121 | ### Additional functionality
122 | 
123 | Option `--plan <PLAN>` can be used to cache all computational steps before actual output writing.
124 | Will be loaded if existing or generated otherwise.
125 | This allows for writing several output formats using the same sample set distribution and without having to load
126 | alignment files and re-calculate quality metrics, de-biasing, partitioning or splitting.
127 | 
128 | Using `--dry-run` one can avoid any writing and get a preview on set-splits and so forth
129 | (`--dry-run-fast` won't even load any sample).
130 | 


--------------------------------------------------------------------------------
/doc/files.md:
--------------------------------------------------------------------------------
  1 | ## File formats
  2 | 
  3 | ### Catalog files (.catalog)
  4 | 
  5 | Catalog files (suffix `.catalog`) are used for organizing bigger data file collections and
  6 | defining relations among them. It is basically a JSON array of hash-tables where each entry stands
  7 | for a single audio file and its associated original transcript.
  8 | 
  9 | So a typical catalog looks like this (`data/all.catalog` from this project):
 10 | 
 11 | ```javascript
 12 | [
 13 |   {
 14 |     "audio": "test1/joined.mp3",
 15 |     "tlog": "test1/joined.tlog",
 16 |     "script": "test1/transcript.txt",
 17 |     "aligned": "test1/joined.aligned"
 18 |   },
 19 |   {
 20 |     "audio": "test2/joined.mp3",
 21 |     "tlog": "test2/joined.tlog",
 22 |     "script": "test2/transcript.script",
 23 |     "aligned": "test2/joined.aligned"
 24 |   }
 25 | ]
 26 | ```
 27 | 
 28 | - `audio` is a path to an audio file (of a format that `pydub` supports)
 29 | - `tlog` is the (supposed) path to the STT generated transcription log of the audio file
 30 | - `script` is the path to the original transcript of the audio file
 31 | (as `.txt` or `.script` file)
 32 | - `aligned` is the (supposed) path to a `.aligned` file
 33 | 
 34 | Be aware: __All relative file paths are treated as relative to the catalog file's directory__.
 35 | 
 36 | The tools `bin/align.sh`, `bin/statistics.sh` and `bin/export.sh` all support parameter
 37 | `--catalog`:
 38 | 
 39 | The __alignment tool__ `bin/align.sh` requires either `tlog` to point to an existing
 40 | file or (if not) `audio` to point to an existing audio file for being able to transcribe
 41 | it and store it at the path indicated by `tlog`. Furthermore it requires `script` to
 42 | point to an  existing script. It will write its alignment results to the path in `aligned`.
 43 | 
 44 | The __export tool__ `bin/export.sh` requires `audio` and `aligned` to point to existing files.
 45 | 
 46 | The __statistics tool__ `bin/statistics.sh` requires only `aligned` to point to existing files.
 47 | 
 48 | Advantages of having a catalog file:
 49 | 
 50 | - Simplified tool usage with only one parameter for defining all involved files (`--catalog`).
 51 | - A directory with many files has to be scanned just one time at catalog generation.
 52 | - Different file types can live at different and custom locations in the system.
 53 | This is important in case of read-only access rights to the original data.
 54 | It can also be used for avoiding to taint the original directory tree.
 55 | - Accumulated statistics
 56 | - Better progress indication (as the total number of files is available up front)
 57 | - Reduced tool startup overhead
 58 | - Allows for meta-data aware set-splitting on export - e.g. if some speakers are speaking
 59 | in several files.
 60 | 
 61 | So especially in case of many files to process it is highly recommended to __first create
 62 | a catalog file__ with all paths present (even the ones not pointing to existing files yet).
 63 | 
 64 | 
 65 | ### Script files (.script|.txt)
 66 | 
 67 | The alignment tool requires an original script or (human transcript) of the provided audio.
 68 | These scripts can be represented in two basic forms:
 69 | - plain text files (`.txt`) or
 70 | - script files (`.script`)
 71 | 
 72 | In case of plain text files the content is considered a continuous stream of text without
 73 | any assigned meta data. The only exception is option `--text-meaningful-newlines` which
 74 | tells the aligner to consider newlines as separators between utterances
 75 | in conjunction with option `--align-phrase-snap-factor`.
 76 | 
 77 | If the original data source features utterance meta data, one should consider converting it
 78 | to the `.script` JSON file format which looks like this
 79 | (except of `data/test2/transcript.script`): 
 80 | 
 81 | ```javascript
 82 | [
 83 |   // ...
 84 |   {
 85 |     "speaker": "Phebe",
 86 |     "text": "Good shepherd, tell this youth what 'tis to love."
 87 |   },
 88 |   {
 89 |     "speaker": "Silvius",
 90 |     "text": "It is to be all made of sighs and tears; And so am I for Phebe."
 91 |   },
 92 |   // ...
 93 | ]
 94 | ```
 95 | 
 96 | _This and the following sub-sections are all using the same real world examples and excerpts_
 97 | 
 98 | It is basically again an array of hash-tables, where each hash-table represents an utterance with the
 99 | only mandatory field `text` for its textual representation.
100 | 
101 | All other fields are considered meta data 
102 | (with the key called "meta data type" and the value "meta data instance").
103 | 
104 | ### Transcription log files (.tlog)
105 | 
106 | The alignment tool relies on timed STT transcripts of the provided audio.
107 | These transcripts are either provided by some external processing 
108 | (even using a different STT system than DeepSpeech) or will get generated
109 | as part of the alignment process.
110 | 
111 | They are called transcription logs (`.tlog`) and are looking like this
112 | (except of `data/test2/joined.tlog`):
113 | 
114 | ```javascript
115 | [
116 |   // ...
117 |   {
118 |     "start": 7491960,
119 |     "end": 7493040,
120 |     "transcript": "good shepherd"
121 |   },
122 |   {
123 |     "start": 7493040,
124 |     "end": 7495110,
125 |     "transcript": "tell this youth what tis to love"
126 |   },
127 |   {
128 |     "start": 7495380,
129 |     "end": 7498020,
130 |     "transcript": "it is to be made of soles and tears"
131 |   },
132 |   {
133 |     "start": 7498470,
134 |     "end": 7500150,
135 |     "transcript": "and so a may for phoebe"
136 |   },
137 |   // ...
138 | ]
139 | ```
140 | 
141 | The fields of each entry:
142 | - `start`: time offset of the audio fragment in milliseconds from the beginning of the
143 | aligned audio file (mandatory)
144 | - `end`: time offset of the audio fragment's end in milliseconds from the beginning of the
145 | aligned audio file (mandatory) 
146 | - `transcript`: STT transcript of the utterance (mandatory)
147 | 
148 | ### Aligned files (.aligned)
149 | 
150 | The result of aligning an audio file with an original transcript is written to an
151 | `.aligned` JSON file consisting of an array of hash-tables of the following form:
152 | 
153 | ```javascript
154 | [
155 |   // ...
156 |   {
157 |     "start": 7491960,
158 |     "end": 7493040,
159 |     "transcript": "good shepherd",
160 |     "text-start": 98302,
161 |     "text-end": 98316,
162 |     "meta": {
163 |       "speaker": [
164 |         "Phebe"
165 |       ]
166 |     },
167 |     "aligned-raw": "Good shepherd,",
168 |     "aligned": "good shepherd",
169 |     "wng": 99.99999999999997,
170 |     "jaro_winkler": 100.0,
171 |     "levenshtein": 100.0,
172 |     "mra": 100.0,
173 |     "cer": 0.0
174 |   },
175 |   {
176 |     "start": 7493040,
177 |     "end": 7495110,
178 |     "transcript": "tell this youth what tis to love",
179 |     "text-start": 98317,
180 |     "text-end": 98351,
181 |     "meta": {
182 |       "speaker": [
183 |         "Phebe"
184 |       ]
185 |     },
186 |     "aligned-raw": "tell this youth what 'tis to love.",
187 |     "aligned": "tell this youth what 'tis to love",
188 |     "wng": 92.71730687405957,
189 |     "jaro_winkler": 100.0,
190 |     "levenshtein": 96.96969696969697,
191 |     "mra": 100.0,
192 |     "cer": 3.0303030303030303
193 |   },
194 |   {
195 |     "start": 7495380,
196 |     "end": 7498020,
197 |     "transcript": "it is to be made of soles and tears",
198 |     "text-start": 98352,
199 |     "text-end": 98392,
200 |     "meta": {
201 |       "speaker": [
202 |         "Silvius"
203 |       ]
204 |     },
205 |     "aligned-raw": "It is to be all made of sighs and tears;",
206 |     "aligned": "it is to be all made of sighs and tears",
207 |     "wng": 77.93921929148159,
208 |     "jaro_winkler": 100.0,
209 |     "levenshtein": 82.05128205128204,
210 |     "mra": 100.0,
211 |     "cer": 17.94871794871795
212 |   },
213 |   {
214 |     "start": 7498470,
215 |     "end": 7500150,
216 |     "transcript": "and so a may for phoebe",
217 |     "text-start": 98393,
218 |     "text-end": 98415,
219 |     "meta": {
220 |       "speaker": [
221 |         "Silvius"
222 |       ]
223 |     },
224 |     "aligned-raw": "And so am I for Phebe.",
225 |     "aligned": "and so am i for phebe",
226 |     "wng": 66.82687893873339,
227 |     "jaro_winkler": 98.47964113181504,
228 |     "levenshtein": 82.6086956521739,
229 |     "mra": 100.0,
230 |     "cer": 19.047619047619047
231 |   },
232 |   // ...
233 | ]
234 | ```
235 | 
236 | Each object array-entry represents an aligned audio fragment with the following attributes:
237 | - `start`: time offset of the audio fragment in milliseconds from the beginning of the
238 | aligned audio file
239 | - `end`: time offset of the audio fragment's end in milliseconds from the beginning of the
240 | aligned audio file
241 | - `transcript`: STT transcript used for aligning
242 | - `text-start`: character offset of the fragment's associated original text within the
243 | aligned text document
244 | - `text-end`: character offset of the end of the fragment's associated original text within the
245 | aligned text document
246 | - `meta`: meta data hash-table with
247 |   - _key_: meta data type
248 |   - _value_: array of meta data instances coalesced from the `.script` entries that
249 |   this entry intersects with
250 | - `aligned-raw`: __raw__ original text fragment that got aligned with the audio fragment
251 | and its STT transcript
252 | - `aligned`: __clean__ original text fragment that got aligned with the audio fragment
253 | and its STT transcript
254 | - `<metric>` For each `--output-<metric>` parameter the alignment tool adds an entry with the
255 | computed value (in this case `wng`, `jaro_winkler`, `levenshtein`, `mra`, `cer`)
256 | 


--------------------------------------------------------------------------------
/doc/lm.md:
--------------------------------------------------------------------------------
 1 | ## Individual language models
 2 | 
 3 | If you plan to let the tool generate individual language models per text,
 4 | you have to get (essentially build) [KenLM](https://kheafield.com/code/kenlm/).
 5 | Before doing this, you should install its [dependencies](https://kheafield.com/code/kenlm/dependencies/).
 6 | For Debian based systems this can be done through:
 7 | ```bash
 8 | $ sudo apt-get install build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev 
 9 | ```
10 | 
11 | With all requirements fulfilled, there is a script for building and installing KenLM
12 | and the required DeepSpeech tools in the right location:
13 | ```bash
14 | $ bin/lm-dependencies.sh
15 | ```
16 | 
17 | If all went well, the alignment tool will find and use it to automatically create individual
18 | language models for each document.


--------------------------------------------------------------------------------
/doc/metrics.md:
--------------------------------------------------------------------------------
 1 | ## Text distance metrics
 2 | 
 3 | This section lists all available text distance metrics along with their IDs for
 4 | command-line use.
 5 | 
 6 | ### Weighted N-grams (wng)
 7 | 
 8 | The weighted N-gram score is computed as the sum of the number of weighted shared N-grams
 9 | between the two texts.
10 | It ensures that:
11 | - Shared N-gram instances near interval bounds (dependent on situation) get rated higher than
12 | the ones near the center or opposite end
13 | - Large shared N-gram instances are weighted higher than short ones
14 | 
15 | `--align-min-ngram-size <SIZE>` sets the start (minimum) N-gram size
16 | 
17 | `--align-max-ngram-size <SIZE>` sets the final (maximum) N-gram size
18 | 
19 | `--align-ngram-size-factor <FACTOR>` sets a weight factor for the size preference
20 | 
21 | `--align-ngram-position-factor <FACTOR>` sets a weight factor for the position preference
22 | 
23 | ### Jaro-Winkler (jaro_winkler)
24 | 
25 | Jaro-Winkler is an edit distance metric described
26 | [here](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance).
27 | 
28 | ### Editex (editex)
29 | 
30 | Editex is a phonetic text distance algorithm described
31 | [here](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.18.2138&rep=rep1&type=pdf).
32 | 
33 | ### Levenshtein (levenshtein)
34 | 
35 | Levenshtein is an edit distance metric described
36 | [here](https://en.wikipedia.org/wiki/Levenshtein_distance).
37 | 
38 | ### MRA (mra)
39 | 
40 | The "Match rating approach" is a phonetic text distance algorithm described
41 | [here](https://en.wikipedia.org/wiki/Match_rating_approach).
42 | 
43 | ### Hamming (hamming)
44 | 
45 | The Hamming distance is an edit distance metric described
46 | [here](https://en.wikipedia.org/wiki/Hamming_distance).
47 | 
48 | ### Word error rate (wer)
49 | 
50 | This is the same as Levenshtein - just on word level.
51 | 
52 | Not available for gap alignment.
53 | 
54 | ### Character error rate (cer)
55 | 
56 | This is the same as Levenshtein but using a different implementation.
57 | 
58 | Not available for gap alignment.
59 | 
60 | ### Smith-Waterman score (sws)
61 | 
62 | This is the final Smith-Waterman score coming from the rough alignment
63 | step (but before gap alignment!).
64 | It is described
65 | [here](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm).
66 | 
67 | Not available for gap alignment.
68 | 
69 | ### Transcript length (tlen)
70 | 
71 | The character length of the STT transcript.
72 | 
73 | Not available for gap alignment.
74 | 
75 | ### Matched text length (mlen)
76 | 
77 | The character length of the matched text of the original transcript (cleaned).
78 | 
79 | Not available for gap alignment.
80 | 


--------------------------------------------------------------------------------
/doc/tools.md:
--------------------------------------------------------------------------------
  1 | ## Tools
  2 | 
  3 | ### Statistics tool
  4 | 
  5 | The statistics tool `bin/statistics.sh` can be used for displaying aggregated statistics of
  6 | all passed alignment files. Alignment files can be specified directly through the 
  7 | `--aligned <ALIGNED-FILE>` multi-option and indirectly through the `--catalog <CATALOG-FILE>` multi-option.
  8 | 
  9 | Example call:
 10 | 
 11 | ```shell script
 12 | DSAlign$ bin/statistics.sh --catalog data/all.catalog 
 13 | Reading catalog
 14 |  2 of 2 : 100.00% (elapsed: 00:00:00, speed: 94.27 it/s, ETA: 00:00:00)
 15 | Total number of files: 2
 16 | 
 17 | Total number of utterances: 5,949
 18 | 
 19 | Total aligned utterance character length: 202,191
 20 | 
 21 | Total utterance duration: 3:53:28.410000 (3 hours)
 22 | 
 23 | Overall number of instances of meta type "speaker": 27
 24 | 
 25 | 100 most frequent "speaker" instances:
 26 | Rosalind                     678
 27 | Touchstone                   401
 28 | Orlando                      310
 29 | Jaques                       303
 30 | Celia                        281
 31 | Oliver                       125
 32 | Phebe                        108
 33 | Duke Senior                   87
 34 | Silvius                       86
 35 | Adam                          81
 36 | Corin                         68
 37 | Duke Frederick                53
 38 | Le Beau                       52
 39 | First Lord                    49
 40 | Charles                       33
 41 | Amiens                        27
 42 | Audrey                        27
 43 | Second Page                   22
 44 | Hymen                         19
 45 | Jaques De Boys                16
 46 | Second Lord                   12
 47 | William                       12
 48 | Forester                       8
 49 | First Page                     7
 50 | Sir Oliver Martext             4
 51 | Dennis                         3
 52 | A Lord                         1
 53 | ```
 54 | 
 55 | ### Catalog tool
 56 | 
 57 | The catalog tool allows for maintenance of catalog files.
 58 | It takes multiple catalog files (supporting wildcards) and allows for applying several checks and tweaks before
 59 | potentially exporting them to a new combined catalog file.
 60 | 
 61 | Options:
 62 | 
 63 |  - `--output <CATALOG>`: Writes all items of all passed catalogs into to the specified new catalog.
 64 |  - `--make-relative`: Makes all paths entries of all items relative to the parent directory of the 
 65 |    new catalog (see `--output`).
 66 |  - `--order-by <ENTRY>`: Entry that should be used for sorting items in new catalog (see `--output`).
 67 |  - `--check <ENTRIES>`: Checks file existence of all passed (comma separated) entries of each catalog 
 68 |    item (e.g. `--check aligned,audio` will check if `aligned` and `audio` file paths of each catalog item exist). 
 69 |    `--check all` will check all entries of each item.
 70 |  - `--on-miss fail|drop|remove|ignore`: What to do if a checked (`--check`) file is not existing. 
 71 |    - `fail`: tool will exit with an error status (default)
 72 |    - `drop`: the catalog item with all its entries will be removed (see `--output`)
 73 |    - `remove`: the missing entry within the catalog item will be removed (see `--output`)
 74 |    - `ignore`: just logs the missing entry
 75 |    
 76 | Example usage:
 77 | ```shell script
 78 | $ cat a.catalog 
 79 | [
 80 |   {
 81 |     "entry1": "is/not/existing/x",
 82 |     "entry2": "is/existing/x"
 83 |   }
 84 | ]
 85 | $ cat b.catalog 
 86 | [
 87 |   {
 88 |     "entry1": "is/not/existing/y",
 89 |     "entry2": "is/existing/y"
 90 |   }
 91 | ]
 92 | $ bin/catalog_tool.sh --check all --on-miss remove --output c.catalog --make-relative a.catalog b.catalog 
 93 | Loading catalog "a.catalog"
 94 | Catalog "a.catalog" - Missing file for "entry1" ("is/not/existing/x") - removing entry from item
 95 | Loading catalog "b.catalog"
 96 | Catalog "b.catalog" - Missing file for "entry1" ("is/not/existing/y") - removing entry from item
 97 | Writing catalog "c.catalog"
 98 | $ cat c.catalog 
 99 | [
100 |   {
101 |     "entry2": "is/existing/x"
102 |   },
103 |   {
104 |     "entry2": "is/existing/y"
105 |   }
106 | ]
107 | ```
108 | 
109 | ### Meta data annotation tool
110 | 
111 | The meta data annotation tool allows for assigning meta data fields to all items of script files or transcription logs.
112 | It takes only two parameters: The file and a series of `<key>=<value>` assignments.
113 | 
114 | Example usage:
115 | ```shell script
116 | $ cat a.tlog 
117 | [
118 |   {
119 |     "start": 330.0,
120 |     "end": 2820.0,
121 |     "transcript": "some text without a meaning"
122 |   },
123 |   {
124 |     "start": 3456.0,
125 |     "end": 5123.0,
126 |     "transcript": "some other text without a meaning"
127 |   }
128 | ]
129 | $ bin/meta.sh a.tlog speaker=alice year=2020
130 | $ cat a.tlog 
131 | [
132 |   {
133 |     "start": 330.0,
134 |     "end": 2820.0,
135 |     "transcript": "some text without a meaning",
136 |     "speaker": "alice",
137 |     "year": "2020"
138 |   },
139 |   {
140 |     "start": 3456.0,
141 |     "end": 5123.0,
142 |     "transcript": "some other text without a meaning",
143 |     "speaker": "alice",
144 |     "year": "2020"
145 |   }
146 | ]
147 | ```


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | six
2 | sox
3 | deepspeech==0.7.1
4 | ds-ctcdecoder==0.7.1
5 | webrtcvad
6 | textdistance
7 | pydub
8 | opuslib
9 | 


--------------------------------------------------------------------------------