├── .gitignore
├── LICENSE
├── README.md
├── assets
├── whisply.png
└── whisply_app.png
├── examples
├── example_config.json
└── post_correction.yaml
├── pyproject.toml
├── setup.cfg
└── whisply
├── __init__.py
├── app.py
├── app_helpers.py
├── cli.py
├── diarize_utils.py
├── download_utils.py
├── little_helper.py
├── models.py
├── output_templates.py
├── output_utils.py
├── post_correction.py
└── transcription.py
/.gitignore:
--------------------------------------------------------------------------------
1 | /dev
2 | /downloads*
3 | /transcriptions
4 | /files*
5 | /legacy
6 | /venv*
7 | .vscode
8 | .DS_Store
9 | *.log
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Copyright 2025 Universitätsbibliothek Mannheim
2 |
3 | Permission is hereby granted, free of charge, to any person obtaining a copy
4 | of this software and associated documentation files (the "Software"), to deal
5 | in the Software without restriction, including without limitation the rights
6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
7 | copies of the Software, and to permit persons to whom the Software is
8 | furnished to do so, subject to the following conditions:
9 |
10 | The above copyright notice and this permission notice shall be included in
11 | all copies or substantial portions of the Software.
12 |
13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
19 | THE SOFTWARE.
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # whisply
2 |
3 | [](https://badge.fury.io/py/whisply)
4 |
5 |
6 |
7 | *Transcribe, translate, annotate and subtitle audio and video files with OpenAI's [Whisper](https://github.com/openai/whisper) ... fast!*
8 |
9 | `whisply` combines [faster-whisper](https://github.com/SYSTRAN/faster-whisper) and [insanely-fast-whisper](https://github.com/Vaibhavs10/insanely-fast-whisper) to offer an easy-to-use solution for batch processing files on Windows, Linux and Mac. It also enables word-level speaker annotation by integrating [whisperX](https://github.com/m-bain/whisperX) and [pyannote](https://github.com/pyannote/pyannote-audio).
10 |
11 | ## Table of contents
12 |
13 | - [Features](#features)
14 | - [Requirements](#requirements)
15 | - [Installation](#installation)
16 | - [Install `ffmpeg`](#install-ffmpeg)
17 | - [Installation with `pip`](#installation-with-pip)
18 | - [Installation from `source`](#installation-from-source)
19 | - [Nvidia GPU fix for Linux users (March 2025)](#nvidia-gpu-fix-for-linux-users-march-2025)
20 | - [Usage](#usage)
21 | - [CLI](#cli)
22 | - [App](#app)
23 | - [Speaker annotation and diarization](#speaker-annotation-and-diarization)
24 | - [Requirements](#requirements-1)
25 | - [How speaker annotation works](#how-speaker-annotation-works)
26 | - [Post correction](#post-correction)
27 | - [Batch processing](#batch-processing)
28 | - [Using config files for batch processing](#using-config-files-for-batch-processing)
29 | - [Citation](#citation)
30 |
31 | ## Features
32 |
33 | * 🚴♂️ **Performance**: `whisply` selects the fastest Whisper implementation based on your hardware:
34 | * CPU/GPU (Nvidia CUDA): `fast-whisper` or `whisperX`
35 | * MPS (Apple M1-M4): `insanely-fast-whisper`
36 |
37 | * ⏩ **large-v3-turbo Ready**: Support for [whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo) on all devices. **Note**: Subtitling and annotations on CPU/GPU use `whisperX` for accurate timestamps, but `whisper-large-v3-turbo` isn’t currently available for `whisperX`.
38 |
39 | * ✅ **Auto Device Selection**: `whisply` automatically chooses `faster-whisper` (CPU) or `insanely-fast-whisper` (MPS, Nvidia GPUs) for transcription and translation unless a specific `--device` option is passed.
40 |
41 | * 🗣️ **Word-level Annotations**: Enabling `--subtitle` or `--annotate` uses `whisperX` or `insanely-fast-whisper` for word segmentation and speaker annotations. `whisply` approximates missing timestamps for numeric words.
42 |
43 | * 💬 **Customizable Subtitles**: Specify words per subtitle block (e.g., "5") to generate `.srt` and `.webvtt` files with fixed word counts and timestamps.
44 |
45 | * 🧺 **Batch Processing**: Handle single files, folders, URLs, or lists via `.list` documents. See the [Batch processing](#batch-processing) section for details.
46 |
47 | * 👩💻 **CLI / App**: `whisply` can be run directly from CLI or as an app with a graphical user-interface (GUI).
48 |
49 | * ⚙️ **Export Formats**: Supports `.json`, `.txt`, `.txt (annotated)`, `.srt`, `.webvtt`, `.vtt`, `.rttm` and `.html` (compatible with [noScribe's editor](https://github.com/kaixxx/noScribe)).
50 |
51 | ## Requirements
52 |
53 | * [FFmpeg](https://ffmpeg.org/)
54 | * \>= Python3.10
55 | * GPU processing requires:
56 | * Nvidia GPU (CUDA: cuBLAS and cuDNN for CUDA 12)
57 | * Apple Metal Performance Shaders (MPS) (Mac M1-M4)
58 | * Speaker annotation requires a [HuggingFace Access Token](https://huggingface.co/docs/hub/security-tokens)
59 |
60 | ## Installation
61 |
62 | ### Install `ffmpeg`
63 |
64 | ```shell
65 | # --- macOS ---
66 | brew install ffmpeg
67 |
68 | # --- Linux ---
69 | sudo apt-get update
70 | sudo apt-get install ffmpeg
71 |
72 | # --- Windows ---
73 | winget install Gyan.FFmpeg
74 | ```
75 |
76 | For more information you can visit the [FFmpeg website](https://ffmpeg.org/download.html).
77 |
78 | ### Installation with `pip`
79 |
80 | 1. Create a Python virtual environment
81 |
82 | ```shell
83 | python3 -m venv venv
84 | ```
85 |
86 | 2. Activate the environment
87 |
88 | ```shell
89 | # --- Linux & macOS ---
90 | source venv/bin/activate
91 |
92 | # --- Windows ---
93 | venv\Scripts\activate
94 | ```
95 |
96 | 3. Install whisply
97 |
98 | ```shell
99 | pip install whisply
100 | ```
101 |
102 | ### Installation from `source`
103 |
104 | 1. Clone this repository
105 |
106 | ```shell
107 | git clone https://github.com/tsmdt/whisply.git
108 | ```
109 |
110 | 2. Change to project folder
111 |
112 | ```shell
113 | cd whisply
114 | ```
115 |
116 | 3. Create a Python virtual environment
117 |
118 | ```shell
119 | python3 -m venv venv
120 | ```
121 |
122 | 4. Activate the Python virtual environment
123 |
124 | ```shell
125 | # --- Linux & macOS ---
126 | source venv/bin/activate
127 |
128 | # --- Windows ---
129 | venv\Scripts\activate
130 | ```
131 |
132 | 5. Install whisply
133 |
134 | ```shell
135 | pip install .
136 | ```
137 |
138 | ### Nvidia GPU fix (May 2025)
139 |
140 | Could not load library libcudnn_ops.so.9 (click to expand)
141 |
If you use whisply with a Nvidia GPU and encounter this error:
142 |
143 | ```shell
144 | Unable to load any of {libcudnn_ops.so.9.1.0, libcudnn_ops.so.9.1, libcudnn_ops.so.9, libcudnn_ops.so}
145 | ```
146 |
147 | Use the following steps to fix the issue:
148 |
149 | 1. In your activated python environment run `pip list` and check that `torch==2.7.0` and `torchaudio==2.7.0` are installed.
150 | 2. If yes, run `pip install ctranslate2==4.5.0`.
151 | 3. Export the following environment variable to your shell:
152 |
153 | ```shell
154 | export LD_LIBRARY_PATH="$(python - <<'PY'
155 | import importlib.util, pathlib, sys
156 | spec = importlib.util.find_spec('nvidia.cudnn')
157 | print(pathlib.Path(spec.origin).parent / 'lib')
158 | PY
159 | ):${LD_LIBRARY_PATH}"
160 | ```
161 |
162 | 4. To make the change permanent, run this bash command while your python environment is activated:
163 |
164 | ```shell
165 | printf '\n# --- add cuDNN wheel dir ---\nexport LD_LIBRARY_PATH="$(python - <<'"'"'PY'"'"'\nimport importlib.util, pathlib, sys\nprint(pathlib.Path(importlib.util.find_spec("nvidia.cudnn").origin).parent / "lib")\nPY\n):${LD_LIBRARY_PATH}"\n' >> "$VIRTUAL_ENV/bin/activate"
166 | ```
167 |
168 | Finally, deactivate the environment and reactivate it to apply the changes.
169 |
170 | Find additional information at faster-whisper's GitHub page.
171 |
172 |
173 | ## Usage
174 |
175 | ### CLI
176 |
177 | ```shell
178 | $ whisply
179 |
180 | Usage: whisply [OPTIONS]
181 |
182 | WHISPLY 💬 Transcribe, translate, annotate and subtitle audio and video files with OpenAI's Whisper ... fast!
183 |
184 | ╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
185 | │ --files -f TEXT Path to file, folder, URL or .list to process. [default: None] │
186 | │ --output_dir -o DIRECTORY Folder where transcripts should be saved. [default: transcriptions] │
187 | │ --device -d [auto|cpu|gpu|mps] Select the computation device: CPU, GPU (NVIDIA), or MPS (Mac M1-M4). │
188 | │ [default: auto] │
189 | │ --model -m TEXT Whisper model to use (List models via --list_models). [default: large-v3-turbo] │
190 | │ --lang -l TEXT Language of provided file(s) ("en", "de") (Default: auto-detection). │
191 | │ [default: None] │
192 | │ --annotate -a Enable speaker annotation (Saves .rttm | Default: False). │
193 | │ --num_speakers -num INTEGER Number of speakers to annotate (Default: auto-detection). [default: None] │
194 | │ --hf_token -hf TEXT HuggingFace Access token required for speaker annotation. [default: None] │
195 | │ --subtitle -s Create subtitles (Saves .srt, .vtt and .webvtt | Default: False). │
196 | │ --sub_length INTEGER Subtitle segment length in words. [default: 5] │
197 | │ --translate -t Translate transcription to English (Default: False). │
198 | │ --export -e [all|json|txt|rttm|vtt|webvtt|srt|html] Choose the export format. [default: all] │
199 | │ --verbose -v Print text chunks during transcription (Default: False). │
200 | │ --del_originals -del Delete original input files after file conversion. (Default: False) │
201 | │ --config PATH Path to configuration file. [default: None] │
202 | │ --post_correction -post PATH Path to YAML file for post-correction. [default: None] │
203 | │ --launch_app -app Launch the web app instead of running standard CLI commands. │
204 | │ --list_models List available models. │
205 | │ --install-completion Install completion for the current shell. │
206 | │ --show-completion Show completion for the current shell, to copy it or customize the installation. │
207 | │ --help Show this message and exit. │
208 | ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
209 | ```
210 |
211 | ### App
212 |
213 | Instead of running `whisply` from the CLI you can start the web app instead:
214 |
215 | ```shell
216 | $ whisply --launch_app
217 | ```
218 |
219 | or:
220 |
221 | ```shell
222 | $ whisply -app
223 | ```
224 |
225 | Open the local URL in your browser after starting the app (**Note**: The URL might differ from system to system):
226 |
227 | ```shell
228 | * Running on local URL: http://127.0.0.1:7860
229 | ```
230 |
231 |
232 |
233 |
234 |
235 |
236 | ### Speaker annotation and diarization
237 |
238 | #### Requirements
239 |
240 | In order to annotate speakers using `--annotate` you need to provide a valid [HuggingFace](https://huggingface.co) access token using the `--hf_token` option. Additionally, you must accept the terms and conditions for both version 3.0 and version 3.1 of the `pyannote` segmentation model.
241 |
242 | For detailed instructions, refer to the *Requirements* section on the [pyannote model page on HuggingFace](https://huggingface.co/pyannote/speaker-diarization-3.1#requirements) and make sure that you complete steps *"2. Accept pyannote/segmentation-3.0 user conditions"*, *"3. Accept pyannote/speaker-diarization-3.1 user conditions"* and *"4. Create access token at hf.co/settings/tokens"*.
243 |
244 | #### How speaker annotation works
245 |
246 | `whisply` uses [whisperX](https://github.com/m-bain/whisperX) for speaker diarization and annotation. Instead of returning chunk-level timestamps like the standard `Whisper` implementation `whisperX` is able to return word-level timestamps as well as annotating speakers word by word, thus returning much more precise annotations.
247 |
248 | Out of the box `whisperX` will not provide timestamps for words containing only numbers (e.g. "1.5" or "2024"): `whisply` fixes those instances through timestamp approximation. Other known limitations of `whisperX` include:
249 |
250 | * inaccurate speaker diarization if multiple speakers talk at the same time
251 | * to provide word-level timestamps and annotations `whisperX` uses language specific alignment models; out of the box `whisperX` supports these languages: `en, fr, de, es, it, ja, zh, nl, uk, pt`.
252 |
253 | Refer to the [whisperX GitHub page](https://github.com/m-bain/whisperX) for more information.
254 |
255 | ### Post correction
256 |
257 | The `--post_correction` option allows you to correct various transcription errors that you may find in your files. The option takes as argument a `.yaml` file with the following structure:
258 |
259 | ```yaml
260 | # Single word corrections
261 | Gardamer: Gadamer
262 |
263 | # Pattern-based corrections
264 | patterns:
265 | - pattern: 'Klaus-(Cira|Cyra|Tira)-Stiftung'
266 | replacement: 'Klaus Tschira Stiftung'
267 | ```
268 |
269 | - **Single word corrections**: matches single words → `wrong word`: `correct word`
270 | - **Pattern-based corrections**: matches patterns → `(Cira|Cyra|Tira)` will look for `Klaus-Cira-Stiftung`, `Klaus-Cyra-Stiftung` and / or `Klaus-Tira-Stiftung` and replaces it with `Klaus-Tschirra-Stiftung`
271 |
272 | Post correction will be applied to **all** export file formats you choose.
273 |
274 | ### Batch processing
275 |
276 | Instead of providing a file, folder or URL by using the `--files` option you can pass a `.list` with a mix of files, folders and URLs for processing.
277 |
278 | Example:
279 |
280 | ```shell
281 | $ cat my_files.list
282 |
283 | video_01.mp4
284 | video_02.mp4
285 | ./my_files/
286 | https://youtu.be/KtOayYXEsN4?si=-0MS6KXbEWXA7dqo
287 | ```
288 |
289 | #### Using config files for batch processing
290 |
291 | You can provide a `.json` config file by using the `--config` option which makes batch processing easy. An example config looks like this:
292 |
293 | ```markdown
294 | {
295 | "files": "./files/my_files.list", # Path to your files
296 | "output_dir": "./transcriptions", # Output folder where transcriptions are saved
297 | "device": "auto", # AUTO, GPU, MPS or CPU
298 | "model": "large-v3-turbo", # Whisper model to use
299 | "lang": null, # Null for auto-detection or language codes ("en", "de", ...)
300 | "annotate": false, # Annotate speakers
301 | "num_speakers": null, # Number of speakers of the input file (null: auto-detection)
302 | "hf_token": "HuggingFace Access Token", # Your HuggingFace Access Token (needed for annotations)
303 | "subtitle": false, # Subtitle file(s)
304 | "sub_length": 10, # Length of each subtitle block in number of words
305 | "translate": false, # Translate to English
306 | "export": "txt", # Export .txts only
307 | "verbose": false # Print transcription segments while processing
308 | "del_originals": false, # Delete original input files after file conversion
309 | "post_correction": "my_corrections.yaml" # Apply post correction with specified patterns in .yaml
310 | }
311 | ```
312 |
313 | ## Citation
314 |
315 | ```bibtex
316 | @software{whisply,
317 | author = {Thomas Schmidt, Renat Shigapov},
318 | title = {Transcribe, translate, annotate and subtitle audio and video files with OpenAI's Whisper ... fast!},
319 | year = {2025},
320 | url = {https://github.com/tsmdt/whisply},
321 | note = {Accessed: 2025-05-29}
322 | }
323 | ```
324 |
--------------------------------------------------------------------------------
/assets/whisply.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tsmdt/whisply/c96865a0942f98b743ef7857b3a0fe2376238175/assets/whisply.png
--------------------------------------------------------------------------------
/assets/whisply_app.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tsmdt/whisply/c96865a0942f98b743ef7857b3a0fe2376238175/assets/whisply_app.png
--------------------------------------------------------------------------------
/examples/example_config.json:
--------------------------------------------------------------------------------
1 | {
2 | "files": "./files/my_files.list",
3 | "output_dir": "./transcriptions",
4 | "device": "auto",
5 | "model": "large-v3-turbo",
6 | "lang": null,
7 | "annotate": false,
8 | "hf_token": "hf_.......",
9 | "translate": false,
10 | "subtitle": false,
11 | "sub_length": 10,
12 | "export": "txt",
13 | "verbose": false,
14 | "del_originals": false,
15 | "post_correction": "my_corrections.yaml"
16 | }
17 |
--------------------------------------------------------------------------------
/examples/post_correction.yaml:
--------------------------------------------------------------------------------
1 | # Single word corrections
2 | Gardamer: Gadamer
3 |
4 | # Pattern-based corrections
5 | patterns:
6 | - pattern: 'Klaus-(Cira|Cyra|Tira)-Stiftung'
7 | replacement: 'Klaus Tschira Stiftung'
8 |
--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
1 | [build-system]
2 | requires = ["setuptools>=61.0"]
3 | build-backend = "setuptools.build_meta"
4 |
--------------------------------------------------------------------------------
/setup.cfg:
--------------------------------------------------------------------------------
1 | [metadata]
2 | name = whisply
3 | version = 0.10.5
4 | description = "Transcribe, translate, annotate and subtitle audio and video files with OpenAI's Whisper ... fast!"
5 | long_description = file: README.md
6 | long_description_content_type = text/markdown
7 | author = Thomas Schmidt, Renat Shigapov
8 | license = MIT
9 | url = https://github.com/tsmdt/whisply
10 |
11 | [options]
12 | packages = find:
13 | python_requires = >=3.10
14 | install_requires =
15 | click<8.2.0
16 | typer==0.15.3
17 | numpy==2.0.2
18 | faster-whisper==1.1.1
19 | ffmpeg-python==0.2.0
20 | optimum==1.24.0
21 | pyannote.audio==3.3.2
22 | whisperx==3.3.4
23 | rich==13.7.1
24 | torch==2.7.0
25 | torchaudio==2.7.0
26 | transformers==4.50.0
27 | validators==0.28.1
28 | yt-dlp==2025.4.30
29 | gradio==5.29.0
30 |
31 | [options.entry_points]
32 | console_scripts =
33 | whisply = whisply.cli:run
--------------------------------------------------------------------------------
/whisply/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tsmdt/whisply/c96865a0942f98b743ef7857b3a0fe2376238175/whisply/__init__.py
--------------------------------------------------------------------------------
/whisply/app.py:
--------------------------------------------------------------------------------
1 | import gradio as gr
2 | import os
3 | import shutil
4 | import zipfile
5 | import tempfile
6 |
7 | from datetime import datetime
8 | from pathlib import Path
9 | from whisply import output_utils
10 | from whisply.app_helpers import CSS, LANGUAGES
11 |
12 |
13 | def zip_files(file_paths: list[str]) -> str:
14 | """
15 | Given a list of file paths, zip them up into a single archive
16 | and return the path to that archive.
17 | """
18 | if not file_paths:
19 | return None
20 |
21 | with tempfile.NamedTemporaryFile(delete=False, suffix=".zip") as tmp:
22 | zip_path = tmp.name
23 |
24 | with zipfile.ZipFile(zip_path, "w", zipfile.ZIP_DEFLATED) as zipf:
25 | for fpath in file_paths:
26 | arcname = os.path.basename(fpath)
27 | zipf.write(fpath, arcname=arcname)
28 |
29 | return zip_path
30 |
31 |
32 | def create_gradio_interface():
33 | """
34 | Main gradio interface.
35 | """
36 | def get_device() -> str:
37 | """
38 | Determine the computation device based on user preference and
39 | availability.
40 | """
41 | import torch
42 |
43 | if torch.cuda.is_available():
44 | device = 'cuda:0'
45 | elif torch.backends.mps.is_available():
46 | device = 'mps'
47 | else:
48 | device = 'cpu'
49 | return device
50 |
51 | def transcribe(file, model, device, language, options, hf_token, sub_length):
52 | from whisply.transcription import TranscriptionHandler
53 | from whisply import little_helper, models
54 |
55 | if not options:
56 | options = []
57 | annotate = 'Annotate Speakers' in options
58 | translate = 'Translate to English' in options
59 | subtitle = 'Generate Subtitles' in options
60 |
61 | if (annotate or subtitle) and not hf_token:
62 | hf_token = os.getenv('HF_TOKEN')
63 | if not hf_token:
64 | yield 'A HuggingFace Access Token is required for annotation or subtitling: https://huggingface.co/docs/hub/security-tokens', None
65 | return
66 |
67 | if file is None:
68 | yield "Please upload a file.", None
69 | return
70 |
71 | # If file is not a list, make it a list
72 | if not isinstance(file, list):
73 | file = [file]
74 |
75 | # Start the progress bar
76 | progress = gr.Progress()
77 | progress(0)
78 |
79 | try:
80 | # Total steps calculation
81 | steps_per_file = 5 # Number of steps per file
82 | total_steps = steps_per_file * len(file)
83 | current_step = 0
84 |
85 | # Save the uploaded file to a temporary directory
86 | temp_dir = './app_uploads'
87 | os.makedirs(temp_dir, exist_ok=True)
88 |
89 | temp_file_paths = []
90 | for uploaded_file in file:
91 | # Get the base name of the file to avoid issues with absolute paths
92 | temp_file_name = os.path.basename(uploaded_file.name)
93 | temp_file_path = os.path.join(temp_dir, temp_file_name)
94 |
95 | # Copy the file from Gradio's temp directory to our local directory
96 | shutil.copyfile(uploaded_file.name, temp_file_path)
97 | temp_file_paths.append(temp_file_path)
98 |
99 | # Adjust the device based on user selection
100 | if device == 'auto':
101 | device_selected = get_device()
102 | elif device == 'gpu':
103 | import torch
104 | if torch.cuda.is_available():
105 | device_selected = 'cuda:0'
106 | else:
107 | print("→ CUDA is not available. Falling back to auto device selection.")
108 | device_selected = get_device()
109 | else:
110 | device_selected = device
111 |
112 | # Handle export formats
113 | export_formats_map = {
114 | 'standard': ['json', 'txt'],
115 | 'annotate': ['rttm', 'txt', 'json', 'html'],
116 | 'subtitle': ['vtt', 'webvtt', 'srt', 'txt', 'json'],
117 | 'translate': ['txt', 'json']
118 | }
119 |
120 | export_formats_list = set(export_formats_map['standard'])
121 |
122 | if annotate:
123 | export_formats_list.update(export_formats_map['annotate'])
124 | if subtitle:
125 | export_formats_list.update(export_formats_map['subtitle'])
126 | if translate:
127 | export_formats_list.update(export_formats_map['translate'])
128 |
129 | export_formats_list = list(export_formats_list)
130 |
131 | # Create an instance of TranscriptionHandler with the provided parameters
132 | handler = TranscriptionHandler(
133 | base_dir='./app_transcriptions',
134 | model=model,
135 | device=device_selected,
136 | file_language=None if language == 'auto' else language,
137 | annotate=annotate,
138 | translate=translate,
139 | subtitle=subtitle,
140 | sub_length=int(sub_length) if subtitle else 5,
141 | hf_token=hf_token,
142 | verbose=False,
143 | export_formats=export_formats_list
144 | )
145 |
146 | # Initialize processed_files list
147 | handler.processed_files = []
148 | for idx, filepath in enumerate(temp_file_paths):
149 | filepath = Path(filepath)
150 |
151 | # Update progress
152 | current_step += 1
153 | progress(current_step / total_steps)
154 |
155 | # Create and set output_dir and output_filepath
156 | handler.output_dir = little_helper.set_output_dir(filepath, handler.base_dir)
157 | output_filepath = handler.output_dir / filepath.stem
158 |
159 | # Convert file format
160 | filepath, audio_array = little_helper.check_file_format(
161 | filepath=filepath,
162 | del_originals=False
163 | )
164 |
165 | # Update progress
166 | current_step += 1
167 | progress(current_step / total_steps)
168 |
169 | # Detect file language
170 | if not handler.file_language:
171 | handler.detect_language(filepath, audio_array)
172 |
173 | # Update progress
174 | current_step += 1
175 | progress(current_step / total_steps)
176 |
177 | # Transcription and speaker annotation
178 | if handler.device == 'mps':
179 | handler.model = models.set_supported_model(
180 | handler.model_provided,
181 | implementation='insane-whisper',
182 | translation=handler.translate
183 | )
184 | print(f'→ Using {handler.device.upper()} and 🚅 Insanely-Fast-Whisper with model "{handler.model}"')
185 | result_data = handler.transcribe_with_insane_whisper(filepath)
186 |
187 | elif handler.device in ['cpu', 'cuda:0']:
188 | if handler.annotate or handler.subtitle:
189 | handler.model = models.set_supported_model(
190 | handler.model_provided,
191 | implementation='whisperx',
192 | translation=handler.translate
193 | )
194 | print(f'→ Using {handler.device.upper()} and whisper🆇 with model "{handler.model}"')
195 | result_data = handler.transcribe_with_whisperx(filepath)
196 | else:
197 | handler.model = models.set_supported_model(
198 | handler.model_provided,
199 | implementation='faster-whisper',
200 | translation=handler.translate
201 | )
202 | print(f'→ Using {handler.device.upper()} and 🏃♀️➡️ Faster-Whisper with model "{handler.model}"')
203 | result_data = handler.transcribe_with_faster_whisper(filepath)
204 |
205 | # Update progress
206 | current_step += 1
207 | progress(current_step / total_steps)
208 |
209 | result = {
210 | 'id': f'file_00{idx + 1}',
211 | 'created': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
212 | 'input_filepath': str(Path(filepath).absolute()),
213 | 'output_filepath': str(Path(output_filepath).absolute()),
214 | 'written_files': None,
215 | 'device': handler.device,
216 | 'model': handler.model,
217 | 'transcription': result_data['transcription']['transcriptions'],
218 | }
219 |
220 | # Save results
221 | result['written_files'] = output_utils.OutputWriter().save_results(
222 | result=result,
223 | export_formats=handler.export_formats
224 | )
225 |
226 | # Update progress
227 | current_step += 1
228 | progress(current_step / total_steps)
229 |
230 | handler.processed_files.append(result)
231 |
232 | if not handler.file_language_provided:
233 | handler.file_language = None
234 |
235 | except Exception as e:
236 | print(f"→ Error during transcription: {e}")
237 | yield f"Transcription Error: {e}", None
238 |
239 | finally:
240 | progress(100)
241 |
242 | # Get the transcription results
243 | if handler and handler.processed_files:
244 | output_files_set = set()
245 | for processed_file in handler.processed_files:
246 | # Collect the paths of the generated files directly
247 | output_files = processed_file.get('written_files', [])
248 | output_files_set.update(output_files)
249 |
250 | output_files = sorted(list(output_files_set))
251 |
252 | yield output_files, output_files, gr.update(visible=True)
253 | else:
254 | yield "Transcription Error."
255 |
256 | def toggle_visibility(options):
257 | """
258 | Updates the visibility of conditional components based on selected options.
259 | """
260 | show_access_token = 'Annotate Speakers' in options
261 | show_subtitle_length = 'Generate Subtitles' in options
262 | return gr.update(visible=show_access_token), gr.update(visible=show_subtitle_length)
263 |
264 | # Theme
265 | theme = gr.themes.Citrus(
266 | primary_hue="emerald",
267 | neutral_hue="slate",
268 | spacing_size=gr.themes.sizes.spacing_sm,
269 | text_size="md",
270 | radius_size="sm",
271 | font=[gr.themes.GoogleFont('Open Sans', 'Roboto'), 'ui-sans-serif', 'system-ui', 'sans-serif'],
272 | font_mono=['Roboto Mono', 'ui-monospace', 'Consolas', 'monospace'],
273 | )
274 |
275 | # Build the Gradio Blocks interface
276 | with gr.Blocks(theme=theme, css=CSS) as app:
277 | gr.Markdown("# whisply 💬")
278 | gr.Markdown("""
279 | Transcribe, translate, annotate, and subtitle audio and video files with \
280 | OpenAI's Whisper ... fast!
281 | """)
282 |
283 | # File Upload and Model Selection
284 | with gr.Row():
285 | with gr.Column():
286 | uploaded_files = gr.File(label="Upload File(s)", file_count='multiple')
287 | with gr.Row():
288 | model_dropdown = gr.Dropdown(
289 | choices=[
290 | 'tiny',
291 | 'tiny-en',
292 | 'base',
293 | 'base-en',
294 | 'small',
295 | 'small-en',
296 | 'distil-small-en',
297 | 'medium',
298 | 'medium-en',
299 | 'distil-medium-en',
300 | 'large',
301 | 'large-v2',
302 | 'distil-large-v2',
303 | 'large-v3',
304 | 'distil-large-v3',
305 | 'large-v3-turbo'],
306 | label="Model",
307 | value='large-v3-turbo',
308 | info='Whisper model for the transcription.'
309 | )
310 | language_dropdown = gr.Dropdown(
311 | choices=sorted(LANGUAGES.keys()),
312 | label="Language",
313 | value='auto',
314 | info="**auto** = auto-detection"
315 | )
316 | with gr.Row():
317 | device_radio = gr.Radio(
318 | choices=['auto', 'cpu', 'gpu', 'mps'],
319 | label="Device",
320 | value='auto',
321 | info="**auto** = auto-detection | **gpu** = Nvidia GPUs | **mps** = Mac M1-M4"
322 | )
323 | with gr.Row():
324 | options_checkbox = gr.CheckboxGroup(
325 | choices=['Annotate Speakers', 'Translate to English', 'Generate Subtitles'],
326 | label="Options",
327 | value=[]
328 | )
329 | with gr.Row():
330 | access_token_text = gr.Text(
331 | label='HuggingFace Access Token (for annotation and subtitling)',
332 | info="Refer to **README.md** to set up the Access Token correctly.",
333 | value=None,
334 | lines=1,
335 | max_lines=1,
336 | visible=False
337 | )
338 | with gr.Row():
339 | subtitle_length_number = gr.Number(
340 | label="Subtitle Length (words)",
341 | value=5,
342 | info="""Subtitle segment length in words. \
343 | (Example: "10" will result in subtitles where each subtitle block has \
344 | exactly 10 words)""",
345 | visible=False
346 | )
347 |
348 | # Event Handler to Toggle Visibility
349 | options_checkbox.change(
350 | toggle_visibility,
351 | inputs=options_checkbox,
352 | outputs=[access_token_text, subtitle_length_number]
353 | )
354 |
355 | with gr.Column():
356 | outputs = gr.Files(label="Transcriptions")
357 | transcribed_files_state = gr.State([])
358 | transcribe_button = gr.Button("Transcribe")
359 |
360 | # Download button
361 | download_all_button = gr.Button(
362 | value="Download Transcripts",
363 | visible=False
364 | )
365 | download_all_button.click(
366 | zip_files,
367 | outputs=[outputs],
368 | inputs=[transcribed_files_state],
369 | )
370 |
371 | # Transcribe button
372 | transcribe_button.click(
373 | transcribe,
374 | inputs=[
375 | uploaded_files,
376 | model_dropdown,
377 | device_radio,
378 | language_dropdown,
379 | options_checkbox,
380 | access_token_text,
381 | subtitle_length_number
382 | ],
383 | outputs=[
384 | outputs,
385 | transcribed_files_state,
386 | download_all_button # "visible=True" after transcription
387 | ]
388 | )
389 |
390 | app.queue()
391 |
392 | return app
393 |
394 | def main():
395 | interface = create_gradio_interface()
396 | interface.launch()
397 |
398 | if __name__ == "__main__":
399 | main()
400 |
--------------------------------------------------------------------------------
/whisply/app_helpers.py:
--------------------------------------------------------------------------------
1 | CSS = """
2 | h1 {
3 | font-size: 36px;
4 | font-weight: 800;
5 | }
6 |
7 | .svelte-1ed2p3z {
8 | text-align: center;
9 | }
10 | """
11 |
12 | LANGUAGES = {
13 | "auto": "auto-detection",
14 | "en": "english",
15 | "zh": "chinese",
16 | "de": "german",
17 | "es": "spanish",
18 | "ru": "russian",
19 | "ko": "korean",
20 | "fr": "french",
21 | "ja": "japanese",
22 | "pt": "portuguese",
23 | "tr": "turkish",
24 | "pl": "polish",
25 | "ca": "catalan",
26 | "nl": "dutch",
27 | "ar": "arabic",
28 | "sv": "swedish",
29 | "it": "italian",
30 | "id": "indonesian",
31 | "hi": "hindi",
32 | "fi": "finnish",
33 | "vi": "vietnamese",
34 | "he": "hebrew",
35 | "uk": "ukrainian",
36 | "el": "greek",
37 | "ms": "malay",
38 | "cs": "czech",
39 | "ro": "romanian",
40 | "da": "danish",
41 | "hu": "hungarian",
42 | "ta": "tamil",
43 | "no": "norwegian",
44 | "th": "thai",
45 | "ur": "urdu",
46 | "hr": "croatian",
47 | "bg": "bulgarian",
48 | "lt": "lithuanian",
49 | "la": "latin",
50 | "mi": "maori",
51 | "ml": "malayalam",
52 | "cy": "welsh",
53 | "sk": "slovak",
54 | "te": "telugu",
55 | "fa": "persian",
56 | "lv": "latvian",
57 | "bn": "bengali",
58 | "sr": "serbian",
59 | "az": "azerbaijani",
60 | "sl": "slovenian",
61 | "kn": "kannada",
62 | "et": "estonian",
63 | "mk": "macedonian",
64 | "br": "breton",
65 | "eu": "basque",
66 | "is": "icelandic",
67 | "hy": "armenian",
68 | "ne": "nepali",
69 | "mn": "mongolian",
70 | "bs": "bosnian",
71 | "kk": "kazakh",
72 | "sq": "albanian",
73 | "sw": "swahili",
74 | "gl": "galician",
75 | "mr": "marathi",
76 | "pa": "punjabi",
77 | "si": "sinhala",
78 | "km": "khmer",
79 | "sn": "shona",
80 | "yo": "yoruba",
81 | "so": "somali",
82 | "af": "afrikaans",
83 | "oc": "occitan",
84 | "ka": "georgian",
85 | "be": "belarusian",
86 | "tg": "tajik",
87 | "sd": "sindhi",
88 | "gu": "gujarati",
89 | "am": "amharic",
90 | "yi": "yiddish",
91 | "lo": "lao",
92 | "uz": "uzbek",
93 | "fo": "faroese",
94 | "ht": "haitian creole",
95 | "ps": "pashto",
96 | "tk": "turkmen",
97 | "nn": "nynorsk",
98 | "mt": "maltese",
99 | "sa": "sanskrit",
100 | "lb": "luxembourgish",
101 | "my": "myanmar",
102 | "bo": "tibetan",
103 | "tl": "tagalog",
104 | "mg": "malagasy",
105 | "as": "assamese",
106 | "tt": "tatar",
107 | "haw": "hawaiian",
108 | "ln": "lingala",
109 | "ha": "hausa",
110 | "ba": "bashkir",
111 | "jw": "javanese",
112 | "su": "sundanese",
113 | "yue": "cantonese",
114 | }
--------------------------------------------------------------------------------
/whisply/cli.py:
--------------------------------------------------------------------------------
1 | import os
2 | import typer
3 | import warnings
4 | from pathlib import Path
5 | from typing import Optional, List
6 | from rich import print
7 | from whisply import output_utils
8 | from whisply import post_correction as post
9 | from whisply.output_utils import ExportFormats
10 | from whisply.little_helper import DeviceChoice
11 |
12 | warnings.filterwarnings("ignore", category=FutureWarning)
13 | warnings.filterwarnings("ignore", category=UserWarning)
14 |
15 | cli_app = typer.Typer()
16 |
17 | @cli_app.command(no_args_is_help=True)
18 | def main(
19 | files: Optional[List[str]] = typer.Option(
20 | None,
21 | "--files",
22 | "-f",
23 | help="Path to file, folder, URL or .list to process.",
24 | ),
25 | output_dir: Path = typer.Option(
26 | Path("./transcriptions"),
27 | "--output_dir",
28 | "-o",
29 | file_okay=False,
30 | dir_okay=True,
31 | writable=True,
32 | readable=True,
33 | resolve_path=True,
34 | help="Folder where transcripts should be saved.",
35 | ),
36 | device: DeviceChoice = typer.Option(
37 | DeviceChoice.AUTO,
38 | "--device",
39 | "-d",
40 | help="Select the computation device: CPU, GPU (NVIDIA), or MPS (Mac M1-M4).",
41 | ),
42 | model: str = typer.Option(
43 | "large-v3-turbo",
44 | "--model",
45 | "-m",
46 | help='Whisper model to use (List models via --list_models).',
47 | ),
48 | lang: Optional[str] = typer.Option(
49 | None,
50 | "--lang",
51 | "-l",
52 | help='Language of provided file(s) ("en", "de") (Default: auto-detection).',
53 | ),
54 | annotate: bool = typer.Option(
55 | False,
56 | "--annotate",
57 | "-a",
58 | help="Enable speaker annotation (Saves .rttm | Default: False).",
59 | ),
60 | num_speakers: Optional[int] = typer.Option(
61 | None,
62 | "--num_speakers",
63 | "-num",
64 | help="Number of speakers to annotate (Default: auto-detection).",
65 | ),
66 | hf_token: Optional[str] = typer.Option(
67 | None,
68 | "--hf_token",
69 | "-hf",
70 | help="HuggingFace Access token required for speaker annotation.",
71 | ),
72 | subtitle: bool = typer.Option(
73 | False,
74 | "--subtitle",
75 | "-s",
76 | help="Create subtitles (Saves .srt, .vtt and .webvtt | Default: False).",
77 | ),
78 | sub_length: int = typer.Option(
79 | 5,
80 | "--sub_length",
81 | help="Subtitle segment length in words."
82 | ),
83 | translate: bool = typer.Option(
84 | False,
85 | "--translate",
86 | "-t",
87 | help="Translate transcription to English (Default: False).",
88 | ),
89 | export_format: ExportFormats = typer.Option(
90 | ExportFormats.ALL,
91 | "--export",
92 | "-e",
93 | help="Choose the export format."
94 | ),
95 | verbose: bool = typer.Option(
96 | False,
97 | "--verbose",
98 | "-v",
99 | help="Print text chunks during transcription (Default: False).",
100 | ),
101 | del_originals: bool = typer.Option(
102 | False,
103 | "--del_originals",
104 | "-del",
105 | help="Delete original input files after file conversion. (Default: False)",
106 | ),
107 | config: Optional[Path] = typer.Option(
108 | None,
109 | "--config",
110 | help="Path to configuration file.",
111 | ),
112 | post_correction: Optional[Path] = typer.Option(
113 | None,
114 | "--post_correction",
115 | "-post",
116 | help="Path to YAML file for post-correction.",
117 | ),
118 | launch_app: bool = typer.Option(
119 | False,
120 | "--launch_app",
121 | "-app",
122 | help="Launch the web app instead of running standard CLI commands.",
123 | ),
124 | list_models: bool = typer.Option(
125 | False,
126 | "--list_models",
127 | help="List available models.",
128 | )
129 | ):
130 | """
131 | WHISPLY 💬 Transcribe, translate, annotate and subtitle audio and video files with OpenAI's Whisper ... fast!
132 | """
133 | from whisply import little_helper, transcription, models
134 |
135 | # Start the gradio web app
136 | if launch_app:
137 | from whisply.app import main as run_gradio_app
138 | run_gradio_app()
139 | raise typer.Exit()
140 |
141 | # Load configuration from config.json if provided
142 | if config:
143 | config_data = little_helper.load_config(config)
144 | files = files or Path(config_data.get("files")) if config_data.get("files") else files
145 | output_dir = Path(config_data.get("output_dir")) if config_data.get("output_dir") else output_dir
146 | device = DeviceChoice(config_data.get("device", device.value))
147 | model = config_data.get("model", model)
148 | lang = config_data.get("lang", lang)
149 | annotate = config_data.get("annotate", annotate)
150 | num_speakers = config_data.get("num_speakers", num_speakers)
151 | translate = config_data.get("translate", translate)
152 | hf_token = config_data.get("hf_token", hf_token)
153 | subtitle = config_data.get("subtitle", subtitle)
154 | sub_length = config_data.get("sub_length", sub_length)
155 | verbose = config_data.get("verbose", verbose)
156 | del_originals = config_data.get("del_originals", del_originals)
157 | post_correction = config_data.get("post_correction", post_correction)
158 |
159 | # Print available models
160 | if list_models:
161 | available_models = "Available models:\n... "
162 | available_models += '\n... '.join(models.WHISPER_MODELS.keys())
163 | print(f"{available_models}")
164 | raise typer.Exit()
165 |
166 | # Check if provided model is available
167 | if not models.ensure_model(model):
168 | msg = f"""→ Model "{model}" is not available.\n→ Available models:\n... """
169 | msg += '\n... '.join(models.WHISPER_MODELS.keys())
170 | print(f"{msg}")
171 | raise typer.Exit()
172 |
173 | # Check for HuggingFace Access Token if speaker annotation is enabled
174 | if annotate and not hf_token:
175 | hf_token = os.getenv('HF_TOKEN')
176 | if not hf_token:
177 | print('→ Please provide a HuggingFace access token (--hf_token / -hf) to enable speaker annotation.')
178 | raise typer.Exit()
179 |
180 | # Determine the computation device
181 | device_str = little_helper.get_device(device=device)
182 |
183 | # Determine the ExportFormats
184 | export_formats = output_utils.determine_export_formats(
185 | export_format,
186 | annotate, subtitle
187 | )
188 |
189 | # Load corrections if post_correction is provided
190 | if post_correction:
191 | corrections = post.load_correction_list(post_correction)
192 |
193 | # Instantiate TranscriptionHandler
194 | service = transcription.TranscriptionHandler(
195 | base_dir=output_dir,
196 | device=device_str,
197 | model=model,
198 | file_language=lang,
199 | annotate=annotate,
200 | num_speakers=num_speakers,
201 | translate=translate,
202 | hf_token=hf_token,
203 | subtitle=subtitle,
204 | sub_length=sub_length,
205 | verbose=verbose,
206 | del_originals=del_originals,
207 | corrections=corrections if post_correction else None,
208 | export_formats=export_formats
209 | )
210 | # Process files
211 | service.process_files(files)
212 |
213 | def run():
214 | cli_app()
215 |
216 | if __name__ == "__main__":
217 | run()
218 |
--------------------------------------------------------------------------------
/whisply/diarize_utils.py:
--------------------------------------------------------------------------------
1 | import requests
2 | import torch
3 | import numpy as np
4 | from torchaudio import functional as F
5 | from pyannote.audio import Pipeline
6 | from transformers.pipelines.audio_utils import ffmpeg_read
7 |
8 | # Code lifted from https://github.com/huggingface/speechbox/blob/main/src/speechbox/diarize.py
9 | # and from https://github.com/m-bain/whisperX/blob/main/whisperx/diarize.py
10 |
11 | def preprocess_inputs(inputs):
12 | if isinstance(inputs, str):
13 | if inputs.startswith("http://") or inputs.startswith("https://"):
14 | # We need to actually check for a real protocol, otherwise it's impossible to use a local file
15 | # like http_huggingface_co.png
16 | inputs = requests.get(inputs).content
17 | else:
18 | with open(inputs, "rb") as f:
19 | inputs = f.read()
20 |
21 | if isinstance(inputs, bytes):
22 | inputs = ffmpeg_read(inputs, 16000)
23 |
24 | if isinstance(inputs, dict):
25 | # Accepting `"array"` which is the key defined in `datasets` for better integration
26 | if not ("sampling_rate" in inputs and ("raw" in inputs or "array" in inputs)):
27 | raise ValueError(
28 | "When passing a dictionary to ASRDiarizePipeline, the dict needs to contain a "
29 | '"raw" key containing the numpy array representing the audio and a "sampling_rate" key, '
30 | "containing the sampling_rate associated with that array"
31 | )
32 |
33 | _inputs = inputs.pop("raw", None)
34 | if _inputs is None:
35 | # Remove path which will not be used from `datasets`.
36 | inputs.pop("path", None)
37 | _inputs = inputs.pop("array", None)
38 | in_sampling_rate = inputs.pop("sampling_rate")
39 | inputs = _inputs
40 | if in_sampling_rate != 16000:
41 | inputs = F.resample(
42 | torch.from_numpy(inputs), in_sampling_rate, 16000
43 | ).numpy()
44 |
45 | if not isinstance(inputs, np.ndarray):
46 | raise ValueError(f"We expect a numpy ndarray as input, got `{type(inputs)}`")
47 | if len(inputs.shape) != 1:
48 | raise ValueError(
49 | "We expect a single channel audio input for ASRDiarizePipeline"
50 | )
51 |
52 | # diarization model expects float32 torch tensor of shape `(channels, seq_len)`
53 | diarizer_inputs = torch.from_numpy(inputs).float()
54 | diarizer_inputs = diarizer_inputs.unsqueeze(0)
55 |
56 | return inputs, diarizer_inputs
57 |
58 | def diarize_audio(
59 | diarizer_inputs,
60 | diarization_pipeline,
61 | num_speakers,
62 | min_speakers,
63 | max_speakers
64 | ) -> dict:
65 | diarization = diarization_pipeline(
66 | {"waveform": diarizer_inputs, "sample_rate": 16000},
67 | num_speakers=num_speakers,
68 | min_speakers=min_speakers,
69 | max_speakers=max_speakers,
70 | )
71 |
72 | segments = []
73 | for segment, track, label in diarization.itertracks(yield_label=True):
74 | segments.append(
75 | {
76 | "segment": {"start": segment.start, "end": segment.end},
77 | "track": track,
78 | "label": label,
79 | }
80 | )
81 |
82 | # diarizer output may contain consecutive segments from the same
83 | # speaker (e.g. {(0 -> 1, speaker_1), (1 -> 1.5, speaker_1), ...})
84 | # we combine these segments to give overall timestamps for each
85 | # speaker's turn (e.g. {(0 -> 1.5, speaker_1), ...})
86 | new_segments = []
87 | prev_segment = cur_segment = segments[0]
88 |
89 | for i in range(1, len(segments)):
90 | cur_segment = segments[i]
91 |
92 | # check if we have changed speaker ("label")
93 | if cur_segment["label"] != prev_segment["label"] and i < len(segments):
94 | # add the start/end times for the super-segment to the new list
95 | new_segments.append(
96 | {
97 | "segment": {
98 | "start": prev_segment["segment"]["start"],
99 | "end": cur_segment["segment"]["start"],
100 | },
101 | "speaker": prev_segment["label"],
102 | }
103 | )
104 | prev_segment = segments[i]
105 |
106 | # add the last segment(s) if there was no speaker change
107 | new_segments.append(
108 | {
109 | "segment": {
110 | "start": prev_segment["segment"]["start"],
111 | "end": cur_segment["segment"]["end"],
112 | },
113 | "speaker": prev_segment["label"],
114 | }
115 | )
116 |
117 | return new_segments
118 |
119 |
120 | def post_process_segments_and_transcripts(
121 | new_segments,
122 | transcript,
123 | group_by_speaker
124 | ) -> list:
125 | segmented_preds = []
126 | transcript_idx = 0
127 | num_chunks = len(transcript)
128 |
129 | # Iterate through each diarization segment and assign transcript chunks
130 | # whose end timestamp falls within the segment
131 | for segment in new_segments:
132 | seg_start = segment["segment"]["start"]
133 | seg_end = segment["segment"]["end"]
134 | segment_chunks = []
135 |
136 | # Collect transcript chunks until the chunk's end timestamp exceeds
137 | # the diarization segment's end
138 | while (transcript_idx < num_chunks
139 | and
140 | transcript[transcript_idx]["timestamp"][1] <= seg_end):
141 | segment_chunks.append(transcript[transcript_idx])
142 | transcript_idx += 1
143 |
144 | # If no transcript chunks were found for this segment, continue
145 | # to next segment
146 | if not segment_chunks:
147 | continue
148 |
149 | if group_by_speaker:
150 | # Combine the text from all transcript chunks within this segment
151 | text = "".join(chunk["text"] for chunk in segment_chunks)
152 | segmented_preds.append({
153 | "speaker": segment["speaker"],
154 | "text": text,
155 | "timestamp": (
156 | segment_chunks[0]["timestamp"][0],
157 | segment_chunks[-1]["timestamp"][1]
158 | )
159 | })
160 | else:
161 | # Assign the speaker label to each transcript chunk in the segment
162 | for chunk in segment_chunks:
163 | chunk_copy = chunk.copy()
164 | chunk_copy["speaker"] = segment["speaker"]
165 | segmented_preds.append(chunk_copy)
166 |
167 | return segmented_preds
168 |
169 | def diarize(outputs, **kwargs):
170 | diarization_pipeline = Pipeline.from_pretrained(
171 | checkpoint_path=kwargs['diarization_model'],
172 | use_auth_token=kwargs['hf_token'],
173 | )
174 | diarization_pipeline.to(torch.device("mps"))
175 |
176 | _, diarizer_inputs = preprocess_inputs(
177 | inputs=kwargs['file_name']
178 | )
179 |
180 | segments = diarize_audio(
181 | diarizer_inputs,
182 | diarization_pipeline,
183 | kwargs['num_speakers'],
184 | kwargs['min_speakers'],
185 | kwargs['max_speakers']
186 | )
187 |
188 | return post_process_segments_and_transcripts(
189 | segments, outputs["chunks"], group_by_speaker=False
190 | )
191 |
--------------------------------------------------------------------------------
/whisply/download_utils.py:
--------------------------------------------------------------------------------
1 | import re
2 | import logging
3 | import yt_dlp as url_downloader
4 |
5 | from pathlib import Path
6 | from datetime import datetime
7 |
8 | from whisply import little_helper
9 |
10 |
11 | # Set logging configuration
12 | logger = logging.getLogger('download_utils')
13 | logger.setLevel(logging.DEBUG)
14 |
15 |
16 | def download_url(url: str, downloads_dir: Path) -> Path:
17 | """
18 | Downloads a media file from a specified URL, typically a YouTube URL, extracting audio
19 | in WAV format, and then renames the file based on the media title.
20 |
21 | The function first ensures the downloads directory exists, then initiates a download
22 | using youtube-dl with specific options set for audio quality and format. After the
23 | download, it extracts the video's information without downloading it again to
24 | rename the file more meaningfully based on the video title. Special characters
25 | in the title are replaced with underscores, and unnecessary leading or trailing
26 | underscores are removed.
27 |
28 | Args:
29 | url (str): The URL of the video to download.
30 | downloads_dir (Path): The directory path where the downloaded file should be stored.
31 |
32 | Returns:
33 | Path: A path object pointing to the renamed downloaded file. If there is an error
34 | during the download or file processing, returns None.
35 |
36 | Raises:
37 | Exception: Outputs an error message to the console if the download fails.
38 |
39 | Examples:
40 | >>> download_url("https://www.youtube.com/watch?v=example", Path("/downloads"))
41 | Path('/downloads/example.wav')
42 | """
43 | little_helper.ensure_dir(downloads_dir)
44 |
45 | temp_filename = f"temp_{datetime.now().strftime('%Y%m%d_%H_%M_%S')}"
46 | options = {
47 | 'format': 'bestaudio/best',
48 | 'postprocessors': [{'key': 'FFmpegExtractAudio',
49 | 'preferredcodec': 'wav',
50 | 'preferredquality': '192'}],
51 | 'outtmpl': f'{downloads_dir}/{temp_filename}.%(ext)s'
52 | }
53 | try:
54 | with url_downloader.YoutubeDL(options) as ydl:
55 | logger.debug(f"Downloading {url}")
56 |
57 | # Download url
58 | ydl.download([url])
59 | video_info = ydl.extract_info(url, download=False)
60 | downloaded_file = list(downloads_dir.glob(f'{temp_filename}*'))[0]
61 | logger.debug(f"Download complete for {downloaded_file}")
62 |
63 | # Normalize title
64 | new_filename = re.sub(r'\W+', '_', video_info.get('title', 'downloaded_video'))
65 |
66 | # Remove trailing underscores
67 | if new_filename.startswith('_'):
68 | new_filename = new_filename[1:]
69 | if new_filename.endswith('_'):
70 | new_filename = new_filename[:-1]
71 |
72 | # Rename the file
73 | renamed_file = downloaded_file.rename(f"{downloads_dir}/{new_filename}{downloaded_file.suffix}")
74 | logger.debug(f"Renamed downloaded file to {renamed_file}")
75 | return Path(renamed_file)
76 |
77 | except Exception as e:
78 | print(f'Error downloading {url}: {e}')
79 | return None
80 |
--------------------------------------------------------------------------------
/whisply/little_helper.py:
--------------------------------------------------------------------------------
1 | import re
2 | import os
3 | import json
4 | import logging
5 | import ffmpeg
6 | import validators
7 | import numpy as np
8 |
9 | from enum import Enum
10 | from pathlib import Path
11 | from typing import Callable, Any, List
12 | from rich import print
13 | from rich.progress import Progress, TimeElapsedColumn, TextColumn, SpinnerColumn
14 | from whisply import download_utils
15 |
16 | # Set logging configuration
17 | logger = logging.getLogger('little_helper')
18 | logger.setLevel(logging.INFO)
19 |
20 |
21 | class DeviceChoice(str, Enum):
22 | AUTO = 'auto'
23 | CPU = 'cpu'
24 | GPU = 'gpu'
25 | MPS = 'mps'
26 |
27 |
28 | def get_device(device: DeviceChoice = DeviceChoice.AUTO) -> str:
29 | """
30 | Determine the computation device based on user preference and
31 | availability.
32 | """
33 | import torch
34 |
35 | if device == DeviceChoice.AUTO:
36 | if torch.cuda.is_available():
37 | device = 'cuda:0'
38 | elif torch.backends.mps.is_available():
39 | device = 'mps'
40 | else:
41 | device = 'cpu'
42 | elif device == DeviceChoice.GPU:
43 | if torch.cuda.is_available():
44 | device = 'cuda:0'
45 | else:
46 | print(f"[blue1]→ NVIDIA GPU not available. Using CPU.")
47 | device = 'cpu'
48 | elif device == DeviceChoice.MPS:
49 | if torch.backends.mps.is_available():
50 | device = 'mps'
51 | else:
52 | print(f"[blue1]→ MPS not available. Using CPU.")
53 | device = 'cpu'
54 | elif device == DeviceChoice.CPU:
55 | device = 'cpu'
56 | else:
57 | device = 'cpu'
58 | return device
59 |
60 |
61 | class FilePathProcessor:
62 | """
63 | Utility class for validating various filepaths.
64 | """
65 | def __init__(self, file_formats: List[str]):
66 | self.file_formats = [fmt.lower() for fmt in file_formats]
67 | self.filepaths: List[Path] = []
68 |
69 | def get_filepaths(self, filepath: str):
70 | """
71 | Processes the provided filepath which can be a URL, a single file, a directory,
72 | or a .list file containing multiple paths/URLs. It validates each input, downloads
73 | URLs if necessary, and accumulates valid file paths for further processing.
74 | """
75 | path = Path(filepath).expanduser().resolve()
76 |
77 | try:
78 | # Handle URL
79 | if validators.url(filepath):
80 | logging.info(f"Processing URL: {filepath}")
81 | downloaded_path = download_utils.download_url(
82 | filepath,
83 | downloads_dir=Path('./downloads')
84 | )
85 | if downloaded_path:
86 | self.filepaths.append(downloaded_path)
87 | else:
88 | logging.error(f"Failed to download URL: {filepath}")
89 | print(f"→ Failed to download URL: {filepath}")
90 | return
91 |
92 | # Handle .list file
93 | elif path.suffix.lower() == '.list':
94 | if not path.is_file():
95 | logging.error(f'The .list file "{path}" does not exist or is not a file.')
96 | print(f'→ The .list file "{path}" does not exist or is not a file.')
97 | return
98 |
99 | logging.info(f"Processing .list file: {path}")
100 | with path.open('r', encoding='utf-8') as file:
101 | lpaths = set()
102 | for line in file:
103 | lpath = line.strip()
104 | if not lpath:
105 | continue
106 | lpaths.add(lpath)
107 |
108 | for lpath in lpaths:
109 | if validators.url(lpath):
110 | downloaded_path = download_utils.download_url(
111 | lpath,
112 | downloads_dir=Path('./downloads')
113 | )
114 | if downloaded_path:
115 | self.filepaths.append(downloaded_path)
116 | else:
117 | print(f'→ Failed to download URL: {lpath}')
118 | else:
119 | self._process_path(lpath)
120 | return
121 |
122 | # Handle single file or directory
123 | else:
124 | self._process_path(path)
125 |
126 | except Exception as e:
127 | logging.exception(f"An unexpected error occurred while processing '{filepath}': {e}")
128 | return
129 |
130 | # Remove duplicates by converting to a set of resolved absolute paths
131 | unique_filepaths = set(p.resolve() for p in self.filepaths)
132 | self.filepaths = list(unique_filepaths)
133 |
134 | # Filter out files that have already been converted
135 | self._filter_converted_files()
136 |
137 | # Final check to ensure there are files to process
138 | if not self.filepaths:
139 | logging.warning(f'No valid files found for processing. Please check the provided path: "{filepath}".')
140 | print(f'→ No valid files found for processing. Please check the provided path: "{filepath}".')
141 | else:
142 | logging.info(f"Total valid files to process: {len(self.filepaths)}")
143 |
144 | def _process_path(self, path_input: str | Path):
145 | """
146 | Processes a single path input, which can be a file or a directory.
147 | """
148 | path = Path(path_input).expanduser().resolve()
149 |
150 | if path.is_file():
151 | if path.suffix.lower() in self.file_formats:
152 | logging.info(f"Adding file: {path}")
153 | normalized_path = self._normalize_filepath(path)
154 | self.filepaths.append(normalized_path)
155 | else:
156 | logging.warning(f'File "{path}" has unsupported format and will be skipped.')
157 | print(f'→ File "{path}" has unsupported format and will be skipped.')
158 | elif path.is_dir():
159 | logging.info(f"Processing directory: {path}")
160 | for file_format in self.file_formats:
161 | for file in path.rglob(f'*{file_format}'):
162 | if file.is_file():
163 | logging.debug(f"Found file: {file}")
164 | normalized_path = self._normalize_filepath(file)
165 | self.filepaths.append(normalized_path)
166 | else:
167 | logging.error(f'Path "{path}" does not exist or is not accessible.')
168 | print(f'→ Path "{path}" does not exist or is not accessible.')
169 |
170 | def _normalize_filepath(self, filepath: Path) -> Path:
171 | """
172 | Normalizes the filepath by replacing non-word characters with underscores,
173 | collapsing multiple underscores into one, and removing leading/trailing underscores.
174 | """
175 | new_filename = re.sub(r'\W+', '_', filepath.stem)
176 | new_filename = new_filename.strip('_')
177 | new_filename = re.sub(r'_+', '_', new_filename)
178 |
179 | suffix = filepath.suffix.lower()
180 |
181 | # Construct the new path
182 | new_path = filepath.parent / f"{new_filename}{suffix}"
183 |
184 | # Rename the file
185 | filepath.rename(new_path)
186 |
187 | return new_path.resolve()
188 |
189 | def _filter_converted_files(self):
190 | """
191 | Removes files that have already been converted to avoid redundant processing.
192 | """
193 | converted_suffix = '_converted.wav'
194 | original_filepaths = []
195 | converted_filepaths = set()
196 |
197 | for fp in self.filepaths:
198 | if fp.name.endswith(converted_suffix):
199 | converted_filepaths.add(fp)
200 | else:
201 | original_filepaths.append(fp)
202 |
203 | # Remove originals if their converted version exists
204 | filtered_filepaths = [
205 | fp for fp in original_filepaths
206 | if not (fp.with_name(fp.stem + converted_suffix) in converted_filepaths)
207 | ]
208 |
209 | # Extened filtered paths with converted paths
210 | filtered_filepaths.extend(converted_filepaths)
211 |
212 | removed_count = len(self.filepaths) - len(filtered_filepaths)
213 | if removed_count > 0:
214 | logging.info(f"Removed {removed_count} files already converted.")
215 | self.filepaths = filtered_filepaths
216 |
217 | def ensure_dir(dir: Path) -> None:
218 | if not dir.exists():
219 | dir.mkdir(parents=True)
220 | return dir
221 |
222 | def set_output_dir(filepath: Path, base_dir: Path) -> None:
223 | output_dir = base_dir / filepath.stem
224 | ensure_dir(output_dir)
225 | return output_dir
226 |
227 | def return_valid_fileformats() -> list[str]:
228 | return [
229 | '.mp3',
230 | '.wav',
231 | '.m4a',
232 | '.aac',
233 | '.flac',
234 | '.ogg',
235 | '.mkv',
236 | '.mov',
237 | '.mp4',
238 | '.avi',
239 | '.mpeg',
240 | '.vob'
241 | ]
242 |
243 | def load_audio_ffmpeg(filepath: str) -> np.ndarray:
244 | try:
245 | out, _ = (
246 | ffmpeg
247 | .input(filepath)
248 | .output('pipe:', format='f32le', acodec='pcm_f32le', ac=1, ar='16000')
249 | .run(capture_stdout=True, capture_stderr=True)
250 | )
251 | except ffmpeg.Error as e:
252 | raise RuntimeError(f"Error loading audio with ffmpeg: {e.stderr.decode()}") from e
253 | return np.frombuffer(out, np.float32)
254 |
255 | def check_file_format(
256 | filepath: Path,
257 | del_originals: bool = True
258 | ) -> tuple[Path, np.ndarray]:
259 | """
260 | Checks the format of an audio file and converts it if it doesn't meet specified criteria.
261 | Then, loads the audio into a 1D NumPy array.
262 |
263 | The function uses `ffmpeg` to probe the metadata of an audio file at the given `filepath`.
264 | It checks if the audio stream meets the following criteria:
265 | - Codec name: 'pcm_s16le'
266 | - Sample rate: 16000 Hz
267 | - Number of channels: 1 (mono)
268 |
269 | If the audio stream does not meet these criteria, the function attempts to convert the file
270 | to meet the required format and saves the converted file with a '_converted.wav' suffix in the same directory.
271 | After successful conversion, it deletes the original file.
272 |
273 | Finally, it loads the audio (original or converted) as a 1D NumPy array and returns it.
274 |
275 | Args:
276 | filepath (Path): The path to the audio file to be checked and potentially converted.
277 |
278 | Returns:
279 | filepath (Path): filepath of the checked and / or converted audio file.
280 | np.ndarray: 1D NumPy array of the audio data.
281 | """
282 | # Define the converted file path
283 | new_filepath = filepath.with_name(f"{filepath.stem}_converted.wav")
284 |
285 | converted = False
286 |
287 | if new_filepath.exists():
288 | target_filepath = new_filepath
289 | converted = True
290 | else:
291 | try:
292 | # Probe the audio file for stream information
293 | probe = ffmpeg.probe(str(filepath))
294 | audio_streams = [stream for stream in probe['streams'] if stream['codec_type'] == 'audio']
295 |
296 | if not audio_streams:
297 | raise ValueError(f"→ No audio stream found for {filepath}. Please check if the file you have provided contains audio content.")
298 |
299 | audio_stream = audio_streams[0]
300 | codec_name = audio_stream.get('codec_name')
301 | sample_rate = int(audio_stream.get('sample_rate', 0))
302 | channels = int(audio_stream.get('channels', 0))
303 |
304 | # Check if the audio stream meets the criteria
305 | if codec_name != 'pcm_s16le' or sample_rate != 16000 or channels != 1:
306 | try:
307 | # Convert the file and show progress
308 | run_with_progress(
309 | description=(
310 | f"[orchid]→ Converting file to .wav: {filepath.name}"
311 | ),
312 | task=lambda: convert_file_format(
313 | old_filepath=filepath,
314 | new_filepath=new_filepath
315 | )
316 | )
317 | target_filepath = new_filepath
318 | converted = True
319 | except Exception as e:
320 | raise RuntimeError(
321 | f"→ An error occurred while converting {filepath}: {e}"
322 | )
323 | else:
324 | # If already in correct format, use the original file
325 | target_filepath = filepath
326 |
327 | except ffmpeg.Error as e:
328 | print(f"→ Error running ffprobe: {e}")
329 | print(f"→ You may have provided an unsupported file type.\
330 | Please check 'whisply --list_formats' for all supported formats.")
331 |
332 | try:
333 | audio_array = load_audio_ffmpeg(str(target_filepath))
334 | except Exception as e:
335 | raise RuntimeError(f"Failed to load audio from {target_filepath}: {e}") from e
336 |
337 | # If conversion occurred delete the original file if del_originals
338 | if (converted and del_originals) and target_filepath != filepath:
339 | try:
340 | os.remove(filepath)
341 | except OSError as e:
342 | print(f"Warning: {e}")
343 |
344 | return Path(target_filepath), audio_array
345 |
346 | def convert_file_format(old_filepath: str, new_filepath: str):
347 | """
348 | Converts a video file into an audio file in WAV format using the ffmpeg library.
349 | """
350 | (
351 | ffmpeg
352 | .input(str(old_filepath))
353 | .output(str(new_filepath),
354 | acodec='pcm_s16le', # Audio codec: PCM signed 16-bit little-endian
355 | ar='16000', # Sampling rate 16 KHz
356 | ac=1) # Mono channel
357 | .run(quiet=True,
358 | overwrite_output=True)
359 | )
360 |
361 | def load_config(config: json) -> dict:
362 | with open(config, 'r', encoding='utf-8') as file:
363 | return json.load(file)
364 |
365 | def format_time(seconds, delimiter=',') -> str:
366 | """
367 | Function for time conversion.
368 | """
369 | h = int(seconds // 3600)
370 | m = int((seconds % 3600) // 60)
371 | s = int(seconds % 60)
372 | ms = int((seconds - int(seconds)) * 1000)
373 |
374 | return f"{h:02}:{m:02}:{s:02}{delimiter}{ms:03}"
375 |
376 | def run_with_progress(description: str, task: Callable[[], Any]) -> Any:
377 | """
378 | Helper function to run a task with a progress bar.
379 | """
380 | with Progress(
381 | SpinnerColumn(),
382 | TimeElapsedColumn(),
383 | TextColumn("[progress.description]{task.description}")
384 | ) as progress:
385 | progress.add_task(description, total=None)
386 | return task()
--------------------------------------------------------------------------------
/whisply/models.py:
--------------------------------------------------------------------------------
1 | from rich import print
2 |
3 | WHISPER_MODELS = {
4 | # Models for faster-whisper / whisperx: https://huggingface.co/Systran
5 | # Models for insanely-fast-whisper: https://huggingface.co/openai
6 | 'tiny': {
7 | 'faster-whisper': 'tiny',
8 | 'insane-whisper': 'openai/whisper-tiny',
9 | 'whisperx': 'tiny',
10 | 'translation': True
11 | },
12 | 'tine-en': {
13 | 'faster-whisper': 'tiny.en',
14 | 'insane-whisper': 'openai/whisper-tiny.en',
15 | 'whisperx': 'tiny.en',
16 | 'translation': False
17 | },
18 | 'base': {
19 | 'faster-whisper': 'base',
20 | 'insane-whisper': 'openai/whisper-base',
21 | 'whisperx': 'base',
22 | 'translation': True
23 | },
24 | 'base-en': {
25 | 'faster-whisper': 'base.en',
26 | 'insane-whisper': 'openai/whisper-base.en',
27 | 'whisperx': 'base.en',
28 | 'translation': False
29 | },
30 | 'small': {
31 | 'faster-whisper': 'small',
32 | 'insane-whisper': 'openai/whisper-small',
33 | 'whisperx': 'small',
34 | 'translation': True
35 | },
36 | 'small-en': {
37 | 'faster-whisper': 'small.en',
38 | 'insane-whisper': 'openai/whisper-small.en',
39 | 'whisperx': 'small.en',
40 | 'translation': False
41 | },
42 | 'distil-small-en': {
43 | 'faster-whisper': 'distil-small.en',
44 | 'insane-whisper': 'distil-whisper/distil-small.en',
45 | 'whisperx': None,
46 | 'translation': False
47 | },
48 | 'medium': {
49 | 'faster-whisper': 'medium',
50 | 'insane-whisper': 'openai/whisper-medium',
51 | 'whisperx': 'medium',
52 | 'translation': True
53 | },
54 | 'medium-en': {
55 | 'faster-whisper': 'medium.en',
56 | 'insane-whisper': 'openai/whisper-medium.en',
57 | 'whisperx': 'medium.en',
58 | 'translation': False
59 | },
60 | 'distil-medium-en': {
61 | 'faster-whisper': 'distil-medium.en',
62 | 'insane-whisper': 'distil-whisper/distil-medium.en',
63 | 'whisperx': None,
64 | 'translation': False
65 | },
66 | 'large': {
67 | 'faster-whisper': 'large',
68 | 'insane-whisper': 'openai/whisper-large',
69 | 'whisperx': 'large',
70 | 'translation': True
71 | },
72 | 'large-v2': {
73 | 'faster-whisper': 'large-v2',
74 | 'insane-whisper': 'openai/whisper-large-v2',
75 | 'whisperx': 'large-v2',
76 | 'translation': True
77 | },
78 | 'distil-large-v2': {
79 | 'faster-whisper': 'distil-large-v2',
80 | 'insane-whisper': 'distil-whisper/distil-large-v2',
81 | 'whisperx': None,
82 | 'translation': True
83 | },
84 | 'large-v3': {
85 | 'faster-whisper': 'large-v3',
86 | 'insane-whisper': 'openai/whisper-large-v3',
87 | 'whisperx': 'large-v3',
88 | 'translation': True
89 | },
90 | 'distil-large-v3': {
91 | 'faster-whisper': 'distil-large-v3',
92 | 'insane-whisper': 'distil-whisper/distil-large-v3',
93 | 'whisperx': 'distil-large-v3',
94 | 'translation': True
95 | },
96 | 'large-v3-turbo': {
97 | 'faster-whisper': 'deepdml/faster-whisper-large-v3-turbo-ct2',
98 | 'insane-whisper': 'openai/whisper-large-v3-turbo',
99 | 'whisperx': None,
100 | 'translation': False
101 | },
102 | }
103 |
104 | def ensure_model(model: str) -> bool:
105 | return model in WHISPER_MODELS
106 |
107 | def is_model_supported(model: str, implementation: str, translation: bool) -> bool:
108 | model_info = WHISPER_MODELS.get(model)
109 | if not model_info:
110 | return False
111 | if model_info.get(implementation) is None:
112 | return False
113 | if translation and not model_info.get("translation", False):
114 | return False
115 | return True
116 |
117 | def set_supported_model(model: str, implementation: str, translation: bool) -> str:
118 | if not is_model_supported(model, implementation, translation):
119 | default_model = "large-v2"
120 | print(f'[blue1]→ Model "{model}" is not available for this task/implementation → Using default model "{default_model}".')
121 | return WHISPER_MODELS.get(default_model)[implementation]
122 | return WHISPER_MODELS.get(model)[implementation]
123 |
--------------------------------------------------------------------------------
/whisply/output_templates.py:
--------------------------------------------------------------------------------
1 | NOSCRIBE_HTML_TEMPLATE = """
2 |
3 |
4 |
5 |
12 |
13 |
14 |
15 |
16 |
{transcription}
17 |
Transcribed with whisply for noScribe's Editor.
Audio source: {audio_filepath}
18 | {body_content}
19 |
20 |
21 |
22 | """
--------------------------------------------------------------------------------
/whisply/output_utils.py:
--------------------------------------------------------------------------------
1 | import re
2 | import json
3 | import logging
4 | import typer
5 |
6 | from enum import Enum
7 | from pathlib import Path
8 | from typing import List, Dict, Tuple
9 | from rich import print
10 | from whisply import little_helper
11 | from whisply.post_correction import Corrections
12 |
13 | # Set logging configuration
14 | logger = logging.getLogger('little_helper')
15 | logger.setLevel(logging.INFO)
16 |
17 |
18 | class ExportFormats(str, Enum):
19 | ALL = 'all'
20 | JSON = 'json'
21 | TXT = 'txt'
22 | RTTM = 'rttm'
23 | VTT = 'vtt'
24 | WEBVTT = 'webvtt'
25 | SRT = 'srt'
26 | HTML = 'html'
27 |
28 |
29 | def determine_export_formats(
30 | export_format: ExportFormats,
31 | annotate: bool,
32 | subtitle: bool
33 | ) -> List[str]:
34 | """
35 | Determine the export formats based on user options and availability.
36 |
37 | Returns a list of export format strings to be used.
38 | """
39 | available_formats = set()
40 | if export_format == ExportFormats.ALL:
41 | available_formats.add(ExportFormats.JSON.value)
42 | available_formats.add(ExportFormats.TXT.value)
43 | if annotate:
44 | available_formats.add(ExportFormats.RTTM.value)
45 | available_formats.add(ExportFormats.HTML.value)
46 | if subtitle:
47 | available_formats.add(ExportFormats.WEBVTT.value)
48 | available_formats.add(ExportFormats.VTT.value)
49 | available_formats.add(ExportFormats.SRT.value)
50 | else:
51 | if export_format in (ExportFormats.JSON, ExportFormats.TXT):
52 | available_formats.add(export_format.value)
53 | elif export_format in (ExportFormats.RTTM, ExportFormats.HTML):
54 | if annotate:
55 | available_formats.add(export_format.value)
56 | else:
57 | print("→ RTTM export format requires annotate option to be True.")
58 | raise typer.Exit()
59 | elif export_format in (
60 | ExportFormats.VTT,
61 | ExportFormats.SRT,
62 | ExportFormats.WEBVTT
63 | ):
64 | if subtitle:
65 | available_formats.add(export_format.value)
66 | else:
67 | print(f"→ {export_format.value.upper()} export format requires subtitle option to be True.")
68 | raise typer.Exit()
69 | else:
70 | print(f"→ Unknown export format: {export_format.value}")
71 | raise typer.Exit()
72 |
73 | return list(available_formats)
74 |
75 |
76 | class OutputWriter:
77 | """
78 | Class for writing various output formats to disk.
79 | """
80 | def __init__(
81 | self,
82 | corrections: Corrections = None
83 | ):
84 | self.cwd = Path.cwd()
85 | self.corrections = corrections
86 | self.compiled_simple_patterns = (
87 | self._compile_simple_patterns() if self.corrections else {}
88 | )
89 | self.compiled_regex_patterns = (
90 | self._compile_regex_patterns() if self.corrections else []
91 | )
92 |
93 | def _compile_simple_patterns(self) -> List[Tuple[re.Pattern, str]]:
94 | """
95 | Pre-compile regex patterns for simple word corrections.
96 | Returns a list of tuples containing compiled patterns and their replacements.
97 | """
98 | patterns = []
99 | for wrong, correct in self.corrections.simple.items():
100 | # Wrap simple corrections with word boundaries
101 | pattern = re.compile(
102 | r'\b{}\b'.format(re.escape(wrong)), flags=re.IGNORECASE
103 | )
104 | patterns.append((pattern, correct))
105 | logger.debug(
106 | f"Compiled simple pattern: '\\b{wrong}\\b' → '{correct}'"
107 | )
108 | return patterns
109 |
110 | def _compile_regex_patterns(self) -> List[Tuple[re.Pattern, str]]:
111 | """
112 | Pre-compile regex patterns for pattern-based corrections.
113 | Returns a list of tuples containing compiled regex patterns and their replacements.
114 | """
115 | patterns = []
116 | for entry in self.corrections.patterns:
117 | original_pattern = entry['pattern']
118 | replacement = entry['replacement']
119 |
120 | # Wrap patterns with word boundaries and non-capturing group
121 | new_pattern = r'\b(?:' + original_pattern + r')\b'
122 | regex = re.compile(new_pattern, flags=re.IGNORECASE)
123 |
124 | patterns.append((regex, replacement))
125 | logger.debug(
126 | f"Compiled pattern-based regex: '{new_pattern}' → '{replacement}'"
127 | )
128 | return patterns
129 |
130 | def correct_transcription(self, transcription: str) -> str:
131 | """
132 | Apply both simple and pattern-based corrections to the transcription.
133 | """
134 | # Apply simple corrections
135 | for pattern, correct in self.compiled_simple_patterns:
136 | transcription = pattern.sub(
137 | lambda m: self.replace_match(m, correct), transcription
138 | )
139 |
140 | # Apply pattern-based corrections
141 | for regex, replacement in self.compiled_regex_patterns:
142 | transcription = regex.sub(replacement, transcription)
143 |
144 | return transcription
145 |
146 | @staticmethod
147 | def replace_match(match, correct: str) -> str:
148 | """
149 | Replace the matched word while preserving the original casing.
150 | """
151 | word = match.group()
152 | if word.isupper():
153 | return correct.upper()
154 | elif word[0].isupper():
155 | return correct.capitalize()
156 | else:
157 | return correct
158 |
159 | def _save_file(
160 | self,
161 | content: str,
162 | filepath: Path,
163 | description: str,
164 | log_message: str
165 | ) -> None:
166 | """
167 | Generic method to save content to a file.
168 | """
169 | with open(filepath, 'w', encoding='utf-8') as file:
170 | file.write(content)
171 | print(f'[blue1]→ Saved {description}: [bold]{filepath.relative_to(self.cwd)}')
172 | logger.info(f'{log_message} {filepath}')
173 |
174 | def save_json(
175 | self,
176 | result: dict,
177 | filepath: Path
178 | ) -> None:
179 | with open(filepath, 'w', encoding='utf-8') as fout:
180 | json.dump(result, fout, indent=4)
181 | print(f'[blue1]→ Saved .json: [bold]{filepath.relative_to(self.cwd)}')
182 | logger.info(f"Saved .json to {filepath}")
183 |
184 | def save_txt(
185 | self,
186 | transcription: Dict[str, str],
187 | filepath: Path
188 | ) -> None:
189 | """
190 | Save the transcription as a TXT file after applying corrections.
191 | """
192 | original_text = transcription.get('text', '').strip()
193 | corrected_text = self.correct_transcription(original_text)
194 | self._save_file(
195 | content=corrected_text,
196 | filepath=filepath,
197 | description='.txt',
198 | log_message='Saved .txt transcript to'
199 | )
200 |
201 | def save_txt_with_speaker_annotation(
202 | self,
203 | annotated_text: str,
204 | filepath: Path
205 | ) -> None:
206 | """
207 | Save the annotated transcription as a TXT file after applying corrections.
208 | """
209 | corrected_annotated_text = self.correct_transcription(annotated_text)
210 | self._save_file(
211 | content=corrected_annotated_text,
212 | filepath=filepath,
213 | description='.txt with speaker annotation',
214 | log_message='Saved .txt transcription with speaker annotation →'
215 | )
216 |
217 | def save_subtitles(
218 | self,
219 | text: str,
220 | type: str,
221 | filepath: Path
222 | ) -> None:
223 | """
224 | Save subtitles in the specified format after applying corrections.
225 | """
226 | corrected_text = self.correct_transcription(text)
227 | description = f'.{type} subtitles'
228 | log_message = f'Saved .{type} subtitles →'
229 | self._save_file(
230 | content=corrected_text,
231 | filepath=filepath,
232 | description=description,
233 | log_message=log_message
234 | )
235 |
236 | def save_rttm_annotations(
237 | self,
238 | rttm: str,
239 | filepath: Path
240 | ) -> None:
241 | self._save_file(
242 | content=rttm,
243 | filepath=filepath,
244 | description='.rttm annotations',
245 | log_message='Saved .rttm annotations →'
246 | )
247 |
248 | def save_html(
249 | self,
250 | html: str,
251 | filepath: Path
252 | ) -> None:
253 | self._save_file(
254 | content=html,
255 | filepath=filepath,
256 | description='.html (noScribe compatible)',
257 | log_message='Saved .html annotations →'
258 | )
259 |
260 | def save_results(
261 | self,
262 | result: dict,
263 | export_formats: List[str]
264 | ) -> List[Path]:
265 | """
266 | Write various output formats to disk based on the specified export formats.
267 | """
268 | output_filepath = Path(result['output_filepath'])
269 | written_filepaths = []
270 |
271 | # Apply corrections if they are provided
272 | if self.corrections and (
273 | self.corrections.simple or self.corrections.patterns
274 | ):
275 | for language, transcription in result.get('transcription', {}).items():
276 | # Correct the main transcription text
277 | original_text = transcription.get('text', '').strip()
278 | corrected_text = self.correct_transcription(original_text)
279 | result['transcription'][language]['text'] = corrected_text
280 |
281 | # Correct chunks and word dicts
282 | chunks = transcription.get('chunks', '')
283 | for c in chunks:
284 | # Text chunk
285 | c['text'] = self.correct_transcription(c['text'])
286 | # Words
287 | words = c.get('words', '')
288 | for w in words:
289 | w['word'] = self.correct_transcription(w['word'])
290 |
291 | # Correct speaker annotations if present
292 | if 'text_with_speaker_annotation' in transcription:
293 | original_annotated = transcription['text_with_speaker_annotation']
294 | corrected_annotated = self.correct_transcription(original_annotated)
295 | result['transcription'][language]['text_with_speaker_annotation'] = corrected_annotated
296 |
297 | # Now, transcription_items reflects the corrected transcriptions
298 | transcription_items = result.get('transcription', {}).items()
299 |
300 | # Write .txt
301 | if 'txt' in export_formats:
302 | for language, transcription in transcription_items:
303 | fout = output_filepath.parent / f"{output_filepath.name}_{language}.txt"
304 | self.save_txt(
305 | transcription,
306 | filepath=fout
307 | )
308 | written_filepaths.append(str(fout))
309 |
310 | # Write subtitles (.srt, .vtt and .webvtt)
311 | subtitle_formats = {'srt', 'vtt', 'webvtt'}
312 | if subtitle_formats.intersection(export_formats):
313 | for language, transcription in transcription_items:
314 | # .srt subtitles
315 | if 'srt' in export_formats:
316 | fout = output_filepath.parent / f"{output_filepath.name}_{language}.srt"
317 | srt_text = create_subtitles(
318 | transcription,
319 | type='srt'
320 | )
321 | self.save_subtitles(srt_text, type='srt', filepath=fout)
322 | written_filepaths.append(str(fout))
323 |
324 | # .vtt / .webvtt subtitles
325 | if 'vtt' in export_formats or 'webvtt' in export_formats:
326 | for subtitle_type in ['webvtt', 'vtt']:
327 | fout = output_filepath.parent / f"{output_filepath.name}_{language}.{subtitle_type}"
328 | vtt_text = create_subtitles(
329 | transcription,
330 | type=subtitle_type,
331 | result=result
332 | )
333 | self.save_subtitles(
334 | vtt_text,
335 | type=subtitle_type,
336 | filepath=fout
337 | )
338 | written_filepaths.append(str(fout))
339 |
340 | # Write annotated .txt with speaker annotations
341 | has_speaker_annotation = any(
342 | 'text_with_speaker_annotation' in transcription
343 | for transcription in result['transcription'].values()
344 | )
345 |
346 | if 'txt' in export_formats and has_speaker_annotation:
347 | for language, transcription in transcription_items:
348 | if 'text_with_speaker_annotation' in transcription:
349 | fout = output_filepath.parent / f"{output_filepath.name}_{language}_annotated.txt"
350 | self.save_txt_with_speaker_annotation(
351 | annotated_text=transcription['text_with_speaker_annotation'],
352 | filepath=fout
353 | )
354 | written_filepaths.append(str(fout))
355 |
356 | # Write .rttm
357 | if 'rttm' in export_formats:
358 | # Create .rttm annotations
359 | rttm_dict = dict_to_rttm(result)
360 |
361 | for language, rttm_annotation in rttm_dict.items():
362 | fout = output_filepath.parent / f"{output_filepath.name}_{language}.rttm"
363 | self.save_rttm_annotations(
364 | rttm=rttm_annotation,
365 | filepath=fout
366 | )
367 | written_filepaths.append(str(fout))
368 |
369 | # Write .json
370 | if 'json' in export_formats:
371 | fout = output_filepath.with_suffix('.json')
372 | written_filepaths.append(str(fout))
373 | result['written_files'] = written_filepaths
374 | self.save_json(result, filepath=fout)
375 |
376 | # Write noScribe compatible .html
377 | if 'html' in export_formats:
378 | audio_filepath = Path(result['input_filepath'])
379 | for language, transcription in transcription_items:
380 | fout = output_filepath.parent / f"{output_filepath.name}_{language}.html"
381 | html = create_html(
382 | transcription,
383 | audio_filepath=audio_filepath,
384 | output_filepath=output_filepath
385 | )
386 | self.save_html(html, filepath=fout)
387 | written_filepaths.append(str(fout))
388 |
389 | return written_filepaths
390 |
391 | def create_subtitles(
392 | transcription_dict: dict,
393 | type: str = 'srt',
394 | result: dict = None
395 | ) -> str:
396 | """
397 | Converts a transcription dictionary into subtitle format (.srt or .webvtt).
398 |
399 | Args:
400 | transcription_dict (dict): Dictionary containing transcription data
401 | with 'chunks'.
402 | sub_length (int, optional): Maximum duration in seconds for each
403 | subtitle block.
404 | type (str, optional): Subtitle format, either 'srt' or 'webvtt'.
405 | Default is 'srt'.
406 |
407 | Returns:
408 | str: Formatted subtitle text in the specified format.
409 | """
410 | subtitle_text = ''
411 | seg_id = 0
412 |
413 | for chunk in transcription_dict['chunks']:
414 | start_time = chunk['timestamp'][0]
415 | end_time = chunk['timestamp'][1]
416 | text = chunk['text'].replace('’', '\'')
417 |
418 | # Create .srt subtitles
419 | if type == 'srt':
420 | start_time_str = little_helper.format_time(
421 | start_time,
422 | delimiter=','
423 | )
424 | end_time_str = little_helper.format_time(
425 | end_time,
426 | delimiter=','
427 | )
428 | seg_id += 1
429 | subtitle_text += f"""{seg_id}\n{start_time_str} --> {end_time_str}\n{text.strip()}\n\n"""
430 |
431 | # Create .webvtt subtitles
432 | elif type in ['webvtt', 'vtt']:
433 | start_time_str = little_helper.format_time(
434 | start_time,
435 | delimiter='.'
436 | )
437 | end_time_str = little_helper.format_time(
438 | end_time,
439 | delimiter='.'
440 | )
441 |
442 | if seg_id == 0:
443 | subtitle_text += f"WEBVTT {Path(result['output_filepath']).stem}\n\n"
444 |
445 | if type == 'vtt':
446 | subtitle_text += 'NOTE transcribed with whisply\n\n'
447 | subtitle_text += f"NOTE media: {Path(result['input_filepath']).absolute()}\n\n"
448 |
449 | seg_id += 1
450 | subtitle_text += f"""{seg_id}\n{start_time_str} --> {end_time_str}\n{text.strip()}\n\n"""
451 |
452 | return subtitle_text
453 |
454 | def dict_to_rttm(result: dict) -> dict:
455 | """
456 | Converts a transcription dictionary to RTTM file format.
457 | """
458 | file_id = result.get('input_filepath', 'unknown_file')
459 | file_id = Path(file_id).stem
460 | rttm_dict = {}
461 |
462 | # Iterate over each available language
463 | for lang, transcription in result.get('transcription', {}).items():
464 | lines = []
465 | current_speaker = None
466 | speaker_start_time = None
467 | speaker_end_time = None
468 |
469 | chunks = transcription.get('chunks', [])
470 |
471 | # Collect all words from chunks
472 | all_words = []
473 | for chunk in chunks:
474 | words = chunk.get('words', [])
475 | all_words.extend(words)
476 |
477 | # Sort all words by their start time
478 | all_words.sort(key=lambda w: w.get('start', 0.0))
479 |
480 | for word_info in all_words:
481 | speaker = word_info.get('speaker', 'SPEAKER_00')
482 | word_start = word_info.get('start', 0.0)
483 | word_end = word_info.get('end', word_start)
484 |
485 | if speaker != current_speaker:
486 | # If there is a previous speaker segment, write it to the RTTM
487 | if current_speaker is not None:
488 | duration = speaker_end_time - speaker_start_time
489 | rttm_line = (
490 | f"SPEAKER {file_id} 1 {speaker_start_time:.3f} {duration:.3f} "
491 | f" {current_speaker} "
492 | )
493 | lines.append(rttm_line)
494 |
495 | # Start a new speaker segment
496 | current_speaker = speaker
497 | speaker_start_time = word_start
498 | speaker_end_time = word_end
499 | else:
500 | # Extend the current speaker segment
501 | speaker_end_time = max(speaker_end_time, word_end)
502 |
503 | # Write the last speaker segment to the RTTM
504 | if current_speaker is not None:
505 | duration = speaker_end_time - speaker_start_time
506 | rttm_line = (
507 | f"SPEAKER {file_id} 1 {speaker_start_time:.3f} {duration:.3f} "
508 | f" {current_speaker} "
509 | )
510 | lines.append(rttm_line)
511 |
512 | rttm_content = "\n".join(lines)
513 | rttm_dict[lang] = rttm_content
514 |
515 | return rttm_dict
516 |
517 | def create_html(
518 | transcription: dict,
519 | audio_filepath: Path,
520 | output_filepath: Path
521 | ) -> None:
522 | """
523 | Save an HTML file that is compatible with noScribe's editor:
524 | https://github.com/kaixxx/noScribe
525 | """
526 | from whisply.output_templates import NOSCRIBE_HTML_TEMPLATE
527 |
528 | # Helper function to parse a transcript line
529 | def parse_line(line: str):
530 | pattern = r'\[(.*?)\]\s+\[(.*?)\]\s+(.*)'
531 | match = re.match(pattern, line)
532 | if match:
533 | timestamp, speaker, text = match.groups()
534 | return timestamp, speaker, text
535 | return None, None, line
536 |
537 | # Helper function to convert a timestamp to milliseconds
538 | def convert_timestamp_to_ms(timestamp: str) -> int:
539 | h, m, s = timestamp.split(':')
540 | s = s.split('.')[0]
541 | return (int(h) * 3600 + int(m) * 60 + int(s)) * 1000
542 |
543 | # Use the annotated txt as input
544 | input_text = transcription.get('text_with_speaker_annotation', '')
545 |
546 | # Replace placeholders in the HTML template
547 | html_template = NOSCRIBE_HTML_TEMPLATE.replace('{transcription}', str(output_filepath))
548 | html_template = html_template.replace('{audio_filepath}', str(audio_filepath))
549 |
550 | # Process transcript lines to build HTML content
551 | body_content = ""
552 | transcript_lines = input_text.strip().splitlines()
553 | num_lines = len(transcript_lines)
554 |
555 | for idx, line in enumerate(transcript_lines):
556 | segment_start, speaker, text = parse_line(line)
557 |
558 | # If there's a following line or a single line, generate detailed HTML with an anchor
559 | if num_lines == 1 or idx < num_lines - 1:
560 | # For a single line, override segment_start with a default value
561 | if num_lines == 1:
562 | segment_start = '00:00:00'
563 | segment_end, _, _ = parse_line(line)
564 | else:
565 | segment_end, _, _ = parse_line(transcript_lines[idx + 1])
566 |
567 | # Convert timestamps to ms
568 | start = convert_timestamp_to_ms(segment_start)
569 | end = convert_timestamp_to_ms(segment_end)
570 |
571 | # Set speaker labels
572 | if 'UNKNOWN' in speaker:
573 | speaker_label = 'SXX'
574 | else:
575 | speaker_number = re.findall(r'\d+', speaker)[0]
576 | speaker_label = f'S{speaker_number}'
577 |
578 | # Build the anchor tag and HTML segment
579 | anchor = f"ts_{start}_{end}_{speaker_label}"
580 | segment_html = (f''
581 | f'[{segment_start}] {speaker}: {text}'
582 | f'
\n')
583 | else:
584 | segment_html = f'{text}
\n'
585 |
586 | body_content += segment_html
587 |
588 | html_output = html_template.replace('{body_content}', body_content)
589 |
590 | return html_output
591 |
--------------------------------------------------------------------------------
/whisply/post_correction.py:
--------------------------------------------------------------------------------
1 | import yaml
2 | from pathlib import Path
3 | from dataclasses import dataclass, field
4 | from typing import Dict, List
5 |
6 | @dataclass
7 | class Corrections:
8 | """
9 | A dataclass to encapsulate both simple and pattern-based corrections.
10 | """
11 | simple: Dict[str, str] = field(default_factory=dict)
12 | patterns: List[Dict[str, str]] = field(default_factory=list)
13 |
14 |
15 | def load_correction_list(filepath: str | Path) -> Corrections:
16 | """
17 | Load the correction dictionary and patterns from a YAML file.
18 |
19 | :param filepath: Path to the YAML correction file.
20 | :return: Corrections object containing simple and pattern-based
21 | corrections.
22 | """
23 | try:
24 | with open(filepath, 'r') as file:
25 | data = yaml.safe_load(file)
26 |
27 | if not isinstance(data, dict):
28 | raise ValueError("→ Correction file must contain a YAML dictionary.")
29 |
30 | # Extract simple corrections
31 | simple_corrections = {k: v for k, v in data.items() if k != 'patterns'}
32 |
33 | # Extract pattern-based corrections
34 | pattern_corrections = data.get('patterns', [])
35 |
36 | # Validate patterns
37 | for entry in pattern_corrections:
38 | if 'pattern' not in entry or 'replacement' not in entry:
39 | raise ValueError("→ Each pattern entry must contain 'pattern' \
40 | and 'replacement' keys.")
41 |
42 | return Corrections(
43 | simple=simple_corrections,
44 | patterns=pattern_corrections
45 | )
46 |
47 | except FileNotFoundError:
48 | print(f"→ Correction file not found: {filepath}")
49 | return Corrections()
50 | except yaml.YAMLError as e:
51 | print(f"→ Error parsing YAML file: {e}")
52 | return Corrections()
53 | except Exception as e:
54 | print(f"→ Unexpected error loading correction list: {e}")
55 | return Corrections()
56 |
--------------------------------------------------------------------------------
/whisply/transcription.py:
--------------------------------------------------------------------------------
1 | import logging
2 | import time
3 | from pathlib import Path
4 | from datetime import datetime
5 | from functools import partial
6 | from rich import print
7 | import whisperx.diarize
8 |
9 | from whisply import little_helper, output_utils, models
10 | from whisply.little_helper import FilePathProcessor
11 | from whisply.post_correction import Corrections
12 |
13 | # Set logging configuration
14 | log_dir = little_helper.ensure_dir(Path('./logs'))
15 | log_filename = f"log_whisply_{datetime.now().strftime('%Y-%m-%d')}.log"
16 | log_file = f"{log_dir}/{log_filename}"
17 |
18 | logging.basicConfig(
19 | filename=log_file,
20 | level=logging.INFO,
21 | format="%(asctime)s %(levelname)s [%(funcName)s]: %(message)s",
22 | )
23 |
24 | class TranscriptionHandler:
25 | """
26 | Handles transcription and diarization of audio/video files using various
27 | Whisper-based models.
28 |
29 | This class leverages different implementations of OpenAI's Whisper models
30 | (whisperX, insanely-fast-whisper, faster-whisper) to transcribe audio and
31 | video files. It supports features like language detection, speaker
32 | diarization, translation, subtitle generation, and exporting transcriptions
33 | in multiple formats. It is capable of processing single files, directories,
34 | URLs, and lists of files, providing flexibility for diverse transcription
35 | needs.
36 |
37 | Args:
38 | base_dir (str, optional): Directory to store transcription outputs.
39 | Defaults to './transcriptions'.
40 | model (str, optional): Whisper model variant to use (e.g., 'large-v2').
41 | Defaults to 'large-v3-turbo'.
42 | device (str, optional): Compute device ('cpu', 'cuda', etc.).
43 | Defaults to 'cpu'.
44 | file_language (str, optional): Language of the input audio.
45 | If not provided, language detection is performed.
46 | annotate (bool, optional): Enable speaker diarization.
47 | Defaults to False.
48 | hf_token (str, optional): Hugging Face token for accessing restricted
49 | models or features.
50 | subtitle (bool, optional): Generate subtitles with word-level timestamps.
51 | Defaults to False.
52 | sub_length (int, optional): Maximum number of words per subtitle chunk.
53 | Required if subtitle is True.
54 | translate (bool, optional): Translate transcription to English if the
55 | original language is different. Defaults to False.
56 | verbose (bool, optional): Enable detailed logging and output.
57 | Defaults to False.
58 | export_formats (str or list, optional): Formats to export transcriptions
59 | (e.g., 'json', 'srt'). Defaults to 'all'.
60 |
61 | Attributes:
62 | base_dir (Path): Directory for storing transcriptions.
63 | device (str): Compute device in use.
64 | file_language (str or None): Detected or specified language of the
65 | audio.
66 | annotate (bool): Indicates if speaker diarization is enabled.
67 | translate (bool): Indicates if translation is enabled.
68 | subtitle (bool): Indicates if subtitle generation is enabled.
69 | verbose (bool): Indicates if verbose mode is active.
70 | export_formats (str or list): Selected formats for exporting
71 | transcriptions.
72 | processed_files (list): List of processed file information and results.
73 |
74 | Methods:
75 | get_filepaths(filepath: str):
76 | Retrieves and validates file paths from various input types.
77 |
78 | detect_language(file: Path, audio_array) -> str:
79 | Detects the language of the given audio file.
80 |
81 | process_files(files: list):
82 | Processes a list of audio files for transcription and diarization.
83 |
84 | transcribe_with_whisperx(filepath: Path) -> dict:
85 | Transcribes an audio file using the whisperX implementation.
86 |
87 | transcribe_with_insane_whisper(filepath: Path) -> dict:
88 | Transcribes an audio file using the insanely-fast-whisper
89 | implementation.
90 |
91 | transcribe_with_faster_whisper(filepath: Path, num_workers: int = 1)
92 | -> dict:
93 | Transcribes an audio file using the faster-whisper implementation.
94 |
95 | adjust_word_chunk_length(result: dict) -> dict:
96 | Splits transcription text into chunks based on a maximum word
97 | count.
98 |
99 | to_transcription_dict(insanely_annotation: list[dict]) -> dict:
100 | Converts speaker-annotated results into a standardized dictionary.
101 |
102 | to_whisperx(transcription_result: dict) -> dict:
103 | Normalizes transcription results to the whisperX format.
104 |
105 | create_text_with_speakers(transcription_dict: dict,
106 | delimiter: str = '.') -> dict:
107 | Inserts speaker labels into the transcription text upon speaker
108 | changes.
109 | """
110 | def __init__(
111 | self,
112 | base_dir='./transcriptions',
113 | model='large-v3-turbo',
114 | device='cpu',
115 | file_language=None,
116 | annotate=False,
117 | num_speakers=None,
118 | hf_token=None,
119 | subtitle=False,
120 | sub_length=None,
121 | translate=False,
122 | verbose=False,
123 | del_originals=False,
124 | corrections=Corrections,
125 | export_formats='all'
126 | ):
127 | self.base_dir = little_helper.ensure_dir(Path(base_dir))
128 | self.file_formats = little_helper.return_valid_fileformats()
129 | self.device = device
130 | self.file_language = file_language
131 | self.file_language_provided = file_language is not None
132 | self.model = None
133 | self.model_provided = model
134 | self.annotate = annotate
135 | self.num_speakers = num_speakers
136 | self.translate = translate
137 | self.hf_token = hf_token
138 | self.subtitle = subtitle
139 | self.sub_length = sub_length
140 | self.verbose = verbose
141 | self.del_originals = del_originals
142 | self.corrections = corrections
143 | self.export_formats = export_formats
144 | self.metadata = self._collect_metadata()
145 | self.filepaths = []
146 | self.output_dir = None
147 | self.processed_files = []
148 |
149 | def _collect_metadata(self):
150 | return {
151 | 'output_dir': str(self.base_dir),
152 | 'file_language': self.file_language,
153 | 'model': self.model_provided,
154 | 'device': self.device,
155 | 'annotate': self.annotate,
156 | 'num_speakers': self.num_speakers,
157 | 'translate': self.translate,
158 | 'subtitle': self.subtitle,
159 | 'sub_length': self.sub_length
160 | }
161 |
162 | def adjust_word_chunk_length(self, result: dict) -> dict:
163 | """
164 | Generates text chunks based on the maximum number of words.
165 |
166 | Parameters:
167 | result (dict): The nested dictionary containing segments
168 | and words.
169 | max_number (int): The maximum number of words per chunk.
170 | Default is 6.
171 |
172 | Returns:
173 | dict: A dictionary containing a list of chunks, each with
174 | 'text', 'timestamp', and 'words'.
175 | """
176 | # Flatten all words from all segments
177 | words = [
178 | word_info
179 | for segment in result.get('segments', [])
180 | for word_info in segment.get('words', [])
181 | ]
182 |
183 | # Split words into chunks of size max_number
184 | def split_into_chunks(lst, n):
185 | """Yield successive n-sized chunks from lst."""
186 | for i in range(0, len(lst), n):
187 | yield lst[i:i + n]
188 |
189 | chunks = []
190 | for word_chunk in split_into_chunks(words, self.sub_length):
191 | chunk_text = ' '.join(word_info['word'] for word_info in word_chunk)
192 | chunk_start = word_chunk[0]['start']
193 | chunk_end = word_chunk[-1]['end']
194 | chunk = {
195 | 'timestamp': [chunk_start, chunk_end],
196 | 'text': chunk_text,
197 | 'words': word_chunk
198 | }
199 | chunks.append(chunk)
200 |
201 | result_temp = {
202 | 'text': ' '.join(chunk['text'].strip() for chunk in chunks),
203 | 'chunks': chunks
204 | }
205 |
206 | return result_temp
207 |
208 | def to_transcription_dict(self, insanely_annotation: list[dict]) -> dict:
209 | """
210 | Transform insanely-fast-whisper speaker annotation result to dict.
211 | """
212 | chunks = []
213 | for s in insanely_annotation:
214 | chunk = {
215 | 'text': s['text'],
216 | 'timestamp': (s['timestamp'][0], s['timestamp'][1]),
217 | 'speaker': s['speaker']
218 | }
219 | chunks.append(chunk)
220 |
221 | result = {
222 | 'text': ''.join([s['text'] for s in insanely_annotation]),
223 | 'chunks': chunks
224 | }
225 | return result
226 |
227 | def to_whisperx(self, transcription_result: dict) -> dict:
228 | """
229 | Normalize insanely-fast-whisper transcription result to whisperX dict.
230 | """
231 | words = []
232 | for c in transcription_result['chunks']:
233 | if 'speaker' in c:
234 | word = {
235 | 'word': c['text'].strip(),
236 | 'start': c['timestamp'][0],
237 | 'end': c['timestamp'][1],
238 | 'speaker': c['speaker']
239 | }
240 | else:
241 | word = {
242 | 'word': c['text'].strip(),
243 | 'start': c['timestamp'][0],
244 | 'end': c['timestamp'][1]
245 | }
246 | words.append(word)
247 |
248 | result = {
249 | 'segments': [
250 | {
251 | 'start': transcription_result['chunks'][0]['timestamp'][0],
252 | 'end': transcription_result['chunks'][-1]['timestamp'][1],
253 | 'text': transcription_result['text'].strip(),
254 | 'words': words
255 | }
256 | ]
257 | }
258 | return result
259 |
260 | def create_text_with_speakers(
261 | self,
262 | transcription_dict: dict,
263 | delimiter: str = '.'
264 | ) -> dict:
265 | """
266 | Iterates through all chunks of each language and creates the complete
267 | text with speaker labels inserted when there is a speaker change.
268 |
269 | Args:
270 | transcription_dict (dict): The dictionary containing transcription
271 | data.
272 |
273 | Returns:
274 | dict: A dictionary mapping each language to its formatted text with
275 | speaker labels.
276 | """
277 | transcriptions = transcription_dict.get('transcriptions', {})
278 |
279 | for lang, lang_data in transcriptions.items():
280 | text = ""
281 | current_speaker = None
282 | chunks = lang_data.get('chunks', [])
283 |
284 | for chunk in chunks:
285 | words = chunk.get('words', [])
286 |
287 | for word_info in words:
288 | speaker = word_info.get('speaker')
289 | word = word_info.get('word', '')
290 | start_timestamp = little_helper.format_time(
291 | word_info.get('start'),
292 | delimiter
293 | )
294 |
295 | # Insert speaker label if a speaker change is detected
296 | if speaker != current_speaker:
297 | text += f"\n[{start_timestamp}] [{speaker}] "
298 | current_speaker = speaker
299 |
300 | # Append the word with a space
301 | text += word + " "
302 |
303 | transcription_dict['transcriptions'][lang]['text_with_speaker_annotation'] = text.strip()
304 |
305 | return transcription_dict
306 |
307 | def transcribe_with_whisperx(self, filepath: Path) -> dict:
308 | """
309 | Transcribe a file with the whisperX implementation that returns word-level
310 | timestamps and speaker annotation:https://github.com/m-bain/whisperX
311 |
312 | This implementation is used when a specific subtitle length (e.g. 5 words
313 | per individual subtitle) is needed.
314 | """
315 | import torch
316 | import whisperx
317 | import gc
318 |
319 | def empty_cuda_cache(model):
320 | gc.collect()
321 | torch.cuda.empty_cache()
322 | del model
323 |
324 | def fill_missing_timestamps(segments: list) -> list:
325 | """
326 | whisperX does not provide timestamps for words containing only
327 | numbers (e.g. "1.5", "2024" etc.).
328 |
329 | The function fills these missing timestamps by padding the last
330 | known 'end' timestamp for a missing 'start' timestamp or by cutting
331 | the next known 'start' timestamp for a missing 'end' timestamp.
332 | """
333 | padding = 0.05 # in seconds
334 |
335 | for segment in segments:
336 | words = segment['words']
337 | num_words = len(words)
338 |
339 | for i, word in enumerate(words):
340 | # If the 'start' key is missing
341 | if 'start' not in word:
342 | if i > 0 and 'end' in words[i-1]:
343 | word['start'] = round(words[i-1]['end'] + padding, 2)
344 | else:
345 | word['start'] = segment['start']
346 |
347 | # If the 'end' key is missing
348 | if 'end' not in word:
349 | if i < num_words - 1 and 'start' in words[i+1]:
350 | word['end'] = round(words[i+1]['start'] - padding, 2)
351 | elif i == num_words - 1:
352 | word['end'] = round(words[i]['start'] + padding, 2)
353 | segment['end'] = word['end']
354 | else:
355 | word['end'] = round(words[i]['start'] + padding, 2)
356 |
357 | # If 'score' key is missing
358 | if 'score' not in word:
359 | word['score'] = 0.5
360 |
361 | # If 'speaker' key is missing
362 | if self.annotate and 'speaker' not in word:
363 | speaker_assigned = False
364 |
365 | # Case 1: If it's the first word, look forward for the
366 | # next speaker
367 | if i == 0:
368 | for j in range(i + 1, num_words):
369 | if 'speaker' in words[j]:
370 | word['speaker'] = words[j]['speaker']
371 | speaker_assigned = True
372 | break
373 |
374 | # Case 2: If it's the last word, look backward for the
375 | # previous speaker
376 | elif i == num_words - 1:
377 | for j in range(i - 1, -1, -1):
378 | if 'speaker' in words[j]:
379 | word['speaker'] = words[j]['speaker']
380 | speaker_assigned = True
381 | break
382 |
383 | # Case 3: For other words, prefer the previous speaker;
384 | # If not found, look forward
385 | if not speaker_assigned:
386 | # Look backward
387 | for j in range(i - 1, -1, -1):
388 | if 'speaker' in words[j]:
389 | word['speaker'] = words[j]['speaker']
390 | speaker_assigned = True
391 | break
392 |
393 | if not speaker_assigned:
394 | # Look forward
395 | for j in range(i + 1, num_words):
396 | if 'speaker' in words[j]:
397 | word['speaker'] = words[j]['speaker']
398 | speaker_assigned = True
399 | break
400 |
401 | if not speaker_assigned:
402 | # Default speaker if none found
403 | word['speaker'] = 'UNKNOWN'
404 |
405 | return segments
406 |
407 | def whisperx_task(task: str = 'transcribe', language = None):
408 | """
409 | Define a transcription / translation task with whisperX
410 | """
411 | # Set parameters
412 | device = 'cuda' if self.device == 'cuda:0' else 'cpu'
413 |
414 | # Transcribe or Translate
415 | model = whisperx.load_model(
416 | whisper_arch=self.model,
417 | device=device,
418 | compute_type='float16' if self.device == 'cuda:0' else 'int8',
419 | language=self.file_language or None,
420 | asr_options={
421 | "hotwords": None,
422 | "multilingual": False
423 | })
424 | audio = whisperx.load_audio(str(filepath), sr=16000)
425 | result = model.transcribe(
426 | audio,
427 | batch_size=16 if self.device == 'cuda:0' else 8,
428 | task=task
429 | )
430 | model_a, metadata = whisperx.load_align_model(
431 | device=device,
432 | language_code=language
433 | )
434 | result = whisperx.align(
435 | result["segments"],
436 | model_a,
437 | metadata,
438 | audio,
439 | device,
440 | return_char_alignments=False
441 | )
442 |
443 | # Empty CUDA cache
444 | if self.device == 'cuda:0':
445 | empty_cuda_cache(model_a)
446 |
447 | return result
448 |
449 | def whisperx_annotation(transcription_result: dict) -> dict:
450 | # Set parameters
451 | device = 'cuda' if self.device == 'cuda:0' else 'cpu'
452 |
453 | diarize_model = whisperx.diarize.DiarizationPipeline(
454 | use_auth_token=self.hf_token,
455 | device=device
456 | )
457 | diarize_segments = diarize_model(
458 | str(filepath),
459 | max_speakers=self.num_speakers
460 | )
461 | result = whisperx.assign_word_speakers(
462 | diarize_segments,
463 | transcription_result
464 | )
465 |
466 | # Empty CUDA cache
467 | if self.device == 'cuda:0':
468 | empty_cuda_cache(diarize_model)
469 |
470 | return result
471 |
472 | # Start and time transcription
473 | logging.info(f"👨💻 Transcription started with whisper🆇 for {filepath.name}")
474 | t_start = time.time()
475 |
476 | # Run the transcription
477 | transcription_task = partial(
478 | whisperx_task,
479 | task='transcribe',
480 | language=self.file_language
481 | )
482 | transcription_result = little_helper.run_with_progress(
483 | description=f"[cyan]→ Transcribing ({'CUDA' if self.device == 'cuda:0' else 'CPU'}) [bold]{filepath.name}",
484 | task=transcription_task
485 | )
486 |
487 | # Speaker annotation
488 | if self.annotate:
489 | annotation_task = partial(
490 | whisperx_annotation,
491 | transcription_result
492 | )
493 | transcription_result = little_helper.run_with_progress(
494 | description=f"[purple]→ Annotating ({self.device.upper()}) [bold]{filepath.name}",
495 | task=annotation_task
496 | )
497 |
498 | # Fill in missing timestamps and adjust word chunk length
499 | transcription_result['segments'] = fill_missing_timestamps(
500 | transcription_result['segments']
501 | )
502 | transcription_result = self.adjust_word_chunk_length(
503 | transcription_result
504 | )
505 |
506 | # Create result dict and append transcription to it
507 | result = {'transcriptions': {}}
508 | result['transcriptions'][self.file_language] = transcription_result
509 |
510 | if self.verbose:
511 | print(f"{result['transcriptions'][self.file_language]['text']}")
512 |
513 | # Translation task (to English)
514 | if self.translate and self.file_language != 'en':
515 | translation_task = partial(
516 | whisperx_task,
517 | task='translate',
518 | language='en'
519 | )
520 | translation_result = little_helper.run_with_progress(
521 | description=f"[dark_blue]→ Translating ({'CUDA' if self.device == 'cuda:0' else 'CPU'}) [bold]{filepath.name}",
522 | task=translation_task
523 | )
524 |
525 | # Speaker annotation
526 | if self.annotate:
527 | annotation_task = partial(
528 | whisperx_annotation,
529 | translation_result
530 | )
531 | translation_result = little_helper.run_with_progress(
532 | description=f"[purple]→ Annotating ({self.device.upper()}) [bold]{filepath.name}",
533 | task=annotation_task
534 | )
535 |
536 | # Fill in missing timestamps and adjust word chunk length
537 | translation_result['segments'] = fill_missing_timestamps(
538 | translation_result['segments']
539 | )
540 | translation_result = self.adjust_word_chunk_length(
541 | translation_result
542 | )
543 | result['transcriptions']['en'] = translation_result
544 |
545 | if self.verbose:
546 | print(f"{result['transcriptions']['en']['text']}")
547 |
548 | # Create full transcription with speaker annotation
549 | if self.annotate:
550 | result = self.create_text_with_speakers(result)
551 |
552 | logging.info(f"👨💻 Transcription completed in {time.time() - t_start:.2f} sec.")
553 |
554 | return {'transcription': result}
555 |
556 | def transcribe_with_insane_whisper(self, filepath: Path) -> dict:
557 | """
558 | Transcribes a file using the 'insanely-fast-whisper' implementation:
559 | https://github.com/Vaibhavs10/insanely-fast-whisper
560 |
561 | This method utilizes the 'insanely-fast-whisper' implementation of
562 | OpenAI Whisper for automatic speech recognition on Mac M1-M4 devices.
563 |
564 | Parameters:
565 | - filepath (Path): The path to the audio file for transcription.
566 |
567 | Returns:
568 | - dict: A dictionary containing the transcription result and, if
569 | speaker detection is enabled, the speaker diarization result.
570 | The transcription result includes the recognized text and
571 | timestamps if available.
572 | """
573 | import torch
574 | from transformers import pipeline
575 | from transformers import logging as hf_logger
576 | from whisply import diarize_utils
577 |
578 | hf_logger.set_verbosity_error()
579 |
580 | def insane_whisper_annotation(transcription_result: dict) -> dict:
581 | # Speaker annotation
582 | annotation_result = diarize_utils.diarize(
583 | transcription_result,
584 | diarization_model='pyannote/speaker-diarization-3.1',
585 | hf_token=self.hf_token,
586 | file_name=str(filepath),
587 | num_speakers=self.num_speakers,
588 | min_speakers=None,
589 | max_speakers=None,
590 | )
591 | # Transform annotation_result to correct dict structure
592 | transcription_result = self.to_transcription_dict(annotation_result)
593 | return transcription_result
594 |
595 | # Start and time transcription
596 | logging.info(
597 | f"👨💻 Transcription started with 🚅 insane-whisper for {filepath.name}"
598 | )
599 | t_start = time.time()
600 |
601 | try:
602 | pipe = pipeline(
603 | 'automatic-speech-recognition',
604 | model = self.model,
605 | torch_dtype = torch.float16,
606 | device = self.device,
607 | model_kwargs = {
608 | 'attn_implementation': 'eager'
609 | }
610 | )
611 |
612 | # Define transcription function
613 | def transcription_task():
614 | transcription_result = pipe(
615 | str(filepath),
616 | batch_size=1,
617 | return_timestamps='word',
618 | generate_kwargs={
619 | 'use_cache': True,
620 | 'return_legacy_cache': False,
621 | 'language': self.file_language,
622 | 'task': "transcribe",
623 | 'forced_decoder_ids': None
624 | }
625 | )
626 | return transcription_result
627 |
628 | # Transcription
629 | transcription_result = little_helper.run_with_progress(
630 | description=f"[cyan]→ Transcribing ({self.device.upper()}) [bold]{filepath.name}",
631 | task=transcription_task
632 | )
633 |
634 | # Speaker annotation
635 | if self.annotate:
636 | transcription_result = little_helper.run_with_progress(
637 | description=f"[purple]→ Annotating ({self.device.upper()}) [bold]{filepath.name}",
638 | task=partial(
639 | insane_whisper_annotation,
640 | transcription_result
641 | )
642 | )
643 |
644 | # Adjust word chunk length
645 | transcription_result = self.to_whisperx(transcription_result)
646 | transcription_result = self.adjust_word_chunk_length(transcription_result)
647 |
648 | # Build result dict
649 | result = {'transcriptions': {}}
650 | result['transcriptions'] = {
651 | self.file_language: transcription_result
652 | }
653 |
654 | if self.verbose:
655 | print(result['transcriptions'][self.file_language]['text'])
656 |
657 | # Translation
658 | if self.translate and self.file_language != 'en':
659 | def translation_task():
660 | translation_result = pipe(
661 | str(filepath),
662 | batch_size=1,
663 | return_timestamps='word',
664 | generate_kwargs={
665 | 'use_cache': True,
666 | 'return_legacy_cache': False,
667 | 'task': 'translate',
668 | # 'forced_decoder_ids': None
669 | }
670 | )
671 | return translation_result
672 |
673 | # Run the translation task
674 | translation_result = little_helper.run_with_progress(
675 | description=f"[dark_blue]→ Translating ({self.device.upper()}) [bold]{filepath.name}",
676 | task=translation_task
677 | )
678 |
679 | # Speaker annotation
680 | if self.annotate:
681 | translation_result = little_helper.run_with_progress(
682 | description=f"[purple]→ Annotating ({self.device.upper()}) [bold]{filepath.name}",
683 | task=partial(
684 | insane_whisper_annotation,
685 | translation_result
686 | )
687 | )
688 |
689 | # Adjust word chunk length
690 | translation_result = self.to_whisperx(translation_result)
691 | translation_result = self.adjust_word_chunk_length(translation_result)
692 |
693 | result['transcriptions']['en'] = translation_result
694 |
695 | if self.verbose:
696 | print(result['transcriptions']['en']['text'])
697 |
698 | if self.annotate:
699 | # Create full transcription with speaker annotation
700 | result = self.create_text_with_speakers(result)
701 |
702 | except Exception as e:
703 | print(f'{e}')
704 |
705 | # Stop timing transcription
706 | logging.info(f"👨💻 Transcription completed in {time.time() - t_start:.2f} sec.")
707 |
708 | return {'transcription': result}
709 |
710 | def transcribe_with_faster_whisper(
711 | self,
712 | filepath: Path,
713 | num_workers: int = 1
714 | ) -> dict:
715 | """
716 | Transcribes an audio file using the 'faster-whisper' implementation:
717 | https://github.com/SYSTRAN/faster-whisper
718 |
719 | This method utilizes the 'faster-whisper' implementation of OpenAI
720 | Whisper for automatic speech recognition. It loads the model and sets
721 | parameters for transcription. After transcription, it formats the
722 | result into segments with timestamps and combines them into a single
723 | text. If speaker detection is enabled, it also annotates speakers in
724 | the transcription result.
725 |
726 | Parameters:
727 | - filepath (Path): The path to the audio file for transcription.
728 | - num_workers (int): The number of workers to use for transcription.
729 |
730 | Returns:
731 | - dict: A dictionary containing the transcription result and, if
732 | speaker detection is enabled, the speaker diarization result.
733 | The transcription result includes the recognized text and
734 | segmented chunks with timestamps if available.
735 | """
736 | from faster_whisper import WhisperModel, BatchedInferencePipeline
737 |
738 | # Start and time transcription
739 | logging.info(f"👨💻 Transcription started with 🏃♀️➡️ faster-whisper for {filepath.name}")
740 | t_start = time.time()
741 |
742 | # Load model and set parameters
743 | model = BatchedInferencePipeline(
744 | model = WhisperModel(
745 | self.model,
746 | device='cpu' if self.device in ['mps', 'cpu'] else 'cuda',
747 | num_workers=num_workers,
748 | compute_type='int8' if self.device in ['mps', 'cpu'] else 'float16'
749 | ))
750 |
751 | # Define the transcription task
752 | def transcription_task():
753 | segments, _ = model.transcribe(
754 | str(filepath),
755 | beam_size=5,
756 | language=self.file_language,
757 | word_timestamps=True,
758 | batch_size=16
759 | )
760 |
761 | chunks = []
762 | for segment in segments:
763 | seg = {
764 | 'timestamp': (float(f"{segment.start:.2f}"), float(f"{segment.end:.2f}")),
765 | 'text': segment.text.strip(),
766 | 'words': [{
767 | 'word': i.word.strip(),
768 | 'start': float(f"{i.start:.2f}"),
769 | 'end': float(f"{i.end:.2f}"),
770 | 'score': float(f"{i.probability:.2f}")
771 | } for i in segment.words]
772 | }
773 | chunks.append(seg)
774 |
775 | # If verbose Flag
776 | if self.verbose:
777 | print(seg['text'])
778 |
779 | return chunks
780 |
781 | # Add progress bar and run the transcription task
782 | chunks = little_helper.run_with_progress(
783 | description=f"[cyan]→ Transcribing ({self.device.upper()}) [bold]{filepath.name}",
784 | task=transcription_task
785 | )
786 |
787 | # Create result dict and append transcriptions to it
788 | result = {'transcriptions': {}}
789 | result['transcriptions'][self.file_language] = {
790 | 'text': ' '.join([segment['text'].strip() for segment in chunks]),
791 | 'chunks': chunks
792 | }
793 |
794 | # Translation
795 | if self.translate and self.file_language != 'en':
796 | # Define the translation task
797 | def translation_task():
798 | segments, _ = model.transcribe(
799 | str(filepath),
800 | beam_size=5,
801 | task='translate',
802 | language='en',
803 | word_timestamps=True
804 | )
805 |
806 | translation_chunks = []
807 | for segment in segments:
808 | seg = {
809 | 'timestamp': (float(f"{segment.start:.2f}"), float(f"{segment.end:.2f}")),
810 | 'text': segment.text.strip(),
811 | 'words': [{
812 | 'word': i.word.strip(),
813 | 'start': float(f"{i.start:.2f}"),
814 | 'end': float(f"{i.end:.2f}"),
815 | 'score': float(f"{i.probability:.2f}")
816 | } for i in segment.words]
817 | }
818 | translation_chunks.append(seg)
819 |
820 | # If verbose Flag
821 | if self.verbose:
822 | print(seg['text'])
823 |
824 | return translation_chunks
825 |
826 | # Add progress bar and run the translation task
827 | translation_chunks = little_helper.run_with_progress(
828 | description=f"[dark_blue]→ Translating ({self.device.upper()}) [bold]{filepath.name}",
829 | task=translation_task
830 | )
831 |
832 | # Add translation to result dict
833 | result['transcriptions']['en'] = {
834 | 'text': ' '.join([segment['text'].strip() for segment in translation_chunks]),
835 | 'chunks': translation_chunks
836 | }
837 |
838 | # Stop timing transcription
839 | logging.info(f"👨💻 Transcription completed in {time.time() - t_start:.2f} sec.")
840 |
841 | return {'transcription': result}
842 |
843 | def detect_language(self, filepath, audio_array) -> str:
844 | """
845 | Detects the language of the input file.
846 | """
847 | from faster_whisper import WhisperModel
848 |
849 | logging.info(f"Detecting language of file: {filepath.name}")
850 |
851 | def run_language_detection():
852 | lang_detection_model = WhisperModel(
853 | models.set_supported_model(
854 | model=self.model_provided,
855 | implementation='faster-whisper',
856 | translation=self.translate
857 | ),
858 | device='cpu' if self.device in ['mps', 'cpu'] else 'cuda',
859 | compute_type='int8' if self.device in ['mps', 'cpu'] else 'float16'
860 | )
861 | lang, score, _ = lang_detection_model.detect_language(audio_array)
862 | return lang, score
863 |
864 | lang, score = little_helper.run_with_progress(
865 | description=f"[dark_goldenrod]→ Detecting language for [bold]{filepath.name}",
866 | task=run_language_detection
867 | )
868 |
869 | self.file_language = lang
870 |
871 | print(f'[blue1]→ Detected language "{lang}" with probability {score:.2f}')
872 | logging.info(f'Detected language → "{lang}" with probability {score:.2f}')
873 |
874 | def process_files(self, files) -> None:
875 | """
876 | Processes a list of audio files for transcription and/or diarization.
877 |
878 | This method logs the processing parameters, extracts filepaths from the
879 | input list, and initializes an empty list for storing results. Each
880 | file is processed based on the compute device specified ('mps', 'cuda:0',
881 | or 'cpu'). Appropriate transcription method is chosen based on the
882 | device. Results, including file ids, paths, transcriptions, and
883 | diarizations, are stored in a dictionary and saved to a designated
884 | output directory. Each result is also appended to `self.processed_files`.
885 |
886 | Parameters:
887 | files (list of str): A list of file paths or file-like objects
888 | representing the audio files to be processed.
889 | """
890 | logging.info(f"Provided parameters for processing: {self.metadata}")
891 |
892 | # Get filepaths
893 | filepath_handler = FilePathProcessor(self.file_formats)
894 | [filepath_handler.get_filepaths(f) for f in files]
895 | self.filepaths = filepath_handler.filepaths
896 |
897 | # Process filepaths
898 | logging.info(f"Processing files: {self.filepaths}")
899 |
900 | self.processed_files = []
901 | for idx, filepath in enumerate(self.filepaths):
902 |
903 | # Create and set output_dir and output_filepath
904 | self.output_dir = little_helper.set_output_dir(filepath, self.base_dir)
905 | output_filepath = self.output_dir / Path(filepath).stem
906 |
907 | # Convert file format
908 | filepath, audio_array = little_helper.check_file_format(
909 | filepath=filepath,
910 | del_originals=self.del_originals
911 | )
912 |
913 | # Detect file language
914 | if not self.file_language:
915 | self.detect_language(filepath, audio_array)
916 |
917 | logging.info(f"Transcribing file: {filepath.name}")
918 |
919 | # Transcription and speaker annotation
920 | if self.device == 'mps':
921 | self.model = models.set_supported_model(
922 | self.model_provided,
923 | implementation='insane-whisper',
924 | translation=self.translate
925 | )
926 | print(f'[blue1]→ Using {self.device.upper()} and 🚅 Insanely-Fast-Whisper with model "{self.model}"')
927 | result_data = self.transcribe_with_insane_whisper(filepath)
928 |
929 | elif self.device in ['cpu', 'cuda:0']:
930 | if self.annotate or self.subtitle:
931 | # WhisperX for annotation / subtitling
932 | self.model = models.set_supported_model(
933 | self.model_provided,
934 | implementation='whisperx',
935 | translation=self.translate
936 | )
937 | print(f'[blue1]→ Using {self.device.upper()} and whisper🆇 with model "{self.model}"')
938 | result_data = self.transcribe_with_whisperx(filepath)
939 | else:
940 | # Faster-Whisper for raw transcription
941 | self.model = models.set_supported_model(
942 | self.model_provided,
943 | implementation='faster-whisper',
944 | translation=self.translate
945 | )
946 | print(f'[blue1]→ Using {self.device.upper()} and 🏃♀️➡️ Faster-Whisper with model "{self.model}"')
947 | result_data = self.transcribe_with_faster_whisper(filepath)
948 |
949 | result = {
950 | 'id': f'file_00{idx + 1}',
951 | 'created': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
952 | 'input_filepath': str(filepath.absolute()),
953 | 'output_filepath': str(Path(output_filepath).absolute()),
954 | 'written_files': None,
955 | 'device': self.device,
956 | 'model': self.model,
957 | 'transcription': result_data['transcription']['transcriptions'],
958 | }
959 |
960 | # Save results
961 | result['written_files'] = output_utils.OutputWriter(
962 | corrections=self.corrections
963 | ).save_results(
964 | result=result,
965 | export_formats=self.export_formats
966 | )
967 |
968 | self.processed_files.append(result)
969 |
970 | if not self.file_language_provided:
971 | self.file_language = None
972 |
--------------------------------------------------------------------------------