├── .gitignore ├── LICENSE ├── README.md ├── assets ├── whisply.png └── whisply_app.png ├── examples ├── example_config.json └── post_correction.yaml ├── pyproject.toml ├── setup.cfg └── whisply ├── __init__.py ├── app.py ├── app_helpers.py ├── cli.py ├── diarize_utils.py ├── download_utils.py ├── little_helper.py ├── models.py ├── output_templates.py ├── output_utils.py ├── post_correction.py └── transcription.py /.gitignore: -------------------------------------------------------------------------------- 1 | /dev 2 | /downloads* 3 | /transcriptions 4 | /files* 5 | /legacy 6 | /venv* 7 | .vscode 8 | .DS_Store 9 | *.log -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright 2025 Universitätsbibliothek Mannheim 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy 4 | of this software and associated documentation files (the "Software"), to deal 5 | in the Software without restriction, including without limitation the rights 6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 7 | copies of the Software, and to permit persons to whom the Software is 8 | furnished to do so, subject to the following conditions: 9 | 10 | The above copyright notice and this permission notice shall be included in 11 | all copies or substantial portions of the Software. 12 | 13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 19 | THE SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # whisply 2 | 3 | [![PyPI version](https://badge.fury.io/py/whisply.svg)](https://badge.fury.io/py/whisply) 4 | 5 |

140 |

Could not load library libcudnn_ops.so.9 (click to expand)

141 |
If you use whisply with a Nvidia GPU and encounter this error:

142 | 143 | ```shell 144 | Unable to load any of {libcudnn_ops.so.9.1.0, libcudnn_ops.so.9.1, libcudnn_ops.so.9, libcudnn_ops.so} 145 | ``` 146 | 147 | Use the following steps to fix the issue: 148 | 149 | 1. In your activated python environment run `pip list` and check that `torch==2.7.0` and `torchaudio==2.7.0` are installed. 150 | 2. If yes, run `pip install ctranslate2==4.5.0`. 151 | 3. Export the following environment variable to your shell: 152 | 153 | ```shell 154 | export LD_LIBRARY_PATH="$(python - <<'PY' 155 | import importlib.util, pathlib, sys 156 | spec = importlib.util.find_spec('nvidia.cudnn') 157 | print(pathlib.Path(spec.origin).parent / 'lib') 158 | PY 159 | ):${LD_LIBRARY_PATH}" 160 | ``` 161 | 162 | 4. To make the change permanent, run this bash command while your python environment is activated: 163 | 164 | ```shell 165 | printf '\n# --- add cuDNN wheel dir ---\nexport LD_LIBRARY_PATH="$(python - <<'"'"'PY'"'"'\nimport importlib.util, pathlib, sys\nprint(pathlib.Path(importlib.util.find_spec("nvidia.cudnn").origin).parent / "lib")\nPY\n):${LD_LIBRARY_PATH}"\n' >> "$VIRTUAL_ENV/bin/activate" 166 | ``` 167 | 168 | Finally, deactivate the environment and reactivate it to apply the changes. 169 | 170 | Find additional information at faster-whisper's GitHub page. 171 |

232 | 233 |

234 | 235 | 236 | ### Speaker annotation and diarization 237 | 238 | #### Requirements 239 | 240 | In order to annotate speakers using `--annotate` you need to provide a valid [HuggingFace](https://huggingface.co) access token using the `--hf_token` option. Additionally, you must accept the terms and conditions for both version 3.0 and version 3.1 of the `pyannote` segmentation model. 241 | 242 | For detailed instructions, refer to the *Requirements* section on the [pyannote model page on HuggingFace](https://huggingface.co/pyannote/speaker-diarization-3.1#requirements) and make sure that you complete steps *"2. Accept pyannote/segmentation-3.0 user conditions"*, *"3. Accept pyannote/speaker-diarization-3.1 user conditions"* and *"4. Create access token at hf.co/settings/tokens"*. 243 | 244 | #### How speaker annotation works 245 | 246 | `whisply` uses [whisperX](https://github.com/m-bain/whisperX) for speaker diarization and annotation. Instead of returning chunk-level timestamps like the standard `Whisper` implementation `whisperX` is able to return word-level timestamps as well as annotating speakers word by word, thus returning much more precise annotations. 247 | 248 | Out of the box `whisperX` will not provide timestamps for words containing only numbers (e.g. "1.5" or "2024"): `whisply` fixes those instances through timestamp approximation. Other known limitations of `whisperX` include: 249 | 250 | * inaccurate speaker diarization if multiple speakers talk at the same time 251 | * to provide word-level timestamps and annotations `whisperX` uses language specific alignment models; out of the box `whisperX` supports these languages: `en, fr, de, es, it, ja, zh, nl, uk, pt`. 252 | 253 | Refer to the [whisperX GitHub page](https://github.com/m-bain/whisperX) for more information. 254 | 255 | ### Post correction 256 | 257 | The `--post_correction` option allows you to correct various transcription errors that you may find in your files. The option takes as argument a `.yaml` file with the following structure: 258 | 259 | ```yaml 260 | # Single word corrections 261 | Gardamer: Gadamer 262 | 263 | # Pattern-based corrections 264 | patterns: 265 | - pattern: 'Klaus-(Cira|Cyra|Tira)-Stiftung' 266 | replacement: 'Klaus Tschira Stiftung' 267 | ``` 268 | 269 | - **Single word corrections**: matches single words → `wrong word`: `correct word` 270 | - **Pattern-based corrections**: matches patterns → `(Cira|Cyra|Tira)` will look for `Klaus-Cira-Stiftung`, `Klaus-Cyra-Stiftung` and / or `Klaus-Tira-Stiftung` and replaces it with `Klaus-Tschirra-Stiftung` 271 | 272 | Post correction will be applied to **all** export file formats you choose. 273 | 274 | ### Batch processing 275 | 276 | Instead of providing a file, folder or URL by using the `--files` option you can pass a `.list` with a mix of files, folders and URLs for processing. 277 | 278 | Example: 279 | 280 | ```shell 281 | $ cat my_files.list 282 | 283 | video_01.mp4 284 | video_02.mp4 285 | ./my_files/ 286 | https://youtu.be/KtOayYXEsN4?si=-0MS6KXbEWXA7dqo 287 | ``` 288 | 289 | #### Using config files for batch processing 290 | 291 | You can provide a `.json` config file by using the `--config` option which makes batch processing easy. An example config looks like this: 292 | 293 | ```markdown 294 | { 295 | "files": "./files/my_files.list", # Path to your files 296 | "output_dir": "./transcriptions", # Output folder where transcriptions are saved 297 | "device": "auto", # AUTO, GPU, MPS or CPU 298 | "model": "large-v3-turbo", # Whisper model to use 299 | "lang": null, # Null for auto-detection or language codes ("en", "de", ...) 300 | "annotate": false, # Annotate speakers 301 | "num_speakers": null, # Number of speakers of the input file (null: auto-detection) 302 | "hf_token": "HuggingFace Access Token", # Your HuggingFace Access Token (needed for annotations) 303 | "subtitle": false, # Subtitle file(s) 304 | "sub_length": 10, # Length of each subtitle block in number of words 305 | "translate": false, # Translate to English 306 | "export": "txt", # Export .txts only 307 | "verbose": false # Print transcription segments while processing 308 | "del_originals": false, # Delete original input files after file conversion 309 | "post_correction": "my_corrections.yaml" # Apply post correction with specified patterns in .yaml 310 | } 311 | ``` 312 | 313 | ## Citation 314 | 315 | ```bibtex 316 | @software{whisply, 317 | author = {Thomas Schmidt, Renat Shigapov}, 318 | title = {Transcribe, translate, annotate and subtitle audio and video files with OpenAI's Whisper ... fast!}, 319 | year = {2025}, 320 | url = {https://github.com/tsmdt/whisply}, 321 | note = {Accessed: 2025-05-29} 322 | } 323 | ``` 324 | -------------------------------------------------------------------------------- /assets/whisply.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmdt/whisply/c96865a0942f98b743ef7857b3a0fe2376238175/assets/whisply.png -------------------------------------------------------------------------------- /assets/whisply_app.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmdt/whisply/c96865a0942f98b743ef7857b3a0fe2376238175/assets/whisply_app.png -------------------------------------------------------------------------------- /examples/example_config.json: -------------------------------------------------------------------------------- 1 | { 2 | "files": "./files/my_files.list", 3 | "output_dir": "./transcriptions", 4 | "device": "auto", 5 | "model": "large-v3-turbo", 6 | "lang": null, 7 | "annotate": false, 8 | "hf_token": "hf_.......", 9 | "translate": false, 10 | "subtitle": false, 11 | "sub_length": 10, 12 | "export": "txt", 13 | "verbose": false, 14 | "del_originals": false, 15 | "post_correction": "my_corrections.yaml" 16 | } 17 | -------------------------------------------------------------------------------- /examples/post_correction.yaml: -------------------------------------------------------------------------------- 1 | # Single word corrections 2 | Gardamer: Gadamer 3 | 4 | # Pattern-based corrections 5 | patterns: 6 | - pattern: 'Klaus-(Cira|Cyra|Tira)-Stiftung' 7 | replacement: 'Klaus Tschira Stiftung' 8 | -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [build-system] 2 | requires = ["setuptools>=61.0"] 3 | build-backend = "setuptools.build_meta" 4 | -------------------------------------------------------------------------------- /setup.cfg: -------------------------------------------------------------------------------- 1 | [metadata] 2 | name = whisply 3 | version = 0.10.5 4 | description = "Transcribe, translate, annotate and subtitle audio and video files with OpenAI's Whisper ... fast!" 5 | long_description = file: README.md 6 | long_description_content_type = text/markdown 7 | author = Thomas Schmidt, Renat Shigapov 8 | license = MIT 9 | url = https://github.com/tsmdt/whisply 10 | 11 | [options] 12 | packages = find: 13 | python_requires = >=3.10 14 | install_requires = 15 | click<8.2.0 16 | typer==0.15.3 17 | numpy==2.0.2 18 | faster-whisper==1.1.1 19 | ffmpeg-python==0.2.0 20 | optimum==1.24.0 21 | pyannote.audio==3.3.2 22 | whisperx==3.3.4 23 | rich==13.7.1 24 | torch==2.7.0 25 | torchaudio==2.7.0 26 | transformers==4.50.0 27 | validators==0.28.1 28 | yt-dlp==2025.4.30 29 | gradio==5.29.0 30 | 31 | [options.entry_points] 32 | console_scripts = 33 | whisply = whisply.cli:run -------------------------------------------------------------------------------- /whisply/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmdt/whisply/c96865a0942f98b743ef7857b3a0fe2376238175/whisply/__init__.py -------------------------------------------------------------------------------- /whisply/app.py: -------------------------------------------------------------------------------- 1 | import gradio as gr 2 | import os 3 | import shutil 4 | import zipfile 5 | import tempfile 6 | 7 | from datetime import datetime 8 | from pathlib import Path 9 | from whisply import output_utils 10 | from whisply.app_helpers import CSS, LANGUAGES 11 | 12 | 13 | def zip_files(file_paths: list[str]) -> str: 14 | """ 15 | Given a list of file paths, zip them up into a single archive 16 | and return the path to that archive. 17 | """ 18 | if not file_paths: 19 | return None 20 | 21 | with tempfile.NamedTemporaryFile(delete=False, suffix=".zip") as tmp: 22 | zip_path = tmp.name 23 | 24 | with zipfile.ZipFile(zip_path, "w", zipfile.ZIP_DEFLATED) as zipf: 25 | for fpath in file_paths: 26 | arcname = os.path.basename(fpath) 27 | zipf.write(fpath, arcname=arcname) 28 | 29 | return zip_path 30 | 31 | 32 | def create_gradio_interface(): 33 | """ 34 | Main gradio interface. 35 | """ 36 | def get_device() -> str: 37 | """ 38 | Determine the computation device based on user preference and 39 | availability. 40 | """ 41 | import torch 42 | 43 | if torch.cuda.is_available(): 44 | device = 'cuda:0' 45 | elif torch.backends.mps.is_available(): 46 | device = 'mps' 47 | else: 48 | device = 'cpu' 49 | return device 50 | 51 | def transcribe(file, model, device, language, options, hf_token, sub_length): 52 | from whisply.transcription import TranscriptionHandler 53 | from whisply import little_helper, models 54 | 55 | if not options: 56 | options = [] 57 | annotate = 'Annotate Speakers' in options 58 | translate = 'Translate to English' in options 59 | subtitle = 'Generate Subtitles' in options 60 | 61 | if (annotate or subtitle) and not hf_token: 62 | hf_token = os.getenv('HF_TOKEN') 63 | if not hf_token: 64 | yield 'A HuggingFace Access Token is required for annotation or subtitling: https://huggingface.co/docs/hub/security-tokens', None 65 | return 66 | 67 | if file is None: 68 | yield "Please upload a file.", None 69 | return 70 | 71 | # If file is not a list, make it a list 72 | if not isinstance(file, list): 73 | file = [file] 74 | 75 | # Start the progress bar 76 | progress = gr.Progress() 77 | progress(0) 78 | 79 | try: 80 | # Total steps calculation 81 | steps_per_file = 5 # Number of steps per file 82 | total_steps = steps_per_file * len(file) 83 | current_step = 0 84 | 85 | # Save the uploaded file to a temporary directory 86 | temp_dir = './app_uploads' 87 | os.makedirs(temp_dir, exist_ok=True) 88 | 89 | temp_file_paths = [] 90 | for uploaded_file in file: 91 | # Get the base name of the file to avoid issues with absolute paths 92 | temp_file_name = os.path.basename(uploaded_file.name) 93 | temp_file_path = os.path.join(temp_dir, temp_file_name) 94 | 95 | # Copy the file from Gradio's temp directory to our local directory 96 | shutil.copyfile(uploaded_file.name, temp_file_path) 97 | temp_file_paths.append(temp_file_path) 98 | 99 | # Adjust the device based on user selection 100 | if device == 'auto': 101 | device_selected = get_device() 102 | elif device == 'gpu': 103 | import torch 104 | if torch.cuda.is_available(): 105 | device_selected = 'cuda:0' 106 | else: 107 | print("→ CUDA is not available. Falling back to auto device selection.") 108 | device_selected = get_device() 109 | else: 110 | device_selected = device 111 | 112 | # Handle export formats 113 | export_formats_map = { 114 | 'standard': ['json', 'txt'], 115 | 'annotate': ['rttm', 'txt', 'json', 'html'], 116 | 'subtitle': ['vtt', 'webvtt', 'srt', 'txt', 'json'], 117 | 'translate': ['txt', 'json'] 118 | } 119 | 120 | export_formats_list = set(export_formats_map['standard']) 121 | 122 | if annotate: 123 | export_formats_list.update(export_formats_map['annotate']) 124 | if subtitle: 125 | export_formats_list.update(export_formats_map['subtitle']) 126 | if translate: 127 | export_formats_list.update(export_formats_map['translate']) 128 | 129 | export_formats_list = list(export_formats_list) 130 | 131 | # Create an instance of TranscriptionHandler with the provided parameters 132 | handler = TranscriptionHandler( 133 | base_dir='./app_transcriptions', 134 | model=model, 135 | device=device_selected, 136 | file_language=None if language == 'auto' else language, 137 | annotate=annotate, 138 | translate=translate, 139 | subtitle=subtitle, 140 | sub_length=int(sub_length) if subtitle else 5, 141 | hf_token=hf_token, 142 | verbose=False, 143 | export_formats=export_formats_list 144 | ) 145 | 146 | # Initialize processed_files list 147 | handler.processed_files = [] 148 | for idx, filepath in enumerate(temp_file_paths): 149 | filepath = Path(filepath) 150 | 151 | # Update progress 152 | current_step += 1 153 | progress(current_step / total_steps) 154 | 155 | # Create and set output_dir and output_filepath 156 | handler.output_dir = little_helper.set_output_dir(filepath, handler.base_dir) 157 | output_filepath = handler.output_dir / filepath.stem 158 | 159 | # Convert file format 160 | filepath, audio_array = little_helper.check_file_format( 161 | filepath=filepath, 162 | del_originals=False 163 | ) 164 | 165 | # Update progress 166 | current_step += 1 167 | progress(current_step / total_steps) 168 | 169 | # Detect file language 170 | if not handler.file_language: 171 | handler.detect_language(filepath, audio_array) 172 | 173 | # Update progress 174 | current_step += 1 175 | progress(current_step / total_steps) 176 | 177 | # Transcription and speaker annotation 178 | if handler.device == 'mps': 179 | handler.model = models.set_supported_model( 180 | handler.model_provided, 181 | implementation='insane-whisper', 182 | translation=handler.translate 183 | ) 184 | print(f'→ Using {handler.device.upper()} and 🚅 Insanely-Fast-Whisper with model "{handler.model}"') 185 | result_data = handler.transcribe_with_insane_whisper(filepath) 186 | 187 | elif handler.device in ['cpu', 'cuda:0']: 188 | if handler.annotate or handler.subtitle: 189 | handler.model = models.set_supported_model( 190 | handler.model_provided, 191 | implementation='whisperx', 192 | translation=handler.translate 193 | ) 194 | print(f'→ Using {handler.device.upper()} and whisper🆇 with model "{handler.model}"') 195 | result_data = handler.transcribe_with_whisperx(filepath) 196 | else: 197 | handler.model = models.set_supported_model( 198 | handler.model_provided, 199 | implementation='faster-whisper', 200 | translation=handler.translate 201 | ) 202 | print(f'→ Using {handler.device.upper()} and 🏃‍♀️‍➡️ Faster-Whisper with model "{handler.model}"') 203 | result_data = handler.transcribe_with_faster_whisper(filepath) 204 | 205 | # Update progress 206 | current_step += 1 207 | progress(current_step / total_steps) 208 | 209 | result = { 210 | 'id': f'file_00{idx + 1}', 211 | 'created': datetime.now().strftime('%Y-%m-%d %H:%M:%S'), 212 | 'input_filepath': str(Path(filepath).absolute()), 213 | 'output_filepath': str(Path(output_filepath).absolute()), 214 | 'written_files': None, 215 | 'device': handler.device, 216 | 'model': handler.model, 217 | 'transcription': result_data['transcription']['transcriptions'], 218 | } 219 | 220 | # Save results 221 | result['written_files'] = output_utils.OutputWriter().save_results( 222 | result=result, 223 | export_formats=handler.export_formats 224 | ) 225 | 226 | # Update progress 227 | current_step += 1 228 | progress(current_step / total_steps) 229 | 230 | handler.processed_files.append(result) 231 | 232 | if not handler.file_language_provided: 233 | handler.file_language = None 234 | 235 | except Exception as e: 236 | print(f"→ Error during transcription: {e}") 237 | yield f"Transcription Error: {e}", None 238 | 239 | finally: 240 | progress(100) 241 | 242 | # Get the transcription results 243 | if handler and handler.processed_files: 244 | output_files_set = set() 245 | for processed_file in handler.processed_files: 246 | # Collect the paths of the generated files directly 247 | output_files = processed_file.get('written_files', []) 248 | output_files_set.update(output_files) 249 | 250 | output_files = sorted(list(output_files_set)) 251 | 252 | yield output_files, output_files, gr.update(visible=True) 253 | else: 254 | yield "Transcription Error." 255 | 256 | def toggle_visibility(options): 257 | """ 258 | Updates the visibility of conditional components based on selected options. 259 | """ 260 | show_access_token = 'Annotate Speakers' in options 261 | show_subtitle_length = 'Generate Subtitles' in options 262 | return gr.update(visible=show_access_token), gr.update(visible=show_subtitle_length) 263 | 264 | # Theme 265 | theme = gr.themes.Citrus( 266 | primary_hue="emerald", 267 | neutral_hue="slate", 268 | spacing_size=gr.themes.sizes.spacing_sm, 269 | text_size="md", 270 | radius_size="sm", 271 | font=[gr.themes.GoogleFont('Open Sans', 'Roboto'), 'ui-sans-serif', 'system-ui', 'sans-serif'], 272 | font_mono=['Roboto Mono', 'ui-monospace', 'Consolas', 'monospace'], 273 | ) 274 | 275 | # Build the Gradio Blocks interface 276 | with gr.Blocks(theme=theme, css=CSS) as app: 277 | gr.Markdown("# whisply 💬") 278 | gr.Markdown(""" 279 | Transcribe, translate, annotate, and subtitle audio and video files with \ 280 | OpenAI's Whisper ... fast! 281 | """) 282 | 283 | # File Upload and Model Selection 284 | with gr.Row(): 285 | with gr.Column(): 286 | uploaded_files = gr.File(label="Upload File(s)", file_count='multiple') 287 | with gr.Row(): 288 | model_dropdown = gr.Dropdown( 289 | choices=[ 290 | 'tiny', 291 | 'tiny-en', 292 | 'base', 293 | 'base-en', 294 | 'small', 295 | 'small-en', 296 | 'distil-small-en', 297 | 'medium', 298 | 'medium-en', 299 | 'distil-medium-en', 300 | 'large', 301 | 'large-v2', 302 | 'distil-large-v2', 303 | 'large-v3', 304 | 'distil-large-v3', 305 | 'large-v3-turbo'], 306 | label="Model", 307 | value='large-v3-turbo', 308 | info='Whisper model for the transcription.' 309 | ) 310 | language_dropdown = gr.Dropdown( 311 | choices=sorted(LANGUAGES.keys()), 312 | label="Language", 313 | value='auto', 314 | info="**auto** = auto-detection" 315 | ) 316 | with gr.Row(): 317 | device_radio = gr.Radio( 318 | choices=['auto', 'cpu', 'gpu', 'mps'], 319 | label="Device", 320 | value='auto', 321 | info="**auto** = auto-detection | **gpu** = Nvidia GPUs | **mps** = Mac M1-M4" 322 | ) 323 | with gr.Row(): 324 | options_checkbox = gr.CheckboxGroup( 325 | choices=['Annotate Speakers', 'Translate to English', 'Generate Subtitles'], 326 | label="Options", 327 | value=[] 328 | ) 329 | with gr.Row(): 330 | access_token_text = gr.Text( 331 | label='HuggingFace Access Token (for annotation and subtitling)', 332 | info="Refer to **README.md** to set up the Access Token correctly.", 333 | value=None, 334 | lines=1, 335 | max_lines=1, 336 | visible=False 337 | ) 338 | with gr.Row(): 339 | subtitle_length_number = gr.Number( 340 | label="Subtitle Length (words)", 341 | value=5, 342 | info="""Subtitle segment length in words. \ 343 | (Example: "10" will result in subtitles where each subtitle block has \ 344 | exactly 10 words)""", 345 | visible=False 346 | ) 347 | 348 | # Event Handler to Toggle Visibility 349 | options_checkbox.change( 350 | toggle_visibility, 351 | inputs=options_checkbox, 352 | outputs=[access_token_text, subtitle_length_number] 353 | ) 354 | 355 | with gr.Column(): 356 | outputs = gr.Files(label="Transcriptions") 357 | transcribed_files_state = gr.State([]) 358 | transcribe_button = gr.Button("Transcribe") 359 | 360 | # Download button 361 | download_all_button = gr.Button( 362 | value="Download Transcripts", 363 | visible=False 364 | ) 365 | download_all_button.click( 366 | zip_files, 367 | outputs=[outputs], 368 | inputs=[transcribed_files_state], 369 | ) 370 | 371 | # Transcribe button 372 | transcribe_button.click( 373 | transcribe, 374 | inputs=[ 375 | uploaded_files, 376 | model_dropdown, 377 | device_radio, 378 | language_dropdown, 379 | options_checkbox, 380 | access_token_text, 381 | subtitle_length_number 382 | ], 383 | outputs=[ 384 | outputs, 385 | transcribed_files_state, 386 | download_all_button # "visible=True" after transcription 387 | ] 388 | ) 389 | 390 | app.queue() 391 | 392 | return app 393 | 394 | def main(): 395 | interface = create_gradio_interface() 396 | interface.launch() 397 | 398 | if __name__ == "__main__": 399 | main() 400 | -------------------------------------------------------------------------------- /whisply/app_helpers.py: -------------------------------------------------------------------------------- 1 | CSS = """ 2 | h1 { 3 | font-size: 36px; 4 | font-weight: 800; 5 | } 6 | 7 | .svelte-1ed2p3z { 8 | text-align: center; 9 | } 10 | """ 11 | 12 | LANGUAGES = { 13 | "auto": "auto-detection", 14 | "en": "english", 15 | "zh": "chinese", 16 | "de": "german", 17 | "es": "spanish", 18 | "ru": "russian", 19 | "ko": "korean", 20 | "fr": "french", 21 | "ja": "japanese", 22 | "pt": "portuguese", 23 | "tr": "turkish", 24 | "pl": "polish", 25 | "ca": "catalan", 26 | "nl": "dutch", 27 | "ar": "arabic", 28 | "sv": "swedish", 29 | "it": "italian", 30 | "id": "indonesian", 31 | "hi": "hindi", 32 | "fi": "finnish", 33 | "vi": "vietnamese", 34 | "he": "hebrew", 35 | "uk": "ukrainian", 36 | "el": "greek", 37 | "ms": "malay", 38 | "cs": "czech", 39 | "ro": "romanian", 40 | "da": "danish", 41 | "hu": "hungarian", 42 | "ta": "tamil", 43 | "no": "norwegian", 44 | "th": "thai", 45 | "ur": "urdu", 46 | "hr": "croatian", 47 | "bg": "bulgarian", 48 | "lt": "lithuanian", 49 | "la": "latin", 50 | "mi": "maori", 51 | "ml": "malayalam", 52 | "cy": "welsh", 53 | "sk": "slovak", 54 | "te": "telugu", 55 | "fa": "persian", 56 | "lv": "latvian", 57 | "bn": "bengali", 58 | "sr": "serbian", 59 | "az": "azerbaijani", 60 | "sl": "slovenian", 61 | "kn": "kannada", 62 | "et": "estonian", 63 | "mk": "macedonian", 64 | "br": "breton", 65 | "eu": "basque", 66 | "is": "icelandic", 67 | "hy": "armenian", 68 | "ne": "nepali", 69 | "mn": "mongolian", 70 | "bs": "bosnian", 71 | "kk": "kazakh", 72 | "sq": "albanian", 73 | "sw": "swahili", 74 | "gl": "galician", 75 | "mr": "marathi", 76 | "pa": "punjabi", 77 | "si": "sinhala", 78 | "km": "khmer", 79 | "sn": "shona", 80 | "yo": "yoruba", 81 | "so": "somali", 82 | "af": "afrikaans", 83 | "oc": "occitan", 84 | "ka": "georgian", 85 | "be": "belarusian", 86 | "tg": "tajik", 87 | "sd": "sindhi", 88 | "gu": "gujarati", 89 | "am": "amharic", 90 | "yi": "yiddish", 91 | "lo": "lao", 92 | "uz": "uzbek", 93 | "fo": "faroese", 94 | "ht": "haitian creole", 95 | "ps": "pashto", 96 | "tk": "turkmen", 97 | "nn": "nynorsk", 98 | "mt": "maltese", 99 | "sa": "sanskrit", 100 | "lb": "luxembourgish", 101 | "my": "myanmar", 102 | "bo": "tibetan", 103 | "tl": "tagalog", 104 | "mg": "malagasy", 105 | "as": "assamese", 106 | "tt": "tatar", 107 | "haw": "hawaiian", 108 | "ln": "lingala", 109 | "ha": "hausa", 110 | "ba": "bashkir", 111 | "jw": "javanese", 112 | "su": "sundanese", 113 | "yue": "cantonese", 114 | } -------------------------------------------------------------------------------- /whisply/cli.py: -------------------------------------------------------------------------------- 1 | import os 2 | import typer 3 | import warnings 4 | from pathlib import Path 5 | from typing import Optional, List 6 | from rich import print 7 | from whisply import output_utils 8 | from whisply import post_correction as post 9 | from whisply.output_utils import ExportFormats 10 | from whisply.little_helper import DeviceChoice 11 | 12 | warnings.filterwarnings("ignore", category=FutureWarning) 13 | warnings.filterwarnings("ignore", category=UserWarning) 14 | 15 | cli_app = typer.Typer() 16 | 17 | @cli_app.command(no_args_is_help=True) 18 | def main( 19 | files: Optional[List[str]] = typer.Option( 20 | None, 21 | "--files", 22 | "-f", 23 | help="Path to file, folder, URL or .list to process.", 24 | ), 25 | output_dir: Path = typer.Option( 26 | Path("./transcriptions"), 27 | "--output_dir", 28 | "-o", 29 | file_okay=False, 30 | dir_okay=True, 31 | writable=True, 32 | readable=True, 33 | resolve_path=True, 34 | help="Folder where transcripts should be saved.", 35 | ), 36 | device: DeviceChoice = typer.Option( 37 | DeviceChoice.AUTO, 38 | "--device", 39 | "-d", 40 | help="Select the computation device: CPU, GPU (NVIDIA), or MPS (Mac M1-M4).", 41 | ), 42 | model: str = typer.Option( 43 | "large-v3-turbo", 44 | "--model", 45 | "-m", 46 | help='Whisper model to use (List models via --list_models).', 47 | ), 48 | lang: Optional[str] = typer.Option( 49 | None, 50 | "--lang", 51 | "-l", 52 | help='Language of provided file(s) ("en", "de") (Default: auto-detection).', 53 | ), 54 | annotate: bool = typer.Option( 55 | False, 56 | "--annotate", 57 | "-a", 58 | help="Enable speaker annotation (Saves .rttm | Default: False).", 59 | ), 60 | num_speakers: Optional[int] = typer.Option( 61 | None, 62 | "--num_speakers", 63 | "-num", 64 | help="Number of speakers to annotate (Default: auto-detection).", 65 | ), 66 | hf_token: Optional[str] = typer.Option( 67 | None, 68 | "--hf_token", 69 | "-hf", 70 | help="HuggingFace Access token required for speaker annotation.", 71 | ), 72 | subtitle: bool = typer.Option( 73 | False, 74 | "--subtitle", 75 | "-s", 76 | help="Create subtitles (Saves .srt, .vtt and .webvtt | Default: False).", 77 | ), 78 | sub_length: int = typer.Option( 79 | 5, 80 | "--sub_length", 81 | help="Subtitle segment length in words." 82 | ), 83 | translate: bool = typer.Option( 84 | False, 85 | "--translate", 86 | "-t", 87 | help="Translate transcription to English (Default: False).", 88 | ), 89 | export_format: ExportFormats = typer.Option( 90 | ExportFormats.ALL, 91 | "--export", 92 | "-e", 93 | help="Choose the export format." 94 | ), 95 | verbose: bool = typer.Option( 96 | False, 97 | "--verbose", 98 | "-v", 99 | help="Print text chunks during transcription (Default: False).", 100 | ), 101 | del_originals: bool = typer.Option( 102 | False, 103 | "--del_originals", 104 | "-del", 105 | help="Delete original input files after file conversion. (Default: False)", 106 | ), 107 | config: Optional[Path] = typer.Option( 108 | None, 109 | "--config", 110 | help="Path to configuration file.", 111 | ), 112 | post_correction: Optional[Path] = typer.Option( 113 | None, 114 | "--post_correction", 115 | "-post", 116 | help="Path to YAML file for post-correction.", 117 | ), 118 | launch_app: bool = typer.Option( 119 | False, 120 | "--launch_app", 121 | "-app", 122 | help="Launch the web app instead of running standard CLI commands.", 123 | ), 124 | list_models: bool = typer.Option( 125 | False, 126 | "--list_models", 127 | help="List available models.", 128 | ) 129 | ): 130 | """ 131 | WHISPLY 💬 Transcribe, translate, annotate and subtitle audio and video files with OpenAI's Whisper ... fast! 132 | """ 133 | from whisply import little_helper, transcription, models 134 | 135 | # Start the gradio web app 136 | if launch_app: 137 | from whisply.app import main as run_gradio_app 138 | run_gradio_app() 139 | raise typer.Exit() 140 | 141 | # Load configuration from config.json if provided 142 | if config: 143 | config_data = little_helper.load_config(config) 144 | files = files or Path(config_data.get("files")) if config_data.get("files") else files 145 | output_dir = Path(config_data.get("output_dir")) if config_data.get("output_dir") else output_dir 146 | device = DeviceChoice(config_data.get("device", device.value)) 147 | model = config_data.get("model", model) 148 | lang = config_data.get("lang", lang) 149 | annotate = config_data.get("annotate", annotate) 150 | num_speakers = config_data.get("num_speakers", num_speakers) 151 | translate = config_data.get("translate", translate) 152 | hf_token = config_data.get("hf_token", hf_token) 153 | subtitle = config_data.get("subtitle", subtitle) 154 | sub_length = config_data.get("sub_length", sub_length) 155 | verbose = config_data.get("verbose", verbose) 156 | del_originals = config_data.get("del_originals", del_originals) 157 | post_correction = config_data.get("post_correction", post_correction) 158 | 159 | # Print available models 160 | if list_models: 161 | available_models = "Available models:\n... " 162 | available_models += '\n... '.join(models.WHISPER_MODELS.keys()) 163 | print(f"{available_models}") 164 | raise typer.Exit() 165 | 166 | # Check if provided model is available 167 | if not models.ensure_model(model): 168 | msg = f"""→ Model "{model}" is not available.\n→ Available models:\n... """ 169 | msg += '\n... '.join(models.WHISPER_MODELS.keys()) 170 | print(f"{msg}") 171 | raise typer.Exit() 172 | 173 | # Check for HuggingFace Access Token if speaker annotation is enabled 174 | if annotate and not hf_token: 175 | hf_token = os.getenv('HF_TOKEN') 176 | if not hf_token: 177 | print('→ Please provide a HuggingFace access token (--hf_token / -hf) to enable speaker annotation.') 178 | raise typer.Exit() 179 | 180 | # Determine the computation device 181 | device_str = little_helper.get_device(device=device) 182 | 183 | # Determine the ExportFormats 184 | export_formats = output_utils.determine_export_formats( 185 | export_format, 186 | annotate, subtitle 187 | ) 188 | 189 | # Load corrections if post_correction is provided 190 | if post_correction: 191 | corrections = post.load_correction_list(post_correction) 192 | 193 | # Instantiate TranscriptionHandler 194 | service = transcription.TranscriptionHandler( 195 | base_dir=output_dir, 196 | device=device_str, 197 | model=model, 198 | file_language=lang, 199 | annotate=annotate, 200 | num_speakers=num_speakers, 201 | translate=translate, 202 | hf_token=hf_token, 203 | subtitle=subtitle, 204 | sub_length=sub_length, 205 | verbose=verbose, 206 | del_originals=del_originals, 207 | corrections=corrections if post_correction else None, 208 | export_formats=export_formats 209 | ) 210 | # Process files 211 | service.process_files(files) 212 | 213 | def run(): 214 | cli_app() 215 | 216 | if __name__ == "__main__": 217 | run() 218 | -------------------------------------------------------------------------------- /whisply/diarize_utils.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import torch 3 | import numpy as np 4 | from torchaudio import functional as F 5 | from pyannote.audio import Pipeline 6 | from transformers.pipelines.audio_utils import ffmpeg_read 7 | 8 | # Code lifted from https://github.com/huggingface/speechbox/blob/main/src/speechbox/diarize.py 9 | # and from https://github.com/m-bain/whisperX/blob/main/whisperx/diarize.py 10 | 11 | def preprocess_inputs(inputs): 12 | if isinstance(inputs, str): 13 | if inputs.startswith("http://") or inputs.startswith("https://"): 14 | # We need to actually check for a real protocol, otherwise it's impossible to use a local file 15 | # like http_huggingface_co.png 16 | inputs = requests.get(inputs).content 17 | else: 18 | with open(inputs, "rb") as f: 19 | inputs = f.read() 20 | 21 | if isinstance(inputs, bytes): 22 | inputs = ffmpeg_read(inputs, 16000) 23 | 24 | if isinstance(inputs, dict): 25 | # Accepting `"array"` which is the key defined in `datasets` for better integration 26 | if not ("sampling_rate" in inputs and ("raw" in inputs or "array" in inputs)): 27 | raise ValueError( 28 | "When passing a dictionary to ASRDiarizePipeline, the dict needs to contain a " 29 | '"raw" key containing the numpy array representing the audio and a "sampling_rate" key, ' 30 | "containing the sampling_rate associated with that array" 31 | ) 32 | 33 | _inputs = inputs.pop("raw", None) 34 | if _inputs is None: 35 | # Remove path which will not be used from `datasets`. 36 | inputs.pop("path", None) 37 | _inputs = inputs.pop("array", None) 38 | in_sampling_rate = inputs.pop("sampling_rate") 39 | inputs = _inputs 40 | if in_sampling_rate != 16000: 41 | inputs = F.resample( 42 | torch.from_numpy(inputs), in_sampling_rate, 16000 43 | ).numpy() 44 | 45 | if not isinstance(inputs, np.ndarray): 46 | raise ValueError(f"We expect a numpy ndarray as input, got `{type(inputs)}`") 47 | if len(inputs.shape) != 1: 48 | raise ValueError( 49 | "We expect a single channel audio input for ASRDiarizePipeline" 50 | ) 51 | 52 | # diarization model expects float32 torch tensor of shape `(channels, seq_len)` 53 | diarizer_inputs = torch.from_numpy(inputs).float() 54 | diarizer_inputs = diarizer_inputs.unsqueeze(0) 55 | 56 | return inputs, diarizer_inputs 57 | 58 | def diarize_audio( 59 | diarizer_inputs, 60 | diarization_pipeline, 61 | num_speakers, 62 | min_speakers, 63 | max_speakers 64 | ) -> dict: 65 | diarization = diarization_pipeline( 66 | {"waveform": diarizer_inputs, "sample_rate": 16000}, 67 | num_speakers=num_speakers, 68 | min_speakers=min_speakers, 69 | max_speakers=max_speakers, 70 | ) 71 | 72 | segments = [] 73 | for segment, track, label in diarization.itertracks(yield_label=True): 74 | segments.append( 75 | { 76 | "segment": {"start": segment.start, "end": segment.end}, 77 | "track": track, 78 | "label": label, 79 | } 80 | ) 81 | 82 | # diarizer output may contain consecutive segments from the same 83 | # speaker (e.g. {(0 -> 1, speaker_1), (1 -> 1.5, speaker_1), ...}) 84 | # we combine these segments to give overall timestamps for each 85 | # speaker's turn (e.g. {(0 -> 1.5, speaker_1), ...}) 86 | new_segments = [] 87 | prev_segment = cur_segment = segments[0] 88 | 89 | for i in range(1, len(segments)): 90 | cur_segment = segments[i] 91 | 92 | # check if we have changed speaker ("label") 93 | if cur_segment["label"] != prev_segment["label"] and i < len(segments): 94 | # add the start/end times for the super-segment to the new list 95 | new_segments.append( 96 | { 97 | "segment": { 98 | "start": prev_segment["segment"]["start"], 99 | "end": cur_segment["segment"]["start"], 100 | }, 101 | "speaker": prev_segment["label"], 102 | } 103 | ) 104 | prev_segment = segments[i] 105 | 106 | # add the last segment(s) if there was no speaker change 107 | new_segments.append( 108 | { 109 | "segment": { 110 | "start": prev_segment["segment"]["start"], 111 | "end": cur_segment["segment"]["end"], 112 | }, 113 | "speaker": prev_segment["label"], 114 | } 115 | ) 116 | 117 | return new_segments 118 | 119 | 120 | def post_process_segments_and_transcripts( 121 | new_segments, 122 | transcript, 123 | group_by_speaker 124 | ) -> list: 125 | segmented_preds = [] 126 | transcript_idx = 0 127 | num_chunks = len(transcript) 128 | 129 | # Iterate through each diarization segment and assign transcript chunks 130 | # whose end timestamp falls within the segment 131 | for segment in new_segments: 132 | seg_start = segment["segment"]["start"] 133 | seg_end = segment["segment"]["end"] 134 | segment_chunks = [] 135 | 136 | # Collect transcript chunks until the chunk's end timestamp exceeds 137 | # the diarization segment's end 138 | while (transcript_idx < num_chunks 139 | and 140 | transcript[transcript_idx]["timestamp"][1] <= seg_end): 141 | segment_chunks.append(transcript[transcript_idx]) 142 | transcript_idx += 1 143 | 144 | # If no transcript chunks were found for this segment, continue 145 | # to next segment 146 | if not segment_chunks: 147 | continue 148 | 149 | if group_by_speaker: 150 | # Combine the text from all transcript chunks within this segment 151 | text = "".join(chunk["text"] for chunk in segment_chunks) 152 | segmented_preds.append({ 153 | "speaker": segment["speaker"], 154 | "text": text, 155 | "timestamp": ( 156 | segment_chunks[0]["timestamp"][0], 157 | segment_chunks[-1]["timestamp"][1] 158 | ) 159 | }) 160 | else: 161 | # Assign the speaker label to each transcript chunk in the segment 162 | for chunk in segment_chunks: 163 | chunk_copy = chunk.copy() 164 | chunk_copy["speaker"] = segment["speaker"] 165 | segmented_preds.append(chunk_copy) 166 | 167 | return segmented_preds 168 | 169 | def diarize(outputs, **kwargs): 170 | diarization_pipeline = Pipeline.from_pretrained( 171 | checkpoint_path=kwargs['diarization_model'], 172 | use_auth_token=kwargs['hf_token'], 173 | ) 174 | diarization_pipeline.to(torch.device("mps")) 175 | 176 | _, diarizer_inputs = preprocess_inputs( 177 | inputs=kwargs['file_name'] 178 | ) 179 | 180 | segments = diarize_audio( 181 | diarizer_inputs, 182 | diarization_pipeline, 183 | kwargs['num_speakers'], 184 | kwargs['min_speakers'], 185 | kwargs['max_speakers'] 186 | ) 187 | 188 | return post_process_segments_and_transcripts( 189 | segments, outputs["chunks"], group_by_speaker=False 190 | ) 191 | -------------------------------------------------------------------------------- /whisply/download_utils.py: -------------------------------------------------------------------------------- 1 | import re 2 | import logging 3 | import yt_dlp as url_downloader 4 | 5 | from pathlib import Path 6 | from datetime import datetime 7 | 8 | from whisply import little_helper 9 | 10 | 11 | # Set logging configuration 12 | logger = logging.getLogger('download_utils') 13 | logger.setLevel(logging.DEBUG) 14 | 15 | 16 | def download_url(url: str, downloads_dir: Path) -> Path: 17 | """ 18 | Downloads a media file from a specified URL, typically a YouTube URL, extracting audio 19 | in WAV format, and then renames the file based on the media title. 20 | 21 | The function first ensures the downloads directory exists, then initiates a download 22 | using youtube-dl with specific options set for audio quality and format. After the 23 | download, it extracts the video's information without downloading it again to 24 | rename the file more meaningfully based on the video title. Special characters 25 | in the title are replaced with underscores, and unnecessary leading or trailing 26 | underscores are removed. 27 | 28 | Args: 29 | url (str): The URL of the video to download. 30 | downloads_dir (Path): The directory path where the downloaded file should be stored. 31 | 32 | Returns: 33 | Path: A path object pointing to the renamed downloaded file. If there is an error 34 | during the download or file processing, returns None. 35 | 36 | Raises: 37 | Exception: Outputs an error message to the console if the download fails. 38 | 39 | Examples: 40 | >>> download_url("https://www.youtube.com/watch?v=example", Path("/downloads")) 41 | Path('/downloads/example.wav') 42 | """ 43 | little_helper.ensure_dir(downloads_dir) 44 | 45 | temp_filename = f"temp_{datetime.now().strftime('%Y%m%d_%H_%M_%S')}" 46 | options = { 47 | 'format': 'bestaudio/best', 48 | 'postprocessors': [{'key': 'FFmpegExtractAudio', 49 | 'preferredcodec': 'wav', 50 | 'preferredquality': '192'}], 51 | 'outtmpl': f'{downloads_dir}/{temp_filename}.%(ext)s' 52 | } 53 | try: 54 | with url_downloader.YoutubeDL(options) as ydl: 55 | logger.debug(f"Downloading {url}") 56 | 57 | # Download url 58 | ydl.download([url]) 59 | video_info = ydl.extract_info(url, download=False) 60 | downloaded_file = list(downloads_dir.glob(f'{temp_filename}*'))[0] 61 | logger.debug(f"Download complete for {downloaded_file}") 62 | 63 | # Normalize title 64 | new_filename = re.sub(r'\W+', '_', video_info.get('title', 'downloaded_video')) 65 | 66 | # Remove trailing underscores 67 | if new_filename.startswith('_'): 68 | new_filename = new_filename[1:] 69 | if new_filename.endswith('_'): 70 | new_filename = new_filename[:-1] 71 | 72 | # Rename the file 73 | renamed_file = downloaded_file.rename(f"{downloads_dir}/{new_filename}{downloaded_file.suffix}") 74 | logger.debug(f"Renamed downloaded file to {renamed_file}") 75 | return Path(renamed_file) 76 | 77 | except Exception as e: 78 | print(f'Error downloading {url}: {e}') 79 | return None 80 | -------------------------------------------------------------------------------- /whisply/little_helper.py: -------------------------------------------------------------------------------- 1 | import re 2 | import os 3 | import json 4 | import logging 5 | import ffmpeg 6 | import validators 7 | import numpy as np 8 | 9 | from enum import Enum 10 | from pathlib import Path 11 | from typing import Callable, Any, List 12 | from rich import print 13 | from rich.progress import Progress, TimeElapsedColumn, TextColumn, SpinnerColumn 14 | from whisply import download_utils 15 | 16 | # Set logging configuration 17 | logger = logging.getLogger('little_helper') 18 | logger.setLevel(logging.INFO) 19 | 20 | 21 | class DeviceChoice(str, Enum): 22 | AUTO = 'auto' 23 | CPU = 'cpu' 24 | GPU = 'gpu' 25 | MPS = 'mps' 26 | 27 | 28 | def get_device(device: DeviceChoice = DeviceChoice.AUTO) -> str: 29 | """ 30 | Determine the computation device based on user preference and 31 | availability. 32 | """ 33 | import torch 34 | 35 | if device == DeviceChoice.AUTO: 36 | if torch.cuda.is_available(): 37 | device = 'cuda:0' 38 | elif torch.backends.mps.is_available(): 39 | device = 'mps' 40 | else: 41 | device = 'cpu' 42 | elif device == DeviceChoice.GPU: 43 | if torch.cuda.is_available(): 44 | device = 'cuda:0' 45 | else: 46 | print(f"[blue1]→ NVIDIA GPU not available. Using CPU.") 47 | device = 'cpu' 48 | elif device == DeviceChoice.MPS: 49 | if torch.backends.mps.is_available(): 50 | device = 'mps' 51 | else: 52 | print(f"[blue1]→ MPS not available. Using CPU.") 53 | device = 'cpu' 54 | elif device == DeviceChoice.CPU: 55 | device = 'cpu' 56 | else: 57 | device = 'cpu' 58 | return device 59 | 60 | 61 | class FilePathProcessor: 62 | """ 63 | Utility class for validating various filepaths. 64 | """ 65 | def __init__(self, file_formats: List[str]): 66 | self.file_formats = [fmt.lower() for fmt in file_formats] 67 | self.filepaths: List[Path] = [] 68 | 69 | def get_filepaths(self, filepath: str): 70 | """ 71 | Processes the provided filepath which can be a URL, a single file, a directory, 72 | or a .list file containing multiple paths/URLs. It validates each input, downloads 73 | URLs if necessary, and accumulates valid file paths for further processing. 74 | """ 75 | path = Path(filepath).expanduser().resolve() 76 | 77 | try: 78 | # Handle URL 79 | if validators.url(filepath): 80 | logging.info(f"Processing URL: {filepath}") 81 | downloaded_path = download_utils.download_url( 82 | filepath, 83 | downloads_dir=Path('./downloads') 84 | ) 85 | if downloaded_path: 86 | self.filepaths.append(downloaded_path) 87 | else: 88 | logging.error(f"Failed to download URL: {filepath}") 89 | print(f"→ Failed to download URL: {filepath}") 90 | return 91 | 92 | # Handle .list file 93 | elif path.suffix.lower() == '.list': 94 | if not path.is_file(): 95 | logging.error(f'The .list file "{path}" does not exist or is not a file.') 96 | print(f'→ The .list file "{path}" does not exist or is not a file.') 97 | return 98 | 99 | logging.info(f"Processing .list file: {path}") 100 | with path.open('r', encoding='utf-8') as file: 101 | lpaths = set() 102 | for line in file: 103 | lpath = line.strip() 104 | if not lpath: 105 | continue 106 | lpaths.add(lpath) 107 | 108 | for lpath in lpaths: 109 | if validators.url(lpath): 110 | downloaded_path = download_utils.download_url( 111 | lpath, 112 | downloads_dir=Path('./downloads') 113 | ) 114 | if downloaded_path: 115 | self.filepaths.append(downloaded_path) 116 | else: 117 | print(f'→ Failed to download URL: {lpath}') 118 | else: 119 | self._process_path(lpath) 120 | return 121 | 122 | # Handle single file or directory 123 | else: 124 | self._process_path(path) 125 | 126 | except Exception as e: 127 | logging.exception(f"An unexpected error occurred while processing '{filepath}': {e}") 128 | return 129 | 130 | # Remove duplicates by converting to a set of resolved absolute paths 131 | unique_filepaths = set(p.resolve() for p in self.filepaths) 132 | self.filepaths = list(unique_filepaths) 133 | 134 | # Filter out files that have already been converted 135 | self._filter_converted_files() 136 | 137 | # Final check to ensure there are files to process 138 | if not self.filepaths: 139 | logging.warning(f'No valid files found for processing. Please check the provided path: "{filepath}".') 140 | print(f'→ No valid files found for processing. Please check the provided path: "{filepath}".') 141 | else: 142 | logging.info(f"Total valid files to process: {len(self.filepaths)}") 143 | 144 | def _process_path(self, path_input: str | Path): 145 | """ 146 | Processes a single path input, which can be a file or a directory. 147 | """ 148 | path = Path(path_input).expanduser().resolve() 149 | 150 | if path.is_file(): 151 | if path.suffix.lower() in self.file_formats: 152 | logging.info(f"Adding file: {path}") 153 | normalized_path = self._normalize_filepath(path) 154 | self.filepaths.append(normalized_path) 155 | else: 156 | logging.warning(f'File "{path}" has unsupported format and will be skipped.') 157 | print(f'→ File "{path}" has unsupported format and will be skipped.') 158 | elif path.is_dir(): 159 | logging.info(f"Processing directory: {path}") 160 | for file_format in self.file_formats: 161 | for file in path.rglob(f'*{file_format}'): 162 | if file.is_file(): 163 | logging.debug(f"Found file: {file}") 164 | normalized_path = self._normalize_filepath(file) 165 | self.filepaths.append(normalized_path) 166 | else: 167 | logging.error(f'Path "{path}" does not exist or is not accessible.') 168 | print(f'→ Path "{path}" does not exist or is not accessible.') 169 | 170 | def _normalize_filepath(self, filepath: Path) -> Path: 171 | """ 172 | Normalizes the filepath by replacing non-word characters with underscores, 173 | collapsing multiple underscores into one, and removing leading/trailing underscores. 174 | """ 175 | new_filename = re.sub(r'\W+', '_', filepath.stem) 176 | new_filename = new_filename.strip('_') 177 | new_filename = re.sub(r'_+', '_', new_filename) 178 | 179 | suffix = filepath.suffix.lower() 180 | 181 | # Construct the new path 182 | new_path = filepath.parent / f"{new_filename}{suffix}" 183 | 184 | # Rename the file 185 | filepath.rename(new_path) 186 | 187 | return new_path.resolve() 188 | 189 | def _filter_converted_files(self): 190 | """ 191 | Removes files that have already been converted to avoid redundant processing. 192 | """ 193 | converted_suffix = '_converted.wav' 194 | original_filepaths = [] 195 | converted_filepaths = set() 196 | 197 | for fp in self.filepaths: 198 | if fp.name.endswith(converted_suffix): 199 | converted_filepaths.add(fp) 200 | else: 201 | original_filepaths.append(fp) 202 | 203 | # Remove originals if their converted version exists 204 | filtered_filepaths = [ 205 | fp for fp in original_filepaths 206 | if not (fp.with_name(fp.stem + converted_suffix) in converted_filepaths) 207 | ] 208 | 209 | # Extened filtered paths with converted paths 210 | filtered_filepaths.extend(converted_filepaths) 211 | 212 | removed_count = len(self.filepaths) - len(filtered_filepaths) 213 | if removed_count > 0: 214 | logging.info(f"Removed {removed_count} files already converted.") 215 | self.filepaths = filtered_filepaths 216 | 217 | def ensure_dir(dir: Path) -> None: 218 | if not dir.exists(): 219 | dir.mkdir(parents=True) 220 | return dir 221 | 222 | def set_output_dir(filepath: Path, base_dir: Path) -> None: 223 | output_dir = base_dir / filepath.stem 224 | ensure_dir(output_dir) 225 | return output_dir 226 | 227 | def return_valid_fileformats() -> list[str]: 228 | return [ 229 | '.mp3', 230 | '.wav', 231 | '.m4a', 232 | '.aac', 233 | '.flac', 234 | '.ogg', 235 | '.mkv', 236 | '.mov', 237 | '.mp4', 238 | '.avi', 239 | '.mpeg', 240 | '.vob' 241 | ] 242 | 243 | def load_audio_ffmpeg(filepath: str) -> np.ndarray: 244 | try: 245 | out, _ = ( 246 | ffmpeg 247 | .input(filepath) 248 | .output('pipe:', format='f32le', acodec='pcm_f32le', ac=1, ar='16000') 249 | .run(capture_stdout=True, capture_stderr=True) 250 | ) 251 | except ffmpeg.Error as e: 252 | raise RuntimeError(f"Error loading audio with ffmpeg: {e.stderr.decode()}") from e 253 | return np.frombuffer(out, np.float32) 254 | 255 | def check_file_format( 256 | filepath: Path, 257 | del_originals: bool = True 258 | ) -> tuple[Path, np.ndarray]: 259 | """ 260 | Checks the format of an audio file and converts it if it doesn't meet specified criteria. 261 | Then, loads the audio into a 1D NumPy array. 262 | 263 | The function uses `ffmpeg` to probe the metadata of an audio file at the given `filepath`. 264 | It checks if the audio stream meets the following criteria: 265 | - Codec name: 'pcm_s16le' 266 | - Sample rate: 16000 Hz 267 | - Number of channels: 1 (mono) 268 | 269 | If the audio stream does not meet these criteria, the function attempts to convert the file 270 | to meet the required format and saves the converted file with a '_converted.wav' suffix in the same directory. 271 | After successful conversion, it deletes the original file. 272 | 273 | Finally, it loads the audio (original or converted) as a 1D NumPy array and returns it. 274 | 275 | Args: 276 | filepath (Path): The path to the audio file to be checked and potentially converted. 277 | 278 | Returns: 279 | filepath (Path): filepath of the checked and / or converted audio file. 280 | np.ndarray: 1D NumPy array of the audio data. 281 | """ 282 | # Define the converted file path 283 | new_filepath = filepath.with_name(f"{filepath.stem}_converted.wav") 284 | 285 | converted = False 286 | 287 | if new_filepath.exists(): 288 | target_filepath = new_filepath 289 | converted = True 290 | else: 291 | try: 292 | # Probe the audio file for stream information 293 | probe = ffmpeg.probe(str(filepath)) 294 | audio_streams = [stream for stream in probe['streams'] if stream['codec_type'] == 'audio'] 295 | 296 | if not audio_streams: 297 | raise ValueError(f"→ No audio stream found for {filepath}. Please check if the file you have provided contains audio content.") 298 | 299 | audio_stream = audio_streams[0] 300 | codec_name = audio_stream.get('codec_name') 301 | sample_rate = int(audio_stream.get('sample_rate', 0)) 302 | channels = int(audio_stream.get('channels', 0)) 303 | 304 | # Check if the audio stream meets the criteria 305 | if codec_name != 'pcm_s16le' or sample_rate != 16000 or channels != 1: 306 | try: 307 | # Convert the file and show progress 308 | run_with_progress( 309 | description=( 310 | f"[orchid]→ Converting file to .wav: {filepath.name}" 311 | ), 312 | task=lambda: convert_file_format( 313 | old_filepath=filepath, 314 | new_filepath=new_filepath 315 | ) 316 | ) 317 | target_filepath = new_filepath 318 | converted = True 319 | except Exception as e: 320 | raise RuntimeError( 321 | f"→ An error occurred while converting {filepath}: {e}" 322 | ) 323 | else: 324 | # If already in correct format, use the original file 325 | target_filepath = filepath 326 | 327 | except ffmpeg.Error as e: 328 | print(f"→ Error running ffprobe: {e}") 329 | print(f"→ You may have provided an unsupported file type.\ 330 | Please check 'whisply --list_formats' for all supported formats.") 331 | 332 | try: 333 | audio_array = load_audio_ffmpeg(str(target_filepath)) 334 | except Exception as e: 335 | raise RuntimeError(f"Failed to load audio from {target_filepath}: {e}") from e 336 | 337 | # If conversion occurred delete the original file if del_originals 338 | if (converted and del_originals) and target_filepath != filepath: 339 | try: 340 | os.remove(filepath) 341 | except OSError as e: 342 | print(f"Warning: {e}") 343 | 344 | return Path(target_filepath), audio_array 345 | 346 | def convert_file_format(old_filepath: str, new_filepath: str): 347 | """ 348 | Converts a video file into an audio file in WAV format using the ffmpeg library. 349 | """ 350 | ( 351 | ffmpeg 352 | .input(str(old_filepath)) 353 | .output(str(new_filepath), 354 | acodec='pcm_s16le', # Audio codec: PCM signed 16-bit little-endian 355 | ar='16000', # Sampling rate 16 KHz 356 | ac=1) # Mono channel 357 | .run(quiet=True, 358 | overwrite_output=True) 359 | ) 360 | 361 | def load_config(config: json) -> dict: 362 | with open(config, 'r', encoding='utf-8') as file: 363 | return json.load(file) 364 | 365 | def format_time(seconds, delimiter=',') -> str: 366 | """ 367 | Function for time conversion. 368 | """ 369 | h = int(seconds // 3600) 370 | m = int((seconds % 3600) // 60) 371 | s = int(seconds % 60) 372 | ms = int((seconds - int(seconds)) * 1000) 373 | 374 | return f"{h:02}:{m:02}:{s:02}{delimiter}{ms:03}" 375 | 376 | def run_with_progress(description: str, task: Callable[[], Any]) -> Any: 377 | """ 378 | Helper function to run a task with a progress bar. 379 | """ 380 | with Progress( 381 | SpinnerColumn(), 382 | TimeElapsedColumn(), 383 | TextColumn("[progress.description]{task.description}") 384 | ) as progress: 385 | progress.add_task(description, total=None) 386 | return task() -------------------------------------------------------------------------------- /whisply/models.py: -------------------------------------------------------------------------------- 1 | from rich import print 2 | 3 | WHISPER_MODELS = { 4 | # Models for faster-whisper / whisperx: https://huggingface.co/Systran 5 | # Models for insanely-fast-whisper: https://huggingface.co/openai 6 | 'tiny': { 7 | 'faster-whisper': 'tiny', 8 | 'insane-whisper': 'openai/whisper-tiny', 9 | 'whisperx': 'tiny', 10 | 'translation': True 11 | }, 12 | 'tine-en': { 13 | 'faster-whisper': 'tiny.en', 14 | 'insane-whisper': 'openai/whisper-tiny.en', 15 | 'whisperx': 'tiny.en', 16 | 'translation': False 17 | }, 18 | 'base': { 19 | 'faster-whisper': 'base', 20 | 'insane-whisper': 'openai/whisper-base', 21 | 'whisperx': 'base', 22 | 'translation': True 23 | }, 24 | 'base-en': { 25 | 'faster-whisper': 'base.en', 26 | 'insane-whisper': 'openai/whisper-base.en', 27 | 'whisperx': 'base.en', 28 | 'translation': False 29 | }, 30 | 'small': { 31 | 'faster-whisper': 'small', 32 | 'insane-whisper': 'openai/whisper-small', 33 | 'whisperx': 'small', 34 | 'translation': True 35 | }, 36 | 'small-en': { 37 | 'faster-whisper': 'small.en', 38 | 'insane-whisper': 'openai/whisper-small.en', 39 | 'whisperx': 'small.en', 40 | 'translation': False 41 | }, 42 | 'distil-small-en': { 43 | 'faster-whisper': 'distil-small.en', 44 | 'insane-whisper': 'distil-whisper/distil-small.en', 45 | 'whisperx': None, 46 | 'translation': False 47 | }, 48 | 'medium': { 49 | 'faster-whisper': 'medium', 50 | 'insane-whisper': 'openai/whisper-medium', 51 | 'whisperx': 'medium', 52 | 'translation': True 53 | }, 54 | 'medium-en': { 55 | 'faster-whisper': 'medium.en', 56 | 'insane-whisper': 'openai/whisper-medium.en', 57 | 'whisperx': 'medium.en', 58 | 'translation': False 59 | }, 60 | 'distil-medium-en': { 61 | 'faster-whisper': 'distil-medium.en', 62 | 'insane-whisper': 'distil-whisper/distil-medium.en', 63 | 'whisperx': None, 64 | 'translation': False 65 | }, 66 | 'large': { 67 | 'faster-whisper': 'large', 68 | 'insane-whisper': 'openai/whisper-large', 69 | 'whisperx': 'large', 70 | 'translation': True 71 | }, 72 | 'large-v2': { 73 | 'faster-whisper': 'large-v2', 74 | 'insane-whisper': 'openai/whisper-large-v2', 75 | 'whisperx': 'large-v2', 76 | 'translation': True 77 | }, 78 | 'distil-large-v2': { 79 | 'faster-whisper': 'distil-large-v2', 80 | 'insane-whisper': 'distil-whisper/distil-large-v2', 81 | 'whisperx': None, 82 | 'translation': True 83 | }, 84 | 'large-v3': { 85 | 'faster-whisper': 'large-v3', 86 | 'insane-whisper': 'openai/whisper-large-v3', 87 | 'whisperx': 'large-v3', 88 | 'translation': True 89 | }, 90 | 'distil-large-v3': { 91 | 'faster-whisper': 'distil-large-v3', 92 | 'insane-whisper': 'distil-whisper/distil-large-v3', 93 | 'whisperx': 'distil-large-v3', 94 | 'translation': True 95 | }, 96 | 'large-v3-turbo': { 97 | 'faster-whisper': 'deepdml/faster-whisper-large-v3-turbo-ct2', 98 | 'insane-whisper': 'openai/whisper-large-v3-turbo', 99 | 'whisperx': None, 100 | 'translation': False 101 | }, 102 | } 103 | 104 | def ensure_model(model: str) -> bool: 105 | return model in WHISPER_MODELS 106 | 107 | def is_model_supported(model: str, implementation: str, translation: bool) -> bool: 108 | model_info = WHISPER_MODELS.get(model) 109 | if not model_info: 110 | return False 111 | if model_info.get(implementation) is None: 112 | return False 113 | if translation and not model_info.get("translation", False): 114 | return False 115 | return True 116 | 117 | def set_supported_model(model: str, implementation: str, translation: bool) -> str: 118 | if not is_model_supported(model, implementation, translation): 119 | default_model = "large-v2" 120 | print(f'[blue1]→ Model "{model}" is not available for this task/implementation → Using default model "{default_model}".') 121 | return WHISPER_MODELS.get(default_model)[implementation] 122 | return WHISPER_MODELS.get(model)[implementation] 123 | -------------------------------------------------------------------------------- /whisply/output_templates.py: -------------------------------------------------------------------------------- 1 | NOSCRIBE_HTML_TEMPLATE = """ 2 | 3 | 4 | 5 | 12 | 13 | 14 | 15 |

16 |

{transcription}

17 |

Transcribed with whisply for noScribe's Editor.
Audio source: {audio_filepath}

18 | {body_content} 19 |

20 | 21 | 22 | """ -------------------------------------------------------------------------------- /whisply/output_utils.py: -------------------------------------------------------------------------------- 1 | import re 2 | import json 3 | import logging 4 | import typer 5 | 6 | from enum import Enum 7 | from pathlib import Path 8 | from typing import List, Dict, Tuple 9 | from rich import print 10 | from whisply import little_helper 11 | from whisply.post_correction import Corrections 12 | 13 | # Set logging configuration 14 | logger = logging.getLogger('little_helper') 15 | logger.setLevel(logging.INFO) 16 | 17 | 18 | class ExportFormats(str, Enum): 19 | ALL = 'all' 20 | JSON = 'json' 21 | TXT = 'txt' 22 | RTTM = 'rttm' 23 | VTT = 'vtt' 24 | WEBVTT = 'webvtt' 25 | SRT = 'srt' 26 | HTML = 'html' 27 | 28 | 29 | def determine_export_formats( 30 | export_format: ExportFormats, 31 | annotate: bool, 32 | subtitle: bool 33 | ) -> List[str]: 34 | """ 35 | Determine the export formats based on user options and availability. 36 | 37 | Returns a list of export format strings to be used. 38 | """ 39 | available_formats = set() 40 | if export_format == ExportFormats.ALL: 41 | available_formats.add(ExportFormats.JSON.value) 42 | available_formats.add(ExportFormats.TXT.value) 43 | if annotate: 44 | available_formats.add(ExportFormats.RTTM.value) 45 | available_formats.add(ExportFormats.HTML.value) 46 | if subtitle: 47 | available_formats.add(ExportFormats.WEBVTT.value) 48 | available_formats.add(ExportFormats.VTT.value) 49 | available_formats.add(ExportFormats.SRT.value) 50 | else: 51 | if export_format in (ExportFormats.JSON, ExportFormats.TXT): 52 | available_formats.add(export_format.value) 53 | elif export_format in (ExportFormats.RTTM, ExportFormats.HTML): 54 | if annotate: 55 | available_formats.add(export_format.value) 56 | else: 57 | print("→ RTTM export format requires annotate option to be True.") 58 | raise typer.Exit() 59 | elif export_format in ( 60 | ExportFormats.VTT, 61 | ExportFormats.SRT, 62 | ExportFormats.WEBVTT 63 | ): 64 | if subtitle: 65 | available_formats.add(export_format.value) 66 | else: 67 | print(f"→ {export_format.value.upper()} export format requires subtitle option to be True.") 68 | raise typer.Exit() 69 | else: 70 | print(f"→ Unknown export format: {export_format.value}") 71 | raise typer.Exit() 72 | 73 | return list(available_formats) 74 | 75 | 76 | class OutputWriter: 77 | """ 78 | Class for writing various output formats to disk. 79 | """ 80 | def __init__( 81 | self, 82 | corrections: Corrections = None 83 | ): 84 | self.cwd = Path.cwd() 85 | self.corrections = corrections 86 | self.compiled_simple_patterns = ( 87 | self._compile_simple_patterns() if self.corrections else {} 88 | ) 89 | self.compiled_regex_patterns = ( 90 | self._compile_regex_patterns() if self.corrections else [] 91 | ) 92 | 93 | def _compile_simple_patterns(self) -> List[Tuple[re.Pattern, str]]: 94 | """ 95 | Pre-compile regex patterns for simple word corrections. 96 | Returns a list of tuples containing compiled patterns and their replacements. 97 | """ 98 | patterns = [] 99 | for wrong, correct in self.corrections.simple.items(): 100 | # Wrap simple corrections with word boundaries 101 | pattern = re.compile( 102 | r'\b{}\b'.format(re.escape(wrong)), flags=re.IGNORECASE 103 | ) 104 | patterns.append((pattern, correct)) 105 | logger.debug( 106 | f"Compiled simple pattern: '\\b{wrong}\\b' → '{correct}'" 107 | ) 108 | return patterns 109 | 110 | def _compile_regex_patterns(self) -> List[Tuple[re.Pattern, str]]: 111 | """ 112 | Pre-compile regex patterns for pattern-based corrections. 113 | Returns a list of tuples containing compiled regex patterns and their replacements. 114 | """ 115 | patterns = [] 116 | for entry in self.corrections.patterns: 117 | original_pattern = entry['pattern'] 118 | replacement = entry['replacement'] 119 | 120 | # Wrap patterns with word boundaries and non-capturing group 121 | new_pattern = r'\b(?:' + original_pattern + r')\b' 122 | regex = re.compile(new_pattern, flags=re.IGNORECASE) 123 | 124 | patterns.append((regex, replacement)) 125 | logger.debug( 126 | f"Compiled pattern-based regex: '{new_pattern}' → '{replacement}'" 127 | ) 128 | return patterns 129 | 130 | def correct_transcription(self, transcription: str) -> str: 131 | """ 132 | Apply both simple and pattern-based corrections to the transcription. 133 | """ 134 | # Apply simple corrections 135 | for pattern, correct in self.compiled_simple_patterns: 136 | transcription = pattern.sub( 137 | lambda m: self.replace_match(m, correct), transcription 138 | ) 139 | 140 | # Apply pattern-based corrections 141 | for regex, replacement in self.compiled_regex_patterns: 142 | transcription = regex.sub(replacement, transcription) 143 | 144 | return transcription 145 | 146 | @staticmethod 147 | def replace_match(match, correct: str) -> str: 148 | """ 149 | Replace the matched word while preserving the original casing. 150 | """ 151 | word = match.group() 152 | if word.isupper(): 153 | return correct.upper() 154 | elif word[0].isupper(): 155 | return correct.capitalize() 156 | else: 157 | return correct 158 | 159 | def _save_file( 160 | self, 161 | content: str, 162 | filepath: Path, 163 | description: str, 164 | log_message: str 165 | ) -> None: 166 | """ 167 | Generic method to save content to a file. 168 | """ 169 | with open(filepath, 'w', encoding='utf-8') as file: 170 | file.write(content) 171 | print(f'[blue1]→ Saved {description}: [bold]{filepath.relative_to(self.cwd)}') 172 | logger.info(f'{log_message} {filepath}') 173 | 174 | def save_json( 175 | self, 176 | result: dict, 177 | filepath: Path 178 | ) -> None: 179 | with open(filepath, 'w', encoding='utf-8') as fout: 180 | json.dump(result, fout, indent=4) 181 | print(f'[blue1]→ Saved .json: [bold]{filepath.relative_to(self.cwd)}') 182 | logger.info(f"Saved .json to {filepath}") 183 | 184 | def save_txt( 185 | self, 186 | transcription: Dict[str, str], 187 | filepath: Path 188 | ) -> None: 189 | """ 190 | Save the transcription as a TXT file after applying corrections. 191 | """ 192 | original_text = transcription.get('text', '').strip() 193 | corrected_text = self.correct_transcription(original_text) 194 | self._save_file( 195 | content=corrected_text, 196 | filepath=filepath, 197 | description='.txt', 198 | log_message='Saved .txt transcript to' 199 | ) 200 | 201 | def save_txt_with_speaker_annotation( 202 | self, 203 | annotated_text: str, 204 | filepath: Path 205 | ) -> None: 206 | """ 207 | Save the annotated transcription as a TXT file after applying corrections. 208 | """ 209 | corrected_annotated_text = self.correct_transcription(annotated_text) 210 | self._save_file( 211 | content=corrected_annotated_text, 212 | filepath=filepath, 213 | description='.txt with speaker annotation', 214 | log_message='Saved .txt transcription with speaker annotation →' 215 | ) 216 | 217 | def save_subtitles( 218 | self, 219 | text: str, 220 | type: str, 221 | filepath: Path 222 | ) -> None: 223 | """ 224 | Save subtitles in the specified format after applying corrections. 225 | """ 226 | corrected_text = self.correct_transcription(text) 227 | description = f'.{type} subtitles' 228 | log_message = f'Saved .{type} subtitles →' 229 | self._save_file( 230 | content=corrected_text, 231 | filepath=filepath, 232 | description=description, 233 | log_message=log_message 234 | ) 235 | 236 | def save_rttm_annotations( 237 | self, 238 | rttm: str, 239 | filepath: Path 240 | ) -> None: 241 | self._save_file( 242 | content=rttm, 243 | filepath=filepath, 244 | description='.rttm annotations', 245 | log_message='Saved .rttm annotations →' 246 | ) 247 | 248 | def save_html( 249 | self, 250 | html: str, 251 | filepath: Path 252 | ) -> None: 253 | self._save_file( 254 | content=html, 255 | filepath=filepath, 256 | description='.html (noScribe compatible)', 257 | log_message='Saved .html annotations →' 258 | ) 259 | 260 | def save_results( 261 | self, 262 | result: dict, 263 | export_formats: List[str] 264 | ) -> List[Path]: 265 | """ 266 | Write various output formats to disk based on the specified export formats. 267 | """ 268 | output_filepath = Path(result['output_filepath']) 269 | written_filepaths = [] 270 | 271 | # Apply corrections if they are provided 272 | if self.corrections and ( 273 | self.corrections.simple or self.corrections.patterns 274 | ): 275 | for language, transcription in result.get('transcription', {}).items(): 276 | # Correct the main transcription text 277 | original_text = transcription.get('text', '').strip() 278 | corrected_text = self.correct_transcription(original_text) 279 | result['transcription'][language]['text'] = corrected_text 280 | 281 | # Correct chunks and word dicts 282 | chunks = transcription.get('chunks', '') 283 | for c in chunks: 284 | # Text chunk 285 | c['text'] = self.correct_transcription(c['text']) 286 | # Words 287 | words = c.get('words', '') 288 | for w in words: 289 | w['word'] = self.correct_transcription(w['word']) 290 | 291 | # Correct speaker annotations if present 292 | if 'text_with_speaker_annotation' in transcription: 293 | original_annotated = transcription['text_with_speaker_annotation'] 294 | corrected_annotated = self.correct_transcription(original_annotated) 295 | result['transcription'][language]['text_with_speaker_annotation'] = corrected_annotated 296 | 297 | # Now, transcription_items reflects the corrected transcriptions 298 | transcription_items = result.get('transcription', {}).items() 299 | 300 | # Write .txt 301 | if 'txt' in export_formats: 302 | for language, transcription in transcription_items: 303 | fout = output_filepath.parent / f"{output_filepath.name}_{language}.txt" 304 | self.save_txt( 305 | transcription, 306 | filepath=fout 307 | ) 308 | written_filepaths.append(str(fout)) 309 | 310 | # Write subtitles (.srt, .vtt and .webvtt) 311 | subtitle_formats = {'srt', 'vtt', 'webvtt'} 312 | if subtitle_formats.intersection(export_formats): 313 | for language, transcription in transcription_items: 314 | # .srt subtitles 315 | if 'srt' in export_formats: 316 | fout = output_filepath.parent / f"{output_filepath.name}_{language}.srt" 317 | srt_text = create_subtitles( 318 | transcription, 319 | type='srt' 320 | ) 321 | self.save_subtitles(srt_text, type='srt', filepath=fout) 322 | written_filepaths.append(str(fout)) 323 | 324 | # .vtt / .webvtt subtitles 325 | if 'vtt' in export_formats or 'webvtt' in export_formats: 326 | for subtitle_type in ['webvtt', 'vtt']: 327 | fout = output_filepath.parent / f"{output_filepath.name}_{language}.{subtitle_type}" 328 | vtt_text = create_subtitles( 329 | transcription, 330 | type=subtitle_type, 331 | result=result 332 | ) 333 | self.save_subtitles( 334 | vtt_text, 335 | type=subtitle_type, 336 | filepath=fout 337 | ) 338 | written_filepaths.append(str(fout)) 339 | 340 | # Write annotated .txt with speaker annotations 341 | has_speaker_annotation = any( 342 | 'text_with_speaker_annotation' in transcription 343 | for transcription in result['transcription'].values() 344 | ) 345 | 346 | if 'txt' in export_formats and has_speaker_annotation: 347 | for language, transcription in transcription_items: 348 | if 'text_with_speaker_annotation' in transcription: 349 | fout = output_filepath.parent / f"{output_filepath.name}_{language}_annotated.txt" 350 | self.save_txt_with_speaker_annotation( 351 | annotated_text=transcription['text_with_speaker_annotation'], 352 | filepath=fout 353 | ) 354 | written_filepaths.append(str(fout)) 355 | 356 | # Write .rttm 357 | if 'rttm' in export_formats: 358 | # Create .rttm annotations 359 | rttm_dict = dict_to_rttm(result) 360 | 361 | for language, rttm_annotation in rttm_dict.items(): 362 | fout = output_filepath.parent / f"{output_filepath.name}_{language}.rttm" 363 | self.save_rttm_annotations( 364 | rttm=rttm_annotation, 365 | filepath=fout 366 | ) 367 | written_filepaths.append(str(fout)) 368 | 369 | # Write .json 370 | if 'json' in export_formats: 371 | fout = output_filepath.with_suffix('.json') 372 | written_filepaths.append(str(fout)) 373 | result['written_files'] = written_filepaths 374 | self.save_json(result, filepath=fout) 375 | 376 | # Write noScribe compatible .html 377 | if 'html' in export_formats: 378 | audio_filepath = Path(result['input_filepath']) 379 | for language, transcription in transcription_items: 380 | fout = output_filepath.parent / f"{output_filepath.name}_{language}.html" 381 | html = create_html( 382 | transcription, 383 | audio_filepath=audio_filepath, 384 | output_filepath=output_filepath 385 | ) 386 | self.save_html(html, filepath=fout) 387 | written_filepaths.append(str(fout)) 388 | 389 | return written_filepaths 390 | 391 | def create_subtitles( 392 | transcription_dict: dict, 393 | type: str = 'srt', 394 | result: dict = None 395 | ) -> str: 396 | """ 397 | Converts a transcription dictionary into subtitle format (.srt or .webvtt). 398 | 399 | Args: 400 | transcription_dict (dict): Dictionary containing transcription data 401 | with 'chunks'. 402 | sub_length (int, optional): Maximum duration in seconds for each 403 | subtitle block. 404 | type (str, optional): Subtitle format, either 'srt' or 'webvtt'. 405 | Default is 'srt'. 406 | 407 | Returns: 408 | str: Formatted subtitle text in the specified format. 409 | """ 410 | subtitle_text = '' 411 | seg_id = 0 412 | 413 | for chunk in transcription_dict['chunks']: 414 | start_time = chunk['timestamp'][0] 415 | end_time = chunk['timestamp'][1] 416 | text = chunk['text'].replace('’', '\'') 417 | 418 | # Create .srt subtitles 419 | if type == 'srt': 420 | start_time_str = little_helper.format_time( 421 | start_time, 422 | delimiter=',' 423 | ) 424 | end_time_str = little_helper.format_time( 425 | end_time, 426 | delimiter=',' 427 | ) 428 | seg_id += 1 429 | subtitle_text += f"""{seg_id}\n{start_time_str} --> {end_time_str}\n{text.strip()}\n\n""" 430 | 431 | # Create .webvtt subtitles 432 | elif type in ['webvtt', 'vtt']: 433 | start_time_str = little_helper.format_time( 434 | start_time, 435 | delimiter='.' 436 | ) 437 | end_time_str = little_helper.format_time( 438 | end_time, 439 | delimiter='.' 440 | ) 441 | 442 | if seg_id == 0: 443 | subtitle_text += f"WEBVTT {Path(result['output_filepath']).stem}\n\n" 444 | 445 | if type == 'vtt': 446 | subtitle_text += 'NOTE transcribed with whisply\n\n' 447 | subtitle_text += f"NOTE media: {Path(result['input_filepath']).absolute()}\n\n" 448 | 449 | seg_id += 1 450 | subtitle_text += f"""{seg_id}\n{start_time_str} --> {end_time_str}\n{text.strip()}\n\n""" 451 | 452 | return subtitle_text 453 | 454 | def dict_to_rttm(result: dict) -> dict: 455 | """ 456 | Converts a transcription dictionary to RTTM file format. 457 | """ 458 | file_id = result.get('input_filepath', 'unknown_file') 459 | file_id = Path(file_id).stem 460 | rttm_dict = {} 461 | 462 | # Iterate over each available language 463 | for lang, transcription in result.get('transcription', {}).items(): 464 | lines = [] 465 | current_speaker = None 466 | speaker_start_time = None 467 | speaker_end_time = None 468 | 469 | chunks = transcription.get('chunks', []) 470 | 471 | # Collect all words from chunks 472 | all_words = [] 473 | for chunk in chunks: 474 | words = chunk.get('words', []) 475 | all_words.extend(words) 476 | 477 | # Sort all words by their start time 478 | all_words.sort(key=lambda w: w.get('start', 0.0)) 479 | 480 | for word_info in all_words: 481 | speaker = word_info.get('speaker', 'SPEAKER_00') 482 | word_start = word_info.get('start', 0.0) 483 | word_end = word_info.get('end', word_start) 484 | 485 | if speaker != current_speaker: 486 | # If there is a previous speaker segment, write it to the RTTM 487 | if current_speaker is not None: 488 | duration = speaker_end_time - speaker_start_time 489 | rttm_line = ( 490 | f"SPEAKER {file_id} 1 {speaker_start_time:.3f} {duration:.3f} " 491 | f" {current_speaker} " 492 | ) 493 | lines.append(rttm_line) 494 | 495 | # Start a new speaker segment 496 | current_speaker = speaker 497 | speaker_start_time = word_start 498 | speaker_end_time = word_end 499 | else: 500 | # Extend the current speaker segment 501 | speaker_end_time = max(speaker_end_time, word_end) 502 | 503 | # Write the last speaker segment to the RTTM 504 | if current_speaker is not None: 505 | duration = speaker_end_time - speaker_start_time 506 | rttm_line = ( 507 | f"SPEAKER {file_id} 1 {speaker_start_time:.3f} {duration:.3f} " 508 | f" {current_speaker} " 509 | ) 510 | lines.append(rttm_line) 511 | 512 | rttm_content = "\n".join(lines) 513 | rttm_dict[lang] = rttm_content 514 | 515 | return rttm_dict 516 | 517 | def create_html( 518 | transcription: dict, 519 | audio_filepath: Path, 520 | output_filepath: Path 521 | ) -> None: 522 | """ 523 | Save an HTML file that is compatible with noScribe's editor: 524 | https://github.com/kaixxx/noScribe 525 | """ 526 | from whisply.output_templates import NOSCRIBE_HTML_TEMPLATE 527 | 528 | # Helper function to parse a transcript line 529 | def parse_line(line: str): 530 | pattern = r'\[(.*?)\]\s+\[(.*?)\]\s+(.*)' 531 | match = re.match(pattern, line) 532 | if match: 533 | timestamp, speaker, text = match.groups() 534 | return timestamp, speaker, text 535 | return None, None, line 536 | 537 | # Helper function to convert a timestamp to milliseconds 538 | def convert_timestamp_to_ms(timestamp: str) -> int: 539 | h, m, s = timestamp.split(':') 540 | s = s.split('.')[0] 541 | return (int(h) * 3600 + int(m) * 60 + int(s)) * 1000 542 | 543 | # Use the annotated txt as input 544 | input_text = transcription.get('text_with_speaker_annotation', '') 545 | 546 | # Replace placeholders in the HTML template 547 | html_template = NOSCRIBE_HTML_TEMPLATE.replace('{transcription}', str(output_filepath)) 548 | html_template = html_template.replace('{audio_filepath}', str(audio_filepath)) 549 | 550 | # Process transcript lines to build HTML content 551 | body_content = "" 552 | transcript_lines = input_text.strip().splitlines() 553 | num_lines = len(transcript_lines) 554 | 555 | for idx, line in enumerate(transcript_lines): 556 | segment_start, speaker, text = parse_line(line) 557 | 558 | # If there's a following line or a single line, generate detailed HTML with an anchor 559 | if num_lines == 1 or idx < num_lines - 1: 560 | # For a single line, override segment_start with a default value 561 | if num_lines == 1: 562 | segment_start = '00:00:00' 563 | segment_end, _, _ = parse_line(line) 564 | else: 565 | segment_end, _, _ = parse_line(transcript_lines[idx + 1]) 566 | 567 | # Convert timestamps to ms 568 | start = convert_timestamp_to_ms(segment_start) 569 | end = convert_timestamp_to_ms(segment_end) 570 | 571 | # Set speaker labels 572 | if 'UNKNOWN' in speaker: 573 | speaker_label = 'SXX' 574 | else: 575 | speaker_number = re.findall(r'\d+', speaker)[0] 576 | speaker_label = f'S{speaker_number}' 577 | 578 | # Build the anchor tag and HTML segment 579 | anchor = f"ts_{start}_{end}_{speaker_label}" 580 | segment_html = (f'

' 581 | f'[{segment_start}] {speaker}: {text}' 582 | f'

\n') 583 | else: 584 | segment_html = f'

{text}

\n' 585 | 586 | body_content += segment_html 587 | 588 | html_output = html_template.replace('{body_content}', body_content) 589 | 590 | return html_output 591 | -------------------------------------------------------------------------------- /whisply/post_correction.py: -------------------------------------------------------------------------------- 1 | import yaml 2 | from pathlib import Path 3 | from dataclasses import dataclass, field 4 | from typing import Dict, List 5 | 6 | @dataclass 7 | class Corrections: 8 | """ 9 | A dataclass to encapsulate both simple and pattern-based corrections. 10 | """ 11 | simple: Dict[str, str] = field(default_factory=dict) 12 | patterns: List[Dict[str, str]] = field(default_factory=list) 13 | 14 | 15 | def load_correction_list(filepath: str | Path) -> Corrections: 16 | """ 17 | Load the correction dictionary and patterns from a YAML file. 18 | 19 | :param filepath: Path to the YAML correction file. 20 | :return: Corrections object containing simple and pattern-based 21 | corrections. 22 | """ 23 | try: 24 | with open(filepath, 'r') as file: 25 | data = yaml.safe_load(file) 26 | 27 | if not isinstance(data, dict): 28 | raise ValueError("→ Correction file must contain a YAML dictionary.") 29 | 30 | # Extract simple corrections 31 | simple_corrections = {k: v for k, v in data.items() if k != 'patterns'} 32 | 33 | # Extract pattern-based corrections 34 | pattern_corrections = data.get('patterns', []) 35 | 36 | # Validate patterns 37 | for entry in pattern_corrections: 38 | if 'pattern' not in entry or 'replacement' not in entry: 39 | raise ValueError("→ Each pattern entry must contain 'pattern' \ 40 | and 'replacement' keys.") 41 | 42 | return Corrections( 43 | simple=simple_corrections, 44 | patterns=pattern_corrections 45 | ) 46 | 47 | except FileNotFoundError: 48 | print(f"→ Correction file not found: {filepath}") 49 | return Corrections() 50 | except yaml.YAMLError as e: 51 | print(f"→ Error parsing YAML file: {e}") 52 | return Corrections() 53 | except Exception as e: 54 | print(f"→ Unexpected error loading correction list: {e}") 55 | return Corrections() 56 | -------------------------------------------------------------------------------- /whisply/transcription.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import time 3 | from pathlib import Path 4 | from datetime import datetime 5 | from functools import partial 6 | from rich import print 7 | import whisperx.diarize 8 | 9 | from whisply import little_helper, output_utils, models 10 | from whisply.little_helper import FilePathProcessor 11 | from whisply.post_correction import Corrections 12 | 13 | # Set logging configuration 14 | log_dir = little_helper.ensure_dir(Path('./logs')) 15 | log_filename = f"log_whisply_{datetime.now().strftime('%Y-%m-%d')}.log" 16 | log_file = f"{log_dir}/{log_filename}" 17 | 18 | logging.basicConfig( 19 | filename=log_file, 20 | level=logging.INFO, 21 | format="%(asctime)s %(levelname)s [%(funcName)s]: %(message)s", 22 | ) 23 | 24 | class TranscriptionHandler: 25 | """ 26 | Handles transcription and diarization of audio/video files using various 27 | Whisper-based models. 28 | 29 | This class leverages different implementations of OpenAI's Whisper models 30 | (whisperX, insanely-fast-whisper, faster-whisper) to transcribe audio and 31 | video files. It supports features like language detection, speaker 32 | diarization, translation, subtitle generation, and exporting transcriptions 33 | in multiple formats. It is capable of processing single files, directories, 34 | URLs, and lists of files, providing flexibility for diverse transcription 35 | needs. 36 | 37 | Args: 38 | base_dir (str, optional): Directory to store transcription outputs. 39 | Defaults to './transcriptions'. 40 | model (str, optional): Whisper model variant to use (e.g., 'large-v2'). 41 | Defaults to 'large-v3-turbo'. 42 | device (str, optional): Compute device ('cpu', 'cuda', etc.). 43 | Defaults to 'cpu'. 44 | file_language (str, optional): Language of the input audio. 45 | If not provided, language detection is performed. 46 | annotate (bool, optional): Enable speaker diarization. 47 | Defaults to False. 48 | hf_token (str, optional): Hugging Face token for accessing restricted 49 | models or features. 50 | subtitle (bool, optional): Generate subtitles with word-level timestamps. 51 | Defaults to False. 52 | sub_length (int, optional): Maximum number of words per subtitle chunk. 53 | Required if subtitle is True. 54 | translate (bool, optional): Translate transcription to English if the 55 | original language is different. Defaults to False. 56 | verbose (bool, optional): Enable detailed logging and output. 57 | Defaults to False. 58 | export_formats (str or list, optional): Formats to export transcriptions 59 | (e.g., 'json', 'srt'). Defaults to 'all'. 60 | 61 | Attributes: 62 | base_dir (Path): Directory for storing transcriptions. 63 | device (str): Compute device in use. 64 | file_language (str or None): Detected or specified language of the 65 | audio. 66 | annotate (bool): Indicates if speaker diarization is enabled. 67 | translate (bool): Indicates if translation is enabled. 68 | subtitle (bool): Indicates if subtitle generation is enabled. 69 | verbose (bool): Indicates if verbose mode is active. 70 | export_formats (str or list): Selected formats for exporting 71 | transcriptions. 72 | processed_files (list): List of processed file information and results. 73 | 74 | Methods: 75 | get_filepaths(filepath: str): 76 | Retrieves and validates file paths from various input types. 77 | 78 | detect_language(file: Path, audio_array) -> str: 79 | Detects the language of the given audio file. 80 | 81 | process_files(files: list): 82 | Processes a list of audio files for transcription and diarization. 83 | 84 | transcribe_with_whisperx(filepath: Path) -> dict: 85 | Transcribes an audio file using the whisperX implementation. 86 | 87 | transcribe_with_insane_whisper(filepath: Path) -> dict: 88 | Transcribes an audio file using the insanely-fast-whisper 89 | implementation. 90 | 91 | transcribe_with_faster_whisper(filepath: Path, num_workers: int = 1) 92 | -> dict: 93 | Transcribes an audio file using the faster-whisper implementation. 94 | 95 | adjust_word_chunk_length(result: dict) -> dict: 96 | Splits transcription text into chunks based on a maximum word 97 | count. 98 | 99 | to_transcription_dict(insanely_annotation: list[dict]) -> dict: 100 | Converts speaker-annotated results into a standardized dictionary. 101 | 102 | to_whisperx(transcription_result: dict) -> dict: 103 | Normalizes transcription results to the whisperX format. 104 | 105 | create_text_with_speakers(transcription_dict: dict, 106 | delimiter: str = '.') -> dict: 107 | Inserts speaker labels into the transcription text upon speaker 108 | changes. 109 | """ 110 | def __init__( 111 | self, 112 | base_dir='./transcriptions', 113 | model='large-v3-turbo', 114 | device='cpu', 115 | file_language=None, 116 | annotate=False, 117 | num_speakers=None, 118 | hf_token=None, 119 | subtitle=False, 120 | sub_length=None, 121 | translate=False, 122 | verbose=False, 123 | del_originals=False, 124 | corrections=Corrections, 125 | export_formats='all' 126 | ): 127 | self.base_dir = little_helper.ensure_dir(Path(base_dir)) 128 | self.file_formats = little_helper.return_valid_fileformats() 129 | self.device = device 130 | self.file_language = file_language 131 | self.file_language_provided = file_language is not None 132 | self.model = None 133 | self.model_provided = model 134 | self.annotate = annotate 135 | self.num_speakers = num_speakers 136 | self.translate = translate 137 | self.hf_token = hf_token 138 | self.subtitle = subtitle 139 | self.sub_length = sub_length 140 | self.verbose = verbose 141 | self.del_originals = del_originals 142 | self.corrections = corrections 143 | self.export_formats = export_formats 144 | self.metadata = self._collect_metadata() 145 | self.filepaths = [] 146 | self.output_dir = None 147 | self.processed_files = [] 148 | 149 | def _collect_metadata(self): 150 | return { 151 | 'output_dir': str(self.base_dir), 152 | 'file_language': self.file_language, 153 | 'model': self.model_provided, 154 | 'device': self.device, 155 | 'annotate': self.annotate, 156 | 'num_speakers': self.num_speakers, 157 | 'translate': self.translate, 158 | 'subtitle': self.subtitle, 159 | 'sub_length': self.sub_length 160 | } 161 | 162 | def adjust_word_chunk_length(self, result: dict) -> dict: 163 | """ 164 | Generates text chunks based on the maximum number of words. 165 | 166 | Parameters: 167 | result (dict): The nested dictionary containing segments 168 | and words. 169 | max_number (int): The maximum number of words per chunk. 170 | Default is 6. 171 | 172 | Returns: 173 | dict: A dictionary containing a list of chunks, each with 174 | 'text', 'timestamp', and 'words'. 175 | """ 176 | # Flatten all words from all segments 177 | words = [ 178 | word_info 179 | for segment in result.get('segments', []) 180 | for word_info in segment.get('words', []) 181 | ] 182 | 183 | # Split words into chunks of size max_number 184 | def split_into_chunks(lst, n): 185 | """Yield successive n-sized chunks from lst.""" 186 | for i in range(0, len(lst), n): 187 | yield lst[i:i + n] 188 | 189 | chunks = [] 190 | for word_chunk in split_into_chunks(words, self.sub_length): 191 | chunk_text = ' '.join(word_info['word'] for word_info in word_chunk) 192 | chunk_start = word_chunk[0]['start'] 193 | chunk_end = word_chunk[-1]['end'] 194 | chunk = { 195 | 'timestamp': [chunk_start, chunk_end], 196 | 'text': chunk_text, 197 | 'words': word_chunk 198 | } 199 | chunks.append(chunk) 200 | 201 | result_temp = { 202 | 'text': ' '.join(chunk['text'].strip() for chunk in chunks), 203 | 'chunks': chunks 204 | } 205 | 206 | return result_temp 207 | 208 | def to_transcription_dict(self, insanely_annotation: list[dict]) -> dict: 209 | """ 210 | Transform insanely-fast-whisper speaker annotation result to dict. 211 | """ 212 | chunks = [] 213 | for s in insanely_annotation: 214 | chunk = { 215 | 'text': s['text'], 216 | 'timestamp': (s['timestamp'][0], s['timestamp'][1]), 217 | 'speaker': s['speaker'] 218 | } 219 | chunks.append(chunk) 220 | 221 | result = { 222 | 'text': ''.join([s['text'] for s in insanely_annotation]), 223 | 'chunks': chunks 224 | } 225 | return result 226 | 227 | def to_whisperx(self, transcription_result: dict) -> dict: 228 | """ 229 | Normalize insanely-fast-whisper transcription result to whisperX dict. 230 | """ 231 | words = [] 232 | for c in transcription_result['chunks']: 233 | if 'speaker' in c: 234 | word = { 235 | 'word': c['text'].strip(), 236 | 'start': c['timestamp'][0], 237 | 'end': c['timestamp'][1], 238 | 'speaker': c['speaker'] 239 | } 240 | else: 241 | word = { 242 | 'word': c['text'].strip(), 243 | 'start': c['timestamp'][0], 244 | 'end': c['timestamp'][1] 245 | } 246 | words.append(word) 247 | 248 | result = { 249 | 'segments': [ 250 | { 251 | 'start': transcription_result['chunks'][0]['timestamp'][0], 252 | 'end': transcription_result['chunks'][-1]['timestamp'][1], 253 | 'text': transcription_result['text'].strip(), 254 | 'words': words 255 | } 256 | ] 257 | } 258 | return result 259 | 260 | def create_text_with_speakers( 261 | self, 262 | transcription_dict: dict, 263 | delimiter: str = '.' 264 | ) -> dict: 265 | """ 266 | Iterates through all chunks of each language and creates the complete 267 | text with speaker labels inserted when there is a speaker change. 268 | 269 | Args: 270 | transcription_dict (dict): The dictionary containing transcription 271 | data. 272 | 273 | Returns: 274 | dict: A dictionary mapping each language to its formatted text with 275 | speaker labels. 276 | """ 277 | transcriptions = transcription_dict.get('transcriptions', {}) 278 | 279 | for lang, lang_data in transcriptions.items(): 280 | text = "" 281 | current_speaker = None 282 | chunks = lang_data.get('chunks', []) 283 | 284 | for chunk in chunks: 285 | words = chunk.get('words', []) 286 | 287 | for word_info in words: 288 | speaker = word_info.get('speaker') 289 | word = word_info.get('word', '') 290 | start_timestamp = little_helper.format_time( 291 | word_info.get('start'), 292 | delimiter 293 | ) 294 | 295 | # Insert speaker label if a speaker change is detected 296 | if speaker != current_speaker: 297 | text += f"\n[{start_timestamp}] [{speaker}] " 298 | current_speaker = speaker 299 | 300 | # Append the word with a space 301 | text += word + " " 302 | 303 | transcription_dict['transcriptions'][lang]['text_with_speaker_annotation'] = text.strip() 304 | 305 | return transcription_dict 306 | 307 | def transcribe_with_whisperx(self, filepath: Path) -> dict: 308 | """ 309 | Transcribe a file with the whisperX implementation that returns word-level 310 | timestamps and speaker annotation:https://github.com/m-bain/whisperX 311 | 312 | This implementation is used when a specific subtitle length (e.g. 5 words 313 | per individual subtitle) is needed. 314 | """ 315 | import torch 316 | import whisperx 317 | import gc 318 | 319 | def empty_cuda_cache(model): 320 | gc.collect() 321 | torch.cuda.empty_cache() 322 | del model 323 | 324 | def fill_missing_timestamps(segments: list) -> list: 325 | """ 326 | whisperX does not provide timestamps for words containing only 327 | numbers (e.g. "1.5", "2024" etc.). 328 | 329 | The function fills these missing timestamps by padding the last 330 | known 'end' timestamp for a missing 'start' timestamp or by cutting 331 | the next known 'start' timestamp for a missing 'end' timestamp. 332 | """ 333 | padding = 0.05 # in seconds 334 | 335 | for segment in segments: 336 | words = segment['words'] 337 | num_words = len(words) 338 | 339 | for i, word in enumerate(words): 340 | # If the 'start' key is missing 341 | if 'start' not in word: 342 | if i > 0 and 'end' in words[i-1]: 343 | word['start'] = round(words[i-1]['end'] + padding, 2) 344 | else: 345 | word['start'] = segment['start'] 346 | 347 | # If the 'end' key is missing 348 | if 'end' not in word: 349 | if i < num_words - 1 and 'start' in words[i+1]: 350 | word['end'] = round(words[i+1]['start'] - padding, 2) 351 | elif i == num_words - 1: 352 | word['end'] = round(words[i]['start'] + padding, 2) 353 | segment['end'] = word['end'] 354 | else: 355 | word['end'] = round(words[i]['start'] + padding, 2) 356 | 357 | # If 'score' key is missing 358 | if 'score' not in word: 359 | word['score'] = 0.5 360 | 361 | # If 'speaker' key is missing 362 | if self.annotate and 'speaker' not in word: 363 | speaker_assigned = False 364 | 365 | # Case 1: If it's the first word, look forward for the 366 | # next speaker 367 | if i == 0: 368 | for j in range(i + 1, num_words): 369 | if 'speaker' in words[j]: 370 | word['speaker'] = words[j]['speaker'] 371 | speaker_assigned = True 372 | break 373 | 374 | # Case 2: If it's the last word, look backward for the 375 | # previous speaker 376 | elif i == num_words - 1: 377 | for j in range(i - 1, -1, -1): 378 | if 'speaker' in words[j]: 379 | word['speaker'] = words[j]['speaker'] 380 | speaker_assigned = True 381 | break 382 | 383 | # Case 3: For other words, prefer the previous speaker; 384 | # If not found, look forward 385 | if not speaker_assigned: 386 | # Look backward 387 | for j in range(i - 1, -1, -1): 388 | if 'speaker' in words[j]: 389 | word['speaker'] = words[j]['speaker'] 390 | speaker_assigned = True 391 | break 392 | 393 | if not speaker_assigned: 394 | # Look forward 395 | for j in range(i + 1, num_words): 396 | if 'speaker' in words[j]: 397 | word['speaker'] = words[j]['speaker'] 398 | speaker_assigned = True 399 | break 400 | 401 | if not speaker_assigned: 402 | # Default speaker if none found 403 | word['speaker'] = 'UNKNOWN' 404 | 405 | return segments 406 | 407 | def whisperx_task(task: str = 'transcribe', language = None): 408 | """ 409 | Define a transcription / translation task with whisperX 410 | """ 411 | # Set parameters 412 | device = 'cuda' if self.device == 'cuda:0' else 'cpu' 413 | 414 | # Transcribe or Translate 415 | model = whisperx.load_model( 416 | whisper_arch=self.model, 417 | device=device, 418 | compute_type='float16' if self.device == 'cuda:0' else 'int8', 419 | language=self.file_language or None, 420 | asr_options={ 421 | "hotwords": None, 422 | "multilingual": False 423 | }) 424 | audio = whisperx.load_audio(str(filepath), sr=16000) 425 | result = model.transcribe( 426 | audio, 427 | batch_size=16 if self.device == 'cuda:0' else 8, 428 | task=task 429 | ) 430 | model_a, metadata = whisperx.load_align_model( 431 | device=device, 432 | language_code=language 433 | ) 434 | result = whisperx.align( 435 | result["segments"], 436 | model_a, 437 | metadata, 438 | audio, 439 | device, 440 | return_char_alignments=False 441 | ) 442 | 443 | # Empty CUDA cache 444 | if self.device == 'cuda:0': 445 | empty_cuda_cache(model_a) 446 | 447 | return result 448 | 449 | def whisperx_annotation(transcription_result: dict) -> dict: 450 | # Set parameters 451 | device = 'cuda' if self.device == 'cuda:0' else 'cpu' 452 | 453 | diarize_model = whisperx.diarize.DiarizationPipeline( 454 | use_auth_token=self.hf_token, 455 | device=device 456 | ) 457 | diarize_segments = diarize_model( 458 | str(filepath), 459 | max_speakers=self.num_speakers 460 | ) 461 | result = whisperx.assign_word_speakers( 462 | diarize_segments, 463 | transcription_result 464 | ) 465 | 466 | # Empty CUDA cache 467 | if self.device == 'cuda:0': 468 | empty_cuda_cache(diarize_model) 469 | 470 | return result 471 | 472 | # Start and time transcription 473 | logging.info(f"👨‍💻 Transcription started with whisper🆇 for {filepath.name}") 474 | t_start = time.time() 475 | 476 | # Run the transcription 477 | transcription_task = partial( 478 | whisperx_task, 479 | task='transcribe', 480 | language=self.file_language 481 | ) 482 | transcription_result = little_helper.run_with_progress( 483 | description=f"[cyan]→ Transcribing ({'CUDA' if self.device == 'cuda:0' else 'CPU'}) [bold]{filepath.name}", 484 | task=transcription_task 485 | ) 486 | 487 | # Speaker annotation 488 | if self.annotate: 489 | annotation_task = partial( 490 | whisperx_annotation, 491 | transcription_result 492 | ) 493 | transcription_result = little_helper.run_with_progress( 494 | description=f"[purple]→ Annotating ({self.device.upper()}) [bold]{filepath.name}", 495 | task=annotation_task 496 | ) 497 | 498 | # Fill in missing timestamps and adjust word chunk length 499 | transcription_result['segments'] = fill_missing_timestamps( 500 | transcription_result['segments'] 501 | ) 502 | transcription_result = self.adjust_word_chunk_length( 503 | transcription_result 504 | ) 505 | 506 | # Create result dict and append transcription to it 507 | result = {'transcriptions': {}} 508 | result['transcriptions'][self.file_language] = transcription_result 509 | 510 | if self.verbose: 511 | print(f"{result['transcriptions'][self.file_language]['text']}") 512 | 513 | # Translation task (to English) 514 | if self.translate and self.file_language != 'en': 515 | translation_task = partial( 516 | whisperx_task, 517 | task='translate', 518 | language='en' 519 | ) 520 | translation_result = little_helper.run_with_progress( 521 | description=f"[dark_blue]→ Translating ({'CUDA' if self.device == 'cuda:0' else 'CPU'}) [bold]{filepath.name}", 522 | task=translation_task 523 | ) 524 | 525 | # Speaker annotation 526 | if self.annotate: 527 | annotation_task = partial( 528 | whisperx_annotation, 529 | translation_result 530 | ) 531 | translation_result = little_helper.run_with_progress( 532 | description=f"[purple]→ Annotating ({self.device.upper()}) [bold]{filepath.name}", 533 | task=annotation_task 534 | ) 535 | 536 | # Fill in missing timestamps and adjust word chunk length 537 | translation_result['segments'] = fill_missing_timestamps( 538 | translation_result['segments'] 539 | ) 540 | translation_result = self.adjust_word_chunk_length( 541 | translation_result 542 | ) 543 | result['transcriptions']['en'] = translation_result 544 | 545 | if self.verbose: 546 | print(f"{result['transcriptions']['en']['text']}") 547 | 548 | # Create full transcription with speaker annotation 549 | if self.annotate: 550 | result = self.create_text_with_speakers(result) 551 | 552 | logging.info(f"👨‍💻 Transcription completed in {time.time() - t_start:.2f} sec.") 553 | 554 | return {'transcription': result} 555 | 556 | def transcribe_with_insane_whisper(self, filepath: Path) -> dict: 557 | """ 558 | Transcribes a file using the 'insanely-fast-whisper' implementation: 559 | https://github.com/Vaibhavs10/insanely-fast-whisper 560 | 561 | This method utilizes the 'insanely-fast-whisper' implementation of 562 | OpenAI Whisper for automatic speech recognition on Mac M1-M4 devices. 563 | 564 | Parameters: 565 | - filepath (Path): The path to the audio file for transcription. 566 | 567 | Returns: 568 | - dict: A dictionary containing the transcription result and, if 569 | speaker detection is enabled, the speaker diarization result. 570 | The transcription result includes the recognized text and 571 | timestamps if available. 572 | """ 573 | import torch 574 | from transformers import pipeline 575 | from transformers import logging as hf_logger 576 | from whisply import diarize_utils 577 | 578 | hf_logger.set_verbosity_error() 579 | 580 | def insane_whisper_annotation(transcription_result: dict) -> dict: 581 | # Speaker annotation 582 | annotation_result = diarize_utils.diarize( 583 | transcription_result, 584 | diarization_model='pyannote/speaker-diarization-3.1', 585 | hf_token=self.hf_token, 586 | file_name=str(filepath), 587 | num_speakers=self.num_speakers, 588 | min_speakers=None, 589 | max_speakers=None, 590 | ) 591 | # Transform annotation_result to correct dict structure 592 | transcription_result = self.to_transcription_dict(annotation_result) 593 | return transcription_result 594 | 595 | # Start and time transcription 596 | logging.info( 597 | f"👨‍💻 Transcription started with 🚅 insane-whisper for {filepath.name}" 598 | ) 599 | t_start = time.time() 600 | 601 | try: 602 | pipe = pipeline( 603 | 'automatic-speech-recognition', 604 | model = self.model, 605 | torch_dtype = torch.float16, 606 | device = self.device, 607 | model_kwargs = { 608 | 'attn_implementation': 'eager' 609 | } 610 | ) 611 | 612 | # Define transcription function 613 | def transcription_task(): 614 | transcription_result = pipe( 615 | str(filepath), 616 | batch_size=1, 617 | return_timestamps='word', 618 | generate_kwargs={ 619 | 'use_cache': True, 620 | 'return_legacy_cache': False, 621 | 'language': self.file_language, 622 | 'task': "transcribe", 623 | 'forced_decoder_ids': None 624 | } 625 | ) 626 | return transcription_result 627 | 628 | # Transcription 629 | transcription_result = little_helper.run_with_progress( 630 | description=f"[cyan]→ Transcribing ({self.device.upper()}) [bold]{filepath.name}", 631 | task=transcription_task 632 | ) 633 | 634 | # Speaker annotation 635 | if self.annotate: 636 | transcription_result = little_helper.run_with_progress( 637 | description=f"[purple]→ Annotating ({self.device.upper()}) [bold]{filepath.name}", 638 | task=partial( 639 | insane_whisper_annotation, 640 | transcription_result 641 | ) 642 | ) 643 | 644 | # Adjust word chunk length 645 | transcription_result = self.to_whisperx(transcription_result) 646 | transcription_result = self.adjust_word_chunk_length(transcription_result) 647 | 648 | # Build result dict 649 | result = {'transcriptions': {}} 650 | result['transcriptions'] = { 651 | self.file_language: transcription_result 652 | } 653 | 654 | if self.verbose: 655 | print(result['transcriptions'][self.file_language]['text']) 656 | 657 | # Translation 658 | if self.translate and self.file_language != 'en': 659 | def translation_task(): 660 | translation_result = pipe( 661 | str(filepath), 662 | batch_size=1, 663 | return_timestamps='word', 664 | generate_kwargs={ 665 | 'use_cache': True, 666 | 'return_legacy_cache': False, 667 | 'task': 'translate', 668 | # 'forced_decoder_ids': None 669 | } 670 | ) 671 | return translation_result 672 | 673 | # Run the translation task 674 | translation_result = little_helper.run_with_progress( 675 | description=f"[dark_blue]→ Translating ({self.device.upper()}) [bold]{filepath.name}", 676 | task=translation_task 677 | ) 678 | 679 | # Speaker annotation 680 | if self.annotate: 681 | translation_result = little_helper.run_with_progress( 682 | description=f"[purple]→ Annotating ({self.device.upper()}) [bold]{filepath.name}", 683 | task=partial( 684 | insane_whisper_annotation, 685 | translation_result 686 | ) 687 | ) 688 | 689 | # Adjust word chunk length 690 | translation_result = self.to_whisperx(translation_result) 691 | translation_result = self.adjust_word_chunk_length(translation_result) 692 | 693 | result['transcriptions']['en'] = translation_result 694 | 695 | if self.verbose: 696 | print(result['transcriptions']['en']['text']) 697 | 698 | if self.annotate: 699 | # Create full transcription with speaker annotation 700 | result = self.create_text_with_speakers(result) 701 | 702 | except Exception as e: 703 | print(f'{e}') 704 | 705 | # Stop timing transcription 706 | logging.info(f"👨‍💻 Transcription completed in {time.time() - t_start:.2f} sec.") 707 | 708 | return {'transcription': result} 709 | 710 | def transcribe_with_faster_whisper( 711 | self, 712 | filepath: Path, 713 | num_workers: int = 1 714 | ) -> dict: 715 | """ 716 | Transcribes an audio file using the 'faster-whisper' implementation: 717 | https://github.com/SYSTRAN/faster-whisper 718 | 719 | This method utilizes the 'faster-whisper' implementation of OpenAI 720 | Whisper for automatic speech recognition. It loads the model and sets 721 | parameters for transcription. After transcription, it formats the 722 | result into segments with timestamps and combines them into a single 723 | text. If speaker detection is enabled, it also annotates speakers in 724 | the transcription result. 725 | 726 | Parameters: 727 | - filepath (Path): The path to the audio file for transcription. 728 | - num_workers (int): The number of workers to use for transcription. 729 | 730 | Returns: 731 | - dict: A dictionary containing the transcription result and, if 732 | speaker detection is enabled, the speaker diarization result. 733 | The transcription result includes the recognized text and 734 | segmented chunks with timestamps if available. 735 | """ 736 | from faster_whisper import WhisperModel, BatchedInferencePipeline 737 | 738 | # Start and time transcription 739 | logging.info(f"👨‍💻 Transcription started with 🏃‍♀️‍➡️ faster-whisper for {filepath.name}") 740 | t_start = time.time() 741 | 742 | # Load model and set parameters 743 | model = BatchedInferencePipeline( 744 | model = WhisperModel( 745 | self.model, 746 | device='cpu' if self.device in ['mps', 'cpu'] else 'cuda', 747 | num_workers=num_workers, 748 | compute_type='int8' if self.device in ['mps', 'cpu'] else 'float16' 749 | )) 750 | 751 | # Define the transcription task 752 | def transcription_task(): 753 | segments, _ = model.transcribe( 754 | str(filepath), 755 | beam_size=5, 756 | language=self.file_language, 757 | word_timestamps=True, 758 | batch_size=16 759 | ) 760 | 761 | chunks = [] 762 | for segment in segments: 763 | seg = { 764 | 'timestamp': (float(f"{segment.start:.2f}"), float(f"{segment.end:.2f}")), 765 | 'text': segment.text.strip(), 766 | 'words': [{ 767 | 'word': i.word.strip(), 768 | 'start': float(f"{i.start:.2f}"), 769 | 'end': float(f"{i.end:.2f}"), 770 | 'score': float(f"{i.probability:.2f}") 771 | } for i in segment.words] 772 | } 773 | chunks.append(seg) 774 | 775 | # If verbose Flag 776 | if self.verbose: 777 | print(seg['text']) 778 | 779 | return chunks 780 | 781 | # Add progress bar and run the transcription task 782 | chunks = little_helper.run_with_progress( 783 | description=f"[cyan]→ Transcribing ({self.device.upper()}) [bold]{filepath.name}", 784 | task=transcription_task 785 | ) 786 | 787 | # Create result dict and append transcriptions to it 788 | result = {'transcriptions': {}} 789 | result['transcriptions'][self.file_language] = { 790 | 'text': ' '.join([segment['text'].strip() for segment in chunks]), 791 | 'chunks': chunks 792 | } 793 | 794 | # Translation 795 | if self.translate and self.file_language != 'en': 796 | # Define the translation task 797 | def translation_task(): 798 | segments, _ = model.transcribe( 799 | str(filepath), 800 | beam_size=5, 801 | task='translate', 802 | language='en', 803 | word_timestamps=True 804 | ) 805 | 806 | translation_chunks = [] 807 | for segment in segments: 808 | seg = { 809 | 'timestamp': (float(f"{segment.start:.2f}"), float(f"{segment.end:.2f}")), 810 | 'text': segment.text.strip(), 811 | 'words': [{ 812 | 'word': i.word.strip(), 813 | 'start': float(f"{i.start:.2f}"), 814 | 'end': float(f"{i.end:.2f}"), 815 | 'score': float(f"{i.probability:.2f}") 816 | } for i in segment.words] 817 | } 818 | translation_chunks.append(seg) 819 | 820 | # If verbose Flag 821 | if self.verbose: 822 | print(seg['text']) 823 | 824 | return translation_chunks 825 | 826 | # Add progress bar and run the translation task 827 | translation_chunks = little_helper.run_with_progress( 828 | description=f"[dark_blue]→ Translating ({self.device.upper()}) [bold]{filepath.name}", 829 | task=translation_task 830 | ) 831 | 832 | # Add translation to result dict 833 | result['transcriptions']['en'] = { 834 | 'text': ' '.join([segment['text'].strip() for segment in translation_chunks]), 835 | 'chunks': translation_chunks 836 | } 837 | 838 | # Stop timing transcription 839 | logging.info(f"👨‍💻 Transcription completed in {time.time() - t_start:.2f} sec.") 840 | 841 | return {'transcription': result} 842 | 843 | def detect_language(self, filepath, audio_array) -> str: 844 | """ 845 | Detects the language of the input file. 846 | """ 847 | from faster_whisper import WhisperModel 848 | 849 | logging.info(f"Detecting language of file: {filepath.name}") 850 | 851 | def run_language_detection(): 852 | lang_detection_model = WhisperModel( 853 | models.set_supported_model( 854 | model=self.model_provided, 855 | implementation='faster-whisper', 856 | translation=self.translate 857 | ), 858 | device='cpu' if self.device in ['mps', 'cpu'] else 'cuda', 859 | compute_type='int8' if self.device in ['mps', 'cpu'] else 'float16' 860 | ) 861 | lang, score, _ = lang_detection_model.detect_language(audio_array) 862 | return lang, score 863 | 864 | lang, score = little_helper.run_with_progress( 865 | description=f"[dark_goldenrod]→ Detecting language for [bold]{filepath.name}", 866 | task=run_language_detection 867 | ) 868 | 869 | self.file_language = lang 870 | 871 | print(f'[blue1]→ Detected language "{lang}" with probability {score:.2f}') 872 | logging.info(f'Detected language → "{lang}" with probability {score:.2f}') 873 | 874 | def process_files(self, files) -> None: 875 | """ 876 | Processes a list of audio files for transcription and/or diarization. 877 | 878 | This method logs the processing parameters, extracts filepaths from the 879 | input list, and initializes an empty list for storing results. Each 880 | file is processed based on the compute device specified ('mps', 'cuda:0', 881 | or 'cpu'). Appropriate transcription method is chosen based on the 882 | device. Results, including file ids, paths, transcriptions, and 883 | diarizations, are stored in a dictionary and saved to a designated 884 | output directory. Each result is also appended to `self.processed_files`. 885 | 886 | Parameters: 887 | files (list of str): A list of file paths or file-like objects 888 | representing the audio files to be processed. 889 | """ 890 | logging.info(f"Provided parameters for processing: {self.metadata}") 891 | 892 | # Get filepaths 893 | filepath_handler = FilePathProcessor(self.file_formats) 894 | [filepath_handler.get_filepaths(f) for f in files] 895 | self.filepaths = filepath_handler.filepaths 896 | 897 | # Process filepaths 898 | logging.info(f"Processing files: {self.filepaths}") 899 | 900 | self.processed_files = [] 901 | for idx, filepath in enumerate(self.filepaths): 902 | 903 | # Create and set output_dir and output_filepath 904 | self.output_dir = little_helper.set_output_dir(filepath, self.base_dir) 905 | output_filepath = self.output_dir / Path(filepath).stem 906 | 907 | # Convert file format 908 | filepath, audio_array = little_helper.check_file_format( 909 | filepath=filepath, 910 | del_originals=self.del_originals 911 | ) 912 | 913 | # Detect file language 914 | if not self.file_language: 915 | self.detect_language(filepath, audio_array) 916 | 917 | logging.info(f"Transcribing file: {filepath.name}") 918 | 919 | # Transcription and speaker annotation 920 | if self.device == 'mps': 921 | self.model = models.set_supported_model( 922 | self.model_provided, 923 | implementation='insane-whisper', 924 | translation=self.translate 925 | ) 926 | print(f'[blue1]→ Using {self.device.upper()} and 🚅 Insanely-Fast-Whisper with model "{self.model}"') 927 | result_data = self.transcribe_with_insane_whisper(filepath) 928 | 929 | elif self.device in ['cpu', 'cuda:0']: 930 | if self.annotate or self.subtitle: 931 | # WhisperX for annotation / subtitling 932 | self.model = models.set_supported_model( 933 | self.model_provided, 934 | implementation='whisperx', 935 | translation=self.translate 936 | ) 937 | print(f'[blue1]→ Using {self.device.upper()} and whisper🆇 with model "{self.model}"') 938 | result_data = self.transcribe_with_whisperx(filepath) 939 | else: 940 | # Faster-Whisper for raw transcription 941 | self.model = models.set_supported_model( 942 | self.model_provided, 943 | implementation='faster-whisper', 944 | translation=self.translate 945 | ) 946 | print(f'[blue1]→ Using {self.device.upper()} and 🏃‍♀️‍➡️ Faster-Whisper with model "{self.model}"') 947 | result_data = self.transcribe_with_faster_whisper(filepath) 948 | 949 | result = { 950 | 'id': f'file_00{idx + 1}', 951 | 'created': datetime.now().strftime('%Y-%m-%d %H:%M:%S'), 952 | 'input_filepath': str(filepath.absolute()), 953 | 'output_filepath': str(Path(output_filepath).absolute()), 954 | 'written_files': None, 955 | 'device': self.device, 956 | 'model': self.model, 957 | 'transcription': result_data['transcription']['transcriptions'], 958 | } 959 | 960 | # Save results 961 | result['written_files'] = output_utils.OutputWriter( 962 | corrections=self.corrections 963 | ).save_results( 964 | result=result, 965 | export_formats=self.export_formats 966 | ) 967 | 968 | self.processed_files.append(result) 969 | 970 | if not self.file_language_provided: 971 | self.file_language = None 972 | --------------------------------------------------------------------------------