├── README.md ├── check_audio_length.py ├── convert_audio_to_mono_22050.py ├── download_yt_files.py ├── full_audio_files └── .gitkeep ├── main.py ├── requirements.txt └── yt_links.txt /README.md: -------------------------------------------------------------------------------- 1 | # Audio Segmentation and Transcription with WhisperX 2 | 3 | This repository contains Python scripts to perform audio segmentation and transcription using the WhisperX ASR (Automatic Speech Recognition) model. WhisperX is a pre-trained ASR model that can be used to transcribe audio files into text. 4 | 5 | ## Purpose 6 | 7 | The purpose of this repository is to facilitate the process of segmenting long audio files into smaller chunks and transcribing those chunks into text. This can be useful for various applications such as creating training data for ASR models, generating subtitles for videos, or extracting specific spoken content from audio recordings. 8 | 9 | ## Usage 10 | 11 | ### 1. Prerequisites 12 | 13 | Before using the scripts in this repository, ensure that you have the following installed: 14 | 15 | - Python 3.x 16 | - `yt-dlp` for downloading YouTube audio files. You can download `yt-dlp` from [https://github.com/yt-dlp/yt-dlp/releases](https://github.com/yt-dlp/yt-dlp/releases). 17 | 18 | ### 2. Set Up 19 | 20 | 1. Clone the repository to your local machine: 21 | 22 | ```bash 23 | git clone https://github.com/your-username/audio-segmentation-transcription.git 24 | cd audio-segmentation-transcription 25 | ``` 26 | 27 | 2. Create a virtual environment (optional but recommended): 28 | 29 | ```bash 30 | python -m venv venv 31 | source venv/bin/activate # On Windows, use venv\Scripts\activate 32 | ``` 33 | 34 | 3. Install the required packages: 35 | 36 | ```bash 37 | pip install -r requirements.txt 38 | ``` 39 | 40 | ### 3. Audio Segmentation and Transcription 41 | 42 | #### Segmentation and Transcription from Local Audio Files 43 | 44 | 1. Place your audio files (in WAV format) inside the `full_audio_files` directory. 45 | 46 | 2. Create a list of YouTube links in `yt_links.txt` if you want to download audio from YouTube. 47 | 48 | 3. Run the `download_yt_files.py` script to download and preprocess audio files from YouTube: 49 | 50 | ```bash 51 | python download_yt_files.py 52 | ``` 53 | 54 | 4. Run the `main.py` script to segment and transcribe the audio files: 55 | 56 | ```bash 57 | python main.py 58 | ``` 59 | 60 | The segmented audio files and corresponding transcriptions will be saved in the `output_audio_segments` directory. 61 | 62 | ### 4. Output 63 | 64 | - **Segmented Audio Files**: The segmented audio files (in WAV format) will be saved in the `output_audio_segments/{run_name}/audio` directory. 65 | 66 | - **Transcriptions**: The transcriptions for each segment will be saved in `output_audio_segments/{run_name}/train.txt` (for training data) and `output_audio_segments/{run_name}/validation.txt` (for validation data). 67 | 68 | ## Notes 69 | 70 | - Ensure that you have enough storage space, especially if dealing with large audio files, as the segmented audio files can consume significant disk space. 71 | 72 | - You can adjust the parameters in the scripts (such as `batch_size`, `compute_type`, and `language`) to customize the behavior of the WhisperX model according to your requirements. 73 | 74 | - For more information about the WhisperX ASR model, refer to the official documentation or repository of the model. 75 | 76 | - This readme was generated with GPT model, any fixes are more than welcome 😃 77 | -------------------------------------------------------------------------------- /check_audio_length.py: -------------------------------------------------------------------------------- 1 | import glob 2 | import os 3 | 4 | import librosa 5 | from tqdm import tqdm 6 | 7 | # Simple script to keep track how much audio you already collected in folders 8 | # Edit the folders list to fit your folders you wish to measure 9 | folders = ['output_audio_segments/2023_10_25_12_34_brave-itinerary_6h/audio', 10 | 'output_audio_segments/2023_10_29_00_10_callous-class/audio', 11 | 'output_audio_segments/2023_10_30_00_00_tempered-reef/audio'] 12 | 13 | total_length = 0 14 | 15 | # Iterate through the list of folders and glob files 16 | for folder in folders: 17 | # Use os.path.join to create the full path to the folder 18 | folder_path = os.path.join(folder, '*.wav') 19 | 20 | # Use glob.glob to get a list of files matching the pattern in the folder 21 | files = glob.glob(folder_path) 22 | 23 | # Iterate through the files and calculate total length 24 | for file in tqdm(files): 25 | audio, _ = librosa.load(file) 26 | total_length += librosa.get_duration(y=audio) 27 | 28 | # Convert total length to hours 29 | total_length_hours = total_length / 3600 # 1 hour = 3600 seconds 30 | 31 | print(f"Total length of all files: {total_length_hours:.2f} hours") 32 | -------------------------------------------------------------------------------- /convert_audio_to_mono_22050.py: -------------------------------------------------------------------------------- 1 | import glob 2 | import librosa 3 | import soundfile as sf 4 | from tqdm import tqdm 5 | 6 | input_folder = "full_audio_files/all/*.wav" 7 | 8 | for file in tqdm(glob.glob(input_folder)): 9 | # Load the audio file without specifying sample rate or mono option 10 | audio, sr = librosa.load(file) 11 | 12 | audio = librosa.to_mono(audio) 13 | audio = librosa.resample(audio, orig_sr=sr, target_sr=22050) 14 | 15 | # Save the converted audio back to the same file 16 | sf.write(file, audio, 22050) 17 | -------------------------------------------------------------------------------- /download_yt_files.py: -------------------------------------------------------------------------------- 1 | import glob 2 | import os 3 | import subprocess 4 | from datetime import datetime 5 | 6 | import librosa 7 | import randomname 8 | import soundfile as sf 9 | from tqdm import tqdm 10 | 11 | 12 | # You have to install yt-dlp 13 | def check_ytdlp_availability(): 14 | try: 15 | # Try running yt-dlp with the --version option to check if it's available 16 | subprocess.run(["yt-dlp", "--version"], check=True) 17 | print("yt-dlp is available on your system.") 18 | except FileNotFoundError: 19 | print("yt-dlp is not available on your system.") 20 | print("Please download it from https://github.com/yt-dlp/yt-dlp/releases.") 21 | exit(0) 22 | 23 | 24 | check_ytdlp_availability() 25 | 26 | run_name = datetime.now().strftime("%Y_%m_%d_%H_%M") + f"_{randomname.get_name()}" 27 | yt_download_dir = os.path.join('yt_downloaded', run_name) 28 | os.makedirs(yt_download_dir, exist_ok=True) 29 | 30 | with open('yt_links.txt', 'r', encoding='utf-8') as f: 31 | all_links = f.readlines() 32 | all_links = [link for link in all_links if not link.startswith('#')] 33 | 34 | os.chdir(yt_download_dir) 35 | for link in tqdm(all_links, desc='Downloading from youtube'): 36 | os.system(f"yt-dlp --quiet -f ba --audio-format wav --extract-audio {link}") 37 | 38 | # Convert to mono and 22050 HZ 39 | for file in tqdm(glob.glob('*.wav'), desc='Resampling all files'): 40 | audio, sr = librosa.load(file, sr=22050, mono=True) 41 | sf.write(file, audio, 22050) 42 | 43 | # Calculate total length of all files in the folder 44 | total_length = 0 45 | for file in glob.glob('*.wav'): 46 | audio, _ = librosa.load(file, sr=22050, mono=True) 47 | total_length += librosa.get_duration(y=audio) 48 | 49 | # Convert total length to hours 50 | total_length_hours = total_length / 3600 # 1 hour = 3600 seconds 51 | 52 | print(f"Total length of all files: {total_length_hours:.2f} hours") 53 | -------------------------------------------------------------------------------- /full_audio_files/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MisterCapi/auto_dataset_tts/1d220419726473bbef44ea36f4f99abd1df12693/full_audio_files/.gitkeep -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | import glob 2 | import os 3 | from dataclasses import dataclass 4 | from datetime import datetime 5 | from typing import List 6 | 7 | import librosa 8 | import numpy as np 9 | import randomname 10 | import soundfile as sf 11 | import torch.cuda 12 | import whisperx 13 | from tqdm import tqdm 14 | 15 | device = "cuda" if torch.cuda.is_available() else "cpu" 16 | batch_size = 3 # reduce if low on GPU mem 17 | compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy) 18 | language = "pl" 19 | 20 | # 1. Transcribe with original whisper (batched) 21 | model = whisperx.load_model("large-v2", device, compute_type=compute_type, language=language) 22 | model_a, metadata = whisperx.load_align_model(language_code=language, device=device) 23 | 24 | 25 | continue_run = "" 26 | while continue_run.lower() not in ('yes', 'no'): 27 | continue_run = input('Do you want to continue previous run? [yes/no]: ') 28 | 29 | continue_run = continue_run.lower() == 'yes' 30 | 31 | if continue_run: 32 | run_names = os.listdir('output_audio_segments') 33 | run_name = "" 34 | while run_name not in run_names: 35 | run_name = input("What run do you wish to continue?\nEnter name [ex. 2023_10_29_00_10_callous-class]: ") 36 | else: 37 | run_name = datetime.now().strftime("%Y_%m_%d_%H_%M") + f"_{randomname.get_name()}" 38 | print('Starting a new run.') 39 | 40 | # Load files that were done previously if continue run 41 | files_done_path = os.path.join('output_audio_segments', run_name, 'files_done.txt') 42 | files_done = [] 43 | if continue_run: 44 | if os.path.exists(files_done_path): 45 | with open(files_done_path, 'r', encoding='utf-8') as f: 46 | files_done = [x.strip() for x in f.readlines()] 47 | 48 | 49 | @dataclass 50 | class Segment: 51 | text: str 52 | filepath: str 53 | duration: float 54 | 55 | 56 | def cut_sample_to_speech_only(audio_path) -> str: 57 | audio_sample = whisperx.load_audio(audio_path) 58 | result_sample = model.transcribe(audio_sample, batch_size=batch_size, chunk_size=30) 59 | result_sample = whisperx.align(result_sample["segments"], model_a, metadata, audio_sample, device, return_char_alignments=True) 60 | 61 | try: 62 | segment = result_sample['segments'][-1] 63 | chars = [char for char in segment['chars'] if char.get('end')] 64 | long_char_threshold = 0.5 65 | end = 0 66 | for char in reversed(chars): 67 | if char.get('char').strip() and (char.get('end') - char.get('start')) < long_char_threshold: 68 | end = char['end'] 69 | break 70 | if not end: 71 | end = chars[-1]['end'] 72 | except IndexError: 73 | print(f"Skipping {audio_path}") 74 | return "" 75 | 76 | # Load audio with Librosa 77 | audio, sr = librosa.load(audio_path) 78 | 79 | # Calculate dB amplitude 80 | amplitude_db = librosa.amplitude_to_db(np.abs(audio), ref=np.max) 81 | 82 | # This logic is here to prevent whisper from cutting audio too short (this happened quite often in my experience) 83 | # If you trust whisper 100% (or have noisy audio) -> commend all "optional" code below 84 | # ======= OPTIONAL CODE ======= 85 | window_size_sec = 0.05 86 | silence_threshold_db = -30 87 | window_size_samples = int(window_size_sec * sr) 88 | start_frame = int(end * sr) 89 | 90 | audio_windows = [amplitude_db[x:x + window_size_samples] for x in 91 | range(start_frame, len(amplitude_db), window_size_samples)] 92 | 93 | audio_windows = [np.max(x) for x in audio_windows] 94 | 95 | for window in audio_windows: 96 | if window > silence_threshold_db: 97 | end += window_size_sec 98 | else: 99 | break 100 | # ======= END OPTIONAL CODE ======= 101 | 102 | sf.write(audio_path, audio[:int(end * sr)], int(sr)) 103 | return " ".join([seg['text'].strip() for seg in result_sample['segments']]) 104 | 105 | 106 | def cut_and_save_audio(input_audio_path, segments, target_sampling_rate=22050): 107 | """ 108 | Load audio from input_audio_path, cut segments based on the provided start and end times, 109 | and save the segments into output_dir with the specified target_sampling_rate. 110 | 111 | Args: 112 | input_audio_path (str): Path to the input audio file. 113 | segments (list): List of lists containing start and end times of audio segments to cut, in seconds. 114 | target_sampling_rate (int, optional): Sampling rate of the output audio segments. Default is 22050. 115 | """ 116 | # Create the output directory if it doesn't exist 117 | output_dir_name = os.path.join('output_audio_segments', run_name, 'audio') 118 | os.makedirs(output_dir_name, exist_ok=True) 119 | 120 | # Load the input audio 121 | audio, original_sampling_rate = librosa.load(input_audio_path, sr=target_sampling_rate) 122 | 123 | # Cut and save audio segments 124 | outputs = [] 125 | output_prefix = os.path.splitext(os.path.basename(input_audio_path))[0] 126 | for idx, segment in tqdm(enumerate(segments), desc=input_audio_path, total=len(segments), leave=False): 127 | start_sample = int(segment['start'] * original_sampling_rate) 128 | end_sample = int(segment['end'] * original_sampling_rate) 129 | audio_segment = audio[start_sample:end_sample] 130 | output_path = os.path.join(output_dir_name, f"{output_prefix}_{idx + 1}.wav") 131 | sf.write(output_path, audio_segment, target_sampling_rate) 132 | # We use whisper again on each sample to get rid of non-speech sounds 133 | segment_text = cut_sample_to_speech_only(output_path) 134 | if segment_text: 135 | outputs.append(Segment(text=segment_text, 136 | filepath=output_path, 137 | duration=librosa.get_duration(y=audio, sr=original_sampling_rate))) 138 | else: 139 | try: 140 | os.remove(output_path) 141 | except FileNotFoundError: 142 | print(f"File '{output_path}' not found.") 143 | except PermissionError: 144 | print(f"You do not have permission to delete '{output_path}'.") 145 | except Exception as e: 146 | print(f"An error occurred: {e}") 147 | 148 | return outputs 149 | 150 | 151 | def create_segments_for_files(files_to_segment: List[str]): 152 | 153 | for audio_file in tqdm(sorted(files_to_segment, key=lambda x: os.path.getsize(x), reverse=True), 154 | desc="Transcribing and segmenting files"): 155 | if audio_file in files_done: 156 | continue 157 | 158 | audio = whisperx.load_audio(audio_file) 159 | result = model.transcribe(audio, batch_size=batch_size, chunk_size=30) 160 | 161 | # Align the whisper output 162 | result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=True) 163 | 164 | # segments = [[segment['start'], segment['end']] for segment in result["segments"]] 165 | target_sampling_rate = 22050 166 | 167 | segment_objects = cut_and_save_audio(audio_file, result["segments"], target_sampling_rate) 168 | 169 | for segment in segment_objects: 170 | segment.filepath = os.path.join(*segment.filepath.split(os.sep)[-2:]) 171 | segment.filepath = segment.filepath.replace('\\', '/') 172 | 173 | segment_objects = sorted(segment_objects, key=lambda x: x.duration, reverse=True) 174 | 175 | split_id = int(0.95 * len(result["segments"])) 176 | train_segments = segment_objects[:split_id] 177 | validation_segments = segment_objects[split_id:] 178 | 179 | output_dir_name = os.path.join('output_audio_segments', run_name) 180 | with open(os.path.join(output_dir_name, 'train.txt'), 'a', encoding='utf-8') as f: 181 | for segment in train_segments: 182 | f.write(f"{segment.filepath}| {segment.text.strip()}\n") 183 | 184 | with open(os.path.join(output_dir_name, 'validation.txt'), 'a', encoding='utf-8') as f: 185 | for segment in validation_segments: 186 | f.write(f"{segment.filepath}| {segment.text.strip()}\n") 187 | 188 | with open(files_done_path, 'a', encoding='utf-8') as f: 189 | f.write(f"{audio_file}\n") 190 | 191 | 192 | if __name__ == '__main__': 193 | print(f"Starting run for: {run_name}") 194 | files_to_segment = glob.glob("full_audio_files/*.wav") 195 | create_segments_for_files(files_to_segment) 196 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MisterCapi/auto_dataset_tts/1d220419726473bbef44ea36f4f99abd1df12693/requirements.txt -------------------------------------------------------------------------------- /yt_links.txt: -------------------------------------------------------------------------------- 1 | # Example youtube links for yt downloader, you can comment out lines using '#' 2 | # https://www.youtube.com/watch?v=SzRLUaHG_wk 3 | # https://www.youtube.com/watch?v=hUw7AI_VUyE 4 | # https://www.youtube.com/watch?v=YTP-4f2FnVQ 5 | # https://www.youtube.com/watch?v=D58YUx-SpUM 6 | # https://www.youtube.com/watch?v=RoC925_jMYw 7 | # https://www.youtube.com/watch?v=Wi4fYGmzgAY 8 | # https://www.youtube.com/watch?v=pnSaBK639-8 9 | # https://www.youtube.com/watch?v=FQOatj4EPAk 10 | # https://www.youtube.com/watch?v=KUte-BZUW-4 11 | # https://www.youtube.com/watch?v=p274nNifPBs 12 | # https://www.youtube.com/watch?v=Gi2y2OygoaE 13 | # https://www.youtube.com/watch?v=1KiIgAlHI5o 14 | # https://www.youtube.com/watch?v=2tPdvtAW1DY 15 | # https://www.youtube.com/watch?v=eIvrNhrnVDI 16 | # https://www.youtube.com/watch?v=0J8SRHLchEw 17 | # https://www.youtube.com/watch?v=vY0C1_iphi8 18 | # https://www.youtube.com/watch?v=S_pkrU9Cc8s 19 | # https://www.youtube.com/watch?v=o5sNIkFgWbE 20 | # new 21 | https://www.youtube.com/watch?v=3ZEpHRVw34Q 22 | https://www.youtube.com/watch?v=o-mt3M847QI 23 | https://www.youtube.com/watch?v=KU7ThZauNsQ 24 | https://www.youtube.com/watch?v=9VSyRI5WdvY 25 | https://www.youtube.com/watch?v=2Esl5B1f6NM 26 | https://www.youtube.com/watch?v=iO36PWK-lnk 27 | https://www.youtube.com/watch?v=NSgY94czLPM 28 | https://www.youtube.com/watch?v=VLKCaTOzLzg 29 | https://www.youtube.com/watch?v=KOK5NSG2sao 30 | https://www.youtube.com/watch?v=Iq7YgV5jzHY 31 | https://www.youtube.com/watch?v=OMJLNEP286k 32 | https://www.youtube.com/watch?v=HZy51oklubM 33 | https://www.youtube.com/watch?v=fTGlzaNfyfg 34 | https://www.youtube.com/watch?v=Hc7QBD1H6pw 35 | https://www.youtube.com/watch?v=qA58lSo_LkA 36 | https://www.youtube.com/watch?v=v_HDxDF4vNM 37 | https://www.youtube.com/watch?v=d3iKibviy2g 38 | https://www.youtube.com/watch?v=wB0jjk_C9QY 39 | https://www.youtube.com/watch?v=QCTYEFHbTlM 40 | https://www.youtube.com/watch?v=-T-OKluEEgA 41 | https://www.youtube.com/watch?v=sNJiKXTYsUQ 42 | https://www.youtube.com/watch?v=LDvB-cH1FeI 43 | https://www.youtube.com/watch?v=T7z0vhuTeRw 44 | https://www.youtube.com/watch?v=WHF9uXwdbyg 45 | https://www.youtube.com/watch?v=qE1vN0e2QpA 46 | https://www.youtube.com/watch?v=2MHe_I5wntc --------------------------------------------------------------------------------