├── README.md
├── check_audio_length.py
├── convert_audio_to_mono_22050.py
├── download_yt_files.py
├── full_audio_files
    └── .gitkeep
├── main.py
├── requirements.txt
└── yt_links.txt


/README.md:
--------------------------------------------------------------------------------
 1 | # Audio Segmentation and Transcription with WhisperX
 2 | 
 3 | This repository contains Python scripts to perform audio segmentation and transcription using the WhisperX ASR (Automatic Speech Recognition) model. WhisperX is a pre-trained ASR model that can be used to transcribe audio files into text.
 4 | 
 5 | ## Purpose
 6 | 
 7 | The purpose of this repository is to facilitate the process of segmenting long audio files into smaller chunks and transcribing those chunks into text. This can be useful for various applications such as creating training data for ASR models, generating subtitles for videos, or extracting specific spoken content from audio recordings.
 8 | 
 9 | ## Usage
10 | 
11 | ### 1. Prerequisites
12 | 
13 | Before using the scripts in this repository, ensure that you have the following installed:
14 | 
15 | - Python 3.x
16 | - `yt-dlp` for downloading YouTube audio files. You can download `yt-dlp` from [https://github.com/yt-dlp/yt-dlp/releases](https://github.com/yt-dlp/yt-dlp/releases).
17 | 
18 | ### 2. Set Up
19 | 
20 | 1. Clone the repository to your local machine:
21 | 
22 |    ```bash
23 |    git clone https://github.com/your-username/audio-segmentation-transcription.git
24 |    cd audio-segmentation-transcription
25 |    ```
26 | 
27 | 2. Create a virtual environment (optional but recommended):
28 | 
29 |    ```bash
30 |    python -m venv venv
31 |    source venv/bin/activate   # On Windows, use venv\Scripts\activate
32 |    ```
33 | 
34 | 3. Install the required packages:
35 | 
36 |    ```bash
37 |    pip install -r requirements.txt
38 |    ```
39 | 
40 | ### 3. Audio Segmentation and Transcription
41 | 
42 | #### Segmentation and Transcription from Local Audio Files
43 | 
44 | 1. Place your audio files (in WAV format) inside the `full_audio_files` directory.
45 | 
46 | 2. Create a list of YouTube links in `yt_links.txt` if you want to download audio from YouTube.
47 | 
48 | 3. Run the `download_yt_files.py` script to download and preprocess audio files from YouTube:
49 | 
50 |    ```bash
51 |    python download_yt_files.py
52 |    ```
53 | 
54 | 4. Run the `main.py` script to segment and transcribe the audio files:
55 | 
56 |    ```bash
57 |    python main.py
58 |    ```
59 | 
60 |    The segmented audio files and corresponding transcriptions will be saved in the `output_audio_segments` directory.
61 | 
62 | ### 4. Output
63 | 
64 | - **Segmented Audio Files**: The segmented audio files (in WAV format) will be saved in the `output_audio_segments/{run_name}/audio` directory.
65 | 
66 | - **Transcriptions**: The transcriptions for each segment will be saved in `output_audio_segments/{run_name}/train.txt` (for training data) and `output_audio_segments/{run_name}/validation.txt` (for validation data).
67 | 
68 | ## Notes
69 | 
70 | - Ensure that you have enough storage space, especially if dealing with large audio files, as the segmented audio files can consume significant disk space.
71 | 
72 | - You can adjust the parameters in the scripts (such as `batch_size`, `compute_type`, and `language`) to customize the behavior of the WhisperX model according to your requirements.
73 | 
74 | - For more information about the WhisperX ASR model, refer to the official documentation or repository of the model.
75 | 
76 | - This readme was generated with GPT model, any fixes are more than welcome 😃
77 | 


--------------------------------------------------------------------------------
/check_audio_length.py:
--------------------------------------------------------------------------------
 1 | import glob
 2 | import os
 3 | 
 4 | import librosa
 5 | from tqdm import tqdm
 6 | 
 7 | # Simple script to keep track how much audio you already collected in folders
 8 | # Edit the folders list to fit your folders you wish to measure
 9 | folders = ['output_audio_segments/2023_10_25_12_34_brave-itinerary_6h/audio',
10 |            'output_audio_segments/2023_10_29_00_10_callous-class/audio',
11 |            'output_audio_segments/2023_10_30_00_00_tempered-reef/audio']
12 | 
13 | total_length = 0
14 | 
15 | # Iterate through the list of folders and glob files
16 | for folder in folders:
17 |     # Use os.path.join to create the full path to the folder
18 |     folder_path = os.path.join(folder, '*.wav')
19 | 
20 |     # Use glob.glob to get a list of files matching the pattern in the folder
21 |     files = glob.glob(folder_path)
22 | 
23 |     # Iterate through the files and calculate total length
24 |     for file in tqdm(files):
25 |         audio, _ = librosa.load(file)
26 |         total_length += librosa.get_duration(y=audio)
27 | 
28 | # Convert total length to hours
29 | total_length_hours = total_length / 3600  # 1 hour = 3600 seconds
30 | 
31 | print(f"Total length of all files: {total_length_hours:.2f} hours")
32 | 


--------------------------------------------------------------------------------
/convert_audio_to_mono_22050.py:
--------------------------------------------------------------------------------
 1 | import glob
 2 | import librosa
 3 | import soundfile as sf
 4 | from tqdm import tqdm
 5 | 
 6 | input_folder = "full_audio_files/all/*.wav"
 7 | 
 8 | for file in tqdm(glob.glob(input_folder)):
 9 |     # Load the audio file without specifying sample rate or mono option
10 |     audio, sr = librosa.load(file)
11 | 
12 |     audio = librosa.to_mono(audio)
13 |     audio = librosa.resample(audio, orig_sr=sr, target_sr=22050)
14 | 
15 |     # Save the converted audio back to the same file
16 |     sf.write(file, audio, 22050)
17 | 


--------------------------------------------------------------------------------
/download_yt_files.py:
--------------------------------------------------------------------------------
 1 | import glob
 2 | import os
 3 | import subprocess
 4 | from datetime import datetime
 5 | 
 6 | import librosa
 7 | import randomname
 8 | import soundfile as sf
 9 | from tqdm import tqdm
10 | 
11 | 
12 | # You have to install yt-dlp
13 | def check_ytdlp_availability():
14 |     try:
15 |         # Try running yt-dlp with the --version option to check if it's available
16 |         subprocess.run(["yt-dlp", "--version"], check=True)
17 |         print("yt-dlp is available on your system.")
18 |     except FileNotFoundError:
19 |         print("yt-dlp is not available on your system.")
20 |         print("Please download it from https://github.com/yt-dlp/yt-dlp/releases.")
21 |         exit(0)
22 | 
23 | 
24 | check_ytdlp_availability()
25 | 
26 | run_name = datetime.now().strftime("%Y_%m_%d_%H_%M") + f"_{randomname.get_name()}"
27 | yt_download_dir = os.path.join('yt_downloaded', run_name)
28 | os.makedirs(yt_download_dir, exist_ok=True)
29 | 
30 | with open('yt_links.txt', 'r', encoding='utf-8') as f:
31 |     all_links = f.readlines()
32 |     all_links = [link for link in all_links if not link.startswith('#')]
33 | 
34 | os.chdir(yt_download_dir)
35 | for link in tqdm(all_links, desc='Downloading from youtube'):
36 |     os.system(f"yt-dlp --quiet -f ba --audio-format wav --extract-audio {link}")
37 | 
38 | # Convert to mono and 22050 HZ
39 | for file in tqdm(glob.glob('*.wav'), desc='Resampling all files'):
40 |     audio, sr = librosa.load(file, sr=22050, mono=True)
41 |     sf.write(file, audio, 22050)
42 | 
43 | # Calculate total length of all files in the folder
44 | total_length = 0
45 | for file in glob.glob('*.wav'):
46 |     audio, _ = librosa.load(file, sr=22050, mono=True)
47 |     total_length += librosa.get_duration(y=audio)
48 | 
49 | # Convert total length to hours
50 | total_length_hours = total_length / 3600  # 1 hour = 3600 seconds
51 | 
52 | print(f"Total length of all files: {total_length_hours:.2f} hours")
53 | 


--------------------------------------------------------------------------------
/full_audio_files/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MisterCapi/auto_dataset_tts/1d220419726473bbef44ea36f4f99abd1df12693/full_audio_files/.gitkeep


--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
  1 | import glob
  2 | import os
  3 | from dataclasses import dataclass
  4 | from datetime import datetime
  5 | from typing import List
  6 | 
  7 | import librosa
  8 | import numpy as np
  9 | import randomname
 10 | import soundfile as sf
 11 | import torch.cuda
 12 | import whisperx
 13 | from tqdm import tqdm
 14 | 
 15 | device = "cuda" if torch.cuda.is_available() else "cpu"
 16 | batch_size = 3              # reduce if low on GPU mem
 17 | compute_type = "float16"    # change to "int8" if low on GPU mem (may reduce accuracy)
 18 | language = "pl"
 19 | 
 20 | # 1. Transcribe with original whisper (batched)
 21 | model = whisperx.load_model("large-v2", device, compute_type=compute_type, language=language)
 22 | model_a, metadata = whisperx.load_align_model(language_code=language, device=device)
 23 | 
 24 | 
 25 | continue_run = ""
 26 | while continue_run.lower() not in ('yes', 'no'):
 27 |     continue_run = input('Do you want to continue previous run? [yes/no]: ')
 28 | 
 29 | continue_run = continue_run.lower() == 'yes'
 30 | 
 31 | if continue_run:
 32 |     run_names = os.listdir('output_audio_segments')
 33 |     run_name = ""
 34 |     while run_name not in run_names:
 35 |         run_name = input("What run do you wish to continue?\nEnter name [ex. 2023_10_29_00_10_callous-class]: ")
 36 | else:
 37 |     run_name = datetime.now().strftime("%Y_%m_%d_%H_%M") + f"_{randomname.get_name()}"
 38 |     print('Starting a new run.')
 39 | 
 40 | # Load files that were done previously if continue run
 41 | files_done_path = os.path.join('output_audio_segments', run_name, 'files_done.txt')
 42 | files_done = []
 43 | if continue_run:
 44 |     if os.path.exists(files_done_path):
 45 |         with open(files_done_path, 'r', encoding='utf-8') as f:
 46 |             files_done = [x.strip() for x in f.readlines()]
 47 | 
 48 | 
 49 | @dataclass
 50 | class Segment:
 51 |     text: str
 52 |     filepath: str
 53 |     duration: float
 54 | 
 55 | 
 56 | def cut_sample_to_speech_only(audio_path) -> str:
 57 |     audio_sample = whisperx.load_audio(audio_path)
 58 |     result_sample = model.transcribe(audio_sample, batch_size=batch_size, chunk_size=30)
 59 |     result_sample = whisperx.align(result_sample["segments"], model_a, metadata, audio_sample, device, return_char_alignments=True)
 60 | 
 61 |     try:
 62 |         segment = result_sample['segments'][-1]
 63 |         chars = [char for char in segment['chars'] if char.get('end')]
 64 |         long_char_threshold = 0.5
 65 |         end = 0
 66 |         for char in reversed(chars):
 67 |             if char.get('char').strip() and (char.get('end') - char.get('start')) < long_char_threshold:
 68 |                 end = char['end']
 69 |                 break
 70 |         if not end:
 71 |             end = chars[-1]['end']
 72 |     except IndexError:
 73 |         print(f"Skipping {audio_path}")
 74 |         return ""
 75 | 
 76 |     # Load audio with Librosa
 77 |     audio, sr = librosa.load(audio_path)
 78 | 
 79 |     # Calculate dB amplitude
 80 |     amplitude_db = librosa.amplitude_to_db(np.abs(audio), ref=np.max)
 81 | 
 82 |     # This logic is here to prevent whisper from cutting audio too short (this happened quite often in my experience)
 83 |     # If you trust whisper 100% (or have noisy audio) -> commend all "optional" code below
 84 |     # ======= OPTIONAL CODE =======
 85 |     window_size_sec = 0.05
 86 |     silence_threshold_db = -30
 87 |     window_size_samples = int(window_size_sec * sr)
 88 |     start_frame = int(end * sr)
 89 | 
 90 |     audio_windows = [amplitude_db[x:x + window_size_samples] for x in
 91 |                      range(start_frame, len(amplitude_db), window_size_samples)]
 92 | 
 93 |     audio_windows = [np.max(x) for x in audio_windows]
 94 | 
 95 |     for window in audio_windows:
 96 |         if window > silence_threshold_db:
 97 |             end += window_size_sec
 98 |         else:
 99 |             break
100 |     # ======= END OPTIONAL CODE =======
101 | 
102 |     sf.write(audio_path, audio[:int(end * sr)], int(sr))
103 |     return " ".join([seg['text'].strip() for seg in result_sample['segments']])
104 | 
105 | 
106 | def cut_and_save_audio(input_audio_path, segments, target_sampling_rate=22050):
107 |     """
108 |     Load audio from input_audio_path, cut segments based on the provided start and end times,
109 |     and save the segments into output_dir with the specified target_sampling_rate.
110 | 
111 |     Args:
112 |     input_audio_path (str): Path to the input audio file.
113 |     segments (list): List of lists containing start and end times of audio segments to cut, in seconds.
114 |     target_sampling_rate (int, optional): Sampling rate of the output audio segments. Default is 22050.
115 |     """
116 |     # Create the output directory if it doesn't exist
117 |     output_dir_name = os.path.join('output_audio_segments', run_name, 'audio')
118 |     os.makedirs(output_dir_name, exist_ok=True)
119 | 
120 |     # Load the input audio
121 |     audio, original_sampling_rate = librosa.load(input_audio_path, sr=target_sampling_rate)
122 | 
123 |     # Cut and save audio segments
124 |     outputs = []
125 |     output_prefix = os.path.splitext(os.path.basename(input_audio_path))[0]
126 |     for idx, segment in tqdm(enumerate(segments), desc=input_audio_path, total=len(segments), leave=False):
127 |         start_sample = int(segment['start'] * original_sampling_rate)
128 |         end_sample = int(segment['end'] * original_sampling_rate)
129 |         audio_segment = audio[start_sample:end_sample]
130 |         output_path = os.path.join(output_dir_name, f"{output_prefix}_{idx + 1}.wav")
131 |         sf.write(output_path, audio_segment, target_sampling_rate)
132 |         # We use whisper again on each sample to get rid of non-speech sounds
133 |         segment_text = cut_sample_to_speech_only(output_path)
134 |         if segment_text:
135 |             outputs.append(Segment(text=segment_text,
136 |                                    filepath=output_path,
137 |                                    duration=librosa.get_duration(y=audio, sr=original_sampling_rate)))
138 |         else:
139 |             try:
140 |                 os.remove(output_path)
141 |             except FileNotFoundError:
142 |                 print(f"File '{output_path}' not found.")
143 |             except PermissionError:
144 |                 print(f"You do not have permission to delete '{output_path}'.")
145 |             except Exception as e:
146 |                 print(f"An error occurred: {e}")
147 | 
148 |     return outputs
149 | 
150 | 
151 | def create_segments_for_files(files_to_segment: List[str]):
152 | 
153 |     for audio_file in tqdm(sorted(files_to_segment, key=lambda x: os.path.getsize(x), reverse=True),
154 |                            desc="Transcribing and segmenting files"):
155 |         if audio_file in files_done:
156 |             continue
157 | 
158 |         audio = whisperx.load_audio(audio_file)
159 |         result = model.transcribe(audio, batch_size=batch_size, chunk_size=30)
160 | 
161 |         # Align the whisper output
162 |         result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=True)
163 | 
164 |         # segments = [[segment['start'], segment['end']] for segment in result["segments"]]
165 |         target_sampling_rate = 22050
166 | 
167 |         segment_objects = cut_and_save_audio(audio_file, result["segments"], target_sampling_rate)
168 | 
169 |         for segment in segment_objects:
170 |             segment.filepath = os.path.join(*segment.filepath.split(os.sep)[-2:])
171 |             segment.filepath = segment.filepath.replace('\\', '/')
172 | 
173 |         segment_objects = sorted(segment_objects, key=lambda x: x.duration, reverse=True)
174 | 
175 |         split_id = int(0.95 * len(result["segments"]))
176 |         train_segments = segment_objects[:split_id]
177 |         validation_segments = segment_objects[split_id:]
178 | 
179 |         output_dir_name = os.path.join('output_audio_segments', run_name)
180 |         with open(os.path.join(output_dir_name, 'train.txt'), 'a', encoding='utf-8') as f:
181 |             for segment in train_segments:
182 |                 f.write(f"{segment.filepath}| {segment.text.strip()}\n")
183 | 
184 |         with open(os.path.join(output_dir_name, 'validation.txt'), 'a', encoding='utf-8') as f:
185 |             for segment in validation_segments:
186 |                 f.write(f"{segment.filepath}| {segment.text.strip()}\n")
187 | 
188 |         with open(files_done_path, 'a', encoding='utf-8') as f:
189 |             f.write(f"{audio_file}\n")
190 | 
191 | 
192 | if __name__ == '__main__':
193 |     print(f"Starting run for: {run_name}")
194 |     files_to_segment = glob.glob("full_audio_files/*.wav")
195 |     create_segments_for_files(files_to_segment)
196 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MisterCapi/auto_dataset_tts/1d220419726473bbef44ea36f4f99abd1df12693/requirements.txt


--------------------------------------------------------------------------------
/yt_links.txt:
--------------------------------------------------------------------------------
 1 | # Example youtube links for yt downloader, you can comment out lines using '#'
 2 | # https://www.youtube.com/watch?v=SzRLUaHG_wk
 3 | # https://www.youtube.com/watch?v=hUw7AI_VUyE
 4 | # https://www.youtube.com/watch?v=YTP-4f2FnVQ
 5 | # https://www.youtube.com/watch?v=D58YUx-SpUM
 6 | # https://www.youtube.com/watch?v=RoC925_jMYw
 7 | # https://www.youtube.com/watch?v=Wi4fYGmzgAY
 8 | # https://www.youtube.com/watch?v=pnSaBK639-8
 9 | # https://www.youtube.com/watch?v=FQOatj4EPAk
10 | # https://www.youtube.com/watch?v=KUte-BZUW-4
11 | # https://www.youtube.com/watch?v=p274nNifPBs
12 | # https://www.youtube.com/watch?v=Gi2y2OygoaE
13 | # https://www.youtube.com/watch?v=1KiIgAlHI5o
14 | # https://www.youtube.com/watch?v=2tPdvtAW1DY
15 | # https://www.youtube.com/watch?v=eIvrNhrnVDI
16 | # https://www.youtube.com/watch?v=0J8SRHLchEw
17 | # https://www.youtube.com/watch?v=vY0C1_iphi8
18 | # https://www.youtube.com/watch?v=S_pkrU9Cc8s
19 | # https://www.youtube.com/watch?v=o5sNIkFgWbE
20 | # new
21 | https://www.youtube.com/watch?v=3ZEpHRVw34Q
22 | https://www.youtube.com/watch?v=o-mt3M847QI
23 | https://www.youtube.com/watch?v=KU7ThZauNsQ
24 | https://www.youtube.com/watch?v=9VSyRI5WdvY
25 | https://www.youtube.com/watch?v=2Esl5B1f6NM
26 | https://www.youtube.com/watch?v=iO36PWK-lnk
27 | https://www.youtube.com/watch?v=NSgY94czLPM
28 | https://www.youtube.com/watch?v=VLKCaTOzLzg
29 | https://www.youtube.com/watch?v=KOK5NSG2sao
30 | https://www.youtube.com/watch?v=Iq7YgV5jzHY
31 | https://www.youtube.com/watch?v=OMJLNEP286k
32 | https://www.youtube.com/watch?v=HZy51oklubM
33 | https://www.youtube.com/watch?v=fTGlzaNfyfg
34 | https://www.youtube.com/watch?v=Hc7QBD1H6pw
35 | https://www.youtube.com/watch?v=qA58lSo_LkA
36 | https://www.youtube.com/watch?v=v_HDxDF4vNM
37 | https://www.youtube.com/watch?v=d3iKibviy2g
38 | https://www.youtube.com/watch?v=wB0jjk_C9QY
39 | https://www.youtube.com/watch?v=QCTYEFHbTlM
40 | https://www.youtube.com/watch?v=-T-OKluEEgA
41 | https://www.youtube.com/watch?v=sNJiKXTYsUQ
42 | https://www.youtube.com/watch?v=LDvB-cH1FeI
43 | https://www.youtube.com/watch?v=T7z0vhuTeRw
44 | https://www.youtube.com/watch?v=WHF9uXwdbyg
45 | https://www.youtube.com/watch?v=qE1vN0e2QpA
46 | https://www.youtube.com/watch?v=2MHe_I5wntc


--------------------------------------------------------------------------------