├── .gitignore
├── LICENSE
├── README.md
├── requirements.txt
├── slice.py
└── transcribe.py


/.gitignore:
--------------------------------------------------------------------------------
1 | text.list
2 | text.list.bak
3 | raw/
4 | venv/
5 | inputs/


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2023 litagin02
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # TTS のためのデータセット作りをするやつ
 2 | 
 3 | 音声ファイルたちから、
 4 | 
 5 | 1. [slice.py](slice.py): 発話区間を 2-12 秒に収まるように分割 ([Silero VAD](https://github.com/snakers4/silero-vad)を使用)
 6 | 2. [transcribe.py](transcribe.py): 分割したファイルからテキストを書き起こして保存([Faster Whisper](https://github.com/SYSTRAN/faster-whisper)を使用)
 7 | 
 8 | をするやつです。
 9 | 
10 | [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2/) で使うために作りました。
11 | 
12 | ## 導入
13 | 
14 | **ffmpeg のインストールが別途必要です**、「Couldn't find ffmpeg」とか怒られたら、「Windows ffmpeg インストール」等でググって別途インストールしてください。
15 | 
16 | ```
17 | git clone https://github.com/litagin02/slice-and-transcribe.git
18 | cd slice_and_transcribe
19 | python -m venv venv
20 | venv\Scripts\activate
21 | pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
22 | pip install -r requirements.txt
23 | ```
24 | 
25 | ## 音声分割
26 | 
27 | `inputs`フォルダを作り、そこに wav ファイルたちを入れてください。
28 | 
29 | ```bash
30 | python slice.py
31 | ```
32 | 
33 | スライスされた結果が`raw`ディレクトリに保存されます。
34 | 
35 | デフォルトは 2 秒から 12 秒の発話のみが保存されます。
36 | 
37 | パラメータ：
38 | 
39 | - `--max_sec`, `-M`: 最大秒数、デフォルトは 12 秒
40 | - `--min_sec`, `-m`: 最小秒数、デフォルトは 2 秒
41 | - `--min_silence_dur_ms`, `-s`: 無音とみなす秒数の長さ（ミリ秒）、デフォルトは 700ms。 このミリ秒数以上を無音だと判断する。逆に、この秒数以下の無音区間では区切られない。小さくすると、音声がぶつ切りに小さくなりすぎ、大きくすると音声一つ一つが長くなりすぎる。
42 | 
43 | 例：
44 | 
45 | ```bash
46 | python slice.py -M 15 -m 3 -s 1000
47 | ```
48 | 
49 | 
50 | ## 書き起こし
51 | 
52 | `raw`ディレクトリにある wav ファイルからテキストを書き起こし、`text.list`に保存します。
53 | 
54 | ```bash
55 | python transcribe.py speaker_name
56 | ```
57 | 
58 | 書き起こし形式は、
59 | 
60 | ```
61 | Data/{speaker_name}/audios/wavs/{file_name}|{speaker_name}|JP|{text}
62 | ```
63 | 
64 | という形です（Bert-VITS2 ですぐ使える形にしている）ので、必要なら適宜 [transcribe.py](transcribe.py) を書き換えてください。
65 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | faster-whisper
2 | pydub
3 | soundfile
4 | 


--------------------------------------------------------------------------------
/slice.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import os
  3 | import shutil
  4 | 
  5 | import torch
  6 | from pydub import AudioSegment
  7 | from tqdm import tqdm
  8 | 
  9 | vad_model, utils = torch.hub.load(
 10 |     repo_or_dir="snakers4/silero-vad",
 11 |     model="silero_vad",
 12 |     onnx=True,
 13 | )
 14 | 
 15 | (get_speech_timestamps, _, read_audio, *_) = utils
 16 | 
 17 | 
 18 | def get_stamps(audio_file, min_silence_dur_ms=700, min_sec=2):
 19 |     """
 20 |     min_silence_dur_ms:
 21 |         このミリ秒数以上を無音だと判断する。
 22 |         逆に、この秒数以下の無音区間では区切られない。
 23 |         小さくすると、音声がぶつ切りに小さくなりすぎ、
 24 |         大きくすると音声一つ一つが長くなりすぎる。
 25 |         データセットによってたぶん要調整。
 26 |     min_sec:
 27 |         この秒数より小さい発話は無視する。TTSのためには2秒未満は切り捨てたほうがいいかも。
 28 |     """
 29 | 
 30 |     sampling_rate = 16000  # 16kHzか8kHzのみ対応
 31 | 
 32 |     wav = read_audio(audio_file, sampling_rate=sampling_rate)
 33 |     speech_timestamps = get_speech_timestamps(
 34 |         wav,
 35 |         vad_model,
 36 |         sampling_rate=sampling_rate,
 37 |         min_silence_duration_ms=min_silence_dur_ms,
 38 |         min_speech_duration_ms=min_sec * 1000,
 39 |     )
 40 | 
 41 |     return speech_timestamps
 42 | 
 43 | 
 44 | def split_wav(
 45 |     audio_file, target_dir="raw", max_sec=12, min_silence_dur_ms=700, min_sec=2
 46 | ):
 47 |     margin = 200  # ミリ秒単位で、音声の前後に余裕を持たせる
 48 |     upper_bound_ms = max_sec * 1000  # これ以上の長さの音声は無視する
 49 | 
 50 |     speech_timestamps = get_stamps(
 51 |         audio_file, min_silence_dur_ms=min_silence_dur_ms, min_sec=min_sec
 52 |     )
 53 | 
 54 |     # WAVファイルを読み込む
 55 |     audio = AudioSegment.from_wav(audio_file)
 56 | 
 57 |     # リサンプリング（44100Hz）
 58 |     audio = audio.set_frame_rate(44100)
 59 | 
 60 |     # ステレオをモノラルに変換
 61 |     audio = audio.set_channels(1)
 62 | 
 63 |     total_ms = len(audio)
 64 | 
 65 |     file_name = os.path.basename(audio_file).split(".")[0]
 66 |     os.makedirs(target_dir, exist_ok=True)
 67 | 
 68 |     total_time_ms = 0
 69 | 
 70 |     # タイムスタンプに従って分割し、ファイルに保存
 71 |     for i, ts in enumerate(speech_timestamps):
 72 |         start_ms = max(ts["start"] / 16 - margin, 0)
 73 |         end_ms = min(ts["end"] / 16 + margin, total_ms)
 74 |         if end_ms - start_ms > upper_bound_ms:
 75 |             continue
 76 |         segment = audio[start_ms:end_ms]
 77 |         segment.export(os.path.join(target_dir, f"{file_name}-{i}.wav"), format="wav")
 78 |         total_time_ms += end_ms - start_ms
 79 | 
 80 |     return total_time_ms / 1000
 81 | 
 82 | 
 83 | if __name__ == "__main__":
 84 |     parser = argparse.ArgumentParser()
 85 |     parser.add_argument("--max_sec", "-M", type=int, default=12)
 86 |     parser.add_argument("--min_sec", "-m", type=int, default=2)
 87 |     parser.add_argument("--min_silence_dur_ms", "-s", type=int, default=700)
 88 |     args = parser.parse_args()
 89 | 
 90 |     input_dir = "inputs"
 91 |     target_dir = "raw"
 92 |     min_sec = args.min_sec
 93 |     max_sec = args.max_sec
 94 |     min_silence_dur_ms = args.min_silence_dur_ms
 95 | 
 96 |     wav_files = [
 97 |         os.path.join(input_dir, f)
 98 |         for f in os.listdir(input_dir)
 99 |         if f.lower().endswith(".wav")
100 |     ]
101 |     if os.path.exists(target_dir):  # ディレクトリを削除
102 |         print(f"{target_dir}フォルダが存在するので、削除します。")
103 |         shutil.rmtree(target_dir)
104 | 
105 |     total_sec = 0
106 |     for wav_file in tqdm(wav_files):
107 |         time_sec = split_wav(
108 |             wav_file,
109 |             target_dir,
110 |             max_sec=max_sec,
111 |             min_sec=min_sec,
112 |             min_silence_dur_ms=min_silence_dur_ms,
113 |         )
114 |         total_sec += time_sec
115 | 
116 |     print(f"Done! Total time: {total_sec / 60:.2f} min.")
117 | 


--------------------------------------------------------------------------------
/transcribe.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import sys
 3 | 
 4 | from faster_whisper import WhisperModel
 5 | from tqdm import tqdm
 6 | 
 7 | model = WhisperModel("large-v3", device="cuda", compute_type="float16")
 8 | 
 9 | 
10 | def transcribe(wav_path, initial_prompt=None):
11 |     segments, _ = model.transcribe(
12 |         wav_path, beam_size=5, language="ja", initial_prompt=initial_prompt
13 |     )
14 |     texts = [segment.text for segment in segments]
15 |     return "".join(texts)
16 | 
17 | 
18 | if __name__ == "__main__":
19 |     if len(sys.argv) != 2:
20 |         print(f"Usage: python {sys.argv[0]} <speaker_name>")
21 |         sys.exit(1)
22 |     speaker_name = sys.argv[1]
23 | 
24 |     wav_dir = "raw"
25 |     output_file = "text.list"
26 |     initial_prompt = "こんにちは。元気、ですかー？私は……ちゃんと元気だよ！"
27 | 
28 |     wav_files = [
29 |         os.path.join(wav_dir, f) for f in os.listdir(wav_dir) if f.endswith(".wav")
30 |     ]
31 |     if os.path.exists(output_file):
32 |         print(f"{output_file}が存在するので、バックアップを{output_file}.bakに作成します。")
33 |         if os.path.exists(output_file + ".bak"):
34 |             print(f"{output_file}.bakも存在するので、削除します。")
35 |             os.remove(output_file + ".bak")
36 |         os.rename(output_file, output_file + ".bak")
37 | 
38 |     with open(output_file, "w", encoding="utf-8") as f:
39 |         for wav_file in tqdm(wav_files):
40 |             file_name = os.path.basename(wav_file)
41 |             text = transcribe(wav_file, initial_prompt=initial_prompt)
42 |             f.write(
43 |                 f"Data/{speaker_name}/audios/wavs/{file_name}|{speaker_name}|JP|{text}\n"
44 |             )
45 | 


--------------------------------------------------------------------------------