├── .gitignore
├── LICENSE
├── README.md
├── acoustic_forced_alignment
    ├── .gitignore
    ├── README.md
    ├── align_tg_words.py
    ├── assets
    │   ├── 2001000001.lab
    │   └── 2001000001.wav
    ├── build_dataset.py
    ├── check_tg.py
    ├── combine_tg.py
    ├── dictionaries
    │   └── opencpop-extension.txt
    ├── distribution.py
    ├── enhance_tg.py
    ├── reformat_wavs.py
    ├── requirements.txt
    ├── select_test_set.py
    ├── slice_tg.py
    ├── summary_pitch.py
    ├── validate_labels.py
    └── validate_lengths.py
├── midi-recognition
    ├── README.md
    ├── extract_midi.py
    └── merge_wavs.py
└── variance-temp-solution
    ├── .gitignore
    ├── README.md
    ├── add_ph_num.py
    ├── add_ph_num_advanced.py
    ├── assets
        └── .gitkeep
    ├── convert_ds.py
    ├── convert_txt.py
    ├── correct_cents.py
    ├── eliminate_short.py
    ├── estimate_midi.py
    ├── get_pitch.py
    ├── requirements.txt
    └── rmvpe
        ├── __init__.py
        ├── constants.py
        ├── deepunet.py
        ├── inference.py
        ├── model.py
        ├── seq.py
        ├── spec.py
        └── utils.py


/.gitignore:
--------------------------------------------------------------------------------
1 | .idea
2 | .vscode
3 | *.pyc
4 | __pycache__/
5 | local_tools/
6 | /venv/
7 | 
8 | .ipynb_checkpoints/
9 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | BSD 3-Clause License
 2 | 
 3 | Copyright (c) 2023, Team OpenVPI
 4 | 
 5 | Redistribution and use in source and binary forms, with or without
 6 | modification, are permitted provided that the following conditions are met:
 7 | 
 8 | 1. Redistributions of source code must retain the above copyright notice, this
 9 |    list of conditions and the following disclaimer.
10 | 
11 | 2. Redistributions in binary form must reproduce the above copyright notice,
12 |    this list of conditions and the following disclaimer in the documentation
13 |    and/or other materials provided with the distribution.
14 | 
15 | 3. Neither the name of the copyright holder nor the names of its
16 |    contributors may be used to endorse or promote products derived from
17 |    this software without specific prior written permission.
18 | 
19 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
20 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
21 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
22 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
23 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
24 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
25 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
26 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
27 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
28 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
29 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # MakeDiffSinger
 2 | Pipelines and tools to build your own DiffSinger dataset.
 3 | 
 4 | For the recommended standard dataset making pipelines, see:
 5 | 
 6 | - acoustic-forced-alignment: make dataset from scratch with MFA for acoustic model training
 7 | - variance-temp-solution: temporary solution to extend acoustic datasets into variance datasets
 8 | 
 9 | For other useful pipelines and tools for making a dataset, welcome to raise issues or submit PRs.
10 | 
11 | ## DiffSinger dataset structure
12 | 
13 | - dataset1/
14 |   - raw/
15 |     - wavs/
16 |       - recording1.wav
17 |       - recording2.wav
18 |       - ...
19 |     - transcriptions.csv
20 | - dataset2/
21 |   - raw/
22 |     - wavs/
23 |       - ...
24 |     - transcriptions.csv
25 | - ...
26 | 
27 | ## Essential tools to process and label your datasets
28 | 
29 | Dataset tools now have their own repository: [dataset-tools](https://github.com/openvpi/dataset-tools).
30 | 
31 | There are mainly 3 components:
32 | 
33 | - AudioSlicer: Slice your recordings into short segments
34 | - MinLabel: Label *.lab files containing word transcriptions for acoustic model training.
35 | - SlurCutter: Edit MIDI sequence in *.ds files for variance model training.


--------------------------------------------------------------------------------
/acoustic_forced_alignment/.gitignore:
--------------------------------------------------------------------------------
1 | assets/mfa-*/
2 | assets/*.zip
3 | segments/
4 | textgrids/
5 | 


--------------------------------------------------------------------------------
/acoustic_forced_alignment/README.md:
--------------------------------------------------------------------------------
  1 | # Making Datasets from Scratch (Forced Alignment)
  2 | 
  3 | This pipeline will guide you to build your dataset from raw recordings with MFA (Montreal Forced Aligner).
  4 | 
  5 | ## 0. Requirements
  6 | 
  7 | This pipeline will require your dictionary having its corresponding MFA pretrained model. You can see currently supported dictionaries and download their MFA models in the table below:
  8 | 
  9 | |  dictionary name   |    dictionary file     |                                          MFA model                                           |
 10 | |:------------------:|:----------------------:|:--------------------------------------------------------------------------------------------:|
 11 | | Opencpop extension | opencpop-extension.txt | [link](https://huggingface.co/datasets/fox7005/tool/resolve/main/mfa-opencpop-extension.zip) |
 12 | 
 13 | Your recordings must meet the following conditions:
 14 | 
 15 | 1. They must be in one single folder. Files in sub-folders will be ignored.
 16 | 2. They must be in WAV format.
 17 | 3. They must have a sampling rate higher than 32 kHz.
 18 | 4. They should be clean, unaccompanied voices with no significant noise or reverb.
 19 | 5. They should contain only voices from one single human.
 20 | 
 21 | <font color="red">**NOTICE:**</font> Before you train a model, you must obtain permission from the copyright holder of the dataset and make sure the provider is fully aware that you will train a model from their data, that you will or will not distribute the synthesized voices and model weights, and the potential risks of this kind of activity.
 22 | 
 23 | ## 1. Clone repo and install dependencies
 24 | 
 25 | ```bash
 26 | git clone https://github.com/openvpi/MakeDiffSinger.git
 27 | cd MakeDiffSinger/acoustic-forced-alignment
 28 | conda create -n mfa python=3.8 --yes  # you must use a Conda environment!
 29 | conda activate mfa
 30 | conda install -c conda-forge montreal-forced-aligner==2.0.6 --yes  # install MFA
 31 | pip install -r requirements.txt  # install other requirements
 32 | ```
 33 | 
 34 | ## 2. Prepare recordings and transcriptions
 35 | 
 36 | ### 2.1 Audio slicing
 37 | 
 38 | The raw data must be sliced into segments of about 5-15 seconds. We recommend using [AudioSlicer](../README.md#essential-tools-to-process-and-label-your-datasets), a simple GUI application that can automatically slice audio files via silence detection.
 39 | 
 40 | Run the following command to validate your segment lengths and count the total length of your sliced segments:
 41 | 
 42 | ```bash
 43 | python validate_lengths.py --dir path/to/your/segments/
 44 | ```
 45 | 
 46 | ### 2.2 Label the segments
 47 | 
 48 | All segments should have their transcriptions (or lyrics) annotated. See [assets/2001000001.wav](assets/2001000001.wav) and its corresponding label [assets/2001000001.lab](assets/2001000001.lab) as an example.
 49 | 
 50 | Each segment should have one annotation file with the same filename as it and `.lab` extension, and placed in the same directory. In the annotation file, you should write all syllables sung or spoken in this segment. Syllables should be split by space, and only syllables that appears in the dictionary are allowed. In addition, all phonemes in the dictionary should be covered in the annotations. Please note that `SP`, `AP` and `<PAD>` should not be included in the labels although they are in your final phoneme set.
 51 | 
 52 | We developed [MinLabel](../README.md#essential-tools-to-process-and-label-your-datasets), a simple yet efficient tool to help finishing this step.
 53 | 
 54 | Once you finish labeling, run the following command to validate your labels:
 55 | 
 56 | ```bash
 57 | python validate_labels.py --dir path/to/your/segments/ --dictionary path/to/your/dictionary.txt
 58 | ```
 59 | 
 60 | This will ensure:
 61 | 
 62 | - All recordings have their corresponding labels.
 63 | - There are no unrecognizable phonemes that does not appear in the dictionary.
 64 | - All phonemes in the dictionary are covered by the labels.
 65 | 
 66 | If there are failed checks, please fix them and run again.
 67 | 
 68 | A summary of your phoneme coverage will be generated. If there are some phonemes that have extremely few occurrences (for example, less than 20), it is highly recommended to add more recordings to cover these phonemes.
 69 | 
 70 | ## 3. Forced Alignment
 71 | 
 72 | ### 3.1 Reformat recordings
 73 | 
 74 | Given the transcriptions of each segment, we are able to align the phoneme sequence to its corresponding audio, thus obtaining position and duration information of each phoneme.
 75 | 
 76 | We use [Montreal Forced Aligner](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to do forced phoneme alignment.
 77 | 
 78 | MFA fails on some platforms if the WAVs are not in 16kHz 16bit PCM format. The following command will reformat your recordings and copy the labels to another temporary directory. You may delete those temporary files afterwards.
 79 | 
 80 | ```bash
 81 | python reformat_wavs.py --src path/to/your/segments/ --dst path/to/tmp/dir/
 82 | ```
 83 | 
 84 | NOTE: `--normalize` can be added to normalize the audio files with respect to the peak value of the whole segments. This is especially helpful on aspiration detection during TextGrid enhancement if the original segments are too quite.
 85 | 
 86 | ### 3.2 Run MFA on the corpus
 87 | 
 88 | MFA will align your labels to your recordings and save the results to TextGrid files.
 89 | 
 90 | Download the MFA model and run the following command:
 91 | 
 92 | ```bash
 93 | mfa align path/to/your/segments/ path/to/your/dictionary.txt path/to/your/model.zip path/to/your/textgrids/ --beam 100 --clean --overwrite
 94 | ```
 95 | 
 96 | Run the following command to check if all TextGrids are successfully generated:
 97 | 
 98 | ```bash
 99 | python check_tg.py --wavs path/to/your/segments/ --tg path/to/your/textgrids/
100 | ```
101 | 
102 | If the checks above fails, or the results are not good, please try another `--beam` value and run the MFA again. TextGrids generated by MFA are still raw and need further processing, so please do not edit them at this time.
103 | 
104 | ### 3.3 Enhance and finish the TextGrids
105 | 
106 | MFA results might not be good on some long utterances. In this section, we:
107 | 
108 | - Try to reduce errors for long utterances
109 | - Detect `AP`s and add `SP`s which have not been labeled before.
110 | 
111 | Run:
112 | 
113 | ```bash
114 | python enhance_tg.py --wavs path/to/your/segments/ --dictionary path/to/your/dictionary.txt --src path/to/raw/textgrids/ --dst path/to/final/textgrids/
115 | ```
116 | 
117 | NOTE: There are other useful arguments of this script. If you understand them, you can try to get better results through adjusting those parameters.
118 | 
119 | The final TextGrids can be saved for future use.
120 | 
121 | If you are interested in the word-level pitch distribution of your dataset, run the following command:
122 | 
123 | ```bash
124 | python summary_pitch.py --wavs path/to/your/segments/ --tg path/to/final/textgrids/
125 | ```
126 | 
127 | ### 3.4 (Optional) Manual TextGrids refinement
128 | 
129 | With steps above, the TextGrids we get contains 2 tiers: the words and the phones. Manual refinement to your TextGrids may take lots of effort but will boost the performance and stability of your model.
130 | 
131 | This section is a recommended (but not required) way to refine your TextGrids manually. Before you start, an additional dependency to achieve natural sorting needs to be installed:
132 | 
133 | ```bash
134 | pip install natsort
135 | ```
136 | 
137 | #### 3.4.1 Combine the recordings and TextGrids
138 | 
139 | A full dataset can contain hundreds or thousands of auto-sliced recording segments and their corresponding TextGrids. The following command will combine them into long ones:
140 | 
141 | ```bash
142 | python combine_tg.py --wavs path/to/your/segments/ --tg path/to/your/final/textgrids/ --out path/to/your/combined/textgrids/
143 | ```
144 | 
145 | This will combine all items with same name except their suffixes and add a `sentences` tier in the combined TextGrids. The new sentences tier controls how the long combined recordings are split into short sentences. If you have other suffix pattern (default: `"_\d+"`) or want to change the bit-depth (default: PCM_16) of the combined recordings, see `python combine_tg.py --help`.
146 | 
147 | #### 3.4.2 Manual editing
148 | 
149 | TextGrids can be viewed and edited with [Praat](https://github.com/praat/praat) or [vLabeler](https://github.com/sdercolin/vlabeler) (recommended).
150 | 
151 | The editing mainly involves the sentences tier and the phones tier. When editing, please ensure the sentences tier is aligned with the words and phones tier; but it is not required to align the words tier to the phones tier. If you want to remove a sentence or not to include one area in any sentences, just leave an empty mark on that area.
152 | 
153 | #### 3.4.3 Slice the recordings and TextGrids
154 | 
155 | After manual editing is finished, the words tier can be automatically re-aligned to the phones tier. Run:
156 | 
157 | ```bash
158 | python align_tg_words.py --tg path/to/your/combined/textgrids --dictionary path/to/your/dictionary.txt --overwrite
159 | ```
160 | 
161 | NOTE 1: This will overwrite your TextGrid files. You can back them up before running the command, or specify another output directory with `--out` option.
162 | 
163 | NOTE 2: This script is also compatible with segmented 2-tier TextGrids.
164 | 
165 | Then the TextGrids and recordings can be sliced according to the boundaries stored in the sentences tiers. Run:
166 | 
167 | ```bash
168 | python slice_tg.py --wavs path/to/your/combined/textgrids/ --out path/to/your/sliced/textgrids/refined/
169 | ```
170 | 
171 | By default, the output segments will be re-numbered like `item_000`, `item_001`, ..., `item_XXX`. If you want to use the marks stored in the sentences tier as the filenames, or want to change the bit-depth (default: PCM_16) of the sliced recordings, or control other behaviors, see `python slice_tg.py --help`.
172 | 
173 | Now you can use these manually refined and re-sliced TextGrids and recordings for further steps.
174 | 
175 | ## 4. Build the final dataset
176 | 
177 | The TextGrids need to be collected into a transcriptions.csv file as the final transcriptions. The CSV file will include the following columns:
178 | 
179 | - name: the segment name
180 | - ph_seq: the phoneme sequence
181 | - ph_dur: the phoneme duration
182 | 
183 | The recordings will be arranged like [this](../README.md#diffsinger-dataset-structure).
184 | 
185 | Run:
186 | 
187 | ```bash
188 | python build_dataset.py --wavs path/to/your/segments/ --tg path/to/final/textgrids/ --dataset path/to/your/dataset/
189 | ```
190 | 
191 | NOTE 1: This will insert random silence parts around each segments by default for better `SP` stability. If you do not need these silence parts, for example, if your TextGrids have been manually refined, please use the `--skip_silence_insertion` option.
192 | 
193 | NOTE 2: `--wav_subtype` can be used to specify the bit-depth of the saved WAV files. Options are `PCM_16` (default), `PCM_24`, `PCM_32`, `FLOAT`, and `DOUBLE`.
194 | 
195 | After doing all things above, you should put it into data/ of the DiffSinger main repository. Now, your dataset can be used to train DiffSinger acoustic models. If you want to train DiffSinger variance models, please follow instructions [here](../variance-temp-solution/README.md).
196 | 
197 | ## 5. Write configuration file
198 | 
199 | Copy the template configration file from `configs/templates` in the DiffSinger repository to your data folder, or a new folder if working with multi-speaker model. Specify required fields in the configurations, check `DiffSinger/docs/ConfigurationSchemas.md` for help on the meanings of those fields.
200 | 
201 | For automatic validation set selection, you can leave the following field as empty. If the field is not empty, the script will prompt a overwrite confirmation later.
202 | ```yaml
203 | ...
204 | test_prefixes:
205 | ...
206 | ```
207 | 
208 | And run:
209 | ```bash
210 | python select_test_set.py path/to/your/config.yaml [--rel_path <PATH>]
211 | ```
212 | 
213 | NOTE 1: `--rel_path` is probably necessary if there are relative paths in your config file. If only absolute paths exist in it, you can omit this argument.
214 | 
215 | NOTE 2: There are other useful arguments of this script. You can use them to change the total number of validation samples.
216 | 


--------------------------------------------------------------------------------
/acoustic_forced_alignment/align_tg_words.py:
--------------------------------------------------------------------------------
 1 | import pathlib
 2 | 
 3 | import click
 4 | import textgrid
 5 | import tqdm
 6 | 
 7 | 
 8 | @click.command(help='Align words tiers in TextGrids to phones tiers')
 9 | @click.option('--tg', required=True, help='Path to TextGrids (2-tier or 3-tier format)')
10 | @click.option('--dictionary', required=True, help='Path to the dictionary file')
11 | @click.option(
12 |     '--out', required=False,
13 |     help='Path to save the aligned TextGrids (defaults to the input directory)'
14 | )
15 | @click.option('--overwrite', is_flag=True, help='Overwrite existing files')
16 | def align_tg_words(tg, dictionary, out, overwrite):
17 |     tg_path_in = pathlib.Path(tg)
18 |     dict_path = pathlib.Path(dictionary)
19 |     tg_path_out = pathlib.Path(out) if out is not None else tg_path_in
20 |     tg_path_out.mkdir(parents=True, exist_ok=True)
21 | 
22 |     with open(dict_path, 'r', encoding='utf8') as f:
23 |         rules = [ln.strip().split('\t') for ln in f.readlines()]
24 |     dictionary = {
25 |         'SP': ['SP'],
26 |         'AP': ['AP']
27 |     }
28 |     phoneme_set = {'SP', 'AP'}
29 |     for r in rules:
30 |         phonemes = r[1].split()
31 |         dictionary[r[0]] = phonemes
32 |         phoneme_set.update(phonemes)
33 | 
34 |     for tgfile in tqdm.tqdm(tg_path_in.glob('*.TextGrid')):
35 |         tg = textgrid.TextGrid()
36 |         tg.read(tgfile)
37 |         old_words_tier: textgrid.IntervalTier = tg[-2]
38 |         if old_words_tier.name != 'words':
39 |             raise ValueError(
40 |                 f'Invalid tier name or order in \'{tgfile}\'. '
41 |                 f'The words tier should be the 1st tier of a 2-tier TextGrid, '
42 |                 f'or the 2nd tier of a 3-tier TextGrid.'
43 |             )
44 |         phones_tier: textgrid.IntervalTier = tg[-1]
45 |         new_words_tier = textgrid.IntervalTier(name='words')
46 |         word_seq = [i.mark for i in old_words_tier]
47 |         word_div = []
48 |         ph_seq = [i.mark for i in phones_tier]
49 |         ph_dur = [i.duration() for i in phones_tier]
50 |         idx = 0
51 |         for i, word in enumerate(word_seq):
52 |             if word not in dictionary:
53 |                 raise ValueError(f'Error invalid word in \'{tgfile}\' at {i}: {word}')
54 |             word_ph_seq = dictionary[word]
55 |             ph_num = len(word_ph_seq)
56 |             word_div.append(ph_num)
57 |             if word_ph_seq != ph_seq[idx: idx + ph_num]:
58 |                 print(
59 |                     f'Warning: word and phones mismatch in \'{tgfile}\' '
60 |                     f'at word {i}, phone {idx}: {word} => {ph_seq[idx: idx + len(word_ph_seq)]}'
61 |                 )
62 |             idx += ph_num
63 |         for i, phone in enumerate(ph_seq):
64 |             if phone not in phoneme_set:
65 |                 raise ValueError(f'Error: invalid phone in \'{tgfile}\' at {i}: {phone}')
66 |         if sum(word_div) != len(ph_dur):
67 |             raise ValueError(
68 |                 f'Error: word_div does not sum to number of phones in \'{tgfile}\'. '
69 |                 f'Check the warnings above for more detailed mismatching positions.'
70 |             )
71 |         start = 0.
72 |         idx = 0
73 |         for j in range(len(word_seq)):
74 |             end = start + sum(ph_dur[idx: idx + word_div[j]])
75 |             new_words_tier.add(minTime=start, maxTime=end, mark=word_seq[j])
76 |             start = end
77 |             idx += word_div[j]
78 |         tg.tiers[-2] = new_words_tier
79 |         tg_file_out = tg_path_out / tgfile.name
80 |         if tg_file_out.exists() and not overwrite:
81 |             raise FileExistsError(str(tg_file_out))
82 |         tg.write(tg_file_out)
83 | 
84 | 
85 | if __name__ == '__main__':
86 |     align_tg_words()
87 | 


--------------------------------------------------------------------------------
/acoustic_forced_alignment/assets/2001000001.lab:
--------------------------------------------------------------------------------
1 | gan shou ting zai wo fa duan de zhi jian


--------------------------------------------------------------------------------
/acoustic_forced_alignment/assets/2001000001.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/openvpi/MakeDiffSinger/ca134d36dc8eec06002a72cd0a59257abcf7bb84/acoustic_forced_alignment/assets/2001000001.wav


--------------------------------------------------------------------------------
/acoustic_forced_alignment/build_dataset.py:
--------------------------------------------------------------------------------
 1 | import csv
 2 | import pathlib
 3 | import random
 4 | 
 5 | import click
 6 | import librosa
 7 | import numpy as np
 8 | import soundfile
 9 | import tqdm
10 | from textgrid import TextGrid
11 | 
12 | 
13 | @click.command(help='Collect phoneme alignments into transcriptions.csv')
14 | @click.option('--wavs', required=True, help='Path to the segments directory')
15 | @click.option('--tg', required=True, help='Path to the final TextGrids directory')
16 | @click.option('--dataset', required=True, help='Path to dataset directory')
17 | @click.option('--skip_silence_insertion', is_flag=True, show_default=True,
18 |               help='Do not insert silence around segments')
19 | @click.option('--wav_subtype', default="PCM_16", show_default=True,
20 |               help='WAV subtype')
21 | def build_dataset(wavs, tg, dataset, skip_silence_insertion, wav_subtype):
22 |     wavs = pathlib.Path(wavs)
23 |     tg_dir = pathlib.Path(tg)
24 |     del tg
25 |     dataset = pathlib.Path(dataset)
26 |     filelist = list(wavs.glob('*.wav'))
27 | 
28 |     dataset.mkdir(parents=True, exist_ok=True)
29 |     (dataset / 'wavs').mkdir(exist_ok=True)
30 |     transcriptions = []
31 |     samplerate = 44100
32 |     min_sil = int(0.1 * samplerate)
33 |     max_sil = int(0.5 * samplerate)
34 |     for wavfile in tqdm.tqdm(filelist):
35 |         y, _ = librosa.load(wavfile, sr=samplerate, mono=True)
36 |         tgfile = tg_dir / wavfile.with_suffix('.TextGrid').name
37 |         tg = TextGrid()
38 |         tg.read(str(tgfile))
39 |         ph_seq = [ph.mark for ph in tg[1]]
40 |         ph_dur = [ph.maxTime - ph.minTime for ph in tg[1]]
41 |         if not skip_silence_insertion:
42 |             if random.random() < 0.5:
43 |                 len_sil = random.randrange(min_sil, max_sil)
44 |                 y = np.concatenate((np.zeros((len_sil,), dtype=np.float32), y))
45 |                 if ph_seq[0] == 'SP':
46 |                     ph_dur[0] += len_sil / samplerate
47 |                 else:
48 |                     ph_seq.insert(0, 'SP')
49 |                     ph_dur.insert(0, len_sil / samplerate)
50 |             if random.random() < 0.5:
51 |                 len_sil = random.randrange(min_sil, max_sil)
52 |                 y = np.concatenate((y, np.zeros((len_sil,), dtype=np.float32)))
53 |                 if ph_seq[-1] == 'SP':
54 |                     ph_dur[-1] += len_sil / samplerate
55 |                 else:
56 |                     ph_seq.append('SP')
57 |                     ph_dur.append(len_sil / samplerate)
58 |         ph_seq = ' '.join(ph_seq)
59 |         ph_dur = ' '.join([str(round(d, 6)) for d in ph_dur])
60 |         soundfile.write(dataset / 'wavs' / wavfile.name, y, samplerate, subtype=wav_subtype)
61 |         transcriptions.append({'name': wavfile.stem, 'ph_seq': ph_seq, 'ph_dur': ph_dur})
62 | 
63 |     with open(dataset / 'transcriptions.csv', 'w', encoding='utf8', newline='') as f:
64 |         writer = csv.DictWriter(f, fieldnames=['name', 'ph_seq', 'ph_dur'])
65 |         writer.writeheader()
66 |         writer.writerows(transcriptions)
67 | 
68 |     print(f'All wavs and transcriptions saved in {dataset}')
69 | 
70 | 
71 | if __name__ == '__main__':
72 |     build_dataset()
73 | 


--------------------------------------------------------------------------------
/acoustic_forced_alignment/check_tg.py:
--------------------------------------------------------------------------------
 1 | import pathlib
 2 | 
 3 | import click
 4 | import tqdm
 5 | 
 6 | 
 7 | @click.command('Check if all TextGrids are generated')
 8 | @click.option('--wavs', required=True, help='Path to the segments directory')
 9 | @click.option('--tg', required=True, help='Path to the TextGrids directory')
10 | def check_tg(wavs, tg):
11 |     wavs = pathlib.Path(wavs)
12 |     tg = pathlib.Path(tg)
13 |     missing = []
14 |     filelist = list(wavs.glob('*.wav'))
15 |     for wavfile in tqdm.tqdm(filelist):
16 |         tgfile = tg / wavfile.with_suffix('.TextGrid').name
17 |         if not tgfile.exists():
18 |             missing.append(tgfile)
19 |     if len(missing) > 0:
20 |         print(
21 |             'These TextGrids are missing! There are possible severe errors in labels of those corresponding segments. '
22 |             'If you do believe there are no errors, consider increase the \'--beam\' argument for MFA.')
23 |         for fn in missing:
24 |             print(f' - {fn}')
25 |     else:
26 |         print('All alignments have been successfully generated.')
27 | 
28 | 
29 | if __name__ == '__main__':
30 |     check_tg()
31 | 


--------------------------------------------------------------------------------
/acoustic_forced_alignment/combine_tg.py:
--------------------------------------------------------------------------------
  1 | import pathlib
  2 | import re
  3 | from typing import Dict, List
  4 | 
  5 | import click
  6 | import librosa
  7 | import natsort
  8 | import numpy
  9 | import soundfile
 10 | import textgrid
 11 | import tqdm
 12 | 
 13 | 
 14 | def remove_suffix(string, suffix_pattern):
 15 |     match = re.search(f'{suffix_pattern}$', string)
 16 |     if not match:
 17 |         return string
 18 |     return string[:-len(match.group())]
 19 | 
 20 | 
 21 | @click.command(help='Combine segmented 2-tier TextGrids and wavs into 3-tier TextGrids and long wavs')
 22 | @click.option(
 23 |     '--wavs', required=True,
 24 |     help='Directory containing the segmented wav files'
 25 | )
 26 | @click.option(
 27 |     '--tg', required=False,
 28 |     help='Directory containing the segmented TextGrid files (defaults to wav directory)'
 29 | )
 30 | @click.option(
 31 |     '--out', required=True,
 32 |     help='Path to output directory for combined files'
 33 | )
 34 | @click.option(
 35 |     '--suffix', required=False, default=r'_\d+',
 36 |     help='Filename suffix pattern for file combination'
 37 | )
 38 | @click.option(
 39 |     '--wav_subtype', required=False, default='PCM_16',
 40 |     help='Wav subtype (defaults to PCM_16)'
 41 | )
 42 | @click.option(
 43 |     '--overwrite', is_flag=True,
 44 |     help='Overwrite existing files'
 45 | )
 46 | def combine_tg(wavs, tg, out, suffix, wav_subtype, overwrite):
 47 |     wav_path_in = pathlib.Path(wavs)
 48 |     tg_path_in = wav_path_in if tg is None else pathlib.Path(tg)
 49 |     del tg
 50 |     combined_path_out = pathlib.Path(out)
 51 |     combined_path_out.mkdir(parents=True, exist_ok=True)
 52 |     filelist: Dict[str, List[pathlib.Path]] = {}
 53 |     for tg_file in tg_path_in.glob('*.TextGrid'):
 54 |         stem = remove_suffix(tg_file.stem, suffix)
 55 |         if stem not in filelist:
 56 |             filelist[stem] = [tg_file]
 57 |         else:
 58 |             filelist[stem].append(tg_file)
 59 |     for name, files in tqdm.tqdm(sorted(filelist.items(), key=lambda kv: kv[0])):
 60 |         wav_segments = []
 61 |         tg = textgrid.TextGrid()
 62 |         sentences_tier = textgrid.IntervalTier(name='sentences')
 63 |         words_tier = textgrid.IntervalTier(name='words')
 64 |         phones_tier = textgrid.IntervalTier(name='phones')
 65 |         sentence_start = 0.
 66 |         sr = None
 67 |         for tg_file in natsort.natsorted(files):
 68 |             wav_file = (wav_path_in / tg_file.name).with_suffix('.wav')
 69 |             waveform, sr_ = librosa.load(wav_file, sr=None)
 70 |             if sr is None:
 71 |                 sr = sr_
 72 |             else:
 73 |                 assert sr_ == sr, f'Cannot combine \'{tg_file.stem}\': incompatible samplerate ({sr_} != {sr})'
 74 |             sentence_end = waveform.shape[0] / sr + sentence_start
 75 |             wav_segments.append(waveform)
 76 |             sentences_tier.add(minTime=sentence_start, maxTime=sentence_end, mark=wav_file.stem)
 77 |             sentence_tg = textgrid.TextGrid()
 78 |             sentence_tg.read(tg_file)
 79 |             start = sentence_start
 80 |             for j, word in enumerate(sentence_tg[0]):
 81 |                 if j == len(sentence_tg[0]) - 1:
 82 |                     end = sentence_end
 83 |                 else:
 84 |                     end = start + word.duration()
 85 |                 words_tier.add(minTime=start, maxTime=end, mark=word.mark)
 86 |                 start = end
 87 |             start = sentence_start
 88 |             for j, phone in enumerate(sentence_tg[1]):
 89 |                 if j == len(sentence_tg[1]) - 1:
 90 |                     end = sentence_end
 91 |                 else:
 92 |                     end = start + phone.duration()
 93 |                 phones_tier.add(minTime=start, maxTime=end, mark=phone.mark)
 94 |                 start = end
 95 |             sentence_start = sentence_end
 96 |         tg.append(sentences_tier)
 97 |         tg.append(words_tier)
 98 |         tg.append(phones_tier)
 99 | 
100 |         tg_file_out = combined_path_out / f'{name}.TextGrid'
101 |         wav_file_out = tg_file_out.with_suffix('.wav')
102 |         if wav_file_out.exists() and not overwrite:
103 |             raise FileExistsError(str(wav_file_out))
104 |         if tg_file_out.exists() and not overwrite:
105 |             raise FileExistsError(str(tg_file_out))
106 | 
107 |         tg.write(tg_file_out)
108 |         full_wav = numpy.concatenate(wav_segments)
109 |         soundfile.write(wav_file_out, full_wav, samplerate=sr, subtype=wav_subtype)
110 | 
111 | 
112 | if __name__ == '__main__':
113 |     combine_tg()
114 | 


--------------------------------------------------------------------------------
/acoustic_forced_alignment/dictionaries/opencpop-extension.txt:
--------------------------------------------------------------------------------
  1 | a	a
  2 | ai	ai
  3 | an	an
  4 | ang	ang
  5 | ao	ao
  6 | ba	b a
  7 | bai	b ai
  8 | ban	b an
  9 | bang	b ang
 10 | bao	b ao
 11 | be	b e
 12 | bei	b ei
 13 | ben	b en
 14 | beng	b eng
 15 | ber	b er
 16 | bi	b i
 17 | bia	b ia
 18 | bian	b ian
 19 | biang	b iang
 20 | biao	b iao
 21 | bie	b ie
 22 | bin	b in
 23 | bing	b ing
 24 | biong	b iong
 25 | biu	b iu
 26 | bo	b o
 27 | bong	b ong
 28 | bou	b ou
 29 | bu	b u
 30 | bua	b ua
 31 | buai	b uai
 32 | buan	b uan
 33 | buang	b uang
 34 | bui	b ui
 35 | bun	b un
 36 | bv	b v
 37 | bve	b ve
 38 | ca	c a
 39 | cai	c ai
 40 | can	c an
 41 | cang	c ang
 42 | cao	c ao
 43 | ce	c e
 44 | cei	c ei
 45 | cen	c en
 46 | ceng	c eng
 47 | cer	c er
 48 | cha	ch a
 49 | chai	ch ai
 50 | chan	ch an
 51 | chang	ch ang
 52 | chao	ch ao
 53 | che	ch e
 54 | chei	ch ei
 55 | chen	ch en
 56 | cheng	ch eng
 57 | cher	ch er
 58 | chi	ch ir
 59 | chong	ch ong
 60 | chou	ch ou
 61 | chu	ch u
 62 | chua	ch ua
 63 | chuai	ch uai
 64 | chuan	ch uan
 65 | chuang	ch uang
 66 | chui	ch ui
 67 | chun	ch un
 68 | chuo	ch uo
 69 | chv	ch v
 70 | chyi	ch i
 71 | ci	c i0
 72 | cong	c ong
 73 | cou	c ou
 74 | cu	c u
 75 | cua	c ua
 76 | cuai	c uai
 77 | cuan	c uan
 78 | cuang	c uang
 79 | cui	c ui
 80 | cun	c un
 81 | cuo	c uo
 82 | cv	c v
 83 | cyi	c i
 84 | da	d a
 85 | dai	d ai
 86 | dan	d an
 87 | dang	d ang
 88 | dao	d ao
 89 | de	d e
 90 | dei	d ei
 91 | den	d en
 92 | deng	d eng
 93 | der	d er
 94 | di	d i
 95 | dia	d ia
 96 | dian	d ian
 97 | diang	d iang
 98 | diao	d iao
 99 | die	d ie
100 | din	d in
101 | ding	d ing
102 | diong	d iong
103 | diu	d iu
104 | dong	d ong
105 | dou	d ou
106 | du	d u
107 | dua	d ua
108 | duai	d uai
109 | duan	d uan
110 | duang	d uang
111 | dui	d ui
112 | dun	d un
113 | duo	d uo
114 | dv	d v
115 | dve	d ve
116 | e	e
117 | ei	ei
118 | en	en
119 | eng	eng
120 | er	er
121 | fa	f a
122 | fai	f ai
123 | fan	f an
124 | fang	f ang
125 | fao	f ao
126 | fe	f e
127 | fei	f ei
128 | fen	f en
129 | feng	f eng
130 | fer	f er
131 | fi	f i
132 | fia	f ia
133 | fian	f ian
134 | fiang	f iang
135 | fiao	f iao
136 | fie	f ie
137 | fin	f in
138 | fing	f ing
139 | fiong	f iong
140 | fiu	f iu
141 | fo	f o
142 | fong	f ong
143 | fou	f ou
144 | fu	f u
145 | fua	f ua
146 | fuai	f uai
147 | fuan	f uan
148 | fuang	f uang
149 | fui	f ui
150 | fun	f un
151 | fv	f v
152 | fve	f ve
153 | ga	g a
154 | gai	g ai
155 | gan	g an
156 | gang	g ang
157 | gao	g ao
158 | ge	g e
159 | gei	g ei
160 | gen	g en
161 | geng	g eng
162 | ger	g er
163 | gi	g i
164 | gia	g ia
165 | gian	g ian
166 | giang	g iang
167 | giao	g iao
168 | gie	g ie
169 | gin	g in
170 | ging	g ing
171 | giong	g iong
172 | giu	g iu
173 | gong	g ong
174 | gou	g ou
175 | gu	g u
176 | gua	g ua
177 | guai	g uai
178 | guan	g uan
179 | guang	g uang
180 | gui	g ui
181 | gun	g un
182 | guo	g uo
183 | gv	g v
184 | gve	g ve
185 | ha	h a
186 | hai	h ai
187 | han	h an
188 | hang	h ang
189 | hao	h ao
190 | he	h e
191 | hei	h ei
192 | hen	h en
193 | heng	h eng
194 | her	h er
195 | hi	h i
196 | hia	h ia
197 | hian	h ian
198 | hiang	h iang
199 | hiao	h iao
200 | hie	h ie
201 | hin	h in
202 | hing	h ing
203 | hiong	h iong
204 | hiu	h iu
205 | hong	h ong
206 | hou	h ou
207 | hu	h u
208 | hua	h ua
209 | huai	h uai
210 | huan	h uan
211 | huang	h uang
212 | hui	h ui
213 | hun	h un
214 | huo	h uo
215 | hv	h v
216 | hve	h ve
217 | ji	j i
218 | jia	j ia
219 | jian	j ian
220 | jiang	j iang
221 | jiao	j iao
222 | jie	j ie
223 | jin	j in
224 | jing	j ing
225 | jiong	j iong
226 | jiu	j iu
227 | ju	j v
228 | juan	j van
229 | jue	j ve
230 | jun	j vn
231 | ka	k a
232 | kai	k ai
233 | kan	k an
234 | kang	k ang
235 | kao	k ao
236 | ke	k e
237 | kei	k ei
238 | ken	k en
239 | keng	k eng
240 | ker	k er
241 | ki	k i
242 | kia	k ia
243 | kian	k ian
244 | kiang	k iang
245 | kiao	k iao
246 | kie	k ie
247 | kin	k in
248 | king	k ing
249 | kiong	k iong
250 | kiu	k iu
251 | kong	k ong
252 | kou	k ou
253 | ku	k u
254 | kua	k ua
255 | kuai	k uai
256 | kuan	k uan
257 | kuang	k uang
258 | kui	k ui
259 | kun	k un
260 | kuo	k uo
261 | kv	k v
262 | kve	k ve
263 | la	l a
264 | lai	l ai
265 | lan	l an
266 | lang	l ang
267 | lao	l ao
268 | le	l e
269 | lei	l ei
270 | len	l en
271 | leng	l eng
272 | ler	l er
273 | li	l i
274 | lia	l ia
275 | lian	l ian
276 | liang	l iang
277 | liao	l iao
278 | lie	l ie
279 | lin	l in
280 | ling	l ing
281 | liong	l iong
282 | liu	l iu
283 | lo	l o
284 | long	l ong
285 | lou	l ou
286 | lu	l u
287 | lua	l ua
288 | luai	l uai
289 | luan	l uan
290 | luang	l uang
291 | lui	l ui
292 | lun	l un
293 | luo	l uo
294 | lv	l v
295 | lve	l ve
296 | ma	m a
297 | mai	m ai
298 | man	m an
299 | mang	m ang
300 | mao	m ao
301 | me	m e
302 | mei	m ei
303 | men	m en
304 | meng	m eng
305 | mer	m er
306 | mi	m i
307 | mia	m ia
308 | mian	m ian
309 | miang	m iang
310 | miao	m iao
311 | mie	m ie
312 | min	m in
313 | ming	m ing
314 | miong	m iong
315 | miu	m iu
316 | mo	m o
317 | mong	m ong
318 | mou	m ou
319 | mu	m u
320 | mua	m ua
321 | muai	m uai
322 | muan	m uan
323 | muang	m uang
324 | mui	m ui
325 | mun	m un
326 | mv	m v
327 | mve	m ve
328 | na	n a
329 | nai	n ai
330 | nan	n an
331 | nang	n ang
332 | nao	n ao
333 | ne	n e
334 | nei	n ei
335 | nen	n en
336 | neng	n eng
337 | ner	n er
338 | ni	n i
339 | nia	n ia
340 | nian	n ian
341 | niang	n iang
342 | niao	n iao
343 | nie	n ie
344 | nin	n in
345 | ning	n ing
346 | niong	n iong
347 | niu	n iu
348 | nong	n ong
349 | nou	n ou
350 | nu	n u
351 | nua	n ua
352 | nuai	n uai
353 | nuan	n uan
354 | nuang	n uang
355 | nui	n ui
356 | nun	n un
357 | nuo	n uo
358 | nv	n v
359 | nve	n ve
360 | o	o
361 | ong	ong
362 | ou	ou
363 | pa	p a
364 | pai	p ai
365 | pan	p an
366 | pang	p ang
367 | pao	p ao
368 | pe	p e
369 | pei	p ei
370 | pen	p en
371 | peng	p eng
372 | per	p er
373 | pi	p i
374 | pia	p ia
375 | pian	p ian
376 | piang	p iang
377 | piao	p iao
378 | pie	p ie
379 | pin	p in
380 | ping	p ing
381 | piong	p iong
382 | piu	p iu
383 | po	p o
384 | pong	p ong
385 | pou	p ou
386 | pu	p u
387 | pua	p ua
388 | puai	p uai
389 | puan	p uan
390 | puang	p uang
391 | pui	p ui
392 | pun	p un
393 | pv	p v
394 | pve	p ve
395 | qi	q i
396 | qia	q ia
397 | qian	q ian
398 | qiang	q iang
399 | qiao	q iao
400 | qie	q ie
401 | qin	q in
402 | qing	q ing
403 | qiong	q iong
404 | qiu	q iu
405 | qu	q v
406 | quan	q van
407 | que	q ve
408 | qun	q vn
409 | ra	r a
410 | rai	r ai
411 | ran	r an
412 | rang	r ang
413 | rao	r ao
414 | re	r e
415 | rei	r ei
416 | ren	r en
417 | reng	r eng
418 | rer	r er
419 | ri	r ir
420 | rong	r ong
421 | rou	r ou
422 | ru	r u
423 | rua	r ua
424 | ruai	r uai
425 | ruan	r uan
426 | ruang	r uang
427 | rui	r ui
428 | run	r un
429 | ruo	r uo
430 | rv	r v
431 | ryi	r i
432 | sa	s a
433 | sai	s ai
434 | san	s an
435 | sang	s ang
436 | sao	s ao
437 | se	s e
438 | sei	s ei
439 | sen	s en
440 | seng	s eng
441 | ser	s er
442 | sha	sh a
443 | shai	sh ai
444 | shan	sh an
445 | shang	sh ang
446 | shao	sh ao
447 | she	sh e
448 | shei	sh ei
449 | shen	sh en
450 | sheng	sh eng
451 | sher	sh er
452 | shi	sh ir
453 | shong	sh ong
454 | shou	sh ou
455 | shu	sh u
456 | shua	sh ua
457 | shuai	sh uai
458 | shuan	sh uan
459 | shuang	sh uang
460 | shui	sh ui
461 | shun	sh un
462 | shuo	sh uo
463 | shv	sh v
464 | shyi	sh i
465 | si	s i0
466 | song	s ong
467 | sou	s ou
468 | su	s u
469 | sua	s ua
470 | suai	s uai
471 | suan	s uan
472 | suang	s uang
473 | sui	s ui
474 | sun	s un
475 | suo	s uo
476 | sv	s v
477 | syi	s i
478 | ta	t a
479 | tai	t ai
480 | tan	t an
481 | tang	t ang
482 | tao	t ao
483 | te	t e
484 | tei	t ei
485 | ten	t en
486 | teng	t eng
487 | ter	t er
488 | ti	t i
489 | tia	t ia
490 | tian	t ian
491 | tiang	t iang
492 | tiao	t iao
493 | tie	t ie
494 | tin	t in
495 | ting	t ing
496 | tiong	t iong
497 | tong	t ong
498 | tou	t ou
499 | tu	t u
500 | tua	t ua
501 | tuai	t uai
502 | tuan	t uan
503 | tuang	t uang
504 | tui	t ui
505 | tun	t un
506 | tuo	t uo
507 | tv	t v
508 | tve	t ve
509 | wa	w a
510 | wai	w ai
511 | wan	w an
512 | wang	w ang
513 | wao	w ao
514 | we	w e
515 | wei	w ei
516 | wen	w en
517 | weng	w eng
518 | wer	w er
519 | wi	w i
520 | wo	w o
521 | wong	w ong
522 | wou	w ou
523 | wu	w u
524 | xi	x i
525 | xia	x ia
526 | xian	x ian
527 | xiang	x iang
528 | xiao	x iao
529 | xie	x ie
530 | xin	x in
531 | xing	x ing
532 | xiong	x iong
533 | xiu	x iu
534 | xu	x v
535 | xuan	x van
536 | xue	x ve
537 | xun	x vn
538 | ya	y a
539 | yai	y ai
540 | yan	y En
541 | yang	y ang
542 | yao	y ao
543 | ye	y E
544 | yei	y ei
545 | yi	y i
546 | yin	y in
547 | ying	y ing
548 | yo	y o
549 | yong	y ong
550 | you	y ou
551 | yu	y v
552 | yuan	y van
553 | yue	y ve
554 | yun	y vn
555 | ywu	y u
556 | za	z a
557 | zai	z ai
558 | zan	z an
559 | zang	z ang
560 | zao	z ao
561 | ze	z e
562 | zei	z ei
563 | zen	z en
564 | zeng	z eng
565 | zer	z er
566 | zha	zh a
567 | zhai	zh ai
568 | zhan	zh an
569 | zhang	zh ang
570 | zhao	zh ao
571 | zhe	zh e
572 | zhei	zh ei
573 | zhen	zh en
574 | zheng	zh eng
575 | zher	zh er
576 | zhi	zh ir
577 | zhong	zh ong
578 | zhou	zh ou
579 | zhu	zh u
580 | zhua	zh ua
581 | zhuai	zh uai
582 | zhuan	zh uan
583 | zhuang	zh uang
584 | zhui	zh ui
585 | zhun	zh un
586 | zhuo	zh uo
587 | zhv	zh v
588 | zhyi	zh i
589 | zi	z i0
590 | zong	z ong
591 | zou	z ou
592 | zu	z u
593 | zua	z ua
594 | zuai	z uai
595 | zuan	z uan
596 | zuang	z uang
597 | zui	z ui
598 | zun	z un
599 | zuo	z uo
600 | zv	z v
601 | zyi	z i
602 | 


--------------------------------------------------------------------------------
/acoustic_forced_alignment/distribution.py:
--------------------------------------------------------------------------------
 1 | import matplotlib.pyplot as plt
 2 | 
 3 | 
 4 | def draw_distribution(title, x_label, y_label, items: list, values: list, zoom=0.8):
 5 |     plt.figure(figsize=(int(len(items) * zoom), 10))
 6 |     plt.bar(x=items, height=values)
 7 |     plt.tick_params(labelsize=15)
 8 |     plt.xlim(-1, len(items))
 9 |     for a, b in zip(items, values):
10 |         plt.text(a, b, b, ha='center', va='bottom', fontsize=15)
11 |     plt.grid()
12 |     plt.title(title, fontsize=30)
13 |     plt.xlabel(x_label, fontsize=20)
14 |     plt.ylabel(y_label, fontsize=20)
15 | 


--------------------------------------------------------------------------------
/acoustic_forced_alignment/enhance_tg.py:
--------------------------------------------------------------------------------
  1 | import pathlib
  2 | 
  3 | import click
  4 | import librosa
  5 | import numpy as np
  6 | import parselmouth as pm
  7 | import textgrid as tg
  8 | import tqdm
  9 | 
 10 | 
 11 | @click.command(help='Enhance and finish the TextGrids')
 12 | @click.option('--wavs', required=True, help='Path to the segments directory')
 13 | @click.option('--dictionary', required=True, help='Path to the dictionary file')
 14 | @click.option('--src', required=True, help='Path to the raw TextGrids directory')
 15 | @click.option('--dst', required=True, help='Path to the final TextGrids directory')
 16 | @click.option('--f0_min', type=float, default=40., show_default=True, help='Minimum value of pitch')
 17 | @click.option('--f0_max', type=float, default=1100., show_default=True, help='Maximum value of pitch')
 18 | @click.option('--br_len', type=float, default=0.1, show_default=True,
 19 |               help='Minimum length of breath in seconds')
 20 | @click.option('--br_db', type=float, default=-60., show_default=True,
 21 |               help='Threshold of RMS in dB for detecting breath')
 22 | @click.option('--br_centroid', type=float, default=2000., show_default=True,
 23 |               help='Threshold of spectral centroid in Hz for detecting breath')
 24 | @click.option('--time_step', type=float, default=0.005, show_default=True,
 25 |               help='Time step for feature extraction')
 26 | @click.option('--min_space', type=float, default=0.04, show_default=True,
 27 |               help='Minimum length of space in seconds')
 28 | @click.option('--voicing_thresh_vowel', type=float, default=0.45, show_default=True,
 29 |               help='Threshold of voicing for fixing long utterances')
 30 | @click.option('--voicing_thresh_breath', type=float, default=0.6, show_default=True,
 31 |               help='Threshold of voicing for detecting breath')
 32 | @click.option('--br_win_sz', type=float, default=0.05, show_default=True,
 33 |               help='Size of sliding window in seconds for detecting breath')
 34 | def enhance_tg(
 35 |         wavs, dictionary, src, dst,
 36 |         f0_min, f0_max, br_len, br_db, br_centroid,
 37 |         time_step, min_space, voicing_thresh_vowel, voicing_thresh_breath, br_win_sz
 38 | ):
 39 |     wavs = pathlib.Path(wavs)
 40 |     dict_path = pathlib.Path(dictionary)
 41 |     src = pathlib.Path(src)
 42 |     dst = pathlib.Path(dst)
 43 |     dst.mkdir(parents=True, exist_ok=True)
 44 | 
 45 |     with open(dict_path, 'r', encoding='utf8') as f:
 46 |         rules = [ln.strip().split('\t') for ln in f.readlines()]
 47 |     dictionary = {}
 48 |     phoneme_set = set()
 49 |     for r in rules:
 50 |         phonemes = r[1].split()
 51 |         dictionary[r[0]] = phonemes
 52 |         phoneme_set.update(phonemes)
 53 | 
 54 |     filelist = list(wavs.glob('*.wav'))
 55 |     for wavfile in tqdm.tqdm(filelist):
 56 |         tgfile = src / wavfile.with_suffix('.TextGrid').name
 57 |         textgrid = tg.TextGrid()
 58 |         textgrid.read(str(tgfile))
 59 |         words = textgrid[0]
 60 |         phones = textgrid[1]
 61 |         sound = pm.Sound(str(wavfile))
 62 |         f0_voicing_breath = sound.to_pitch_ac(
 63 |             time_step=time_step,
 64 |             voicing_threshold=voicing_thresh_breath,
 65 |             pitch_floor=f0_min,
 66 |             pitch_ceiling=f0_max,
 67 |         ).selected_array['frequency']
 68 |         f0_voicing_vowel = sound.to_pitch_ac(
 69 |             time_step=time_step,
 70 |             voicing_threshold=voicing_thresh_vowel,
 71 |             pitch_floor=f0_min,
 72 |             pitch_ceiling=f0_max,
 73 |         ).selected_array['frequency']
 74 |         y, sr = librosa.load(wavfile, sr=24000, mono=True)
 75 |         hop_size = int(time_step * sr)
 76 |         spectral_centroid = librosa.feature.spectral_centroid(y=y, sr=sr, n_fft=2048, hop_length=hop_size).squeeze(0)
 77 | 
 78 |         # Fix long utterances
 79 |         i = j = 0
 80 |         while i < len(words):
 81 |             word = words[i]
 82 |             phone = phones[j]
 83 |             if word.mark is not None and word.mark != '':
 84 |                 i += 1
 85 |                 j += len(dictionary[word.mark])
 86 |                 continue
 87 |             if i == 0:
 88 |                 i += 1
 89 |                 j += 1
 90 |                 continue
 91 |             prev_word = words[i - 1]
 92 |             prev_phone = phones[j - 1]
 93 |             # Extend length of long utterances
 94 |             while word.minTime < word.maxTime - time_step:
 95 |                 pos = min(f0_voicing_vowel.shape[0] - 1, int(word.minTime / time_step))
 96 |                 if f0_voicing_vowel[pos] < f0_min:
 97 |                     break
 98 |                 prev_word.maxTime += time_step
 99 |                 prev_phone.maxTime += time_step
100 |                 word.minTime += time_step
101 |                 phone.minTime += time_step
102 |             i += 1
103 |             j += 1
104 | 
105 |         # Detect aspiration
106 |         i = j = 0
107 |         while i < len(words):
108 |             word = words[i]
109 |             phone = phones[j]
110 |             if word.mark is not None and word.mark != '':
111 |                 i += 1
112 |                 j += len(dictionary[word.mark])
113 |                 continue
114 |             if word.maxTime - word.minTime < br_len:
115 |                 i += 1
116 |                 j += 1
117 |                 continue
118 |             ap_ranges = []
119 |             br_start = None
120 |             win_pos = word.minTime
121 |             while win_pos + br_win_sz <= word.maxTime:
122 |                 all_noisy = (f0_voicing_breath[
123 |                              int(win_pos / time_step): int((win_pos + br_win_sz) / time_step)] < f0_min).all()
124 |                 rms_db = 20 * np.log10(
125 |                     np.clip(sound.get_rms(from_time=win_pos, to_time=win_pos + br_win_sz), a_min=1e-12, a_max=1))
126 |                 # print(win_pos, win_pos + br_win_sz, all_noisy, rms_db)
127 |                 if all_noisy and rms_db >= br_db:
128 |                     if br_start is None:
129 |                         br_start = win_pos
130 |                 else:
131 |                     if br_start is not None:
132 |                         br_end = win_pos + br_win_sz - time_step
133 |                         if br_end - br_start >= br_len:
134 |                             centroid = spectral_centroid[int(br_start / time_step): int(br_end / time_step)].mean()
135 |                             if centroid >= br_centroid:
136 |                                 ap_ranges.append((br_start, br_end))
137 |                         br_start = None
138 |                         win_pos = br_end
139 |                 win_pos += time_step
140 |             if br_start is not None:
141 |                 br_end = win_pos + br_win_sz - time_step
142 |                 if br_end - br_start >= br_len:
143 |                     centroid = spectral_centroid[int(br_start / time_step): int(br_end / time_step)].mean()
144 |                     if centroid >= br_centroid:
145 |                         ap_ranges.append((br_start, br_end))
146 |             # print(ap_ranges)
147 |             if len(ap_ranges) == 0:
148 |                 i += 1
149 |                 j += 1
150 |                 continue
151 |             words.removeInterval(word)
152 |             phones.removeInterval(phone)
153 |             if word.minTime < ap_ranges[0][0]:
154 |                 words.add(minTime=word.minTime, maxTime=ap_ranges[0][0], mark=None)
155 |                 phones.add(minTime=phone.minTime, maxTime=ap_ranges[0][0], mark=None)
156 |                 i += 1
157 |                 j += 1
158 |             for k, ap in enumerate(ap_ranges):
159 |                 if k > 0:
160 |                     words.add(minTime=ap_ranges[k - 1][1], maxTime=ap[0], mark=None)
161 |                     phones.add(minTime=ap_ranges[k - 1][1], maxTime=ap[0], mark=None)
162 |                     i += 1
163 |                     j += 1
164 |                 words.add(minTime=ap[0], maxTime=min(word.maxTime, ap[1]), mark='AP')
165 |                 phones.add(minTime=ap[0], maxTime=min(word.maxTime, ap[1]), mark='AP')
166 |                 i += 1
167 |                 j += 1
168 |             if ap_ranges[-1][1] < word.maxTime:
169 |                 words.add(minTime=ap_ranges[-1][1], maxTime=word.maxTime, mark=None)
170 |                 phones.add(minTime=ap_ranges[-1][1], maxTime=phone.maxTime, mark=None)
171 |                 i += 1
172 |                 j += 1
173 | 
174 |         # Remove short spaces
175 |         i = j = 0
176 |         while i < len(words):
177 |             word = words[i]
178 |             phone = phones[j]
179 |             if word.mark is not None and word.mark != '':
180 |                 i += 1
181 |                 j += (1 if word.mark == 'AP' else len(dictionary[word.mark]))
182 |                 continue
183 |             if word.maxTime - word.minTime >= min_space:
184 |                 word.mark = 'SP'
185 |                 phone.mark = 'SP'
186 |                 i += 1
187 |                 j += 1
188 |                 continue
189 |             if i == 0:
190 |                 if len(words) >= 2:
191 |                     words[i + 1].minTime = word.minTime
192 |                     phones[j + 1].minTime = phone.minTime
193 |                     words.removeInterval(word)
194 |                     phones.removeInterval(phone)
195 |                 else:
196 |                     break
197 |             elif i == len(words) - 1:
198 |                 if len(words) >= 2:
199 |                     words[i - 1].maxTime = word.maxTime
200 |                     phones[j - 1].maxTime = phone.maxTime
201 |                     words.removeInterval(word)
202 |                     phones.removeInterval(phone)
203 |                 else:
204 |                     break
205 |             else:
206 |                 words[i - 1].maxTime = words[i + 1].minTime = (word.minTime + word.maxTime) / 2
207 |                 phones[j - 1].maxTime = phones[j + 1].minTime = (phone.minTime + phone.maxTime) / 2
208 |                 words.removeInterval(word)
209 |                 phones.removeInterval(phone)
210 |         textgrid.write(str(dst / tgfile.name))
211 | 
212 | 
213 | if __name__ == '__main__':
214 |     enhance_tg()
215 | 


--------------------------------------------------------------------------------
/acoustic_forced_alignment/reformat_wavs.py:
--------------------------------------------------------------------------------
 1 | import pathlib
 2 | import shutil
 3 | 
 4 | import click
 5 | import librosa
 6 | import numpy as np
 7 | import soundfile
 8 | import tqdm
 9 | 
10 | 
11 | @click.command(help='Reformat the WAV files to 16kHz, 16bit PCM mono format and copy labels')
12 | @click.option('--src', required=True, help='Source segments directory')
13 | @click.option('--dst', required=True, help='Target segments directory')
14 | @click.option(
15 |     '--normalize',
16 |     is_flag=True, show_default=True, default=False,
17 |     help='Normalize the audio (peak calculated over all segments)'
18 | )
19 | def reformat_wavs(src, dst, normalize):
20 |     src = pathlib.Path(src).resolve()
21 |     dst = pathlib.Path(dst).resolve()
22 |     assert src != dst, 'src and dst should not be the same path'
23 |     assert src.is_dir() and (not dst.exists() or dst.is_dir()), 'src and dst must be directories'
24 |     dst.mkdir(parents=True, exist_ok=True)
25 |     samplerate = 16000
26 |     filelist = list(src.glob('*.wav'))
27 |     max_y = 1.0
28 |     if normalize:
29 |         max_y = 0.0
30 |         for file in tqdm.tqdm(filelist):
31 |             y, _ = librosa.load(file, sr=samplerate, mono=True)
32 |             max_y = max(max_y, np.max(np.abs(y)))
33 |         max_y += 0.01
34 |     for file in tqdm.tqdm(filelist):
35 |         y, _ = librosa.load(file, sr=samplerate, mono=True)
36 |         soundfile.write((dst / file.name), y / max_y, samplerate, subtype='PCM_16')
37 |         annotation = file.with_suffix('.lab')
38 |         shutil.copy(annotation, dst)
39 |     print('Reformatting and copying done.')
40 | 
41 | 
42 | if __name__ == '__main__':
43 |     reformat_wavs()
44 | 


--------------------------------------------------------------------------------
/acoustic_forced_alignment/requirements.txt:
--------------------------------------------------------------------------------
 1 | biopython==1.78
 2 | click
 3 | librosa<0.10.0
 4 | matplotlib
 5 | praatio<6.0.0
 6 | praat-parselmouth
 7 | pyyaml
 8 | soundfile
 9 | sox
10 | sqlalchemy==1.4.46
11 | textgrid
12 | 


--------------------------------------------------------------------------------
/acoustic_forced_alignment/select_test_set.py:
--------------------------------------------------------------------------------
  1 | import csv
  2 | import random
  3 | from collections import defaultdict
  4 | from pathlib import Path
  5 | 
  6 | import click
  7 | import yaml
  8 | 
  9 | 
 10 | # noinspection PyShadowingBuiltins
 11 | @click.command(help='Randomly select test samples')
 12 | @click.argument(
 13 |     'config',
 14 |     type=click.Path(file_okay=True, dir_okay=False, resolve_path=True, writable=True, path_type=Path),
 15 |     metavar="CONFIG"
 16 | )
 17 | @click.option(
 18 |     '--rel_path',
 19 |     type=click.Path(file_okay=False, dir_okay=True, resolve_path=True, path_type=Path),
 20 |     default=None,
 21 |     help='Path that is relative to the paths mentioned in the config file.'
 22 | )
 23 | @click.option(
 24 |     '--min', '_min',
 25 |     show_default=True,
 26 |     type=click.IntRange(min=1),
 27 |     default=10,
 28 |     help='Minimum number of test samples.'
 29 | )
 30 | @click.option(
 31 |     '--max', '_max',
 32 |     show_default=True,
 33 |     type=click.IntRange(min=1),
 34 |     default=20,
 35 |     help='Maximum number of test samples (note that each speaker will have at least one test sample).'
 36 | )
 37 | @click.option(
 38 |     '--per_speaker',
 39 |     show_default=True,
 40 |     type=click.IntRange(min=1),
 41 |     default=4,
 42 |     help='Expected number of test samples per speaker.'
 43 | )
 44 | def select_test_set(config, rel_path, _min, _max, per_speaker):
 45 |     assert _min <= _max, 'min must be smaller or equal to max'
 46 |     with open(config, 'r', encoding='utf8') as f:
 47 |         hparams = yaml.safe_load(f)
 48 | 
 49 |     spk_map = None
 50 |     spk_ids = hparams['spk_ids']
 51 |     speakers = hparams['speakers']
 52 |     raw_data_dirs = list(map(Path, hparams['raw_data_dir']))
 53 |     assert isinstance(speakers, list), 'Speakers must be a list'
 54 |     assert len(speakers) == len(raw_data_dirs), \
 55 |         'Number of raw data dirs must equal number of speaker names!'
 56 |     if not spk_ids:
 57 |         spk_ids = list(range(len(raw_data_dirs)))
 58 |     else:
 59 |         assert len(spk_ids) == len(raw_data_dirs), \
 60 |             'Length of explicitly given spk_ids must equal the number of raw datasets.'
 61 |     assert max(spk_ids) < hparams['num_spk'], \
 62 |         f'Index in spk_id sequence {spk_ids} is out of range. All values should be smaller than num_spk.'
 63 | 
 64 |     spk_map = {}
 65 |     path_spk_map = defaultdict(list)
 66 |     for ds_id, (spk_name, raw_path, spk_id) in enumerate(zip(speakers, raw_data_dirs, spk_ids)):
 67 |         if spk_name in spk_map and spk_map[spk_name] != spk_id:
 68 |             raise ValueError(f'Invalid speaker ID assignment. Name \'{spk_name}\' is assigned '
 69 |                                 f'with different speaker IDs: {spk_map[spk_name]} and {spk_id}.')
 70 |         spk_map[spk_name] = spk_id
 71 |         path_spk_map[spk_id].append((ds_id, rel_path / raw_path if rel_path else raw_path))
 72 | 
 73 |     training_cases = []
 74 |     for spk_raw_dirs in path_spk_map.values():
 75 |         training_case = []
 76 |         # training cases from the same speaker are grouped together
 77 |         for ds_id, raw_data_dir in spk_raw_dirs:
 78 |             with open(raw_data_dir / 'transcriptions.csv', 'r', encoding='utf8') as f:
 79 |                 reader = csv.DictReader(f)
 80 |                 for row in reader:
 81 |                     if (raw_data_dir / 'wavs' / f'{row["name"]}.wav').exists():
 82 |                         training_case.append(f'{ds_id}:{row["name"]}')
 83 |         training_cases.append(training_case)
 84 | 
 85 |     test_prefixes = []
 86 |     total = min(_max, max(_min, per_speaker * len(training_cases)))
 87 |     quotient, remainder = total // len(training_cases), total % len(training_cases)
 88 |     if quotient == 0:
 89 |         test_counts = [1] * len(training_cases)
 90 |     else:
 91 |         test_counts = [quotient + 1] * remainder + [quotient] * (len(training_cases) - remainder)
 92 |     for i, count in enumerate(test_counts):
 93 |         test_prefixes += sorted(random.sample(training_cases[i], count))
 94 |     if not hparams['test_prefixes'] or click.confirm('Overwrite existing test prefixes?', abort=False):
 95 |         hparams['test_prefixes'] = test_prefixes
 96 |         hparams['num_valid_plots'] = len(test_prefixes)
 97 |         with open(config, 'w', encoding='utf8') as f:
 98 |             yaml.dump(hparams, f, sort_keys=False)
 99 |         print('Test prefixes saved.')
100 |     else:
101 |         print('Test prefixes not saved, aborted.')
102 | 
103 | if __name__ == '__main__':
104 |     select_test_set()
105 | 


--------------------------------------------------------------------------------
/acoustic_forced_alignment/slice_tg.py:
--------------------------------------------------------------------------------
  1 | import pathlib
  2 | 
  3 | import click
  4 | import librosa
  5 | import soundfile
  6 | import textgrid
  7 | import tqdm
  8 | 
  9 | 
 10 | @click.command(help='Slice 3-tier TextGrids and long recordings into segmented 2-tier TextGrids and wavs')
 11 | @click.option(
 12 |     '--wavs', required=True,
 13 |     help='Directory containing the segmented wav files'
 14 | )
 15 | @click.option(
 16 |     '--tg', required=False,
 17 |     help='Directory containing the segmented TextGrid files (defaults to wav directory)'
 18 | )
 19 | @click.option(
 20 |     '--out', required=True,
 21 |     help='Path to output directory for combined files'
 22 | )
 23 | @click.option(
 24 |     '--preserve_sentence_names', is_flag=True,
 25 |     help='Whether to use sentence marks as filenames (will be re-numbered by default)'
 26 | )
 27 | @click.option(
 28 |     '--digits', required=False, type=int, default=3,
 29 |     help='Number of suffix digits (defaults to 3, will be padded with zeros on the left)'
 30 | )
 31 | @click.option(
 32 |     '--wav_subtype', required=False, default='PCM_16',
 33 |     help='Wav subtype (defaults to PCM_16)'
 34 | )
 35 | @click.option(
 36 |     '--overwrite', is_flag=True,
 37 |     help='Overwrite existing files'
 38 | )
 39 | def slice_tg(wavs, tg, out, preserve_sentence_names, digits, wav_subtype, overwrite):
 40 |     wav_path_in = pathlib.Path(wavs)
 41 |     tg_path_in = wav_path_in if tg is None else pathlib.Path(tg)
 42 |     del tg
 43 |     sliced_path_out = pathlib.Path(out)
 44 |     sliced_path_out.mkdir(parents=True, exist_ok=True)
 45 |     for tg_file in tqdm.tqdm(tg_path_in.glob('*.TextGrid')):
 46 |         tg = textgrid.TextGrid()
 47 |         tg.read(tg_file)
 48 |         wav, sr = librosa.load((wav_path_in / tg_file.name).with_suffix('.wav'), sr=None)
 49 |         sentences_tier = tg[0]
 50 |         words_tier = tg[1]
 51 |         phones_tier = tg[2]
 52 |         idx = 0
 53 |         for sentence in sentences_tier:
 54 |             if sentence.mark == '':
 55 |                 continue
 56 |             sentence_tg = textgrid.TextGrid()
 57 |             sentence_words_tier = textgrid.IntervalTier(name='words')
 58 |             sentence_phones_tier = textgrid.IntervalTier(name='phones')
 59 |             for word in words_tier:
 60 |                 min_time = max(sentence.minTime, word.minTime)
 61 |                 max_time = min(sentence.maxTime, word.maxTime)
 62 |                 if min_time >= max_time:
 63 |                     continue
 64 |                 sentence_words_tier.add(
 65 |                     minTime=min_time - sentence.minTime, maxTime=max_time - sentence.minTime, mark=word.mark
 66 |                 )
 67 |             for phone in phones_tier:
 68 |                 min_time = max(sentence.minTime, phone.minTime)
 69 |                 max_time = min(sentence.maxTime, phone.maxTime)
 70 |                 if min_time >= max_time:
 71 |                     continue
 72 |                 sentence_phones_tier.add(
 73 |                     minTime=min_time - sentence.minTime, maxTime=max_time - sentence.minTime, mark=phone.mark
 74 |                 )
 75 |             sentence_tg.append(sentence_words_tier)
 76 |             sentence_tg.append(sentence_phones_tier)
 77 | 
 78 |             if preserve_sentence_names:
 79 |                 tg_file_out = sliced_path_out / f'{sentence.mark}.TextGrid'
 80 |                 wav_file_out = tg_file_out.with_suffix('.wav')
 81 |             else:
 82 |                 tg_file_out = sliced_path_out / f'{tg_file.stem}_{str(idx).zfill(digits)}.TextGrid'
 83 |                 wav_file_out = tg_file_out.with_suffix('.wav')
 84 |             if tg_file_out.exists() and not overwrite:
 85 |                 raise FileExistsError(str(tg_file_out))
 86 |             if wav_file_out.exists() and not overwrite:
 87 |                 raise FileExistsError(str(wav_file_out))
 88 | 
 89 |             sentence_tg.write(tg_file_out)
 90 |             sentence_wav = wav[int(sentence.minTime * sr): min(wav.shape[0], int(sentence.maxTime * sr) + 1)]
 91 |             soundfile.write(
 92 |                 wav_file_out,
 93 |                 sentence_wav, samplerate=sr, subtype=wav_subtype
 94 |             )
 95 |             idx += 1
 96 | 
 97 | 
 98 | if __name__ == '__main__':
 99 |     slice_tg()
100 | 


--------------------------------------------------------------------------------
/acoustic_forced_alignment/summary_pitch.py:
--------------------------------------------------------------------------------
 1 | import pathlib
 2 | 
 3 | import click
 4 | import librosa
 5 | import matplotlib.pyplot as plt
 6 | import numpy as np
 7 | import parselmouth as pm
 8 | import tqdm
 9 | from textgrid import TextGrid
10 | 
11 | import distribution
12 | 
13 | 
14 | @click.command(help='Generate word-level pitch summary')
15 | @click.option('--wavs', required=True, help='Path to the segments directory')
16 | @click.option('--tg', required=True, help='Path to the TextGrids directory')
17 | def summary_pitch(wavs, tg):
18 |     wavs = pathlib.Path(wavs)
19 |     tg_dir = pathlib.Path(tg)
20 |     del tg
21 |     filelist = list(wavs.glob('*.wav'))
22 | 
23 |     pit_map = {}
24 |     f0_min = 40.
25 |     f0_max = 1100.
26 |     voicing_thresh_vowel = 0.45
27 |     for wavfile in tqdm.tqdm(filelist):
28 |         tg = TextGrid()
29 |         tg.read(tg_dir / wavfile.with_suffix('.TextGrid').name)
30 |         timestep = 0.01
31 |         f0 = pm.Sound(str(wavfile)).to_pitch_ac(
32 |             time_step=timestep,
33 |             voicing_threshold=voicing_thresh_vowel,
34 |             pitch_floor=f0_min,
35 |             pitch_ceiling=f0_max,
36 |         ).selected_array['frequency']
37 |         pitch = 12. * np.log2(f0 / 440.) + 69.
38 |         for word in tg[0]:
39 |             if word.mark in ['AP', 'SP']:
40 |                 continue
41 |             if word.maxTime - word.minTime < timestep:
42 |                 continue
43 |             word_pit = pitch[int(word.minTime / timestep): int(word.maxTime / timestep)]
44 |             word_pit = np.extract(word_pit >= 0, word_pit)
45 |             if word_pit.shape[0] == 0:
46 |                 continue
47 |             counts = np.bincount(word_pit.astype(np.int64))
48 |             midi = counts.argmax()
49 |             if midi in pit_map:
50 |                 pit_map[midi] += 1
51 |             else:
52 |                 pit_map[midi] = 1
53 |     midi_keys = sorted(pit_map.keys())
54 |     midi_keys = list(range(midi_keys[0], midi_keys[-1] + 1))
55 |     distribution.draw_distribution(
56 |         title='Pitch Distribution Summary',
57 |         x_label='Pitch',
58 |         y_label='Number of occurrences',
59 |         items=[librosa.midi_to_note(k) for k in midi_keys],
60 |         values=[pit_map.get(k, 0) for k in midi_keys]
61 |     )
62 |     pitch_summary = wavs / 'pitch_distribution.jpg'
63 |     plt.savefig(fname=pitch_summary,
64 |                 bbox_inches='tight',
65 |                 pad_inches=0.25)
66 |     print(f'Pitch distribution summary saved to {pitch_summary}')
67 | 
68 | 
69 | if __name__ == '__main__':
70 |     summary_pitch()
71 | 


--------------------------------------------------------------------------------
/acoustic_forced_alignment/validate_labels.py:
--------------------------------------------------------------------------------
 1 | import pathlib
 2 | 
 3 | import click
 4 | import matplotlib.pyplot as plt
 5 | import tqdm
 6 | 
 7 | import distribution
 8 | 
 9 | 
10 | # noinspection PyShadowingBuiltins
11 | @click.command(help='Validate transcription labels')
12 | @click.option('--dir', required=True, help='Path to the segments directory')
13 | @click.option('--dictionary', required=True, help='Path to the dictionary file')
14 | def validate_labels(dir, dictionary):
15 |     # Load dictionary
16 |     dict_path = pathlib.Path(dictionary)
17 |     with open(dict_path, 'r', encoding='utf8') as f:
18 |         rules = [ln.strip().split('\t') for ln in f.readlines()]
19 |     dictionary = {}
20 |     phoneme_set = set()
21 |     for r in rules:
22 |         phonemes = r[1].split()
23 |         dictionary[r[0]] = phonemes
24 |         phoneme_set.update(phonemes)
25 | 
26 |     # Run checks
27 |     check_failed = False
28 |     covered = set()
29 |     phoneme_map = {}
30 |     for ph in sorted(phoneme_set):
31 |         phoneme_map[ph] = 0
32 | 
33 |     segments_dir = pathlib.Path(dir)
34 |     filelist = list(segments_dir.glob('*.wav'))
35 | 
36 |     for file in tqdm.tqdm(filelist):
37 |         filename = file.stem
38 |         annotation = file.with_suffix('.lab')
39 |         if not annotation.exists():
40 |             print(f'No annotation found for \'{filename}\'!')
41 |             check_failed = True
42 |             continue
43 |         with open(annotation, 'r', encoding='utf8') as f:
44 |             syllables = f.read().strip().split()
45 |         if not syllables:
46 |             print(f'Annotation file \'{annotation}\' is empty!')
47 |             check_failed = True
48 |         else:
49 |             oov = []
50 |             for s in syllables:
51 |                 if s not in dictionary:
52 |                     oov.append(s)
53 |                 else:
54 |                     for ph in dictionary[s]:
55 |                         phoneme_map[ph] += 1
56 |                     covered.update(dictionary[s])
57 |             if oov:
58 |                 print(f'Syllable(s) {oov} not allowed in annotation file \'{annotation}\'')
59 |                 check_failed = True
60 | 
61 |     # Phoneme coverage
62 |     uncovered = phoneme_set - covered
63 |     if uncovered:
64 |         print(f'The following phonemes are not covered!')
65 |         print(sorted(uncovered))
66 |         print('Please add more recordings to cover these phonemes.')
67 |         check_failed = True
68 | 
69 |     if not check_failed:
70 |         print('All annotations are well prepared.')
71 | 
72 |     phoneme_list = sorted(phoneme_set)
73 |     phoneme_counts = [phoneme_map[ph] for ph in phoneme_list]
74 |     distribution.draw_distribution(
75 |         title='Phoneme Distribution Summary',
76 |         x_label='Phoneme',
77 |         y_label='Number of occurrences',
78 |         items=phoneme_list,
79 |         values=phoneme_counts
80 |     )
81 |     phoneme_summary = segments_dir / 'phoneme_distribution.jpg'
82 |     plt.savefig(fname=phoneme_summary,
83 |                 bbox_inches='tight',
84 |                 pad_inches=0.25)
85 |     print(f'Phoneme distribution summary saved to {phoneme_summary}')
86 | 
87 | 
88 | if __name__ == '__main__':
89 |     validate_labels()
90 | 


--------------------------------------------------------------------------------
/acoustic_forced_alignment/validate_lengths.py:
--------------------------------------------------------------------------------
 1 | import librosa
 2 | import tqdm
 3 | import os
 4 | import pathlib
 5 | 
 6 | import click
 7 | 
 8 | 
 9 | def length(src: str):
10 |     if os.path.isfile(src) and src.endswith('.wav'):
11 |         return librosa.get_duration(filename=src) / 3600.
12 |     elif os.path.isdir(src):
13 |         total = 0
14 |         for ch in [os.path.join(src, c) for c in os.listdir(src)]:
15 |             total += length(ch)
16 |         return total
17 |     return 0
18 | 
19 | 
20 | # noinspection PyShadowingBuiltins
21 | @click.command(help='Validate segment lengths')
22 | @click.option('--dir', required=True, help='Path to the segments directory')
23 | def validate_lengths(dir):
24 |     dir = pathlib.Path(dir)
25 |     assert dir.exists() and dir.is_dir(), 'The chosen path does not exist or is not a directory.'
26 | 
27 |     reported = False
28 |     filelist = list(dir.glob('*.wav'))
29 |     total_length = 0.
30 |     for file in tqdm.tqdm(filelist):
31 |         wave_seconds = librosa.get_duration(filename=str(file))
32 |         if wave_seconds < 2.:
33 |             reported = True
34 |             print(f'Too short! \'{file}\' has a length of {round(wave_seconds, 1)} seconds!')
35 |         if wave_seconds > 20.:
36 |             reported = True
37 |             print(f'Too long! \'{file}\' has a length of {round(wave_seconds, 1)} seconds!')
38 |         total_length += wave_seconds / 3600.
39 | 
40 |     print(f'Found {len(filelist)} segments with total length of {round(total_length, 2)} hours.')
41 | 
42 |     if not reported:
43 |         print('All segments have proper length.')
44 | 
45 | 
46 | if __name__ == '__main__':
47 |     validate_lengths()
48 | 


--------------------------------------------------------------------------------
/midi-recognition/README.md:
--------------------------------------------------------------------------------
 1 | # MIDI Recognition
 2 | 
 3 | ## 1. merge_wavs.py
 4 | 
 5 | Merge short audio clips into long audio segments of similar length (e.g. 4 min) and a fixed sampling rate (e.g. 16000) and save the timestamps into tags.json.
 6 | 
 7 | ## 2. extract_midi.py
 8 | 
 9 | Extract MIDI sequences from of OpenSVIP json files, split them back into short clips according to tags.json, and add them into transcriptions.csv.
10 | 
11 | 


--------------------------------------------------------------------------------
/midi-recognition/extract_midi.py:
--------------------------------------------------------------------------------
  1 | import csv
  2 | import json
  3 | import pathlib
  4 | 
  5 | import click
  6 | import librosa
  7 | from typing import List, Tuple
  8 | 
  9 | 
 10 | @click.command(help='Extract MIDI sequences from OpenSVIP json files and add them into transcriptions.csv')
 11 | @click.argument('json_dir', metavar='JSONS')
 12 | @click.argument('csv_file', metavar='TRANSCRIPTIONS')
 13 | @click.option('--key', type=int, default=0, show_default=True,
 14 |               metavar='SEMITONES', help='Key transition')
 15 | def extract_midi(json_dir, csv_file, key):
 16 |     json_dir = pathlib.Path(json_dir).resolve()
 17 |     assert json_dir.exists(), 'The json directory does not exist.'
 18 |     tags_file = json_dir / 'tags.json'
 19 |     assert tags_file.exists(), 'The tags.json does not exist.'
 20 |     csv_file = pathlib.Path(csv_file).resolve()
 21 |     assert csv_file.resolve(), 'The path to transcriptions.csv does not exist.'
 22 |     tol = 0.001
 23 | 
 24 |     with open(tags_file, 'r', encoding='utf8') as f:
 25 |         tags: dict = json.load(f)
 26 | 
 27 |     # Read MIDI sequences
 28 |     note_seq_map: dict = {}  # key: merged filename, value: note sequence
 29 |     for json_file in json_dir.iterdir():
 30 |         if json_file.stem not in tags or not json_file.is_file() or json_file.suffix != '.json':
 31 |             continue
 32 |         with open(json_file, 'r', encoding='utf8') as f:
 33 |             json_obj: dict = json.load(f)
 34 |         assert len(json_obj['SongTempoList']) == 1, \
 35 |             f'[ERROR] {json_file.name}: there must be one and only one single tempo in the project.'
 36 | 
 37 |         tempo = json_obj['SongTempoList'][0]['BPM']
 38 |         midi_seq: list = json_obj['TrackList'][0]['NoteList']
 39 |         note_seq: List[Tuple[str, float]] = []  # (note, duration)
 40 |         prev_pos: int = 0  # in ticks
 41 |         for i, midi in enumerate(midi_seq):
 42 |             if prev_pos < midi['StartPos']:
 43 |                 note_seq.append(
 44 |                     ('rest', (midi['StartPos'] - prev_pos) / 8 / tempo)
 45 |                 )
 46 |             note_seq.append(
 47 |                 (librosa.midi_to_note(midi['KeyNumber'] + key, unicode=False), midi['Length'] / 8 / tempo)
 48 |             )
 49 |             prev_pos = midi['StartPos'] + midi['Length']
 50 |         remain_secs = prev_pos / 8 / tempo - sum(t['duration'] for t in tags[json_file.stem])
 51 |         if remain_secs > tol:
 52 |             note_seq.append(
 53 |                 ('rest', remain_secs)
 54 |             )
 55 |         note_seq_map[json_file.stem] = note_seq
 56 | 
 57 |     # Load transcriptions
 58 |     transcriptions: dict = {}  # key: split filename, value: attr dict
 59 |     with open(csv_file, 'r', encoding='utf8') as f:
 60 |         reader = csv.DictReader(f)
 61 |         for attrs in reader:
 62 |             transcriptions[attrs['name']] = attrs
 63 | 
 64 |     # Split note sequence and add into transcriptions
 65 |     for merged_name, note_seq in note_seq_map.items():
 66 |         note_seq: Tuple[str, float]
 67 |         idx = 0
 68 |         offset = 0.
 69 |         cur_note_secs = 0.
 70 |         cur_clip_secs = 0.
 71 |         for split_tag in tags[merged_name]:
 72 |             split_note_seq = []
 73 |             while idx < len(note_seq):
 74 |                 cur_note_dur = note_seq[idx][1] - offset
 75 |                 if cur_note_secs + cur_note_dur <= cur_clip_secs + split_tag['duration']:
 76 |                     split_note_seq.append(
 77 |                         (note_seq[idx][0], cur_note_dur)
 78 |                     )
 79 |                     idx += 1
 80 |                     cur_note_secs += cur_note_dur
 81 |                     offset = 0.
 82 |                 else:
 83 |                     offset = cur_clip_secs + split_tag['duration'] - cur_note_secs
 84 |                     cur_note_secs += offset
 85 |                     cur_clip_secs += split_tag['duration']
 86 |                     split_note_seq.append(
 87 |                         (note_seq[idx][0], offset)
 88 |                     )
 89 |                     break
 90 |             if idx == len(note_seq) and cur_clip_secs + split_tag['duration'] - cur_note_secs >= tol:
 91 |                 split_note_seq.append(
 92 |                     ('rest', cur_clip_secs + split_tag['duration'] - cur_note_secs)
 93 |                 )
 94 |             if split_tag['filename'] not in transcriptions:
 95 |                 continue
 96 |             dst_dict = transcriptions[split_tag['filename']]
 97 |             dst_dict['note_seq'] = ' '.join(n[0] for n in split_note_seq)
 98 |             dst_dict['note_dur'] = ' '.join(str(n[1]) for n in split_note_seq)
 99 | 
100 |     with open(csv_file, 'w', encoding='utf8', newline='') as f:
101 |         writer = csv.DictWriter(f, fieldnames=['name', 'ph_seq', 'ph_dur', 'ph_num', 'note_seq', 'note_dur'])
102 |         writer.writeheader()
103 |         writer.writerows(v for _, v in transcriptions.items())
104 | 
105 | 
106 | if __name__ == '__main__':
107 |     extract_midi()
108 | 


--------------------------------------------------------------------------------
/midi-recognition/merge_wavs.py:
--------------------------------------------------------------------------------
 1 | import tqdm
 2 | import json
 3 | import pathlib
 4 | from collections import OrderedDict
 5 | 
 6 | import click
 7 | import librosa
 8 | import numpy as np
 9 | import soundfile
10 | 
11 | 
12 | @click.command(help='Merge clips into segments of similar length')
13 | @click.argument('input_wavs', metavar='INPUT_WAVS')
14 | @click.argument('output_wavs', metavar='OUTPUT_WAVS')
15 | @click.option('--length', type=int, required=False, default=240, metavar='SECONDS')
16 | @click.option('--sr', type=int, required=False, default=16000)
17 | def merge_wavs(
18 |         input_wavs, output_wavs, length, sr
19 | ):
20 |     input_wavs = pathlib.Path(input_wavs).resolve()
21 |     assert input_wavs.exists(), 'The input directory does not exist.'
22 |     output_wavs = pathlib.Path(output_wavs).resolve()
23 |     assert not output_wavs.exists() or all(False for _ in output_wavs.iterdir()), \
24 |         'The output directory is not empty.'
25 | 
26 |     output_wavs.mkdir(parents=True, exist_ok=True)
27 |     tags = OrderedDict()
28 |     count = 0
29 |     cache: list[tuple[str, np.ndarray]] = []
30 |     cache_len = 0.
31 | 
32 |     def save_cache():
33 |         nonlocal tags, count, cache, cache_len
34 |         waveform_merged = np.concatenate(tuple(c[1] for c in cache))
35 |         filename = (output_wavs / str(count).zfill(8)).with_suffix('.wav')
36 |         soundfile.write(
37 |             str(filename),
38 |             waveform_merged, sr, format='WAV'
39 |         )
40 |         tags[str(filename.stem)] = [
41 |             {
42 |                 'filename': c[0],
43 |                 'duration': c[1].shape[0] / sr
44 |             }
45 |             for c in cache
46 |         ]
47 |         cache.clear()
48 |         cache_len = 0.
49 |         count += 1
50 | 
51 |     for wav in tqdm.tqdm(input_wavs.iterdir()):
52 |         if not wav.is_file() or wav.suffix != '.wav':
53 |             continue
54 |         y, _ = librosa.load(wav, sr=sr, mono=True)
55 |         cur_len = y.shape[0] / sr
56 |         if len(cache) > 0 and cache_len + cur_len >= length:
57 |             save_cache()
58 |         cache.append((wav.stem, y))
59 |         cache_len += cur_len
60 |     if len(cache) > 0:
61 |         save_cache()
62 | 
63 |     tags_path = output_wavs / 'tags.json'
64 |     with open(tags_path, 'w', encoding='utf8') as f:
65 |         json.dump(tags, f, ensure_ascii=False, indent=2)
66 |         print(f'Timestamps saved to {tags_path}')
67 | 
68 | 
69 | if __name__ == '__main__':
70 |     merge_wavs()
71 | 


--------------------------------------------------------------------------------
/variance-temp-solution/.gitignore:
--------------------------------------------------------------------------------
 1 | .idea
 2 | *.pyc
 3 | __pycache__/
 4 | *.sh
 5 | local_tools/
 6 | /venv/
 7 | 
 8 | .vscode
 9 | .ipynb_checkpoints/
10 | 
11 | assets/*
12 | !assets/.gitkeep
13 | 


--------------------------------------------------------------------------------
/variance-temp-solution/README.md:
--------------------------------------------------------------------------------
  1 | # Making variance datasets (temporary solution)
  2 | 
  3 | This pipeline will guide you to migrate your old DiffSinger datasets to the new and complete format for both acoustic and variance model training.
  4 | 
  5 | ## 1. Clone repo and install dependencies
  6 | 
  7 | ```bash
  8 | git clone https://github.com/openvpi/MakeDiffSinger.git
  9 | cd MakeDiffSinger/variance-temp-solution
 10 | pip install -r requirements.txt  # or you can reuse a pre-existing DiffSinger environment
 11 | ```
 12 | 
 13 | ## 2. Convert transcriptions
 14 | 
 15 | Assume you have a DiffSinger dataset which contains a transcriptions.txt file.
 16 | 
 17 | Run:
 18 | 
 19 | ```bash
 20 | python convert_txt.py path/to/your/transcriptions.txt
 21 | ```
 22 | 
 23 | This will generate transcriptions.csv in the same folder as transcriptions.txt, which has three attributes: `name`, `ph_seq` and `ph_dur`.
 24 | 
 25 | ## 3. Add `ph_num` attribute
 26 | 
 27 | The attribute `ph_num` is needed for training the variance models especially if you need to train the phoneme duration predictor. This attribute represents the number of phones that each word contains.
 28 | 
 29 | In singing, vowels, instead of consonants, are used to align with the beginnings of notes. For this reason, each word should start with a vowel/AP/SP, and end with leading consonant(s) of the next word (if there are any). See the example below:
 30 | 
 31 | ```text
 32 | text      |   AP   |     shi     |        zhe       |  => word transcriptions (pinyin, romaji, etc.)
 33 | ph_seq    |   AP   |  sh  |  ir  | zh |      e      |  => phoneme sequence
 34 | ph_num    |       2       |     2     |      1      |  => word-level phoneme division
 35 | ```
 36 | 
 37 | where `sh` and `zh` are consonants, `AP`, `ir` and `e` can be regarded as vowels. There are one special case that a word can start with a consonants: isolated consonants. In this case, all phones in the word are consonants.
 38 | 
 39 | For all monosyllabic phoneme systems (at most one vowel in one word), this step can be performed automatically.
 40 | 
 41 | ### 3.1 two-part dictionaries (Chinese, Japanese, etc.)
 42 | 
 43 | A two-part dictionary has "V" and "C-V" phoneme patterns.
 44 | 
 45 | Run:
 46 | 
 47 | ```bash
 48 | python add_ph_num.py path/to/your/transcriptions.csv --dictionary path/to/your/dictionary.txt
 49 | ```
 50 | 
 51 | ### 3.2 monosyllabic phoneme systems (Cantonese, Korean, etc.)
 52 | 
 53 | A universal monosyllabic phoneme system has "C(m)-V-C(n)" (m,n >= 0) phoneme patterns.
 54 | 
 55 | 1. Collect all vowels into vowels.txt, divided by spaces.
 56 | 
 57 | 2. Collect all consonants into consonants.txt, divided by spaces.
 58 | 
 59 | 3. Run:
 60 | 
 61 |    ```bash
 62 |    python add_ph_num.py path/to/your/transcriptions.csv --vowels vowels.txt --consonants consonants.txt
 63 |    ```
 64 | 
 65 | ### 3.3 polysyllabic phoneme systems (English, Russian, etc.)
 66 | 
 67 | We recommand this step be manually performed because word divisions cannot be infered from phoneme sequences in these phoneme systems.
 68 | 
 69 | > After finishing this step, the transcriptions.csv file can be directly used to train the phoneme duration predictor. If you want to train a pitch predictor, you must finish the remaining steps as follows.
 70 | >
 71 | 
 72 | ## 4. Estimate note values
 73 | 
 74 | The note tier is another division of words besides the phoneme tier. See the example below:
 75 | 
 76 | ```text
 77 | ph_seq       |   AP   |  sh  |  ir  | zh |      e      |  => phoneme sequence
 78 | ph_num       |       2       |     2     |      1      |  => word-level phoneme division
 79 | note_seq     |     rest      |    D#3    | D#3 |  C4   |  => note sequence
 80 | note_slur    |       0       |     0     |  0  |   1   |  => slur flag (will not be stored)
 81 | ```
 82 | 
 83 | Note sequences can be automatically estimated and manually refined in two ways.
 84 | 
 85 | ### 4.1 Infer a rough pitch value for each word
 86 | 
 87 | The following program can infer a rough note value for each word. There are no slurs - slurs are hard to judge, and different people have different labeling styles.
 88 | 
 89 | Run:
 90 | 
 91 | ```bash
 92 | python estimate_midi.py path/to/your/transcriptions.csv path/to/your/wavs
 93 | ```
 94 | 
 95 | > **IMPORTANT**
 96 | > 
 97 | > This step only estimates the rough MIDI value for each word. You have to refine the MIDI sequences, otherwise the pitch predictor will not be accurate.
 98 | 
 99 | ### 4.2 (New!) Use the AI-powered MIDI extractor - SOME
100 | 
101 | SOME (Singing-Oriented MIDI Extractor) is a NN-based MIDI extractor developed under the DiffSinger ecosystem. See guidance [here](https://github.com/openvpi/SOME#inference-via-pretrained-model-diffsinger-dataset) for using it on your DiffSinger dataset.
102 | 
103 | ## 5. Refine MIDI sequences
104 | 
105 | ### 5.1 take apart transcriptions.csv into DS files
106 | 
107 | Run:
108 | 
109 | ```bash
110 | python convert_ds.py csv2ds path/to/your/transcriptions.csv path/to/your/wavs
111 | ```
112 | 
113 | This will generate *.ds files matching your *.wav files in the same directory.
114 | 
115 | > **IMPORTANT**
116 | > 
117 | > In this step, we highly recommend using RMVPE, a more accurate NN-based pitch extraction algorithm, to get better pitch results. See guidance [here](#rmvpe-pitch-extraction-algorithm).
118 | > 
119 | > Also note that after you finish manual MIDI refinement, please use the **same algorithm** and **same model** in your DiffSinger configuration files for variance model training to get the best results.
120 | 
121 | ### 5.2 manually edit MIDI sequences
122 | 
123 | Get the latest release of SlurCutter from [here](../README.md#essential-tools-to-process-and-label-your-datasets). This simple tool helps you adjust MIDI pitch in each DS file and cut notes into slurs if neccessary. Be sure to back up your DS files before you start, since this tool will automatically save and overwrite an edited DS file.
124 | 
125 | ### 5.3 re-combine DS files into transcriptions.csv
126 | 
127 | Run:
128 | 
129 | ```bash
130 | python convert_ds.py ds2csv path/to/your/ds path/to/your/transcriptions.csv
131 | ```
132 | 
133 | This will generate a new transcriptions.csv from the DS files you just edited. Append `-f` if you are sure you want to overwrite the original transcription file (and the script complains about it).
134 | 
135 | Now the transcriptions.csv can be used for all functionalities of DiffSinger training.
136 | 
137 | `convert_ds.py ds2csv` supports DS files which have no corresponding WAV files. All sentences in these files will be assigned a virtual item name, and inserted into the transcriptions. This is a preparation to support using DS tuning projects to train a variance model. In addition, `curves.json` file is written to support `f0` sequence refinement.
138 | 
139 | ## (Appendix) other useful tools
140 | 
141 | ### RMVPE pitch extraction algorithm
142 | 
143 | convert_ds.py and estimate_midi.py supports the state-of-the-art RMVPE pitch extraction algorithm. To use it:
144 | 
145 | - Install PyTorch via [official guidance](https://pytorch.org/get-started/locally/).
146 | - Get RMVPE pretrained model [here](https://github.com/yxlllc/RMVPE/releases).
147 | - Put the RMVPE model.pt in `variance-temp-solution/assets/rmvpe/`.
148 | - Use `--pe rmvpe` when running `python convert_ds.py csv2ds` or `python estimate_midi.py`.
149 | 
150 | ### correct_cents.py
151 | 
152 | Apply cents correction to note sequences in a transcriptions.csv to offset the out-of-tune errors. Need pitch extracted from waveforms for reference.
153 | 
154 | Usage:
155 | 
156 | ```bash
157 | python correct_cents.py csv path/to/your/transcriptions.csv path/to/your/wavs
158 | ```
159 | 
160 | or
161 | 
162 | ```bash
163 | python correct_cents.py ds path/to/your/ds/files
164 | ```
165 | 
166 | Note: this operation will overwrite your input file(s).
167 | 
168 | ### eliminate_short.py
169 | 
170 | Eliminate short slur notes in DS files. Slurs that are shorter than a given threshold will be merged into its neighboring notes within the same word.
171 | 
172 | Usage:
173 | 
174 | ```bash
175 | python eliminate_short.py path/to/your/ds/files
176 | ```
177 | 
178 | Note: this operation will overwrite your input DS files.
179 | 


--------------------------------------------------------------------------------
/variance-temp-solution/add_ph_num.py:
--------------------------------------------------------------------------------
 1 | import csv
 2 | import pathlib
 3 | 
 4 | import click
 5 | 
 6 | 
 7 | @click.command(help='Add ph_num attribute into transcriptions.csv')
 8 | @click.argument('transcription', metavar='TRANSCRIPTIONS')
 9 | @click.option('--dictionary', metavar='DICTIONARY')
10 | @click.option('--vowels', metavar='FILE')
11 | @click.option('--consonants', metavar='FILE')
12 | def add_ph_num(
13 |         transcription: str,
14 |         dictionary: str = None,
15 |         vowels: str = None,
16 |         consonants: str = None
17 | ):
18 |     assert dictionary is not None or (vowels is not None and consonants is not None), \
19 |         'Either dictionary file or vowels and consonants file should be specified.'
20 |     if dictionary is not None:
21 |         dictionary = pathlib.Path(dictionary).resolve()
22 |         vowels = {'SP', 'AP'}
23 |         consonants = set()
24 |         with open(dictionary, 'r', encoding='utf8') as f:
25 |             rules = f.readlines()
26 |         for r in rules:
27 |             syllable, phonemes = r.split('\t')
28 |             phonemes = phonemes.split()
29 |             assert len(phonemes) <= 2, 'We only support two-phase dictionaries for automatically adding ph_num.'
30 |             if len(phonemes) == 1:
31 |                 vowels.add(phonemes[0])
32 |             else:
33 |                 consonants.add(phonemes[0])
34 |                 vowels.add(phonemes[1])
35 |     else:
36 |         vowels_path = pathlib.Path(vowels).resolve()
37 |         consonants_path = pathlib.Path(consonants).resolve()
38 |         vowels = {'SP', 'AP'}
39 |         consonants = set()
40 |         with open(vowels_path, 'r', encoding='utf8') as f:
41 |             vowels.update(f.read().split())
42 |         with open(consonants_path, 'r', encoding='utf8') as f:
43 |             consonants.update(f.read().split())
44 |         overlapped = vowels.intersection(consonants)
45 |         assert len(vowels.intersection(consonants)) == 0, \
46 |             'Vowel set and consonant set overlapped. The following phonemes ' \
47 |             'appear both as vowels and as consonants:\n' \
48 |             f'{sorted(overlapped)}'
49 | 
50 |     transcription = pathlib.Path(transcription).resolve()
51 |     items: list[dict] = []
52 |     with open(transcription, 'r', encoding='utf8') as f:
53 |         reader = csv.DictReader(f)
54 |         for item in reader:
55 |             items.append(item)
56 | 
57 |     for item in items:
58 |         item: dict
59 |         ph_seq = item['ph_seq'].split()
60 |         for ph in ph_seq:
61 |             assert ph in vowels or ph in consonants, \
62 |                 f'Invalid phoneme symbol \'{ph}\' in \'{item["name"]}\'.'
63 |         ph_num = []
64 |         i = 0
65 |         while i < len(ph_seq):
66 |             j = i + 1
67 |             while j < len(ph_seq) and ph_seq[j] in consonants:
68 |                 j += 1
69 |             ph_num.append(str(j - i))
70 |             i = j
71 |         item['ph_num'] = ' '.join(ph_num)
72 | 
73 |     with open(transcription, 'w', encoding='utf8', newline='') as f:
74 |         writer = csv.DictWriter(f, fieldnames=items[0].keys())
75 |         writer.writeheader()
76 |         writer.writerows(items)
77 | 
78 | 
79 | if __name__ == '__main__':
80 |     add_ph_num()
81 | 


--------------------------------------------------------------------------------
/variance-temp-solution/add_ph_num_advanced.py:
--------------------------------------------------------------------------------
  1 | import csv
  2 | import pathlib
  3 | from typing import Tuple, List
  4 | 
  5 | import click
  6 | import textgrid
  7 | 
  8 | 
  9 | class RuleTerm:
 10 |     def __init__(self, key: str, is_wildcard: bool = False):
 11 |         self.key = key
 12 |         self.is_wildcard = is_wildcard
 13 | 
 14 |     def __repr__(self):
 15 |         if self.is_wildcard:
 16 |             return f"*{self.key}"
 17 |         return self.key
 18 | 
 19 | 
 20 | class QueryTerm:
 21 |     def __init__(self, specified_key: str = None, wildcard_key: str = None):
 22 |         self.specified_key = specified_key
 23 |         self.wildcard_key = wildcard_key
 24 | 
 25 |     def __repr__(self):
 26 |         return str((self.specified_key, self.wildcard_key))
 27 | 
 28 | 
 29 | class TrieNode:
 30 |     def __init__(self):
 31 |         self.children = {}
 32 |         self.wildcards = {}
 33 |         self.value = None
 34 | 
 35 |     def __setitem__(self, key: Tuple[RuleTerm], value):
 36 |         if not key:
 37 |             self.value = value
 38 |         else:
 39 |             term, *key = key
 40 |             if term.is_wildcard:
 41 |                 if term.key not in self.wildcards:
 42 |                     self.wildcards[term.key] = TrieNode()
 43 |                 self.wildcards[term.key][(*key,)] = value
 44 |             else:
 45 |                 if term.key not in self.children:
 46 |                     self.children[term.key] = TrieNode()
 47 |                 self.children[term.key][(*key,)] = value
 48 | 
 49 |     def __getitem__(self, key: Tuple[RuleTerm]):
 50 |         if not key:
 51 |             return self.value
 52 |         term, *key = key
 53 |         if term.is_wildcard:
 54 |             if term.key not in self.wildcards:
 55 |                 return None
 56 |             return self.wildcards[term.key][(*key,)]
 57 |         if term.key not in self.children:
 58 |             return None
 59 |         return self.children[term.key][(*key,)]
 60 | 
 61 |     def find_paths(self, query: List[QueryTerm]) -> List[RuleTerm]:
 62 |         if not query:
 63 |             return []
 64 |         term, *query = query
 65 |         paths = []
 66 |         if term.specified_key in self.children:
 67 |             if self.children[term.specified_key].value is not None:
 68 |                 paths.append([RuleTerm(term.specified_key, False)])
 69 |             for path in self.children[term.specified_key].find_paths(query):
 70 |                 paths.append([RuleTerm(term.specified_key, False), *path])
 71 |         if term.wildcard_key in self.wildcards:
 72 |             if self.wildcards[term.wildcard_key].value is not None:
 73 |                 paths.append([RuleTerm(term.wildcard_key, True)])
 74 |             for path in self.wildcards[term.wildcard_key].find_paths(query):
 75 |                 paths.append([RuleTerm(term.wildcard_key, True), *path])
 76 |         return paths
 77 | 
 78 |     def find_best_path(self, query: Tuple[QueryTerm]) -> List[RuleTerm]:
 79 |         paths = self.find_paths(list(query))
 80 |         return max(
 81 |             paths,
 82 |             default=None,
 83 |             key=lambda p: (
 84 |                 len(p),
 85 |                 sum(not t.is_wildcard for t in p),
 86 |                 min(enumerate(p), key=lambda e: (not e[1].is_wildcard, e[0]))[0]
 87 |             )
 88 |         )
 89 | 
 90 | 
 91 | CONSONANT = 0
 92 | VOWEL = 1
 93 | LIQUID = 2
 94 | 
 95 | 
 96 | @click.command(help='Add ph_num attribute into transcriptions.csv (advanced mode)')
 97 | @click.argument(
 98 |     'transcription',
 99 |     type=click.Path(exists=True, readable=True, path_type=pathlib.Path),
100 |     metavar='TRANSCRIPTIONS'
101 | )
102 | @click.option(
103 |     '--tg', required=True,
104 |     type=click.Path(exists=True, file_okay=False, dir_okay=True, path_type=pathlib.Path),
105 |     help='Path to TextGrids'
106 | )
107 | @click.option(
108 |     '--vowels',
109 |     type=click.Path(exists=True, readable=True, path_type=pathlib.Path),
110 |     metavar='FILE',
111 |     help='Path to the file containing vowels'
112 | )
113 | @click.option(
114 |     '--consonants',
115 |     type=click.Path(exists=True, readable=True, path_type=pathlib.Path),
116 |     metavar='FILE',
117 |     help='Path to the file containing consonants'
118 | )
119 | @click.option(
120 |     '--liquids',
121 |     type=click.Path(exists=True, readable=True, path_type=pathlib.Path),
122 |     metavar='FILE',
123 |     help='Path to the file containing liquids'
124 | )
125 | def add_ph_num_advanced(
126 |         transcription: pathlib.Path,
127 |         tg: pathlib.Path,
128 |         vowels: pathlib.Path = None,
129 |         consonants: pathlib.Path = None,
130 |         liquids: pathlib.Path = None
131 | ):
132 |     with open(transcription, 'r', encoding='utf8') as f:
133 |         reader = csv.DictReader(f)
134 |         items = list(reader)
135 |     phoneme_type_map = {
136 |         'AP': VOWEL,
137 |         'SP': VOWEL,
138 |     }
139 |     if vowels is not None:
140 |         with open(vowels, 'r', encoding='utf8') as f:
141 |             for v in f.read().split():
142 |                 phoneme_type_map[v] = VOWEL
143 |     if consonants is not None:
144 |         with open(consonants, 'r', encoding='utf8') as f:
145 |             for c in f.read().split():
146 |                 phoneme_type_map[c] = CONSONANT
147 |     if liquids is not None:
148 |         with open(liquids, 'r', encoding='utf8') as f:
149 |             for l in f.read().split():
150 |                 phoneme_type_map[l] = LIQUID
151 | 
152 |     trie = TrieNode()
153 |     trie[(
154 |         RuleTerm(VOWEL, True),
155 |     )] = [0]
156 |     trie[(
157 |         RuleTerm(CONSONANT, True),
158 |         RuleTerm(LIQUID, True),
159 |         RuleTerm(VOWEL, True),
160 |     )] = [1]
161 |     trie[(
162 |         RuleTerm(LIQUID, True),
163 |         RuleTerm(LIQUID, True),
164 |         RuleTerm(VOWEL, True),
165 |     )] = [1]
166 | 
167 |     for item in items:
168 |         name = item['name']
169 |         tg_path = tg / f"{name}.TextGrid"
170 |         tg_obj = textgrid.TextGrid()
171 |         tg_obj.read(tg_path, encoding='utf8')
172 |         words_tier = tg_obj[0]
173 |         phones_tier = tg_obj[1]
174 | 
175 |         if item['ph_seq'].split() != [i.mark for i in phones_tier]:
176 |             raise ValueError(f"Error: ph_seq mismatch in item: {name}")
177 |         for phone_idx, phone_interval in enumerate(phones_tier):
178 |             if phone_interval.mark not in phoneme_type_map:
179 |                 raise ValueError(
180 |                     f"Error: invalid phone in item: {name}, index: {phone_idx}, phone: {phone_interval.mark}"
181 |                 )
182 | 
183 |         is_onset = []
184 |         for word_idx, word_interval in enumerate(words_tier):
185 |             start_ph_idx = min(
186 |                 enumerate(tg_obj[1]),
187 |                 key=lambda e: abs(e[1].minTime - word_interval.minTime)
188 |             )[0]
189 |             end_ph_idx = min(
190 |                 enumerate(tg_obj[1]),
191 |                 key=lambda e: abs(e[1].maxTime - word_interval.maxTime)
192 |             )[0]
193 |             if phones_tier[start_ph_idx].minTime != word_interval.minTime:
194 |                 raise ValueError(
195 |                     f"Error: word minTime not aligned to phone minTime in item: "
196 |                     f"{name}, index: {word_idx}, word: {word_interval.mark}"
197 |                 )
198 |             if phones_tier[end_ph_idx].maxTime != word_interval.maxTime:
199 |                 raise ValueError(
200 |                     f"Error: word maxTime not aligned to phone maxTime in item: "
201 |                     f"{name}, index: {word_idx}, word: {word_interval.mark}"
202 |                 )
203 |             word_phones = [i.mark for i in phones_tier[start_ph_idx:end_ph_idx + 1]]
204 |             i = 0
205 |             while i < len(word_phones):
206 |                 query = [
207 |                     QueryTerm(specified_key=ph, wildcard_key=phoneme_type_map[ph])
208 |                     for ph in word_phones[i:]
209 |                 ]
210 |                 best_path = trie.find_best_path(query)
211 |                 if not best_path:
212 |                     is_onset.append(False)
213 |                     i += 1
214 |                     continue
215 |                 onsets = trie[best_path]
216 |                 is_onset.extend(
217 |                     j in onsets
218 |                     for j in range(len(best_path))
219 |                 )
220 |                 i += len(best_path)
221 |         acc = 0
222 |         ph_num = []
223 |         for flag in is_onset:
224 |             if flag:
225 |                 if acc > 0:
226 |                     ph_num.append(acc)
227 |                 acc = 1
228 |             else:
229 |                 acc += 1
230 |         if acc > 0:
231 |             ph_num.append(acc)
232 |         item['ph_num'] = ' '.join(str(n) for n in ph_num)
233 | 
234 |     with open(transcription, 'w', encoding='utf8', newline='') as f:
235 |         writer = csv.DictWriter(f, fieldnames=items[0].keys())
236 |         writer.writeheader()
237 |         writer.writerows(items)
238 | 
239 | 
240 | if __name__ == '__main__':
241 |     add_ph_num_advanced()
242 | 


--------------------------------------------------------------------------------
/variance-temp-solution/assets/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/openvpi/MakeDiffSinger/ca134d36dc8eec06002a72cd0a59257abcf7bb84/variance-temp-solution/assets/.gitkeep


--------------------------------------------------------------------------------
/variance-temp-solution/convert_ds.py:
--------------------------------------------------------------------------------
  1 | import csv
  2 | import json
  3 | import pathlib
  4 | from decimal import Decimal
  5 | from math import isclose
  6 | 
  7 | import click
  8 | import librosa
  9 | import numpy as np
 10 | from tqdm import tqdm
 11 | 
 12 | from get_pitch import get_pitch
 13 | 
 14 | 
 15 | def try_resolve_note_slur_by_matching(ph_dur, ph_num, note_dur, tol):
 16 |     if len(ph_num) > len(note_dur):
 17 |         raise ValueError("ph_num should not be longer than note_dur.")
 18 |     ph_num_cum = np.cumsum([0] + ph_num)
 19 |     word_pos = np.cumsum([sum(ph_dur[l:r]) for l, r in zip(ph_num_cum[:-1], ph_num_cum[1:])])
 20 |     note_pos = np.cumsum(note_dur)
 21 |     new_note_dur = []
 22 | 
 23 |     note_slur = []
 24 |     idx_word, idx_note = 0, 0
 25 |     slur = False
 26 |     while idx_word < len(word_pos) and idx_note < len(note_pos):
 27 |         if isclose(word_pos[idx_word], note_pos[idx_note], abs_tol=tol):
 28 |             note_slur.append(1 if slur else 0)
 29 |             new_note_dur.append(word_pos[idx_word])
 30 |             idx_word += 1
 31 |             idx_note += 1
 32 |             slur = False
 33 |         elif note_pos[idx_note] > word_pos[idx_word]:
 34 |             raise ValueError("Cannot resolve note_slur by matching.")
 35 |         elif note_pos[idx_note] <= word_pos[idx_word]:
 36 |             note_slur.append(1 if slur else 0)
 37 |             new_note_dur.append(note_pos[idx_note])
 38 |             idx_note += 1
 39 |             slur = True
 40 |     ret_note_dur = np.diff(new_note_dur, prepend=Decimal("0.0")).tolist()
 41 |     assert len(ret_note_dur) == len(note_slur)
 42 |     return ret_note_dur, note_slur
 43 | 
 44 | 
 45 | def try_resolve_slur_by_slicing(ph_dur, ph_num, note_seq, note_dur, tol):
 46 |     ph_num_cum = np.cumsum([0] + ph_num)
 47 |     word_pos = np.cumsum([sum(ph_dur[l:r]) for l, r in zip(ph_num_cum[:-1], ph_num_cum[1:])])
 48 |     note_pos = np.cumsum(note_dur)
 49 |     new_note_seq = []
 50 |     new_note_dur = []
 51 | 
 52 |     note_slur = []
 53 |     idx_word, idx_note = 0, 0
 54 |     while idx_word < len(word_pos):
 55 |         slur = False
 56 |         if note_pos[idx_note] > word_pos[idx_word] and not isclose(
 57 |             note_pos[idx_note], word_pos[idx_word], abs_tol=tol
 58 |         ):
 59 |             new_note_seq.append(note_seq[idx_note])
 60 |             new_note_dur.append(word_pos[idx_word])
 61 |             note_slur.append(1 if slur else 0)
 62 |         else:
 63 |             while idx_note < len(note_pos) and (
 64 |                 note_pos[idx_note] < word_pos[idx_word]
 65 |                 or isclose(note_pos[idx_note], word_pos[idx_word], abs_tol=tol)
 66 |             ):
 67 |                 new_note_seq.append(note_seq[idx_note])
 68 |                 new_note_dur.append(note_pos[idx_note])
 69 |                 note_slur.append(1 if slur else 0)
 70 |                 slur = True
 71 |                 idx_note += 1
 72 |             if new_note_dur[-1] < word_pos[idx_word]:
 73 |                 if isclose(new_note_dur[-1], word_pos[idx_word], abs_tol=tol):
 74 |                     new_note_dur[-1] = word_pos[idx_word]
 75 |                 else:
 76 |                     new_note_seq.append(note_seq[idx_note])
 77 |                     new_note_dur.append(word_pos[idx_word])
 78 |                     note_slur.append(1 if slur else 0)
 79 |         idx_word += 1
 80 |     ret_note_dur = np.diff(new_note_dur, prepend=Decimal("0.0")).tolist()
 81 |     assert len(new_note_seq) == len(ret_note_dur) == len(note_slur)
 82 |     return new_note_seq, ret_note_dur, note_slur
 83 | 
 84 | 
 85 | @click.group()
 86 | def cli():
 87 |     pass
 88 | 
 89 | 
 90 | @click.command(help="Convert a transcription file to DS files")
 91 | @click.argument(
 92 |     "transcription_file",
 93 |     type=click.Path(
 94 |         dir_okay=False,
 95 |         resolve_path=True,
 96 |         path_type=pathlib.Path,
 97 |         exists=True,
 98 |         readable=True,
 99 |     ),
100 |     metavar="TRANSCRIPTIONS",
101 | )
102 | @click.argument(
103 |     "wavs_folder",
104 |     type=click.Path(file_okay=False, resolve_path=True, path_type=pathlib.Path),
105 |     metavar="FOLDER",
106 | )
107 | @click.option(
108 |     "--tolerance",
109 |     "-t",
110 |     type=float,
111 |     default=0.005,
112 |     help="Tolerance for ph_dur/note_dur mismatch",
113 |     metavar="FLOAT",
114 | )
115 | @click.option(
116 |     "--hop_size", "-h", type=int, default=512, help="Hop size for f0_seq", metavar="INT"
117 | )
118 | @click.option(
119 |     "--sample_rate",
120 |     "-s",
121 |     type=int,
122 |     default=44100,
123 |     help="Sample rate of audio",
124 |     metavar="INT",
125 | )
126 | @click.option(
127 |     "--pe",
128 |     type=str,
129 |     default="parselmouth",
130 |     help="Pitch extractor (parselmouth, rmvpe)",
131 |     metavar="ALGORITHM",
132 | )
133 | def csv2ds(transcription_file, wavs_folder, tolerance, hop_size, sample_rate, pe):
134 |     """Convert a transcription file to DS file"""
135 |     assert wavs_folder.is_dir(), "wavs folder not found."
136 |     out_ds = {}
137 |     out_exists = []
138 |     with open(transcription_file, "r", encoding="utf-8") as f:
139 |         for trans_line in tqdm(csv.DictReader(f)):
140 |             item_name = trans_line["name"]
141 |             wav_fn = wavs_folder / f"{item_name}.wav"
142 |             ds_fn = wavs_folder / f"{item_name}.ds"
143 |             ph_dur = list(map(Decimal, trans_line["ph_dur"].strip().split()))
144 |             ph_num = list(map(int, trans_line["ph_num"].strip().split()))
145 |             note_seq = trans_line["note_seq"].strip().split()
146 |             note_dur = list(map(Decimal, trans_line["note_dur"].strip().split()))
147 |             note_glide = trans_line["note_glide"].strip().split() if "note_glide" in trans_line else None
148 | 
149 |             assert wav_fn.is_file(), f"{item_name}.wav not found."
150 |             assert len(ph_dur) == sum(ph_num), "ph_dur and ph_num mismatch."
151 |             assert len(note_seq) == len(note_dur), "note_seq and note_dur should have the same length."
152 |             if note_glide:
153 |                 assert len(note_glide) == len(note_seq), "note_glide and note_seq should have the same length."
154 |             assert isclose(
155 |                 sum(ph_dur), sum(note_dur), abs_tol=tolerance
156 |             ), f"[{item_name}] ERROR: mismatch total duration: {sum(ph_dur) - sum(note_dur)}"
157 | 
158 |             # Resolve note_slur
159 |             if "note_slur" in trans_line and trans_line["note_slur"]:
160 |                 note_slur = list(map(int, trans_line["note_slur"].strip().split()))
161 |             else:
162 |                 try:
163 |                     note_dur, note_slur = try_resolve_note_slur_by_matching(
164 |                         ph_dur, ph_num, note_dur, tolerance
165 |                     )
166 |                 except ValueError:
167 |                     # logging.warning(f"note_slur is not resolved by matching for {item_name}")
168 |                     note_seq, note_dur, note_slur = try_resolve_slur_by_slicing(
169 |                         ph_dur, ph_num, note_seq, note_dur, tolerance
170 |                     )
171 |             # Extract f0_seq
172 |             wav, _ = librosa.load(wav_fn, sr=sample_rate, mono=True)
173 |             # length = len(wav) + (win_size - hop_size) // 2 + (win_size - hop_size + 1) // 2
174 |             # length = ceil((length - win_size) / hop_size)
175 |             f0_timestep, f0, _ = get_pitch(pe, wav, hop_size, sample_rate)
176 |             ds_content = [
177 |                 {
178 |                     "offset": 0.0,
179 |                     "text": trans_line["ph_seq"],
180 |                     "ph_seq": trans_line["ph_seq"],
181 |                     "ph_dur": " ".join(str(round(d, 6)) for d in ph_dur),
182 |                     "ph_num": trans_line["ph_num"],
183 |                     "note_seq": " ".join(note_seq),
184 |                     "note_dur": " ".join(str(round(d, 6)) for d in note_dur),
185 |                     "note_slur": " ".join(map(str, note_slur)),
186 |                     "f0_seq": " ".join(map("{:.1f}".format, f0)),
187 |                     "f0_timestep": str(f0_timestep),
188 |                 }
189 |             ]
190 |             if note_glide:
191 |                 ds_content[0]["note_glide"] = " ".join(note_glide)
192 |             out_ds[ds_fn] = ds_content
193 |             if ds_fn.exists():
194 |                 out_exists.append(ds_fn)
195 |     if not out_exists or click.confirm(f"Overwrite {len(out_exists)} existing DS files?", abort=False):
196 |         for ds_fn, ds_content in out_ds.items():
197 |             with open(ds_fn, "w", encoding="utf-8") as f:
198 |                 json.dump(ds_content, f, ensure_ascii=False, indent=4)
199 |     else:
200 |         click.echo("Aborted.")
201 | 
202 | 
203 | @click.command(help="Convert DS files to a transcription and curve files")
204 | @click.argument(
205 |     "ds_folder",
206 |     type=click.Path(file_okay=False, resolve_path=True, exists=True, path_type=pathlib.Path),
207 |     metavar="FOLDER",
208 | )
209 | @click.argument(
210 |     "transcription_file",
211 |     type=click.Path(file_okay=True, dir_okay=False, resolve_path=True, path_type=pathlib.Path),
212 |     metavar="TRANSCRIPTIONS",
213 | )
214 | @click.option(
215 |     "--overwrite",
216 |     "-f",
217 |     is_flag=True,
218 |     default=False,
219 |     help="Overwrite existing transcription file",
220 | )
221 | def ds2csv(ds_folder, transcription_file, overwrite):
222 |     """Convert DS files to a transcription file"""
223 |     if not overwrite and transcription_file.exists():
224 |         raise FileExistsError(f"{transcription_file} already exist.")
225 | 
226 |     transcriptions = []
227 |     any_with_glide = False
228 |     # records that have corresponding wav files, assuming it's midi annotation
229 |     for fp in tqdm(ds_folder.glob("*.ds"), ncols=80):
230 |         if fp.with_suffix(".wav").exists():
231 |             with open(fp, "r", encoding="utf-8") as f:
232 |                 ds = json.load(f)
233 |                 transcriptions.append(
234 |                     {
235 |                         "name": fp.stem,
236 |                         "ph_seq": ds[0]["ph_seq"],
237 |                         "ph_dur": " ".join(str(round(Decimal(d), 6)) for d in ds[0]["ph_dur"].split()),
238 |                         "ph_num": ds[0]["ph_num"],
239 |                         "note_seq": ds[0]["note_seq"],
240 |                         "note_dur": " ".join(str(round(Decimal(d), 6)) for d in ds[0]["note_dur"].split()),
241 |                         # "note_slur": ds[0]["note_slur"],
242 |                     }
243 |                 )
244 |                 if "note_glide" in ds[0]:
245 |                     any_with_glide = True
246 |                     transcriptions[-1]["note_glide"] = ds[0]["note_glide"]
247 |     # Lone DS files.
248 |     for fp in tqdm(ds_folder.glob("*.ds"), ncols=80):
249 |         if not fp.with_suffix(".wav").exists():
250 |             with open(fp, "r", encoding="utf-8") as f:
251 |                 ds = json.load(f)
252 |                 for idx, sub_ds in enumerate(ds):
253 |                     item_name = f"{fp.stem}#{idx}" if len(ds) > 1 else fp.stem
254 |                     transcriptions.append(
255 |                         {
256 |                             "name": item_name,
257 |                             "ph_seq": sub_ds["ph_seq"],
258 |                             "ph_dur": " ".join(str(round(Decimal(d), 6)) for d in sub_ds["ph_dur"].split()),
259 |                             "ph_num": sub_ds["ph_num"],
260 |                             "note_seq": sub_ds["note_seq"],
261 |                             "note_dur": " ".join(str(round(Decimal(d), 6)) for d in sub_ds["note_dur"].split()),
262 |                             # "note_slur": sub_ds["note_slur"],
263 |                         }
264 |                     )
265 |                     if "note_glide" in sub_ds:
266 |                         any_with_glide = True
267 |                         transcriptions[-1]["note_glide"] = sub_ds["note_glide"]
268 |     if any_with_glide:
269 |         for row in transcriptions:
270 |             if "note_glide" not in row:
271 |                 row["note_glide"] = " ".join(["none"] * len(row["note_seq"].split()))
272 |     with open(transcription_file, "w", newline="", encoding="utf-8") as f:
273 |         writer = csv.DictWriter(
274 |             f,
275 |             fieldnames=[
276 |                 "name",
277 |                 "ph_seq",
278 |                 "ph_dur",
279 |                 "ph_num",
280 |                 "note_seq",
281 |                 "note_dur",
282 |                 # "note_slur",
283 |             ] + (["note_glide"] if any_with_glide else []),
284 |         )
285 |         writer.writeheader()
286 |         writer.writerows(transcriptions)
287 | 
288 | 
289 | cli.add_command(csv2ds)
290 | cli.add_command(ds2csv)
291 | 
292 | if __name__ == "__main__":
293 |     cli()
294 | 


--------------------------------------------------------------------------------
/variance-temp-solution/convert_txt.py:
--------------------------------------------------------------------------------
 1 | import csv
 2 | import pathlib
 3 | 
 4 | import click
 5 | 
 6 | 
 7 | @click.command(help='Migrate transcriptions.txt in old datasets to transcriptions.csv')
 8 | @click.argument('input_txt', metavar='INPUT')
 9 | def convert_txt(
10 |         input_txt: str
11 | ):
12 |     input_txt = pathlib.Path(input_txt).resolve()
13 |     assert input_txt.exists(), 'The input file does not exist.'
14 |     with open(input_txt, 'r', encoding='utf8') as f:
15 |         utterances = f.readlines()
16 |     utterances = [u.split('|') for u in utterances]
17 |     utterances = [
18 |         {
19 |             'name': u[0],
20 |             'ph_seq': u[2],
21 |             'ph_dur': u[5]
22 |         }
23 |         for u in utterances
24 |     ]
25 | 
26 |     with open(input_txt.with_suffix('.csv'), 'w', encoding='utf8', newline='') as f:
27 |         writer = csv.DictWriter(f, fieldnames=['name', 'ph_seq', 'ph_dur'])
28 |         writer.writeheader()
29 |         writer.writerows(utterances)
30 | 
31 | 
32 | if __name__ == '__main__':
33 |     convert_txt()
34 | 


--------------------------------------------------------------------------------
/variance-temp-solution/correct_cents.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | import math
  3 | import warnings
  4 | from collections import OrderedDict
  5 | 
  6 | import librosa
  7 | import numpy as np
  8 | import tqdm
  9 | import pathlib
 10 | from csv import DictReader, DictWriter
 11 | 
 12 | import click
 13 | 
 14 | from get_pitch import get_pitch_parselmouth
 15 | 
 16 | warns = []
 17 | 
 18 | 
 19 | def get_aligned_pitch(wav_path: pathlib.Path, total_secs: float, timestep: float):
 20 |     waveform, _ = librosa.load(wav_path, sr=44100, mono=True)
 21 |     _, f0, _ = get_pitch_parselmouth(waveform, 512, 44100)
 22 |     pitch = librosa.hz_to_midi(f0)
 23 |     if pitch.shape[0] < total_secs / timestep:
 24 |         pad = math.ceil(total_secs / timestep) - pitch.shape[0]
 25 |         pitch = np.pad(pitch, [0, pad], mode='constant', constant_values=[0, pitch[-1]])
 26 |     return pitch
 27 | 
 28 | 
 29 | def correct_cents_item(
 30 |         name: str, item: OrderedDict, ref_pitch: np.ndarray,
 31 |         timestep: float, error_ratio: float
 32 | ):
 33 |     note_seq = item['note_seq'].split()
 34 |     note_dur = [float(d) for d in item['note_dur'].split()]
 35 |     assert len(note_seq) == len(note_dur)
 36 | 
 37 |     start = 0.
 38 |     note_seq_correct = []
 39 |     for i, (note, dur) in enumerate(zip(note_seq, note_dur)):
 40 |         end = start + dur
 41 |         if note == 'rest':
 42 |             start = end
 43 |             note_seq_correct.append('rest')
 44 |             continue
 45 | 
 46 |         midi = librosa.note_to_midi(note, round_midi=False)
 47 |         start_idx = math.floor(start / timestep)
 48 |         end_idx = math.ceil(end / timestep)
 49 |         note_pitch = ref_pitch[start_idx: end_idx]
 50 |         note_pitch_close = note_pitch[(note_pitch >= midi - 0.5) & (note_pitch < midi + 0.5)]
 51 |         if len(note_pitch_close) < len(note_pitch) * error_ratio or len(note_pitch) == 0:
 52 |             warns.append({
 53 |                 'position': name,
 54 |                 'note_index': i,
 55 |                 'note_value': note
 56 |             })
 57 |             if len(note_pitch) == 0 or len(note_pitch_close) == 0:
 58 |                 start = end
 59 |                 note_seq_correct.append(note)
 60 |                 continue
 61 |         midi_correct = np.mean(note_pitch_close)
 62 |         note_seq_correct.append(librosa.midi_to_note(midi_correct, cents=True, unicode=False))
 63 | 
 64 |         start = end
 65 | 
 66 |     item['note_seq'] = ' '.join(note_seq_correct)
 67 | 
 68 | 
 69 | def save_warnings(save_dir: pathlib.Path):
 70 |     if len(warns) > 0:
 71 |         save_path = save_dir.resolve() / 'warnings.csv'
 72 |         with open(save_path, 'w', encoding='utf8', newline='') as f:
 73 |             writer = DictWriter(f, fieldnames=['position', 'note_index', 'note_value'])
 74 |             writer.writeheader()
 75 |             writer.writerows(warns)
 76 |         warnings.warn(
 77 |             message=f'possible labeling errors saved in {save_path}',
 78 |             category=UserWarning
 79 |         )
 80 |         warnings.filterwarnings(action='default')
 81 | 
 82 | 
 83 | @click.group(help='Apply cents correction to note sequences')
 84 | def correct_cents():
 85 |     pass
 86 | 
 87 | 
 88 | @correct_cents.command(help='Apply cents correction to note sequences in transcriptions.csv')
 89 | @click.argument('transcriptions', metavar='TRANSCRIPTIONS')
 90 | @click.argument('waveforms', metavar='WAVS')
 91 | @click.option('--error_ratio', metavar='RATIO', type=float, default=0.4,
 92 |               help='If the percentage of pitch points within a deviation of 50 cents compared to the note label '
 93 |                    'is lower than this value, a warning will be raised.')
 94 | def csv(
 95 |         transcriptions,
 96 |         waveforms,
 97 |         error_ratio
 98 | ):
 99 |     transcriptions = pathlib.Path(transcriptions).resolve()
100 |     waveforms = pathlib.Path(waveforms).resolve()
101 |     with open(transcriptions, 'r', encoding='utf8') as f:
102 |         reader = DictReader(f)
103 |         items: list[OrderedDict] = []
104 |         for item in reader:
105 |             items.append(OrderedDict(item))
106 | 
107 |     timestep = 512 / 44100
108 |     for item in tqdm.tqdm(items):
109 |         item: OrderedDict
110 |         ref_pitch = get_aligned_pitch(
111 |             wav_path=waveforms / (item['name'] + '.wav'),
112 |             total_secs=sum(float(d) for d in item['note_dur'].split()),
113 |             timestep=timestep
114 |         )
115 |         correct_cents_item(
116 |             name=item['name'], item=item, ref_pitch=ref_pitch,
117 |             timestep=timestep, error_ratio=error_ratio
118 |         )
119 | 
120 |     with open(transcriptions, 'w', encoding='utf8', newline='') as f:
121 |         writer = DictWriter(f, fieldnames=['name', 'ph_seq', 'ph_dur', 'ph_num', 'note_seq', 'note_dur'])
122 |         writer.writeheader()
123 |         writer.writerows(items)
124 |     save_warnings(transcriptions.parent)
125 | 
126 | 
127 | @correct_cents.command(help='Apply cents correction to note sequences in DS files')
128 | @click.argument('ds_dir', metavar='DS_DIR')
129 | @click.option('--error_ratio', metavar='RATIO', type=float, default=0.4,
130 |               help='If the percentage of pitch points within a deviation of 50 cents compared to the note label '
131 |                    'is lower than this value, a warning will be raised.')
132 | def ds(
133 |         ds_dir,
134 |         error_ratio
135 | ):
136 |     ds_dir = pathlib.Path(ds_dir).resolve()
137 |     assert ds_dir.exists(), 'The directory of DS files does not exist.'
138 | 
139 |     timestep = 512 / 44100
140 |     for ds_file in tqdm.tqdm(ds_dir.glob('*.ds')):
141 |         if not ds_file.is_file():
142 |             continue
143 | 
144 |         assert ds_file.with_suffix('.wav').exists(), \
145 |             f'Missing corresponding .wav file of {ds_file.name}.'
146 |         with open(ds_file, 'r', encoding='utf8') as f:
147 |             params = json.load(f)
148 |         if not isinstance(params, list):
149 |             params = [params]
150 |         params = [OrderedDict(p) for p in params]
151 | 
152 |         ref_pitch = get_aligned_pitch(
153 |             wav_path=ds_file.with_suffix('.wav'),
154 |             total_secs=params[-1]['offset'] + sum(float(d) for d in params[-1]['note_dur'].split()),
155 |             timestep=timestep
156 |         )
157 |         for i, param in enumerate(params):
158 |             start_idx = math.floor(param['offset'] / timestep)
159 |             end_idx = math.ceil((param['offset'] + sum(float(d) for d in param['note_dur'].split())) / timestep)
160 |             correct_cents_item(
161 |                 name=f'{ds_file.stem}#{i}', item=param, ref_pitch=ref_pitch[start_idx: end_idx],
162 |                 timestep=timestep, error_ratio=error_ratio
163 |             )
164 | 
165 |         with open(ds_file, 'w', encoding='utf8') as f:
166 |             json.dump(params, f, ensure_ascii=False, indent=2)
167 |     save_warnings(ds_dir)
168 | 
169 | 
170 | if __name__ == '__main__':
171 |     correct_cents()
172 | 


--------------------------------------------------------------------------------
/variance-temp-solution/eliminate_short.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import pathlib
 3 | from collections import OrderedDict
 4 | 
 5 | import click
 6 | 
 7 | 
 8 | @click.command(help='Eliminate short slur notes in DS files')
 9 | @click.argument('ds_dir', metavar='DS_DIR')
10 | @click.argument('threshold', type=float, metavar='THRESHOLD')
11 | def eliminate_short(
12 |         ds_dir,
13 |         threshold: float
14 | ):
15 |     ds_dir = pathlib.Path(ds_dir).resolve()
16 |     assert ds_dir.exists(), 'The directory of DS files does not exist.'
17 | 
18 |     for ds in ds_dir.iterdir():
19 |         if not ds.is_file() or ds.suffix != '.ds':
20 |             continue
21 | 
22 |         with open(ds, 'r', encoding='utf8') as f:
23 |             params = json.load(f)
24 |         if not isinstance(params, list):
25 |             params = [params]
26 |         params = [OrderedDict(p) for p in params]
27 | 
28 |         for param in params:
29 |             note_list = [
30 |                 (note, float(dur), bool(int(slur)))
31 |                 for note, dur, slur
32 |                 in zip(param['note_seq'].split(), param['note_dur'].split(), param['note_slur'].split())
33 |             ]
34 |             word_note_div = []
35 |             cache = []
36 |             for note in note_list:
37 |                 if len(cache) == 0 or note[2]:
38 |                     cache.append(note)
39 |                 else:
40 |                     word_note_div.append(cache)
41 |                     cache = [note]
42 |             if len(cache) > 0:
43 |                 word_note_div.append(cache)
44 | 
45 |             word_note_div_new = []
46 |             for i in range(len(word_note_div)):
47 |                 word_note_seq = word_note_div[i]
48 |                 if len(word_note_seq) == 1 or all(n[1] < threshold for n in word_note_seq):
49 |                     word_note_div_new.append(word_note_seq)
50 |                     continue
51 | 
52 |                 word_note_seq_new = []
53 |                 j = 0
54 |                 prev_merge = 0.
55 |                 while word_note_seq[j][1] < threshold:
56 |                     # Enumerate leading short notes
57 |                     prev_merge += word_note_seq[j][1]
58 |                     j += 1
59 |                 # Iter note sequence
60 |                 while j < len(word_note_seq):
61 |                     k = j + 1
62 |                     while k < len(word_note_seq) and word_note_seq[k][1] < threshold:
63 |                         k += 1
64 |                     post_merge = sum(n[1] for n in word_note_seq[j + 1: k])
65 |                     if k < len(word_note_seq):
66 |                         post_merge /= 2
67 |                     word_note_seq_new.append(
68 |                         (word_note_seq[j][0], prev_merge + word_note_seq[j][1] + post_merge, False)
69 |                     )
70 |                     prev_merge = post_merge
71 |                     j = k
72 | 
73 |                 word_note_div_new.append(word_note_seq_new)
74 | 
75 |             note_seq_new = []
76 |             note_dur_new = []
77 |             note_slur_new = []
78 |             for word_note_seq in word_note_div_new:
79 |                 note_seq_new += [n[0] for n in word_note_seq]
80 |                 note_dur_new += [n[1] for n in word_note_seq]
81 |                 note_slur_new += [pos > 0 for pos in range(len(word_note_seq))]
82 |             param['note_seq'] = ' '.join(note_seq_new)
83 |             param['note_dur'] = ' '.join(str(round(d, 6)) for d in note_dur_new)
84 |             param['note_slur'] = ' '.join(str(int(s)) for s in note_slur_new)
85 | 
86 |         with open(ds, 'w', encoding='utf8') as f:
87 |             json.dump(params, f, ensure_ascii=False, indent=2)
88 | 
89 | 
90 | if __name__ == '__main__':
91 |     eliminate_short()
92 | 


--------------------------------------------------------------------------------
/variance-temp-solution/estimate_midi.py:
--------------------------------------------------------------------------------
 1 | import csv
 2 | import math
 3 | import pathlib
 4 | 
 5 | import click
 6 | import librosa
 7 | import numpy as np
 8 | import tqdm
 9 | from typing import List
10 | 
11 | from get_pitch import get_pitch
12 | 
13 | 
14 | @click.command(help='Estimate note pitch from transcriptions and corresponding waveforms')
15 | @click.argument('transcriptions', metavar='TRANSCRIPTIONS')
16 | @click.argument('waveforms', metavar='WAVS')
17 | @click.option('--pe', metavar='ALGORITHM', default='parselmouth',
18 |               help='Pitch extractor (parselmouth, rmvpe)')
19 | @click.option('--rest_uv_ratio', metavar='RATIO', type=float, default=0.85,
20 |               help='The minimum percentage of unvoiced length for a note to be regarded as rest')
21 | def estimate_midi(
22 |         transcriptions: str,
23 |         waveforms: str,
24 |         pe: str = 'parselmouth',
25 |         rest_uv_ratio: float = 0.85
26 | ):
27 |     transcriptions = pathlib.Path(transcriptions).resolve()
28 |     waveforms = pathlib.Path(waveforms).resolve()
29 |     with open(transcriptions, 'r', encoding='utf8') as f:
30 |         reader = csv.DictReader(f)
31 |         items: List[dict] = []
32 |         for item in reader:
33 |             items.append(item)
34 | 
35 |     timestep = 512 / 44100
36 |     for item in tqdm.tqdm(items):
37 |         item: dict
38 |         ph_dur = [float(d) for d in item['ph_dur'].split()]
39 |         ph_num = [int(n) for n in item['ph_num'].split()]
40 |         assert sum(ph_num) == len(ph_dur), f'ph_num does not sum to number of phones in \'{item["name"]}\'.'
41 | 
42 |         word_dur = []
43 |         i = 0
44 |         for num in ph_num:
45 |             word_dur.append(sum(ph_dur[i: i + num]))
46 |             i += num
47 | 
48 |         total_secs = sum(ph_dur)
49 |         waveform, _ = librosa.load(waveforms / (item['name'] + '.wav'), sr=44100, mono=True)
50 |         _, f0, uv = get_pitch(pe, waveform, 512, 44100)
51 |         pitch = librosa.hz_to_midi(f0)
52 |         if pitch.shape[0] < total_secs / timestep:
53 |             pad = math.ceil(total_secs / timestep) - pitch.shape[0]
54 |             pitch = np.pad(pitch, [0, pad], mode='constant', constant_values=[0, pitch[-1]])
55 |             uv = np.pad(uv, [0, pad], mode='constant')
56 | 
57 |         note_seq = []
58 |         note_dur = []
59 |         start = 0.
60 |         for dur in word_dur:
61 |             end = start + dur
62 |             start_idx = math.floor(start / timestep)
63 |             end_idx = math.ceil(end / timestep)
64 |             word_pitch = pitch[start_idx: end_idx]
65 |             word_uv = uv[start_idx: end_idx]
66 |             word_valid_pitch = np.extract(~word_uv & (word_pitch >= 0), word_pitch)
67 |             if len(word_valid_pitch) < (1 - rest_uv_ratio) * (end_idx - start_idx):
68 |                 note_seq.append('rest')
69 |             else:
70 |                 counts = np.bincount(np.round(word_valid_pitch).astype(np.int64))
71 |                 midi = counts.argmax()
72 |                 midi = np.mean(word_valid_pitch[(word_valid_pitch >= midi - 0.5) & (word_valid_pitch < midi + 0.5)])
73 |                 note_seq.append(librosa.midi_to_note(midi, cents=True, unicode=False))
74 |             note_dur.append(dur)
75 | 
76 |             start = end
77 | 
78 |         item['note_seq'] = ' '.join(note_seq)
79 |         item['note_dur'] = ' '.join([str(round(d, 6)) for d in note_dur])
80 | 
81 |     with open(transcriptions, 'w', encoding='utf8', newline='') as f:
82 |         writer = csv.DictWriter(f, fieldnames=['name', 'ph_seq', 'ph_dur', 'ph_num', 'note_seq', 'note_dur'])
83 |         writer.writeheader()
84 |         writer.writerows(items)
85 | 
86 | 
87 | if __name__ == '__main__':
88 |     estimate_midi()
89 | 


--------------------------------------------------------------------------------
/variance-temp-solution/get_pitch.py:
--------------------------------------------------------------------------------
 1 | import pathlib
 2 | 
 3 | import numpy as np
 4 | import parselmouth
 5 | 
 6 | 
 7 | def norm_f0(f0):
 8 |     f0 = np.log2(f0)
 9 |     return f0
10 | 
11 | 
12 | def denorm_f0(f0, uv, pitch_padding=None):
13 |     f0 = 2 ** f0
14 |     if uv is not None:
15 |         f0[uv > 0] = 0
16 |     if pitch_padding is not None:
17 |         f0[pitch_padding] = 0
18 |     return f0
19 | 
20 | 
21 | def interp_f0(f0, uv=None):
22 |     if uv is None:
23 |         uv = f0 == 0
24 |     f0 = norm_f0(f0)
25 |     if sum(uv) == len(f0):
26 |         f0[uv] = -np.inf
27 |     elif sum(uv) > 0:
28 |         f0[uv] = np.interp(np.where(uv)[0], np.where(~uv)[0], f0[~uv])
29 |     return denorm_f0(f0, uv=None), uv
30 | 
31 | 
32 | def resample_align_curve(points: np.ndarray, original_timestep: float, target_timestep: float, align_length: int):
33 |     t_max = (len(points) - 1) * original_timestep
34 |     curve_interp = np.interp(
35 |         np.arange(0, t_max, target_timestep),
36 |         original_timestep * np.arange(len(points)),
37 |         points
38 |     ).astype(points.dtype)
39 |     delta_l = align_length - len(curve_interp)
40 |     if delta_l < 0:
41 |         curve_interp = curve_interp[:align_length]
42 |     elif delta_l > 0:
43 |         curve_interp = np.concatenate((curve_interp, np.full(delta_l, fill_value=curve_interp[-1])), axis=0)
44 |     return curve_interp
45 | 
46 | 
47 | def get_pitch_parselmouth(wav_data, hop_size, audio_sample_rate, interp_uv=True):
48 |     time_step = hop_size / audio_sample_rate
49 |     f0_min = 65.
50 |     f0_max = 1100.
51 | 
52 |     # noinspection PyArgumentList
53 |     f0 = (
54 |         parselmouth.Sound(wav_data, sampling_frequency=audio_sample_rate)
55 |         .to_pitch_ac(
56 |             time_step=time_step, voicing_threshold=0.6, pitch_floor=f0_min, pitch_ceiling=f0_max
57 |         ).selected_array["frequency"]
58 |     )
59 |     uv = f0 == 0
60 |     if interp_uv:
61 |         f0, uv = interp_f0(f0, uv)
62 |     return time_step, f0, uv
63 | 
64 | 
65 | rmvpe = None
66 | 
67 | 
68 | def get_pitch_rmvpe(wav_data, hop_size, audio_sample_rate, interp_uv=True):
69 |     global rmvpe
70 |     if rmvpe is None:
71 |         from rmvpe import RMVPE
72 |         rmvpe = RMVPE(pathlib.Path(__file__).parent / 'assets' / 'rmvpe' / 'model.pt')
73 |     f0 = rmvpe.infer_from_audio(wav_data, sample_rate=audio_sample_rate)
74 |     uv = f0 == 0
75 |     f0, uv = interp_f0(f0, uv)
76 | 
77 |     time_step = hop_size / audio_sample_rate
78 |     length = (wav_data.shape[0] + hop_size - 1) // hop_size
79 |     f0_res = resample_align_curve(f0, 0.01, time_step, length)
80 |     uv_res = resample_align_curve(uv.astype(np.float32), 0.01, time_step, length) > 0.5
81 |     if not interp_uv:
82 |         f0_res[uv_res] = 0
83 |     return time_step, f0_res, uv_res
84 | 
85 | 
86 | def get_pitch(algorithm, wav_data, hop_size, audio_sample_rate, interp_uv=True):
87 |     if algorithm == 'parselmouth':
88 |         return get_pitch_parselmouth(wav_data, hop_size, audio_sample_rate, interp_uv=interp_uv)
89 |     elif algorithm == 'rmvpe':
90 |         return get_pitch_rmvpe(wav_data, hop_size, audio_sample_rate, interp_uv=interp_uv)
91 |     else:
92 |         raise ValueError(f" [x] Unknown f0 extractor: {algorithm}")
93 | 


--------------------------------------------------------------------------------
/variance-temp-solution/requirements.txt:
--------------------------------------------------------------------------------
1 | click
2 | librosa<0.10.0
3 | numpy==1.23.5
4 | praat-parselmouth==0.4.3
5 | tqdm
6 | 


--------------------------------------------------------------------------------
/variance-temp-solution/rmvpe/__init__.py:
--------------------------------------------------------------------------------
1 | from .inference import RMVPE
2 | 


--------------------------------------------------------------------------------
/variance-temp-solution/rmvpe/constants.py:
--------------------------------------------------------------------------------
 1 | SAMPLE_RATE = 16000
 2 | 
 3 | N_CLASS = 360
 4 | 
 5 | N_MELS = 128
 6 | MEL_FMIN = 30
 7 | MEL_FMAX = 8000
 8 | WINDOW_LENGTH = 1024
 9 | CONST = 1997.3794084376191
10 | 


--------------------------------------------------------------------------------
/variance-temp-solution/rmvpe/deepunet.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import torch.nn as nn
  3 | from .constants import N_MELS
  4 | 
  5 | 
  6 | class ConvBlockRes(nn.Module):
  7 |     def __init__(self, in_channels, out_channels, momentum=0.01):
  8 |         super(ConvBlockRes, self).__init__()
  9 |         self.conv = nn.Sequential(
 10 |             nn.Conv2d(in_channels=in_channels,
 11 |                       out_channels=out_channels,
 12 |                       kernel_size=(3, 3),
 13 |                       stride=(1, 1),
 14 |                       padding=(1, 1),
 15 |                       bias=False),
 16 |             nn.BatchNorm2d(out_channels, momentum=momentum),
 17 |             nn.ReLU(),
 18 | 
 19 |             nn.Conv2d(in_channels=out_channels,
 20 |                       out_channels=out_channels,
 21 |                       kernel_size=(3, 3),
 22 |                       stride=(1, 1),
 23 |                       padding=(1, 1),
 24 |                       bias=False),
 25 |             nn.BatchNorm2d(out_channels, momentum=momentum),
 26 |             nn.ReLU(),
 27 |         )
 28 |         if in_channels != out_channels:
 29 |             self.shortcut = nn.Conv2d(in_channels, out_channels, (1, 1))
 30 |             self.is_shortcut = True
 31 |         else:
 32 |             self.is_shortcut = False
 33 | 
 34 |     def forward(self, x):
 35 |         if self.is_shortcut:
 36 |             return self.conv(x) + self.shortcut(x)
 37 |         else:
 38 |             return self.conv(x) + x
 39 | 
 40 | 
 41 | class ResEncoderBlock(nn.Module):
 42 |     def __init__(self, in_channels, out_channels, kernel_size, n_blocks=1, momentum=0.01):
 43 |         super(ResEncoderBlock, self).__init__()
 44 |         self.n_blocks = n_blocks
 45 |         self.conv = nn.ModuleList()
 46 |         self.conv.append(ConvBlockRes(in_channels, out_channels, momentum))
 47 |         for i in range(n_blocks - 1):
 48 |             self.conv.append(ConvBlockRes(out_channels, out_channels, momentum))
 49 |         self.kernel_size = kernel_size
 50 |         if self.kernel_size is not None:
 51 |             self.pool = nn.AvgPool2d(kernel_size=kernel_size)
 52 | 
 53 |     def forward(self, x):
 54 |         for i in range(self.n_blocks):
 55 |             x = self.conv[i](x)
 56 |         if self.kernel_size is not None:
 57 |             return x, self.pool(x)
 58 |         else:
 59 |             return x
 60 | 
 61 | 
 62 | class ResDecoderBlock(nn.Module):
 63 |     def __init__(self, in_channels, out_channels, stride, n_blocks=1, momentum=0.01):
 64 |         super(ResDecoderBlock, self).__init__()
 65 |         out_padding = (0, 1) if stride == (1, 2) else (1, 1)
 66 |         self.n_blocks = n_blocks
 67 |         self.conv1 = nn.Sequential(
 68 |             nn.ConvTranspose2d(in_channels=in_channels,
 69 |                                out_channels=out_channels,
 70 |                                kernel_size=(3, 3),
 71 |                                stride=stride,
 72 |                                padding=(1, 1),
 73 |                                output_padding=out_padding,
 74 |                                bias=False),
 75 |             nn.BatchNorm2d(out_channels, momentum=momentum),
 76 |             nn.ReLU(),
 77 |         )
 78 |         self.conv2 = nn.ModuleList()
 79 |         self.conv2.append(ConvBlockRes(out_channels * 2, out_channels, momentum))
 80 |         for i in range(n_blocks-1):
 81 |             self.conv2.append(ConvBlockRes(out_channels, out_channels, momentum))
 82 | 
 83 |     def forward(self, x, concat_tensor):
 84 |         x = self.conv1(x)
 85 |         x = torch.cat((x, concat_tensor), dim=1)
 86 |         for i in range(self.n_blocks):
 87 |             x = self.conv2[i](x)
 88 |         return x
 89 | 
 90 | 
 91 | class Encoder(nn.Module):
 92 |     def __init__(self, in_channels, in_size, n_encoders, kernel_size, n_blocks, out_channels=16, momentum=0.01):
 93 |         super(Encoder, self).__init__()
 94 |         self.n_encoders = n_encoders
 95 |         self.bn = nn.BatchNorm2d(in_channels, momentum=momentum)
 96 |         self.layers = nn.ModuleList()
 97 |         self.latent_channels = []
 98 |         for i in range(self.n_encoders):
 99 |             self.layers.append(ResEncoderBlock(in_channels, out_channels, kernel_size, n_blocks, momentum=momentum))
100 |             self.latent_channels.append([out_channels, in_size])
101 |             in_channels = out_channels
102 |             out_channels *= 2
103 |             in_size //= 2
104 |         self.out_size = in_size
105 |         self.out_channel = out_channels
106 | 
107 |     def forward(self, x):
108 |         concat_tensors = []
109 |         x = self.bn(x)
110 |         for i in range(self.n_encoders):
111 |             _, x = self.layers[i](x)
112 |             concat_tensors.append(_)
113 |         return x, concat_tensors
114 | 
115 | 
116 | class Intermediate(nn.Module):
117 |     def __init__(self, in_channels, out_channels, n_inters, n_blocks, momentum=0.01):
118 |         super(Intermediate, self).__init__()
119 |         self.n_inters = n_inters
120 |         self.layers = nn.ModuleList()
121 |         self.layers.append(ResEncoderBlock(in_channels, out_channels, None, n_blocks, momentum))
122 |         for i in range(self.n_inters-1):
123 |             self.layers.append(ResEncoderBlock(out_channels, out_channels, None, n_blocks, momentum))
124 | 
125 |     def forward(self, x):
126 |         for i in range(self.n_inters):
127 |             x = self.layers[i](x)
128 |         return x
129 | 
130 | 
131 | class Decoder(nn.Module):
132 |     def __init__(self, in_channels, n_decoders, stride, n_blocks, momentum=0.01):
133 |         super(Decoder, self).__init__()
134 |         self.layers = nn.ModuleList()
135 |         self.n_decoders = n_decoders
136 |         for i in range(self.n_decoders):
137 |             out_channels = in_channels // 2
138 |             self.layers.append(ResDecoderBlock(in_channels, out_channels, stride, n_blocks, momentum))
139 |             in_channels = out_channels
140 | 
141 |     def forward(self, x, concat_tensors):
142 |         for i in range(self.n_decoders):
143 |             x = self.layers[i](x, concat_tensors[-1-i])
144 |         return x
145 | 
146 | 
147 | class TimbreFilter(nn.Module):
148 |     def __init__(self, latent_rep_channels):
149 |         super(TimbreFilter, self).__init__()
150 |         self.layers = nn.ModuleList()
151 |         for latent_rep in latent_rep_channels:
152 |             self.layers.append(ConvBlockRes(latent_rep[0], latent_rep[0]))
153 | 
154 |     def forward(self, x_tensors):
155 |         out_tensors = []
156 |         for i, layer in enumerate(self.layers):
157 |             out_tensors.append(layer(x_tensors[i]))
158 |         return out_tensors
159 | 
160 | 
161 | class DeepUnet0(nn.Module):
162 |     def __init__(self, kernel_size, n_blocks, en_de_layers=5, inter_layers=4, in_channels=1, en_out_channels=16):
163 |         super(DeepUnet0, self).__init__()
164 |         self.encoder = Encoder(in_channels, N_MELS, en_de_layers, kernel_size, n_blocks, en_out_channels)
165 |         self.intermediate = Intermediate(self.encoder.out_channel // 2, self.encoder.out_channel, inter_layers, n_blocks)
166 |         self.tf = TimbreFilter(self.encoder.latent_channels)
167 |         self.decoder = Decoder(self.encoder.out_channel, en_de_layers, kernel_size, n_blocks)
168 | 
169 |     def forward(self, x):
170 |         x, concat_tensors = self.encoder(x)
171 |         x = self.intermediate(x)
172 |         x = self.decoder(x, concat_tensors)
173 |         return x
174 | 


--------------------------------------------------------------------------------
/variance-temp-solution/rmvpe/inference.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import torch.nn.functional as F
 3 | from torchaudio.transforms import Resample
 4 | 
 5 | from .constants import *
 6 | from .model import E2E0
 7 | from .spec import MelSpectrogram
 8 | from .utils import to_local_average_f0, to_viterbi_f0
 9 | 
10 | 
11 | class RMVPE:
12 |     def __init__(self, model_path, hop_length=160):
13 |         self.resample_kernel = {}
14 |         self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
15 |         self.model = E2E0(4, 1, (2, 2)).eval().to(self.device)
16 |         ckpt = torch.load(model_path, map_location=self.device)
17 |         self.model.load_state_dict(ckpt['model'], strict=False)
18 |         self.mel_extractor = MelSpectrogram(
19 |             N_MELS, SAMPLE_RATE, WINDOW_LENGTH, hop_length, None, MEL_FMIN, MEL_FMAX
20 |         ).to(self.device)
21 | 
22 |     @torch.no_grad()
23 |     def mel2hidden(self, mel):
24 |         n_frames = mel.shape[-1]
25 |         mel = F.pad(mel, (0, 32 * ((n_frames - 1) // 32 + 1) - n_frames), mode='constant')
26 |         hidden = self.model(mel)
27 |         return hidden[:, :n_frames]
28 | 
29 |     def decode(self, hidden, thred=0.03, use_viterbi=False):
30 |         if use_viterbi:
31 |             f0 = to_viterbi_f0(hidden, thred=thred)
32 |         else:
33 |             f0 = to_local_average_f0(hidden, thred=thred)
34 |         return f0
35 | 
36 |     def infer_from_audio(self, audio, sample_rate=16000, thred=0.03, use_viterbi=False):
37 |         audio = torch.from_numpy(audio).float().unsqueeze(0).to(self.device)
38 |         if sample_rate == 16000:
39 |             audio_res = audio
40 |         else:
41 |             key_str = str(sample_rate)
42 |             if key_str not in self.resample_kernel:
43 |                 self.resample_kernel[key_str] = Resample(sample_rate, 16000, lowpass_filter_width=128)
44 |             self.resample_kernel[key_str] = self.resample_kernel[key_str].to(self.device)
45 |             audio_res = self.resample_kernel[key_str](audio)
46 |         mel = self.mel_extractor(audio_res, center=True)
47 |         hidden = self.mel2hidden(mel)
48 |         f0 = self.decode(hidden, thred=thred, use_viterbi=use_viterbi)
49 |         return f0
50 | 


--------------------------------------------------------------------------------
/variance-temp-solution/rmvpe/model.py:
--------------------------------------------------------------------------------
 1 | from torch import nn
 2 | 
 3 | from .constants import *
 4 | from .deepunet import DeepUnet0
 5 | from .seq import BiGRU
 6 | 
 7 | 
 8 | class E2E0(nn.Module):
 9 |     def __init__(self, n_blocks, n_gru, kernel_size, en_de_layers=5, inter_layers=4, in_channels=1,
10 |                  en_out_channels=16):
11 |         super(E2E0, self).__init__()
12 |         self.unet = DeepUnet0(kernel_size, n_blocks, en_de_layers, inter_layers, in_channels, en_out_channels)
13 |         self.cnn = nn.Conv2d(en_out_channels, 3, (3, 3), padding=(1, 1))
14 |         if n_gru:
15 |             self.fc = nn.Sequential(
16 |                 BiGRU(3 * N_MELS, 256, n_gru),
17 |                 nn.Linear(512, N_CLASS),
18 |                 nn.Dropout(0.25),
19 |                 nn.Sigmoid()
20 |             )
21 |         else:
22 |             self.fc = nn.Sequential(
23 |                 nn.Linear(3 * N_MELS, N_CLASS),
24 |                 nn.Dropout(0.25),
25 |                 nn.Sigmoid()
26 |             )
27 | 
28 |     def forward(self, mel):
29 |         mel = mel.transpose(-1, -2).unsqueeze(1)
30 |         x = self.cnn(self.unet(mel)).transpose(1, 2).flatten(-2)
31 |         x = self.fc(x)
32 |         return x
33 | 


--------------------------------------------------------------------------------
/variance-temp-solution/rmvpe/seq.py:
--------------------------------------------------------------------------------
 1 | import torch.nn as nn
 2 | 
 3 | 
 4 | class BiGRU(nn.Module):
 5 |     def __init__(self, input_features, hidden_features, num_layers):
 6 |         super(BiGRU, self).__init__()
 7 |         self.gru = nn.GRU(input_features, hidden_features, num_layers=num_layers, batch_first=True, bidirectional=True)
 8 | 
 9 |     def forward(self, x):
10 |         return self.gru(x)[0]
11 | 


--------------------------------------------------------------------------------
/variance-temp-solution/rmvpe/spec.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import numpy as np
 3 | import torch.nn.functional as F
 4 | from librosa.filters import mel
 5 | 
 6 | 
 7 | class MelSpectrogram(torch.nn.Module):
 8 |     def __init__(
 9 |             self,
10 |             n_mel_channels,
11 |             sampling_rate,
12 |             win_length,
13 |             hop_length,
14 |             n_fft=None,
15 |             mel_fmin=0,
16 |             mel_fmax=None,
17 |             clamp=1e-5
18 |     ):
19 |         super().__init__()
20 |         n_fft = win_length if n_fft is None else n_fft
21 |         self.hann_window = {}
22 |         mel_basis = mel(
23 |             sr=sampling_rate,
24 |             n_fft=n_fft,
25 |             n_mels=n_mel_channels,
26 |             fmin=mel_fmin,
27 |             fmax=mel_fmax,
28 |             htk=True)
29 |         mel_basis = torch.from_numpy(mel_basis).float()
30 |         self.register_buffer("mel_basis", mel_basis)
31 |         self.n_fft = win_length if n_fft is None else n_fft
32 |         self.hop_length = hop_length
33 |         self.win_length = win_length
34 |         self.sampling_rate = sampling_rate
35 |         self.n_mel_channels = n_mel_channels
36 |         self.clamp = clamp
37 | 
38 |     def forward(self, audio, keyshift=0, speed=1, center=True):
39 |         factor = 2 ** (keyshift / 12)
40 |         n_fft_new = int(np.round(self.n_fft * factor))
41 |         win_length_new = int(np.round(self.win_length * factor))
42 |         hop_length_new = int(np.round(self.hop_length * speed))
43 | 
44 |         keyshift_key = str(keyshift) + '_' + str(audio.device)
45 |         if keyshift_key not in self.hann_window:
46 |             self.hann_window[keyshift_key] = torch.hann_window(win_length_new).to(audio.device)
47 | 
48 |         fft = torch.stft(
49 |             audio,
50 |             n_fft=n_fft_new,
51 |             hop_length=hop_length_new,
52 |             win_length=win_length_new,
53 |             window=self.hann_window[keyshift_key],
54 |             center=center,
55 |             return_complex=True
56 |         )
57 |         magnitude = fft.abs()
58 | 
59 |         if keyshift != 0:
60 |             size = self.n_fft // 2 + 1
61 |             resize = magnitude.size(1)
62 |             if resize < size:
63 |                 magnitude = F.pad(magnitude, (0, 0, 0, size - resize))
64 |             magnitude = magnitude[:, :size, :] * self.win_length / win_length_new
65 | 
66 |         mel_output = torch.matmul(self.mel_basis, magnitude)
67 |         log_mel_spec = torch.log(torch.clamp(mel_output, min=self.clamp))
68 |         return log_mel_spec
69 | 


--------------------------------------------------------------------------------
/variance-temp-solution/rmvpe/utils.py:
--------------------------------------------------------------------------------
 1 | import librosa
 2 | import numpy as np
 3 | import torch
 4 | 
 5 | from .constants import *
 6 | 
 7 | 
 8 | def to_local_average_f0(hidden, center=None, thred=0.03):
 9 |     idx = torch.arange(N_CLASS, device=hidden.device)[None, None, :]  # [B=1, T=1, N]
10 |     idx_cents = idx * 20 + CONST  # [B=1, N]
11 |     if center is None:
12 |         center = torch.argmax(hidden, dim=2, keepdim=True)  # [B, T, 1]
13 |     start = torch.clip(center - 4, min=0)  # [B, T, 1]
14 |     end = torch.clip(center + 5, max=N_CLASS)  # [B, T, 1]
15 |     idx_mask = (idx >= start) & (idx < end)  # [B, T, N]
16 |     weights = hidden * idx_mask  # [B, T, N]
17 |     product_sum = torch.sum(weights * idx_cents, dim=2)  # [B, T]
18 |     weight_sum = torch.sum(weights, dim=2)  # [B, T]
19 |     cents = product_sum / (weight_sum + (weight_sum == 0))  # avoid dividing by zero, [B, T]
20 |     f0 = 10 * 2 ** (cents / 1200)
21 |     uv = hidden.max(dim=2)[0] < thred  # [B, T]
22 |     f0 = f0 * ~uv
23 |     return f0.squeeze(0).cpu().numpy()
24 | 
25 | 
26 | def to_viterbi_f0(hidden, thred=0.03):
27 |     # Create viterbi transition matrix
28 |     if not hasattr(to_viterbi_f0, 'transition'):
29 |         xx, yy = np.meshgrid(range(N_CLASS), range(N_CLASS))
30 |         transition = np.maximum(30 - abs(xx - yy), 0)
31 |         transition = transition / transition.sum(axis=1, keepdims=True)
32 |         to_viterbi_f0.transition = transition
33 | 
34 |     # Convert to probability
35 |     prob = hidden.squeeze(0).cpu().numpy()
36 |     prob = prob.T
37 |     prob = prob / prob.sum(axis=0)
38 | 
39 |     # Perform viterbi decoding
40 |     path = librosa.sequence.viterbi(prob, to_viterbi_f0.transition).astype(np.int64)
41 |     center = torch.from_numpy(path).unsqueeze(0).unsqueeze(-1).to(hidden.device)
42 | 
43 |     return to_local_average_f0(hidden, center=center, thred=thred)
44 | 


--------------------------------------------------------------------------------