├── .gitignore
├── LICENSE
├── README.md
├── README_ORIG.md
├── README_notes.md
├── add_duration_to_transcript.py
├── architectures.py
├── calculate_CDP_Ain_Aout.py
├── config
├── lj_01.cfg
├── lj_02.cfg
├── lj_03.cfg
├── lj_04.cfg
├── lj_05.cfg
├── lj_test.cfg
├── lj_tutorial.cfg
├── nancy2nick_01.cfg
├── nancy2nick_02.cfg
├── nancy_01.cfg
├── nancyplusnick_01.cfg
├── nancyplusnick_02.cfg
├── nancyplusnick_03.cfg
├── nancyplusnick_04_lcc.cfg
├── nancyplusnick_05_lcc_sdpe.cfg
├── project
│ ├── baseline.cfg
│ ├── c-labels_plus_te.cfg
│ ├── fa_as_attention.cfg
│ ├── fa_as_guide.cfg
│ ├── fa_as_target.cfg
│ ├── labels_minus_te.cfg
│ └── labels_plus_te.cfg
├── ssw10
│ ├── G1ABC_01.cfg
│ ├── G1ABC_02.cfg
│ ├── G1ABC_03.cfg
│ ├── G1AB_03.cfg
│ ├── G1BC_03.cfg
│ ├── G1B_03.cfg
│ ├── G1_03.cfg
│ ├── G2_03.cfg
│ ├── README.md
│ ├── W2A_03.cfg
│ ├── W2B_03.cfg
│ └── W2_03.cfg
├── vctk_01.cfg
├── vctk_02.cfg
└── vctk_03_lcc.cfg
├── configuration.py
├── convert_alignment_to_guide.py
├── copy_synth_GL.py
├── copy_synth_SSRN_GL.py
├── data_load.py
├── doc
├── festival_install.md
├── old_readme.md
├── preparing_new_database.md
├── recipe_WaveRNN.md
├── recipe_nancy.md
├── recipe_nancy2nick.md
├── recipe_nancyplusnick.md
├── recipe_project.md
├── recipe_semisupervised.md
└── recipe_vctk.md
├── fig
├── aaa
├── attention.gif
└── training_curves.png
├── labels2tra.sh
├── libutil.py
├── logger_setup.py
├── modules.py
├── networks.py
├── objective_measures.py
├── plot_loss.py
├── prepare_acoustic_features.py
├── prepare_attention_guides.py
├── requirements.txt
├── script
├── add_durations_to_transcript.py
├── add_speaker.py
├── check_transcript.py
├── festival
│ ├── csv2scm.py
│ ├── fix_transcript.py
│ ├── make_rich_phones.scm
│ ├── make_rich_phones_cmulex.scm
│ ├── make_rich_phones_combirpx_noplex.scm
│ └── multi_transcript.py
├── get_transcriptions.sh
├── interpolate_unvoiced.py
├── libutil.py
├── make_internal_webchart.py
├── munge_nsf_data.py
├── normalise_level.py
├── pitchmark_speech.py
├── prepare_world_features.py
├── process_merlin_label.py
├── process_merlin_label_positions.py
├── remove_end_silences.py
└── split_speech.py
├── synthesise_validation_waveforms.py
├── synthesize.py
├── test.tmp
├── tool
└── bin
│ └── sv56demo
├── train.py
├── util
├── find_free_gpu.py
├── gpu_lock.py
├── submit_tf.sh
├── submit_tf_2.sh
├── submit_tf_cpu.sh
├── sync_code_from_afs.sh
└── sync_output_to_afs.sh
└── utils.py
/.gitignore:
--------------------------------------------------------------------------------
1 | *.pyc
2 | *.cfgc
3 | tool/*
4 | !tool/bin/sv56demo
--------------------------------------------------------------------------------
/README_ORIG.md:
--------------------------------------------------------------------------------
1 | # A TensorFlow Implementation of DC-TTS: yet another text-to-speech model
2 |
3 | I implement yet another text-to-speech model, dc-tts, introduced in [Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention](https://arxiv.org/abs/1710.08969). My goal, however, is not just replicating the paper. Rather, I'd like to gain insights about various sound projects.
4 |
5 | ## Requirements
6 | * NumPy >= 1.11.1
7 | * TensorFlow >= 1.3 (Note that the API of `tf.contrib.layers.layer_norm` has changed since 1.3)
8 | * librosa
9 | * tqdm
10 | * matplotlib
11 | * scipy
12 |
13 | ## Data
14 |
15 |
16 |
17 |
18 |
19 |
20 | I train English models and an Korean model on four different speech datasets.
1. [LJ Speech Dataset](https://keithito.com/LJ-Speech-Dataset/)
2. [Nick Offerman's Audiobooks](https://www.audible.com.au/search?searchNarrator=Nick+Offerman)
3. [Kate Winslet's Audiobook](https://www.audible.com.au/pd/Classics/Therese-Raquin-Audiobook/B00FF0SLW4/ref=a_search_c4_1_3_srTtl?qid=1516854754&sr=1-3)
4. [KSS Dataset](https://kaggle.com/bryanpark/korean-single-speaker-speech-dataset)
21 |
22 | LJ Speech Dataset is recently widely used as a benchmark dataset in the TTS task because it is publicly available, and it has 24 hours of reasonable quality samples.
23 | Nick's and Kate's audiobooks are additionally used to see if the model can learn even with less data, variable speech samples. They are 18 hours and 5 hours long, respectively. Finally, KSS Dataset is a Korean single speaker speech dataset that lasts more than 12 hours.
24 |
25 |
26 | ## Training
27 | * STEP 0. Download [LJ Speech Dataset](https://keithito.com/LJ-Speech-Dataset/) or prepare your own data.
28 | * STEP 1. Adjust hyper parameters in `hyperparams.py`. (If you want to do preprocessing, set prepro True`.
29 | * STEP 2. Run `python train.py 1` for training Text2Mel. (If you set prepro True, run python prepro.py first)
30 | * STEP 3. Run `python train.py 2` for training SSRN.
31 |
32 | You can do STEP 2 and 3 at the same time, if you have more than one gpu card.
33 |
34 | ## Training Curves
35 |
36 |
37 |
38 | ## Attention Plot
39 |
40 |
41 | ## Sample Synthesis
42 | I generate speech samples based on [Harvard Sentences](http://www.cs.columbia.edu/~hgs/audio/harvard.html) as the original paper does. It is already included in the repo.
43 |
44 | * Run `synthesize.py` and check the files in `samples`.
45 |
46 | ## Generated Samples
47 |
48 | | Dataset | Samples |
49 | | :----- |:-------------|
50 | | LJ | [50k](https://soundcloud.com/kyubyong-park/sets/dc_tts) [200k](https://soundcloud.com/kyubyong-park/sets/dc_tts_lj_200k) [310k](https://soundcloud.com/kyubyong-park/sets/dc_tts_lj_310k) [800k](https://soundcloud.com/kyubyong-park/sets/dc_tts_lj_800k)|
51 | | Nick | [40k](https://soundcloud.com/kyubyong-park/sets/dc_tts_nick_40k) [170k](https://soundcloud.com/kyubyong-park/sets/dc_tts_nick_170k) [300k](https://soundcloud.com/kyubyong-park/sets/dc_tts_nick_300k) [800k](https://soundcloud.com/kyubyong-park/sets/dc_tts_nick_800k)|
52 | | Kate| [40k](https://soundcloud.com/kyubyong-park/sets/dc_tts_kate_40k) [160k](https://soundcloud.com/kyubyong-park/sets/dc_tts_kate_160k) [300k](https://soundcloud.com/kyubyong-park/sets/dc_tts_kate_300k) [800k](https://soundcloud.com/kyubyong-park/sets/dc_tts_kate_800k) |
53 | | KSS| [400k](https://soundcloud.com/kyubyong-park/sets/dc_tts_ko_400k) |
54 |
55 | ## Pretrained Model for LJ
56 |
57 | Download [this](https://www.dropbox.com/s/1oyipstjxh2n5wo/LJ_logdir.tar?dl=0).
58 |
59 | ## Notes
60 |
61 | * The paper didn't mention normalization, but without normalization I couldn't get it to work. So I added layer normalization.
62 | * The paper fixed the learning rate to 0.001, but it didn't work for me. So I decayed it.
63 | * I tried to train Text2Mel and SSRN simultaneously, but it didn't work. I guess separating those two networks mitigates the burden of training.
64 | * The authors claimed that the model can be trained within a day, but unfortunately the luck was not mine. However obviously this is much fater than Tacotron as it uses only convolution layers.
65 | * Thanks to the guided attention, the attention plot looks monotonic almost from the beginning. I guess this seems to hold the aligment tight so it won't lose track.
66 | * The paper didn't mention dropouts. I applied them as I believe it helps for regularization.
67 | * Check also other TTS models such as [Tacotron](https://github.com/kyubyong/tacotron) and [Deep Voice 3](https://github.com/kyubyong/deepvoice3).
68 |
69 |
--------------------------------------------------------------------------------
/add_duration_to_transcript.py:
--------------------------------------------------------------------------------
1 |
2 | import numpy as np
3 | import os
4 | import sys
5 | import os.path
6 |
7 | # Usage: python add_duration_to_transcript.py fa_matrix_dir transcript_file new_transcript_file
8 | fa_matrices_dir = sys.argv[1]
9 | transcript = sys.argv[2]
10 | out_transcript = sys.argv[3]
11 |
12 | with open(transcript, 'r') as f:
13 | tra = f.readlines()
14 |
15 | f = open(out_transcript,'w')
16 |
17 |
18 | for t in tra:
19 |
20 | file = t.split('|')[0]
21 | phones = t.split('|')[3]
22 | num_phones = len(phones.split(' '))
23 |
24 | if os.path.exists(fa_matrices_dir + file + '.npy'):
25 |
26 | A = np.load( fa_matrices_dir + file + '.npy')
27 | durations = np.sum(A,axis=0)
28 | durations = 4*durations # durations are given for 12.5ms frames, convert to 50ms
29 | assert(num_phones == A.shape[1])
30 | assert(num_phones == len(durations))
31 |
32 | f.write(t.rstrip() + '||' + " ".join(map(str, durations.astype(int))) + '\n')
33 |
34 | else:
35 | print("Missing " + file)
36 |
37 | f.close()
38 |
--------------------------------------------------------------------------------
/calculate_CDP_Ain_Aout.py:
--------------------------------------------------------------------------------
1 |
2 | import sys
3 | import math
4 | import glob
5 | import numpy as np
6 | import matplotlib
7 | import matplotlib.pyplot as plt
8 |
9 | def get_att_per_input(A):
10 |
11 | att_per_input = np.trim_zeros(np.sum(A,axis=1),'b')
12 | num_input = len(att_per_input)
13 |
14 | return att_per_input, num_input
15 |
16 | # coverage deviation penalty: punishes when input token is not represented by output or overly represented
17 | # (lack of attention) skips and (too much attention) prolongation
18 | def getCDP(A):
19 |
20 | att_per_input, num_input = get_att_per_input(A)
21 | tmp = (1. - att_per_input )**2
22 | tmp = np.sum( np.log( 1. + (1. - att_per_input )**2 ) )
23 | CDP = tmp / num_input # removed the minus sign
24 |
25 | return CDP
26 |
27 | def getEnt(A):
28 |
29 | Entr = 0.0
30 |
31 | for a in A: # traverse rows of A
32 | norm = np.sum(a)
33 | normPd = a
34 | if norm != 0.0:
35 | normPd = [ p / norm for p in a ]
36 | entr = np.sum( [ ( p * np.log(p) if (p!=0.0) else 0.0 ) for p in normPd ] )
37 | Entr += entr
38 |
39 | Entr = -Entr/A.shape[0]
40 | Entr /= np.log(A.shape[1]) # to normalise it between 0-1
41 |
42 | return Entr
43 |
44 |
45 | # absentmindess penalty: punishes scattered attention, dispersion is calculated via the entropy
46 | def getAP(A):
47 |
48 | att_per_input, num_input = get_att_per_input(A)
49 | num_output = A.shape[1]
50 | A = A[:num_input,:]
51 | APout = 0.0
52 | APin = 0.0
53 |
54 | APin = getEnt(A)
55 | APout = getEnt(np.transpose(A))
56 |
57 | return APin, APout
58 |
59 | def main(file_name):
60 |
61 | A = np.load(file_name) # matrix axis 0: input (phones) / axis 1: output (acoustic)
62 | print(A.shape)
63 | # fig, ax = plt.subplots()
64 | # im = ax.imshow(A)
65 | # fig.colorbar(im)
66 | # plt.ylabel('Encoder timestep')
67 | # plt.xlabel('Decoder timestep')
68 | # plt.show()
69 |
70 | CDP = getCDP(A)
71 | APin, APout = getAP(A)
72 |
73 | print('CDP: ' + str(CDP))
74 | print('Ain: ' + str(APin))
75 | print('Aout: ' + str(APout))
76 |
77 | if __name__ == '__main__':
78 |
79 | # Input attention matrix - numpy format - axis 0: input (phones) / axis 1: output (acoustic)
80 |
81 | file_name = sys.argv[1]
82 |
83 | main(file_name)
84 |
85 |
86 |
--------------------------------------------------------------------------------
/config/lj_01.cfg:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #/usr/bin/python2
3 |
4 | import os
5 |
6 |
7 | ## Take name of this file to be the config name:
8 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension
9 |
10 | ## Define place to put outputs relative to this config file's location;
11 | ## supply an absoluate path to work elsewhere:
12 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work')))
13 |
14 | voicedir = os.path.join(topworkdir, config_name)
15 | logdir = os.path.join(voicedir, 'train')
16 | sampledir = os.path.join(voicedir, 'synth')
17 |
18 | ## Change featuredir to absolute path to use existing features
19 | featuredir = os.path.join(topworkdir, 'lj_test', 'data') ## use older data
20 | coarse_audio_dir = os.path.join(featuredir, 'mels')
21 | full_mel_dir = os.path.join(featuredir, 'full_mels')
22 | full_audio_dir = os.path.join(featuredir, 'mags')
23 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide
24 |
25 |
26 |
27 | # data locations:
28 | datadir = '/group/project/cstr2/owatts/data/LJSpeech-1.1/'
29 |
30 |
31 | transcript = os.path.join(datadir, 'transcript.csv')
32 | test_transcript = os.path.join(datadir, 'test_transcript.csv')
33 | waveforms = os.path.join(datadir, 'wav_trim')
34 |
35 |
36 |
37 | input_type = 'phones' ## letters or phones
38 | ## Combilex RPX plus extra symbols:-
39 | vocab = ['', '', "<'>", "<'s>", '<)>', '<]>', '<">', '<,>', '<.>', '<:>', '<;>', '<>', '>', \
40 | '<_END_>', '<_START_>', \
41 | '@', '@@', '@U', 'A', 'D', 'E', 'E@', 'I', 'I@', 'N', 'O', 'OI', 'Q', 'S', 'T', 'U',\
42 | 'U@', 'V', 'Z', 'a', 'aI', 'aU', 'b', 'd', 'dZ', 'eI', 'f', 'g', 'h', 'i', 'j', 'k',\
43 | 'l', 'l!', 'lw', 'm', 'm!', 'n', 'n!', 'o^', 'o~', 'p', 'r', 's', 't', 'tS', 'u', 'v', 'w', 'z']
44 |
45 | #vocab = "PE abcdefghijklmnopqrstuvwxyz'.?" # P: Padding, E: EOS.
46 | max_N = 180 # Maximum number of characters. # !!!
47 | max_T = 210 # Maximum number of mel frames. # !!!
48 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
49 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set
50 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
51 |
52 |
53 |
54 | # signal processing
55 | trim_before_spectrogram_extraction = 0
56 | vocoder = 'griffin_lim'
57 | sr = 22050 # Sampling rate.
58 | n_fft = 2048 # fft points (samples)
59 | frame_shift = 0.0125 # seconds
60 | frame_length = 0.05 # seconds
61 | hop_length = int(sr * frame_shift)
62 | win_length = int(sr * frame_length)
63 | prepro = True # don't extract spectrograms on the fly
64 | full_dim = n_fft//2+1
65 | n_mels = 80 # Number of Mel banks to generate
66 | power = 1.5 # Exponent for amplifying the predicted magnitude
67 | n_iter = 50 # Number of inversion iterations
68 | preemphasis = .97
69 | max_db = 100
70 | ref_db = 20
71 |
72 |
73 | # Model
74 | r = 4 # Reduction factor. Do not change this.
75 | dropout_rate = 0.05
76 | e = 128 # == embedding
77 | d = 256 # == hidden units of Text2Mel
78 | c = 512 # == hidden units of SSRN
79 | attention_win_size = 3
80 | g = 0.2 ## determines width of band in attention guide
81 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None]
82 |
83 | ## loss weights : T2M
84 | lw_mel = 0.3333
85 | lw_bd1 = 0.3333
86 | lw_att = 0.3333
87 | ## : SSRN
88 | lw_mag = 0.5
89 | lw_bd2 = 0.5
90 |
91 |
92 | ## validation:
93 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI
94 | validation_sentences_to_evaluate = 32
95 | validation_sentences_to_synth_params = 3
96 |
97 |
98 | # training scheme
99 | restart_from_savepath = []
100 | lr = 0.001 # Initial learning rate.
101 | batchsize = {'t2m': 32, 'ssrn': 32}
102 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8)
103 | validate_every_n_epochs = 10 ## how often to compute validation score and save speech parameters
104 | save_every_n_epochs = 0 ## as well as 5 latest models, how often to archive a model. This has been disabled (0) to save disk space.
105 | max_epochs = 300
106 |
107 | # attention plotting during training
108 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices
109 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for
110 |
--------------------------------------------------------------------------------
/config/lj_03.cfg:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #/usr/bin/python2
3 |
4 | ## rpx as lj_01, 16kHz as lj_02, but experiment with training babbler before T2M
5 |
6 | import os
7 |
8 |
9 | ## Take name of this file to be the config name:
10 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension
11 |
12 | ## Define place to put outputs relative to this config file's location;
13 | ## supply an absoluate path to work elsewhere:
14 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work')))
15 |
16 | voicedir = os.path.join(topworkdir, config_name)
17 | logdir = os.path.join(voicedir, 'train')
18 | sampledir = os.path.join(voicedir, 'synth')
19 |
20 | ## Change featuredir to absolute path to use existing features
21 | featuredir = os.path.join(topworkdir, 'lj_02', 'data') ## use older data
22 | coarse_audio_dir = os.path.join(featuredir, 'mels')
23 | full_mel_dir = os.path.join(featuredir, 'full_mels')
24 | full_audio_dir = os.path.join(featuredir, 'mags')
25 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide
26 |
27 | bucket_data_by = 'audio_length'
28 |
29 | # data locations:
30 | datadir = '/group/project/cstr2/owatts/data/LJSpeech-1.1/'
31 |
32 |
33 | transcript = os.path.join(datadir, 'transcript.csv')
34 | test_transcript = os.path.join(datadir, 'test_transcript.csv')
35 | waveforms = os.path.join(datadir, 'wav_trim')
36 |
37 |
38 |
39 | input_type = 'phones' ## letters or phones
40 | ## Combilex RPX plus extra symbols:-
41 | vocab = ['', '', "<'>", "<'s>", '<)>', '<]>', '<">', '<,>', '<.>', '<:>', '<;>', '<>', '>', \
42 | '<_END_>', '<_START_>', \
43 | '@', '@@', '@U', 'A', 'D', 'E', 'E@', 'I', 'I@', 'N', 'O', 'OI', 'Q', 'S', 'T', 'U',\
44 | 'U@', 'V', 'Z', 'a', 'aI', 'aU', 'b', 'd', 'dZ', 'eI', 'f', 'g', 'h', 'i', 'j', 'k',\
45 | 'l', 'l!', 'lw', 'm', 'm!', 'n', 'n!', 'o^', 'o~', 'p', 'r', 's', 't', 'tS', 'u', 'v', 'w', 'z']
46 |
47 | #vocab = "PE abcdefghijklmnopqrstuvwxyz'.?" # P: Padding, E: EOS.
48 | max_N = 180 # Maximum number of characters. # !!!
49 | max_T = 210 # Maximum number of mel frames. # !!!
50 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
51 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set
52 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
53 |
54 |
55 |
56 | # signal processing
57 | trim_before_spectrogram_extraction = 0
58 | vocoder = 'griffin_lim'
59 | sr = 16000 # Sampling rate.
60 | n_fft = 2048 # fft points (samples)
61 | frame_shift = 0.0125 # seconds
62 | frame_length = 0.05 # seconds
63 | hop_length = int(sr * frame_shift)
64 | win_length = int(sr * frame_length)
65 | prepro = True # don't extract spectrograms on the fly
66 | full_dim = n_fft//2+1
67 | n_mels = 80 # Number of Mel banks to generate
68 | power = 1.5 # Exponent for amplifying the predicted magnitude
69 | n_iter = 50 # Number of inversion iterations
70 | preemphasis = .97
71 | max_db = 100
72 | ref_db = 20
73 |
74 |
75 | # Model
76 | r = 4 # Reduction factor. Do not change this.
77 | dropout_rate = 0.05
78 | e = 128 # == embedding
79 | d = 256 # == hidden units of Text2Mel
80 | c = 512 # == hidden units of SSRN
81 | attention_win_size = 3
82 | g = 0.2 ## determines width of band in attention guide
83 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None]
84 |
85 | ## loss weights : T2M
86 | lw_mel = 0.3333
87 | lw_bd1 = 0.3333
88 | lw_att = 0.3333
89 | ## : SSRN
90 | lw_mag = 0.5
91 | lw_bd2 = 0.5
92 |
93 | loss_weights = {'babbler': {'L1': 0.5, 'binary_divergence': 0.5}} ## TODO: make other loss config consistent
94 |
95 | ## validation:
96 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI
97 | validation_sentences_to_evaluate = 32
98 | validation_sentences_to_synth_params = 3
99 |
100 |
101 | # training scheme
102 | restart_from_savepath = []
103 | lr = 0.001 # Initial learning rate.
104 | batchsize = {'t2m': 32, 'ssrn': 32, 'babbler': 32}
105 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8)
106 | validate_every_n_epochs = 10 ## how often to compute validation score and save speech parameters
107 | save_every_n_epochs = 0 ## as well as 5 latest models, how often to archive a model. This has been disabled (0) to save disk space.
108 | max_epochs = 300
109 |
110 | # attention plotting during training
111 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices
112 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for
113 |
--------------------------------------------------------------------------------
/config/lj_04.cfg:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #/usr/bin/python2
3 |
4 | ## rpx as lj_01, 16kHz as lj_02, but experiment with training babbler before T2M
5 | ## stage 2: fine tune on subset of transcribed data
6 | import os
7 |
8 |
9 | ## Take name of this file to be the config name:
10 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension
11 |
12 | ## Define place to put outputs relative to this config file's location;
13 | ## supply an absoluate path to work elsewhere:
14 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work')))
15 |
16 | voicedir = os.path.join(topworkdir, config_name)
17 | logdir = os.path.join(voicedir, 'train')
18 | sampledir = os.path.join(voicedir, 'synth')
19 |
20 | ## Change featuredir to absolute path to use existing features
21 | featuredir = os.path.join(topworkdir, 'lj_02', 'data') ## use older data
22 | coarse_audio_dir = os.path.join(featuredir, 'mels')
23 | full_mel_dir = os.path.join(featuredir, 'full_mels')
24 | full_audio_dir = os.path.join(featuredir, 'mags')
25 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide
26 |
27 |
28 |
29 | # data locations:
30 | datadir = '/group/project/cstr2/owatts/data/LJSpeech-1.1/'
31 |
32 |
33 | transcript = os.path.join(datadir, 'transcript.csv')
34 | test_transcript = os.path.join(datadir, 'test_transcript.csv')
35 | waveforms = os.path.join(datadir, 'wav_trim')
36 |
37 |
38 |
39 | input_type = 'phones' ## letters or phones
40 | ## Combilex RPX plus extra symbols:-
41 | vocab = ['', '', "<'>", "<'s>", '<)>', '<]>', '<">', '<,>', '<.>', '<:>', '<;>', '<>', '>', \
42 | '<_END_>', '<_START_>', \
43 | '@', '@@', '@U', 'A', 'D', 'E', 'E@', 'I', 'I@', 'N', 'O', 'OI', 'Q', 'S', 'T', 'U',\
44 | 'U@', 'V', 'Z', 'a', 'aI', 'aU', 'b', 'd', 'dZ', 'eI', 'f', 'g', 'h', 'i', 'j', 'k',\
45 | 'l', 'l!', 'lw', 'm', 'm!', 'n', 'n!', 'o^', 'o~', 'p', 'r', 's', 't', 'tS', 'u', 'v', 'w', 'z']
46 |
47 | #vocab = "PE abcdefghijklmnopqrstuvwxyz'.?" # P: Padding, E: EOS.
48 | max_N = 180 # Maximum number of characters. # !!!
49 | max_T = 210 # Maximum number of mel frames. # !!!
50 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
51 | n_utts = 1000 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set
52 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
53 |
54 |
55 |
56 | # signal processing
57 | trim_before_spectrogram_extraction = 0
58 | vocoder = 'griffin_lim'
59 | sr = 16000 # Sampling rate.
60 | n_fft = 2048 # fft points (samples)
61 | frame_shift = 0.0125 # seconds
62 | frame_length = 0.05 # seconds
63 | hop_length = int(sr * frame_shift)
64 | win_length = int(sr * frame_length)
65 | prepro = True # don't extract spectrograms on the fly
66 | full_dim = n_fft//2+1
67 | n_mels = 80 # Number of Mel banks to generate
68 | power = 1.5 # Exponent for amplifying the predicted magnitude
69 | n_iter = 50 # Number of inversion iterations
70 | preemphasis = .97
71 | max_db = 100
72 | ref_db = 20
73 |
74 |
75 | # Model
76 | r = 4 # Reduction factor. Do not change this.
77 | dropout_rate = 0.05
78 | e = 128 # == embedding
79 | d = 256 # == hidden units of Text2Mel
80 | c = 512 # == hidden units of SSRN
81 | attention_win_size = 3
82 | g = 0.2 ## determines width of band in attention guide
83 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None]
84 |
85 | ## loss weights : T2M
86 | lw_mel = 0.3333
87 | lw_bd1 = 0.3333
88 | lw_att = 0.3333
89 | ## : SSRN
90 | lw_mag = 0.5
91 | lw_bd2 = 0.5
92 |
93 |
94 | ## validation:
95 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI
96 | validation_sentences_to_evaluate = 32
97 | validation_sentences_to_synth_params = 3
98 |
99 |
100 | # training scheme
101 | restart_from_savepath = []
102 | lr = 0.001 # Initial learning rate.
103 | batchsize = {'t2m': 32, 'ssrn': 32, 'babbler': 32}
104 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8)
105 | validate_every_n_epochs = 10 ## how often to compute validation score and save speech parameters
106 | save_every_n_epochs = 0 ## as well as 5 latest models, how often to archive a model. This has been disabled (0) to save disk space.
107 | max_epochs = 1000
108 |
109 | # attention plotting during training
110 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices
111 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for
112 |
--------------------------------------------------------------------------------
/config/lj_test.cfg:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #/usr/bin/python2
3 |
4 | import os
5 |
6 |
7 | ## Take name of this file to be the config name:
8 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension
9 |
10 | ## Define place to put outputs relative to this config file's location;
11 | ## supply an absoluate path to work elsewhere:
12 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work')))
13 |
14 | voicedir = os.path.join(topworkdir, config_name)
15 | logdir = os.path.join(voicedir, 'train')
16 | sampledir = os.path.join(voicedir, 'synth')
17 |
18 | ## Change featuredir to absolute path to use existing features
19 | featuredir = os.path.join(voicedir, 'data')
20 | coarse_audio_dir = os.path.join(featuredir, 'mels')
21 | full_mel_dir = os.path.join(featuredir, 'full_mels')
22 | full_audio_dir = os.path.join(featuredir, 'mags')
23 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide
24 |
25 |
26 |
27 | # data locations:
28 | datadir = '/group/project/cstr2/owatts/data/LJSpeech-1.1/'
29 |
30 |
31 | transcript = os.path.join(datadir, 'transcript.csv')
32 | test_transcript = os.path.join(datadir, 'test_transcript.csv')
33 | waveforms = os.path.join(datadir, 'wav_trim')
34 |
35 |
36 |
37 | input_type = 'phones' ## letters or phones
38 | ## Combilex RPX plus extra symbols:-
39 | vocab = ['', '', "<'>", "<'s>", '<)>', '<]>', '<">', '<,>', '<.>', '<:>', '<;>', '<>', '>', \
40 | '<_END_>', '<_START_>', \
41 | '@', '@@', '@U', 'A', 'D', 'E', 'E@', 'I', 'I@', 'N', 'O', 'OI', 'Q', 'S', 'T', 'U',\
42 | 'U@', 'V', 'Z', 'a', 'aI', 'aU', 'b', 'd', 'dZ', 'eI', 'f', 'g', 'h', 'i', 'j', 'k',\
43 | 'l', 'l!', 'lw', 'm', 'm!', 'n', 'n!', 'o^', 'o~', 'p', 'r', 's', 't', 'tS', 'u', 'v', 'w', 'z']
44 |
45 | #vocab = "PE abcdefghijklmnopqrstuvwxyz'.?" # P: Padding, E: EOS.
46 | max_N = 180 # Maximum number of characters. # !!!
47 | max_T = 210 # Maximum number of mel frames. # !!!
48 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
49 | n_utts = 200 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set
50 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
51 |
52 |
53 |
54 | # signal processing
55 | trim_before_spectrogram_extraction = 0
56 | vocoder = 'griffin_lim'
57 | sr = 22050 # Sampling rate.
58 | n_fft = 2048 # fft points (samples)
59 | frame_shift = 0.0125 # seconds
60 | frame_length = 0.05 # seconds
61 | hop_length = int(sr * frame_shift)
62 | win_length = int(sr * frame_length)
63 | prepro = True # don't extract spectrograms on the fly
64 | full_dim = n_fft//2+1
65 | n_mels = 80 # Number of Mel banks to generate
66 | power = 1.5 # Exponent for amplifying the predicted magnitude
67 | n_iter = 50 # Number of inversion iterations
68 | preemphasis = .97
69 | max_db = 100
70 | ref_db = 20
71 |
72 |
73 | # Model
74 | r = 4 # Reduction factor. Do not change this.
75 | dropout_rate = 0.05
76 | e = 128 # == embedding
77 | d = 256 # == hidden units of Text2Mel
78 | c = 512 # == hidden units of SSRN
79 | attention_win_size = 3
80 | g = 0.2 ## determines width of band in attention guide
81 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None]
82 |
83 | ## loss weights : T2M
84 | lw_mel = 0.3333
85 | lw_bd1 = 0.3333
86 | lw_att = 0.3333
87 | ## : SSRN
88 | lw_mag = 0.5
89 | lw_bd2 = 0.5
90 |
91 |
92 | ## validation:
93 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI
94 | validation_sentences_to_evaluate = 32
95 | validation_sentences_to_synth_params = 3
96 |
97 |
98 | # training scheme
99 | restart_from_savepath = []
100 | lr = 0.001 # Initial learning rate.
101 | batchsize = {'t2m': 32, 'ssrn': 32}
102 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8)
103 | validate_every_n_epochs = 1 ## how often to compute validation score and save speech parameters
104 | save_every_n_epochs = 2 ## as well as 5 latest models, how often to archive a model
105 | max_epochs = 4
106 |
107 | # attention plotting during training
108 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices
109 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for
--------------------------------------------------------------------------------
/config/lj_tutorial.cfg:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #/usr/bin/python2
3 |
4 | import os
5 |
6 |
7 | ## Take name of this file to be the config name:
8 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension
9 |
10 | ## Define place to put outputs relative to this config file's location;
11 | ## supply an absoluate path to work elsewhere:
12 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work')))
13 |
14 | voicedir = os.path.join(topworkdir, config_name)
15 | logdir = os.path.join(voicedir, 'train')
16 | sampledir = os.path.join(voicedir, 'synth')
17 |
18 | ## Change featuredir to absolute path to use existing features
19 | featuredir = os.path.join(voicedir, 'data')
20 | coarse_audio_dir = os.path.join(featuredir, 'mels')
21 | full_mel_dir = os.path.join(featuredir, 'full_mels')
22 | full_audio_dir = os.path.join(featuredir, 'mags')
23 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide
24 |
25 |
26 |
27 | # Data locations:
28 | datadir = '/path/to/LJSpeech-1.1/'
29 |
30 |
31 | transcript = os.path.join(datadir, 'transcript.csv')
32 | test_transcript = os.path.join(datadir, 'test_transcript.csv')
33 | waveforms = os.path.join(datadir, 'wav_trim')
34 |
35 |
36 | input_type = 'phones' ## letters or phones
37 |
38 | ## CMU phones:
39 | vocab = ['', '<_END_>', '<_START_>'] + ['', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', '<;>', '<>', '>', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', 'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', 'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', 'oy', 'p', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', 'w', 'y', 'z', 'zh']
40 |
41 | max_N = 150 # Maximum number of characters/phones
42 | max_T = 200 # Maximum number of mel frames
43 |
44 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
45 | n_utts = 200 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set
46 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
47 |
48 |
49 |
50 | # signal processing
51 | trim_before_spectrogram_extraction = 0
52 | vocoder = 'griffin_lim'
53 | sr = 22050 # Sampling rate.
54 | n_fft = 2048 # fft points (samples)
55 | frame_shift = 0.0125 # seconds
56 | frame_length = 0.05 # seconds
57 | hop_length = int(sr * frame_shift)
58 | win_length = int(sr * frame_length)
59 | prepro = True # don't extract spectrograms on the fly
60 | full_dim = n_fft//2+1
61 | n_mels = 80 # Number of Mel banks to generate
62 | power = 1.5 # Exponent for amplifying the predicted magnitude
63 | n_iter = 50 # Number of inversion iterations
64 | preemphasis = .97
65 | max_db = 100
66 | ref_db = 20
67 |
68 |
69 | # Model
70 | r = 4 # Reduction factor. Do not change this.
71 | dropout_rate = 0.05
72 | e = 128 # == embedding
73 | d = 256 # == hidden units of Text2Mel
74 | c = 512 # == hidden units of SSRN
75 | attention_win_size = 3
76 | g = 0.2 ## determines width of band in attention guide
77 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None]
78 |
79 | ## loss weights : T2M
80 | lw_mel = 0.25
81 | lw_bd1 = 0.25
82 | lw_att = 0.25
83 | lw_t2m_l2 = 0.25
84 | ## : SSRN
85 | lw_mag = 0.3333
86 | lw_bd2 = 0.3333
87 | lw_ssrn_l2 = 0.3333
88 |
89 |
90 | ## validation:
91 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI
92 | validation_sentences_to_evaluate = 32
93 | validation_sentences_to_synth_params = 3
94 |
95 |
96 | # training scheme
97 | restart_from_savepath = []
98 | lr = 0.001 # Initial learning rate.
99 | batchsize = {'t2m': 32, 'ssrn': 8}
100 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8)
101 | validate_every_n_epochs = 1 ## how often to compute validation score and save speech parameters
102 | save_every_n_epochs = 5 ## as well as 5 latest models, how often to archive a model
103 | max_epochs = 10
104 |
105 | # attention plotting during training
106 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices
107 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for
108 |
--------------------------------------------------------------------------------
/config/nancy2nick_01.cfg:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #/usr/bin/python2
3 |
4 | '''
5 | nancy base voice retrained on nick for ICPhS in first instance.
6 | '''
7 |
8 |
9 | import os
10 |
11 |
12 | ## Take name of this file to be the config name:
13 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension
14 |
15 | ## Define place to put outputs relative to this config file's location;
16 | ## supply an absoluate path to work elsewhere:
17 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work')))
18 |
19 | voicedir = os.path.join(topworkdir, config_name)
20 | logdir = os.path.join(voicedir, 'train')
21 | sampledir = os.path.join(voicedir, 'synth')
22 |
23 | ## Change featuredir to absolute path to use existing features
24 | featuredir = os.path.join(topworkdir, config_name, 'data') ## use older data
25 | coarse_audio_dir = os.path.join(featuredir, 'mels')
26 | full_mel_dir = os.path.join(featuredir, 'full_mels')
27 | full_audio_dir = os.path.join(featuredir, 'mags')
28 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide
29 |
30 |
31 |
32 | # data locations:
33 | datadir = '/group/project/cstr2/owatts/data/nick16k/'
34 |
35 |
36 | transcript = os.path.join(datadir, 'transcript.csv')
37 | test_transcript = os.path.join(datadir, 'transcript_hvd.csv')
38 | #test_transcript = os.path.join(datadir, 'transcript_mrt.csv')
39 | waveforms = os.path.join(datadir, 'wav_trim')
40 |
41 |
42 |
43 | input_type = 'phones' ## letters or phones
44 |
45 | vocab = ['', '', "<'s>", '<,>', '<.>', '<:>', '<;>', '<>', \
46 | '>', '<_END_>', '<_START_>', '@', '@@', '@U', 'A', 'D', 'E', \
47 | 'E@', 'I', 'I@', 'N', 'O', 'OI', 'Q', 'S', 'T', 'U', 'U@', 'V', \
48 | 'Z', 'a', 'aI', 'aU', 'b', 'd', 'dZ', 'eI', 'f', 'g', 'h', 'i', \
49 | 'j', 'k', 'l', 'l!', 'lw', 'm', 'm!', 'n', 'n!', 'o^', 'o~', 'p', \
50 | 'r', 's', 't', 'tS', 'u', 'v', 'w', 'z']
51 |
52 | #vocab = "PE abcdefghijklmnopqrstuvwxyz'.?" # P: Padding, E: EOS.
53 | max_N = 140 # Maximum number of characters. # !!!
54 | max_T = 200 # Maximum number of mel frames. # !!!
55 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
56 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set
57 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
58 |
59 |
60 |
61 | # signal processing
62 | trim_before_spectrogram_extraction = 0
63 | vocoder = 'griffin_lim'
64 | sr = 16000 # Sampling rate.
65 | n_fft = 2048 # fft points (samples)
66 | frame_shift = 0.0125 # seconds
67 | frame_length = 0.05 # seconds
68 | hop_length = int(sr * frame_shift)
69 | win_length = int(sr * frame_length)
70 | prepro = True # don't extract spectrograms on the fly
71 | full_dim = n_fft//2+1
72 | n_mels = 80 # Number of Mel banks to generate
73 | power = 1.5 # Exponent for amplifying the predicted magnitude
74 | n_iter = 50 # Number of inversion iterations
75 | preemphasis = .97
76 | max_db = 100
77 | ref_db = 20
78 |
79 |
80 | # Model
81 | r = 4 # Reduction factor. Do not change this.
82 | dropout_rate = 0.05
83 | e = 128 # == embedding
84 | d = 256 # == hidden units of Text2Mel
85 | c = 512 # == hidden units of SSRN
86 | attention_win_size = 3
87 | g = 0.2 ## determines width of band in attention guide
88 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None]
89 |
90 | ## loss weights : T2M
91 | lw_mel = 0.3333
92 | lw_bd1 = 0.3333
93 | lw_att = 0.3333
94 | ## : SSRN
95 | lw_mag = 0.5
96 | lw_bd2 = 0.5
97 |
98 |
99 | ## validation:
100 | validpatt = 'herald_9' ## herald_9 matches 98 sentences from nick
101 | #### NYT00 matches 24 of nancy's sentences.
102 | validation_sentences_to_evaluate = 24
103 | validation_sentences_to_synth_params = 16
104 |
105 |
106 | # training scheme
107 | restart_from_savepath = []
108 | lr = 0.001 # Initial learning rate.
109 | batchsize = {'t2m': 32, 'ssrn': 32}
110 | validate_every_n_epochs = 1 ## how often to compute validation score and save speech parameters
111 | save_every_n_epochs = 10 ## as well as 5 latest models, how often to archive a model
112 | max_epochs = 100
113 |
114 | ## copy existing model parameters -- list of (scope, checkpoint) pairs:--
115 | WORK='/disk/scratch/oliver/dc_tts_osw_clean/work/'
116 | initialise_weights_from_existing = [('Text2Mel', WORK+'/nancy_01/train-t2m/model_epoch_1000'), \
117 | ('SSRN', WORK+'/nancy_01/train-ssrn/model_epoch_1000')]
118 |
119 | update_weights = [] ## If empty, update all trainable variables. Otherwise, update only
120 | ## those variables whose scopes match one of the patterns in the list
--------------------------------------------------------------------------------
/config/nancy_01.cfg:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #/usr/bin/python2
3 |
4 | '''
5 | nancy base voice for ICPhS in first instance.
6 | '''
7 |
8 |
9 | import os
10 |
11 |
12 | ## Take name of this file to be the config name:
13 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension
14 |
15 | ## Define place to put outputs relative to this config file's location;
16 | ## supply an absoluate path to work elsewhere:
17 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work')))
18 |
19 | voicedir = os.path.join(topworkdir, config_name)
20 | logdir = os.path.join(voicedir, 'train')
21 | sampledir = os.path.join(voicedir, 'synth')
22 |
23 | ## Change featuredir to absolute path to use existing features
24 | featuredir = os.path.join(topworkdir, config_name, 'data') ## use older data
25 | coarse_audio_dir = os.path.join(featuredir, 'mels')
26 | full_mel_dir = os.path.join(featuredir, 'full_mels')
27 | full_audio_dir = os.path.join(featuredir, 'mags')
28 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide
29 |
30 |
31 |
32 | # data locations:
33 | datadir = '/group/project/cstr2/owatts/data/nancy/derived'
34 |
35 |
36 | transcript = os.path.join(datadir, 'transcript.csv')
37 | test_transcript = '/group/project/cstr2/owatts/data/nick16k/transcript_mrt.csv'
38 |
39 |
40 | waveforms = os.path.join(datadir, 'wav_trim')
41 |
42 |
43 |
44 | input_type = 'phones' ## letters or phones
45 |
46 | vocab = ['', '', "<'s>", '<,>', '<.>', '<:>', '<;>', '<>', \
47 | '>', '<_END_>', '<_START_>', '@', '@@', '@U', 'A', 'D', 'E', \
48 | 'E@', 'I', 'I@', 'N', 'O', 'OI', 'Q', 'S', 'T', 'U', 'U@', 'V', \
49 | 'Z', 'a', 'aI', 'aU', 'b', 'd', 'dZ', 'eI', 'f', 'g', 'h', 'i', \
50 | 'j', 'k', 'l', 'l!', 'lw', 'm', 'm!', 'n', 'n!', 'o^', 'o~', 'p', \
51 | 'r', 's', 't', 'tS', 'u', 'v', 'w', 'z']
52 |
53 | #vocab = "PE abcdefghijklmnopqrstuvwxyz'.?" # P: Padding, E: EOS.
54 | max_N = 140 # Maximum number of characters. # !!!
55 | max_T = 200 # Maximum number of mel frames. # !!!
56 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
57 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set
58 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
59 |
60 |
61 |
62 | # signal processing
63 | trim_before_spectrogram_extraction = 0
64 | vocoder = 'griffin_lim'
65 | sr = 16000 # Sampling rate.
66 | n_fft = 2048 # fft points (samples)
67 | frame_shift = 0.0125 # seconds
68 | frame_length = 0.05 # seconds
69 | hop_length = int(sr * frame_shift)
70 | win_length = int(sr * frame_length)
71 | prepro = True # don't extract spectrograms on the fly
72 | full_dim = n_fft//2+1
73 | n_mels = 80 # Number of Mel banks to generate
74 | power = 1.5 # Exponent for amplifying the predicted magnitude
75 | n_iter = 50 # Number of inversion iterations
76 | preemphasis = .97
77 | max_db = 100
78 | ref_db = 20
79 |
80 |
81 | # Model
82 | r = 4 # Reduction factor. Do not change this.
83 | dropout_rate = 0.05
84 | e = 128 # == embedding
85 | d = 256 # == hidden units of Text2Mel
86 | c = 512 # == hidden units of SSRN
87 | attention_win_size = 3
88 | g = 0.2 ## determines width of band in attention guide
89 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None]
90 |
91 | ## loss weights : T2M
92 | lw_mel = 0.3333
93 | lw_bd1 = 0.3333
94 | lw_att = 0.3333
95 | ## : SSRN
96 | lw_mag = 0.5
97 | lw_bd2 = 0.5
98 |
99 |
100 | ## validation:
101 | validpatt = 'NYT00' ## sentence names containing this substring will be held out of training.
102 | #### NYT00 matches 24 of nancy's sentences.
103 | validation_sentences_to_evaluate = 24
104 | validation_sentences_to_synth_params = 16
105 |
106 |
107 | # training scheme
108 | restart_from_savepath = []
109 | lr = 0.001 # Initial learning rate.
110 | batchsize = {'t2m': 32, 'ssrn': 32}
111 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8)
112 | validate_every_n_epochs = 10 ## how often to compute validation score and save speech parameters
113 | save_every_n_epochs = 100 ## as well as 5 latest models, how often to archive a model
114 | max_epochs = 1000
115 |
116 | # attention plotting during training
117 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices
118 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for
--------------------------------------------------------------------------------
/config/nancyplusnick_01.cfg:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #/usr/bin/python2
3 |
4 | '''
5 | Combined nancy & nick voice for ICPhS in first instance.
6 | '''
7 |
8 |
9 | import os
10 |
11 |
12 | ## Take name of this file to be the config name:
13 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension
14 |
15 | ## Define place to put outputs relative to this config file's location;
16 | ## supply an absoluate path to work elsewhere:
17 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work')))
18 |
19 | voicedir = os.path.join(topworkdir, config_name)
20 | logdir = os.path.join(voicedir, 'train')
21 | sampledir = os.path.join(voicedir, 'synth')
22 |
23 | ## Change featuredir to absolute path to use existing features
24 | featuredir = os.path.join(topworkdir, config_name, 'data')
25 | coarse_audio_dir = os.path.join(featuredir, 'mels')
26 | full_mel_dir = os.path.join(featuredir, 'full_mels')
27 | full_audio_dir = os.path.join(featuredir, 'mags')
28 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide
29 |
30 |
31 |
32 | # data locations:
33 | datadir = '/group/project/cstr2/owatts/data/nick_plus_nancy/'
34 |
35 |
36 | transcript = os.path.join(datadir, 'transcript.csv')
37 | test_transcript = os.path.join(datadir, 'transcript_mrt.csv')
38 |
39 |
40 | waveforms = os.path.join(datadir, 'wav_trim')
41 |
42 |
43 |
44 | input_type = 'phones' ## letters or phones
45 |
46 | vocab = ['', '', "<'s>", '<,>', '<.>', '<:>', '<;>', '<>', \
47 | '>', '<_END_>', '<_START_>', '@', '@@', '@U', 'A', 'D', 'E', \
48 | 'E@', 'I', 'I@', 'N', 'O', 'OI', 'Q', 'S', 'T', 'U', 'U@', 'V', \
49 | 'Z', 'a', 'aI', 'aU', 'b', 'd', 'dZ', 'eI', 'f', 'g', 'h', 'i', \
50 | 'j', 'k', 'l', 'l!', 'lw', 'm', 'm!', 'n', 'n!', 'o^', 'o~', 'p', \
51 | 'r', 's', 't', 'tS', 'u', 'v', 'w', 'z']
52 |
53 | #vocab = "PE abcdefghijklmnopqrstuvwxyz'.?" # P: Padding, E: EOS.
54 | max_N = 140 # Maximum number of characters. # !!!
55 | max_T = 200 # Maximum number of mel frames. # !!!
56 | multispeaker = ['text_encoder_input', 'audio_decoder_input'] ## []: speaker dependent. text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
57 | speaker_list = [''] + ['nancy', 'nick']
58 | nspeakers = len(speaker_list) + 10 ## add 10 extra slots for posterity's sake
59 | speaker_embedding_size = 128
60 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set
61 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
62 |
63 |
64 |
65 | # signal processing
66 | trim_before_spectrogram_extraction = 0
67 | vocoder = 'griffin_lim'
68 | sr = 16000 # Sampling rate.
69 | n_fft = 2048 # fft points (samples)
70 | frame_shift = 0.0125 # seconds
71 | frame_length = 0.05 # seconds
72 | hop_length = int(sr * frame_shift)
73 | win_length = int(sr * frame_length)
74 | prepro = True # don't extract spectrograms on the fly
75 | full_dim = n_fft//2+1
76 | n_mels = 80 # Number of Mel banks to generate
77 | power = 1.5 # Exponent for amplifying the predicted magnitude
78 | n_iter = 50 # Number of inversion iterations
79 | preemphasis = .97
80 | max_db = 100
81 | ref_db = 20
82 |
83 |
84 | # Model
85 | r = 4 # Reduction factor. Do not change this.
86 | dropout_rate = 0.05
87 | e = 128 # == embedding
88 | d = 256 # == hidden units of Text2Mel
89 | c = 512 # == hidden units of SSRN
90 | attention_win_size = 3
91 | g = 0.2 ## determines width of band in attention guide
92 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None]
93 |
94 | ## loss weights : T2M
95 | lw_mel = 0.3333
96 | lw_bd1 = 0.3333
97 | lw_att = 0.3333
98 | ## : SSRN
99 | lw_mag = 0.5
100 | lw_bd2 = 0.5
101 |
102 |
103 | ## validation:
104 | validpatt = 'herald_9' ## herald_9 matches 98 sentences from nick
105 | #### NYT00 matches 24 of nancy's sentences.
106 | validation_sentences_to_evaluate = 24
107 | validation_sentences_to_synth_params = 16
108 |
109 |
110 | # training scheme
111 | restart_from_savepath = []
112 | lr = 0.001 # Initial learning rate.
113 | batchsize = {'t2m': 32, 'ssrn': 32}
114 | validate_every_n_epochs = 10 ## how often to compute validation score and save speech parameters
115 | save_every_n_epochs = 100 ## as well as 5 latest models, how often to archive a model
116 | max_epochs = 1000
117 |
118 |
--------------------------------------------------------------------------------
/config/nancyplusnick_02.cfg:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #/usr/bin/python2
3 |
4 | '''
5 | Combined nancy & nick voice for ICPhS in first instance.
6 | Stage 2: fine tune on nick only
7 | '''
8 |
9 |
10 | import os
11 |
12 |
13 | ## Take name of this file to be the config name:
14 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension
15 |
16 | ## Define place to put outputs relative to this config file's location;
17 | ## supply an absoluate path to work elsewhere:
18 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work')))
19 |
20 | voicedir = os.path.join(topworkdir, config_name)
21 | logdir = os.path.join(voicedir, 'train')
22 | sampledir = os.path.join(voicedir, 'synth')
23 |
24 | ## Change featuredir to absolute path to use existing features
25 | featuredir = os.path.join(topworkdir, 'nancy2nick_01', 'data')
26 | coarse_audio_dir = os.path.join(featuredir, 'mels')
27 | full_mel_dir = os.path.join(featuredir, 'full_mels')
28 | full_audio_dir = os.path.join(featuredir, 'mags')
29 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide
30 |
31 |
32 |
33 | # data locations:
34 | datadir = '/group/project/cstr2/owatts/data/nick16k/'
35 |
36 |
37 | transcript = os.path.join(datadir, 'transcript.csv')
38 | test_transcript = os.path.join(datadir, 'transcript_hvd.csv')
39 | #test_transcript = os.path.join(datadir, 'transcript_mrt.csv')
40 |
41 | waveforms = os.path.join(datadir, 'wav_trim')
42 |
43 |
44 |
45 | input_type = 'phones' ## letters or phones
46 |
47 | vocab = ['', '', "<'s>", '<,>', '<.>', '<:>', '<;>', '<>', \
48 | '>', '<_END_>', '<_START_>', '@', '@@', '@U', 'A', 'D', 'E', \
49 | 'E@', 'I', 'I@', 'N', 'O', 'OI', 'Q', 'S', 'T', 'U', 'U@', 'V', \
50 | 'Z', 'a', 'aI', 'aU', 'b', 'd', 'dZ', 'eI', 'f', 'g', 'h', 'i', \
51 | 'j', 'k', 'l', 'l!', 'lw', 'm', 'm!', 'n', 'n!', 'o^', 'o~', 'p', \
52 | 'r', 's', 't', 'tS', 'u', 'v', 'w', 'z']
53 |
54 | #vocab = "PE abcdefghijklmnopqrstuvwxyz'.?" # P: Padding, E: EOS.
55 | max_N = 140 # Maximum number of characters. # !!!
56 | max_T = 200 # Maximum number of mel frames. # !!!
57 | multispeaker = ['text_encoder_input', 'audio_decoder_input'] ## []: speaker dependent. text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
58 | speaker_list = [''] + ['nancy', 'nick']
59 | nspeakers = len(speaker_list) + 10 ## add 10 extra slots for posterity's sake
60 | speaker_embedding_size = 128
61 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set
62 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
63 |
64 |
65 |
66 | # signal processing
67 | trim_before_spectrogram_extraction = 0
68 | vocoder = 'griffin_lim'
69 | sr = 16000 # Sampling rate.
70 | n_fft = 2048 # fft points (samples)
71 | frame_shift = 0.0125 # seconds
72 | frame_length = 0.05 # seconds
73 | hop_length = int(sr * frame_shift)
74 | win_length = int(sr * frame_length)
75 | prepro = True # don't extract spectrograms on the fly
76 | full_dim = n_fft//2+1
77 | n_mels = 80 # Number of Mel banks to generate
78 | power = 1.5 # Exponent for amplifying the predicted magnitude
79 | n_iter = 50 # Number of inversion iterations
80 | preemphasis = .97
81 | max_db = 100
82 | ref_db = 20
83 |
84 |
85 | # Model
86 | r = 4 # Reduction factor. Do not change this.
87 | dropout_rate = 0.05
88 | e = 128 # == embedding
89 | d = 256 # == hidden units of Text2Mel
90 | c = 512 # == hidden units of SSRN
91 | attention_win_size = 3
92 | g = 0.2 ## determines width of band in attention guide
93 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None]
94 |
95 | ## loss weights : T2M
96 | lw_mel = 0.3333
97 | lw_bd1 = 0.3333
98 | lw_att = 0.3333
99 | ## : SSRN
100 | lw_mag = 0.5
101 | lw_bd2 = 0.5
102 |
103 |
104 | ## validation:
105 | validpatt = 'herald_9' ## herald_9 matches 98 sentences from nick
106 | #### NYT00 matches 24 of nancy's sentences.
107 | validation_sentences_to_evaluate = 24
108 | validation_sentences_to_synth_params = 16
109 |
110 |
111 | # training scheme
112 | restart_from_savepath = []
113 | lr = 0.001 # Initial learning rate.
114 | batchsize = {'t2m': 32, 'ssrn': 32}
115 | validate_every_n_epochs = 1 ## how often to compute validation score and save speech parameters
116 | save_every_n_epochs = 10 ## as well as 5 latest models, how often to archive a model
117 | max_epochs = 100
118 |
119 | ## copy existing model parameters -- list of (scope, checkpoint) pairs:--
120 | WORK=topworkdir
121 | initialise_weights_from_existing = [('Text2Mel', WORK+'/nancyplusnick_01/train-t2m/model_epoch_1000')]
122 |
123 | update_weights = [] ## If empty, update all trainable variables. Otherwise, update only
124 | ## those variables whose scopes match one of the patterns in the list
--------------------------------------------------------------------------------
/config/nancyplusnick_03.cfg:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #/usr/bin/python2
3 |
4 | '''
5 | Combined nancy & nick voice for ICPhS in first instance.
6 | Stage 2: fine tune on nick only -- in this version, set attention loss weight very high
7 | '''
8 |
9 |
10 | import os
11 |
12 |
13 | ## Take name of this file to be the config name:
14 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension
15 |
16 | ## Define place to put outputs relative to this config file's location;
17 | ## supply an absoluate path to work elsewhere:
18 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work')))
19 |
20 | voicedir = os.path.join(topworkdir, config_name)
21 | logdir = os.path.join(voicedir, 'train')
22 | sampledir = os.path.join(voicedir, 'synth')
23 |
24 | ## Change featuredir to absolute path to use existing features
25 | featuredir = os.path.join(topworkdir, 'nancy2nick_01', 'data')
26 | coarse_audio_dir = os.path.join(featuredir, 'mels')
27 | full_mel_dir = os.path.join(featuredir, 'full_mels')
28 | full_audio_dir = os.path.join(featuredir, 'mags')
29 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide
30 |
31 |
32 |
33 | # data locations:
34 | datadir = '/group/project/cstr2/owatts/data/nick16k/'
35 |
36 |
37 | transcript = os.path.join(datadir, 'transcript.csv')
38 | test_transcript = os.path.join(datadir, 'transcript_hvd.csv')
39 | #test_transcript = os.path.join(datadir, 'transcript_mrt.csv')
40 |
41 | waveforms = os.path.join(datadir, 'wav_trim')
42 |
43 |
44 |
45 | input_type = 'phones' ## letters or phones
46 |
47 | vocab = ['', '', "<'s>", '<,>', '<.>', '<:>', '<;>', '<>', \
48 | '>', '<_END_>', '<_START_>', '@', '@@', '@U', 'A', 'D', 'E', \
49 | 'E@', 'I', 'I@', 'N', 'O', 'OI', 'Q', 'S', 'T', 'U', 'U@', 'V', \
50 | 'Z', 'a', 'aI', 'aU', 'b', 'd', 'dZ', 'eI', 'f', 'g', 'h', 'i', \
51 | 'j', 'k', 'l', 'l!', 'lw', 'm', 'm!', 'n', 'n!', 'o^', 'o~', 'p', \
52 | 'r', 's', 't', 'tS', 'u', 'v', 'w', 'z']
53 |
54 | #vocab = "PE abcdefghijklmnopqrstuvwxyz'.?" # P: Padding, E: EOS.
55 | max_N = 140 # Maximum number of characters. # !!!
56 | max_T = 200 # Maximum number of mel frames. # !!!
57 | multispeaker = ['text_encoder_input', 'audio_decoder_input'] ## []: speaker dependent. text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
58 | speaker_list = [''] + ['nancy', 'nick']
59 | nspeakers = len(speaker_list) + 10 ## add 10 extra slots for posterity's sake
60 | speaker_embedding_size = 128
61 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set
62 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
63 |
64 |
65 |
66 | # signal processing
67 | trim_before_spectrogram_extraction = 0
68 | vocoder = 'griffin_lim'
69 | sr = 16000 # Sampling rate.
70 | n_fft = 2048 # fft points (samples)
71 | frame_shift = 0.0125 # seconds
72 | frame_length = 0.05 # seconds
73 | hop_length = int(sr * frame_shift)
74 | win_length = int(sr * frame_length)
75 | prepro = True # don't extract spectrograms on the fly
76 | full_dim = n_fft//2+1
77 | n_mels = 80 # Number of Mel banks to generate
78 | power = 1.5 # Exponent for amplifying the predicted magnitude
79 | n_iter = 50 # Number of inversion iterations
80 | preemphasis = .97
81 | max_db = 100
82 | ref_db = 20
83 |
84 |
85 | # Model
86 | r = 4 # Reduction factor. Do not change this.
87 | dropout_rate = 0.05
88 | e = 128 # == embedding
89 | d = 256 # == hidden units of Text2Mel
90 | c = 512 # == hidden units of SSRN
91 | attention_win_size = 3
92 | g = 0.2 ## determines width of band in attention guide
93 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None]
94 |
95 | ## loss weights : T2M
96 | lw_mel = 0.15
97 | lw_bd1 = 0.15
98 | lw_att = 0.9
99 | ## : SSRN
100 | lw_mag = 0.5
101 | lw_bd2 = 0.5
102 |
103 |
104 | ## validation:
105 | validpatt = 'herald_9' ## herald_9 matches 98 sentences from nick
106 | #### NYT00 matches 24 of nancy's sentences.
107 | validation_sentences_to_evaluate = 24
108 | validation_sentences_to_synth_params = 16
109 |
110 |
111 | # training scheme
112 | restart_from_savepath = []
113 | lr = 0.001 # Initial learning rate.
114 | batchsize = {'t2m': 32, 'ssrn': 32}
115 | validate_every_n_epochs = 1 ## how often to compute validation score and save speech parameters
116 | save_every_n_epochs = 0 ## as well as 5 latest models, how often to archive a model
117 | max_epochs = 100
118 |
119 | ## copy existing model parameters -- list of (scope, checkpoint) pairs:--
120 | WORK=topworkdir
121 | initialise_weights_from_existing = [('Text2Mel', WORK+'/nancyplusnick_01/train-t2m/model_epoch_1000')]
122 |
123 | update_weights = [] ## If empty, update all trainable variables. Otherwise, update only
124 | ## those variables whose scopes match one of the patterns in the list
--------------------------------------------------------------------------------
/config/nancyplusnick_04_lcc.cfg:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #/usr/bin/python2
3 |
4 | '''
5 | Combined nancy & nick voice for ICPhS in first instance
6 | Version 4: train with learned channel contributions from the 2 speakers
7 | '''
8 |
9 |
10 | import os
11 |
12 |
13 | ## Take name of this file to be the config name:
14 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension
15 |
16 | ## Define place to put outputs relative to this config file's location;
17 | ## supply an absoluate path to work elsewhere:
18 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work')))
19 |
20 | voicedir = os.path.join(topworkdir, config_name)
21 | logdir = os.path.join(voicedir, 'train')
22 | sampledir = os.path.join(voicedir, 'synth')
23 |
24 | ## Change featuredir to absolute path to use existing features
25 | featuredir = os.path.join(topworkdir, 'nancyplusnick_01', 'data')
26 | coarse_audio_dir = os.path.join(featuredir, 'mels')
27 | full_mel_dir = os.path.join(featuredir, 'full_mels')
28 | full_audio_dir = os.path.join(featuredir, 'mags')
29 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide
30 |
31 |
32 |
33 | # data locations:
34 | datadir = '/group/project/cstr2/owatts/data/nick_plus_nancy/'
35 |
36 |
37 | transcript = os.path.join(datadir, 'transcript.csv')
38 | test_transcript = os.path.join(datadir, 'transcript_mrt.csv')
39 |
40 |
41 | waveforms = os.path.join(datadir, 'wav_trim')
42 |
43 |
44 |
45 | input_type = 'phones' ## letters or phones
46 |
47 | vocab = ['', '', "<'s>", '<,>', '<.>', '<:>', '<;>', '<>', \
48 | '>', '<_END_>', '<_START_>', '@', '@@', '@U', 'A', 'D', 'E', \
49 | 'E@', 'I', 'I@', 'N', 'O', 'OI', 'Q', 'S', 'T', 'U', 'U@', 'V', \
50 | 'Z', 'a', 'aI', 'aU', 'b', 'd', 'dZ', 'eI', 'f', 'g', 'h', 'i', \
51 | 'j', 'k', 'l', 'l!', 'lw', 'm', 'm!', 'n', 'n!', 'o^', 'o~', 'p', \
52 | 'r', 's', 't', 'tS', 'u', 'v', 'w', 'z']
53 |
54 | #vocab = "PE abcdefghijklmnopqrstuvwxyz'.?" # P: Padding, E: EOS.
55 | max_N = 140 # Maximum number of characters. # !!!
56 | max_T = 200 # Maximum number of mel frames. # !!!
57 | multispeaker = ['learn_channel_contributions'] ## []: speaker dependent. text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
58 | speaker_list = [''] + ['nancy', 'nick']
59 | nspeakers = len(speaker_list) + 2
60 | speaker_embedding_size = 128
61 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set
62 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
63 |
64 |
65 |
66 | # signal processing
67 | trim_before_spectrogram_extraction = 0
68 | vocoder = 'griffin_lim'
69 | sr = 16000 # Sampling rate.
70 | n_fft = 2048 # fft points (samples)
71 | frame_shift = 0.0125 # seconds
72 | frame_length = 0.05 # seconds
73 | hop_length = int(sr * frame_shift)
74 | win_length = int(sr * frame_length)
75 | prepro = True # don't extract spectrograms on the fly
76 | full_dim = n_fft//2+1
77 | n_mels = 80 # Number of Mel banks to generate
78 | power = 1.5 # Exponent for amplifying the predicted magnitude
79 | n_iter = 50 # Number of inversion iterations
80 | preemphasis = .97
81 | max_db = 100
82 | ref_db = 20
83 |
84 |
85 | # Model
86 | r = 4 # Reduction factor. Do not change this.
87 | dropout_rate = 0.05
88 | e = 128 # == embedding
89 | d = 256 # == hidden units of Text2Mel
90 | c = 512 # == hidden units of SSRN
91 | attention_win_size = 3
92 | g = 0.2 ## determines width of band in attention guide
93 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None]
94 |
95 | ## loss weights : T2M
96 | lw_mel = 0.3333
97 | lw_bd1 = 0.3333
98 | lw_att = 0.3333
99 | ## : SSRN
100 | lw_mag = 0.5
101 | lw_bd2 = 0.5
102 |
103 |
104 | ## validation:
105 | validpatt = 'herald_9' ## herald_9 matches 98 sentences from nick
106 | #### NYT00 matches 24 of nancy's sentences.
107 | validation_sentences_to_evaluate = 24
108 | validation_sentences_to_synth_params = 16
109 |
110 |
111 | # training scheme
112 | restart_from_savepath = []
113 | lr = 0.001 # Initial learning rate.
114 | batchsize = {'t2m': 32, 'ssrn': 32}
115 | validate_every_n_epochs = 10 ## how often to compute validation score and save speech parameters
116 | save_every_n_epochs = 100 ## as well as 5 latest models, how often to archive a model
117 | max_epochs = 1000
118 |
119 |
--------------------------------------------------------------------------------
/config/nancyplusnick_05_lcc_sdpe.cfg:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #/usr/bin/python2
3 |
4 | '''
5 | Combined nancy & nick voice for ICPhS in first instance
6 | Version 4: train with learned channel contributions from the 2 speakers
7 | '''
8 |
9 |
10 | import os
11 |
12 |
13 | ## Take name of this file to be the config name:
14 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension
15 |
16 | ## Define place to put outputs relative to this config file's location;
17 | ## supply an absoluate path to work elsewhere:
18 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work')))
19 |
20 | voicedir = os.path.join(topworkdir, config_name)
21 | logdir = os.path.join(voicedir, 'train')
22 | sampledir = os.path.join(voicedir, 'synth')
23 |
24 | ## Change featuredir to absolute path to use existing features
25 | featuredir = os.path.join(topworkdir, 'nancyplusnick_01', 'data')
26 | coarse_audio_dir = os.path.join(featuredir, 'mels')
27 | full_mel_dir = os.path.join(featuredir, 'full_mels')
28 | full_audio_dir = os.path.join(featuredir, 'mags')
29 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide
30 |
31 |
32 |
33 | # data locations:
34 | datadir = '/group/project/cstr2/owatts/data/nick_plus_nancy/'
35 |
36 |
37 | transcript = os.path.join(datadir, 'transcript.csv')
38 | test_transcript = os.path.join(datadir, 'transcript_mrt.csv')
39 |
40 |
41 | waveforms = os.path.join(datadir, 'wav_trim')
42 |
43 |
44 |
45 | input_type = 'phones' ## letters or phones
46 |
47 | vocab = ['', '', "<'s>", '<,>', '<.>', '<:>', '<;>', '<>', \
48 | '>', '<_END_>', '<_START_>', '@', '@@', '@U', 'A', 'D', 'E', \
49 | 'E@', 'I', 'I@', 'N', 'O', 'OI', 'Q', 'S', 'T', 'U', 'U@', 'V', \
50 | 'Z', 'a', 'aI', 'aU', 'b', 'd', 'dZ', 'eI', 'f', 'g', 'h', 'i', \
51 | 'j', 'k', 'l', 'l!', 'lw', 'm', 'm!', 'n', 'n!', 'o^', 'o~', 'p', \
52 | 'r', 's', 't', 'tS', 'u', 'v', 'w', 'z']
53 |
54 | #vocab = "PE abcdefghijklmnopqrstuvwxyz'.?" # P: Padding, E: EOS.
55 | max_N = 140 # Maximum number of characters. # !!!
56 | max_T = 200 # Maximum number of mel frames. # !!!
57 | multispeaker = ['learn_channel_contributions', 'speaker_dependent_phones'] ## []: speaker dependent. text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
58 | speaker_list = [''] + ['nancy', 'nick']
59 | nspeakers = len(speaker_list) + 2
60 | speaker_embedding_size = 128
61 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set
62 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
63 |
64 |
65 |
66 | # signal processing
67 | trim_before_spectrogram_extraction = 0
68 | vocoder = 'griffin_lim'
69 | sr = 16000 # Sampling rate.
70 | n_fft = 2048 # fft points (samples)
71 | frame_shift = 0.0125 # seconds
72 | frame_length = 0.05 # seconds
73 | hop_length = int(sr * frame_shift)
74 | win_length = int(sr * frame_length)
75 | prepro = True # don't extract spectrograms on the fly
76 | full_dim = n_fft//2+1
77 | n_mels = 80 # Number of Mel banks to generate
78 | power = 1.5 # Exponent for amplifying the predicted magnitude
79 | n_iter = 50 # Number of inversion iterations
80 | preemphasis = .97
81 | max_db = 100
82 | ref_db = 20
83 |
84 |
85 | # Model
86 | r = 4 # Reduction factor. Do not change this.
87 | dropout_rate = 0.05
88 | e = 128 # == embedding
89 | d = 256 # == hidden units of Text2Mel
90 | c = 512 # == hidden units of SSRN
91 | attention_win_size = 3
92 | g = 0.2 ## determines width of band in attention guide
93 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None]
94 |
95 | ## loss weights : T2M
96 | lw_mel = 0.3333
97 | lw_bd1 = 0.3333
98 | lw_att = 0.3333
99 | ## : SSRN
100 | lw_mag = 0.5
101 | lw_bd2 = 0.5
102 |
103 |
104 | ## validation:
105 | validpatt = 'herald_9' ## herald_9 matches 98 sentences from nick
106 | #### NYT00 matches 24 of nancy's sentences.
107 | validation_sentences_to_evaluate = 24
108 | validation_sentences_to_synth_params = 16
109 |
110 |
111 | # training scheme
112 | restart_from_savepath = []
113 | lr = 0.001 # Initial learning rate.
114 | batchsize = {'t2m': 32, 'ssrn': 32}
115 | validate_every_n_epochs = 10 ## how often to compute validation score and save speech parameters
116 | save_every_n_epochs = 100 ## as well as 5 latest models, how often to archive a model
117 | max_epochs = 1000
118 |
119 |
--------------------------------------------------------------------------------
/config/project/baseline.cfg:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #/usr/bin/python2
3 |
4 | import os
5 |
6 |
7 | ## Take name of this file to be the config name:
8 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension
9 |
10 | ## Define place to put outputs relative to this config file's location;
11 | ## supply an absoluate path to work elsewhere:
12 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work')))
13 |
14 | voicedir = os.path.join(topworkdir, config_name)
15 | logdir = os.path.join(voicedir, 'train')
16 | sampledir = os.path.join(voicedir, 'synth')
17 |
18 | ## Change featuredir to absolute path to use existing features
19 | featuredir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/'
20 | coarse_audio_dir = os.path.join(featuredir, 'mels')
21 | full_mel_dir = os.path.join(featuredir, 'full_mels')
22 | full_audio_dir = os.path.join(featuredir, 'mags')
23 | #attention_guide_dir = os.path.join(featuredir, 'attention_guides')
24 | ## Set this to the empty string ('') to global attention guide
25 | attention_guide_dir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/attention_guides_punctuation_all_quotes/'
26 | #attention_guide_dir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/fa_attention_guides_punctuation_all_quotes/'
27 |
28 | # Data locations:
29 | datadir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/'
30 |
31 | transcript = os.path.join(datadir, 'trimmed_transcript_trainset_with_punctuation_with_all_quotes_CB-EM.csv')
32 | test_transcript = os.path.join(datadir, 'transcript_testset_shuffled_with_punctuation_with_all_quotes.csv')
33 |
34 | waveforms = os.path.join(datadir, 'wavs_trim')
35 |
36 |
37 | input_type = 'phones' ## letters or phones
38 |
39 | ## CMU phones:
40 | vocab = ['', '<_END_>', '<_START_>'] + ['', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', '<;>', '<>', '>', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', 'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', 'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', 'oy', 'p', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', 'w', 'y', 'z', 'zh']
41 |
42 | # Train
43 | max_N = 150 # Maximum number of characters/phones
44 | max_T = 264 # Maximum number of mel frames
45 |
46 | turn_off_monotonic_for_synthesis = True
47 | sampledir = sampledir + '_' + str(turn_off_monotonic_for_synthesis)
48 |
49 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
50 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set
51 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
52 |
53 |
54 |
55 | # signal processing
56 | trim_before_spectrogram_extraction = 0
57 | vocoder = 'griffin_lim'
58 | sr = 22050 # Sampling rate.
59 | n_fft = 2048 # fft points (samples)
60 | frame_shift = 0.0125 # seconds
61 | frame_length = 0.05 # seconds
62 | hop_length = int(sr * frame_shift)
63 | win_length = int(sr * frame_length)
64 | prepro = True # don't extract spectrograms on the fly
65 | full_dim = n_fft//2+1
66 | n_mels = 80 # Number of Mel banks to generate
67 | power = 1.5 # Exponent for amplifying the predicted magnitude
68 | n_iter = 50 # Number of inversion iterations
69 | preemphasis = .97
70 | max_db = 100
71 | ref_db = 20
72 |
73 |
74 | # Model
75 | r = 4 # Reduction factor. Do not change this.
76 | dropout_rate = 0.05
77 | e = 128 # == embedding
78 | d = 256 # == hidden units of Text2Mel
79 | c = 512 # == hidden units of SSRN
80 | attention_win_size = 3
81 | g = 0.2 ## determines width of band in attention guide
82 | norm = None ## type of normalisation layers to use: from ['layer', 'batch', None]
83 |
84 | ## loss weights : T2M
85 | lw_mel =0.333
86 | lw_bd1 =0.333
87 | lw_att =0.333
88 | lw_t2m_l2 = 0.0
89 | ## : SSRN
90 | lw_mag = 0.5
91 | lw_bd2 = 0.5
92 | lw_ssrn_l2 = 0.0
93 |
94 |
95 | ## validation:
96 | validpatt = 'CB-EM-55-6' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SIL
97 | validation_sentences_to_evaluate = 5
98 | validation_sentences_to_synth_params = 3
99 |
100 |
101 | # training scheme
102 | restart_from_savepath = []
103 | lr = 0.0001 # Initial learning rate.
104 | beta1 = 0.5
105 | beta2 = 0.9
106 | epsilon = 0.000001
107 | decay_lr = False
108 | batchsize = {'t2m': 8, 'ssrn': 2}
109 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8)
110 | validate_every_n_epochs = 50 ## how often to compute validation score and save speech parameters
111 | save_every_n_epochs = 50 ## as well as 5 latest models, how often to archive a model
112 | max_epochs = 500
113 |
114 | # attention plotting during training
115 | plot_attention_every_n_epochs = 50 ## set to 0 if you do not wish to plot attention matrices
116 | num_sentences_to_plot_attention = 3 ## number of sentences to plot attention matrices for
117 |
--------------------------------------------------------------------------------
/config/project/fa_as_attention.cfg:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #/usr/bin/python2
3 |
4 | import os
5 |
6 |
7 | ## Take name of this file to be the config name:
8 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension
9 |
10 | ## Define place to put outputs relative to this config file's location;
11 | ## supply an absoluate path to work elsewhere:
12 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work')))
13 |
14 | voicedir = os.path.join(topworkdir, config_name)
15 | logdir = os.path.join(voicedir, 'train')
16 | sampledir = os.path.join(voicedir, 'synth')
17 |
18 | ## Change featuredir to absolute path to use existing features
19 | featuredir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/'
20 | coarse_audio_dir = os.path.join(featuredir, 'mels')
21 | full_mel_dir = os.path.join(featuredir, 'full_mels')
22 | full_audio_dir = os.path.join(featuredir, 'mags')
23 | #attention_guide_dir = os.path.join(featuredir, 'attention_guides')
24 | ## Set this to the empty string ('') to global attention guide
25 | attention_guide_dir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/attention_guides_punctuation_all_quotes/'
26 | use_external_durations = True
27 |
28 | # Data locations:
29 | datadir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/'
30 |
31 | transcript = os.path.join(datadir, 'trimmed_transcript_trainset_with_punctuation_with_all_quotes_CB-EM_less17_durations.csv')
32 | test_transcript = os.path.join(datadir, 'transcript_testset_shuffled_with_punctuation_with_all_quotes_durations_updated.csv')
33 | waveforms = os.path.join(datadir, 'wavs_trim')
34 |
35 |
36 | input_type = 'phones' ## letters or phones
37 |
38 | ## CMU phones:
39 | vocab = ['', '<_END_>', '<_START_>'] + ['', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', '<;>', '<>', '>', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', 'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', 'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', 'oy', 'p', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', 'w', 'y', 'z', 'zh']
40 |
41 | # Train
42 | max_N = 150 # Maximum number of characters/phones
43 | max_T = 264 # Maximum number of mel frames
44 |
45 | turn_off_monotonic_for_synthesis = True
46 | sampledir = sampledir + '_' + str(turn_off_monotonic_for_synthesis)
47 |
48 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
49 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set
50 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
51 |
52 |
53 |
54 | # signal processing
55 | trim_before_spectrogram_extraction = 0
56 | vocoder = 'griffin_lim'
57 | sr = 22050 # Sampling rate.
58 | n_fft = 2048 # fft points (samples)
59 | frame_shift = 0.0125 # seconds
60 | frame_length = 0.05 # seconds
61 | hop_length = int(sr * frame_shift)
62 | win_length = int(sr * frame_length)
63 | prepro = True # don't extract spectrograms on the fly
64 | full_dim = n_fft//2+1
65 | n_mels = 80 # Number of Mel banks to generate
66 | power = 1.5 # Exponent for amplifying the predicted magnitude
67 | n_iter = 50 # Number of inversion iterations
68 | preemphasis = .97
69 | max_db = 100
70 | ref_db = 20
71 |
72 |
73 | # Model
74 | r = 4 # Reduction factor. Do not change this.
75 | dropout_rate = 0.05
76 | e = 128 # == embedding
77 | d = 256 # == hidden units of Text2Mel
78 | c = 512 # == hidden units of SSRN
79 | attention_win_size = 3
80 | g = 0.2 ## determines width of band in attention guide
81 | norm = None ## type of normalisation layers to use: from ['layer', 'batch', None]
82 |
83 | ## loss weights : T2M
84 | lw_mel =0.333
85 | lw_bd1 =0.333
86 | lw_att =0.333
87 | lw_t2m_l2 = 0.0
88 | ## : SSRN
89 | lw_mag = 0.5
90 | lw_bd2 = 0.5
91 | lw_ssrn_l2 = 0.0
92 |
93 |
94 | ## validation:
95 | validpatt = 'CB-EM-55-6' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SIL
96 | validation_sentences_to_evaluate = 5
97 | validation_sentences_to_synth_params = 3
98 |
99 |
100 | # training scheme
101 | restart_from_savepath = []
102 | lr = 0.0001 # Initial learning rate.
103 | beta1 = 0.5
104 | beta2 = 0.9
105 | epsilon = 0.000001
106 | decay_lr = False
107 | batchsize = {'t2m': 8, 'ssrn': 2}
108 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8)
109 | validate_every_n_epochs = 50 ## how often to compute validation score and save speech parameters
110 | save_every_n_epochs = 50 ## as well as 5 latest models, how often to archive a model
111 | max_epochs = 500
112 |
113 | # attention plotting during training
114 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices
115 | num_sentences_to_plot_attention = 3 ## number of sentences to plot attention matrices for
116 |
--------------------------------------------------------------------------------
/config/project/fa_as_guide.cfg:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #/usr/bin/python2
3 |
4 | import os
5 |
6 |
7 | ## Take name of this file to be the config name:
8 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension
9 |
10 | ## Define place to put outputs relative to this config file's location;
11 | ## supply an absoluate path to work elsewhere:
12 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work')))
13 |
14 | voicedir = os.path.join(topworkdir, config_name)
15 | logdir = os.path.join(voicedir, 'train')
16 | sampledir = os.path.join(voicedir, 'synth')
17 |
18 | ## Change featuredir to absolute path to use existing features
19 | featuredir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/'
20 | coarse_audio_dir = os.path.join(featuredir, 'mels')
21 | full_mel_dir = os.path.join(featuredir, 'full_mels')
22 | full_audio_dir = os.path.join(featuredir, 'mags')
23 | #attention_guide_dir = os.path.join(featuredir, 'attention_guides')
24 | ## Set this to the empty string ('') to global attention guide
25 | attention_guide_dir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/W0.1_attention_guides_dctts/' # contains FA guides
26 |
27 | # Data locations:
28 | datadir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/'
29 |
30 | transcript = os.path.join(datadir, 'trimmed_transcript_trainset_with_punctuation_with_all_quotes_CB-EM_less17.csv')
31 | test_transcript = os.path.join(datadir, 'transcript_testset_shuffled_with_punctuation_with_all_quotes.csv')
32 | waveforms = os.path.join(datadir, 'wavs_trim')
33 |
34 |
35 | input_type = 'phones' ## letters or phones
36 |
37 | ## CMU phones:
38 | vocab = ['', '<_END_>', '<_START_>'] + ['', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', '<;>', '<>', '>', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', 'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', 'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', 'oy', 'p', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', 'w', 'y', 'z', 'zh']
39 |
40 | # Train
41 | max_N = 150 # Maximum number of characters/phones
42 | max_T = 264 # Maximum number of mel frames
43 |
44 | turn_off_monotonic_for_synthesis = True
45 | sampledir = sampledir + '_' + str(turn_off_monotonic_for_synthesis)
46 |
47 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
48 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set
49 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
50 |
51 |
52 |
53 | # signal processing
54 | trim_before_spectrogram_extraction = 0
55 | vocoder = 'griffin_lim'
56 | sr = 22050 # Sampling rate.
57 | n_fft = 2048 # fft points (samples)
58 | frame_shift = 0.0125 # seconds
59 | frame_length = 0.05 # seconds
60 | hop_length = int(sr * frame_shift)
61 | win_length = int(sr * frame_length)
62 | prepro = True # don't extract spectrograms on the fly
63 | full_dim = n_fft//2+1
64 | n_mels = 80 # Number of Mel banks to generate
65 | power = 1.5 # Exponent for amplifying the predicted magnitude
66 | n_iter = 50 # Number of inversion iterations
67 | preemphasis = .97
68 | max_db = 100
69 | ref_db = 20
70 |
71 |
72 | # Model
73 | r = 4 # Reduction factor. Do not change this.
74 | dropout_rate = 0.05
75 | e = 128 # == embedding
76 | d = 256 # == hidden units of Text2Mel
77 | c = 512 # == hidden units of SSRN
78 | attention_win_size = 3
79 | g = 0.2 ## determines width of band in attention guide
80 | norm = None ## type of normalisation layers to use: from ['layer', 'batch', None]
81 |
82 | ## loss weights : T2M
83 | lw_mel =0.333
84 | lw_bd1 =0.333
85 | lw_att =0.333
86 | lw_t2m_l2 = 0.0
87 | ## : SSRN
88 | lw_mag = 0.5
89 | lw_bd2 = 0.5
90 | lw_ssrn_l2 = 0.0
91 |
92 |
93 | ## validation:
94 | validpatt = 'CB-EM-55-6' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SIL
95 | validation_sentences_to_evaluate = 5
96 | validation_sentences_to_synth_params = 3
97 |
98 |
99 | # training scheme
100 | restart_from_savepath = []
101 | lr = 0.0001 # Initial learning rate.
102 | beta1 = 0.5
103 | beta2 = 0.9
104 | epsilon = 0.000001
105 | decay_lr = False
106 | batchsize = {'t2m': 8, 'ssrn': 2}
107 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8)
108 | validate_every_n_epochs = 50 ## how often to compute validation score and save speech parameters
109 | save_every_n_epochs = 50 ## as well as 5 latest models, how often to archive a model
110 | max_epochs = 500
111 |
112 | # attention plotting during training
113 | plot_attention_every_n_epochs = 50 ## set to 0 if you do not wish to plot attention matrices
114 | num_sentences_to_plot_attention = 3 ## number of sentences to plot attention matrices for
115 |
--------------------------------------------------------------------------------
/config/project/fa_as_target.cfg:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #/usr/bin/python2
3 |
4 | import os
5 |
6 |
7 | ## Take name of this file to be the config name:
8 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension
9 |
10 | ## Define place to put outputs relative to this config file's location;
11 | ## supply an absoluate path to work elsewhere:
12 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work')))
13 |
14 | voicedir = os.path.join(topworkdir, config_name)
15 | logdir = os.path.join(voicedir, 'train')
16 | sampledir = os.path.join(voicedir, 'synth')
17 |
18 | ## Change featuredir to absolute path to use existing features
19 | featuredir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/'
20 | coarse_audio_dir = os.path.join(featuredir, 'mels')
21 | full_mel_dir = os.path.join(featuredir, 'full_mels')
22 | full_audio_dir = os.path.join(featuredir, 'mags')
23 | #attention_guide_dir = os.path.join(featuredir, 'attention_guides')
24 | ## Set this to the empty string ('') to global attention guide
25 | attention_guide_dir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/attention_guides_dctts/' # contains FA matrix
26 | attention_guide_fa = True
27 |
28 | # Data locations:
29 | datadir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/'
30 |
31 | transcript = os.path.join(datadir, 'trimmed_transcript_trainset_with_punctuation_with_all_quotes_CB-EM_less17.csv')
32 | test_transcript = os.path.join(datadir, 'transcript_testset_shuffled_with_punctuation_with_all_quotes.csv')
33 | waveforms = os.path.join(datadir, 'wavs_trim')
34 |
35 |
36 | input_type = 'phones' ## letters or phones
37 |
38 | ## CMU phones:
39 | vocab = ['', '<_END_>', '<_START_>'] + ['', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', '<;>', '<>', '>', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', 'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', 'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', 'oy', 'p', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', 'w', 'y', 'z', 'zh']
40 |
41 | # Train
42 | max_N = 150 # Maximum number of characters/phones
43 | max_T = 264 # Maximum number of mel frames
44 |
45 | turn_off_monotonic_for_synthesis = True
46 | sampledir = sampledir + '_' + str(turn_off_monotonic_for_synthesis)
47 |
48 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
49 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set
50 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
51 |
52 |
53 |
54 | # signal processing
55 | trim_before_spectrogram_extraction = 0
56 | vocoder = 'griffin_lim'
57 | sr = 22050 # Sampling rate.
58 | n_fft = 2048 # fft points (samples)
59 | frame_shift = 0.0125 # seconds
60 | frame_length = 0.05 # seconds
61 | hop_length = int(sr * frame_shift)
62 | win_length = int(sr * frame_length)
63 | prepro = True # don't extract spectrograms on the fly
64 | full_dim = n_fft//2+1
65 | n_mels = 80 # Number of Mel banks to generate
66 | power = 1.5 # Exponent for amplifying the predicted magnitude
67 | n_iter = 50 # Number of inversion iterations
68 | preemphasis = .97
69 | max_db = 100
70 | ref_db = 20
71 |
72 |
73 | # Model
74 | r = 4 # Reduction factor. Do not change this.
75 | dropout_rate = 0.05
76 | e = 128 # == embedding
77 | d = 256 # == hidden units of Text2Mel
78 | c = 512 # == hidden units of SSRN
79 | attention_win_size = 3
80 | g = 0.2 ## determines width of band in attention guide
81 | norm = None ## type of normalisation layers to use: from ['layer', 'batch', None]
82 |
83 | ## loss weights : T2M
84 | lw_mel =0.333
85 | lw_bd1 =0.333
86 | lw_att =0.333
87 | lw_t2m_l2 = 0.0
88 | ## : SSRN
89 | lw_mag = 0.5
90 | lw_bd2 = 0.5
91 | lw_ssrn_l2 = 0.0
92 |
93 |
94 | ## validation:
95 | validpatt = 'CB-EM-55-6' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SIL
96 | validation_sentences_to_evaluate = 5
97 | validation_sentences_to_synth_params = 3
98 |
99 |
100 | # training scheme
101 | restart_from_savepath = []
102 | lr = 0.0001 # Initial learning rate.
103 | beta1 = 0.5
104 | beta2 = 0.9
105 | epsilon = 0.000001
106 | decay_lr = False
107 | batchsize = {'t2m': 8, 'ssrn': 2}
108 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8)
109 | validate_every_n_epochs = 50 ## how often to compute validation score and save speech parameters
110 | save_every_n_epochs = 50 ## as well as 5 latest models, how often to archive a model
111 | max_epochs = 500
112 |
113 | # attention plotting during training
114 | plot_attention_every_n_epochs = 50 ## set to 0 if you do not wish to plot attention matrices
115 | num_sentences_to_plot_attention = 3 ## number of sentences to plot attention matrices for
116 |
--------------------------------------------------------------------------------
/config/ssw10/G1ABC_01.cfg:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #/usr/bin/python2
3 |
4 | ## Audio data downsampled to 16000 Hz and CMU lex transcription
5 |
6 | import os
7 |
8 | ## Take name of this file to be the config name:
9 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension
10 |
11 | ## Define place to put outputs relative to this config file's location;
12 | ## supply an absoluate path to work elsewhere:
13 | topworkdir = '/disk/scratch/script_project/ssw10/work/'
14 |
15 | voicedir = os.path.join(topworkdir, config_name)
16 | logdir = os.path.join(voicedir, 'train')
17 | sampledir = os.path.join(voicedir, 'synth')
18 |
19 | ## Change featuredir to absolute path to use existing features
20 | featuredir = '/disk/scratch/script_project/ssw10/features/settings_01/'
21 | coarse_audio_dir = os.path.join(featuredir, 'mels')
22 | full_mel_dir = os.path.join(featuredir, 'full_mels')
23 | full_audio_dir = os.path.join(featuredir, 'mags')
24 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide
25 |
26 |
27 |
28 | # data locations:
29 | datadirwav = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/' ### '/group/project/cstr2/owatts/data/LJSpeech-1.1/'
30 | transcript = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/transcript_cmu.csv'
31 | test_transcript = '/disk/scratch/script_project/ssw10/data/transcript_cmulex_hvd.csv'
32 | waveforms = os.path.join(datadirwav, 'wav_trim16k')
33 |
34 | input_type = 'phones' ## letters or phones
35 |
36 | ### CMU lex version:-
37 | vocab = ['', '', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', \
38 | '<;>', '<>', '>', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', \
39 | 'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', \
40 | 'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', \
41 | 'oy', 'p', 'pau', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', \
42 | 'w', 'y', 'z', 'zh']
43 |
44 | max_N = 173 # Maximum number of characters. # !!!
45 | max_T = 210 # Maximum number of mel frames. # !!!
46 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
47 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set
48 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
49 |
50 |
51 | # signal processing
52 | trim_before_spectrogram_extraction = 0
53 | vocoder = 'griffin_lim'
54 | sr = 16000 # Sampling rate.
55 | n_fft = 1024 # fft points (samples)
56 | hop_length = 256 # int(sr * frame_shift)
57 | win_length = 1024 # int(sr * frame_length)
58 | prepro = True # don't extract spectrograms on the fly
59 | full_dim = n_fft//2+1
60 | n_mels = 80 # Number of Mel banks to generate
61 | power = 1.5 # Exponent for amplifying the predicted magnitude
62 | n_iter = 50 # Number of inversion iterations
63 | preemphasis = .97
64 | max_db = 100
65 | ref_db = 20
66 |
67 |
68 | # Model
69 | r = 4 # Reduction factor. Do not change this.
70 | dropout_rate = 0.05
71 | e = 128 # == embedding
72 | d = 256 # == hidden units of Text2Mel
73 | c = 512 # == hidden units of SSRN
74 | attention_win_size = 3
75 | g = 0.2 ## determines width of band in attention guide
76 | norm = None ## type of normalisation layers to use: from ['layer', 'batch', None]
77 |
78 | ## loss weights : T2M
79 | lw_mel = 0.3333
80 | lw_bd1 = 0.3333
81 | lw_att = 0.3333
82 | ## : SSRN
83 | lw_mag = 0.5
84 | lw_bd2 = 0.5
85 |
86 |
87 | ## validation:
88 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI
89 | validation_sentences_to_evaluate = 4
90 | validation_sentences_to_synth_params = 4
91 |
92 |
93 | # training scheme
94 | restart_from_savepath = []
95 | lr = 0.0002 # Initial learning rate.
96 | beta1 = 0.5
97 | beta2 = 0.9
98 | epsilon = 0.000001
99 | decay_lr = False
100 | batchsize = {'t2m': 16, 'ssrn': 32}
101 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8)
102 | validate_every_n_epochs = 100 ## how often to compute validation score and save speech parameters
103 | save_every_n_epochs = 100 ## as well as 5 latest models, how often to archive a model
104 | max_epochs = 1000
105 |
106 | # attention plotting during training
107 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices
108 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for
109 |
--------------------------------------------------------------------------------
/config/ssw10/G1ABC_02.cfg:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #/usr/bin/python2
3 |
4 | ## Audio data downsampled to 16000 Hz and CMU lex transcription
5 |
6 | import os
7 |
8 | ## Take name of this file to be the config name:
9 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension
10 |
11 | ## Define place to put outputs relative to this config file's location;
12 | ## supply an absoluate path to work elsewhere:
13 | topworkdir = '/disk/scratch/script_project/ssw10/work/'
14 |
15 | voicedir = os.path.join(topworkdir, config_name)
16 | logdir = os.path.join(voicedir, 'train')
17 | sampledir = os.path.join(voicedir, 'synth')
18 |
19 | ## Change featuredir to absolute path to use existing features
20 | featuredir = '/disk/scratch/script_project/ssw10/features/settings_02/'
21 | coarse_audio_dir = os.path.join(featuredir, 'mels')
22 | full_mel_dir = os.path.join(featuredir, 'full_mels')
23 | full_audio_dir = os.path.join(featuredir, 'mags')
24 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide
25 |
26 |
27 |
28 | # data locations:
29 | datadirwav = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/' ### '/group/project/cstr2/owatts/data/LJSpeech-1.1/'
30 | transcript = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/transcript_cmu.csv'
31 | test_transcript = '/disk/scratch/script_project/ssw10/data/transcript_cmulex_hvd.csv'
32 | waveforms = os.path.join(datadirwav, 'wav_trim16k')
33 |
34 | input_type = 'phones' ## letters or phones
35 |
36 | ### CMU lex version:-
37 | vocab = ['', '', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', \
38 | '<;>', '<>', '>', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', \
39 | 'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', \
40 | 'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', \
41 | 'oy', 'p', 'pau', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', \
42 | 'w', 'y', 'z', 'zh']
43 |
44 | max_N = 173 # Maximum number of characters. # !!!
45 | max_T = 210 # Maximum number of mel frames. # !!!
46 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
47 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set
48 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
49 |
50 |
51 | # signal processing
52 | trim_before_spectrogram_extraction = 0
53 | vocoder = 'griffin_lim'
54 | sr = 16000 # Sampling rate.
55 | n_fft = 2048 # fft points (samples)
56 | hop_length = 200 # int(sr * 0.0125)
57 | win_length = 800 # int(sr * 0.05)
58 | prepro = True # don't extract spectrograms on the fly
59 | full_dim = n_fft//2+1
60 | n_mels = 80 # Number of Mel banks to generate
61 | power = 1.5 # Exponent for amplifying the predicted magnitude
62 | n_iter = 50 # Number of inversion iterations
63 | preemphasis = .97
64 | max_db = 100
65 | ref_db = 20
66 |
67 |
68 | # Model
69 | r = 4 # Reduction factor. Do not change this.
70 | dropout_rate = 0.05
71 | e = 128 # == embedding
72 | d = 256 # == hidden units of Text2Mel
73 | c = 512 # == hidden units of SSRN
74 | attention_win_size = 3
75 | g = 0.2 ## determines width of band in attention guide
76 | norm = None ## type of normalisation layers to use: from ['layer', 'batch', None]
77 |
78 | ## loss weights : T2M
79 | lw_mel = 0.3333
80 | lw_bd1 = 0.3333
81 | lw_att = 0.3333
82 | ## : SSRN
83 | lw_mag = 0.5
84 | lw_bd2 = 0.5
85 |
86 |
87 | ## validation:
88 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI
89 | validation_sentences_to_evaluate = 4
90 | validation_sentences_to_synth_params = 4
91 |
92 |
93 | # training scheme
94 | restart_from_savepath = []
95 | lr = 0.0002 # Initial learning rate.
96 | beta1 = 0.5
97 | beta2 = 0.9
98 | epsilon = 0.000001
99 | decay_lr = False
100 | batchsize = {'t2m': 16, 'ssrn': 32}
101 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8)
102 | validate_every_n_epochs = 100 ## how often to compute validation score and save speech parameters
103 | save_every_n_epochs = 100 ## as well as 5 latest models, how often to archive a model
104 | max_epochs = 1000
105 |
106 | # attention plotting during training
107 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices
108 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for
109 |
--------------------------------------------------------------------------------
/config/ssw10/G1ABC_03.cfg:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #/usr/bin/python2
3 |
4 | ## Audio data downsampled to 16000 Hz and CMU lex transcription
5 |
6 | import os
7 |
8 | ## Take name of this file to be the config name:
9 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension
10 |
11 | ## Define place to put outputs relative to this config file's location;
12 | ## supply an absoluate path to work elsewhere:
13 | topworkdir = '/disk/scratch/script_project/ssw10/work/'
14 |
15 | voicedir = os.path.join(topworkdir, config_name)
16 | logdir = os.path.join(voicedir, 'train')
17 | sampledir = os.path.join(voicedir, 'synth')
18 |
19 | ## Change featuredir to absolute path to use existing features
20 | featuredir = '/disk/scratch/script_project/ssw10/features/settings_02/' ## NB: use features from settings 2!!
21 | coarse_audio_dir = os.path.join(featuredir, 'mels')
22 | full_mel_dir = os.path.join(featuredir, 'full_mels')
23 | full_audio_dir = os.path.join(featuredir, 'mags')
24 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide
25 |
26 |
27 |
28 | # data locations:
29 | datadirwav = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/' ### '/group/project/cstr2/owatts/data/LJSpeech-1.1/'
30 | transcript = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/transcript_cmu.csv'
31 | test_transcript = '/disk/scratch/script_project/ssw10/data/transcript_cmulex_hvd.csv'
32 | waveforms = os.path.join(datadirwav, 'wav_trim16k')
33 |
34 | input_type = 'phones' ## letters or phones
35 |
36 | ### CMU lex version:-
37 | vocab = ['', '', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', \
38 | '<;>', '<>', '>', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', \
39 | 'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', \
40 | 'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', \
41 | 'oy', 'p', 'pau', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', \
42 | 'w', 'y', 'z', 'zh']
43 |
44 | max_N = 173 # Maximum number of characters. # !!!
45 | max_T = 210 # Maximum number of mel frames. # !!!
46 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
47 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set
48 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
49 |
50 |
51 | # signal processing
52 | trim_before_spectrogram_extraction = 0
53 | vocoder = 'griffin_lim'
54 | sr = 16000 # Sampling rate.
55 | n_fft = 2048 # fft points (samples)
56 | hop_length = 200 # int(sr * 0.0125)
57 | win_length = 800 # int(sr * 0.05)
58 | prepro = True # don't extract spectrograms on the fly
59 | full_dim = n_fft//2+1
60 | n_mels = 80 # Number of Mel banks to generate
61 | power = 1.5 # Exponent for amplifying the predicted magnitude
62 | n_iter = 50 # Number of inversion iterations
63 | preemphasis = .97
64 | max_db = 100
65 | ref_db = 20
66 |
67 |
68 | # Model
69 | r = 4 # Reduction factor. Do not change this.
70 | dropout_rate = 0.05
71 | e = 128 # == embedding
72 | d = 256 # == hidden units of Text2Mel
73 | c = 512 # == hidden units of SSRN
74 | attention_win_size = 3
75 | g = 0.2 ## determines width of band in attention guide
76 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None]
77 |
78 | ## loss weights : T2M
79 | lw_mel = 0.3333
80 | lw_bd1 = 0.3333
81 | lw_att = 0.3333
82 | ## : SSRN
83 | lw_mag = 0.5
84 | lw_bd2 = 0.5
85 |
86 |
87 | ## validation:
88 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI
89 | validation_sentences_to_evaluate = 4
90 | validation_sentences_to_synth_params = 4
91 |
92 |
93 | # training scheme
94 | restart_from_savepath = []
95 | lr = 0.001 # Initial learning rate.
96 | batchsize = {'t2m': 32, 'ssrn': 32}
97 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8)
98 | validate_every_n_epochs = 100 ## how often to compute validation score and save speech parameters
99 | save_every_n_epochs = 100 ## as well as 5 latest models, how often to archive a model
100 | max_epochs = 1000
101 |
102 | # attention plotting during training
103 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices
104 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for
105 |
--------------------------------------------------------------------------------
/config/ssw10/G1AB_03.cfg:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #/usr/bin/python2
3 |
4 | ## As DCTTS but remove attention
5 |
6 | import os
7 |
8 | ## Take name of this file to be the config name:
9 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension
10 |
11 | ## Define place to put outputs relative to this config file's location;
12 | ## supply an absoluate path to work elsewhere:
13 | topworkdir = '/disk/scratch/script_project/ssw10/work/'
14 |
15 | voicedir = os.path.join(topworkdir, config_name)
16 | logdir = os.path.join(voicedir, 'train')
17 | sampledir = os.path.join(voicedir, 'synth')
18 |
19 | ## Change featuredir to absolute path to use existing features
20 | featuredir = '/disk/scratch/script_project/ssw10/features/settings_02/' ## NB: use features from settings 2!!
21 | coarse_audio_dir = os.path.join(featuredir, 'mels')
22 | full_mel_dir = os.path.join(featuredir, 'full_mels')
23 | full_audio_dir = os.path.join(featuredir, 'mags')
24 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide
25 |
26 |
27 |
28 |
29 | # data locations:
30 | use_external_durations = True
31 | datadirwav = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/' ### '/group/project/cstr2/owatts/data/LJSpeech-1.1/'
32 | transcript = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/transcript_cmu_durations_02.csv'
33 | test_transcript = '/disk/scratch/script_project/ssw10/data/transcript_cmulex_hvd_durations_02_noendsil.csv'
34 | waveforms = os.path.join(datadirwav, 'wav_trim16k')
35 |
36 | input_type = 'phones' ## letters or phones
37 |
38 | ### CMU lex version:-
39 | vocab = ['', '', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', \
40 | '<;>', '<>', '>', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', \
41 | 'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', \
42 | 'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', \
43 | 'oy', 'p', 'pau', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', \
44 | 'w', 'y', 'z', 'zh']
45 |
46 | max_N = 173 # Maximum number of characters. # !!!
47 | max_T = 210 # Maximum number of mel frames. # !!!
48 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
49 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set
50 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
51 |
52 |
53 | # signal processing
54 | trim_before_spectrogram_extraction = 0
55 | vocoder = 'griffin_lim'
56 | sr = 16000 # Sampling rate.
57 | n_fft = 2048 # fft points (samples)
58 | hop_length = 200 # int(sr * 0.0125)
59 | win_length = 800 # int(sr * 0.05)
60 | prepro = True # don't extract spectrograms on the fly
61 | full_dim = n_fft//2+1
62 | n_mels = 80 # Number of Mel banks to generate
63 | power = 1.5 # Exponent for amplifying the predicted magnitude
64 | n_iter = 50 # Number of inversion iterations
65 | preemphasis = .97
66 | max_db = 100
67 | ref_db = 20
68 |
69 |
70 | # Model
71 | r = 4 # Reduction factor. Do not change this.
72 | dropout_rate = 0.05
73 | e = 128 # == embedding
74 | d = 256 # == hidden units of Text2Mel
75 | c = 512 # == hidden units of SSRN
76 | attention_win_size = 3
77 | g = 0.2 ## determines width of band in attention guide
78 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None]
79 |
80 | ## loss weights : T2M
81 | lw_mel = 0.3333
82 | lw_bd1 = 0.3333
83 | lw_att = 0.3333
84 | ## : SSRN
85 | lw_mag = 0.5
86 | lw_bd2 = 0.5
87 |
88 |
89 | ## validation:
90 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI
91 | validation_sentences_to_evaluate = 4
92 | validation_sentences_to_synth_params = 4
93 |
94 |
95 | # training scheme
96 | restart_from_savepath = []
97 | lr = 0.001 # Initial learning rate.
98 | batchsize = {'t2m': 32, 'ssrn': 32}
99 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8)
100 | validate_every_n_epochs = 100 ## how often to compute validation score and save speech parameters
101 | save_every_n_epochs = 100 ## as well as 5 latest models, how often to archive a model
102 | max_epochs = 1000
103 |
104 | # attention plotting during training
105 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices
106 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for
107 |
--------------------------------------------------------------------------------
/config/ssw10/G1BC_03.cfg:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #/usr/bin/python2
3 |
4 | ## As DCTTS but remove attention
5 |
6 | import os
7 |
8 | ## Take name of this file to be the config name:
9 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension
10 |
11 | ## Define place to put outputs relative to this config file's location;
12 | ## supply an absoluate path to work elsewhere:
13 | topworkdir = '/disk/scratch/script_project/ssw10/work/'
14 |
15 | voicedir = os.path.join(topworkdir, config_name)
16 | logdir = os.path.join(voicedir, 'train')
17 | sampledir = os.path.join(voicedir, 'synth')
18 |
19 | ## Change featuredir to absolute path to use existing features
20 | featuredir = '/disk/scratch/script_project/ssw10/features/settings_02/' ## NB: use features from settings 2!!
21 | coarse_audio_dir = os.path.join(featuredir, 'mels')
22 | full_mel_dir = os.path.join(featuredir, 'full_mels')
23 | full_audio_dir = os.path.join(featuredir, 'mags')
24 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide
25 |
26 |
27 |
28 |
29 | # data locations:
30 | datadirwav = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/' ### '/group/project/cstr2/owatts/data/LJSpeech-1.1/'
31 | transcript = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/transcript_cmu.csv'
32 | test_transcript = '/disk/scratch/script_project/ssw10/data/transcript_cmulex_hvd.csv'
33 | waveforms = os.path.join(datadirwav, 'wav_trim16k')
34 |
35 | text_encoder_type = 'minimal_feedforward'
36 | merlin_label_dir = '/disk/scratch/script_project/ssw10/features/from_merlin/labels'
37 | merlin_lab_dim = 416
38 |
39 |
40 | input_type = 'phones' ## letters or phones
41 |
42 | ### CMU lex version:-
43 | vocab = ['', '', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', \
44 | '<;>', '<>', '>', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', \
45 | 'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', \
46 | 'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', \
47 | 'oy', 'p', 'pau', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', \
48 | 'w', 'y', 'z', 'zh']
49 |
50 | max_N = 173 # Maximum number of characters. # !!!
51 | max_T = 210 # Maximum number of mel frames. # !!!
52 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
53 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set
54 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
55 |
56 |
57 | # signal processing
58 | trim_before_spectrogram_extraction = 0
59 | vocoder = 'griffin_lim'
60 | sr = 16000 # Sampling rate.
61 | n_fft = 2048 # fft points (samples)
62 | hop_length = 200 # int(sr * 0.0125)
63 | win_length = 800 # int(sr * 0.05)
64 | prepro = True # don't extract spectrograms on the fly
65 | full_dim = n_fft//2+1
66 | n_mels = 80 # Number of Mel banks to generate
67 | power = 1.5 # Exponent for amplifying the predicted magnitude
68 | n_iter = 50 # Number of inversion iterations
69 | preemphasis = .97
70 | max_db = 100
71 | ref_db = 20
72 |
73 |
74 | # Model
75 | r = 4 # Reduction factor. Do not change this.
76 | dropout_rate = 0.05
77 | e = 128 # == embedding
78 | d = 256 # == hidden units of Text2Mel
79 | c = 512 # == hidden units of SSRN
80 | attention_win_size = 3
81 | g = 0.2 ## determines width of band in attention guide
82 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None]
83 |
84 | ## loss weights : T2M
85 | lw_mel = 0.3333
86 | lw_bd1 = 0.3333
87 | lw_att = 0.3333
88 | ## : SSRN
89 | lw_mag = 0.5
90 | lw_bd2 = 0.5
91 |
92 |
93 | ## validation:
94 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI
95 | validation_sentences_to_evaluate = 4
96 | validation_sentences_to_synth_params = 4
97 |
98 |
99 | # training scheme
100 | restart_from_savepath = []
101 | lr = 0.001 # Initial learning rate.
102 | batchsize = {'t2m': 32, 'ssrn': 32}
103 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8)
104 | validate_every_n_epochs = 100 ## how often to compute validation score and save speech parameters
105 | save_every_n_epochs = 100 ## as well as 5 latest models, how often to archive a model
106 | max_epochs = 1000
107 |
108 | # attention plotting during training
109 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices
110 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for
111 |
--------------------------------------------------------------------------------
/config/ssw10/G1B_03.cfg:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #/usr/bin/python2
3 |
4 | ## As DCTTS but remove attention
5 |
6 | import os
7 |
8 | ## Take name of this file to be the config name:
9 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension
10 |
11 | ## Define place to put outputs relative to this config file's location;
12 | ## supply an absoluate path to work elsewhere:
13 | topworkdir = '/disk/scratch/script_project/ssw10/work/'
14 |
15 | voicedir = os.path.join(topworkdir, config_name)
16 | logdir = os.path.join(voicedir, 'train')
17 | sampledir = os.path.join(voicedir, 'synth')
18 |
19 | ## Change featuredir to absolute path to use existing features
20 | featuredir = '/disk/scratch/script_project/ssw10/features/settings_02/' ## NB: use features from settings 2!!
21 | coarse_audio_dir = os.path.join(featuredir, 'mels')
22 | full_mel_dir = os.path.join(featuredir, 'full_mels')
23 | full_audio_dir = os.path.join(featuredir, 'mags')
24 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide
25 |
26 |
27 |
28 |
29 | # data locations:
30 | use_external_durations = True
31 | datadirwav = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/' ### '/group/project/cstr2/owatts/data/LJSpeech-1.1/'
32 | transcript = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/transcript_cmu_durations_02.csv'
33 | test_transcript = '/disk/scratch/script_project/ssw10/data/transcript_cmulex_hvd_durations_02_noendsil.csv'
34 | waveforms = os.path.join(datadirwav, 'wav_trim16k')
35 |
36 | text_encoder_type = 'minimal_feedforward'
37 | merlin_label_dir = '/disk/scratch/script_project/ssw10/features/from_merlin/labels'
38 | merlin_lab_dim = 416
39 |
40 | input_type = 'phones' ## letters or phones
41 |
42 | ### CMU lex version:-
43 | vocab = ['', '', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', \
44 | '<;>', '<>', '>', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', \
45 | 'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', \
46 | 'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', \
47 | 'oy', 'p', 'pau', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', \
48 | 'w', 'y', 'z', 'zh']
49 |
50 | max_N = 173 # Maximum number of characters. # !!!
51 | max_T = 210 # Maximum number of mel frames. # !!!
52 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
53 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set
54 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
55 |
56 |
57 | # signal processing
58 | trim_before_spectrogram_extraction = 0
59 | vocoder = 'griffin_lim'
60 | sr = 16000 # Sampling rate.
61 | n_fft = 2048 # fft points (samples)
62 | hop_length = 200 # int(sr * 0.0125)
63 | win_length = 800 # int(sr * 0.05)
64 | prepro = True # don't extract spectrograms on the fly
65 | full_dim = n_fft//2+1
66 | n_mels = 80 # Number of Mel banks to generate
67 | power = 1.5 # Exponent for amplifying the predicted magnitude
68 | n_iter = 50 # Number of inversion iterations
69 | preemphasis = .97
70 | max_db = 100
71 | ref_db = 20
72 |
73 |
74 | # Model
75 | r = 4 # Reduction factor. Do not change this.
76 | dropout_rate = 0.05
77 | e = 128 # == embedding
78 | d = 256 # == hidden units of Text2Mel
79 | c = 512 # == hidden units of SSRN
80 | attention_win_size = 3
81 | g = 0.2 ## determines width of band in attention guide
82 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None]
83 |
84 | ## loss weights : T2M
85 | lw_mel = 0.3333
86 | lw_bd1 = 0.3333
87 | lw_att = 0.3333
88 | ## : SSRN
89 | lw_mag = 0.5
90 | lw_bd2 = 0.5
91 |
92 |
93 | ## validation:
94 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI
95 | validation_sentences_to_evaluate = 4
96 | validation_sentences_to_synth_params = 4
97 |
98 |
99 | # training scheme
100 | restart_from_savepath = []
101 | lr = 0.001 # Initial learning rate.
102 | batchsize = {'t2m': 32, 'ssrn': 32}
103 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8)
104 | validate_every_n_epochs = 100 ## how often to compute validation score and save speech parameters
105 | save_every_n_epochs = 100 ## as well as 5 latest models, how often to archive a model
106 | max_epochs = 1000
107 |
108 | # attention plotting during training
109 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices
110 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for
111 |
--------------------------------------------------------------------------------
/config/ssw10/G1_03.cfg:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #/usr/bin/python2
3 |
4 | ## As DCTTS but remove attention
5 |
6 | import os
7 |
8 | ## Take name of this file to be the config name:
9 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension
10 |
11 | ## Define place to put outputs relative to this config file's location;
12 | ## supply an absoluate path to work elsewhere:
13 | topworkdir = '/disk/scratch/script_project/ssw10/work/'
14 |
15 | voicedir = os.path.join(topworkdir, config_name)
16 | logdir = os.path.join(voicedir, 'train')
17 | sampledir = os.path.join(voicedir, 'synth')
18 |
19 | ## Change featuredir to absolute path to use existing features
20 | featuredir = '/disk/scratch/script_project/ssw10/features/settings_02/' ## NB: use features from settings 2!!
21 | coarse_audio_dir = os.path.join(featuredir, 'mels')
22 | full_mel_dir = os.path.join(featuredir, 'full_mels')
23 | full_audio_dir = os.path.join(featuredir, 'mags')
24 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide
25 |
26 | history_type = 'fractional_position_in_phone'
27 |
28 |
29 | # data locations:
30 | use_external_durations = True
31 | datadirwav = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/' ### '/group/project/cstr2/owatts/data/LJSpeech-1.1/'
32 | transcript = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/transcript_cmu_durations_02.csv'
33 | test_transcript = '/disk/scratch/script_project/ssw10/data/transcript_cmulex_hvd_durations_02_noendsil.csv'
34 | waveforms = os.path.join(datadirwav, 'wav_trim16k')
35 |
36 | text_encoder_type = 'minimal_feedforward'
37 | merlin_label_dir = '/disk/scratch/script_project/ssw10/features/from_merlin/labels'
38 | merlin_lab_dim = 416
39 |
40 | input_type = 'phones' ## letters or phones
41 |
42 | ### CMU lex version:-
43 | vocab = ['', '', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', \
44 | '<;>', '<>', '>', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', \
45 | 'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', \
46 | 'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', \
47 | 'oy', 'p', 'pau', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', \
48 | 'w', 'y', 'z', 'zh']
49 |
50 | max_N = 173 # Maximum number of characters. # !!!
51 | max_T = 210 # Maximum number of mel frames. # !!!
52 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
53 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set
54 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
55 |
56 |
57 | # signal processing
58 | trim_before_spectrogram_extraction = 0
59 | vocoder = 'griffin_lim'
60 | sr = 16000 # Sampling rate.
61 | n_fft = 2048 # fft points (samples)
62 | hop_length = 200 # int(sr * 0.0125)
63 | win_length = 800 # int(sr * 0.05)
64 | prepro = True # don't extract spectrograms on the fly
65 | full_dim = n_fft//2+1
66 | n_mels = 80 # Number of Mel banks to generate
67 | power = 1.5 # Exponent for amplifying the predicted magnitude
68 | n_iter = 50 # Number of inversion iterations
69 | preemphasis = .97
70 | max_db = 100
71 | ref_db = 20
72 |
73 |
74 | # Model
75 | r = 4 # Reduction factor. Do not change this.
76 | dropout_rate = 0.05
77 | e = 128 # == embedding
78 | d = 256 # == hidden units of Text2Mel
79 | c = 512 # == hidden units of SSRN
80 | attention_win_size = 3
81 | g = 0.2 ## determines width of band in attention guide
82 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None]
83 |
84 | ## loss weights : T2M
85 | lw_mel = 0.3333
86 | lw_bd1 = 0.3333
87 | lw_att = 0.3333
88 | ## : SSRN
89 | lw_mag = 0.5
90 | lw_bd2 = 0.5
91 |
92 |
93 | ## validation:
94 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI
95 | validation_sentences_to_evaluate = 4
96 | validation_sentences_to_synth_params = 4
97 |
98 |
99 | # training scheme
100 | restart_from_savepath = []
101 | lr = 0.001 # Initial learning rate.
102 | batchsize = {'t2m': 32, 'ssrn': 32}
103 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8)
104 | validate_every_n_epochs = 100 ## how often to compute validation score and save speech parameters
105 | save_every_n_epochs = 100 ## as well as 5 latest models, how often to archive a model
106 | max_epochs = 1000
107 |
108 | # attention plotting during training
109 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices
110 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for
111 |
--------------------------------------------------------------------------------
/config/ssw10/README.md:
--------------------------------------------------------------------------------
1 | # Voices for SSW10 experiment
2 |
3 | Instructions for recreating SSW10 voices
4 |
5 |
6 | ## Tools
7 |
8 |
9 | ### DCTTS code
10 | ```
11 | TOOLDIR=/disk/scratch/script_project/ssw10/tools/
12 | mkdir -p $TOOLDIR
13 | cd $TOOLDIR
14 | git clone https://github.com/oliverwatts/dc_tts_osw.git dc_tts_osw_A
15 | cd dc_tts_osw_A
16 | ```
17 |
18 | ### Festival
19 |
20 | ```
21 | INSTALL_DIR=$TOOLDIR/festival
22 | mkdir -p $INSTALL_DIR
23 | cd $INSTALL_DIR
24 |
25 | wget http://www.cstr.ed.ac.uk/downloads/festival/2.4/festival-2.4-release.tar.gz
26 | wget http://www.cstr.ed.ac.uk/downloads/festival/2.4/speech_tools-2.4-release.tar.gz
27 |
28 | tar xvf festival-2.4-release.tar.gz
29 | tar xvf speech_tools-2.4-release.tar.gz
30 |
31 | cd speech_tools
32 | ./configure --prefix=$INSTALL_DIR
33 | gmake
34 |
35 | cd ../festival
36 | ./configure --prefix=$INSTALL_DIR
37 | gmake
38 |
39 | cd $INSTALL_DIR
40 |
41 | wget http://www.cstr.ed.ac.uk/downloads/festival/2.4/voices/festvox_cmu_us_awb_cg.tar.gz
42 | tar xvf festvox_cmu_us_awb_cg.tar.gz
43 |
44 | ## Get lexicons for the english voice:
45 | wget http://www.cstr.ed.ac.uk/downloads/festival/2.4/festlex_CMU.tar.gz
46 | tar xvf festlex_CMU.tar.gz
47 |
48 | wget http://www.cstr.ed.ac.uk/downloads/festival/2.4/festlex_POSLEX.tar.gz
49 | tar xvf festlex_POSLEX.tar.gz
50 |
51 | ## test
52 | cd $INSTALL_DIR/festival/bin
53 |
54 | ## run the *locally installed* festival (NB: initial ./ is important!)
55 | ./festival
56 |
57 | festival> (voice_cmu_us_awb_cg)
58 | festival> (utt.save.wave (SayText "If i'm speaking then installation actually went ok.") "test.wav" ' riff)
59 | ```
60 |
61 |
62 |
63 |
64 |
65 |
66 |
67 | ## Data
68 |
69 |
70 | ### Download LJSpeech
71 |
72 | ```
73 | DATADIR=/disk/scratch/script_project/ssw10/data
74 | mkdir -p $DATADIR
75 |
76 | cd $DATADIR
77 | wget http://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
78 | bunzip2 LJSpeech-1.1.tar.bz2
79 | tar xvf LJSpeech-1.1.tar
80 | rm LJSpeech-1.1.tar
81 | ```
82 |
83 |
84 | ### Phonetise the transcription with Festvial + CMU lexicon
85 |
86 | ```
87 | CODEDIR=/disk/scratch/oliver/dc_tts_osw/
88 | cd $CODEDIR
89 | python ./script/festival/csv2scm.py -i $DATADIR/LJSpeech-1.1/metadata.csv -o $DATADIR/LJSpeech-1.1/utts.data
90 |
91 | cd $DATADIR/LJSpeech-1.1/
92 | FEST=$TOOLDIR/festival/festival/bin/festival
93 | SCRIPT=$CODEDIR/script/festival/make_rich_phones_cmulex.scm
94 | $FEST -b $SCRIPT | grep ___KEEP___ | sed 's/___KEEP___//' | tee ./transcript_temp2.csv
95 |
96 | python $CODEDIR/script/festival/fix_transcript.py ./transcript_temp2.csv > ./transcript_cmu.csv
97 | ```
98 |
99 |
--------------------------------------------------------------------------------
/configuration.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #!/usr/bin/env python2
3 | import os
4 | import imp
5 | import inspect
6 |
7 |
8 | CONFIG_DEFAULTS = [
9 | ('initialise_weights_from_existing', [], ''),
10 | ('update_weights', [], ''),
11 | ('num_threads', 8, 'how many threads get_batch should use to build training batches of data (default: 8)'),
12 | ('plot_attention_every_n_epochs', 0, 'set to 0 if you do not wish to plot attention matrices'),
13 | ('num_sentences_to_plot_attention', 0, 'number of sentences to plot attention matrices for'),
14 | ('concatenate_query', True, 'Concatenate [R Q] to get audio decoder input, or just take R?'),
15 | ('use_external_durations', False, 'Use externally supplied durations in 6th field of transcript for fixed attention matrix A'),
16 | ('text_encoder_type', 'DCTTS_standard', 'one of DCTTS_standard/none/minimal_feedforward'),
17 | ('merlin_label_dir', '', 'npy format phone labels converted from merlin using process_merlin_label.py'),
18 | ('merlin_lab_dim', 592, ''),
19 | ('bucket_data_by', 'text_length', 'One of audio_length/text_length. Label length will be used if merlin_label_dir is set and bucket_data_by=="text_length"'),
20 | ('history_type', 'DCTTS_standard', 'DCTTS_standard/fractional_position_in_phone/absolute_position_in_phone/minimal_history'),
21 | ('beta1', 0.9, 'ADAM setting - default value from original dctss repo'),
22 | ('beta2', 0.999, 'ADAM setting - default value from original dctss repo'),
23 | ('epsilon', 0.00000001 , 'ADAM setting - default value from original dctss repo'),
24 | ('decay_lr', True , 'learning rate decay - default value from original dctss repo'),
25 | ('squash_output_t2m', True, 'apply sigmoid to output - binary divergence loss will be disabled if False'),
26 | ('squash_output_ssrn', True, 'apply sigmoid to output - binary divergence loss will be disabled if False'),
27 | ('store_synth_features', False, 'store .npy file of features alongside output .wav file'),
28 | ('turn_off_monotonic_for_synthesis',False, 'turns off FIA mechanism for synthesis, should be False during training'),
29 | ('lw_cdp',0.0,''),
30 | ('lw_ain',0.0,''),
31 | ('lw_aout',0.0,''),
32 | ('attention_guide_fa',False,'use attention guide as target - MSE attention loss'),
33 | ('select_central',False,'use only centre phones from Merlin labels'),
34 | ('MerlinTextEncWithPhoneEmbedding',False,'use Merlin labels and phone embeddings as input of TextEncoder')
35 | ]
36 |
37 | ## Intended to have hp as a module, but this doesn't allow pickling and therefore
38 | ## use in parallel processing. So, convert module into an object of arbitrary type
39 | ## ("Hyperparams") having same attributes:
40 | class Hyperparams(object):
41 | def __init__(self, module_object):
42 | for (key, value) in module_object.__dict__.items():
43 | if key.startswith('_'):
44 | continue
45 | if inspect.ismodule(value): # e.g. from os imported at top of config
46 | continue
47 | setattr(self, key, module_object.__dict__[key])
48 | def validate(self):
49 | '''
50 | Supply defaults for various thing of appropriate type if missing --
51 | TODO: Currently this is just to supply values for variables added later in development.
52 | Should we have some filling in of defaults like this more permanently, or should
53 | everything be explicitly set in a config file?
54 | '''
55 | for (varname, default_value, help_string) in CONFIG_DEFAULTS:
56 | if not hasattr(self, varname):
57 | setattr(self, varname, default_value)
58 |
59 |
60 | def load_config(config_fname):
61 | config = os.path.abspath(config_fname)
62 | assert os.path.isfile(config), 'Config file %s does not exist'%(config)
63 | settings = imp.load_source('config', config)
64 | hp = Hyperparams(settings)
65 | hp.validate()
66 | return hp
67 |
68 |
69 |
70 | ### https://stackoverflow.com/questions/1325673/how-to-add-property-to-a-class-dynamically
71 |
72 | # class atdict(dict):
73 | # __getattr__= dict.__getitem__
74 | # __setattr__= dict.__setitem__
75 | # __delattr__= dict.__delitem__
76 |
--------------------------------------------------------------------------------
/convert_alignment_to_guide.py:
--------------------------------------------------------------------------------
1 |
2 | import sys
3 | import math
4 | import glob
5 | import numpy as np
6 | import matplotlib
7 | import matplotlib.pyplot as plt
8 | from scipy.ndimage.filters import gaussian_filter
9 | from libutil import save_floats_as_8bit
10 | import tqdm
11 | from concurrent.futures import ProcessPoolExecutor
12 |
13 | gD = 0.2
14 | gW = 0.1
15 |
16 | DEBUG = False
17 |
18 | def main(file_name,out_file):
19 |
20 | F = np.load(file_name)
21 | F = np.transpose(F)
22 |
23 | ndim, tdim = F.shape # x: encoder (N) / y: decoder (T)
24 |
25 | ## Convert alignment to attention guide
26 | if DEBUG:
27 | D = np.zeros((ndim, tdim), dtype=np.float32) # diagonal guide
28 | W = np.zeros((ndim, tdim), dtype=np.float32) # alignment based guide
29 |
30 | for n_pos in range(ndim):
31 | for t_pos in range(tdim):
32 |
33 | n_pos_new = np.argmax(F[:,t_pos])
34 | W[n_pos,t_pos] = 1 - np.exp( -(n_pos / float(ndim) - n_pos_new / float(ndim) ) ** 2 / (2 * gW * gW))
35 |
36 | if DEBUG:
37 | D[n_pos, t_pos] = 1 - np.exp(-(t_pos / float(tdim) - n_pos / float(ndim)) ** 2 / (2 * gD * gD))
38 |
39 | ## Smooth the alignment based guide
40 | S = gaussian_filter(W, sigma=1) # trying to blur
41 | # needs min max norm here to make sure 0-1
42 | S = ( S - S.min()) / ( S.max() - S.min() )
43 |
44 | save_floats_as_8bit(S, out_file)
45 |
46 | if DEBUG:
47 |
48 | D = ( D - D.min()) / ( D.max() - D.min() )
49 | W = ( W - W.min()) / ( W.max() - W.min() )
50 |
51 | for plot_type in range(0,3):
52 |
53 | ## Visualization
54 | if plot_type==0:
55 | M = F+D # add forced alignment path to help visualisation
56 | elif plot_type == 1:
57 | M = F+W # add forced alignment path to help visualisation
58 | elif plot_type == 2:
59 | M = F+S # add forced alignment path to help visualisation
60 |
61 | fig, ax = plt.subplots()
62 | im = ax.imshow(M)
63 | # plt.title('Diagonal (top), Alignment based (middle), Alignment based smoothed (bottom)', fontsize=8)
64 | fig.colorbar(im,fraction=0.035, pad=0.04)
65 | plt.ylabel('Encoder timestep', fontsize=12)
66 | plt.xlabel('Decoder timestep', fontsize=12)
67 |
68 | if plot_type==0:
69 | plt.title('Diagonal attention guide', fontsize=12)
70 | plt.savefig('attention_guide_diagonal.pdf')
71 | elif plot_type == 1:
72 | plt.title('Forced alignment based attention guide', fontsize=12)
73 | plt.savefig('attention_guide_fa.pdf')
74 | elif plot_type == 2:
75 | plt.title('Forced alignment based attention guide (smoothed)', fontsize=12)
76 | plt.savefig('attention_guide_fa_smooth.pdf')
77 |
78 | plt.show()
79 |
80 | if __name__ == '__main__':
81 |
82 | # Usage: python convert_alignment_to_guide.py CB-EM-55-07.npy CB-EM-55-07_out.npy
83 |
84 | inputdir = sys.argv[1]
85 | outputdir = sys.argv[2]
86 | ncores = 5
87 | executor = ProcessPoolExecutor(max_workers=ncores)
88 | futures = []
89 | for file in glob.iglob(inputdir + '/*.npy'):
90 | outfile = outputdir + file.split('/')[-1]
91 | futures.append(executor.submit(main, file, outfile))
92 |
93 | proc_list = [future.result() for future in tqdm.tqdm(futures)]
94 |
95 |
--------------------------------------------------------------------------------
/copy_synth_GL.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | # /usr/bin/python2
3 |
4 | from __future__ import print_function
5 |
6 | import os
7 | import glob
8 |
9 | import numpy as np
10 | from utils import spectrogram2wav
11 | from data_load import load_data
12 | import soundfile
13 | from tqdm import tqdm
14 | from configuration import load_config
15 |
16 | from argparse import ArgumentParser
17 |
18 | from libutil import basename, safe_makedir
19 |
20 | def copy_synth_GL(hp, outdir):
21 |
22 | safe_makedir(outdir)
23 |
24 | dataset = load_data(hp, mode="synthesis")
25 | fnames, texts = dataset['fpaths'], dataset['texts']
26 | bases = [basename(fname) for fname in fnames]
27 |
28 | for base in bases:
29 | print("Working on file %s"%(base))
30 | mag = np.load(os.path.join(hp.full_audio_dir, base + '.npy'))
31 | wav = spectrogram2wav(hp, mag)
32 | soundfile.write(outdir + "/%s.wav"%(base), wav, hp.sr)
33 |
34 | def main_work():
35 |
36 | #################################################
37 |
38 | # ============= Process command line ============
39 |
40 | a = ArgumentParser()
41 | a.add_argument('-c', dest='config', required=True, type=str)
42 | a.add_argument('-o', dest='outdir', required=True, type=str)
43 | opts = a.parse_args()
44 |
45 | # ===============================================
46 |
47 | hp = load_config(opts.config)
48 | copy_synth_GL(hp, opts.outdir)
49 |
50 | if __name__=="__main__":
51 |
52 | main_work()
53 |
--------------------------------------------------------------------------------
/copy_synth_SSRN_GL.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | # /usr/bin/python2
3 |
4 | from __future__ import print_function
5 |
6 | import os
7 | import glob
8 |
9 | import numpy as np
10 | from utils import spectrogram2wav
11 | from data_load import load_data
12 | import soundfile
13 | from tqdm import tqdm
14 | from configuration import load_config
15 |
16 | from argparse import ArgumentParser
17 |
18 | from libutil import basename, safe_makedir
19 | from synthesize import synth_mel2mag, list2batch, restore_latest_model_parameters
20 | from architectures import SSRNGraph
21 | import tensorflow as tf
22 |
23 | def copy_synth_SSRN_GL(hp, outdir):
24 |
25 | safe_makedir(outdir)
26 |
27 | dataset = load_data(hp, mode="synthesis")
28 | fnames, texts = dataset['fpaths'], dataset['texts']
29 | bases = [basename(fname) for fname in fnames]
30 | mels = [np.load(os.path.join(hp.coarse_audio_dir, base + '.npy')) for base in bases]
31 | lengths = [a.shape[0] for a in mels]
32 | mels = list2batch(mels, 0)
33 |
34 | g = SSRNGraph(hp, mode="synthesize"); print("Graph (ssrn) loaded")
35 |
36 | with tf.Session() as sess:
37 | sess.run(tf.global_variables_initializer())
38 | ssrn_epoch = restore_latest_model_parameters(sess, hp, 'ssrn')
39 |
40 | print('Run SSRN...')
41 | Z = synth_mel2mag(hp, mels, g, sess)
42 |
43 | for i, mag in enumerate(Z):
44 | print("Working on %s"%(bases[i]))
45 | mag = mag[:lengths[i]*hp.r,:] ### trim to generated length
46 | wav = spectrogram2wav(hp, mag)
47 | soundfile.write(outdir + "/%s.wav"%(base), wav, hp.sr)
48 |
49 |
50 |
51 |
52 | def main_work():
53 |
54 | #################################################
55 |
56 | # ============= Process command line ============
57 |
58 | a = ArgumentParser()
59 | a.add_argument('-c', dest='config', required=True, type=str)
60 | a.add_argument('-o', dest='outdir', required=True, type=str)
61 | opts = a.parse_args()
62 |
63 | # ===============================================
64 |
65 | hp = load_config(opts.config)
66 | copy_synth_SSRN_GL(hp, opts.outdir)
67 |
68 | if __name__=="__main__":
69 |
70 | main_work()
71 |
--------------------------------------------------------------------------------
/doc/festival_install.md:
--------------------------------------------------------------------------------
1 |
2 |
3 | # Installing Festival
4 |
5 | ## Basic Festival install of Spanish and Scottish voices
6 | ```
7 | INSTALL_DIR=/afs/some/convenient/directory/festival
8 |
9 | mkdir -p $INSTALL_DIR
10 | cd $INSTALL_DIR
11 |
12 | wget http://www.cstr.ed.ac.uk/downloads/festival/2.4/festival-2.4-release.tar.gz
13 | wget http://www.cstr.ed.ac.uk/downloads/festival/2.4/speech_tools-2.4-release.tar.gz
14 |
15 | tar xvf festival-2.4-release.tar.gz
16 | tar xvf speech_tools-2.4-release.tar.gz
17 |
18 | cd speech_tools
19 | ./configure --prefix=$INSTALL_DIR
20 | gmake
21 |
22 | cd ../festival
23 | ./configure --prefix=$INSTALL_DIR
24 | gmake
25 |
26 | cd $INSTALL_DIR
27 |
28 | ## Get spanish voice:
29 | wget http://festvox.org/packed/festival/1.4.1/festvox_ellpc11k.tar.gz
30 | tar xvf festvox_ellpc11k.tar.gz
31 |
32 | ## Get scottish english voice:
33 | wget http://www.cstr.ed.ac.uk/downloads/festival/2.4/voices/festvox_cmu_us_awb_cg.tar.gz
34 | tar xvf festvox_cmu_us_awb_cg.tar.gz
35 |
36 | ## Get lexicons for the english voice:
37 | wget http://www.cstr.ed.ac.uk/downloads/festival/2.4/festlex_CMU.tar.gz
38 | tar xvf festlex_CMU.tar.gz
39 |
40 | wget http://www.cstr.ed.ac.uk/downloads/festival/2.4/festlex_POSLEX.tar.gz
41 | tar xvf festlex_POSLEX.tar.gz
42 |
43 | ## test
44 | cd $INSTALL_DIR/festival/bin
45 |
46 | ## run the *locally installed* festival (NB: initial ./ is important!)
47 | ./festival
48 | festival> (voice_el_diphone )
49 | festival> (SayText "La rica salsa canaria se llama mojo pic'on.")
50 |
51 | festival> (voice_cmu_us_awb_cg)
52 | festival> (SayText "If i'm speaking then installation actually went ok.")
53 |
54 |
55 | ## synthesise to file:
56 |
57 | (utt.save.wave (SynthText "La rica salsa canaria se llama mojo pic'on.") "/path/to/output/file.wav" 'riff)
58 | ```
59 |
60 |
61 | ## Combilex installation
62 |
63 | Given the file `combilex.tar` which contains 3 combilex surface form lexicons (gam, rpx, edi), install like this:
64 |
65 | ```
66 | cp combilex.tar $INSTALL_DIR/festival/
67 | cd $INSTALL_DIR/festival/
68 | tar xvf combilex.tar
69 | ```
70 |
71 | For processing the Nancy data, which contains a French word with a nasalised vowel present in the lexicon but not the phoneset definition, I needed to edit `$INSTALL_DIR/festival/lib/combilex_phones.scm` and add the line:
72 |
73 | ```
74 | (o~ + l 2 3 + n 0 0) ;; added missing nasalised vowel
75 | ```
76 |
77 | after the line:
78 |
79 | ```
80 | (@U + d 2 3 + 0 0 0) ;ou
81 | ```
82 |
83 |
84 | ## Cleaning up
85 |
86 | ```
87 | cd $INSTALL_DIR
88 | rm *.tar.gz
89 |
90 | cd $INSTALL_DIR/festival
91 | rm *.tar
92 | ```
93 |
94 |
95 | ## Note for UoE users
96 |
97 | If the installation is run through SSH, make sure you are in an *actual* machine and not just hare or bruegel, as these are just gateways and won't have a C compiler installed.
--------------------------------------------------------------------------------
/doc/recipe_WaveRNN.md:
--------------------------------------------------------------------------------
1 |
2 | ## DCTTS + WaveRNN
3 |
4 | To generate DCTTS samples using WaveRNN set the following flag in your config file:
5 | ```
6 | store_synth_features = True
7 | ```
8 | and run the normal DCTTS synthesis script:
9 | ```
10 | cd ophelia
11 | dctts_synth_dir='/tmp/dctts_synth_dir/'
12 | ./util/submit_tf.sh synthesize.py -c config/lj_tutorial.cfg -N 5 -odir ${dctts_synth_dir}
13 | ```
14 |
15 | This saves the generated magnitude files (.npy) and Grifim-Lim wavefiles in the directory dctts_synth_dir.
16 |
17 | To generate WaveRNN wavefiles from these magnitude files:
18 | ```
19 | cd Tacotron
20 | wavernn_synth_dir='/tmp/wavernn_synth_dir/'
21 | python synthesize_dctts_wavernn.py -i ${dctts_synth_dir} -o ${wavernn_synth_dir}
22 | ```
23 |
24 | ## Notes on Tacotron+WaveRNN code installation
25 |
26 | ```
27 | git clone https://github.com/cassiavb/Tacotron.git
28 | cd Tacotron/
29 | virtualenv --distribute --python=/usr/bin/python3.6 env
30 | source env/bin/activate
31 | pip install --upgrade pip
32 | pip install torch torchvision
33 | pip install -r requirements.txt
34 | pip install numba==0.48
35 | ```
36 |
37 |
--------------------------------------------------------------------------------
/doc/recipe_nancy.md:
--------------------------------------------------------------------------------
1 |
2 | # Blizzard Nancy corpus preparation and voice building
3 |
4 | For base voice for ICPhS in first instance.
5 |
6 |
7 | ```
8 | ### get the (publicly downloadable) data from CSTR datawww:-
9 | mkdir /group/project/cstr2/owatts/data/nancy/
10 | cd /group/project/cstr2/owatts/data/nancy/
11 | mkdir original
12 |
13 | cp /afs/inf.ed.ac.uk/group/cstr/datawww/blizzard2011/lessac/wavn.tgz ./original/
14 | cp /afs/inf.ed.ac.uk/group/cstr/datawww/blizzard2011/lessac/lab.ssil.zip ./original/
15 | cp /afs/inf.ed.ac.uk/group/cstr/datawww/blizzard2011/lessac/lab.zip ./original/
16 | cp /afs/inf.ed.ac.uk/group/cstr/datawww/blizzard2011/lessac/prompts.data ./original/
17 |
18 | ### compare checksums with published ones:
19 | md5sum ./original/*
20 | 0a4860a69bca56d7e9f8170306ff3709 ./original/lab.ssil.zip
21 | aeae7916d881a8eef255a6fe05e77e77 ./original/lab.zip
22 | 650b44f7252aed564d190b76a98cb490 ./original/prompts.data
23 | bb2a80dd1423f87ba12d2074af8e7a3f ./original/wavn.tgz
24 |
25 | ### ...all OK.
26 |
27 | cd ./original
28 | tar xvf wavn.tgz
29 | unzip lab.ssil.zip
30 | unzip lab.zip
31 |
32 | rm *.zip
33 | rm *.tgz
34 |
35 | ### information about the data:-
36 | http://www.cstr.ed.ac.uk/projects/blizzard/2011/lessac_blizzard2011/:
37 |
38 | * prompts.data - File with all of the prompt texts in filename order.
39 | * corrected.gui - File with all of the prompts in Lesseme labeled format in the order of the Nancy corpus as originally recorded, after some labels produced by the Lessac Front-end were corrected to reflect what the voice actor actually said.
40 | * uncorrected.gui - File with all of the prompts in Lesseme labeled format in the order of the Nancy corpus as produced by the Lessac Front-end from the prompts.data file without correction to the labels for what the voice actor actually said.
41 | * lab.ssil.zip - Zipped set of files with Lesseme labels that include the result of automated segmentation of the Lesseme labels in the corrected.gui file before the ssil label is collapsed into the preceding or following label.
42 | * lab.zip
43 |
44 | cd /disk/scratch/oliver/dc_tts_osw_clean
45 | mkdir /group/project/cstr2/owatts/data/nancy/derived/
46 | python ./script/normalise_level.py -i /group/project/cstr2/owatts/data/nancy/original/wavn/ \
47 | -o /group/project/cstr2/owatts/data/nancy/derived/wav_norm/ -ncores 25
48 |
49 | ./util/submit_tf_cpu.sh ./script/split_speech.py \
50 | -w /group/project/cstr2/owatts/data/nancy/derived/wav_norm/ \
51 | -o /group/project/cstr2/owatts/data/nancy/derived/wav_trim/ -dB 30 -ncores 25 -trimonly
52 |
53 | rm -r /group/project/cstr2/owatts/data/nancy/derived/wav_norm/
54 |
55 |
56 |
57 | ## transcription (needed to add o~ to combilex rpx phoneset in Festival):-
58 |
59 | ## use existing scheme format transcript:-
60 | cp /group/project/cstr2/owatts/data/nancy/original/prompts.data /group/project/cstr2/owatts/data/nancy/derived/utts.data
61 | cd /group/project/cstr2/owatts/data/nancy/derived/
62 |
63 | CODEDIR=/disk/scratch/oliver/dc_tts_osw_clean
64 | FEST=/afs/inf.ed.ac.uk/group/cstr/projects/simple4all_2/oliver/tool/festival/festival/src/main/festival
65 | SCRIPT=$CODEDIR/script/festival/make_rich_phones_combirpx_noplex.scm
66 | $FEST -b $SCRIPT | grep ___KEEP___ | sed 's/___KEEP___//' | tee ./transcript_temp1.csv
67 |
68 | python $CODEDIR/script/festival/fix_transcript.py ./transcript_temp1.csv > ./transcript.csv
69 |
70 |
71 |
72 | ### get phone list to ad to config:
73 | python ./script/check_transcript.py -i /group//project/cstr2/owatts/data/nancy/derived/transcript.csv -phone
74 |
75 |
76 |
77 | ./util/submit_tf_cpu.sh ./prepare_acoustic_features.py -c ./config/nancy_01.cfg -ncores 25
78 |
79 | ./util/submit_tf.sh ./prepare_attention_guides.py -c ./config/nancy_01.cfg -ncores 25
80 |
81 |
82 | ## train
83 | ./util/submit_tf.sh ./train.py -c config/nancy_01.cfg -m t2m
84 | ./util/submit_tf.sh ./train.py -c config/nancy_01.cfg -m ssrn
85 | ```
--------------------------------------------------------------------------------
/doc/recipe_nancy2nick.md:
--------------------------------------------------------------------------------
1 |
2 | # Very naive speaker adaptation to convert Nancy to Nick
3 |
4 | The simplest way to train on a small database is to fine tune
5 | a speaker-dependent voice to the new database. This works
6 | surprisingly well even where the base voice is of a different
7 | sex and accent to the target speaker, as this example shows.
8 |
9 |
10 | ## Prepare nick data
11 |
12 |
13 |
14 | We will use a version of the nick data which has been downsampled to
15 | 16kHz with sox:
16 |
17 | ```
18 | /afs/inf.ed.ac.uk/group/cstr/projects/nst/cvbotinh/SCRIPT/ICPhS19/samples/second_submission/natural_16k/
19 | ```
20 |
21 | It was converted from the 48kHz version here:
22 |
23 | ```
24 | /afs/inf.ed.ac.uk/group/cstr/projects/corpus_1/Nick48kHz/wav/
25 | ```
26 |
27 | ### waveforms
28 | ```
29 | INDIR=/afs/inf.ed.ac.uk/group/cstr/projects/nst/cvbotinh/SCRIPT/ICPhS19/samples/second_submission/natural_16k/
30 | OUTDIR=/group/project/cstr2/owatts/data/nick16k/
31 |
32 | python ./script/normalise_level.py -i $INDIR -o $OUTDIR/wav_norm/ -ncores 25
33 |
34 | ./util/submit_tf_cpu.sh ./script/split_speech.py -w $OUTDIR/wav_norm/ -o $OUTDIR/wav_trim/ -dB 30 -ncores 25 -trimonly
35 | ```
36 |
37 | ### transcript
38 |
39 | Gather texts:
40 |
41 | ```
42 | for FNAME in /afs/inf.ed.ac.uk/group/cstr/projects/corpus_1/Nick48kHz/txt/herald_* ; do
43 | BASE=`basename $FNAME .txt` ;
44 | TEXT=`cat $FNAME` ;
45 | echo "${BASE}||${TEXT}" ;
46 | done > /group/project/cstr2/owatts/data/nick16k/metadata.csv
47 |
48 |
49 | for FNAME in /afs/inf.ed.ac.uk/group/cstr/projects/corpus_1/Nick48kHz/txt/hvd_* ; do
50 | BASE=`basename $FNAME .txt` ;
51 | TEXT=`cat $FNAME` ;
52 | echo "${BASE}||${TEXT}" ;
53 | done > /group/project/cstr2/owatts/data/nick16k/metadata_hvd.csv
54 |
55 |
56 | for FNAME in /afs/inf.ed.ac.uk/group/cstr/projects/corpus_1/Nick48kHz/txt/mrt_* ; do
57 | BASE=`basename $FNAME .txt` ;
58 | TEXT=`cat $FNAME` ;
59 | echo "${BASE}||${TEXT}" ;
60 | done > /group/project/cstr2/owatts/data/nick16k/metadata_mrt.csv
61 | ```
62 |
63 | Phonetise:
64 |
65 | ```
66 | CODEDIR=`pwd`
67 | DATADIR=/group/project/cstr2/owatts/data/nick16k/
68 | python ./script/festival/csv2scm.py -i $DATADIR/metadata.csv -o $DATADIR/utts.data
69 |
70 | cd $DATADIR/
71 | FEST=/afs/inf.ed.ac.uk/group/cstr/projects/simple4all_2/oliver/tool/festival/festival/src/main/festival
72 | SCRIPT=$CODEDIR/script/festival/make_rich_phones_combirpx_noplex.scm
73 | $FEST -b $SCRIPT | grep ___KEEP___ | sed 's/___KEEP___//' | tee ./transcript_temp1.csv
74 | python $CODEDIR/script/festival/fix_transcript.py ./transcript_temp1.csv > ./transcript.csv
75 |
76 |
77 | cd $CODEDIR
78 | for TESTSET in mrt hvd ; do
79 | mkdir /group/project/cstr2/owatts/data/nick16k/test_${TESTSET}
80 | python ./script/festival/csv2scm.py -i $DATADIR/metadata_${TESTSET}.csv -o $DATADIR/test_${TESTSET}/utts.data
81 | done
82 |
83 |
84 | for TESTSET in mrt hvd ; do
85 | cd /group/project/cstr2/owatts/data/nick16k/test_${TESTSET}
86 | $FEST -b $SCRIPT | grep ___KEEP___ | sed 's/___KEEP___//' | tee ./transcript_temp1.csv
87 | python $CODEDIR/script/festival/fix_transcript.py ./transcript_temp1.csv > ./transcript.csv
88 | cp transcript.csv ../transcript_${TESTSET}.csv
89 | done
90 |
91 | ```
92 |
93 |
94 |
95 |
96 | ### features
97 | ```
98 | ./util/submit_tf_cpu.sh ./prepare_acoustic_features.py -c ./config/nancy2nick_01.cfg -ncores 15
99 | ./util/submit_tf.sh ./prepare_attention_guides.py -c ./config/nancy2nick_01.cfg -ncores 15
100 | ```
101 |
102 |
103 | ## training
104 |
105 | Config `nancy2nick_01` updates all weights pretrained on the Nancy data:
106 |
107 | ```
108 | ./util/submit_tf.sh ./train.py -c ./config/nancy2nick_01.cfg -m t2m
109 | ./util/submit_tf.sh ./train.py -c ./config/nancy2nick_01.cfg -m ssrn
110 | ```
111 |
112 | Config `nancy2nick_01` updates all weights pretrained on the Nancy data, except
113 | those of textencoder which are kept frozen:
114 |
115 | ```
116 | ./util/submit_tf.sh ./train.py -c ./config/nancy2nick_02.cfg -m t2m
117 | ```
118 |
119 | Previously-trained SSRN can be softlinked without retraining:
120 | ```
121 | cp -rs `pwd`/work/nancy2nick_01/train-ssrn/ `pwd`/work/nancy2nick_02/train-ssrn/
122 | ```
123 |
124 | ## Synthesis
125 | ```
126 | ./util/submit_tf.sh ./synthesize.py -c config/nancy2nick_01.cfg
127 | ./util/submit_tf.sh ./synthesize.py -c config/nancy2nick_02.cfg
128 | ```
129 |
130 |
--------------------------------------------------------------------------------
/doc/recipe_nancyplusnick.md:
--------------------------------------------------------------------------------
1 |
2 | # Train on Nancy + Nick
3 |
4 |
5 |
6 | ## Combine Nancy & Nick data already used for nancy_01 and nancy2nick_*
7 |
8 |
9 | Combine transcripts, adding speaker codes:
10 | ```
11 | DATADIR=/group/project/cstr2/owatts/data/nick_plus_nancy
12 | mkdir $DATADIR
13 |
14 | grep -v ^$ /group/project/cstr2/owatts/data/nick16k/transcript.csv | awk '{print $0"|nick"}' > $DATADIR/transcript.csv
15 | grep -v ^$ /group/project/cstr2/owatts/data/nancy/derived/transcript.csv | awk '{print $0"|nancy"}' | grep -v NYT00 >> $DATADIR/transcript.csv
16 |
17 | # (remove empty lines and NYT00 section for which attention guides were not made)
18 |
19 | cp /group/project/cstr2/owatts/data/nick16k/transcript_{mrt,hvd}.csv $DATADIR
20 | ```
21 |
22 | Combine acoustic features and attention guides:
23 |
24 | ```
25 | mkdir -p work/nancyplusnick_01/data/{attention_guides,full_mels,mels,mags}/
26 | for SUBDIR in attention_guides full_mels mels mags ; do
27 | for VOICE in nancy2nick_01 nancy_01 ; do
28 | ln -s ${PWD}/work/$VOICE/data/$SUBDIR/* work/nancyplusnick_01/data/$SUBDIR/ ;
29 | done
30 | done
31 | ```
32 |
33 |
34 |
35 | Prepare config file and train:
36 |
37 | ```
38 | ./util/submit_tf.sh ./train.py -c ./config/nancyplusnick_01.cfg -m t2m
39 | ```
40 |
41 | Previously-trained SSRN can be softlinked without retraining:
42 | ```
43 | cp -rs `pwd`/work/nancy2nick_01/train-ssrn/ `pwd`/work/nancyplusnick_01/train-ssrn/
44 | ```
45 |
46 |
47 | Synth
48 |
49 | ```
50 | ./util/submit_tf.sh ./synthesize.py -c ./config/nancyplusnick_01.cfg -N 10 -speaker nick
51 | ./util/submit_tf.sh ./synthesize.py -c ./config/nancyplusnick_01.cfg -N 10 -speaker nancy
52 | ```
53 |
54 |
55 |
56 |
57 | Fine tune on nick only:
58 |
59 | ./util/submit_tf.sh ./train.py -c ./config/nancyplusnick_02.cfg -m t2m
60 | cp -rs $PWD/work/nancy2nick_01/train-ssrn/ ./work/nancyplusnick_02/train-ssrn/
61 |
62 |
63 |
64 | ./util/submit_tf.sh ./synthesize.py -c ./config/nancyplusnick_02.cfg -N 10 -speaker nick
65 |
66 |
67 |
68 | set attention loss weight very high:
69 |
70 |
71 | ```
72 | ./util/submit_tf.sh ./train.py -c ./config/nancyplusnick_03.cfg -m t2m
73 | cp -rs $PWD/work/nancy2nick_01/train-ssrn/ ./work/nancyplusnick_03/train-ssrn/
74 |
75 |
76 |
77 | ./util/submit_tf.sh ./synthesize.py -c ./config/nancyplusnick_03.cfg -N 10 -speaker nick
78 |
79 |
80 | ```
81 |
82 |
83 |
84 | Try learning channel contributions for each speaker:
85 |
86 | ./util/submit_tf.sh ./train.py -c ./config/nancyplusnick_04_lcc.cfg -m t2m ; cp -rs $PWD/work/nancy2nick_01/train-ssrn/ ./work/nancyplusnick_04_lcc/train-ssrn/
87 |
88 |
89 |
90 |
91 |
92 | Try learning channel contributions for each speaker + SD-phone embeddings:
93 |
94 | ./util/submit_tf.sh ./train.py -c ./config/nancyplusnick_05_lcc_sdpe.cfg -m t2m ; cp -rs $PWD/work/nancy2nick_01/train-ssrn/ ./work/nancyplusnick_04_lcc/train-ssrn/
95 |
96 |
97 |
--------------------------------------------------------------------------------
/doc/recipe_project.md:
--------------------------------------------------------------------------------
1 | ## Required tools
2 |
3 | ```
4 | git clone -b project https://github.com/CSTR-Edinburgh/ophelia.git
5 | git clone https://github.com/AvashnaGovender/Merlyn.git
6 | git clone https://github.com/AvashnaGovender/Tacotron.git
7 | ```
8 |
9 | ## DCTTS + WaveRNN
10 |
11 | See recipe [here.](recipe_WaveRNN.md)
12 |
13 | ## Attention experiments
14 |
15 | ### Obtaining forced alignment labels:
16 |
17 | Step 5a - Run forced alignment in:
18 | https://github.com/AvashnaGovender/Tacotron/blob/master/PAG_recipe.md
19 |
20 | ### FA as target
21 |
22 | Convert forced alignment labels to the forced alignment matrix:
23 |
24 | Step 6 - Get durations and create guides:
25 | https://github.com/AvashnaGovender/Tacotron/blob/master/PAG_recipe.md
26 |
27 | To use FA as target in DCTTS see config file:
28 | [fa_as_target.cfg](../config/project/fa_as_target.cfg)
29 |
30 | ### FA as guides
31 |
32 | Create attention guides from forced alignment matrix
33 |
34 | ```
35 | cd ophelia/
36 | python convert_alignment_to_guide.py fa_matrix.npy fa_guide.npy
37 | ```
38 |
39 | To use FA as guide in DCTTS see config file:
40 | [fa_as_guide.cfg](../config/project/fa_as_guide.cfg)
41 |
42 | ### FA as attention
43 |
44 | Add phone level duration to transcript.csv using forced alignment matrix
45 |
46 | ```
47 | cd ophelia/
48 | python add_duration_to_transcript.py fa_matrix_dir transcript_file new_transcript_file
49 | ```
50 |
51 | To use FA as attention in DCTTS see config file:
52 | [fa_as_attention.cfg](../config/project/fa_as_attention.cfg)
53 |
54 | ## Text Encoder experiments
55 |
56 | ### Labels -/+ TE
57 |
58 | Convert state labels to 416 normalised label features (needs state labels and question file)
59 |
60 | ```
61 | cd Merlyn/
62 | python scripts/prepare_inputs.py
63 | ```
64 |
65 | To use Labels-TE in DCTTS see config file:
66 | [labels_minus_te.cfg](../config/project/labels_minus_te.cfg)
67 |
68 | To use Labels+TE in DCTTS see config file:
69 | [labels_plus_te.cfg](../config/project/labels_plus_te.cfg)
70 |
71 | To use C-Labels+TE in DCTTS see config file:
72 | [c-labels_plus_te.cfg](../config/project/c-labels_plus_te.cfg)
73 |
74 | ### PE&Labels + TE
75 |
76 | Create new transcription file with phoneme sequence from labels by replace phone sequence of transcript file with phone sequence from HTS style labels
77 | ```
78 | cd ophelia/
79 | ./labels2tra.sh labels_dir transcript_file new_transcript_file
80 | ```
81 |
82 | To use PE&Labels+TE set MerlinTextEncWithPhoneEmbedding to True in the config file.
83 |
84 | ## Gross error detection experiments
85 |
86 | To calculate CDP, Ain and Aout:
87 | ```
88 | cd ophelia/
89 | python calculate_CDP_Ain_Aout.py attention_matrix.npy
90 | ```
91 |
92 | ## FIA experiments
93 |
94 | To generate without FIA (forcibly incremental attention) set turn_off_monotonic_for_synthesis to True in the config file.
95 |
96 | ## Tacotron experiments
97 |
98 | See readme in Tacotron repository: https://github.com/AvashnaGovender/Tacotron
99 |
--------------------------------------------------------------------------------
/doc/recipe_semisupervised.md:
--------------------------------------------------------------------------------
1 | ## Semisupervised training
2 |
3 | ### Babbler training
4 |
5 | Train 'babbler' (300 epochs only):
6 |
7 | ```
8 | ./util/submit_tf.sh ./train.py -c ./config/lj_03.cfg -m babbler
9 | ```
10 |
11 | Note that this wasn't implemented when I trained the voice before - hope it works OK:
12 |
13 | ```
14 | bucket_data_by = 'audio_length'
15 | ```
16 |
17 | In any case, text in transcript is ignored when training babbler.
18 |
19 | Copy existing SSRN:
20 |
21 | ```
22 | cp -rs /disk/scratch/oliver/dc_tts_osw_clean/work/lj_02/train-ssrn /disk/scratch/oliver/dc_tts_osw_clean/work/lj_03/
23 | ```
24 |
25 | Synthesise by babbling:
26 |
27 | ```
28 | ./util/submit_tf.sh ./synthesize.py -c ./config/lj_03.cfg -babble
29 | ```
30 |
31 | (Note: all sentences are currently seeded with the same start (all zeros from padding) so all babbled outputs will be identical)
32 |
33 |
34 | ### Fine tuning
35 |
36 | Fine tune with text as conventional model on transcribed subset (1000 sentences) of the data:
37 |
38 | ```
39 | ./util/submit_tf.sh ./train.py -c ./config/lj_05.cfg -m t2m
40 |
41 | cp -rs /disk/scratch/oliver/dc_tts_osw_clean/work/lj_02/train-ssrn /disk/scratch/oliver/dc_tts_osw_clean/work/lj_05/
42 |
43 | ./util/submit_tf.sh ./synthesize.py -c ./config/lj_05.cfg -N 10
44 | ```
45 |
46 | ### Baseline
47 |
48 | Compare training from scratch on 1000 sentences:
49 |
50 | ```
51 | ./util/submit_tf.sh ./train.py -c ./config/lj_04.cfg -m t2m
52 |
53 | cp -rs /disk/scratch/oliver/dc_tts_osw_clean/work/lj_02/train-ssrn /disk/scratch/oliver/dc_tts_osw_clean/work/lj_04/
54 |
55 | ./util/submit_tf.sh ./synthesize.py -c ./config/lj_04.cfg -N 10
56 |
57 | ```
58 |
59 |
60 |
61 |
--------------------------------------------------------------------------------
/fig/aaa:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/fig/attention.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CSTR-Edinburgh/ophelia/46042a80d045b6fd6403601e9e8fb447d304b6b3/fig/attention.gif
--------------------------------------------------------------------------------
/fig/training_curves.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CSTR-Edinburgh/ophelia/46042a80d045b6fd6403601e9e8fb447d304b6b3/fig/training_curves.png
--------------------------------------------------------------------------------
/labels2tra.sh:
--------------------------------------------------------------------------------
1 | #!/bin/sh
2 | #
3 | # Replace phone sequence of transcript file with phone sequence from HTS style labels
4 | # Usage: ./labels2tra.sh labels_dir transcript_file new_transcript_file
5 |
6 | labelsdir=$1
7 | trafile=$2
8 | newtrafile=$3
9 |
10 | cat $trafile | while IFS=$'|' read -r file nada text ps
11 | do
12 |
13 | grep -r "\[2\]" $labelsdir/$file.lab | sed 's/\+.*//' | sed 's/.*-//' > ~/tmp/test.txt
14 |
15 | newps=`cat ~/tmp/test.txt | tr '\n' ' '`
16 |
17 | # To print start and end
18 | echo $file"||"$text"|<_START_> "${newps:4:-4}"<_END_>"
19 |
20 | done > $newtrafile
21 |
--------------------------------------------------------------------------------
/libutil.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #!/usr/bin/env python2
3 |
4 |
5 | import os
6 | import re
7 | import numpy as np
8 | import codecs
9 |
10 |
11 |
12 | def safe_makedir(dir):
13 | if not os.path.isdir(dir):
14 | os.makedirs(dir)
15 |
16 | def writelist(seq, fname):
17 | path, _ = os.path.split(fname)
18 | safe_makedir(path)
19 | f = codecs.open(fname, 'w', encoding='utf8')
20 | f.write('\n'.join(seq) + '\n')
21 | f.close()
22 |
23 | def readlist(fname):
24 | f = codecs.open(fname, 'r', encoding='utf8')
25 | data = f.readlines()
26 | f.close()
27 | data = [line.strip('\n') for line in data]
28 | data = [l for l in data if l != '']
29 | return data
30 |
31 | def read_norm_data(fname, stream_names):
32 | out = {}
33 | vals = np.loadtxt(fname)
34 | mean_ix = 0
35 | for stream in stream_names:
36 | std_ix = mean_ix + 1
37 | out[stream] = (vals[mean_ix], vals[std_ix])
38 | mean_ix += 2
39 | return out
40 |
41 | def makedirecs(direcs):
42 | for direc in direcs:
43 | if not os.path.isdir(direc):
44 | os.makedirs(direc)
45 |
46 | def basename(fname):
47 | path, name = os.path.split(fname)
48 | base = re.sub('\.[^\.]+\Z','',name)
49 | return base
50 |
51 | get_basename = basename # alias
52 | def get_speech(infile, dimension):
53 | f = open(infile, 'rb')
54 | speech = np.fromfile(f, dtype=np.float32)
55 | f.close()
56 | assert speech.size % float(dimension) == 0.0,'specified dimension %s not compatible with data'%(dimension)
57 | speech = speech.reshape((-1, dimension))
58 | return speech
59 |
60 | def put_speech(m_data, filename):
61 | m_data = np.array(m_data, 'float32') # Ensuring float32 output
62 | fid = open(filename, 'wb')
63 | m_data.tofile(fid)
64 | fid.close()
65 | return
66 |
67 | def save_floats_as_8bit(data, fname):
68 | '''
69 | Lossily store data in range [0, 1] with 8 bit resolution
70 | '''
71 | assert (data.max() <= 1.0) and (data.min() >= 0.0), (data.min(), data.max())
72 |
73 | maxval = np.iinfo(np.uint8).max
74 | data_scaled = (data * maxval).astype(np.uint8)
75 | np.save(fname, data_scaled)
76 |
77 | def read_floats_from_8bit(fname):
78 | maxval = np.iinfo(np.uint8).max
79 | data = (np.load(fname).astype(np.float32)) / maxval
80 | assert (data.max() <= 1.0) and (data.min() >= 0.0), (data.min(), data.max())
81 | return data
82 |
83 |
--------------------------------------------------------------------------------
/logger_setup.py:
--------------------------------------------------------------------------------
1 | import logging
2 | import os
3 | import sys
4 | import subprocess
5 | import socket
6 | import numpy
7 | import tensorflow
8 | from libutil import safe_makedir
9 |
10 | def logger_setup(logdir):
11 |
12 | safe_makedir(logdir)
13 |
14 | ## Get new unique named logfile for each run:
15 | i = 1
16 | while True:
17 | logfile = os.path.join(logdir, 'log_{:06d}.txt'.format(i))
18 | if not os.path.isfile(logfile):
19 | break
20 | else:
21 | i += 1
22 |
23 | logger = logging.getLogger()
24 | logger.setLevel(logging.DEBUG)
25 |
26 | formatter = logging.Formatter('%(asctime)s | %(threadName)-3.3s | %(levelname)-1.1s | %(message)s')
27 |
28 | fh = logging.FileHandler(logfile)
29 | fh.setLevel(logging.DEBUG)
30 | fh.setFormatter(formatter)
31 | logger.addHandler(fh)
32 |
33 | ch = logging.StreamHandler()
34 | ch.setLevel(logging.DEBUG)
35 | ch.setFormatter(formatter)
36 | logger.addHandler(ch)
37 |
38 | logger.info('Set up logger to write to console and %s'%(logfile))
39 |
40 | log_environment_information(logger, logfile)
41 |
42 |
43 | def log_environment_information(logger, logfile):
44 | ### This function's contents adjusted from Merlin (https://github.com/CSTR-Edinburgh/merlin/blob/master/src/run_merlin.py)
45 | ### TODO: other things to log here?
46 | logger.info('Installation information:')
47 | logger.info(' Merlin directory: '+os.path.abspath(os.path.join(os.path.dirname(os.path.realpath(__file__)), os.pardir)))
48 | logger.info(' PATH:')
49 | env_PATHs = os.getenv('PATH')
50 | if env_PATHs:
51 | env_PATHs = env_PATHs.split(':')
52 | for p in env_PATHs:
53 | if len(p)>0: logger.info(' '+p)
54 | logger.info(' LD_LIBRARY_PATH:')
55 | env_LD_LIBRARY_PATHs = os.getenv('LD_LIBRARY_PATH')
56 | if env_LD_LIBRARY_PATHs:
57 | env_LD_LIBRARY_PATHs = env_LD_LIBRARY_PATHs.split(':')
58 | for p in env_LD_LIBRARY_PATHs:
59 | if len(p)>0: logger.info(' '+p)
60 | logger.info(' Python version: '+sys.version.replace('\n',''))
61 | logger.info(' PYTHONPATH:')
62 | env_PYTHONPATHs = os.getenv('PYTHONPATH')
63 | if env_PYTHONPATHs:
64 | env_PYTHONPATHs = env_PYTHONPATHs.split(':')
65 | for p in env_PYTHONPATHs:
66 | if len(p)>0:
67 | logger.info(' '+p)
68 | logger.info(' Numpy version: '+numpy.version.version)
69 | logger.info(' Tensorflow version: '+tensorflow.__version__)
70 | #logger.info(' THEANO_FLAGS: '+os.getenv('THEANO_FLAGS'))
71 | #logger.info(' device: '+theano.config.device)
72 |
73 | # Check for the presence of git
74 | ret = os.system('git status > /dev/null')
75 | if ret==0:
76 | logger.info(' Git is available in the working directory:')
77 | git_describe = subprocess.Popen(['git', 'describe', '--tags', '--always'], stdout=subprocess.PIPE).communicate()[0][:-1]
78 | logger.info(' DC_TTS_OSW version: {}'.format(git_describe))
79 | git_branch = subprocess.Popen(['git', 'rev-parse', '--abbrev-ref', 'HEAD'], stdout=subprocess.PIPE).communicate()[0][:-1]
80 | logger.info(' branch: {}'.format(git_branch))
81 | git_diff = subprocess.Popen(['git', 'diff', '--name-status'], stdout=subprocess.PIPE).communicate()[0]
82 | if sys.version_info.major >= 3:
83 | git_diff = git_diff.decode('utf-8')
84 | git_diff = git_diff.replace('\t',' ').split('\n')
85 | logger.info(' diff to DC_TTS_OSW version:')
86 | for filediff in git_diff:
87 | if len(filediff)>0: logger.info(' '+filediff)
88 | logger.info(' (all diffs logged in '+os.path.basename(logfile)+'.gitdiff'+')')
89 | os.system('git diff > '+logfile+'.gitdiff')
90 |
91 | logger.info('Execution information:')
92 | logger.info(' HOSTNAME: '+socket.getfqdn())
93 | logger.info(' USER: '+os.getenv('USER'))
94 | logger.info(' PID: '+str(os.getpid()))
95 | PBS_JOBID = os.getenv('PBS_JOBID')
96 | if PBS_JOBID:
97 | logger.info(' PBS_JOBID: '+PBS_JOBID)
98 |
99 |
--------------------------------------------------------------------------------
/objective_measures.py:
--------------------------------------------------------------------------------
1 |
2 | '''
3 | TODO: logSpecDbDist appropriate? (both mels & mags?)
4 | TODO: compute output length error?
5 | TODO: work out best way of handling the fact that predicted *coarse* features
6 | can correspond to text but be arbitrarily 'out of phase' with reference.
7 | Mutliple references? Or compare against full-time resolution reference?
8 | '''
9 | import logging
10 | from mcd import dtw
11 | import mcd.metrics_fast as mt
12 |
13 | def compute_dtw_error(reference, predictions):
14 | minCostTot = 0.0
15 | framesTot = 0
16 | for (nat, synth) in zip(reference, predictions):
17 | nat, synth = nat.astype('float64'), synth.astype('float64')
18 | minCost, path = dtw.dtw(nat, synth, mt.logSpecDbDist)
19 | frames = len(nat)
20 | minCostTot += minCost
21 | framesTot += frames
22 | mean_score = minCostTot / framesTot
23 | print ('overall LSD = %f (%s frames nat/synth)' % (mean_score, framesTot))
24 | return mean_score
25 |
26 | def compute_simple_LSD(reference_list, prediction_list):
27 | costTot = 0.0
28 | framesTot = 0
29 | for (synth, nat) in zip(prediction_list, reference_list):
30 | #synth = prediction_tensor[i,:,:].astype('float64')
31 | # len_nat = len(nat)
32 | assert len(synth) == len(nat)
33 | #synth = synth[:len_nat, :]
34 | nat = nat.astype('float64')
35 | synth = synth.astype('float64')
36 | cost = sum([
37 | mt.logSpecDbDist(natFrame, synthFrame)
38 | for natFrame, synthFrame in zip(nat, synth)
39 | ])
40 | framesTot += len(nat)
41 | costTot += cost
42 | return costTot / framesTot
43 |
44 |
45 |
--------------------------------------------------------------------------------
/plot_loss.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 | ## Project: SCRIPT - February 2018
4 | ## Contact: Oliver Watts - owatts@staffmail.ed.ac.uk
5 |
6 | import sys
7 | import os
8 | import glob
9 | import os
10 | import fileinput
11 | from argparse import ArgumentParser
12 |
13 | from libutil import readlist
14 | import numpy as np
15 | import pylab as pl
16 | def main_work():
17 |
18 | #################################################
19 |
20 | # ======== Get stuff from command line ==========
21 |
22 | a = ArgumentParser()
23 | a.add_argument('-o', dest='outfile', required=True)
24 | a.add_argument('-l', dest='logfile', required=True)
25 | opts = a.parse_args()
26 |
27 | # ===============================================
28 |
29 | log = readlist(opts.logfile)
30 | log = [line.split('|') for line in log]
31 | log = [line[3].strip() for line in log if len(line) >=4]
32 |
33 | #validation = [line.replace('validation epoch ', '') for line in log if line.startswith('validation epoch')]
34 | #train = [line.replace('train epoch ', '') for line in log if line.startswith('validation epoch')]
35 |
36 | validation = [line.split(':')[1].strip().split(' ') for line in log if line.startswith('validation epoch')]
37 | train = [line.split(':')[1].strip().split(' ') for line in log if line.startswith('train epoch')]
38 | validation = np.array(validation, dtype=float)
39 | train = np.array(train, dtype=float)
40 | print train.shape
41 | print validation.shape
42 |
43 | pl.subplot(211)
44 | pl.plot(validation.flatten())
45 | pl.subplot(212)
46 | pl.plot(train[:,:4])
47 | pl.show()
48 | if __name__=="__main__":
49 |
50 | main_work()
51 |
52 |
--------------------------------------------------------------------------------
/prepare_acoustic_features.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #! /usr/bin/env python2
3 | '''
4 | Based on code by kyubyong park at https://www.github.com/kyubyong/dc_tts
5 | '''
6 |
7 | from __future__ import print_function
8 |
9 | import os
10 | import sys
11 | import glob
12 | from argparse import ArgumentParser
13 | from concurrent.futures import ProcessPoolExecutor
14 | import numpy as np
15 |
16 | import tqdm
17 |
18 | from libutil import safe_makedir
19 | from configuration import load_config
20 | from utils import load_spectrograms
21 |
22 | def proc(fpath, hp):
23 |
24 | if not os.path.isfile(fpath):
25 | return
26 |
27 | fname, mel, mag, full_mel = load_spectrograms(hp, fpath)
28 | np.save("{}/{}".format(hp.coarse_audio_dir, fname.replace("wav", "npy")), mel)
29 | np.save("{}/{}".format(hp.full_audio_dir, fname.replace("wav", "npy")), mag)
30 | np.save("{}/{}".format(hp.full_mel_dir, fname.replace("wav", "npy")), full_mel)
31 |
32 |
33 | def main_work():
34 |
35 | #################################################
36 |
37 | # ============= Process command line ============
38 |
39 | a = ArgumentParser()
40 | a.add_argument('-c', dest='config', required=True, type=str)
41 | a.add_argument('-ncores', default=1, type=int, help='Number of cores for parallel processing')
42 | opts = a.parse_args()
43 |
44 | # ===============================================
45 |
46 | hp = load_config(opts.config)
47 |
48 | fpaths = sorted(glob.glob(hp.waveforms + '/*.wav'))
49 |
50 | safe_makedir(hp.coarse_audio_dir)
51 | safe_makedir(hp.full_audio_dir)
52 | safe_makedir(hp.full_mel_dir)
53 |
54 | executor = ProcessPoolExecutor(max_workers=opts.ncores)
55 | futures = []
56 | for fpath in fpaths:
57 | futures.append(executor.submit(
58 | proc, fpath, hp))
59 | proc_list = [future.result() for future in tqdm.tqdm(futures)]
60 |
61 |
62 | if __name__=="__main__":
63 |
64 | main_work()
65 |
--------------------------------------------------------------------------------
/prepare_attention_guides.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #!/usr/bin/env python2
3 |
4 | from __future__ import print_function
5 |
6 | from utils import get_attention_guide
7 | import os
8 | from data_load import load_data
9 | import numpy as np
10 | import tqdm
11 | from concurrent.futures import ProcessPoolExecutor
12 |
13 | from argparse import ArgumentParser
14 |
15 | from libutil import basename, save_floats_as_8bit, safe_makedir
16 | from configuration import load_config
17 |
18 | def proc(fpath, text_length, hp):
19 |
20 | base = basename(fpath)
21 | melfile = hp.coarse_audio_dir + os.path.sep + base + '.npy'
22 | attfile = hp.attention_guide_dir + os.path.sep + base # without '.npy'
23 | if not os.path.isfile(melfile):
24 | print('file %s not found'%(melfile))
25 | return
26 | speech_length = np.load(melfile).shape[0]
27 | att = get_attention_guide(text_length, speech_length, g=hp.g)
28 | save_floats_as_8bit(att, attfile)
29 |
30 |
31 | def main_work():
32 |
33 | #################################################
34 |
35 | # ============= Process command line ============
36 |
37 | a = ArgumentParser()
38 | a.add_argument('-c', dest='config', required=True, type=str)
39 | a.add_argument('-ncores', default=1, type=int, help='Number of cores for parallel processing')
40 | opts = a.parse_args()
41 |
42 | # ===============================================
43 |
44 | hp = load_config(opts.config)
45 | assert hp.attention_guide_dir
46 |
47 | dataset = load_data(hp)
48 | fpaths, text_lengths = dataset['fpaths'], dataset['text_lengths']
49 |
50 | if hp.merlin_label_dir:
51 | text_lengths = dataset['label_lengths']
52 |
53 | assert os.path.exists(hp.coarse_audio_dir)
54 | safe_makedir(hp.attention_guide_dir)
55 |
56 | executor = ProcessPoolExecutor(max_workers=opts.ncores)
57 | futures = []
58 | for (fpath, text_length) in zip(fpaths, text_lengths):
59 | futures.append(executor.submit(proc, fpath, text_length, hp))
60 | proc_list = [future.result() for future in tqdm.tqdm(futures)]
61 |
62 |
63 | if __name__=="__main__":
64 |
65 | main_work()
66 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | numpy==1.16.4
2 | absl-py==0.6.1
3 | astor==0.7.1
4 | audioread==2.1.6
5 | backports.functools-lru-cache==1.5
6 | backports.weakref==1.0.post1
7 | bashplotlib==0.6.5
8 | cffi==1.11.5
9 | cycler==0.10.0
10 | decorator==4.3.0
11 | enum34==1.1.6
12 | funcsigs==1.0.2
13 | futures==3.2.0
14 | gast==0.2.0
15 | grpcio==1.16.1
16 | h5py==2.8.0
17 | htk-io==0.5
18 | joblib==0.13.0
19 | Keras-Applications==1.0.6
20 | Keras-Preprocessing==1.0.5
21 | kiwisolver==1.0.1
22 | librosa==0.6.2
23 | llvmlite==0.25.0
24 | Markdown==3.0.1
25 | matplotlib==2.2.3
26 | mcd==0.4
27 | mock==2.0.0
28 | numba==0.40.1
29 | pbr==5.1.1
30 | protobuf==3.6.1
31 | pycparser==2.19
32 | pyparsing==2.3.0
33 | python-dateutil==2.7.5
34 | pytz==2018.7
35 | regex==2019.2.3
36 | resampy==0.2.1
37 | scikit-learn==0.20.0
38 | scipy==1.1.0
39 | singledispatch==3.4.0.3
40 | six==1.11.0
41 | SoundFile==0.10.2
42 | subprocess32==3.5.3
43 | tensorboard==1.12.0
44 | tensorflow-gpu==1.12.0
45 | termcolor==1.1.0
46 | tqdm==4.28.1
47 | Werkzeug==0.14.1
48 |
--------------------------------------------------------------------------------
/script/add_speaker.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 |
3 | from argparse import ArgumentParser
4 |
5 |
6 | def main_work():
7 |
8 | a = ArgumentParser()
9 | a.add_argument('-i', dest='infile', required=True)
10 | a.add_argument('-o', dest='outfile', required=True)
11 | opts = a.parse_args()
12 |
13 | outf = open(opts.outfile, 'w')
14 | transcript = open(opts.infile, 'r').read().split('\n')
15 |
16 | for line in transcript:
17 | spk = line.split('_')[0]
18 | outf.writelines(line+'||'+spk+'\n')
19 | outf.close()
20 |
21 | if __name__ == "__main__":
22 | main_work()
23 |
--------------------------------------------------------------------------------
/script/festival/csv2scm.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 | ## Project: SCRIPT - February 2018
4 | ## Contact: Oliver Watts - owatts@staffmail.ed.ac.uk
5 |
6 | import sys
7 | import os
8 | import glob
9 | import os
10 | from argparse import ArgumentParser
11 | import codecs
12 |
13 |
14 | def main_work():
15 |
16 | #################################################
17 |
18 | # ======== Get stuff from command line ==========
19 |
20 | a = ArgumentParser()
21 | a.add_argument('-i', dest='infile', required=True, \
22 | help= "File in LJ speech transcription format: https://keithito.com/LJ-Speech-Dataset/")
23 | a.add_argument('-o', dest='outfile', required=True, \
24 | help= "File in Festival utts.data scheme format")
25 | opts = a.parse_args()
26 |
27 | # ===============================================
28 |
29 | f = codecs.open(opts.infile, 'r', encoding='utf8')
30 | lines = f.readlines()
31 | f.close()
32 |
33 | f = codecs.open(opts.outfile, 'w', encoding='utf8')
34 | for line in lines:
35 | fields = line.strip('\n\r ').split('|')
36 | assert len(fields) >= 3
37 | name, _, text = fields[:3]
38 | text = text.replace('"', '\\"')
39 | f.write('(%s "%s")\n'%(name, text))
40 | f.close()
41 |
42 |
43 |
44 |
45 | if __name__=="__main__":
46 |
47 | main_work()
48 |
49 |
--------------------------------------------------------------------------------
/script/festival/fix_transcript.py:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 | import sys, codecs, re, regex
9 |
10 |
11 | f = codecs.open(sys.argv[1], encoding='utf8', errors='ignore')
12 | text = f.read()
13 | f.close()
14 |
15 | lines = text.split('\n')
16 |
17 |
18 | all_seps = set()
19 | for line in lines:
20 | phones = line.split('|')[-1].strip('\n\r ').split(' ')
21 | seps = set([phone for phone in phones if phone.startswith('<') and phone.endswith('>')])
22 | all_seps.update(seps)
23 | #print seps
24 | #print phones
25 |
26 |
27 |
28 | # print all_seps
29 |
30 | badseps = []
31 | for sep in all_seps:
32 |
33 | if regex.match('\A[\p{P}\p{Z}]+\Z', sep.strip('<>')):
34 | #puncs.append(sep)
35 | pass
36 | elif sep in ["<'s>", '<_END_>', '<_START_>']:
37 | pass
38 | else:
39 | badseps.append(sep)
40 |
41 | for sep in badseps:
42 | text = text.replace(sep, '<>')
43 |
44 |
45 |
46 |
47 |
48 | #bad_strings = sys.argv[2:]
49 |
50 |
51 | # for bad in bad_strings:
52 | # lines = lines.replace('<'+bad+'>', '<>')
53 |
54 | print text.encode('utf8')
--------------------------------------------------------------------------------
/script/festival/make_rich_phones.scm:
--------------------------------------------------------------------------------
1 |
2 |
3 | ; Usage:
4 | ; cd to directory containing ./utts.data, then with a version of Festival with the correct
5 | ; lexicon installed:
6 |
7 | ; FEST=/afs/inf.ed.ac.uk/user/o/owatts/sim2/oliver/tool/festival/festival/bin/festival
8 | ; SCRIPT=/afs/inf.ed.ac.uk/user/o/owatts/repos/dc_tts/script/make_rich_phones.scm
9 | ; $FEST -b $SCRIPT | grep ___KEEP___ | sed 's/___KEEP___//' | tee ./transcript.csv
10 |
11 |
12 |
13 | ;;; Taken from build_unitsel:
14 | ;; Is this the last segment in a word [more complicated than you would think]
15 | (define (seg_word_final seg)
16 | "(seg_word_final seg)
17 | Is this segment word final?"
18 | (let ((this_seg_word (item.parent (item.relation.parent seg 'SylStructure)))
19 | (silence (car (cadr (car (PhoneSet.description '(silences))))))
20 | next_seg_word)
21 | (if (item.next seg)
22 | (set! next_seg_word (item.parent (item.relation.parent (item.next seg) 'SylStructure))))
23 | (if (or (equal? this_seg_word next_seg_word)
24 | (string-equal (item.feat seg "name") silence))
25 | nil
26 | t)))
27 |
28 |
29 | (define (following_punc seg)
30 | ( set! this_seg_word (item.parent (item.relation.parent seg 'SylStructure)))
31 | ( set! this_seg_token (item.relation.next this_seg_word 'Token))
32 | (if this_seg_token
33 | ( format t "<%s> " (item.name this_seg_token))
34 | ( format t "<> " ) %% else
35 | )
36 | )
37 |
38 |
39 | (define (print_phones_punc utt)
40 | (if (utt.relation.present utt 'Segment)
41 | (begin
42 | ( format t "<_START_> " )
43 | (mapcar
44 | (lambda (x)
45 | ( if (not (equal? (item.name x) "#") ) ;; TODO: unhardcode silent symbols
46 | (format t "%s " (item.name x) )
47 | )
48 | (if (seg_word_final x)
49 | (following_punc x)
50 | )
51 | )
52 | (utt.relation.items utt 'Segment))
53 | ( format t "<_END_>" )
54 | )
55 | (format t "Utterance contains no Segments\n"))
56 | nil)
57 |
58 | ;;; Taken from build_unitsel:
59 | ;; Do the linguistic side of synthesis.
60 | (define (utt.synth_toSegment_text utt)
61 | (Initialize utt)
62 | (Text utt)
63 | (Token_POS utt) ;; when utt.synth is called
64 | (Token utt)
65 | (POS utt)
66 | (Phrasify utt)
67 | (Word utt)
68 | (Pauses utt)
69 | (Intonation utt)
70 | (PostLex utt))
71 |
72 | (define (synth_utts utts_data_file)
73 | (set! uttlist (load utts_data_file t))
74 | (mapcar
75 | (lambda (line)
76 | (set! utt (utt.synth_toSegment_text (eval (list 'Utterance 'Text (car (cdr line)))))) ;;; Initialize
77 | (format t "___KEEP___%s|" (car line))
78 | (format t "%s|" (car (cdr line)) )
79 | (format t "%s|" (car (cdr line)) )
80 | (print_phones_punc utt)
81 | ; (utt.relation.print utt 'Text)
82 | ( format t "\n" )
83 | t)
84 | uttlist)
85 | )
86 |
87 |
88 | (if (not (member_string 'unilex-rpx (lex.list)))
89 | (load (path-append lexdir "unilex/" (string-append 'unilex-rpx ".scm"))))
90 |
91 | ; (if (not (member_string 'cmudict (lex.list)))
92 | ; (load (path-append lexdir "cmu/" (string-append 'cmudict-0.4 ".scm"))))
93 |
94 |
95 | ;(require 'unilex_phones)
96 | ;(lex.select 'unilex-rpx)
97 |
98 |
99 |
100 |
101 | (if (not (member_string 'combilex-rpx (lex.list)))
102 | (load (path-append lexdir "combilex/" (string-append 'combilex-rpx ".scm"))))
103 | (lex.select 'combilex-rpx)
104 | (require 'postlex)
105 | (set! postlex_rules_hooks (list postlex_apos_s_check
106 | postlex_intervoc_r
107 | postlex_the_vs_thee
108 | postlex_a
109 | ))
110 | ; (set! postlex_rules_hooks (list
111 | ; postlex_intervoc_r
112 | ; ))
113 |
114 |
115 |
116 | ; (lex.select 'cmudict)
117 |
118 |
119 |
120 | ;(set! utt1 (Utterance Text "Hello there, world, isn't it a nice day?!"))
121 | ;(utt.synth utt1)
122 | ;(print_phones_punc utt1)
123 |
124 | (synth_utts "./utts.data")
125 |
126 |
127 |
128 |
129 |
130 |
131 |
--------------------------------------------------------------------------------
/script/festival/make_rich_phones_cmulex.scm:
--------------------------------------------------------------------------------
1 |
2 |
3 | ; Usage:
4 | ; cd to directory containing ./utts.data, then with a version of Festival with the correct
5 | ; lexicon installed:
6 |
7 | ; FEST=/afs/inf.ed.ac.uk/user/o/owatts/sim2/oliver/tool/festival/festival/bin/festival
8 | ; SCRIPT=/afs/inf.ed.ac.uk/user/o/owatts/repos/dc_tts/script/make_rich_phones.scm
9 | ; $FEST -b $SCRIPT | grep ___KEEP___ | sed 's/___KEEP___//' | tee ./transcript.csv
10 |
11 |
12 |
13 | ;;; Taken from build_unitsel:
14 | ;; Is this the last segment in a word [more complicated than you would think]
15 | (define (seg_word_final seg)
16 | "(seg_word_final seg)
17 | Is this segment word final?"
18 | (let ((this_seg_word (item.parent (item.relation.parent seg 'SylStructure)))
19 | (silence (car (cadr (car (PhoneSet.description '(silences))))))
20 | next_seg_word)
21 | (if (item.next seg)
22 | (set! next_seg_word (item.parent (item.relation.parent (item.next seg) 'SylStructure))))
23 | (if (or (equal? this_seg_word next_seg_word)
24 | (string-equal (item.feat seg "name") silence))
25 | nil
26 | t)))
27 |
28 |
29 | (define (following_punc seg)
30 | ( set! this_seg_word (item.parent (item.relation.parent seg 'SylStructure)))
31 | ( set! this_seg_token (item.relation.next this_seg_word 'Token))
32 | (if this_seg_token
33 | ( format t "<%s> " (item.name this_seg_token))
34 | ( format t "<> " ) %% else
35 | )
36 | )
37 |
38 |
39 | (define (print_phones_punc utt)
40 | (if (utt.relation.present utt 'Segment)
41 | (begin
42 | ( format t "<_START_> " )
43 | (mapcar
44 | (lambda (x)
45 | ( if (not (equal? (item.name x) "pau") ) ;; TODO: unhardcode silent symbols
46 | (format t "%s " (item.name x) )
47 | )
48 | (if (seg_word_final x)
49 | (following_punc x)
50 | )
51 | )
52 | (utt.relation.items utt 'Segment))
53 | ( format t "<_END_>" )
54 | )
55 | (format t "Utterance contains no Segments\n"))
56 | nil)
57 |
58 | ;;; Taken from build_unitsel:
59 | ;; Do the linguistic side of synthesis.
60 | (define (utt.synth_toSegment_text utt)
61 | (Initialize utt)
62 | (Text utt)
63 | (Token_POS utt) ;; when utt.synth is called
64 | (Token utt)
65 | (POS utt)
66 | (Phrasify utt)
67 | (Word utt)
68 | (Pauses utt)
69 | (Intonation utt)
70 | (PostLex utt))
71 |
72 | (define (synth_utts utts_data_file)
73 | (set! uttlist (load utts_data_file t))
74 | (mapcar
75 | (lambda (line)
76 | (set! utt (utt.synth_toSegment_text (eval (list 'Utterance 'Text (car (cdr line)))))) ;;; Initialize
77 | (format t "___KEEP___%s||" (car line))
78 | ;(format t "%s|" (car (cdr line)) )
79 | (format t "%s|" (car (cdr line)) )
80 | (print_phones_punc utt)
81 | ; (utt.relation.print utt 'Text)
82 | ( format t "\n" )
83 | t)
84 | uttlist)
85 | )
86 |
87 |
88 |
89 |
90 |
91 |
92 | ; (if (not (member_string 'combilex-rpx (lex.list)))
93 | ; (load (path-append lexdir "combilex/" (string-append 'combilex-rpx ".scm"))))
94 | (lex.select 'cmu)
95 |
96 |
97 | (synth_utts "./utts.data")
98 |
99 |
100 |
101 |
102 |
103 |
104 |
--------------------------------------------------------------------------------
/script/festival/make_rich_phones_combirpx_noplex.scm:
--------------------------------------------------------------------------------
1 |
2 |
3 | ; Usage:
4 | ; cd to directory containing ./utts.data, then with a version of Festival with the correct
5 | ; lexicon installed:
6 |
7 | ; FEST=/afs/inf.ed.ac.uk/user/o/owatts/sim2/oliver/tool/festival/festival/bin/festival
8 | ; SCRIPT=/afs/inf.ed.ac.uk/user/o/owatts/repos/dc_tts/script/make_rich_phones.scm
9 | ; $FEST -b $SCRIPT | grep ___KEEP___ | sed 's/___KEEP___//' | tee ./transcript.csv
10 |
11 |
12 |
13 | ;;; Taken from build_unitsel:
14 | ;; Is this the last segment in a word [more complicated than you would think]
15 | (define (seg_word_final seg)
16 | "(seg_word_final seg)
17 | Is this segment word final?"
18 | (let ((this_seg_word (item.parent (item.relation.parent seg 'SylStructure)))
19 | (silence (car (cadr (car (PhoneSet.description '(silences))))))
20 | next_seg_word)
21 | (if (item.next seg)
22 | (set! next_seg_word (item.parent (item.relation.parent (item.next seg) 'SylStructure))))
23 | (if (or (equal? this_seg_word next_seg_word)
24 | (string-equal (item.feat seg "name") silence))
25 | nil
26 | t)))
27 |
28 |
29 | (define (following_punc seg)
30 | ( set! this_seg_word (item.parent (item.relation.parent seg 'SylStructure)))
31 | ( set! this_seg_token (item.relation.next this_seg_word 'Token))
32 | (if this_seg_token
33 | ( format t "<%s> " (item.name this_seg_token))
34 | ( format t "<> " ) %% else
35 | )
36 | )
37 |
38 |
39 | (define (print_phones_punc utt)
40 | (if (utt.relation.present utt 'Segment)
41 | (begin
42 | ( format t "<_START_> " )
43 | (mapcar
44 | (lambda (x)
45 | ( if (not (equal? (item.name x) "#") ) ;; TODO: unhardcode silent symbols
46 | (format t "%s " (item.name x) )
47 | )
48 | (if (seg_word_final x)
49 | (following_punc x)
50 | )
51 | )
52 | (utt.relation.items utt 'Segment))
53 | ( format t "<_END_>" )
54 | )
55 | (format t "Utterance contains no Segments\n"))
56 | nil)
57 |
58 | ;;; Taken from build_unitsel:
59 | ;; Do the linguistic side of synthesis.
60 | (define (utt.synth_toSegment_text utt)
61 | (Initialize utt)
62 | (Text utt)
63 | (Token_POS utt) ;; when utt.synth is called
64 | (Token utt)
65 | (POS utt)
66 | (Phrasify utt)
67 | (Word utt)
68 | (Pauses utt)
69 | (Intonation utt)
70 | (PostLex utt))
71 |
72 | (define (synth_utts utts_data_file)
73 | (set! uttlist (load utts_data_file t))
74 | (mapcar
75 | (lambda (line)
76 | (set! utt (utt.synth_toSegment_text (eval (list 'Utterance 'Text (car (cdr line)))))) ;;; Initialize
77 | (format t "___KEEP___%s||" (car line))
78 | ;(format t "%s|" (car (cdr line)) )
79 | (format t "%s|" (car (cdr line)) )
80 | (print_phones_punc utt)
81 | ; (utt.relation.print utt 'Text)
82 | ( format t "\n" )
83 | t)
84 | uttlist)
85 | )
86 |
87 |
88 | (if (not (member_string 'unilex-rpx (lex.list)))
89 | (load (path-append lexdir "unilex/" (string-append 'unilex-rpx ".scm"))))
90 |
91 | ; (if (not (member_string 'cmudict (lex.list)))
92 | ; (load (path-append lexdir "cmu/" (string-append 'cmudict-0.4 ".scm"))))
93 |
94 |
95 | ;(require 'unilex_phones)
96 | ;(lex.select 'unilex-rpx)
97 |
98 |
99 |
100 |
101 | (if (not (member_string 'combilex-rpx (lex.list)))
102 | (load (path-append lexdir "combilex/" (string-append 'combilex-rpx ".scm"))))
103 | (lex.select 'combilex-rpx)
104 | ; (require 'postlex)
105 | ; (set! postlex_rules_hooks (list postlex_apos_s_check
106 | ; postlex_intervoc_r
107 | ; postlex_the_vs_thee
108 | ; postlex_a
109 | ; ))
110 | ; (set! postlex_rules_hooks (list
111 | ; postlex_intervoc_r
112 | ; ))
113 |
114 |
115 |
116 | ; (lex.select 'cmudict)
117 |
118 |
119 |
120 | ;(set! utt1 (Utterance Text "Hello there, world, isn't it a nice day?!"))
121 | ;(utt.synth utt1)
122 | ;(print_phones_punc utt1)
123 |
124 | (synth_utts "./utts.data")
125 |
126 |
127 |
128 |
129 |
130 |
131 |
--------------------------------------------------------------------------------
/script/festival/multi_transcript.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 |
3 | from argparse import ArgumentParser
4 |
5 |
6 | def main_work():
7 |
8 | a = ArgumentParser()
9 | a.add_argument('-i', dest='infile', required=True)
10 | a.add_argument('-o', dest='outfile', required=True)
11 | opts = a.parse_args()
12 |
13 | o = open(opts.outfile, 'w')
14 |
15 | with open(opts.infile, 'r') as f:
16 | for line in f.readlines()[:-1]:
17 | if line[3] == '_': # if clauses dealing with different length p-numbers (not really applicable for public VCTK)
18 | speaker_id = line[0:3]
19 | elif line[4] == '_':
20 | speaker_id = line[0:4]
21 | elif line[5] == '_':
22 | speaker_id = line[0:5]
23 | else:
24 | print('Something is wrong with the input file - speaker ID cannot be parsed!')
25 | o.write('{}|{}\n'.format(line.rstrip(), speaker_id))
26 |
27 | o.close()
28 |
29 | if __name__ == "__main__":
30 | main_work()
31 |
32 |
--------------------------------------------------------------------------------
/script/get_transcriptions.sh:
--------------------------------------------------------------------------------
1 |
2 | python ./script/festival/csv2scm.py -i $1 -o utts.data
3 |
4 | FEST='/afs/inf.ed.ac.uk/user/s15/s1520337/Documents/festival/festival/bin/festival'
5 |
6 | SCRIPT=./script/festival/make_rich_phones_cmulex.scm
7 | $FEST -b $SCRIPT | grep ___KEEP___ | sed 's/___KEEP___//' | tee ./transcript_temp1.csv
8 |
9 | python $CODEDIR/script/festival/fix_transcript.py ./transcript_temp1.csv > ./new_transcript.csv
10 |
--------------------------------------------------------------------------------
/script/libutil.py:
--------------------------------------------------------------------------------
1 | import os
2 | import re
3 | import numpy as np
4 |
5 | #### TODO -- this module is duplicated 1 level up -- sort this out!
6 |
7 |
8 | def load_config(config_fname):
9 | config = {}
10 | execfile(config_fname, config)
11 | del config['__builtins__']
12 | _, config_name = os.path.split(config_fname)
13 | config_name = config_name.replace('.cfg','').replace('.conf','')
14 | config['config_name'] = config_name
15 | return config
16 |
17 |
18 | def safe_makedir(dir):
19 | if not os.path.isdir(dir):
20 | os.makedirs(dir)
21 |
22 | def writelist(seq, fname):
23 | f = open(fname, 'w')
24 | f.write('\n'.join(seq) + '\n')
25 | f.close()
26 |
27 | def readlist(fname):
28 | f = open(fname, 'r')
29 | data = f.readlines()
30 | f.close()
31 | return [line.strip('\n') for line in data]
32 |
33 | def read_norm_data(fname, stream_names):
34 | out = {}
35 | vals = np.loadtxt(fname)
36 | mean_ix = 0
37 | for stream in stream_names:
38 | std_ix = mean_ix + 1
39 | out[stream] = (vals[mean_ix], vals[std_ix])
40 | mean_ix += 2
41 | return out
42 |
43 |
44 | def makedirecs(direcs):
45 | for direc in direcs:
46 | if not os.path.isdir(direc):
47 | os.makedirs(direc)
48 |
49 | def basename(fname):
50 | path, name = os.path.split(fname)
51 | base = re.sub('\.[^\.]+\Z','',name)
52 | return base
53 |
54 | get_basename = basename # alias
55 | def get_speech(infile, dimension):
56 | f = open(infile, 'rb')
57 | speech = np.fromfile(f, dtype=np.float32)
58 | f.close()
59 | assert speech.size % float(dimension) == 0.0,'specified dimension %s not compatible with data'%(dimension)
60 | speech = speech.reshape((-1, dimension))
61 | return speech
62 |
63 | def put_speech(m_data, filename):
64 | m_data = np.array(m_data, 'float32') # Ensuring float32 output
65 | fid = open(filename, 'wb')
66 | m_data.tofile(fid)
67 | fid.close()
68 | return
--------------------------------------------------------------------------------
/script/make_internal_webchart.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 | ## Project: SCRIPT - March 2018
4 | ## Contact: Oliver Watts - owatts@staffmail.ed.ac.uk
5 |
6 |
7 | import sys, os
8 | from string import strip
9 | from argparse import ArgumentParser
10 |
11 | def main_work(voice_dirs, names=[], pattern='', outfile='', title=''):
12 |
13 | #for v in voice_dirs:
14 | # print v
15 | #sys.exit('wvswv')
16 | #voice_dirs = opts.d
17 | voice_dirs = [string for string in voice_dirs if os.path.isdir(string)]
18 |
19 | if names:
20 | #names = opts.n
21 | if not len(names) == len(voice_dirs):
22 | print '------'
23 | for name in names:
24 | print name
25 | for v in voice_dirs:
26 | print v
27 | sys.exit('len(names) != len(voice_dirs)')
28 | else:
29 | names = [direc.strip('/').split('/')[-1] for direc in voice_dirs]
30 | print names
31 |
32 | # for i in range(0,len(inputs),2):
33 | # name = inputs[i]
34 | # voice_dir = inputs[i+1]
35 | # names.append(name)
36 | # voice_dirs.append(voice_dir)
37 |
38 |
39 | #################################################
40 |
41 | ### only keep utts appearing in all conditions
42 | uttnames=[]
43 | all_utts = []
44 | for voice_dir in voice_dirs:
45 | print voice_dir
46 | print os.listdir(voice_dir)
47 | print '-----'
48 | all_utts.extend(os.listdir(voice_dir))
49 |
50 | for unique_utt in set(all_utts):
51 | if unique_utt.endswith('.wav'):
52 | if all_utts.count(unique_utt) == len(names):
53 | uttnames.append(unique_utt)
54 |
55 |
56 | if pattern:
57 | uttnames = [name for name in uttnames if pattern in name]
58 |
59 | # for voice_dir in voice_dirs:
60 | # for uttname in os.listdir(voice_dir):
61 | # if uttname not in uttnames:
62 | # uttnames.append(uttname)
63 |
64 |
65 | if len(uttnames) == 0:
66 | sys.exit('no utterances found in common!')
67 |
68 |
69 | output = ''
70 |
71 |
72 | if title:
73 | output += '' + title + '
\n'
74 |
75 | ## table top and toprow
76 | output += '\n'
77 | output += '\n'
78 | output += "\n"
79 | output += ' Condition | \n'
80 | for (name,voice_dir) in zip(names, voice_dirs):
81 | _, voice = os.path.split(voice_dir)
82 | #output += voice
83 |
84 |
85 | output += '%s | \n'%(name)
86 | output += '
\n'
87 |
88 | for uttname in sorted(uttnames):
89 |
90 | output += "\n"
91 |
92 | output += '%s | \n'%(uttname.replace(".wav", ""))
93 | for voice_dir in voice_dirs:
94 |
95 | wavename=os.path.join(voice_dir, uttname)
96 | output += '\n'
97 | output += get_audio_control(wavename)
98 |
99 | output += " |
\n"
100 | output += '
\n'
101 | output += '
\n'
102 |
103 |
104 | if outfile:
105 | f = open(outfile, 'w')
106 | f.write(output)
107 | f.close()
108 | else:
109 | print output
110 |
111 |
112 |
113 |
114 | def get_audio_control(fname):
115 | return '''
\n'''%(fname)
116 |
117 |
118 |
119 |
120 | if __name__=="__main__":
121 |
122 | #################################################
123 |
124 | # ======== Get stuff from command line ==========
125 |
126 | a = ArgumentParser()
127 | a.add_argument('-o', dest='outfile', default='', type=str, \
128 | help= "If not given, print to console")
129 | a.add_argument('-d', nargs='+', required=True, help='list of directories with samples')
130 | a.add_argument('-n', nargs='+', required=False, help='list of names -- use directory names if not given')
131 | a.add_argument('-p', dest='pattern', default='', type=str)
132 | a.add_argument('-title', default='', type=str)
133 |
134 | opts = a.parse_args()
135 |
136 |
137 | # ===============================================
138 |
139 | main_work(opts.d, opts.n, opts.pattern, opts.outfile, opts.title)
140 |
141 |
142 |
143 |
144 |
145 |
146 |
147 |
--------------------------------------------------------------------------------
/script/munge_nsf_data.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 | ## Project: SCRIPT - March 2018
4 | ## Contact: Oliver Watts - owatts@staffmail.ed.ac.uk
5 |
6 | import sys
7 | import os
8 | import glob
9 | import os
10 | from argparse import ArgumentParser
11 |
12 | import soundfile as sf
13 |
14 | from concurrent.futures import ProcessPoolExecutor
15 | from functools import partial
16 | from tqdm import tqdm
17 | import subprocess
18 |
19 | import numpy as np
20 | #from generate import read_est_file
21 |
22 |
23 | ## TODO : copied from generate
24 | def read_est_file(est_file):
25 |
26 | with open(est_file) as fid:
27 | header_size = 1 # init
28 | for line in fid:
29 | if line == 'EST_Header_End\n':
30 | break
31 | header_size += 1
32 | ## now check there is at least 1 line beyond the header:
33 | status_ok = False
34 | for (i,line) in enumerate(fid):
35 | if i > header_size:
36 | status_ok = True
37 | if not status_ok:
38 | return np.array([])
39 |
40 | # Read text: TODO: improve skiprows
41 | data = np.loadtxt(est_file, skiprows=header_size)
42 | data = np.atleast_2d(data)
43 | return data
44 |
45 |
46 | def main_work():
47 |
48 | #################################################
49 |
50 | # ======== Get stuff from command line ==========
51 |
52 | a = ArgumentParser()
53 | a.add_argument('-i', dest='indir', required=True) ## mels
54 | a.add_argument('-of', dest='outdir_f', required=True, \
55 | help= "Put output f0 here: make it if it doesn't exist")
56 | a.add_argument('-om', dest='outdir_m', required=True, \
57 | help= "Put output mels here: make it if it doesn't exist")
58 | a.add_argument('-f', dest='fzdir', required=True)
59 | # a.add_argument('-framerate', required=False, default=0.005, type=float, help='rate in seconds for F0 track frames')
60 | # a.add_argument('-pattern', default='', \
61 | # help= "If given, only normalise files whose base contains this substring")
62 | a.add_argument('-ncores', default=1, type=int)
63 | #a.add_argument('-waveformat', default=False, action='store_true', help='call sox to format data (16 bit ).')
64 |
65 | # a.add_argument('-twopass', default=False, action='store_true', help='Run initially on a subset of data to guess sensible limits, then run again. Assumes all data is from same speaker.')
66 | opts = a.parse_args()
67 |
68 | # ===============================================
69 |
70 | for direc in [opts.outdir_f, opts.outdir_m]:
71 | if not os.path.isdir(direc):
72 | os.makedirs(direc)
73 |
74 | flist = sorted(glob.glob(opts.indir + '/*.npy'))
75 |
76 | print flist
77 | # print 'Extract with range %s %s'%(min_f0, max_f0)
78 | executor = ProcessPoolExecutor(max_workers=opts.ncores)
79 | futures = []
80 | for mel_file in flist:
81 | futures.append(executor.submit(
82 | partial(process, mel_file, opts.fzdir, opts.outdir_f, opts.outdir_m)))
83 | return [future.result() for future in tqdm(futures)]
84 |
85 |
86 | def put_speech(m_data, filename):
87 | m_data = np.array(m_data, 'float32') # Ensuring float32 output
88 | fid = open(filename, 'wb')
89 | m_data.tofile(fid)
90 | fid.close()
91 | return
92 |
93 |
94 |
95 | def process(mel_file, fzdir, outdir_f, outdir_m):
96 | _, base = os.path.split(mel_file)
97 | base = base.replace('.npy', '')
98 |
99 | mels = np.load(mel_file)
100 | fz_file = os.path.join(fzdir, base + '.f0')
101 |
102 | fz = read_est_file(fz_file)[:,2] # .reshape(-1,1)
103 |
104 | m,_ = mels.shape
105 | f = fz.shape[0]
106 |
107 |
108 | fz[fz<0.0] = 0.0
109 |
110 | if m > f:
111 | diff = m - f
112 | fz = np.pad(fz, (0,diff), 'constant').reshape(-1,1)
113 |
114 | put_speech(fz, os.path.join(outdir_f, base+'.f0'))
115 | put_speech(mels, os.path.join(outdir_m, base+'.mfbsp'))
116 |
117 | # print fz.shape
118 | # print mels.shape
119 | # print fz
120 |
121 |
122 |
123 |
124 |
125 | # out_file = os.path.join(outdir, base + '.pm')
126 |
127 | # in_wav_file = os.path.join(fzdir, base + '_tmp.wav') ### !!!!!
128 | # cmd = 'sox %s -r 16000 %s '%(wavefile, in_wav_file)
129 | # subprocess.call(cmd, shell=True)
130 |
131 | # out_fz_file = os.path.join(fzdir, base + '.f0')
132 | # cmd = _reaper_bin + " -s -e %s -x %s -m %s -a -u 0.005 -i %s -p %s -f %s >/dev/null" % (framerate, max_f0, min_f0, in_wav_file, out_est_file, out_fz_file)
133 | # subprocess.call(cmd, shell=True)
134 |
135 |
136 |
137 |
138 |
139 | if __name__=="__main__":
140 |
141 | main_work()
142 |
143 |
--------------------------------------------------------------------------------
/script/normalise_level.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 | ## Project: SCRIPT - March 2018
4 | ## Contact: Oliver Watts - owatts@staffmail.ed.ac.uk
5 |
6 | import sys
7 | import os
8 | import glob
9 | import os
10 | from argparse import ArgumentParser
11 |
12 | import soundfile as sf
13 |
14 | from concurrent.futures import ProcessPoolExecutor
15 | from functools import partial
16 | from tqdm import tqdm
17 |
18 |
19 | HERE = os.path.realpath(os.path.abspath(os.path.dirname(__file__)))
20 | sv56 = HERE + '/../tool/bin/sv56demo'
21 |
22 | if not os.path.isfile(sv56):
23 |
24 | ## Check required executables are available:
25 |
26 | from distutils.spawn import find_executable
27 |
28 | required_executables = ['sv56demo']
29 |
30 | for executable in required_executables:
31 | if not find_executable(executable):
32 | sys.exit('%s command line tool must be on system path '%(executable))
33 |
34 | sv56 = 'sv56demo'
35 |
36 |
37 | def main_work():
38 |
39 | #################################################
40 |
41 | # ======== Get stuff from command line ==========
42 |
43 | a = ArgumentParser()
44 | a.add_argument('-i', dest='indir', required=True)
45 | a.add_argument('-o', dest='outdir', required=True, \
46 | help= "Put output here: make it if it doesn't exist")
47 | a.add_argument('-pattern', default='', \
48 | help= "If given, only normalise files whose base contains this substring")
49 | a.add_argument('-ncores', default=1, type=int)
50 | opts = a.parse_args()
51 |
52 | # ===============================================
53 |
54 | for direc in [opts.outdir]:
55 | if not os.path.isdir(direc):
56 | os.makedirs(direc)
57 |
58 | flist = sorted(glob.glob(opts.indir + '/*.wav'))
59 |
60 | executor = ProcessPoolExecutor(max_workers=opts.ncores)
61 | futures = []
62 | for wave_file in flist:
63 | futures.append(executor.submit(
64 | partial(process, wave_file, opts.outdir, pattern=opts.pattern)))
65 | return [future.result() for future in tqdm(futures)]
66 |
67 |
68 |
69 |
70 |
71 |
72 | def process(wavefile, outdir, pattern=''):
73 | _, base = os.path.split(wavefile)
74 |
75 | if pattern:
76 | if pattern not in base:
77 | return
78 |
79 | # print base
80 |
81 | raw_in = os.path.join(outdir, base.replace('.wav','.raw'))
82 | raw_out = os.path.join(outdir, base.replace('.wav','_norm.raw'))
83 | logfile = os.path.join(outdir, base.replace('.wav','.log'))
84 | wav_out = os.path.join(outdir, base)
85 |
86 | data, samplerate = sf.read(wavefile, dtype='int16')
87 | sf.write(raw_in, data, samplerate, subtype='PCM_16')
88 | os.system('%s -log %s -q -lev -26.0 -sf %s %s %s'%(sv56, logfile, samplerate, raw_in, raw_out))
89 | norm_data, samplerate = sf.read(raw_out, dtype='int16', samplerate=samplerate, channels=1, subtype='PCM_16')
90 | sf.write(wav_out, norm_data, samplerate)
91 |
92 | os.system('rm %s %s'%(raw_in, raw_out))
93 |
94 |
95 |
96 |
97 | if __name__=="__main__":
98 |
99 | main_work()
100 |
101 |
--------------------------------------------------------------------------------
/script/process_merlin_label.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 | ## Project: SCRIPT - February 2019
4 | ## Contact: Oliver Watts - owatts@staffmail.ed.ac.uk
5 |
6 |
7 | import sys
8 | import os
9 | import glob
10 | from argparse import ArgumentParser
11 |
12 | from libutil import get_speech, basename, safe_makedir
13 | from scipy.signal import argrelextrema
14 | import numpy as np
15 | import matplotlib as mpl
16 | mpl.use('PDF')
17 | import pylab as pl
18 |
19 | def merlin_state_label_to_phone(labfile):
20 | labels = np.loadtxt(labfile, dtype=str, comments=None) ## default comments='#' breaks
21 | starts = labels[:,0].astype(int)[::5].reshape(-1,1)
22 | ends = labels[:,1].astype(int)[4::5].reshape(-1,1)
23 | fc = labels[:,2][::5]
24 | fc = np.array([line.replace('[2]','') for line in fc]).reshape(-1,1)
25 | phone_label = np.hstack([starts, ends, fc])
26 | return phone_label
27 |
28 |
29 | def minmax_norm(X, data_min, data_max):
30 | data_range = data_max - data_min
31 | data_range[data_range<=0.0] = 1.0
32 | maxi, mini = 0.01, 0.99 # ## merlin's default desired range
33 | X_std = (X - data_min) / data_range
34 | X_scaled = X_std * (maxi - mini) + mini
35 | return X_scaled
36 |
37 |
38 | def process_merlin_label(bin_label_fname, text_lab_dir, phonedim=416, subphonedim=9):
39 |
40 | text_label = os.path.join(text_lab_dir, basename(bin_label_fname) + '.lab')
41 | assert os.path.isfile(text_label), 'No text file for %s '%(basename(bin_label_fname))
42 |
43 | labfrombin = get_speech(bin_label_fname, phonedim+subphonedim)
44 |
45 | ## fraction through phone (forwards)
46 | fraction_through_phone_forwards = labfrombin[:,-1]
47 |
48 | ## This is a suprisingly noisy signal which never seems to start at 0.0! Find minima:-
49 | (minima, ) = argrelextrema(fraction_through_phone_forwards, np.less)
50 |
51 | ## first frame is always a start:
52 | minima = np.insert(minima, 0, 0)
53 |
54 | ## check size against text file:
55 | labfromtext = merlin_state_label_to_phone(text_label)
56 | assert labfromtext.shape[0] == minima.shape[0]
57 |
58 | lab = labfrombin[minima,:-subphonedim] ## discard frame level feats, and take first frame of each phone
59 |
60 | return lab
61 |
62 |
63 |
64 | def main_work():
65 |
66 | #################################################
67 |
68 | # ============= Process command line ============
69 |
70 | a = ArgumentParser()
71 |
72 | a.add_argument('-b', dest='binlabdir', required=True)
73 | a.add_argument('-t', dest='text_lab_dir', required=True)
74 | a.add_argument('-n', dest='norm_info_fname', required=True)
75 | a.add_argument('-o', dest='outdir', required=True)
76 | a.add_argument('-binext', dest='binext', required=False, default='lab')
77 | a.add_argument('-skipterminals', action='store_true', default=False)
78 |
79 |
80 | opts = a.parse_args()
81 |
82 | # ===============================================
83 |
84 | safe_makedir(opts.outdir)
85 |
86 | norm_info = get_speech(opts.norm_info_fname, 425)[:,:-9]
87 | data_min = norm_info[0,:]
88 | data_max = norm_info[1,:]
89 | data_range = data_max - data_min
90 |
91 | text_label_files = set([basename(f) for f in glob.glob(opts.text_lab_dir + '/*.lab')])
92 | binary_label_files = sorted(glob.glob(opts.binlabdir + '/*.' + opts.binext) )
93 | print binary_label_files
94 | for binlab in binary_label_files:
95 | base = basename(binlab)
96 | if base not in text_label_files:
97 | continue
98 | print base
99 | lab = process_merlin_label(binlab, opts.text_lab_dir)
100 | if opts.skipterminals:
101 | lab = lab[1:-1,:] ## NB: dont remove 2 last as in durations, as the final punct does't features here
102 | norm_lab = minmax_norm(lab, data_min, data_max)
103 |
104 | if 0: ## piano roll style plot:
105 | pl.imshow(norm_lab, interpolation='nearest')
106 | pl.gray()
107 | pl.savefig('/afs/inf.ed.ac.uk/user/o/owatts/temp/fig.pdf')
108 | sys.exit('abckdubv')
109 |
110 | np.save(opts.outdir + '/' + base, norm_lab)
111 |
112 |
113 | if __name__=="__main__":
114 | main_work()
115 |
116 |
--------------------------------------------------------------------------------
/script/process_merlin_label_positions.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 | ## Project: SCRIPT - February 2019
4 | ## Contact: Oliver Watts - owatts@staffmail.ed.ac.uk
5 |
6 |
7 | import sys
8 | import os
9 | import glob
10 | from argparse import ArgumentParser
11 |
12 | from libutil import get_speech, basename, safe_makedir
13 | from scipy.signal import argrelextrema
14 | import numpy as np
15 | import matplotlib as mpl
16 | mpl.use('PDF')
17 | import pylab as pl
18 |
19 | from scipy import interpolate
20 |
21 |
22 | # def merlin_state_label_to_phone(labfile):
23 | # labels = np.loadtxt(labfile, dtype=str, comments=None) ## default comments='#' breaks
24 | # starts = labels[:,0].astype(int)[::5].reshape(-1,1)
25 | # ends = labels[:,1].astype(int)[4::5].reshape(-1,1)
26 | # fc = labels[:,2][::5]
27 | # fc = np.array([line.replace('[2]','') for line in fc]).reshape(-1,1)
28 | # phone_label = np.hstack([starts, ends, fc])
29 | # return phone_label
30 |
31 |
32 | def minmax_norm(X, data_min, data_max):
33 | data_range = data_max - data_min
34 | data_range[data_range<=0.0] = 1.0
35 | maxi, mini = 0.01, 0.99 # ## merlin's default desired range
36 | X_std = (X - data_min) / data_range
37 | X_scaled = X_std * (maxi - mini) + mini
38 | return X_scaled
39 |
40 |
41 | def process_merlin_positions(bin_label_fname, audio_dir, phonedim=416, subphonedim=9, \
42 | inrate=5.0, outrate=12.5):
43 |
44 | audio_fname = os.path.join(audio_dir, basename(bin_label_fname) + '.npy')
45 | assert os.path.isfile(audio_fname), 'No audio file for %s '%(basename(bin_label_fname))
46 | audio = np.load(audio_fname)
47 |
48 | labfrombin = get_speech(bin_label_fname, phonedim+subphonedim)
49 |
50 | positions = labfrombin[:,-subphonedim:]
51 |
52 | nframes, dim = positions.shape
53 | assert dim==9
54 |
55 | new_nframes, _ = audio.shape
56 |
57 | old_x = np.linspace((inrate/2.0), nframes*inrate, nframes, endpoint=False) ## place points at frame centres
58 |
59 | f = interpolate.interp1d(old_x, positions, axis=0, kind='nearest', bounds_error=False, fill_value='extrapolate') ## nearest to avoid weird averaging effects near segment boundaries
60 |
61 | new_x = np.linspace((outrate/2.0), new_nframes*outrate, new_nframes, endpoint=False)
62 | new_positions = f(new_x)
63 |
64 | return new_positions
65 |
66 |
67 |
68 | def main_work():
69 |
70 | #################################################
71 |
72 | # ============= Process command line ============
73 |
74 | a = ArgumentParser()
75 |
76 | a.add_argument('-b', dest='binlabdir', required=True)
77 | a.add_argument('-f', dest='audio_dir', required=True)
78 | a.add_argument('-n', dest='norm_info_fname', required=True)
79 | a.add_argument('-o', dest='outdir', required=True)
80 | a.add_argument('-binext', dest='binext', required=False, default='lab')
81 |
82 | a.add_argument('-ir', dest='inrate', type=float, default=5.0)
83 | a.add_argument('-or', dest='outrate', type=float, default=12.5)
84 |
85 | opts = a.parse_args()
86 |
87 | # ===============================================
88 |
89 | safe_makedir(opts.outdir)
90 |
91 | norm_info = get_speech(opts.norm_info_fname, 425)[:,-9:]
92 | data_min = norm_info[0,:]
93 | data_max = norm_info[1,:]
94 | data_range = data_max - data_min
95 |
96 | audio_files = set([basename(f) for f in glob.glob(opts.audio_dir + '/*.npy')])
97 | binary_label_files = sorted(glob.glob(opts.binlabdir + '/*.' + opts.binext) )
98 |
99 | for binlab in binary_label_files:
100 | base = basename(binlab)
101 | if base not in audio_files:
102 | continue
103 | print base
104 | positions = process_merlin_positions(binlab, opts.audio_dir, inrate=opts.inrate, outrate=opts.outrate)
105 | norm_positions = minmax_norm(positions, data_min, data_max)
106 |
107 | np.save(opts.outdir + '/' + base, norm_positions)
108 |
109 |
110 | if __name__=="__main__":
111 | main_work()
112 |
113 |
--------------------------------------------------------------------------------
/script/remove_end_silences.py:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | import fileinput, sys
5 |
6 | infile = sys.argv[1]
7 |
8 | i = 0
9 | for line in fileinput.input(infile):
10 | name, t1, t2, phones, speaker, durs = line.strip(' \n').split('|') # [-1]
11 | durs = durs.split(' ')
12 | assert durs[0] == '24'
13 | assert durs[-2] == '24'
14 | assert durs[-1] == '0'
15 | durs[0] = '0'
16 | durs[-2] = '0'
17 | durs[-1] = '0'
18 | # if i==5:
19 | # break
20 | durs = ' '.join(durs)
21 | line2 = '|'.join([name, t1, t2, phones, speaker, durs])
22 | print line2
23 | i += 1
24 | # print i
25 |
--------------------------------------------------------------------------------
/script/split_speech.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 | ## Project: SCRIPT - June 2018
4 | ## Contact: Oliver Watts - owatts@staffmail.ed.ac.uk
5 |
6 | import sys
7 | import os
8 | import glob
9 | import os
10 | import fileinput
11 | from argparse import ArgumentParser
12 |
13 |
14 |
15 | import librosa
16 | from concurrent.futures import ProcessPoolExecutor
17 | from functools import partial
18 | import numpy as np
19 | import soundfile
20 | from libutil import safe_makedir, get_basename
21 | from tqdm import tqdm
22 |
23 | # librosa.effects.split(y, top_db=60, ref=, frame_length=2048, hop_length=512)[source]
24 |
25 | # intervals[i] == (start_i, end_i) are the start and end time (in samples) of non-silent interval i.
26 |
27 |
28 |
29 |
30 | def main_work():
31 |
32 | #################################################
33 |
34 | # ======== Get stuff from command line ==========
35 |
36 | a = ArgumentParser()
37 | a = ArgumentParser()
38 | a.add_argument('-w', dest='wave_dir', required=True)
39 | a.add_argument('-o', dest='output_dir', required=True)
40 | a.add_argument('-N', dest='nfiles', type=int, default=0)
41 | a.add_argument('-l', dest='wavlist_file', default=None)
42 | a.add_argument('-dB', dest='top_db', default=30, type=int)
43 | a.add_argument('-trimonly', action='store_true', default=False)
44 | a.add_argument('-ncores', type=int, default=0)
45 | a.add_argument('-endpad', type=float, default=0.3)
46 | opts = a.parse_args()
47 |
48 | # ===============================================
49 |
50 | trim_waves_in_directory(opts.wave_dir, opts.output_dir, num_workers=opts.ncores, \
51 | tqdm=tqdm, nfiles=opts.nfiles, top_db=opts.top_db, trimonly=opts.trimonly, endpad=opts.endpad)
52 |
53 | def trim_waves_in_directory(in_dir, out_dir, num_workers=1, tqdm=lambda x: x, \
54 | nfiles=0, top_db=30, trimonly=False, endpad=0.3):
55 | safe_makedir(out_dir)
56 | wave_files = sorted(glob.glob(in_dir + '/*.wav'))
57 | if nfiles > 0:
58 | wave_files = wave_files[:min(nfiles, len(wave_files))]
59 |
60 | if num_workers:
61 | executor = ProcessPoolExecutor(max_workers=num_workers)
62 | futures = []
63 | for (index, wave_file) in enumerate(wave_files):
64 | futures.append(executor.submit(
65 | partial(_process_utterance, wave_file, out_dir, top_db=top_db, trimonly=trimonly, end_pad_sec=endpad)))
66 | return [future.result() for future in tqdm(futures)]
67 | else: ## serial processing
68 | for wave_file in tqdm(wave_files):
69 | _process_utterance(wave_file, out_dir, top_db=top_db, trimonly=trimonly, end_pad_sec=endpad)
70 |
71 |
72 |
73 | def _process_utterance(wav_path, out_dir, top_db=30, end_pad_sec=0.3, pad_sec=0.01, minimum_duration_sec=0.5, trimonly=False):
74 |
75 | wav, fs = soundfile.read(wav_path) ## TODO: assert mono
76 |
77 | if len(wav) > 0:
78 |
79 | pad = int(pad_sec * fs)
80 | end_pad = int(end_pad_sec * fs)
81 | # print pad
82 | base = get_basename(wav_path)
83 | # print base
84 | _, (start, end) = librosa.effects.trim(wav, top_db=top_db)
85 | start = max(0, (start - end_pad))
86 | end = min(len(wav), (end + end_pad))
87 |
88 | if start < end:
89 | wav = wav[start:end]
90 |
91 | if trimonly:
92 | ofile = os.path.join(out_dir, base + '.wav')
93 | soundfile.write(ofile, wav, fs)
94 | else:
95 | starts_ends = librosa.effects.split(wav, top_db=top_db)
96 | starts_ends[:,0] -= pad
97 | starts_ends[:,1] += pad
98 | starts_ends = np.clip(starts_ends, 0, wav.size)
99 | lengths = starts_ends[:,1] - starts_ends[:,0]
100 | starts_ends = starts_ends[lengths > fs * minimum_duration_sec]
101 |
102 |
103 | for (i, (s,e)) in enumerate(starts_ends):
104 |
105 | ofile = os.path.join(out_dir, base + '_seg%s.wav'%(str(i+1).zfill(4)))
106 | # print ofile
107 | soundfile.write(ofile, wav[s:e], fs)
108 |
109 | else:
110 |
111 | print "File discarded: " + wav_path
112 |
113 | def test():
114 | safe_makedir('/tmp/splitwaves/')
115 | _process_utterance('/afs/inf.ed.ac.uk/group/cstr/projects/simple4all_2/oliver/data/nick/wav/herald_030.wav', '/tmp/splitwaves/')
116 |
117 | if __name__=="__main__":
118 |
119 | main_work()
120 |
121 |
--------------------------------------------------------------------------------
/synthesise_validation_waveforms.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 | ## Project: SCRIPT - February 2018
4 | ## Contact: Oliver Watts - owatts@staffmail.ed.ac.uk
5 |
6 | import sys
7 | import os
8 | import glob
9 | from argparse import ArgumentParser
10 |
11 | import imp
12 |
13 | import numpy as np
14 |
15 | from utils import spectrogram2wav
16 | # from scipy.io.wavfile import write
17 | import soundfile
18 |
19 | import tqdm
20 | from concurrent.futures import ProcessPoolExecutor
21 |
22 | import tensorflow as tf
23 | from architectures import SSRNGraph
24 | from synthesize import make_mel_batch, split_batch, synth_mel2mag
25 | from configuration import load_config
26 |
27 |
28 | def synth_wave(hp, magfile):
29 | mag = np.load(magfile)
30 | #print ('mag shape %s'%(str(mag.shape)))
31 | wav = spectrogram2wav(hp, mag)
32 | outfile = magfile.replace('.mag.npy', '.wav')
33 | outfile = outfile.replace('.npy', '.wav')
34 | #print magfile
35 | #print outfile
36 | #print
37 | # write(outfile, hp.sr, wav)
38 | soundfile.write(outfile, wav, hp.sr)
39 |
40 | def main_work():
41 |
42 | #################################################
43 |
44 | # ======== Get stuff from command line ==========
45 |
46 | a = ArgumentParser()
47 | a.add_argument('-c', dest='config', required=True, type=str)
48 | a.add_argument('-ncores', type=int, default=1)
49 | opts = a.parse_args()
50 |
51 | # ===============================================
52 |
53 | hp = load_config(opts.config)
54 |
55 | ### 1) convert saved coarse mels to mags with latest-trained SSRN
56 | print('mel2mag: restore last saved SSRN')
57 | g = SSRNGraph(hp, mode="synthesize")
58 | with tf.Session() as sess:
59 | sess.run(tf.global_variables_initializer())
60 |
61 | ## TODO: use restore_latest_model_parameters from synthesize?
62 | var_list = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'SSRN')
63 | saver2 = tf.train.Saver(var_list=var_list)
64 | savepath = hp.logdir + "-ssrn"
65 | latest_checkpoint = tf.train.latest_checkpoint(savepath)
66 | if latest_checkpoint is None: sys.exit('No SSRN at %s?'%(savepath))
67 | ssrn_epoch = latest_checkpoint.strip('/ ').split('/')[-1].replace('model_epoch_', '')
68 | saver2.restore(sess, latest_checkpoint)
69 | print("SSRN Restored from latest epoch %s"%(ssrn_epoch))
70 |
71 | filelist = glob.glob(hp.logdir + '-t2m/validation_epoch_*/*.npy')
72 | filelist = [fname for fname in filelist if not fname.endswith('.mag.npy')]
73 | batch, lengths = make_mel_batch(hp, filelist, oracle=False)
74 | Z = synth_mel2mag(hp, batch, g, sess, batchsize=32)
75 | print ('synthesised mags, now splitting batch:')
76 | maglist = split_batch(Z, lengths)
77 | for (infname, outdata) in tqdm.tqdm(zip(filelist, maglist)):
78 | np.save(infname.replace('.npy','.mag.npy'), outdata)
79 |
80 |
81 |
82 | ### 2) GL in parallel for both t2m and ssrn validation set
83 | print('GL for SSRN validation')
84 | filelist = glob.glob(hp.logdir + '-t2m/validation_epoch_*/*.mag.npy') + \
85 | glob.glob(hp.logdir + '-ssrn/validation_epoch_*/*.npy')
86 |
87 | if opts.ncores==1:
88 | for fname in tqdm.tqdm(filelist):
89 | synth_wave(hp, fname)
90 | else:
91 | executor = ProcessPoolExecutor(max_workers=opts.ncores)
92 | futures = []
93 | for fpath in filelist:
94 | futures.append(executor.submit(synth_wave, hp, fpath))
95 | proc_list = [future.result() for future in tqdm.tqdm(futures)]
96 |
97 |
98 |
99 | if __name__=="__main__":
100 |
101 | main_work()
102 |
--------------------------------------------------------------------------------
/test.tmp:
--------------------------------------------------------------------------------
1 | testing
2 |
--------------------------------------------------------------------------------
/tool/bin/sv56demo:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CSTR-Edinburgh/ophelia/46042a80d045b6fd6403601e9e8fb447d304b6b3/tool/bin/sv56demo
--------------------------------------------------------------------------------
/util/find_free_gpu.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python
2 |
3 | import subprocess
4 | import re
5 |
6 | sp = subprocess.Popen(['nvidia-smi', 'pmon', '-c', '1'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
7 | out_str = sp.communicate()
8 |
9 | #print out_str
10 |
11 | lines = out_str[0].split('\n')
12 | lines = [line for line in lines if not line.startswith('#')] # filter comments
13 | #lines = [line.replace('-', '') for line in lines] # remove - placeholders used to mark columns for idle GPUs
14 | lines = [re.split('\s+', line.strip(' ')) for line in lines] # split on whitespace
15 | lines = [line for line in lines if line != ['']] # remove empty lines
16 |
17 | free_gpu = -1
18 | for line in lines:
19 | #print line
20 | gpu_id = line[0]
21 | process_info = ''.join(line[1:])
22 | #print gpu_id
23 | #print process_info
24 | if process_info.replace('-', '') == '': ## does it contain only - placeholders used to mark columns for idle GPUs?
25 | free_gpu = gpu_id
26 | break
27 | print free_gpu
28 |
29 |
--------------------------------------------------------------------------------
/util/submit_tf.sh:
--------------------------------------------------------------------------------
1 |
2 | PYTHON=python
3 |
4 |
5 | ## Generic script for submitting any tensorflow job to GPU
6 | # usage: submit.sh [scriptname.py script_arguments ... ]
7 |
8 | ## Location of this script: assume gpu_lock.py is in same place -
9 | SCRIPTPATH=$( cd $(dirname $0) ; pwd -P )
10 |
11 |
12 | gpu_id=$(python2 $SCRIPTPATH/gpu_lock.py --id-to-hog)
13 |
14 | ### paths for tensorflow (works on hynek)
15 | # export LD_LIBRARY_PATH=/opt/cuda-8.0.44/extras/CUPTI/lib64/:/opt/cuda-8.0.44/:/opt/cuda-8.0.44/lib64:/opt/cuDNN-7.0/:/opt/cuDNN-6.0_8.0/:/opt/cuda/:/opt/cuDNN-6.0_8.0/lib64:/opt/cuDNN-6.0/lib6
16 | export LD_LIBRARY_PATH=/opt/cuda-8.0.44/extras/CUPTI/lib64/:/opt/cuda-8.0.44/:/opt/cuda-8.0.44/lib64:/opt/cuDNN-7.0/:/opt/cuDNN-6.0_8.0/:/opt/cuda/:/opt/cuDNN-6.0_8.0/lib64:/opt/cuDNN-6.0/lib6:/opt/cuda-9.0.176.1/lib64/:/opt/cuda-9.1.85/lib64/:/opt/cuDNN-7.1_9.1/lib64
17 | export CUDA_HOME=/opt/cuda-8.0.44/
18 |
19 | export KERAS_BACKEND=tensorflow
20 | export CUDA_VISIBLE_DEVICES=$gpu_id
21 |
22 | if [ $gpu_id -gt -1 ]; then
23 |
24 | $PYTHON $@
25 |
26 | python2 $SCRIPTPATH/gpu_lock.py --free $gpu_id
27 | else
28 | echo 'Let us wait! No GPU is available!'
29 |
30 | fi
31 |
--------------------------------------------------------------------------------
/util/submit_tf_2.sh:
--------------------------------------------------------------------------------
1 |
2 | PYTHON=python
3 |
4 |
5 | ## Generic script for submitting any tensorflow job to GPU
6 | # usage: submit.sh [scriptname.py script_arguments ... ]
7 |
8 | ## Location of this script: assume gpu_lock.py is in same place -
9 | SCRIPTPATH=$( cd $(dirname $0) ; pwd -P )
10 |
11 |
12 | gpu_id=$(python2 $SCRIPTPATH/find_free_gpu.py)
13 |
14 | ### paths for tensorflow (works on hynek)
15 | # export LD_LIBRARY_PATH=/opt/cuda-8.0.44/extras/CUPTI/lib64/:/opt/cuda-8.0.44/:/opt/cuda-8.0.44/lib64:/opt/cuDNN-7.0/:/opt/cuDNN-6.0_8.0/:/opt/cuda/:/opt/cuDNN-6.0_8.0/lib64:/opt/cuDNN-6.0/lib6
16 | export LD_LIBRARY_PATH=/opt/cuda-8.0.44/extras/CUPTI/lib64/:/opt/cuda-8.0.44/:/opt/cuda-8.0.44/lib64:/opt/cuDNN-7.0/:/opt/cuDNN-6.0_8.0/:/opt/cuda/:/opt/cuDNN-6.0_8.0/lib64:/opt/cuDNN-6.0/lib6:/opt/cuda-9.0.176.1/lib64/:/opt/cuda-9.1.85/lib64/:/opt/cuDNN-7.1_9.1/lib64
17 | export CUDA_HOME=/opt/cuda-8.0.44/
18 |
19 | export KERAS_BACKEND=tensorflow
20 | export CUDA_VISIBLE_DEVICES=$gpu_id
21 |
22 | if [ $gpu_id -gt -1 ]; then
23 |
24 | $PYTHON $@
25 |
26 | else
27 | echo 'Let us wait! No GPU is available!'
28 |
29 | fi
30 |
--------------------------------------------------------------------------------
/util/submit_tf_cpu.sh:
--------------------------------------------------------------------------------
1 |
2 | PYTHON=python
3 |
4 |
5 | ## Generic script for submitting any tensorflow job to GPU
6 | # usage: submit.sh [scriptname.py script_arguments ... ]
7 |
8 |
9 | ### paths for tensorflow (works on hynek)
10 | # export LD_LIBRARY_PATH=/opt/cuda-8.0.44/extras/CUPTI/lib64/:/opt/cuda-8.0.44/:/opt/cuda-8.0.44/lib64:/opt/cuDNN-7.0/:/opt/cuDNN-6.0_8.0/:/opt/cuda/:/opt/cuDNN-6.0_8.0/lib64:/opt/cuDNN-6.0/lib6
11 | export LD_LIBRARY_PATH=/opt/cuda-8.0.44/extras/CUPTI/lib64/:/opt/cuda-8.0.44/:/opt/cuda-8.0.44/lib64:/opt/cuDNN-7.0/:/opt/cuDNN-6.0_8.0/:/opt/cuda/:/opt/cuDNN-6.0_8.0/lib64:/opt/cuDNN-6.0/lib6:/opt/cuda-9.0.176.1/lib64/:/opt/cuda-9.1.85/lib64/:/opt/cuDNN-7.1_9.1/lib64
12 | export CUDA_HOME=/opt/cuda-8.0.44/
13 |
14 | export KERAS_BACKEND=tensorflow
15 | export CUDA_VISIBLE_DEVICES=""
16 |
17 | $PYTHON $@
18 |
19 |
--------------------------------------------------------------------------------
/util/sync_code_from_afs.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | SERVERDIR=/path/on/GPU/machine/to/dc_tts_osw/ ## change this to e.g. /disk/scratch/.../dc_tts_osw
4 | SOURCE=/path/on/AFS/to/dc_tts_osw/ ## e.g. in your AFS home directory
5 |
6 | rsync -avzh --progress $SOURCE/ $SERVERDIR/
7 | # --exclude='.git/'
8 |
--------------------------------------------------------------------------------
/util/sync_output_to_afs.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | DEST=~/synthetic_speech ## e.g. directory on AFS where examining audio is straightforward
4 |
5 | PATTERN=$1 ## substring of the config to sync
6 |
7 | rsync -av --relative ./work/*${PATTERN}*/train-t2m/validation_epoch_*/*.wav $DEST
8 | rsync -av --relative ./work/*${PATTERN}*/train-ssrn/validation_epoch_*/*.wav $DEST
9 | rsync -av --relative ./work/*${PATTERN}*/train-t2m/alignment* $DEST
10 | rsync -av --relative ./work/*${PATTERN}*/synth/*/*.wav $DEST
11 | rsync -av --relative ./work/*${PATTERN}*/synth*/*/*.wav $DEST
12 | rsync -av --relative ./work/*${PATTERN}*/synth/*/*.png $DEST
13 |
14 |
15 | echo "Synced to: $DEST"
--------------------------------------------------------------------------------