├── .gitignore ├── LICENSE ├── README.md ├── README_ORIG.md ├── README_notes.md ├── add_duration_to_transcript.py ├── architectures.py ├── calculate_CDP_Ain_Aout.py ├── config ├── lj_01.cfg ├── lj_02.cfg ├── lj_03.cfg ├── lj_04.cfg ├── lj_05.cfg ├── lj_test.cfg ├── lj_tutorial.cfg ├── nancy2nick_01.cfg ├── nancy2nick_02.cfg ├── nancy_01.cfg ├── nancyplusnick_01.cfg ├── nancyplusnick_02.cfg ├── nancyplusnick_03.cfg ├── nancyplusnick_04_lcc.cfg ├── nancyplusnick_05_lcc_sdpe.cfg ├── project │ ├── baseline.cfg │ ├── c-labels_plus_te.cfg │ ├── fa_as_attention.cfg │ ├── fa_as_guide.cfg │ ├── fa_as_target.cfg │ ├── labels_minus_te.cfg │ └── labels_plus_te.cfg ├── ssw10 │ ├── G1ABC_01.cfg │ ├── G1ABC_02.cfg │ ├── G1ABC_03.cfg │ ├── G1AB_03.cfg │ ├── G1BC_03.cfg │ ├── G1B_03.cfg │ ├── G1_03.cfg │ ├── G2_03.cfg │ ├── README.md │ ├── W2A_03.cfg │ ├── W2B_03.cfg │ └── W2_03.cfg ├── vctk_01.cfg ├── vctk_02.cfg └── vctk_03_lcc.cfg ├── configuration.py ├── convert_alignment_to_guide.py ├── copy_synth_GL.py ├── copy_synth_SSRN_GL.py ├── data_load.py ├── doc ├── festival_install.md ├── old_readme.md ├── preparing_new_database.md ├── recipe_WaveRNN.md ├── recipe_nancy.md ├── recipe_nancy2nick.md ├── recipe_nancyplusnick.md ├── recipe_project.md ├── recipe_semisupervised.md └── recipe_vctk.md ├── fig ├── aaa ├── attention.gif └── training_curves.png ├── labels2tra.sh ├── libutil.py ├── logger_setup.py ├── modules.py ├── networks.py ├── objective_measures.py ├── plot_loss.py ├── prepare_acoustic_features.py ├── prepare_attention_guides.py ├── requirements.txt ├── script ├── add_durations_to_transcript.py ├── add_speaker.py ├── check_transcript.py ├── festival │ ├── csv2scm.py │ ├── fix_transcript.py │ ├── make_rich_phones.scm │ ├── make_rich_phones_cmulex.scm │ ├── make_rich_phones_combirpx_noplex.scm │ └── multi_transcript.py ├── get_transcriptions.sh ├── interpolate_unvoiced.py ├── libutil.py ├── make_internal_webchart.py ├── munge_nsf_data.py ├── normalise_level.py ├── pitchmark_speech.py ├── prepare_world_features.py ├── process_merlin_label.py ├── process_merlin_label_positions.py ├── remove_end_silences.py └── split_speech.py ├── synthesise_validation_waveforms.py ├── synthesize.py ├── test.tmp ├── tool └── bin │ └── sv56demo ├── train.py ├── util ├── find_free_gpu.py ├── gpu_lock.py ├── submit_tf.sh ├── submit_tf_2.sh ├── submit_tf_cpu.sh ├── sync_code_from_afs.sh └── sync_output_to_afs.sh └── utils.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | *.cfgc 3 | tool/* 4 | !tool/bin/sv56demo -------------------------------------------------------------------------------- /README_ORIG.md: -------------------------------------------------------------------------------- 1 | # A TensorFlow Implementation of DC-TTS: yet another text-to-speech model 2 | 3 | I implement yet another text-to-speech model, dc-tts, introduced in [Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention](https://arxiv.org/abs/1710.08969). My goal, however, is not just replicating the paper. Rather, I'd like to gain insights about various sound projects. 4 | 5 | ## Requirements 6 | * NumPy >= 1.11.1 7 | * TensorFlow >= 1.3 (Note that the API of `tf.contrib.layers.layer_norm` has changed since 1.3) 8 | * librosa 9 | * tqdm 10 | * matplotlib 11 | * scipy 12 | 13 | ## Data 14 | 15 | 16 | 17 | 18 | 19 | 20 | I train English models and an Korean model on four different speech datasets.

1. [LJ Speech Dataset](https://keithito.com/LJ-Speech-Dataset/)
2. [Nick Offerman's Audiobooks](https://www.audible.com.au/search?searchNarrator=Nick+Offerman)
3. [Kate Winslet's Audiobook](https://www.audible.com.au/pd/Classics/Therese-Raquin-Audiobook/B00FF0SLW4/ref=a_search_c4_1_3_srTtl?qid=1516854754&sr=1-3)
4. [KSS Dataset](https://kaggle.com/bryanpark/korean-single-speaker-speech-dataset) 21 | 22 | LJ Speech Dataset is recently widely used as a benchmark dataset in the TTS task because it is publicly available, and it has 24 hours of reasonable quality samples. 23 | Nick's and Kate's audiobooks are additionally used to see if the model can learn even with less data, variable speech samples. They are 18 hours and 5 hours long, respectively. Finally, KSS Dataset is a Korean single speaker speech dataset that lasts more than 12 hours. 24 | 25 | 26 | ## Training 27 | * STEP 0. Download [LJ Speech Dataset](https://keithito.com/LJ-Speech-Dataset/) or prepare your own data. 28 | * STEP 1. Adjust hyper parameters in `hyperparams.py`. (If you want to do preprocessing, set prepro True`. 29 | * STEP 2. Run `python train.py 1` for training Text2Mel. (If you set prepro True, run python prepro.py first) 30 | * STEP 3. Run `python train.py 2` for training SSRN. 31 | 32 | You can do STEP 2 and 3 at the same time, if you have more than one gpu card. 33 | 34 | ## Training Curves 35 | 36 | 37 | 38 | ## Attention Plot 39 | 40 | 41 | ## Sample Synthesis 42 | I generate speech samples based on [Harvard Sentences](http://www.cs.columbia.edu/~hgs/audio/harvard.html) as the original paper does. It is already included in the repo. 43 | 44 | * Run `synthesize.py` and check the files in `samples`. 45 | 46 | ## Generated Samples 47 | 48 | | Dataset | Samples | 49 | | :----- |:-------------| 50 | | LJ | [50k](https://soundcloud.com/kyubyong-park/sets/dc_tts) [200k](https://soundcloud.com/kyubyong-park/sets/dc_tts_lj_200k) [310k](https://soundcloud.com/kyubyong-park/sets/dc_tts_lj_310k) [800k](https://soundcloud.com/kyubyong-park/sets/dc_tts_lj_800k)| 51 | | Nick | [40k](https://soundcloud.com/kyubyong-park/sets/dc_tts_nick_40k) [170k](https://soundcloud.com/kyubyong-park/sets/dc_tts_nick_170k) [300k](https://soundcloud.com/kyubyong-park/sets/dc_tts_nick_300k) [800k](https://soundcloud.com/kyubyong-park/sets/dc_tts_nick_800k)| 52 | | Kate| [40k](https://soundcloud.com/kyubyong-park/sets/dc_tts_kate_40k) [160k](https://soundcloud.com/kyubyong-park/sets/dc_tts_kate_160k) [300k](https://soundcloud.com/kyubyong-park/sets/dc_tts_kate_300k) [800k](https://soundcloud.com/kyubyong-park/sets/dc_tts_kate_800k) | 53 | | KSS| [400k](https://soundcloud.com/kyubyong-park/sets/dc_tts_ko_400k) | 54 | 55 | ## Pretrained Model for LJ 56 | 57 | Download [this](https://www.dropbox.com/s/1oyipstjxh2n5wo/LJ_logdir.tar?dl=0). 58 | 59 | ## Notes 60 | 61 | * The paper didn't mention normalization, but without normalization I couldn't get it to work. So I added layer normalization. 62 | * The paper fixed the learning rate to 0.001, but it didn't work for me. So I decayed it. 63 | * I tried to train Text2Mel and SSRN simultaneously, but it didn't work. I guess separating those two networks mitigates the burden of training. 64 | * The authors claimed that the model can be trained within a day, but unfortunately the luck was not mine. However obviously this is much fater than Tacotron as it uses only convolution layers. 65 | * Thanks to the guided attention, the attention plot looks monotonic almost from the beginning. I guess this seems to hold the aligment tight so it won't lose track. 66 | * The paper didn't mention dropouts. I applied them as I believe it helps for regularization. 67 | * Check also other TTS models such as [Tacotron](https://github.com/kyubyong/tacotron) and [Deep Voice 3](https://github.com/kyubyong/deepvoice3). 68 | 69 | -------------------------------------------------------------------------------- /add_duration_to_transcript.py: -------------------------------------------------------------------------------- 1 | 2 | import numpy as np 3 | import os 4 | import sys 5 | import os.path 6 | 7 | # Usage: python add_duration_to_transcript.py fa_matrix_dir transcript_file new_transcript_file 8 | fa_matrices_dir = sys.argv[1] 9 | transcript = sys.argv[2] 10 | out_transcript = sys.argv[3] 11 | 12 | with open(transcript, 'r') as f: 13 | tra = f.readlines() 14 | 15 | f = open(out_transcript,'w') 16 | 17 | 18 | for t in tra: 19 | 20 | file = t.split('|')[0] 21 | phones = t.split('|')[3] 22 | num_phones = len(phones.split(' ')) 23 | 24 | if os.path.exists(fa_matrices_dir + file + '.npy'): 25 | 26 | A = np.load( fa_matrices_dir + file + '.npy') 27 | durations = np.sum(A,axis=0) 28 | durations = 4*durations # durations are given for 12.5ms frames, convert to 50ms 29 | assert(num_phones == A.shape[1]) 30 | assert(num_phones == len(durations)) 31 | 32 | f.write(t.rstrip() + '||' + " ".join(map(str, durations.astype(int))) + '\n') 33 | 34 | else: 35 | print("Missing " + file) 36 | 37 | f.close() 38 | -------------------------------------------------------------------------------- /calculate_CDP_Ain_Aout.py: -------------------------------------------------------------------------------- 1 | 2 | import sys 3 | import math 4 | import glob 5 | import numpy as np 6 | import matplotlib 7 | import matplotlib.pyplot as plt 8 | 9 | def get_att_per_input(A): 10 | 11 | att_per_input = np.trim_zeros(np.sum(A,axis=1),'b') 12 | num_input = len(att_per_input) 13 | 14 | return att_per_input, num_input 15 | 16 | # coverage deviation penalty: punishes when input token is not represented by output or overly represented 17 | # (lack of attention) skips and (too much attention) prolongation 18 | def getCDP(A): 19 | 20 | att_per_input, num_input = get_att_per_input(A) 21 | tmp = (1. - att_per_input )**2 22 | tmp = np.sum( np.log( 1. + (1. - att_per_input )**2 ) ) 23 | CDP = tmp / num_input # removed the minus sign 24 | 25 | return CDP 26 | 27 | def getEnt(A): 28 | 29 | Entr = 0.0 30 | 31 | for a in A: # traverse rows of A 32 | norm = np.sum(a) 33 | normPd = a 34 | if norm != 0.0: 35 | normPd = [ p / norm for p in a ] 36 | entr = np.sum( [ ( p * np.log(p) if (p!=0.0) else 0.0 ) for p in normPd ] ) 37 | Entr += entr 38 | 39 | Entr = -Entr/A.shape[0] 40 | Entr /= np.log(A.shape[1]) # to normalise it between 0-1 41 | 42 | return Entr 43 | 44 | 45 | # absentmindess penalty: punishes scattered attention, dispersion is calculated via the entropy 46 | def getAP(A): 47 | 48 | att_per_input, num_input = get_att_per_input(A) 49 | num_output = A.shape[1] 50 | A = A[:num_input,:] 51 | APout = 0.0 52 | APin = 0.0 53 | 54 | APin = getEnt(A) 55 | APout = getEnt(np.transpose(A)) 56 | 57 | return APin, APout 58 | 59 | def main(file_name): 60 | 61 | A = np.load(file_name) # matrix axis 0: input (phones) / axis 1: output (acoustic) 62 | print(A.shape) 63 | # fig, ax = plt.subplots() 64 | # im = ax.imshow(A) 65 | # fig.colorbar(im) 66 | # plt.ylabel('Encoder timestep') 67 | # plt.xlabel('Decoder timestep') 68 | # plt.show() 69 | 70 | CDP = getCDP(A) 71 | APin, APout = getAP(A) 72 | 73 | print('CDP: ' + str(CDP)) 74 | print('Ain: ' + str(APin)) 75 | print('Aout: ' + str(APout)) 76 | 77 | if __name__ == '__main__': 78 | 79 | # Input attention matrix - numpy format - axis 0: input (phones) / axis 1: output (acoustic) 80 | 81 | file_name = sys.argv[1] 82 | 83 | main(file_name) 84 | 85 | 86 | -------------------------------------------------------------------------------- /config/lj_01.cfg: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #/usr/bin/python2 3 | 4 | import os 5 | 6 | 7 | ## Take name of this file to be the config name: 8 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension 9 | 10 | ## Define place to put outputs relative to this config file's location; 11 | ## supply an absoluate path to work elsewhere: 12 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work'))) 13 | 14 | voicedir = os.path.join(topworkdir, config_name) 15 | logdir = os.path.join(voicedir, 'train') 16 | sampledir = os.path.join(voicedir, 'synth') 17 | 18 | ## Change featuredir to absolute path to use existing features 19 | featuredir = os.path.join(topworkdir, 'lj_test', 'data') ## use older data 20 | coarse_audio_dir = os.path.join(featuredir, 'mels') 21 | full_mel_dir = os.path.join(featuredir, 'full_mels') 22 | full_audio_dir = os.path.join(featuredir, 'mags') 23 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide 24 | 25 | 26 | 27 | # data locations: 28 | datadir = '/group/project/cstr2/owatts/data/LJSpeech-1.1/' 29 | 30 | 31 | transcript = os.path.join(datadir, 'transcript.csv') 32 | test_transcript = os.path.join(datadir, 'test_transcript.csv') 33 | waveforms = os.path.join(datadir, 'wav_trim') 34 | 35 | 36 | 37 | input_type = 'phones' ## letters or phones 38 | ## Combilex RPX plus extra symbols:- 39 | vocab = ['', '', "<'>", "<'s>", '<)>', '<]>', '<">', '<,>', '<.>', '<:>', '<;>', '<>', '', \ 40 | '<_END_>', '<_START_>', \ 41 | '@', '@@', '@U', 'A', 'D', 'E', 'E@', 'I', 'I@', 'N', 'O', 'OI', 'Q', 'S', 'T', 'U',\ 42 | 'U@', 'V', 'Z', 'a', 'aI', 'aU', 'b', 'd', 'dZ', 'eI', 'f', 'g', 'h', 'i', 'j', 'k',\ 43 | 'l', 'l!', 'lw', 'm', 'm!', 'n', 'n!', 'o^', 'o~', 'p', 'r', 's', 't', 'tS', 'u', 'v', 'w', 'z'] 44 | 45 | #vocab = "PE abcdefghijklmnopqrstuvwxyz'.?" # P: Padding, E: EOS. 46 | max_N = 180 # Maximum number of characters. # !!! 47 | max_T = 210 # Maximum number of mel frames. # !!! 48 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input 49 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 50 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features. 51 | 52 | 53 | 54 | # signal processing 55 | trim_before_spectrogram_extraction = 0 56 | vocoder = 'griffin_lim' 57 | sr = 22050 # Sampling rate. 58 | n_fft = 2048 # fft points (samples) 59 | frame_shift = 0.0125 # seconds 60 | frame_length = 0.05 # seconds 61 | hop_length = int(sr * frame_shift) 62 | win_length = int(sr * frame_length) 63 | prepro = True # don't extract spectrograms on the fly 64 | full_dim = n_fft//2+1 65 | n_mels = 80 # Number of Mel banks to generate 66 | power = 1.5 # Exponent for amplifying the predicted magnitude 67 | n_iter = 50 # Number of inversion iterations 68 | preemphasis = .97 69 | max_db = 100 70 | ref_db = 20 71 | 72 | 73 | # Model 74 | r = 4 # Reduction factor. Do not change this. 75 | dropout_rate = 0.05 76 | e = 128 # == embedding 77 | d = 256 # == hidden units of Text2Mel 78 | c = 512 # == hidden units of SSRN 79 | attention_win_size = 3 80 | g = 0.2 ## determines width of band in attention guide 81 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None] 82 | 83 | ## loss weights : T2M 84 | lw_mel = 0.3333 85 | lw_bd1 = 0.3333 86 | lw_att = 0.3333 87 | ## : SSRN 88 | lw_mag = 0.5 89 | lw_bd2 = 0.5 90 | 91 | 92 | ## validation: 93 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI 94 | validation_sentences_to_evaluate = 32 95 | validation_sentences_to_synth_params = 3 96 | 97 | 98 | # training scheme 99 | restart_from_savepath = [] 100 | lr = 0.001 # Initial learning rate. 101 | batchsize = {'t2m': 32, 'ssrn': 32} 102 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8) 103 | validate_every_n_epochs = 10 ## how often to compute validation score and save speech parameters 104 | save_every_n_epochs = 0 ## as well as 5 latest models, how often to archive a model. This has been disabled (0) to save disk space. 105 | max_epochs = 300 106 | 107 | # attention plotting during training 108 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices 109 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for 110 | -------------------------------------------------------------------------------- /config/lj_03.cfg: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #/usr/bin/python2 3 | 4 | ## rpx as lj_01, 16kHz as lj_02, but experiment with training babbler before T2M 5 | 6 | import os 7 | 8 | 9 | ## Take name of this file to be the config name: 10 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension 11 | 12 | ## Define place to put outputs relative to this config file's location; 13 | ## supply an absoluate path to work elsewhere: 14 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work'))) 15 | 16 | voicedir = os.path.join(topworkdir, config_name) 17 | logdir = os.path.join(voicedir, 'train') 18 | sampledir = os.path.join(voicedir, 'synth') 19 | 20 | ## Change featuredir to absolute path to use existing features 21 | featuredir = os.path.join(topworkdir, 'lj_02', 'data') ## use older data 22 | coarse_audio_dir = os.path.join(featuredir, 'mels') 23 | full_mel_dir = os.path.join(featuredir, 'full_mels') 24 | full_audio_dir = os.path.join(featuredir, 'mags') 25 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide 26 | 27 | bucket_data_by = 'audio_length' 28 | 29 | # data locations: 30 | datadir = '/group/project/cstr2/owatts/data/LJSpeech-1.1/' 31 | 32 | 33 | transcript = os.path.join(datadir, 'transcript.csv') 34 | test_transcript = os.path.join(datadir, 'test_transcript.csv') 35 | waveforms = os.path.join(datadir, 'wav_trim') 36 | 37 | 38 | 39 | input_type = 'phones' ## letters or phones 40 | ## Combilex RPX plus extra symbols:- 41 | vocab = ['', '', "<'>", "<'s>", '<)>', '<]>', '<">', '<,>', '<.>', '<:>', '<;>', '<>', '', \ 42 | '<_END_>', '<_START_>', \ 43 | '@', '@@', '@U', 'A', 'D', 'E', 'E@', 'I', 'I@', 'N', 'O', 'OI', 'Q', 'S', 'T', 'U',\ 44 | 'U@', 'V', 'Z', 'a', 'aI', 'aU', 'b', 'd', 'dZ', 'eI', 'f', 'g', 'h', 'i', 'j', 'k',\ 45 | 'l', 'l!', 'lw', 'm', 'm!', 'n', 'n!', 'o^', 'o~', 'p', 'r', 's', 't', 'tS', 'u', 'v', 'w', 'z'] 46 | 47 | #vocab = "PE abcdefghijklmnopqrstuvwxyz'.?" # P: Padding, E: EOS. 48 | max_N = 180 # Maximum number of characters. # !!! 49 | max_T = 210 # Maximum number of mel frames. # !!! 50 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input 51 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 52 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features. 53 | 54 | 55 | 56 | # signal processing 57 | trim_before_spectrogram_extraction = 0 58 | vocoder = 'griffin_lim' 59 | sr = 16000 # Sampling rate. 60 | n_fft = 2048 # fft points (samples) 61 | frame_shift = 0.0125 # seconds 62 | frame_length = 0.05 # seconds 63 | hop_length = int(sr * frame_shift) 64 | win_length = int(sr * frame_length) 65 | prepro = True # don't extract spectrograms on the fly 66 | full_dim = n_fft//2+1 67 | n_mels = 80 # Number of Mel banks to generate 68 | power = 1.5 # Exponent for amplifying the predicted magnitude 69 | n_iter = 50 # Number of inversion iterations 70 | preemphasis = .97 71 | max_db = 100 72 | ref_db = 20 73 | 74 | 75 | # Model 76 | r = 4 # Reduction factor. Do not change this. 77 | dropout_rate = 0.05 78 | e = 128 # == embedding 79 | d = 256 # == hidden units of Text2Mel 80 | c = 512 # == hidden units of SSRN 81 | attention_win_size = 3 82 | g = 0.2 ## determines width of band in attention guide 83 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None] 84 | 85 | ## loss weights : T2M 86 | lw_mel = 0.3333 87 | lw_bd1 = 0.3333 88 | lw_att = 0.3333 89 | ## : SSRN 90 | lw_mag = 0.5 91 | lw_bd2 = 0.5 92 | 93 | loss_weights = {'babbler': {'L1': 0.5, 'binary_divergence': 0.5}} ## TODO: make other loss config consistent 94 | 95 | ## validation: 96 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI 97 | validation_sentences_to_evaluate = 32 98 | validation_sentences_to_synth_params = 3 99 | 100 | 101 | # training scheme 102 | restart_from_savepath = [] 103 | lr = 0.001 # Initial learning rate. 104 | batchsize = {'t2m': 32, 'ssrn': 32, 'babbler': 32} 105 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8) 106 | validate_every_n_epochs = 10 ## how often to compute validation score and save speech parameters 107 | save_every_n_epochs = 0 ## as well as 5 latest models, how often to archive a model. This has been disabled (0) to save disk space. 108 | max_epochs = 300 109 | 110 | # attention plotting during training 111 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices 112 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for 113 | -------------------------------------------------------------------------------- /config/lj_04.cfg: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #/usr/bin/python2 3 | 4 | ## rpx as lj_01, 16kHz as lj_02, but experiment with training babbler before T2M 5 | ## stage 2: fine tune on subset of transcribed data 6 | import os 7 | 8 | 9 | ## Take name of this file to be the config name: 10 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension 11 | 12 | ## Define place to put outputs relative to this config file's location; 13 | ## supply an absoluate path to work elsewhere: 14 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work'))) 15 | 16 | voicedir = os.path.join(topworkdir, config_name) 17 | logdir = os.path.join(voicedir, 'train') 18 | sampledir = os.path.join(voicedir, 'synth') 19 | 20 | ## Change featuredir to absolute path to use existing features 21 | featuredir = os.path.join(topworkdir, 'lj_02', 'data') ## use older data 22 | coarse_audio_dir = os.path.join(featuredir, 'mels') 23 | full_mel_dir = os.path.join(featuredir, 'full_mels') 24 | full_audio_dir = os.path.join(featuredir, 'mags') 25 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide 26 | 27 | 28 | 29 | # data locations: 30 | datadir = '/group/project/cstr2/owatts/data/LJSpeech-1.1/' 31 | 32 | 33 | transcript = os.path.join(datadir, 'transcript.csv') 34 | test_transcript = os.path.join(datadir, 'test_transcript.csv') 35 | waveforms = os.path.join(datadir, 'wav_trim') 36 | 37 | 38 | 39 | input_type = 'phones' ## letters or phones 40 | ## Combilex RPX plus extra symbols:- 41 | vocab = ['', '', "<'>", "<'s>", '<)>', '<]>', '<">', '<,>', '<.>', '<:>', '<;>', '<>', '', \ 42 | '<_END_>', '<_START_>', \ 43 | '@', '@@', '@U', 'A', 'D', 'E', 'E@', 'I', 'I@', 'N', 'O', 'OI', 'Q', 'S', 'T', 'U',\ 44 | 'U@', 'V', 'Z', 'a', 'aI', 'aU', 'b', 'd', 'dZ', 'eI', 'f', 'g', 'h', 'i', 'j', 'k',\ 45 | 'l', 'l!', 'lw', 'm', 'm!', 'n', 'n!', 'o^', 'o~', 'p', 'r', 's', 't', 'tS', 'u', 'v', 'w', 'z'] 46 | 47 | #vocab = "PE abcdefghijklmnopqrstuvwxyz'.?" # P: Padding, E: EOS. 48 | max_N = 180 # Maximum number of characters. # !!! 49 | max_T = 210 # Maximum number of mel frames. # !!! 50 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input 51 | n_utts = 1000 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 52 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features. 53 | 54 | 55 | 56 | # signal processing 57 | trim_before_spectrogram_extraction = 0 58 | vocoder = 'griffin_lim' 59 | sr = 16000 # Sampling rate. 60 | n_fft = 2048 # fft points (samples) 61 | frame_shift = 0.0125 # seconds 62 | frame_length = 0.05 # seconds 63 | hop_length = int(sr * frame_shift) 64 | win_length = int(sr * frame_length) 65 | prepro = True # don't extract spectrograms on the fly 66 | full_dim = n_fft//2+1 67 | n_mels = 80 # Number of Mel banks to generate 68 | power = 1.5 # Exponent for amplifying the predicted magnitude 69 | n_iter = 50 # Number of inversion iterations 70 | preemphasis = .97 71 | max_db = 100 72 | ref_db = 20 73 | 74 | 75 | # Model 76 | r = 4 # Reduction factor. Do not change this. 77 | dropout_rate = 0.05 78 | e = 128 # == embedding 79 | d = 256 # == hidden units of Text2Mel 80 | c = 512 # == hidden units of SSRN 81 | attention_win_size = 3 82 | g = 0.2 ## determines width of band in attention guide 83 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None] 84 | 85 | ## loss weights : T2M 86 | lw_mel = 0.3333 87 | lw_bd1 = 0.3333 88 | lw_att = 0.3333 89 | ## : SSRN 90 | lw_mag = 0.5 91 | lw_bd2 = 0.5 92 | 93 | 94 | ## validation: 95 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI 96 | validation_sentences_to_evaluate = 32 97 | validation_sentences_to_synth_params = 3 98 | 99 | 100 | # training scheme 101 | restart_from_savepath = [] 102 | lr = 0.001 # Initial learning rate. 103 | batchsize = {'t2m': 32, 'ssrn': 32, 'babbler': 32} 104 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8) 105 | validate_every_n_epochs = 10 ## how often to compute validation score and save speech parameters 106 | save_every_n_epochs = 0 ## as well as 5 latest models, how often to archive a model. This has been disabled (0) to save disk space. 107 | max_epochs = 1000 108 | 109 | # attention plotting during training 110 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices 111 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for 112 | -------------------------------------------------------------------------------- /config/lj_test.cfg: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #/usr/bin/python2 3 | 4 | import os 5 | 6 | 7 | ## Take name of this file to be the config name: 8 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension 9 | 10 | ## Define place to put outputs relative to this config file's location; 11 | ## supply an absoluate path to work elsewhere: 12 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work'))) 13 | 14 | voicedir = os.path.join(topworkdir, config_name) 15 | logdir = os.path.join(voicedir, 'train') 16 | sampledir = os.path.join(voicedir, 'synth') 17 | 18 | ## Change featuredir to absolute path to use existing features 19 | featuredir = os.path.join(voicedir, 'data') 20 | coarse_audio_dir = os.path.join(featuredir, 'mels') 21 | full_mel_dir = os.path.join(featuredir, 'full_mels') 22 | full_audio_dir = os.path.join(featuredir, 'mags') 23 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide 24 | 25 | 26 | 27 | # data locations: 28 | datadir = '/group/project/cstr2/owatts/data/LJSpeech-1.1/' 29 | 30 | 31 | transcript = os.path.join(datadir, 'transcript.csv') 32 | test_transcript = os.path.join(datadir, 'test_transcript.csv') 33 | waveforms = os.path.join(datadir, 'wav_trim') 34 | 35 | 36 | 37 | input_type = 'phones' ## letters or phones 38 | ## Combilex RPX plus extra symbols:- 39 | vocab = ['', '', "<'>", "<'s>", '<)>', '<]>', '<">', '<,>', '<.>', '<:>', '<;>', '<>', '', \ 40 | '<_END_>', '<_START_>', \ 41 | '@', '@@', '@U', 'A', 'D', 'E', 'E@', 'I', 'I@', 'N', 'O', 'OI', 'Q', 'S', 'T', 'U',\ 42 | 'U@', 'V', 'Z', 'a', 'aI', 'aU', 'b', 'd', 'dZ', 'eI', 'f', 'g', 'h', 'i', 'j', 'k',\ 43 | 'l', 'l!', 'lw', 'm', 'm!', 'n', 'n!', 'o^', 'o~', 'p', 'r', 's', 't', 'tS', 'u', 'v', 'w', 'z'] 44 | 45 | #vocab = "PE abcdefghijklmnopqrstuvwxyz'.?" # P: Padding, E: EOS. 46 | max_N = 180 # Maximum number of characters. # !!! 47 | max_T = 210 # Maximum number of mel frames. # !!! 48 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input 49 | n_utts = 200 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 50 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features. 51 | 52 | 53 | 54 | # signal processing 55 | trim_before_spectrogram_extraction = 0 56 | vocoder = 'griffin_lim' 57 | sr = 22050 # Sampling rate. 58 | n_fft = 2048 # fft points (samples) 59 | frame_shift = 0.0125 # seconds 60 | frame_length = 0.05 # seconds 61 | hop_length = int(sr * frame_shift) 62 | win_length = int(sr * frame_length) 63 | prepro = True # don't extract spectrograms on the fly 64 | full_dim = n_fft//2+1 65 | n_mels = 80 # Number of Mel banks to generate 66 | power = 1.5 # Exponent for amplifying the predicted magnitude 67 | n_iter = 50 # Number of inversion iterations 68 | preemphasis = .97 69 | max_db = 100 70 | ref_db = 20 71 | 72 | 73 | # Model 74 | r = 4 # Reduction factor. Do not change this. 75 | dropout_rate = 0.05 76 | e = 128 # == embedding 77 | d = 256 # == hidden units of Text2Mel 78 | c = 512 # == hidden units of SSRN 79 | attention_win_size = 3 80 | g = 0.2 ## determines width of band in attention guide 81 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None] 82 | 83 | ## loss weights : T2M 84 | lw_mel = 0.3333 85 | lw_bd1 = 0.3333 86 | lw_att = 0.3333 87 | ## : SSRN 88 | lw_mag = 0.5 89 | lw_bd2 = 0.5 90 | 91 | 92 | ## validation: 93 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI 94 | validation_sentences_to_evaluate = 32 95 | validation_sentences_to_synth_params = 3 96 | 97 | 98 | # training scheme 99 | restart_from_savepath = [] 100 | lr = 0.001 # Initial learning rate. 101 | batchsize = {'t2m': 32, 'ssrn': 32} 102 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8) 103 | validate_every_n_epochs = 1 ## how often to compute validation score and save speech parameters 104 | save_every_n_epochs = 2 ## as well as 5 latest models, how often to archive a model 105 | max_epochs = 4 106 | 107 | # attention plotting during training 108 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices 109 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for -------------------------------------------------------------------------------- /config/lj_tutorial.cfg: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #/usr/bin/python2 3 | 4 | import os 5 | 6 | 7 | ## Take name of this file to be the config name: 8 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension 9 | 10 | ## Define place to put outputs relative to this config file's location; 11 | ## supply an absoluate path to work elsewhere: 12 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work'))) 13 | 14 | voicedir = os.path.join(topworkdir, config_name) 15 | logdir = os.path.join(voicedir, 'train') 16 | sampledir = os.path.join(voicedir, 'synth') 17 | 18 | ## Change featuredir to absolute path to use existing features 19 | featuredir = os.path.join(voicedir, 'data') 20 | coarse_audio_dir = os.path.join(featuredir, 'mels') 21 | full_mel_dir = os.path.join(featuredir, 'full_mels') 22 | full_audio_dir = os.path.join(featuredir, 'mags') 23 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide 24 | 25 | 26 | 27 | # Data locations: 28 | datadir = '/path/to/LJSpeech-1.1/' 29 | 30 | 31 | transcript = os.path.join(datadir, 'transcript.csv') 32 | test_transcript = os.path.join(datadir, 'test_transcript.csv') 33 | waveforms = os.path.join(datadir, 'wav_trim') 34 | 35 | 36 | input_type = 'phones' ## letters or phones 37 | 38 | ## CMU phones: 39 | vocab = ['', '<_END_>', '<_START_>'] + ['', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', '<;>', '<>', '', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', 'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', 'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', 'oy', 'p', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', 'w', 'y', 'z', 'zh'] 40 | 41 | max_N = 150 # Maximum number of characters/phones 42 | max_T = 200 # Maximum number of mel frames 43 | 44 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input 45 | n_utts = 200 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 46 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features. 47 | 48 | 49 | 50 | # signal processing 51 | trim_before_spectrogram_extraction = 0 52 | vocoder = 'griffin_lim' 53 | sr = 22050 # Sampling rate. 54 | n_fft = 2048 # fft points (samples) 55 | frame_shift = 0.0125 # seconds 56 | frame_length = 0.05 # seconds 57 | hop_length = int(sr * frame_shift) 58 | win_length = int(sr * frame_length) 59 | prepro = True # don't extract spectrograms on the fly 60 | full_dim = n_fft//2+1 61 | n_mels = 80 # Number of Mel banks to generate 62 | power = 1.5 # Exponent for amplifying the predicted magnitude 63 | n_iter = 50 # Number of inversion iterations 64 | preemphasis = .97 65 | max_db = 100 66 | ref_db = 20 67 | 68 | 69 | # Model 70 | r = 4 # Reduction factor. Do not change this. 71 | dropout_rate = 0.05 72 | e = 128 # == embedding 73 | d = 256 # == hidden units of Text2Mel 74 | c = 512 # == hidden units of SSRN 75 | attention_win_size = 3 76 | g = 0.2 ## determines width of band in attention guide 77 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None] 78 | 79 | ## loss weights : T2M 80 | lw_mel = 0.25 81 | lw_bd1 = 0.25 82 | lw_att = 0.25 83 | lw_t2m_l2 = 0.25 84 | ## : SSRN 85 | lw_mag = 0.3333 86 | lw_bd2 = 0.3333 87 | lw_ssrn_l2 = 0.3333 88 | 89 | 90 | ## validation: 91 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI 92 | validation_sentences_to_evaluate = 32 93 | validation_sentences_to_synth_params = 3 94 | 95 | 96 | # training scheme 97 | restart_from_savepath = [] 98 | lr = 0.001 # Initial learning rate. 99 | batchsize = {'t2m': 32, 'ssrn': 8} 100 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8) 101 | validate_every_n_epochs = 1 ## how often to compute validation score and save speech parameters 102 | save_every_n_epochs = 5 ## as well as 5 latest models, how often to archive a model 103 | max_epochs = 10 104 | 105 | # attention plotting during training 106 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices 107 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for 108 | -------------------------------------------------------------------------------- /config/nancy2nick_01.cfg: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #/usr/bin/python2 3 | 4 | ''' 5 | nancy base voice retrained on nick for ICPhS in first instance. 6 | ''' 7 | 8 | 9 | import os 10 | 11 | 12 | ## Take name of this file to be the config name: 13 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension 14 | 15 | ## Define place to put outputs relative to this config file's location; 16 | ## supply an absoluate path to work elsewhere: 17 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work'))) 18 | 19 | voicedir = os.path.join(topworkdir, config_name) 20 | logdir = os.path.join(voicedir, 'train') 21 | sampledir = os.path.join(voicedir, 'synth') 22 | 23 | ## Change featuredir to absolute path to use existing features 24 | featuredir = os.path.join(topworkdir, config_name, 'data') ## use older data 25 | coarse_audio_dir = os.path.join(featuredir, 'mels') 26 | full_mel_dir = os.path.join(featuredir, 'full_mels') 27 | full_audio_dir = os.path.join(featuredir, 'mags') 28 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide 29 | 30 | 31 | 32 | # data locations: 33 | datadir = '/group/project/cstr2/owatts/data/nick16k/' 34 | 35 | 36 | transcript = os.path.join(datadir, 'transcript.csv') 37 | test_transcript = os.path.join(datadir, 'transcript_hvd.csv') 38 | #test_transcript = os.path.join(datadir, 'transcript_mrt.csv') 39 | waveforms = os.path.join(datadir, 'wav_trim') 40 | 41 | 42 | 43 | input_type = 'phones' ## letters or phones 44 | 45 | vocab = ['', '', "<'s>", '<,>', '<.>', '<:>', '<;>', '<>', \ 46 | '', '<_END_>', '<_START_>', '@', '@@', '@U', 'A', 'D', 'E', \ 47 | 'E@', 'I', 'I@', 'N', 'O', 'OI', 'Q', 'S', 'T', 'U', 'U@', 'V', \ 48 | 'Z', 'a', 'aI', 'aU', 'b', 'd', 'dZ', 'eI', 'f', 'g', 'h', 'i', \ 49 | 'j', 'k', 'l', 'l!', 'lw', 'm', 'm!', 'n', 'n!', 'o^', 'o~', 'p', \ 50 | 'r', 's', 't', 'tS', 'u', 'v', 'w', 'z'] 51 | 52 | #vocab = "PE abcdefghijklmnopqrstuvwxyz'.?" # P: Padding, E: EOS. 53 | max_N = 140 # Maximum number of characters. # !!! 54 | max_T = 200 # Maximum number of mel frames. # !!! 55 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input 56 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 57 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features. 58 | 59 | 60 | 61 | # signal processing 62 | trim_before_spectrogram_extraction = 0 63 | vocoder = 'griffin_lim' 64 | sr = 16000 # Sampling rate. 65 | n_fft = 2048 # fft points (samples) 66 | frame_shift = 0.0125 # seconds 67 | frame_length = 0.05 # seconds 68 | hop_length = int(sr * frame_shift) 69 | win_length = int(sr * frame_length) 70 | prepro = True # don't extract spectrograms on the fly 71 | full_dim = n_fft//2+1 72 | n_mels = 80 # Number of Mel banks to generate 73 | power = 1.5 # Exponent for amplifying the predicted magnitude 74 | n_iter = 50 # Number of inversion iterations 75 | preemphasis = .97 76 | max_db = 100 77 | ref_db = 20 78 | 79 | 80 | # Model 81 | r = 4 # Reduction factor. Do not change this. 82 | dropout_rate = 0.05 83 | e = 128 # == embedding 84 | d = 256 # == hidden units of Text2Mel 85 | c = 512 # == hidden units of SSRN 86 | attention_win_size = 3 87 | g = 0.2 ## determines width of band in attention guide 88 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None] 89 | 90 | ## loss weights : T2M 91 | lw_mel = 0.3333 92 | lw_bd1 = 0.3333 93 | lw_att = 0.3333 94 | ## : SSRN 95 | lw_mag = 0.5 96 | lw_bd2 = 0.5 97 | 98 | 99 | ## validation: 100 | validpatt = 'herald_9' ## herald_9 matches 98 sentences from nick 101 | #### NYT00 matches 24 of nancy's sentences. 102 | validation_sentences_to_evaluate = 24 103 | validation_sentences_to_synth_params = 16 104 | 105 | 106 | # training scheme 107 | restart_from_savepath = [] 108 | lr = 0.001 # Initial learning rate. 109 | batchsize = {'t2m': 32, 'ssrn': 32} 110 | validate_every_n_epochs = 1 ## how often to compute validation score and save speech parameters 111 | save_every_n_epochs = 10 ## as well as 5 latest models, how often to archive a model 112 | max_epochs = 100 113 | 114 | ## copy existing model parameters -- list of (scope, checkpoint) pairs:-- 115 | WORK='/disk/scratch/oliver/dc_tts_osw_clean/work/' 116 | initialise_weights_from_existing = [('Text2Mel', WORK+'/nancy_01/train-t2m/model_epoch_1000'), \ 117 | ('SSRN', WORK+'/nancy_01/train-ssrn/model_epoch_1000')] 118 | 119 | update_weights = [] ## If empty, update all trainable variables. Otherwise, update only 120 | ## those variables whose scopes match one of the patterns in the list -------------------------------------------------------------------------------- /config/nancy_01.cfg: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #/usr/bin/python2 3 | 4 | ''' 5 | nancy base voice for ICPhS in first instance. 6 | ''' 7 | 8 | 9 | import os 10 | 11 | 12 | ## Take name of this file to be the config name: 13 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension 14 | 15 | ## Define place to put outputs relative to this config file's location; 16 | ## supply an absoluate path to work elsewhere: 17 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work'))) 18 | 19 | voicedir = os.path.join(topworkdir, config_name) 20 | logdir = os.path.join(voicedir, 'train') 21 | sampledir = os.path.join(voicedir, 'synth') 22 | 23 | ## Change featuredir to absolute path to use existing features 24 | featuredir = os.path.join(topworkdir, config_name, 'data') ## use older data 25 | coarse_audio_dir = os.path.join(featuredir, 'mels') 26 | full_mel_dir = os.path.join(featuredir, 'full_mels') 27 | full_audio_dir = os.path.join(featuredir, 'mags') 28 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide 29 | 30 | 31 | 32 | # data locations: 33 | datadir = '/group/project/cstr2/owatts/data/nancy/derived' 34 | 35 | 36 | transcript = os.path.join(datadir, 'transcript.csv') 37 | test_transcript = '/group/project/cstr2/owatts/data/nick16k/transcript_mrt.csv' 38 | 39 | 40 | waveforms = os.path.join(datadir, 'wav_trim') 41 | 42 | 43 | 44 | input_type = 'phones' ## letters or phones 45 | 46 | vocab = ['', '', "<'s>", '<,>', '<.>', '<:>', '<;>', '<>', \ 47 | '', '<_END_>', '<_START_>', '@', '@@', '@U', 'A', 'D', 'E', \ 48 | 'E@', 'I', 'I@', 'N', 'O', 'OI', 'Q', 'S', 'T', 'U', 'U@', 'V', \ 49 | 'Z', 'a', 'aI', 'aU', 'b', 'd', 'dZ', 'eI', 'f', 'g', 'h', 'i', \ 50 | 'j', 'k', 'l', 'l!', 'lw', 'm', 'm!', 'n', 'n!', 'o^', 'o~', 'p', \ 51 | 'r', 's', 't', 'tS', 'u', 'v', 'w', 'z'] 52 | 53 | #vocab = "PE abcdefghijklmnopqrstuvwxyz'.?" # P: Padding, E: EOS. 54 | max_N = 140 # Maximum number of characters. # !!! 55 | max_T = 200 # Maximum number of mel frames. # !!! 56 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input 57 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 58 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features. 59 | 60 | 61 | 62 | # signal processing 63 | trim_before_spectrogram_extraction = 0 64 | vocoder = 'griffin_lim' 65 | sr = 16000 # Sampling rate. 66 | n_fft = 2048 # fft points (samples) 67 | frame_shift = 0.0125 # seconds 68 | frame_length = 0.05 # seconds 69 | hop_length = int(sr * frame_shift) 70 | win_length = int(sr * frame_length) 71 | prepro = True # don't extract spectrograms on the fly 72 | full_dim = n_fft//2+1 73 | n_mels = 80 # Number of Mel banks to generate 74 | power = 1.5 # Exponent for amplifying the predicted magnitude 75 | n_iter = 50 # Number of inversion iterations 76 | preemphasis = .97 77 | max_db = 100 78 | ref_db = 20 79 | 80 | 81 | # Model 82 | r = 4 # Reduction factor. Do not change this. 83 | dropout_rate = 0.05 84 | e = 128 # == embedding 85 | d = 256 # == hidden units of Text2Mel 86 | c = 512 # == hidden units of SSRN 87 | attention_win_size = 3 88 | g = 0.2 ## determines width of band in attention guide 89 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None] 90 | 91 | ## loss weights : T2M 92 | lw_mel = 0.3333 93 | lw_bd1 = 0.3333 94 | lw_att = 0.3333 95 | ## : SSRN 96 | lw_mag = 0.5 97 | lw_bd2 = 0.5 98 | 99 | 100 | ## validation: 101 | validpatt = 'NYT00' ## sentence names containing this substring will be held out of training. 102 | #### NYT00 matches 24 of nancy's sentences. 103 | validation_sentences_to_evaluate = 24 104 | validation_sentences_to_synth_params = 16 105 | 106 | 107 | # training scheme 108 | restart_from_savepath = [] 109 | lr = 0.001 # Initial learning rate. 110 | batchsize = {'t2m': 32, 'ssrn': 32} 111 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8) 112 | validate_every_n_epochs = 10 ## how often to compute validation score and save speech parameters 113 | save_every_n_epochs = 100 ## as well as 5 latest models, how often to archive a model 114 | max_epochs = 1000 115 | 116 | # attention plotting during training 117 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices 118 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for -------------------------------------------------------------------------------- /config/nancyplusnick_01.cfg: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #/usr/bin/python2 3 | 4 | ''' 5 | Combined nancy & nick voice for ICPhS in first instance. 6 | ''' 7 | 8 | 9 | import os 10 | 11 | 12 | ## Take name of this file to be the config name: 13 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension 14 | 15 | ## Define place to put outputs relative to this config file's location; 16 | ## supply an absoluate path to work elsewhere: 17 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work'))) 18 | 19 | voicedir = os.path.join(topworkdir, config_name) 20 | logdir = os.path.join(voicedir, 'train') 21 | sampledir = os.path.join(voicedir, 'synth') 22 | 23 | ## Change featuredir to absolute path to use existing features 24 | featuredir = os.path.join(topworkdir, config_name, 'data') 25 | coarse_audio_dir = os.path.join(featuredir, 'mels') 26 | full_mel_dir = os.path.join(featuredir, 'full_mels') 27 | full_audio_dir = os.path.join(featuredir, 'mags') 28 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide 29 | 30 | 31 | 32 | # data locations: 33 | datadir = '/group/project/cstr2/owatts/data/nick_plus_nancy/' 34 | 35 | 36 | transcript = os.path.join(datadir, 'transcript.csv') 37 | test_transcript = os.path.join(datadir, 'transcript_mrt.csv') 38 | 39 | 40 | waveforms = os.path.join(datadir, 'wav_trim') 41 | 42 | 43 | 44 | input_type = 'phones' ## letters or phones 45 | 46 | vocab = ['', '', "<'s>", '<,>', '<.>', '<:>', '<;>', '<>', \ 47 | '', '<_END_>', '<_START_>', '@', '@@', '@U', 'A', 'D', 'E', \ 48 | 'E@', 'I', 'I@', 'N', 'O', 'OI', 'Q', 'S', 'T', 'U', 'U@', 'V', \ 49 | 'Z', 'a', 'aI', 'aU', 'b', 'd', 'dZ', 'eI', 'f', 'g', 'h', 'i', \ 50 | 'j', 'k', 'l', 'l!', 'lw', 'm', 'm!', 'n', 'n!', 'o^', 'o~', 'p', \ 51 | 'r', 's', 't', 'tS', 'u', 'v', 'w', 'z'] 52 | 53 | #vocab = "PE abcdefghijklmnopqrstuvwxyz'.?" # P: Padding, E: EOS. 54 | max_N = 140 # Maximum number of characters. # !!! 55 | max_T = 200 # Maximum number of mel frames. # !!! 56 | multispeaker = ['text_encoder_input', 'audio_decoder_input'] ## []: speaker dependent. text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input 57 | speaker_list = [''] + ['nancy', 'nick'] 58 | nspeakers = len(speaker_list) + 10 ## add 10 extra slots for posterity's sake 59 | speaker_embedding_size = 128 60 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 61 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features. 62 | 63 | 64 | 65 | # signal processing 66 | trim_before_spectrogram_extraction = 0 67 | vocoder = 'griffin_lim' 68 | sr = 16000 # Sampling rate. 69 | n_fft = 2048 # fft points (samples) 70 | frame_shift = 0.0125 # seconds 71 | frame_length = 0.05 # seconds 72 | hop_length = int(sr * frame_shift) 73 | win_length = int(sr * frame_length) 74 | prepro = True # don't extract spectrograms on the fly 75 | full_dim = n_fft//2+1 76 | n_mels = 80 # Number of Mel banks to generate 77 | power = 1.5 # Exponent for amplifying the predicted magnitude 78 | n_iter = 50 # Number of inversion iterations 79 | preemphasis = .97 80 | max_db = 100 81 | ref_db = 20 82 | 83 | 84 | # Model 85 | r = 4 # Reduction factor. Do not change this. 86 | dropout_rate = 0.05 87 | e = 128 # == embedding 88 | d = 256 # == hidden units of Text2Mel 89 | c = 512 # == hidden units of SSRN 90 | attention_win_size = 3 91 | g = 0.2 ## determines width of band in attention guide 92 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None] 93 | 94 | ## loss weights : T2M 95 | lw_mel = 0.3333 96 | lw_bd1 = 0.3333 97 | lw_att = 0.3333 98 | ## : SSRN 99 | lw_mag = 0.5 100 | lw_bd2 = 0.5 101 | 102 | 103 | ## validation: 104 | validpatt = 'herald_9' ## herald_9 matches 98 sentences from nick 105 | #### NYT00 matches 24 of nancy's sentences. 106 | validation_sentences_to_evaluate = 24 107 | validation_sentences_to_synth_params = 16 108 | 109 | 110 | # training scheme 111 | restart_from_savepath = [] 112 | lr = 0.001 # Initial learning rate. 113 | batchsize = {'t2m': 32, 'ssrn': 32} 114 | validate_every_n_epochs = 10 ## how often to compute validation score and save speech parameters 115 | save_every_n_epochs = 100 ## as well as 5 latest models, how often to archive a model 116 | max_epochs = 1000 117 | 118 | -------------------------------------------------------------------------------- /config/nancyplusnick_02.cfg: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #/usr/bin/python2 3 | 4 | ''' 5 | Combined nancy & nick voice for ICPhS in first instance. 6 | Stage 2: fine tune on nick only 7 | ''' 8 | 9 | 10 | import os 11 | 12 | 13 | ## Take name of this file to be the config name: 14 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension 15 | 16 | ## Define place to put outputs relative to this config file's location; 17 | ## supply an absoluate path to work elsewhere: 18 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work'))) 19 | 20 | voicedir = os.path.join(topworkdir, config_name) 21 | logdir = os.path.join(voicedir, 'train') 22 | sampledir = os.path.join(voicedir, 'synth') 23 | 24 | ## Change featuredir to absolute path to use existing features 25 | featuredir = os.path.join(topworkdir, 'nancy2nick_01', 'data') 26 | coarse_audio_dir = os.path.join(featuredir, 'mels') 27 | full_mel_dir = os.path.join(featuredir, 'full_mels') 28 | full_audio_dir = os.path.join(featuredir, 'mags') 29 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide 30 | 31 | 32 | 33 | # data locations: 34 | datadir = '/group/project/cstr2/owatts/data/nick16k/' 35 | 36 | 37 | transcript = os.path.join(datadir, 'transcript.csv') 38 | test_transcript = os.path.join(datadir, 'transcript_hvd.csv') 39 | #test_transcript = os.path.join(datadir, 'transcript_mrt.csv') 40 | 41 | waveforms = os.path.join(datadir, 'wav_trim') 42 | 43 | 44 | 45 | input_type = 'phones' ## letters or phones 46 | 47 | vocab = ['', '', "<'s>", '<,>', '<.>', '<:>', '<;>', '<>', \ 48 | '', '<_END_>', '<_START_>', '@', '@@', '@U', 'A', 'D', 'E', \ 49 | 'E@', 'I', 'I@', 'N', 'O', 'OI', 'Q', 'S', 'T', 'U', 'U@', 'V', \ 50 | 'Z', 'a', 'aI', 'aU', 'b', 'd', 'dZ', 'eI', 'f', 'g', 'h', 'i', \ 51 | 'j', 'k', 'l', 'l!', 'lw', 'm', 'm!', 'n', 'n!', 'o^', 'o~', 'p', \ 52 | 'r', 's', 't', 'tS', 'u', 'v', 'w', 'z'] 53 | 54 | #vocab = "PE abcdefghijklmnopqrstuvwxyz'.?" # P: Padding, E: EOS. 55 | max_N = 140 # Maximum number of characters. # !!! 56 | max_T = 200 # Maximum number of mel frames. # !!! 57 | multispeaker = ['text_encoder_input', 'audio_decoder_input'] ## []: speaker dependent. text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input 58 | speaker_list = [''] + ['nancy', 'nick'] 59 | nspeakers = len(speaker_list) + 10 ## add 10 extra slots for posterity's sake 60 | speaker_embedding_size = 128 61 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 62 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features. 63 | 64 | 65 | 66 | # signal processing 67 | trim_before_spectrogram_extraction = 0 68 | vocoder = 'griffin_lim' 69 | sr = 16000 # Sampling rate. 70 | n_fft = 2048 # fft points (samples) 71 | frame_shift = 0.0125 # seconds 72 | frame_length = 0.05 # seconds 73 | hop_length = int(sr * frame_shift) 74 | win_length = int(sr * frame_length) 75 | prepro = True # don't extract spectrograms on the fly 76 | full_dim = n_fft//2+1 77 | n_mels = 80 # Number of Mel banks to generate 78 | power = 1.5 # Exponent for amplifying the predicted magnitude 79 | n_iter = 50 # Number of inversion iterations 80 | preemphasis = .97 81 | max_db = 100 82 | ref_db = 20 83 | 84 | 85 | # Model 86 | r = 4 # Reduction factor. Do not change this. 87 | dropout_rate = 0.05 88 | e = 128 # == embedding 89 | d = 256 # == hidden units of Text2Mel 90 | c = 512 # == hidden units of SSRN 91 | attention_win_size = 3 92 | g = 0.2 ## determines width of band in attention guide 93 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None] 94 | 95 | ## loss weights : T2M 96 | lw_mel = 0.3333 97 | lw_bd1 = 0.3333 98 | lw_att = 0.3333 99 | ## : SSRN 100 | lw_mag = 0.5 101 | lw_bd2 = 0.5 102 | 103 | 104 | ## validation: 105 | validpatt = 'herald_9' ## herald_9 matches 98 sentences from nick 106 | #### NYT00 matches 24 of nancy's sentences. 107 | validation_sentences_to_evaluate = 24 108 | validation_sentences_to_synth_params = 16 109 | 110 | 111 | # training scheme 112 | restart_from_savepath = [] 113 | lr = 0.001 # Initial learning rate. 114 | batchsize = {'t2m': 32, 'ssrn': 32} 115 | validate_every_n_epochs = 1 ## how often to compute validation score and save speech parameters 116 | save_every_n_epochs = 10 ## as well as 5 latest models, how often to archive a model 117 | max_epochs = 100 118 | 119 | ## copy existing model parameters -- list of (scope, checkpoint) pairs:-- 120 | WORK=topworkdir 121 | initialise_weights_from_existing = [('Text2Mel', WORK+'/nancyplusnick_01/train-t2m/model_epoch_1000')] 122 | 123 | update_weights = [] ## If empty, update all trainable variables. Otherwise, update only 124 | ## those variables whose scopes match one of the patterns in the list -------------------------------------------------------------------------------- /config/nancyplusnick_03.cfg: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #/usr/bin/python2 3 | 4 | ''' 5 | Combined nancy & nick voice for ICPhS in first instance. 6 | Stage 2: fine tune on nick only -- in this version, set attention loss weight very high 7 | ''' 8 | 9 | 10 | import os 11 | 12 | 13 | ## Take name of this file to be the config name: 14 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension 15 | 16 | ## Define place to put outputs relative to this config file's location; 17 | ## supply an absoluate path to work elsewhere: 18 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work'))) 19 | 20 | voicedir = os.path.join(topworkdir, config_name) 21 | logdir = os.path.join(voicedir, 'train') 22 | sampledir = os.path.join(voicedir, 'synth') 23 | 24 | ## Change featuredir to absolute path to use existing features 25 | featuredir = os.path.join(topworkdir, 'nancy2nick_01', 'data') 26 | coarse_audio_dir = os.path.join(featuredir, 'mels') 27 | full_mel_dir = os.path.join(featuredir, 'full_mels') 28 | full_audio_dir = os.path.join(featuredir, 'mags') 29 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide 30 | 31 | 32 | 33 | # data locations: 34 | datadir = '/group/project/cstr2/owatts/data/nick16k/' 35 | 36 | 37 | transcript = os.path.join(datadir, 'transcript.csv') 38 | test_transcript = os.path.join(datadir, 'transcript_hvd.csv') 39 | #test_transcript = os.path.join(datadir, 'transcript_mrt.csv') 40 | 41 | waveforms = os.path.join(datadir, 'wav_trim') 42 | 43 | 44 | 45 | input_type = 'phones' ## letters or phones 46 | 47 | vocab = ['', '', "<'s>", '<,>', '<.>', '<:>', '<;>', '<>', \ 48 | '', '<_END_>', '<_START_>', '@', '@@', '@U', 'A', 'D', 'E', \ 49 | 'E@', 'I', 'I@', 'N', 'O', 'OI', 'Q', 'S', 'T', 'U', 'U@', 'V', \ 50 | 'Z', 'a', 'aI', 'aU', 'b', 'd', 'dZ', 'eI', 'f', 'g', 'h', 'i', \ 51 | 'j', 'k', 'l', 'l!', 'lw', 'm', 'm!', 'n', 'n!', 'o^', 'o~', 'p', \ 52 | 'r', 's', 't', 'tS', 'u', 'v', 'w', 'z'] 53 | 54 | #vocab = "PE abcdefghijklmnopqrstuvwxyz'.?" # P: Padding, E: EOS. 55 | max_N = 140 # Maximum number of characters. # !!! 56 | max_T = 200 # Maximum number of mel frames. # !!! 57 | multispeaker = ['text_encoder_input', 'audio_decoder_input'] ## []: speaker dependent. text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input 58 | speaker_list = [''] + ['nancy', 'nick'] 59 | nspeakers = len(speaker_list) + 10 ## add 10 extra slots for posterity's sake 60 | speaker_embedding_size = 128 61 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 62 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features. 63 | 64 | 65 | 66 | # signal processing 67 | trim_before_spectrogram_extraction = 0 68 | vocoder = 'griffin_lim' 69 | sr = 16000 # Sampling rate. 70 | n_fft = 2048 # fft points (samples) 71 | frame_shift = 0.0125 # seconds 72 | frame_length = 0.05 # seconds 73 | hop_length = int(sr * frame_shift) 74 | win_length = int(sr * frame_length) 75 | prepro = True # don't extract spectrograms on the fly 76 | full_dim = n_fft//2+1 77 | n_mels = 80 # Number of Mel banks to generate 78 | power = 1.5 # Exponent for amplifying the predicted magnitude 79 | n_iter = 50 # Number of inversion iterations 80 | preemphasis = .97 81 | max_db = 100 82 | ref_db = 20 83 | 84 | 85 | # Model 86 | r = 4 # Reduction factor. Do not change this. 87 | dropout_rate = 0.05 88 | e = 128 # == embedding 89 | d = 256 # == hidden units of Text2Mel 90 | c = 512 # == hidden units of SSRN 91 | attention_win_size = 3 92 | g = 0.2 ## determines width of band in attention guide 93 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None] 94 | 95 | ## loss weights : T2M 96 | lw_mel = 0.15 97 | lw_bd1 = 0.15 98 | lw_att = 0.9 99 | ## : SSRN 100 | lw_mag = 0.5 101 | lw_bd2 = 0.5 102 | 103 | 104 | ## validation: 105 | validpatt = 'herald_9' ## herald_9 matches 98 sentences from nick 106 | #### NYT00 matches 24 of nancy's sentences. 107 | validation_sentences_to_evaluate = 24 108 | validation_sentences_to_synth_params = 16 109 | 110 | 111 | # training scheme 112 | restart_from_savepath = [] 113 | lr = 0.001 # Initial learning rate. 114 | batchsize = {'t2m': 32, 'ssrn': 32} 115 | validate_every_n_epochs = 1 ## how often to compute validation score and save speech parameters 116 | save_every_n_epochs = 0 ## as well as 5 latest models, how often to archive a model 117 | max_epochs = 100 118 | 119 | ## copy existing model parameters -- list of (scope, checkpoint) pairs:-- 120 | WORK=topworkdir 121 | initialise_weights_from_existing = [('Text2Mel', WORK+'/nancyplusnick_01/train-t2m/model_epoch_1000')] 122 | 123 | update_weights = [] ## If empty, update all trainable variables. Otherwise, update only 124 | ## those variables whose scopes match one of the patterns in the list -------------------------------------------------------------------------------- /config/nancyplusnick_04_lcc.cfg: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #/usr/bin/python2 3 | 4 | ''' 5 | Combined nancy & nick voice for ICPhS in first instance 6 | Version 4: train with learned channel contributions from the 2 speakers 7 | ''' 8 | 9 | 10 | import os 11 | 12 | 13 | ## Take name of this file to be the config name: 14 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension 15 | 16 | ## Define place to put outputs relative to this config file's location; 17 | ## supply an absoluate path to work elsewhere: 18 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work'))) 19 | 20 | voicedir = os.path.join(topworkdir, config_name) 21 | logdir = os.path.join(voicedir, 'train') 22 | sampledir = os.path.join(voicedir, 'synth') 23 | 24 | ## Change featuredir to absolute path to use existing features 25 | featuredir = os.path.join(topworkdir, 'nancyplusnick_01', 'data') 26 | coarse_audio_dir = os.path.join(featuredir, 'mels') 27 | full_mel_dir = os.path.join(featuredir, 'full_mels') 28 | full_audio_dir = os.path.join(featuredir, 'mags') 29 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide 30 | 31 | 32 | 33 | # data locations: 34 | datadir = '/group/project/cstr2/owatts/data/nick_plus_nancy/' 35 | 36 | 37 | transcript = os.path.join(datadir, 'transcript.csv') 38 | test_transcript = os.path.join(datadir, 'transcript_mrt.csv') 39 | 40 | 41 | waveforms = os.path.join(datadir, 'wav_trim') 42 | 43 | 44 | 45 | input_type = 'phones' ## letters or phones 46 | 47 | vocab = ['', '', "<'s>", '<,>', '<.>', '<:>', '<;>', '<>', \ 48 | '', '<_END_>', '<_START_>', '@', '@@', '@U', 'A', 'D', 'E', \ 49 | 'E@', 'I', 'I@', 'N', 'O', 'OI', 'Q', 'S', 'T', 'U', 'U@', 'V', \ 50 | 'Z', 'a', 'aI', 'aU', 'b', 'd', 'dZ', 'eI', 'f', 'g', 'h', 'i', \ 51 | 'j', 'k', 'l', 'l!', 'lw', 'm', 'm!', 'n', 'n!', 'o^', 'o~', 'p', \ 52 | 'r', 's', 't', 'tS', 'u', 'v', 'w', 'z'] 53 | 54 | #vocab = "PE abcdefghijklmnopqrstuvwxyz'.?" # P: Padding, E: EOS. 55 | max_N = 140 # Maximum number of characters. # !!! 56 | max_T = 200 # Maximum number of mel frames. # !!! 57 | multispeaker = ['learn_channel_contributions'] ## []: speaker dependent. text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input 58 | speaker_list = [''] + ['nancy', 'nick'] 59 | nspeakers = len(speaker_list) + 2 60 | speaker_embedding_size = 128 61 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 62 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features. 63 | 64 | 65 | 66 | # signal processing 67 | trim_before_spectrogram_extraction = 0 68 | vocoder = 'griffin_lim' 69 | sr = 16000 # Sampling rate. 70 | n_fft = 2048 # fft points (samples) 71 | frame_shift = 0.0125 # seconds 72 | frame_length = 0.05 # seconds 73 | hop_length = int(sr * frame_shift) 74 | win_length = int(sr * frame_length) 75 | prepro = True # don't extract spectrograms on the fly 76 | full_dim = n_fft//2+1 77 | n_mels = 80 # Number of Mel banks to generate 78 | power = 1.5 # Exponent for amplifying the predicted magnitude 79 | n_iter = 50 # Number of inversion iterations 80 | preemphasis = .97 81 | max_db = 100 82 | ref_db = 20 83 | 84 | 85 | # Model 86 | r = 4 # Reduction factor. Do not change this. 87 | dropout_rate = 0.05 88 | e = 128 # == embedding 89 | d = 256 # == hidden units of Text2Mel 90 | c = 512 # == hidden units of SSRN 91 | attention_win_size = 3 92 | g = 0.2 ## determines width of band in attention guide 93 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None] 94 | 95 | ## loss weights : T2M 96 | lw_mel = 0.3333 97 | lw_bd1 = 0.3333 98 | lw_att = 0.3333 99 | ## : SSRN 100 | lw_mag = 0.5 101 | lw_bd2 = 0.5 102 | 103 | 104 | ## validation: 105 | validpatt = 'herald_9' ## herald_9 matches 98 sentences from nick 106 | #### NYT00 matches 24 of nancy's sentences. 107 | validation_sentences_to_evaluate = 24 108 | validation_sentences_to_synth_params = 16 109 | 110 | 111 | # training scheme 112 | restart_from_savepath = [] 113 | lr = 0.001 # Initial learning rate. 114 | batchsize = {'t2m': 32, 'ssrn': 32} 115 | validate_every_n_epochs = 10 ## how often to compute validation score and save speech parameters 116 | save_every_n_epochs = 100 ## as well as 5 latest models, how often to archive a model 117 | max_epochs = 1000 118 | 119 | -------------------------------------------------------------------------------- /config/nancyplusnick_05_lcc_sdpe.cfg: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #/usr/bin/python2 3 | 4 | ''' 5 | Combined nancy & nick voice for ICPhS in first instance 6 | Version 4: train with learned channel contributions from the 2 speakers 7 | ''' 8 | 9 | 10 | import os 11 | 12 | 13 | ## Take name of this file to be the config name: 14 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension 15 | 16 | ## Define place to put outputs relative to this config file's location; 17 | ## supply an absoluate path to work elsewhere: 18 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work'))) 19 | 20 | voicedir = os.path.join(topworkdir, config_name) 21 | logdir = os.path.join(voicedir, 'train') 22 | sampledir = os.path.join(voicedir, 'synth') 23 | 24 | ## Change featuredir to absolute path to use existing features 25 | featuredir = os.path.join(topworkdir, 'nancyplusnick_01', 'data') 26 | coarse_audio_dir = os.path.join(featuredir, 'mels') 27 | full_mel_dir = os.path.join(featuredir, 'full_mels') 28 | full_audio_dir = os.path.join(featuredir, 'mags') 29 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide 30 | 31 | 32 | 33 | # data locations: 34 | datadir = '/group/project/cstr2/owatts/data/nick_plus_nancy/' 35 | 36 | 37 | transcript = os.path.join(datadir, 'transcript.csv') 38 | test_transcript = os.path.join(datadir, 'transcript_mrt.csv') 39 | 40 | 41 | waveforms = os.path.join(datadir, 'wav_trim') 42 | 43 | 44 | 45 | input_type = 'phones' ## letters or phones 46 | 47 | vocab = ['', '', "<'s>", '<,>', '<.>', '<:>', '<;>', '<>', \ 48 | '', '<_END_>', '<_START_>', '@', '@@', '@U', 'A', 'D', 'E', \ 49 | 'E@', 'I', 'I@', 'N', 'O', 'OI', 'Q', 'S', 'T', 'U', 'U@', 'V', \ 50 | 'Z', 'a', 'aI', 'aU', 'b', 'd', 'dZ', 'eI', 'f', 'g', 'h', 'i', \ 51 | 'j', 'k', 'l', 'l!', 'lw', 'm', 'm!', 'n', 'n!', 'o^', 'o~', 'p', \ 52 | 'r', 's', 't', 'tS', 'u', 'v', 'w', 'z'] 53 | 54 | #vocab = "PE abcdefghijklmnopqrstuvwxyz'.?" # P: Padding, E: EOS. 55 | max_N = 140 # Maximum number of characters. # !!! 56 | max_T = 200 # Maximum number of mel frames. # !!! 57 | multispeaker = ['learn_channel_contributions', 'speaker_dependent_phones'] ## []: speaker dependent. text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input 58 | speaker_list = [''] + ['nancy', 'nick'] 59 | nspeakers = len(speaker_list) + 2 60 | speaker_embedding_size = 128 61 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 62 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features. 63 | 64 | 65 | 66 | # signal processing 67 | trim_before_spectrogram_extraction = 0 68 | vocoder = 'griffin_lim' 69 | sr = 16000 # Sampling rate. 70 | n_fft = 2048 # fft points (samples) 71 | frame_shift = 0.0125 # seconds 72 | frame_length = 0.05 # seconds 73 | hop_length = int(sr * frame_shift) 74 | win_length = int(sr * frame_length) 75 | prepro = True # don't extract spectrograms on the fly 76 | full_dim = n_fft//2+1 77 | n_mels = 80 # Number of Mel banks to generate 78 | power = 1.5 # Exponent for amplifying the predicted magnitude 79 | n_iter = 50 # Number of inversion iterations 80 | preemphasis = .97 81 | max_db = 100 82 | ref_db = 20 83 | 84 | 85 | # Model 86 | r = 4 # Reduction factor. Do not change this. 87 | dropout_rate = 0.05 88 | e = 128 # == embedding 89 | d = 256 # == hidden units of Text2Mel 90 | c = 512 # == hidden units of SSRN 91 | attention_win_size = 3 92 | g = 0.2 ## determines width of band in attention guide 93 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None] 94 | 95 | ## loss weights : T2M 96 | lw_mel = 0.3333 97 | lw_bd1 = 0.3333 98 | lw_att = 0.3333 99 | ## : SSRN 100 | lw_mag = 0.5 101 | lw_bd2 = 0.5 102 | 103 | 104 | ## validation: 105 | validpatt = 'herald_9' ## herald_9 matches 98 sentences from nick 106 | #### NYT00 matches 24 of nancy's sentences. 107 | validation_sentences_to_evaluate = 24 108 | validation_sentences_to_synth_params = 16 109 | 110 | 111 | # training scheme 112 | restart_from_savepath = [] 113 | lr = 0.001 # Initial learning rate. 114 | batchsize = {'t2m': 32, 'ssrn': 32} 115 | validate_every_n_epochs = 10 ## how often to compute validation score and save speech parameters 116 | save_every_n_epochs = 100 ## as well as 5 latest models, how often to archive a model 117 | max_epochs = 1000 118 | 119 | -------------------------------------------------------------------------------- /config/project/baseline.cfg: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #/usr/bin/python2 3 | 4 | import os 5 | 6 | 7 | ## Take name of this file to be the config name: 8 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension 9 | 10 | ## Define place to put outputs relative to this config file's location; 11 | ## supply an absoluate path to work elsewhere: 12 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work'))) 13 | 14 | voicedir = os.path.join(topworkdir, config_name) 15 | logdir = os.path.join(voicedir, 'train') 16 | sampledir = os.path.join(voicedir, 'synth') 17 | 18 | ## Change featuredir to absolute path to use existing features 19 | featuredir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/' 20 | coarse_audio_dir = os.path.join(featuredir, 'mels') 21 | full_mel_dir = os.path.join(featuredir, 'full_mels') 22 | full_audio_dir = os.path.join(featuredir, 'mags') 23 | #attention_guide_dir = os.path.join(featuredir, 'attention_guides') 24 | ## Set this to the empty string ('') to global attention guide 25 | attention_guide_dir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/attention_guides_punctuation_all_quotes/' 26 | #attention_guide_dir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/fa_attention_guides_punctuation_all_quotes/' 27 | 28 | # Data locations: 29 | datadir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/' 30 | 31 | transcript = os.path.join(datadir, 'trimmed_transcript_trainset_with_punctuation_with_all_quotes_CB-EM.csv') 32 | test_transcript = os.path.join(datadir, 'transcript_testset_shuffled_with_punctuation_with_all_quotes.csv') 33 | 34 | waveforms = os.path.join(datadir, 'wavs_trim') 35 | 36 | 37 | input_type = 'phones' ## letters or phones 38 | 39 | ## CMU phones: 40 | vocab = ['', '<_END_>', '<_START_>'] + ['', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', '<;>', '<>', '', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', 'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', 'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', 'oy', 'p', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', 'w', 'y', 'z', 'zh'] 41 | 42 | # Train 43 | max_N = 150 # Maximum number of characters/phones 44 | max_T = 264 # Maximum number of mel frames 45 | 46 | turn_off_monotonic_for_synthesis = True 47 | sampledir = sampledir + '_' + str(turn_off_monotonic_for_synthesis) 48 | 49 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input 50 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 51 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features. 52 | 53 | 54 | 55 | # signal processing 56 | trim_before_spectrogram_extraction = 0 57 | vocoder = 'griffin_lim' 58 | sr = 22050 # Sampling rate. 59 | n_fft = 2048 # fft points (samples) 60 | frame_shift = 0.0125 # seconds 61 | frame_length = 0.05 # seconds 62 | hop_length = int(sr * frame_shift) 63 | win_length = int(sr * frame_length) 64 | prepro = True # don't extract spectrograms on the fly 65 | full_dim = n_fft//2+1 66 | n_mels = 80 # Number of Mel banks to generate 67 | power = 1.5 # Exponent for amplifying the predicted magnitude 68 | n_iter = 50 # Number of inversion iterations 69 | preemphasis = .97 70 | max_db = 100 71 | ref_db = 20 72 | 73 | 74 | # Model 75 | r = 4 # Reduction factor. Do not change this. 76 | dropout_rate = 0.05 77 | e = 128 # == embedding 78 | d = 256 # == hidden units of Text2Mel 79 | c = 512 # == hidden units of SSRN 80 | attention_win_size = 3 81 | g = 0.2 ## determines width of band in attention guide 82 | norm = None ## type of normalisation layers to use: from ['layer', 'batch', None] 83 | 84 | ## loss weights : T2M 85 | lw_mel =0.333 86 | lw_bd1 =0.333 87 | lw_att =0.333 88 | lw_t2m_l2 = 0.0 89 | ## : SSRN 90 | lw_mag = 0.5 91 | lw_bd2 = 0.5 92 | lw_ssrn_l2 = 0.0 93 | 94 | 95 | ## validation: 96 | validpatt = 'CB-EM-55-6' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SIL 97 | validation_sentences_to_evaluate = 5 98 | validation_sentences_to_synth_params = 3 99 | 100 | 101 | # training scheme 102 | restart_from_savepath = [] 103 | lr = 0.0001 # Initial learning rate. 104 | beta1 = 0.5 105 | beta2 = 0.9 106 | epsilon = 0.000001 107 | decay_lr = False 108 | batchsize = {'t2m': 8, 'ssrn': 2} 109 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8) 110 | validate_every_n_epochs = 50 ## how often to compute validation score and save speech parameters 111 | save_every_n_epochs = 50 ## as well as 5 latest models, how often to archive a model 112 | max_epochs = 500 113 | 114 | # attention plotting during training 115 | plot_attention_every_n_epochs = 50 ## set to 0 if you do not wish to plot attention matrices 116 | num_sentences_to_plot_attention = 3 ## number of sentences to plot attention matrices for 117 | -------------------------------------------------------------------------------- /config/project/fa_as_attention.cfg: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #/usr/bin/python2 3 | 4 | import os 5 | 6 | 7 | ## Take name of this file to be the config name: 8 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension 9 | 10 | ## Define place to put outputs relative to this config file's location; 11 | ## supply an absoluate path to work elsewhere: 12 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work'))) 13 | 14 | voicedir = os.path.join(topworkdir, config_name) 15 | logdir = os.path.join(voicedir, 'train') 16 | sampledir = os.path.join(voicedir, 'synth') 17 | 18 | ## Change featuredir to absolute path to use existing features 19 | featuredir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/' 20 | coarse_audio_dir = os.path.join(featuredir, 'mels') 21 | full_mel_dir = os.path.join(featuredir, 'full_mels') 22 | full_audio_dir = os.path.join(featuredir, 'mags') 23 | #attention_guide_dir = os.path.join(featuredir, 'attention_guides') 24 | ## Set this to the empty string ('') to global attention guide 25 | attention_guide_dir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/attention_guides_punctuation_all_quotes/' 26 | use_external_durations = True 27 | 28 | # Data locations: 29 | datadir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/' 30 | 31 | transcript = os.path.join(datadir, 'trimmed_transcript_trainset_with_punctuation_with_all_quotes_CB-EM_less17_durations.csv') 32 | test_transcript = os.path.join(datadir, 'transcript_testset_shuffled_with_punctuation_with_all_quotes_durations_updated.csv') 33 | waveforms = os.path.join(datadir, 'wavs_trim') 34 | 35 | 36 | input_type = 'phones' ## letters or phones 37 | 38 | ## CMU phones: 39 | vocab = ['', '<_END_>', '<_START_>'] + ['', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', '<;>', '<>', '', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', 'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', 'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', 'oy', 'p', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', 'w', 'y', 'z', 'zh'] 40 | 41 | # Train 42 | max_N = 150 # Maximum number of characters/phones 43 | max_T = 264 # Maximum number of mel frames 44 | 45 | turn_off_monotonic_for_synthesis = True 46 | sampledir = sampledir + '_' + str(turn_off_monotonic_for_synthesis) 47 | 48 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input 49 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 50 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features. 51 | 52 | 53 | 54 | # signal processing 55 | trim_before_spectrogram_extraction = 0 56 | vocoder = 'griffin_lim' 57 | sr = 22050 # Sampling rate. 58 | n_fft = 2048 # fft points (samples) 59 | frame_shift = 0.0125 # seconds 60 | frame_length = 0.05 # seconds 61 | hop_length = int(sr * frame_shift) 62 | win_length = int(sr * frame_length) 63 | prepro = True # don't extract spectrograms on the fly 64 | full_dim = n_fft//2+1 65 | n_mels = 80 # Number of Mel banks to generate 66 | power = 1.5 # Exponent for amplifying the predicted magnitude 67 | n_iter = 50 # Number of inversion iterations 68 | preemphasis = .97 69 | max_db = 100 70 | ref_db = 20 71 | 72 | 73 | # Model 74 | r = 4 # Reduction factor. Do not change this. 75 | dropout_rate = 0.05 76 | e = 128 # == embedding 77 | d = 256 # == hidden units of Text2Mel 78 | c = 512 # == hidden units of SSRN 79 | attention_win_size = 3 80 | g = 0.2 ## determines width of band in attention guide 81 | norm = None ## type of normalisation layers to use: from ['layer', 'batch', None] 82 | 83 | ## loss weights : T2M 84 | lw_mel =0.333 85 | lw_bd1 =0.333 86 | lw_att =0.333 87 | lw_t2m_l2 = 0.0 88 | ## : SSRN 89 | lw_mag = 0.5 90 | lw_bd2 = 0.5 91 | lw_ssrn_l2 = 0.0 92 | 93 | 94 | ## validation: 95 | validpatt = 'CB-EM-55-6' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SIL 96 | validation_sentences_to_evaluate = 5 97 | validation_sentences_to_synth_params = 3 98 | 99 | 100 | # training scheme 101 | restart_from_savepath = [] 102 | lr = 0.0001 # Initial learning rate. 103 | beta1 = 0.5 104 | beta2 = 0.9 105 | epsilon = 0.000001 106 | decay_lr = False 107 | batchsize = {'t2m': 8, 'ssrn': 2} 108 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8) 109 | validate_every_n_epochs = 50 ## how often to compute validation score and save speech parameters 110 | save_every_n_epochs = 50 ## as well as 5 latest models, how often to archive a model 111 | max_epochs = 500 112 | 113 | # attention plotting during training 114 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices 115 | num_sentences_to_plot_attention = 3 ## number of sentences to plot attention matrices for 116 | -------------------------------------------------------------------------------- /config/project/fa_as_guide.cfg: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #/usr/bin/python2 3 | 4 | import os 5 | 6 | 7 | ## Take name of this file to be the config name: 8 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension 9 | 10 | ## Define place to put outputs relative to this config file's location; 11 | ## supply an absoluate path to work elsewhere: 12 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work'))) 13 | 14 | voicedir = os.path.join(topworkdir, config_name) 15 | logdir = os.path.join(voicedir, 'train') 16 | sampledir = os.path.join(voicedir, 'synth') 17 | 18 | ## Change featuredir to absolute path to use existing features 19 | featuredir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/' 20 | coarse_audio_dir = os.path.join(featuredir, 'mels') 21 | full_mel_dir = os.path.join(featuredir, 'full_mels') 22 | full_audio_dir = os.path.join(featuredir, 'mags') 23 | #attention_guide_dir = os.path.join(featuredir, 'attention_guides') 24 | ## Set this to the empty string ('') to global attention guide 25 | attention_guide_dir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/W0.1_attention_guides_dctts/' # contains FA guides 26 | 27 | # Data locations: 28 | datadir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/' 29 | 30 | transcript = os.path.join(datadir, 'trimmed_transcript_trainset_with_punctuation_with_all_quotes_CB-EM_less17.csv') 31 | test_transcript = os.path.join(datadir, 'transcript_testset_shuffled_with_punctuation_with_all_quotes.csv') 32 | waveforms = os.path.join(datadir, 'wavs_trim') 33 | 34 | 35 | input_type = 'phones' ## letters or phones 36 | 37 | ## CMU phones: 38 | vocab = ['', '<_END_>', '<_START_>'] + ['', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', '<;>', '<>', '', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', 'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', 'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', 'oy', 'p', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', 'w', 'y', 'z', 'zh'] 39 | 40 | # Train 41 | max_N = 150 # Maximum number of characters/phones 42 | max_T = 264 # Maximum number of mel frames 43 | 44 | turn_off_monotonic_for_synthesis = True 45 | sampledir = sampledir + '_' + str(turn_off_monotonic_for_synthesis) 46 | 47 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input 48 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 49 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features. 50 | 51 | 52 | 53 | # signal processing 54 | trim_before_spectrogram_extraction = 0 55 | vocoder = 'griffin_lim' 56 | sr = 22050 # Sampling rate. 57 | n_fft = 2048 # fft points (samples) 58 | frame_shift = 0.0125 # seconds 59 | frame_length = 0.05 # seconds 60 | hop_length = int(sr * frame_shift) 61 | win_length = int(sr * frame_length) 62 | prepro = True # don't extract spectrograms on the fly 63 | full_dim = n_fft//2+1 64 | n_mels = 80 # Number of Mel banks to generate 65 | power = 1.5 # Exponent for amplifying the predicted magnitude 66 | n_iter = 50 # Number of inversion iterations 67 | preemphasis = .97 68 | max_db = 100 69 | ref_db = 20 70 | 71 | 72 | # Model 73 | r = 4 # Reduction factor. Do not change this. 74 | dropout_rate = 0.05 75 | e = 128 # == embedding 76 | d = 256 # == hidden units of Text2Mel 77 | c = 512 # == hidden units of SSRN 78 | attention_win_size = 3 79 | g = 0.2 ## determines width of band in attention guide 80 | norm = None ## type of normalisation layers to use: from ['layer', 'batch', None] 81 | 82 | ## loss weights : T2M 83 | lw_mel =0.333 84 | lw_bd1 =0.333 85 | lw_att =0.333 86 | lw_t2m_l2 = 0.0 87 | ## : SSRN 88 | lw_mag = 0.5 89 | lw_bd2 = 0.5 90 | lw_ssrn_l2 = 0.0 91 | 92 | 93 | ## validation: 94 | validpatt = 'CB-EM-55-6' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SIL 95 | validation_sentences_to_evaluate = 5 96 | validation_sentences_to_synth_params = 3 97 | 98 | 99 | # training scheme 100 | restart_from_savepath = [] 101 | lr = 0.0001 # Initial learning rate. 102 | beta1 = 0.5 103 | beta2 = 0.9 104 | epsilon = 0.000001 105 | decay_lr = False 106 | batchsize = {'t2m': 8, 'ssrn': 2} 107 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8) 108 | validate_every_n_epochs = 50 ## how often to compute validation score and save speech parameters 109 | save_every_n_epochs = 50 ## as well as 5 latest models, how often to archive a model 110 | max_epochs = 500 111 | 112 | # attention plotting during training 113 | plot_attention_every_n_epochs = 50 ## set to 0 if you do not wish to plot attention matrices 114 | num_sentences_to_plot_attention = 3 ## number of sentences to plot attention matrices for 115 | -------------------------------------------------------------------------------- /config/project/fa_as_target.cfg: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #/usr/bin/python2 3 | 4 | import os 5 | 6 | 7 | ## Take name of this file to be the config name: 8 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension 9 | 10 | ## Define place to put outputs relative to this config file's location; 11 | ## supply an absoluate path to work elsewhere: 12 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work'))) 13 | 14 | voicedir = os.path.join(topworkdir, config_name) 15 | logdir = os.path.join(voicedir, 'train') 16 | sampledir = os.path.join(voicedir, 'synth') 17 | 18 | ## Change featuredir to absolute path to use existing features 19 | featuredir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/' 20 | coarse_audio_dir = os.path.join(featuredir, 'mels') 21 | full_mel_dir = os.path.join(featuredir, 'full_mels') 22 | full_audio_dir = os.path.join(featuredir, 'mags') 23 | #attention_guide_dir = os.path.join(featuredir, 'attention_guides') 24 | ## Set this to the empty string ('') to global attention guide 25 | attention_guide_dir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/attention_guides_dctts/' # contains FA matrix 26 | attention_guide_fa = True 27 | 28 | # Data locations: 29 | datadir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/' 30 | 31 | transcript = os.path.join(datadir, 'trimmed_transcript_trainset_with_punctuation_with_all_quotes_CB-EM_less17.csv') 32 | test_transcript = os.path.join(datadir, 'transcript_testset_shuffled_with_punctuation_with_all_quotes.csv') 33 | waveforms = os.path.join(datadir, 'wavs_trim') 34 | 35 | 36 | input_type = 'phones' ## letters or phones 37 | 38 | ## CMU phones: 39 | vocab = ['', '<_END_>', '<_START_>'] + ['', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', '<;>', '<>', '', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', 'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', 'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', 'oy', 'p', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', 'w', 'y', 'z', 'zh'] 40 | 41 | # Train 42 | max_N = 150 # Maximum number of characters/phones 43 | max_T = 264 # Maximum number of mel frames 44 | 45 | turn_off_monotonic_for_synthesis = True 46 | sampledir = sampledir + '_' + str(turn_off_monotonic_for_synthesis) 47 | 48 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input 49 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 50 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features. 51 | 52 | 53 | 54 | # signal processing 55 | trim_before_spectrogram_extraction = 0 56 | vocoder = 'griffin_lim' 57 | sr = 22050 # Sampling rate. 58 | n_fft = 2048 # fft points (samples) 59 | frame_shift = 0.0125 # seconds 60 | frame_length = 0.05 # seconds 61 | hop_length = int(sr * frame_shift) 62 | win_length = int(sr * frame_length) 63 | prepro = True # don't extract spectrograms on the fly 64 | full_dim = n_fft//2+1 65 | n_mels = 80 # Number of Mel banks to generate 66 | power = 1.5 # Exponent for amplifying the predicted magnitude 67 | n_iter = 50 # Number of inversion iterations 68 | preemphasis = .97 69 | max_db = 100 70 | ref_db = 20 71 | 72 | 73 | # Model 74 | r = 4 # Reduction factor. Do not change this. 75 | dropout_rate = 0.05 76 | e = 128 # == embedding 77 | d = 256 # == hidden units of Text2Mel 78 | c = 512 # == hidden units of SSRN 79 | attention_win_size = 3 80 | g = 0.2 ## determines width of band in attention guide 81 | norm = None ## type of normalisation layers to use: from ['layer', 'batch', None] 82 | 83 | ## loss weights : T2M 84 | lw_mel =0.333 85 | lw_bd1 =0.333 86 | lw_att =0.333 87 | lw_t2m_l2 = 0.0 88 | ## : SSRN 89 | lw_mag = 0.5 90 | lw_bd2 = 0.5 91 | lw_ssrn_l2 = 0.0 92 | 93 | 94 | ## validation: 95 | validpatt = 'CB-EM-55-6' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SIL 96 | validation_sentences_to_evaluate = 5 97 | validation_sentences_to_synth_params = 3 98 | 99 | 100 | # training scheme 101 | restart_from_savepath = [] 102 | lr = 0.0001 # Initial learning rate. 103 | beta1 = 0.5 104 | beta2 = 0.9 105 | epsilon = 0.000001 106 | decay_lr = False 107 | batchsize = {'t2m': 8, 'ssrn': 2} 108 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8) 109 | validate_every_n_epochs = 50 ## how often to compute validation score and save speech parameters 110 | save_every_n_epochs = 50 ## as well as 5 latest models, how often to archive a model 111 | max_epochs = 500 112 | 113 | # attention plotting during training 114 | plot_attention_every_n_epochs = 50 ## set to 0 if you do not wish to plot attention matrices 115 | num_sentences_to_plot_attention = 3 ## number of sentences to plot attention matrices for 116 | -------------------------------------------------------------------------------- /config/ssw10/G1ABC_01.cfg: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #/usr/bin/python2 3 | 4 | ## Audio data downsampled to 16000 Hz and CMU lex transcription 5 | 6 | import os 7 | 8 | ## Take name of this file to be the config name: 9 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension 10 | 11 | ## Define place to put outputs relative to this config file's location; 12 | ## supply an absoluate path to work elsewhere: 13 | topworkdir = '/disk/scratch/script_project/ssw10/work/' 14 | 15 | voicedir = os.path.join(topworkdir, config_name) 16 | logdir = os.path.join(voicedir, 'train') 17 | sampledir = os.path.join(voicedir, 'synth') 18 | 19 | ## Change featuredir to absolute path to use existing features 20 | featuredir = '/disk/scratch/script_project/ssw10/features/settings_01/' 21 | coarse_audio_dir = os.path.join(featuredir, 'mels') 22 | full_mel_dir = os.path.join(featuredir, 'full_mels') 23 | full_audio_dir = os.path.join(featuredir, 'mags') 24 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide 25 | 26 | 27 | 28 | # data locations: 29 | datadirwav = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/' ### '/group/project/cstr2/owatts/data/LJSpeech-1.1/' 30 | transcript = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/transcript_cmu.csv' 31 | test_transcript = '/disk/scratch/script_project/ssw10/data/transcript_cmulex_hvd.csv' 32 | waveforms = os.path.join(datadirwav, 'wav_trim16k') 33 | 34 | input_type = 'phones' ## letters or phones 35 | 36 | ### CMU lex version:- 37 | vocab = ['', '', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', \ 38 | '<;>', '<>', '', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', \ 39 | 'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', \ 40 | 'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', \ 41 | 'oy', 'p', 'pau', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', \ 42 | 'w', 'y', 'z', 'zh'] 43 | 44 | max_N = 173 # Maximum number of characters. # !!! 45 | max_T = 210 # Maximum number of mel frames. # !!! 46 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input 47 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 48 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features. 49 | 50 | 51 | # signal processing 52 | trim_before_spectrogram_extraction = 0 53 | vocoder = 'griffin_lim' 54 | sr = 16000 # Sampling rate. 55 | n_fft = 1024 # fft points (samples) 56 | hop_length = 256 # int(sr * frame_shift) 57 | win_length = 1024 # int(sr * frame_length) 58 | prepro = True # don't extract spectrograms on the fly 59 | full_dim = n_fft//2+1 60 | n_mels = 80 # Number of Mel banks to generate 61 | power = 1.5 # Exponent for amplifying the predicted magnitude 62 | n_iter = 50 # Number of inversion iterations 63 | preemphasis = .97 64 | max_db = 100 65 | ref_db = 20 66 | 67 | 68 | # Model 69 | r = 4 # Reduction factor. Do not change this. 70 | dropout_rate = 0.05 71 | e = 128 # == embedding 72 | d = 256 # == hidden units of Text2Mel 73 | c = 512 # == hidden units of SSRN 74 | attention_win_size = 3 75 | g = 0.2 ## determines width of band in attention guide 76 | norm = None ## type of normalisation layers to use: from ['layer', 'batch', None] 77 | 78 | ## loss weights : T2M 79 | lw_mel = 0.3333 80 | lw_bd1 = 0.3333 81 | lw_att = 0.3333 82 | ## : SSRN 83 | lw_mag = 0.5 84 | lw_bd2 = 0.5 85 | 86 | 87 | ## validation: 88 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI 89 | validation_sentences_to_evaluate = 4 90 | validation_sentences_to_synth_params = 4 91 | 92 | 93 | # training scheme 94 | restart_from_savepath = [] 95 | lr = 0.0002 # Initial learning rate. 96 | beta1 = 0.5 97 | beta2 = 0.9 98 | epsilon = 0.000001 99 | decay_lr = False 100 | batchsize = {'t2m': 16, 'ssrn': 32} 101 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8) 102 | validate_every_n_epochs = 100 ## how often to compute validation score and save speech parameters 103 | save_every_n_epochs = 100 ## as well as 5 latest models, how often to archive a model 104 | max_epochs = 1000 105 | 106 | # attention plotting during training 107 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices 108 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for 109 | -------------------------------------------------------------------------------- /config/ssw10/G1ABC_02.cfg: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #/usr/bin/python2 3 | 4 | ## Audio data downsampled to 16000 Hz and CMU lex transcription 5 | 6 | import os 7 | 8 | ## Take name of this file to be the config name: 9 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension 10 | 11 | ## Define place to put outputs relative to this config file's location; 12 | ## supply an absoluate path to work elsewhere: 13 | topworkdir = '/disk/scratch/script_project/ssw10/work/' 14 | 15 | voicedir = os.path.join(topworkdir, config_name) 16 | logdir = os.path.join(voicedir, 'train') 17 | sampledir = os.path.join(voicedir, 'synth') 18 | 19 | ## Change featuredir to absolute path to use existing features 20 | featuredir = '/disk/scratch/script_project/ssw10/features/settings_02/' 21 | coarse_audio_dir = os.path.join(featuredir, 'mels') 22 | full_mel_dir = os.path.join(featuredir, 'full_mels') 23 | full_audio_dir = os.path.join(featuredir, 'mags') 24 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide 25 | 26 | 27 | 28 | # data locations: 29 | datadirwav = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/' ### '/group/project/cstr2/owatts/data/LJSpeech-1.1/' 30 | transcript = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/transcript_cmu.csv' 31 | test_transcript = '/disk/scratch/script_project/ssw10/data/transcript_cmulex_hvd.csv' 32 | waveforms = os.path.join(datadirwav, 'wav_trim16k') 33 | 34 | input_type = 'phones' ## letters or phones 35 | 36 | ### CMU lex version:- 37 | vocab = ['', '', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', \ 38 | '<;>', '<>', '', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', \ 39 | 'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', \ 40 | 'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', \ 41 | 'oy', 'p', 'pau', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', \ 42 | 'w', 'y', 'z', 'zh'] 43 | 44 | max_N = 173 # Maximum number of characters. # !!! 45 | max_T = 210 # Maximum number of mel frames. # !!! 46 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input 47 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 48 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features. 49 | 50 | 51 | # signal processing 52 | trim_before_spectrogram_extraction = 0 53 | vocoder = 'griffin_lim' 54 | sr = 16000 # Sampling rate. 55 | n_fft = 2048 # fft points (samples) 56 | hop_length = 200 # int(sr * 0.0125) 57 | win_length = 800 # int(sr * 0.05) 58 | prepro = True # don't extract spectrograms on the fly 59 | full_dim = n_fft//2+1 60 | n_mels = 80 # Number of Mel banks to generate 61 | power = 1.5 # Exponent for amplifying the predicted magnitude 62 | n_iter = 50 # Number of inversion iterations 63 | preemphasis = .97 64 | max_db = 100 65 | ref_db = 20 66 | 67 | 68 | # Model 69 | r = 4 # Reduction factor. Do not change this. 70 | dropout_rate = 0.05 71 | e = 128 # == embedding 72 | d = 256 # == hidden units of Text2Mel 73 | c = 512 # == hidden units of SSRN 74 | attention_win_size = 3 75 | g = 0.2 ## determines width of band in attention guide 76 | norm = None ## type of normalisation layers to use: from ['layer', 'batch', None] 77 | 78 | ## loss weights : T2M 79 | lw_mel = 0.3333 80 | lw_bd1 = 0.3333 81 | lw_att = 0.3333 82 | ## : SSRN 83 | lw_mag = 0.5 84 | lw_bd2 = 0.5 85 | 86 | 87 | ## validation: 88 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI 89 | validation_sentences_to_evaluate = 4 90 | validation_sentences_to_synth_params = 4 91 | 92 | 93 | # training scheme 94 | restart_from_savepath = [] 95 | lr = 0.0002 # Initial learning rate. 96 | beta1 = 0.5 97 | beta2 = 0.9 98 | epsilon = 0.000001 99 | decay_lr = False 100 | batchsize = {'t2m': 16, 'ssrn': 32} 101 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8) 102 | validate_every_n_epochs = 100 ## how often to compute validation score and save speech parameters 103 | save_every_n_epochs = 100 ## as well as 5 latest models, how often to archive a model 104 | max_epochs = 1000 105 | 106 | # attention plotting during training 107 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices 108 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for 109 | -------------------------------------------------------------------------------- /config/ssw10/G1ABC_03.cfg: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #/usr/bin/python2 3 | 4 | ## Audio data downsampled to 16000 Hz and CMU lex transcription 5 | 6 | import os 7 | 8 | ## Take name of this file to be the config name: 9 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension 10 | 11 | ## Define place to put outputs relative to this config file's location; 12 | ## supply an absoluate path to work elsewhere: 13 | topworkdir = '/disk/scratch/script_project/ssw10/work/' 14 | 15 | voicedir = os.path.join(topworkdir, config_name) 16 | logdir = os.path.join(voicedir, 'train') 17 | sampledir = os.path.join(voicedir, 'synth') 18 | 19 | ## Change featuredir to absolute path to use existing features 20 | featuredir = '/disk/scratch/script_project/ssw10/features/settings_02/' ## NB: use features from settings 2!! 21 | coarse_audio_dir = os.path.join(featuredir, 'mels') 22 | full_mel_dir = os.path.join(featuredir, 'full_mels') 23 | full_audio_dir = os.path.join(featuredir, 'mags') 24 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide 25 | 26 | 27 | 28 | # data locations: 29 | datadirwav = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/' ### '/group/project/cstr2/owatts/data/LJSpeech-1.1/' 30 | transcript = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/transcript_cmu.csv' 31 | test_transcript = '/disk/scratch/script_project/ssw10/data/transcript_cmulex_hvd.csv' 32 | waveforms = os.path.join(datadirwav, 'wav_trim16k') 33 | 34 | input_type = 'phones' ## letters or phones 35 | 36 | ### CMU lex version:- 37 | vocab = ['', '', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', \ 38 | '<;>', '<>', '', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', \ 39 | 'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', \ 40 | 'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', \ 41 | 'oy', 'p', 'pau', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', \ 42 | 'w', 'y', 'z', 'zh'] 43 | 44 | max_N = 173 # Maximum number of characters. # !!! 45 | max_T = 210 # Maximum number of mel frames. # !!! 46 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input 47 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 48 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features. 49 | 50 | 51 | # signal processing 52 | trim_before_spectrogram_extraction = 0 53 | vocoder = 'griffin_lim' 54 | sr = 16000 # Sampling rate. 55 | n_fft = 2048 # fft points (samples) 56 | hop_length = 200 # int(sr * 0.0125) 57 | win_length = 800 # int(sr * 0.05) 58 | prepro = True # don't extract spectrograms on the fly 59 | full_dim = n_fft//2+1 60 | n_mels = 80 # Number of Mel banks to generate 61 | power = 1.5 # Exponent for amplifying the predicted magnitude 62 | n_iter = 50 # Number of inversion iterations 63 | preemphasis = .97 64 | max_db = 100 65 | ref_db = 20 66 | 67 | 68 | # Model 69 | r = 4 # Reduction factor. Do not change this. 70 | dropout_rate = 0.05 71 | e = 128 # == embedding 72 | d = 256 # == hidden units of Text2Mel 73 | c = 512 # == hidden units of SSRN 74 | attention_win_size = 3 75 | g = 0.2 ## determines width of band in attention guide 76 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None] 77 | 78 | ## loss weights : T2M 79 | lw_mel = 0.3333 80 | lw_bd1 = 0.3333 81 | lw_att = 0.3333 82 | ## : SSRN 83 | lw_mag = 0.5 84 | lw_bd2 = 0.5 85 | 86 | 87 | ## validation: 88 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI 89 | validation_sentences_to_evaluate = 4 90 | validation_sentences_to_synth_params = 4 91 | 92 | 93 | # training scheme 94 | restart_from_savepath = [] 95 | lr = 0.001 # Initial learning rate. 96 | batchsize = {'t2m': 32, 'ssrn': 32} 97 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8) 98 | validate_every_n_epochs = 100 ## how often to compute validation score and save speech parameters 99 | save_every_n_epochs = 100 ## as well as 5 latest models, how often to archive a model 100 | max_epochs = 1000 101 | 102 | # attention plotting during training 103 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices 104 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for 105 | -------------------------------------------------------------------------------- /config/ssw10/G1AB_03.cfg: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #/usr/bin/python2 3 | 4 | ## As DCTTS but remove attention 5 | 6 | import os 7 | 8 | ## Take name of this file to be the config name: 9 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension 10 | 11 | ## Define place to put outputs relative to this config file's location; 12 | ## supply an absoluate path to work elsewhere: 13 | topworkdir = '/disk/scratch/script_project/ssw10/work/' 14 | 15 | voicedir = os.path.join(topworkdir, config_name) 16 | logdir = os.path.join(voicedir, 'train') 17 | sampledir = os.path.join(voicedir, 'synth') 18 | 19 | ## Change featuredir to absolute path to use existing features 20 | featuredir = '/disk/scratch/script_project/ssw10/features/settings_02/' ## NB: use features from settings 2!! 21 | coarse_audio_dir = os.path.join(featuredir, 'mels') 22 | full_mel_dir = os.path.join(featuredir, 'full_mels') 23 | full_audio_dir = os.path.join(featuredir, 'mags') 24 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide 25 | 26 | 27 | 28 | 29 | # data locations: 30 | use_external_durations = True 31 | datadirwav = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/' ### '/group/project/cstr2/owatts/data/LJSpeech-1.1/' 32 | transcript = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/transcript_cmu_durations_02.csv' 33 | test_transcript = '/disk/scratch/script_project/ssw10/data/transcript_cmulex_hvd_durations_02_noendsil.csv' 34 | waveforms = os.path.join(datadirwav, 'wav_trim16k') 35 | 36 | input_type = 'phones' ## letters or phones 37 | 38 | ### CMU lex version:- 39 | vocab = ['', '', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', \ 40 | '<;>', '<>', '', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', \ 41 | 'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', \ 42 | 'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', \ 43 | 'oy', 'p', 'pau', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', \ 44 | 'w', 'y', 'z', 'zh'] 45 | 46 | max_N = 173 # Maximum number of characters. # !!! 47 | max_T = 210 # Maximum number of mel frames. # !!! 48 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input 49 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 50 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features. 51 | 52 | 53 | # signal processing 54 | trim_before_spectrogram_extraction = 0 55 | vocoder = 'griffin_lim' 56 | sr = 16000 # Sampling rate. 57 | n_fft = 2048 # fft points (samples) 58 | hop_length = 200 # int(sr * 0.0125) 59 | win_length = 800 # int(sr * 0.05) 60 | prepro = True # don't extract spectrograms on the fly 61 | full_dim = n_fft//2+1 62 | n_mels = 80 # Number of Mel banks to generate 63 | power = 1.5 # Exponent for amplifying the predicted magnitude 64 | n_iter = 50 # Number of inversion iterations 65 | preemphasis = .97 66 | max_db = 100 67 | ref_db = 20 68 | 69 | 70 | # Model 71 | r = 4 # Reduction factor. Do not change this. 72 | dropout_rate = 0.05 73 | e = 128 # == embedding 74 | d = 256 # == hidden units of Text2Mel 75 | c = 512 # == hidden units of SSRN 76 | attention_win_size = 3 77 | g = 0.2 ## determines width of band in attention guide 78 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None] 79 | 80 | ## loss weights : T2M 81 | lw_mel = 0.3333 82 | lw_bd1 = 0.3333 83 | lw_att = 0.3333 84 | ## : SSRN 85 | lw_mag = 0.5 86 | lw_bd2 = 0.5 87 | 88 | 89 | ## validation: 90 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI 91 | validation_sentences_to_evaluate = 4 92 | validation_sentences_to_synth_params = 4 93 | 94 | 95 | # training scheme 96 | restart_from_savepath = [] 97 | lr = 0.001 # Initial learning rate. 98 | batchsize = {'t2m': 32, 'ssrn': 32} 99 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8) 100 | validate_every_n_epochs = 100 ## how often to compute validation score and save speech parameters 101 | save_every_n_epochs = 100 ## as well as 5 latest models, how often to archive a model 102 | max_epochs = 1000 103 | 104 | # attention plotting during training 105 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices 106 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for 107 | -------------------------------------------------------------------------------- /config/ssw10/G1BC_03.cfg: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #/usr/bin/python2 3 | 4 | ## As DCTTS but remove attention 5 | 6 | import os 7 | 8 | ## Take name of this file to be the config name: 9 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension 10 | 11 | ## Define place to put outputs relative to this config file's location; 12 | ## supply an absoluate path to work elsewhere: 13 | topworkdir = '/disk/scratch/script_project/ssw10/work/' 14 | 15 | voicedir = os.path.join(topworkdir, config_name) 16 | logdir = os.path.join(voicedir, 'train') 17 | sampledir = os.path.join(voicedir, 'synth') 18 | 19 | ## Change featuredir to absolute path to use existing features 20 | featuredir = '/disk/scratch/script_project/ssw10/features/settings_02/' ## NB: use features from settings 2!! 21 | coarse_audio_dir = os.path.join(featuredir, 'mels') 22 | full_mel_dir = os.path.join(featuredir, 'full_mels') 23 | full_audio_dir = os.path.join(featuredir, 'mags') 24 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide 25 | 26 | 27 | 28 | 29 | # data locations: 30 | datadirwav = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/' ### '/group/project/cstr2/owatts/data/LJSpeech-1.1/' 31 | transcript = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/transcript_cmu.csv' 32 | test_transcript = '/disk/scratch/script_project/ssw10/data/transcript_cmulex_hvd.csv' 33 | waveforms = os.path.join(datadirwav, 'wav_trim16k') 34 | 35 | text_encoder_type = 'minimal_feedforward' 36 | merlin_label_dir = '/disk/scratch/script_project/ssw10/features/from_merlin/labels' 37 | merlin_lab_dim = 416 38 | 39 | 40 | input_type = 'phones' ## letters or phones 41 | 42 | ### CMU lex version:- 43 | vocab = ['', '', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', \ 44 | '<;>', '<>', '', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', \ 45 | 'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', \ 46 | 'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', \ 47 | 'oy', 'p', 'pau', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', \ 48 | 'w', 'y', 'z', 'zh'] 49 | 50 | max_N = 173 # Maximum number of characters. # !!! 51 | max_T = 210 # Maximum number of mel frames. # !!! 52 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input 53 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 54 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features. 55 | 56 | 57 | # signal processing 58 | trim_before_spectrogram_extraction = 0 59 | vocoder = 'griffin_lim' 60 | sr = 16000 # Sampling rate. 61 | n_fft = 2048 # fft points (samples) 62 | hop_length = 200 # int(sr * 0.0125) 63 | win_length = 800 # int(sr * 0.05) 64 | prepro = True # don't extract spectrograms on the fly 65 | full_dim = n_fft//2+1 66 | n_mels = 80 # Number of Mel banks to generate 67 | power = 1.5 # Exponent for amplifying the predicted magnitude 68 | n_iter = 50 # Number of inversion iterations 69 | preemphasis = .97 70 | max_db = 100 71 | ref_db = 20 72 | 73 | 74 | # Model 75 | r = 4 # Reduction factor. Do not change this. 76 | dropout_rate = 0.05 77 | e = 128 # == embedding 78 | d = 256 # == hidden units of Text2Mel 79 | c = 512 # == hidden units of SSRN 80 | attention_win_size = 3 81 | g = 0.2 ## determines width of band in attention guide 82 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None] 83 | 84 | ## loss weights : T2M 85 | lw_mel = 0.3333 86 | lw_bd1 = 0.3333 87 | lw_att = 0.3333 88 | ## : SSRN 89 | lw_mag = 0.5 90 | lw_bd2 = 0.5 91 | 92 | 93 | ## validation: 94 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI 95 | validation_sentences_to_evaluate = 4 96 | validation_sentences_to_synth_params = 4 97 | 98 | 99 | # training scheme 100 | restart_from_savepath = [] 101 | lr = 0.001 # Initial learning rate. 102 | batchsize = {'t2m': 32, 'ssrn': 32} 103 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8) 104 | validate_every_n_epochs = 100 ## how often to compute validation score and save speech parameters 105 | save_every_n_epochs = 100 ## as well as 5 latest models, how often to archive a model 106 | max_epochs = 1000 107 | 108 | # attention plotting during training 109 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices 110 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for 111 | -------------------------------------------------------------------------------- /config/ssw10/G1B_03.cfg: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #/usr/bin/python2 3 | 4 | ## As DCTTS but remove attention 5 | 6 | import os 7 | 8 | ## Take name of this file to be the config name: 9 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension 10 | 11 | ## Define place to put outputs relative to this config file's location; 12 | ## supply an absoluate path to work elsewhere: 13 | topworkdir = '/disk/scratch/script_project/ssw10/work/' 14 | 15 | voicedir = os.path.join(topworkdir, config_name) 16 | logdir = os.path.join(voicedir, 'train') 17 | sampledir = os.path.join(voicedir, 'synth') 18 | 19 | ## Change featuredir to absolute path to use existing features 20 | featuredir = '/disk/scratch/script_project/ssw10/features/settings_02/' ## NB: use features from settings 2!! 21 | coarse_audio_dir = os.path.join(featuredir, 'mels') 22 | full_mel_dir = os.path.join(featuredir, 'full_mels') 23 | full_audio_dir = os.path.join(featuredir, 'mags') 24 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide 25 | 26 | 27 | 28 | 29 | # data locations: 30 | use_external_durations = True 31 | datadirwav = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/' ### '/group/project/cstr2/owatts/data/LJSpeech-1.1/' 32 | transcript = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/transcript_cmu_durations_02.csv' 33 | test_transcript = '/disk/scratch/script_project/ssw10/data/transcript_cmulex_hvd_durations_02_noendsil.csv' 34 | waveforms = os.path.join(datadirwav, 'wav_trim16k') 35 | 36 | text_encoder_type = 'minimal_feedforward' 37 | merlin_label_dir = '/disk/scratch/script_project/ssw10/features/from_merlin/labels' 38 | merlin_lab_dim = 416 39 | 40 | input_type = 'phones' ## letters or phones 41 | 42 | ### CMU lex version:- 43 | vocab = ['', '', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', \ 44 | '<;>', '<>', '', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', \ 45 | 'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', \ 46 | 'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', \ 47 | 'oy', 'p', 'pau', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', \ 48 | 'w', 'y', 'z', 'zh'] 49 | 50 | max_N = 173 # Maximum number of characters. # !!! 51 | max_T = 210 # Maximum number of mel frames. # !!! 52 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input 53 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 54 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features. 55 | 56 | 57 | # signal processing 58 | trim_before_spectrogram_extraction = 0 59 | vocoder = 'griffin_lim' 60 | sr = 16000 # Sampling rate. 61 | n_fft = 2048 # fft points (samples) 62 | hop_length = 200 # int(sr * 0.0125) 63 | win_length = 800 # int(sr * 0.05) 64 | prepro = True # don't extract spectrograms on the fly 65 | full_dim = n_fft//2+1 66 | n_mels = 80 # Number of Mel banks to generate 67 | power = 1.5 # Exponent for amplifying the predicted magnitude 68 | n_iter = 50 # Number of inversion iterations 69 | preemphasis = .97 70 | max_db = 100 71 | ref_db = 20 72 | 73 | 74 | # Model 75 | r = 4 # Reduction factor. Do not change this. 76 | dropout_rate = 0.05 77 | e = 128 # == embedding 78 | d = 256 # == hidden units of Text2Mel 79 | c = 512 # == hidden units of SSRN 80 | attention_win_size = 3 81 | g = 0.2 ## determines width of band in attention guide 82 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None] 83 | 84 | ## loss weights : T2M 85 | lw_mel = 0.3333 86 | lw_bd1 = 0.3333 87 | lw_att = 0.3333 88 | ## : SSRN 89 | lw_mag = 0.5 90 | lw_bd2 = 0.5 91 | 92 | 93 | ## validation: 94 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI 95 | validation_sentences_to_evaluate = 4 96 | validation_sentences_to_synth_params = 4 97 | 98 | 99 | # training scheme 100 | restart_from_savepath = [] 101 | lr = 0.001 # Initial learning rate. 102 | batchsize = {'t2m': 32, 'ssrn': 32} 103 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8) 104 | validate_every_n_epochs = 100 ## how often to compute validation score and save speech parameters 105 | save_every_n_epochs = 100 ## as well as 5 latest models, how often to archive a model 106 | max_epochs = 1000 107 | 108 | # attention plotting during training 109 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices 110 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for 111 | -------------------------------------------------------------------------------- /config/ssw10/G1_03.cfg: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #/usr/bin/python2 3 | 4 | ## As DCTTS but remove attention 5 | 6 | import os 7 | 8 | ## Take name of this file to be the config name: 9 | config_name = os.path.split(__file__)[-1].split('.')[0] ## remove path and extension 10 | 11 | ## Define place to put outputs relative to this config file's location; 12 | ## supply an absoluate path to work elsewhere: 13 | topworkdir = '/disk/scratch/script_project/ssw10/work/' 14 | 15 | voicedir = os.path.join(topworkdir, config_name) 16 | logdir = os.path.join(voicedir, 'train') 17 | sampledir = os.path.join(voicedir, 'synth') 18 | 19 | ## Change featuredir to absolute path to use existing features 20 | featuredir = '/disk/scratch/script_project/ssw10/features/settings_02/' ## NB: use features from settings 2!! 21 | coarse_audio_dir = os.path.join(featuredir, 'mels') 22 | full_mel_dir = os.path.join(featuredir, 'full_mels') 23 | full_audio_dir = os.path.join(featuredir, 'mags') 24 | attention_guide_dir = os.path.join(featuredir, 'attention_guides') ## Set this to the empty string ('') to global attention guide 25 | 26 | history_type = 'fractional_position_in_phone' 27 | 28 | 29 | # data locations: 30 | use_external_durations = True 31 | datadirwav = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/' ### '/group/project/cstr2/owatts/data/LJSpeech-1.1/' 32 | transcript = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/transcript_cmu_durations_02.csv' 33 | test_transcript = '/disk/scratch/script_project/ssw10/data/transcript_cmulex_hvd_durations_02_noendsil.csv' 34 | waveforms = os.path.join(datadirwav, 'wav_trim16k') 35 | 36 | text_encoder_type = 'minimal_feedforward' 37 | merlin_label_dir = '/disk/scratch/script_project/ssw10/features/from_merlin/labels' 38 | merlin_lab_dim = 416 39 | 40 | input_type = 'phones' ## letters or phones 41 | 42 | ### CMU lex version:- 43 | vocab = ['', '', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', \ 44 | '<;>', '<>', '', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', \ 45 | 'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', \ 46 | 'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', \ 47 | 'oy', 'p', 'pau', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', \ 48 | 'w', 'y', 'z', 'zh'] 49 | 50 | max_N = 173 # Maximum number of characters. # !!! 51 | max_T = 210 # Maximum number of mel frames. # !!! 52 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input 53 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 54 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features. 55 | 56 | 57 | # signal processing 58 | trim_before_spectrogram_extraction = 0 59 | vocoder = 'griffin_lim' 60 | sr = 16000 # Sampling rate. 61 | n_fft = 2048 # fft points (samples) 62 | hop_length = 200 # int(sr * 0.0125) 63 | win_length = 800 # int(sr * 0.05) 64 | prepro = True # don't extract spectrograms on the fly 65 | full_dim = n_fft//2+1 66 | n_mels = 80 # Number of Mel banks to generate 67 | power = 1.5 # Exponent for amplifying the predicted magnitude 68 | n_iter = 50 # Number of inversion iterations 69 | preemphasis = .97 70 | max_db = 100 71 | ref_db = 20 72 | 73 | 74 | # Model 75 | r = 4 # Reduction factor. Do not change this. 76 | dropout_rate = 0.05 77 | e = 128 # == embedding 78 | d = 256 # == hidden units of Text2Mel 79 | c = 512 # == hidden units of SSRN 80 | attention_win_size = 3 81 | g = 0.2 ## determines width of band in attention guide 82 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None] 83 | 84 | ## loss weights : T2M 85 | lw_mel = 0.3333 86 | lw_bd1 = 0.3333 87 | lw_att = 0.3333 88 | ## : SSRN 89 | lw_mag = 0.5 90 | lw_bd2 = 0.5 91 | 92 | 93 | ## validation: 94 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI 95 | validation_sentences_to_evaluate = 4 96 | validation_sentences_to_synth_params = 4 97 | 98 | 99 | # training scheme 100 | restart_from_savepath = [] 101 | lr = 0.001 # Initial learning rate. 102 | batchsize = {'t2m': 32, 'ssrn': 32} 103 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8) 104 | validate_every_n_epochs = 100 ## how often to compute validation score and save speech parameters 105 | save_every_n_epochs = 100 ## as well as 5 latest models, how often to archive a model 106 | max_epochs = 1000 107 | 108 | # attention plotting during training 109 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices 110 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for 111 | -------------------------------------------------------------------------------- /config/ssw10/README.md: -------------------------------------------------------------------------------- 1 | # Voices for SSW10 experiment 2 | 3 | Instructions for recreating SSW10 voices 4 | 5 | 6 | ## Tools 7 | 8 | 9 | ### DCTTS code 10 | ``` 11 | TOOLDIR=/disk/scratch/script_project/ssw10/tools/ 12 | mkdir -p $TOOLDIR 13 | cd $TOOLDIR 14 | git clone https://github.com/oliverwatts/dc_tts_osw.git dc_tts_osw_A 15 | cd dc_tts_osw_A 16 | ``` 17 | 18 | ### Festival 19 | 20 | ``` 21 | INSTALL_DIR=$TOOLDIR/festival 22 | mkdir -p $INSTALL_DIR 23 | cd $INSTALL_DIR 24 | 25 | wget http://www.cstr.ed.ac.uk/downloads/festival/2.4/festival-2.4-release.tar.gz 26 | wget http://www.cstr.ed.ac.uk/downloads/festival/2.4/speech_tools-2.4-release.tar.gz 27 | 28 | tar xvf festival-2.4-release.tar.gz 29 | tar xvf speech_tools-2.4-release.tar.gz 30 | 31 | cd speech_tools 32 | ./configure --prefix=$INSTALL_DIR 33 | gmake 34 | 35 | cd ../festival 36 | ./configure --prefix=$INSTALL_DIR 37 | gmake 38 | 39 | cd $INSTALL_DIR 40 | 41 | wget http://www.cstr.ed.ac.uk/downloads/festival/2.4/voices/festvox_cmu_us_awb_cg.tar.gz 42 | tar xvf festvox_cmu_us_awb_cg.tar.gz 43 | 44 | ## Get lexicons for the english voice: 45 | wget http://www.cstr.ed.ac.uk/downloads/festival/2.4/festlex_CMU.tar.gz 46 | tar xvf festlex_CMU.tar.gz 47 | 48 | wget http://www.cstr.ed.ac.uk/downloads/festival/2.4/festlex_POSLEX.tar.gz 49 | tar xvf festlex_POSLEX.tar.gz 50 | 51 | ## test 52 | cd $INSTALL_DIR/festival/bin 53 | 54 | ## run the *locally installed* festival (NB: initial ./ is important!) 55 | ./festival 56 | 57 | festival> (voice_cmu_us_awb_cg) 58 | festival> (utt.save.wave (SayText "If i'm speaking then installation actually went ok.") "test.wav" ' riff) 59 | ``` 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | ## Data 68 | 69 | 70 | ### Download LJSpeech 71 | 72 | ``` 73 | DATADIR=/disk/scratch/script_project/ssw10/data 74 | mkdir -p $DATADIR 75 | 76 | cd $DATADIR 77 | wget http://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2 78 | bunzip2 LJSpeech-1.1.tar.bz2 79 | tar xvf LJSpeech-1.1.tar 80 | rm LJSpeech-1.1.tar 81 | ``` 82 | 83 | 84 | ### Phonetise the transcription with Festvial + CMU lexicon 85 | 86 | ``` 87 | CODEDIR=/disk/scratch/oliver/dc_tts_osw/ 88 | cd $CODEDIR 89 | python ./script/festival/csv2scm.py -i $DATADIR/LJSpeech-1.1/metadata.csv -o $DATADIR/LJSpeech-1.1/utts.data 90 | 91 | cd $DATADIR/LJSpeech-1.1/ 92 | FEST=$TOOLDIR/festival/festival/bin/festival 93 | SCRIPT=$CODEDIR/script/festival/make_rich_phones_cmulex.scm 94 | $FEST -b $SCRIPT | grep ___KEEP___ | sed 's/___KEEP___//' | tee ./transcript_temp2.csv 95 | 96 | python $CODEDIR/script/festival/fix_transcript.py ./transcript_temp2.csv > ./transcript_cmu.csv 97 | ``` 98 | 99 | -------------------------------------------------------------------------------- /configuration.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #!/usr/bin/env python2 3 | import os 4 | import imp 5 | import inspect 6 | 7 | 8 | CONFIG_DEFAULTS = [ 9 | ('initialise_weights_from_existing', [], ''), 10 | ('update_weights', [], ''), 11 | ('num_threads', 8, 'how many threads get_batch should use to build training batches of data (default: 8)'), 12 | ('plot_attention_every_n_epochs', 0, 'set to 0 if you do not wish to plot attention matrices'), 13 | ('num_sentences_to_plot_attention', 0, 'number of sentences to plot attention matrices for'), 14 | ('concatenate_query', True, 'Concatenate [R Q] to get audio decoder input, or just take R?'), 15 | ('use_external_durations', False, 'Use externally supplied durations in 6th field of transcript for fixed attention matrix A'), 16 | ('text_encoder_type', 'DCTTS_standard', 'one of DCTTS_standard/none/minimal_feedforward'), 17 | ('merlin_label_dir', '', 'npy format phone labels converted from merlin using process_merlin_label.py'), 18 | ('merlin_lab_dim', 592, ''), 19 | ('bucket_data_by', 'text_length', 'One of audio_length/text_length. Label length will be used if merlin_label_dir is set and bucket_data_by=="text_length"'), 20 | ('history_type', 'DCTTS_standard', 'DCTTS_standard/fractional_position_in_phone/absolute_position_in_phone/minimal_history'), 21 | ('beta1', 0.9, 'ADAM setting - default value from original dctss repo'), 22 | ('beta2', 0.999, 'ADAM setting - default value from original dctss repo'), 23 | ('epsilon', 0.00000001 , 'ADAM setting - default value from original dctss repo'), 24 | ('decay_lr', True , 'learning rate decay - default value from original dctss repo'), 25 | ('squash_output_t2m', True, 'apply sigmoid to output - binary divergence loss will be disabled if False'), 26 | ('squash_output_ssrn', True, 'apply sigmoid to output - binary divergence loss will be disabled if False'), 27 | ('store_synth_features', False, 'store .npy file of features alongside output .wav file'), 28 | ('turn_off_monotonic_for_synthesis',False, 'turns off FIA mechanism for synthesis, should be False during training'), 29 | ('lw_cdp',0.0,''), 30 | ('lw_ain',0.0,''), 31 | ('lw_aout',0.0,''), 32 | ('attention_guide_fa',False,'use attention guide as target - MSE attention loss'), 33 | ('select_central',False,'use only centre phones from Merlin labels'), 34 | ('MerlinTextEncWithPhoneEmbedding',False,'use Merlin labels and phone embeddings as input of TextEncoder') 35 | ] 36 | 37 | ## Intended to have hp as a module, but this doesn't allow pickling and therefore 38 | ## use in parallel processing. So, convert module into an object of arbitrary type 39 | ## ("Hyperparams") having same attributes: 40 | class Hyperparams(object): 41 | def __init__(self, module_object): 42 | for (key, value) in module_object.__dict__.items(): 43 | if key.startswith('_'): 44 | continue 45 | if inspect.ismodule(value): # e.g. from os imported at top of config 46 | continue 47 | setattr(self, key, module_object.__dict__[key]) 48 | def validate(self): 49 | ''' 50 | Supply defaults for various thing of appropriate type if missing -- 51 | TODO: Currently this is just to supply values for variables added later in development. 52 | Should we have some filling in of defaults like this more permanently, or should 53 | everything be explicitly set in a config file? 54 | ''' 55 | for (varname, default_value, help_string) in CONFIG_DEFAULTS: 56 | if not hasattr(self, varname): 57 | setattr(self, varname, default_value) 58 | 59 | 60 | def load_config(config_fname): 61 | config = os.path.abspath(config_fname) 62 | assert os.path.isfile(config), 'Config file %s does not exist'%(config) 63 | settings = imp.load_source('config', config) 64 | hp = Hyperparams(settings) 65 | hp.validate() 66 | return hp 67 | 68 | 69 | 70 | ### https://stackoverflow.com/questions/1325673/how-to-add-property-to-a-class-dynamically 71 | 72 | # class atdict(dict): 73 | # __getattr__= dict.__getitem__ 74 | # __setattr__= dict.__setitem__ 75 | # __delattr__= dict.__delitem__ 76 | -------------------------------------------------------------------------------- /convert_alignment_to_guide.py: -------------------------------------------------------------------------------- 1 | 2 | import sys 3 | import math 4 | import glob 5 | import numpy as np 6 | import matplotlib 7 | import matplotlib.pyplot as plt 8 | from scipy.ndimage.filters import gaussian_filter 9 | from libutil import save_floats_as_8bit 10 | import tqdm 11 | from concurrent.futures import ProcessPoolExecutor 12 | 13 | gD = 0.2 14 | gW = 0.1 15 | 16 | DEBUG = False 17 | 18 | def main(file_name,out_file): 19 | 20 | F = np.load(file_name) 21 | F = np.transpose(F) 22 | 23 | ndim, tdim = F.shape # x: encoder (N) / y: decoder (T) 24 | 25 | ## Convert alignment to attention guide 26 | if DEBUG: 27 | D = np.zeros((ndim, tdim), dtype=np.float32) # diagonal guide 28 | W = np.zeros((ndim, tdim), dtype=np.float32) # alignment based guide 29 | 30 | for n_pos in range(ndim): 31 | for t_pos in range(tdim): 32 | 33 | n_pos_new = np.argmax(F[:,t_pos]) 34 | W[n_pos,t_pos] = 1 - np.exp( -(n_pos / float(ndim) - n_pos_new / float(ndim) ) ** 2 / (2 * gW * gW)) 35 | 36 | if DEBUG: 37 | D[n_pos, t_pos] = 1 - np.exp(-(t_pos / float(tdim) - n_pos / float(ndim)) ** 2 / (2 * gD * gD)) 38 | 39 | ## Smooth the alignment based guide 40 | S = gaussian_filter(W, sigma=1) # trying to blur 41 | # needs min max norm here to make sure 0-1 42 | S = ( S - S.min()) / ( S.max() - S.min() ) 43 | 44 | save_floats_as_8bit(S, out_file) 45 | 46 | if DEBUG: 47 | 48 | D = ( D - D.min()) / ( D.max() - D.min() ) 49 | W = ( W - W.min()) / ( W.max() - W.min() ) 50 | 51 | for plot_type in range(0,3): 52 | 53 | ## Visualization 54 | if plot_type==0: 55 | M = F+D # add forced alignment path to help visualisation 56 | elif plot_type == 1: 57 | M = F+W # add forced alignment path to help visualisation 58 | elif plot_type == 2: 59 | M = F+S # add forced alignment path to help visualisation 60 | 61 | fig, ax = plt.subplots() 62 | im = ax.imshow(M) 63 | # plt.title('Diagonal (top), Alignment based (middle), Alignment based smoothed (bottom)', fontsize=8) 64 | fig.colorbar(im,fraction=0.035, pad=0.04) 65 | plt.ylabel('Encoder timestep', fontsize=12) 66 | plt.xlabel('Decoder timestep', fontsize=12) 67 | 68 | if plot_type==0: 69 | plt.title('Diagonal attention guide', fontsize=12) 70 | plt.savefig('attention_guide_diagonal.pdf') 71 | elif plot_type == 1: 72 | plt.title('Forced alignment based attention guide', fontsize=12) 73 | plt.savefig('attention_guide_fa.pdf') 74 | elif plot_type == 2: 75 | plt.title('Forced alignment based attention guide (smoothed)', fontsize=12) 76 | plt.savefig('attention_guide_fa_smooth.pdf') 77 | 78 | plt.show() 79 | 80 | if __name__ == '__main__': 81 | 82 | # Usage: python convert_alignment_to_guide.py CB-EM-55-07.npy CB-EM-55-07_out.npy 83 | 84 | inputdir = sys.argv[1] 85 | outputdir = sys.argv[2] 86 | ncores = 5 87 | executor = ProcessPoolExecutor(max_workers=ncores) 88 | futures = [] 89 | for file in glob.iglob(inputdir + '/*.npy'): 90 | outfile = outputdir + file.split('/')[-1] 91 | futures.append(executor.submit(main, file, outfile)) 92 | 93 | proc_list = [future.result() for future in tqdm.tqdm(futures)] 94 | 95 | -------------------------------------------------------------------------------- /copy_synth_GL.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # /usr/bin/python2 3 | 4 | from __future__ import print_function 5 | 6 | import os 7 | import glob 8 | 9 | import numpy as np 10 | from utils import spectrogram2wav 11 | from data_load import load_data 12 | import soundfile 13 | from tqdm import tqdm 14 | from configuration import load_config 15 | 16 | from argparse import ArgumentParser 17 | 18 | from libutil import basename, safe_makedir 19 | 20 | def copy_synth_GL(hp, outdir): 21 | 22 | safe_makedir(outdir) 23 | 24 | dataset = load_data(hp, mode="synthesis") 25 | fnames, texts = dataset['fpaths'], dataset['texts'] 26 | bases = [basename(fname) for fname in fnames] 27 | 28 | for base in bases: 29 | print("Working on file %s"%(base)) 30 | mag = np.load(os.path.join(hp.full_audio_dir, base + '.npy')) 31 | wav = spectrogram2wav(hp, mag) 32 | soundfile.write(outdir + "/%s.wav"%(base), wav, hp.sr) 33 | 34 | def main_work(): 35 | 36 | ################################################# 37 | 38 | # ============= Process command line ============ 39 | 40 | a = ArgumentParser() 41 | a.add_argument('-c', dest='config', required=True, type=str) 42 | a.add_argument('-o', dest='outdir', required=True, type=str) 43 | opts = a.parse_args() 44 | 45 | # =============================================== 46 | 47 | hp = load_config(opts.config) 48 | copy_synth_GL(hp, opts.outdir) 49 | 50 | if __name__=="__main__": 51 | 52 | main_work() 53 | -------------------------------------------------------------------------------- /copy_synth_SSRN_GL.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # /usr/bin/python2 3 | 4 | from __future__ import print_function 5 | 6 | import os 7 | import glob 8 | 9 | import numpy as np 10 | from utils import spectrogram2wav 11 | from data_load import load_data 12 | import soundfile 13 | from tqdm import tqdm 14 | from configuration import load_config 15 | 16 | from argparse import ArgumentParser 17 | 18 | from libutil import basename, safe_makedir 19 | from synthesize import synth_mel2mag, list2batch, restore_latest_model_parameters 20 | from architectures import SSRNGraph 21 | import tensorflow as tf 22 | 23 | def copy_synth_SSRN_GL(hp, outdir): 24 | 25 | safe_makedir(outdir) 26 | 27 | dataset = load_data(hp, mode="synthesis") 28 | fnames, texts = dataset['fpaths'], dataset['texts'] 29 | bases = [basename(fname) for fname in fnames] 30 | mels = [np.load(os.path.join(hp.coarse_audio_dir, base + '.npy')) for base in bases] 31 | lengths = [a.shape[0] for a in mels] 32 | mels = list2batch(mels, 0) 33 | 34 | g = SSRNGraph(hp, mode="synthesize"); print("Graph (ssrn) loaded") 35 | 36 | with tf.Session() as sess: 37 | sess.run(tf.global_variables_initializer()) 38 | ssrn_epoch = restore_latest_model_parameters(sess, hp, 'ssrn') 39 | 40 | print('Run SSRN...') 41 | Z = synth_mel2mag(hp, mels, g, sess) 42 | 43 | for i, mag in enumerate(Z): 44 | print("Working on %s"%(bases[i])) 45 | mag = mag[:lengths[i]*hp.r,:] ### trim to generated length 46 | wav = spectrogram2wav(hp, mag) 47 | soundfile.write(outdir + "/%s.wav"%(base), wav, hp.sr) 48 | 49 | 50 | 51 | 52 | def main_work(): 53 | 54 | ################################################# 55 | 56 | # ============= Process command line ============ 57 | 58 | a = ArgumentParser() 59 | a.add_argument('-c', dest='config', required=True, type=str) 60 | a.add_argument('-o', dest='outdir', required=True, type=str) 61 | opts = a.parse_args() 62 | 63 | # =============================================== 64 | 65 | hp = load_config(opts.config) 66 | copy_synth_SSRN_GL(hp, opts.outdir) 67 | 68 | if __name__=="__main__": 69 | 70 | main_work() 71 | -------------------------------------------------------------------------------- /doc/festival_install.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | # Installing Festival 4 | 5 | ## Basic Festival install of Spanish and Scottish voices 6 | ``` 7 | INSTALL_DIR=/afs/some/convenient/directory/festival 8 | 9 | mkdir -p $INSTALL_DIR 10 | cd $INSTALL_DIR 11 | 12 | wget http://www.cstr.ed.ac.uk/downloads/festival/2.4/festival-2.4-release.tar.gz 13 | wget http://www.cstr.ed.ac.uk/downloads/festival/2.4/speech_tools-2.4-release.tar.gz 14 | 15 | tar xvf festival-2.4-release.tar.gz 16 | tar xvf speech_tools-2.4-release.tar.gz 17 | 18 | cd speech_tools 19 | ./configure --prefix=$INSTALL_DIR 20 | gmake 21 | 22 | cd ../festival 23 | ./configure --prefix=$INSTALL_DIR 24 | gmake 25 | 26 | cd $INSTALL_DIR 27 | 28 | ## Get spanish voice: 29 | wget http://festvox.org/packed/festival/1.4.1/festvox_ellpc11k.tar.gz 30 | tar xvf festvox_ellpc11k.tar.gz 31 | 32 | ## Get scottish english voice: 33 | wget http://www.cstr.ed.ac.uk/downloads/festival/2.4/voices/festvox_cmu_us_awb_cg.tar.gz 34 | tar xvf festvox_cmu_us_awb_cg.tar.gz 35 | 36 | ## Get lexicons for the english voice: 37 | wget http://www.cstr.ed.ac.uk/downloads/festival/2.4/festlex_CMU.tar.gz 38 | tar xvf festlex_CMU.tar.gz 39 | 40 | wget http://www.cstr.ed.ac.uk/downloads/festival/2.4/festlex_POSLEX.tar.gz 41 | tar xvf festlex_POSLEX.tar.gz 42 | 43 | ## test 44 | cd $INSTALL_DIR/festival/bin 45 | 46 | ## run the *locally installed* festival (NB: initial ./ is important!) 47 | ./festival 48 | festival> (voice_el_diphone ) 49 | festival> (SayText "La rica salsa canaria se llama mojo pic'on.") 50 | 51 | festival> (voice_cmu_us_awb_cg) 52 | festival> (SayText "If i'm speaking then installation actually went ok.") 53 | 54 | 55 | ## synthesise to file: 56 | 57 | (utt.save.wave (SynthText "La rica salsa canaria se llama mojo pic'on.") "/path/to/output/file.wav" 'riff) 58 | ``` 59 | 60 | 61 | ## Combilex installation 62 | 63 | Given the file `combilex.tar` which contains 3 combilex surface form lexicons (gam, rpx, edi), install like this: 64 | 65 | ``` 66 | cp combilex.tar $INSTALL_DIR/festival/ 67 | cd $INSTALL_DIR/festival/ 68 | tar xvf combilex.tar 69 | ``` 70 | 71 | For processing the Nancy data, which contains a French word with a nasalised vowel present in the lexicon but not the phoneset definition, I needed to edit `$INSTALL_DIR/festival/lib/combilex_phones.scm` and add the line: 72 | 73 | ``` 74 | (o~ + l 2 3 + n 0 0) ;; added missing nasalised vowel 75 | ``` 76 | 77 | after the line: 78 | 79 | ``` 80 | (@U + d 2 3 + 0 0 0) ;ou 81 | ``` 82 | 83 | 84 | ## Cleaning up 85 | 86 | ``` 87 | cd $INSTALL_DIR 88 | rm *.tar.gz 89 | 90 | cd $INSTALL_DIR/festival 91 | rm *.tar 92 | ``` 93 | 94 | 95 | ## Note for UoE users 96 | 97 | If the installation is run through SSH, make sure you are in an *actual* machine and not just hare or bruegel, as these are just gateways and won't have a C compiler installed. -------------------------------------------------------------------------------- /doc/recipe_WaveRNN.md: -------------------------------------------------------------------------------- 1 | 2 | ## DCTTS + WaveRNN 3 | 4 | To generate DCTTS samples using WaveRNN set the following flag in your config file: 5 | ``` 6 | store_synth_features = True 7 | ``` 8 | and run the normal DCTTS synthesis script: 9 | ``` 10 | cd ophelia 11 | dctts_synth_dir='/tmp/dctts_synth_dir/' 12 | ./util/submit_tf.sh synthesize.py -c config/lj_tutorial.cfg -N 5 -odir ${dctts_synth_dir} 13 | ``` 14 | 15 | This saves the generated magnitude files (.npy) and Grifim-Lim wavefiles in the directory dctts_synth_dir. 16 | 17 | To generate WaveRNN wavefiles from these magnitude files: 18 | ``` 19 | cd Tacotron 20 | wavernn_synth_dir='/tmp/wavernn_synth_dir/' 21 | python synthesize_dctts_wavernn.py -i ${dctts_synth_dir} -o ${wavernn_synth_dir} 22 | ``` 23 | 24 | ## Notes on Tacotron+WaveRNN code installation 25 | 26 | ``` 27 | git clone https://github.com/cassiavb/Tacotron.git 28 | cd Tacotron/ 29 | virtualenv --distribute --python=/usr/bin/python3.6 env 30 | source env/bin/activate 31 | pip install --upgrade pip 32 | pip install torch torchvision 33 | pip install -r requirements.txt 34 | pip install numba==0.48 35 | ``` 36 | 37 | -------------------------------------------------------------------------------- /doc/recipe_nancy.md: -------------------------------------------------------------------------------- 1 | 2 | # Blizzard Nancy corpus preparation and voice building 3 | 4 | For base voice for ICPhS in first instance. 5 | 6 | 7 | ``` 8 | ### get the (publicly downloadable) data from CSTR datawww:- 9 | mkdir /group/project/cstr2/owatts/data/nancy/ 10 | cd /group/project/cstr2/owatts/data/nancy/ 11 | mkdir original 12 | 13 | cp /afs/inf.ed.ac.uk/group/cstr/datawww/blizzard2011/lessac/wavn.tgz ./original/ 14 | cp /afs/inf.ed.ac.uk/group/cstr/datawww/blizzard2011/lessac/lab.ssil.zip ./original/ 15 | cp /afs/inf.ed.ac.uk/group/cstr/datawww/blizzard2011/lessac/lab.zip ./original/ 16 | cp /afs/inf.ed.ac.uk/group/cstr/datawww/blizzard2011/lessac/prompts.data ./original/ 17 | 18 | ### compare checksums with published ones: 19 | md5sum ./original/* 20 | 0a4860a69bca56d7e9f8170306ff3709 ./original/lab.ssil.zip 21 | aeae7916d881a8eef255a6fe05e77e77 ./original/lab.zip 22 | 650b44f7252aed564d190b76a98cb490 ./original/prompts.data 23 | bb2a80dd1423f87ba12d2074af8e7a3f ./original/wavn.tgz 24 | 25 | ### ...all OK. 26 | 27 | cd ./original 28 | tar xvf wavn.tgz 29 | unzip lab.ssil.zip 30 | unzip lab.zip 31 | 32 | rm *.zip 33 | rm *.tgz 34 | 35 | ### information about the data:- 36 | http://www.cstr.ed.ac.uk/projects/blizzard/2011/lessac_blizzard2011/: 37 | 38 | * prompts.data - File with all of the prompt texts in filename order. 39 | * corrected.gui - File with all of the prompts in Lesseme labeled format in the order of the Nancy corpus as originally recorded, after some labels produced by the Lessac Front-end were corrected to reflect what the voice actor actually said. 40 | * uncorrected.gui - File with all of the prompts in Lesseme labeled format in the order of the Nancy corpus as produced by the Lessac Front-end from the prompts.data file without correction to the labels for what the voice actor actually said. 41 | * lab.ssil.zip - Zipped set of files with Lesseme labels that include the result of automated segmentation of the Lesseme labels in the corrected.gui file before the ssil label is collapsed into the preceding or following label. 42 | * lab.zip 43 | 44 | cd /disk/scratch/oliver/dc_tts_osw_clean 45 | mkdir /group/project/cstr2/owatts/data/nancy/derived/ 46 | python ./script/normalise_level.py -i /group/project/cstr2/owatts/data/nancy/original/wavn/ \ 47 | -o /group/project/cstr2/owatts/data/nancy/derived/wav_norm/ -ncores 25 48 | 49 | ./util/submit_tf_cpu.sh ./script/split_speech.py \ 50 | -w /group/project/cstr2/owatts/data/nancy/derived/wav_norm/ \ 51 | -o /group/project/cstr2/owatts/data/nancy/derived/wav_trim/ -dB 30 -ncores 25 -trimonly 52 | 53 | rm -r /group/project/cstr2/owatts/data/nancy/derived/wav_norm/ 54 | 55 | 56 | 57 | ## transcription (needed to add o~ to combilex rpx phoneset in Festival):- 58 | 59 | ## use existing scheme format transcript:- 60 | cp /group/project/cstr2/owatts/data/nancy/original/prompts.data /group/project/cstr2/owatts/data/nancy/derived/utts.data 61 | cd /group/project/cstr2/owatts/data/nancy/derived/ 62 | 63 | CODEDIR=/disk/scratch/oliver/dc_tts_osw_clean 64 | FEST=/afs/inf.ed.ac.uk/group/cstr/projects/simple4all_2/oliver/tool/festival/festival/src/main/festival 65 | SCRIPT=$CODEDIR/script/festival/make_rich_phones_combirpx_noplex.scm 66 | $FEST -b $SCRIPT | grep ___KEEP___ | sed 's/___KEEP___//' | tee ./transcript_temp1.csv 67 | 68 | python $CODEDIR/script/festival/fix_transcript.py ./transcript_temp1.csv > ./transcript.csv 69 | 70 | 71 | 72 | ### get phone list to ad to config: 73 | python ./script/check_transcript.py -i /group//project/cstr2/owatts/data/nancy/derived/transcript.csv -phone 74 | 75 | 76 | 77 | ./util/submit_tf_cpu.sh ./prepare_acoustic_features.py -c ./config/nancy_01.cfg -ncores 25 78 | 79 | ./util/submit_tf.sh ./prepare_attention_guides.py -c ./config/nancy_01.cfg -ncores 25 80 | 81 | 82 | ## train 83 | ./util/submit_tf.sh ./train.py -c config/nancy_01.cfg -m t2m 84 | ./util/submit_tf.sh ./train.py -c config/nancy_01.cfg -m ssrn 85 | ``` -------------------------------------------------------------------------------- /doc/recipe_nancy2nick.md: -------------------------------------------------------------------------------- 1 | 2 | # Very naive speaker adaptation to convert Nancy to Nick 3 | 4 | The simplest way to train on a small database is to fine tune 5 | a speaker-dependent voice to the new database. This works 6 | surprisingly well even where the base voice is of a different 7 | sex and accent to the target speaker, as this example shows. 8 | 9 | 10 | ## Prepare nick data 11 | 12 | 13 | 14 | We will use a version of the nick data which has been downsampled to 15 | 16kHz with sox: 16 | 17 | ``` 18 | /afs/inf.ed.ac.uk/group/cstr/projects/nst/cvbotinh/SCRIPT/ICPhS19/samples/second_submission/natural_16k/ 19 | ``` 20 | 21 | It was converted from the 48kHz version here: 22 | 23 | ``` 24 | /afs/inf.ed.ac.uk/group/cstr/projects/corpus_1/Nick48kHz/wav/ 25 | ``` 26 | 27 | ### waveforms 28 | ``` 29 | INDIR=/afs/inf.ed.ac.uk/group/cstr/projects/nst/cvbotinh/SCRIPT/ICPhS19/samples/second_submission/natural_16k/ 30 | OUTDIR=/group/project/cstr2/owatts/data/nick16k/ 31 | 32 | python ./script/normalise_level.py -i $INDIR -o $OUTDIR/wav_norm/ -ncores 25 33 | 34 | ./util/submit_tf_cpu.sh ./script/split_speech.py -w $OUTDIR/wav_norm/ -o $OUTDIR/wav_trim/ -dB 30 -ncores 25 -trimonly 35 | ``` 36 | 37 | ### transcript 38 | 39 | Gather texts: 40 | 41 | ``` 42 | for FNAME in /afs/inf.ed.ac.uk/group/cstr/projects/corpus_1/Nick48kHz/txt/herald_* ; do 43 | BASE=`basename $FNAME .txt` ; 44 | TEXT=`cat $FNAME` ; 45 | echo "${BASE}||${TEXT}" ; 46 | done > /group/project/cstr2/owatts/data/nick16k/metadata.csv 47 | 48 | 49 | for FNAME in /afs/inf.ed.ac.uk/group/cstr/projects/corpus_1/Nick48kHz/txt/hvd_* ; do 50 | BASE=`basename $FNAME .txt` ; 51 | TEXT=`cat $FNAME` ; 52 | echo "${BASE}||${TEXT}" ; 53 | done > /group/project/cstr2/owatts/data/nick16k/metadata_hvd.csv 54 | 55 | 56 | for FNAME in /afs/inf.ed.ac.uk/group/cstr/projects/corpus_1/Nick48kHz/txt/mrt_* ; do 57 | BASE=`basename $FNAME .txt` ; 58 | TEXT=`cat $FNAME` ; 59 | echo "${BASE}||${TEXT}" ; 60 | done > /group/project/cstr2/owatts/data/nick16k/metadata_mrt.csv 61 | ``` 62 | 63 | Phonetise: 64 | 65 | ``` 66 | CODEDIR=`pwd` 67 | DATADIR=/group/project/cstr2/owatts/data/nick16k/ 68 | python ./script/festival/csv2scm.py -i $DATADIR/metadata.csv -o $DATADIR/utts.data 69 | 70 | cd $DATADIR/ 71 | FEST=/afs/inf.ed.ac.uk/group/cstr/projects/simple4all_2/oliver/tool/festival/festival/src/main/festival 72 | SCRIPT=$CODEDIR/script/festival/make_rich_phones_combirpx_noplex.scm 73 | $FEST -b $SCRIPT | grep ___KEEP___ | sed 's/___KEEP___//' | tee ./transcript_temp1.csv 74 | python $CODEDIR/script/festival/fix_transcript.py ./transcript_temp1.csv > ./transcript.csv 75 | 76 | 77 | cd $CODEDIR 78 | for TESTSET in mrt hvd ; do 79 | mkdir /group/project/cstr2/owatts/data/nick16k/test_${TESTSET} 80 | python ./script/festival/csv2scm.py -i $DATADIR/metadata_${TESTSET}.csv -o $DATADIR/test_${TESTSET}/utts.data 81 | done 82 | 83 | 84 | for TESTSET in mrt hvd ; do 85 | cd /group/project/cstr2/owatts/data/nick16k/test_${TESTSET} 86 | $FEST -b $SCRIPT | grep ___KEEP___ | sed 's/___KEEP___//' | tee ./transcript_temp1.csv 87 | python $CODEDIR/script/festival/fix_transcript.py ./transcript_temp1.csv > ./transcript.csv 88 | cp transcript.csv ../transcript_${TESTSET}.csv 89 | done 90 | 91 | ``` 92 | 93 | 94 | 95 | 96 | ### features 97 | ``` 98 | ./util/submit_tf_cpu.sh ./prepare_acoustic_features.py -c ./config/nancy2nick_01.cfg -ncores 15 99 | ./util/submit_tf.sh ./prepare_attention_guides.py -c ./config/nancy2nick_01.cfg -ncores 15 100 | ``` 101 | 102 | 103 | ## training 104 | 105 | Config `nancy2nick_01` updates all weights pretrained on the Nancy data: 106 | 107 | ``` 108 | ./util/submit_tf.sh ./train.py -c ./config/nancy2nick_01.cfg -m t2m 109 | ./util/submit_tf.sh ./train.py -c ./config/nancy2nick_01.cfg -m ssrn 110 | ``` 111 | 112 | Config `nancy2nick_01` updates all weights pretrained on the Nancy data, except 113 | those of textencoder which are kept frozen: 114 | 115 | ``` 116 | ./util/submit_tf.sh ./train.py -c ./config/nancy2nick_02.cfg -m t2m 117 | ``` 118 | 119 | Previously-trained SSRN can be softlinked without retraining: 120 | ``` 121 | cp -rs `pwd`/work/nancy2nick_01/train-ssrn/ `pwd`/work/nancy2nick_02/train-ssrn/ 122 | ``` 123 | 124 | ## Synthesis 125 | ``` 126 | ./util/submit_tf.sh ./synthesize.py -c config/nancy2nick_01.cfg 127 | ./util/submit_tf.sh ./synthesize.py -c config/nancy2nick_02.cfg 128 | ``` 129 | 130 | -------------------------------------------------------------------------------- /doc/recipe_nancyplusnick.md: -------------------------------------------------------------------------------- 1 | 2 | # Train on Nancy + Nick 3 | 4 | 5 | 6 | ## Combine Nancy & Nick data already used for nancy_01 and nancy2nick_* 7 | 8 | 9 | Combine transcripts, adding speaker codes: 10 | ``` 11 | DATADIR=/group/project/cstr2/owatts/data/nick_plus_nancy 12 | mkdir $DATADIR 13 | 14 | grep -v ^$ /group/project/cstr2/owatts/data/nick16k/transcript.csv | awk '{print $0"|nick"}' > $DATADIR/transcript.csv 15 | grep -v ^$ /group/project/cstr2/owatts/data/nancy/derived/transcript.csv | awk '{print $0"|nancy"}' | grep -v NYT00 >> $DATADIR/transcript.csv 16 | 17 | # (remove empty lines and NYT00 section for which attention guides were not made) 18 | 19 | cp /group/project/cstr2/owatts/data/nick16k/transcript_{mrt,hvd}.csv $DATADIR 20 | ``` 21 | 22 | Combine acoustic features and attention guides: 23 | 24 | ``` 25 | mkdir -p work/nancyplusnick_01/data/{attention_guides,full_mels,mels,mags}/ 26 | for SUBDIR in attention_guides full_mels mels mags ; do 27 | for VOICE in nancy2nick_01 nancy_01 ; do 28 | ln -s ${PWD}/work/$VOICE/data/$SUBDIR/* work/nancyplusnick_01/data/$SUBDIR/ ; 29 | done 30 | done 31 | ``` 32 | 33 | 34 | 35 | Prepare config file and train: 36 | 37 | ``` 38 | ./util/submit_tf.sh ./train.py -c ./config/nancyplusnick_01.cfg -m t2m 39 | ``` 40 | 41 | Previously-trained SSRN can be softlinked without retraining: 42 | ``` 43 | cp -rs `pwd`/work/nancy2nick_01/train-ssrn/ `pwd`/work/nancyplusnick_01/train-ssrn/ 44 | ``` 45 | 46 | 47 | Synth 48 | 49 | ``` 50 | ./util/submit_tf.sh ./synthesize.py -c ./config/nancyplusnick_01.cfg -N 10 -speaker nick 51 | ./util/submit_tf.sh ./synthesize.py -c ./config/nancyplusnick_01.cfg -N 10 -speaker nancy 52 | ``` 53 | 54 | 55 | 56 | 57 | Fine tune on nick only: 58 | 59 | ./util/submit_tf.sh ./train.py -c ./config/nancyplusnick_02.cfg -m t2m 60 | cp -rs $PWD/work/nancy2nick_01/train-ssrn/ ./work/nancyplusnick_02/train-ssrn/ 61 | 62 | 63 | 64 | ./util/submit_tf.sh ./synthesize.py -c ./config/nancyplusnick_02.cfg -N 10 -speaker nick 65 | 66 | 67 | 68 | set attention loss weight very high: 69 | 70 | 71 | ``` 72 | ./util/submit_tf.sh ./train.py -c ./config/nancyplusnick_03.cfg -m t2m 73 | cp -rs $PWD/work/nancy2nick_01/train-ssrn/ ./work/nancyplusnick_03/train-ssrn/ 74 | 75 | 76 | 77 | ./util/submit_tf.sh ./synthesize.py -c ./config/nancyplusnick_03.cfg -N 10 -speaker nick 78 | 79 | 80 | ``` 81 | 82 | 83 | 84 | Try learning channel contributions for each speaker: 85 | 86 | ./util/submit_tf.sh ./train.py -c ./config/nancyplusnick_04_lcc.cfg -m t2m ; cp -rs $PWD/work/nancy2nick_01/train-ssrn/ ./work/nancyplusnick_04_lcc/train-ssrn/ 87 | 88 | 89 | 90 | 91 | 92 | Try learning channel contributions for each speaker + SD-phone embeddings: 93 | 94 | ./util/submit_tf.sh ./train.py -c ./config/nancyplusnick_05_lcc_sdpe.cfg -m t2m ; cp -rs $PWD/work/nancy2nick_01/train-ssrn/ ./work/nancyplusnick_04_lcc/train-ssrn/ 95 | 96 | 97 | -------------------------------------------------------------------------------- /doc/recipe_project.md: -------------------------------------------------------------------------------- 1 | ## Required tools 2 | 3 | ``` 4 | git clone -b project https://github.com/CSTR-Edinburgh/ophelia.git 5 | git clone https://github.com/AvashnaGovender/Merlyn.git 6 | git clone https://github.com/AvashnaGovender/Tacotron.git 7 | ``` 8 | 9 | ## DCTTS + WaveRNN 10 | 11 | See recipe [here.](recipe_WaveRNN.md) 12 | 13 | ## Attention experiments 14 | 15 | ### Obtaining forced alignment labels: 16 | 17 | Step 5a - Run forced alignment in: 18 | https://github.com/AvashnaGovender/Tacotron/blob/master/PAG_recipe.md 19 | 20 | ### FA as target 21 | 22 | Convert forced alignment labels to the forced alignment matrix: 23 | 24 | Step 6 - Get durations and create guides: 25 | https://github.com/AvashnaGovender/Tacotron/blob/master/PAG_recipe.md 26 | 27 | To use FA as target in DCTTS see config file: 28 | [fa_as_target.cfg](../config/project/fa_as_target.cfg) 29 | 30 | ### FA as guides 31 | 32 | Create attention guides from forced alignment matrix 33 | 34 | ``` 35 | cd ophelia/ 36 | python convert_alignment_to_guide.py fa_matrix.npy fa_guide.npy 37 | ``` 38 | 39 | To use FA as guide in DCTTS see config file: 40 | [fa_as_guide.cfg](../config/project/fa_as_guide.cfg) 41 | 42 | ### FA as attention 43 | 44 | Add phone level duration to transcript.csv using forced alignment matrix 45 | 46 | ``` 47 | cd ophelia/ 48 | python add_duration_to_transcript.py fa_matrix_dir transcript_file new_transcript_file 49 | ``` 50 | 51 | To use FA as attention in DCTTS see config file: 52 | [fa_as_attention.cfg](../config/project/fa_as_attention.cfg) 53 | 54 | ## Text Encoder experiments 55 | 56 | ### Labels -/+ TE 57 | 58 | Convert state labels to 416 normalised label features (needs state labels and question file) 59 | 60 | ``` 61 | cd Merlyn/ 62 | python scripts/prepare_inputs.py 63 | ``` 64 | 65 | To use Labels-TE in DCTTS see config file: 66 | [labels_minus_te.cfg](../config/project/labels_minus_te.cfg) 67 | 68 | To use Labels+TE in DCTTS see config file: 69 | [labels_plus_te.cfg](../config/project/labels_plus_te.cfg) 70 | 71 | To use C-Labels+TE in DCTTS see config file: 72 | [c-labels_plus_te.cfg](../config/project/c-labels_plus_te.cfg) 73 | 74 | ### PE&Labels + TE 75 | 76 | Create new transcription file with phoneme sequence from labels by replace phone sequence of transcript file with phone sequence from HTS style labels 77 | ``` 78 | cd ophelia/ 79 | ./labels2tra.sh labels_dir transcript_file new_transcript_file 80 | ``` 81 | 82 | To use PE&Labels+TE set MerlinTextEncWithPhoneEmbedding to True in the config file. 83 | 84 | ## Gross error detection experiments 85 | 86 | To calculate CDP, Ain and Aout: 87 | ``` 88 | cd ophelia/ 89 | python calculate_CDP_Ain_Aout.py attention_matrix.npy 90 | ``` 91 | 92 | ## FIA experiments 93 | 94 | To generate without FIA (forcibly incremental attention) set turn_off_monotonic_for_synthesis to True in the config file. 95 | 96 | ## Tacotron experiments 97 | 98 | See readme in Tacotron repository: https://github.com/AvashnaGovender/Tacotron 99 | -------------------------------------------------------------------------------- /doc/recipe_semisupervised.md: -------------------------------------------------------------------------------- 1 | ## Semisupervised training 2 | 3 | ### Babbler training 4 | 5 | Train 'babbler' (300 epochs only): 6 | 7 | ``` 8 | ./util/submit_tf.sh ./train.py -c ./config/lj_03.cfg -m babbler 9 | ``` 10 | 11 | Note that this wasn't implemented when I trained the voice before - hope it works OK: 12 | 13 | ``` 14 | bucket_data_by = 'audio_length' 15 | ``` 16 | 17 | In any case, text in transcript is ignored when training babbler. 18 | 19 | Copy existing SSRN: 20 | 21 | ``` 22 | cp -rs /disk/scratch/oliver/dc_tts_osw_clean/work/lj_02/train-ssrn /disk/scratch/oliver/dc_tts_osw_clean/work/lj_03/ 23 | ``` 24 | 25 | Synthesise by babbling: 26 | 27 | ``` 28 | ./util/submit_tf.sh ./synthesize.py -c ./config/lj_03.cfg -babble 29 | ``` 30 | 31 | (Note: all sentences are currently seeded with the same start (all zeros from padding) so all babbled outputs will be identical) 32 | 33 | 34 | ### Fine tuning 35 | 36 | Fine tune with text as conventional model on transcribed subset (1000 sentences) of the data: 37 | 38 | ``` 39 | ./util/submit_tf.sh ./train.py -c ./config/lj_05.cfg -m t2m 40 | 41 | cp -rs /disk/scratch/oliver/dc_tts_osw_clean/work/lj_02/train-ssrn /disk/scratch/oliver/dc_tts_osw_clean/work/lj_05/ 42 | 43 | ./util/submit_tf.sh ./synthesize.py -c ./config/lj_05.cfg -N 10 44 | ``` 45 | 46 | ### Baseline 47 | 48 | Compare training from scratch on 1000 sentences: 49 | 50 | ``` 51 | ./util/submit_tf.sh ./train.py -c ./config/lj_04.cfg -m t2m 52 | 53 | cp -rs /disk/scratch/oliver/dc_tts_osw_clean/work/lj_02/train-ssrn /disk/scratch/oliver/dc_tts_osw_clean/work/lj_04/ 54 | 55 | ./util/submit_tf.sh ./synthesize.py -c ./config/lj_04.cfg -N 10 56 | 57 | ``` 58 | 59 | 60 | 61 | -------------------------------------------------------------------------------- /fig/aaa: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /fig/attention.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CSTR-Edinburgh/ophelia/46042a80d045b6fd6403601e9e8fb447d304b6b3/fig/attention.gif -------------------------------------------------------------------------------- /fig/training_curves.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CSTR-Edinburgh/ophelia/46042a80d045b6fd6403601e9e8fb447d304b6b3/fig/training_curves.png -------------------------------------------------------------------------------- /labels2tra.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | # 3 | # Replace phone sequence of transcript file with phone sequence from HTS style labels 4 | # Usage: ./labels2tra.sh labels_dir transcript_file new_transcript_file 5 | 6 | labelsdir=$1 7 | trafile=$2 8 | newtrafile=$3 9 | 10 | cat $trafile | while IFS=$'|' read -r file nada text ps 11 | do 12 | 13 | grep -r "\[2\]" $labelsdir/$file.lab | sed 's/\+.*//' | sed 's/.*-//' > ~/tmp/test.txt 14 | 15 | newps=`cat ~/tmp/test.txt | tr '\n' ' '` 16 | 17 | # To print start and end 18 | echo $file"||"$text"|<_START_> "${newps:4:-4}"<_END_>" 19 | 20 | done > $newtrafile 21 | -------------------------------------------------------------------------------- /libutil.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #!/usr/bin/env python2 3 | 4 | 5 | import os 6 | import re 7 | import numpy as np 8 | import codecs 9 | 10 | 11 | 12 | def safe_makedir(dir): 13 | if not os.path.isdir(dir): 14 | os.makedirs(dir) 15 | 16 | def writelist(seq, fname): 17 | path, _ = os.path.split(fname) 18 | safe_makedir(path) 19 | f = codecs.open(fname, 'w', encoding='utf8') 20 | f.write('\n'.join(seq) + '\n') 21 | f.close() 22 | 23 | def readlist(fname): 24 | f = codecs.open(fname, 'r', encoding='utf8') 25 | data = f.readlines() 26 | f.close() 27 | data = [line.strip('\n') for line in data] 28 | data = [l for l in data if l != ''] 29 | return data 30 | 31 | def read_norm_data(fname, stream_names): 32 | out = {} 33 | vals = np.loadtxt(fname) 34 | mean_ix = 0 35 | for stream in stream_names: 36 | std_ix = mean_ix + 1 37 | out[stream] = (vals[mean_ix], vals[std_ix]) 38 | mean_ix += 2 39 | return out 40 | 41 | def makedirecs(direcs): 42 | for direc in direcs: 43 | if not os.path.isdir(direc): 44 | os.makedirs(direc) 45 | 46 | def basename(fname): 47 | path, name = os.path.split(fname) 48 | base = re.sub('\.[^\.]+\Z','',name) 49 | return base 50 | 51 | get_basename = basename # alias 52 | def get_speech(infile, dimension): 53 | f = open(infile, 'rb') 54 | speech = np.fromfile(f, dtype=np.float32) 55 | f.close() 56 | assert speech.size % float(dimension) == 0.0,'specified dimension %s not compatible with data'%(dimension) 57 | speech = speech.reshape((-1, dimension)) 58 | return speech 59 | 60 | def put_speech(m_data, filename): 61 | m_data = np.array(m_data, 'float32') # Ensuring float32 output 62 | fid = open(filename, 'wb') 63 | m_data.tofile(fid) 64 | fid.close() 65 | return 66 | 67 | def save_floats_as_8bit(data, fname): 68 | ''' 69 | Lossily store data in range [0, 1] with 8 bit resolution 70 | ''' 71 | assert (data.max() <= 1.0) and (data.min() >= 0.0), (data.min(), data.max()) 72 | 73 | maxval = np.iinfo(np.uint8).max 74 | data_scaled = (data * maxval).astype(np.uint8) 75 | np.save(fname, data_scaled) 76 | 77 | def read_floats_from_8bit(fname): 78 | maxval = np.iinfo(np.uint8).max 79 | data = (np.load(fname).astype(np.float32)) / maxval 80 | assert (data.max() <= 1.0) and (data.min() >= 0.0), (data.min(), data.max()) 81 | return data 82 | 83 | -------------------------------------------------------------------------------- /logger_setup.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import os 3 | import sys 4 | import subprocess 5 | import socket 6 | import numpy 7 | import tensorflow 8 | from libutil import safe_makedir 9 | 10 | def logger_setup(logdir): 11 | 12 | safe_makedir(logdir) 13 | 14 | ## Get new unique named logfile for each run: 15 | i = 1 16 | while True: 17 | logfile = os.path.join(logdir, 'log_{:06d}.txt'.format(i)) 18 | if not os.path.isfile(logfile): 19 | break 20 | else: 21 | i += 1 22 | 23 | logger = logging.getLogger() 24 | logger.setLevel(logging.DEBUG) 25 | 26 | formatter = logging.Formatter('%(asctime)s | %(threadName)-3.3s | %(levelname)-1.1s | %(message)s') 27 | 28 | fh = logging.FileHandler(logfile) 29 | fh.setLevel(logging.DEBUG) 30 | fh.setFormatter(formatter) 31 | logger.addHandler(fh) 32 | 33 | ch = logging.StreamHandler() 34 | ch.setLevel(logging.DEBUG) 35 | ch.setFormatter(formatter) 36 | logger.addHandler(ch) 37 | 38 | logger.info('Set up logger to write to console and %s'%(logfile)) 39 | 40 | log_environment_information(logger, logfile) 41 | 42 | 43 | def log_environment_information(logger, logfile): 44 | ### This function's contents adjusted from Merlin (https://github.com/CSTR-Edinburgh/merlin/blob/master/src/run_merlin.py) 45 | ### TODO: other things to log here? 46 | logger.info('Installation information:') 47 | logger.info(' Merlin directory: '+os.path.abspath(os.path.join(os.path.dirname(os.path.realpath(__file__)), os.pardir))) 48 | logger.info(' PATH:') 49 | env_PATHs = os.getenv('PATH') 50 | if env_PATHs: 51 | env_PATHs = env_PATHs.split(':') 52 | for p in env_PATHs: 53 | if len(p)>0: logger.info(' '+p) 54 | logger.info(' LD_LIBRARY_PATH:') 55 | env_LD_LIBRARY_PATHs = os.getenv('LD_LIBRARY_PATH') 56 | if env_LD_LIBRARY_PATHs: 57 | env_LD_LIBRARY_PATHs = env_LD_LIBRARY_PATHs.split(':') 58 | for p in env_LD_LIBRARY_PATHs: 59 | if len(p)>0: logger.info(' '+p) 60 | logger.info(' Python version: '+sys.version.replace('\n','')) 61 | logger.info(' PYTHONPATH:') 62 | env_PYTHONPATHs = os.getenv('PYTHONPATH') 63 | if env_PYTHONPATHs: 64 | env_PYTHONPATHs = env_PYTHONPATHs.split(':') 65 | for p in env_PYTHONPATHs: 66 | if len(p)>0: 67 | logger.info(' '+p) 68 | logger.info(' Numpy version: '+numpy.version.version) 69 | logger.info(' Tensorflow version: '+tensorflow.__version__) 70 | #logger.info(' THEANO_FLAGS: '+os.getenv('THEANO_FLAGS')) 71 | #logger.info(' device: '+theano.config.device) 72 | 73 | # Check for the presence of git 74 | ret = os.system('git status > /dev/null') 75 | if ret==0: 76 | logger.info(' Git is available in the working directory:') 77 | git_describe = subprocess.Popen(['git', 'describe', '--tags', '--always'], stdout=subprocess.PIPE).communicate()[0][:-1] 78 | logger.info(' DC_TTS_OSW version: {}'.format(git_describe)) 79 | git_branch = subprocess.Popen(['git', 'rev-parse', '--abbrev-ref', 'HEAD'], stdout=subprocess.PIPE).communicate()[0][:-1] 80 | logger.info(' branch: {}'.format(git_branch)) 81 | git_diff = subprocess.Popen(['git', 'diff', '--name-status'], stdout=subprocess.PIPE).communicate()[0] 82 | if sys.version_info.major >= 3: 83 | git_diff = git_diff.decode('utf-8') 84 | git_diff = git_diff.replace('\t',' ').split('\n') 85 | logger.info(' diff to DC_TTS_OSW version:') 86 | for filediff in git_diff: 87 | if len(filediff)>0: logger.info(' '+filediff) 88 | logger.info(' (all diffs logged in '+os.path.basename(logfile)+'.gitdiff'+')') 89 | os.system('git diff > '+logfile+'.gitdiff') 90 | 91 | logger.info('Execution information:') 92 | logger.info(' HOSTNAME: '+socket.getfqdn()) 93 | logger.info(' USER: '+os.getenv('USER')) 94 | logger.info(' PID: '+str(os.getpid())) 95 | PBS_JOBID = os.getenv('PBS_JOBID') 96 | if PBS_JOBID: 97 | logger.info(' PBS_JOBID: '+PBS_JOBID) 98 | 99 | -------------------------------------------------------------------------------- /objective_measures.py: -------------------------------------------------------------------------------- 1 | 2 | ''' 3 | TODO: logSpecDbDist appropriate? (both mels & mags?) 4 | TODO: compute output length error? 5 | TODO: work out best way of handling the fact that predicted *coarse* features 6 | can correspond to text but be arbitrarily 'out of phase' with reference. 7 | Mutliple references? Or compare against full-time resolution reference? 8 | ''' 9 | import logging 10 | from mcd import dtw 11 | import mcd.metrics_fast as mt 12 | 13 | def compute_dtw_error(reference, predictions): 14 | minCostTot = 0.0 15 | framesTot = 0 16 | for (nat, synth) in zip(reference, predictions): 17 | nat, synth = nat.astype('float64'), synth.astype('float64') 18 | minCost, path = dtw.dtw(nat, synth, mt.logSpecDbDist) 19 | frames = len(nat) 20 | minCostTot += minCost 21 | framesTot += frames 22 | mean_score = minCostTot / framesTot 23 | print ('overall LSD = %f (%s frames nat/synth)' % (mean_score, framesTot)) 24 | return mean_score 25 | 26 | def compute_simple_LSD(reference_list, prediction_list): 27 | costTot = 0.0 28 | framesTot = 0 29 | for (synth, nat) in zip(prediction_list, reference_list): 30 | #synth = prediction_tensor[i,:,:].astype('float64') 31 | # len_nat = len(nat) 32 | assert len(synth) == len(nat) 33 | #synth = synth[:len_nat, :] 34 | nat = nat.astype('float64') 35 | synth = synth.astype('float64') 36 | cost = sum([ 37 | mt.logSpecDbDist(natFrame, synthFrame) 38 | for natFrame, synthFrame in zip(nat, synth) 39 | ]) 40 | framesTot += len(nat) 41 | costTot += cost 42 | return costTot / framesTot 43 | 44 | 45 | -------------------------------------------------------------------------------- /plot_loss.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | ## Project: SCRIPT - February 2018 4 | ## Contact: Oliver Watts - owatts@staffmail.ed.ac.uk 5 | 6 | import sys 7 | import os 8 | import glob 9 | import os 10 | import fileinput 11 | from argparse import ArgumentParser 12 | 13 | from libutil import readlist 14 | import numpy as np 15 | import pylab as pl 16 | def main_work(): 17 | 18 | ################################################# 19 | 20 | # ======== Get stuff from command line ========== 21 | 22 | a = ArgumentParser() 23 | a.add_argument('-o', dest='outfile', required=True) 24 | a.add_argument('-l', dest='logfile', required=True) 25 | opts = a.parse_args() 26 | 27 | # =============================================== 28 | 29 | log = readlist(opts.logfile) 30 | log = [line.split('|') for line in log] 31 | log = [line[3].strip() for line in log if len(line) >=4] 32 | 33 | #validation = [line.replace('validation epoch ', '') for line in log if line.startswith('validation epoch')] 34 | #train = [line.replace('train epoch ', '') for line in log if line.startswith('validation epoch')] 35 | 36 | validation = [line.split(':')[1].strip().split(' ') for line in log if line.startswith('validation epoch')] 37 | train = [line.split(':')[1].strip().split(' ') for line in log if line.startswith('train epoch')] 38 | validation = np.array(validation, dtype=float) 39 | train = np.array(train, dtype=float) 40 | print train.shape 41 | print validation.shape 42 | 43 | pl.subplot(211) 44 | pl.plot(validation.flatten()) 45 | pl.subplot(212) 46 | pl.plot(train[:,:4]) 47 | pl.show() 48 | if __name__=="__main__": 49 | 50 | main_work() 51 | 52 | -------------------------------------------------------------------------------- /prepare_acoustic_features.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #! /usr/bin/env python2 3 | ''' 4 | Based on code by kyubyong park at https://www.github.com/kyubyong/dc_tts 5 | ''' 6 | 7 | from __future__ import print_function 8 | 9 | import os 10 | import sys 11 | import glob 12 | from argparse import ArgumentParser 13 | from concurrent.futures import ProcessPoolExecutor 14 | import numpy as np 15 | 16 | import tqdm 17 | 18 | from libutil import safe_makedir 19 | from configuration import load_config 20 | from utils import load_spectrograms 21 | 22 | def proc(fpath, hp): 23 | 24 | if not os.path.isfile(fpath): 25 | return 26 | 27 | fname, mel, mag, full_mel = load_spectrograms(hp, fpath) 28 | np.save("{}/{}".format(hp.coarse_audio_dir, fname.replace("wav", "npy")), mel) 29 | np.save("{}/{}".format(hp.full_audio_dir, fname.replace("wav", "npy")), mag) 30 | np.save("{}/{}".format(hp.full_mel_dir, fname.replace("wav", "npy")), full_mel) 31 | 32 | 33 | def main_work(): 34 | 35 | ################################################# 36 | 37 | # ============= Process command line ============ 38 | 39 | a = ArgumentParser() 40 | a.add_argument('-c', dest='config', required=True, type=str) 41 | a.add_argument('-ncores', default=1, type=int, help='Number of cores for parallel processing') 42 | opts = a.parse_args() 43 | 44 | # =============================================== 45 | 46 | hp = load_config(opts.config) 47 | 48 | fpaths = sorted(glob.glob(hp.waveforms + '/*.wav')) 49 | 50 | safe_makedir(hp.coarse_audio_dir) 51 | safe_makedir(hp.full_audio_dir) 52 | safe_makedir(hp.full_mel_dir) 53 | 54 | executor = ProcessPoolExecutor(max_workers=opts.ncores) 55 | futures = [] 56 | for fpath in fpaths: 57 | futures.append(executor.submit( 58 | proc, fpath, hp)) 59 | proc_list = [future.result() for future in tqdm.tqdm(futures)] 60 | 61 | 62 | if __name__=="__main__": 63 | 64 | main_work() 65 | -------------------------------------------------------------------------------- /prepare_attention_guides.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #!/usr/bin/env python2 3 | 4 | from __future__ import print_function 5 | 6 | from utils import get_attention_guide 7 | import os 8 | from data_load import load_data 9 | import numpy as np 10 | import tqdm 11 | from concurrent.futures import ProcessPoolExecutor 12 | 13 | from argparse import ArgumentParser 14 | 15 | from libutil import basename, save_floats_as_8bit, safe_makedir 16 | from configuration import load_config 17 | 18 | def proc(fpath, text_length, hp): 19 | 20 | base = basename(fpath) 21 | melfile = hp.coarse_audio_dir + os.path.sep + base + '.npy' 22 | attfile = hp.attention_guide_dir + os.path.sep + base # without '.npy' 23 | if not os.path.isfile(melfile): 24 | print('file %s not found'%(melfile)) 25 | return 26 | speech_length = np.load(melfile).shape[0] 27 | att = get_attention_guide(text_length, speech_length, g=hp.g) 28 | save_floats_as_8bit(att, attfile) 29 | 30 | 31 | def main_work(): 32 | 33 | ################################################# 34 | 35 | # ============= Process command line ============ 36 | 37 | a = ArgumentParser() 38 | a.add_argument('-c', dest='config', required=True, type=str) 39 | a.add_argument('-ncores', default=1, type=int, help='Number of cores for parallel processing') 40 | opts = a.parse_args() 41 | 42 | # =============================================== 43 | 44 | hp = load_config(opts.config) 45 | assert hp.attention_guide_dir 46 | 47 | dataset = load_data(hp) 48 | fpaths, text_lengths = dataset['fpaths'], dataset['text_lengths'] 49 | 50 | if hp.merlin_label_dir: 51 | text_lengths = dataset['label_lengths'] 52 | 53 | assert os.path.exists(hp.coarse_audio_dir) 54 | safe_makedir(hp.attention_guide_dir) 55 | 56 | executor = ProcessPoolExecutor(max_workers=opts.ncores) 57 | futures = [] 58 | for (fpath, text_length) in zip(fpaths, text_lengths): 59 | futures.append(executor.submit(proc, fpath, text_length, hp)) 60 | proc_list = [future.result() for future in tqdm.tqdm(futures)] 61 | 62 | 63 | if __name__=="__main__": 64 | 65 | main_work() 66 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy==1.16.4 2 | absl-py==0.6.1 3 | astor==0.7.1 4 | audioread==2.1.6 5 | backports.functools-lru-cache==1.5 6 | backports.weakref==1.0.post1 7 | bashplotlib==0.6.5 8 | cffi==1.11.5 9 | cycler==0.10.0 10 | decorator==4.3.0 11 | enum34==1.1.6 12 | funcsigs==1.0.2 13 | futures==3.2.0 14 | gast==0.2.0 15 | grpcio==1.16.1 16 | h5py==2.8.0 17 | htk-io==0.5 18 | joblib==0.13.0 19 | Keras-Applications==1.0.6 20 | Keras-Preprocessing==1.0.5 21 | kiwisolver==1.0.1 22 | librosa==0.6.2 23 | llvmlite==0.25.0 24 | Markdown==3.0.1 25 | matplotlib==2.2.3 26 | mcd==0.4 27 | mock==2.0.0 28 | numba==0.40.1 29 | pbr==5.1.1 30 | protobuf==3.6.1 31 | pycparser==2.19 32 | pyparsing==2.3.0 33 | python-dateutil==2.7.5 34 | pytz==2018.7 35 | regex==2019.2.3 36 | resampy==0.2.1 37 | scikit-learn==0.20.0 38 | scipy==1.1.0 39 | singledispatch==3.4.0.3 40 | six==1.11.0 41 | SoundFile==0.10.2 42 | subprocess32==3.5.3 43 | tensorboard==1.12.0 44 | tensorflow-gpu==1.12.0 45 | termcolor==1.1.0 46 | tqdm==4.28.1 47 | Werkzeug==0.14.1 48 | -------------------------------------------------------------------------------- /script/add_speaker.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | from argparse import ArgumentParser 4 | 5 | 6 | def main_work(): 7 | 8 | a = ArgumentParser() 9 | a.add_argument('-i', dest='infile', required=True) 10 | a.add_argument('-o', dest='outfile', required=True) 11 | opts = a.parse_args() 12 | 13 | outf = open(opts.outfile, 'w') 14 | transcript = open(opts.infile, 'r').read().split('\n') 15 | 16 | for line in transcript: 17 | spk = line.split('_')[0] 18 | outf.writelines(line+'||'+spk+'\n') 19 | outf.close() 20 | 21 | if __name__ == "__main__": 22 | main_work() 23 | -------------------------------------------------------------------------------- /script/festival/csv2scm.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | ## Project: SCRIPT - February 2018 4 | ## Contact: Oliver Watts - owatts@staffmail.ed.ac.uk 5 | 6 | import sys 7 | import os 8 | import glob 9 | import os 10 | from argparse import ArgumentParser 11 | import codecs 12 | 13 | 14 | def main_work(): 15 | 16 | ################################################# 17 | 18 | # ======== Get stuff from command line ========== 19 | 20 | a = ArgumentParser() 21 | a.add_argument('-i', dest='infile', required=True, \ 22 | help= "File in LJ speech transcription format: https://keithito.com/LJ-Speech-Dataset/") 23 | a.add_argument('-o', dest='outfile', required=True, \ 24 | help= "File in Festival utts.data scheme format") 25 | opts = a.parse_args() 26 | 27 | # =============================================== 28 | 29 | f = codecs.open(opts.infile, 'r', encoding='utf8') 30 | lines = f.readlines() 31 | f.close() 32 | 33 | f = codecs.open(opts.outfile, 'w', encoding='utf8') 34 | for line in lines: 35 | fields = line.strip('\n\r ').split('|') 36 | assert len(fields) >= 3 37 | name, _, text = fields[:3] 38 | text = text.replace('"', '\\"') 39 | f.write('(%s "%s")\n'%(name, text)) 40 | f.close() 41 | 42 | 43 | 44 | 45 | if __name__=="__main__": 46 | 47 | main_work() 48 | 49 | -------------------------------------------------------------------------------- /script/festival/fix_transcript.py: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | import sys, codecs, re, regex 9 | 10 | 11 | f = codecs.open(sys.argv[1], encoding='utf8', errors='ignore') 12 | text = f.read() 13 | f.close() 14 | 15 | lines = text.split('\n') 16 | 17 | 18 | all_seps = set() 19 | for line in lines: 20 | phones = line.split('|')[-1].strip('\n\r ').split(' ') 21 | seps = set([phone for phone in phones if phone.startswith('<') and phone.endswith('>')]) 22 | all_seps.update(seps) 23 | #print seps 24 | #print phones 25 | 26 | 27 | 28 | # print all_seps 29 | 30 | badseps = [] 31 | for sep in all_seps: 32 | 33 | if regex.match('\A[\p{P}\p{Z}]+\Z', sep.strip('<>')): 34 | #puncs.append(sep) 35 | pass 36 | elif sep in ["<'s>", '<_END_>', '<_START_>']: 37 | pass 38 | else: 39 | badseps.append(sep) 40 | 41 | for sep in badseps: 42 | text = text.replace(sep, '<>') 43 | 44 | 45 | 46 | 47 | 48 | #bad_strings = sys.argv[2:] 49 | 50 | 51 | # for bad in bad_strings: 52 | # lines = lines.replace('<'+bad+'>', '<>') 53 | 54 | print text.encode('utf8') -------------------------------------------------------------------------------- /script/festival/make_rich_phones.scm: -------------------------------------------------------------------------------- 1 | 2 | 3 | ; Usage: 4 | ; cd to directory containing ./utts.data, then with a version of Festival with the correct 5 | ; lexicon installed: 6 | 7 | ; FEST=/afs/inf.ed.ac.uk/user/o/owatts/sim2/oliver/tool/festival/festival/bin/festival 8 | ; SCRIPT=/afs/inf.ed.ac.uk/user/o/owatts/repos/dc_tts/script/make_rich_phones.scm 9 | ; $FEST -b $SCRIPT | grep ___KEEP___ | sed 's/___KEEP___//' | tee ./transcript.csv 10 | 11 | 12 | 13 | ;;; Taken from build_unitsel: 14 | ;; Is this the last segment in a word [more complicated than you would think] 15 | (define (seg_word_final seg) 16 | "(seg_word_final seg) 17 | Is this segment word final?" 18 | (let ((this_seg_word (item.parent (item.relation.parent seg 'SylStructure))) 19 | (silence (car (cadr (car (PhoneSet.description '(silences)))))) 20 | next_seg_word) 21 | (if (item.next seg) 22 | (set! next_seg_word (item.parent (item.relation.parent (item.next seg) 'SylStructure)))) 23 | (if (or (equal? this_seg_word next_seg_word) 24 | (string-equal (item.feat seg "name") silence)) 25 | nil 26 | t))) 27 | 28 | 29 | (define (following_punc seg) 30 | ( set! this_seg_word (item.parent (item.relation.parent seg 'SylStructure))) 31 | ( set! this_seg_token (item.relation.next this_seg_word 'Token)) 32 | (if this_seg_token 33 | ( format t "<%s> " (item.name this_seg_token)) 34 | ( format t "<> " ) %% else 35 | ) 36 | ) 37 | 38 | 39 | (define (print_phones_punc utt) 40 | (if (utt.relation.present utt 'Segment) 41 | (begin 42 | ( format t "<_START_> " ) 43 | (mapcar 44 | (lambda (x) 45 | ( if (not (equal? (item.name x) "#") ) ;; TODO: unhardcode silent symbols 46 | (format t "%s " (item.name x) ) 47 | ) 48 | (if (seg_word_final x) 49 | (following_punc x) 50 | ) 51 | ) 52 | (utt.relation.items utt 'Segment)) 53 | ( format t "<_END_>" ) 54 | ) 55 | (format t "Utterance contains no Segments\n")) 56 | nil) 57 | 58 | ;;; Taken from build_unitsel: 59 | ;; Do the linguistic side of synthesis. 60 | (define (utt.synth_toSegment_text utt) 61 | (Initialize utt) 62 | (Text utt) 63 | (Token_POS utt) ;; when utt.synth is called 64 | (Token utt) 65 | (POS utt) 66 | (Phrasify utt) 67 | (Word utt) 68 | (Pauses utt) 69 | (Intonation utt) 70 | (PostLex utt)) 71 | 72 | (define (synth_utts utts_data_file) 73 | (set! uttlist (load utts_data_file t)) 74 | (mapcar 75 | (lambda (line) 76 | (set! utt (utt.synth_toSegment_text (eval (list 'Utterance 'Text (car (cdr line)))))) ;;; Initialize 77 | (format t "___KEEP___%s|" (car line)) 78 | (format t "%s|" (car (cdr line)) ) 79 | (format t "%s|" (car (cdr line)) ) 80 | (print_phones_punc utt) 81 | ; (utt.relation.print utt 'Text) 82 | ( format t "\n" ) 83 | t) 84 | uttlist) 85 | ) 86 | 87 | 88 | (if (not (member_string 'unilex-rpx (lex.list))) 89 | (load (path-append lexdir "unilex/" (string-append 'unilex-rpx ".scm")))) 90 | 91 | ; (if (not (member_string 'cmudict (lex.list))) 92 | ; (load (path-append lexdir "cmu/" (string-append 'cmudict-0.4 ".scm")))) 93 | 94 | 95 | ;(require 'unilex_phones) 96 | ;(lex.select 'unilex-rpx) 97 | 98 | 99 | 100 | 101 | (if (not (member_string 'combilex-rpx (lex.list))) 102 | (load (path-append lexdir "combilex/" (string-append 'combilex-rpx ".scm")))) 103 | (lex.select 'combilex-rpx) 104 | (require 'postlex) 105 | (set! postlex_rules_hooks (list postlex_apos_s_check 106 | postlex_intervoc_r 107 | postlex_the_vs_thee 108 | postlex_a 109 | )) 110 | ; (set! postlex_rules_hooks (list 111 | ; postlex_intervoc_r 112 | ; )) 113 | 114 | 115 | 116 | ; (lex.select 'cmudict) 117 | 118 | 119 | 120 | ;(set! utt1 (Utterance Text "Hello there, world, isn't it a nice day?!")) 121 | ;(utt.synth utt1) 122 | ;(print_phones_punc utt1) 123 | 124 | (synth_utts "./utts.data") 125 | 126 | 127 | 128 | 129 | 130 | 131 | -------------------------------------------------------------------------------- /script/festival/make_rich_phones_cmulex.scm: -------------------------------------------------------------------------------- 1 | 2 | 3 | ; Usage: 4 | ; cd to directory containing ./utts.data, then with a version of Festival with the correct 5 | ; lexicon installed: 6 | 7 | ; FEST=/afs/inf.ed.ac.uk/user/o/owatts/sim2/oliver/tool/festival/festival/bin/festival 8 | ; SCRIPT=/afs/inf.ed.ac.uk/user/o/owatts/repos/dc_tts/script/make_rich_phones.scm 9 | ; $FEST -b $SCRIPT | grep ___KEEP___ | sed 's/___KEEP___//' | tee ./transcript.csv 10 | 11 | 12 | 13 | ;;; Taken from build_unitsel: 14 | ;; Is this the last segment in a word [more complicated than you would think] 15 | (define (seg_word_final seg) 16 | "(seg_word_final seg) 17 | Is this segment word final?" 18 | (let ((this_seg_word (item.parent (item.relation.parent seg 'SylStructure))) 19 | (silence (car (cadr (car (PhoneSet.description '(silences)))))) 20 | next_seg_word) 21 | (if (item.next seg) 22 | (set! next_seg_word (item.parent (item.relation.parent (item.next seg) 'SylStructure)))) 23 | (if (or (equal? this_seg_word next_seg_word) 24 | (string-equal (item.feat seg "name") silence)) 25 | nil 26 | t))) 27 | 28 | 29 | (define (following_punc seg) 30 | ( set! this_seg_word (item.parent (item.relation.parent seg 'SylStructure))) 31 | ( set! this_seg_token (item.relation.next this_seg_word 'Token)) 32 | (if this_seg_token 33 | ( format t "<%s> " (item.name this_seg_token)) 34 | ( format t "<> " ) %% else 35 | ) 36 | ) 37 | 38 | 39 | (define (print_phones_punc utt) 40 | (if (utt.relation.present utt 'Segment) 41 | (begin 42 | ( format t "<_START_> " ) 43 | (mapcar 44 | (lambda (x) 45 | ( if (not (equal? (item.name x) "pau") ) ;; TODO: unhardcode silent symbols 46 | (format t "%s " (item.name x) ) 47 | ) 48 | (if (seg_word_final x) 49 | (following_punc x) 50 | ) 51 | ) 52 | (utt.relation.items utt 'Segment)) 53 | ( format t "<_END_>" ) 54 | ) 55 | (format t "Utterance contains no Segments\n")) 56 | nil) 57 | 58 | ;;; Taken from build_unitsel: 59 | ;; Do the linguistic side of synthesis. 60 | (define (utt.synth_toSegment_text utt) 61 | (Initialize utt) 62 | (Text utt) 63 | (Token_POS utt) ;; when utt.synth is called 64 | (Token utt) 65 | (POS utt) 66 | (Phrasify utt) 67 | (Word utt) 68 | (Pauses utt) 69 | (Intonation utt) 70 | (PostLex utt)) 71 | 72 | (define (synth_utts utts_data_file) 73 | (set! uttlist (load utts_data_file t)) 74 | (mapcar 75 | (lambda (line) 76 | (set! utt (utt.synth_toSegment_text (eval (list 'Utterance 'Text (car (cdr line)))))) ;;; Initialize 77 | (format t "___KEEP___%s||" (car line)) 78 | ;(format t "%s|" (car (cdr line)) ) 79 | (format t "%s|" (car (cdr line)) ) 80 | (print_phones_punc utt) 81 | ; (utt.relation.print utt 'Text) 82 | ( format t "\n" ) 83 | t) 84 | uttlist) 85 | ) 86 | 87 | 88 | 89 | 90 | 91 | 92 | ; (if (not (member_string 'combilex-rpx (lex.list))) 93 | ; (load (path-append lexdir "combilex/" (string-append 'combilex-rpx ".scm")))) 94 | (lex.select 'cmu) 95 | 96 | 97 | (synth_utts "./utts.data") 98 | 99 | 100 | 101 | 102 | 103 | 104 | -------------------------------------------------------------------------------- /script/festival/make_rich_phones_combirpx_noplex.scm: -------------------------------------------------------------------------------- 1 | 2 | 3 | ; Usage: 4 | ; cd to directory containing ./utts.data, then with a version of Festival with the correct 5 | ; lexicon installed: 6 | 7 | ; FEST=/afs/inf.ed.ac.uk/user/o/owatts/sim2/oliver/tool/festival/festival/bin/festival 8 | ; SCRIPT=/afs/inf.ed.ac.uk/user/o/owatts/repos/dc_tts/script/make_rich_phones.scm 9 | ; $FEST -b $SCRIPT | grep ___KEEP___ | sed 's/___KEEP___//' | tee ./transcript.csv 10 | 11 | 12 | 13 | ;;; Taken from build_unitsel: 14 | ;; Is this the last segment in a word [more complicated than you would think] 15 | (define (seg_word_final seg) 16 | "(seg_word_final seg) 17 | Is this segment word final?" 18 | (let ((this_seg_word (item.parent (item.relation.parent seg 'SylStructure))) 19 | (silence (car (cadr (car (PhoneSet.description '(silences)))))) 20 | next_seg_word) 21 | (if (item.next seg) 22 | (set! next_seg_word (item.parent (item.relation.parent (item.next seg) 'SylStructure)))) 23 | (if (or (equal? this_seg_word next_seg_word) 24 | (string-equal (item.feat seg "name") silence)) 25 | nil 26 | t))) 27 | 28 | 29 | (define (following_punc seg) 30 | ( set! this_seg_word (item.parent (item.relation.parent seg 'SylStructure))) 31 | ( set! this_seg_token (item.relation.next this_seg_word 'Token)) 32 | (if this_seg_token 33 | ( format t "<%s> " (item.name this_seg_token)) 34 | ( format t "<> " ) %% else 35 | ) 36 | ) 37 | 38 | 39 | (define (print_phones_punc utt) 40 | (if (utt.relation.present utt 'Segment) 41 | (begin 42 | ( format t "<_START_> " ) 43 | (mapcar 44 | (lambda (x) 45 | ( if (not (equal? (item.name x) "#") ) ;; TODO: unhardcode silent symbols 46 | (format t "%s " (item.name x) ) 47 | ) 48 | (if (seg_word_final x) 49 | (following_punc x) 50 | ) 51 | ) 52 | (utt.relation.items utt 'Segment)) 53 | ( format t "<_END_>" ) 54 | ) 55 | (format t "Utterance contains no Segments\n")) 56 | nil) 57 | 58 | ;;; Taken from build_unitsel: 59 | ;; Do the linguistic side of synthesis. 60 | (define (utt.synth_toSegment_text utt) 61 | (Initialize utt) 62 | (Text utt) 63 | (Token_POS utt) ;; when utt.synth is called 64 | (Token utt) 65 | (POS utt) 66 | (Phrasify utt) 67 | (Word utt) 68 | (Pauses utt) 69 | (Intonation utt) 70 | (PostLex utt)) 71 | 72 | (define (synth_utts utts_data_file) 73 | (set! uttlist (load utts_data_file t)) 74 | (mapcar 75 | (lambda (line) 76 | (set! utt (utt.synth_toSegment_text (eval (list 'Utterance 'Text (car (cdr line)))))) ;;; Initialize 77 | (format t "___KEEP___%s||" (car line)) 78 | ;(format t "%s|" (car (cdr line)) ) 79 | (format t "%s|" (car (cdr line)) ) 80 | (print_phones_punc utt) 81 | ; (utt.relation.print utt 'Text) 82 | ( format t "\n" ) 83 | t) 84 | uttlist) 85 | ) 86 | 87 | 88 | (if (not (member_string 'unilex-rpx (lex.list))) 89 | (load (path-append lexdir "unilex/" (string-append 'unilex-rpx ".scm")))) 90 | 91 | ; (if (not (member_string 'cmudict (lex.list))) 92 | ; (load (path-append lexdir "cmu/" (string-append 'cmudict-0.4 ".scm")))) 93 | 94 | 95 | ;(require 'unilex_phones) 96 | ;(lex.select 'unilex-rpx) 97 | 98 | 99 | 100 | 101 | (if (not (member_string 'combilex-rpx (lex.list))) 102 | (load (path-append lexdir "combilex/" (string-append 'combilex-rpx ".scm")))) 103 | (lex.select 'combilex-rpx) 104 | ; (require 'postlex) 105 | ; (set! postlex_rules_hooks (list postlex_apos_s_check 106 | ; postlex_intervoc_r 107 | ; postlex_the_vs_thee 108 | ; postlex_a 109 | ; )) 110 | ; (set! postlex_rules_hooks (list 111 | ; postlex_intervoc_r 112 | ; )) 113 | 114 | 115 | 116 | ; (lex.select 'cmudict) 117 | 118 | 119 | 120 | ;(set! utt1 (Utterance Text "Hello there, world, isn't it a nice day?!")) 121 | ;(utt.synth utt1) 122 | ;(print_phones_punc utt1) 123 | 124 | (synth_utts "./utts.data") 125 | 126 | 127 | 128 | 129 | 130 | 131 | -------------------------------------------------------------------------------- /script/festival/multi_transcript.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | from argparse import ArgumentParser 4 | 5 | 6 | def main_work(): 7 | 8 | a = ArgumentParser() 9 | a.add_argument('-i', dest='infile', required=True) 10 | a.add_argument('-o', dest='outfile', required=True) 11 | opts = a.parse_args() 12 | 13 | o = open(opts.outfile, 'w') 14 | 15 | with open(opts.infile, 'r') as f: 16 | for line in f.readlines()[:-1]: 17 | if line[3] == '_': # if clauses dealing with different length p-numbers (not really applicable for public VCTK) 18 | speaker_id = line[0:3] 19 | elif line[4] == '_': 20 | speaker_id = line[0:4] 21 | elif line[5] == '_': 22 | speaker_id = line[0:5] 23 | else: 24 | print('Something is wrong with the input file - speaker ID cannot be parsed!') 25 | o.write('{}|{}\n'.format(line.rstrip(), speaker_id)) 26 | 27 | o.close() 28 | 29 | if __name__ == "__main__": 30 | main_work() 31 | 32 | -------------------------------------------------------------------------------- /script/get_transcriptions.sh: -------------------------------------------------------------------------------- 1 | 2 | python ./script/festival/csv2scm.py -i $1 -o utts.data 3 | 4 | FEST='/afs/inf.ed.ac.uk/user/s15/s1520337/Documents/festival/festival/bin/festival' 5 | 6 | SCRIPT=./script/festival/make_rich_phones_cmulex.scm 7 | $FEST -b $SCRIPT | grep ___KEEP___ | sed 's/___KEEP___//' | tee ./transcript_temp1.csv 8 | 9 | python $CODEDIR/script/festival/fix_transcript.py ./transcript_temp1.csv > ./new_transcript.csv 10 | -------------------------------------------------------------------------------- /script/libutil.py: -------------------------------------------------------------------------------- 1 | import os 2 | import re 3 | import numpy as np 4 | 5 | #### TODO -- this module is duplicated 1 level up -- sort this out! 6 | 7 | 8 | def load_config(config_fname): 9 | config = {} 10 | execfile(config_fname, config) 11 | del config['__builtins__'] 12 | _, config_name = os.path.split(config_fname) 13 | config_name = config_name.replace('.cfg','').replace('.conf','') 14 | config['config_name'] = config_name 15 | return config 16 | 17 | 18 | def safe_makedir(dir): 19 | if not os.path.isdir(dir): 20 | os.makedirs(dir) 21 | 22 | def writelist(seq, fname): 23 | f = open(fname, 'w') 24 | f.write('\n'.join(seq) + '\n') 25 | f.close() 26 | 27 | def readlist(fname): 28 | f = open(fname, 'r') 29 | data = f.readlines() 30 | f.close() 31 | return [line.strip('\n') for line in data] 32 | 33 | def read_norm_data(fname, stream_names): 34 | out = {} 35 | vals = np.loadtxt(fname) 36 | mean_ix = 0 37 | for stream in stream_names: 38 | std_ix = mean_ix + 1 39 | out[stream] = (vals[mean_ix], vals[std_ix]) 40 | mean_ix += 2 41 | return out 42 | 43 | 44 | def makedirecs(direcs): 45 | for direc in direcs: 46 | if not os.path.isdir(direc): 47 | os.makedirs(direc) 48 | 49 | def basename(fname): 50 | path, name = os.path.split(fname) 51 | base = re.sub('\.[^\.]+\Z','',name) 52 | return base 53 | 54 | get_basename = basename # alias 55 | def get_speech(infile, dimension): 56 | f = open(infile, 'rb') 57 | speech = np.fromfile(f, dtype=np.float32) 58 | f.close() 59 | assert speech.size % float(dimension) == 0.0,'specified dimension %s not compatible with data'%(dimension) 60 | speech = speech.reshape((-1, dimension)) 61 | return speech 62 | 63 | def put_speech(m_data, filename): 64 | m_data = np.array(m_data, 'float32') # Ensuring float32 output 65 | fid = open(filename, 'wb') 66 | m_data.tofile(fid) 67 | fid.close() 68 | return -------------------------------------------------------------------------------- /script/make_internal_webchart.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | ## Project: SCRIPT - March 2018 4 | ## Contact: Oliver Watts - owatts@staffmail.ed.ac.uk 5 | 6 | 7 | import sys, os 8 | from string import strip 9 | from argparse import ArgumentParser 10 | 11 | def main_work(voice_dirs, names=[], pattern='', outfile='', title=''): 12 | 13 | #for v in voice_dirs: 14 | # print v 15 | #sys.exit('wvswv') 16 | #voice_dirs = opts.d 17 | voice_dirs = [string for string in voice_dirs if os.path.isdir(string)] 18 | 19 | if names: 20 | #names = opts.n 21 | if not len(names) == len(voice_dirs): 22 | print '------' 23 | for name in names: 24 | print name 25 | for v in voice_dirs: 26 | print v 27 | sys.exit('len(names) != len(voice_dirs)') 28 | else: 29 | names = [direc.strip('/').split('/')[-1] for direc in voice_dirs] 30 | print names 31 | 32 | # for i in range(0,len(inputs),2): 33 | # name = inputs[i] 34 | # voice_dir = inputs[i+1] 35 | # names.append(name) 36 | # voice_dirs.append(voice_dir) 37 | 38 | 39 | ################################################# 40 | 41 | ### only keep utts appearing in all conditions 42 | uttnames=[] 43 | all_utts = [] 44 | for voice_dir in voice_dirs: 45 | print voice_dir 46 | print os.listdir(voice_dir) 47 | print '-----' 48 | all_utts.extend(os.listdir(voice_dir)) 49 | 50 | for unique_utt in set(all_utts): 51 | if unique_utt.endswith('.wav'): 52 | if all_utts.count(unique_utt) == len(names): 53 | uttnames.append(unique_utt) 54 | 55 | 56 | if pattern: 57 | uttnames = [name for name in uttnames if pattern in name] 58 | 59 | # for voice_dir in voice_dirs: 60 | # for uttname in os.listdir(voice_dir): 61 | # if uttname not in uttnames: 62 | # uttnames.append(uttname) 63 | 64 | 65 | if len(uttnames) == 0: 66 | sys.exit('no utterances found in common!') 67 | 68 | 69 | output = '' 70 | 71 | 72 | if title: 73 | output += '

' + title + '

\n' 74 | 75 | ## table top and toprow 76 | output += '\n' 77 | output += '\n' 78 | output += "\n" 79 | output += '\n' 80 | for (name,voice_dir) in zip(names, voice_dirs): 81 | _, voice = os.path.split(voice_dir) 82 | #output += voice 83 | 84 | 85 | output += '\n'%(name) 86 | output += '\n' 87 | 88 | for uttname in sorted(uttnames): 89 | 90 | output += "\n" 91 | 92 | output += '\n'%(uttname.replace(".wav", "")) 93 | for voice_dir in voice_dirs: 94 | 95 | wavename=os.path.join(voice_dir, uttname) 96 | output += '\n" 100 | output += '

Condition

%s

%s

\n' 97 | output += get_audio_control(wavename) 98 | 99 | output += "
\n' 101 | output += '

 

\n' 102 | 103 | 104 | if outfile: 105 | f = open(outfile, 'w') 106 | f.write(output) 107 | f.close() 108 | else: 109 | print output 110 | 111 | 112 | 113 | 114 | def get_audio_control(fname): 115 | return '''

\n'''%(fname) 116 | 117 | 118 | 119 | 120 | if __name__=="__main__": 121 | 122 | ################################################# 123 | 124 | # ======== Get stuff from command line ========== 125 | 126 | a = ArgumentParser() 127 | a.add_argument('-o', dest='outfile', default='', type=str, \ 128 | help= "If not given, print to console") 129 | a.add_argument('-d', nargs='+', required=True, help='list of directories with samples') 130 | a.add_argument('-n', nargs='+', required=False, help='list of names -- use directory names if not given') 131 | a.add_argument('-p', dest='pattern', default='', type=str) 132 | a.add_argument('-title', default='', type=str) 133 | 134 | opts = a.parse_args() 135 | 136 | 137 | # =============================================== 138 | 139 | main_work(opts.d, opts.n, opts.pattern, opts.outfile, opts.title) 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | -------------------------------------------------------------------------------- /script/munge_nsf_data.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | ## Project: SCRIPT - March 2018 4 | ## Contact: Oliver Watts - owatts@staffmail.ed.ac.uk 5 | 6 | import sys 7 | import os 8 | import glob 9 | import os 10 | from argparse import ArgumentParser 11 | 12 | import soundfile as sf 13 | 14 | from concurrent.futures import ProcessPoolExecutor 15 | from functools import partial 16 | from tqdm import tqdm 17 | import subprocess 18 | 19 | import numpy as np 20 | #from generate import read_est_file 21 | 22 | 23 | ## TODO : copied from generate 24 | def read_est_file(est_file): 25 | 26 | with open(est_file) as fid: 27 | header_size = 1 # init 28 | for line in fid: 29 | if line == 'EST_Header_End\n': 30 | break 31 | header_size += 1 32 | ## now check there is at least 1 line beyond the header: 33 | status_ok = False 34 | for (i,line) in enumerate(fid): 35 | if i > header_size: 36 | status_ok = True 37 | if not status_ok: 38 | return np.array([]) 39 | 40 | # Read text: TODO: improve skiprows 41 | data = np.loadtxt(est_file, skiprows=header_size) 42 | data = np.atleast_2d(data) 43 | return data 44 | 45 | 46 | def main_work(): 47 | 48 | ################################################# 49 | 50 | # ======== Get stuff from command line ========== 51 | 52 | a = ArgumentParser() 53 | a.add_argument('-i', dest='indir', required=True) ## mels 54 | a.add_argument('-of', dest='outdir_f', required=True, \ 55 | help= "Put output f0 here: make it if it doesn't exist") 56 | a.add_argument('-om', dest='outdir_m', required=True, \ 57 | help= "Put output mels here: make it if it doesn't exist") 58 | a.add_argument('-f', dest='fzdir', required=True) 59 | # a.add_argument('-framerate', required=False, default=0.005, type=float, help='rate in seconds for F0 track frames') 60 | # a.add_argument('-pattern', default='', \ 61 | # help= "If given, only normalise files whose base contains this substring") 62 | a.add_argument('-ncores', default=1, type=int) 63 | #a.add_argument('-waveformat', default=False, action='store_true', help='call sox to format data (16 bit ).') 64 | 65 | # a.add_argument('-twopass', default=False, action='store_true', help='Run initially on a subset of data to guess sensible limits, then run again. Assumes all data is from same speaker.') 66 | opts = a.parse_args() 67 | 68 | # =============================================== 69 | 70 | for direc in [opts.outdir_f, opts.outdir_m]: 71 | if not os.path.isdir(direc): 72 | os.makedirs(direc) 73 | 74 | flist = sorted(glob.glob(opts.indir + '/*.npy')) 75 | 76 | print flist 77 | # print 'Extract with range %s %s'%(min_f0, max_f0) 78 | executor = ProcessPoolExecutor(max_workers=opts.ncores) 79 | futures = [] 80 | for mel_file in flist: 81 | futures.append(executor.submit( 82 | partial(process, mel_file, opts.fzdir, opts.outdir_f, opts.outdir_m))) 83 | return [future.result() for future in tqdm(futures)] 84 | 85 | 86 | def put_speech(m_data, filename): 87 | m_data = np.array(m_data, 'float32') # Ensuring float32 output 88 | fid = open(filename, 'wb') 89 | m_data.tofile(fid) 90 | fid.close() 91 | return 92 | 93 | 94 | 95 | def process(mel_file, fzdir, outdir_f, outdir_m): 96 | _, base = os.path.split(mel_file) 97 | base = base.replace('.npy', '') 98 | 99 | mels = np.load(mel_file) 100 | fz_file = os.path.join(fzdir, base + '.f0') 101 | 102 | fz = read_est_file(fz_file)[:,2] # .reshape(-1,1) 103 | 104 | m,_ = mels.shape 105 | f = fz.shape[0] 106 | 107 | 108 | fz[fz<0.0] = 0.0 109 | 110 | if m > f: 111 | diff = m - f 112 | fz = np.pad(fz, (0,diff), 'constant').reshape(-1,1) 113 | 114 | put_speech(fz, os.path.join(outdir_f, base+'.f0')) 115 | put_speech(mels, os.path.join(outdir_m, base+'.mfbsp')) 116 | 117 | # print fz.shape 118 | # print mels.shape 119 | # print fz 120 | 121 | 122 | 123 | 124 | 125 | # out_file = os.path.join(outdir, base + '.pm') 126 | 127 | # in_wav_file = os.path.join(fzdir, base + '_tmp.wav') ### !!!!! 128 | # cmd = 'sox %s -r 16000 %s '%(wavefile, in_wav_file) 129 | # subprocess.call(cmd, shell=True) 130 | 131 | # out_fz_file = os.path.join(fzdir, base + '.f0') 132 | # cmd = _reaper_bin + " -s -e %s -x %s -m %s -a -u 0.005 -i %s -p %s -f %s >/dev/null" % (framerate, max_f0, min_f0, in_wav_file, out_est_file, out_fz_file) 133 | # subprocess.call(cmd, shell=True) 134 | 135 | 136 | 137 | 138 | 139 | if __name__=="__main__": 140 | 141 | main_work() 142 | 143 | -------------------------------------------------------------------------------- /script/normalise_level.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | ## Project: SCRIPT - March 2018 4 | ## Contact: Oliver Watts - owatts@staffmail.ed.ac.uk 5 | 6 | import sys 7 | import os 8 | import glob 9 | import os 10 | from argparse import ArgumentParser 11 | 12 | import soundfile as sf 13 | 14 | from concurrent.futures import ProcessPoolExecutor 15 | from functools import partial 16 | from tqdm import tqdm 17 | 18 | 19 | HERE = os.path.realpath(os.path.abspath(os.path.dirname(__file__))) 20 | sv56 = HERE + '/../tool/bin/sv56demo' 21 | 22 | if not os.path.isfile(sv56): 23 | 24 | ## Check required executables are available: 25 | 26 | from distutils.spawn import find_executable 27 | 28 | required_executables = ['sv56demo'] 29 | 30 | for executable in required_executables: 31 | if not find_executable(executable): 32 | sys.exit('%s command line tool must be on system path '%(executable)) 33 | 34 | sv56 = 'sv56demo' 35 | 36 | 37 | def main_work(): 38 | 39 | ################################################# 40 | 41 | # ======== Get stuff from command line ========== 42 | 43 | a = ArgumentParser() 44 | a.add_argument('-i', dest='indir', required=True) 45 | a.add_argument('-o', dest='outdir', required=True, \ 46 | help= "Put output here: make it if it doesn't exist") 47 | a.add_argument('-pattern', default='', \ 48 | help= "If given, only normalise files whose base contains this substring") 49 | a.add_argument('-ncores', default=1, type=int) 50 | opts = a.parse_args() 51 | 52 | # =============================================== 53 | 54 | for direc in [opts.outdir]: 55 | if not os.path.isdir(direc): 56 | os.makedirs(direc) 57 | 58 | flist = sorted(glob.glob(opts.indir + '/*.wav')) 59 | 60 | executor = ProcessPoolExecutor(max_workers=opts.ncores) 61 | futures = [] 62 | for wave_file in flist: 63 | futures.append(executor.submit( 64 | partial(process, wave_file, opts.outdir, pattern=opts.pattern))) 65 | return [future.result() for future in tqdm(futures)] 66 | 67 | 68 | 69 | 70 | 71 | 72 | def process(wavefile, outdir, pattern=''): 73 | _, base = os.path.split(wavefile) 74 | 75 | if pattern: 76 | if pattern not in base: 77 | return 78 | 79 | # print base 80 | 81 | raw_in = os.path.join(outdir, base.replace('.wav','.raw')) 82 | raw_out = os.path.join(outdir, base.replace('.wav','_norm.raw')) 83 | logfile = os.path.join(outdir, base.replace('.wav','.log')) 84 | wav_out = os.path.join(outdir, base) 85 | 86 | data, samplerate = sf.read(wavefile, dtype='int16') 87 | sf.write(raw_in, data, samplerate, subtype='PCM_16') 88 | os.system('%s -log %s -q -lev -26.0 -sf %s %s %s'%(sv56, logfile, samplerate, raw_in, raw_out)) 89 | norm_data, samplerate = sf.read(raw_out, dtype='int16', samplerate=samplerate, channels=1, subtype='PCM_16') 90 | sf.write(wav_out, norm_data, samplerate) 91 | 92 | os.system('rm %s %s'%(raw_in, raw_out)) 93 | 94 | 95 | 96 | 97 | if __name__=="__main__": 98 | 99 | main_work() 100 | 101 | -------------------------------------------------------------------------------- /script/process_merlin_label.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | ## Project: SCRIPT - February 2019 4 | ## Contact: Oliver Watts - owatts@staffmail.ed.ac.uk 5 | 6 | 7 | import sys 8 | import os 9 | import glob 10 | from argparse import ArgumentParser 11 | 12 | from libutil import get_speech, basename, safe_makedir 13 | from scipy.signal import argrelextrema 14 | import numpy as np 15 | import matplotlib as mpl 16 | mpl.use('PDF') 17 | import pylab as pl 18 | 19 | def merlin_state_label_to_phone(labfile): 20 | labels = np.loadtxt(labfile, dtype=str, comments=None) ## default comments='#' breaks 21 | starts = labels[:,0].astype(int)[::5].reshape(-1,1) 22 | ends = labels[:,1].astype(int)[4::5].reshape(-1,1) 23 | fc = labels[:,2][::5] 24 | fc = np.array([line.replace('[2]','') for line in fc]).reshape(-1,1) 25 | phone_label = np.hstack([starts, ends, fc]) 26 | return phone_label 27 | 28 | 29 | def minmax_norm(X, data_min, data_max): 30 | data_range = data_max - data_min 31 | data_range[data_range<=0.0] = 1.0 32 | maxi, mini = 0.01, 0.99 # ## merlin's default desired range 33 | X_std = (X - data_min) / data_range 34 | X_scaled = X_std * (maxi - mini) + mini 35 | return X_scaled 36 | 37 | 38 | def process_merlin_label(bin_label_fname, text_lab_dir, phonedim=416, subphonedim=9): 39 | 40 | text_label = os.path.join(text_lab_dir, basename(bin_label_fname) + '.lab') 41 | assert os.path.isfile(text_label), 'No text file for %s '%(basename(bin_label_fname)) 42 | 43 | labfrombin = get_speech(bin_label_fname, phonedim+subphonedim) 44 | 45 | ## fraction through phone (forwards) 46 | fraction_through_phone_forwards = labfrombin[:,-1] 47 | 48 | ## This is a suprisingly noisy signal which never seems to start at 0.0! Find minima:- 49 | (minima, ) = argrelextrema(fraction_through_phone_forwards, np.less) 50 | 51 | ## first frame is always a start: 52 | minima = np.insert(minima, 0, 0) 53 | 54 | ## check size against text file: 55 | labfromtext = merlin_state_label_to_phone(text_label) 56 | assert labfromtext.shape[0] == minima.shape[0] 57 | 58 | lab = labfrombin[minima,:-subphonedim] ## discard frame level feats, and take first frame of each phone 59 | 60 | return lab 61 | 62 | 63 | 64 | def main_work(): 65 | 66 | ################################################# 67 | 68 | # ============= Process command line ============ 69 | 70 | a = ArgumentParser() 71 | 72 | a.add_argument('-b', dest='binlabdir', required=True) 73 | a.add_argument('-t', dest='text_lab_dir', required=True) 74 | a.add_argument('-n', dest='norm_info_fname', required=True) 75 | a.add_argument('-o', dest='outdir', required=True) 76 | a.add_argument('-binext', dest='binext', required=False, default='lab') 77 | a.add_argument('-skipterminals', action='store_true', default=False) 78 | 79 | 80 | opts = a.parse_args() 81 | 82 | # =============================================== 83 | 84 | safe_makedir(opts.outdir) 85 | 86 | norm_info = get_speech(opts.norm_info_fname, 425)[:,:-9] 87 | data_min = norm_info[0,:] 88 | data_max = norm_info[1,:] 89 | data_range = data_max - data_min 90 | 91 | text_label_files = set([basename(f) for f in glob.glob(opts.text_lab_dir + '/*.lab')]) 92 | binary_label_files = sorted(glob.glob(opts.binlabdir + '/*.' + opts.binext) ) 93 | print binary_label_files 94 | for binlab in binary_label_files: 95 | base = basename(binlab) 96 | if base not in text_label_files: 97 | continue 98 | print base 99 | lab = process_merlin_label(binlab, opts.text_lab_dir) 100 | if opts.skipterminals: 101 | lab = lab[1:-1,:] ## NB: dont remove 2 last as in durations, as the final punct does't features here 102 | norm_lab = minmax_norm(lab, data_min, data_max) 103 | 104 | if 0: ## piano roll style plot: 105 | pl.imshow(norm_lab, interpolation='nearest') 106 | pl.gray() 107 | pl.savefig('/afs/inf.ed.ac.uk/user/o/owatts/temp/fig.pdf') 108 | sys.exit('abckdubv') 109 | 110 | np.save(opts.outdir + '/' + base, norm_lab) 111 | 112 | 113 | if __name__=="__main__": 114 | main_work() 115 | 116 | -------------------------------------------------------------------------------- /script/process_merlin_label_positions.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | ## Project: SCRIPT - February 2019 4 | ## Contact: Oliver Watts - owatts@staffmail.ed.ac.uk 5 | 6 | 7 | import sys 8 | import os 9 | import glob 10 | from argparse import ArgumentParser 11 | 12 | from libutil import get_speech, basename, safe_makedir 13 | from scipy.signal import argrelextrema 14 | import numpy as np 15 | import matplotlib as mpl 16 | mpl.use('PDF') 17 | import pylab as pl 18 | 19 | from scipy import interpolate 20 | 21 | 22 | # def merlin_state_label_to_phone(labfile): 23 | # labels = np.loadtxt(labfile, dtype=str, comments=None) ## default comments='#' breaks 24 | # starts = labels[:,0].astype(int)[::5].reshape(-1,1) 25 | # ends = labels[:,1].astype(int)[4::5].reshape(-1,1) 26 | # fc = labels[:,2][::5] 27 | # fc = np.array([line.replace('[2]','') for line in fc]).reshape(-1,1) 28 | # phone_label = np.hstack([starts, ends, fc]) 29 | # return phone_label 30 | 31 | 32 | def minmax_norm(X, data_min, data_max): 33 | data_range = data_max - data_min 34 | data_range[data_range<=0.0] = 1.0 35 | maxi, mini = 0.01, 0.99 # ## merlin's default desired range 36 | X_std = (X - data_min) / data_range 37 | X_scaled = X_std * (maxi - mini) + mini 38 | return X_scaled 39 | 40 | 41 | def process_merlin_positions(bin_label_fname, audio_dir, phonedim=416, subphonedim=9, \ 42 | inrate=5.0, outrate=12.5): 43 | 44 | audio_fname = os.path.join(audio_dir, basename(bin_label_fname) + '.npy') 45 | assert os.path.isfile(audio_fname), 'No audio file for %s '%(basename(bin_label_fname)) 46 | audio = np.load(audio_fname) 47 | 48 | labfrombin = get_speech(bin_label_fname, phonedim+subphonedim) 49 | 50 | positions = labfrombin[:,-subphonedim:] 51 | 52 | nframes, dim = positions.shape 53 | assert dim==9 54 | 55 | new_nframes, _ = audio.shape 56 | 57 | old_x = np.linspace((inrate/2.0), nframes*inrate, nframes, endpoint=False) ## place points at frame centres 58 | 59 | f = interpolate.interp1d(old_x, positions, axis=0, kind='nearest', bounds_error=False, fill_value='extrapolate') ## nearest to avoid weird averaging effects near segment boundaries 60 | 61 | new_x = np.linspace((outrate/2.0), new_nframes*outrate, new_nframes, endpoint=False) 62 | new_positions = f(new_x) 63 | 64 | return new_positions 65 | 66 | 67 | 68 | def main_work(): 69 | 70 | ################################################# 71 | 72 | # ============= Process command line ============ 73 | 74 | a = ArgumentParser() 75 | 76 | a.add_argument('-b', dest='binlabdir', required=True) 77 | a.add_argument('-f', dest='audio_dir', required=True) 78 | a.add_argument('-n', dest='norm_info_fname', required=True) 79 | a.add_argument('-o', dest='outdir', required=True) 80 | a.add_argument('-binext', dest='binext', required=False, default='lab') 81 | 82 | a.add_argument('-ir', dest='inrate', type=float, default=5.0) 83 | a.add_argument('-or', dest='outrate', type=float, default=12.5) 84 | 85 | opts = a.parse_args() 86 | 87 | # =============================================== 88 | 89 | safe_makedir(opts.outdir) 90 | 91 | norm_info = get_speech(opts.norm_info_fname, 425)[:,-9:] 92 | data_min = norm_info[0,:] 93 | data_max = norm_info[1,:] 94 | data_range = data_max - data_min 95 | 96 | audio_files = set([basename(f) for f in glob.glob(opts.audio_dir + '/*.npy')]) 97 | binary_label_files = sorted(glob.glob(opts.binlabdir + '/*.' + opts.binext) ) 98 | 99 | for binlab in binary_label_files: 100 | base = basename(binlab) 101 | if base not in audio_files: 102 | continue 103 | print base 104 | positions = process_merlin_positions(binlab, opts.audio_dir, inrate=opts.inrate, outrate=opts.outrate) 105 | norm_positions = minmax_norm(positions, data_min, data_max) 106 | 107 | np.save(opts.outdir + '/' + base, norm_positions) 108 | 109 | 110 | if __name__=="__main__": 111 | main_work() 112 | 113 | -------------------------------------------------------------------------------- /script/remove_end_silences.py: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | import fileinput, sys 5 | 6 | infile = sys.argv[1] 7 | 8 | i = 0 9 | for line in fileinput.input(infile): 10 | name, t1, t2, phones, speaker, durs = line.strip(' \n').split('|') # [-1] 11 | durs = durs.split(' ') 12 | assert durs[0] == '24' 13 | assert durs[-2] == '24' 14 | assert durs[-1] == '0' 15 | durs[0] = '0' 16 | durs[-2] = '0' 17 | durs[-1] = '0' 18 | # if i==5: 19 | # break 20 | durs = ' '.join(durs) 21 | line2 = '|'.join([name, t1, t2, phones, speaker, durs]) 22 | print line2 23 | i += 1 24 | # print i 25 | -------------------------------------------------------------------------------- /script/split_speech.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | ## Project: SCRIPT - June 2018 4 | ## Contact: Oliver Watts - owatts@staffmail.ed.ac.uk 5 | 6 | import sys 7 | import os 8 | import glob 9 | import os 10 | import fileinput 11 | from argparse import ArgumentParser 12 | 13 | 14 | 15 | import librosa 16 | from concurrent.futures import ProcessPoolExecutor 17 | from functools import partial 18 | import numpy as np 19 | import soundfile 20 | from libutil import safe_makedir, get_basename 21 | from tqdm import tqdm 22 | 23 | # librosa.effects.split(y, top_db=60, ref=, frame_length=2048, hop_length=512)[source] 24 | 25 | # intervals[i] == (start_i, end_i) are the start and end time (in samples) of non-silent interval i. 26 | 27 | 28 | 29 | 30 | def main_work(): 31 | 32 | ################################################# 33 | 34 | # ======== Get stuff from command line ========== 35 | 36 | a = ArgumentParser() 37 | a = ArgumentParser() 38 | a.add_argument('-w', dest='wave_dir', required=True) 39 | a.add_argument('-o', dest='output_dir', required=True) 40 | a.add_argument('-N', dest='nfiles', type=int, default=0) 41 | a.add_argument('-l', dest='wavlist_file', default=None) 42 | a.add_argument('-dB', dest='top_db', default=30, type=int) 43 | a.add_argument('-trimonly', action='store_true', default=False) 44 | a.add_argument('-ncores', type=int, default=0) 45 | a.add_argument('-endpad', type=float, default=0.3) 46 | opts = a.parse_args() 47 | 48 | # =============================================== 49 | 50 | trim_waves_in_directory(opts.wave_dir, opts.output_dir, num_workers=opts.ncores, \ 51 | tqdm=tqdm, nfiles=opts.nfiles, top_db=opts.top_db, trimonly=opts.trimonly, endpad=opts.endpad) 52 | 53 | def trim_waves_in_directory(in_dir, out_dir, num_workers=1, tqdm=lambda x: x, \ 54 | nfiles=0, top_db=30, trimonly=False, endpad=0.3): 55 | safe_makedir(out_dir) 56 | wave_files = sorted(glob.glob(in_dir + '/*.wav')) 57 | if nfiles > 0: 58 | wave_files = wave_files[:min(nfiles, len(wave_files))] 59 | 60 | if num_workers: 61 | executor = ProcessPoolExecutor(max_workers=num_workers) 62 | futures = [] 63 | for (index, wave_file) in enumerate(wave_files): 64 | futures.append(executor.submit( 65 | partial(_process_utterance, wave_file, out_dir, top_db=top_db, trimonly=trimonly, end_pad_sec=endpad))) 66 | return [future.result() for future in tqdm(futures)] 67 | else: ## serial processing 68 | for wave_file in tqdm(wave_files): 69 | _process_utterance(wave_file, out_dir, top_db=top_db, trimonly=trimonly, end_pad_sec=endpad) 70 | 71 | 72 | 73 | def _process_utterance(wav_path, out_dir, top_db=30, end_pad_sec=0.3, pad_sec=0.01, minimum_duration_sec=0.5, trimonly=False): 74 | 75 | wav, fs = soundfile.read(wav_path) ## TODO: assert mono 76 | 77 | if len(wav) > 0: 78 | 79 | pad = int(pad_sec * fs) 80 | end_pad = int(end_pad_sec * fs) 81 | # print pad 82 | base = get_basename(wav_path) 83 | # print base 84 | _, (start, end) = librosa.effects.trim(wav, top_db=top_db) 85 | start = max(0, (start - end_pad)) 86 | end = min(len(wav), (end + end_pad)) 87 | 88 | if start < end: 89 | wav = wav[start:end] 90 | 91 | if trimonly: 92 | ofile = os.path.join(out_dir, base + '.wav') 93 | soundfile.write(ofile, wav, fs) 94 | else: 95 | starts_ends = librosa.effects.split(wav, top_db=top_db) 96 | starts_ends[:,0] -= pad 97 | starts_ends[:,1] += pad 98 | starts_ends = np.clip(starts_ends, 0, wav.size) 99 | lengths = starts_ends[:,1] - starts_ends[:,0] 100 | starts_ends = starts_ends[lengths > fs * minimum_duration_sec] 101 | 102 | 103 | for (i, (s,e)) in enumerate(starts_ends): 104 | 105 | ofile = os.path.join(out_dir, base + '_seg%s.wav'%(str(i+1).zfill(4))) 106 | # print ofile 107 | soundfile.write(ofile, wav[s:e], fs) 108 | 109 | else: 110 | 111 | print "File discarded: " + wav_path 112 | 113 | def test(): 114 | safe_makedir('/tmp/splitwaves/') 115 | _process_utterance('/afs/inf.ed.ac.uk/group/cstr/projects/simple4all_2/oliver/data/nick/wav/herald_030.wav', '/tmp/splitwaves/') 116 | 117 | if __name__=="__main__": 118 | 119 | main_work() 120 | 121 | -------------------------------------------------------------------------------- /synthesise_validation_waveforms.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | ## Project: SCRIPT - February 2018 4 | ## Contact: Oliver Watts - owatts@staffmail.ed.ac.uk 5 | 6 | import sys 7 | import os 8 | import glob 9 | from argparse import ArgumentParser 10 | 11 | import imp 12 | 13 | import numpy as np 14 | 15 | from utils import spectrogram2wav 16 | # from scipy.io.wavfile import write 17 | import soundfile 18 | 19 | import tqdm 20 | from concurrent.futures import ProcessPoolExecutor 21 | 22 | import tensorflow as tf 23 | from architectures import SSRNGraph 24 | from synthesize import make_mel_batch, split_batch, synth_mel2mag 25 | from configuration import load_config 26 | 27 | 28 | def synth_wave(hp, magfile): 29 | mag = np.load(magfile) 30 | #print ('mag shape %s'%(str(mag.shape))) 31 | wav = spectrogram2wav(hp, mag) 32 | outfile = magfile.replace('.mag.npy', '.wav') 33 | outfile = outfile.replace('.npy', '.wav') 34 | #print magfile 35 | #print outfile 36 | #print 37 | # write(outfile, hp.sr, wav) 38 | soundfile.write(outfile, wav, hp.sr) 39 | 40 | def main_work(): 41 | 42 | ################################################# 43 | 44 | # ======== Get stuff from command line ========== 45 | 46 | a = ArgumentParser() 47 | a.add_argument('-c', dest='config', required=True, type=str) 48 | a.add_argument('-ncores', type=int, default=1) 49 | opts = a.parse_args() 50 | 51 | # =============================================== 52 | 53 | hp = load_config(opts.config) 54 | 55 | ### 1) convert saved coarse mels to mags with latest-trained SSRN 56 | print('mel2mag: restore last saved SSRN') 57 | g = SSRNGraph(hp, mode="synthesize") 58 | with tf.Session() as sess: 59 | sess.run(tf.global_variables_initializer()) 60 | 61 | ## TODO: use restore_latest_model_parameters from synthesize? 62 | var_list = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'SSRN') 63 | saver2 = tf.train.Saver(var_list=var_list) 64 | savepath = hp.logdir + "-ssrn" 65 | latest_checkpoint = tf.train.latest_checkpoint(savepath) 66 | if latest_checkpoint is None: sys.exit('No SSRN at %s?'%(savepath)) 67 | ssrn_epoch = latest_checkpoint.strip('/ ').split('/')[-1].replace('model_epoch_', '') 68 | saver2.restore(sess, latest_checkpoint) 69 | print("SSRN Restored from latest epoch %s"%(ssrn_epoch)) 70 | 71 | filelist = glob.glob(hp.logdir + '-t2m/validation_epoch_*/*.npy') 72 | filelist = [fname for fname in filelist if not fname.endswith('.mag.npy')] 73 | batch, lengths = make_mel_batch(hp, filelist, oracle=False) 74 | Z = synth_mel2mag(hp, batch, g, sess, batchsize=32) 75 | print ('synthesised mags, now splitting batch:') 76 | maglist = split_batch(Z, lengths) 77 | for (infname, outdata) in tqdm.tqdm(zip(filelist, maglist)): 78 | np.save(infname.replace('.npy','.mag.npy'), outdata) 79 | 80 | 81 | 82 | ### 2) GL in parallel for both t2m and ssrn validation set 83 | print('GL for SSRN validation') 84 | filelist = glob.glob(hp.logdir + '-t2m/validation_epoch_*/*.mag.npy') + \ 85 | glob.glob(hp.logdir + '-ssrn/validation_epoch_*/*.npy') 86 | 87 | if opts.ncores==1: 88 | for fname in tqdm.tqdm(filelist): 89 | synth_wave(hp, fname) 90 | else: 91 | executor = ProcessPoolExecutor(max_workers=opts.ncores) 92 | futures = [] 93 | for fpath in filelist: 94 | futures.append(executor.submit(synth_wave, hp, fpath)) 95 | proc_list = [future.result() for future in tqdm.tqdm(futures)] 96 | 97 | 98 | 99 | if __name__=="__main__": 100 | 101 | main_work() 102 | -------------------------------------------------------------------------------- /test.tmp: -------------------------------------------------------------------------------- 1 | testing 2 | -------------------------------------------------------------------------------- /tool/bin/sv56demo: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CSTR-Edinburgh/ophelia/46042a80d045b6fd6403601e9e8fb447d304b6b3/tool/bin/sv56demo -------------------------------------------------------------------------------- /util/find_free_gpu.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | import subprocess 4 | import re 5 | 6 | sp = subprocess.Popen(['nvidia-smi', 'pmon', '-c', '1'], stdout=subprocess.PIPE, stderr=subprocess.PIPE) 7 | out_str = sp.communicate() 8 | 9 | #print out_str 10 | 11 | lines = out_str[0].split('\n') 12 | lines = [line for line in lines if not line.startswith('#')] # filter comments 13 | #lines = [line.replace('-', '') for line in lines] # remove - placeholders used to mark columns for idle GPUs 14 | lines = [re.split('\s+', line.strip(' ')) for line in lines] # split on whitespace 15 | lines = [line for line in lines if line != ['']] # remove empty lines 16 | 17 | free_gpu = -1 18 | for line in lines: 19 | #print line 20 | gpu_id = line[0] 21 | process_info = ''.join(line[1:]) 22 | #print gpu_id 23 | #print process_info 24 | if process_info.replace('-', '') == '': ## does it contain only - placeholders used to mark columns for idle GPUs? 25 | free_gpu = gpu_id 26 | break 27 | print free_gpu 28 | 29 | -------------------------------------------------------------------------------- /util/submit_tf.sh: -------------------------------------------------------------------------------- 1 | 2 | PYTHON=python 3 | 4 | 5 | ## Generic script for submitting any tensorflow job to GPU 6 | # usage: submit.sh [scriptname.py script_arguments ... ] 7 | 8 | ## Location of this script: assume gpu_lock.py is in same place - 9 | SCRIPTPATH=$( cd $(dirname $0) ; pwd -P ) 10 | 11 | 12 | gpu_id=$(python2 $SCRIPTPATH/gpu_lock.py --id-to-hog) 13 | 14 | ### paths for tensorflow (works on hynek) 15 | # export LD_LIBRARY_PATH=/opt/cuda-8.0.44/extras/CUPTI/lib64/:/opt/cuda-8.0.44/:/opt/cuda-8.0.44/lib64:/opt/cuDNN-7.0/:/opt/cuDNN-6.0_8.0/:/opt/cuda/:/opt/cuDNN-6.0_8.0/lib64:/opt/cuDNN-6.0/lib6 16 | export LD_LIBRARY_PATH=/opt/cuda-8.0.44/extras/CUPTI/lib64/:/opt/cuda-8.0.44/:/opt/cuda-8.0.44/lib64:/opt/cuDNN-7.0/:/opt/cuDNN-6.0_8.0/:/opt/cuda/:/opt/cuDNN-6.0_8.0/lib64:/opt/cuDNN-6.0/lib6:/opt/cuda-9.0.176.1/lib64/:/opt/cuda-9.1.85/lib64/:/opt/cuDNN-7.1_9.1/lib64 17 | export CUDA_HOME=/opt/cuda-8.0.44/ 18 | 19 | export KERAS_BACKEND=tensorflow 20 | export CUDA_VISIBLE_DEVICES=$gpu_id 21 | 22 | if [ $gpu_id -gt -1 ]; then 23 | 24 | $PYTHON $@ 25 | 26 | python2 $SCRIPTPATH/gpu_lock.py --free $gpu_id 27 | else 28 | echo 'Let us wait! No GPU is available!' 29 | 30 | fi 31 | -------------------------------------------------------------------------------- /util/submit_tf_2.sh: -------------------------------------------------------------------------------- 1 | 2 | PYTHON=python 3 | 4 | 5 | ## Generic script for submitting any tensorflow job to GPU 6 | # usage: submit.sh [scriptname.py script_arguments ... ] 7 | 8 | ## Location of this script: assume gpu_lock.py is in same place - 9 | SCRIPTPATH=$( cd $(dirname $0) ; pwd -P ) 10 | 11 | 12 | gpu_id=$(python2 $SCRIPTPATH/find_free_gpu.py) 13 | 14 | ### paths for tensorflow (works on hynek) 15 | # export LD_LIBRARY_PATH=/opt/cuda-8.0.44/extras/CUPTI/lib64/:/opt/cuda-8.0.44/:/opt/cuda-8.0.44/lib64:/opt/cuDNN-7.0/:/opt/cuDNN-6.0_8.0/:/opt/cuda/:/opt/cuDNN-6.0_8.0/lib64:/opt/cuDNN-6.0/lib6 16 | export LD_LIBRARY_PATH=/opt/cuda-8.0.44/extras/CUPTI/lib64/:/opt/cuda-8.0.44/:/opt/cuda-8.0.44/lib64:/opt/cuDNN-7.0/:/opt/cuDNN-6.0_8.0/:/opt/cuda/:/opt/cuDNN-6.0_8.0/lib64:/opt/cuDNN-6.0/lib6:/opt/cuda-9.0.176.1/lib64/:/opt/cuda-9.1.85/lib64/:/opt/cuDNN-7.1_9.1/lib64 17 | export CUDA_HOME=/opt/cuda-8.0.44/ 18 | 19 | export KERAS_BACKEND=tensorflow 20 | export CUDA_VISIBLE_DEVICES=$gpu_id 21 | 22 | if [ $gpu_id -gt -1 ]; then 23 | 24 | $PYTHON $@ 25 | 26 | else 27 | echo 'Let us wait! No GPU is available!' 28 | 29 | fi 30 | -------------------------------------------------------------------------------- /util/submit_tf_cpu.sh: -------------------------------------------------------------------------------- 1 | 2 | PYTHON=python 3 | 4 | 5 | ## Generic script for submitting any tensorflow job to GPU 6 | # usage: submit.sh [scriptname.py script_arguments ... ] 7 | 8 | 9 | ### paths for tensorflow (works on hynek) 10 | # export LD_LIBRARY_PATH=/opt/cuda-8.0.44/extras/CUPTI/lib64/:/opt/cuda-8.0.44/:/opt/cuda-8.0.44/lib64:/opt/cuDNN-7.0/:/opt/cuDNN-6.0_8.0/:/opt/cuda/:/opt/cuDNN-6.0_8.0/lib64:/opt/cuDNN-6.0/lib6 11 | export LD_LIBRARY_PATH=/opt/cuda-8.0.44/extras/CUPTI/lib64/:/opt/cuda-8.0.44/:/opt/cuda-8.0.44/lib64:/opt/cuDNN-7.0/:/opt/cuDNN-6.0_8.0/:/opt/cuda/:/opt/cuDNN-6.0_8.0/lib64:/opt/cuDNN-6.0/lib6:/opt/cuda-9.0.176.1/lib64/:/opt/cuda-9.1.85/lib64/:/opt/cuDNN-7.1_9.1/lib64 12 | export CUDA_HOME=/opt/cuda-8.0.44/ 13 | 14 | export KERAS_BACKEND=tensorflow 15 | export CUDA_VISIBLE_DEVICES="" 16 | 17 | $PYTHON $@ 18 | 19 | -------------------------------------------------------------------------------- /util/sync_code_from_afs.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | SERVERDIR=/path/on/GPU/machine/to/dc_tts_osw/ ## change this to e.g. /disk/scratch/.../dc_tts_osw 4 | SOURCE=/path/on/AFS/to/dc_tts_osw/ ## e.g. in your AFS home directory 5 | 6 | rsync -avzh --progress $SOURCE/ $SERVERDIR/ 7 | # --exclude='.git/' 8 | -------------------------------------------------------------------------------- /util/sync_output_to_afs.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | DEST=~/synthetic_speech ## e.g. directory on AFS where examining audio is straightforward 4 | 5 | PATTERN=$1 ## substring of the config to sync 6 | 7 | rsync -av --relative ./work/*${PATTERN}*/train-t2m/validation_epoch_*/*.wav $DEST 8 | rsync -av --relative ./work/*${PATTERN}*/train-ssrn/validation_epoch_*/*.wav $DEST 9 | rsync -av --relative ./work/*${PATTERN}*/train-t2m/alignment* $DEST 10 | rsync -av --relative ./work/*${PATTERN}*/synth/*/*.wav $DEST 11 | rsync -av --relative ./work/*${PATTERN}*/synth*/*/*.wav $DEST 12 | rsync -av --relative ./work/*${PATTERN}*/synth/*/*.png $DEST 13 | 14 | 15 | echo "Synced to: $DEST" --------------------------------------------------------------------------------