├── .gitignore
├── LICENSE
├── README.md
├── README_ORIG.md
├── README_notes.md
├── add_duration_to_transcript.py
├── architectures.py
├── calculate_CDP_Ain_Aout.py
├── config
    ├── lj_01.cfg
    ├── lj_02.cfg
    ├── lj_03.cfg
    ├── lj_04.cfg
    ├── lj_05.cfg
    ├── lj_test.cfg
    ├── lj_tutorial.cfg
    ├── nancy2nick_01.cfg
    ├── nancy2nick_02.cfg
    ├── nancy_01.cfg
    ├── nancyplusnick_01.cfg
    ├── nancyplusnick_02.cfg
    ├── nancyplusnick_03.cfg
    ├── nancyplusnick_04_lcc.cfg
    ├── nancyplusnick_05_lcc_sdpe.cfg
    ├── project
    │   ├── baseline.cfg
    │   ├── c-labels_plus_te.cfg
    │   ├── fa_as_attention.cfg
    │   ├── fa_as_guide.cfg
    │   ├── fa_as_target.cfg
    │   ├── labels_minus_te.cfg
    │   └── labels_plus_te.cfg
    ├── ssw10
    │   ├── G1ABC_01.cfg
    │   ├── G1ABC_02.cfg
    │   ├── G1ABC_03.cfg
    │   ├── G1AB_03.cfg
    │   ├── G1BC_03.cfg
    │   ├── G1B_03.cfg
    │   ├── G1_03.cfg
    │   ├── G2_03.cfg
    │   ├── README.md
    │   ├── W2A_03.cfg
    │   ├── W2B_03.cfg
    │   └── W2_03.cfg
    ├── vctk_01.cfg
    ├── vctk_02.cfg
    └── vctk_03_lcc.cfg
├── configuration.py
├── convert_alignment_to_guide.py
├── copy_synth_GL.py
├── copy_synth_SSRN_GL.py
├── data_load.py
├── doc
    ├── festival_install.md
    ├── old_readme.md
    ├── preparing_new_database.md
    ├── recipe_WaveRNN.md
    ├── recipe_nancy.md
    ├── recipe_nancy2nick.md
    ├── recipe_nancyplusnick.md
    ├── recipe_project.md
    ├── recipe_semisupervised.md
    └── recipe_vctk.md
├── fig
    ├── aaa
    ├── attention.gif
    └── training_curves.png
├── labels2tra.sh
├── libutil.py
├── logger_setup.py
├── modules.py
├── networks.py
├── objective_measures.py
├── plot_loss.py
├── prepare_acoustic_features.py
├── prepare_attention_guides.py
├── requirements.txt
├── script
    ├── add_durations_to_transcript.py
    ├── add_speaker.py
    ├── check_transcript.py
    ├── festival
    │   ├── csv2scm.py
    │   ├── fix_transcript.py
    │   ├── make_rich_phones.scm
    │   ├── make_rich_phones_cmulex.scm
    │   ├── make_rich_phones_combirpx_noplex.scm
    │   └── multi_transcript.py
    ├── get_transcriptions.sh
    ├── interpolate_unvoiced.py
    ├── libutil.py
    ├── make_internal_webchart.py
    ├── munge_nsf_data.py
    ├── normalise_level.py
    ├── pitchmark_speech.py
    ├── prepare_world_features.py
    ├── process_merlin_label.py
    ├── process_merlin_label_positions.py
    ├── remove_end_silences.py
    └── split_speech.py
├── synthesise_validation_waveforms.py
├── synthesize.py
├── test.tmp
├── tool
    └── bin
    │   └── sv56demo
├── train.py
├── util
    ├── find_free_gpu.py
    ├── gpu_lock.py
    ├── submit_tf.sh
    ├── submit_tf_2.sh
    ├── submit_tf_cpu.sh
    ├── sync_code_from_afs.sh
    └── sync_output_to_afs.sh
└── utils.py


/.gitignore:
--------------------------------------------------------------------------------
1 | *.pyc
2 | *.cfgc
3 | tool/*
4 | !tool/bin/sv56demo


--------------------------------------------------------------------------------
/README_ORIG.md:
--------------------------------------------------------------------------------
 1 | # A TensorFlow Implementation of DC-TTS: yet another text-to-speech model
 2 | 
 3 | I implement yet another text-to-speech model, dc-tts, introduced in [Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention](https://arxiv.org/abs/1710.08969). My goal, however, is not just replicating the paper. Rather, I'd like to gain insights about various sound projects.
 4 | 
 5 | ## Requirements
 6 |   * NumPy >= 1.11.1
 7 |   * TensorFlow >= 1.3 (Note that the API of `tf.contrib.layers.layer_norm` has changed since 1.3)
 8 |   * librosa
 9 |   * tqdm
10 |   * matplotlib
11 |   * scipy
12 | 
13 | ## Data
14 | 
15 | <img src="https://image.shutterstock.com/z/stock-vector-korean-alphabet-korean-hangul-pattern-693680611.jpg" height="200" align="right">
16 | <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/9c/Kate_Winslet_March_18%2C_2014_%28headshot%29.jpg/890px-Kate_Winslet_March_18%2C_2014_%28headshot%29.jpg" height="200" align="right">
17 | <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/f6/Nick_Offerman_at_UMBC_%28cropped%29.jpg/440px-Nick_Offerman_at_UMBC_%28cropped%29.jpg" height="200" align="right">
18 | <img src="https://image.shutterstock.com/z/stock-vector-lj-letters-four-colors-in-abstract-background-logo-design-identity-in-circle-alphabet-letter-418687846.jpg" height="200" align="right">
19 | 
20 | I train English models and an Korean model on four different speech datasets. <p> 1. [LJ Speech Dataset](https://keithito.com/LJ-Speech-Dataset/) <br/> 2. [Nick Offerman's Audiobooks](https://www.audible.com.au/search?searchNarrator=Nick+Offerman) <br/> 3. [Kate Winslet's Audiobook](https://www.audible.com.au/pd/Classics/Therese-Raquin-Audiobook/B00FF0SLW4/ref=a_search_c4_1_3_srTtl?qid=1516854754&sr=1-3) <br/> 4. [KSS Dataset](https://kaggle.com/bryanpark/korean-single-speaker-speech-dataset)
21 | 
22 | LJ Speech Dataset is recently widely used as a benchmark dataset in the TTS task because it is publicly available, and it has 24 hours of reasonable quality samples.
23 | Nick's and Kate's audiobooks are additionally used to see if the model can learn even with less data, variable speech samples. They are 18 hours and 5 hours long, respectively. Finally, KSS Dataset is a Korean single speaker speech dataset that lasts more than 12 hours.
24 | 
25 | 
26 | ## Training
27 |   * STEP 0. Download [LJ Speech Dataset](https://keithito.com/LJ-Speech-Dataset/) or prepare your own data.
28 |   * STEP 1. Adjust hyper parameters in `hyperparams.py`. (If you want to do preprocessing, set prepro True`.
29 |   * STEP 2. Run `python train.py 1` for training Text2Mel. (If you set prepro True, run python prepro.py first)
30 |   * STEP 3. Run `python train.py 2` for training SSRN.
31 | 
32 | You can do STEP 2 and 3 at the same time, if you have more than one gpu card.
33 | 
34 | ## Training Curves
35 | 
36 | <img src="fig/training_curves.png">
37 | 
38 | ## Attention Plot
39 | <img src="fig/attention.gif">
40 | 
41 | ## Sample Synthesis
42 | I generate speech samples based on [Harvard Sentences](http://www.cs.columbia.edu/~hgs/audio/harvard.html) as the original paper does. It is already included in the repo.
43 | 
44 |   * Run `synthesize.py` and check the files in `samples`.
45 | 
46 | ## Generated Samples
47 | 
48 | | Dataset       | Samples |
49 | | :----- |:-------------|
50 | | LJ      | [50k](https://soundcloud.com/kyubyong-park/sets/dc_tts) [200k](https://soundcloud.com/kyubyong-park/sets/dc_tts_lj_200k) [310k](https://soundcloud.com/kyubyong-park/sets/dc_tts_lj_310k) [800k](https://soundcloud.com/kyubyong-park/sets/dc_tts_lj_800k)|
51 | | Nick      | [40k](https://soundcloud.com/kyubyong-park/sets/dc_tts_nick_40k) [170k](https://soundcloud.com/kyubyong-park/sets/dc_tts_nick_170k) [300k](https://soundcloud.com/kyubyong-park/sets/dc_tts_nick_300k) [800k](https://soundcloud.com/kyubyong-park/sets/dc_tts_nick_800k)|
52 | | Kate| [40k](https://soundcloud.com/kyubyong-park/sets/dc_tts_kate_40k) [160k](https://soundcloud.com/kyubyong-park/sets/dc_tts_kate_160k) [300k](https://soundcloud.com/kyubyong-park/sets/dc_tts_kate_300k) [800k](https://soundcloud.com/kyubyong-park/sets/dc_tts_kate_800k) |
53 | | KSS| [400k](https://soundcloud.com/kyubyong-park/sets/dc_tts_ko_400k) |
54 | 
55 | ## Pretrained Model for LJ
56 | 
57 | Download [this](https://www.dropbox.com/s/1oyipstjxh2n5wo/LJ_logdir.tar?dl=0).
58 | 
59 | ## Notes
60 | 
61 |   * The paper didn't mention normalization, but without normalization I couldn't get it to work. So I added layer normalization.
62 |   * The paper fixed the learning rate to 0.001, but it didn't work for me. So I decayed it.
63 |   * I tried to train Text2Mel and SSRN simultaneously, but it didn't work. I guess separating those two networks mitigates the burden of training.
64 |   * The authors claimed that the model can be trained within a day, but unfortunately the luck was not mine. However obviously this is much fater than Tacotron as it uses only convolution layers.
65 |   * Thanks to the guided attention, the attention plot looks monotonic almost from the beginning. I guess this seems to hold the aligment tight so it won't lose track.
66 |   * The paper didn't mention dropouts. I applied them as I believe it helps for regularization.
67 |   * Check also other TTS models such as [Tacotron](https://github.com/kyubyong/tacotron) and [Deep Voice 3](https://github.com/kyubyong/deepvoice3).
68 |   
69 | 


--------------------------------------------------------------------------------
/add_duration_to_transcript.py:
--------------------------------------------------------------------------------
 1 | 
 2 | import numpy as np
 3 | import os
 4 | import sys
 5 | import os.path
 6 | 
 7 | # Usage: python add_duration_to_transcript.py fa_matrix_dir transcript_file new_transcript_file
 8 | fa_matrices_dir = sys.argv[1]
 9 | transcript = sys.argv[2]
10 | out_transcript = sys.argv[3]
11 | 
12 | with open(transcript, 'r') as f:
13 | 	tra = f.readlines()
14 | 
15 | f = open(out_transcript,'w')
16 | 
17 | 
18 | for t in tra:
19 | 
20 | 	file = t.split('|')[0]
21 | 	phones = t.split('|')[3]
22 | 	num_phones = len(phones.split(' '))
23 |         
24 |         if os.path.exists(fa_matrices_dir + file + '.npy'):
25 | 
26 |   	   A = np.load( fa_matrices_dir + file + '.npy')
27 | 	   durations = np.sum(A,axis=0)
28 | 	   durations = 4*durations # durations are given for 12.5ms frames, convert to 50ms
29 | 	   assert(num_phones == A.shape[1])
30 | 	   assert(num_phones == len(durations))
31 | 
32 | 	   f.write(t.rstrip() + '||' + " ".join(map(str, durations.astype(int))) + '\n')
33 | 
34 |         else:
35 |            print("Missing " + file)
36 | 
37 | f.close()
38 | 


--------------------------------------------------------------------------------
/calculate_CDP_Ain_Aout.py:
--------------------------------------------------------------------------------
 1 | 
 2 | import sys
 3 | import math
 4 | import glob
 5 | import numpy as np
 6 | import matplotlib
 7 | import matplotlib.pyplot as plt
 8 | 
 9 | def get_att_per_input(A):
10 | 
11 | 	att_per_input = np.trim_zeros(np.sum(A,axis=1),'b')
12 | 	num_input = len(att_per_input)
13 | 
14 | 	return att_per_input, num_input
15 | 
16 | # coverage deviation penalty: punishes when input token is not represented by output or overly represented
17 | # (lack of attention) skips and (too much attention) prolongation
18 | def getCDP(A):
19 | 
20 | 	att_per_input, num_input = get_att_per_input(A)
21 | 	tmp = (1. - att_per_input )**2
22 | 	tmp = np.sum( np.log( 1. + (1. - att_per_input )**2 ) )
23 | 	CDP = tmp / num_input # removed the minus sign
24 | 
25 | 	return CDP
26 | 
27 | def getEnt(A):
28 | 
29 | 	Entr = 0.0
30 | 
31 | 	for a in A: # traverse rows of A
32 | 		norm   = np.sum(a)
33 | 		normPd = a
34 | 		if norm != 0.0:
35 | 			normPd = [ p / norm for p in a ]
36 | 		entr = np.sum( [ ( p * np.log(p) if (p!=0.0) else 0.0 ) for p in normPd ] )
37 | 		Entr += entr
38 | 
39 | 	Entr = -Entr/A.shape[0]
40 | 	Entr /= np.log(A.shape[1]) # to normalise it between 0-1
41 | 
42 | 	return Entr
43 | 
44 | 
45 | # absentmindess penalty: punishes scattered attention, dispersion is calculated via the entropy
46 | def getAP(A):
47 | 
48 | 	att_per_input, num_input = get_att_per_input(A)
49 | 	num_output = A.shape[1]
50 | 	A     = A[:num_input,:]
51 | 	APout = 0.0
52 | 	APin  = 0.0
53 | 
54 | 	APin  = getEnt(A)
55 | 	APout = getEnt(np.transpose(A))
56 | 	
57 | 	return APin, APout
58 | 
59 | def main(file_name):
60 | 
61 | 	A = np.load(file_name) # matrix axis 0: input (phones) / axis 1: output (acoustic)
62 |         print(A.shape)
63 | 	# fig, ax = plt.subplots()
64 | 	# im = ax.imshow(A)
65 | 	# fig.colorbar(im)
66 | 	# plt.ylabel('Encoder timestep')
67 | 	# plt.xlabel('Decoder timestep')
68 | 	# plt.show()
69 | 
70 | 	CDP = getCDP(A)
71 | 	APin, APout = getAP(A)
72 | 	
73 | 	print('CDP: ' + str(CDP)) 
74 | 	print('Ain: ' + str(APin))
75 | 	print('Aout: ' + str(APout))
76 | 
77 | if __name__ == '__main__':
78 | 
79 | 	# Input attention matrix - numpy format - axis 0: input (phones) / axis 1: output (acoustic)
80 | 
81 | 	file_name = sys.argv[1]
82 | 	
83 | 	main(file_name)
84 | 
85 | 
86 | 


--------------------------------------------------------------------------------
/config/lj_01.cfg:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | #/usr/bin/python2
  3 | 
  4 | import os
  5 | 
  6 | 
  7 | ## Take name of this file to be the config name:
  8 | config_name = os.path.split(__file__)[-1].split('.')[0]  ## remove path and extension
  9 | 
 10 | ## Define place to put outputs relative to this config file's location;
 11 | ## supply an absoluate path to work elsewhere:
 12 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work')))
 13 | 
 14 | voicedir = os.path.join(topworkdir, config_name)
 15 | logdir = os.path.join(voicedir, 'train')
 16 | sampledir = os.path.join(voicedir, 'synth')
 17 | 
 18 | ## Change featuredir to absolute path to use existing features
 19 | featuredir = os.path.join(topworkdir, 'lj_test', 'data')  ## use older data
 20 | coarse_audio_dir = os.path.join(featuredir, 'mels')
 21 | full_mel_dir = os.path.join(featuredir, 'full_mels')
 22 | full_audio_dir = os.path.join(featuredir, 'mags')
 23 | attention_guide_dir = os.path.join(featuredir, 'attention_guides')  ## Set this to the empty string ('') to global attention guide
 24 | 
 25 | 
 26 | 
 27 | # data locations:
 28 | datadir = '/group/project/cstr2/owatts/data/LJSpeech-1.1/' 
 29 | 
 30 | 
 31 | transcript = os.path.join(datadir, 'transcript.csv')
 32 | test_transcript = os.path.join(datadir, 'test_transcript.csv')
 33 | waveforms = os.path.join(datadir, 'wav_trim')
 34 | 
 35 | 
 36 | 
 37 | input_type = 'phones' ## letters or phones
 38 | ## Combilex RPX plus extra symbols:- 
 39 | vocab = ['<PADDING>', '<!>', "<'>", "<'s>", '<)>', '<]>', '<">', '<,>', '<.>', '<:>', '<;>', '<>', '<?>', \
 40 |          '<_END_>', '<_START_>', \
 41 |          '@', '@@', '@U', 'A', 'D', 'E', 'E@', 'I', 'I@', 'N', 'O', 'OI', 'Q', 'S', 'T', 'U',\
 42 |          'U@', 'V', 'Z', 'a', 'aI', 'aU', 'b', 'd', 'dZ', 'eI', 'f', 'g', 'h', 'i', 'j', 'k',\
 43 |          'l', 'l!', 'lw', 'm', 'm!', 'n', 'n!', 'o^', 'o~', 'p', 'r', 's', 't', 'tS', 'u', 'v', 'w', 'z']
 44 | 
 45 | #vocab = "PE abcdefghijklmnopqrstuvwxyz'.?" # P: Padding, E: EOS.
 46 | max_N = 180 # Maximum number of characters. # !!!
 47 | max_T = 210 # Maximum number of mel frames. # !!!
 48 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
 49 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 
 50 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
 51 |             
 52 | 
 53 | 
 54 | # signal processing
 55 | trim_before_spectrogram_extraction = 0    
 56 | vocoder = 'griffin_lim'  
 57 | sr = 22050  # Sampling rate.
 58 | n_fft = 2048  # fft points (samples)
 59 | frame_shift = 0.0125  # seconds
 60 | frame_length = 0.05  # seconds
 61 | hop_length = int(sr * frame_shift)
 62 | win_length = int(sr * frame_length)    
 63 | prepro = True  # don't extract spectrograms on the fly
 64 | full_dim = n_fft//2+1
 65 | n_mels = 80  # Number of Mel banks to generate
 66 | power = 1.5  # Exponent for amplifying the predicted magnitude
 67 | n_iter = 50  # Number of inversion iterations
 68 | preemphasis = .97
 69 | max_db = 100
 70 | ref_db = 20
 71 | 
 72 | 
 73 | # Model
 74 | r = 4 # Reduction factor. Do not change this.
 75 | dropout_rate = 0.05
 76 | e = 128 # == embedding
 77 | d = 256 # == hidden units of Text2Mel
 78 | c = 512 # == hidden units of SSRN
 79 | attention_win_size = 3
 80 | g = 0.2 ## determines width of band in attention guide
 81 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None]
 82 | 
 83 | ## loss weights : T2M
 84 | lw_mel = 0.3333
 85 | lw_bd1 = 0.3333
 86 | lw_att = 0.3333
 87 | ##              : SSRN
 88 | lw_mag = 0.5
 89 | lw_bd2 = 0.5
 90 | 
 91 | 
 92 | ## validation:
 93 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI
 94 | validation_sentences_to_evaluate = 32 
 95 | validation_sentences_to_synth_params = 3
 96 | 
 97 | 
 98 | # training scheme
 99 | restart_from_savepath = []
100 | lr = 0.001 # Initial learning rate.
101 | batchsize = {'t2m': 32, 'ssrn': 32}
102 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8)
103 | validate_every_n_epochs = 10   ## how often to compute validation score and save speech parameters
104 | save_every_n_epochs = 0  ## as well as 5 latest models, how often to archive a model. This has been disabled (0) to save disk space.
105 | max_epochs = 300
106 | 
107 | # attention plotting during training
108 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices
109 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for
110 | 


--------------------------------------------------------------------------------
/config/lj_03.cfg:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | #/usr/bin/python2
  3 | 
  4 | ## rpx as lj_01, 16kHz as lj_02, but experiment with training babbler before T2M
  5 | 
  6 | import os
  7 | 
  8 | 
  9 | ## Take name of this file to be the config name:
 10 | config_name = os.path.split(__file__)[-1].split('.')[0]  ## remove path and extension
 11 | 
 12 | ## Define place to put outputs relative to this config file's location;
 13 | ## supply an absoluate path to work elsewhere:
 14 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work')))
 15 | 
 16 | voicedir = os.path.join(topworkdir, config_name)
 17 | logdir = os.path.join(voicedir, 'train')
 18 | sampledir = os.path.join(voicedir, 'synth')
 19 | 
 20 | ## Change featuredir to absolute path to use existing features
 21 | featuredir = os.path.join(topworkdir, 'lj_02', 'data')  ## use older data
 22 | coarse_audio_dir = os.path.join(featuredir, 'mels')
 23 | full_mel_dir = os.path.join(featuredir, 'full_mels')
 24 | full_audio_dir = os.path.join(featuredir, 'mags')
 25 | attention_guide_dir = os.path.join(featuredir, 'attention_guides')  ## Set this to the empty string ('') to global attention guide
 26 | 
 27 | bucket_data_by = 'audio_length'  
 28 | 
 29 | # data locations:
 30 | datadir = '/group/project/cstr2/owatts/data/LJSpeech-1.1/' 
 31 | 
 32 | 
 33 | transcript = os.path.join(datadir, 'transcript.csv')
 34 | test_transcript = os.path.join(datadir, 'test_transcript.csv')
 35 | waveforms = os.path.join(datadir, 'wav_trim')
 36 | 
 37 | 
 38 | 
 39 | input_type = 'phones' ## letters or phones
 40 | ## Combilex RPX plus extra symbols:- 
 41 | vocab = ['<PADDING>', '<!>', "<'>", "<'s>", '<)>', '<]>', '<">', '<,>', '<.>', '<:>', '<;>', '<>', '<?>', \
 42 |          '<_END_>', '<_START_>', \
 43 |          '@', '@@', '@U', 'A', 'D', 'E', 'E@', 'I', 'I@', 'N', 'O', 'OI', 'Q', 'S', 'T', 'U',\
 44 |          'U@', 'V', 'Z', 'a', 'aI', 'aU', 'b', 'd', 'dZ', 'eI', 'f', 'g', 'h', 'i', 'j', 'k',\
 45 |          'l', 'l!', 'lw', 'm', 'm!', 'n', 'n!', 'o^', 'o~', 'p', 'r', 's', 't', 'tS', 'u', 'v', 'w', 'z']
 46 | 
 47 | #vocab = "PE abcdefghijklmnopqrstuvwxyz'.?" # P: Padding, E: EOS.
 48 | max_N = 180 # Maximum number of characters. # !!!
 49 | max_T = 210 # Maximum number of mel frames. # !!!
 50 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
 51 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 
 52 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
 53 |             
 54 | 
 55 | 
 56 | # signal processing
 57 | trim_before_spectrogram_extraction = 0    
 58 | vocoder = 'griffin_lim'  
 59 | sr = 16000  # Sampling rate.
 60 | n_fft = 2048  # fft points (samples)
 61 | frame_shift = 0.0125  # seconds
 62 | frame_length = 0.05  # seconds
 63 | hop_length = int(sr * frame_shift)
 64 | win_length = int(sr * frame_length)    
 65 | prepro = True  # don't extract spectrograms on the fly
 66 | full_dim = n_fft//2+1
 67 | n_mels = 80  # Number of Mel banks to generate
 68 | power = 1.5  # Exponent for amplifying the predicted magnitude
 69 | n_iter = 50  # Number of inversion iterations
 70 | preemphasis = .97
 71 | max_db = 100
 72 | ref_db = 20
 73 | 
 74 | 
 75 | # Model
 76 | r = 4 # Reduction factor. Do not change this.
 77 | dropout_rate = 0.05
 78 | e = 128 # == embedding
 79 | d = 256 # == hidden units of Text2Mel
 80 | c = 512 # == hidden units of SSRN
 81 | attention_win_size = 3
 82 | g = 0.2 ## determines width of band in attention guide
 83 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None]
 84 | 
 85 | ## loss weights : T2M
 86 | lw_mel = 0.3333
 87 | lw_bd1 = 0.3333
 88 | lw_att = 0.3333
 89 | ##              : SSRN
 90 | lw_mag = 0.5
 91 | lw_bd2 = 0.5
 92 | 
 93 | loss_weights = {'babbler': {'L1': 0.5, 'binary_divergence': 0.5}} ## TODO: make other loss config consistent
 94 | 
 95 | ## validation:
 96 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI
 97 | validation_sentences_to_evaluate = 32 
 98 | validation_sentences_to_synth_params = 3
 99 | 
100 | 
101 | # training scheme
102 | restart_from_savepath = []
103 | lr = 0.001 # Initial learning rate.
104 | batchsize = {'t2m': 32, 'ssrn': 32, 'babbler': 32}
105 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8)
106 | validate_every_n_epochs = 10   ## how often to compute validation score and save speech parameters
107 | save_every_n_epochs = 0  ## as well as 5 latest models, how often to archive a model. This has been disabled (0) to save disk space.
108 | max_epochs = 300
109 | 
110 | # attention plotting during training
111 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices
112 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for
113 | 


--------------------------------------------------------------------------------
/config/lj_04.cfg:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | #/usr/bin/python2
  3 | 
  4 | ## rpx as lj_01, 16kHz as lj_02, but experiment with training babbler before T2M
  5 | ## stage 2: fine tune on subset of transcribed data
  6 | import os
  7 | 
  8 | 
  9 | ## Take name of this file to be the config name:
 10 | config_name = os.path.split(__file__)[-1].split('.')[0]  ## remove path and extension
 11 | 
 12 | ## Define place to put outputs relative to this config file's location;
 13 | ## supply an absoluate path to work elsewhere:
 14 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work')))
 15 | 
 16 | voicedir = os.path.join(topworkdir, config_name)
 17 | logdir = os.path.join(voicedir, 'train')
 18 | sampledir = os.path.join(voicedir, 'synth')
 19 | 
 20 | ## Change featuredir to absolute path to use existing features
 21 | featuredir = os.path.join(topworkdir, 'lj_02', 'data')  ## use older data
 22 | coarse_audio_dir = os.path.join(featuredir, 'mels')
 23 | full_mel_dir = os.path.join(featuredir, 'full_mels')
 24 | full_audio_dir = os.path.join(featuredir, 'mags')
 25 | attention_guide_dir = os.path.join(featuredir, 'attention_guides')  ## Set this to the empty string ('') to global attention guide
 26 | 
 27 | 
 28 | 
 29 | # data locations:
 30 | datadir = '/group/project/cstr2/owatts/data/LJSpeech-1.1/' 
 31 | 
 32 | 
 33 | transcript = os.path.join(datadir, 'transcript.csv')
 34 | test_transcript = os.path.join(datadir, 'test_transcript.csv')
 35 | waveforms = os.path.join(datadir, 'wav_trim')
 36 | 
 37 | 
 38 | 
 39 | input_type = 'phones' ## letters or phones
 40 | ## Combilex RPX plus extra symbols:- 
 41 | vocab = ['<PADDING>', '<!>', "<'>", "<'s>", '<)>', '<]>', '<">', '<,>', '<.>', '<:>', '<;>', '<>', '<?>', \
 42 |          '<_END_>', '<_START_>', \
 43 |          '@', '@@', '@U', 'A', 'D', 'E', 'E@', 'I', 'I@', 'N', 'O', 'OI', 'Q', 'S', 'T', 'U',\
 44 |          'U@', 'V', 'Z', 'a', 'aI', 'aU', 'b', 'd', 'dZ', 'eI', 'f', 'g', 'h', 'i', 'j', 'k',\
 45 |          'l', 'l!', 'lw', 'm', 'm!', 'n', 'n!', 'o^', 'o~', 'p', 'r', 's', 't', 'tS', 'u', 'v', 'w', 'z']
 46 | 
 47 | #vocab = "PE abcdefghijklmnopqrstuvwxyz'.?" # P: Padding, E: EOS.
 48 | max_N = 180 # Maximum number of characters. # !!!
 49 | max_T = 210 # Maximum number of mel frames. # !!!
 50 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
 51 | n_utts = 1000 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 
 52 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
 53 |             
 54 | 
 55 | 
 56 | # signal processing
 57 | trim_before_spectrogram_extraction = 0    
 58 | vocoder = 'griffin_lim'  
 59 | sr = 16000  # Sampling rate.
 60 | n_fft = 2048  # fft points (samples)
 61 | frame_shift = 0.0125  # seconds
 62 | frame_length = 0.05  # seconds
 63 | hop_length = int(sr * frame_shift)
 64 | win_length = int(sr * frame_length)    
 65 | prepro = True  # don't extract spectrograms on the fly
 66 | full_dim = n_fft//2+1
 67 | n_mels = 80  # Number of Mel banks to generate
 68 | power = 1.5  # Exponent for amplifying the predicted magnitude
 69 | n_iter = 50  # Number of inversion iterations
 70 | preemphasis = .97
 71 | max_db = 100
 72 | ref_db = 20
 73 | 
 74 | 
 75 | # Model
 76 | r = 4 # Reduction factor. Do not change this.
 77 | dropout_rate = 0.05
 78 | e = 128 # == embedding
 79 | d = 256 # == hidden units of Text2Mel
 80 | c = 512 # == hidden units of SSRN
 81 | attention_win_size = 3
 82 | g = 0.2 ## determines width of band in attention guide
 83 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None]
 84 | 
 85 | ## loss weights : T2M
 86 | lw_mel = 0.3333
 87 | lw_bd1 = 0.3333
 88 | lw_att = 0.3333
 89 | ##              : SSRN
 90 | lw_mag = 0.5
 91 | lw_bd2 = 0.5
 92 | 
 93 | 
 94 | ## validation:
 95 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI
 96 | validation_sentences_to_evaluate = 32 
 97 | validation_sentences_to_synth_params = 3
 98 | 
 99 | 
100 | # training scheme
101 | restart_from_savepath = []
102 | lr = 0.001 # Initial learning rate.
103 | batchsize = {'t2m': 32, 'ssrn': 32, 'babbler': 32}
104 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8)
105 | validate_every_n_epochs = 10   ## how often to compute validation score and save speech parameters
106 | save_every_n_epochs = 0  ## as well as 5 latest models, how often to archive a model. This has been disabled (0) to save disk space.
107 | max_epochs = 1000
108 | 
109 | # attention plotting during training
110 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices
111 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for
112 | 


--------------------------------------------------------------------------------
/config/lj_test.cfg:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | #/usr/bin/python2
  3 | 
  4 | import os
  5 | 
  6 | 
  7 | ## Take name of this file to be the config name:
  8 | config_name = os.path.split(__file__)[-1].split('.')[0]  ## remove path and extension
  9 | 
 10 | ## Define place to put outputs relative to this config file's location;
 11 | ## supply an absoluate path to work elsewhere:
 12 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work')))
 13 | 
 14 | voicedir = os.path.join(topworkdir, config_name)
 15 | logdir = os.path.join(voicedir, 'train')
 16 | sampledir = os.path.join(voicedir, 'synth')
 17 | 
 18 | ## Change featuredir to absolute path to use existing features
 19 | featuredir = os.path.join(voicedir, 'data') 
 20 | coarse_audio_dir = os.path.join(featuredir, 'mels')
 21 | full_mel_dir = os.path.join(featuredir, 'full_mels')
 22 | full_audio_dir = os.path.join(featuredir, 'mags')
 23 | attention_guide_dir = os.path.join(featuredir, 'attention_guides')  ## Set this to the empty string ('') to global attention guide
 24 | 
 25 | 
 26 | 
 27 | # data locations:
 28 | datadir = '/group/project/cstr2/owatts/data/LJSpeech-1.1/' 
 29 | 
 30 | 
 31 | transcript = os.path.join(datadir, 'transcript.csv')
 32 | test_transcript = os.path.join(datadir, 'test_transcript.csv')
 33 | waveforms = os.path.join(datadir, 'wav_trim')
 34 | 
 35 | 
 36 | 
 37 | input_type = 'phones' ## letters or phones
 38 | ## Combilex RPX plus extra symbols:- 
 39 | vocab = ['<PADDING>', '<!>', "<'>", "<'s>", '<)>', '<]>', '<">', '<,>', '<.>', '<:>', '<;>', '<>', '<?>', \
 40 |          '<_END_>', '<_START_>', \
 41 |          '@', '@@', '@U', 'A', 'D', 'E', 'E@', 'I', 'I@', 'N', 'O', 'OI', 'Q', 'S', 'T', 'U',\
 42 |          'U@', 'V', 'Z', 'a', 'aI', 'aU', 'b', 'd', 'dZ', 'eI', 'f', 'g', 'h', 'i', 'j', 'k',\
 43 |          'l', 'l!', 'lw', 'm', 'm!', 'n', 'n!', 'o^', 'o~', 'p', 'r', 's', 't', 'tS', 'u', 'v', 'w', 'z']
 44 | 
 45 | #vocab = "PE abcdefghijklmnopqrstuvwxyz'.?" # P: Padding, E: EOS.
 46 | max_N = 180 # Maximum number of characters. # !!!
 47 | max_T = 210 # Maximum number of mel frames. # !!!
 48 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
 49 | n_utts = 200 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 
 50 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
 51 |             
 52 | 
 53 | 
 54 | # signal processing
 55 | trim_before_spectrogram_extraction = 0    
 56 | vocoder = 'griffin_lim'  
 57 | sr = 22050  # Sampling rate.
 58 | n_fft = 2048  # fft points (samples)
 59 | frame_shift = 0.0125  # seconds
 60 | frame_length = 0.05  # seconds
 61 | hop_length = int(sr * frame_shift)
 62 | win_length = int(sr * frame_length)    
 63 | prepro = True  # don't extract spectrograms on the fly
 64 | full_dim = n_fft//2+1
 65 | n_mels = 80  # Number of Mel banks to generate
 66 | power = 1.5  # Exponent for amplifying the predicted magnitude
 67 | n_iter = 50  # Number of inversion iterations
 68 | preemphasis = .97
 69 | max_db = 100
 70 | ref_db = 20
 71 | 
 72 | 
 73 | # Model
 74 | r = 4 # Reduction factor. Do not change this.
 75 | dropout_rate = 0.05
 76 | e = 128 # == embedding
 77 | d = 256 # == hidden units of Text2Mel
 78 | c = 512 # == hidden units of SSRN
 79 | attention_win_size = 3
 80 | g = 0.2 ## determines width of band in attention guide
 81 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None]
 82 | 
 83 | ## loss weights : T2M
 84 | lw_mel = 0.3333
 85 | lw_bd1 = 0.3333
 86 | lw_att = 0.3333
 87 | ##              : SSRN
 88 | lw_mag = 0.5
 89 | lw_bd2 = 0.5
 90 | 
 91 | 
 92 | ## validation:
 93 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI
 94 | validation_sentences_to_evaluate = 32 
 95 | validation_sentences_to_synth_params = 3
 96 | 
 97 | 
 98 | # training scheme
 99 | restart_from_savepath = []
100 | lr = 0.001 # Initial learning rate.
101 | batchsize = {'t2m': 32, 'ssrn': 32}
102 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8)
103 | validate_every_n_epochs = 1   ## how often to compute validation score and save speech parameters
104 | save_every_n_epochs = 2  ## as well as 5 latest models, how often to archive a model
105 | max_epochs = 4
106 | 
107 | # attention plotting during training
108 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices
109 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for


--------------------------------------------------------------------------------
/config/lj_tutorial.cfg:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | #/usr/bin/python2
  3 | 
  4 | import os
  5 | 
  6 | 
  7 | ## Take name of this file to be the config name:
  8 | config_name = os.path.split(__file__)[-1].split('.')[0]  ## remove path and extension
  9 | 
 10 | ## Define place to put outputs relative to this config file's location;
 11 | ## supply an absoluate path to work elsewhere:
 12 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work')))
 13 | 
 14 | voicedir = os.path.join(topworkdir, config_name)
 15 | logdir = os.path.join(voicedir, 'train')
 16 | sampledir = os.path.join(voicedir, 'synth')
 17 | 
 18 | ## Change featuredir to absolute path to use existing features
 19 | featuredir = os.path.join(voicedir, 'data')
 20 | coarse_audio_dir = os.path.join(featuredir, 'mels')
 21 | full_mel_dir = os.path.join(featuredir, 'full_mels')
 22 | full_audio_dir = os.path.join(featuredir, 'mags')
 23 | attention_guide_dir = os.path.join(featuredir, 'attention_guides')  ## Set this to the empty string ('') to global attention guide
 24 | 
 25 | 
 26 | 
 27 | # Data locations:
 28 | datadir = '/path/to/LJSpeech-1.1/'
 29 | 
 30 | 
 31 | transcript = os.path.join(datadir, 'transcript.csv')
 32 | test_transcript = os.path.join(datadir, 'test_transcript.csv')
 33 | waveforms = os.path.join(datadir, 'wav_trim')
 34 | 
 35 | 
 36 | input_type = 'phones' ## letters or phones
 37 | 
 38 | ## CMU phones:
 39 | vocab = ['<PADDING>', '<_END_>', '<_START_>'] + ['<!>', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', '<;>', '<>', '<?>', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', 'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', 'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', 'oy', 'p', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', 'w', 'y', 'z', 'zh']
 40 | 
 41 | max_N = 150 # Maximum number of characters/phones
 42 | max_T = 200 # Maximum number of mel frames
 43 | 
 44 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
 45 | n_utts = 200 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set
 46 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
 47 | 
 48 | 
 49 | 
 50 | # signal processing
 51 | trim_before_spectrogram_extraction = 0
 52 | vocoder = 'griffin_lim'
 53 | sr = 22050  # Sampling rate.
 54 | n_fft = 2048  # fft points (samples)
 55 | frame_shift = 0.0125  # seconds
 56 | frame_length = 0.05  # seconds
 57 | hop_length = int(sr * frame_shift)
 58 | win_length = int(sr * frame_length)
 59 | prepro = True  # don't extract spectrograms on the fly
 60 | full_dim = n_fft//2+1
 61 | n_mels = 80  # Number of Mel banks to generate
 62 | power = 1.5  # Exponent for amplifying the predicted magnitude
 63 | n_iter = 50  # Number of inversion iterations
 64 | preemphasis = .97
 65 | max_db = 100
 66 | ref_db = 20
 67 | 
 68 | 
 69 | # Model
 70 | r = 4 # Reduction factor. Do not change this.
 71 | dropout_rate = 0.05
 72 | e = 128 # == embedding
 73 | d = 256 # == hidden units of Text2Mel
 74 | c = 512 # == hidden units of SSRN
 75 | attention_win_size = 3
 76 | g = 0.2 ## determines width of band in attention guide
 77 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None]
 78 | 
 79 | ## loss weights : T2M
 80 | lw_mel = 0.25
 81 | lw_bd1 = 0.25
 82 | lw_att = 0.25
 83 | lw_t2m_l2 = 0.25
 84 | ##              : SSRN
 85 | lw_mag = 0.3333
 86 | lw_bd2 = 0.3333
 87 | lw_ssrn_l2 = 0.3333
 88 | 
 89 | 
 90 | ## validation:
 91 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI
 92 | validation_sentences_to_evaluate = 32
 93 | validation_sentences_to_synth_params = 3
 94 | 
 95 | 
 96 | # training scheme
 97 | restart_from_savepath = []
 98 | lr = 0.001 # Initial learning rate.
 99 | batchsize = {'t2m': 32, 'ssrn': 8}
100 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8)
101 | validate_every_n_epochs = 1   ## how often to compute validation score and save speech parameters
102 | save_every_n_epochs = 5  ## as well as 5 latest models, how often to archive a model
103 | max_epochs = 10
104 | 
105 | # attention plotting during training
106 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices
107 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for
108 | 


--------------------------------------------------------------------------------
/config/nancy2nick_01.cfg:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | #/usr/bin/python2
  3 | 
  4 | '''
  5 | nancy base voice retrained on nick for ICPhS in first instance.
  6 | '''
  7 | 
  8 | 
  9 | import os
 10 | 
 11 | 
 12 | ## Take name of this file to be the config name:
 13 | config_name = os.path.split(__file__)[-1].split('.')[0]  ## remove path and extension
 14 | 
 15 | ## Define place to put outputs relative to this config file's location;
 16 | ## supply an absoluate path to work elsewhere:
 17 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work')))
 18 | 
 19 | voicedir = os.path.join(topworkdir, config_name)
 20 | logdir = os.path.join(voicedir, 'train')
 21 | sampledir = os.path.join(voicedir, 'synth')
 22 | 
 23 | ## Change featuredir to absolute path to use existing features
 24 | featuredir = os.path.join(topworkdir, config_name, 'data')  ## use older data
 25 | coarse_audio_dir = os.path.join(featuredir, 'mels')
 26 | full_mel_dir = os.path.join(featuredir, 'full_mels')
 27 | full_audio_dir = os.path.join(featuredir, 'mags')
 28 | attention_guide_dir = os.path.join(featuredir, 'attention_guides')  ## Set this to the empty string ('') to global attention guide
 29 | 
 30 | 
 31 | 
 32 | # data locations:
 33 | datadir = '/group/project/cstr2/owatts/data/nick16k/' 
 34 | 
 35 | 
 36 | transcript = os.path.join(datadir, 'transcript.csv')
 37 | test_transcript = os.path.join(datadir, 'transcript_hvd.csv')
 38 | #test_transcript = os.path.join(datadir, 'transcript_mrt.csv')
 39 | waveforms = os.path.join(datadir, 'wav_trim')
 40 | 
 41 | 
 42 | 
 43 | input_type = 'phones' ## letters or phones
 44 | 
 45 | vocab = ['<PADDING>', '<!>', "<'s>", '<,>', '<.>', '<:>', '<;>', '<>', \
 46 |             '<?>', '<_END_>', '<_START_>', '@', '@@', '@U', 'A', 'D', 'E', \
 47 |             'E@', 'I', 'I@', 'N', 'O', 'OI', 'Q', 'S', 'T', 'U', 'U@', 'V', \
 48 |             'Z', 'a', 'aI', 'aU', 'b', 'd', 'dZ', 'eI', 'f', 'g', 'h', 'i', \
 49 |             'j', 'k', 'l', 'l!', 'lw', 'm', 'm!', 'n', 'n!', 'o^', 'o~', 'p', \
 50 |             'r', 's', 't', 'tS', 'u', 'v', 'w', 'z']
 51 | 
 52 | #vocab = "PE abcdefghijklmnopqrstuvwxyz'.?" # P: Padding, E: EOS.
 53 | max_N = 140 # Maximum number of characters. # !!!
 54 | max_T = 200 # Maximum number of mel frames. # !!!
 55 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
 56 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 
 57 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
 58 |             
 59 | 
 60 | 
 61 | # signal processing
 62 | trim_before_spectrogram_extraction = 0    
 63 | vocoder = 'griffin_lim'  
 64 | sr = 16000  # Sampling rate.
 65 | n_fft = 2048  # fft points (samples)
 66 | frame_shift = 0.0125  # seconds
 67 | frame_length = 0.05  # seconds
 68 | hop_length = int(sr * frame_shift)
 69 | win_length = int(sr * frame_length)    
 70 | prepro = True  # don't extract spectrograms on the fly
 71 | full_dim = n_fft//2+1
 72 | n_mels = 80  # Number of Mel banks to generate
 73 | power = 1.5  # Exponent for amplifying the predicted magnitude
 74 | n_iter = 50  # Number of inversion iterations
 75 | preemphasis = .97
 76 | max_db = 100
 77 | ref_db = 20
 78 | 
 79 | 
 80 | # Model
 81 | r = 4 # Reduction factor. Do not change this.
 82 | dropout_rate = 0.05
 83 | e = 128 # == embedding
 84 | d = 256 # == hidden units of Text2Mel
 85 | c = 512 # == hidden units of SSRN
 86 | attention_win_size = 3
 87 | g = 0.2 ## determines width of band in attention guide
 88 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None]
 89 | 
 90 | ## loss weights : T2M
 91 | lw_mel = 0.3333
 92 | lw_bd1 = 0.3333
 93 | lw_att = 0.3333
 94 | ##              : SSRN
 95 | lw_mag = 0.5
 96 | lw_bd2 = 0.5
 97 | 
 98 | 
 99 | ## validation:
100 | validpatt = 'herald_9' ## herald_9 matches 98 sentences from nick
101 | #### NYT00 matches 24 of nancy's sentences.
102 | validation_sentences_to_evaluate = 24
103 | validation_sentences_to_synth_params = 16
104 | 
105 | 
106 | # training scheme
107 | restart_from_savepath = []
108 | lr = 0.001 # Initial learning rate.
109 | batchsize = {'t2m': 32, 'ssrn': 32}
110 | validate_every_n_epochs = 1   ## how often to compute validation score and save speech parameters
111 | save_every_n_epochs = 10 ## as well as 5 latest models, how often to archive a model
112 | max_epochs = 100
113 | 
114 | ## copy existing model parameters -- list of (scope, checkpoint) pairs:--
115 | WORK='/disk/scratch/oliver/dc_tts_osw_clean/work/'
116 | initialise_weights_from_existing = [('Text2Mel', WORK+'/nancy_01/train-t2m/model_epoch_1000'), \
117 |                                     ('SSRN', WORK+'/nancy_01/train-ssrn/model_epoch_1000')]
118 | 
119 | update_weights = []  ## If empty, update all trainable variables. Otherwise, update only 
120 |                      ## those variables whose scopes match one of the patterns in the list


--------------------------------------------------------------------------------
/config/nancy_01.cfg:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | #/usr/bin/python2
  3 | 
  4 | '''
  5 | nancy base voice for ICPhS in first instance.
  6 | '''
  7 | 
  8 | 
  9 | import os
 10 | 
 11 | 
 12 | ## Take name of this file to be the config name:
 13 | config_name = os.path.split(__file__)[-1].split('.')[0]  ## remove path and extension
 14 | 
 15 | ## Define place to put outputs relative to this config file's location;
 16 | ## supply an absoluate path to work elsewhere:
 17 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work')))
 18 | 
 19 | voicedir = os.path.join(topworkdir, config_name)
 20 | logdir = os.path.join(voicedir, 'train')
 21 | sampledir = os.path.join(voicedir, 'synth')
 22 | 
 23 | ## Change featuredir to absolute path to use existing features
 24 | featuredir = os.path.join(topworkdir, config_name, 'data')  ## use older data
 25 | coarse_audio_dir = os.path.join(featuredir, 'mels')
 26 | full_mel_dir = os.path.join(featuredir, 'full_mels')
 27 | full_audio_dir = os.path.join(featuredir, 'mags')
 28 | attention_guide_dir = os.path.join(featuredir, 'attention_guides')  ## Set this to the empty string ('') to global attention guide
 29 | 
 30 | 
 31 | 
 32 | # data locations:
 33 | datadir = '/group/project/cstr2/owatts/data/nancy/derived' 
 34 | 
 35 | 
 36 | transcript = os.path.join(datadir, 'transcript.csv')
 37 | test_transcript = '/group/project/cstr2/owatts/data/nick16k/transcript_mrt.csv'
 38 | 
 39 | 
 40 | waveforms = os.path.join(datadir, 'wav_trim')
 41 | 
 42 | 
 43 | 
 44 | input_type = 'phones' ## letters or phones
 45 | 
 46 | vocab = ['<PADDING>', '<!>', "<'s>", '<,>', '<.>', '<:>', '<;>', '<>', \
 47 |             '<?>', '<_END_>', '<_START_>', '@', '@@', '@U', 'A', 'D', 'E', \
 48 |             'E@', 'I', 'I@', 'N', 'O', 'OI', 'Q', 'S', 'T', 'U', 'U@', 'V', \
 49 |             'Z', 'a', 'aI', 'aU', 'b', 'd', 'dZ', 'eI', 'f', 'g', 'h', 'i', \
 50 |             'j', 'k', 'l', 'l!', 'lw', 'm', 'm!', 'n', 'n!', 'o^', 'o~', 'p', \
 51 |             'r', 's', 't', 'tS', 'u', 'v', 'w', 'z']
 52 | 
 53 | #vocab = "PE abcdefghijklmnopqrstuvwxyz'.?" # P: Padding, E: EOS.
 54 | max_N = 140 # Maximum number of characters. # !!!
 55 | max_T = 200 # Maximum number of mel frames. # !!!
 56 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
 57 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 
 58 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
 59 |             
 60 | 
 61 | 
 62 | # signal processing
 63 | trim_before_spectrogram_extraction = 0    
 64 | vocoder = 'griffin_lim'  
 65 | sr = 16000  # Sampling rate.
 66 | n_fft = 2048  # fft points (samples)
 67 | frame_shift = 0.0125  # seconds
 68 | frame_length = 0.05  # seconds
 69 | hop_length = int(sr * frame_shift)
 70 | win_length = int(sr * frame_length)    
 71 | prepro = True  # don't extract spectrograms on the fly
 72 | full_dim = n_fft//2+1
 73 | n_mels = 80  # Number of Mel banks to generate
 74 | power = 1.5  # Exponent for amplifying the predicted magnitude
 75 | n_iter = 50  # Number of inversion iterations
 76 | preemphasis = .97
 77 | max_db = 100
 78 | ref_db = 20
 79 | 
 80 | 
 81 | # Model
 82 | r = 4 # Reduction factor. Do not change this.
 83 | dropout_rate = 0.05
 84 | e = 128 # == embedding
 85 | d = 256 # == hidden units of Text2Mel
 86 | c = 512 # == hidden units of SSRN
 87 | attention_win_size = 3
 88 | g = 0.2 ## determines width of band in attention guide
 89 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None]
 90 | 
 91 | ## loss weights : T2M
 92 | lw_mel = 0.3333
 93 | lw_bd1 = 0.3333
 94 | lw_att = 0.3333
 95 | ##              : SSRN
 96 | lw_mag = 0.5
 97 | lw_bd2 = 0.5
 98 | 
 99 | 
100 | ## validation:
101 | validpatt = 'NYT00' ## sentence names containing this substring will be held out of training. 
102 | #### NYT00 matches 24 of nancy's sentences.
103 | validation_sentences_to_evaluate = 24
104 | validation_sentences_to_synth_params = 16
105 | 
106 | 
107 | # training scheme
108 | restart_from_savepath = []
109 | lr = 0.001 # Initial learning rate.
110 | batchsize = {'t2m': 32, 'ssrn': 32}
111 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8)
112 | validate_every_n_epochs = 10   ## how often to compute validation score and save speech parameters
113 | save_every_n_epochs = 100 ## as well as 5 latest models, how often to archive a model
114 | max_epochs = 1000
115 | 
116 | # attention plotting during training
117 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices
118 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for


--------------------------------------------------------------------------------
/config/nancyplusnick_01.cfg:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | #/usr/bin/python2
  3 | 
  4 | '''
  5 | Combined nancy & nick voice for ICPhS in first instance.
  6 | '''
  7 | 
  8 | 
  9 | import os
 10 | 
 11 | 
 12 | ## Take name of this file to be the config name:
 13 | config_name = os.path.split(__file__)[-1].split('.')[0]  ## remove path and extension
 14 | 
 15 | ## Define place to put outputs relative to this config file's location;
 16 | ## supply an absoluate path to work elsewhere:
 17 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work')))
 18 | 
 19 | voicedir = os.path.join(topworkdir, config_name)
 20 | logdir = os.path.join(voicedir, 'train')
 21 | sampledir = os.path.join(voicedir, 'synth')
 22 | 
 23 | ## Change featuredir to absolute path to use existing features
 24 | featuredir = os.path.join(topworkdir, config_name, 'data')  
 25 | coarse_audio_dir = os.path.join(featuredir, 'mels')
 26 | full_mel_dir = os.path.join(featuredir, 'full_mels')
 27 | full_audio_dir = os.path.join(featuredir, 'mags')
 28 | attention_guide_dir = os.path.join(featuredir, 'attention_guides')  ## Set this to the empty string ('') to global attention guide
 29 | 
 30 | 
 31 | 
 32 | # data locations:
 33 | datadir = '/group/project/cstr2/owatts/data/nick_plus_nancy/'
 34 | 
 35 | 
 36 | transcript = os.path.join(datadir, 'transcript.csv')
 37 | test_transcript = os.path.join(datadir, 'transcript_mrt.csv')
 38 | 
 39 | 
 40 | waveforms = os.path.join(datadir, 'wav_trim')
 41 | 
 42 | 
 43 | 
 44 | input_type = 'phones' ## letters or phones
 45 | 
 46 | vocab = ['<PADDING>', '<!>', "<'s>", '<,>', '<.>', '<:>', '<;>', '<>', \
 47 |             '<?>', '<_END_>', '<_START_>', '@', '@@', '@U', 'A', 'D', 'E', \
 48 |             'E@', 'I', 'I@', 'N', 'O', 'OI', 'Q', 'S', 'T', 'U', 'U@', 'V', \
 49 |             'Z', 'a', 'aI', 'aU', 'b', 'd', 'dZ', 'eI', 'f', 'g', 'h', 'i', \
 50 |             'j', 'k', 'l', 'l!', 'lw', 'm', 'm!', 'n', 'n!', 'o^', 'o~', 'p', \
 51 |             'r', 's', 't', 'tS', 'u', 'v', 'w', 'z']
 52 | 
 53 | #vocab = "PE abcdefghijklmnopqrstuvwxyz'.?" # P: Padding, E: EOS.
 54 | max_N = 140 # Maximum number of characters. # !!!
 55 | max_T = 200 # Maximum number of mel frames. # !!!
 56 | multispeaker = ['text_encoder_input', 'audio_decoder_input'] ## []: speaker dependent. text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
 57 | speaker_list = ['<PADDING>']  +  ['nancy', 'nick']
 58 | nspeakers = len(speaker_list) + 10 ## add 10 extra slots for posterity's sake 
 59 | speaker_embedding_size = 128
 60 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 
 61 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
 62 |             
 63 | 
 64 | 
 65 | # signal processing
 66 | trim_before_spectrogram_extraction = 0    
 67 | vocoder = 'griffin_lim'  
 68 | sr = 16000  # Sampling rate.
 69 | n_fft = 2048  # fft points (samples)
 70 | frame_shift = 0.0125  # seconds
 71 | frame_length = 0.05  # seconds
 72 | hop_length = int(sr * frame_shift)
 73 | win_length = int(sr * frame_length)    
 74 | prepro = True  # don't extract spectrograms on the fly
 75 | full_dim = n_fft//2+1
 76 | n_mels = 80  # Number of Mel banks to generate
 77 | power = 1.5  # Exponent for amplifying the predicted magnitude
 78 | n_iter = 50  # Number of inversion iterations
 79 | preemphasis = .97
 80 | max_db = 100
 81 | ref_db = 20
 82 | 
 83 | 
 84 | # Model
 85 | r = 4 # Reduction factor. Do not change this.
 86 | dropout_rate = 0.05
 87 | e = 128 # == embedding
 88 | d = 256 # == hidden units of Text2Mel
 89 | c = 512 # == hidden units of SSRN
 90 | attention_win_size = 3
 91 | g = 0.2 ## determines width of band in attention guide
 92 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None]
 93 | 
 94 | ## loss weights : T2M
 95 | lw_mel = 0.3333
 96 | lw_bd1 = 0.3333
 97 | lw_att = 0.3333
 98 | ##              : SSRN
 99 | lw_mag = 0.5
100 | lw_bd2 = 0.5
101 | 
102 | 
103 | ## validation:
104 | validpatt = 'herald_9' ## herald_9 matches 98 sentences from nick
105 | #### NYT00 matches 24 of nancy's sentences.
106 | validation_sentences_to_evaluate = 24
107 | validation_sentences_to_synth_params = 16
108 | 
109 | 
110 | # training scheme
111 | restart_from_savepath = []
112 | lr = 0.001 # Initial learning rate.
113 | batchsize = {'t2m': 32, 'ssrn': 32}
114 | validate_every_n_epochs = 10   ## how often to compute validation score and save speech parameters
115 | save_every_n_epochs = 100 ## as well as 5 latest models, how often to archive a model
116 | max_epochs = 1000
117 | 
118 | 


--------------------------------------------------------------------------------
/config/nancyplusnick_02.cfg:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | #/usr/bin/python2
  3 | 
  4 | '''
  5 | Combined nancy & nick voice for ICPhS in first instance.
  6 | Stage 2: fine tune on nick only
  7 | '''
  8 | 
  9 | 
 10 | import os
 11 | 
 12 | 
 13 | ## Take name of this file to be the config name:
 14 | config_name = os.path.split(__file__)[-1].split('.')[0]  ## remove path and extension
 15 | 
 16 | ## Define place to put outputs relative to this config file's location;
 17 | ## supply an absoluate path to work elsewhere:
 18 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work')))
 19 | 
 20 | voicedir = os.path.join(topworkdir, config_name)
 21 | logdir = os.path.join(voicedir, 'train')
 22 | sampledir = os.path.join(voicedir, 'synth')
 23 | 
 24 | ## Change featuredir to absolute path to use existing features
 25 | featuredir = os.path.join(topworkdir, 'nancy2nick_01', 'data')  
 26 | coarse_audio_dir = os.path.join(featuredir, 'mels')
 27 | full_mel_dir = os.path.join(featuredir, 'full_mels')
 28 | full_audio_dir = os.path.join(featuredir, 'mags')
 29 | attention_guide_dir = os.path.join(featuredir, 'attention_guides')  ## Set this to the empty string ('') to global attention guide
 30 | 
 31 | 
 32 | 
 33 | # data locations:
 34 | datadir = '/group/project/cstr2/owatts/data/nick16k/' 
 35 | 
 36 | 
 37 | transcript = os.path.join(datadir, 'transcript.csv')
 38 | test_transcript = os.path.join(datadir, 'transcript_hvd.csv')
 39 | #test_transcript = os.path.join(datadir, 'transcript_mrt.csv')
 40 | 
 41 | waveforms = os.path.join(datadir, 'wav_trim')
 42 | 
 43 | 
 44 | 
 45 | input_type = 'phones' ## letters or phones
 46 | 
 47 | vocab = ['<PADDING>', '<!>', "<'s>", '<,>', '<.>', '<:>', '<;>', '<>', \
 48 |             '<?>', '<_END_>', '<_START_>', '@', '@@', '@U', 'A', 'D', 'E', \
 49 |             'E@', 'I', 'I@', 'N', 'O', 'OI', 'Q', 'S', 'T', 'U', 'U@', 'V', \
 50 |             'Z', 'a', 'aI', 'aU', 'b', 'd', 'dZ', 'eI', 'f', 'g', 'h', 'i', \
 51 |             'j', 'k', 'l', 'l!', 'lw', 'm', 'm!', 'n', 'n!', 'o^', 'o~', 'p', \
 52 |             'r', 's', 't', 'tS', 'u', 'v', 'w', 'z']
 53 | 
 54 | #vocab = "PE abcdefghijklmnopqrstuvwxyz'.?" # P: Padding, E: EOS.
 55 | max_N = 140 # Maximum number of characters. # !!!
 56 | max_T = 200 # Maximum number of mel frames. # !!!
 57 | multispeaker = ['text_encoder_input', 'audio_decoder_input'] ## []: speaker dependent. text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
 58 | speaker_list = ['<PADDING>']  +  ['nancy', 'nick']
 59 | nspeakers = len(speaker_list) + 10 ## add 10 extra slots for posterity's sake 
 60 | speaker_embedding_size = 128
 61 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 
 62 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
 63 |             
 64 | 
 65 | 
 66 | # signal processing
 67 | trim_before_spectrogram_extraction = 0    
 68 | vocoder = 'griffin_lim'  
 69 | sr = 16000  # Sampling rate.
 70 | n_fft = 2048  # fft points (samples)
 71 | frame_shift = 0.0125  # seconds
 72 | frame_length = 0.05  # seconds
 73 | hop_length = int(sr * frame_shift)
 74 | win_length = int(sr * frame_length)    
 75 | prepro = True  # don't extract spectrograms on the fly
 76 | full_dim = n_fft//2+1
 77 | n_mels = 80  # Number of Mel banks to generate
 78 | power = 1.5  # Exponent for amplifying the predicted magnitude
 79 | n_iter = 50  # Number of inversion iterations
 80 | preemphasis = .97
 81 | max_db = 100
 82 | ref_db = 20
 83 | 
 84 | 
 85 | # Model
 86 | r = 4 # Reduction factor. Do not change this.
 87 | dropout_rate = 0.05
 88 | e = 128 # == embedding
 89 | d = 256 # == hidden units of Text2Mel
 90 | c = 512 # == hidden units of SSRN
 91 | attention_win_size = 3
 92 | g = 0.2 ## determines width of band in attention guide
 93 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None]
 94 | 
 95 | ## loss weights : T2M
 96 | lw_mel = 0.3333
 97 | lw_bd1 = 0.3333
 98 | lw_att = 0.3333
 99 | ##              : SSRN
100 | lw_mag = 0.5
101 | lw_bd2 = 0.5
102 | 
103 | 
104 | ## validation:
105 | validpatt = 'herald_9' ## herald_9 matches 98 sentences from nick
106 | #### NYT00 matches 24 of nancy's sentences.
107 | validation_sentences_to_evaluate = 24
108 | validation_sentences_to_synth_params = 16
109 | 
110 | 
111 | # training scheme
112 | restart_from_savepath = []
113 | lr = 0.001 # Initial learning rate.
114 | batchsize = {'t2m': 32, 'ssrn': 32}
115 | validate_every_n_epochs = 1   ## how often to compute validation score and save speech parameters
116 | save_every_n_epochs = 10 ## as well as 5 latest models, how often to archive a model
117 | max_epochs = 100
118 | 
119 | ## copy existing model parameters -- list of (scope, checkpoint) pairs:--
120 | WORK=topworkdir
121 | initialise_weights_from_existing = [('Text2Mel', WORK+'/nancyplusnick_01/train-t2m/model_epoch_1000')]
122 | 
123 | update_weights = []  ## If empty, update all trainable variables. Otherwise, update only 
124 |                      ## those variables whose scopes match one of the patterns in the list


--------------------------------------------------------------------------------
/config/nancyplusnick_03.cfg:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | #/usr/bin/python2
  3 | 
  4 | '''
  5 | Combined nancy & nick voice for ICPhS in first instance.
  6 | Stage 2: fine tune on nick only -- in this version, set attention loss weight very high
  7 | '''
  8 | 
  9 | 
 10 | import os
 11 | 
 12 | 
 13 | ## Take name of this file to be the config name:
 14 | config_name = os.path.split(__file__)[-1].split('.')[0]  ## remove path and extension
 15 | 
 16 | ## Define place to put outputs relative to this config file's location;
 17 | ## supply an absoluate path to work elsewhere:
 18 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work')))
 19 | 
 20 | voicedir = os.path.join(topworkdir, config_name)
 21 | logdir = os.path.join(voicedir, 'train')
 22 | sampledir = os.path.join(voicedir, 'synth')
 23 | 
 24 | ## Change featuredir to absolute path to use existing features
 25 | featuredir = os.path.join(topworkdir, 'nancy2nick_01', 'data')  
 26 | coarse_audio_dir = os.path.join(featuredir, 'mels')
 27 | full_mel_dir = os.path.join(featuredir, 'full_mels')
 28 | full_audio_dir = os.path.join(featuredir, 'mags')
 29 | attention_guide_dir = os.path.join(featuredir, 'attention_guides')  ## Set this to the empty string ('') to global attention guide
 30 | 
 31 | 
 32 | 
 33 | # data locations:
 34 | datadir = '/group/project/cstr2/owatts/data/nick16k/' 
 35 | 
 36 | 
 37 | transcript = os.path.join(datadir, 'transcript.csv')
 38 | test_transcript = os.path.join(datadir, 'transcript_hvd.csv')
 39 | #test_transcript = os.path.join(datadir, 'transcript_mrt.csv')
 40 | 
 41 | waveforms = os.path.join(datadir, 'wav_trim')
 42 | 
 43 | 
 44 | 
 45 | input_type = 'phones' ## letters or phones
 46 | 
 47 | vocab = ['<PADDING>', '<!>', "<'s>", '<,>', '<.>', '<:>', '<;>', '<>', \
 48 |             '<?>', '<_END_>', '<_START_>', '@', '@@', '@U', 'A', 'D', 'E', \
 49 |             'E@', 'I', 'I@', 'N', 'O', 'OI', 'Q', 'S', 'T', 'U', 'U@', 'V', \
 50 |             'Z', 'a', 'aI', 'aU', 'b', 'd', 'dZ', 'eI', 'f', 'g', 'h', 'i', \
 51 |             'j', 'k', 'l', 'l!', 'lw', 'm', 'm!', 'n', 'n!', 'o^', 'o~', 'p', \
 52 |             'r', 's', 't', 'tS', 'u', 'v', 'w', 'z']
 53 | 
 54 | #vocab = "PE abcdefghijklmnopqrstuvwxyz'.?" # P: Padding, E: EOS.
 55 | max_N = 140 # Maximum number of characters. # !!!
 56 | max_T = 200 # Maximum number of mel frames. # !!!
 57 | multispeaker = ['text_encoder_input', 'audio_decoder_input'] ## []: speaker dependent. text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
 58 | speaker_list = ['<PADDING>']  +  ['nancy', 'nick']
 59 | nspeakers = len(speaker_list) + 10 ## add 10 extra slots for posterity's sake 
 60 | speaker_embedding_size = 128
 61 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 
 62 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
 63 |             
 64 | 
 65 | 
 66 | # signal processing
 67 | trim_before_spectrogram_extraction = 0    
 68 | vocoder = 'griffin_lim'  
 69 | sr = 16000  # Sampling rate.
 70 | n_fft = 2048  # fft points (samples)
 71 | frame_shift = 0.0125  # seconds
 72 | frame_length = 0.05  # seconds
 73 | hop_length = int(sr * frame_shift)
 74 | win_length = int(sr * frame_length)    
 75 | prepro = True  # don't extract spectrograms on the fly
 76 | full_dim = n_fft//2+1
 77 | n_mels = 80  # Number of Mel banks to generate
 78 | power = 1.5  # Exponent for amplifying the predicted magnitude
 79 | n_iter = 50  # Number of inversion iterations
 80 | preemphasis = .97
 81 | max_db = 100
 82 | ref_db = 20
 83 | 
 84 | 
 85 | # Model
 86 | r = 4 # Reduction factor. Do not change this.
 87 | dropout_rate = 0.05
 88 | e = 128 # == embedding
 89 | d = 256 # == hidden units of Text2Mel
 90 | c = 512 # == hidden units of SSRN
 91 | attention_win_size = 3
 92 | g = 0.2 ## determines width of band in attention guide
 93 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None]
 94 | 
 95 | ## loss weights : T2M
 96 | lw_mel = 0.15
 97 | lw_bd1 = 0.15
 98 | lw_att = 0.9
 99 | ##              : SSRN
100 | lw_mag = 0.5
101 | lw_bd2 = 0.5
102 | 
103 | 
104 | ## validation:
105 | validpatt = 'herald_9' ## herald_9 matches 98 sentences from nick
106 | #### NYT00 matches 24 of nancy's sentences.
107 | validation_sentences_to_evaluate = 24
108 | validation_sentences_to_synth_params = 16
109 | 
110 | 
111 | # training scheme
112 | restart_from_savepath = []
113 | lr = 0.001 # Initial learning rate.
114 | batchsize = {'t2m': 32, 'ssrn': 32}
115 | validate_every_n_epochs = 1   ## how often to compute validation score and save speech parameters
116 | save_every_n_epochs = 0 ## as well as 5 latest models, how often to archive a model
117 | max_epochs = 100
118 | 
119 | ## copy existing model parameters -- list of (scope, checkpoint) pairs:--
120 | WORK=topworkdir
121 | initialise_weights_from_existing = [('Text2Mel', WORK+'/nancyplusnick_01/train-t2m/model_epoch_1000')]
122 | 
123 | update_weights = []  ## If empty, update all trainable variables. Otherwise, update only 
124 |                      ## those variables whose scopes match one of the patterns in the list


--------------------------------------------------------------------------------
/config/nancyplusnick_04_lcc.cfg:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | #/usr/bin/python2
  3 | 
  4 | '''
  5 | Combined nancy & nick voice for ICPhS in first instance
  6 | Version 4: train with learned channel contributions from the 2 speakers
  7 | '''
  8 | 
  9 | 
 10 | import os
 11 | 
 12 | 
 13 | ## Take name of this file to be the config name:
 14 | config_name = os.path.split(__file__)[-1].split('.')[0]  ## remove path and extension
 15 | 
 16 | ## Define place to put outputs relative to this config file's location;
 17 | ## supply an absoluate path to work elsewhere:
 18 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work')))
 19 | 
 20 | voicedir = os.path.join(topworkdir, config_name)
 21 | logdir = os.path.join(voicedir, 'train')
 22 | sampledir = os.path.join(voicedir, 'synth')
 23 | 
 24 | ## Change featuredir to absolute path to use existing features
 25 | featuredir = os.path.join(topworkdir, 'nancyplusnick_01', 'data')  
 26 | coarse_audio_dir = os.path.join(featuredir, 'mels')
 27 | full_mel_dir = os.path.join(featuredir, 'full_mels')
 28 | full_audio_dir = os.path.join(featuredir, 'mags')
 29 | attention_guide_dir = os.path.join(featuredir, 'attention_guides')  ## Set this to the empty string ('') to global attention guide
 30 | 
 31 | 
 32 | 
 33 | # data locations:
 34 | datadir = '/group/project/cstr2/owatts/data/nick_plus_nancy/'
 35 | 
 36 | 
 37 | transcript = os.path.join(datadir, 'transcript.csv')
 38 | test_transcript = os.path.join(datadir, 'transcript_mrt.csv')
 39 | 
 40 | 
 41 | waveforms = os.path.join(datadir, 'wav_trim')
 42 | 
 43 | 
 44 | 
 45 | input_type = 'phones' ## letters or phones
 46 | 
 47 | vocab = ['<PADDING>', '<!>', "<'s>", '<,>', '<.>', '<:>', '<;>', '<>', \
 48 |             '<?>', '<_END_>', '<_START_>', '@', '@@', '@U', 'A', 'D', 'E', \
 49 |             'E@', 'I', 'I@', 'N', 'O', 'OI', 'Q', 'S', 'T', 'U', 'U@', 'V', \
 50 |             'Z', 'a', 'aI', 'aU', 'b', 'd', 'dZ', 'eI', 'f', 'g', 'h', 'i', \
 51 |             'j', 'k', 'l', 'l!', 'lw', 'm', 'm!', 'n', 'n!', 'o^', 'o~', 'p', \
 52 |             'r', 's', 't', 'tS', 'u', 'v', 'w', 'z']
 53 | 
 54 | #vocab = "PE abcdefghijklmnopqrstuvwxyz'.?" # P: Padding, E: EOS.
 55 | max_N = 140 # Maximum number of characters. # !!!
 56 | max_T = 200 # Maximum number of mel frames. # !!!
 57 | multispeaker = ['learn_channel_contributions'] ## []: speaker dependent. text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
 58 | speaker_list = ['<PADDING>']  +  ['nancy', 'nick']
 59 | nspeakers = len(speaker_list) + 2
 60 | speaker_embedding_size = 128
 61 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 
 62 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
 63 |             
 64 | 
 65 | 
 66 | # signal processing
 67 | trim_before_spectrogram_extraction = 0    
 68 | vocoder = 'griffin_lim'  
 69 | sr = 16000  # Sampling rate.
 70 | n_fft = 2048  # fft points (samples)
 71 | frame_shift = 0.0125  # seconds
 72 | frame_length = 0.05  # seconds
 73 | hop_length = int(sr * frame_shift)
 74 | win_length = int(sr * frame_length)    
 75 | prepro = True  # don't extract spectrograms on the fly
 76 | full_dim = n_fft//2+1
 77 | n_mels = 80  # Number of Mel banks to generate
 78 | power = 1.5  # Exponent for amplifying the predicted magnitude
 79 | n_iter = 50  # Number of inversion iterations
 80 | preemphasis = .97
 81 | max_db = 100
 82 | ref_db = 20
 83 | 
 84 | 
 85 | # Model
 86 | r = 4 # Reduction factor. Do not change this.
 87 | dropout_rate = 0.05
 88 | e = 128 # == embedding
 89 | d = 256 # == hidden units of Text2Mel
 90 | c = 512 # == hidden units of SSRN
 91 | attention_win_size = 3
 92 | g = 0.2 ## determines width of band in attention guide
 93 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None]
 94 | 
 95 | ## loss weights : T2M
 96 | lw_mel = 0.3333
 97 | lw_bd1 = 0.3333
 98 | lw_att = 0.3333
 99 | ##              : SSRN
100 | lw_mag = 0.5
101 | lw_bd2 = 0.5
102 | 
103 | 
104 | ## validation:
105 | validpatt = 'herald_9' ## herald_9 matches 98 sentences from nick
106 | #### NYT00 matches 24 of nancy's sentences.
107 | validation_sentences_to_evaluate = 24
108 | validation_sentences_to_synth_params = 16
109 | 
110 | 
111 | # training scheme
112 | restart_from_savepath = []
113 | lr = 0.001 # Initial learning rate.
114 | batchsize = {'t2m': 32, 'ssrn': 32}
115 | validate_every_n_epochs = 10   ## how often to compute validation score and save speech parameters
116 | save_every_n_epochs = 100 ## as well as 5 latest models, how often to archive a model
117 | max_epochs = 1000
118 | 
119 | 


--------------------------------------------------------------------------------
/config/nancyplusnick_05_lcc_sdpe.cfg:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | #/usr/bin/python2
  3 | 
  4 | '''
  5 | Combined nancy & nick voice for ICPhS in first instance
  6 | Version 4: train with learned channel contributions from the 2 speakers
  7 | '''
  8 | 
  9 | 
 10 | import os
 11 | 
 12 | 
 13 | ## Take name of this file to be the config name:
 14 | config_name = os.path.split(__file__)[-1].split('.')[0]  ## remove path and extension
 15 | 
 16 | ## Define place to put outputs relative to this config file's location;
 17 | ## supply an absoluate path to work elsewhere:
 18 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work')))
 19 | 
 20 | voicedir = os.path.join(topworkdir, config_name)
 21 | logdir = os.path.join(voicedir, 'train')
 22 | sampledir = os.path.join(voicedir, 'synth')
 23 | 
 24 | ## Change featuredir to absolute path to use existing features
 25 | featuredir = os.path.join(topworkdir, 'nancyplusnick_01', 'data')  
 26 | coarse_audio_dir = os.path.join(featuredir, 'mels')
 27 | full_mel_dir = os.path.join(featuredir, 'full_mels')
 28 | full_audio_dir = os.path.join(featuredir, 'mags')
 29 | attention_guide_dir = os.path.join(featuredir, 'attention_guides')  ## Set this to the empty string ('') to global attention guide
 30 | 
 31 | 
 32 | 
 33 | # data locations:
 34 | datadir = '/group/project/cstr2/owatts/data/nick_plus_nancy/'
 35 | 
 36 | 
 37 | transcript = os.path.join(datadir, 'transcript.csv')
 38 | test_transcript = os.path.join(datadir, 'transcript_mrt.csv')
 39 | 
 40 | 
 41 | waveforms = os.path.join(datadir, 'wav_trim')
 42 | 
 43 | 
 44 | 
 45 | input_type = 'phones' ## letters or phones
 46 | 
 47 | vocab = ['<PADDING>', '<!>', "<'s>", '<,>', '<.>', '<:>', '<;>', '<>', \
 48 |             '<?>', '<_END_>', '<_START_>', '@', '@@', '@U', 'A', 'D', 'E', \
 49 |             'E@', 'I', 'I@', 'N', 'O', 'OI', 'Q', 'S', 'T', 'U', 'U@', 'V', \
 50 |             'Z', 'a', 'aI', 'aU', 'b', 'd', 'dZ', 'eI', 'f', 'g', 'h', 'i', \
 51 |             'j', 'k', 'l', 'l!', 'lw', 'm', 'm!', 'n', 'n!', 'o^', 'o~', 'p', \
 52 |             'r', 's', 't', 'tS', 'u', 'v', 'w', 'z']
 53 | 
 54 | #vocab = "PE abcdefghijklmnopqrstuvwxyz'.?" # P: Padding, E: EOS.
 55 | max_N = 140 # Maximum number of characters. # !!!
 56 | max_T = 200 # Maximum number of mel frames. # !!!
 57 | multispeaker = ['learn_channel_contributions', 'speaker_dependent_phones'] ## []: speaker dependent. text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
 58 | speaker_list = ['<PADDING>']  +  ['nancy', 'nick']
 59 | nspeakers = len(speaker_list) + 2
 60 | speaker_embedding_size = 128
 61 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 
 62 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
 63 |             
 64 | 
 65 | 
 66 | # signal processing
 67 | trim_before_spectrogram_extraction = 0    
 68 | vocoder = 'griffin_lim'  
 69 | sr = 16000  # Sampling rate.
 70 | n_fft = 2048  # fft points (samples)
 71 | frame_shift = 0.0125  # seconds
 72 | frame_length = 0.05  # seconds
 73 | hop_length = int(sr * frame_shift)
 74 | win_length = int(sr * frame_length)    
 75 | prepro = True  # don't extract spectrograms on the fly
 76 | full_dim = n_fft//2+1
 77 | n_mels = 80  # Number of Mel banks to generate
 78 | power = 1.5  # Exponent for amplifying the predicted magnitude
 79 | n_iter = 50  # Number of inversion iterations
 80 | preemphasis = .97
 81 | max_db = 100
 82 | ref_db = 20
 83 | 
 84 | 
 85 | # Model
 86 | r = 4 # Reduction factor. Do not change this.
 87 | dropout_rate = 0.05
 88 | e = 128 # == embedding
 89 | d = 256 # == hidden units of Text2Mel
 90 | c = 512 # == hidden units of SSRN
 91 | attention_win_size = 3
 92 | g = 0.2 ## determines width of band in attention guide
 93 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None]
 94 | 
 95 | ## loss weights : T2M
 96 | lw_mel = 0.3333
 97 | lw_bd1 = 0.3333
 98 | lw_att = 0.3333
 99 | ##              : SSRN
100 | lw_mag = 0.5
101 | lw_bd2 = 0.5
102 | 
103 | 
104 | ## validation:
105 | validpatt = 'herald_9' ## herald_9 matches 98 sentences from nick
106 | #### NYT00 matches 24 of nancy's sentences.
107 | validation_sentences_to_evaluate = 24
108 | validation_sentences_to_synth_params = 16
109 | 
110 | 
111 | # training scheme
112 | restart_from_savepath = []
113 | lr = 0.001 # Initial learning rate.
114 | batchsize = {'t2m': 32, 'ssrn': 32}
115 | validate_every_n_epochs = 10   ## how often to compute validation score and save speech parameters
116 | save_every_n_epochs = 100 ## as well as 5 latest models, how often to archive a model
117 | max_epochs = 1000
118 | 
119 | 


--------------------------------------------------------------------------------
/config/project/baseline.cfg:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | #/usr/bin/python2
  3 | 
  4 | import os
  5 | 
  6 | 
  7 | ## Take name of this file to be the config name:
  8 | config_name = os.path.split(__file__)[-1].split('.')[0]  ## remove path and extension
  9 | 
 10 | ## Define place to put outputs relative to this config file's location;
 11 | ## supply an absoluate path to work elsewhere:
 12 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work')))
 13 | 
 14 | voicedir = os.path.join(topworkdir, config_name)
 15 | logdir = os.path.join(voicedir, 'train')
 16 | sampledir = os.path.join(voicedir, 'synth')
 17 | 
 18 | ## Change featuredir to absolute path to use existing features
 19 | featuredir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/'
 20 | coarse_audio_dir = os.path.join(featuredir, 'mels')
 21 | full_mel_dir = os.path.join(featuredir, 'full_mels')
 22 | full_audio_dir = os.path.join(featuredir, 'mags')
 23 | #attention_guide_dir = os.path.join(featuredir, 'attention_guides') 
 24 | ## Set this to the empty string ('') to global attention guide
 25 | attention_guide_dir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/attention_guides_punctuation_all_quotes/'
 26 | #attention_guide_dir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/fa_attention_guides_punctuation_all_quotes/'
 27 | 
 28 | # Data locations:
 29 | datadir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/'
 30 | 
 31 | transcript = os.path.join(datadir, 'trimmed_transcript_trainset_with_punctuation_with_all_quotes_CB-EM.csv')
 32 | test_transcript = os.path.join(datadir, 'transcript_testset_shuffled_with_punctuation_with_all_quotes.csv')
 33 | 
 34 | waveforms = os.path.join(datadir, 'wavs_trim')
 35 | 
 36 | 
 37 | input_type = 'phones' ## letters or phones
 38 | 
 39 | ## CMU phones:
 40 | vocab = ['<PADDING>', '<_END_>', '<_START_>'] + ['<!>', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', '<;>', '<>', '<?>', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', 'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', 'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', 'oy', 'p', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', 'w', 'y', 'z', 'zh']
 41 | 
 42 | # Train
 43 | max_N = 150 # Maximum number of characters/phones
 44 | max_T = 264 # Maximum number of mel frames
 45 | 
 46 | turn_off_monotonic_for_synthesis = True
 47 | sampledir = sampledir + '_' + str(turn_off_monotonic_for_synthesis)
 48 | 
 49 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
 50 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set
 51 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
 52 | 
 53 | 
 54 | 
 55 | # signal processing
 56 | trim_before_spectrogram_extraction = 0
 57 | vocoder = 'griffin_lim'
 58 | sr = 22050  # Sampling rate.
 59 | n_fft = 2048  # fft points (samples)
 60 | frame_shift = 0.0125  # seconds
 61 | frame_length = 0.05  # seconds
 62 | hop_length = int(sr * frame_shift)
 63 | win_length = int(sr * frame_length)
 64 | prepro = True  # don't extract spectrograms on the fly
 65 | full_dim = n_fft//2+1
 66 | n_mels = 80  # Number of Mel banks to generate
 67 | power = 1.5  # Exponent for amplifying the predicted magnitude
 68 | n_iter = 50  # Number of inversion iterations
 69 | preemphasis = .97
 70 | max_db = 100
 71 | ref_db = 20
 72 | 
 73 | 
 74 | # Model
 75 | r = 4 # Reduction factor. Do not change this.
 76 | dropout_rate = 0.05
 77 | e = 128 # == embedding
 78 | d = 256 # == hidden units of Text2Mel
 79 | c = 512 # == hidden units of SSRN
 80 | attention_win_size = 3
 81 | g = 0.2 ## determines width of band in attention guide
 82 | norm = None ## type of normalisation layers to use: from ['layer', 'batch', None]
 83 | 
 84 | ## loss weights : T2M
 85 | lw_mel =0.333
 86 | lw_bd1 =0.333
 87 | lw_att =0.333
 88 | lw_t2m_l2 = 0.0
 89 | ##              : SSRN
 90 | lw_mag = 0.5
 91 | lw_bd2 = 0.5
 92 | lw_ssrn_l2 = 0.0
 93 | 
 94 | 
 95 | ## validation:
 96 | validpatt = 'CB-EM-55-6' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SIL 
 97 | validation_sentences_to_evaluate = 5 
 98 | validation_sentences_to_synth_params = 3
 99 | 
100 | 
101 | # training scheme
102 | restart_from_savepath = []
103 | lr = 0.0001 # Initial learning rate.
104 | beta1 = 0.5
105 | beta2 = 0.9
106 | epsilon = 0.000001
107 | decay_lr = False
108 | batchsize = {'t2m': 8, 'ssrn': 2}
109 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8)
110 | validate_every_n_epochs = 50   ## how often to compute validation score and save speech parameters
111 | save_every_n_epochs = 50  ## as well as 5 latest models, how often to archive a model
112 | max_epochs = 500
113 | 
114 | # attention plotting during training
115 | plot_attention_every_n_epochs = 50 ## set to 0 if you do not wish to plot attention matrices
116 | num_sentences_to_plot_attention = 3 ## number of sentences to plot attention matrices for
117 | 


--------------------------------------------------------------------------------
/config/project/fa_as_attention.cfg:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | #/usr/bin/python2
  3 | 
  4 | import os
  5 | 
  6 | 
  7 | ## Take name of this file to be the config name:
  8 | config_name = os.path.split(__file__)[-1].split('.')[0]  ## remove path and extension
  9 | 
 10 | ## Define place to put outputs relative to this config file's location;
 11 | ## supply an absoluate path to work elsewhere:
 12 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work')))
 13 | 
 14 | voicedir = os.path.join(topworkdir, config_name)
 15 | logdir = os.path.join(voicedir, 'train')
 16 | sampledir = os.path.join(voicedir, 'synth')
 17 | 
 18 | ## Change featuredir to absolute path to use existing features
 19 | featuredir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/'
 20 | coarse_audio_dir = os.path.join(featuredir, 'mels')
 21 | full_mel_dir = os.path.join(featuredir, 'full_mels')
 22 | full_audio_dir = os.path.join(featuredir, 'mags')
 23 | #attention_guide_dir = os.path.join(featuredir, 'attention_guides') 
 24 | ## Set this to the empty string ('') to global attention guide
 25 | attention_guide_dir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/attention_guides_punctuation_all_quotes/'
 26 | use_external_durations = True 
 27 | 
 28 | # Data locations:
 29 | datadir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/'
 30 | 
 31 | transcript = os.path.join(datadir, 'trimmed_transcript_trainset_with_punctuation_with_all_quotes_CB-EM_less17_durations.csv')
 32 | test_transcript = os.path.join(datadir, 'transcript_testset_shuffled_with_punctuation_with_all_quotes_durations_updated.csv')
 33 | waveforms = os.path.join(datadir, 'wavs_trim')
 34 | 
 35 | 
 36 | input_type = 'phones' ## letters or phones
 37 | 
 38 | ## CMU phones:
 39 | vocab = ['<PADDING>', '<_END_>', '<_START_>'] + ['<!>', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', '<;>', '<>', '<?>', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', 'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', 'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', 'oy', 'p', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', 'w', 'y', 'z', 'zh']
 40 | 
 41 | # Train
 42 | max_N = 150 # Maximum number of characters/phones
 43 | max_T = 264 # Maximum number of mel frames
 44 | 
 45 | turn_off_monotonic_for_synthesis = True
 46 | sampledir = sampledir + '_' + str(turn_off_monotonic_for_synthesis)
 47 | 
 48 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
 49 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set
 50 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
 51 | 
 52 | 
 53 | 
 54 | # signal processing
 55 | trim_before_spectrogram_extraction = 0
 56 | vocoder = 'griffin_lim'
 57 | sr = 22050  # Sampling rate.
 58 | n_fft = 2048  # fft points (samples)
 59 | frame_shift = 0.0125  # seconds
 60 | frame_length = 0.05  # seconds
 61 | hop_length = int(sr * frame_shift)
 62 | win_length = int(sr * frame_length)
 63 | prepro = True  # don't extract spectrograms on the fly
 64 | full_dim = n_fft//2+1
 65 | n_mels = 80  # Number of Mel banks to generate
 66 | power = 1.5  # Exponent for amplifying the predicted magnitude
 67 | n_iter = 50  # Number of inversion iterations
 68 | preemphasis = .97
 69 | max_db = 100
 70 | ref_db = 20
 71 | 
 72 | 
 73 | # Model
 74 | r = 4 # Reduction factor. Do not change this.
 75 | dropout_rate = 0.05
 76 | e = 128 # == embedding
 77 | d = 256 # == hidden units of Text2Mel
 78 | c = 512 # == hidden units of SSRN
 79 | attention_win_size = 3
 80 | g = 0.2 ## determines width of band in attention guide
 81 | norm = None ## type of normalisation layers to use: from ['layer', 'batch', None]
 82 | 
 83 | ## loss weights : T2M
 84 | lw_mel =0.333
 85 | lw_bd1 =0.333
 86 | lw_att =0.333
 87 | lw_t2m_l2 = 0.0
 88 | ##              : SSRN
 89 | lw_mag = 0.5
 90 | lw_bd2 = 0.5
 91 | lw_ssrn_l2 = 0.0
 92 | 
 93 | 
 94 | ## validation:
 95 | validpatt = 'CB-EM-55-6' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SIL 
 96 | validation_sentences_to_evaluate = 5 
 97 | validation_sentences_to_synth_params = 3
 98 | 
 99 | 
100 | # training scheme
101 | restart_from_savepath = []
102 | lr = 0.0001 # Initial learning rate.
103 | beta1 = 0.5
104 | beta2 = 0.9
105 | epsilon = 0.000001
106 | decay_lr = False
107 | batchsize = {'t2m': 8, 'ssrn': 2}
108 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8)
109 | validate_every_n_epochs = 50   ## how often to compute validation score and save speech parameters
110 | save_every_n_epochs = 50  ## as well as 5 latest models, how often to archive a model
111 | max_epochs = 500
112 | 
113 | # attention plotting during training
114 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices
115 | num_sentences_to_plot_attention = 3 ## number of sentences to plot attention matrices for
116 | 


--------------------------------------------------------------------------------
/config/project/fa_as_guide.cfg:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | #/usr/bin/python2
  3 | 
  4 | import os
  5 | 
  6 | 
  7 | ## Take name of this file to be the config name:
  8 | config_name = os.path.split(__file__)[-1].split('.')[0]  ## remove path and extension
  9 | 
 10 | ## Define place to put outputs relative to this config file's location;
 11 | ## supply an absoluate path to work elsewhere:
 12 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work')))
 13 | 
 14 | voicedir = os.path.join(topworkdir, config_name)
 15 | logdir = os.path.join(voicedir, 'train')
 16 | sampledir = os.path.join(voicedir, 'synth')
 17 | 
 18 | ## Change featuredir to absolute path to use existing features
 19 | featuredir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/'
 20 | coarse_audio_dir = os.path.join(featuredir, 'mels')
 21 | full_mel_dir = os.path.join(featuredir, 'full_mels')
 22 | full_audio_dir = os.path.join(featuredir, 'mags')
 23 | #attention_guide_dir = os.path.join(featuredir, 'attention_guides') 
 24 | ## Set this to the empty string ('') to global attention guide
 25 | attention_guide_dir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/W0.1_attention_guides_dctts/' # contains FA guides
 26 | 
 27 | # Data locations:
 28 | datadir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/'
 29 | 
 30 | transcript = os.path.join(datadir, 'trimmed_transcript_trainset_with_punctuation_with_all_quotes_CB-EM_less17.csv')
 31 | test_transcript = os.path.join(datadir, 'transcript_testset_shuffled_with_punctuation_with_all_quotes.csv')
 32 | waveforms = os.path.join(datadir, 'wavs_trim')
 33 | 
 34 | 
 35 | input_type = 'phones' ## letters or phones
 36 | 
 37 | ## CMU phones:
 38 | vocab = ['<PADDING>', '<_END_>', '<_START_>'] + ['<!>', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', '<;>', '<>', '<?>', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', 'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', 'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', 'oy', 'p', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', 'w', 'y', 'z', 'zh']
 39 | 
 40 | # Train
 41 | max_N = 150 # Maximum number of characters/phones
 42 | max_T = 264 # Maximum number of mel frames
 43 | 
 44 | turn_off_monotonic_for_synthesis = True
 45 | sampledir = sampledir + '_' + str(turn_off_monotonic_for_synthesis)
 46 | 
 47 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
 48 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set
 49 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
 50 | 
 51 | 
 52 | 
 53 | # signal processing
 54 | trim_before_spectrogram_extraction = 0
 55 | vocoder = 'griffin_lim'
 56 | sr = 22050  # Sampling rate.
 57 | n_fft = 2048  # fft points (samples)
 58 | frame_shift = 0.0125  # seconds
 59 | frame_length = 0.05  # seconds
 60 | hop_length = int(sr * frame_shift)
 61 | win_length = int(sr * frame_length)
 62 | prepro = True  # don't extract spectrograms on the fly
 63 | full_dim = n_fft//2+1
 64 | n_mels = 80  # Number of Mel banks to generate
 65 | power = 1.5  # Exponent for amplifying the predicted magnitude
 66 | n_iter = 50  # Number of inversion iterations
 67 | preemphasis = .97
 68 | max_db = 100
 69 | ref_db = 20
 70 | 
 71 | 
 72 | # Model
 73 | r = 4 # Reduction factor. Do not change this.
 74 | dropout_rate = 0.05
 75 | e = 128 # == embedding
 76 | d = 256 # == hidden units of Text2Mel
 77 | c = 512 # == hidden units of SSRN
 78 | attention_win_size = 3
 79 | g = 0.2 ## determines width of band in attention guide
 80 | norm = None ## type of normalisation layers to use: from ['layer', 'batch', None]
 81 | 
 82 | ## loss weights : T2M
 83 | lw_mel =0.333
 84 | lw_bd1 =0.333
 85 | lw_att =0.333
 86 | lw_t2m_l2 = 0.0
 87 | ##              : SSRN
 88 | lw_mag = 0.5
 89 | lw_bd2 = 0.5
 90 | lw_ssrn_l2 = 0.0
 91 | 
 92 | 
 93 | ## validation:
 94 | validpatt = 'CB-EM-55-6' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SIL 
 95 | validation_sentences_to_evaluate = 5 
 96 | validation_sentences_to_synth_params = 3
 97 | 
 98 | 
 99 | # training scheme
100 | restart_from_savepath = []
101 | lr = 0.0001 # Initial learning rate.
102 | beta1 = 0.5
103 | beta2 = 0.9
104 | epsilon = 0.000001
105 | decay_lr = False
106 | batchsize = {'t2m': 8, 'ssrn': 2}
107 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8)
108 | validate_every_n_epochs = 50   ## how often to compute validation score and save speech parameters
109 | save_every_n_epochs = 50  ## as well as 5 latest models, how often to archive a model
110 | max_epochs = 500
111 | 
112 | # attention plotting during training
113 | plot_attention_every_n_epochs = 50 ## set to 0 if you do not wish to plot attention matrices
114 | num_sentences_to_plot_attention = 3 ## number of sentences to plot attention matrices for
115 | 


--------------------------------------------------------------------------------
/config/project/fa_as_target.cfg:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | #/usr/bin/python2
  3 | 
  4 | import os
  5 | 
  6 | 
  7 | ## Take name of this file to be the config name:
  8 | config_name = os.path.split(__file__)[-1].split('.')[0]  ## remove path and extension
  9 | 
 10 | ## Define place to put outputs relative to this config file's location;
 11 | ## supply an absoluate path to work elsewhere:
 12 | topworkdir = os.path.realpath(os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'work')))
 13 | 
 14 | voicedir = os.path.join(topworkdir, config_name)
 15 | logdir = os.path.join(voicedir, 'train')
 16 | sampledir = os.path.join(voicedir, 'synth')
 17 | 
 18 | ## Change featuredir to absolute path to use existing features
 19 | featuredir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/'
 20 | coarse_audio_dir = os.path.join(featuredir, 'mels')
 21 | full_mel_dir = os.path.join(featuredir, 'full_mels')
 22 | full_audio_dir = os.path.join(featuredir, 'mags')
 23 | #attention_guide_dir = os.path.join(featuredir, 'attention_guides') 
 24 | ## Set this to the empty string ('') to global attention guide
 25 | attention_guide_dir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/attention_guides_dctts/' # contains FA matrix
 26 | attention_guide_fa = True
 27 | 
 28 | # Data locations:
 29 | datadir = '/disk/scratch/cvbotinh/data/BC2013/CB-JE-LCL-FFM-EM/'
 30 | 
 31 | transcript = os.path.join(datadir, 'trimmed_transcript_trainset_with_punctuation_with_all_quotes_CB-EM_less17.csv')
 32 | test_transcript = os.path.join(datadir, 'transcript_testset_shuffled_with_punctuation_with_all_quotes.csv')
 33 | waveforms = os.path.join(datadir, 'wavs_trim')
 34 | 
 35 | 
 36 | input_type = 'phones' ## letters or phones
 37 | 
 38 | ## CMU phones:
 39 | vocab = ['<PADDING>', '<_END_>', '<_START_>'] + ['<!>', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', '<;>', '<>', '<?>', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', 'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', 'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', 'oy', 'p', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', 'w', 'y', 'z', 'zh']
 40 | 
 41 | # Train
 42 | max_N = 150 # Maximum number of characters/phones
 43 | max_T = 264 # Maximum number of mel frames
 44 | 
 45 | turn_off_monotonic_for_synthesis = True
 46 | sampledir = sampledir + '_' + str(turn_off_monotonic_for_synthesis)
 47 | 
 48 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
 49 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set
 50 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
 51 | 
 52 | 
 53 | 
 54 | # signal processing
 55 | trim_before_spectrogram_extraction = 0
 56 | vocoder = 'griffin_lim'
 57 | sr = 22050  # Sampling rate.
 58 | n_fft = 2048  # fft points (samples)
 59 | frame_shift = 0.0125  # seconds
 60 | frame_length = 0.05  # seconds
 61 | hop_length = int(sr * frame_shift)
 62 | win_length = int(sr * frame_length)
 63 | prepro = True  # don't extract spectrograms on the fly
 64 | full_dim = n_fft//2+1
 65 | n_mels = 80  # Number of Mel banks to generate
 66 | power = 1.5  # Exponent for amplifying the predicted magnitude
 67 | n_iter = 50  # Number of inversion iterations
 68 | preemphasis = .97
 69 | max_db = 100
 70 | ref_db = 20
 71 | 
 72 | 
 73 | # Model
 74 | r = 4 # Reduction factor. Do not change this.
 75 | dropout_rate = 0.05
 76 | e = 128 # == embedding
 77 | d = 256 # == hidden units of Text2Mel
 78 | c = 512 # == hidden units of SSRN
 79 | attention_win_size = 3
 80 | g = 0.2 ## determines width of band in attention guide
 81 | norm = None ## type of normalisation layers to use: from ['layer', 'batch', None]
 82 | 
 83 | ## loss weights : T2M
 84 | lw_mel =0.333
 85 | lw_bd1 =0.333
 86 | lw_att =0.333
 87 | lw_t2m_l2 = 0.0
 88 | ##              : SSRN
 89 | lw_mag = 0.5
 90 | lw_bd2 = 0.5
 91 | lw_ssrn_l2 = 0.0
 92 | 
 93 | 
 94 | ## validation:
 95 | validpatt = 'CB-EM-55-6' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SIL 
 96 | validation_sentences_to_evaluate = 5 
 97 | validation_sentences_to_synth_params = 3
 98 | 
 99 | 
100 | # training scheme
101 | restart_from_savepath = []
102 | lr = 0.0001 # Initial learning rate.
103 | beta1 = 0.5
104 | beta2 = 0.9
105 | epsilon = 0.000001
106 | decay_lr = False
107 | batchsize = {'t2m': 8, 'ssrn': 2}
108 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8)
109 | validate_every_n_epochs = 50   ## how often to compute validation score and save speech parameters
110 | save_every_n_epochs = 50  ## as well as 5 latest models, how often to archive a model
111 | max_epochs = 500
112 | 
113 | # attention plotting during training
114 | plot_attention_every_n_epochs = 50 ## set to 0 if you do not wish to plot attention matrices
115 | num_sentences_to_plot_attention = 3 ## number of sentences to plot attention matrices for
116 | 


--------------------------------------------------------------------------------
/config/ssw10/G1ABC_01.cfg:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | #/usr/bin/python2
  3 | 
  4 | ## Audio data downsampled to 16000 Hz and CMU lex transcription
  5 | 
  6 | import os
  7 | 
  8 | ## Take name of this file to be the config name:
  9 | config_name = os.path.split(__file__)[-1].split('.')[0]  ## remove path and extension
 10 | 
 11 | ## Define place to put outputs relative to this config file's location;
 12 | ## supply an absoluate path to work elsewhere:
 13 | topworkdir = '/disk/scratch/script_project/ssw10/work/' 
 14 | 
 15 | voicedir = os.path.join(topworkdir, config_name)
 16 | logdir = os.path.join(voicedir, 'train')
 17 | sampledir = os.path.join(voicedir, 'synth')
 18 | 
 19 | ## Change featuredir to absolute path to use existing features
 20 | featuredir = '/disk/scratch/script_project/ssw10/features/settings_01/'
 21 | coarse_audio_dir = os.path.join(featuredir, 'mels')
 22 | full_mel_dir = os.path.join(featuredir, 'full_mels')
 23 | full_audio_dir = os.path.join(featuredir, 'mags')
 24 | attention_guide_dir = os.path.join(featuredir, 'attention_guides')  ## Set this to the empty string ('') to global attention guide
 25 | 
 26 | 
 27 | 
 28 | # data locations:
 29 | datadirwav   = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/' ###  '/group/project/cstr2/owatts/data/LJSpeech-1.1/' 
 30 | transcript = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/transcript_cmu.csv'
 31 | test_transcript = '/disk/scratch/script_project/ssw10/data/transcript_cmulex_hvd.csv'
 32 | waveforms = os.path.join(datadirwav, 'wav_trim16k')
 33 | 
 34 | input_type = 'phones' ## letters or phones
 35 | 
 36 | ### CMU lex version:-
 37 | vocab = ['<PADDING>', '<!>', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', \
 38 |             '<;>', '<>', '<?>', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', \
 39 |             'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', \
 40 |             'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', \
 41 |             'oy', 'p', 'pau', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', \
 42 |             'w', 'y', 'z', 'zh']
 43 | 
 44 | max_N = 173 # Maximum number of characters. # !!!
 45 | max_T = 210 # Maximum number of mel frames. # !!!
 46 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
 47 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 
 48 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
 49 |             
 50 | 
 51 | # signal processing
 52 | trim_before_spectrogram_extraction = 0    
 53 | vocoder = 'griffin_lim'  
 54 | sr = 16000  # Sampling rate.
 55 | n_fft = 1024  # fft points (samples)
 56 | hop_length = 256 # int(sr * frame_shift)
 57 | win_length = 1024 # int(sr * frame_length)    
 58 | prepro = True  # don't extract spectrograms on the fly
 59 | full_dim = n_fft//2+1
 60 | n_mels = 80  # Number of Mel banks to generate
 61 | power = 1.5  # Exponent for amplifying the predicted magnitude
 62 | n_iter = 50  # Number of inversion iterations
 63 | preemphasis = .97
 64 | max_db = 100
 65 | ref_db = 20
 66 | 
 67 | 
 68 | # Model
 69 | r = 4 # Reduction factor. Do not change this.
 70 | dropout_rate = 0.05
 71 | e = 128 # == embedding
 72 | d = 256 # == hidden units of Text2Mel
 73 | c = 512 # == hidden units of SSRN
 74 | attention_win_size = 3
 75 | g = 0.2 ## determines width of band in attention guide
 76 | norm = None ## type of normalisation layers to use: from ['layer', 'batch', None]
 77 | 
 78 | ## loss weights : T2M
 79 | lw_mel = 0.3333
 80 | lw_bd1 = 0.3333
 81 | lw_att = 0.3333
 82 | ##              : SSRN
 83 | lw_mag = 0.5
 84 | lw_bd2 = 0.5
 85 | 
 86 | 
 87 | ## validation:
 88 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI
 89 | validation_sentences_to_evaluate = 4 
 90 | validation_sentences_to_synth_params = 4
 91 | 
 92 | 
 93 | # training scheme
 94 | restart_from_savepath = [] 
 95 | lr = 0.0002 # Initial learning rate.
 96 | beta1 = 0.5
 97 | beta2 = 0.9
 98 | epsilon = 0.000001
 99 | decay_lr = False
100 | batchsize = {'t2m': 16, 'ssrn': 32}
101 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8)
102 | validate_every_n_epochs = 100   ## how often to compute validation score and save speech parameters
103 | save_every_n_epochs = 100 ## as well as 5 latest models, how often to archive a model
104 | max_epochs = 1000
105 | 
106 | # attention plotting during training
107 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices
108 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for
109 | 


--------------------------------------------------------------------------------
/config/ssw10/G1ABC_02.cfg:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | #/usr/bin/python2
  3 | 
  4 | ## Audio data downsampled to 16000 Hz and CMU lex transcription
  5 | 
  6 | import os
  7 | 
  8 | ## Take name of this file to be the config name:
  9 | config_name = os.path.split(__file__)[-1].split('.')[0]  ## remove path and extension
 10 | 
 11 | ## Define place to put outputs relative to this config file's location;
 12 | ## supply an absoluate path to work elsewhere:
 13 | topworkdir = '/disk/scratch/script_project/ssw10/work/'
 14 | 
 15 | voicedir = os.path.join(topworkdir, config_name)
 16 | logdir = os.path.join(voicedir, 'train')
 17 | sampledir = os.path.join(voicedir, 'synth')
 18 | 
 19 | ## Change featuredir to absolute path to use existing features
 20 | featuredir = '/disk/scratch/script_project/ssw10/features/settings_02/'
 21 | coarse_audio_dir = os.path.join(featuredir, 'mels')
 22 | full_mel_dir = os.path.join(featuredir, 'full_mels')
 23 | full_audio_dir = os.path.join(featuredir, 'mags')
 24 | attention_guide_dir = os.path.join(featuredir, 'attention_guides')  ## Set this to the empty string ('') to global attention guide
 25 | 
 26 | 
 27 | 
 28 | # data locations:
 29 | datadirwav   = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/' ###  '/group/project/cstr2/owatts/data/LJSpeech-1.1/' 
 30 | transcript = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/transcript_cmu.csv'
 31 | test_transcript = '/disk/scratch/script_project/ssw10/data/transcript_cmulex_hvd.csv'
 32 | waveforms = os.path.join(datadirwav, 'wav_trim16k')
 33 | 
 34 | input_type = 'phones' ## letters or phones
 35 | 
 36 | ### CMU lex version:-
 37 | vocab = ['<PADDING>', '<!>', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', \
 38 |             '<;>', '<>', '<?>', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', \
 39 |             'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', \
 40 |             'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', \
 41 |             'oy', 'p', 'pau', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', \
 42 |             'w', 'y', 'z', 'zh']
 43 | 
 44 | max_N = 173 # Maximum number of characters. # !!!
 45 | max_T = 210 # Maximum number of mel frames. # !!!
 46 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
 47 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 
 48 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
 49 |             
 50 | 
 51 | # signal processing
 52 | trim_before_spectrogram_extraction = 0    
 53 | vocoder = 'griffin_lim'  
 54 | sr = 16000  # Sampling rate.
 55 | n_fft = 2048  # fft points (samples)
 56 | hop_length = 200 # int(sr * 0.0125)
 57 | win_length = 800 # int(sr * 0.05)    
 58 | prepro = True  # don't extract spectrograms on the fly
 59 | full_dim = n_fft//2+1
 60 | n_mels = 80  # Number of Mel banks to generate
 61 | power = 1.5  # Exponent for amplifying the predicted magnitude
 62 | n_iter = 50  # Number of inversion iterations
 63 | preemphasis = .97
 64 | max_db = 100
 65 | ref_db = 20
 66 | 
 67 | 
 68 | # Model
 69 | r = 4 # Reduction factor. Do not change this.
 70 | dropout_rate = 0.05
 71 | e = 128 # == embedding
 72 | d = 256 # == hidden units of Text2Mel
 73 | c = 512 # == hidden units of SSRN
 74 | attention_win_size = 3
 75 | g = 0.2 ## determines width of band in attention guide
 76 | norm = None ## type of normalisation layers to use: from ['layer', 'batch', None]
 77 | 
 78 | ## loss weights : T2M
 79 | lw_mel = 0.3333
 80 | lw_bd1 = 0.3333
 81 | lw_att = 0.3333
 82 | ##              : SSRN
 83 | lw_mag = 0.5
 84 | lw_bd2 = 0.5
 85 | 
 86 | 
 87 | ## validation:
 88 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI
 89 | validation_sentences_to_evaluate = 4 
 90 | validation_sentences_to_synth_params = 4
 91 | 
 92 | 
 93 | # training scheme
 94 | restart_from_savepath = [] 
 95 | lr = 0.0002 # Initial learning rate.
 96 | beta1 = 0.5
 97 | beta2 = 0.9
 98 | epsilon = 0.000001
 99 | decay_lr = False
100 | batchsize = {'t2m': 16, 'ssrn': 32}
101 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8)
102 | validate_every_n_epochs = 100   ## how often to compute validation score and save speech parameters
103 | save_every_n_epochs = 100 ## as well as 5 latest models, how often to archive a model
104 | max_epochs = 1000
105 | 
106 | # attention plotting during training
107 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices
108 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for
109 | 


--------------------------------------------------------------------------------
/config/ssw10/G1ABC_03.cfg:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | #/usr/bin/python2
  3 | 
  4 | ## Audio data downsampled to 16000 Hz and CMU lex transcription
  5 | 
  6 | import os
  7 | 
  8 | ## Take name of this file to be the config name:
  9 | config_name = os.path.split(__file__)[-1].split('.')[0]  ## remove path and extension
 10 | 
 11 | ## Define place to put outputs relative to this config file's location;
 12 | ## supply an absoluate path to work elsewhere:
 13 | topworkdir = '/disk/scratch/script_project/ssw10/work/'
 14 | 
 15 | voicedir = os.path.join(topworkdir, config_name)
 16 | logdir = os.path.join(voicedir, 'train')
 17 | sampledir = os.path.join(voicedir, 'synth')
 18 | 
 19 | ## Change featuredir to absolute path to use existing features
 20 | featuredir = '/disk/scratch/script_project/ssw10/features/settings_02/'  ## NB: use features from settings 2!!
 21 | coarse_audio_dir = os.path.join(featuredir, 'mels')
 22 | full_mel_dir = os.path.join(featuredir, 'full_mels')
 23 | full_audio_dir = os.path.join(featuredir, 'mags')
 24 | attention_guide_dir = os.path.join(featuredir, 'attention_guides')  ## Set this to the empty string ('') to global attention guide
 25 | 
 26 | 
 27 | 
 28 | # data locations:
 29 | datadirwav   = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/' ###  '/group/project/cstr2/owatts/data/LJSpeech-1.1/' 
 30 | transcript = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/transcript_cmu.csv'
 31 | test_transcript = '/disk/scratch/script_project/ssw10/data/transcript_cmulex_hvd.csv'
 32 | waveforms = os.path.join(datadirwav, 'wav_trim16k')
 33 | 
 34 | input_type = 'phones' ## letters or phones
 35 | 
 36 | ### CMU lex version:-
 37 | vocab = ['<PADDING>', '<!>', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', \
 38 |             '<;>', '<>', '<?>', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', \
 39 |             'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', \
 40 |             'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', \
 41 |             'oy', 'p', 'pau', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', \
 42 |             'w', 'y', 'z', 'zh']
 43 | 
 44 | max_N = 173 # Maximum number of characters. # !!!
 45 | max_T = 210 # Maximum number of mel frames. # !!!
 46 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
 47 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 
 48 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
 49 |             
 50 | 
 51 | # signal processing
 52 | trim_before_spectrogram_extraction = 0    
 53 | vocoder = 'griffin_lim'  
 54 | sr = 16000  # Sampling rate.
 55 | n_fft = 2048  # fft points (samples)
 56 | hop_length = 200 # int(sr * 0.0125)
 57 | win_length = 800 # int(sr * 0.05)    
 58 | prepro = True  # don't extract spectrograms on the fly
 59 | full_dim = n_fft//2+1
 60 | n_mels = 80  # Number of Mel banks to generate
 61 | power = 1.5  # Exponent for amplifying the predicted magnitude
 62 | n_iter = 50  # Number of inversion iterations
 63 | preemphasis = .97
 64 | max_db = 100
 65 | ref_db = 20
 66 | 
 67 | 
 68 | # Model
 69 | r = 4 # Reduction factor. Do not change this.
 70 | dropout_rate = 0.05
 71 | e = 128 # == embedding
 72 | d = 256 # == hidden units of Text2Mel
 73 | c = 512 # == hidden units of SSRN
 74 | attention_win_size = 3
 75 | g = 0.2 ## determines width of band in attention guide
 76 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None]
 77 | 
 78 | ## loss weights : T2M
 79 | lw_mel = 0.3333
 80 | lw_bd1 = 0.3333
 81 | lw_att = 0.3333
 82 | ##              : SSRN
 83 | lw_mag = 0.5
 84 | lw_bd2 = 0.5
 85 | 
 86 | 
 87 | ## validation:
 88 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI
 89 | validation_sentences_to_evaluate = 4 
 90 | validation_sentences_to_synth_params = 4
 91 | 
 92 | 
 93 | # training scheme
 94 | restart_from_savepath = []
 95 | lr = 0.001 # Initial learning rate.
 96 | batchsize = {'t2m': 32, 'ssrn': 32}
 97 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8)
 98 | validate_every_n_epochs = 100   ## how often to compute validation score and save speech parameters
 99 | save_every_n_epochs = 100 ## as well as 5 latest models, how often to archive a model
100 | max_epochs = 1000
101 | 
102 | # attention plotting during training
103 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices
104 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for
105 | 


--------------------------------------------------------------------------------
/config/ssw10/G1AB_03.cfg:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | #/usr/bin/python2
  3 | 
  4 | ## As DCTTS but remove attention
  5 | 
  6 | import os
  7 | 
  8 | ## Take name of this file to be the config name:
  9 | config_name = os.path.split(__file__)[-1].split('.')[0]  ## remove path and extension
 10 | 
 11 | ## Define place to put outputs relative to this config file's location;
 12 | ## supply an absoluate path to work elsewhere:
 13 | topworkdir = '/disk/scratch/script_project/ssw10/work/'
 14 | 
 15 | voicedir = os.path.join(topworkdir, config_name)
 16 | logdir = os.path.join(voicedir, 'train')
 17 | sampledir = os.path.join(voicedir, 'synth')
 18 | 
 19 | ## Change featuredir to absolute path to use existing features
 20 | featuredir = '/disk/scratch/script_project/ssw10/features/settings_02/'  ## NB: use features from settings 2!!
 21 | coarse_audio_dir = os.path.join(featuredir, 'mels')
 22 | full_mel_dir = os.path.join(featuredir, 'full_mels')
 23 | full_audio_dir = os.path.join(featuredir, 'mags')
 24 | attention_guide_dir = os.path.join(featuredir, 'attention_guides')  ## Set this to the empty string ('') to global attention guide
 25 | 
 26 | 
 27 | 
 28 | 
 29 | # data locations:
 30 | use_external_durations = True 
 31 | datadirwav   = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/' ###  '/group/project/cstr2/owatts/data/LJSpeech-1.1/' 
 32 | transcript = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/transcript_cmu_durations_02.csv'
 33 | test_transcript = '/disk/scratch/script_project/ssw10/data/transcript_cmulex_hvd_durations_02_noendsil.csv'
 34 | waveforms = os.path.join(datadirwav, 'wav_trim16k')
 35 | 
 36 | input_type = 'phones' ## letters or phones
 37 | 
 38 | ### CMU lex version:-
 39 | vocab = ['<PADDING>', '<!>', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', \
 40 |             '<;>', '<>', '<?>', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', \
 41 |             'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', \
 42 |             'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', \
 43 |             'oy', 'p', 'pau', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', \
 44 |             'w', 'y', 'z', 'zh']
 45 | 
 46 | max_N = 173 # Maximum number of characters. # !!!
 47 | max_T = 210 # Maximum number of mel frames. # !!!
 48 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
 49 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 
 50 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
 51 |             
 52 | 
 53 | # signal processing
 54 | trim_before_spectrogram_extraction = 0    
 55 | vocoder = 'griffin_lim'  
 56 | sr = 16000  # Sampling rate.
 57 | n_fft = 2048  # fft points (samples)
 58 | hop_length = 200 # int(sr * 0.0125)
 59 | win_length = 800 # int(sr * 0.05)    
 60 | prepro = True  # don't extract spectrograms on the fly
 61 | full_dim = n_fft//2+1
 62 | n_mels = 80  # Number of Mel banks to generate
 63 | power = 1.5  # Exponent for amplifying the predicted magnitude
 64 | n_iter = 50  # Number of inversion iterations
 65 | preemphasis = .97
 66 | max_db = 100
 67 | ref_db = 20
 68 | 
 69 | 
 70 | # Model
 71 | r = 4 # Reduction factor. Do not change this.
 72 | dropout_rate = 0.05
 73 | e = 128 # == embedding
 74 | d = 256 # == hidden units of Text2Mel
 75 | c = 512 # == hidden units of SSRN
 76 | attention_win_size = 3
 77 | g = 0.2 ## determines width of band in attention guide
 78 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None]
 79 | 
 80 | ## loss weights : T2M
 81 | lw_mel = 0.3333
 82 | lw_bd1 = 0.3333
 83 | lw_att = 0.3333
 84 | ##              : SSRN
 85 | lw_mag = 0.5
 86 | lw_bd2 = 0.5
 87 | 
 88 | 
 89 | ## validation:
 90 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI
 91 | validation_sentences_to_evaluate = 4 
 92 | validation_sentences_to_synth_params = 4
 93 | 
 94 | 
 95 | # training scheme
 96 | restart_from_savepath = []
 97 | lr = 0.001 # Initial learning rate.
 98 | batchsize = {'t2m': 32, 'ssrn': 32}
 99 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8)
100 | validate_every_n_epochs = 100   ## how often to compute validation score and save speech parameters
101 | save_every_n_epochs = 100 ## as well as 5 latest models, how often to archive a model
102 | max_epochs = 1000
103 | 
104 | # attention plotting during training
105 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices
106 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for
107 | 


--------------------------------------------------------------------------------
/config/ssw10/G1BC_03.cfg:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | #/usr/bin/python2
  3 | 
  4 | ## As DCTTS but remove attention
  5 | 
  6 | import os
  7 | 
  8 | ## Take name of this file to be the config name:
  9 | config_name = os.path.split(__file__)[-1].split('.')[0]  ## remove path and extension
 10 | 
 11 | ## Define place to put outputs relative to this config file's location;
 12 | ## supply an absoluate path to work elsewhere:
 13 | topworkdir = '/disk/scratch/script_project/ssw10/work/'
 14 | 
 15 | voicedir = os.path.join(topworkdir, config_name)
 16 | logdir = os.path.join(voicedir, 'train')
 17 | sampledir = os.path.join(voicedir, 'synth')
 18 | 
 19 | ## Change featuredir to absolute path to use existing features
 20 | featuredir = '/disk/scratch/script_project/ssw10/features/settings_02/'  ## NB: use features from settings 2!!
 21 | coarse_audio_dir = os.path.join(featuredir, 'mels')
 22 | full_mel_dir = os.path.join(featuredir, 'full_mels')
 23 | full_audio_dir = os.path.join(featuredir, 'mags')
 24 | attention_guide_dir = os.path.join(featuredir, 'attention_guides')  ## Set this to the empty string ('') to global attention guide
 25 | 
 26 | 
 27 | 
 28 | 
 29 | # data locations:
 30 | datadirwav   = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/' ###  '/group/project/cstr2/owatts/data/LJSpeech-1.1/' 
 31 | transcript = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/transcript_cmu.csv'
 32 | test_transcript = '/disk/scratch/script_project/ssw10/data/transcript_cmulex_hvd.csv'
 33 | waveforms = os.path.join(datadirwav, 'wav_trim16k')
 34 | 
 35 | text_encoder_type = 'minimal_feedforward'
 36 | merlin_label_dir = '/disk/scratch/script_project/ssw10/features/from_merlin/labels'
 37 | merlin_lab_dim = 416 
 38 | 
 39 | 
 40 | input_type = 'phones' ## letters or phones
 41 | 
 42 | ### CMU lex version:-
 43 | vocab = ['<PADDING>', '<!>', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', \
 44 |             '<;>', '<>', '<?>', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', \
 45 |             'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', \
 46 |             'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', \
 47 |             'oy', 'p', 'pau', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', \
 48 |             'w', 'y', 'z', 'zh']
 49 | 
 50 | max_N = 173 # Maximum number of characters. # !!!
 51 | max_T = 210 # Maximum number of mel frames. # !!!
 52 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
 53 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 
 54 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
 55 |             
 56 | 
 57 | # signal processing
 58 | trim_before_spectrogram_extraction = 0    
 59 | vocoder = 'griffin_lim'  
 60 | sr = 16000  # Sampling rate.
 61 | n_fft = 2048  # fft points (samples)
 62 | hop_length = 200 # int(sr * 0.0125)
 63 | win_length = 800 # int(sr * 0.05)    
 64 | prepro = True  # don't extract spectrograms on the fly
 65 | full_dim = n_fft//2+1
 66 | n_mels = 80  # Number of Mel banks to generate
 67 | power = 1.5  # Exponent for amplifying the predicted magnitude
 68 | n_iter = 50  # Number of inversion iterations
 69 | preemphasis = .97
 70 | max_db = 100
 71 | ref_db = 20
 72 | 
 73 | 
 74 | # Model
 75 | r = 4 # Reduction factor. Do not change this.
 76 | dropout_rate = 0.05
 77 | e = 128 # == embedding
 78 | d = 256 # == hidden units of Text2Mel
 79 | c = 512 # == hidden units of SSRN
 80 | attention_win_size = 3
 81 | g = 0.2 ## determines width of band in attention guide
 82 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None]
 83 | 
 84 | ## loss weights : T2M
 85 | lw_mel = 0.3333
 86 | lw_bd1 = 0.3333
 87 | lw_att = 0.3333
 88 | ##              : SSRN
 89 | lw_mag = 0.5
 90 | lw_bd2 = 0.5
 91 | 
 92 | 
 93 | ## validation:
 94 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI
 95 | validation_sentences_to_evaluate = 4 
 96 | validation_sentences_to_synth_params = 4
 97 | 
 98 | 
 99 | # training scheme
100 | restart_from_savepath = []
101 | lr = 0.001 # Initial learning rate.
102 | batchsize = {'t2m': 32, 'ssrn': 32}
103 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8)
104 | validate_every_n_epochs = 100   ## how often to compute validation score and save speech parameters
105 | save_every_n_epochs = 100 ## as well as 5 latest models, how often to archive a model
106 | max_epochs = 1000
107 | 
108 | # attention plotting during training
109 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices
110 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for
111 | 


--------------------------------------------------------------------------------
/config/ssw10/G1B_03.cfg:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | #/usr/bin/python2
  3 | 
  4 | ## As DCTTS but remove attention
  5 | 
  6 | import os
  7 | 
  8 | ## Take name of this file to be the config name:
  9 | config_name = os.path.split(__file__)[-1].split('.')[0]  ## remove path and extension
 10 | 
 11 | ## Define place to put outputs relative to this config file's location;
 12 | ## supply an absoluate path to work elsewhere:
 13 | topworkdir = '/disk/scratch/script_project/ssw10/work/'
 14 | 
 15 | voicedir = os.path.join(topworkdir, config_name)
 16 | logdir = os.path.join(voicedir, 'train')
 17 | sampledir = os.path.join(voicedir, 'synth')
 18 | 
 19 | ## Change featuredir to absolute path to use existing features
 20 | featuredir = '/disk/scratch/script_project/ssw10/features/settings_02/'  ## NB: use features from settings 2!!
 21 | coarse_audio_dir = os.path.join(featuredir, 'mels')
 22 | full_mel_dir = os.path.join(featuredir, 'full_mels')
 23 | full_audio_dir = os.path.join(featuredir, 'mags')
 24 | attention_guide_dir = os.path.join(featuredir, 'attention_guides')  ## Set this to the empty string ('') to global attention guide
 25 | 
 26 | 
 27 | 
 28 | 
 29 | # data locations:
 30 | use_external_durations = True 
 31 | datadirwav   = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/' ###  '/group/project/cstr2/owatts/data/LJSpeech-1.1/' 
 32 | transcript = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/transcript_cmu_durations_02.csv'
 33 | test_transcript = '/disk/scratch/script_project/ssw10/data/transcript_cmulex_hvd_durations_02_noendsil.csv'
 34 | waveforms = os.path.join(datadirwav, 'wav_trim16k')
 35 | 
 36 | text_encoder_type = 'minimal_feedforward'
 37 | merlin_label_dir = '/disk/scratch/script_project/ssw10/features/from_merlin/labels'
 38 | merlin_lab_dim = 416 
 39 | 
 40 | input_type = 'phones' ## letters or phones
 41 | 
 42 | ### CMU lex version:-
 43 | vocab = ['<PADDING>', '<!>', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', \
 44 |             '<;>', '<>', '<?>', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', \
 45 |             'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', \
 46 |             'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', \
 47 |             'oy', 'p', 'pau', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', \
 48 |             'w', 'y', 'z', 'zh']
 49 | 
 50 | max_N = 173 # Maximum number of characters. # !!!
 51 | max_T = 210 # Maximum number of mel frames. # !!!
 52 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
 53 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 
 54 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
 55 |             
 56 | 
 57 | # signal processing
 58 | trim_before_spectrogram_extraction = 0    
 59 | vocoder = 'griffin_lim'  
 60 | sr = 16000  # Sampling rate.
 61 | n_fft = 2048  # fft points (samples)
 62 | hop_length = 200 # int(sr * 0.0125)
 63 | win_length = 800 # int(sr * 0.05)    
 64 | prepro = True  # don't extract spectrograms on the fly
 65 | full_dim = n_fft//2+1
 66 | n_mels = 80  # Number of Mel banks to generate
 67 | power = 1.5  # Exponent for amplifying the predicted magnitude
 68 | n_iter = 50  # Number of inversion iterations
 69 | preemphasis = .97
 70 | max_db = 100
 71 | ref_db = 20
 72 | 
 73 | 
 74 | # Model
 75 | r = 4 # Reduction factor. Do not change this.
 76 | dropout_rate = 0.05
 77 | e = 128 # == embedding
 78 | d = 256 # == hidden units of Text2Mel
 79 | c = 512 # == hidden units of SSRN
 80 | attention_win_size = 3
 81 | g = 0.2 ## determines width of band in attention guide
 82 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None]
 83 | 
 84 | ## loss weights : T2M
 85 | lw_mel = 0.3333
 86 | lw_bd1 = 0.3333
 87 | lw_att = 0.3333
 88 | ##              : SSRN
 89 | lw_mag = 0.5
 90 | lw_bd2 = 0.5
 91 | 
 92 | 
 93 | ## validation:
 94 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI
 95 | validation_sentences_to_evaluate = 4 
 96 | validation_sentences_to_synth_params = 4
 97 | 
 98 | 
 99 | # training scheme
100 | restart_from_savepath = []
101 | lr = 0.001 # Initial learning rate.
102 | batchsize = {'t2m': 32, 'ssrn': 32}
103 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8)
104 | validate_every_n_epochs = 100   ## how often to compute validation score and save speech parameters
105 | save_every_n_epochs = 100 ## as well as 5 latest models, how often to archive a model
106 | max_epochs = 1000
107 | 
108 | # attention plotting during training
109 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices
110 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for
111 | 


--------------------------------------------------------------------------------
/config/ssw10/G1_03.cfg:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | #/usr/bin/python2
  3 | 
  4 | ## As DCTTS but remove attention
  5 | 
  6 | import os
  7 | 
  8 | ## Take name of this file to be the config name:
  9 | config_name = os.path.split(__file__)[-1].split('.')[0]  ## remove path and extension
 10 | 
 11 | ## Define place to put outputs relative to this config file's location;
 12 | ## supply an absoluate path to work elsewhere:
 13 | topworkdir = '/disk/scratch/script_project/ssw10/work/'
 14 | 
 15 | voicedir = os.path.join(topworkdir, config_name)
 16 | logdir = os.path.join(voicedir, 'train')
 17 | sampledir = os.path.join(voicedir, 'synth')
 18 | 
 19 | ## Change featuredir to absolute path to use existing features
 20 | featuredir = '/disk/scratch/script_project/ssw10/features/settings_02/'  ## NB: use features from settings 2!!
 21 | coarse_audio_dir = os.path.join(featuredir, 'mels')
 22 | full_mel_dir = os.path.join(featuredir, 'full_mels')
 23 | full_audio_dir = os.path.join(featuredir, 'mags')
 24 | attention_guide_dir = os.path.join(featuredir, 'attention_guides')  ## Set this to the empty string ('') to global attention guide
 25 | 
 26 | history_type = 'fractional_position_in_phone'
 27 | 
 28 | 
 29 | # data locations:
 30 | use_external_durations = True 
 31 | datadirwav   = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/' ###  '/group/project/cstr2/owatts/data/LJSpeech-1.1/' 
 32 | transcript = '/disk/scratch/script_project/ssw10/data/LJSpeech-1.1/transcript_cmu_durations_02.csv'
 33 | test_transcript = '/disk/scratch/script_project/ssw10/data/transcript_cmulex_hvd_durations_02_noendsil.csv'
 34 | waveforms = os.path.join(datadirwav, 'wav_trim16k')
 35 | 
 36 | text_encoder_type = 'minimal_feedforward'
 37 | merlin_label_dir = '/disk/scratch/script_project/ssw10/features/from_merlin/labels'
 38 | merlin_lab_dim = 416 
 39 | 
 40 | input_type = 'phones' ## letters or phones
 41 | 
 42 | ### CMU lex version:-
 43 | vocab = ['<PADDING>', '<!>', '<">', "<'>", "<'s>", '<)>', '<,>', '<.>', '<:>', \
 44 |             '<;>', '<>', '<?>', '<]>', '<_END_>', '<_START_>', 'aa', 'ae', 'ah', \
 45 |             'ao', 'aw', 'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'er', 'ey', \
 46 |             'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'm', 'n', 'ng', 'ow', \
 47 |             'oy', 'p', 'pau', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', \
 48 |             'w', 'y', 'z', 'zh']
 49 | 
 50 | max_N = 173 # Maximum number of characters. # !!!
 51 | max_T = 210 # Maximum number of mel frames. # !!!
 52 | multispeaker = [] ## list of positions at which to add speaker embeddings to handle multi-speaker training. [] means speaker dependent (no embeddings). Possible positions: text_encoder_input, text_encoder_towards_end, audio_decoder_input, ssrn_input, audio_encoder_input
 53 | n_utts = 0 ## 0 means use all data, other positive integer means select this many sentences from beginning of training set 
 54 | random_reduction_on_the_fly = True ## Randomly choose shift when performing reduction to get coarse features.
 55 |             
 56 | 
 57 | # signal processing
 58 | trim_before_spectrogram_extraction = 0    
 59 | vocoder = 'griffin_lim'  
 60 | sr = 16000  # Sampling rate.
 61 | n_fft = 2048  # fft points (samples)
 62 | hop_length = 200 # int(sr * 0.0125)
 63 | win_length = 800 # int(sr * 0.05)    
 64 | prepro = True  # don't extract spectrograms on the fly
 65 | full_dim = n_fft//2+1
 66 | n_mels = 80  # Number of Mel banks to generate
 67 | power = 1.5  # Exponent for amplifying the predicted magnitude
 68 | n_iter = 50  # Number of inversion iterations
 69 | preemphasis = .97
 70 | max_db = 100
 71 | ref_db = 20
 72 | 
 73 | 
 74 | # Model
 75 | r = 4 # Reduction factor. Do not change this.
 76 | dropout_rate = 0.05
 77 | e = 128 # == embedding
 78 | d = 256 # == hidden units of Text2Mel
 79 | c = 512 # == hidden units of SSRN
 80 | attention_win_size = 3
 81 | g = 0.2 ## determines width of band in attention guide
 82 | norm = 'layer' ## type of normalisation layers to use: from ['layer', 'batch', None]
 83 | 
 84 | ## loss weights : T2M
 85 | lw_mel = 0.3333
 86 | lw_bd1 = 0.3333
 87 | lw_att = 0.3333
 88 | ##              : SSRN
 89 | lw_mag = 0.5
 90 | lw_bd2 = 0.5
 91 | 
 92 | 
 93 | ## validation:
 94 | validpatt = 'LJ050-' ## sentence names containing this substring will be held out of training. In this case we will hold out 50th chapter of LJ. TODO: mention SD vs. SI
 95 | validation_sentences_to_evaluate = 4 
 96 | validation_sentences_to_synth_params = 4
 97 | 
 98 | 
 99 | # training scheme
100 | restart_from_savepath = []
101 | lr = 0.001 # Initial learning rate.
102 | batchsize = {'t2m': 32, 'ssrn': 32}
103 | num_threads = 8 # how many threads get_batch should use to build training batches of data (default: 8)
104 | validate_every_n_epochs = 100   ## how often to compute validation score and save speech parameters
105 | save_every_n_epochs = 100 ## as well as 5 latest models, how often to archive a model
106 | max_epochs = 1000
107 | 
108 | # attention plotting during training
109 | plot_attention_every_n_epochs = 0 ## set to 0 if you do not wish to plot attention matrices
110 | num_sentences_to_plot_attention = 0 ## number of sentences to plot attention matrices for
111 | 


--------------------------------------------------------------------------------
/config/ssw10/README.md:
--------------------------------------------------------------------------------
 1 | # Voices for SSW10 experiment
 2 | 
 3 | Instructions for recreating SSW10 voices
 4 | 
 5 | 
 6 | ## Tools
 7 | 
 8 | 
 9 | ### DCTTS code
10 | ```
11 | TOOLDIR=/disk/scratch/script_project/ssw10/tools/
12 | mkdir -p $TOOLDIR
13 | cd $TOOLDIR
14 | git clone https://github.com/oliverwatts/dc_tts_osw.git dc_tts_osw_A
15 | cd dc_tts_osw_A
16 | ```
17 | 
18 | ### Festival 
19 | 
20 | ```
21 | INSTALL_DIR=$TOOLDIR/festival
22 | mkdir -p $INSTALL_DIR
23 | cd $INSTALL_DIR
24 | 
25 | wget http://www.cstr.ed.ac.uk/downloads/festival/2.4/festival-2.4-release.tar.gz
26 | wget http://www.cstr.ed.ac.uk/downloads/festival/2.4/speech_tools-2.4-release.tar.gz
27 | 
28 | tar xvf festival-2.4-release.tar.gz
29 | tar xvf speech_tools-2.4-release.tar.gz
30 | 
31 | cd speech_tools
32 | ./configure  --prefix=$INSTALL_DIR
33 | gmake
34 | 
35 | cd ../festival
36 | ./configure  --prefix=$INSTALL_DIR
37 | gmake
38 | 
39 | cd $INSTALL_DIR
40 | 
41 | wget http://www.cstr.ed.ac.uk/downloads/festival/2.4/voices/festvox_cmu_us_awb_cg.tar.gz
42 | tar xvf festvox_cmu_us_awb_cg.tar.gz
43 | 
44 | ## Get lexicons for the english voice:
45 | wget http://www.cstr.ed.ac.uk/downloads/festival/2.4/festlex_CMU.tar.gz
46 | tar xvf festlex_CMU.tar.gz
47 |  
48 | wget http://www.cstr.ed.ac.uk/downloads/festival/2.4/festlex_POSLEX.tar.gz
49 | tar xvf festlex_POSLEX.tar.gz
50 | 
51 | ##  test
52 | cd $INSTALL_DIR/festival/bin
53 | 
54 | ## run the *locally installed* festival (NB: initial ./ is important!)
55 | ./festival
56 | 
57 | festival> (voice_cmu_us_awb_cg)
58 | festival> (utt.save.wave (SayText "If i'm speaking then installation actually went ok.") "test.wav" ' riff)
59 | ```
60 | 
61 | 
62 | 
63 | 
64 | 
65 | 
66 | 
67 | ## Data
68 | 
69 | 
70 | ### Download LJSpeech
71 | 
72 | ```
73 | DATADIR=/disk/scratch/script_project/ssw10/data
74 | mkdir -p $DATADIR
75 | 
76 | cd $DATADIR
77 | wget http://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
78 | bunzip2  LJSpeech-1.1.tar.bz2
79 | tar xvf LJSpeech-1.1.tar
80 | rm LJSpeech-1.1.tar
81 | ```
82 | 
83 | 
84 | ### Phonetise the transcription with Festvial + CMU lexicon
85 | 
86 | ```
87 | CODEDIR=/disk/scratch/oliver/dc_tts_osw/
88 | cd $CODEDIR
89 | python ./script/festival/csv2scm.py -i $DATADIR/LJSpeech-1.1/metadata.csv -o $DATADIR/LJSpeech-1.1/utts.data
90 | 
91 | cd $DATADIR/LJSpeech-1.1/
92 | FEST=$TOOLDIR/festival/festival/bin/festival
93 | SCRIPT=$CODEDIR/script/festival/make_rich_phones_cmulex.scm
94 | $FEST -b $SCRIPT | grep ___KEEP___ | sed 's/___KEEP___//' | tee ./transcript_temp2.csv
95 | 
96 | python $CODEDIR/script/festival/fix_transcript.py ./transcript_temp2.csv > ./transcript_cmu.csv
97 | ```
98 | 
99 | 


--------------------------------------------------------------------------------
/configuration.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | #!/usr/bin/env python2
 3 | import os
 4 | import imp
 5 | import inspect
 6 | 
 7 | 
 8 | CONFIG_DEFAULTS = [
 9 |     ('initialise_weights_from_existing', [], ''),
10 |     ('update_weights', [], ''),
11 |     ('num_threads', 8, 'how many threads get_batch should use to build training batches of data (default: 8)'),
12 |     ('plot_attention_every_n_epochs', 0, 'set to 0 if you do not wish to plot attention matrices'),
13 |     ('num_sentences_to_plot_attention', 0, 'number of sentences to plot attention matrices for'),
14 |     ('concatenate_query', True, 'Concatenate [R Q] to get audio decoder input, or just take R?'),
15 |     ('use_external_durations', False, 'Use externally supplied durations in 6th field of transcript for fixed attention matrix A'),
16 |     ('text_encoder_type', 'DCTTS_standard', 'one of DCTTS_standard/none/minimal_feedforward'),
17 |     ('merlin_label_dir', '', 'npy format phone labels converted from merlin using process_merlin_label.py'),
18 |     ('merlin_lab_dim', 592, ''),
19 |     ('bucket_data_by', 'text_length', 'One of audio_length/text_length. Label length will be used if merlin_label_dir is set and bucket_data_by=="text_length"'),
20 |     ('history_type', 'DCTTS_standard', 'DCTTS_standard/fractional_position_in_phone/absolute_position_in_phone/minimal_history'),
21 |     ('beta1', 0.9, 'ADAM setting - default value from original dctss repo'),
22 |     ('beta2', 0.999, 'ADAM setting - default value from original dctss repo'),
23 |     ('epsilon', 0.00000001 , 'ADAM setting - default value from original dctss repo'),
24 |     ('decay_lr', True , 'learning rate decay - default value from original dctss repo'),
25 |     ('squash_output_t2m', True, 'apply sigmoid to output - binary divergence loss will be disabled if False'),
26 |     ('squash_output_ssrn', True, 'apply sigmoid to output - binary divergence loss will be disabled if False'),
27 |     ('store_synth_features', False, 'store .npy file of features alongside output .wav file'),
28 |     ('turn_off_monotonic_for_synthesis',False, 'turns off FIA mechanism for synthesis, should be False during training'),
29 |     ('lw_cdp',0.0,''),
30 |     ('lw_ain',0.0,''),
31 |     ('lw_aout',0.0,''),
32 |     ('attention_guide_fa',False,'use attention guide as target - MSE attention loss'),
33 |     ('select_central',False,'use only centre phones from Merlin labels'),
34 |     ('MerlinTextEncWithPhoneEmbedding',False,'use Merlin labels and phone embeddings as input of TextEncoder')
35 | ]
36 | 
37 | ## Intended to have hp as a module, but this doesn't allow pickling and therefore 
38 | ## use in parallel processing. So, convert module into an object of arbitrary type 
39 | ## ("Hyperparams") having same attributes: 
40 | class Hyperparams(object):
41 |     def __init__(self, module_object):
42 |         for (key, value) in module_object.__dict__.items():
43 |             if key.startswith('_'):
44 |                 continue
45 |             if inspect.ismodule(value): # e.g. from os imported at top of config
46 |                 continue
47 |             setattr(self, key, module_object.__dict__[key])
48 |     def validate(self):
49 |         '''
50 |         Supply defaults for various thing of appropriate type if missing -- 
51 |         TODO: Currently this is just to supply values for variables added later in development.
52 |         Should we have some filling in of defaults like this more permanently, or should
53 |         everything be explicitly set in a config file?
54 |         '''
55 |         for (varname, default_value, help_string) in CONFIG_DEFAULTS:
56 |             if not hasattr(self, varname):
57 |                 setattr(self, varname, default_value)
58 | 
59 | 
60 | def load_config(config_fname):        
61 |     config = os.path.abspath(config_fname)
62 |     assert os.path.isfile(config), 'Config file %s does not exist'%(config)
63 |     settings = imp.load_source('config', config)
64 |     hp = Hyperparams(settings)
65 |     hp.validate()
66 |     return hp
67 | 
68 | 
69 | 
70 | ### https://stackoverflow.com/questions/1325673/how-to-add-property-to-a-class-dynamically
71 | 
72 | # class atdict(dict):
73 | #     __getattr__= dict.__getitem__
74 | #     __setattr__= dict.__setitem__
75 | #     __delattr__= dict.__delitem__
76 | 


--------------------------------------------------------------------------------
/convert_alignment_to_guide.py:
--------------------------------------------------------------------------------
 1 | 
 2 | import sys
 3 | import math
 4 | import glob
 5 | import numpy as np
 6 | import matplotlib
 7 | import matplotlib.pyplot as plt
 8 | from scipy.ndimage.filters import gaussian_filter
 9 | from libutil import save_floats_as_8bit
10 | import tqdm
11 | from concurrent.futures import ProcessPoolExecutor
12 | 
13 | gD = 0.2
14 | gW = 0.1
15 | 
16 | DEBUG = False
17 | 
18 | def main(file_name,out_file):
19 | 
20 | 	F = np.load(file_name)
21 | 	F = np.transpose(F)
22 | 
23 | 	ndim, tdim = F.shape # x: encoder (N) / y: decoder (T)
24 | 
25 | 	## Convert alignment to attention guide
26 | 	if DEBUG:
27 | 		D = np.zeros((ndim, tdim), dtype=np.float32) # diagonal guide
28 | 	W = np.zeros((ndim, tdim), dtype=np.float32) # alignment based guide
29 | 
30 | 	for n_pos in range(ndim):
31 | 		for t_pos in range(tdim):
32 | 			
33 | 			n_pos_new = np.argmax(F[:,t_pos])
34 | 			W[n_pos,t_pos] = 1 - np.exp( -(n_pos / float(ndim) - n_pos_new / float(ndim) ) ** 2  / (2 * gW * gW))
35 | 
36 | 			if DEBUG:
37 | 				D[n_pos, t_pos] = 1 - np.exp(-(t_pos / float(tdim) - n_pos / float(ndim)) ** 2 / (2 * gD * gD))
38 | 
39 | 	## Smooth the alignment based guide
40 | 	S = gaussian_filter(W, sigma=1) # trying to blur
41 | 	# needs min max norm here to make sure 0-1
42 | 	S = ( S - S.min()) / ( S.max() - S.min() )
43 | 
44 | 	save_floats_as_8bit(S, out_file)
45 | 
46 | 	if DEBUG:
47 | 
48 | 		D = ( D - D.min()) / ( D.max() - D.min() )
49 | 		W = ( W - W.min()) / ( W.max() - W.min() )
50 | 
51 | 		for plot_type in range(0,3):
52 | 
53 | 			## Visualization 
54 | 			if plot_type==0:
55 | 				M = F+D # add forced alignment path to help visualisation
56 | 			elif plot_type == 1:
57 | 				M = F+W # add forced alignment path to help visualisation
58 | 			elif plot_type == 2:
59 | 				M = F+S # add forced alignment path to help visualisation
60 | 
61 | 			fig, ax = plt.subplots()
62 | 			im = ax.imshow(M)
63 | 			# plt.title('Diagonal (top), Alignment based (middle), Alignment based smoothed (bottom)', fontsize=8)
64 | 			fig.colorbar(im,fraction=0.035, pad=0.04)
65 | 			plt.ylabel('Encoder timestep', fontsize=12)
66 | 			plt.xlabel('Decoder timestep', fontsize=12)
67 | 
68 | 			if plot_type==0:
69 | 				plt.title('Diagonal attention guide', fontsize=12)
70 | 				plt.savefig('attention_guide_diagonal.pdf')
71 | 			elif plot_type == 1:
72 | 				plt.title('Forced alignment based attention guide', fontsize=12)
73 | 				plt.savefig('attention_guide_fa.pdf')
74 | 			elif plot_type == 2:
75 | 				plt.title('Forced alignment based attention guide (smoothed)', fontsize=12)
76 | 				plt.savefig('attention_guide_fa_smooth.pdf')
77 | 
78 | 			plt.show()
79 | 
80 | if __name__ == '__main__':
81 | 
82 | 	# Usage: python convert_alignment_to_guide.py CB-EM-55-07.npy CB-EM-55-07_out.npy 
83 | 
84 | 	inputdir   = sys.argv[1]
85 | 	outputdir  = sys.argv[2]
86 | 	ncores = 5
87 | 	executor = ProcessPoolExecutor(max_workers=ncores)    
88 | 	futures  = []
89 | 	for file in glob.iglob(inputdir + '/*.npy'):
90 | 		outfile = outputdir + file.split('/')[-1]
91 | 		futures.append(executor.submit(main, file, outfile))
92 | 
93 | 	proc_list = [future.result() for future in tqdm.tqdm(futures)]
94 | 
95 | 


--------------------------------------------------------------------------------
/copy_synth_GL.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | # /usr/bin/python2
 3 | 
 4 | from __future__ import print_function
 5 | 
 6 | import os
 7 | import glob
 8 | 
 9 | import numpy as np
10 | from utils import spectrogram2wav
11 | from data_load import load_data
12 | import soundfile
13 | from tqdm import tqdm
14 | from configuration import load_config
15 | 
16 | from argparse import ArgumentParser
17 | 
18 | from libutil import basename, safe_makedir
19 | 
20 | def copy_synth_GL(hp, outdir):
21 | 
22 |     safe_makedir(outdir)
23 | 
24 |     dataset = load_data(hp, mode="synthesis") 
25 |     fnames, texts = dataset['fpaths'], dataset['texts']
26 |     bases = [basename(fname) for fname in fnames]
27 |     
28 |     for base in bases:
29 |         print("Working on file %s"%(base))
30 |         mag = np.load(os.path.join(hp.full_audio_dir, base + '.npy'))
31 |         wav = spectrogram2wav(hp, mag)
32 |         soundfile.write(outdir + "/%s.wav"%(base), wav, hp.sr)
33 | 
34 | def main_work():
35 | 
36 |     #################################################
37 |       
38 |     # ============= Process command line ============
39 | 
40 |     a = ArgumentParser()
41 |     a.add_argument('-c', dest='config', required=True, type=str)
42 |     a.add_argument('-o', dest='outdir', required=True, type=str)    
43 |     opts = a.parse_args()
44 |     
45 |     # ===============================================
46 |     
47 |     hp = load_config(opts.config)
48 |     copy_synth_GL(hp, opts.outdir)
49 | 
50 | if __name__=="__main__":
51 | 
52 |     main_work()
53 | 


--------------------------------------------------------------------------------
/copy_synth_SSRN_GL.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | # /usr/bin/python2
 3 | 
 4 | from __future__ import print_function
 5 | 
 6 | import os
 7 | import glob
 8 | 
 9 | import numpy as np
10 | from utils import spectrogram2wav
11 | from data_load import load_data
12 | import soundfile
13 | from tqdm import tqdm
14 | from configuration import load_config
15 | 
16 | from argparse import ArgumentParser
17 | 
18 | from libutil import basename, safe_makedir
19 | from synthesize import synth_mel2mag, list2batch, restore_latest_model_parameters
20 | from architectures import SSRNGraph
21 | import tensorflow as tf
22 | 
23 | def copy_synth_SSRN_GL(hp, outdir):
24 | 
25 |     safe_makedir(outdir)
26 | 
27 |     dataset = load_data(hp, mode="synthesis") 
28 |     fnames, texts = dataset['fpaths'], dataset['texts']
29 |     bases = [basename(fname) for fname in fnames]
30 |     mels = [np.load(os.path.join(hp.coarse_audio_dir, base + '.npy')) for base in bases]
31 |     lengths = [a.shape[0] for a in mels]
32 |     mels = list2batch(mels, 0)
33 | 
34 |     g = SSRNGraph(hp, mode="synthesize"); print("Graph (ssrn) loaded")
35 | 
36 |     with tf.Session() as sess:
37 |         sess.run(tf.global_variables_initializer())
38 |         ssrn_epoch = restore_latest_model_parameters(sess, hp, 'ssrn')
39 | 
40 |         print('Run SSRN...')
41 |         Z = synth_mel2mag(hp, mels, g, sess)
42 | 
43 |         for i, mag in enumerate(Z):
44 |             print("Working on %s"%(bases[i]))
45 |             mag = mag[:lengths[i]*hp.r,:]  ### trim to generated length             
46 |             wav = spectrogram2wav(hp, mag)
47 |             soundfile.write(outdir + "/%s.wav"%(base), wav, hp.sr)
48 | 
49 |   
50 | 
51 | 
52 | def main_work():
53 | 
54 |     #################################################
55 |       
56 |     # ============= Process command line ============
57 | 
58 |     a = ArgumentParser()
59 |     a.add_argument('-c', dest='config', required=True, type=str)
60 |     a.add_argument('-o', dest='outdir', required=True, type=str)    
61 |     opts = a.parse_args()
62 |     
63 |     # ===============================================
64 |     
65 |     hp = load_config(opts.config)
66 |     copy_synth_SSRN_GL(hp, opts.outdir)
67 | 
68 | if __name__=="__main__":
69 | 
70 |     main_work()
71 | 


--------------------------------------------------------------------------------
/doc/festival_install.md:
--------------------------------------------------------------------------------
 1 | 
 2 | 
 3 | # Installing Festival
 4 | 
 5 | ## Basic Festival install of Spanish and Scottish voices
 6 | ```
 7 | INSTALL_DIR=/afs/some/convenient/directory/festival 
 8 | 
 9 | mkdir -p $INSTALL_DIR
10 | cd $INSTALL_DIR
11 | 
12 | wget http://www.cstr.ed.ac.uk/downloads/festival/2.4/festival-2.4-release.tar.gz
13 | wget http://www.cstr.ed.ac.uk/downloads/festival/2.4/speech_tools-2.4-release.tar.gz
14 | 
15 | tar xvf festival-2.4-release.tar.gz
16 | tar xvf speech_tools-2.4-release.tar.gz
17 | 
18 | cd speech_tools
19 | ./configure  --prefix=$INSTALL_DIR
20 | gmake
21 | 
22 | cd ../festival
23 | ./configure  --prefix=$INSTALL_DIR
24 | gmake
25 | 
26 | cd $INSTALL_DIR
27 | 
28 | ## Get spanish voice:
29 | wget http://festvox.org/packed/festival/1.4.1/festvox_ellpc11k.tar.gz
30 | tar xvf festvox_ellpc11k.tar.gz
31 | 
32 | ## Get scottish english voice:
33 | wget http://www.cstr.ed.ac.uk/downloads/festival/2.4/voices/festvox_cmu_us_awb_cg.tar.gz
34 | tar xvf festvox_cmu_us_awb_cg.tar.gz
35 | 
36 | ## Get lexicons for the english voice:
37 | wget http://www.cstr.ed.ac.uk/downloads/festival/2.4/festlex_CMU.tar.gz
38 | tar xvf festlex_CMU.tar.gz
39 |  
40 | wget http://www.cstr.ed.ac.uk/downloads/festival/2.4/festlex_POSLEX.tar.gz
41 | tar xvf festlex_POSLEX.tar.gz
42 | 
43 | ##  test
44 | cd $INSTALL_DIR/festival/bin
45 | 
46 | ## run the *locally installed* festival (NB: initial ./ is important!)
47 | ./festival
48 | festival> (voice_el_diphone )
49 | festival> (SayText "La rica salsa canaria se llama mojo pic'on.")
50 | 
51 | festival> (voice_cmu_us_awb_cg)
52 | festival> (SayText "If i'm speaking then installation actually went ok.")
53 | 
54 | 
55 | ## synthesise to file:
56 | 
57 | (utt.save.wave (SynthText "La rica salsa canaria se llama mojo pic'on.")  "/path/to/output/file.wav" 'riff)
58 | ```
59 | 
60 | 
61 | ## Combilex installation
62 | 
63 | Given the file `combilex.tar` which contains 3 combilex surface form lexicons (gam, rpx, edi), install like this:
64 | 
65 | ``` 
66 | cp combilex.tar $INSTALL_DIR/festival/
67 | cd $INSTALL_DIR/festival/
68 | tar xvf combilex.tar
69 | ```
70 | 
71 | For processing the Nancy data, which contains a French word with a nasalised vowel present in the lexicon but not the phoneset definition, I needed to edit `$INSTALL_DIR/festival/lib/combilex_phones.scm` and add the line:
72 | 
73 | ```
74 |    (o~     + l 2 3 + n 0 0)   ;; added missing nasalised vowel
75 | ```
76 | 
77 | after the line:
78 | 
79 | ```
80 |    (@U     + d 2 3 + 0 0 0) ;ou
81 | ```
82 | 
83 | 
84 | ## Cleaning up
85 | 
86 | ```
87 | cd $INSTALL_DIR
88 | rm *.tar.gz
89 | 
90 | cd $INSTALL_DIR/festival
91 | rm *.tar
92 | ```
93 | 
94 | 
95 | ## Note for UoE users
96 | 
97 | If the installation is run through SSH, make sure you are in an *actual* machine and not just hare or bruegel, as these are just gateways and won't have a C compiler installed.


--------------------------------------------------------------------------------
/doc/recipe_WaveRNN.md:
--------------------------------------------------------------------------------
 1 | 
 2 | ## DCTTS + WaveRNN
 3 | 
 4 | To generate DCTTS samples using WaveRNN set the following flag in your config file:
 5 | ```
 6 | store_synth_features = True
 7 | ```
 8 | and run the normal DCTTS synthesis script:
 9 | ```
10 | cd ophelia
11 | dctts_synth_dir='/tmp/dctts_synth_dir/' 
12 | ./util/submit_tf.sh synthesize.py -c config/lj_tutorial.cfg -N 5 -odir ${dctts_synth_dir}
13 | ```
14 | 
15 | This saves the generated magnitude files (.npy) and Grifim-Lim wavefiles in the directory dctts_synth_dir.
16 | 
17 | To generate WaveRNN wavefiles from these magnitude files:
18 | ```
19 | cd Tacotron
20 | wavernn_synth_dir='/tmp/wavernn_synth_dir/'
21 | python synthesize_dctts_wavernn.py -i ${dctts_synth_dir} -o ${wavernn_synth_dir}
22 | ```
23 | 
24 | ## Notes on Tacotron+WaveRNN code installation
25 | 
26 | ```
27 | git clone https://github.com/cassiavb/Tacotron.git
28 | cd Tacotron/
29 | virtualenv --distribute --python=/usr/bin/python3.6 env
30 | source env/bin/activate
31 | pip install --upgrade pip
32 | pip install torch torchvision
33 | pip install -r requirements.txt 
34 | pip install numba==0.48
35 | ```
36 | 
37 | 


--------------------------------------------------------------------------------
/doc/recipe_nancy.md:
--------------------------------------------------------------------------------
 1 | 
 2 | # Blizzard Nancy corpus preparation and voice building
 3 | 
 4 | For base voice for ICPhS in first instance.
 5 | 
 6 | 
 7 | ```
 8 | ### get the (publicly downloadable) data from CSTR datawww:-
 9 | mkdir /group/project/cstr2/owatts/data/nancy/
10 | cd /group/project/cstr2/owatts/data/nancy/
11 | mkdir original
12 | 
13 | cp /afs/inf.ed.ac.uk/group/cstr/datawww/blizzard2011/lessac/wavn.tgz ./original/
14 | cp /afs/inf.ed.ac.uk/group/cstr/datawww/blizzard2011/lessac/lab.ssil.zip ./original/
15 | cp /afs/inf.ed.ac.uk/group/cstr/datawww/blizzard2011/lessac/lab.zip ./original/
16 | cp /afs/inf.ed.ac.uk/group/cstr/datawww/blizzard2011/lessac/prompts.data ./original/
17 | 
18 | ### compare checksums with published ones:
19 | md5sum ./original/*
20 | 0a4860a69bca56d7e9f8170306ff3709  ./original/lab.ssil.zip
21 | aeae7916d881a8eef255a6fe05e77e77  ./original/lab.zip
22 | 650b44f7252aed564d190b76a98cb490  ./original/prompts.data
23 | bb2a80dd1423f87ba12d2074af8e7a3f  ./original/wavn.tgz
24 | 
25 | ### ...all OK.
26 | 
27 | cd ./original
28 | tar xvf wavn.tgz
29 | unzip lab.ssil.zip
30 | unzip lab.zip
31 | 
32 | rm *.zip
33 | rm *.tgz
34 | 
35 | ### information about the data:-
36 | http://www.cstr.ed.ac.uk/projects/blizzard/2011/lessac_blizzard2011/:
37 | 
38 | * prompts.data - File with all of the prompt texts in filename order.
39 | * corrected.gui - File with all of the prompts in Lesseme labeled format in the order of the Nancy corpus as originally recorded, after some labels produced by the Lessac Front-end were corrected to reflect what the voice actor actually said.
40 | * uncorrected.gui - File with all of the prompts in Lesseme labeled format in the order of the Nancy corpus as produced by the Lessac Front-end from the prompts.data file without correction to the labels for what the voice actor actually said.
41 | * lab.ssil.zip - Zipped set of files with Lesseme labels that include the result of automated segmentation of the Lesseme labels in the corrected.gui file before the ssil label is collapsed into the preceding or following label.
42 | * lab.zip 
43 | 
44 | cd /disk/scratch/oliver/dc_tts_osw_clean
45 | mkdir /group/project/cstr2/owatts/data/nancy/derived/
46 | python ./script/normalise_level.py -i /group/project/cstr2/owatts/data/nancy/original/wavn/ \
47 |      -o /group/project/cstr2/owatts/data/nancy/derived/wav_norm/ -ncores 25
48 | 
49 | ./util/submit_tf_cpu.sh ./script/split_speech.py \
50 |         -w /group/project/cstr2/owatts/data/nancy/derived/wav_norm/ \
51 |         -o /group/project/cstr2/owatts/data/nancy/derived/wav_trim/ -dB 30 -ncores 25 -trimonly
52 | 
53 | rm -r /group/project/cstr2/owatts/data/nancy/derived/wav_norm/        
54 | 
55 | 
56 | 
57 | ## transcription (needed to add o~ to combilex rpx phoneset in Festival):-
58 | 
59 | ## use existing scheme format transcript:-
60 | cp /group/project/cstr2/owatts/data/nancy/original/prompts.data /group/project/cstr2/owatts/data/nancy/derived/utts.data
61 | cd /group/project/cstr2/owatts/data/nancy/derived/
62 | 
63 | CODEDIR=/disk/scratch/oliver/dc_tts_osw_clean
64 | FEST=/afs/inf.ed.ac.uk/group/cstr/projects/simple4all_2/oliver/tool/festival/festival/src/main/festival
65 | SCRIPT=$CODEDIR/script/festival/make_rich_phones_combirpx_noplex.scm
66 | $FEST -b $SCRIPT | grep ___KEEP___ | sed 's/___KEEP___//' | tee ./transcript_temp1.csv
67 | 
68 | python $CODEDIR/script/festival/fix_transcript.py ./transcript_temp1.csv > ./transcript.csv
69 | 
70 | 
71 | 
72 | ### get phone list to ad to config:
73 |  python ./script/check_transcript.py -i /group//project/cstr2/owatts/data/nancy/derived/transcript.csv -phone
74 | 
75 | 
76 | 
77 | ./util/submit_tf_cpu.sh ./prepare_acoustic_features.py -c ./config/nancy_01.cfg -ncores 25
78 | 
79 | ./util/submit_tf.sh ./prepare_attention_guides.py -c ./config/nancy_01.cfg -ncores 25
80 | 
81 | 
82 | ## train
83 | ./util/submit_tf.sh ./train.py -c config/nancy_01.cfg -m t2m
84 | ./util/submit_tf.sh ./train.py -c config/nancy_01.cfg -m ssrn
85 | ```


--------------------------------------------------------------------------------
/doc/recipe_nancy2nick.md:
--------------------------------------------------------------------------------
  1 | 
  2 | # Very naive speaker adaptation to convert Nancy to Nick
  3 | 
  4 | The simplest way to train on a small database is to fine tune
  5 | a speaker-dependent voice to the new database. This works
  6 | surprisingly well even where the base voice is of a different 
  7 | sex and accent to the target speaker, as this example shows.
  8 | 
  9 | 
 10 | ## Prepare nick data
 11 | 
 12 | 
 13 | 
 14 | We will use a version of the nick data which has been downsampled to 
 15 | 16kHz with sox:
 16 | 
 17 | ```
 18 | /afs/inf.ed.ac.uk/group/cstr/projects/nst/cvbotinh/SCRIPT/ICPhS19/samples/second_submission/natural_16k/ 
 19 | ```
 20 | 
 21 | It was converted from the 48kHz version here:
 22 | 
 23 | ```
 24 | /afs/inf.ed.ac.uk/group/cstr/projects/corpus_1/Nick48kHz/wav/ 
 25 | ```
 26 | 
 27 | ### waveforms
 28 | ```
 29 | INDIR=/afs/inf.ed.ac.uk/group/cstr/projects/nst/cvbotinh/SCRIPT/ICPhS19/samples/second_submission/natural_16k/
 30 | OUTDIR=/group/project/cstr2/owatts/data/nick16k/
 31 | 
 32 | python ./script/normalise_level.py -i $INDIR -o $OUTDIR/wav_norm/ -ncores 25
 33 | 
 34 | ./util/submit_tf_cpu.sh ./script/split_speech.py -w $OUTDIR/wav_norm/ -o $OUTDIR/wav_trim/ -dB 30 -ncores 25 -trimonly
 35 | ```
 36 | 
 37 | ### transcript
 38 | 
 39 | Gather texts:
 40 | 
 41 | ```
 42 | for FNAME in /afs/inf.ed.ac.uk/group/cstr/projects/corpus_1/Nick48kHz/txt/herald_* ; do 
 43 |   BASE=`basename $FNAME .txt` ; 
 44 |   TEXT=`cat $FNAME` ; 
 45 |   echo "${BASE}||${TEXT}" ; 
 46 | done > /group/project/cstr2/owatts/data/nick16k/metadata.csv
 47 | 
 48 | 
 49 | for FNAME in /afs/inf.ed.ac.uk/group/cstr/projects/corpus_1/Nick48kHz/txt/hvd_* ; do 
 50 |   BASE=`basename $FNAME .txt` ; 
 51 |   TEXT=`cat $FNAME` ; 
 52 |   echo "${BASE}||${TEXT}" ; 
 53 | done > /group/project/cstr2/owatts/data/nick16k/metadata_hvd.csv
 54 | 
 55 | 
 56 | for FNAME in /afs/inf.ed.ac.uk/group/cstr/projects/corpus_1/Nick48kHz/txt/mrt_* ; do 
 57 |   BASE=`basename $FNAME .txt` ; 
 58 |   TEXT=`cat $FNAME` ; 
 59 |   echo "${BASE}||${TEXT}" ; 
 60 | done > /group/project/cstr2/owatts/data/nick16k/metadata_mrt.csv
 61 | ```
 62 | 
 63 | Phonetise:
 64 | 
 65 | ```
 66 | CODEDIR=`pwd`
 67 | DATADIR=/group/project/cstr2/owatts/data/nick16k/
 68 | python ./script/festival/csv2scm.py -i $DATADIR/metadata.csv -o $DATADIR/utts.data
 69 | 
 70 | cd $DATADIR/
 71 | FEST=/afs/inf.ed.ac.uk/group/cstr/projects/simple4all_2/oliver/tool/festival/festival/src/main/festival
 72 | SCRIPT=$CODEDIR/script/festival/make_rich_phones_combirpx_noplex.scm
 73 | $FEST -b $SCRIPT | grep ___KEEP___ | sed 's/___KEEP___//' | tee ./transcript_temp1.csv
 74 | python $CODEDIR/script/festival/fix_transcript.py ./transcript_temp1.csv > ./transcript.csv
 75 | 
 76 | 
 77 | cd $CODEDIR
 78 | for TESTSET in mrt hvd ; do
 79 |     mkdir /group/project/cstr2/owatts/data/nick16k/test_${TESTSET}
 80 |     python ./script/festival/csv2scm.py -i $DATADIR/metadata_${TESTSET}.csv -o $DATADIR/test_${TESTSET}/utts.data
 81 | done
 82 | 
 83 | 
 84 | for TESTSET in mrt hvd ; do
 85 |     cd /group/project/cstr2/owatts/data/nick16k/test_${TESTSET}
 86 |     $FEST -b $SCRIPT | grep ___KEEP___ | sed 's/___KEEP___//' | tee ./transcript_temp1.csv
 87 |     python $CODEDIR/script/festival/fix_transcript.py ./transcript_temp1.csv > ./transcript.csv
 88 |     cp transcript.csv ../transcript_${TESTSET}.csv
 89 | done
 90 | 
 91 | ```
 92 | 
 93 | 
 94 | 
 95 | 
 96 | ### features
 97 | ```
 98 | ./util/submit_tf_cpu.sh ./prepare_acoustic_features.py -c ./config/nancy2nick_01.cfg -ncores 15
 99 | ./util/submit_tf.sh ./prepare_attention_guides.py -c ./config/nancy2nick_01.cfg -ncores 15
100 | ```
101 | 
102 | 
103 | ## training
104 | 
105 | Config `nancy2nick_01` updates all weights pretrained on the Nancy data:
106 | 
107 | ```
108 | ./util/submit_tf.sh ./train.py -c ./config/nancy2nick_01.cfg -m t2m
109 | ./util/submit_tf.sh ./train.py -c ./config/nancy2nick_01.cfg -m ssrn
110 | ```
111 | 
112 | Config `nancy2nick_01` updates all weights pretrained on the Nancy data, except
113 | those of textencoder which are kept frozen:
114 | 
115 | ```
116 | ./util/submit_tf.sh ./train.py -c ./config/nancy2nick_02.cfg -m t2m
117 | ```
118 | 
119 | Previously-trained SSRN can be softlinked without retraining: 
120 | ```
121 | cp -rs  `pwd`/work/nancy2nick_01/train-ssrn/ `pwd`/work/nancy2nick_02/train-ssrn/
122 | ```
123 | 
124 | ## Synthesis
125 | ```
126 | ./util/submit_tf.sh ./synthesize.py -c config/nancy2nick_01.cfg 
127 | ./util/submit_tf.sh ./synthesize.py -c config/nancy2nick_02.cfg
128 | ```
129 | 
130 | 


--------------------------------------------------------------------------------
/doc/recipe_nancyplusnick.md:
--------------------------------------------------------------------------------
 1 | 
 2 | # Train on Nancy + Nick
 3 | 
 4 | 
 5 | 
 6 | ## Combine Nancy & Nick data already used for nancy_01 and nancy2nick_*
 7 | 
 8 | 
 9 | Combine transcripts, adding speaker codes:
10 | ```
11 | DATADIR=/group/project/cstr2/owatts/data/nick_plus_nancy
12 | mkdir $DATADIR
13 | 
14 | grep -v ^$ /group/project/cstr2/owatts/data/nick16k/transcript.csv | awk '{print $0"|nick"}' > $DATADIR/transcript.csv
15 | grep -v ^$ /group/project/cstr2/owatts/data/nancy/derived/transcript.csv | awk '{print $0"|nancy"}' | grep -v NYT00 >> $DATADIR/transcript.csv
16 | 
17 | # (remove empty lines and NYT00 section for which attention guides were not made)
18 | 
19 | cp /group/project/cstr2/owatts/data/nick16k/transcript_{mrt,hvd}.csv $DATADIR
20 | ```
21 | 
22 | Combine acoustic features and attention guides:
23 | 
24 | ```
25 | mkdir -p work/nancyplusnick_01/data/{attention_guides,full_mels,mels,mags}/
26 | for SUBDIR in attention_guides full_mels mels mags ; do
27 |   for VOICE in nancy2nick_01 nancy_01 ; do
28 |     ln -s ${PWD}/work/$VOICE/data/$SUBDIR/* work/nancyplusnick_01/data/$SUBDIR/ ;
29 |   done
30 | done
31 | ```
32 | 
33 | 
34 | 
35 | Prepare config file and train:
36 | 
37 | ```
38 | ./util/submit_tf.sh ./train.py -c ./config/nancyplusnick_01.cfg -m t2m
39 | ```
40 | 
41 | Previously-trained SSRN can be softlinked without retraining: 
42 | ```
43 | cp -rs  `pwd`/work/nancy2nick_01/train-ssrn/ `pwd`/work/nancyplusnick_01/train-ssrn/
44 | ```
45 | 
46 | 
47 | Synth
48 | 
49 | ```
50 | ./util/submit_tf.sh ./synthesize.py -c ./config/nancyplusnick_01.cfg -N 10 -speaker nick
51 | ./util/submit_tf.sh ./synthesize.py -c ./config/nancyplusnick_01.cfg -N 10 -speaker nancy
52 | ```
53 | 
54 | 
55 | 
56 | 
57 | Fine tune on nick only:
58 | 
59 | ./util/submit_tf.sh ./train.py -c ./config/nancyplusnick_02.cfg -m t2m
60 | cp -rs  $PWD/work/nancy2nick_01/train-ssrn/ ./work/nancyplusnick_02/train-ssrn/
61 | 
62 | 
63 | 
64 |  ./util/submit_tf.sh ./synthesize.py -c ./config/nancyplusnick_02.cfg     -N 10 -speaker nick
65 | 
66 | 
67 | 
68 |  set attention loss weight very high:
69 | 
70 | 
71 |  ```
72 |  ./util/submit_tf.sh ./train.py -c ./config/nancyplusnick_03.cfg -m t2m
73 | cp -rs  $PWD/work/nancy2nick_01/train-ssrn/ ./work/nancyplusnick_03/train-ssrn/
74 | 
75 | 
76 | 
77 |  ./util/submit_tf.sh ./synthesize.py -c ./config/nancyplusnick_03.cfg     -N 10 -speaker nick
78 | 
79 | 
80 |  ```
81 | 
82 | 
83 | 
84 |  Try learning channel contributions for each speaker:
85 | 
86 |  ./util/submit_tf.sh ./train.py -c ./config/nancyplusnick_04_lcc.cfg -m t2m ; cp -rs  $PWD/work/nancy2nick_01/train-ssrn/ ./work/nancyplusnick_04_lcc/train-ssrn/
87 | 
88 | 
89 | 
90 | 
91 | 
92 |  Try learning channel contributions for each speaker + SD-phone embeddings:
93 | 
94 |  ./util/submit_tf.sh ./train.py -c ./config/nancyplusnick_05_lcc_sdpe.cfg -m t2m ; cp -rs  $PWD/work/nancy2nick_01/train-ssrn/ ./work/nancyplusnick_04_lcc/train-ssrn/
95 | 
96 | 
97 | 


--------------------------------------------------------------------------------
/doc/recipe_project.md:
--------------------------------------------------------------------------------
 1 | ## Required tools
 2 | 
 3 | ```
 4 | git clone -b project https://github.com/CSTR-Edinburgh/ophelia.git
 5 | git clone https://github.com/AvashnaGovender/Merlyn.git
 6 | git clone https://github.com/AvashnaGovender/Tacotron.git
 7 | ```
 8 | 
 9 | ## DCTTS + WaveRNN
10 | 
11 | See recipe [here.](recipe_WaveRNN.md)
12 | 
13 | ## Attention experiments
14 | 
15 | ### Obtaining forced alignment labels:
16 | 
17 | Step 5a - Run forced alignment in:  
18 | https://github.com/AvashnaGovender/Tacotron/blob/master/PAG_recipe.md
19 | 
20 | ### FA as target
21 | 
22 | Convert forced alignment labels to the forced alignment matrix:  
23 | 
24 | Step 6 - Get durations and create guides:  
25 | https://github.com/AvashnaGovender/Tacotron/blob/master/PAG_recipe.md
26 | 
27 | To use FA as target in DCTTS see config file:  
28 | [fa_as_target.cfg](../config/project/fa_as_target.cfg)
29 | 
30 | ### FA as guides
31 | 
32 | Create attention guides from forced alignment matrix
33 | 
34 | ```
35 | cd ophelia/
36 | python convert_alignment_to_guide.py fa_matrix.npy fa_guide.npy 
37 | ```
38 | 
39 | To use FA as guide in DCTTS see config file:  
40 | [fa_as_guide.cfg](../config/project/fa_as_guide.cfg)
41 | 
42 | ### FA as attention
43 | 
44 | Add phone level duration to transcript.csv using forced alignment matrix
45 | 
46 | ```
47 | cd ophelia/
48 | python add_duration_to_transcript.py fa_matrix_dir transcript_file new_transcript_file
49 | ```
50 | 
51 | To use FA as attention in DCTTS see config file:  
52 | [fa_as_attention.cfg](../config/project/fa_as_attention.cfg)
53 | 
54 | ## Text Encoder experiments
55 | 
56 | ### Labels -/+ TE
57 | 
58 | Convert state labels to 416 normalised label features (needs state labels and question file)
59 | 
60 | ```
61 | cd Merlyn/
62 | python scripts/prepare_inputs.py
63 | ```
64 | 
65 | To use Labels-TE in DCTTS see config file:  
66 | [labels_minus_te.cfg](../config/project/labels_minus_te.cfg)
67 | 
68 | To use Labels+TE in DCTTS see config file:  
69 | [labels_plus_te.cfg](../config/project/labels_plus_te.cfg)
70 | 
71 | To use C-Labels+TE in DCTTS see config file:  
72 | [c-labels_plus_te.cfg](../config/project/c-labels_plus_te.cfg)
73 | 
74 | ### PE&Labels + TE
75 | 
76 | Create new transcription file with phoneme sequence from labels by replace phone sequence of transcript file with phone sequence from HTS style labels
77 | ```
78 | cd ophelia/
79 | ./labels2tra.sh labels_dir transcript_file new_transcript_file
80 | ```
81 | 
82 | To use PE&Labels+TE set MerlinTextEncWithPhoneEmbedding to True in the config file.
83 | 
84 | ## Gross error detection experiments
85 | 
86 | To calculate CDP, Ain and Aout:
87 | ```
88 | cd ophelia/
89 | python calculate_CDP_Ain_Aout.py attention_matrix.npy
90 | ```
91 | 
92 | ## FIA experiments
93 | 
94 | To generate without FIA (forcibly incremental attention) set turn_off_monotonic_for_synthesis to True in the config file.
95 | 
96 | ## Tacotron experiments
97 | 
98 | See readme in Tacotron repository: https://github.com/AvashnaGovender/Tacotron
99 | 


--------------------------------------------------------------------------------
/doc/recipe_semisupervised.md:
--------------------------------------------------------------------------------
 1 | ## Semisupervised training
 2 | 
 3 | ### Babbler training
 4 | 
 5 | Train 'babbler' (300 epochs only):
 6 | 
 7 | ```
 8 | ./util/submit_tf.sh ./train.py -c ./config/lj_03.cfg -m babbler 
 9 | ```
10 | 
11 | Note that this wasn't implemented when I trained the voice before - hope it works OK:
12 | 
13 | ```
14 | bucket_data_by = 'audio_length'  
15 | ```
16 | 
17 | In any case, text in transcript is ignored when training babbler.
18 | 
19 | Copy existing SSRN:
20 | 
21 | ```
22 | cp -rs /disk/scratch/oliver/dc_tts_osw_clean/work/lj_02/train-ssrn /disk/scratch/oliver/dc_tts_osw_clean/work/lj_03/
23 | ```
24 | 
25 | Synthesise by babbling:
26 | 
27 | ```
28 | ./util/submit_tf.sh ./synthesize.py -c ./config/lj_03.cfg -babble
29 | ```
30 | 
31 | (Note: all sentences are currently seeded with the same start (all zeros from padding) so all babbled outputs will be identical)
32 | 
33 | 
34 | ### Fine tuning 
35 | 
36 | Fine tune with text as conventional model on transcribed subset (1000 sentences) of the data:
37 | 
38 | ```
39 |  ./util/submit_tf.sh ./train.py -c ./config/lj_05.cfg -m t2m  
40 | 
41 | cp -rs /disk/scratch/oliver/dc_tts_osw_clean/work/lj_02/train-ssrn /disk/scratch/oliver/dc_tts_osw_clean/work/lj_05/
42 | 
43 |  ./util/submit_tf.sh ./synthesize.py -c ./config/lj_05.cfg -N 10
44 | ```
45 | 
46 | ### Baseline
47 | 
48 | Compare training from scratch on 1000 sentences:
49 | 
50 | ```
51 |  ./util/submit_tf.sh ./train.py -c ./config/lj_04.cfg -m t2m     
52 | 
53 | cp -rs /disk/scratch/oliver/dc_tts_osw_clean/work/lj_02/train-ssrn /disk/scratch/oliver/dc_tts_osw_clean/work/lj_04/
54 | 
55 |  ./util/submit_tf.sh ./synthesize.py -c ./config/lj_04.cfg -N 10
56 | 
57 | ```
58 | 
59 | 
60 | 
61 | 


--------------------------------------------------------------------------------
/fig/aaa:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/fig/attention.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CSTR-Edinburgh/ophelia/46042a80d045b6fd6403601e9e8fb447d304b6b3/fig/attention.gif


--------------------------------------------------------------------------------
/fig/training_curves.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CSTR-Edinburgh/ophelia/46042a80d045b6fd6403601e9e8fb447d304b6b3/fig/training_curves.png


--------------------------------------------------------------------------------
/labels2tra.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/sh
 2 | #
 3 | # Replace phone sequence of transcript file with phone sequence from HTS style labels
 4 | # Usage: ./labels2tra.sh labels_dir transcript_file new_transcript_file
 5 | 
 6 | labelsdir=$1
 7 | trafile=$2
 8 | newtrafile=$3
 9 | 
10 | cat $trafile | while IFS=$'|' read -r file nada text ps
11 | do 
12 | 
13 | 	grep -r "\[2\]" $labelsdir/$file.lab | sed 's/\+.*//' | sed 's/.*-//' > ~/tmp/test.txt
14 | 
15 | 	newps=`cat ~/tmp/test.txt  | tr '\n' ' '`
16 |         
17 | 	# To print start and end
18 |     echo $file"||"$text"|<_START_> "${newps:4:-4}"<_END_>"
19 | 
20 | done > $newtrafile
21 | 


--------------------------------------------------------------------------------
/libutil.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | #!/usr/bin/env python2
 3 | 
 4 | 
 5 | import os
 6 | import re
 7 | import numpy as np
 8 | import codecs
 9 | 
10 | 
11 | 
12 | def safe_makedir(dir):
13 |     if not os.path.isdir(dir):
14 |         os.makedirs(dir)
15 |         
16 | def writelist(seq, fname):
17 |     path, _ = os.path.split(fname)
18 |     safe_makedir(path)
19 |     f = codecs.open(fname, 'w', encoding='utf8')
20 |     f.write('\n'.join(seq) + '\n')
21 |     f.close()
22 |     
23 | def readlist(fname):
24 |     f = codecs.open(fname, 'r', encoding='utf8')
25 |     data = f.readlines()
26 |     f.close()
27 |     data = [line.strip('\n') for line in data]
28 |     data = [l for l in data if l != '']
29 |     return data
30 |     
31 | def read_norm_data(fname, stream_names):
32 |     out = {}
33 |     vals = np.loadtxt(fname)
34 |     mean_ix = 0
35 |     for stream in stream_names:
36 |         std_ix = mean_ix + 1
37 |         out[stream] = (vals[mean_ix], vals[std_ix])
38 |         mean_ix += 2
39 |     return out
40 | 
41 | def makedirecs(direcs):
42 |     for direc in direcs:
43 |         if not os.path.isdir(direc):
44 |             os.makedirs(direc)
45 | 
46 | def basename(fname):
47 |     path, name = os.path.split(fname)
48 |     base = re.sub('\.[^\.]+\Z','',name)
49 |     return base    
50 | 
51 | get_basename = basename # alias
52 | def get_speech(infile, dimension):
53 |     f = open(infile, 'rb')
54 |     speech = np.fromfile(f, dtype=np.float32)
55 |     f.close()
56 |     assert speech.size % float(dimension) == 0.0,'specified dimension %s not compatible with data'%(dimension)
57 |     speech = speech.reshape((-1, dimension))
58 |     return speech    
59 | 
60 | def put_speech(m_data, filename):
61 |     m_data = np.array(m_data, 'float32') # Ensuring float32 output
62 |     fid = open(filename, 'wb')
63 |     m_data.tofile(fid)
64 |     fid.close()
65 |     return    
66 | 
67 | def save_floats_as_8bit(data, fname):
68 |     '''
69 |     Lossily store data in range [0, 1] with 8 bit resolution
70 |     '''
71 |     assert (data.max() <= 1.0) and (data.min() >= 0.0), (data.min(), data.max())
72 | 
73 |     maxval = np.iinfo(np.uint8).max
74 |     data_scaled = (data * maxval).astype(np.uint8)
75 |     np.save(fname, data_scaled)
76 | 
77 | def read_floats_from_8bit(fname):
78 |     maxval = np.iinfo(np.uint8).max
79 |     data = (np.load(fname).astype(np.float32)) / maxval
80 |     assert (data.max() <= 1.0) and (data.min() >= 0.0), (data.min(), data.max())
81 |     return data
82 | 
83 | 


--------------------------------------------------------------------------------
/logger_setup.py:
--------------------------------------------------------------------------------
 1 | import logging
 2 | import os
 3 | import sys
 4 | import subprocess
 5 | import socket
 6 | import numpy
 7 | import tensorflow
 8 | from libutil import safe_makedir
 9 | 
10 | def logger_setup(logdir):
11 | 
12 |     safe_makedir(logdir)
13 | 
14 |     ## Get new unique named logfile for each run:
15 |     i = 1
16 |     while True:
17 |         logfile = os.path.join(logdir, 'log_{:06d}.txt'.format(i))
18 |         if not os.path.isfile(logfile):
19 |             break
20 |         else:
21 |             i += 1
22 | 
23 |     logger = logging.getLogger()
24 |     logger.setLevel(logging.DEBUG)
25 | 
26 |     formatter = logging.Formatter('%(asctime)s | %(threadName)-3.3s | %(levelname)-1.1s | %(message)s')
27 | 
28 |     fh = logging.FileHandler(logfile)
29 |     fh.setLevel(logging.DEBUG)
30 |     fh.setFormatter(formatter)
31 |     logger.addHandler(fh)
32 | 
33 |     ch = logging.StreamHandler()
34 |     ch.setLevel(logging.DEBUG)
35 |     ch.setFormatter(formatter)
36 |     logger.addHandler(ch)
37 | 
38 |     logger.info('Set up logger to write to console and %s'%(logfile))
39 | 
40 |     log_environment_information(logger, logfile)
41 | 
42 | 
43 | def log_environment_information(logger, logfile):
44 |     ### This function's contents adjusted from Merlin (https://github.com/CSTR-Edinburgh/merlin/blob/master/src/run_merlin.py)
45 |     ### TODO: other things to log here?
46 |     logger.info('Installation information:')
47 |     logger.info('  Merlin directory: '+os.path.abspath(os.path.join(os.path.dirname(os.path.realpath(__file__)), os.pardir)))
48 |     logger.info('  PATH:')
49 |     env_PATHs = os.getenv('PATH')
50 |     if env_PATHs:
51 |         env_PATHs = env_PATHs.split(':')
52 |         for p in env_PATHs:
53 |             if len(p)>0: logger.info('      '+p)
54 |     logger.info('  LD_LIBRARY_PATH:')
55 |     env_LD_LIBRARY_PATHs = os.getenv('LD_LIBRARY_PATH')
56 |     if env_LD_LIBRARY_PATHs:
57 |         env_LD_LIBRARY_PATHs = env_LD_LIBRARY_PATHs.split(':')
58 |         for p in env_LD_LIBRARY_PATHs:
59 |             if len(p)>0: logger.info('      '+p)
60 |     logger.info('  Python version: '+sys.version.replace('\n',''))
61 |     logger.info('    PYTHONPATH:')
62 |     env_PYTHONPATHs = os.getenv('PYTHONPATH')
63 |     if env_PYTHONPATHs:
64 |         env_PYTHONPATHs = env_PYTHONPATHs.split(':')
65 |         for p in env_PYTHONPATHs:
66 |             if len(p)>0:
67 |                 logger.info('      '+p)
68 |     logger.info('  Numpy version: '+numpy.version.version)
69 |     logger.info('  Tensorflow version: '+tensorflow.__version__)
70 |     #logger.info('    THEANO_FLAGS: '+os.getenv('THEANO_FLAGS'))
71 |     #logger.info('    device: '+theano.config.device)
72 | 
73 |     # Check for the presence of git
74 |     ret = os.system('git status > /dev/null')
75 |     if ret==0:
76 |         logger.info('  Git is available in the working directory:')
77 |         git_describe = subprocess.Popen(['git', 'describe', '--tags', '--always'], stdout=subprocess.PIPE).communicate()[0][:-1]
78 |         logger.info('    DC_TTS_OSW version: {}'.format(git_describe))
79 |         git_branch = subprocess.Popen(['git', 'rev-parse', '--abbrev-ref', 'HEAD'], stdout=subprocess.PIPE).communicate()[0][:-1]
80 |         logger.info('    branch: {}'.format(git_branch))
81 |         git_diff = subprocess.Popen(['git', 'diff', '--name-status'], stdout=subprocess.PIPE).communicate()[0]
82 |         if sys.version_info.major >= 3:
83 |             git_diff = git_diff.decode('utf-8')
84 |         git_diff = git_diff.replace('\t',' ').split('\n')
85 |         logger.info('    diff to DC_TTS_OSW version:')
86 |         for filediff in git_diff:
87 |             if len(filediff)>0: logger.info('      '+filediff)
88 |         logger.info('      (all diffs logged in '+os.path.basename(logfile)+'.gitdiff'+')')
89 |         os.system('git diff > '+logfile+'.gitdiff')
90 | 
91 |     logger.info('Execution information:')
92 |     logger.info('  HOSTNAME: '+socket.getfqdn())
93 |     logger.info('  USER: '+os.getenv('USER'))
94 |     logger.info('  PID: '+str(os.getpid()))
95 |     PBS_JOBID = os.getenv('PBS_JOBID')
96 |     if PBS_JOBID:
97 |         logger.info('  PBS_JOBID: '+PBS_JOBID)
98 | 
99 | 


--------------------------------------------------------------------------------
/objective_measures.py:
--------------------------------------------------------------------------------
 1 | 
 2 | '''
 3 | TODO: logSpecDbDist appropriate? (both mels & mags?)
 4 | TODO: compute output length error?
 5 | TODO: work out best way of handling the fact that predicted *coarse* features 
 6 |       can correspond to text but be arbitrarily 'out of phase' with reference.
 7 |       Mutliple references? Or compare against full-time resolution reference? 
 8 | '''
 9 | import logging
10 | from mcd import dtw
11 | import mcd.metrics_fast as mt
12 | 
13 | def compute_dtw_error(reference, predictions):
14 |     minCostTot = 0.0
15 |     framesTot = 0
16 |     for (nat, synth) in zip(reference, predictions):
17 |         nat, synth = nat.astype('float64'), synth.astype('float64')
18 |         minCost, path = dtw.dtw(nat, synth, mt.logSpecDbDist)
19 |         frames = len(nat)
20 |         minCostTot += minCost
21 |         framesTot += frames
22 |     mean_score = minCostTot / framesTot
23 |     print ('overall LSD = %f (%s frames nat/synth)' % (mean_score, framesTot))
24 |     return mean_score
25 | 
26 | def compute_simple_LSD(reference_list, prediction_list):
27 |     costTot = 0.0
28 |     framesTot = 0    
29 |     for (synth, nat) in zip(prediction_list, reference_list):
30 |         #synth = prediction_tensor[i,:,:].astype('float64')
31 |         # len_nat = len(nat)
32 |         assert len(synth) == len(nat)
33 |         #synth = synth[:len_nat, :]
34 |         nat = nat.astype('float64')
35 |         synth = synth.astype('float64')
36 |         cost = sum([
37 |             mt.logSpecDbDist(natFrame, synthFrame)
38 |             for natFrame, synthFrame in zip(nat, synth)
39 |         ])
40 |         framesTot += len(nat)
41 |         costTot += cost
42 |     return costTot / framesTot
43 |     
44 | 
45 | 


--------------------------------------------------------------------------------
/plot_loss.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | ## Project: SCRIPT - February 2018 
 4 | ## Contact: Oliver Watts - owatts@staffmail.ed.ac.uk
 5 |   
 6 | import sys
 7 | import os
 8 | import glob
 9 | import os
10 | import fileinput
11 | from argparse import ArgumentParser
12 | 
13 | from libutil import readlist
14 | import numpy as np
15 | import pylab as pl
16 | def main_work():
17 | 
18 |     #################################################
19 |       
20 |     # ======== Get stuff from command line ==========
21 | 
22 |     a = ArgumentParser()
23 |     a.add_argument('-o', dest='outfile', required=True)
24 |     a.add_argument('-l', dest='logfile', required=True)
25 |     opts = a.parse_args()
26 |     
27 |     # ===============================================
28 |     
29 |     log = readlist(opts.logfile)
30 |     log = [line.split('|') for line in log]
31 |     log = [line[3].strip() for line in log if len(line) >=4]
32 |     
33 |     #validation = [line.replace('validation epoch ', '') for line in log if line.startswith('validation epoch')]
34 |     #train = [line.replace('train epoch ', '') for line in log if line.startswith('validation epoch')]
35 | 
36 |     validation = [line.split(':')[1].strip().split(' ') for line in log if line.startswith('validation epoch')]
37 |     train = [line.split(':')[1].strip().split(' ') for line in log if line.startswith('train epoch')]
38 |     validation = np.array(validation, dtype=float)
39 |     train = np.array(train, dtype=float)
40 |     print train.shape
41 |     print validation.shape
42 | 
43 |     pl.subplot(211)
44 |     pl.plot(validation.flatten())
45 |     pl.subplot(212)
46 |     pl.plot(train[:,:4])
47 |     pl.show()
48 | if __name__=="__main__":
49 | 
50 |     main_work()
51 | 
52 | 


--------------------------------------------------------------------------------
/prepare_acoustic_features.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | #! /usr/bin/env python2
 3 | '''
 4 | Based on code by kyubyong park at https://www.github.com/kyubyong/dc_tts
 5 | '''
 6 | 
 7 | from __future__ import print_function
 8 | 
 9 | import os
10 | import sys
11 | import glob
12 | from argparse import ArgumentParser
13 | from concurrent.futures import ProcessPoolExecutor
14 | import numpy as np
15 | 
16 | import tqdm
17 | 
18 | from libutil import safe_makedir
19 | from configuration import load_config
20 | from utils import load_spectrograms
21 | 
22 | def proc(fpath, hp):
23 |     
24 |     if not os.path.isfile(fpath):
25 |         return
26 |         
27 |     fname, mel, mag, full_mel = load_spectrograms(hp, fpath)
28 |     np.save("{}/{}".format(hp.coarse_audio_dir, fname.replace("wav", "npy")), mel)
29 |     np.save("{}/{}".format(hp.full_audio_dir, fname.replace("wav", "npy")), mag)
30 |     np.save("{}/{}".format(hp.full_mel_dir, fname.replace("wav", "npy")), full_mel)
31 | 
32 | 
33 | def main_work():
34 | 
35 |     #################################################
36 |       
37 |     # ============= Process command line ============
38 | 
39 |     a = ArgumentParser()
40 |     a.add_argument('-c', dest='config', required=True, type=str)
41 |     a.add_argument('-ncores', default=1, type=int, help='Number of cores for parallel processing')    
42 |     opts = a.parse_args()
43 |     
44 |     # ===============================================
45 | 
46 |     hp = load_config(opts.config)
47 | 
48 |     fpaths = sorted(glob.glob(hp.waveforms + '/*.wav'))
49 | 
50 |     safe_makedir(hp.coarse_audio_dir)
51 |     safe_makedir(hp.full_audio_dir)
52 |     safe_makedir(hp.full_mel_dir)
53 |            
54 |     executor = ProcessPoolExecutor(max_workers=opts.ncores)    
55 |     futures = []
56 |     for fpath in fpaths:
57 |         futures.append(executor.submit(
58 |             proc, fpath, hp))
59 |     proc_list = [future.result() for future in tqdm.tqdm(futures)]
60 | 
61 | 
62 | if __name__=="__main__":
63 | 
64 |     main_work()
65 | 


--------------------------------------------------------------------------------
/prepare_attention_guides.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | #!/usr/bin/env python2
 3 | 
 4 | from __future__ import print_function
 5 | 
 6 | from utils import get_attention_guide
 7 | import os
 8 | from data_load import load_data
 9 | import numpy as np
10 | import tqdm
11 | from concurrent.futures import ProcessPoolExecutor
12 | 
13 | from argparse import ArgumentParser
14 | 
15 | from libutil import basename, save_floats_as_8bit, safe_makedir
16 | from configuration import load_config
17 | 
18 | def proc(fpath, text_length, hp):
19 |     
20 |     base = basename(fpath)
21 |     melfile = hp.coarse_audio_dir + os.path.sep + base + '.npy'
22 |     attfile = hp.attention_guide_dir + os.path.sep + base # without '.npy'
23 |     if not os.path.isfile(melfile):
24 |         print('file %s not found'%(melfile))
25 |         return
26 |     speech_length = np.load(melfile).shape[0]
27 |     att = get_attention_guide(text_length, speech_length, g=hp.g)
28 |     save_floats_as_8bit(att, attfile)
29 | 
30 | 
31 | def main_work():
32 | 
33 |     #################################################
34 |       
35 |     # ============= Process command line ============
36 | 
37 |     a = ArgumentParser()
38 |     a.add_argument('-c', dest='config', required=True, type=str)
39 |     a.add_argument('-ncores', default=1, type=int, help='Number of cores for parallel processing')    
40 |     opts = a.parse_args()
41 |     
42 |     # ===============================================
43 | 
44 |     hp = load_config(opts.config)
45 |     assert hp.attention_guide_dir
46 |     
47 |     dataset = load_data(hp) 
48 |     fpaths, text_lengths = dataset['fpaths'], dataset['text_lengths']
49 | 
50 |     if hp.merlin_label_dir:
51 |          text_lengths = dataset['label_lengths']
52 | 
53 |     assert os.path.exists(hp.coarse_audio_dir)
54 |     safe_makedir(hp.attention_guide_dir)
55 | 
56 |     executor = ProcessPoolExecutor(max_workers=opts.ncores)    
57 |     futures = []
58 |     for (fpath, text_length) in zip(fpaths, text_lengths):
59 |          futures.append(executor.submit(proc, fpath, text_length, hp)) 
60 |     proc_list = [future.result() for future in tqdm.tqdm(futures)]
61 | 
62 | 
63 | if __name__=="__main__":
64 | 
65 |     main_work()
66 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | numpy==1.16.4
 2 | absl-py==0.6.1
 3 | astor==0.7.1
 4 | audioread==2.1.6
 5 | backports.functools-lru-cache==1.5
 6 | backports.weakref==1.0.post1
 7 | bashplotlib==0.6.5
 8 | cffi==1.11.5
 9 | cycler==0.10.0
10 | decorator==4.3.0
11 | enum34==1.1.6
12 | funcsigs==1.0.2
13 | futures==3.2.0
14 | gast==0.2.0
15 | grpcio==1.16.1
16 | h5py==2.8.0
17 | htk-io==0.5
18 | joblib==0.13.0
19 | Keras-Applications==1.0.6
20 | Keras-Preprocessing==1.0.5
21 | kiwisolver==1.0.1
22 | librosa==0.6.2
23 | llvmlite==0.25.0
24 | Markdown==3.0.1
25 | matplotlib==2.2.3
26 | mcd==0.4
27 | mock==2.0.0
28 | numba==0.40.1
29 | pbr==5.1.1
30 | protobuf==3.6.1
31 | pycparser==2.19
32 | pyparsing==2.3.0
33 | python-dateutil==2.7.5
34 | pytz==2018.7
35 | regex==2019.2.3
36 | resampy==0.2.1
37 | scikit-learn==0.20.0
38 | scipy==1.1.0
39 | singledispatch==3.4.0.3
40 | six==1.11.0
41 | SoundFile==0.10.2
42 | subprocess32==3.5.3
43 | tensorboard==1.12.0
44 | tensorflow-gpu==1.12.0
45 | termcolor==1.1.0
46 | tqdm==4.28.1
47 | Werkzeug==0.14.1
48 | 


--------------------------------------------------------------------------------
/script/add_speaker.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | from argparse import ArgumentParser
 4 | 
 5 | 
 6 | def main_work():
 7 | 
 8 |     a = ArgumentParser()
 9 |     a.add_argument('-i', dest='infile', required=True)
10 |     a.add_argument('-o', dest='outfile', required=True)
11 |     opts = a.parse_args()
12 | 
13 |     outf = open(opts.outfile, 'w')
14 |     transcript = open(opts.infile, 'r').read().split('\n')
15 | 
16 |     for line in transcript:
17 |         spk = line.split('_')[0]
18 |         outf.writelines(line+'||'+spk+'\n')
19 |     outf.close()
20 | 
21 | if __name__ == "__main__":
22 |     main_work()
23 | 


--------------------------------------------------------------------------------
/script/festival/csv2scm.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | ## Project: SCRIPT - February 2018 
 4 | ## Contact: Oliver Watts - owatts@staffmail.ed.ac.uk
 5 |   
 6 | import sys
 7 | import os
 8 | import glob
 9 | import os
10 | from argparse import ArgumentParser
11 | import codecs
12 | 
13 | 
14 | def main_work():
15 | 
16 |     #################################################
17 |       
18 |     # ======== Get stuff from command line ==========
19 | 
20 |     a = ArgumentParser()
21 |     a.add_argument('-i', dest='infile', required=True, \
22 |                     help= "File in LJ speech transcription format: https://keithito.com/LJ-Speech-Dataset/")
23 |     a.add_argument('-o', dest='outfile', required=True, \
24 |                     help= "File in Festival utts.data scheme format")
25 |     opts = a.parse_args()
26 |     
27 |     # ===============================================
28 |     
29 |     f = codecs.open(opts.infile, 'r', encoding='utf8')
30 |     lines = f.readlines()
31 |     f.close()
32 | 
33 |     f = codecs.open(opts.outfile, 'w', encoding='utf8')
34 |     for line in lines:
35 |         fields = line.strip('\n\r ').split('|')
36 |         assert len(fields) >= 3
37 |         name, _, text = fields[:3]
38 |         text = text.replace('"', '\\"')
39 |         f.write('(%s "%s")\n'%(name, text))
40 |     f.close()
41 | 
42 | 
43 | 
44 | 
45 | if __name__=="__main__":
46 | 
47 |     main_work()
48 | 
49 | 


--------------------------------------------------------------------------------
/script/festival/fix_transcript.py:
--------------------------------------------------------------------------------
 1 | 
 2 | 
 3 | 
 4 | 
 5 | 
 6 | 
 7 | 
 8 | import sys, codecs, re, regex
 9 | 
10 | 
11 | f = codecs.open(sys.argv[1], encoding='utf8', errors='ignore')
12 | text = f.read()
13 | f.close()
14 | 
15 | lines = text.split('\n')
16 | 
17 | 
18 | all_seps = set()
19 | for line in lines:
20 |     phones = line.split('|')[-1].strip('\n\r ').split(' ')
21 |     seps = set([phone for phone in phones if phone.startswith('<') and phone.endswith('>')])
22 |     all_seps.update(seps)
23 |     #print seps
24 |     #print phones
25 | 
26 | 
27 | 
28 | # print all_seps
29 | 
30 | badseps = []
31 | for sep in all_seps:
32 | 
33 |     if regex.match('\A[\p{P}\p{Z}]+\Z', sep.strip('<>')):
34 |         #puncs.append(sep)
35 |         pass
36 |     elif sep in ["<'s>",  '<_END_>', '<_START_>']:
37 |         pass
38 |     else:
39 |         badseps.append(sep)
40 | 
41 | for sep in badseps:
42 |     text = text.replace(sep, '<>')
43 | 
44 | 
45 | 
46 | 
47 | 
48 | #bad_strings = sys.argv[2:]
49 | 
50 | 
51 | # for bad in bad_strings:
52 | #     lines = lines.replace('<'+bad+'>', '<>')
53 | 
54 | print text.encode('utf8')


--------------------------------------------------------------------------------
/script/festival/make_rich_phones.scm:
--------------------------------------------------------------------------------
  1 | 
  2 | 
  3 | ; Usage: 
  4 | ;     cd to directory containing ./utts.data, then with a version of Festival with the correct
  5 | ;     lexicon installed:
  6 | 
  7 | ;    FEST=/afs/inf.ed.ac.uk/user/o/owatts/sim2/oliver/tool/festival/festival/bin/festival
  8 | ;    SCRIPT=/afs/inf.ed.ac.uk/user/o/owatts/repos/dc_tts/script/make_rich_phones.scm
  9 | ;    $FEST -b $SCRIPT | grep ___KEEP___ | sed 's/___KEEP___//' | tee ./transcript.csv
 10 | 
 11 | 
 12 | 
 13 | ;;; Taken from build_unitsel:
 14 | ;; Is this the last segment in a word [more complicated than you would think]
 15 | (define (seg_word_final seg)
 16 | "(seg_word_final seg)
 17 | Is this segment word final?"
 18 |   (let ((this_seg_word (item.parent (item.relation.parent  seg 'SylStructure)))
 19 |     (silence (car (cadr (car (PhoneSet.description '(silences))))))
 20 |     next_seg_word)
 21 |     (if (item.next seg)
 22 |     (set! next_seg_word (item.parent (item.relation.parent (item.next seg) 'SylStructure))))
 23 |     (if (or (equal? this_seg_word next_seg_word)
 24 |          (string-equal (item.feat seg "name") silence))
 25 |     nil
 26 |     t)))
 27 | 
 28 | 
 29 | (define (following_punc seg)
 30 |   ( set! this_seg_word (item.parent (item.relation.parent  seg 'SylStructure)))
 31 |   ( set! this_seg_token (item.relation.next  this_seg_word 'Token))
 32 |   (if this_seg_token
 33 |     ( format t "<%s> " (item.name this_seg_token))
 34 |     ( format t "<> " )    %% else
 35 |   )
 36 | )
 37 | 
 38 | 
 39 | (define (print_phones_punc utt)
 40 | (if (utt.relation.present utt 'Segment)
 41 |     (begin
 42 |       ( format t "<_START_> " )
 43 |       (mapcar
 44 |        (lambda (x)
 45 |             ( if (not (equal? (item.name x) "#") )   ;; TODO: unhardcode silent symbols
 46 |               (format t "%s " (item.name x) )
 47 |             )
 48 |             (if (seg_word_final x) 
 49 |                 (following_punc x)
 50 |             )            
 51 |        )
 52 |        (utt.relation.items utt 'Segment))
 53 |         ( format t "<_END_>" )
 54 |       )
 55 |     (format t "Utterance contains no Segments\n"))
 56 |   nil)
 57 | 
 58 | ;;; Taken from build_unitsel:
 59 | ;; Do the linguistic side of synthesis.
 60 | (define (utt.synth_toSegment_text utt)
 61 |   (Initialize utt)
 62 |   (Text utt)
 63 |   (Token_POS utt)    ;; when utt.synth is called
 64 |   (Token utt)
 65 |   (POS utt)
 66 |   (Phrasify utt)
 67 |   (Word utt)
 68 |   (Pauses utt)
 69 |   (Intonation utt)
 70 |   (PostLex utt))
 71 | 
 72 | (define (synth_utts utts_data_file)
 73 |     (set! uttlist (load utts_data_file t))
 74 |     (mapcar
 75 |      (lambda (line)
 76 |         (set! utt (utt.synth_toSegment_text (eval (list 'Utterance 'Text (car (cdr line))))))   ;;;  Initialize
 77 |        (format t "___KEEP___%s|" (car line))
 78 |         (format t "%s|" (car (cdr line)) )     
 79 |         (format t "%s|" (car (cdr line)) )          
 80 |         (print_phones_punc utt)
 81 |         ; (utt.relation.print utt 'Text)
 82 |         ( format t "\n" )
 83 |        t)
 84 |      uttlist)
 85 |   )
 86 | 
 87 | 
 88 | (if (not (member_string 'unilex-rpx (lex.list)))
 89 |     (load (path-append lexdir "unilex/" (string-append 'unilex-rpx ".scm"))))
 90 | 
 91 | ; (if (not (member_string 'cmudict (lex.list)))
 92 | ;     (load (path-append lexdir "cmu/" (string-append 'cmudict-0.4 ".scm"))))
 93 | 
 94 | 
 95 | ;(require 'unilex_phones)
 96 | ;(lex.select 'unilex-rpx)
 97 | 
 98 | 
 99 | 
100 | 
101 | (if (not (member_string 'combilex-rpx (lex.list)))
102 |     (load (path-append lexdir "combilex/" (string-append 'combilex-rpx ".scm"))))
103 | (lex.select 'combilex-rpx)
104 | (require 'postlex)
105 |   (set! postlex_rules_hooks (list postlex_apos_s_check
106 |                                   postlex_intervoc_r
107 |                                   postlex_the_vs_thee
108 |                                   postlex_a
109 |                                   ))
110 |   ; (set! postlex_rules_hooks (list 
111 |   ;                                 postlex_intervoc_r                                                                    
112 |   ;                                 ))
113 | 
114 | 
115 | 
116 | ; (lex.select 'cmudict)
117 | 
118 | 
119 | 
120 | ;(set! utt1 (Utterance Text "Hello there, world, isn't it a nice day?!"))
121 | ;(utt.synth utt1)
122 | ;(print_phones_punc utt1)
123 | 
124 | (synth_utts "./utts.data")
125 | 
126 | 
127 | 
128 | 
129 | 
130 | 
131 | 


--------------------------------------------------------------------------------
/script/festival/make_rich_phones_cmulex.scm:
--------------------------------------------------------------------------------
  1 | 
  2 | 
  3 | ; Usage: 
  4 | ;     cd to directory containing ./utts.data, then with a version of Festival with the correct
  5 | ;     lexicon installed:
  6 | 
  7 | ;    FEST=/afs/inf.ed.ac.uk/user/o/owatts/sim2/oliver/tool/festival/festival/bin/festival
  8 | ;    SCRIPT=/afs/inf.ed.ac.uk/user/o/owatts/repos/dc_tts/script/make_rich_phones.scm
  9 | ;    $FEST -b $SCRIPT | grep ___KEEP___ | sed 's/___KEEP___//' | tee ./transcript.csv
 10 | 
 11 | 
 12 | 
 13 | ;;; Taken from build_unitsel:
 14 | ;; Is this the last segment in a word [more complicated than you would think]
 15 | (define (seg_word_final seg)
 16 | "(seg_word_final seg)
 17 | Is this segment word final?"
 18 |   (let ((this_seg_word (item.parent (item.relation.parent  seg 'SylStructure)))
 19 |     (silence (car (cadr (car (PhoneSet.description '(silences))))))
 20 |     next_seg_word)
 21 |     (if (item.next seg)
 22 |     (set! next_seg_word (item.parent (item.relation.parent (item.next seg) 'SylStructure))))
 23 |     (if (or (equal? this_seg_word next_seg_word)
 24 |          (string-equal (item.feat seg "name") silence))
 25 |     nil
 26 |     t)))
 27 | 
 28 | 
 29 | (define (following_punc seg)
 30 |   ( set! this_seg_word (item.parent (item.relation.parent  seg 'SylStructure)))
 31 |   ( set! this_seg_token (item.relation.next  this_seg_word 'Token))
 32 |   (if this_seg_token
 33 |     ( format t "<%s> " (item.name this_seg_token))
 34 |     ( format t "<> " )    %% else
 35 |   )
 36 | )
 37 | 
 38 | 
 39 | (define (print_phones_punc utt)
 40 | (if (utt.relation.present utt 'Segment)
 41 |     (begin
 42 |       ( format t "<_START_> " )
 43 |       (mapcar
 44 |        (lambda (x)
 45 |             ( if (not (equal? (item.name x) "pau") )   ;; TODO: unhardcode silent symbols
 46 |               (format t "%s " (item.name x) )
 47 |             )
 48 |             (if (seg_word_final x) 
 49 |                 (following_punc x)
 50 |             )            
 51 |        )
 52 |        (utt.relation.items utt 'Segment))
 53 |         ( format t "<_END_>" )
 54 |       )
 55 |     (format t "Utterance contains no Segments\n"))
 56 |   nil)
 57 | 
 58 | ;;; Taken from build_unitsel:
 59 | ;; Do the linguistic side of synthesis.
 60 | (define (utt.synth_toSegment_text utt)
 61 |   (Initialize utt)
 62 |   (Text utt)
 63 |   (Token_POS utt)    ;; when utt.synth is called
 64 |   (Token utt)
 65 |   (POS utt)
 66 |   (Phrasify utt)
 67 |   (Word utt)
 68 |   (Pauses utt)
 69 |   (Intonation utt)
 70 |   (PostLex utt))
 71 | 
 72 | (define (synth_utts utts_data_file)
 73 |     (set! uttlist (load utts_data_file t))
 74 |     (mapcar
 75 |      (lambda (line)
 76 |         (set! utt (utt.synth_toSegment_text (eval (list 'Utterance 'Text (car (cdr line))))))   ;;;  Initialize
 77 |        (format t "___KEEP___%s||" (car line))
 78 |         ;(format t "%s|" (car (cdr line)) )     
 79 |         (format t "%s|" (car (cdr line)) )          
 80 |         (print_phones_punc utt)
 81 |         ; (utt.relation.print utt 'Text)
 82 |         ( format t "\n" )
 83 |        t)
 84 |      uttlist)
 85 |   )
 86 | 
 87 | 
 88 | 
 89 | 
 90 | 
 91 | 
 92 | ; (if (not (member_string 'combilex-rpx (lex.list)))
 93 | ;     (load (path-append lexdir "combilex/" (string-append 'combilex-rpx ".scm"))))
 94 | (lex.select 'cmu)
 95 | 
 96 | 
 97 | (synth_utts "./utts.data")
 98 | 
 99 | 
100 | 
101 | 
102 | 
103 | 
104 | 


--------------------------------------------------------------------------------
/script/festival/make_rich_phones_combirpx_noplex.scm:
--------------------------------------------------------------------------------
  1 | 
  2 | 
  3 | ; Usage: 
  4 | ;     cd to directory containing ./utts.data, then with a version of Festival with the correct
  5 | ;     lexicon installed:
  6 | 
  7 | ;    FEST=/afs/inf.ed.ac.uk/user/o/owatts/sim2/oliver/tool/festival/festival/bin/festival
  8 | ;    SCRIPT=/afs/inf.ed.ac.uk/user/o/owatts/repos/dc_tts/script/make_rich_phones.scm
  9 | ;    $FEST -b $SCRIPT | grep ___KEEP___ | sed 's/___KEEP___//' | tee ./transcript.csv
 10 | 
 11 | 
 12 | 
 13 | ;;; Taken from build_unitsel:
 14 | ;; Is this the last segment in a word [more complicated than you would think]
 15 | (define (seg_word_final seg)
 16 | "(seg_word_final seg)
 17 | Is this segment word final?"
 18 |   (let ((this_seg_word (item.parent (item.relation.parent  seg 'SylStructure)))
 19 |     (silence (car (cadr (car (PhoneSet.description '(silences))))))
 20 |     next_seg_word)
 21 |     (if (item.next seg)
 22 |     (set! next_seg_word (item.parent (item.relation.parent (item.next seg) 'SylStructure))))
 23 |     (if (or (equal? this_seg_word next_seg_word)
 24 |          (string-equal (item.feat seg "name") silence))
 25 |     nil
 26 |     t)))
 27 | 
 28 | 
 29 | (define (following_punc seg)
 30 |   ( set! this_seg_word (item.parent (item.relation.parent  seg 'SylStructure)))
 31 |   ( set! this_seg_token (item.relation.next  this_seg_word 'Token))
 32 |   (if this_seg_token
 33 |     ( format t "<%s> " (item.name this_seg_token))
 34 |     ( format t "<> " )    %% else
 35 |   )
 36 | )
 37 | 
 38 | 
 39 | (define (print_phones_punc utt)
 40 | (if (utt.relation.present utt 'Segment)
 41 |     (begin
 42 |       ( format t "<_START_> " )
 43 |       (mapcar
 44 |        (lambda (x)
 45 |             ( if (not (equal? (item.name x) "#") )   ;; TODO: unhardcode silent symbols
 46 |               (format t "%s " (item.name x) )
 47 |             )
 48 |             (if (seg_word_final x) 
 49 |                 (following_punc x)
 50 |             )            
 51 |        )
 52 |        (utt.relation.items utt 'Segment))
 53 |         ( format t "<_END_>" )
 54 |       )
 55 |     (format t "Utterance contains no Segments\n"))
 56 |   nil)
 57 | 
 58 | ;;; Taken from build_unitsel:
 59 | ;; Do the linguistic side of synthesis.
 60 | (define (utt.synth_toSegment_text utt)
 61 |   (Initialize utt)
 62 |   (Text utt)
 63 |   (Token_POS utt)    ;; when utt.synth is called
 64 |   (Token utt)
 65 |   (POS utt)
 66 |   (Phrasify utt)
 67 |   (Word utt)
 68 |   (Pauses utt)
 69 |   (Intonation utt)
 70 |   (PostLex utt))
 71 | 
 72 | (define (synth_utts utts_data_file)
 73 |     (set! uttlist (load utts_data_file t))
 74 |     (mapcar
 75 |      (lambda (line)
 76 |         (set! utt (utt.synth_toSegment_text (eval (list 'Utterance 'Text (car (cdr line))))))   ;;;  Initialize
 77 |        (format t "___KEEP___%s||" (car line))
 78 |         ;(format t "%s|" (car (cdr line)) )     
 79 |         (format t "%s|" (car (cdr line)) )          
 80 |         (print_phones_punc utt)
 81 |         ; (utt.relation.print utt 'Text)
 82 |         ( format t "\n" )
 83 |        t)
 84 |      uttlist)
 85 |   )
 86 | 
 87 | 
 88 | (if (not (member_string 'unilex-rpx (lex.list)))
 89 |     (load (path-append lexdir "unilex/" (string-append 'unilex-rpx ".scm"))))
 90 | 
 91 | ; (if (not (member_string 'cmudict (lex.list)))
 92 | ;     (load (path-append lexdir "cmu/" (string-append 'cmudict-0.4 ".scm"))))
 93 | 
 94 | 
 95 | ;(require 'unilex_phones)
 96 | ;(lex.select 'unilex-rpx)
 97 | 
 98 | 
 99 | 
100 | 
101 | (if (not (member_string 'combilex-rpx (lex.list)))
102 |     (load (path-append lexdir "combilex/" (string-append 'combilex-rpx ".scm"))))
103 | (lex.select 'combilex-rpx)
104 | ; (require 'postlex)
105 | ;   (set! postlex_rules_hooks (list postlex_apos_s_check
106 | ;                                   postlex_intervoc_r
107 | ;                                   postlex_the_vs_thee
108 | ;                                   postlex_a
109 | ;                                   ))
110 |   ; (set! postlex_rules_hooks (list 
111 |   ;                                 postlex_intervoc_r                                                                    
112 |   ;                                 ))
113 | 
114 | 
115 | 
116 | ; (lex.select 'cmudict)
117 | 
118 | 
119 | 
120 | ;(set! utt1 (Utterance Text "Hello there, world, isn't it a nice day?!"))
121 | ;(utt.synth utt1)
122 | ;(print_phones_punc utt1)
123 | 
124 | (synth_utts "./utts.data")
125 | 
126 | 
127 | 
128 | 
129 | 
130 | 
131 | 


--------------------------------------------------------------------------------
/script/festival/multi_transcript.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | from argparse import ArgumentParser
 4 | 
 5 | 
 6 | def main_work():
 7 | 
 8 |     a = ArgumentParser()
 9 |     a.add_argument('-i', dest='infile', required=True)
10 |     a.add_argument('-o', dest='outfile', required=True)
11 |     opts = a.parse_args()
12 | 
13 |     o = open(opts.outfile, 'w')
14 |     
15 |     with open(opts.infile, 'r') as f:
16 |         for line in f.readlines()[:-1]:
17 |             if line[3] == '_':  # if clauses dealing with different length p-numbers (not really applicable for public VCTK)
18 |                 speaker_id = line[0:3]
19 |             elif line[4] == '_':
20 |                 speaker_id = line[0:4]
21 |             elif line[5] == '_':
22 |                 speaker_id = line[0:5]
23 |             else:
24 |                 print('Something is wrong with the input file - speaker ID cannot be parsed!')
25 |             o.write('{}|{}\n'.format(line.rstrip(), speaker_id))
26 | 
27 |     o.close()
28 | 
29 | if __name__ == "__main__":
30 |     main_work()
31 |     
32 | 


--------------------------------------------------------------------------------
/script/get_transcriptions.sh:
--------------------------------------------------------------------------------
 1 | 
 2 | python ./script/festival/csv2scm.py -i $1 -o utts.data
 3 | 
 4 | FEST='/afs/inf.ed.ac.uk/user/s15/s1520337/Documents/festival/festival/bin/festival'
 5 | 
 6 | SCRIPT=./script/festival/make_rich_phones_cmulex.scm
 7 | $FEST -b $SCRIPT | grep ___KEEP___ | sed 's/___KEEP___//' | tee ./transcript_temp1.csv
 8 | 
 9 | python $CODEDIR/script/festival/fix_transcript.py ./transcript_temp1.csv > ./new_transcript.csv
10 | 


--------------------------------------------------------------------------------
/script/libutil.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import re
 3 | import numpy as np
 4 | 
 5 | #### TODO -- this module is duplicated 1 level up -- sort this out!
 6 | 
 7 | 
 8 | def load_config(config_fname):
 9 |     config = {}
10 |     execfile(config_fname, config)
11 |     del config['__builtins__']
12 |     _, config_name = os.path.split(config_fname)
13 |     config_name = config_name.replace('.cfg','').replace('.conf','')
14 |     config['config_name'] = config_name
15 |     return config
16 | 
17 |     
18 | def safe_makedir(dir):
19 |     if not os.path.isdir(dir):
20 |         os.makedirs(dir)
21 |         
22 | def writelist(seq, fname):
23 |     f = open(fname, 'w')
24 |     f.write('\n'.join(seq) + '\n')
25 |     f.close()
26 |     
27 | def readlist(fname):
28 |     f = open(fname, 'r')
29 |     data = f.readlines()
30 |     f.close()
31 |     return [line.strip('\n') for line in data]
32 |     
33 | def read_norm_data(fname, stream_names):
34 |     out = {}
35 |     vals = np.loadtxt(fname)
36 |     mean_ix = 0
37 |     for stream in stream_names:
38 |         std_ix = mean_ix + 1
39 |         out[stream] = (vals[mean_ix], vals[std_ix])
40 |         mean_ix += 2
41 |     return out
42 |     
43 | 
44 | def makedirecs(direcs):
45 |     for direc in direcs:
46 |         if not os.path.isdir(direc):
47 |             os.makedirs(direc)
48 | 
49 | def basename(fname):
50 |     path, name = os.path.split(fname)
51 |     base = re.sub('\.[^\.]+\Z','',name)
52 |     return base    
53 | 
54 | get_basename = basename # alias
55 | def get_speech(infile, dimension):
56 |     f = open(infile, 'rb')
57 |     speech = np.fromfile(f, dtype=np.float32)
58 |     f.close()
59 |     assert speech.size % float(dimension) == 0.0,'specified dimension %s not compatible with data'%(dimension)
60 |     speech = speech.reshape((-1, dimension))
61 |     return speech    
62 | 
63 | def put_speech(m_data, filename):
64 |     m_data = np.array(m_data, 'float32') # Ensuring float32 output
65 |     fid = open(filename, 'wb')
66 |     m_data.tofile(fid)
67 |     fid.close()
68 |     return    


--------------------------------------------------------------------------------
/script/make_internal_webchart.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | ## Project: SCRIPT - March 2018 
  4 | ## Contact: Oliver Watts - owatts@staffmail.ed.ac.uk
  5 | 
  6 | 
  7 | import sys, os
  8 | from string import strip
  9 | from argparse import ArgumentParser
 10 | 
 11 | def main_work(voice_dirs, names=[], pattern='', outfile='', title=''):
 12 | 
 13 |     #for v in voice_dirs:
 14 |     #    print v
 15 |     #sys.exit('wvswv')
 16 |     #voice_dirs = opts.d
 17 |     voice_dirs = [string for string in voice_dirs if os.path.isdir(string)]
 18 | 
 19 |     if names:
 20 |         #names = opts.n
 21 |         if not len(names) == len(voice_dirs):
 22 |             print '------'
 23 |             for name in names:
 24 |                 print name
 25 |             for v in  voice_dirs:
 26 |                 print v
 27 |             sys.exit('len(names) != len(voice_dirs)')
 28 |     else:
 29 |         names = [direc.strip('/').split('/')[-1] for direc in voice_dirs]
 30 |     print names
 31 | 
 32 |     # for i in range(0,len(inputs),2):
 33 |     #     name = inputs[i]
 34 |     #     voice_dir = inputs[i+1]
 35 |     #     names.append(name)
 36 |     #     voice_dirs.append(voice_dir)
 37 | 
 38 | 
 39 |     #################################################
 40 | 
 41 |     ### only keep utts appearing in all conditions
 42 |     uttnames=[]
 43 |     all_utts = []
 44 |     for voice_dir in voice_dirs:
 45 |         print voice_dir
 46 |         print os.listdir(voice_dir)
 47 |         print '-----'
 48 |         all_utts.extend(os.listdir(voice_dir))
 49 | 
 50 |     for unique_utt in set(all_utts):
 51 |         if unique_utt.endswith('.wav'):
 52 |             if all_utts.count(unique_utt) == len(names):
 53 |                 uttnames.append(unique_utt)
 54 | 
 55 | 
 56 |     if pattern:
 57 |         uttnames = [name for name in uttnames if pattern in name]
 58 | 
 59 |     # for voice_dir in voice_dirs:
 60 |     #     for uttname in os.listdir(voice_dir):
 61 |     #             if uttname not in uttnames:
 62 |     #                     uttnames.append(uttname)
 63 |                     
 64 | 
 65 |     if len(uttnames) == 0:
 66 |         sys.exit('no utterances found in common!')
 67 | 
 68 | 
 69 |     output = ''
 70 | 
 71 | 
 72 |     if title:
 73 |         output += '<h2>' + title + '</h2>\n'
 74 | 
 75 |     ## table top and toprow
 76 |     output += '<TABLE BORDER="1" CELLSPACING=2 CELLPADDING=7 WIDTH=1046 height="3">\n'
 77 |     output += '<!-- First (header) row -->\n'
 78 |     output += "<TR>\n"
 79 |     output += '<TD WIDTH="1" VALIGN="TOP" height="1"> <FONT FACE="Verdana" SIZE=2> <B><P ALIGN="CENTER">Condition</B></FONT> </TD>\n'
 80 |     for (name,voice_dir) in zip(names, voice_dirs):
 81 |             _, voice = os.path.split(voice_dir)
 82 |             #output += voice
 83 | 
 84 | 
 85 |             output += '<TD WIDTH="1" VALIGN="TOP" height="1"><FONT FACE="Verdana" SIZE=2><B><P ALIGN="CENTER">%s</B></FONT> </TD>\n'%(name)
 86 |     output += '</TR>\n'
 87 |     
 88 |     for uttname in sorted(uttnames):
 89 |     
 90 |             output += "<TR>\n"        
 91 |             
 92 |             output += '<TD WIDTH="1" VALIGN="TOP" height="1"><FONT FACE="Verdana" SIZE=2><B><P ALIGN="CENTER">%s</B></FONT></TD>\n'%(uttname.replace(".wav", ""))
 93 |             for voice_dir in voice_dirs:
 94 | 
 95 |                 wavename=os.path.join(voice_dir, uttname)
 96 |                 output += '<TD WIDTH="1" VALIGN="TOP" height="1">\n'
 97 |                 output += get_audio_control(wavename)
 98 | 
 99 |             output += "</TR>\n"                        
100 |     output += '</table>\n'
101 |     output += '<p>&nbsp;</p>\n'
102 | 
103 | 
104 |     if outfile:
105 |         f = open(outfile, 'w')
106 |         f.write(output)
107 |         f.close()
108 |     else:
109 |         print output
110 | 
111 | 
112 | 
113 | 
114 | def get_audio_control(fname):
115 |     return '''<p><a onclick="this.firstChild.play()"><audio ><source src="%s"/></audio><img width="30" alt="" src="http://pngimg.com/uploads/ear/ear_PNG35710.png" height="30" /></a></p>\n'''%(fname)
116 | 
117 | 
118 | 
119 | 
120 | if __name__=="__main__": 
121 | 
122 |     #################################################
123 |       
124 |     # ======== Get stuff from command line ==========
125 | 
126 |     a = ArgumentParser()
127 |     a.add_argument('-o', dest='outfile', default='', type=str, \
128 |                     help= "If not given, print to console")
129 |     a.add_argument('-d', nargs='+', required=True, help='list of directories with samples')
130 |     a.add_argument('-n', nargs='+', required=False, help='list of names -- use directory names if not given')
131 |     a.add_argument('-p', dest='pattern', default='', type=str)
132 |     a.add_argument('-title', default='', type=str)
133 | 
134 |     opts = a.parse_args()
135 |     
136 | 
137 |     # ===============================================
138 | 
139 |     main_work(opts.d, opts.n, opts.pattern, opts.outfile, opts.title)
140 | 
141 | 
142 | 
143 | 
144 | 
145 | 
146 | 
147 | 


--------------------------------------------------------------------------------
/script/munge_nsf_data.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | ## Project: SCRIPT - March 2018 
  4 | ## Contact: Oliver Watts - owatts@staffmail.ed.ac.uk
  5 |   
  6 | import sys
  7 | import os
  8 | import glob
  9 | import os
 10 | from argparse import ArgumentParser
 11 | 
 12 | import soundfile as sf
 13 | 
 14 | from concurrent.futures import ProcessPoolExecutor
 15 | from functools import partial
 16 | from tqdm import tqdm
 17 | import subprocess
 18 | 
 19 | import  numpy as np
 20 | #from generate import read_est_file
 21 | 
 22 | 
 23 | ## TODO : copied from generate
 24 | def read_est_file(est_file):
 25 | 
 26 |     with open(est_file) as fid:
 27 |         header_size = 1 # init
 28 |         for line in fid:
 29 |             if line == 'EST_Header_End\n':
 30 |                 break
 31 |             header_size += 1
 32 |         ## now check there is at least 1 line beyond the header:
 33 |         status_ok = False
 34 |         for (i,line) in enumerate(fid):
 35 |             if i > header_size:
 36 |                 status_ok = True
 37 |     if not status_ok:
 38 |         return np.array([])
 39 | 
 40 |     # Read text: TODO: improve skiprows
 41 |     data = np.loadtxt(est_file, skiprows=header_size)
 42 |     data = np.atleast_2d(data)
 43 |     return data
 44 | 
 45 | 
 46 | def main_work():
 47 | 
 48 |     #################################################
 49 |       
 50 |     # ======== Get stuff from command line ==========
 51 | 
 52 |     a = ArgumentParser()
 53 |     a.add_argument('-i', dest='indir', required=True)  ## mels   
 54 |     a.add_argument('-of', dest='outdir_f', required=True, \
 55 |                     help= "Put output f0 here: make it if it doesn't exist") 
 56 |     a.add_argument('-om', dest='outdir_m', required=True, \
 57 |                     help= "Put output mels here: make it if it doesn't exist")     
 58 |     a.add_argument('-f', dest='fzdir', required=True)  
 59 |     # a.add_argument('-framerate', required=False, default=0.005, type=float, help='rate in seconds for F0 track frames')
 60 |     # a.add_argument('-pattern', default='', \
 61 |     #                 help= "If given, only normalise files whose base contains this substring")
 62 |     a.add_argument('-ncores', default=1, type=int)
 63 |     #a.add_argument('-waveformat', default=False, action='store_true', help='call sox to format data (16 bit ).')    
 64 |     
 65 |     # a.add_argument('-twopass', default=False, action='store_true', help='Run initially on a subset of data to guess sensible limits, then run again. Assumes all data is from same speaker.')    
 66 |     opts = a.parse_args()
 67 |     
 68 |     # ===============================================
 69 |     
 70 |     for direc in [opts.outdir_f, opts.outdir_m]:
 71 |         if not os.path.isdir(direc):
 72 |             os.makedirs(direc)
 73 | 
 74 |     flist = sorted(glob.glob(opts.indir + '/*.npy'))
 75 | 
 76 |     print flist
 77 |     # print 'Extract with range %s %s'%(min_f0, max_f0)
 78 |     executor = ProcessPoolExecutor(max_workers=opts.ncores)
 79 |     futures = []
 80 |     for mel_file in flist:
 81 |         futures.append(executor.submit(
 82 |             partial(process, mel_file, opts.fzdir, opts.outdir_f, opts.outdir_m)))
 83 |     return [future.result() for future in tqdm(futures)]
 84 | 
 85 | 
 86 | def put_speech(m_data, filename):
 87 |     m_data = np.array(m_data, 'float32') # Ensuring float32 output
 88 |     fid = open(filename, 'wb')
 89 |     m_data.tofile(fid)
 90 |     fid.close()
 91 |     return 
 92 | 
 93 | 
 94 | 
 95 | def process(mel_file, fzdir, outdir_f, outdir_m):
 96 |     _, base = os.path.split(mel_file)
 97 |     base = base.replace('.npy', '')
 98 | 
 99 |     mels = np.load(mel_file)
100 |     fz_file = os.path.join(fzdir, base + '.f0')
101 | 
102 |     fz = read_est_file(fz_file)[:,2] # .reshape(-1,1)
103 | 
104 |     m,_ = mels.shape
105 |     f = fz.shape[0]
106 | 
107 | 
108 |     fz[fz<0.0] = 0.0
109 | 
110 |     if m > f:
111 |         diff = m - f
112 |         fz = np.pad(fz, (0,diff), 'constant').reshape(-1,1)
113 | 
114 |     put_speech(fz, os.path.join(outdir_f, base+'.f0'))
115 |     put_speech(mels, os.path.join(outdir_m, base+'.mfbsp'))
116 | 
117 |     # print fz.shape
118 |     # print mels.shape
119 |     # print fz
120 | 
121 | 
122 | 
123 | 
124 | 
125 |     # out_file = os.path.join(outdir, base + '.pm')
126 | 
127 |     # in_wav_file = os.path.join(fzdir, base + '_tmp.wav')  ### !!!!!
128 |     # cmd = 'sox %s -r 16000 %s '%(wavefile, in_wav_file)
129 |     # subprocess.call(cmd, shell=True)
130 | 
131 |     # out_fz_file = os.path.join(fzdir, base + '.f0')
132 |     # cmd =  _reaper_bin + " -s -e %s -x %s -m %s -a -u 0.005 -i %s -p %s -f %s >/dev/null" % (framerate, max_f0, min_f0, in_wav_file, out_est_file, out_fz_file)
133 |     # subprocess.call(cmd, shell=True)
134 | 
135 | 
136 | 
137 | 
138 | 
139 | if __name__=="__main__":
140 | 
141 |     main_work()
142 | 
143 | 


--------------------------------------------------------------------------------
/script/normalise_level.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | ## Project: SCRIPT - March 2018 
  4 | ## Contact: Oliver Watts - owatts@staffmail.ed.ac.uk
  5 |   
  6 | import sys
  7 | import os
  8 | import glob
  9 | import os
 10 | from argparse import ArgumentParser
 11 | 
 12 | import soundfile as sf
 13 | 
 14 | from concurrent.futures import ProcessPoolExecutor
 15 | from functools import partial
 16 | from tqdm import tqdm
 17 | 
 18 | 
 19 | HERE = os.path.realpath(os.path.abspath(os.path.dirname(__file__)))
 20 | sv56 = HERE + '/../tool/bin/sv56demo'
 21 | 
 22 | if not os.path.isfile(sv56):
 23 | 
 24 |     ## Check required executables are available:
 25 | 
 26 |     from distutils.spawn import find_executable
 27 | 
 28 |     required_executables = ['sv56demo']
 29 | 
 30 |     for executable in required_executables:
 31 |         if not find_executable(executable):
 32 |             sys.exit('%s command line tool must be on system path '%(executable))
 33 |         
 34 |     sv56 = 'sv56demo'
 35 | 
 36 | 
 37 | def main_work():
 38 | 
 39 |     #################################################
 40 |       
 41 |     # ======== Get stuff from command line ==========
 42 | 
 43 |     a = ArgumentParser()
 44 |     a.add_argument('-i', dest='indir', required=True)    
 45 |     a.add_argument('-o', dest='outdir', required=True, \
 46 |                     help= "Put output here: make it if it doesn't exist")
 47 |     a.add_argument('-pattern', default='', \
 48 |                     help= "If given, only normalise files whose base contains this substring")
 49 |     a.add_argument('-ncores', default=1, type=int)
 50 |     opts = a.parse_args()
 51 |     
 52 |     # ===============================================
 53 |     
 54 |     for direc in [opts.outdir]:
 55 |         if not os.path.isdir(direc):
 56 |             os.makedirs(direc)
 57 | 
 58 |     flist = sorted(glob.glob(opts.indir + '/*.wav'))
 59 |     
 60 |     executor = ProcessPoolExecutor(max_workers=opts.ncores)
 61 |     futures = []
 62 |     for wave_file in flist:
 63 |         futures.append(executor.submit(
 64 |             partial(process, wave_file, opts.outdir, pattern=opts.pattern)))
 65 |     return [future.result() for future in tqdm(futures)]
 66 | 
 67 | 
 68 | 
 69 | 
 70 | 
 71 | 
 72 | def process(wavefile, outdir, pattern=''):
 73 |     _, base = os.path.split(wavefile)
 74 |     
 75 |     if pattern:
 76 |         if pattern not in base:
 77 |             return
 78 | 
 79 |     # print base
 80 | 
 81 |     raw_in = os.path.join(outdir, base.replace('.wav','.raw'))
 82 |     raw_out = os.path.join(outdir, base.replace('.wav','_norm.raw'))
 83 |     logfile = os.path.join(outdir, base.replace('.wav','.log'))
 84 |     wav_out = os.path.join(outdir, base)
 85 |     
 86 |     data, samplerate = sf.read(wavefile, dtype='int16')
 87 |     sf.write(raw_in, data, samplerate, subtype='PCM_16')
 88 |     os.system('%s -log %s -q -lev -26.0 -sf %s %s %s'%(sv56, logfile, samplerate, raw_in, raw_out))
 89 |     norm_data, samplerate = sf.read(raw_out, dtype='int16', samplerate=samplerate, channels=1, subtype='PCM_16')
 90 |     sf.write(wav_out, norm_data, samplerate)
 91 | 
 92 |     os.system('rm %s %s'%(raw_in, raw_out))
 93 | 
 94 | 
 95 | 
 96 | 
 97 | if __name__=="__main__":
 98 | 
 99 |     main_work()
100 | 
101 | 


--------------------------------------------------------------------------------
/script/process_merlin_label.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | ## Project: SCRIPT - February 2019 
  4 | ## Contact: Oliver Watts - owatts@staffmail.ed.ac.uk
  5 | 
  6 | 
  7 | import sys
  8 | import os
  9 | import glob
 10 | from argparse import ArgumentParser
 11 | 
 12 | from libutil import get_speech, basename, safe_makedir
 13 | from scipy.signal import argrelextrema
 14 | import numpy as np
 15 | import matplotlib as mpl
 16 | mpl.use('PDF')
 17 | import pylab as pl 
 18 | 
 19 | def merlin_state_label_to_phone(labfile):
 20 |     labels = np.loadtxt(labfile, dtype=str, comments=None) ## default comments='#' breaks
 21 |     starts = labels[:,0].astype(int)[::5].reshape(-1,1)
 22 |     ends = labels[:,1].astype(int)[4::5].reshape(-1,1)
 23 |     fc = labels[:,2][::5]
 24 |     fc = np.array([line.replace('[2]','') for line in fc]).reshape(-1,1)
 25 |     phone_label = np.hstack([starts, ends, fc])
 26 |     return phone_label
 27 | 
 28 | 
 29 | def minmax_norm(X, data_min, data_max):
 30 |     data_range = data_max - data_min
 31 |     data_range[data_range<=0.0] = 1.0
 32 |     maxi, mini = 0.01, 0.99   # ## merlin's default desired range
 33 |     X_std = (X - data_min) / data_range
 34 |     X_scaled = X_std * (maxi - mini) + mini    
 35 |     return X_scaled
 36 | 
 37 | 
 38 | def process_merlin_label(bin_label_fname, text_lab_dir, phonedim=416, subphonedim=9):
 39 | 
 40 |     text_label = os.path.join(text_lab_dir, basename(bin_label_fname) + '.lab')
 41 |     assert os.path.isfile(text_label), 'No text file for %s '%(basename(bin_label_fname))
 42 |     
 43 |     labfrombin = get_speech(bin_label_fname, phonedim+subphonedim)
 44 |     
 45 |     ## fraction through phone (forwards)
 46 |     fraction_through_phone_forwards = labfrombin[:,-1]
 47 | 
 48 |     ## This is a suprisingly noisy signal which never seems to start at 0.0! Find minima:-
 49 |     (minima, ) = argrelextrema(fraction_through_phone_forwards, np.less)
 50 | 
 51 |     ## first frame is always a start: 
 52 |     minima = np.insert(minima, 0, 0)  
 53 | 
 54 |     ## check size against text file:
 55 |     labfromtext = merlin_state_label_to_phone(text_label)
 56 |     assert labfromtext.shape[0] == minima.shape[0]
 57 | 
 58 |     lab = labfrombin[minima,:-subphonedim] ## discard frame level feats, and take first frame of each phone
 59 | 
 60 |     return lab
 61 | 
 62 | 
 63 | 
 64 | def main_work():
 65 | 
 66 |     #################################################
 67 |       
 68 |     # ============= Process command line ============
 69 | 
 70 |     a = ArgumentParser()
 71 | 
 72 |     a.add_argument('-b', dest='binlabdir', required=True)   
 73 |     a.add_argument('-t', dest='text_lab_dir', required=True)    
 74 |     a.add_argument('-n', dest='norm_info_fname', required=True)  
 75 |     a.add_argument('-o', dest='outdir', required=True) 
 76 |     a.add_argument('-binext', dest='binext', required=False, default='lab')    
 77 |     a.add_argument('-skipterminals', action='store_true', default=False)    
 78 | 
 79 | 
 80 |     opts = a.parse_args()
 81 |     
 82 |     # ===============================================
 83 | 
 84 |     safe_makedir(opts.outdir)
 85 | 
 86 |     norm_info = get_speech(opts.norm_info_fname, 425)[:,:-9]
 87 |     data_min = norm_info[0,:]
 88 |     data_max = norm_info[1,:]
 89 |     data_range = data_max - data_min
 90 | 
 91 |     text_label_files = set([basename(f) for f in glob.glob(opts.text_lab_dir + '/*.lab')])
 92 |     binary_label_files = sorted(glob.glob(opts.binlabdir + '/*.' + opts.binext) )
 93 |     print binary_label_files
 94 |     for binlab in binary_label_files:
 95 |         base = basename(binlab)
 96 |         if base not in text_label_files:
 97 |             continue
 98 |         print base
 99 |         lab = process_merlin_label(binlab, opts.text_lab_dir)
100 |         if opts.skipterminals:
101 |             lab = lab[1:-1,:] ## NB: dont remove 2 last as in durations, as the final punct does't features here
102 |         norm_lab = minmax_norm(lab, data_min, data_max)
103 | 
104 |         if 0: ## piano roll style plot:
105 |             pl.imshow(norm_lab, interpolation='nearest')
106 |             pl.gray()
107 |             pl.savefig('/afs/inf.ed.ac.uk/user/o/owatts/temp/fig.pdf')
108 |             sys.exit('abckdubv')
109 | 
110 |         np.save(opts.outdir + '/' + base, norm_lab)
111 | 
112 | 
113 | if __name__=="__main__":
114 |     main_work()
115 | 
116 | 


--------------------------------------------------------------------------------
/script/process_merlin_label_positions.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | ## Project: SCRIPT - February 2019 
  4 | ## Contact: Oliver Watts - owatts@staffmail.ed.ac.uk
  5 | 
  6 | 
  7 | import sys
  8 | import os
  9 | import glob
 10 | from argparse import ArgumentParser
 11 | 
 12 | from libutil import get_speech, basename, safe_makedir
 13 | from scipy.signal import argrelextrema
 14 | import numpy as np
 15 | import matplotlib as mpl
 16 | mpl.use('PDF')
 17 | import pylab as pl 
 18 | 
 19 | from scipy import interpolate
 20 | 
 21 | 
 22 | # def merlin_state_label_to_phone(labfile):
 23 | #     labels = np.loadtxt(labfile, dtype=str, comments=None) ## default comments='#' breaks
 24 | #     starts = labels[:,0].astype(int)[::5].reshape(-1,1)
 25 | #     ends = labels[:,1].astype(int)[4::5].reshape(-1,1)
 26 | #     fc = labels[:,2][::5]
 27 | #     fc = np.array([line.replace('[2]','') for line in fc]).reshape(-1,1)
 28 | #     phone_label = np.hstack([starts, ends, fc])
 29 | #     return phone_label
 30 | 
 31 | 
 32 | def minmax_norm(X, data_min, data_max):
 33 |     data_range = data_max - data_min
 34 |     data_range[data_range<=0.0] = 1.0
 35 |     maxi, mini = 0.01, 0.99   # ## merlin's default desired range
 36 |     X_std = (X - data_min) / data_range
 37 |     X_scaled = X_std * (maxi - mini) + mini    
 38 |     return X_scaled
 39 | 
 40 | 
 41 | def process_merlin_positions(bin_label_fname, audio_dir, phonedim=416, subphonedim=9, \
 42 |                     inrate=5.0, outrate=12.5):
 43 | 
 44 |     audio_fname = os.path.join(audio_dir, basename(bin_label_fname) + '.npy')
 45 |     assert os.path.isfile(audio_fname), 'No audio file for %s '%(basename(bin_label_fname))
 46 |     audio = np.load(audio_fname)
 47 | 
 48 |     labfrombin = get_speech(bin_label_fname, phonedim+subphonedim)
 49 |     
 50 |     positions = labfrombin[:,-subphonedim:] 
 51 | 
 52 |     nframes, dim = positions.shape
 53 |     assert dim==9
 54 | 
 55 |     new_nframes, _ = audio.shape
 56 | 
 57 |     old_x = np.linspace((inrate/2.0), nframes*inrate, nframes, endpoint=False)  ## place points at frame centres
 58 |     
 59 |     f = interpolate.interp1d(old_x, positions, axis=0, kind='nearest', bounds_error=False, fill_value='extrapolate') ## nearest to avoid weird averaging effects near segment boundaries
 60 | 
 61 |     new_x = np.linspace((outrate/2.0), new_nframes*outrate, new_nframes, endpoint=False)
 62 |     new_positions = f(new_x)  
 63 |     
 64 |     return new_positions
 65 | 
 66 | 
 67 | 
 68 | def main_work():
 69 | 
 70 |     #################################################
 71 |       
 72 |     # ============= Process command line ============
 73 | 
 74 |     a = ArgumentParser()
 75 | 
 76 |     a.add_argument('-b', dest='binlabdir', required=True)   
 77 |     a.add_argument('-f', dest='audio_dir', required=True)    
 78 |     a.add_argument('-n', dest='norm_info_fname', required=True)  
 79 |     a.add_argument('-o', dest='outdir', required=True) 
 80 |     a.add_argument('-binext', dest='binext', required=False, default='lab')    
 81 | 
 82 |     a.add_argument('-ir', dest='inrate', type=float, default=5.0) 
 83 |     a.add_argument('-or', dest='outrate', type=float, default=12.5)   
 84 | 
 85 |     opts = a.parse_args()
 86 |     
 87 |     # ===============================================
 88 | 
 89 |     safe_makedir(opts.outdir)
 90 | 
 91 |     norm_info = get_speech(opts.norm_info_fname, 425)[:,-9:]
 92 |     data_min = norm_info[0,:]
 93 |     data_max = norm_info[1,:]
 94 |     data_range = data_max - data_min
 95 | 
 96 |     audio_files = set([basename(f) for f in glob.glob(opts.audio_dir + '/*.npy')])
 97 |     binary_label_files = sorted(glob.glob(opts.binlabdir + '/*.' + opts.binext) )
 98 |     
 99 |     for binlab in binary_label_files:
100 |         base = basename(binlab)
101 |         if base not in audio_files:
102 |             continue
103 |         print base
104 |         positions = process_merlin_positions(binlab, opts.audio_dir, inrate=opts.inrate, outrate=opts.outrate)
105 |         norm_positions = minmax_norm(positions, data_min, data_max)
106 | 
107 |         np.save(opts.outdir + '/' + base, norm_positions)
108 | 
109 | 
110 | if __name__=="__main__":
111 |     main_work()
112 | 
113 | 


--------------------------------------------------------------------------------
/script/remove_end_silences.py:
--------------------------------------------------------------------------------
 1 | 
 2 | 
 3 | 
 4 | import fileinput, sys
 5 | 
 6 | infile = sys.argv[1]
 7 | 
 8 | i = 0
 9 | for line in fileinput.input(infile):
10 |     name, t1, t2, phones, speaker, durs = line.strip(' \n').split('|') # [-1]
11 |     durs = durs.split(' ')
12 |     assert durs[0] == '24'
13 |     assert durs[-2] == '24'
14 |     assert durs[-1] == '0'
15 |     durs[0] = '0'
16 |     durs[-2] = '0'
17 |     durs[-1] = '0'
18 |     # if i==5:
19 |     #    break
20 |     durs = ' '.join(durs)
21 |     line2 = '|'.join([name, t1, t2, phones, speaker, durs])
22 |     print line2
23 |     i += 1
24 | # print i
25 | 


--------------------------------------------------------------------------------
/script/split_speech.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | ## Project: SCRIPT - June 2018 
  4 | ## Contact: Oliver Watts - owatts@staffmail.ed.ac.uk
  5 |   
  6 | import sys
  7 | import os
  8 | import glob
  9 | import os
 10 | import fileinput
 11 | from argparse import ArgumentParser
 12 | 
 13 |     
 14 | 
 15 | import librosa
 16 | from concurrent.futures import ProcessPoolExecutor
 17 | from functools import partial
 18 | import numpy as np
 19 | import soundfile
 20 | from libutil import safe_makedir, get_basename
 21 | from tqdm import tqdm
 22 | 
 23 | # librosa.effects.split(y, top_db=60, ref=<function amax>, frame_length=2048, hop_length=512)[source]
 24 | 
 25 | # intervals[i] == (start_i, end_i) are the start and end time (in samples) of non-silent interval i.
 26 | 
 27 | 
 28 | 
 29 | 
 30 | def main_work():
 31 | 
 32 |     #################################################
 33 |       
 34 |     # ======== Get stuff from command line ==========
 35 | 
 36 |     a = ArgumentParser()
 37 |     a = ArgumentParser()
 38 |     a.add_argument('-w', dest='wave_dir', required=True)
 39 |     a.add_argument('-o', dest='output_dir', required=True)    
 40 |     a.add_argument('-N', dest='nfiles', type=int, default=0)  
 41 |     a.add_argument('-l', dest='wavlist_file', default=None)  
 42 |     a.add_argument('-dB', dest='top_db', default=30, type=int) 
 43 |     a.add_argument('-trimonly', action='store_true', default=False)  
 44 |     a.add_argument('-ncores', type=int, default=0)   
 45 |     a.add_argument('-endpad', type=float, default=0.3)
 46 |     opts = a.parse_args()
 47 |     
 48 |     # ===============================================
 49 |     
 50 |     trim_waves_in_directory(opts.wave_dir, opts.output_dir, num_workers=opts.ncores, \
 51 |             tqdm=tqdm, nfiles=opts.nfiles, top_db=opts.top_db, trimonly=opts.trimonly, endpad=opts.endpad)
 52 | 
 53 | def trim_waves_in_directory(in_dir, out_dir, num_workers=1, tqdm=lambda x: x, \
 54 |                 nfiles=0, top_db=30, trimonly=False, endpad=0.3):
 55 |     safe_makedir(out_dir)
 56 |     wave_files = sorted(glob.glob(in_dir + '/*.wav'))
 57 |     if nfiles > 0:
 58 |         wave_files = wave_files[:min(nfiles, len(wave_files))]
 59 | 
 60 |     if num_workers:
 61 |         executor = ProcessPoolExecutor(max_workers=num_workers)
 62 |         futures = []
 63 |         for (index, wave_file) in enumerate(wave_files):
 64 |             futures.append(executor.submit(
 65 |                 partial(_process_utterance, wave_file, out_dir, top_db=top_db, trimonly=trimonly, end_pad_sec=endpad)))
 66 |         return [future.result() for future in tqdm(futures)]
 67 |     else:  ## serial processing
 68 |         for wave_file in tqdm(wave_files):
 69 |             _process_utterance(wave_file, out_dir, top_db=top_db, trimonly=trimonly, end_pad_sec=endpad)        
 70 | 
 71 | 
 72 | 
 73 | def _process_utterance(wav_path, out_dir, top_db=30, end_pad_sec=0.3, pad_sec=0.01, minimum_duration_sec=0.5, trimonly=False):
 74 | 
 75 |     wav, fs = soundfile.read(wav_path)  ## TODO: assert mono
 76 | 
 77 |     if len(wav) > 0:
 78 | 
 79 |         pad = int(pad_sec * fs)
 80 |         end_pad = int(end_pad_sec * fs)
 81 |         # print pad
 82 |         base = get_basename(wav_path)
 83 |         # print base
 84 |         _, (start, end) = librosa.effects.trim(wav, top_db=top_db)
 85 |         start = max(0, (start - end_pad))
 86 |         end = min(len(wav), (end + end_pad))
 87 | 
 88 |         if start < end:
 89 |             wav = wav[start:end]
 90 | 
 91 |         if trimonly:
 92 |             ofile = os.path.join(out_dir, base + '.wav')
 93 |             soundfile.write(ofile, wav, fs)
 94 |         else:
 95 |             starts_ends = librosa.effects.split(wav, top_db=top_db)
 96 |             starts_ends[:,0] -= pad
 97 |             starts_ends[:,1] += pad
 98 |             starts_ends = np.clip(starts_ends, 0, wav.size)
 99 |             lengths = starts_ends[:,1] - starts_ends[:,0]
100 |             starts_ends = starts_ends[lengths > fs * minimum_duration_sec]
101 | 
102 | 
103 |             for (i, (s,e)) in enumerate(starts_ends):
104 | 
105 |                 ofile = os.path.join(out_dir, base + '_seg%s.wav'%(str(i+1).zfill(4)))
106 |                 # print ofile
107 |                 soundfile.write(ofile, wav[s:e], fs)
108 | 
109 |     else:
110 | 
111 |         print "File discarded: " + wav_path
112 | 
113 | def test():
114 |     safe_makedir('/tmp/splitwaves/')
115 |     _process_utterance('/afs/inf.ed.ac.uk/group/cstr/projects/simple4all_2/oliver/data/nick/wav/herald_030.wav', '/tmp/splitwaves/')
116 | 
117 | if __name__=="__main__":
118 | 
119 |     main_work()
120 | 
121 | 


--------------------------------------------------------------------------------
/synthesise_validation_waveforms.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | ## Project: SCRIPT - February 2018 
  4 | ## Contact: Oliver Watts - owatts@staffmail.ed.ac.uk
  5 |   
  6 | import sys
  7 | import os
  8 | import glob
  9 | from argparse import ArgumentParser
 10 | 
 11 | import imp
 12 | 
 13 | import numpy as np
 14 | 
 15 | from utils import spectrogram2wav
 16 | # from scipy.io.wavfile import write
 17 | import soundfile
 18 | 
 19 | import tqdm
 20 | from concurrent.futures import ProcessPoolExecutor
 21 | 
 22 | import tensorflow as tf
 23 | from architectures import SSRNGraph
 24 | from synthesize import make_mel_batch, split_batch, synth_mel2mag
 25 | from configuration import load_config
 26 | 
 27 | 
 28 | def synth_wave(hp, magfile):
 29 |     mag = np.load(magfile)
 30 |     #print ('mag shape %s'%(str(mag.shape)))
 31 |     wav = spectrogram2wav(hp, mag)
 32 |     outfile = magfile.replace('.mag.npy', '.wav')
 33 |     outfile = outfile.replace('.npy', '.wav')
 34 |     #print magfile
 35 |     #print outfile
 36 |     #print 
 37 |     # write(outfile, hp.sr, wav)
 38 |     soundfile.write(outfile, wav, hp.sr)
 39 | 
 40 | def main_work():
 41 | 
 42 |     #################################################
 43 |       
 44 |     # ======== Get stuff from command line ==========
 45 | 
 46 |     a = ArgumentParser()
 47 |     a.add_argument('-c', dest='config', required=True, type=str)
 48 |     a.add_argument('-ncores', type=int, default=1)
 49 |     opts = a.parse_args()
 50 |     
 51 |     # ===============================================
 52 | 
 53 |     hp = load_config(opts.config)
 54 |     
 55 |     ### 1) convert saved coarse mels to mags with latest-trained SSRN
 56 |     print('mel2mag: restore last saved SSRN')
 57 |     g = SSRNGraph(hp,  mode="synthesize")
 58 |     with tf.Session() as sess:
 59 |         sess.run(tf.global_variables_initializer())
 60 |         
 61 |         ## TODO: use restore_latest_model_parameters from synthesize?
 62 |         var_list = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'SSRN') 
 63 |         saver2 = tf.train.Saver(var_list=var_list)
 64 |         savepath = hp.logdir + "-ssrn"        
 65 |         latest_checkpoint = tf.train.latest_checkpoint(savepath)
 66 |         if latest_checkpoint is None: sys.exit('No SSRN at %s?'%(savepath))
 67 |         ssrn_epoch = latest_checkpoint.strip('/ ').split('/')[-1].replace('model_epoch_', '')
 68 |         saver2.restore(sess, latest_checkpoint)
 69 |         print("SSRN Restored from latest epoch %s"%(ssrn_epoch))
 70 | 
 71 |         filelist = glob.glob(hp.logdir + '-t2m/validation_epoch_*/*.npy')
 72 |         filelist = [fname for fname in filelist if not fname.endswith('.mag.npy')]
 73 |         batch, lengths = make_mel_batch(hp, filelist, oracle=False)
 74 |         Z = synth_mel2mag(hp, batch, g, sess, batchsize=32)
 75 |         print ('synthesised mags, now splitting batch:')
 76 |         maglist = split_batch(Z, lengths)
 77 |         for (infname, outdata) in tqdm.tqdm(zip(filelist, maglist)):
 78 |             np.save(infname.replace('.npy','.mag.npy'), outdata)
 79 | 
 80 | 
 81 | 
 82 |     ### 2) GL in parallel for both t2m and ssrn validation set 
 83 |     print('GL for SSRN validation')
 84 |     filelist = glob.glob(hp.logdir + '-t2m/validation_epoch_*/*.mag.npy') + \
 85 |                glob.glob(hp.logdir + '-ssrn/validation_epoch_*/*.npy')
 86 | 
 87 |     if opts.ncores==1:
 88 |         for fname in tqdm.tqdm(filelist):
 89 |             synth_wave(hp, fname)
 90 |     else:
 91 |         executor = ProcessPoolExecutor(max_workers=opts.ncores)    
 92 |         futures = []
 93 |         for fpath in filelist:
 94 |             futures.append(executor.submit(synth_wave, hp, fpath))
 95 |         proc_list = [future.result() for future in tqdm.tqdm(futures)]
 96 | 
 97 | 
 98 | 
 99 | if __name__=="__main__":
100 | 
101 |     main_work()
102 | 


--------------------------------------------------------------------------------
/test.tmp:
--------------------------------------------------------------------------------
1 | testing
2 | 


--------------------------------------------------------------------------------
/tool/bin/sv56demo:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CSTR-Edinburgh/ophelia/46042a80d045b6fd6403601e9e8fb447d304b6b3/tool/bin/sv56demo


--------------------------------------------------------------------------------
/util/find_free_gpu.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | 
 3 | import subprocess
 4 | import re
 5 | 
 6 | sp = subprocess.Popen(['nvidia-smi', 'pmon', '-c', '1'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
 7 | out_str = sp.communicate()
 8 | 
 9 | #print out_str
10 | 
11 | lines = out_str[0].split('\n')
12 | lines = [line for line in lines if not line.startswith('#')] # filter comments
13 | #lines = [line.replace('-', '') for line in lines]            # remove - placeholders used to mark columns for idle GPUs
14 | lines = [re.split('\s+', line.strip(' ')) for line in lines] # split on whitespace
15 | lines = [line for line in lines if line != ['']]               # remove empty lines
16 | 
17 | free_gpu = -1
18 | for line in lines:
19 |     #print line
20 |     gpu_id = line[0]
21 |     process_info = ''.join(line[1:])
22 |     #print gpu_id
23 |     #print process_info
24 |     if process_info.replace('-', '') == '': ## does it contain only  - placeholders used to mark columns for idle GPUs?
25 |         free_gpu = gpu_id
26 |         break
27 | print free_gpu
28 | 
29 | 


--------------------------------------------------------------------------------
/util/submit_tf.sh:
--------------------------------------------------------------------------------
 1 | 
 2 | PYTHON=python
 3 | 
 4 | 
 5 | ## Generic script for submitting any tensorflow job to GPU
 6 | # usage: submit.sh [scriptname.py script_arguments ... ]
 7 | 
 8 | ## Location of this script: assume gpu_lock.py is in same place -
 9 | SCRIPTPATH=$( cd $(dirname $0) ; pwd -P )
10 | 
11 | 
12 | gpu_id=$(python2 $SCRIPTPATH/gpu_lock.py --id-to-hog)
13 | 
14 | ### paths for tensorflow (works on hynek)
15 | # export LD_LIBRARY_PATH=/opt/cuda-8.0.44/extras/CUPTI/lib64/:/opt/cuda-8.0.44/:/opt/cuda-8.0.44/lib64:/opt/cuDNN-7.0/:/opt/cuDNN-6.0_8.0/:/opt/cuda/:/opt/cuDNN-6.0_8.0/lib64:/opt/cuDNN-6.0/lib6
16 | export LD_LIBRARY_PATH=/opt/cuda-8.0.44/extras/CUPTI/lib64/:/opt/cuda-8.0.44/:/opt/cuda-8.0.44/lib64:/opt/cuDNN-7.0/:/opt/cuDNN-6.0_8.0/:/opt/cuda/:/opt/cuDNN-6.0_8.0/lib64:/opt/cuDNN-6.0/lib6:/opt/cuda-9.0.176.1/lib64/:/opt/cuda-9.1.85/lib64/:/opt/cuDNN-7.1_9.1/lib64
17 | export CUDA_HOME=/opt/cuda-8.0.44/
18 | 
19 | export KERAS_BACKEND=tensorflow
20 | export CUDA_VISIBLE_DEVICES=$gpu_id
21 | 
22 | if [ $gpu_id -gt -1 ]; then
23 |     
24 |     $PYTHON $@
25 |     
26 |     python2 $SCRIPTPATH/gpu_lock.py --free $gpu_id
27 | else
28 |     echo 'Let us wait! No GPU is available!'
29 | 
30 | fi
31 | 


--------------------------------------------------------------------------------
/util/submit_tf_2.sh:
--------------------------------------------------------------------------------
 1 | 
 2 | PYTHON=python
 3 | 
 4 | 
 5 | ## Generic script for submitting any tensorflow job to GPU
 6 | # usage: submit.sh [scriptname.py script_arguments ... ]
 7 | 
 8 | ## Location of this script: assume gpu_lock.py is in same place -
 9 | SCRIPTPATH=$( cd $(dirname $0) ; pwd -P )
10 | 
11 | 
12 | gpu_id=$(python2 $SCRIPTPATH/find_free_gpu.py)
13 | 
14 | ### paths for tensorflow (works on hynek)
15 | # export LD_LIBRARY_PATH=/opt/cuda-8.0.44/extras/CUPTI/lib64/:/opt/cuda-8.0.44/:/opt/cuda-8.0.44/lib64:/opt/cuDNN-7.0/:/opt/cuDNN-6.0_8.0/:/opt/cuda/:/opt/cuDNN-6.0_8.0/lib64:/opt/cuDNN-6.0/lib6
16 | export LD_LIBRARY_PATH=/opt/cuda-8.0.44/extras/CUPTI/lib64/:/opt/cuda-8.0.44/:/opt/cuda-8.0.44/lib64:/opt/cuDNN-7.0/:/opt/cuDNN-6.0_8.0/:/opt/cuda/:/opt/cuDNN-6.0_8.0/lib64:/opt/cuDNN-6.0/lib6:/opt/cuda-9.0.176.1/lib64/:/opt/cuda-9.1.85/lib64/:/opt/cuDNN-7.1_9.1/lib64
17 | export CUDA_HOME=/opt/cuda-8.0.44/
18 | 
19 | export KERAS_BACKEND=tensorflow
20 | export CUDA_VISIBLE_DEVICES=$gpu_id
21 | 
22 | if [ $gpu_id -gt -1 ]; then
23 |     
24 |     $PYTHON $@
25 |     
26 | else
27 |     echo 'Let us wait! No GPU is available!'
28 | 
29 | fi
30 | 


--------------------------------------------------------------------------------
/util/submit_tf_cpu.sh:
--------------------------------------------------------------------------------
 1 | 
 2 | PYTHON=python
 3 | 
 4 | 
 5 | ## Generic script for submitting any tensorflow job to GPU
 6 | # usage: submit.sh [scriptname.py script_arguments ... ]
 7 | 
 8 | 
 9 | ### paths for tensorflow (works on hynek)
10 | # export LD_LIBRARY_PATH=/opt/cuda-8.0.44/extras/CUPTI/lib64/:/opt/cuda-8.0.44/:/opt/cuda-8.0.44/lib64:/opt/cuDNN-7.0/:/opt/cuDNN-6.0_8.0/:/opt/cuda/:/opt/cuDNN-6.0_8.0/lib64:/opt/cuDNN-6.0/lib6
11 | export LD_LIBRARY_PATH=/opt/cuda-8.0.44/extras/CUPTI/lib64/:/opt/cuda-8.0.44/:/opt/cuda-8.0.44/lib64:/opt/cuDNN-7.0/:/opt/cuDNN-6.0_8.0/:/opt/cuda/:/opt/cuDNN-6.0_8.0/lib64:/opt/cuDNN-6.0/lib6:/opt/cuda-9.0.176.1/lib64/:/opt/cuda-9.1.85/lib64/:/opt/cuDNN-7.1_9.1/lib64
12 | export CUDA_HOME=/opt/cuda-8.0.44/
13 | 
14 | export KERAS_BACKEND=tensorflow
15 | export CUDA_VISIBLE_DEVICES=""
16 | 
17 | $PYTHON $@
18 |     
19 | 


--------------------------------------------------------------------------------
/util/sync_code_from_afs.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | SERVERDIR=/path/on/GPU/machine/to/dc_tts_osw/   ## change this to e.g. /disk/scratch/.../dc_tts_osw
4 | SOURCE=/path/on/AFS/to/dc_tts_osw/  ## e.g. in your AFS home directory
5 | 
6 | rsync -avzh --progress $SOURCE/ $SERVERDIR/ 
7 | # --exclude='.git/'
8 | 


--------------------------------------------------------------------------------
/util/sync_output_to_afs.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | DEST=~/synthetic_speech  ## e.g. directory on AFS where examining audio is straightforward
 4 | 
 5 | PATTERN=$1  ## substring of the config to sync
 6 | 
 7 | rsync -av --relative ./work/*${PATTERN}*/train-t2m/validation_epoch_*/*.wav $DEST
 8 | rsync -av --relative ./work/*${PATTERN}*/train-ssrn/validation_epoch_*/*.wav $DEST
 9 | rsync -av --relative ./work/*${PATTERN}*/train-t2m/alignment* $DEST
10 | rsync -av --relative ./work/*${PATTERN}*/synth/*/*.wav $DEST
11 | rsync -av --relative ./work/*${PATTERN}*/synth*/*/*.wav $DEST
12 | rsync -av --relative ./work/*${PATTERN}*/synth/*/*.png $DEST
13 | 
14 | 
15 | echo "Synced to: $DEST"


--------------------------------------------------------------------------------