├── README.md
├── compute_statistics.py
├── convert.py
├── dataset.py
├── extract_features.py
├── module.py
├── net.py
├── normalize_features.py
├── pwg
    ├── egs
    │   └── .gitkeep
    └── parallel_wavegan
    │   └── .gitkeep
├── recipes
    ├── run_test.sh
    ├── run_test_arctic_4spk.sh
    ├── run_train.sh
    └── run_train_arctic_4spk.sh
├── requirements.txt
└── train.py


/README.md:
--------------------------------------------------------------------------------
  1 | # StarGAN-VC
  2 | 
  3 | This repository provides an official PyTorch implementation for [StarGAN-VC](http://www.kecl.ntt.co.jp/people/kameoka.hirokazu/Demos/stargan-vc2/index.html).
  4 | 
  5 | StarGAN-VC is a nonparallel many-to-many voice conversion (VC) method using star generative adversarial networks (StarGAN). The current version performs VC by first modifying the mel-spectrogram of input speech of an arbitrary speaker in accordance with a target speaker index, and then generating a waveform using a speaker-independent neural vocoder (HiFi-GAN or Parallel WaveGAN) from the modified mel-spectrogram.
  6 | 
  7 | Audio samples are available [here](http://www.kecl.ntt.co.jp/people/kameoka.hirokazu/Demos/stargan-vc2/index.html).
  8 | 
  9 | ## Papers
 10 | 
 11 | - [Hirokazu Kameoka](http://www.kecl.ntt.co.jp/people/kameoka.hirokazu/index-e.html), [Takuhiro Kaneko](http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/index.html), [Kou Tanaka](http://www.kecl.ntt.co.jp/people/tanaka.ko/index.html), and Nobukatsu Hojo, "**StarGAN-VC: Non-parallel many-to-many voice conversion using star generative adversarial networks**," in *Proc. 2018 IEEE Workshop on Spoken Language Technology ([SLT 2018](http://www.slt2018.org/))*, pp. 266-273, Dec. 2018. [**[Paper]**](http://www.kecl.ntt.co.jp/people/kameoka.hirokazu/publications/Kameoka2018SLT12_published.pdf) 
 12 |   
 13 | - [Hirokazu Kameoka](http://www.kecl.ntt.co.jp/people/kameoka.hirokazu/index-e.html), [Takuhiro Kaneko](http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/index.html), [Kou Tanaka](http://www.kecl.ntt.co.jp/people/tanaka.ko/index.html), and Nobukatsu Hojo, "**Nonparallel Voice Conversion With Augmented Classifier Star Generative Adversarial Networks**" *IEEE/ACM Transactions on Audio, Speech, and Language Processing*, vol. 28, pp. 2982-2995, 2020. [**[Paper]**](https://ieeexplore.ieee.org/document/9256995) 
 14 | 
 15 | ## Preparation
 16 | 
 17 | #### Requirements
 18 | 
 19 | - See `requirements.txt`.
 20 | 
 21 | #### Dataset
 22 | 
 23 | 1. Setup your training and test sets. The data structure should look like:
 24 | 
 25 | 
 26 | ```bash
 27 | /path/to/dataset/training
 28 | ├── spk_1
 29 | │   ├── utt1.wav
 30 | │   ...
 31 | ├── spk_2
 32 | │   ├── utt1.wav
 33 | │   ...
 34 | └── spk_N
 35 |     ├── utt1.wav
 36 |     ...
 37 |     
 38 | /path/to/dataset/test
 39 | ├── spk_1
 40 | │   ├── utt1.wav
 41 | │   ...
 42 | ├── spk_2
 43 | │   ├── utt1.wav
 44 | │   ...
 45 | └── spk_N
 46 |     ├── utt1.wav
 47 |     ...
 48 | ```
 49 | 
 50 | #### Waveform generator
 51 | 
 52 | 1. Place a copy of the directory `parallel_wavegan` from https://github.com/kan-bayashi/ParallelWaveGAN in `pwg/`.
 53 | 2. HiFi-GAN models trained on several databases can be found [here](https://drive.google.com/drive/folders/1RvagKsKaCih0qhRP6XkSF07r3uNFhB5T?usp=sharing). Once these are downloaded, place them in `pwg/egs/`. Please contact me if you have any problems downloading.
 54 | 3. Optionally, Parallel WaveGAN can be used instead for waveform generation. The trained models are available [here](https://drive.google.com/drive/folders/1zRYZ9dx16dONn1SEuO4wXjjgJHaYSKwb?usp=sharing). Once these are downloaded, place them in `pwg/egs/`. 
 55 | 
 56 | ## Main
 57 | 
 58 | #### Train
 59 | 
 60 | To run all stages for model training, execute:
 61 | 
 62 | ```bash
 63 | ./recipes/run_train.sh [-g gpu] [-a arch_type] [-l loss_type] [-s stage] [-e exp_name]
 64 | ```
 65 | 
 66 | - Options:
 67 | 
 68 |   ```bash
 69 |   -g: GPU device (default: -1)
 70 |   #    -1 indicates CPU
 71 |   -a: Generator architecture type ("conv" or "rnn")
 72 |   #    conv: 1D fully convolutional network (default)
 73 |   #    rnn: Bidirectional long short-term memory network
 74 |   -l: Loss type ("cgan", "wgan", or "lsgan")
 75 |   #    cgan: Cross-entropy GAN
 76 |   #    wgan: Wasserstein GAN with the gradient penalty loss (default)
 77 |   #    lsgan: Least squares GAN
 78 |   -s: Stage to start (0 or 1)
 79 |   #    Stages 0 and 1 correspond to feature extraction and model training, respectively.
 80 |   -e: Experiment name (default: "conv_wgan_exp1")
 81 |   #    This name will be used at test time to specify which trained model to load.
 82 |   ```
 83 | 
 84 | - Examples:
 85 | 
 86 |   ```bash
 87 |   # To run the training from scratch with the default settings:
 88 |   ./recipes/run_train.sh
 89 |   
 90 |   # To skip the feature extraction stage:
 91 |   ./recipes/run_train.sh -s 1
 92 |   
 93 |   # To set the gpu device to, say, 0:
 94 |   ./recipes/run_train.sh -g 0
 95 |   
 96 |   # To use a generator with a recurrent architecture:
 97 |   ./recipes/run_train.sh -a rnn -e rnn_wgan_exp1
 98 |   
 99 |   # To use the cross-entropy adversarial loss:
100 |   ./recipes/run_train.sh -l cgan -e conv_cgan_exp1
101 |   
102 |   # To use the least-squares adversarial loss:
103 |   ./recipes/run_train.sh -l lsgan -e conv_lsgan_exp1
104 |   ```
105 | 
106 | See other scripts in `recipes` for examples of training on different datasets. 
107 | 
108 | To monitor the training process, use tensorboard:
109 | 
110 | ```bash
111 | tensorboard [--logdir log_path]
112 | ```
113 | 
114 | #### Test
115 | 
116 | To perform conversion, execute:
117 | 
118 | ```bash
119 | ./recipes/run_test.sh [-g gpu] [-e exp_name] [-c checkpoint] [-v vocoder_type]
120 | ```
121 | 
122 | - Options:
123 | 
124 |   ```bash
125 |   -g: GPU device (default: -1)
126 |   #    -1 indicates CPU
127 |   -e: Experiment name (e.g., "conv_wgan_exp1")
128 |   -c: Model checkpoint to load (default: 0)
129 |   #    0 indicates the newest model
130 |   -v: Vocoder type ("hfg" or "pwg")
131 |   #    hfg: HifiGAN (default)
132 |   #    pwg: Parallel WaveGAN
133 |   ```
134 | 
135 | - Examples:
136 | 
137 |   ```bash
138 |   # To perform conversion with the default settings:
139 |   ./recipes/run_test.sh -g 0 -e conv_wgan_exp1
140 |   
141 |   # To use Parallel WaveGAN as an alternative for waveform generation:
142 |   ./recipes/run_test.sh -g 0 -e conv_wgan_exp1 -v pwg
143 |   ```
144 | 
145 | ## Citation
146 | 
147 | If you find this work useful for your research, please cite our papers.
148 | 
149 | ```
150 | @INPROCEEDINGS{Kameoka2018SLT_StarGAN-VC,
151 |   author={Hirokazu Kameoka and Takuhiro Kaneko and Kou Tanaka and Nobukatsu Hojo},
152 |   booktitle={Proc. 2018 IEEE Spoken Language Technology Workshop (SLT)}, 
153 |   title={StarGAN-VC: Non-parallel Many-to-Many Voice Conversion Using Star Generative Adversarial Networks}, 
154 |   year={2018},
155 |   pages={266--273}}
156 | @Article{Kameoka2020IEEETrans_StarGAN-VC,
157 |   author={Hirokazu Kameoka and Takuhiro Kaneko and Kou Tanaka and Nobukatsu Hojo},
158 |   title={Nonparallel Voice Conversion With Augmented Classifier Star Generative Adversarial Networks},
159 |   journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
160 |   volume={28},
161 |   pages={2982--2995},
162 |   year={2020}}
163 | ```
164 | 
165 | ## Author
166 | 
167 | Hirokazu Kameoka ([@kamepong](https://github.com/kamepong))
168 | 
169 | E-mail: kame.hirokazu@gmail.com
170 | 


--------------------------------------------------------------------------------
/compute_statistics.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import logging
 3 | import os
 4 | import h5py
 5 | from tqdm import tqdm
 6 | from sklearn.preprocessing import StandardScaler
 7 | import pickle
 8 | 
 9 | def walk_files(root, extension):
10 |     for path, dirs, files in os.walk(root):
11 |         for file in files:
12 |             if file.endswith(extension):
13 |                 yield os.path.join(path, file)
14 | 
15 | def read_melspec(filepath):
16 |     with h5py.File(filepath, "r") as f:
17 |         melspec = f["melspec"][()]  # n_mels x n_frame
18 |     #import pdb;pdb.set_trace() # Breakpoint
19 |     return melspec
20 | 
21 | def compute_statistics(src, stat_filepath):
22 |     melspec_scaler = StandardScaler()
23 |         
24 |     filepath_list = list(walk_files(src, '.h5'))
25 |     for filepath in tqdm(filepath_list):
26 |         melspec = read_melspec(filepath)
27 |         #import pdb;pdb.set_trace() # Breakpoint
28 |         melspec_scaler.partial_fit(melspec.T)
29 | 
30 |     with open(stat_filepath, mode='wb') as f:
31 |         pickle.dump(melspec_scaler, f)
32 | 
33 | def main():
34 |     parser = argparse.ArgumentParser()
35 |     parser.add_argument('--src', type=str, default='./dump/arctic/feat/train')
36 |     parser.add_argument('--stat', type=str, default='./dump/arctic/stat.pkl')
37 |     args = parser.parse_args()
38 | 
39 |     fmt = '%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s'
40 |     datafmt = '%m/%d/%Y %I:%M:%S'
41 |     logging.basicConfig(level=logging.INFO, format=fmt, datefmt=datafmt)
42 | 
43 |     src = args.src
44 |     stat_filepath = args.stat
45 |     if not os.path.exists(os.path.dirname(stat_filepath)):
46 |         os.makedirs(os.path.dirname(stat_filepath))
47 |     
48 |     compute_statistics(src, stat_filepath)
49 | 
50 | if __name__ == '__main__':
51 |     main()


--------------------------------------------------------------------------------
/convert.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2021 Hirokazu Kameoka
  2 | 
  3 | import os
  4 | import argparse
  5 | import torch
  6 | import json
  7 | import numpy as np
  8 | import re
  9 | import pickle
 10 | from tqdm import tqdm
 11 | import yaml
 12 | 
 13 | import librosa
 14 | import soundfile as sf
 15 | from sklearn.preprocessing import StandardScaler
 16 | 
 17 | import net
 18 | from extract_features import logmelfilterbank
 19 | 
 20 | import sys
 21 | sys.path.append(os.path.abspath("pwg"))
 22 | from pwg.parallel_wavegan.utils import load_model
 23 | from pwg.parallel_wavegan.utils import read_hdf5
 24 | 
 25 | def audio_transform(wav_filepath, scaler, kwargs, device):
 26 | 
 27 |     trim_silence = kwargs['trim_silence']
 28 |     top_db = kwargs['top_db']
 29 |     flen = kwargs['flen']
 30 |     fshift = kwargs['fshift']
 31 |     fmin = kwargs['fmin']
 32 |     fmax = kwargs['fmax']
 33 |     num_mels = kwargs['num_mels']
 34 |     fs = kwargs['fs']
 35 | 
 36 |     audio, fs_ = sf.read(wav_filepath)
 37 |     if trim_silence:
 38 |         #print('trimming.')
 39 |         audio, _ = librosa.effects.trim(audio, top_db=top_db, frame_length=2048, hop_length=512)
 40 |     if fs != fs_:
 41 |         #print('resampling.')
 42 |         audio = librosa.resample(audio, fs_, fs)
 43 |     melspec_raw = logmelfilterbank(audio,fs, fft_size=flen,hop_size=fshift,
 44 |                                     fmin=fmin, fmax=fmax, num_mels=num_mels)
 45 |     melspec_raw = melspec_raw.astype(np.float32) # n_frame x n_mels
 46 | 
 47 |     melspec_norm = scaler.transform(melspec_raw)
 48 |     melspec_norm =  melspec_norm.T # n_mels x n_frame
 49 | 
 50 |     return torch.tensor(melspec_norm[None]).to(device, dtype=torch.float)
 51 | 
 52 | def extract_num(s, p, ret=0):
 53 |     search = p.search(s)
 54 |     if search:
 55 |         return int(search.groups()[0])
 56 |     else:
 57 |         return ret
 58 | 
 59 | def listdir_ext(dirpath,ext):
 60 |     p = re.compile(r'(\d+)')
 61 |     out = []
 62 |     for file in sorted(os.listdir(dirpath), key=lambda s: extract_num(s, p)):
 63 |         if os.path.splitext(file)[1]==ext:
 64 |             out.append(file)
 65 |     return out
 66 | 
 67 | def find_newest_model_file(model_dir, tag):
 68 |     mfile_list = os.listdir(model_dir)
 69 |     checkpoint = max([int(os.path.splitext(os.path.splitext(mfile)[0])[0]) for mfile in mfile_list if mfile.endswith('.{}.pt'.format(tag))])
 70 |     return checkpoint
 71 | 
 72 | 
 73 | def synthesis(melspec, model_nv, nv_config, savepath, device):
 74 |     ## Parallel WaveGAN / MelGAN
 75 |     melspec = torch.tensor(melspec, dtype=torch.float).to(device)
 76 |     #start = time.time()
 77 |     x = model_nv.inference(melspec).view(-1)
 78 |     #elapsed_time = time.time() - start
 79 |     #rtf2 = elapsed_time/audio_len
 80 |     #print ("elapsed_time (waveform generation): {0}".format(elapsed_time) + "[sec]")
 81 |     #print ("real time factor (waveform generation): {0}".format(rtf2))
 82 |     
 83 |     # save as PCM 16 bit wav file
 84 |     if not os.path.exists(os.path.dirname(savepath)):
 85 |         os.makedirs(os.path.dirname(savepath))
 86 |     sf.write(savepath, x.detach().cpu().clone().numpy(), nv_config["sampling_rate"], "PCM_16")
 87 | 
 88 | def main():
 89 |     parser = argparse.ArgumentParser(description='Testing StarGAN-VC')
 90 |     parser.add_argument('--gpu', '-g', type=int, default=-1, help='GPU ID (negative value indicates CPU)')
 91 |     parser.add_argument('-i', '--input', type=str, default='/misc/raid58/kameoka.hirokazu/python/db/arctic/wav/test',
 92 |                         help='root data folder that contains the wav files of input speech')
 93 |     parser.add_argument('-o', '--out', type=str, default='./out/arctic',
 94 |                         help='root data folder where the wav files of the converted speech will be saved.')
 95 |     parser.add_argument('--dataconf', type=str, default='./dump/arctic/data_config.json')
 96 |     parser.add_argument('--stat', type=str, default='./dump/arctic/stat.pkl', help='state file used for normalization')                        
 97 |     parser.add_argument('--model_rootdir', '-mdir', type=str, default='./model/arctic/', help='model file directory')
 98 |     parser.add_argument('--checkpoint', '-ckpt', type=int, default=0, help='model checkpoint to load (0 indicates the newest model)')
 99 |     parser.add_argument('--experiment_name', '-exp', default='experiment1', type=str, help='experiment name')
100 |     parser.add_argument('--vocoder', '-voc', default='hifigan.v1', type=str,
101 |                         help='neural vocoder type name (e.g., hifigan.v1, hifigan.v2, parallel_wavegan.v1)')
102 |     parser.add_argument('--voc_dir', '-vdir', type=str, default='pwg/egs/arctic_4spk_flen64ms_fshift8ms/voc1', 
103 |                         help='directory of trained neural vocoder')
104 |     args = parser.parse_args()
105 | 
106 |     # Set up GPU
107 |     if torch.cuda.is_available() and args.gpu >= 0:
108 |         device = torch.device('cuda:%d' % args.gpu)
109 |     else:
110 |         device = torch.device('cpu')
111 |     if device.type == 'cuda':
112 |         torch.cuda.set_device(device)
113 | 
114 |     input_dir = args.input
115 |     data_config_path = args.dataconf
116 |     model_config_path = os.path.join(args.model_rootdir,args.experiment_name,'model_config.json')
117 |     with open(data_config_path) as f:
118 |         data_config = json.load(f)
119 |     with open(model_config_path) as f:
120 |         model_config = json.load(f)
121 |     checkpoint = args.checkpoint
122 | 
123 |     num_mels = model_config['num_mels']
124 |     arch_type = model_config['arch_type']
125 |     loss_type = model_config['loss_type']
126 |     n_spk = model_config['n_spk']
127 |     trg_spk_list = model_config['spk_list']
128 |     zdim = model_config['zdim']
129 |     hdim = model_config['hdim']
130 |     mdim = model_config['mdim']
131 |     sdim = model_config['sdim']
132 |     normtype = model_config['normtype']
133 |     src_conditioning = model_config['src_conditioning']
134 | 
135 |     stat_filepath = args.stat
136 |     melspec_scaler = StandardScaler()
137 |     if os.path.exists(stat_filepath):
138 |         with open(stat_filepath, mode='rb') as f:
139 |             melspec_scaler = pickle.load(f)
140 |         print('Loaded mel-spectrogram statistics successfully.')
141 |     else:
142 |         print('Stat file not found.')
143 | 
144 |     # Set up main model
145 |     gen = net.Generator1(num_mels, n_spk, zdim, hdim, sdim, normtype, src_conditioning) if arch_type=='conv' else net.Generator1(num_mels, n_spk, zdim, hdim, sdim, normtype, src_conditioning)
146 |     dis = net.Discriminator1(num_mels, n_spk, mdim, normtype) if arch_type=='conv' else net.Discriminator1(num_mels, n_spk, mdim, normtype)
147 |     models = {
148 |         'gen': gen,
149 |         'dis': dis
150 |     }
151 |     models['stargan'] = net.StarGAN(models['gen'],models['dis'],n_spk,loss_type)
152 | 
153 |     for tag in ['gen', 'dis']:
154 |         model_dir = os.path.join(args.model_rootdir,args.experiment_name)
155 |         vc_checkpoint_idx = find_newest_model_file(model_dir, tag) if checkpoint <= 0 else checkpoint
156 |         mfilename = '{}.{}.pt'.format(vc_checkpoint_idx,tag)
157 |         path = os.path.join(args.model_rootdir,args.experiment_name,mfilename)
158 |         if path is not None:
159 |             model_checkpoint = torch.load(path, map_location=device)
160 |             models[tag].load_state_dict(model_checkpoint['model_state_dict'])
161 |             print('{}: {}'.format(tag, os.path.abspath(path)))
162 | 
163 |     for tag in ['gen', 'dis']:
164 |         #models[tag].to(device).eval()
165 |         models[tag].to(device).train(mode=True)
166 | 
167 |     # Set up nv
168 |     vocoder = args.vocoder
169 |     voc_dir = args.voc_dir
170 |     voc_yaml_path = os.path.join(voc_dir,'conf', '{}.yaml'.format(vocoder))
171 |     checkpointlist = listdir_ext(
172 |         os.path.join(voc_dir,'exp','train_nodev_all_{}'.format(vocoder)),'.pkl')
173 |     nv_checkpoint = os.path.join(voc_dir,'exp',
174 |                                   'train_nodev_all_{}'.format(vocoder),
175 |                                   checkpointlist[-1]) # Find and use the newest checkpoint model.
176 |     print('vocoder: {}'.format(os.path.abspath(nv_checkpoint)))
177 |     
178 |     with open(voc_yaml_path) as f:
179 |         nv_config = yaml.load(f, Loader=yaml.Loader)
180 |     nv_config.update(vars(args))
181 |     model_nv = load_model(nv_checkpoint, nv_config)
182 |     model_nv.remove_weight_norm()
183 |     model_nv = model_nv.eval().to(device)
184 | 
185 |     src_spk_list = sorted(os.listdir(input_dir))
186 | 
187 |     for i, src_spk in enumerate(src_spk_list):
188 |         src_wav_dir = os.path.join(input_dir, src_spk)
189 |         for j, trg_spk in enumerate(trg_spk_list):
190 |             if src_spk != trg_spk:
191 |                 print('Converting {}2{}...'.format(src_spk, trg_spk))
192 |                 for n, src_wav_filename in enumerate(os.listdir(src_wav_dir)):
193 |                     src_wav_filepath = os.path.join(src_wav_dir, src_wav_filename)
194 |                     src_melspec = audio_transform(src_wav_filepath, melspec_scaler, data_config, device)
195 |                     k_t = j
196 |                     k_s = i if src_conditioning else None
197 | 
198 |                     conv_melspec = models['stargan'](src_melspec, k_t, k_s)
199 | 
200 |                     conv_melspec = conv_melspec[0,:,:].detach().cpu().clone().numpy()
201 |                     conv_melspec = conv_melspec.T # n_frames x n_mels
202 | 
203 |                     out_wavpath = os.path.join(args.out,args.experiment_name,'{}'.format(vc_checkpoint_idx),vocoder,'{}2{}'.format(src_spk,trg_spk), src_wav_filename)
204 |                     synthesis(conv_melspec, model_nv, nv_config, out_wavpath, device)
205 | 
206 | 
207 | if __name__ == '__main__':
208 |     main()


--------------------------------------------------------------------------------
/dataset.py:
--------------------------------------------------------------------------------
 1 | # Copyright 2021 Hirokazu Kameoka
 2 | 
 3 | import os
 4 | import numpy as np
 5 | import torch
 6 | from torch.utils.data import Dataset
 7 | import h5py
 8 | import math
 9 | import random
10 | 
11 | def walk_files(root, extension):
12 |     for path, dirs, files in os.walk(root):
13 |         for file in files:
14 |             if file.endswith(extension):
15 |                 yield os.path.join(path, file)
16 | 
17 | class MultiDomain_Dataset(Dataset):
18 |     def __init__(self, *feat_dirs):
19 |         self.n_domain = len(feat_dirs)
20 |         self.filenames_all = [[os.path.join(d,t) for t in sorted(os.listdir(d))] for d in feat_dirs]
21 |         #self.filenames_all = [[t for t in walk_files(d, '.h5')] for d in feat_dirs]
22 |         self.feat_dirs = feat_dirs
23 |         
24 |     def __len__(self):
25 |         return min(len(f) for f in self.filenames_all)
26 | 
27 |     def __getitem__(self, idx):
28 |         melspec_list = []
29 |         for d in range(self.n_domain):
30 |             with h5py.File(self.filenames_all[d][idx], "r") as f:
31 |                 melspec = f["melspec"][()]  # n_freq x n_time
32 |             melspec_list.append(melspec)
33 |         return melspec_list
34 | 
35 | def collate_fn(batch):
36 |     #batch[b][s]: melspec (n_freq x n_frame)
37 |     #b: batch size
38 |     #s: speaker ID
39 | 
40 |     batchsize = len(batch)
41 |     n_spk = len(batch[0])
42 |     melspec_list = [[batch[b][s] for b in range(batchsize)] for s in range(n_spk)]
43 |     #melspec_list[s][b]: melspec (n_freq x n_frame)
44 |     #s: speaker ID
45 |     #b: batch size
46 | 
47 |     n_freq = melspec_list[0][0].shape[0]
48 | 
49 |     X_list = []
50 |     for s in range(n_spk):
51 |         maxlen=0
52 |         for b in range(batchsize):
53 |             if maxlen<melspec_list[s][b].shape[1]:
54 |                 maxlen = melspec_list[s][b].shape[1]
55 |         maxlen = math.ceil(maxlen/4)*4
56 |     
57 |         X = np.zeros((batchsize,n_freq,maxlen))
58 |         for b in range(batchsize):
59 |             melspec = melspec_list[s][b]
60 |             melspec = np.tile(melspec,(1,math.ceil(maxlen/melspec.shape[1])))
61 |             X[b,:,:] = melspec[:,0:maxlen]
62 |         #X = torch.tensor(X)
63 |         X_list.append(X)
64 | 
65 |     return X_list


--------------------------------------------------------------------------------
/extract_features.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import joblib
  3 | import logging
  4 | import os
  5 | import warnings
  6 | import json
  7 | 
  8 | import h5py
  9 | import librosa
 10 | import numpy as np
 11 | import soundfile as sf
 12 | 
 13 | from tqdm import tqdm
 14 | 
 15 | def walk_files(root, extension):
 16 |     for path, dirs, files in os.walk(root):
 17 |         for file in files:
 18 |             if file.endswith(extension):
 19 |                 yield os.path.join(path, file)
 20 | 
 21 | def logmelfilterbank(audio,
 22 |                      sampling_rate,
 23 |                      fft_size=1024,
 24 |                      hop_size=256,
 25 |                      win_length=None,
 26 |                      window="hann",
 27 |                      num_mels=80,
 28 |                      fmin=None,
 29 |                      fmax=None,
 30 |                      eps=1e-10,
 31 |                      ):
 32 |     """Compute log-Mel filterbank feature.
 33 |     Args:
 34 |         audio (ndarray): Audio signal (T,).
 35 |         sampling_rate (int): Sampling rate.
 36 |         fft_size (int): FFT size.
 37 |         hop_size (int): Hop size.
 38 |         win_length (int): Window length. If set to None, it will be the same as fft_size.
 39 |         window (str): Window function type.
 40 |         num_mels (int): Number of mel basis.
 41 |         fmin (int): Minimum frequency in mel basis calculation.
 42 |         fmax (int): Maximum frequency in mel basis calculation.
 43 |         eps (float): Epsilon value to avoid inf in log calculation.
 44 |     Returns:
 45 |         ndarray: Log Mel filterbank feature (#frames, num_mels).
 46 |     """
 47 |     # get amplitude spectrogram
 48 |     x_stft = librosa.stft(audio, n_fft=fft_size, hop_length=hop_size,
 49 |                           win_length=win_length, window=window, pad_mode="reflect")
 50 |     spc = np.abs(x_stft).T  # (#frames, #bins)
 51 | 
 52 |     # get mel basis
 53 |     fmin = 0 if fmin is None else fmin
 54 |     fmax = sampling_rate / 2 if fmax is None else fmax
 55 |     mel_basis = librosa.filters.mel(sampling_rate, fft_size, num_mels, fmin, fmax)
 56 | 
 57 |     return np.log10(np.maximum(eps, np.dot(spc, mel_basis.T)))
 58 | 
 59 | def extract_melspec(src_filepath, dst_filepath, kwargs):
 60 |     try:
 61 |         warnings.filterwarnings('ignore')
 62 | 
 63 |         trim_silence = kwargs['trim_silence']
 64 |         top_db = kwargs['top_db']
 65 |         flen = kwargs['flen']
 66 |         fshift = kwargs['fshift']
 67 |         fmin = kwargs['fmin']
 68 |         fmax = kwargs['fmax']
 69 |         num_mels = kwargs['num_mels']
 70 |         fs = kwargs['fs']
 71 |         
 72 |         audio, fs_ = sf.read(src_filepath)
 73 |         if trim_silence:
 74 |             #print('trimming.')
 75 |             audio, _ = librosa.effects.trim(audio, top_db=top_db, frame_length=2048, hop_length=512)
 76 |         if fs != fs_:
 77 |             #print('resampling.')
 78 |             audio = librosa.resample(audio, fs_, fs)
 79 |         melspec_raw = logmelfilterbank(audio,fs, fft_size=flen,hop_size=fshift,
 80 |                                         fmin=fmin, fmax=fmax, num_mels=num_mels)
 81 |         melspec_raw = melspec_raw.astype(np.float32)
 82 |         melspec_raw = melspec_raw.T # n_mels x n_frame
 83 | 
 84 |         if not os.path.exists(os.path.dirname(dst_filepath)):
 85 |             os.makedirs(os.path.dirname(dst_filepath), exist_ok=True)
 86 |         with h5py.File(dst_filepath, "w") as f:
 87 |             f.create_dataset("melspec", data=melspec_raw)
 88 | 
 89 |         logging.info(f"{dst_filepath}...[{melspec_raw.shape}].")
 90 | 
 91 |     except:
 92 |         logging.info(f"{dst_filepath}...failed.")
 93 | 
 94 | def main():
 95 |     parser = argparse.ArgumentParser()
 96 |     parser.add_argument('--src', type=str,
 97 |                         default='/misc/raid58/kameoka.hirokazu/python/db/arctic/wav/training',
 98 |                         help='data folder that contains the files of the training data')
 99 |     parser.add_argument('--dst', type=str, default='./dump/arctic/feat/train',
100 |                         help='data folder where the extracted features are stored')
101 |     parser.add_argument('--ext', type=str, default='.wav')
102 |     parser.add_argument('--conf', type=str, default='./dump/arctic/data_config.json')
103 |     parser.add_argument('--num_mels', '-mel', type=int, default=80, help='mel-spectrogram diemsion')
104 |     parser.add_argument('--fs', '-r', type=int, default=16000, help='Sampling frequency')
105 |     parser.add_argument('--flen', '-l', type=int, default=1024, help='Frame length')
106 |     parser.add_argument('--fshift', '-s', type=int, default=128, help='Frame shift')
107 |     parser.add_argument('--fmin', type=int, default=80, help='Minimum freq in mel basis calculation')
108 |     parser.add_argument('--fmax', type=int, default=7600, help='Maximum freq in mel basis calculation')
109 |     parser.add_argument('--trim_silence', action='store_true')
110 |     parser.add_argument('--top_db', type=int, default=30, help='Trimming threshold in dB')
111 |     args = parser.parse_args()
112 | 
113 |     src = args.src
114 |     dst = args.dst
115 |     ext = args.ext
116 | 
117 |     fmt = '%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s'
118 |     datafmt = '%m/%d/%Y %I:%M:%S'
119 |     logging.basicConfig(level=logging.INFO, format=fmt, datefmt=datafmt)
120 |     
121 |     data_config = {
122 |         'num_mels' : args.num_mels,
123 |         'fs' : args.fs,
124 |         'flen' : args.flen,
125 |         'fshift' : args.fshift,
126 |         'fmin' : args.fmin,
127 |         'fmax' : args.fmax,
128 |         'trim_silence' : args.trim_silence,
129 |         'top_db' : args.top_db
130 |     }
131 |     configpath = args.conf
132 |     if not os.path.exists(os.path.dirname(configpath)):
133 |         os.makedirs(os.path.dirname(configpath))
134 |     with open(configpath, 'w') as outfile:
135 |         json.dump(data_config, outfile, indent=4)
136 | 
137 |     fargs_list = [
138 |         [
139 |             f,
140 |             f.replace(src, dst).replace(ext, ".h5"),
141 |             data_config,
142 |         ]
143 |         for f in walk_files(src, ext)
144 |     ]
145 |     
146 |     #import pdb;pdb.set_trace() # Breakpoint
147 |     # debug
148 |     #extract_melspec(*fargs_list[0])
149 |     # test
150 | 
151 |     #results = joblib.Parallel(n_jobs=-1)(
152 |     #    joblib.delayed(extract_melspec)(*f) for f in tqdm(fargs_list)
153 |     #)
154 |     results = joblib.Parallel(n_jobs=16)(
155 |         joblib.delayed(extract_melspec)(*f) for f in tqdm(fargs_list)
156 |     )
157 | 
158 | if __name__ == '__main__':
159 |     main()


--------------------------------------------------------------------------------
/module.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2021 Hirokazu Kameoka
  2 | 
  3 | import torch
  4 | import torch.nn as nn
  5 | 
  6 | def calc_padding(kernel_size, dilation, causal, stride=1):
  7 |     if causal:
  8 |         padding = (kernel_size-1)*dilation+1-stride
  9 |     else:
 10 |         padding = ((kernel_size-1)*dilation+1-stride)//2
 11 |     return padding
 12 | 
 13 | class LinearWN(torch.nn.Module):
 14 |     def __init__(self, in_dim, out_dim, bias=True):
 15 |         super(LinearWN, self).__init__()
 16 |         self.linear_layer = torch.nn.Linear(in_dim, out_dim, bias=bias)
 17 |         #nn.init.xavier_normal_(self.linear_layer.weight,gain=0.1)
 18 |         self.linear_layer = nn.utils.weight_norm(self.linear_layer)
 19 | 
 20 |     def forward(self, x):
 21 |         return self.linear_layer(x)
 22 | 
 23 | class ConvGLU1D(nn.Module):
 24 |     def __init__(self, in_ch, out_ch, ks, sd, normtype='IN'):
 25 |         super(ConvGLU1D, self).__init__()
 26 |         self.conv1 = nn.Conv1d(
 27 |             in_ch, out_ch*2, ks, stride=sd, padding=(ks-sd)//2)
 28 |         nn.init.xavier_normal_(self.conv1.weight,gain=0.1)
 29 |         if normtype=='BN':
 30 |             self.norm1 = nn.BatchNorm1d(out_ch*2)
 31 |         elif normtype=='IN':
 32 |             self.norm1 = nn.InstanceNorm1d(out_ch*2)
 33 |         elif normtype=='LN':
 34 |             self.norm1 = nn.LayerNorm(out_ch*2)
 35 | 
 36 |         self.conv1 = nn.utils.weight_norm(self.conv1)
 37 |         self.normtype = normtype
 38 | 
 39 |     def __call__(self, x):
 40 |         h = self.conv1(x)
 41 |         if self.normtype=='BN' or self.normtype=='IN':
 42 |             h = self.norm1(h)
 43 |         elif self.normtype=='LN':
 44 |             B, D, N = h.shape
 45 |             h = h.permute(0,2,1).reshape(-1,D)
 46 |             h = self.norm1(h)
 47 |             h = h.reshape(B,N,D).permute(0,2,1)
 48 |         h_l, h_g = torch.split(h, h.shape[1]//2, dim=1)
 49 |         h = h_l * torch.sigmoid(h_g)
 50 |         
 51 |         return h
 52 | 
 53 | class DeconvGLU1D(nn.Module):
 54 |     def __init__(self, in_ch, out_ch, ks, sd, normtype='IN'):
 55 |         super(DeconvGLU1D, self).__init__()
 56 |         self.conv1 = nn.ConvTranspose1d(
 57 |             in_ch, out_ch*2, ks, stride=sd, padding=(ks-sd)//2)
 58 |         nn.init.xavier_normal_(self.conv1.weight,gain=0.1)
 59 |         if normtype=='BN':
 60 |             self.norm1 = nn.BatchNorm1d(out_ch*2)            
 61 |         elif normtype=='IN':
 62 |             self.norm1 = nn.InstanceNorm1d(out_ch*2)
 63 |         elif normtype=='LN':
 64 |             self.norm1 = nn.LayerNorm(out_ch*2)
 65 | 
 66 |         self.conv1 = nn.utils.weight_norm(self.conv1)
 67 |         self.normtype = normtype
 68 | 
 69 |     def __call__(self, x):
 70 |         h = self.conv1(x)
 71 |         if self.normtype=='BN' or self.normtype=='IN':
 72 |             h = self.norm1(h)
 73 |         elif self.normtype=='LN':
 74 |             B, D, N = h.shape
 75 |             h = h.permute(0,2,1).reshape(-1,D)
 76 |             h = self.norm1(h)
 77 |             h = h.reshape(B,N,D).permute(0,2,1)
 78 |         h_l, h_g = torch.split(h, h.shape[1]//2, dim=1)
 79 |         h = h_l * torch.sigmoid(h_g)
 80 |         
 81 |         return h
 82 | 
 83 | class PixelShuffleGLU1D(nn.Module):
 84 |     def __init__(self, in_ch, out_ch, ks, sd, normtype='IN'):
 85 |         super(PixelShuffleGLU1D, self).__init__()
 86 |         self.conv1 = nn.Conv1d(
 87 |             in_ch, out_ch*2*sd, ks, stride=1, padding=(ks-1)//2)
 88 |         self.r = sd
 89 |         if normtype=='BN':
 90 |             self.norm1 = nn.BatchNorm1d(out_ch*2)
 91 |         elif normtype=='IN':
 92 |             self.norm1 = nn.InstanceNorm1d(out_ch*2)
 93 |         self.conv1 = nn.utils.weight_norm(self.conv1)
 94 |         self.normtype = normtype
 95 | 
 96 |     def __call__(self, x):
 97 |         h = self.conv1(x)
 98 |         N, pre_ch, pre_len = h.shape
 99 |         r = self.r
100 |         post_ch = pre_ch//r
101 |         post_len = pre_len * r
102 |         h = torch.reshape(h, (N, r, post_ch, pre_len))
103 |         h = h.permute(0,2,3,1)
104 | 
105 |         h = torch.reshape(h, (N, post_ch, post_len))
106 |         if self.normtype=='BN' or self.normtype=='IN':
107 |             h = self.norm1(h)
108 |         h_l, h_g = torch.split(h, h.shape[1]//2, dim=1)
109 |         h = h_l * torch.sigmoid(h_g)
110 |         
111 |         return h
112 | 
113 | def concat_dim1(x,y):
114 |     assert x.shape[0] == y.shape[0]
115 |     if torch.Tensor.dim(x) == 3:
116 |         y0 = torch.unsqueeze(y,2)
117 |         N, n_ch, n_t = x.shape
118 |         yy = y0.repeat(1,1,n_t)
119 |         h = torch.cat((x,yy), dim=1)
120 |     elif torch.Tensor.dim(x) == 4:
121 |         y0 = torch.unsqueeze(torch.unsqueeze(y,2),3)
122 |         N, n_ch, n_q, n_t = x.shape
123 |         yy = y0.repeat(1,1,n_q,n_t)
124 |         h = torch.cat((x,yy), dim=1)
125 |     return h
126 | 
127 | def concat_dim2(x,y):
128 |     assert x.shape[0] == y.shape[0]
129 |     if torch.Tensor.dim(x) == 3:
130 |         y0 = torch.unsqueeze(y,1)
131 |         N, n_t, n_ch = x.shape
132 |         yy = y0.repeat(1,n_t,1)
133 |         h = torch.cat((x,yy), dim=2)
134 |     elif torch.Tensor.dim(x) == 2:
135 |         h = torch.cat((x,y), dim=1)
136 |     return h


--------------------------------------------------------------------------------
/net.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import six
  3 | import torch
  4 | import torch.nn as nn
  5 | import torch.nn.functional as F
  6 | import math
  7 | import module as md
  8 | 
  9 | class Generator1(nn.Module):
 10 |     # 1D convolutional architecture
 11 |     def __init__(self, in_ch, n_spk, z_ch, mid_ch, s_ch, normtype='IN', src_conditioning=False):
 12 |         super(Generator1, self).__init__()
 13 |         add_ch = 0 if src_conditioning==False else s_ch
 14 |         self.le1 = md.ConvGLU1D(in_ch+add_ch, mid_ch, 9, 1, normtype)
 15 |         self.le2 = md.ConvGLU1D(mid_ch+add_ch, mid_ch, 8, 2, normtype)
 16 |         self.le3 = md.ConvGLU1D(mid_ch+add_ch, mid_ch, 8, 2, normtype)
 17 |         self.le4 = md.ConvGLU1D(mid_ch+add_ch, mid_ch, 5, 1, normtype)
 18 |         self.le5 = md.ConvGLU1D(mid_ch+add_ch, z_ch, 5, 1, normtype)
 19 |         self.le6 = md.DeconvGLU1D(z_ch+s_ch, mid_ch, 5, 1, normtype)
 20 |         self.le7 = md.DeconvGLU1D(mid_ch+s_ch, mid_ch, 5, 1, normtype)
 21 |         self.le8 = md.DeconvGLU1D(mid_ch+s_ch, mid_ch, 8, 2, normtype)
 22 |         self.le9 = md.DeconvGLU1D(mid_ch+s_ch, mid_ch, 8, 2, normtype)
 23 |         self.le10 = nn.Conv1d(mid_ch+s_ch, in_ch, 9, stride=1, padding=(9-1)//2)
 24 |         #nn.init.xavier_normal_(self.le10.weight,gain=0.1)
 25 | 
 26 |         if src_conditioning:
 27 |             self.eb0 = nn.Embedding(n_spk, s_ch)
 28 |         self.eb1 = nn.Embedding(n_spk, s_ch)
 29 |         self.src_conditioning = src_conditioning
 30 | 
 31 |     def __call__(self, xin, k_t, k_s=None):
 32 |         device = xin.device
 33 |         B, n_mels, n_frame_ = xin.shape
 34 | 
 35 |         kk_t = k_t*torch.ones(B).to(device, dtype=torch.int64)
 36 |         trgspk_emb = self.eb1(kk_t)
 37 |         if self.src_conditioning:
 38 |             kk_s = k_s*torch.ones(B).to(device, dtype=torch.int64)
 39 |             srcspk_emb = self.eb0(kk_s)
 40 | 
 41 |         out = xin
 42 | 
 43 |         if self.src_conditioning: out = md.concat_dim1(out,srcspk_emb)
 44 |         out = self.le1(out)
 45 |         if self.src_conditioning: 
 46 |             out = md.concat_dim1(out,srcspk_emb)
 47 |         out = self.le2(out)
 48 |         if self.src_conditioning: 
 49 |             out = md.concat_dim1(out,srcspk_emb)
 50 |         out = self.le3(out)
 51 |         if self.src_conditioning: 
 52 |             out = md.concat_dim1(out,srcspk_emb)
 53 |         out = self.le4(out)
 54 |         if self.src_conditioning: 
 55 |             out = md.concat_dim1(out,srcspk_emb)
 56 |         out = self.le5(out)
 57 |         out = md.concat_dim1(out,trgspk_emb)
 58 |         out = self.le6(out)
 59 |         out = md.concat_dim1(out,trgspk_emb)
 60 |         out = self.le7(out)
 61 |         out = md.concat_dim1(out,trgspk_emb)
 62 |         out = self.le8(out)
 63 |         out = md.concat_dim1(out,trgspk_emb)
 64 |         out = self.le9(out)
 65 |         out = md.concat_dim1(out,trgspk_emb)
 66 |         out = self.le10(out)
 67 | 
 68 |         return out
 69 | 
 70 | class Generator2(nn.Module):
 71 |     # Bidirectional LSTM 
 72 |     def __init__(self, in_ch, n_spk, z_ch, mid_ch, s_ch, num_layers=2, negative_slope=0.1, src_conditioning=False):
 73 |         super(Generator2, self).__init__()
 74 |         add_ch = 0 if src_conditioning==False else s_ch
 75 | 
 76 |         self.linear0 = md.LinearWN(in_ch+add_ch, mid_ch)
 77 |         self.lrelu0 = nn.LeakyReLU(negative_slope)
 78 |         self.rnn0 = nn.LSTM(
 79 |             mid_ch+add_ch,
 80 |             mid_ch//2,
 81 |             num_layers,
 82 |             dropout=0,
 83 |             bidirectional=True,
 84 |             batch_first = True
 85 |         )
 86 |         self.linear1 = md.LinearWN(mid_ch+add_ch, z_ch)
 87 |         self.linear2 = md.LinearWN(z_ch+s_ch, mid_ch)
 88 |         self.lrelu1 = nn.LeakyReLU(negative_slope)
 89 |         self.rnn1 = nn.LSTM(
 90 |             mid_ch+s_ch,
 91 |             mid_ch//2,
 92 |             num_layers,
 93 |             dropout=0,
 94 |             bidirectional=True,
 95 |             batch_first = True
 96 |         )
 97 |         self.linear3 = md.LinearWN(mid_ch+s_ch, in_ch)
 98 | 
 99 |         if src_conditioning:
100 |             self.eb0 = nn.Embedding(n_spk, s_ch)
101 |         self.eb1 = nn.Embedding(n_spk, s_ch)
102 |         self.src_conditioning = src_conditioning
103 | 
104 |     def __call__(self, xin, k_t, k_s=None):
105 |         device = xin.device
106 |         B, num_mels, num_frame = xin.shape
107 |         kk_t = k_t*torch.ones(B).to(device, dtype=torch.int64)
108 |         trgspk_emb = self.eb1(kk_t)
109 |         if self.src_conditioning:
110 |             kk_s = k_s*torch.ones(B).to(device, dtype=torch.int64)
111 |             srcspk_emb = self.eb0(kk_s)
112 |         out = xin
113 | 
114 |         out = out.permute(0,2,1) # (B, num_frame, num_mels)
115 |         if self.src_conditioning: out = md.concat_dim2(out,srcspk_emb) # (B, num_frame, num_mels+add_ch)
116 |         out = self.lrelu0(self.linear0(out))
117 |         if self.src_conditioning: out = md.concat_dim2(out,srcspk_emb) # (B, num_frame, mid_ch+add_ch)
118 |         self.rnn0.flatten_parameters()
119 |         out, _ = self.rnn0(out) # (B, num_frame, mid_ch)
120 |         if self.src_conditioning: out = md.concat_dim2(out,srcspk_emb) # (B, num_frame, mid_ch+add_ch)
121 |         out = self.linear1(out) # (B, num_frame, z_ch)
122 |         out = md.concat_dim2(out,trgspk_emb) # (B, num_frame, z_ch+s_ch)
123 |         out = self.lrelu1(self.linear2(out)) # (B, num_frame, mid_ch)
124 |         out = md.concat_dim2(out,trgspk_emb) # (B, num_frame, mid_ch+s_ch)
125 |         self.rnn1.flatten_parameters()
126 |         out, _ = self.rnn1(out) # (B, num_frame, mid_ch)
127 |         out = md.concat_dim2(out,trgspk_emb) # (B, num_frame, mid_ch+s_ch)
128 |         out = self.linear3(out) # (B, num_frame, in_ch)
129 |         out = out.permute(0,2,1) # (B, in_ch, num_frame)
130 | 
131 |         return out
132 | 
133 | class Discriminator1(nn.Module):
134 |     # 1D convolutional architecture
135 |     def __init__(self, in_ch, clsnum, mid_ch, normtype='IN', dor=0.1):
136 |         super(Discriminator1, self).__init__()
137 |         self.le1 = md.ConvGLU1D(in_ch, mid_ch, 9, 1, normtype)
138 |         self.le2 = md.ConvGLU1D(mid_ch, mid_ch, 8, 2, normtype)
139 |         self.le3 = md.ConvGLU1D(mid_ch, mid_ch, 8, 2, normtype)
140 |         self.le4 = md.ConvGLU1D(mid_ch, mid_ch, 5, 1, normtype)
141 |         self.le_adv = nn.Conv1d(mid_ch, 1, 5, stride=1, padding=(5-1)//2, bias=False)
142 |         self.le_cls = nn.Conv1d(mid_ch, clsnum, 5, stride=1, padding=(5-1)//2, bias=False)
143 |         nn.init.xavier_normal_(self.le_adv.weight,gain=0.1)
144 |         nn.init.xavier_normal_(self.le_cls.weight,gain=0.1)
145 |         self.do1 = nn.Dropout(p=dor)
146 |         self.do2 = nn.Dropout(p=dor)
147 |         self.do3 = nn.Dropout(p=dor)
148 |         self.do4 = nn.Dropout(p=dor)
149 | 
150 |     def __call__(self, xin):
151 |         device = xin.device
152 |         B, n_mels, n_frame_ = xin.shape
153 | 
154 |         out = xin
155 | 
156 |         out = self.do1(self.le1(out))
157 |         out = self.do2(self.le2(out))
158 |         out = self.do3(self.le3(out))
159 |         out = self.do4(self.le4(out))
160 | 
161 |         out_adv = self.le_adv(out)
162 |         out_cls = self.le_cls(out)
163 |         
164 |         return out_adv, out_cls
165 |     
166 | class StarGAN(nn.Module):
167 |     def __init__(self, gen, dis, n_spk, loss_type='wgan'):
168 |         super(StarGAN, self).__init__()
169 |         self.gen = gen
170 |         self.dis = dis
171 |         self.n_spk = n_spk
172 |         self.loss_type = loss_type
173 | 
174 |     def forward(self, x, k_t, k_s=None):
175 |         device = x.device
176 |         n_frame_ = x.shape[2]
177 |         n_frame = math.ceil(n_frame_/4)*4
178 |         if n_frame > n_frame_:
179 |             x = nn.ReplicationPad1d((0, n_frame-n_frame_))(x)
180 |         return self.gen(x, k_t, k_s)[:,:,0:n_frame_]
181 | 
182 |     def calc_advloss_g(self, df_adv_ss, df_adv_st, df_adv_tt, df_adv_ts):
183 |         df_adv_ss = df_adv_ss.permute(0,2,1).reshape(-1,1)
184 |         df_adv_st = df_adv_st.permute(0,2,1).reshape(-1,1)
185 |         df_adv_tt = df_adv_tt.permute(0,2,1).reshape(-1,1)
186 |         df_adv_ts = df_adv_ts.permute(0,2,1).reshape(-1,1)
187 | 
188 |         if self.loss_type=='wgan':
189 |             # Wasserstein GAN with gradient penalty (WGAN-GP)
190 |             AdvLoss_g = (
191 |                 torch.sum(-df_adv_ss) +
192 |                 torch.sum(-df_adv_st) + 
193 |                 torch.sum(-df_adv_tt) + 
194 |                 torch.sum(-df_adv_ts)
195 |             ) / (df_adv_ss.numel() + df_adv_st.numel() + df_adv_tt.numel() + df_adv_ts.numel())
196 | 
197 |         elif self.loss_type=='lsgan':
198 |             # Least squares GAN (LSGAN)
199 |             AdvLoss_g = 0.5 * (
200 |                 torch.sum((df_adv_ss - torch.ones_like(df_adv_ss))**2) +
201 |                 torch.sum((df_adv_st - torch.ones_like(df_adv_st))**2) +
202 |                 torch.sum((df_adv_tt - torch.ones_like(df_adv_tt))**2) +
203 |                 torch.sum((df_adv_ts - torch.ones_like(df_adv_ts))**2)
204 |             ) / (df_adv_ss.numel() + df_adv_st.numel() + df_adv_tt.numel() + df_adv_ts.numel())
205 | 
206 |         elif self.loss_type=='cgan':
207 |             # Regular GAN with the sigmoid cross-entropy criterion (CGAN)
208 |             AdvLoss_g = (
209 |                 F.binary_cross_entropy_with_logits(df_adv_ss, torch.ones_like(df_adv_ss), reduction='sum') +
210 |                 F.binary_cross_entropy_with_logits(df_adv_st, torch.ones_like(df_adv_st), reduction='sum') +
211 |                 F.binary_cross_entropy_with_logits(df_adv_tt, torch.ones_like(df_adv_tt), reduction='sum') +
212 |                 F.binary_cross_entropy_with_logits(df_adv_ts, torch.ones_like(df_adv_ts), reduction='sum')
213 |             ) / (df_adv_ss.numel() + df_adv_st.numel() + df_adv_tt.numel() + df_adv_ts.numel())
214 | 
215 |         return AdvLoss_g
216 | 
217 |     def calc_clsloss_g(self, df_cls_ss, df_cls_st, df_cls_tt, df_cls_ts, k_s, k_t):
218 |         device = df_cls_ss.device
219 | 
220 |         df_cls_ss = df_cls_ss.permute(0,2,1).reshape(-1,self.n_spk)
221 |         df_cls_st = df_cls_st.permute(0,2,1).reshape(-1,self.n_spk)
222 |         df_cls_tt = df_cls_tt.permute(0,2,1).reshape(-1,self.n_spk)
223 |         df_cls_ts = df_cls_ts.permute(0,2,1).reshape(-1,self.n_spk)
224 | 
225 |         cf_ss = k_s*torch.ones(len(df_cls_ss), device=device, dtype=torch.long)
226 |         cf_st = k_t*torch.ones(len(df_cls_st), device=device, dtype=torch.long)
227 |         cf_tt = k_t*torch.ones(len(df_cls_tt), device=device, dtype=torch.long)
228 |         cf_ts = k_s*torch.ones(len(df_cls_ts), device=device, dtype=torch.long)
229 | 
230 |         ClsLoss_g = (
231 |             F.cross_entropy(df_cls_ss, cf_ss, reduction='sum') + 
232 |             F.cross_entropy(df_cls_st, cf_st, reduction='sum') + 
233 |             F.cross_entropy(df_cls_tt, cf_tt, reduction='sum') + 
234 |             F.cross_entropy(df_cls_ts, cf_ts, reduction='sum')
235 |         ) / (df_cls_ss.numel() + df_cls_st.numel() + df_cls_tt.numel() + df_cls_ts.numel())
236 | 
237 |         return ClsLoss_g
238 | 
239 |     def calc_advloss_d(self, x_s, x_t, xf_ts, xf_st, dr_adv_s, dr_adv_t, df_adv_ss, df_adv_st, df_adv_tt, df_adv_ts):
240 |         device = x_s.device
241 |         B_s = len(x_s)
242 |         B_t = len(x_t)
243 |         
244 |         dr_adv_s = dr_adv_s.permute(0,2,1).reshape(-1,1)
245 |         dr_adv_t = dr_adv_t.permute(0,2,1).reshape(-1,1)
246 |         df_adv_ss = df_adv_ss.permute(0,2,1).reshape(-1,1)
247 |         df_adv_st = df_adv_st.permute(0,2,1).reshape(-1,1)
248 |         df_adv_tt = df_adv_tt.permute(0,2,1).reshape(-1,1)
249 |         df_adv_ts = df_adv_ts.permute(0,2,1).reshape(-1,1)
250 | 
251 |         if self.loss_type=='wgan':
252 |             # Wasserstein GAN with gradient penalty (WGAN-GP)
253 |             AdvLoss_d_r = (
254 |                 torch.sum(-dr_adv_s) + 
255 |                 torch.sum(-dr_adv_t)
256 |             ) / (dr_adv_s.numel() + dr_adv_t.numel())
257 |             AdvLoss_d_f = (
258 |                 torch.sum(df_adv_ss) + 
259 |                 torch.sum(df_adv_st) + 
260 |                 torch.sum(df_adv_tt) + 
261 |                 torch.sum(df_adv_ts)
262 |             ) / (df_adv_ss.numel() + df_adv_st.numel() + df_adv_tt.numel() + df_adv_ts.numel())
263 |             AdvLoss_d = AdvLoss_d_r + AdvLoss_d_f
264 | 
265 |         elif self.loss_type=='lsgan':
266 |             # Least squares GAN (LSGAN)
267 | 
268 |             AdvLoss_d_r = 0.5 * (
269 |                 torch.sum((dr_adv_s - torch.ones_like(dr_adv_s))**2) + 
270 |                 torch.sum((dr_adv_t - torch.ones_like(dr_adv_t))**2)
271 |             ) / (dr_adv_s.numel() + dr_adv_t.numel())
272 |             AdvLoss_d_f = 0.5 * (
273 |                 torch.sum(df_adv_ss**2) + 
274 |                 torch.sum(df_adv_st**2) + 
275 |                 torch.sum(df_adv_tt**2) + 
276 |                 torch.sum(df_adv_ts**2)
277 |             ) / (df_adv_ss.numel() + df_adv_st.numel() + df_adv_tt.numel() + df_adv_ts.numel())
278 |             AdvLoss_d = AdvLoss_d_r + AdvLoss_d_f
279 | 
280 |         elif self.loss_type=='cgan':
281 |             # Regular GAN with sigmoid cross-entropy criterion (CGAN)
282 |             AdvLoss_d_r = (
283 |                 F.binary_cross_entropy_with_logits(dr_adv_s, torch.ones_like(dr_adv_s), reduction='sum') +
284 |                 F.binary_cross_entropy_with_logits(dr_adv_t, torch.ones_like(dr_adv_t), reduction='sum')
285 |             ) / (dr_adv_s.numel() + dr_adv_t.numel())
286 |             AdvLoss_d_f = (
287 |                 F.binary_cross_entropy_with_logits(df_adv_ss, torch.zeros_like(df_adv_ss), reduction='sum') +
288 |                 F.binary_cross_entropy_with_logits(df_adv_st, torch.zeros_like(df_adv_st), reduction='sum') +
289 |                 F.binary_cross_entropy_with_logits(df_adv_tt, torch.zeros_like(df_adv_tt), reduction='sum') +
290 |                 F.binary_cross_entropy_with_logits(df_adv_ts, torch.zeros_like(df_adv_ts), reduction='sum')
291 |             ) / (df_adv_ss.numel() + df_adv_st.numel() + df_adv_tt.numel() + df_adv_ts.numel())
292 |             AdvLoss_d = AdvLoss_d_r + AdvLoss_d_f
293 | 
294 |         # Gradient penalty loss
295 |         alpha_t = torch.rand(B_t, 1, 1, requires_grad=True).to(device)
296 |         interpolates = alpha_t * x_t + ((1 - alpha_t) * xf_ts)
297 |         interpolates = interpolates.to(device)
298 |         disc_interpolates, _ = self.dis(interpolates)
299 |         disc_interpolates = torch.sum(disc_interpolates)
300 |         gradients = torch.autograd.grad(outputs=disc_interpolates,
301 |                                         inputs=interpolates,
302 |                                         grad_outputs=torch.ones(disc_interpolates.size()).to(device),
303 |                                         create_graph=True, retain_graph=True, only_inputs=True)[0]
304 |         gradnorm = torch.sqrt(torch.sum(gradients * gradients, (1, 2)))
305 |         loss_gp_t = ((gradnorm - 1)**2).mean()
306 | 
307 |         alpha_s = torch.rand(B_s, 1, 1, requires_grad=True).to(device)
308 |         interpolates = alpha_s * x_s + ((1 - alpha_s) * xf_st)
309 |         interpolates = interpolates.to(device)
310 |         interpolates = torch.autograd.Variable(interpolates, requires_grad=True)
311 |         disc_interpolates, _ = self.dis(interpolates)
312 |         disc_interpolates = torch.sum(disc_interpolates)
313 |         gradients = torch.autograd.grad(outputs=disc_interpolates,
314 |                                         inputs=interpolates,
315 |                                         grad_outputs=torch.ones(disc_interpolates.size()).to(device),
316 |                                         create_graph=True, retain_graph=True, only_inputs=True)[0]
317 |         gradnorm = torch.sqrt(torch.sum(gradients * gradients, (1, 2)))
318 |         loss_gp_s = ((gradnorm - 1)**2).mean()
319 | 
320 |         GradLoss_d = loss_gp_s + loss_gp_t
321 | 
322 |         return AdvLoss_d, GradLoss_d
323 | 
324 |     def calc_clsloss_d(self, dr_cls_s, dr_cls_t, k_s, k_t):
325 |         device = dr_cls_s.device
326 | 
327 |         dr_cls_s = dr_cls_s.permute(0,2,1).reshape(-1,self.n_spk)
328 |         dr_cls_t = dr_cls_t.permute(0,2,1).reshape(-1,self.n_spk)
329 | 
330 |         cr_s = k_s*torch.ones(len(dr_cls_s), device=device, dtype=torch.long)
331 |         cr_t = k_t*torch.ones(len(dr_cls_t), device=device, dtype=torch.long)
332 | 
333 |         ClsLoss_d = (
334 |             F.cross_entropy(dr_cls_s, cr_s, reduction='sum') + 
335 |             F.cross_entropy(dr_cls_t, cr_t, reduction='sum')
336 |         ) / (dr_cls_s.numel() + dr_cls_t.numel())
337 | 
338 |         return ClsLoss_d
339 | 
340 |     def calc_gen_loss(self, x_s, x_t, k_s, k_t):
341 |         # Generator outputs        
342 |         xf_ss = self.gen(x_s, k_s, k_s)
343 |         xf_ts = self.gen(x_t, k_s, k_t)
344 |         xf_tt = self.gen(x_t, k_t, k_t)
345 |         xf_st = self.gen(x_s, k_t, k_s)
346 | 
347 |         # Discriminator outputs
348 |         df_adv_ss, df_cls_ss = self.dis(xf_ss)
349 |         df_adv_st, df_cls_st = self.dis(xf_st)
350 |         df_adv_tt, df_cls_tt = self.dis(xf_tt)
351 |         df_adv_ts, df_cls_ts = self.dis(xf_ts)
352 | 
353 |         # Adversarial loss 
354 |         AdvLoss_g = self.calc_advloss_g(df_adv_ss, df_adv_st, df_adv_tt, df_adv_ts)
355 | 
356 |         # Classifier loss
357 |         ClsLoss_g = self.calc_clsloss_g(df_cls_ss, df_cls_st, df_cls_tt, df_cls_ts, k_s, k_t)
358 |         
359 |         # Cycle-consistency loss
360 |         CycLoss = (
361 |             torch.sum(torch.abs(x_s - self.gen(xf_st, k_s, k_t))) + 
362 |             torch.sum(torch.abs(x_t - self.gen(xf_ts, k_t, k_s)))
363 |         ) / (x_s.numel() + x_t.numel())
364 | 
365 |         # Reconstruction loss
366 |         RecLoss = (
367 |             torch.sum(torch.abs(x_s - xf_ss)) + 
368 |             torch.sum(torch.abs(x_t - xf_tt))
369 |         ) / (x_s.numel() + x_t.numel())
370 | 
371 |         return AdvLoss_g, ClsLoss_g, CycLoss, RecLoss
372 | 
373 |     def calc_dis_loss(self, x_s, x_t, k_s, k_t):
374 |         device = x_s.device
375 | 
376 |         # Generator outputs        
377 |         xf_ss = self.gen(x_s, k_s, k_s)
378 |         xf_ts = self.gen(x_t, k_s, k_t)
379 |         xf_tt = self.gen(x_t, k_t, k_t)
380 |         xf_st = self.gen(x_s, k_t, k_s)
381 | 
382 |         # Discriminator outputs
383 |         dr_adv_s, dr_cls_s = self.dis(x_s)
384 |         dr_adv_t, dr_cls_t = self.dis(x_t)
385 |         df_adv_ss, _ = self.dis(xf_ss)
386 |         df_adv_st, _ = self.dis(xf_st)
387 |         df_adv_tt, _ = self.dis(xf_tt)
388 |         df_adv_ts, _ = self.dis(xf_ts)
389 | 
390 |         # Adversarial loss
391 |         AdvLoss_d, GradLoss_d = self.calc_advloss_d(x_s, x_t, xf_ts, xf_st, dr_adv_s, dr_adv_t, df_adv_ss, df_adv_st, df_adv_tt, df_adv_ts)
392 | 
393 |         # Classifier loss
394 |         ClsLoss_d = self.calc_clsloss_d(dr_cls_s, dr_cls_t, k_s, k_t)
395 | 
396 |         return AdvLoss_d, GradLoss_d, ClsLoss_d
397 | 


--------------------------------------------------------------------------------
/normalize_features.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import joblib
 3 | import logging
 4 | import os
 5 | 
 6 | import h5py
 7 | import numpy as np
 8 | from tqdm import tqdm
 9 | from sklearn.preprocessing import StandardScaler
10 | import pickle
11 | 
12 | def walk_files(root, extension):
13 |     for path, dirs, files in os.walk(root):
14 |         for file in files:
15 |             if file.endswith(extension):
16 |                 yield os.path.join(path, file)
17 | 
18 | def melspec_transform(melspec, scaler):
19 |     # melspec.shape: (n_freq, n_time)
20 |     # scaler.transform assumes the first axis to be the time axis
21 |     melspec = scaler.transform(melspec.T)
22 |     #import pdb;pdb.set_trace() # Breakpoint
23 |     melspec = melspec.T
24 |     return melspec
25 | 
26 | def normalize_features(src_filepath, dst_filepath, melspec_transform):
27 |     try:
28 |         with h5py.File(src_filepath, "r") as f:
29 |             melspec = f["melspec"][()]
30 |         melspec = melspec_transform(melspec)
31 |         
32 |         if not os.path.exists(os.path.dirname(dst_filepath)):
33 |             os.makedirs(os.path.dirname(dst_filepath), exist_ok=True)
34 |         with h5py.File(dst_filepath, "w") as f:
35 |             f.create_dataset("melspec", data=melspec)
36 |         
37 |         #logging.info(f"{dst_filepath}...[{melspec.shape}].")
38 |         return melspec.shape
39 | 
40 |     except:
41 |         logging.info(f"{dst_filepath}...failed.")
42 | 
43 | def main():
44 |     parser = argparse.ArgumentParser()
45 |     parser.add_argument('--src', type=str,
46 |                         default='./dump/arctic/feat/train',
47 |                         help='data folder that contains the raw features extracted from VoxCeleb2 Dataset')
48 |     parser.add_argument('--dst', type=str, default='./dump/arctic/norm_feat/train',
49 |                         help='data folder where the normalized features are stored')
50 |     parser.add_argument('--stat', type=str, default='./dump/arctic/stat.pkl', 
51 |                         help='state file used for normalization')
52 |     parser.add_argument('--ext', type=str, default='.h5')
53 |     args = parser.parse_args()
54 | 
55 |     src = args.src
56 |     dst = args.dst
57 |     ext = args.ext
58 |     stat_filepath = args.stat
59 | 
60 |     fmt = '%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s'
61 |     datafmt = '%m/%d/%Y %I:%M:%S'
62 |     logging.basicConfig(level=logging.INFO, format=fmt, datefmt=datafmt)
63 |     
64 |     melspec_scaler = StandardScaler()
65 |     if os.path.exists(stat_filepath):
66 |         with open(stat_filepath, mode='rb') as f:
67 |             melspec_scaler = pickle.load(f)
68 |         print('Loaded mel-spectrogram statistics successfully.')
69 |     else:
70 |         print('Stat file not found.')
71 |         
72 |     root = src
73 |     fargs_list = [
74 |         [
75 |             f,
76 |             f.replace(src, dst),
77 |             lambda x: melspec_transform(x, melspec_scaler),
78 |         ]
79 |         for f in walk_files(root, ext)
80 |     ]
81 |     
82 |     #import pdb;pdb.set_trace() # Breakpoint
83 |     # debug
84 |     #normalize_features(*fargs_list[0])
85 |     # test
86 |     #results = joblib.Parallel(n_jobs=-1)(
87 |     #    joblib.delayed(normalize_features)(*f) for f in tqdm(fargs_list)
88 |     #)
89 |     results = joblib.Parallel(n_jobs=16)(
90 |         joblib.delayed(normalize_features)(*f) for f in tqdm(fargs_list)
91 |     )
92 | 
93 | if __name__ == '__main__':
94 |     main()


--------------------------------------------------------------------------------
/pwg/egs/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kamepong/StarGAN-VC/d68e10db5f29385bcb4a0473b94e3dc688b07bf6/pwg/egs/.gitkeep


--------------------------------------------------------------------------------
/pwg/parallel_wavegan/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kamepong/StarGAN-VC/d68e10db5f29385bcb4a0473b94e3dc688b07bf6/pwg/parallel_wavegan/.gitkeep


--------------------------------------------------------------------------------
/recipes/run_test.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # Copyright 2021 Hirokazu Kameoka
 4 | # 
 5 | # Usage:
 6 | # ./run_test.sh [-g gpu] [-e exp_name] [-c checkpoint] [-v vocoder_type]
 7 | # Options:
 8 | #     -g: GPU device#  
 9 | #     -e: Experiment name (e.g., "conv_exp1")
10 | #     -c: Model checkpoint to load (0 indicates the newest model)
11 | #     -v: Vocoder type ("hifigan.v1" or "parallel_wavegan.v1")
12 | 
13 | db_dir="/path/to/dataset/test"
14 | dataset_name="mydataset"
15 | gpu=0
16 | checkpoint=0
17 | vocoder_type="hifigan.v1"
18 | 
19 | while getopts "g:e:c:v:" opt; do
20 |        case $opt in
21 |               g ) gpu=$OPTARG;;
22 |               e ) exp_name=$OPTARG;;
23 | 			  c ) checkpoint=$OPTARG;;
24 | 			  v ) vocoder_type=$OPTARG;;
25 |        esac
26 | done
27 | 
28 | # If the -v option is abbreviated...
29 | case ${vocoder_type} in
30 | 	"pwg" ) vocoder_type="parallel_wavegan.v1";;
31 | 	"hfg" ) vocoder_type="hifigan.v1";;
32 | esac
33 | 
34 | echo "Experiment name: ${exp_name}, Vocoder: ${vocoder_type}"
35 | 
36 | dconf_path="./dump/${dataset_name}/data_config.json"
37 | stat_path="./dump/${dataset_name}/stat.pkl"
38 | out_dir="./out/${dataset_name}"
39 | model_dir="./model/${dataset_name}"
40 | vocoder_dir="pwg/egs/arctic_4spk_flen64ms_fshift8ms/voc1"
41 | 
42 | python convert.py -g ${gpu} \
43 | 	--input ${db_dir} \
44 | 	--dataconf ${dconf_path} \
45 | 	--stat ${stat_path} \
46 | 	--out ${out_dir} \
47 | 	--model_rootdir ${model_dir} \
48 | 	--experiment_name ${exp_name} \
49 | 	--vocoder ${vocoder_type} \
50 | 	--voc_dir ${vocoder_dir} \
51 | 	--checkpoint ${checkpoint}


--------------------------------------------------------------------------------
/recipes/run_test_arctic_4spk.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # Copyright 2021 Hirokazu Kameoka
 4 | # 
 5 | # Usage:
 6 | # ./run_test_arctic_4spk.sh [-g gpu] [-e exp_name] [-c checkpoint] [-v vocoder_type]
 7 | # Options:
 8 | #     -g: GPU device#  
 9 | #     -e: Experiment name (e.g., "conv_exp1")
10 | #     -c: Model checkpoint to load (0 indicates the newest model)
11 | #     -v: Vocoder type ("hifigan.v1" or "parallel_wavegan.v1")
12 | 
13 | db_dir="/misc/raid58/kameoka.hirokazu/db/arctic/wav/test"
14 | dataset_name="arctic_4spk"
15 | gpu=0
16 | checkpoint=0
17 | vocoder_type="hifigan.v1"
18 | 
19 | while getopts "g:e:c:v:" opt; do
20 |        case $opt in
21 |               g ) gpu=$OPTARG;;
22 |               e ) exp_name=$OPTARG;;
23 | 			  c ) checkpoint=$OPTARG;;
24 | 			  v ) vocoder_type=$OPTARG;;
25 |        esac
26 | done
27 | 
28 | # If the -v option is abbreviated...
29 | case ${vocoder_type} in
30 | 	"pwg" ) vocoder_type="parallel_wavegan.v1";;
31 | 	"hfg" ) vocoder_type="hifigan.v1";;
32 | esac
33 | 
34 | echo "Experiment name: ${exp_name}, Vocoder: ${vocoder_type}"
35 | 
36 | dconf_path="./dump/${dataset_name}/data_config.json"
37 | stat_path="./dump/${dataset_name}/stat.pkl"
38 | out_dir="./out/${dataset_name}"
39 | model_dir="./model/${dataset_name}"
40 | vocoder_dir="pwg/egs/arctic_4spk_flen64ms_fshift8ms/voc1"
41 | 
42 | python convert.py -g ${gpu} \
43 | 	--input ${db_dir} \
44 | 	--dataconf ${dconf_path} \
45 | 	--stat ${stat_path} \
46 | 	--out ${out_dir} \
47 | 	--model_rootdir ${model_dir} \
48 | 	--experiment_name ${exp_name} \
49 | 	--vocoder ${vocoder_type} \
50 | 	--voc_dir ${vocoder_dir} \
51 | 	--checkpoint ${checkpoint}


--------------------------------------------------------------------------------
/recipes/run_train.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # Copyright 2021 Hirokazu Kameoka
 4 | # 
 5 | # Usage:
 6 | # ./run_train.sh [-g gpu] [-d db_dir] [-a arch_type] [-l loss_type] [-s stage] [-e exp_name]
 7 | # Options:
 8 | #     -g: GPU device# 
 9 | #     -d: Path to the dataset directory
10 | #     -a: Architecture type ("conv" or "rnn")
11 | #     -l: Loss type ("cgan", "wgan", or "lsgan")
12 | #     -s: Stage to start (0 or 1)
13 | #     -e: Experiment name (e.g., "exp1")
14 | 
15 | # Default values
16 | db_dir="/path/to/dataset/training"
17 | dataset_name="mydataset"
18 | gpu=0
19 | arch_type="conv"
20 | loss_type="wgan"
21 | start_stage=0
22 | exp_name="conv_wgan_exp1"
23 | 
24 | while getopts "g:d:a:l:s:e:" opt; do
25 |        case $opt in
26 |               g ) gpu=$OPTARG;;
27 |               d ) db_dir=$OPTARG;;
28 |               a ) arch_type=$OPTARG;;
29 |               l ) loss_type=$OPTARG;;
30 |               s ) start_stage=$OPTARG;;
31 |               e ) exp_name=$OPTARG;;
32 |        esac
33 | done
34 | 
35 | feat_dir="./dump/${dataset_name}/feat/train"
36 | dconf_path="./dump/${dataset_name}/data_config.json"
37 | stat_path="./dump/${dataset_name}/stat.pkl"
38 | normfeat_dir="./dump/${dataset_name}/norm_feat/train"
39 | model_dir="./model/${dataset_name}"
40 | log_dir="./logs/${dataset_name}"
41 | 
42 | # Stage 0: Feature extraction
43 | if [[ ${start_stage} -le 0 ]]; then
44 |        python extract_features.py --src ${db_dir} --dst ${feat_dir} --conf ${dconf_path}
45 |        python compute_statistics.py --src ${feat_dir} --stat ${stat_path}
46 |        python normalize_features.py --src ${feat_dir} --dst ${normfeat_dir} --stat ${stat_path}
47 | fi
48 | 
49 | # Stage 1: Model training
50 | if [[ ${start_stage} -le 1 ]]; then
51 |        python train.py -g ${gpu} \
52 |               --data_rootdir ${normfeat_dir} \
53 |               --model_rootdir ${model_dir} \
54 |               --log_dir ${log_dir} \
55 |               --arch_type ${arch_type} \
56 |               --loss_type ${loss_type} \
57 |               --experiment_name ${exp_name} \
58 |               ${cond}
59 | fi


--------------------------------------------------------------------------------
/recipes/run_train_arctic_4spk.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # Copyright 2021 Hirokazu Kameoka
 4 | # 
 5 | # Usage:
 6 | # ./run_train_arctic_4spk.sh [-g gpu] [-d db_dir] [-a arch_type] [-l loss_type] [-s stage] [-e exp_name]
 7 | # Options:
 8 | #     -g: GPU device# 
 9 | #     -d: Path to the dataset directory
10 | #     -a: Architecture type ("conv" or "rnn")
11 | #     -l: Loss type ("cgan", "wgan", or "lsgan")
12 | #     -s: Stage to start (0 or 1)
13 | #     -e: Experiment name (e.g., "exp1")
14 | 
15 | # Default values
16 | db_dir="/misc/raid58/kameoka.hirokazu/db/arctic/wav/training"
17 | dataset_name="arctic_4spk"
18 | gpu=0
19 | arch_type="conv"
20 | loss_type="wgan"
21 | start_stage=0
22 | exp_name="conv_wgan_exp1"
23 | 
24 | while getopts "g:d:a:l:s:e:" opt; do
25 |        case $opt in
26 |               g ) gpu=$OPTARG;;
27 |               d ) db_dir=$OPTARG;;
28 |               a ) arch_type=$OPTARG;;
29 |               l ) loss_type=$OPTARG;;
30 |               s ) start_stage=$OPTARG;;
31 |               e ) exp_name=$OPTARG;;
32 |        esac
33 | done
34 | 
35 | feat_dir="./dump/${dataset_name}/feat/train"
36 | dconf_path="./dump/${dataset_name}/data_config.json"
37 | stat_path="./dump/${dataset_name}/stat.pkl"
38 | normfeat_dir="./dump/${dataset_name}/norm_feat/train"
39 | model_dir="./model/${dataset_name}"
40 | log_dir="./logs/${dataset_name}"
41 | 
42 | # Stage 0: Feature extraction
43 | if [[ ${start_stage} -le 0 ]]; then
44 |        python extract_features.py --src ${db_dir} --dst ${feat_dir} --conf ${dconf_path}
45 |        python compute_statistics.py --src ${feat_dir} --stat ${stat_path}
46 |        python normalize_features.py --src ${feat_dir} --dst ${normfeat_dir} --stat ${stat_path}
47 | fi
48 | 
49 | # Stage 1: Model training
50 | if [[ ${start_stage} -le 1 ]]; then
51 |        python train.py -g ${gpu} \
52 |               --data_rootdir ${normfeat_dir} \
53 |               --model_rootdir ${model_dir} \
54 |               --log_dir ${log_dir} \
55 |               --arch_type ${arch_type} \
56 |               --loss_type ${loss_type} \
57 |               --experiment_name ${exp_name} \
58 |               ${cond}
59 | fi


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | apex==0.9.10dev
 2 | filelock==3.3.1
 3 | gdown==4.2.0
 4 | h5py==3.3.0
 5 | joblib==1.1.0
 6 | kaldiio==2.17.2
 7 | librosa==0.8.1
 8 | matplotlib==3.3.4
 9 | numpy==1.21.2
10 | PyYAML==6.0
11 | scikit_learn==1.0.1
12 | scipy==1.7.1
13 | six==1.16.0
14 | SoundFile==0.10.3.post1
15 | tensorboardX==2.4.1
16 | tensorflow==2.7.0
17 | torch==1.10.0
18 | tqdm==4.62.3
19 | 


--------------------------------------------------------------------------------
/train.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import os
  3 | import argparse
  4 | import json
  5 | import itertools
  6 | import logging
  7 | import warnings
  8 | 
  9 | import torch
 10 | from torch import optim
 11 | from torch.utils.tensorboard import SummaryWriter
 12 | from torch.utils.data import DataLoader
 13 | 
 14 | from dataset import MultiDomain_Dataset, collate_fn
 15 | import net
 16 | 
 17 | def makedirs_if_not_exists(dir):
 18 |     if not os.path.exists(dir):
 19 |         os.makedirs(dir)
 20 | 
 21 | def comb(N,r):
 22 |     iterable = list(range(0,N))
 23 |     return list(itertools.combinations(iterable,2))
 24 | 
 25 | def Train(models, epochs, train_dataset, train_loader, optimizers, device, model_dir, log_path, config, snapshot=100, resume=0):
 26 |     fmt = '%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s'
 27 |     datafmt = '%m/%d/%Y %I:%M:%S'
 28 |     if not os.path.exists(os.path.dirname(log_path)):
 29 |         os.makedirs(os.path.dirname(log_path))
 30 |     logging.basicConfig(filename=log_path, filemode='a', level=logging.INFO, format=fmt, datefmt=datafmt)
 31 |     writer = SummaryWriter(os.path.dirname(log_path))
 32 | 
 33 |     if not os.path.exists(model_dir):
 34 |         os.makedirs(model_dir)
 35 | 
 36 |     for tag in ['gen', 'dis']:
 37 |         checkpointpath = os.path.join(model_dir, '{}.{}.pt'.format(resume,tag))
 38 |         if os.path.exists(checkpointpath):
 39 |             checkpoint = torch.load(checkpointpath, map_location=device)
 40 |             models[tag].load_state_dict(checkpoint['model_state_dict'])
 41 |             optimizers[tag].load_state_dict(checkpoint['optimizer_state_dict'])
 42 |             print('{} loaded successfully.'.format(checkpointpath))
 43 | 
 44 |     w_adv = config['w_adv']
 45 |     w_grad = config['w_grad']
 46 |     w_cls = config['w_cls']
 47 |     w_cyc = config['w_cyc']
 48 |     w_rec = config['w_rec']
 49 |     gradient_clip = config['gradient_clip']
 50 | 
 51 |     print("===================================Training Started===================================")
 52 |     n_iter = 0
 53 |     for epoch in range(resume+1, epochs+1):
 54 |         b = 0
 55 |         for X_list in train_loader:
 56 |             n_spk = len(X_list)
 57 |             xin = []
 58 |             for s in range(n_spk):
 59 |                 xin.append(torch.tensor(X_list[s]).to(device, dtype=torch.float))
 60 | 
 61 |             # List of speaker pairs
 62 |             spk_pair_list = comb(n_spk,2)
 63 |             n_spk_pair = len(spk_pair_list)
 64 | 
 65 |             gen_loss_mean = 0
 66 |             dis_loss_mean = 0
 67 |             advloss_d_mean = 0
 68 |             gradloss_d_mean = 0
 69 |             advloss_g_mean = 0
 70 |             clsloss_d_mean = 0
 71 |             clsloss_g_mean = 0
 72 |             cycloss_mean = 0
 73 |             recloss_mean = 0
 74 |             # Iterate through all speaker pairs
 75 |             for m in range(n_spk_pair):
 76 |                 s0 = spk_pair_list[m][0]
 77 |                 s1 = spk_pair_list[m][1]
 78 | 
 79 |                 AdvLoss_g, ClsLoss_g, CycLoss, RecLoss = models['stargan'].calc_gen_loss(xin[s0], xin[s1], s0, s1)
 80 |                 gen_loss = (w_adv * AdvLoss_g + w_cls * ClsLoss_g + w_cyc * CycLoss + w_rec * RecLoss)
 81 | 
 82 |                 models['gen'].zero_grad()
 83 |                 gen_loss.backward()
 84 |                 torch.nn.utils.clip_grad_norm_(models['gen'].parameters(), gradient_clip)
 85 |                 optimizers['gen'].step()
 86 |                 
 87 |                 AdvLoss_d, GradLoss_d, ClsLoss_d = models['stargan'].calc_dis_loss(xin[s0], xin[s1], s0, s1)
 88 |                 dis_loss = w_adv * AdvLoss_d + w_grad * GradLoss_d + w_cls * ClsLoss_d
 89 | 
 90 |                 models['dis'].zero_grad()
 91 |                 dis_loss.backward()
 92 |                 torch.nn.utils.clip_grad_norm_(models['dis'].parameters(), gradient_clip)
 93 |                 optimizers['dis'].step()
 94 | 
 95 |                 gen_loss_mean += gen_loss.item()
 96 |                 dis_loss_mean += dis_loss.item()
 97 |                 advloss_d_mean += AdvLoss_d.item()
 98 |                 gradloss_d_mean += GradLoss_d.item()
 99 |                 advloss_g_mean += AdvLoss_g.item()
100 |                 clsloss_d_mean += ClsLoss_d.item()
101 |                 clsloss_g_mean += ClsLoss_g.item()
102 |                 cycloss_mean += CycLoss.item()
103 |                 recloss_mean += RecLoss.item()
104 | 
105 |             gen_loss_mean /= n_spk_pair
106 |             dis_loss_mean /= n_spk_pair
107 |             advloss_d_mean /= n_spk_pair
108 |             gradloss_d_mean /= n_spk_pair
109 |             advloss_g_mean /= n_spk_pair
110 |             clsloss_d_mean /= n_spk_pair
111 |             clsloss_g_mean /= n_spk_pair
112 |             cycloss_mean /= n_spk_pair
113 |             recloss_mean /= n_spk_pair
114 | 
115 |             logging.info('epoch {}, mini-batch {}: AdvLoss_d={:.4f}, AdvLoss_g={:.4f}, GradLoss_d={:.4f}, ClsLoss_d={:.4f}, ClsLoss_g={:.4f}'
116 |                         .format(epoch, b+1, w_adv*advloss_d_mean, w_adv*advloss_g_mean, w_grad*gradloss_d_mean, w_cls*clsloss_d_mean, w_cls*clsloss_g_mean))
117 |             logging.info('epoch {}, mini-batch {}: CycLoss={:.4f}, RecLoss={:.4f}'.format(epoch, b+1, w_cyc*cycloss_mean, w_rec*recloss_mean))
118 |             writer.add_scalars('Loss/Total_Loss',  {'adv_loss_d': w_adv*advloss_d_mean,
119 |                                                     'adv_loss_g': w_adv*advloss_g_mean,
120 |                                                     'grad_loss_d': w_grad*gradloss_d_mean,
121 |                                                     'cls_loss_d': w_cls*clsloss_d_mean,
122 |                                                     'cls_loss_g': w_cls*clsloss_g_mean,
123 |                                                     'cyc_loss': w_cyc*cycloss_mean,
124 |                                                     'rec_loss': w_rec*recloss_mean}, n_iter)
125 |             n_iter += 1
126 |             b += 1
127 | 
128 |         if epoch % snapshot == 0:
129 |             for tag in ['gen', 'dis']:
130 |                 print('save {} at {} epoch'.format(tag, epoch))
131 |                 torch.save({'epoch': epoch,
132 |                             'model_state_dict': models[tag].state_dict(),
133 |                             'optimizer_state_dict': optimizers[tag].state_dict()},
134 |                             os.path.join(model_dir, '{}.{}.pt'.format(epoch, tag)))
135 | 
136 |     print("===================================Training Finished===================================")
137 | 
138 | def main():
139 |     parser = argparse.ArgumentParser(description='StarGAN-VC')
140 |     parser.add_argument('--gpu', '-g', type=int, default=-1, help='GPU ID (negative value indicates CPU)')
141 |     parser.add_argument('-ddir', '--data_rootdir', type=str, default='./dump/arctic/norm_feat/train',
142 |                         help='root data folder that contains the normalized features')
143 |     parser.add_argument('--epochs', '-epoch', default=2000, type=int, help='number of epochs to learn')
144 |     parser.add_argument('--snapshot', '-snap', default=200, type=int, help='snapshot interval')
145 |     parser.add_argument('--batch_size', '-batch', type=int, default=12, help='Batch size')
146 |     parser.add_argument('--num_mels', '-nm', type=int, default=80, help='number of mel channels')
147 |     parser.add_argument('--arch_type', '-arc', default='conv', type=str, help='generator architecture type (conv or rnn)')
148 |     parser.add_argument('--loss_type', '-los', default='wgan', type=str, help='type of adversarial loss (cgan, wgan, or lsgan)')
149 |     parser.add_argument('--zdim', '-zd', type=int, default=16, help='dimension of bottleneck layer in generator')
150 |     parser.add_argument('--hdim', '-hd', type=int, default=64, help='dimension of middle layers in generator')
151 |     parser.add_argument('--mdim', '-md', type=int, default=32, help='dimension of middle layers in discriminator')
152 |     parser.add_argument('--sdim', '-sd', type=int, default=16, help='dimension of speaker embedding')
153 |     parser.add_argument('--lrate_g', '-lrg', default='0.0005', type=float, help='learning rate for G')
154 |     parser.add_argument('--lrate_d', '-lrd', default='5e-6', type=float, help='learning rate for D/C')
155 |     parser.add_argument('--gradient_clip', '-gclip', default='1.0', type=float, help='gradient clip')
156 |     parser.add_argument('--w_adv', '-wa', default='1.0', type=float, help='Weight on adversarial loss')
157 |     parser.add_argument('--w_grad', '-wg', default='1.0', type=float, help='Weight on gradient penalty loss')
158 |     parser.add_argument('--w_cls', '-wcl', default='1.0', type=float, help='Weight on classification loss')
159 |     parser.add_argument('--w_cyc', '-wcy', default='1.0', type=float, help='Weight on cycle consistency loss')
160 |     parser.add_argument('--w_rec', '-wre', default='1.0', type=float, help='Weight on reconstruction loss')
161 |     parser.add_argument('--normtype', '-norm', default='IN', type=str, help='normalization type: LN, BN and IN')
162 |     parser.add_argument('--src_conditioning', '-srccon', default=0, type=int, help='w or w/o source conditioning')
163 |     parser.add_argument('--resume', '-res', type=int, default=0, help='Checkpoint to resume training')
164 |     parser.add_argument('--model_rootdir', '-mdir', type=str, default='./model/arctic/', help='model file directory')
165 |     parser.add_argument('--log_dir', '-ldir', type=str, default='./logs/arctic/', help='log file directory')
166 |     parser.add_argument('--experiment_name', '-exp', default='experiment1', type=str, help='experiment name')
167 |     args = parser.parse_args()
168 | 
169 |     # Set up GPU
170 |     if torch.cuda.is_available() and args.gpu >= 0:
171 |         device = torch.device('cuda:%d' % args.gpu)
172 |     else:
173 |         device = torch.device('cpu')
174 |     if device.type == 'cuda':
175 |         torch.cuda.set_device(device)
176 | 
177 |     # Configuration for StarGAN
178 |     num_mels = args.num_mels
179 |     arch_type = args.arch_type
180 |     loss_type = args.loss_type
181 |     zdim = args.zdim
182 |     hdim = args.hdim
183 |     mdim = args.mdim
184 |     sdim = args.sdim
185 |     w_adv = args.w_adv
186 |     w_grad = args.w_grad
187 |     w_cls = args.w_cls
188 |     w_cyc = args.w_cyc
189 |     w_rec = args.w_rec
190 |     lrate_g = args.lrate_g
191 |     lrate_d = args.lrate_d
192 |     gradient_clip = args.gradient_clip
193 |     epochs = args.epochs
194 |     batch_size = args.batch_size
195 |     snapshot = args.snapshot
196 |     resume = args.resume
197 |     normtype = args.normtype
198 |     src_conditioning = bool(args.src_conditioning)
199 | 
200 |     data_rootdir = args.data_rootdir
201 |     spk_list = sorted(os.listdir(data_rootdir))
202 |     n_spk = len(spk_list)
203 |     melspec_dirs = [os.path.join(data_rootdir,spk) for spk in spk_list]
204 | 
205 |     model_config = {
206 |         'num_mels': num_mels,
207 |         'arch_type': arch_type,
208 |         'loss_type': loss_type,
209 |         'zdim': zdim,
210 |         'hdim': hdim,
211 |         'mdim': mdim,
212 |         'sdim': sdim,
213 |         'w_adv': w_adv,
214 |         'w_grad': w_grad,
215 |         'w_cls': w_cls,
216 |         'w_cyc': w_cyc,
217 |         'w_rec': w_rec,
218 |         'lrate_g': lrate_g,
219 |         'lrate_d': lrate_d,
220 |         'gradient_clip': gradient_clip,
221 |         'normtype': normtype,
222 |         'epochs': epochs,
223 |         'BatchSize': batch_size,
224 |         'n_spk': n_spk,
225 |         'spk_list': spk_list,
226 |         'src_conditioning': src_conditioning
227 |     }
228 | 
229 |     model_dir = os.path.join(args.model_rootdir, args.experiment_name)
230 |     makedirs_if_not_exists(model_dir)
231 |     log_path = os.path.join(args.log_dir, args.experiment_name, 'train_{}.log'.format(args.experiment_name))
232 |     
233 |     # Save configuration as a json file
234 |     config_path = os.path.join(model_dir, 'model_config.json')
235 |     with open(config_path, 'w') as outfile:
236 |         json.dump(model_config, outfile, indent=4)
237 | 
238 |     if arch_type=='conv':
239 |         gen = net.Generator1(num_mels, n_spk, zdim, hdim, sdim, normtype, src_conditioning)
240 |     elif arch_type=='rnn':
241 |         net.Generator2(num_mels, n_spk, zdim, hdim, sdim, src_conditioning=src_conditioning)
242 |     dis = net.Discriminator1(num_mels, n_spk, mdim, normtype)
243 |     models = {
244 |         'gen' : gen,
245 |         'dis' : dis
246 |     }
247 |     models['stargan'] = net.StarGAN(models['gen'], models['dis'],n_spk,loss_type)
248 | 
249 |     optimizers = {
250 |         'gen' : optim.Adam(models['gen'].parameters(), lr=lrate_g, betas=(0.9,0.999)),
251 |         'dis' : optim.Adam(models['dis'].parameters(), lr=lrate_d, betas=(0.5,0.999))
252 |     }
253 | 
254 |     for tag in ['gen', 'dis']:
255 |         models[tag].to(device).train(mode=True)
256 | 
257 |     train_dataset = MultiDomain_Dataset(*melspec_dirs)
258 |     train_loader = DataLoader(train_dataset,
259 |                               batch_size=batch_size,
260 |                               shuffle=True,
261 |                               num_workers=0,
262 |                               #num_workers=os.cpu_count(),
263 |                               drop_last=True,
264 |                               collate_fn=collate_fn)
265 |     Train(models, epochs, train_dataset, train_loader, optimizers, device, model_dir, log_path, model_config, snapshot, resume)
266 | 
267 | 
268 | if __name__ == '__main__':
269 |     main()


--------------------------------------------------------------------------------