├── README.md ├── compute_statistics.py ├── convert.py ├── dataset.py ├── extract_features.py ├── module.py ├── net.py ├── normalize_features.py ├── pwg ├── egs │ └── .gitkeep └── parallel_wavegan │ └── .gitkeep ├── recipes ├── run_test.sh ├── run_test_arctic_4spk.sh ├── run_train.sh └── run_train_arctic_4spk.sh ├── requirements.txt └── train.py /README.md: -------------------------------------------------------------------------------- 1 | # StarGAN-VC 2 | 3 | This repository provides an official PyTorch implementation for [StarGAN-VC](http://www.kecl.ntt.co.jp/people/kameoka.hirokazu/Demos/stargan-vc2/index.html). 4 | 5 | StarGAN-VC is a nonparallel many-to-many voice conversion (VC) method using star generative adversarial networks (StarGAN). The current version performs VC by first modifying the mel-spectrogram of input speech of an arbitrary speaker in accordance with a target speaker index, and then generating a waveform using a speaker-independent neural vocoder (HiFi-GAN or Parallel WaveGAN) from the modified mel-spectrogram. 6 | 7 | Audio samples are available [here](http://www.kecl.ntt.co.jp/people/kameoka.hirokazu/Demos/stargan-vc2/index.html). 8 | 9 | ## Papers 10 | 11 | - [Hirokazu Kameoka](http://www.kecl.ntt.co.jp/people/kameoka.hirokazu/index-e.html), [Takuhiro Kaneko](http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/index.html), [Kou Tanaka](http://www.kecl.ntt.co.jp/people/tanaka.ko/index.html), and Nobukatsu Hojo, "**StarGAN-VC: Non-parallel many-to-many voice conversion using star generative adversarial networks**," in *Proc. 2018 IEEE Workshop on Spoken Language Technology ([SLT 2018](http://www.slt2018.org/))*, pp. 266-273, Dec. 2018. [**[Paper]**](http://www.kecl.ntt.co.jp/people/kameoka.hirokazu/publications/Kameoka2018SLT12_published.pdf) 12 | 13 | - [Hirokazu Kameoka](http://www.kecl.ntt.co.jp/people/kameoka.hirokazu/index-e.html), [Takuhiro Kaneko](http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/index.html), [Kou Tanaka](http://www.kecl.ntt.co.jp/people/tanaka.ko/index.html), and Nobukatsu Hojo, "**Nonparallel Voice Conversion With Augmented Classifier Star Generative Adversarial Networks**" *IEEE/ACM Transactions on Audio, Speech, and Language Processing*, vol. 28, pp. 2982-2995, 2020. [**[Paper]**](https://ieeexplore.ieee.org/document/9256995) 14 | 15 | ## Preparation 16 | 17 | #### Requirements 18 | 19 | - See `requirements.txt`. 20 | 21 | #### Dataset 22 | 23 | 1. Setup your training and test sets. The data structure should look like: 24 | 25 | 26 | ```bash 27 | /path/to/dataset/training 28 | ├── spk_1 29 | │ ├── utt1.wav 30 | │ ... 31 | ├── spk_2 32 | │ ├── utt1.wav 33 | │ ... 34 | └── spk_N 35 | ├── utt1.wav 36 | ... 37 | 38 | /path/to/dataset/test 39 | ├── spk_1 40 | │ ├── utt1.wav 41 | │ ... 42 | ├── spk_2 43 | │ ├── utt1.wav 44 | │ ... 45 | └── spk_N 46 | ├── utt1.wav 47 | ... 48 | ``` 49 | 50 | #### Waveform generator 51 | 52 | 1. Place a copy of the directory `parallel_wavegan` from https://github.com/kan-bayashi/ParallelWaveGAN in `pwg/`. 53 | 2. HiFi-GAN models trained on several databases can be found [here](https://drive.google.com/drive/folders/1RvagKsKaCih0qhRP6XkSF07r3uNFhB5T?usp=sharing). Once these are downloaded, place them in `pwg/egs/`. Please contact me if you have any problems downloading. 54 | 3. Optionally, Parallel WaveGAN can be used instead for waveform generation. The trained models are available [here](https://drive.google.com/drive/folders/1zRYZ9dx16dONn1SEuO4wXjjgJHaYSKwb?usp=sharing). Once these are downloaded, place them in `pwg/egs/`. 55 | 56 | ## Main 57 | 58 | #### Train 59 | 60 | To run all stages for model training, execute: 61 | 62 | ```bash 63 | ./recipes/run_train.sh [-g gpu] [-a arch_type] [-l loss_type] [-s stage] [-e exp_name] 64 | ``` 65 | 66 | - Options: 67 | 68 | ```bash 69 | -g: GPU device (default: -1) 70 | # -1 indicates CPU 71 | -a: Generator architecture type ("conv" or "rnn") 72 | # conv: 1D fully convolutional network (default) 73 | # rnn: Bidirectional long short-term memory network 74 | -l: Loss type ("cgan", "wgan", or "lsgan") 75 | # cgan: Cross-entropy GAN 76 | # wgan: Wasserstein GAN with the gradient penalty loss (default) 77 | # lsgan: Least squares GAN 78 | -s: Stage to start (0 or 1) 79 | # Stages 0 and 1 correspond to feature extraction and model training, respectively. 80 | -e: Experiment name (default: "conv_wgan_exp1") 81 | # This name will be used at test time to specify which trained model to load. 82 | ``` 83 | 84 | - Examples: 85 | 86 | ```bash 87 | # To run the training from scratch with the default settings: 88 | ./recipes/run_train.sh 89 | 90 | # To skip the feature extraction stage: 91 | ./recipes/run_train.sh -s 1 92 | 93 | # To set the gpu device to, say, 0: 94 | ./recipes/run_train.sh -g 0 95 | 96 | # To use a generator with a recurrent architecture: 97 | ./recipes/run_train.sh -a rnn -e rnn_wgan_exp1 98 | 99 | # To use the cross-entropy adversarial loss: 100 | ./recipes/run_train.sh -l cgan -e conv_cgan_exp1 101 | 102 | # To use the least-squares adversarial loss: 103 | ./recipes/run_train.sh -l lsgan -e conv_lsgan_exp1 104 | ``` 105 | 106 | See other scripts in `recipes` for examples of training on different datasets. 107 | 108 | To monitor the training process, use tensorboard: 109 | 110 | ```bash 111 | tensorboard [--logdir log_path] 112 | ``` 113 | 114 | #### Test 115 | 116 | To perform conversion, execute: 117 | 118 | ```bash 119 | ./recipes/run_test.sh [-g gpu] [-e exp_name] [-c checkpoint] [-v vocoder_type] 120 | ``` 121 | 122 | - Options: 123 | 124 | ```bash 125 | -g: GPU device (default: -1) 126 | # -1 indicates CPU 127 | -e: Experiment name (e.g., "conv_wgan_exp1") 128 | -c: Model checkpoint to load (default: 0) 129 | # 0 indicates the newest model 130 | -v: Vocoder type ("hfg" or "pwg") 131 | # hfg: HifiGAN (default) 132 | # pwg: Parallel WaveGAN 133 | ``` 134 | 135 | - Examples: 136 | 137 | ```bash 138 | # To perform conversion with the default settings: 139 | ./recipes/run_test.sh -g 0 -e conv_wgan_exp1 140 | 141 | # To use Parallel WaveGAN as an alternative for waveform generation: 142 | ./recipes/run_test.sh -g 0 -e conv_wgan_exp1 -v pwg 143 | ``` 144 | 145 | ## Citation 146 | 147 | If you find this work useful for your research, please cite our papers. 148 | 149 | ``` 150 | @INPROCEEDINGS{Kameoka2018SLT_StarGAN-VC, 151 | author={Hirokazu Kameoka and Takuhiro Kaneko and Kou Tanaka and Nobukatsu Hojo}, 152 | booktitle={Proc. 2018 IEEE Spoken Language Technology Workshop (SLT)}, 153 | title={StarGAN-VC: Non-parallel Many-to-Many Voice Conversion Using Star Generative Adversarial Networks}, 154 | year={2018}, 155 | pages={266--273}} 156 | @Article{Kameoka2020IEEETrans_StarGAN-VC, 157 | author={Hirokazu Kameoka and Takuhiro Kaneko and Kou Tanaka and Nobukatsu Hojo}, 158 | title={Nonparallel Voice Conversion With Augmented Classifier Star Generative Adversarial Networks}, 159 | journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, 160 | volume={28}, 161 | pages={2982--2995}, 162 | year={2020}} 163 | ``` 164 | 165 | ## Author 166 | 167 | Hirokazu Kameoka ([@kamepong](https://github.com/kamepong)) 168 | 169 | E-mail: kame.hirokazu@gmail.com 170 | -------------------------------------------------------------------------------- /compute_statistics.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import logging 3 | import os 4 | import h5py 5 | from tqdm import tqdm 6 | from sklearn.preprocessing import StandardScaler 7 | import pickle 8 | 9 | def walk_files(root, extension): 10 | for path, dirs, files in os.walk(root): 11 | for file in files: 12 | if file.endswith(extension): 13 | yield os.path.join(path, file) 14 | 15 | def read_melspec(filepath): 16 | with h5py.File(filepath, "r") as f: 17 | melspec = f["melspec"][()] # n_mels x n_frame 18 | #import pdb;pdb.set_trace() # Breakpoint 19 | return melspec 20 | 21 | def compute_statistics(src, stat_filepath): 22 | melspec_scaler = StandardScaler() 23 | 24 | filepath_list = list(walk_files(src, '.h5')) 25 | for filepath in tqdm(filepath_list): 26 | melspec = read_melspec(filepath) 27 | #import pdb;pdb.set_trace() # Breakpoint 28 | melspec_scaler.partial_fit(melspec.T) 29 | 30 | with open(stat_filepath, mode='wb') as f: 31 | pickle.dump(melspec_scaler, f) 32 | 33 | def main(): 34 | parser = argparse.ArgumentParser() 35 | parser.add_argument('--src', type=str, default='./dump/arctic/feat/train') 36 | parser.add_argument('--stat', type=str, default='./dump/arctic/stat.pkl') 37 | args = parser.parse_args() 38 | 39 | fmt = '%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s' 40 | datafmt = '%m/%d/%Y %I:%M:%S' 41 | logging.basicConfig(level=logging.INFO, format=fmt, datefmt=datafmt) 42 | 43 | src = args.src 44 | stat_filepath = args.stat 45 | if not os.path.exists(os.path.dirname(stat_filepath)): 46 | os.makedirs(os.path.dirname(stat_filepath)) 47 | 48 | compute_statistics(src, stat_filepath) 49 | 50 | if __name__ == '__main__': 51 | main() -------------------------------------------------------------------------------- /convert.py: -------------------------------------------------------------------------------- 1 | # Copyright 2021 Hirokazu Kameoka 2 | 3 | import os 4 | import argparse 5 | import torch 6 | import json 7 | import numpy as np 8 | import re 9 | import pickle 10 | from tqdm import tqdm 11 | import yaml 12 | 13 | import librosa 14 | import soundfile as sf 15 | from sklearn.preprocessing import StandardScaler 16 | 17 | import net 18 | from extract_features import logmelfilterbank 19 | 20 | import sys 21 | sys.path.append(os.path.abspath("pwg")) 22 | from pwg.parallel_wavegan.utils import load_model 23 | from pwg.parallel_wavegan.utils import read_hdf5 24 | 25 | def audio_transform(wav_filepath, scaler, kwargs, device): 26 | 27 | trim_silence = kwargs['trim_silence'] 28 | top_db = kwargs['top_db'] 29 | flen = kwargs['flen'] 30 | fshift = kwargs['fshift'] 31 | fmin = kwargs['fmin'] 32 | fmax = kwargs['fmax'] 33 | num_mels = kwargs['num_mels'] 34 | fs = kwargs['fs'] 35 | 36 | audio, fs_ = sf.read(wav_filepath) 37 | if trim_silence: 38 | #print('trimming.') 39 | audio, _ = librosa.effects.trim(audio, top_db=top_db, frame_length=2048, hop_length=512) 40 | if fs != fs_: 41 | #print('resampling.') 42 | audio = librosa.resample(audio, fs_, fs) 43 | melspec_raw = logmelfilterbank(audio,fs, fft_size=flen,hop_size=fshift, 44 | fmin=fmin, fmax=fmax, num_mels=num_mels) 45 | melspec_raw = melspec_raw.astype(np.float32) # n_frame x n_mels 46 | 47 | melspec_norm = scaler.transform(melspec_raw) 48 | melspec_norm = melspec_norm.T # n_mels x n_frame 49 | 50 | return torch.tensor(melspec_norm[None]).to(device, dtype=torch.float) 51 | 52 | def extract_num(s, p, ret=0): 53 | search = p.search(s) 54 | if search: 55 | return int(search.groups()[0]) 56 | else: 57 | return ret 58 | 59 | def listdir_ext(dirpath,ext): 60 | p = re.compile(r'(\d+)') 61 | out = [] 62 | for file in sorted(os.listdir(dirpath), key=lambda s: extract_num(s, p)): 63 | if os.path.splitext(file)[1]==ext: 64 | out.append(file) 65 | return out 66 | 67 | def find_newest_model_file(model_dir, tag): 68 | mfile_list = os.listdir(model_dir) 69 | checkpoint = max([int(os.path.splitext(os.path.splitext(mfile)[0])[0]) for mfile in mfile_list if mfile.endswith('.{}.pt'.format(tag))]) 70 | return checkpoint 71 | 72 | 73 | def synthesis(melspec, model_nv, nv_config, savepath, device): 74 | ## Parallel WaveGAN / MelGAN 75 | melspec = torch.tensor(melspec, dtype=torch.float).to(device) 76 | #start = time.time() 77 | x = model_nv.inference(melspec).view(-1) 78 | #elapsed_time = time.time() - start 79 | #rtf2 = elapsed_time/audio_len 80 | #print ("elapsed_time (waveform generation): {0}".format(elapsed_time) + "[sec]") 81 | #print ("real time factor (waveform generation): {0}".format(rtf2)) 82 | 83 | # save as PCM 16 bit wav file 84 | if not os.path.exists(os.path.dirname(savepath)): 85 | os.makedirs(os.path.dirname(savepath)) 86 | sf.write(savepath, x.detach().cpu().clone().numpy(), nv_config["sampling_rate"], "PCM_16") 87 | 88 | def main(): 89 | parser = argparse.ArgumentParser(description='Testing StarGAN-VC') 90 | parser.add_argument('--gpu', '-g', type=int, default=-1, help='GPU ID (negative value indicates CPU)') 91 | parser.add_argument('-i', '--input', type=str, default='/misc/raid58/kameoka.hirokazu/python/db/arctic/wav/test', 92 | help='root data folder that contains the wav files of input speech') 93 | parser.add_argument('-o', '--out', type=str, default='./out/arctic', 94 | help='root data folder where the wav files of the converted speech will be saved.') 95 | parser.add_argument('--dataconf', type=str, default='./dump/arctic/data_config.json') 96 | parser.add_argument('--stat', type=str, default='./dump/arctic/stat.pkl', help='state file used for normalization') 97 | parser.add_argument('--model_rootdir', '-mdir', type=str, default='./model/arctic/', help='model file directory') 98 | parser.add_argument('--checkpoint', '-ckpt', type=int, default=0, help='model checkpoint to load (0 indicates the newest model)') 99 | parser.add_argument('--experiment_name', '-exp', default='experiment1', type=str, help='experiment name') 100 | parser.add_argument('--vocoder', '-voc', default='hifigan.v1', type=str, 101 | help='neural vocoder type name (e.g., hifigan.v1, hifigan.v2, parallel_wavegan.v1)') 102 | parser.add_argument('--voc_dir', '-vdir', type=str, default='pwg/egs/arctic_4spk_flen64ms_fshift8ms/voc1', 103 | help='directory of trained neural vocoder') 104 | args = parser.parse_args() 105 | 106 | # Set up GPU 107 | if torch.cuda.is_available() and args.gpu >= 0: 108 | device = torch.device('cuda:%d' % args.gpu) 109 | else: 110 | device = torch.device('cpu') 111 | if device.type == 'cuda': 112 | torch.cuda.set_device(device) 113 | 114 | input_dir = args.input 115 | data_config_path = args.dataconf 116 | model_config_path = os.path.join(args.model_rootdir,args.experiment_name,'model_config.json') 117 | with open(data_config_path) as f: 118 | data_config = json.load(f) 119 | with open(model_config_path) as f: 120 | model_config = json.load(f) 121 | checkpoint = args.checkpoint 122 | 123 | num_mels = model_config['num_mels'] 124 | arch_type = model_config['arch_type'] 125 | loss_type = model_config['loss_type'] 126 | n_spk = model_config['n_spk'] 127 | trg_spk_list = model_config['spk_list'] 128 | zdim = model_config['zdim'] 129 | hdim = model_config['hdim'] 130 | mdim = model_config['mdim'] 131 | sdim = model_config['sdim'] 132 | normtype = model_config['normtype'] 133 | src_conditioning = model_config['src_conditioning'] 134 | 135 | stat_filepath = args.stat 136 | melspec_scaler = StandardScaler() 137 | if os.path.exists(stat_filepath): 138 | with open(stat_filepath, mode='rb') as f: 139 | melspec_scaler = pickle.load(f) 140 | print('Loaded mel-spectrogram statistics successfully.') 141 | else: 142 | print('Stat file not found.') 143 | 144 | # Set up main model 145 | gen = net.Generator1(num_mels, n_spk, zdim, hdim, sdim, normtype, src_conditioning) if arch_type=='conv' else net.Generator1(num_mels, n_spk, zdim, hdim, sdim, normtype, src_conditioning) 146 | dis = net.Discriminator1(num_mels, n_spk, mdim, normtype) if arch_type=='conv' else net.Discriminator1(num_mels, n_spk, mdim, normtype) 147 | models = { 148 | 'gen': gen, 149 | 'dis': dis 150 | } 151 | models['stargan'] = net.StarGAN(models['gen'],models['dis'],n_spk,loss_type) 152 | 153 | for tag in ['gen', 'dis']: 154 | model_dir = os.path.join(args.model_rootdir,args.experiment_name) 155 | vc_checkpoint_idx = find_newest_model_file(model_dir, tag) if checkpoint <= 0 else checkpoint 156 | mfilename = '{}.{}.pt'.format(vc_checkpoint_idx,tag) 157 | path = os.path.join(args.model_rootdir,args.experiment_name,mfilename) 158 | if path is not None: 159 | model_checkpoint = torch.load(path, map_location=device) 160 | models[tag].load_state_dict(model_checkpoint['model_state_dict']) 161 | print('{}: {}'.format(tag, os.path.abspath(path))) 162 | 163 | for tag in ['gen', 'dis']: 164 | #models[tag].to(device).eval() 165 | models[tag].to(device).train(mode=True) 166 | 167 | # Set up nv 168 | vocoder = args.vocoder 169 | voc_dir = args.voc_dir 170 | voc_yaml_path = os.path.join(voc_dir,'conf', '{}.yaml'.format(vocoder)) 171 | checkpointlist = listdir_ext( 172 | os.path.join(voc_dir,'exp','train_nodev_all_{}'.format(vocoder)),'.pkl') 173 | nv_checkpoint = os.path.join(voc_dir,'exp', 174 | 'train_nodev_all_{}'.format(vocoder), 175 | checkpointlist[-1]) # Find and use the newest checkpoint model. 176 | print('vocoder: {}'.format(os.path.abspath(nv_checkpoint))) 177 | 178 | with open(voc_yaml_path) as f: 179 | nv_config = yaml.load(f, Loader=yaml.Loader) 180 | nv_config.update(vars(args)) 181 | model_nv = load_model(nv_checkpoint, nv_config) 182 | model_nv.remove_weight_norm() 183 | model_nv = model_nv.eval().to(device) 184 | 185 | src_spk_list = sorted(os.listdir(input_dir)) 186 | 187 | for i, src_spk in enumerate(src_spk_list): 188 | src_wav_dir = os.path.join(input_dir, src_spk) 189 | for j, trg_spk in enumerate(trg_spk_list): 190 | if src_spk != trg_spk: 191 | print('Converting {}2{}...'.format(src_spk, trg_spk)) 192 | for n, src_wav_filename in enumerate(os.listdir(src_wav_dir)): 193 | src_wav_filepath = os.path.join(src_wav_dir, src_wav_filename) 194 | src_melspec = audio_transform(src_wav_filepath, melspec_scaler, data_config, device) 195 | k_t = j 196 | k_s = i if src_conditioning else None 197 | 198 | conv_melspec = models['stargan'](src_melspec, k_t, k_s) 199 | 200 | conv_melspec = conv_melspec[0,:,:].detach().cpu().clone().numpy() 201 | conv_melspec = conv_melspec.T # n_frames x n_mels 202 | 203 | out_wavpath = os.path.join(args.out,args.experiment_name,'{}'.format(vc_checkpoint_idx),vocoder,'{}2{}'.format(src_spk,trg_spk), src_wav_filename) 204 | synthesis(conv_melspec, model_nv, nv_config, out_wavpath, device) 205 | 206 | 207 | if __name__ == '__main__': 208 | main() -------------------------------------------------------------------------------- /dataset.py: -------------------------------------------------------------------------------- 1 | # Copyright 2021 Hirokazu Kameoka 2 | 3 | import os 4 | import numpy as np 5 | import torch 6 | from torch.utils.data import Dataset 7 | import h5py 8 | import math 9 | import random 10 | 11 | def walk_files(root, extension): 12 | for path, dirs, files in os.walk(root): 13 | for file in files: 14 | if file.endswith(extension): 15 | yield os.path.join(path, file) 16 | 17 | class MultiDomain_Dataset(Dataset): 18 | def __init__(self, *feat_dirs): 19 | self.n_domain = len(feat_dirs) 20 | self.filenames_all = [[os.path.join(d,t) for t in sorted(os.listdir(d))] for d in feat_dirs] 21 | #self.filenames_all = [[t for t in walk_files(d, '.h5')] for d in feat_dirs] 22 | self.feat_dirs = feat_dirs 23 | 24 | def __len__(self): 25 | return min(len(f) for f in self.filenames_all) 26 | 27 | def __getitem__(self, idx): 28 | melspec_list = [] 29 | for d in range(self.n_domain): 30 | with h5py.File(self.filenames_all[d][idx], "r") as f: 31 | melspec = f["melspec"][()] # n_freq x n_time 32 | melspec_list.append(melspec) 33 | return melspec_list 34 | 35 | def collate_fn(batch): 36 | #batch[b][s]: melspec (n_freq x n_frame) 37 | #b: batch size 38 | #s: speaker ID 39 | 40 | batchsize = len(batch) 41 | n_spk = len(batch[0]) 42 | melspec_list = [[batch[b][s] for b in range(batchsize)] for s in range(n_spk)] 43 | #melspec_list[s][b]: melspec (n_freq x n_frame) 44 | #s: speaker ID 45 | #b: batch size 46 | 47 | n_freq = melspec_list[0][0].shape[0] 48 | 49 | X_list = [] 50 | for s in range(n_spk): 51 | maxlen=0 52 | for b in range(batchsize): 53 | if maxlen n_frame_: 179 | x = nn.ReplicationPad1d((0, n_frame-n_frame_))(x) 180 | return self.gen(x, k_t, k_s)[:,:,0:n_frame_] 181 | 182 | def calc_advloss_g(self, df_adv_ss, df_adv_st, df_adv_tt, df_adv_ts): 183 | df_adv_ss = df_adv_ss.permute(0,2,1).reshape(-1,1) 184 | df_adv_st = df_adv_st.permute(0,2,1).reshape(-1,1) 185 | df_adv_tt = df_adv_tt.permute(0,2,1).reshape(-1,1) 186 | df_adv_ts = df_adv_ts.permute(0,2,1).reshape(-1,1) 187 | 188 | if self.loss_type=='wgan': 189 | # Wasserstein GAN with gradient penalty (WGAN-GP) 190 | AdvLoss_g = ( 191 | torch.sum(-df_adv_ss) + 192 | torch.sum(-df_adv_st) + 193 | torch.sum(-df_adv_tt) + 194 | torch.sum(-df_adv_ts) 195 | ) / (df_adv_ss.numel() + df_adv_st.numel() + df_adv_tt.numel() + df_adv_ts.numel()) 196 | 197 | elif self.loss_type=='lsgan': 198 | # Least squares GAN (LSGAN) 199 | AdvLoss_g = 0.5 * ( 200 | torch.sum((df_adv_ss - torch.ones_like(df_adv_ss))**2) + 201 | torch.sum((df_adv_st - torch.ones_like(df_adv_st))**2) + 202 | torch.sum((df_adv_tt - torch.ones_like(df_adv_tt))**2) + 203 | torch.sum((df_adv_ts - torch.ones_like(df_adv_ts))**2) 204 | ) / (df_adv_ss.numel() + df_adv_st.numel() + df_adv_tt.numel() + df_adv_ts.numel()) 205 | 206 | elif self.loss_type=='cgan': 207 | # Regular GAN with the sigmoid cross-entropy criterion (CGAN) 208 | AdvLoss_g = ( 209 | F.binary_cross_entropy_with_logits(df_adv_ss, torch.ones_like(df_adv_ss), reduction='sum') + 210 | F.binary_cross_entropy_with_logits(df_adv_st, torch.ones_like(df_adv_st), reduction='sum') + 211 | F.binary_cross_entropy_with_logits(df_adv_tt, torch.ones_like(df_adv_tt), reduction='sum') + 212 | F.binary_cross_entropy_with_logits(df_adv_ts, torch.ones_like(df_adv_ts), reduction='sum') 213 | ) / (df_adv_ss.numel() + df_adv_st.numel() + df_adv_tt.numel() + df_adv_ts.numel()) 214 | 215 | return AdvLoss_g 216 | 217 | def calc_clsloss_g(self, df_cls_ss, df_cls_st, df_cls_tt, df_cls_ts, k_s, k_t): 218 | device = df_cls_ss.device 219 | 220 | df_cls_ss = df_cls_ss.permute(0,2,1).reshape(-1,self.n_spk) 221 | df_cls_st = df_cls_st.permute(0,2,1).reshape(-1,self.n_spk) 222 | df_cls_tt = df_cls_tt.permute(0,2,1).reshape(-1,self.n_spk) 223 | df_cls_ts = df_cls_ts.permute(0,2,1).reshape(-1,self.n_spk) 224 | 225 | cf_ss = k_s*torch.ones(len(df_cls_ss), device=device, dtype=torch.long) 226 | cf_st = k_t*torch.ones(len(df_cls_st), device=device, dtype=torch.long) 227 | cf_tt = k_t*torch.ones(len(df_cls_tt), device=device, dtype=torch.long) 228 | cf_ts = k_s*torch.ones(len(df_cls_ts), device=device, dtype=torch.long) 229 | 230 | ClsLoss_g = ( 231 | F.cross_entropy(df_cls_ss, cf_ss, reduction='sum') + 232 | F.cross_entropy(df_cls_st, cf_st, reduction='sum') + 233 | F.cross_entropy(df_cls_tt, cf_tt, reduction='sum') + 234 | F.cross_entropy(df_cls_ts, cf_ts, reduction='sum') 235 | ) / (df_cls_ss.numel() + df_cls_st.numel() + df_cls_tt.numel() + df_cls_ts.numel()) 236 | 237 | return ClsLoss_g 238 | 239 | def calc_advloss_d(self, x_s, x_t, xf_ts, xf_st, dr_adv_s, dr_adv_t, df_adv_ss, df_adv_st, df_adv_tt, df_adv_ts): 240 | device = x_s.device 241 | B_s = len(x_s) 242 | B_t = len(x_t) 243 | 244 | dr_adv_s = dr_adv_s.permute(0,2,1).reshape(-1,1) 245 | dr_adv_t = dr_adv_t.permute(0,2,1).reshape(-1,1) 246 | df_adv_ss = df_adv_ss.permute(0,2,1).reshape(-1,1) 247 | df_adv_st = df_adv_st.permute(0,2,1).reshape(-1,1) 248 | df_adv_tt = df_adv_tt.permute(0,2,1).reshape(-1,1) 249 | df_adv_ts = df_adv_ts.permute(0,2,1).reshape(-1,1) 250 | 251 | if self.loss_type=='wgan': 252 | # Wasserstein GAN with gradient penalty (WGAN-GP) 253 | AdvLoss_d_r = ( 254 | torch.sum(-dr_adv_s) + 255 | torch.sum(-dr_adv_t) 256 | ) / (dr_adv_s.numel() + dr_adv_t.numel()) 257 | AdvLoss_d_f = ( 258 | torch.sum(df_adv_ss) + 259 | torch.sum(df_adv_st) + 260 | torch.sum(df_adv_tt) + 261 | torch.sum(df_adv_ts) 262 | ) / (df_adv_ss.numel() + df_adv_st.numel() + df_adv_tt.numel() + df_adv_ts.numel()) 263 | AdvLoss_d = AdvLoss_d_r + AdvLoss_d_f 264 | 265 | elif self.loss_type=='lsgan': 266 | # Least squares GAN (LSGAN) 267 | 268 | AdvLoss_d_r = 0.5 * ( 269 | torch.sum((dr_adv_s - torch.ones_like(dr_adv_s))**2) + 270 | torch.sum((dr_adv_t - torch.ones_like(dr_adv_t))**2) 271 | ) / (dr_adv_s.numel() + dr_adv_t.numel()) 272 | AdvLoss_d_f = 0.5 * ( 273 | torch.sum(df_adv_ss**2) + 274 | torch.sum(df_adv_st**2) + 275 | torch.sum(df_adv_tt**2) + 276 | torch.sum(df_adv_ts**2) 277 | ) / (df_adv_ss.numel() + df_adv_st.numel() + df_adv_tt.numel() + df_adv_ts.numel()) 278 | AdvLoss_d = AdvLoss_d_r + AdvLoss_d_f 279 | 280 | elif self.loss_type=='cgan': 281 | # Regular GAN with sigmoid cross-entropy criterion (CGAN) 282 | AdvLoss_d_r = ( 283 | F.binary_cross_entropy_with_logits(dr_adv_s, torch.ones_like(dr_adv_s), reduction='sum') + 284 | F.binary_cross_entropy_with_logits(dr_adv_t, torch.ones_like(dr_adv_t), reduction='sum') 285 | ) / (dr_adv_s.numel() + dr_adv_t.numel()) 286 | AdvLoss_d_f = ( 287 | F.binary_cross_entropy_with_logits(df_adv_ss, torch.zeros_like(df_adv_ss), reduction='sum') + 288 | F.binary_cross_entropy_with_logits(df_adv_st, torch.zeros_like(df_adv_st), reduction='sum') + 289 | F.binary_cross_entropy_with_logits(df_adv_tt, torch.zeros_like(df_adv_tt), reduction='sum') + 290 | F.binary_cross_entropy_with_logits(df_adv_ts, torch.zeros_like(df_adv_ts), reduction='sum') 291 | ) / (df_adv_ss.numel() + df_adv_st.numel() + df_adv_tt.numel() + df_adv_ts.numel()) 292 | AdvLoss_d = AdvLoss_d_r + AdvLoss_d_f 293 | 294 | # Gradient penalty loss 295 | alpha_t = torch.rand(B_t, 1, 1, requires_grad=True).to(device) 296 | interpolates = alpha_t * x_t + ((1 - alpha_t) * xf_ts) 297 | interpolates = interpolates.to(device) 298 | disc_interpolates, _ = self.dis(interpolates) 299 | disc_interpolates = torch.sum(disc_interpolates) 300 | gradients = torch.autograd.grad(outputs=disc_interpolates, 301 | inputs=interpolates, 302 | grad_outputs=torch.ones(disc_interpolates.size()).to(device), 303 | create_graph=True, retain_graph=True, only_inputs=True)[0] 304 | gradnorm = torch.sqrt(torch.sum(gradients * gradients, (1, 2))) 305 | loss_gp_t = ((gradnorm - 1)**2).mean() 306 | 307 | alpha_s = torch.rand(B_s, 1, 1, requires_grad=True).to(device) 308 | interpolates = alpha_s * x_s + ((1 - alpha_s) * xf_st) 309 | interpolates = interpolates.to(device) 310 | interpolates = torch.autograd.Variable(interpolates, requires_grad=True) 311 | disc_interpolates, _ = self.dis(interpolates) 312 | disc_interpolates = torch.sum(disc_interpolates) 313 | gradients = torch.autograd.grad(outputs=disc_interpolates, 314 | inputs=interpolates, 315 | grad_outputs=torch.ones(disc_interpolates.size()).to(device), 316 | create_graph=True, retain_graph=True, only_inputs=True)[0] 317 | gradnorm = torch.sqrt(torch.sum(gradients * gradients, (1, 2))) 318 | loss_gp_s = ((gradnorm - 1)**2).mean() 319 | 320 | GradLoss_d = loss_gp_s + loss_gp_t 321 | 322 | return AdvLoss_d, GradLoss_d 323 | 324 | def calc_clsloss_d(self, dr_cls_s, dr_cls_t, k_s, k_t): 325 | device = dr_cls_s.device 326 | 327 | dr_cls_s = dr_cls_s.permute(0,2,1).reshape(-1,self.n_spk) 328 | dr_cls_t = dr_cls_t.permute(0,2,1).reshape(-1,self.n_spk) 329 | 330 | cr_s = k_s*torch.ones(len(dr_cls_s), device=device, dtype=torch.long) 331 | cr_t = k_t*torch.ones(len(dr_cls_t), device=device, dtype=torch.long) 332 | 333 | ClsLoss_d = ( 334 | F.cross_entropy(dr_cls_s, cr_s, reduction='sum') + 335 | F.cross_entropy(dr_cls_t, cr_t, reduction='sum') 336 | ) / (dr_cls_s.numel() + dr_cls_t.numel()) 337 | 338 | return ClsLoss_d 339 | 340 | def calc_gen_loss(self, x_s, x_t, k_s, k_t): 341 | # Generator outputs 342 | xf_ss = self.gen(x_s, k_s, k_s) 343 | xf_ts = self.gen(x_t, k_s, k_t) 344 | xf_tt = self.gen(x_t, k_t, k_t) 345 | xf_st = self.gen(x_s, k_t, k_s) 346 | 347 | # Discriminator outputs 348 | df_adv_ss, df_cls_ss = self.dis(xf_ss) 349 | df_adv_st, df_cls_st = self.dis(xf_st) 350 | df_adv_tt, df_cls_tt = self.dis(xf_tt) 351 | df_adv_ts, df_cls_ts = self.dis(xf_ts) 352 | 353 | # Adversarial loss 354 | AdvLoss_g = self.calc_advloss_g(df_adv_ss, df_adv_st, df_adv_tt, df_adv_ts) 355 | 356 | # Classifier loss 357 | ClsLoss_g = self.calc_clsloss_g(df_cls_ss, df_cls_st, df_cls_tt, df_cls_ts, k_s, k_t) 358 | 359 | # Cycle-consistency loss 360 | CycLoss = ( 361 | torch.sum(torch.abs(x_s - self.gen(xf_st, k_s, k_t))) + 362 | torch.sum(torch.abs(x_t - self.gen(xf_ts, k_t, k_s))) 363 | ) / (x_s.numel() + x_t.numel()) 364 | 365 | # Reconstruction loss 366 | RecLoss = ( 367 | torch.sum(torch.abs(x_s - xf_ss)) + 368 | torch.sum(torch.abs(x_t - xf_tt)) 369 | ) / (x_s.numel() + x_t.numel()) 370 | 371 | return AdvLoss_g, ClsLoss_g, CycLoss, RecLoss 372 | 373 | def calc_dis_loss(self, x_s, x_t, k_s, k_t): 374 | device = x_s.device 375 | 376 | # Generator outputs 377 | xf_ss = self.gen(x_s, k_s, k_s) 378 | xf_ts = self.gen(x_t, k_s, k_t) 379 | xf_tt = self.gen(x_t, k_t, k_t) 380 | xf_st = self.gen(x_s, k_t, k_s) 381 | 382 | # Discriminator outputs 383 | dr_adv_s, dr_cls_s = self.dis(x_s) 384 | dr_adv_t, dr_cls_t = self.dis(x_t) 385 | df_adv_ss, _ = self.dis(xf_ss) 386 | df_adv_st, _ = self.dis(xf_st) 387 | df_adv_tt, _ = self.dis(xf_tt) 388 | df_adv_ts, _ = self.dis(xf_ts) 389 | 390 | # Adversarial loss 391 | AdvLoss_d, GradLoss_d = self.calc_advloss_d(x_s, x_t, xf_ts, xf_st, dr_adv_s, dr_adv_t, df_adv_ss, df_adv_st, df_adv_tt, df_adv_ts) 392 | 393 | # Classifier loss 394 | ClsLoss_d = self.calc_clsloss_d(dr_cls_s, dr_cls_t, k_s, k_t) 395 | 396 | return AdvLoss_d, GradLoss_d, ClsLoss_d 397 | -------------------------------------------------------------------------------- /normalize_features.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import joblib 3 | import logging 4 | import os 5 | 6 | import h5py 7 | import numpy as np 8 | from tqdm import tqdm 9 | from sklearn.preprocessing import StandardScaler 10 | import pickle 11 | 12 | def walk_files(root, extension): 13 | for path, dirs, files in os.walk(root): 14 | for file in files: 15 | if file.endswith(extension): 16 | yield os.path.join(path, file) 17 | 18 | def melspec_transform(melspec, scaler): 19 | # melspec.shape: (n_freq, n_time) 20 | # scaler.transform assumes the first axis to be the time axis 21 | melspec = scaler.transform(melspec.T) 22 | #import pdb;pdb.set_trace() # Breakpoint 23 | melspec = melspec.T 24 | return melspec 25 | 26 | def normalize_features(src_filepath, dst_filepath, melspec_transform): 27 | try: 28 | with h5py.File(src_filepath, "r") as f: 29 | melspec = f["melspec"][()] 30 | melspec = melspec_transform(melspec) 31 | 32 | if not os.path.exists(os.path.dirname(dst_filepath)): 33 | os.makedirs(os.path.dirname(dst_filepath), exist_ok=True) 34 | with h5py.File(dst_filepath, "w") as f: 35 | f.create_dataset("melspec", data=melspec) 36 | 37 | #logging.info(f"{dst_filepath}...[{melspec.shape}].") 38 | return melspec.shape 39 | 40 | except: 41 | logging.info(f"{dst_filepath}...failed.") 42 | 43 | def main(): 44 | parser = argparse.ArgumentParser() 45 | parser.add_argument('--src', type=str, 46 | default='./dump/arctic/feat/train', 47 | help='data folder that contains the raw features extracted from VoxCeleb2 Dataset') 48 | parser.add_argument('--dst', type=str, default='./dump/arctic/norm_feat/train', 49 | help='data folder where the normalized features are stored') 50 | parser.add_argument('--stat', type=str, default='./dump/arctic/stat.pkl', 51 | help='state file used for normalization') 52 | parser.add_argument('--ext', type=str, default='.h5') 53 | args = parser.parse_args() 54 | 55 | src = args.src 56 | dst = args.dst 57 | ext = args.ext 58 | stat_filepath = args.stat 59 | 60 | fmt = '%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s' 61 | datafmt = '%m/%d/%Y %I:%M:%S' 62 | logging.basicConfig(level=logging.INFO, format=fmt, datefmt=datafmt) 63 | 64 | melspec_scaler = StandardScaler() 65 | if os.path.exists(stat_filepath): 66 | with open(stat_filepath, mode='rb') as f: 67 | melspec_scaler = pickle.load(f) 68 | print('Loaded mel-spectrogram statistics successfully.') 69 | else: 70 | print('Stat file not found.') 71 | 72 | root = src 73 | fargs_list = [ 74 | [ 75 | f, 76 | f.replace(src, dst), 77 | lambda x: melspec_transform(x, melspec_scaler), 78 | ] 79 | for f in walk_files(root, ext) 80 | ] 81 | 82 | #import pdb;pdb.set_trace() # Breakpoint 83 | # debug 84 | #normalize_features(*fargs_list[0]) 85 | # test 86 | #results = joblib.Parallel(n_jobs=-1)( 87 | # joblib.delayed(normalize_features)(*f) for f in tqdm(fargs_list) 88 | #) 89 | results = joblib.Parallel(n_jobs=16)( 90 | joblib.delayed(normalize_features)(*f) for f in tqdm(fargs_list) 91 | ) 92 | 93 | if __name__ == '__main__': 94 | main() -------------------------------------------------------------------------------- /pwg/egs/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kamepong/StarGAN-VC/d68e10db5f29385bcb4a0473b94e3dc688b07bf6/pwg/egs/.gitkeep -------------------------------------------------------------------------------- /pwg/parallel_wavegan/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kamepong/StarGAN-VC/d68e10db5f29385bcb4a0473b94e3dc688b07bf6/pwg/parallel_wavegan/.gitkeep -------------------------------------------------------------------------------- /recipes/run_test.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Copyright 2021 Hirokazu Kameoka 4 | # 5 | # Usage: 6 | # ./run_test.sh [-g gpu] [-e exp_name] [-c checkpoint] [-v vocoder_type] 7 | # Options: 8 | # -g: GPU device# 9 | # -e: Experiment name (e.g., "conv_exp1") 10 | # -c: Model checkpoint to load (0 indicates the newest model) 11 | # -v: Vocoder type ("hifigan.v1" or "parallel_wavegan.v1") 12 | 13 | db_dir="/path/to/dataset/test" 14 | dataset_name="mydataset" 15 | gpu=0 16 | checkpoint=0 17 | vocoder_type="hifigan.v1" 18 | 19 | while getopts "g:e:c:v:" opt; do 20 | case $opt in 21 | g ) gpu=$OPTARG;; 22 | e ) exp_name=$OPTARG;; 23 | c ) checkpoint=$OPTARG;; 24 | v ) vocoder_type=$OPTARG;; 25 | esac 26 | done 27 | 28 | # If the -v option is abbreviated... 29 | case ${vocoder_type} in 30 | "pwg" ) vocoder_type="parallel_wavegan.v1";; 31 | "hfg" ) vocoder_type="hifigan.v1";; 32 | esac 33 | 34 | echo "Experiment name: ${exp_name}, Vocoder: ${vocoder_type}" 35 | 36 | dconf_path="./dump/${dataset_name}/data_config.json" 37 | stat_path="./dump/${dataset_name}/stat.pkl" 38 | out_dir="./out/${dataset_name}" 39 | model_dir="./model/${dataset_name}" 40 | vocoder_dir="pwg/egs/arctic_4spk_flen64ms_fshift8ms/voc1" 41 | 42 | python convert.py -g ${gpu} \ 43 | --input ${db_dir} \ 44 | --dataconf ${dconf_path} \ 45 | --stat ${stat_path} \ 46 | --out ${out_dir} \ 47 | --model_rootdir ${model_dir} \ 48 | --experiment_name ${exp_name} \ 49 | --vocoder ${vocoder_type} \ 50 | --voc_dir ${vocoder_dir} \ 51 | --checkpoint ${checkpoint} -------------------------------------------------------------------------------- /recipes/run_test_arctic_4spk.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Copyright 2021 Hirokazu Kameoka 4 | # 5 | # Usage: 6 | # ./run_test_arctic_4spk.sh [-g gpu] [-e exp_name] [-c checkpoint] [-v vocoder_type] 7 | # Options: 8 | # -g: GPU device# 9 | # -e: Experiment name (e.g., "conv_exp1") 10 | # -c: Model checkpoint to load (0 indicates the newest model) 11 | # -v: Vocoder type ("hifigan.v1" or "parallel_wavegan.v1") 12 | 13 | db_dir="/misc/raid58/kameoka.hirokazu/db/arctic/wav/test" 14 | dataset_name="arctic_4spk" 15 | gpu=0 16 | checkpoint=0 17 | vocoder_type="hifigan.v1" 18 | 19 | while getopts "g:e:c:v:" opt; do 20 | case $opt in 21 | g ) gpu=$OPTARG;; 22 | e ) exp_name=$OPTARG;; 23 | c ) checkpoint=$OPTARG;; 24 | v ) vocoder_type=$OPTARG;; 25 | esac 26 | done 27 | 28 | # If the -v option is abbreviated... 29 | case ${vocoder_type} in 30 | "pwg" ) vocoder_type="parallel_wavegan.v1";; 31 | "hfg" ) vocoder_type="hifigan.v1";; 32 | esac 33 | 34 | echo "Experiment name: ${exp_name}, Vocoder: ${vocoder_type}" 35 | 36 | dconf_path="./dump/${dataset_name}/data_config.json" 37 | stat_path="./dump/${dataset_name}/stat.pkl" 38 | out_dir="./out/${dataset_name}" 39 | model_dir="./model/${dataset_name}" 40 | vocoder_dir="pwg/egs/arctic_4spk_flen64ms_fshift8ms/voc1" 41 | 42 | python convert.py -g ${gpu} \ 43 | --input ${db_dir} \ 44 | --dataconf ${dconf_path} \ 45 | --stat ${stat_path} \ 46 | --out ${out_dir} \ 47 | --model_rootdir ${model_dir} \ 48 | --experiment_name ${exp_name} \ 49 | --vocoder ${vocoder_type} \ 50 | --voc_dir ${vocoder_dir} \ 51 | --checkpoint ${checkpoint} -------------------------------------------------------------------------------- /recipes/run_train.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Copyright 2021 Hirokazu Kameoka 4 | # 5 | # Usage: 6 | # ./run_train.sh [-g gpu] [-d db_dir] [-a arch_type] [-l loss_type] [-s stage] [-e exp_name] 7 | # Options: 8 | # -g: GPU device# 9 | # -d: Path to the dataset directory 10 | # -a: Architecture type ("conv" or "rnn") 11 | # -l: Loss type ("cgan", "wgan", or "lsgan") 12 | # -s: Stage to start (0 or 1) 13 | # -e: Experiment name (e.g., "exp1") 14 | 15 | # Default values 16 | db_dir="/path/to/dataset/training" 17 | dataset_name="mydataset" 18 | gpu=0 19 | arch_type="conv" 20 | loss_type="wgan" 21 | start_stage=0 22 | exp_name="conv_wgan_exp1" 23 | 24 | while getopts "g:d:a:l:s:e:" opt; do 25 | case $opt in 26 | g ) gpu=$OPTARG;; 27 | d ) db_dir=$OPTARG;; 28 | a ) arch_type=$OPTARG;; 29 | l ) loss_type=$OPTARG;; 30 | s ) start_stage=$OPTARG;; 31 | e ) exp_name=$OPTARG;; 32 | esac 33 | done 34 | 35 | feat_dir="./dump/${dataset_name}/feat/train" 36 | dconf_path="./dump/${dataset_name}/data_config.json" 37 | stat_path="./dump/${dataset_name}/stat.pkl" 38 | normfeat_dir="./dump/${dataset_name}/norm_feat/train" 39 | model_dir="./model/${dataset_name}" 40 | log_dir="./logs/${dataset_name}" 41 | 42 | # Stage 0: Feature extraction 43 | if [[ ${start_stage} -le 0 ]]; then 44 | python extract_features.py --src ${db_dir} --dst ${feat_dir} --conf ${dconf_path} 45 | python compute_statistics.py --src ${feat_dir} --stat ${stat_path} 46 | python normalize_features.py --src ${feat_dir} --dst ${normfeat_dir} --stat ${stat_path} 47 | fi 48 | 49 | # Stage 1: Model training 50 | if [[ ${start_stage} -le 1 ]]; then 51 | python train.py -g ${gpu} \ 52 | --data_rootdir ${normfeat_dir} \ 53 | --model_rootdir ${model_dir} \ 54 | --log_dir ${log_dir} \ 55 | --arch_type ${arch_type} \ 56 | --loss_type ${loss_type} \ 57 | --experiment_name ${exp_name} \ 58 | ${cond} 59 | fi -------------------------------------------------------------------------------- /recipes/run_train_arctic_4spk.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Copyright 2021 Hirokazu Kameoka 4 | # 5 | # Usage: 6 | # ./run_train_arctic_4spk.sh [-g gpu] [-d db_dir] [-a arch_type] [-l loss_type] [-s stage] [-e exp_name] 7 | # Options: 8 | # -g: GPU device# 9 | # -d: Path to the dataset directory 10 | # -a: Architecture type ("conv" or "rnn") 11 | # -l: Loss type ("cgan", "wgan", or "lsgan") 12 | # -s: Stage to start (0 or 1) 13 | # -e: Experiment name (e.g., "exp1") 14 | 15 | # Default values 16 | db_dir="/misc/raid58/kameoka.hirokazu/db/arctic/wav/training" 17 | dataset_name="arctic_4spk" 18 | gpu=0 19 | arch_type="conv" 20 | loss_type="wgan" 21 | start_stage=0 22 | exp_name="conv_wgan_exp1" 23 | 24 | while getopts "g:d:a:l:s:e:" opt; do 25 | case $opt in 26 | g ) gpu=$OPTARG;; 27 | d ) db_dir=$OPTARG;; 28 | a ) arch_type=$OPTARG;; 29 | l ) loss_type=$OPTARG;; 30 | s ) start_stage=$OPTARG;; 31 | e ) exp_name=$OPTARG;; 32 | esac 33 | done 34 | 35 | feat_dir="./dump/${dataset_name}/feat/train" 36 | dconf_path="./dump/${dataset_name}/data_config.json" 37 | stat_path="./dump/${dataset_name}/stat.pkl" 38 | normfeat_dir="./dump/${dataset_name}/norm_feat/train" 39 | model_dir="./model/${dataset_name}" 40 | log_dir="./logs/${dataset_name}" 41 | 42 | # Stage 0: Feature extraction 43 | if [[ ${start_stage} -le 0 ]]; then 44 | python extract_features.py --src ${db_dir} --dst ${feat_dir} --conf ${dconf_path} 45 | python compute_statistics.py --src ${feat_dir} --stat ${stat_path} 46 | python normalize_features.py --src ${feat_dir} --dst ${normfeat_dir} --stat ${stat_path} 47 | fi 48 | 49 | # Stage 1: Model training 50 | if [[ ${start_stage} -le 1 ]]; then 51 | python train.py -g ${gpu} \ 52 | --data_rootdir ${normfeat_dir} \ 53 | --model_rootdir ${model_dir} \ 54 | --log_dir ${log_dir} \ 55 | --arch_type ${arch_type} \ 56 | --loss_type ${loss_type} \ 57 | --experiment_name ${exp_name} \ 58 | ${cond} 59 | fi -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | apex==0.9.10dev 2 | filelock==3.3.1 3 | gdown==4.2.0 4 | h5py==3.3.0 5 | joblib==1.1.0 6 | kaldiio==2.17.2 7 | librosa==0.8.1 8 | matplotlib==3.3.4 9 | numpy==1.21.2 10 | PyYAML==6.0 11 | scikit_learn==1.0.1 12 | scipy==1.7.1 13 | six==1.16.0 14 | SoundFile==0.10.3.post1 15 | tensorboardX==2.4.1 16 | tensorflow==2.7.0 17 | torch==1.10.0 18 | tqdm==4.62.3 19 | -------------------------------------------------------------------------------- /train.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import os 3 | import argparse 4 | import json 5 | import itertools 6 | import logging 7 | import warnings 8 | 9 | import torch 10 | from torch import optim 11 | from torch.utils.tensorboard import SummaryWriter 12 | from torch.utils.data import DataLoader 13 | 14 | from dataset import MultiDomain_Dataset, collate_fn 15 | import net 16 | 17 | def makedirs_if_not_exists(dir): 18 | if not os.path.exists(dir): 19 | os.makedirs(dir) 20 | 21 | def comb(N,r): 22 | iterable = list(range(0,N)) 23 | return list(itertools.combinations(iterable,2)) 24 | 25 | def Train(models, epochs, train_dataset, train_loader, optimizers, device, model_dir, log_path, config, snapshot=100, resume=0): 26 | fmt = '%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s' 27 | datafmt = '%m/%d/%Y %I:%M:%S' 28 | if not os.path.exists(os.path.dirname(log_path)): 29 | os.makedirs(os.path.dirname(log_path)) 30 | logging.basicConfig(filename=log_path, filemode='a', level=logging.INFO, format=fmt, datefmt=datafmt) 31 | writer = SummaryWriter(os.path.dirname(log_path)) 32 | 33 | if not os.path.exists(model_dir): 34 | os.makedirs(model_dir) 35 | 36 | for tag in ['gen', 'dis']: 37 | checkpointpath = os.path.join(model_dir, '{}.{}.pt'.format(resume,tag)) 38 | if os.path.exists(checkpointpath): 39 | checkpoint = torch.load(checkpointpath, map_location=device) 40 | models[tag].load_state_dict(checkpoint['model_state_dict']) 41 | optimizers[tag].load_state_dict(checkpoint['optimizer_state_dict']) 42 | print('{} loaded successfully.'.format(checkpointpath)) 43 | 44 | w_adv = config['w_adv'] 45 | w_grad = config['w_grad'] 46 | w_cls = config['w_cls'] 47 | w_cyc = config['w_cyc'] 48 | w_rec = config['w_rec'] 49 | gradient_clip = config['gradient_clip'] 50 | 51 | print("===================================Training Started===================================") 52 | n_iter = 0 53 | for epoch in range(resume+1, epochs+1): 54 | b = 0 55 | for X_list in train_loader: 56 | n_spk = len(X_list) 57 | xin = [] 58 | for s in range(n_spk): 59 | xin.append(torch.tensor(X_list[s]).to(device, dtype=torch.float)) 60 | 61 | # List of speaker pairs 62 | spk_pair_list = comb(n_spk,2) 63 | n_spk_pair = len(spk_pair_list) 64 | 65 | gen_loss_mean = 0 66 | dis_loss_mean = 0 67 | advloss_d_mean = 0 68 | gradloss_d_mean = 0 69 | advloss_g_mean = 0 70 | clsloss_d_mean = 0 71 | clsloss_g_mean = 0 72 | cycloss_mean = 0 73 | recloss_mean = 0 74 | # Iterate through all speaker pairs 75 | for m in range(n_spk_pair): 76 | s0 = spk_pair_list[m][0] 77 | s1 = spk_pair_list[m][1] 78 | 79 | AdvLoss_g, ClsLoss_g, CycLoss, RecLoss = models['stargan'].calc_gen_loss(xin[s0], xin[s1], s0, s1) 80 | gen_loss = (w_adv * AdvLoss_g + w_cls * ClsLoss_g + w_cyc * CycLoss + w_rec * RecLoss) 81 | 82 | models['gen'].zero_grad() 83 | gen_loss.backward() 84 | torch.nn.utils.clip_grad_norm_(models['gen'].parameters(), gradient_clip) 85 | optimizers['gen'].step() 86 | 87 | AdvLoss_d, GradLoss_d, ClsLoss_d = models['stargan'].calc_dis_loss(xin[s0], xin[s1], s0, s1) 88 | dis_loss = w_adv * AdvLoss_d + w_grad * GradLoss_d + w_cls * ClsLoss_d 89 | 90 | models['dis'].zero_grad() 91 | dis_loss.backward() 92 | torch.nn.utils.clip_grad_norm_(models['dis'].parameters(), gradient_clip) 93 | optimizers['dis'].step() 94 | 95 | gen_loss_mean += gen_loss.item() 96 | dis_loss_mean += dis_loss.item() 97 | advloss_d_mean += AdvLoss_d.item() 98 | gradloss_d_mean += GradLoss_d.item() 99 | advloss_g_mean += AdvLoss_g.item() 100 | clsloss_d_mean += ClsLoss_d.item() 101 | clsloss_g_mean += ClsLoss_g.item() 102 | cycloss_mean += CycLoss.item() 103 | recloss_mean += RecLoss.item() 104 | 105 | gen_loss_mean /= n_spk_pair 106 | dis_loss_mean /= n_spk_pair 107 | advloss_d_mean /= n_spk_pair 108 | gradloss_d_mean /= n_spk_pair 109 | advloss_g_mean /= n_spk_pair 110 | clsloss_d_mean /= n_spk_pair 111 | clsloss_g_mean /= n_spk_pair 112 | cycloss_mean /= n_spk_pair 113 | recloss_mean /= n_spk_pair 114 | 115 | logging.info('epoch {}, mini-batch {}: AdvLoss_d={:.4f}, AdvLoss_g={:.4f}, GradLoss_d={:.4f}, ClsLoss_d={:.4f}, ClsLoss_g={:.4f}' 116 | .format(epoch, b+1, w_adv*advloss_d_mean, w_adv*advloss_g_mean, w_grad*gradloss_d_mean, w_cls*clsloss_d_mean, w_cls*clsloss_g_mean)) 117 | logging.info('epoch {}, mini-batch {}: CycLoss={:.4f}, RecLoss={:.4f}'.format(epoch, b+1, w_cyc*cycloss_mean, w_rec*recloss_mean)) 118 | writer.add_scalars('Loss/Total_Loss', {'adv_loss_d': w_adv*advloss_d_mean, 119 | 'adv_loss_g': w_adv*advloss_g_mean, 120 | 'grad_loss_d': w_grad*gradloss_d_mean, 121 | 'cls_loss_d': w_cls*clsloss_d_mean, 122 | 'cls_loss_g': w_cls*clsloss_g_mean, 123 | 'cyc_loss': w_cyc*cycloss_mean, 124 | 'rec_loss': w_rec*recloss_mean}, n_iter) 125 | n_iter += 1 126 | b += 1 127 | 128 | if epoch % snapshot == 0: 129 | for tag in ['gen', 'dis']: 130 | print('save {} at {} epoch'.format(tag, epoch)) 131 | torch.save({'epoch': epoch, 132 | 'model_state_dict': models[tag].state_dict(), 133 | 'optimizer_state_dict': optimizers[tag].state_dict()}, 134 | os.path.join(model_dir, '{}.{}.pt'.format(epoch, tag))) 135 | 136 | print("===================================Training Finished===================================") 137 | 138 | def main(): 139 | parser = argparse.ArgumentParser(description='StarGAN-VC') 140 | parser.add_argument('--gpu', '-g', type=int, default=-1, help='GPU ID (negative value indicates CPU)') 141 | parser.add_argument('-ddir', '--data_rootdir', type=str, default='./dump/arctic/norm_feat/train', 142 | help='root data folder that contains the normalized features') 143 | parser.add_argument('--epochs', '-epoch', default=2000, type=int, help='number of epochs to learn') 144 | parser.add_argument('--snapshot', '-snap', default=200, type=int, help='snapshot interval') 145 | parser.add_argument('--batch_size', '-batch', type=int, default=12, help='Batch size') 146 | parser.add_argument('--num_mels', '-nm', type=int, default=80, help='number of mel channels') 147 | parser.add_argument('--arch_type', '-arc', default='conv', type=str, help='generator architecture type (conv or rnn)') 148 | parser.add_argument('--loss_type', '-los', default='wgan', type=str, help='type of adversarial loss (cgan, wgan, or lsgan)') 149 | parser.add_argument('--zdim', '-zd', type=int, default=16, help='dimension of bottleneck layer in generator') 150 | parser.add_argument('--hdim', '-hd', type=int, default=64, help='dimension of middle layers in generator') 151 | parser.add_argument('--mdim', '-md', type=int, default=32, help='dimension of middle layers in discriminator') 152 | parser.add_argument('--sdim', '-sd', type=int, default=16, help='dimension of speaker embedding') 153 | parser.add_argument('--lrate_g', '-lrg', default='0.0005', type=float, help='learning rate for G') 154 | parser.add_argument('--lrate_d', '-lrd', default='5e-6', type=float, help='learning rate for D/C') 155 | parser.add_argument('--gradient_clip', '-gclip', default='1.0', type=float, help='gradient clip') 156 | parser.add_argument('--w_adv', '-wa', default='1.0', type=float, help='Weight on adversarial loss') 157 | parser.add_argument('--w_grad', '-wg', default='1.0', type=float, help='Weight on gradient penalty loss') 158 | parser.add_argument('--w_cls', '-wcl', default='1.0', type=float, help='Weight on classification loss') 159 | parser.add_argument('--w_cyc', '-wcy', default='1.0', type=float, help='Weight on cycle consistency loss') 160 | parser.add_argument('--w_rec', '-wre', default='1.0', type=float, help='Weight on reconstruction loss') 161 | parser.add_argument('--normtype', '-norm', default='IN', type=str, help='normalization type: LN, BN and IN') 162 | parser.add_argument('--src_conditioning', '-srccon', default=0, type=int, help='w or w/o source conditioning') 163 | parser.add_argument('--resume', '-res', type=int, default=0, help='Checkpoint to resume training') 164 | parser.add_argument('--model_rootdir', '-mdir', type=str, default='./model/arctic/', help='model file directory') 165 | parser.add_argument('--log_dir', '-ldir', type=str, default='./logs/arctic/', help='log file directory') 166 | parser.add_argument('--experiment_name', '-exp', default='experiment1', type=str, help='experiment name') 167 | args = parser.parse_args() 168 | 169 | # Set up GPU 170 | if torch.cuda.is_available() and args.gpu >= 0: 171 | device = torch.device('cuda:%d' % args.gpu) 172 | else: 173 | device = torch.device('cpu') 174 | if device.type == 'cuda': 175 | torch.cuda.set_device(device) 176 | 177 | # Configuration for StarGAN 178 | num_mels = args.num_mels 179 | arch_type = args.arch_type 180 | loss_type = args.loss_type 181 | zdim = args.zdim 182 | hdim = args.hdim 183 | mdim = args.mdim 184 | sdim = args.sdim 185 | w_adv = args.w_adv 186 | w_grad = args.w_grad 187 | w_cls = args.w_cls 188 | w_cyc = args.w_cyc 189 | w_rec = args.w_rec 190 | lrate_g = args.lrate_g 191 | lrate_d = args.lrate_d 192 | gradient_clip = args.gradient_clip 193 | epochs = args.epochs 194 | batch_size = args.batch_size 195 | snapshot = args.snapshot 196 | resume = args.resume 197 | normtype = args.normtype 198 | src_conditioning = bool(args.src_conditioning) 199 | 200 | data_rootdir = args.data_rootdir 201 | spk_list = sorted(os.listdir(data_rootdir)) 202 | n_spk = len(spk_list) 203 | melspec_dirs = [os.path.join(data_rootdir,spk) for spk in spk_list] 204 | 205 | model_config = { 206 | 'num_mels': num_mels, 207 | 'arch_type': arch_type, 208 | 'loss_type': loss_type, 209 | 'zdim': zdim, 210 | 'hdim': hdim, 211 | 'mdim': mdim, 212 | 'sdim': sdim, 213 | 'w_adv': w_adv, 214 | 'w_grad': w_grad, 215 | 'w_cls': w_cls, 216 | 'w_cyc': w_cyc, 217 | 'w_rec': w_rec, 218 | 'lrate_g': lrate_g, 219 | 'lrate_d': lrate_d, 220 | 'gradient_clip': gradient_clip, 221 | 'normtype': normtype, 222 | 'epochs': epochs, 223 | 'BatchSize': batch_size, 224 | 'n_spk': n_spk, 225 | 'spk_list': spk_list, 226 | 'src_conditioning': src_conditioning 227 | } 228 | 229 | model_dir = os.path.join(args.model_rootdir, args.experiment_name) 230 | makedirs_if_not_exists(model_dir) 231 | log_path = os.path.join(args.log_dir, args.experiment_name, 'train_{}.log'.format(args.experiment_name)) 232 | 233 | # Save configuration as a json file 234 | config_path = os.path.join(model_dir, 'model_config.json') 235 | with open(config_path, 'w') as outfile: 236 | json.dump(model_config, outfile, indent=4) 237 | 238 | if arch_type=='conv': 239 | gen = net.Generator1(num_mels, n_spk, zdim, hdim, sdim, normtype, src_conditioning) 240 | elif arch_type=='rnn': 241 | net.Generator2(num_mels, n_spk, zdim, hdim, sdim, src_conditioning=src_conditioning) 242 | dis = net.Discriminator1(num_mels, n_spk, mdim, normtype) 243 | models = { 244 | 'gen' : gen, 245 | 'dis' : dis 246 | } 247 | models['stargan'] = net.StarGAN(models['gen'], models['dis'],n_spk,loss_type) 248 | 249 | optimizers = { 250 | 'gen' : optim.Adam(models['gen'].parameters(), lr=lrate_g, betas=(0.9,0.999)), 251 | 'dis' : optim.Adam(models['dis'].parameters(), lr=lrate_d, betas=(0.5,0.999)) 252 | } 253 | 254 | for tag in ['gen', 'dis']: 255 | models[tag].to(device).train(mode=True) 256 | 257 | train_dataset = MultiDomain_Dataset(*melspec_dirs) 258 | train_loader = DataLoader(train_dataset, 259 | batch_size=batch_size, 260 | shuffle=True, 261 | num_workers=0, 262 | #num_workers=os.cpu_count(), 263 | drop_last=True, 264 | collate_fn=collate_fn) 265 | Train(models, epochs, train_dataset, train_loader, optimizers, device, model_dir, log_path, model_config, snapshot, resume) 266 | 267 | 268 | if __name__ == '__main__': 269 | main() --------------------------------------------------------------------------------