├── README.md ├── Relative Attributes ├── 0013_000701.csv ├── 0013_Neutral_Angry.csv ├── BayesClass_RelAtt.m ├── BayesClass_RelAtt_unseen.m ├── Create_O_and_S_Mats_2D.m ├── GetTrainingSample_per_category.m ├── classlabel.csv ├── main.m ├── meanandvar_forcat.m ├── pre-processing.py ├── ranksvm_with_sim.m └── used_for_training_kun.csv ├── codes ├── acrnn_test.py ├── checkpoint │ └── checkpoint_5900 ├── distributed.py ├── hparams.py ├── hparams_1.py ├── hparams_update.py ├── inference.py ├── logger.py ├── logger_original.py ├── lstm_test.py ├── model │ ├── __init__.py │ ├── __pycache__ │ │ ├── __init__.cpython-36.pyc │ │ ├── basic_layers.cpython-36.pyc │ │ ├── beam.cpython-36.pyc │ │ ├── decoder.cpython-36.pyc │ │ ├── layers.cpython-36.pyc │ │ ├── loss.cpython-36.pyc │ │ ├── lstm_test.cpython-36.pyc │ │ ├── model.cpython-36.pyc │ │ ├── penalties.cpython-36.pyc │ │ └── utils.cpython-36.pyc │ ├── basic_layers.py │ ├── beam.py │ ├── decoder.py │ ├── layers.py │ ├── loss.py │ ├── lstm_test.py │ ├── model.py │ ├── penalties.py │ ├── ser.py │ └── utils.py ├── multiproc.py ├── plotting_utils.py ├── reader │ ├── evaluation_spec_list.txt │ └── training_mel_list.txt ├── train.py └── train_ser.py ├── stage3_update.png └── train_ser.py /README.md: -------------------------------------------------------------------------------- 1 | # Emovox 2 | This is the implementation of the paper "Emotion Intensity and its Control for Emotional Voice Conversion". 3 | 4 | ![image info](./stage3_update.png) 5 | 6 | ## Database: 7 | We use ESD database, which is an emotional speech database that can be downloaded here: https://hltsingapore.github.io/ESD/. In this paper, we choose "0013" to perform all the experiments. To run the codes, you first need to customize your data path correctly, and generate phoneme transcriptions with Festival. More details can be found in https://github.com/jxzhanggg/nonparaSeq2seqVC_code. 8 | 9 | 10 | ## Step 1: Learning relative attributes 11 | 12 | ### 1) Extracting open-simle features 13 | 14 | ```Bash 15 | python pre-processing.py 16 | ``` 17 | 18 | ### 2) Training relative ranking function 19 | 20 | ```Matlab 21 | main.m 22 | ``` 23 | 24 | ## Step 2: Emotion recognizer training 25 | 26 | ```Bash 27 | python train_ser.py 28 | ``` 29 | 30 | ## Step 3: Emovox training 31 | 32 | ### 1) Style Pre-training 33 | 34 | You need to download VCTK corpus and customize it accordingly, and then perform feature extraction: 35 | ```Bash 36 | $ cd reader 37 | $ python extract_features.py (please customize "path" and "kind", and edit the codes for "spec" or "mel-spec") 38 | $ python generate_list_mel.py 39 | ``` 40 | 41 | The pre-training procedure is same as the pretraining in https://github.com/jxzhanggg/nonparaSeq2seqVC_code. You can download the pre-trained models from Stage I: Style Initialization here: https://drive.google.com/file/d/1oqk-PSREwpFNTyeREwcUry13WZ1LYl6U/view?usp=sharing. With the released pre-trained models, you can directly perform Stage II: Emotion Training. If you would like to pre-train it by yourself, you can try the following: 42 | ```Bash 43 | $ python train.py -l logdir \ 44 | -o outdir --n_gpus=1 --hparams=speaker_adversial_loss_w=20.,ce_loss=False,speaker_classifier_loss_w=0.1,contrastive_loss_w=30. 45 | ``` 46 | 47 | ### 2) Emotion training 48 | 49 | You need to download ESD corpus and customize it accordingly, and then perform feature extraction: 50 | ```Bash 51 | $ cd reader 52 | $ python extract.py (please customize "path" and "kind", and edit the codes for "spec" or "mel-spec") 53 | $ python generate_list_mel.py 54 | ``` 55 | 56 | ```Bash 57 | $ python train.py -l logdir \ 58 | -o outdir_emotion_IS --n_gpus=1 -c '/home/zhoukun/nonparaSeq2seqVC_code-master/pre-train/outdir/checkpoint_234000 (The path to your Pre-trained models from Stage I)' --warm_start 59 | ``` 60 | 61 | ## Step 4: Run-time conversion 62 | 63 | (1) Generate emotion embedding from the emotion encoder: 64 | 65 | Please remember to customize the paths in hparam.py... 66 | ```Bash 67 | $ cd conversion 68 | $ python inference_embedding.py -c '/home/zhoukun/nonparaSeq2seqVC_code-master/fine-tune/outdir_emotion_update/checkpoint_3200 [YOUR EMOTION TRAINING CHECKPOINT]' --hparams speaker_A='Neutral',speaker_B='Happy',speaker_C='Sad',speaker_D='Angry',training_list='/home/zhoukun/nonparaSeq2seqVC_code-master/fine-tune/reader/emotion_list/testing_mel_list.txt',SC_kernel_size=1 69 | ``` 70 | (2) Convert the source speech to the target emotion: [FOR EXAMPLE: convert emotion D to emotion A] 71 | ```Bash 72 | $ cd conversion 73 | $ python inference_A.py -c '/home/zhoukun/nonparaSeq2seqVC_code-master/pre-train/outdir_emotion_update/checkpoint_3200[YOUR EMOTION TRAINING CHECKPOINT]' --num 20 --hparams validation_list='/home/zhoukun/nonparaSeq2seqVC_code-master/fine-tune/reader/emotion_list/evaluation_mel_list.txt',SC_kernel_size=1 74 | ``` 75 | Please customize inference.py to generate your intended emotion type. 76 | 77 | 78 | ## Training log 79 | 80 | # Still under construction ... 81 | -------------------------------------------------------------------------------- /Relative Attributes/0013_000701.csv: -------------------------------------------------------------------------------- 1 | name;frameTime;pcm_RMSenergy_sma_max;pcm_RMSenergy_sma_min;pcm_RMSenergy_sma_range;pcm_RMSenergy_sma_maxPos;pcm_RMSenergy_sma_minPos;pcm_RMSenergy_sma_amean;pcm_RMSenergy_sma_linregc1;pcm_RMSenergy_sma_linregc2;pcm_RMSenergy_sma_linregerrQ;pcm_RMSenergy_sma_stddev;pcm_RMSenergy_sma_skewness;pcm_RMSenergy_sma_kurtosis;pcm_fftMag_mfcc_sma[1]_max;pcm_fftMag_mfcc_sma[1]_min;pcm_fftMag_mfcc_sma[1]_range;pcm_fftMag_mfcc_sma[1]_maxPos;pcm_fftMag_mfcc_sma[1]_minPos;pcm_fftMag_mfcc_sma[1]_amean;pcm_fftMag_mfcc_sma[1]_linregc1;pcm_fftMag_mfcc_sma[1]_linregc2;pcm_fftMag_mfcc_sma[1]_linregerrQ;pcm_fftMag_mfcc_sma[1]_stddev;pcm_fftMag_mfcc_sma[1]_skewness;pcm_fftMag_mfcc_sma[1]_kurtosis;pcm_fftMag_mfcc_sma[2]_max;pcm_fftMag_mfcc_sma[2]_min;pcm_fftMag_mfcc_sma[2]_range;pcm_fftMag_mfcc_sma[2]_maxPos;pcm_fftMag_mfcc_sma[2]_minPos;pcm_fftMag_mfcc_sma[2]_amean;pcm_fftMag_mfcc_sma[2]_linregc1;pcm_fftMag_mfcc_sma[2]_linregc2;pcm_fftMag_mfcc_sma[2]_linregerrQ;pcm_fftMag_mfcc_sma[2]_stddev;pcm_fftMag_mfcc_sma[2]_skewness;pcm_fftMag_mfcc_sma[2]_kurtosis;pcm_fftMag_mfcc_sma[3]_max;pcm_fftMag_mfcc_sma[3]_min;pcm_fftMag_mfcc_sma[3]_range;pcm_fftMag_mfcc_sma[3]_maxPos;pcm_fftMag_mfcc_sma[3]_minPos;pcm_fftMag_mfcc_sma[3]_amean;pcm_fftMag_mfcc_sma[3]_linregc1;pcm_fftMag_mfcc_sma[3]_linregc2;pcm_fftMag_mfcc_sma[3]_linregerrQ;pcm_fftMag_mfcc_sma[3]_stddev;pcm_fftMag_mfcc_sma[3]_skewness;pcm_fftMag_mfcc_sma[3]_kurtosis;pcm_fftMag_mfcc_sma[4]_max;pcm_fftMag_mfcc_sma[4]_min;pcm_fftMag_mfcc_sma[4]_range;pcm_fftMag_mfcc_sma[4]_maxPos;pcm_fftMag_mfcc_sma[4]_minPos;pcm_fftMag_mfcc_sma[4]_amean;pcm_fftMag_mfcc_sma[4]_linregc1;pcm_fftMag_mfcc_sma[4]_linregc2;pcm_fftMag_mfcc_sma[4]_linregerrQ;pcm_fftMag_mfcc_sma[4]_stddev;pcm_fftMag_mfcc_sma[4]_skewness;pcm_fftMag_mfcc_sma[4]_kurtosis;pcm_fftMag_mfcc_sma[5]_max;pcm_fftMag_mfcc_sma[5]_min;pcm_fftMag_mfcc_sma[5]_range;pcm_fftMag_mfcc_sma[5]_maxPos;pcm_fftMag_mfcc_sma[5]_minPos;pcm_fftMag_mfcc_sma[5]_amean;pcm_fftMag_mfcc_sma[5]_linregc1;pcm_fftMag_mfcc_sma[5]_linregc2;pcm_fftMag_mfcc_sma[5]_linregerrQ;pcm_fftMag_mfcc_sma[5]_stddev;pcm_fftMag_mfcc_sma[5]_skewness;pcm_fftMag_mfcc_sma[5]_kurtosis;pcm_fftMag_mfcc_sma[6]_max;pcm_fftMag_mfcc_sma[6]_min;pcm_fftMag_mfcc_sma[6]_range;pcm_fftMag_mfcc_sma[6]_maxPos;pcm_fftMag_mfcc_sma[6]_minPos;pcm_fftMag_mfcc_sma[6]_amean;pcm_fftMag_mfcc_sma[6]_linregc1;pcm_fftMag_mfcc_sma[6]_linregc2;pcm_fftMag_mfcc_sma[6]_linregerrQ;pcm_fftMag_mfcc_sma[6]_stddev;pcm_fftMag_mfcc_sma[6]_skewness;pcm_fftMag_mfcc_sma[6]_kurtosis;pcm_fftMag_mfcc_sma[7]_max;pcm_fftMag_mfcc_sma[7]_min;pcm_fftMag_mfcc_sma[7]_range;pcm_fftMag_mfcc_sma[7]_maxPos;pcm_fftMag_mfcc_sma[7]_minPos;pcm_fftMag_mfcc_sma[7]_amean;pcm_fftMag_mfcc_sma[7]_linregc1;pcm_fftMag_mfcc_sma[7]_linregc2;pcm_fftMag_mfcc_sma[7]_linregerrQ;pcm_fftMag_mfcc_sma[7]_stddev;pcm_fftMag_mfcc_sma[7]_skewness;pcm_fftMag_mfcc_sma[7]_kurtosis;pcm_fftMag_mfcc_sma[8]_max;pcm_fftMag_mfcc_sma[8]_min;pcm_fftMag_mfcc_sma[8]_range;pcm_fftMag_mfcc_sma[8]_maxPos;pcm_fftMag_mfcc_sma[8]_minPos;pcm_fftMag_mfcc_sma[8]_amean;pcm_fftMag_mfcc_sma[8]_linregc1;pcm_fftMag_mfcc_sma[8]_linregc2;pcm_fftMag_mfcc_sma[8]_linregerrQ;pcm_fftMag_mfcc_sma[8]_stddev;pcm_fftMag_mfcc_sma[8]_skewness;pcm_fftMag_mfcc_sma[8]_kurtosis;pcm_fftMag_mfcc_sma[9]_max;pcm_fftMag_mfcc_sma[9]_min;pcm_fftMag_mfcc_sma[9]_range;pcm_fftMag_mfcc_sma[9]_maxPos;pcm_fftMag_mfcc_sma[9]_minPos;pcm_fftMag_mfcc_sma[9]_amean;pcm_fftMag_mfcc_sma[9]_linregc1;pcm_fftMag_mfcc_sma[9]_linregc2;pcm_fftMag_mfcc_sma[9]_linregerrQ;pcm_fftMag_mfcc_sma[9]_stddev;pcm_fftMag_mfcc_sma[9]_skewness;pcm_fftMag_mfcc_sma[9]_kurtosis;pcm_fftMag_mfcc_sma[10]_max;pcm_fftMag_mfcc_sma[10]_min;pcm_fftMag_mfcc_sma[10]_range;pcm_fftMag_mfcc_sma[10]_maxPos;pcm_fftMag_mfcc_sma[10]_minPos;pcm_fftMag_mfcc_sma[10]_amean;pcm_fftMag_mfcc_sma[10]_linregc1;pcm_fftMag_mfcc_sma[10]_linregc2;pcm_fftMag_mfcc_sma[10]_linregerrQ;pcm_fftMag_mfcc_sma[10]_stddev;pcm_fftMag_mfcc_sma[10]_skewness;pcm_fftMag_mfcc_sma[10]_kurtosis;pcm_fftMag_mfcc_sma[11]_max;pcm_fftMag_mfcc_sma[11]_min;pcm_fftMag_mfcc_sma[11]_range;pcm_fftMag_mfcc_sma[11]_maxPos;pcm_fftMag_mfcc_sma[11]_minPos;pcm_fftMag_mfcc_sma[11]_amean;pcm_fftMag_mfcc_sma[11]_linregc1;pcm_fftMag_mfcc_sma[11]_linregc2;pcm_fftMag_mfcc_sma[11]_linregerrQ;pcm_fftMag_mfcc_sma[11]_stddev;pcm_fftMag_mfcc_sma[11]_skewness;pcm_fftMag_mfcc_sma[11]_kurtosis;pcm_fftMag_mfcc_sma[12]_max;pcm_fftMag_mfcc_sma[12]_min;pcm_fftMag_mfcc_sma[12]_range;pcm_fftMag_mfcc_sma[12]_maxPos;pcm_fftMag_mfcc_sma[12]_minPos;pcm_fftMag_mfcc_sma[12]_amean;pcm_fftMag_mfcc_sma[12]_linregc1;pcm_fftMag_mfcc_sma[12]_linregc2;pcm_fftMag_mfcc_sma[12]_linregerrQ;pcm_fftMag_mfcc_sma[12]_stddev;pcm_fftMag_mfcc_sma[12]_skewness;pcm_fftMag_mfcc_sma[12]_kurtosis;pcm_zcr_sma_max;pcm_zcr_sma_min;pcm_zcr_sma_range;pcm_zcr_sma_maxPos;pcm_zcr_sma_minPos;pcm_zcr_sma_amean;pcm_zcr_sma_linregc1;pcm_zcr_sma_linregc2;pcm_zcr_sma_linregerrQ;pcm_zcr_sma_stddev;pcm_zcr_sma_skewness;pcm_zcr_sma_kurtosis;voiceProb_sma_max;voiceProb_sma_min;voiceProb_sma_range;voiceProb_sma_maxPos;voiceProb_sma_minPos;voiceProb_sma_amean;voiceProb_sma_linregc1;voiceProb_sma_linregc2;voiceProb_sma_linregerrQ;voiceProb_sma_stddev;voiceProb_sma_skewness;voiceProb_sma_kurtosis;F0_sma_max;F0_sma_min;F0_sma_range;F0_sma_maxPos;F0_sma_minPos;F0_sma_amean;F0_sma_linregc1;F0_sma_linregc2;F0_sma_linregerrQ;F0_sma_stddev;F0_sma_skewness;F0_sma_kurtosis;pcm_RMSenergy_sma_de_max;pcm_RMSenergy_sma_de_min;pcm_RMSenergy_sma_de_range;pcm_RMSenergy_sma_de_maxPos;pcm_RMSenergy_sma_de_minPos;pcm_RMSenergy_sma_de_amean;pcm_RMSenergy_sma_de_linregc1;pcm_RMSenergy_sma_de_linregc2;pcm_RMSenergy_sma_de_linregerrQ;pcm_RMSenergy_sma_de_stddev;pcm_RMSenergy_sma_de_skewness;pcm_RMSenergy_sma_de_kurtosis;pcm_fftMag_mfcc_sma_de[1]_max;pcm_fftMag_mfcc_sma_de[1]_min;pcm_fftMag_mfcc_sma_de[1]_range;pcm_fftMag_mfcc_sma_de[1]_maxPos;pcm_fftMag_mfcc_sma_de[1]_minPos;pcm_fftMag_mfcc_sma_de[1]_amean;pcm_fftMag_mfcc_sma_de[1]_linregc1;pcm_fftMag_mfcc_sma_de[1]_linregc2;pcm_fftMag_mfcc_sma_de[1]_linregerrQ;pcm_fftMag_mfcc_sma_de[1]_stddev;pcm_fftMag_mfcc_sma_de[1]_skewness;pcm_fftMag_mfcc_sma_de[1]_kurtosis;pcm_fftMag_mfcc_sma_de[2]_max;pcm_fftMag_mfcc_sma_de[2]_min;pcm_fftMag_mfcc_sma_de[2]_range;pcm_fftMag_mfcc_sma_de[2]_maxPos;pcm_fftMag_mfcc_sma_de[2]_minPos;pcm_fftMag_mfcc_sma_de[2]_amean;pcm_fftMag_mfcc_sma_de[2]_linregc1;pcm_fftMag_mfcc_sma_de[2]_linregc2;pcm_fftMag_mfcc_sma_de[2]_linregerrQ;pcm_fftMag_mfcc_sma_de[2]_stddev;pcm_fftMag_mfcc_sma_de[2]_skewness;pcm_fftMag_mfcc_sma_de[2]_kurtosis;pcm_fftMag_mfcc_sma_de[3]_max;pcm_fftMag_mfcc_sma_de[3]_min;pcm_fftMag_mfcc_sma_de[3]_range;pcm_fftMag_mfcc_sma_de[3]_maxPos;pcm_fftMag_mfcc_sma_de[3]_minPos;pcm_fftMag_mfcc_sma_de[3]_amean;pcm_fftMag_mfcc_sma_de[3]_linregc1;pcm_fftMag_mfcc_sma_de[3]_linregc2;pcm_fftMag_mfcc_sma_de[3]_linregerrQ;pcm_fftMag_mfcc_sma_de[3]_stddev;pcm_fftMag_mfcc_sma_de[3]_skewness;pcm_fftMag_mfcc_sma_de[3]_kurtosis;pcm_fftMag_mfcc_sma_de[4]_max;pcm_fftMag_mfcc_sma_de[4]_min;pcm_fftMag_mfcc_sma_de[4]_range;pcm_fftMag_mfcc_sma_de[4]_maxPos;pcm_fftMag_mfcc_sma_de[4]_minPos;pcm_fftMag_mfcc_sma_de[4]_amean;pcm_fftMag_mfcc_sma_de[4]_linregc1;pcm_fftMag_mfcc_sma_de[4]_linregc2;pcm_fftMag_mfcc_sma_de[4]_linregerrQ;pcm_fftMag_mfcc_sma_de[4]_stddev;pcm_fftMag_mfcc_sma_de[4]_skewness;pcm_fftMag_mfcc_sma_de[4]_kurtosis;pcm_fftMag_mfcc_sma_de[5]_max;pcm_fftMag_mfcc_sma_de[5]_min;pcm_fftMag_mfcc_sma_de[5]_range;pcm_fftMag_mfcc_sma_de[5]_maxPos;pcm_fftMag_mfcc_sma_de[5]_minPos;pcm_fftMag_mfcc_sma_de[5]_amean;pcm_fftMag_mfcc_sma_de[5]_linregc1;pcm_fftMag_mfcc_sma_de[5]_linregc2;pcm_fftMag_mfcc_sma_de[5]_linregerrQ;pcm_fftMag_mfcc_sma_de[5]_stddev;pcm_fftMag_mfcc_sma_de[5]_skewness;pcm_fftMag_mfcc_sma_de[5]_kurtosis;pcm_fftMag_mfcc_sma_de[6]_max;pcm_fftMag_mfcc_sma_de[6]_min;pcm_fftMag_mfcc_sma_de[6]_range;pcm_fftMag_mfcc_sma_de[6]_maxPos;pcm_fftMag_mfcc_sma_de[6]_minPos;pcm_fftMag_mfcc_sma_de[6]_amean;pcm_fftMag_mfcc_sma_de[6]_linregc1;pcm_fftMag_mfcc_sma_de[6]_linregc2;pcm_fftMag_mfcc_sma_de[6]_linregerrQ;pcm_fftMag_mfcc_sma_de[6]_stddev;pcm_fftMag_mfcc_sma_de[6]_skewness;pcm_fftMag_mfcc_sma_de[6]_kurtosis;pcm_fftMag_mfcc_sma_de[7]_max;pcm_fftMag_mfcc_sma_de[7]_min;pcm_fftMag_mfcc_sma_de[7]_range;pcm_fftMag_mfcc_sma_de[7]_maxPos;pcm_fftMag_mfcc_sma_de[7]_minPos;pcm_fftMag_mfcc_sma_de[7]_amean;pcm_fftMag_mfcc_sma_de[7]_linregc1;pcm_fftMag_mfcc_sma_de[7]_linregc2;pcm_fftMag_mfcc_sma_de[7]_linregerrQ;pcm_fftMag_mfcc_sma_de[7]_stddev;pcm_fftMag_mfcc_sma_de[7]_skewness;pcm_fftMag_mfcc_sma_de[7]_kurtosis;pcm_fftMag_mfcc_sma_de[8]_max;pcm_fftMag_mfcc_sma_de[8]_min;pcm_fftMag_mfcc_sma_de[8]_range;pcm_fftMag_mfcc_sma_de[8]_maxPos;pcm_fftMag_mfcc_sma_de[8]_minPos;pcm_fftMag_mfcc_sma_de[8]_amean;pcm_fftMag_mfcc_sma_de[8]_linregc1;pcm_fftMag_mfcc_sma_de[8]_linregc2;pcm_fftMag_mfcc_sma_de[8]_linregerrQ;pcm_fftMag_mfcc_sma_de[8]_stddev;pcm_fftMag_mfcc_sma_de[8]_skewness;pcm_fftMag_mfcc_sma_de[8]_kurtosis;pcm_fftMag_mfcc_sma_de[9]_max;pcm_fftMag_mfcc_sma_de[9]_min;pcm_fftMag_mfcc_sma_de[9]_range;pcm_fftMag_mfcc_sma_de[9]_maxPos;pcm_fftMag_mfcc_sma_de[9]_minPos;pcm_fftMag_mfcc_sma_de[9]_amean;pcm_fftMag_mfcc_sma_de[9]_linregc1;pcm_fftMag_mfcc_sma_de[9]_linregc2;pcm_fftMag_mfcc_sma_de[9]_linregerrQ;pcm_fftMag_mfcc_sma_de[9]_stddev;pcm_fftMag_mfcc_sma_de[9]_skewness;pcm_fftMag_mfcc_sma_de[9]_kurtosis;pcm_fftMag_mfcc_sma_de[10]_max;pcm_fftMag_mfcc_sma_de[10]_min;pcm_fftMag_mfcc_sma_de[10]_range;pcm_fftMag_mfcc_sma_de[10]_maxPos;pcm_fftMag_mfcc_sma_de[10]_minPos;pcm_fftMag_mfcc_sma_de[10]_amean;pcm_fftMag_mfcc_sma_de[10]_linregc1;pcm_fftMag_mfcc_sma_de[10]_linregc2;pcm_fftMag_mfcc_sma_de[10]_linregerrQ;pcm_fftMag_mfcc_sma_de[10]_stddev;pcm_fftMag_mfcc_sma_de[10]_skewness;pcm_fftMag_mfcc_sma_de[10]_kurtosis;pcm_fftMag_mfcc_sma_de[11]_max;pcm_fftMag_mfcc_sma_de[11]_min;pcm_fftMag_mfcc_sma_de[11]_range;pcm_fftMag_mfcc_sma_de[11]_maxPos;pcm_fftMag_mfcc_sma_de[11]_minPos;pcm_fftMag_mfcc_sma_de[11]_amean;pcm_fftMag_mfcc_sma_de[11]_linregc1;pcm_fftMag_mfcc_sma_de[11]_linregc2;pcm_fftMag_mfcc_sma_de[11]_linregerrQ;pcm_fftMag_mfcc_sma_de[11]_stddev;pcm_fftMag_mfcc_sma_de[11]_skewness;pcm_fftMag_mfcc_sma_de[11]_kurtosis;pcm_fftMag_mfcc_sma_de[12]_max;pcm_fftMag_mfcc_sma_de[12]_min;pcm_fftMag_mfcc_sma_de[12]_range;pcm_fftMag_mfcc_sma_de[12]_maxPos;pcm_fftMag_mfcc_sma_de[12]_minPos;pcm_fftMag_mfcc_sma_de[12]_amean;pcm_fftMag_mfcc_sma_de[12]_linregc1;pcm_fftMag_mfcc_sma_de[12]_linregc2;pcm_fftMag_mfcc_sma_de[12]_linregerrQ;pcm_fftMag_mfcc_sma_de[12]_stddev;pcm_fftMag_mfcc_sma_de[12]_skewness;pcm_fftMag_mfcc_sma_de[12]_kurtosis;pcm_zcr_sma_de_max;pcm_zcr_sma_de_min;pcm_zcr_sma_de_range;pcm_zcr_sma_de_maxPos;pcm_zcr_sma_de_minPos;pcm_zcr_sma_de_amean;pcm_zcr_sma_de_linregc1;pcm_zcr_sma_de_linregc2;pcm_zcr_sma_de_linregerrQ;pcm_zcr_sma_de_stddev;pcm_zcr_sma_de_skewness;pcm_zcr_sma_de_kurtosis;voiceProb_sma_de_max;voiceProb_sma_de_min;voiceProb_sma_de_range;voiceProb_sma_de_maxPos;voiceProb_sma_de_minPos;voiceProb_sma_de_amean;voiceProb_sma_de_linregc1;voiceProb_sma_de_linregc2;voiceProb_sma_de_linregerrQ;voiceProb_sma_de_stddev;voiceProb_sma_de_skewness;voiceProb_sma_de_kurtosis;F0_sma_de_max;F0_sma_de_min;F0_sma_de_range;F0_sma_de_maxPos;F0_sma_de_minPos;F0_sma_de_amean;F0_sma_de_linregc1;F0_sma_de_linregc2;F0_sma_de_linregerrQ;F0_sma_de_stddev;F0_sma_de_skewness;F0_sma_de_kurtosis 2 | '0013_000701';0.000000;3.369850e-02;1.735465e-05;3.368114e-02;128;1;7.375638e-03;9.669271e-07;7.274111e-03;6.286126e-05;7.928728e-03;1.231967e+00;3.903731e+00;-4.289949e-01;-2.965781e+01;2.922882e+01;86;119;-1.156513e+01;-9.081191e-03;-1.061160e+01;4.861806e+01;6.994570e+00;-4.087515e-01;2.463248e+00;1.203464e+01;-2.046483e+01;3.249947e+01;118;128;-4.140470e+00;1.807947e-02;-6.038815e+00;3.540856e+01;6.051548e+00;-1.875589e-01;3.351907e+00;1.955154e+01;-1.310266e+01;3.265420e+01;175;65;1.407256e+00;4.926669e-02;-3.765746e+00;5.113353e+01;7.754900e+00;4.629310e-01;2.395122e+00;1.685727e+01;-1.567376e+01;3.253102e+01;86;65;-1.302843e+00;3.730469e-03;-1.694542e+00;5.699306e+01;7.552794e+00;4.020537e-01;2.447639e+00;4.721776e+00;-3.279273e+01;3.751450e+01;194;83;-1.143948e+01;3.856772e-03;-1.184444e+01;9.055809e+01;9.519101e+00;-3.643253e-01;1.896232e+00;2.670732e+00;-2.272833e+01;2.539906e+01;16;132;-7.037078e+00;-6.061968e-03;-6.400571e+00;2.522136e+01;5.035643e+00;-5.853465e-01;2.856317e+00;2.703540e+00;-3.120908e+01;3.391262e+01;29;41;-1.052643e+01;-3.276756e-03;-1.018237e+01;4.950270e+01;7.038646e+00;-6.248007e-01;3.172013e+00;2.142991e+01;-3.905996e+01;6.048987e+01;103;160;-6.943793e+00;-2.189371e-02;-4.644953e+00;1.432078e+02;1.204102e+01;-5.093850e-01;3.094743e+00;8.956097e+00;-2.533958e+01;3.429568e+01;197;89;-5.922431e+00;1.756681e-04;-5.940876e+00;4.861909e+01;6.972747e+00;-3.561003e-01;2.635875e+00;1.521178e+01;-1.677612e+01;3.198790e+01;126;164;-1.061801e+00;-3.271530e-02;2.373305e+00;3.841283e+01;6.510269e+00;6.487774e-02;3.377592e+00;1.704618e+01;-2.677855e+01;4.382473e+01;108;163;-4.708352e+00;-2.559226e-02;-2.021164e+00;7.319276e+01;8.696130e+00;-1.842328e-01;3.034668e+00;9.610009e+00;-1.632896e+01;2.593897e+01;203;81;-4.018893e+00;1.434067e-02;-5.524663e+00;3.414647e+01;5.908422e+00;-5.604835e-02;2.275285e+00;7.458333e-01;1.833333e-02;7.275000e-01;118;92;1.568918e-01;-9.117536e-04;2.526259e-01;2.999283e-02;1.818707e-01;1.855961e+00;5.140184e+00;8.273448e-01;1.171804e-01;7.101644e-01;74;17;3.982598e-01;1.266134e-04;3.849654e-01;3.866918e-02;1.967960e-01;4.884707e-01;2.201182e+00;3.117144e+02;0;3.117144e+02;170;0;5.710489e+01;9.207127e-03;5.613814e+01;1.014255e+04;1.007118e+02;1.458309e+00;3.494218e+00;6.368165e-03;-5.066753e-03;1.143492e-02;115;132;6.224651e-07;-1.965598e-06;2.070103e-04;2.720752e-06;1.653809e-03;2.595058e-01;5.316820e+00;6.426627e+00;-6.035265e+00;1.246189e+01;122;115;3.675509e-02;-1.927535e-03;2.391463e-01;2.008664e+00;1.422128e+00;1.150463e-01;8.981321e+00;3.935276e+00;-6.386956e+00;1.032223e+01;116;123;1.665787e-02;1.901351e-04;-3.306316e-03;2.443964e+00;1.563361e+00;-5.182502e-01;4.893812e+00;3.989394e+00;-4.547331e+00;8.536725e+00;89;113;8.128021e-03;-1.953440e-03;2.132392e-01;2.024084e+00;1.427670e+00;-2.068333e-01;3.657849e+00;6.170218e+00;-4.438015e+00;1.060823e+01;43;54;-3.623639e-03;1.559478e-05;-5.261091e-03;3.224394e+00;1.795660e+00;7.156055e-01;4.212295e+00;3.811246e+00;-4.006615e+00;7.817862e+00;176;72;1.258776e-03;2.742066e-03;-2.866582e-01;2.325907e+00;1.534210e+00;9.629573e-02;3.240988e+00;5.007465e+00;-3.484175e+00;8.491640e+00;133;129;-6.047554e-03;1.326532e-03;-1.453335e-01;1.947959e+00;1.398030e+00;2.221200e-01;3.902068e+00;4.570399e+00;-5.748763e+00;1.031916e+01;43;34;7.832383e-04;2.075908e-03;-2.171871e-01;2.939625e+00;1.719190e+00;-2.689770e-01;3.820803e+00;6.299998e+00;-7.817480e+00;1.411748e+01;99;42;-6.132154e-03;1.437077e-03;-1.570253e-01;6.091033e+00;2.469553e+00;-4.642484e-01;3.740222e+00;4.501269e+00;-3.964148e+00;8.465417e+00;110;85;7.161017e-03;1.012676e-03;-9.916998e-02;2.205007e+00;1.486207e+00;-2.858680e-03;3.461223e+00;3.874099e+00;-3.938681e+00;7.812780e+00;44;54;9.058981e-03;2.090432e-04;-1.289056e-02;2.445675e+00;1.563917e+00;-1.241545e-01;2.777863e+00;3.896590e+00;-3.824652e+00;7.721242e+00;141;54;1.654875e-02;1.875372e-03;-1.803653e-01;2.567747e+00;1.606485e+00;2.015208e-02;2.676925e+00;4.173211e+00;-3.225763e+00;7.398974e+00;141;122;-2.564223e-03;1.262240e-03;-1.350994e-01;2.077329e+00;1.443343e+00;1.666264e-01;2.809902e+00;1.574167e-01;-1.438333e-01;3.012500e-01;115;122;-2.622433e-03;6.552591e-05;-9.502653e-03;1.249018e-03;3.556610e-02;-3.552915e-01;1.012584e+01;9.802486e-02;-1.201950e-01;2.182199e-01;157;171;-2.525804e-04;-5.859234e-05;5.899616e-03;1.182782e-03;3.457628e-02;-8.069830e-02;4.849043e+00;8.888889e+01;-8.888889e+01;1.777778e+02;36;39;-2.542378e-09;-1.539215e-02;1.616176e+00;5.662868e+02;2.381524e+01;-6.053687e-01;7.081641e+00 3 | -------------------------------------------------------------------------------- /Relative Attributes/BayesClass_RelAtt.m: -------------------------------------------------------------------------------- 1 | % Bayesian Classification of the Relative Attributes 2 | % Created by Joe Ellis for Reproducible Codes Class 3 | % This function takes in the means and Covariances Matrices of each class 4 | % and then classifies the variables based on their values 5 | 6 | function accuracy = BayesClass_RelAtt(predicts,ground_truth,means,Covariances,used_for_training,unseen) 7 | 8 | % Variables 9 | % predicts = the values that need to be predicted and classified these are 10 | % the relative predictions 11 | % ground_truth = the real class_labels they are a 2668 vector; 12 | % means = 1x6x8 matrix of the covariances and the means 13 | % Covariances = 6x6x8 matrix fo the covariances 14 | 15 | % This is for tracking the accuracy of the set up 16 | correct = 0; 17 | total = 0; 18 | 19 | % Now do a for loop for each of the predicts variables 20 | for j = 1:length(predicts) 21 | % We don't want to use the variables that are used for training so 22 | % let's skip those in test 23 | if used_for_training(j) == 0 24 | 25 | %{ 26 | % This is for debug purposes 27 | if ismember(ground_truth(j),unseen) == 1 28 | disp('This is an unseen variable, and is of class'); 29 | disp(ground_truth(j)); 30 | end 31 | %} 32 | 33 | % For each of the categories find the guassian probability of the 34 | % each variable and each point 35 | best_prob = 0; 36 | for k = 1:size(means,3) 37 | 38 | % Add a bit of value to the Covariances to insure they are 39 | % positive definite 40 | Cov_ex = Covariances(:,:,k) + eye(size(Covariances,1)).*.00001; 41 | prob = mvnpdf(predicts(j,:),means(:,:,k),Cov_ex); 42 | 43 | % Debug Purposes 44 | % let's calc the distance from the prediction values of the 45 | % ranking to the predicted means of the values 46 | %{ 47 | distance = pdist([predicts(j,:);means(:,:,k)],'euclidean'); 48 | disp('This is the class: '); 49 | disp(k); 50 | disp('This is the distance: ') 51 | disp(distance); 52 | disp('The predicted values'); 53 | disp(predicts(j,:)); 54 | disp('The mean values of this variable'); 55 | disp(means(:,:,k)); 56 | %} 57 | 58 | if prob > best_prob 59 | best_prob = prob; 60 | app_label = k; 61 | end 62 | end 63 | 64 | % Now see if the label is the same as the ground truth label; 65 | if ground_truth(j) == app_label; 66 | correct = correct + 1; 67 | end 68 | 69 | % Add to the total numbers of predicts that are analyzed 70 | total = total + 1; 71 | end 72 | end 73 | 74 | accuracy = correct/total; 75 | -------------------------------------------------------------------------------- /Relative Attributes/BayesClass_RelAtt_unseen.m: -------------------------------------------------------------------------------- 1 | % Bayesian Classification of the Relative Attributes 2 | % Created by Joe Ellis for Reproducible Codes Class 3 | % This function takes in the means and Covariances Matrices of each class 4 | % and then classifies the variables based on their values 5 | 6 | function accuracy = BayesClass_RelAtt_unseen(predicts,ground_truth,means,Covariances,used_for_training,unseen) 7 | 8 | % Variables 9 | % predicts = the values that need to be predicted and classified these are 10 | % the relative predictions 11 | % ground_truth = the real class_labels they are a 2668 vector; 12 | % means = 1x6x8 matrix of the covariances and the means 13 | % Covariances = 6x6x8 matrix fo the covariances 14 | 15 | % This is for tracking the accuracy of the set up 16 | correct = 0; 17 | total = 0; 18 | 19 | % Now do a for loop for each of the predicts variables 20 | for j = 1:length(predicts) 21 | % We don't want to use the variables that are used for training so 22 | % let's skip those in test 23 | if used_for_training(j) == 0 && ismember(ground_truth(j),unseen) == 1 24 | 25 | %{ 26 | % This is for debug purposes 27 | if ismember(ground_truth(j),unseen) == 1 28 | disp('This is an unseen variable, and is of class'); 29 | disp(ground_truth(j)); 30 | end 31 | %} 32 | 33 | % For each of the categories find the guassian probability of the 34 | % each variable and each point 35 | best_prob = 0; 36 | for k = 1:size(means,3) 37 | 38 | % Add a bit of value to the Covariances to insure they are 39 | % positive definite 40 | Cov_ex = Covariances(:,:,k) + eye(size(Covariances,1)).*.00001; 41 | prob = mvnpdf(predicts(j,:),means(:,:,k),Cov_ex); 42 | 43 | % Debug Purposes 44 | % let's calc the distance from the prediction values of the 45 | % ranking to the predicted means of the values 46 | %{ 47 | distance = pdist([predicts(j,:);means(:,:,k)],'euclidean'); 48 | disp('This is the class: '); 49 | disp(k); 50 | disp('This is the distance: ') 51 | disp(distance); 52 | disp('The predicted values'); 53 | disp(predicts(j,:)); 54 | disp('The mean values of this variable'); 55 | disp(means(:,:,k)); 56 | %} 57 | 58 | if prob > best_prob 59 | best_prob = prob; 60 | app_label = k; 61 | end 62 | end 63 | 64 | % Now see if the label is the same as the ground truth label; 65 | if ground_truth(j) == app_label; 66 | correct = correct + 1; 67 | end 68 | 69 | % Add to the total numbers of predicts that are analyzed 70 | total = total + 1; 71 | end 72 | end 73 | 74 | accuracy = correct/total; 75 | -------------------------------------------------------------------------------- /Relative Attributes/Create_O_and_S_Mats_2D.m: -------------------------------------------------------------------------------- 1 | % This function takes in a matrix with attribute relations in numeric 2 | % categories, and the features extracted from the available test images, 3 | % class labels of all of the images, and the images that are used for training, 4 | % and outputs the O and S matrix used with rank_with_sim rank svm implementation for training 5 | % Created by Joe Ellis -- PhD Candidate Columbia University 6 | 7 | function [O,S] = Create_O_and_S_Mats(category_order,used_for_training,class_labels,num_classes,unseen,trainpics,att_combos) 8 | 9 | % INPUTS 10 | % category_order = the order of the relative attributes of each category. 11 | % used_for_training = A vector the length of the samples, a 1 denotes that 12 | % this sample should be used for training, and a 0 is a test image 13 | % class_labels = the class_labels of each sample 14 | % num_classes = the total number of classes 15 | % unseen = A vector containing the class labels that are unseen 16 | % trainpics = The number of pictures used for training 17 | % att_combos = The number of category pairs that will be used for training 18 | 19 | % OUTPUTS 20 | % O = The matrix that is used as input to the ranking function, this matrix 21 | % for each row in the matrix contains one 1 and -1 element used for 22 | % training. 23 | % S = The similarity matrix is the same as the O matrix, but contains 24 | % samples that have the same score for a given attribute 25 | 26 | % num_categories = 6; 27 | num_categories = 2; 28 | 29 | % This matrix holds the index of the training samples for each class 30 | train_by_class = zeros(num_classes,trainpics); 31 | 32 | % Set up the O and S Mats 33 | O = zeros((trainpics^2)*att_combos,length(class_labels),num_categories); 34 | S = zeros((trainpics^2)*att_combos,length(class_labels),num_categories); 35 | 36 | % Create the train_by_class matrix to create the o and s matrix for ranking 37 | % functions 38 | index = ones(1,num_classes); 39 | for j = 1:length(used_for_training) 40 | 41 | % pick out the images that are going to be used_for_training 42 | if used_for_training(j) == 1; 43 | switch class_labels(j) 44 | case 1 45 | train_by_class(1,index(1)) = j; 46 | index(1) = index(1) + 1; 47 | case 2 48 | train_by_class(2,index(2)) = j; 49 | index(2) = index(2) + 1; 50 | case 3 51 | train_by_class(3,index(3)) = j; 52 | index(3) = index(3) + 1; 53 | case 4 54 | train_by_class(4,index(4)) = j; 55 | index(4) = index(4) + 1; 56 | case 5 57 | train_by_class(5,index(5)) = j; 58 | index(5) = index(5) + 1; 59 | case 6 60 | train_by_class(6,index(6)) = j; 61 | index(6) = index(6) + 1; 62 | % case 7 63 | % train_by_class(7,index(7)) = j; 64 | % index(7) = index(7) + 1; 65 | % case 8 66 | % train_by_class(8,index(8)) = j; 67 | % index(8) = index(8) + 1; 68 | end 69 | end 70 | end 71 | 72 | % Now we have the train_by_class matrix which has the training images for 73 | % each seperate variable. Now we are going to write the code as to how we 74 | % are going to create the o matrix and s matrix 75 | 76 | % create the elements to index the o matrix and s matrix 77 | num_images = length(used_for_training); 78 | s_index = ones(1,num_categories); 79 | o_index = ones(1,num_categories); 80 | 81 | % Create the list of seen classes 82 | seen = []; 83 | seen_index = 1; 84 | for z = 1:num_classes 85 | if (ismember(z,unseen) == 0) 86 | seen(seen_index) = z; 87 | seen_index = seen_index + 1; 88 | end 89 | end 90 | 91 | % Now we need to get the mix of the 4 categories that should all be together 92 | % This section randomly assigns two seen categoires as the category pairs 93 | % for training. 94 | combo1 = floor(1+((rand(1,att_combos)).*length(seen))); 95 | combo2 = floor(1+((rand(1,att_combos)).*length(seen))); 96 | for z = 1:att_combos 97 | test_combos(z,1) = seen(combo1(z)); 98 | test_combos(z,2) = seen(combo2(z)); 99 | 100 | % We should not compare two categoires to each other, and this section 101 | % does not allow that to happen 102 | while test_combos(z,1) == test_combos(z,2) 103 | test_combos(z,2) = floor(1+((rand(1).*length(seen)))); 104 | end 105 | 106 | % We also don't want to choose the same combination twice. 107 | % This function will prevent that from happening by checking the 108 | % previous combos and making sure they are not the same as the current. 109 | r = 1; 110 | while r < z 111 | if ismember(0,(sort(test_combos(r,:)) == sort(test_combos(z,:)))) 112 | r = r + 1; 113 | else 114 | test_combos(z,1) = seen(floor(1+((rand(1).*length(seen))))); 115 | test_combos(z,2) = seen(floor(1+((rand(1).*length(seen))))); 116 | r = 1; 117 | 118 | % Make sure that we are not comparing the two values together. 119 | while test_combos(z,1) == test_combos(z,2) 120 | test_combos(z,2) = seen(floor(1+((rand(1).*length(seen))))); 121 | end 122 | end 123 | end 124 | 125 | 126 | 127 | end 128 | 129 | % Now loop through each attribute pairing that we have and generate the O 130 | % and S matrices. 131 | 132 | % Now display which classes we are using for training 133 | disp('These are the category pairs for RankSVM training') 134 | disp(test_combos) 135 | for z = 1:size(test_combos,1) 136 | on_class = test_combos(z,1); 137 | compared_class = test_combos(z,2); 138 | 139 | % Do this for every attribute 140 | for l = 1:2 141 | % If the two relative comparisons are equal add this pairing to 142 | % the S matrix 143 | if category_order(l,on_class) == category_order(l,compared_class) 144 | % Now perform this for every training picture for each class 145 | for j = 1:trainpics 146 | for i = 1:trainpics 147 | S_row = zeros(1,num_images); 148 | S_row(train_by_class(on_class,j)) = 1; 149 | S_row(train_by_class(compared_class,i)) = -1; 150 | S(s_index(l),:,l) = S_row; 151 | s_index(l) = s_index(l) + 1; 152 | end 153 | end 154 | 155 | % If the relative comparison of the on_class is greater than 156 | % that of the compared class 157 | elseif category_order(l,on_class) > category_order(l,compared_class) 158 | % Now perform this for every training picture for each class 159 | for j = 1:trainpics 160 | for i = 1:trainpics 161 | O_row = zeros(1,num_images); 162 | O_row(train_by_class(on_class,j)) = 1; 163 | O_row(train_by_class(compared_class,i)) = -1; 164 | O(o_index(l),:,l) = O_row; 165 | o_index(l) = o_index(l) + 1; 166 | end 167 | end 168 | 169 | % If the relative comparison of the new class is greater than 170 | % that of the compared class 171 | elseif category_order(l,on_class) < category_order(l,compared_class) 172 | % Now perform this for every training picture for each class 173 | for j = 1:trainpics 174 | for i = 1:trainpics 175 | O_row = zeros(1,num_images); 176 | O_row(train_by_class(on_class,j)) = -1; 177 | O_row(train_by_class(compared_class,i)) = 1; 178 | O(o_index(l),:,l) = O_row; 179 | o_index(l) = o_index(l) + 1; 180 | end 181 | end 182 | 183 | end 184 | end 185 | end 186 | 187 | end -------------------------------------------------------------------------------- /Relative Attributes/GetTrainingSample_per_category.m: -------------------------------------------------------------------------------- 1 | % Seperate the examples that are used for training in comparison to the 2 | % seen and unseen variables 3 | % Created by Joe Ellis for the Reproducible Codes Class 4 | function Train_samples = GetTrainingSample_per_category(predicts,class_labels,used_for_training) 5 | 6 | % Variables 7 | % predicts = the values for each image that have been predicted using the 8 | % ranking algorithm devised 9 | % class_labels = the ground truth label of each class 10 | % used_for_training = If this image should be used in the training of the 11 | % model 12 | 13 | % Train Samples is a 3-D matrix of the training variables that will be used 14 | % to train the gaussian distributions of the material for what we are 15 | % doing. 16 | 17 | Train_samples = zeros(30,size(predicts,2),8); 18 | index = ones(1,8); 19 | 20 | % Set up the matrices for training 21 | for j = 1:length(predicts); 22 | if used_for_training(j) == 1; 23 | switch class_labels(j) 24 | case 1 25 | Train_samples(index(1),:,1) = predicts(j,:); 26 | index(1) = index(1) + 1; 27 | case 2 28 | Train_samples(index(2),:,2) = predicts(j,:); 29 | index(2) = index(2) + 1; 30 | case 3 31 | Train_samples(index(3),:,3) = predicts(j,:); 32 | index(3) = index(3) + 1; 33 | case 4 34 | Train_samples(index(4),:,4) = predicts(j,:); 35 | index(4) = index(4) + 1; 36 | case 5 37 | Train_samples(index(5),:,5) = predicts(j,:); 38 | index(5) = index(5) + 1; 39 | case 6 40 | Train_samples(index(6),:,6) = predicts(j,:); 41 | index(6) = index(6) + 1; 42 | case 7 43 | Train_samples(index(7),:,7) = predicts(j,:); 44 | index(7) = index(7) + 1; 45 | case 8 46 | Train_samples(index(8),:,8) = predicts(j,:); 47 | index(8) = index(8) + 1; 48 | end 49 | end 50 | end 51 | end 52 | 53 | 54 | 55 | 56 | -------------------------------------------------------------------------------- /Relative Attributes/classlabel.csv: -------------------------------------------------------------------------------- 1 | 2 2 | 2 3 | 2 4 | 2 5 | 2 6 | 2 7 | 2 8 | 2 9 | 2 10 | 2 11 | 2 12 | 2 13 | 2 14 | 2 15 | 2 16 | 2 17 | 2 18 | 2 19 | 2 20 | 2 21 | 2 22 | 2 23 | 2 24 | 2 25 | 2 26 | 2 27 | 2 28 | 2 29 | 2 30 | 2 31 | 2 32 | 2 33 | 2 34 | 2 35 | 2 36 | 2 37 | 2 38 | 2 39 | 2 40 | 2 41 | 2 42 | 2 43 | 2 44 | 2 45 | 2 46 | 2 47 | 2 48 | 2 49 | 2 50 | 2 51 | 2 52 | 2 53 | 2 54 | 2 55 | 2 56 | 2 57 | 2 58 | 2 59 | 2 60 | 2 61 | 2 62 | 2 63 | 2 64 | 2 65 | 2 66 | 2 67 | 2 68 | 2 69 | 2 70 | 2 71 | 2 72 | 2 73 | 2 74 | 2 75 | 2 76 | 2 77 | 2 78 | 2 79 | 2 80 | 2 81 | 2 82 | 2 83 | 2 84 | 2 85 | 2 86 | 2 87 | 2 88 | 2 89 | 2 90 | 2 91 | 2 92 | 2 93 | 2 94 | 2 95 | 2 96 | 2 97 | 2 98 | 2 99 | 2 100 | 2 101 | 2 102 | 2 103 | 2 104 | 2 105 | 2 106 | 2 107 | 2 108 | 2 109 | 2 110 | 2 111 | 2 112 | 2 113 | 2 114 | 2 115 | 2 116 | 2 117 | 2 118 | 2 119 | 2 120 | 2 121 | 2 122 | 2 123 | 2 124 | 2 125 | 2 126 | 2 127 | 2 128 | 2 129 | 2 130 | 2 131 | 2 132 | 2 133 | 2 134 | 2 135 | 2 136 | 2 137 | 2 138 | 2 139 | 2 140 | 2 141 | 2 142 | 2 143 | 2 144 | 2 145 | 2 146 | 2 147 | 2 148 | 2 149 | 2 150 | 2 151 | 2 152 | 2 153 | 2 154 | 2 155 | 2 156 | 2 157 | 2 158 | 2 159 | 2 160 | 2 161 | 2 162 | 2 163 | 2 164 | 2 165 | 2 166 | 2 167 | 2 168 | 2 169 | 2 170 | 2 171 | 2 172 | 2 173 | 2 174 | 2 175 | 2 176 | 2 177 | 2 178 | 2 179 | 2 180 | 2 181 | 2 182 | 2 183 | 2 184 | 2 185 | 2 186 | 2 187 | 2 188 | 2 189 | 2 190 | 2 191 | 2 192 | 2 193 | 2 194 | 2 195 | 2 196 | 2 197 | 2 198 | 2 199 | 2 200 | 2 201 | 2 202 | 2 203 | 2 204 | 2 205 | 2 206 | 2 207 | 2 208 | 2 209 | 2 210 | 2 211 | 2 212 | 2 213 | 2 214 | 2 215 | 2 216 | 2 217 | 2 218 | 2 219 | 2 220 | 2 221 | 2 222 | 2 223 | 2 224 | 2 225 | 2 226 | 2 227 | 2 228 | 2 229 | 2 230 | 2 231 | 2 232 | 2 233 | 2 234 | 2 235 | 2 236 | 2 237 | 2 238 | 2 239 | 2 240 | 2 241 | 2 242 | 2 243 | 2 244 | 2 245 | 2 246 | 2 247 | 2 248 | 2 249 | 2 250 | 2 251 | 2 252 | 2 253 | 2 254 | 2 255 | 2 256 | 2 257 | 2 258 | 2 259 | 2 260 | 2 261 | 2 262 | 2 263 | 2 264 | 2 265 | 2 266 | 2 267 | 2 268 | 2 269 | 2 270 | 2 271 | 2 272 | 2 273 | 2 274 | 2 275 | 2 276 | 2 277 | 2 278 | 2 279 | 2 280 | 2 281 | 2 282 | 2 283 | 2 284 | 2 285 | 2 286 | 2 287 | 2 288 | 2 289 | 2 290 | 2 291 | 2 292 | 2 293 | 2 294 | 2 295 | 2 296 | 2 297 | 2 298 | 2 299 | 2 300 | 2 301 | 2 302 | 2 303 | 2 304 | 2 305 | 2 306 | 2 307 | 2 308 | 2 309 | 2 310 | 2 311 | 2 312 | 2 313 | 2 314 | 2 315 | 2 316 | 2 317 | 2 318 | 2 319 | 2 320 | 2 321 | 2 322 | 2 323 | 2 324 | 2 325 | 2 326 | 2 327 | 2 328 | 2 329 | 2 330 | 2 331 | 2 332 | 2 333 | 2 334 | 2 335 | 2 336 | 2 337 | 2 338 | 2 339 | 2 340 | 2 341 | 2 342 | 2 343 | 2 344 | 2 345 | 2 346 | 2 347 | 2 348 | 2 349 | 2 350 | 2 351 | 1 352 | 1 353 | 1 354 | 1 355 | 1 356 | 1 357 | 1 358 | 1 359 | 1 360 | 1 361 | 1 362 | 1 363 | 1 364 | 1 365 | 1 366 | 1 367 | 1 368 | 1 369 | 1 370 | 1 371 | 1 372 | 1 373 | 1 374 | 1 375 | 1 376 | 1 377 | 1 378 | 1 379 | 1 380 | 1 381 | 1 382 | 1 383 | 1 384 | 1 385 | 1 386 | 1 387 | 1 388 | 1 389 | 1 390 | 1 391 | 1 392 | 1 393 | 1 394 | 1 395 | 1 396 | 1 397 | 1 398 | 1 399 | 1 400 | 1 401 | 1 402 | 1 403 | 1 404 | 1 405 | 1 406 | 1 407 | 1 408 | 1 409 | 1 410 | 1 411 | 1 412 | 1 413 | 1 414 | 1 415 | 1 416 | 1 417 | 1 418 | 1 419 | 1 420 | 1 421 | 1 422 | 1 423 | 1 424 | 1 425 | 1 426 | 1 427 | 1 428 | 1 429 | 1 430 | 1 431 | 1 432 | 1 433 | 1 434 | 1 435 | 1 436 | 1 437 | 1 438 | 1 439 | 1 440 | 1 441 | 1 442 | 1 443 | 1 444 | 1 445 | 1 446 | 1 447 | 1 448 | 1 449 | 1 450 | 1 451 | 1 452 | 1 453 | 1 454 | 1 455 | 1 456 | 1 457 | 1 458 | 1 459 | 1 460 | 1 461 | 1 462 | 1 463 | 1 464 | 1 465 | 1 466 | 1 467 | 1 468 | 1 469 | 1 470 | 1 471 | 1 472 | 1 473 | 1 474 | 1 475 | 1 476 | 1 477 | 1 478 | 1 479 | 1 480 | 1 481 | 1 482 | 1 483 | 1 484 | 1 485 | 1 486 | 1 487 | 1 488 | 1 489 | 1 490 | 1 491 | 1 492 | 1 493 | 1 494 | 1 495 | 1 496 | 1 497 | 1 498 | 1 499 | 1 500 | 1 501 | 1 502 | 1 503 | 1 504 | 1 505 | 1 506 | 1 507 | 1 508 | 1 509 | 1 510 | 1 511 | 1 512 | 1 513 | 1 514 | 1 515 | 1 516 | 1 517 | 1 518 | 1 519 | 1 520 | 1 521 | 1 522 | 1 523 | 1 524 | 1 525 | 1 526 | 1 527 | 1 528 | 1 529 | 1 530 | 1 531 | 1 532 | 1 533 | 1 534 | 1 535 | 1 536 | 1 537 | 1 538 | 1 539 | 1 540 | 1 541 | 1 542 | 1 543 | 1 544 | 1 545 | 1 546 | 1 547 | 1 548 | 1 549 | 1 550 | 1 551 | 1 552 | 1 553 | 1 554 | 1 555 | 1 556 | 1 557 | 1 558 | 1 559 | 1 560 | 1 561 | 1 562 | 1 563 | 1 564 | 1 565 | 1 566 | 1 567 | 1 568 | 1 569 | 1 570 | 1 571 | 1 572 | 1 573 | 1 574 | 1 575 | 1 576 | 1 577 | 1 578 | 1 579 | 1 580 | 1 581 | 1 582 | 1 583 | 1 584 | 1 585 | 1 586 | 1 587 | 1 588 | 1 589 | 1 590 | 1 591 | 1 592 | 1 593 | 1 594 | 1 595 | 1 596 | 1 597 | 1 598 | 1 599 | 1 600 | 1 601 | 1 602 | 1 603 | 1 604 | 1 605 | 1 606 | 1 607 | 1 608 | 1 609 | 1 610 | 1 611 | 1 612 | 1 613 | 1 614 | 1 615 | 1 616 | 1 617 | 1 618 | 1 619 | 1 620 | 1 621 | 1 622 | 1 623 | 1 624 | 1 625 | 1 626 | 1 627 | 1 628 | 1 629 | 1 630 | 1 631 | 1 632 | 1 633 | 1 634 | 1 635 | 1 636 | 1 637 | 1 638 | 1 639 | 1 640 | 1 641 | 1 642 | 1 643 | 1 644 | 1 645 | 1 646 | 1 647 | 1 648 | 1 649 | 1 650 | 1 651 | 1 652 | 1 653 | 1 654 | 1 655 | 1 656 | 1 657 | 1 658 | 1 659 | 1 660 | 1 661 | 1 662 | 1 663 | 1 664 | 1 665 | 1 666 | 1 667 | 1 668 | 1 669 | 1 670 | 1 671 | 1 672 | 1 673 | 1 674 | 1 675 | 1 676 | 1 677 | 1 678 | 1 679 | 1 680 | 1 681 | 1 682 | 1 683 | 1 684 | 1 685 | 1 686 | 1 687 | 1 688 | 1 689 | 1 690 | 1 691 | 1 692 | 1 693 | 1 694 | 1 695 | 1 696 | 1 697 | 1 698 | 1 699 | 1 700 | 1 701 | -------------------------------------------------------------------------------- /Relative Attributes/main.m: -------------------------------------------------------------------------------- 1 | % Script created to create the graphs that we want to create for the osr 2 | % dataset with the Relative attributes method 3 | 4 | % Train the ranking function should be right here 5 | % This portion of the code needs to have some ground truth data labeled and 6 | % the relative similarities finished 7 | 8 | % Clear all the data before running the script 9 | clear all; 10 | 11 | 12 | % These are the num of unseen classes and training images per class 13 | num_unseen = 0; 14 | trainpics = 300; %need to change to 300 Kun 15 | num_iter = 10; 16 | held_out_attributes = 0; 17 | emotion_categories = 2; 18 | num_attributes = 2; 19 | labeled_pairs = 1; 20 | looseness_constraint = 1; 21 | % This is the number of iterations we want to do 22 | accuracy = zeros(1,num_iter); 23 | 24 | %spk_list = {'0011','0012','0013','0014','0015','0016','0017','0018','0019','0020'} %[0011--0020] 25 | emo_list = {'Neutral', 'Happy', 'Angry', 'Sad'} 26 | spk_list = {'0013'} %[0011--0020] 27 | %emo_list = {'Happy', 'Angry', 'Sad'} 28 | for spk_num = 1:size(spk_list,2) 29 | for emo_num = 1:size(emo_list,2) 30 | spk_tag = spk_list(spk_num) 31 | emo_tag = emo_list(emo_num) 32 | spk_tag_str = string(spk_tag(1)) %0011 33 | emo_tag_str = string(emo_tag(1)) %Happy 34 | % output file (score) 35 | score_path = strcat(spk_tag_str, '_Surprise_' , emo_tag_str , '_Score.csv') %"0011_Neutral_Happy_Score.csv" 36 | fopen(score_path,'wt') 37 | % input file (feature extracted by OpenSmile) 38 | 39 | %osr_gist_Mat = csvread(strcat(spk_tag_str, '/', spk_tag_str, '_Neutral_', emo_tag_str, '.csv'),1,2); %"0011/0011_Neutral_Happy.csv" 40 | 41 | osr_gist_Mat = csvread(strcat(spk_tag_str, '_Surprise_', emo_tag_str, '.csv'),1,2); %"0011_Neutral_Happy.csv" %700x384 42 | 43 | % Debug 44 | % osr_gist_Mat = csvread('0010_Neutral_Angry.csv',1,2); 45 | 46 | 47 | used_for_training = csvread('used_for_training_kun.csv'); %debug Kun 48 | % class_names = {'Angry','Neutral'}; 49 | class_labels = csvread('classlabel.csv'); 50 | relative_ordering = [2 1; 3 1]; 51 | category_order = relative_ordering; 52 | 53 | osr_gist_Mat_normal = mapminmax(osr_gist_Mat',0,1); %normalization 54 | osr_gist_Mat = osr_gist_Mat_normal'; %384x700 55 | 56 | 57 | 58 | for iter = 1:num_iter 59 | 60 | % Create a random list of unseen images 61 | unseen = randperm(emotion_categories,num_unseen); 62 | [O,S] = Create_O_and_S_Mats_2D(category_order,used_for_training,class_labels,emotion_categories,unseen,trainpics,labeled_pairs); 63 | % Now we need to train the ranking function, but we have some values in the 64 | % matrices that will not correspond to the anything becuase some attributes 65 | % will have more nodes with similarity. 66 | 67 | weights = zeros(384,num_attributes); %384x2 68 | for l = 1:num_attributes 69 | 70 | % Find where each O and S matrix stops having values for each category 71 | % matrix section 72 | 73 | % Find when the O matrix for this dimension no longer has real values 74 | 75 | for j = 1:size(O,1) 76 | O_length = j; 77 | if ismember(1,O(j,:,l)) == 0; 78 | break; 79 | end 80 | end 81 | 82 | % Find when the S matrix for this dimension no longer has real values. 83 | for j = 1:size(S,1) 84 | S_length = j; 85 | if ismember(1,S(j,:,l)) == 0; 86 | break; 87 | end 88 | end 89 | 90 | % Now set up the cost matrices both are initialized to 0.1 in the 91 | % Relative Attributes paper from 2011; 92 | Costs_for_O = .1*ones(O_length,1); 93 | Costs_for_S = .1*ones(S_length,1); 94 | 95 | if O_length > 1 96 | w = ranksvm_with_sim(osr_gist_Mat,O(1:O_length-1,:,l),S(1:S_length,:,l),Costs_for_O,Costs_for_S); 97 | %w = testrank(osr_gist_Mat,O(1:O_length-1,:,l),S(1:S_length,:,l),Costs_for_O,Costs_for_S); 98 | weights(:,l) = w*2; 99 | else 100 | % exit 101 | % Re-Do the ranking and start over, because we chose category pairs 102 | % that did not have the O matrix for a given attribute. 103 | 104 | % This function creates the O and S matrix used in the ranking algorithm 105 | [O,S] = Create_O_and_S_Mats(category_order,used_for_training,class_labels,3,unseen,trainpics,labeled_pairs); 106 | 107 | % initialize the weights matrix that will be learned for ranking 108 | % weights = zeros(384,6); 109 | weights = zeros(384,num_attributes); 110 | 111 | % re-do the creation of the O and S matrix 112 | l = 1; 113 | disp('We had to redo the O and S matrix ranking, Pairs chosen were all similar for an attribute'); 114 | end 115 | end 116 | 117 | % here we want to choose to take out some of the weights for each 118 | % attribute and also the category order 119 | if held_out_attributes ~= 0 120 | rand_atts = randperm(6,6-held_out_attributes); 121 | for j = 1:length(rand_atts); 122 | new_weights(:,j) = weights(:,rand_atts(j)); 123 | new_cat_order(j,:) = category_order(rand_atts(j),:); 124 | new_relative_att_predictor(:,j) = relative_att_predictor(:,rand_atts(j)); 125 | end 126 | else 127 | new_cat_order = category_order; 128 | new_weights = weights; 129 | % new_relative_att_predictor = relative_att_predictor; 130 | end 131 | 132 | 133 | % Get the predictions based on the outputs from rank svm 134 | % Use there trained data 135 | % relative_att_predictions = feat*new_relative_att_predictor; 136 | % Use my trained data 137 | relative_att_predictions = osr_gist_Mat*new_weights; 138 | 139 | % Seperate the training samples from the other training samples 140 | Train_samples = GetTrainingSample_per_category(relative_att_predictions,class_labels,used_for_training); 141 | 142 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 143 | % Debug %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 144 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 145 | % Calculate the means and covariances from the samples 146 | [means, Covariances] = meanandvar_forcat(Train_samples,[],new_cat_order,emotion_categories,looseness_constraint); 147 | 148 | % This is for debug to find the problem with the unseen scategories 149 | means_unseen = meanandvar_forcat(Train_samples,unseen,new_cat_order,emotion_categories,looseness_constraint); 150 | 151 | % This section will find the difference between the values of the means 152 | disp('The unseen values are') 153 | unseen 154 | disp('Actual Means'); 155 | means 156 | disp('Difference between the means'); 157 | disp(means_unseen - means); 158 | 159 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 160 | 161 | % Classify the predicted features from the system 162 | accuracy(iter) = BayesClass_RelAtt(relative_att_predictions,class_labels,means_unseen,Covariances,used_for_training,unseen); 163 | disp('unseen accuracy for means found'); 164 | disp(accuracy(iter)) 165 | 166 | other_acc = BayesClass_RelAtt_unseen(relative_att_predictions,class_labels,means_unseen,Covariances,used_for_training,unseen); 167 | disp('unseen accuracy for derived means') 168 | disp(other_acc); 169 | disp('The relative ordering of the attributes for each image'); 170 | % category_order 171 | end 172 | 173 | total_acc = mean(accuracy); 174 | 175 | relative_att_predictions_norm = normalize(relative_att_predictions(:,1),'range') 176 | csvwrite(score_path, relative_att_predictions_norm) % (350:700) 177 | disp('The accuracy of this calculation: '); 178 | disp(total_acc); 179 | disp('------------- ok -------------- '); 180 | end 181 | end 182 | -------------------------------------------------------------------------------- /Relative Attributes/meanandvar_forcat.m: -------------------------------------------------------------------------------- 1 | % Generate mean and covariance matrix for each categories relative scores. 2 | % Created by Joe Ellis for the Reproduction Code Class 3 | % Reproducing Relative Attributes 4 | 5 | function [means, Covariances] = meanandvar_forcat(Training_Samples,unseen,category_order,num_classes, looseness_constraint) 6 | 7 | % The looseness constraint should be the looseless-1 8 | looseness_constraint = looseness_constraint - 1; 9 | 10 | % variables 11 | % means = 2-d matrix each row is a mean of the labels should be 8x6 rows 12 | % Covariances = 3-d matrix. Should be 6x6x8 to finish this work. 13 | 14 | % means of the set ups 15 | % Create the list of seen categories 16 | seen = []; 17 | seen_index = 1; 18 | for z = 1:num_classes 19 | if ismember(z,unseen) == 0 20 | seen(seen_index) = z; 21 | seen_index = seen_index + 1; 22 | end 23 | end 24 | 25 | % now we have the seen categories, and we want to find the mean and 26 | % covariance of each of these values. 27 | 28 | % set up the means and covariances that we want to find 29 | means = zeros(1,size(Training_Samples,2),size(Training_Samples,3)); 30 | Covariances = zeros(size(Training_Samples,2),size(Training_Samples,2),size(Training_Samples,3)); 31 | 32 | for k = 1:length(seen) 33 | 34 | % Get the seen variable index 35 | class = seen(k); 36 | 37 | % Find the means of the seen a 38 | means(:,:,class) = mean(Training_Samples(:,:,class)); 39 | 40 | % for loop to iterate over the sections of the training_samples 41 | Covariances(:,:,class) = cov(Training_Samples(:,:,class)); 42 | end 43 | 44 | % Now we need to find the average covariance mat for all the seen samples 45 | AVG_COV = sum(Covariances,3)/length(seen); 46 | 47 | 48 | % Now we need to set up the mean and covariance for all of the unseen 49 | % variables 50 | 51 | % Now we have to find the average distance between the means 52 | 53 | dm = zeros(1,size(category_order,1)); 54 | 55 | for j = 1:size(category_order,1) 56 | % This section finds the means and sorts the average distance between 57 | % the neightbors 58 | sorted_means = sort(nonzeros(means(1,j,:))); 59 | diff = 0; 60 | for z = 1:length(sorted_means)-1 61 | diff = diff + abs(sorted_means(z)-sorted_means(z+1)); 62 | end 63 | dm(j) = diff/(length(seen)-1); 64 | end 65 | 66 | disp('The differences between the elements for each attribute'); 67 | dm 68 | 69 | % We need to create a category ordering of only the categories available 70 | % not the unseen categories 71 | for j = 1:length(seen) 72 | there = seen(j); 73 | new_category_order(:,j) = category_order(:,there); 74 | end 75 | 76 | 77 | for k = 1:length(unseen) 78 | % This is the unseen class 79 | class = unseen(k); 80 | 81 | % now we have to go through every attribute within this section of the 82 | % code to figure out 83 | for j = 1:size(new_category_order,1) 84 | attr_rank = category_order(j,class); 85 | 86 | % Now get the max and min of that particular ranking 87 | [max_rank max_idx] = max(new_category_order(j,:)); 88 | [min_rank min_idx] = min(new_category_order(j,:)); 89 | 90 | %if ismember(attr_rank,new_category_order(j,:)) == 1; 91 | % vect = (attr_rank == new_category_order(j,:)); 92 | % idx = find(vect); 93 | % means(1,j,class) = means(1,j,seen(idx(1))); 94 | %disp(means(1,j,class)); 95 | 96 | 97 | if attr_rank > max_rank; 98 | % Do some stuff 99 | max_mean = means(1,j,seen(max_idx(1))); 100 | means(1,j,class) = max_mean + dm(j); 101 | %disp(means(1,j,class)); 102 | 103 | elseif attr_rank == max_rank 104 | % Do some stuff 105 | new_rank = attr_rank - 1; 106 | idx = find(new_category_order(j,:) == new_rank); 107 | if isempty(idx) == 0 108 | one_less_mean = means(1,j,seen(idx(1))); 109 | means(1,j,class) = one_less_mean + dm(j); 110 | else 111 | max_mean = means(1,j,seen(max_idx(1))); 112 | means(1,j,class) = max_mean; 113 | end 114 | 115 | 116 | elseif attr_rank < min_rank 117 | % Do some stuff 118 | % Now we have to find the average distances between the means 119 | min_mean = means(1,j,seen(min_idx(1))); 120 | means(1,j,class) = min_mean - dm(j); 121 | %disp(means(1,j,class)); 122 | 123 | elseif attr_rank == min_rank 124 | % Do some stuff 125 | % Now we have to find the average distances between the means 126 | new_rank = attr_rank + 1; 127 | idx = find(new_category_order(j,:) == new_rank); 128 | if isempty(idx) == 0 129 | one_more_mean = means(1,j,seen(idx(1))); 130 | means(1,j,class) = one_more_mean - dm(j); 131 | else 132 | min_mean = means(1,j,seen(min_idx(1))); 133 | means(1,j,class) = min_mean; 134 | end 135 | 136 | else 137 | % Find the index of the elements one above and below those 138 | % elements 139 | row_vec = new_category_order(j,:); 140 | min_cand = row_vec < attr_rank - looseness_constraint; 141 | value = 0; 142 | min_use_index = 1; 143 | for a = 1:length(min_cand) 144 | if min_cand(a) == 1 145 | if row_vec(a) > value; 146 | min_use_index = a; 147 | value = row_vec(a); 148 | end 149 | end 150 | end 151 | lower_u = means(1,j,seen(min_use_index)); 152 | 153 | % Here we have the values for the max used 154 | max_cand = row_vec > attr_rank + looseness_constraint; 155 | value = 9; 156 | max_use_index = 1; 157 | for a = 1:length(max_cand) 158 | if max_cand(a) == 1 159 | if row_vec(a) < value; 160 | max_use_index = a; 161 | value = row_vec(a); 162 | end 163 | end 164 | end 165 | higher_u = means(1,j,seen(max_use_index)); 166 | 167 | % This solves for the mean for this class 168 | means(1,j,class) = (higher_u + lower_u)/2; 169 | %disp(means(1,j,class)); 170 | end 171 | 172 | % Give it the average covariance of all of the elements 173 | Covariances(:,:,class) = AVG_COV; 174 | end 175 | end 176 | 177 | 178 | % I need to see what is getting messed up here 179 | 180 | for k = 1:length(unseen) 181 | % Get the seen variable index 182 | class = unseen(k); 183 | % Find the means of the seen a 184 | %means(:,:,class) = mean(Training_Samples(:,:,class)); 185 | truemean = mean(Training_Samples(:,:,class)); 186 | % for loop to iterate over the sections of the training_samples 187 | %Covariances(:,:,class) = cov(Training_Samples(:,:,class)); 188 | 189 | end 190 | % matlab is returning non positive definite matrices for me. Therefore, I 191 | % need to add a bit to the diagonal. 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | -------------------------------------------------------------------------------- /Relative Attributes/pre-processing.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | root_path = '/Users/kun/Desktop/workspace/' # 4 | data_path = root_path +'esd_en/' # 待处理的音频路径 5 | 6 | count = 0 7 | 8 | opensmile_path = '/Users/kun/Desktop/workspace/open_smile/opensmile/build/progsrc/smilextract/SMILExtract' 9 | config_path = '/Users/kun/Desktop/workspace/open_smile/opensmile/config/is09-13/IS09_emotion.conf' 10 | 11 | error_list = [] 12 | 13 | for root, spk_dir, files in os.walk(data_path): 14 | # print('----') 15 | # print(root) 16 | print(spk_dir) 17 | # print(files) 18 | # exit() 19 | for spk in spk_dir: 20 | spk_out_dir = root_path + 'output/' + spk 21 | if not os.path.exists(spk_out_dir): 22 | os.mkdir(spk_out_dir) 23 | for root, emo_dir, files in os.walk(data_path + spk): 24 | print(emo_dir) 25 | for emo in emo_dir: 26 | for _, _, wavfiles in os.walk(data_path + spk + '/' + emo): 27 | for i in range(len(wavfiles)): 28 | #print(count) 29 | wavfile_path = data_path + spk + '/' + emo + '/' + wavfiles[i] 30 | print(wavfile_path) 31 | 32 | emo_out_dir = root_path + 'output/' + spk+ '/' + emo 33 | print(emo_out_dir) 34 | feature_path = emo_out_dir + '/' + wavfiles[i][:-4] + '.csv' 35 | print(feature_path) 36 | # path_remake(files[i]) 37 | try: 38 | if not os.path.exists(emo_out_dir): 39 | os.mkdir(emo_out_dir) 40 | os.system(opensmile_path + ' -C ' + config_path + ' -I ' + wavfile_path + ' -csvoutput ' + feature_path + ' -instname ' + feature_path[-15:-4]) 41 | except: 42 | error_list.append(wavfile_path) 43 | count += 1 44 | # exit() 45 | print('error num: ',count) # 出错次数,正常情况都是不会出错的。 46 | print('error list: ', error_list) -------------------------------------------------------------------------------- /Relative Attributes/ranksvm_with_sim.m: -------------------------------------------------------------------------------- 1 | function w = ranksvm_with_sim(X_,OMat,SMat,Costs_for_O,Costs_for_S,w,opt) 2 | % W = RANKSVM(X,A,C,W,OPT) 3 | % Solves the Ranking SVM optimization problem in the primal (with quatratic 4 | % penalization of the training errors). 5 | % 6 | % X contains the training inputs and is an n x d matrix (n = number of points). 7 | % A is a sparse p x n matrix, where p is the number of preference pairs. 8 | % Each row of A should contain exactly one +1 and one -1 9 | % reflecting the indices of the points constituing the pairs. 10 | % C is a vector of training error penalizations (one for each preference pair). 11 | % 12 | % OPT is a structure containing the options (in brackets default values): 13 | % lin_cg: Find the Newton step, by linear conjugate gradients [0] 14 | % iter_max_Newton: Maximum number of Newton steps [20] 15 | % prec: Stopping criterion 16 | % cg_prec and cg_it: stopping criteria for the linear CG. 17 | 18 | % Copyright Olivier Chapelle, olivier.chapelle@tuebingen.mpg.de 19 | % Last modified 25/08/2006 20 | 21 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 22 | % This code was modified by Joe Ellis -- Columbia University DVMM Lab 23 | % https://sites.google.com/site/joeelliscolumbiauniversity/ 24 | % Code was modified to reproduce the results from Relative Attributes paper 25 | % by Grauman and Parakh. The variables that are used here can be seen 26 | % below. 27 | % 28 | % Variables: 29 | % X contains the training inputs and is an n x d matrix (n = number of points). 30 | % OMat is the matrix that holds the pairs. It is set up in the same way 31 | % that A is described above.a sparse p x n matrix, where p is the number 32 | % of preference pairs. Each row of A should contain exactly one +1 and 33 | % one -1 reflecting the indices of the points constituing the pairs. 34 | % The plus one is the attribute that represents the images that have more 35 | % of an attribute -1 is the one that does not. 36 | % SMat is the matrix that holds the pairs. It is set up in the same way 37 | % that A is described above.a sparse p x n matrix, where p is the number 38 | % of preference pairs. Each row of A should contain exactly one +1 and 39 | % one -1 reflecting the indices of the points constituing the pairs. 40 | % We still have to have one attribute that has the +1 and -1 to make this 41 | % work based on the implementation below. 42 | % Costs_for_O this has the value of the cost for each preference pair 43 | % Costs_for_S this has the value of the cost for each preference pair 44 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 45 | 46 | 47 | oldO = OMat; 48 | oldS = SMat; 49 | 50 | if size(OMat,1) == 1 51 | OMat = []; 52 | end 53 | 54 | if size(SMat,1) == 1 55 | SMat = []; 56 | end 57 | 58 | global X A 59 | X = X_; A = OMat; S = SMat; % To avoid passing theses matrices as arguments to subfunctions 60 | 61 | if nargin < 7 % Assign the options to their default values 62 | opt = []; 63 | end; 64 | if ~isfield(opt,'lin_cg'), opt.lin_cg = 0; end; 65 | if ~isfield(opt,'iter_max_Newton'), opt.iter_max_Newton = 10; end; 66 | if ~isfield(opt,'prec'), opt.prec = 1e-7; end; 67 | if ~isfield(opt,'cg_prec'), opt.cg_prec = 1e-3; end; 68 | if ~isfield(opt,'cg_it'), opt.cg_it = 20; end; 69 | 70 | d = size(X,2); 71 | n = size([A;S],1); 72 | 73 | if (d*n>1e9) & (opt.lin_cg==0) 74 | warning('Large problem: you should consider trying the lin_cg option') 75 | end; 76 | 77 | if nargin<6 78 | w = zeros(d,1); 79 | end; 80 | iter = 0; 81 | %out = 1-A*(X*w); 82 | 83 | %%%%%%%%%% 84 | % This is the line that needs to be altered to change the way that 85 | % ranksvm works for my purposes 86 | % out1 is the same exact implementation as what was originally designed 87 | % out2 needs to take into account the pairs that are highly similar 88 | 89 | disp('The size of the X matrix') 90 | disp(size(X)) 91 | disp('The sixe of the w matrix') 92 | disp(size(w)) 93 | 94 | % This is the section of the code that sets up the work 95 | if size(oldO,1) == 1 96 | %out1 = 1-A*(X*w); 97 | out2 = -(S*(X*w)); 98 | out1 = []; 99 | elseif size(oldS,1) == 1 100 | out1 = 1-A*(X*w); 101 | %out2 = (S*(X*w)); 102 | out2 = []; 103 | else 104 | out1 = 1-A*(X*w); 105 | out2 = -(S*(X*w)); 106 | end 107 | 108 | % Now concatenate the vectors together. 109 | out = [out1; out2]; 110 | % We need to keep track of the pairs that are chosen for the experiments, 111 | % so thusly concatenate the two vectors together so they will be the same 112 | % for our experiments. 113 | A = [A;S]; 114 | % disp('The size of the A Matrix') 115 | % disp(size(A)) 116 | % C = [Costs_for_O;Costs_for_S]; 117 | 118 | C = 0.1*ones(size(A,1),1); 119 | while 1 120 | iter = iter + 1; 121 | if iter > opt.iter_max_Newton; 122 | warning(sprintf(['Maximum number of Newton steps reached.' ... 123 | 'Try larger lambda'])); 124 | break; 125 | end; 126 | 127 | [obj, grad, sv] = obj_fun_linear(w,C,out); 128 | 129 | % Compute the Newton direction either by linear CG 130 | % Advantage of linear CG when using sparse input: the Hessian 131 | % is never computed explicitly. 132 | if opt.lin_cg 133 | [step, foo, relres] = minres(@hess_vect_mult, -grad,... 134 | opt.cg_prec,opt.cg_it,[],[],[],sv,C); 135 | else 136 | Xsv = A(sv,:)*X; 137 | hess = eye(d) + Xsv'*(Xsv.*repmat(C(sv),1,d)); % Hessian 138 | step = - hess \ grad; % Newton direction 139 | relres = 0; 140 | end; 141 | 142 | % Do an exact line search 143 | [t,out] = line_search_linear(w,step,out,C); 144 | 145 | w = w + t*step; 146 | fprintf(['Iter = %d, Obj = %f, Nb of sv = %d, Newton decr = %.3f, ' ... 147 | 'Line search = %.3f, Lin CG acc = %.4f \n'],... 148 | iter,obj,sum(sv),-step'*grad/2,t,relres); 149 | 150 | if -step'*grad < opt.prec * obj 151 | % Stop when the Newton decrement is small enough 152 | break; 153 | end; 154 | end; 155 | 156 | 157 | function [obj, grad, sv] = obj_fun_linear(w,C,out) 158 | % Compute the objective function, its gradient and the set of support vectors 159 | % Out is supposed to contain 1-A*X*w 160 | global X A 161 | out = max(0,out); 162 | obj = sum(C.*out.^2)/2 + w'*w/2; % L2 penalization of the errors 163 | grad = w - (((C.*out)'*A)*X)'; % Gradient 164 | sv = out>0; 165 | 166 | 167 | function y = hess_vect_mult(w,sv,C) 168 | % Compute the Hessian times a given vector x. 169 | global X A 170 | y = w; 171 | z = (C.*sv).*(A*(X*w)); % Computing X(sv,:)*x takes more time in Matlab :-( 172 | y = y + ((z'*A)*X)'; 173 | 174 | 175 | function [t,out] = line_search_linear(w,d,out,C) 176 | % From the current solution w, do a line search in the direction d by 177 | % 1D Newton minimization 178 | global X A 179 | t = 0; 180 | % Precompute some dots products 181 | Xd = A*(X*d); 182 | wd = w'*d; 183 | dd = d'*d; 184 | while 1 185 | out2 = out - t*Xd; % The new outputs after a step of length t 186 | sv = find(out2>0); 187 | g = wd + t*dd - (C(sv).*out2(sv))'*Xd(sv); % The gradient (along the line) 188 | h = dd + Xd(sv)'*(Xd(sv).*C(sv)); % The second derivative (along the line) 189 | t = t - g/h; % Take the 1D Newton step. Note that if d was an exact Newton 190 | % direction, t is 1 after the first iteration. 191 | if g^2/h < 1e-10, break; end; 192 | end; 193 | out = out2; 194 | -------------------------------------------------------------------------------- /Relative Attributes/used_for_training_kun.csv: -------------------------------------------------------------------------------- 1 | 0 2 | 0 3 | 0 4 | 0 5 | 0 6 | 0 7 | 0 8 | 0 9 | 0 10 | 0 11 | 0 12 | 0 13 | 0 14 | 0 15 | 0 16 | 0 17 | 0 18 | 0 19 | 0 20 | 0 21 | 0 22 | 0 23 | 0 24 | 0 25 | 0 26 | 0 27 | 0 28 | 0 29 | 0 30 | 0 31 | 0 32 | 0 33 | 0 34 | 0 35 | 0 36 | 0 37 | 0 38 | 0 39 | 0 40 | 0 41 | 0 42 | 0 43 | 0 44 | 0 45 | 0 46 | 0 47 | 0 48 | 0 49 | 0 50 | 0 51 | 1 52 | 1 53 | 1 54 | 1 55 | 1 56 | 1 57 | 1 58 | 1 59 | 1 60 | 1 61 | 1 62 | 1 63 | 1 64 | 1 65 | 1 66 | 1 67 | 1 68 | 1 69 | 1 70 | 1 71 | 1 72 | 1 73 | 1 74 | 1 75 | 1 76 | 1 77 | 1 78 | 1 79 | 1 80 | 1 81 | 1 82 | 1 83 | 1 84 | 1 85 | 1 86 | 1 87 | 1 88 | 1 89 | 1 90 | 1 91 | 1 92 | 1 93 | 1 94 | 1 95 | 1 96 | 1 97 | 1 98 | 1 99 | 1 100 | 1 101 | 1 102 | 1 103 | 1 104 | 1 105 | 1 106 | 1 107 | 1 108 | 1 109 | 1 110 | 1 111 | 1 112 | 1 113 | 1 114 | 1 115 | 1 116 | 1 117 | 1 118 | 1 119 | 1 120 | 1 121 | 1 122 | 1 123 | 1 124 | 1 125 | 1 126 | 1 127 | 1 128 | 1 129 | 1 130 | 1 131 | 1 132 | 1 133 | 1 134 | 1 135 | 1 136 | 1 137 | 1 138 | 1 139 | 1 140 | 1 141 | 1 142 | 1 143 | 1 144 | 1 145 | 1 146 | 1 147 | 1 148 | 1 149 | 1 150 | 1 151 | 1 152 | 1 153 | 1 154 | 1 155 | 1 156 | 1 157 | 1 158 | 1 159 | 1 160 | 1 161 | 1 162 | 1 163 | 1 164 | 1 165 | 1 166 | 1 167 | 1 168 | 1 169 | 1 170 | 1 171 | 1 172 | 1 173 | 1 174 | 1 175 | 1 176 | 1 177 | 1 178 | 1 179 | 1 180 | 1 181 | 1 182 | 1 183 | 1 184 | 1 185 | 1 186 | 1 187 | 1 188 | 1 189 | 1 190 | 1 191 | 1 192 | 1 193 | 1 194 | 1 195 | 1 196 | 1 197 | 1 198 | 1 199 | 1 200 | 1 201 | 1 202 | 1 203 | 1 204 | 1 205 | 1 206 | 1 207 | 1 208 | 1 209 | 1 210 | 1 211 | 1 212 | 1 213 | 1 214 | 1 215 | 1 216 | 1 217 | 1 218 | 1 219 | 1 220 | 1 221 | 1 222 | 1 223 | 1 224 | 1 225 | 1 226 | 1 227 | 1 228 | 1 229 | 1 230 | 1 231 | 1 232 | 1 233 | 1 234 | 1 235 | 1 236 | 1 237 | 1 238 | 1 239 | 1 240 | 1 241 | 1 242 | 1 243 | 1 244 | 1 245 | 1 246 | 1 247 | 1 248 | 1 249 | 1 250 | 1 251 | 1 252 | 1 253 | 1 254 | 1 255 | 1 256 | 1 257 | 1 258 | 1 259 | 1 260 | 1 261 | 1 262 | 1 263 | 1 264 | 1 265 | 1 266 | 1 267 | 1 268 | 1 269 | 1 270 | 1 271 | 1 272 | 1 273 | 1 274 | 1 275 | 1 276 | 1 277 | 1 278 | 1 279 | 1 280 | 1 281 | 1 282 | 1 283 | 1 284 | 1 285 | 1 286 | 1 287 | 1 288 | 1 289 | 1 290 | 1 291 | 1 292 | 1 293 | 1 294 | 1 295 | 1 296 | 1 297 | 1 298 | 1 299 | 1 300 | 1 301 | 1 302 | 1 303 | 1 304 | 1 305 | 1 306 | 1 307 | 1 308 | 1 309 | 1 310 | 1 311 | 1 312 | 1 313 | 1 314 | 1 315 | 1 316 | 1 317 | 1 318 | 1 319 | 1 320 | 1 321 | 1 322 | 1 323 | 1 324 | 1 325 | 1 326 | 1 327 | 1 328 | 1 329 | 1 330 | 1 331 | 1 332 | 1 333 | 1 334 | 1 335 | 1 336 | 1 337 | 1 338 | 1 339 | 1 340 | 1 341 | 1 342 | 1 343 | 1 344 | 1 345 | 1 346 | 1 347 | 1 348 | 1 349 | 1 350 | 1 351 | 0 352 | 0 353 | 0 354 | 0 355 | 0 356 | 0 357 | 0 358 | 0 359 | 0 360 | 0 361 | 0 362 | 0 363 | 0 364 | 0 365 | 0 366 | 0 367 | 0 368 | 0 369 | 0 370 | 0 371 | 0 372 | 0 373 | 0 374 | 0 375 | 0 376 | 0 377 | 0 378 | 0 379 | 0 380 | 0 381 | 0 382 | 0 383 | 0 384 | 0 385 | 0 386 | 0 387 | 0 388 | 0 389 | 0 390 | 0 391 | 0 392 | 0 393 | 0 394 | 0 395 | 0 396 | 0 397 | 0 398 | 0 399 | 0 400 | 0 401 | 1 402 | 1 403 | 1 404 | 1 405 | 1 406 | 1 407 | 1 408 | 1 409 | 1 410 | 1 411 | 1 412 | 1 413 | 1 414 | 1 415 | 1 416 | 1 417 | 1 418 | 1 419 | 1 420 | 1 421 | 1 422 | 1 423 | 1 424 | 1 425 | 1 426 | 1 427 | 1 428 | 1 429 | 1 430 | 1 431 | 1 432 | 1 433 | 1 434 | 1 435 | 1 436 | 1 437 | 1 438 | 1 439 | 1 440 | 1 441 | 1 442 | 1 443 | 1 444 | 1 445 | 1 446 | 1 447 | 1 448 | 1 449 | 1 450 | 1 451 | 1 452 | 1 453 | 1 454 | 1 455 | 1 456 | 1 457 | 1 458 | 1 459 | 1 460 | 1 461 | 1 462 | 1 463 | 1 464 | 1 465 | 1 466 | 1 467 | 1 468 | 1 469 | 1 470 | 1 471 | 1 472 | 1 473 | 1 474 | 1 475 | 1 476 | 1 477 | 1 478 | 1 479 | 1 480 | 1 481 | 1 482 | 1 483 | 1 484 | 1 485 | 1 486 | 1 487 | 1 488 | 1 489 | 1 490 | 1 491 | 1 492 | 1 493 | 1 494 | 1 495 | 1 496 | 1 497 | 1 498 | 1 499 | 1 500 | 1 501 | 1 502 | 1 503 | 1 504 | 1 505 | 1 506 | 1 507 | 1 508 | 1 509 | 1 510 | 1 511 | 1 512 | 1 513 | 1 514 | 1 515 | 1 516 | 1 517 | 1 518 | 1 519 | 1 520 | 1 521 | 1 522 | 1 523 | 1 524 | 1 525 | 1 526 | 1 527 | 1 528 | 1 529 | 1 530 | 1 531 | 1 532 | 1 533 | 1 534 | 1 535 | 1 536 | 1 537 | 1 538 | 1 539 | 1 540 | 1 541 | 1 542 | 1 543 | 1 544 | 1 545 | 1 546 | 1 547 | 1 548 | 1 549 | 1 550 | 1 551 | 1 552 | 1 553 | 1 554 | 1 555 | 1 556 | 1 557 | 1 558 | 1 559 | 1 560 | 1 561 | 1 562 | 1 563 | 1 564 | 1 565 | 1 566 | 1 567 | 1 568 | 1 569 | 1 570 | 1 571 | 1 572 | 1 573 | 1 574 | 1 575 | 1 576 | 1 577 | 1 578 | 1 579 | 1 580 | 1 581 | 1 582 | 1 583 | 1 584 | 1 585 | 1 586 | 1 587 | 1 588 | 1 589 | 1 590 | 1 591 | 1 592 | 1 593 | 1 594 | 1 595 | 1 596 | 1 597 | 1 598 | 1 599 | 1 600 | 1 601 | 1 602 | 1 603 | 1 604 | 1 605 | 1 606 | 1 607 | 1 608 | 1 609 | 1 610 | 1 611 | 1 612 | 1 613 | 1 614 | 1 615 | 1 616 | 1 617 | 1 618 | 1 619 | 1 620 | 1 621 | 1 622 | 1 623 | 1 624 | 1 625 | 1 626 | 1 627 | 1 628 | 1 629 | 1 630 | 1 631 | 1 632 | 1 633 | 1 634 | 1 635 | 1 636 | 1 637 | 1 638 | 1 639 | 1 640 | 1 641 | 1 642 | 1 643 | 1 644 | 1 645 | 1 646 | 1 647 | 1 648 | 1 649 | 1 650 | 1 651 | 1 652 | 1 653 | 1 654 | 1 655 | 1 656 | 1 657 | 1 658 | 1 659 | 1 660 | 1 661 | 1 662 | 1 663 | 1 664 | 1 665 | 1 666 | 1 667 | 1 668 | 1 669 | 1 670 | 1 671 | 1 672 | 1 673 | 1 674 | 1 675 | 1 676 | 1 677 | 1 678 | 1 679 | 1 680 | 1 681 | 1 682 | 1 683 | 1 684 | 1 685 | 1 686 | 1 687 | 1 688 | 1 689 | 1 690 | 1 691 | 1 692 | 1 693 | 1 694 | 1 695 | 1 696 | 1 697 | 1 698 | 1 699 | 1 700 | 1 701 | -------------------------------------------------------------------------------- /codes/acrnn_test.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | import pdb 5 | 6 | class acrnn(nn.Module): 7 | def __init__(self, num_classes=4, is_training=True, 8 | L1=128, L2=256, cell_units=128, num_linear=768, 9 | p=10, time_step=800, F1=128, dropout_keep_prob=1): 10 | super(acrnn, self).__init__() 11 | 12 | self.num_classes = num_classes 13 | self.is_training = is_training 14 | self.L1 = L1 15 | self.L2 = L2 16 | self.cell_units = cell_units 17 | self.num_linear = num_linear 18 | self.p = p 19 | self.time_step = time_step 20 | self.F1 = F1 21 | self.dropout_prob = 1 - dropout_keep_prob 22 | 23 | # tf filter : [filter_height, filter_width, in_channels, out_channels] 24 | self.conv1 = nn.Conv2d(3, self.L1, (5, 3), padding=(2, 1)) # [5, 3, 3, 128] 25 | self.conv2 = nn.Conv2d(self.L1, self.L2, (5, 3), padding=(2, 1)) # [5, 3, 128, 256] 26 | self.conv3 = nn.Conv2d(self.L2, self.L2, (5, 3), padding=(2, 1)) # [5, 3, 256, 256] 27 | self.conv4 = nn.Conv2d(self.L2, self.L2, (5, 3), padding=(2, 1)) # [5, 3, 256, 256] 28 | self.conv5 = nn.Conv2d(self.L2, self.L2, (5, 3), padding=(2, 1)) # [5, 3, 128, 256] 29 | self.conv6 = nn.Conv2d(self.L2, self.L2, (5, 3), padding=(2, 1)) # [5, 3, 128, 256] 30 | 31 | self.linear1 = nn.Linear(self.p*self.L2, self.num_linear) # [10*256, 768] 32 | self.bn = nn.BatchNorm1d(self.num_linear) 33 | 34 | #self.linear_em = nn.Linear(self.p, self.L2) 35 | 36 | self.relu = nn.LeakyReLU(0.01) 37 | self.dropout = nn.Dropout2d(p=self.dropout_prob) 38 | 39 | self.rnn = nn.LSTM(input_size=self.num_linear, hidden_size=self.cell_units, 40 | batch_first=True, num_layers=1, bidirectional=True) 41 | 42 | # for attention 43 | self.a_fc1 = nn.Linear(2*self.cell_units, 1) 44 | self.a_fc2 = nn.Linear(1, 1) 45 | self.sigmoid = nn.Sigmoid() 46 | self.softmax = nn.Softmax(dim=1) 47 | 48 | # fully connected layers 49 | self.fc1 = nn.Linear(2*self.cell_units, self.F1) # [2*128, 64] 50 | self.fc2 = nn.Linear(self.F1, self.num_classes) # [num_classes] 51 | 52 | 53 | def forward(self, x): 54 | 55 | layer1 = self.relu(self.conv1(x)) 56 | layer1 = F.max_pool2d(layer1, kernel_size=(2, 4), stride=(2, 4)) # [1,2,4,1], padding = 'valid' 57 | layer1 = self.dropout(layer1) 58 | 59 | layer2 = self.relu(self.conv2(layer1)) 60 | layer2 = self.dropout(layer2) 61 | 62 | layer3 = self.relu(self.conv3(layer2)) 63 | layer3 = self.dropout(layer3) 64 | 65 | layer4 = self.relu(self.conv4(layer3)) 66 | layer4 = self.dropout(layer4) 67 | 68 | layer5 = self.relu(self.conv5(layer4)) 69 | layer5 = self.dropout(layer5) 70 | 71 | layer6 = self.relu(self.conv6(layer5)) 72 | layer6 = self.dropout(layer6) 73 | 74 | # lstm 75 | layer6 = layer6.permute(0, 2, 3, 1) 76 | layer6 = layer6.reshape(-1, self.time_step, self.L2*self.p) # (-1, 150, 256*10) 77 | 78 | layer6 = layer6.reshape(-1, self.L2*self.p) # (1500, 2560) 79 | 80 | linear1 = self.relu(self.bn(self.linear1(layer6))) # [1500, 768] 81 | linear1 = linear1.reshape(-1, self.time_step, self.num_linear) # [10, 150, 768] 82 | em_bed_low = linear1 83 | 84 | outputs1, output_states1 = self.rnn(linear1) # outputs1 : [10, 150, 128] (B,T,D) 85 | 86 | # # attention 87 | v = self.sigmoid(self.a_fc1(outputs1)) # (10, 150, 1) 88 | alphas = self.softmax(self.a_fc2(v).squeeze()) # (B,T) shape, alphas are attention weights (10,800) 89 | gru = (alphas.unsqueeze(2) * outputs1).sum(dim=1) # (B,D) (10,256) 90 | 91 | # # fc 92 | fully1 = self.relu(self.fc1(gru)) 93 | em_bed_high = fully1 94 | fully1 = self.dropout(fully1) 95 | Ylogits = self.fc2(fully1) 96 | Ylogits = self.softmax(Ylogits) 97 | 98 | return Ylogits, em_bed_low, em_bed_high 99 | -------------------------------------------------------------------------------- /codes/checkpoint/checkpoint_5900: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KunZhou9646/Emovox/972db246eb7b63e5eaf532d46710af3c9ec11eba/codes/checkpoint/checkpoint_5900 -------------------------------------------------------------------------------- /codes/distributed.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.distributed as dist 3 | from torch.nn.modules import Module 4 | from torch.autograd import Variable 5 | 6 | def _flatten_dense_tensors(tensors): 7 | """Flatten dense tensors into a contiguous 1D buffer. Assume tensors are of 8 | same dense type. 9 | Since inputs are dense, the resulting tensor will be a concatenated 1D 10 | buffer. Element-wise operation on this buffer will be equivalent to 11 | operating individually. 12 | Arguments: 13 | tensors (Iterable[Tensor]): dense tensors to flatten. 14 | Returns: 15 | A contiguous 1D buffer containing input tensors. 16 | """ 17 | if len(tensors) == 1: 18 | return tensors[0].contiguous().view(-1) 19 | flat = torch.cat([t.contiguous().view(-1) for t in tensors], dim=0) 20 | return flat 21 | 22 | def _unflatten_dense_tensors(flat, tensors): 23 | """View a flat buffer using the sizes of tensors. Assume that tensors are of 24 | same dense type, and that flat is given by _flatten_dense_tensors. 25 | Arguments: 26 | flat (Tensor): flattened dense tensors to unflatten. 27 | tensors (Iterable[Tensor]): dense tensors whose sizes will be used to 28 | unflatten flat. 29 | Returns: 30 | Unflattened dense tensors with sizes same as tensors and values from 31 | flat. 32 | """ 33 | outputs = [] 34 | offset = 0 35 | for tensor in tensors: 36 | numel = tensor.numel() 37 | outputs.append(flat.narrow(0, offset, numel).view_as(tensor)) 38 | offset += numel 39 | return tuple(outputs) 40 | 41 | 42 | ''' 43 | This version of DistributedDataParallel is designed to be used in conjunction with the multiproc.py 44 | launcher included with this example. It assumes that your run is using multiprocess with 1 45 | GPU/process, that the model is on the correct device, and that torch.set_device has been 46 | used to set the device. 47 | Parameters are broadcasted to the other processes on initialization of DistributedDataParallel, 48 | and will be allreduced at the finish of the backward pass. 49 | ''' 50 | class DistributedDataParallel(Module): 51 | 52 | def __init__(self, module): 53 | super(DistributedDataParallel, self).__init__() 54 | #fallback for PyTorch 0.3 55 | if not hasattr(dist, '_backend'): 56 | self.warn_on_half = True 57 | else: 58 | self.warn_on_half = True if dist._backend == dist.dist_backend.GLOO else False 59 | 60 | self.module = module 61 | 62 | for p in list(self.module.state_dict().values()): 63 | if not torch.is_tensor(p): 64 | continue 65 | dist.broadcast(p, 0) 66 | 67 | def allreduce_params(): 68 | if(self.needs_reduction): 69 | self.needs_reduction = False 70 | buckets = {} 71 | for param in self.module.parameters(): 72 | if param.requires_grad and param.grad is not None: 73 | tp = type(param.data) 74 | if tp not in buckets: 75 | buckets[tp] = [] 76 | buckets[tp].append(param) 77 | if self.warn_on_half: 78 | if torch.cuda.HalfTensor in buckets: 79 | print(("WARNING: gloo dist backend for half parameters may be extremely slow." + 80 | " It is recommended to use the NCCL backend in this case. This currently requires" + 81 | "PyTorch built from top of tree master.")) 82 | self.warn_on_half = False 83 | 84 | for tp in buckets: 85 | bucket = buckets[tp] 86 | grads = [param.grad.data for param in bucket] 87 | coalesced = _flatten_dense_tensors(grads) 88 | dist.all_reduce(coalesced) 89 | coalesced /= dist.get_world_size() 90 | for buf, synced in zip(grads, _unflatten_dense_tensors(coalesced, grads)): 91 | buf.copy_(synced) 92 | 93 | for param in list(self.module.parameters()): 94 | def allreduce_hook(*unused): 95 | param._execution_engine.queue_callback(allreduce_params) 96 | if param.requires_grad: 97 | param.register_hook(allreduce_hook) 98 | 99 | def forward(self, *inputs, **kwargs): 100 | self.needs_reduction = True 101 | return self.module(*inputs, **kwargs) 102 | 103 | ''' 104 | def _sync_buffers(self): 105 | buffers = list(self.module._all_buffers()) 106 | if len(buffers) > 0: 107 | # cross-node buffer sync 108 | flat_buffers = _flatten_dense_tensors(buffers) 109 | dist.broadcast(flat_buffers, 0) 110 | for buf, synced in zip(buffers, _unflatten_dense_tensors(flat_buffers, buffers)): 111 | buf.copy_(synced) 112 | def train(self, mode=True): 113 | # Clear NCCL communicator and CUDA event cache of the default group ID, 114 | # These cache will be recreated at the later call. This is currently a 115 | # work-around for a potential NCCL deadlock. 116 | if dist._backend == dist.dist_backend.NCCL: 117 | dist._clear_group_cache() 118 | super(DistributedDataParallel, self).train(mode) 119 | self.module.train(mode) 120 | ''' 121 | ''' 122 | Modifies existing model to do gradient allreduce, but doesn't change class 123 | so you don't need "module" 124 | ''' 125 | def apply_gradient_allreduce(module): 126 | if not hasattr(dist, '_backend'): 127 | module.warn_on_half = True 128 | else: 129 | module.warn_on_half = True if dist._backend == dist.dist_backend.GLOO else False 130 | 131 | for p in list(module.state_dict().values()): 132 | if not torch.is_tensor(p): 133 | continue 134 | dist.broadcast(p, 0) 135 | 136 | def allreduce_params(): 137 | if(module.needs_reduction): 138 | module.needs_reduction = False 139 | buckets = {} 140 | for param in module.parameters(): 141 | if param.requires_grad and param.grad is not None: 142 | tp = type(param.data) 143 | if tp not in buckets: 144 | buckets[tp] = [] 145 | buckets[tp].append(param) 146 | if module.warn_on_half: 147 | if torch.cuda.HalfTensor in buckets: 148 | print(("WARNING: gloo dist backend for half parameters may be extremely slow." + 149 | " It is recommended to use the NCCL backend in this case. This currently requires" + 150 | "PyTorch built from top of tree master.")) 151 | module.warn_on_half = False 152 | 153 | for tp in buckets: 154 | bucket = buckets[tp] 155 | grads = [param.grad.data for param in bucket] 156 | coalesced = _flatten_dense_tensors(grads) 157 | dist.all_reduce(coalesced) 158 | coalesced /= dist.get_world_size() 159 | for buf, synced in zip(grads, _unflatten_dense_tensors(coalesced, grads)): 160 | buf.copy_(synced) 161 | 162 | for param in list(module.parameters()): 163 | def allreduce_hook(*unused): 164 | Variable._execution_engine.queue_callback(allreduce_params) 165 | if param.requires_grad: 166 | param.register_hook(allreduce_hook) 167 | 168 | def set_needs_reduction(self, input, output): 169 | self.needs_reduction = True 170 | 171 | module.register_forward_hook(set_needs_reduction) 172 | return module -------------------------------------------------------------------------------- /codes/hparams.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | #from text import symbols 3 | 4 | def create_hparams(hparams_string=None, verbose=False): 5 | """Create model hyperparameters. Parse nondefault from given string.""" 6 | 7 | hparams = tf.contrib.training.HParams( 8 | ################################ 9 | # Experiment Parameters # 10 | ################################ 11 | epochs=50, 12 | iters_per_checkpoint=100, 13 | seed=1234, 14 | dynamic_loss_scaling=True, 15 | distributed_run=False, 16 | dist_backend="nccl", 17 | dist_url="tcp://localhost:54321", 18 | cudnn_enabled=True, 19 | cudnn_benchmark=False, 20 | 21 | ################################ 22 | # Data Parameters # 23 | ################################ 24 | training_list='/home/zhoukun/nonparaSeq2seqVC_code-master/pre-train/reader/emotion_list/training_mel_list.txt', 25 | validation_list='/home/zhoukun/nonparaSeq2seqVC_code-master/pre-train/reader/emotion_list/evaluation_mel_list.txt', 26 | #mel_mean_std='/data07/zhoukun/VCTK-Corpus/mel_mean_std.npy', 27 | mel_mean_std = '/home/zhoukun/nonparaSeq2seqVC_code-master/0013/mel_mean_std.npy', 28 | ################################ 29 | # Data Parameters # 30 | ################################ 31 | n_mel_channels=80, 32 | n_spc_channels=1025, 33 | n_symbols=41, # 34 | pretrain_n_speakers=99, # 35 | 36 | n_speakers=4, # 37 | predict_spectrogram=False, 38 | 39 | ################################ 40 | # Model Parameters # 41 | ################################ 42 | 43 | symbols_embedding_dim=512, 44 | 45 | # Text Encoder parameters 46 | encoder_kernel_size=5, 47 | encoder_n_convolutions=3, 48 | encoder_embedding_dim=512, 49 | text_encoder_dropout=0.5, 50 | 51 | # Audio Encoder parameters 52 | spemb_input=False, 53 | n_frames_per_step_encoder=2, 54 | audio_encoder_hidden_dim=512, 55 | AE_attention_dim=128, 56 | AE_attention_location_n_filters=32, 57 | AE_attention_location_kernel_size=51, 58 | beam_width=10, 59 | 60 | # hidden activation 61 | # relu linear tanh 62 | hidden_activation='tanh', 63 | 64 | #Speaker Encoder parameters 65 | speaker_encoder_hidden_dim=256, 66 | speaker_encoder_dropout=0.2, 67 | #speaker_embedding_dim=128, 68 | speaker_embedding_dim=64, 69 | 70 | 71 | #Speaker Classifier parameters 72 | SC_hidden_dim=512, 73 | SC_n_convolutions=3, 74 | SC_kernel_size=1, 75 | 76 | # Decoder parameters 77 | feed_back_last=True, 78 | n_frames_per_step_decoder=2, 79 | decoder_rnn_dim=512, 80 | prenet_dim=[256,256], 81 | max_decoder_steps=1000, 82 | stop_threshold=0.5, 83 | 84 | # Attention parameters 85 | attention_rnn_dim=512, 86 | attention_dim=128, 87 | 88 | # Location Layer parameters 89 | attention_location_n_filters=32, 90 | attention_location_kernel_size=17, 91 | 92 | # PostNet parameters 93 | postnet_n_convolutions=5, 94 | postnet_dim=512, 95 | postnet_kernel_size=5, 96 | postnet_dropout=0.5, 97 | 98 | ################################ 99 | # Optimization Hyperparameters # 100 | ################################ 101 | use_saved_learning_rate=False, 102 | #learning_rate=1e-2, 103 | learning_rate=1e-3, 104 | #weight_decay=1e-6, 105 | weight_decay=1e-4, 106 | grad_clip_thresh=5.0, 107 | #batch_size=32, 108 | #batch_size=16, 109 | batch_size = 8, 110 | warmup = 7, 111 | decay_rate = 0.5, 112 | decay_every = 7, 113 | 114 | 115 | 116 | ser_loss_w = 1, 117 | emo_loss_w = 1, 118 | contrastive_loss_w=30.0, 119 | #contrastive_loss_w=0.0, 120 | speaker_encoder_loss_w=1.0, 121 | text_classifier_loss_w=1.0, 122 | #text_classifier_loss_w=0.0, 123 | speaker_adversial_loss_w=20., 124 | speaker_classifier_loss_w=0.1, 125 | ce_loss=False 126 | ) 127 | 128 | if hparams_string: 129 | tf.logging.info('Parsing command line hparams: %s', hparams_string) 130 | hparams.parse(hparams_string) 131 | 132 | if verbose: 133 | tf.logging.info('Final parsed hparams: %s', list(hparams.values())) 134 | 135 | return hparams 136 | -------------------------------------------------------------------------------- /codes/hparams_1.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | #from text import symbols 3 | 4 | def create_hparams(hparams_string=None, verbose=False): 5 | """Create model hyperparameters. Parse nondefault from given string.""" 6 | 7 | hparams = tf.contrib.training.HParams( 8 | ################################ 9 | # Experiment Parameters # 10 | ################################ 11 | epochs=200, 12 | iters_per_checkpoint=1000, 13 | seed=1234, 14 | distributed_run=False, 15 | dist_backend="nccl", 16 | dist_url="tcp://localhost:54321", 17 | cudnn_enabled=True, 18 | cudnn_benchmark=False, 19 | 20 | ################################ 21 | # Data Parameters # 22 | ################################ 23 | training_list='/home/zhoukun/nonparaSeq2seqVC_code-master/pre-train/reader/training_mel_list.txt', 24 | validation_list='/home/zhoukun/nonparaSeq2seqVC_code-master/pre-train/reader/evaluation_mel_list.txt', 25 | mel_mean_std='/data07/zhoukun/VCTK-Corpus/mel_mean_std.npy', 26 | 27 | ################################ 28 | # Data Parameters # 29 | ################################ 30 | n_mel_channels=80, 31 | n_spc_channels=1025, 32 | n_symbols=41, # 33 | n_speakers=99, # 34 | predict_spectrogram=False, 35 | 36 | ################################ 37 | # Model Parameters # 38 | ################################ 39 | 40 | symbols_embedding_dim=512, 41 | 42 | # Text Encoder parameters 43 | encoder_kernel_size=5, 44 | encoder_n_convolutions=3, 45 | encoder_embedding_dim=512, 46 | text_encoder_dropout=0.5, 47 | 48 | # Audio Encoder parameters 49 | spemb_input=False, 50 | n_frames_per_step_encoder=2, 51 | audio_encoder_hidden_dim=512, 52 | AE_attention_dim=128, 53 | AE_attention_location_n_filters=32, 54 | AE_attention_location_kernel_size=51, 55 | beam_width=10, 56 | 57 | # hidden activation 58 | # relu linear tanh 59 | hidden_activation='tanh', 60 | 61 | #Speaker Encoder parameters 62 | speaker_encoder_hidden_dim=256, 63 | speaker_encoder_dropout=0.2, 64 | speaker_embedding_dim=128, 65 | 66 | 67 | #Speaker Classifier parameters 68 | SC_hidden_dim=512, 69 | SC_n_convolutions=3, 70 | SC_kernel_size=1, 71 | 72 | # Decoder parameters 73 | feed_back_last=True, 74 | n_frames_per_step_decoder=2, 75 | decoder_rnn_dim=512, 76 | prenet_dim=[256,256], 77 | max_decoder_steps=1000, 78 | stop_threshold=0.5, 79 | 80 | # Attention parameters 81 | attention_rnn_dim=512, 82 | attention_dim=128, 83 | 84 | # Location Layer parameters 85 | attention_location_n_filters=32, 86 | attention_location_kernel_size=17, 87 | 88 | # PostNet parameters 89 | postnet_n_convolutions=5, 90 | postnet_dim=512, 91 | postnet_kernel_size=5, 92 | postnet_dropout=0.5, 93 | 94 | ################################ 95 | # Optimization Hyperparameters # 96 | ################################ 97 | use_saved_learning_rate=False, 98 | learning_rate=1e-3, 99 | weight_decay=1e-6, 100 | grad_clip_thresh=5.0, 101 | batch_size=32, 102 | 103 | contrastive_loss_w=30.0, 104 | speaker_encoder_loss_w=1.0, 105 | text_classifier_loss_w=1.0, 106 | speaker_adversial_loss_w=20., 107 | speaker_classifier_loss_w=0.1, 108 | ce_loss=False 109 | ) 110 | 111 | if hparams_string: 112 | tf.logging.info('Parsing command line hparams: %s', hparams_string) 113 | hparams.parse(hparams_string) 114 | 115 | if verbose: 116 | tf.logging.info('Final parsed hparams: %s', list(hparams.values())) 117 | 118 | return hparams 119 | -------------------------------------------------------------------------------- /codes/hparams_update.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | #from text import symbols 3 | 4 | def create_hparams(hparams_string=None, verbose=False): 5 | """Create model hyperparameters. Parse nondefault from given string.""" 6 | 7 | hparams = tf.contrib.training.HParams( 8 | ################################ 9 | # Experiment Parameters # 10 | ################################ 11 | epochs=100, 12 | iters_per_checkpoint=100, 13 | seed=1234, 14 | dynamic_loss_scaling=True, 15 | distributed_run=False, 16 | dist_backend="nccl", 17 | dist_url="tcp://localhost:54321", 18 | cudnn_enabled=True, 19 | cudnn_benchmark=False, 20 | 21 | ################################ 22 | # Data Parameters # 23 | ################################ 24 | training_list='/home/zhoukun/nonparaSeq2seqVC_code-master/pre-train/reader/emotion_list/training_mel_list.txt', 25 | validation_list='/home/zhoukun/nonparaSeq2seqVC_code-master/pre-train/reader/emotion_list/evaluation_mel_list.txt', 26 | #mel_mean_std='/data07/zhoukun/VCTK-Corpus/mel_mean_std.npy', 27 | mel_mean_std = '/home/zhoukun/nonparaSeq2seqVC_code-master/0013/mel_mean_std.npy', 28 | ################################ 29 | # Data Parameters # 30 | ################################ 31 | n_mel_channels=80, 32 | n_spc_channels=1025, 33 | n_symbols=41, # 34 | pretrain_n_speakers=99, # 35 | 36 | n_speakers=5, # 37 | predict_spectrogram=False, 38 | 39 | ################################ 40 | # Model Parameters # 41 | ################################ 42 | 43 | symbols_embedding_dim=512, 44 | 45 | # Text Encoder parameters 46 | encoder_kernel_size=5, 47 | encoder_n_convolutions=3, 48 | encoder_embedding_dim=512, 49 | text_encoder_dropout=0.5, 50 | 51 | # Audio Encoder parameters 52 | spemb_input=False, 53 | n_frames_per_step_encoder=2, 54 | audio_encoder_hidden_dim=512, 55 | AE_attention_dim=128, 56 | AE_attention_location_n_filters=32, 57 | AE_attention_location_kernel_size=51, 58 | beam_width=10, 59 | 60 | # hidden activation 61 | # relu linear tanh 62 | hidden_activation='tanh', 63 | 64 | #Speaker Encoder parameters 65 | speaker_encoder_hidden_dim=256, 66 | speaker_encoder_dropout=0.2, 67 | speaker_embedding_dim=128, 68 | 69 | 70 | #Speaker Classifier parameters 71 | SC_hidden_dim=512, 72 | SC_n_convolutions=3, 73 | SC_kernel_size=1, 74 | 75 | # Decoder parameters 76 | feed_back_last=True, 77 | n_frames_per_step_decoder=2, 78 | decoder_rnn_dim=512, 79 | prenet_dim=[256,256], 80 | max_decoder_steps=1000, 81 | stop_threshold=0.5, 82 | 83 | # Attention parameters 84 | attention_rnn_dim=512, 85 | attention_dim=128, 86 | 87 | # Location Layer parameters 88 | attention_location_n_filters=32, 89 | attention_location_kernel_size=17, 90 | 91 | # PostNet parameters 92 | postnet_n_convolutions=5, 93 | postnet_dim=512, 94 | postnet_kernel_size=5, 95 | postnet_dropout=0.5, 96 | 97 | ################################ 98 | # Optimization Hyperparameters # 99 | ################################ 100 | use_saved_learning_rate=False, 101 | learning_rate=1e-3, 102 | weight_decay=1e-6, 103 | grad_clip_thresh=5.0, 104 | batch_size=64, 105 | #batch_size = 8, 106 | warmup = 7, 107 | decay_rate = 0.5, 108 | decay_every = 7, 109 | 110 | 111 | 112 | 113 | contrastive_loss_w=30.0, 114 | #speaker_encoder_loss_w=1.0, 115 | speaker_encoder_loss_w=5.0, 116 | text_classifier_loss_w=5.0, 117 | speaker_adversial_loss_w=20., 118 | #speaker_classifier_loss_w=0.1, 119 | speaker_classifier_loss_w=5.0, 120 | ce_loss=False 121 | ) 122 | 123 | if hparams_string: 124 | tf.logging.info('Parsing command line hparams: %s', hparams_string) 125 | hparams.parse(hparams_string) 126 | 127 | if verbose: 128 | tf.logging.info('Final parsed hparams: %s', list(hparams.values())) 129 | 130 | return hparams 131 | -------------------------------------------------------------------------------- /codes/inference.py: -------------------------------------------------------------------------------- 1 | import matplotlib 2 | matplotlib.use("Agg") 3 | import matplotlib.pylab as plt 4 | 5 | 6 | import os 7 | import librosa 8 | import numpy as np 9 | import torch 10 | from torch.utils.data import DataLoader 11 | 12 | from reader import TextMelIDLoader, TextMelIDCollate, id2ph, id2sp 13 | from hparams import create_hparams 14 | from model import Parrot, lcm 15 | from train import load_model 16 | import scipy.io.wavfile 17 | 18 | 19 | ########### Configuration ########### 20 | hparams = create_hparams() 21 | 22 | #generation list 23 | 24 | #hlist = '/home/jxzhang/Documents/DataSets/VCTK/list/hold_english.list' #unseen speakers list 25 | tlist = '/home/zhoukun/nonparaSeq2seqVC_code-master/pre-train/reader/evaluation_mel_list.txt' #seen speakers list 26 | 27 | # use seen (tlist) or unseen list (hlist) 28 | test_list = tlist 29 | checkpoint_path='/home/zhoukun/nonparaSeq2seqVC_code-master/pre-train/outdir/checkpoint_234000' 30 | # TTS or VC task? 31 | input_text=False 32 | # number of utterances for generation 33 | NUM=10 34 | ISMEL=(not hparams.predict_spectrogram) 35 | ##################################### 36 | 37 | def plot_data(data, fn, figsize=(12, 4)): 38 | fig, axes = plt.subplots(1, len(data), figsize=figsize) 39 | for i in range(len(data)): 40 | if len(data) == 1: 41 | ax = axes 42 | else: 43 | ax = axes[i] 44 | g = ax.imshow(data[i], aspect='auto', origin='lower', 45 | interpolation='none') 46 | 47 | plt.colorbar(g, ax=ax) 48 | plt.savefig(fn) 49 | 50 | 51 | model = load_model(hparams) 52 | 53 | model.load_state_dict(torch.load(checkpoint_path)['state_dict']) 54 | _ = model.eval() 55 | 56 | test_set = TextMelIDLoader(test_list, hparams.mel_mean_std, shuffle=True) 57 | sample_list = test_set.file_path_list 58 | collate_fn = TextMelIDCollate(lcm(hparams.n_frames_per_step_encoder, 59 | hparams.n_frames_per_step_decoder)) 60 | 61 | test_loader = DataLoader(test_set, num_workers=1, shuffle=False, 62 | sampler=None, 63 | batch_size=1, pin_memory=False, 64 | drop_last=True, collate_fn=collate_fn) 65 | 66 | 67 | 68 | task = 'tts' if input_text else 'vc' 69 | path_save = os.path.join(checkpoint_path.replace('checkpoint', 'test'), task) 70 | path_save += '_seen' if test_list == tlist else '_unseen' 71 | if not os.path.exists(path_save): 72 | os.makedirs(path_save) 73 | 74 | print(path_save) 75 | 76 | def recover_wav(mel, wav_path, ismel=False, 77 | n_fft=2048, win_length=800,hop_length=200): 78 | 79 | if ismel: 80 | mean, std = np.load(hparams.mel_mean_std) 81 | else: 82 | mean, std = np.load(hparams.mel_mean_std.replace('mel','spec')) 83 | 84 | mean = mean[:,None] 85 | std = std[:,None] 86 | mel = 1.2 * mel * std + mean 87 | mel = np.exp(mel) 88 | 89 | if ismel: 90 | filters = librosa.filters.mel(sr=16000, n_fft=2048, n_mels=80) 91 | inv_filters = np.linalg.pinv(filters) 92 | spec = np.dot(inv_filters, mel) 93 | else: 94 | spec = mel 95 | 96 | def _griffin_lim(stftm_matrix, shape, max_iter=50): 97 | y = np.random.random(shape) 98 | for i in range(max_iter): 99 | stft_matrix = librosa.core.stft(y, n_fft=n_fft, win_length=win_length, hop_length=hop_length) 100 | stft_matrix = stftm_matrix * stft_matrix / np.abs(stft_matrix) 101 | y = librosa.core.istft(stft_matrix, win_length=win_length, hop_length=hop_length) 102 | return y 103 | 104 | shape = spec.shape[1] * hop_length - hop_length + 1 105 | 106 | y = _griffin_lim(spec, shape) 107 | scipy.io.wavfile.write(wav_path, 16000, y) 108 | return y 109 | 110 | 111 | text_input, mel, spec, speaker_id = test_set[0] 112 | reference_mel = mel.cuda().unsqueeze(0) 113 | ref_sp = id2sp[speaker_id.item()] 114 | 115 | def levenshteinDistance(s1, s2): 116 | if len(s1) > len(s2): 117 | s1, s2 = s2, s1 118 | 119 | distances = list(range(len(s1) + 1)) 120 | for i2, c2 in enumerate(s2): 121 | distances_ = [i2+1] 122 | for i1, c1 in enumerate(s1): 123 | if c1 == c2: 124 | distances_.append(distances[i1]) 125 | else: 126 | distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1]))) 127 | distances = distances_ 128 | return distances[-1] 129 | 130 | with torch.no_grad(): 131 | 132 | errs = 0 133 | totalphs = 0 134 | 135 | for i, batch in enumerate(test_loader): 136 | if i == NUM: 137 | break 138 | 139 | sample_id = sample_list[i].split('/')[-1][9:17] 140 | print(('%d index %s, decoding ...'%(i,sample_id))) 141 | 142 | x, y = model.parse_batch(batch) 143 | predicted_mel, post_output, predicted_stop, alignments, \ 144 | text_hidden, audio_seq2seq_hidden, audio_seq2seq_phids, audio_seq2seq_alignments, \ 145 | speaker_id = model.inference(x, input_text, reference_mel, hparams.beam_width) 146 | 147 | post_output = post_output.data.cpu().numpy()[0] 148 | alignments = alignments.data.cpu().numpy()[0].T 149 | audio_seq2seq_alignments = audio_seq2seq_alignments.data.cpu().numpy()[0].T 150 | 151 | text_hidden = text_hidden.data.cpu().numpy()[0].T #-> [hidden_dim, max_text_len] 152 | audio_seq2seq_hidden = audio_seq2seq_hidden.data.cpu().numpy()[0].T 153 | audio_seq2seq_phids = audio_seq2seq_phids.data.cpu().numpy()[0] # [T + 1] 154 | speaker_id = speaker_id.data.cpu().numpy()[0] # scalar 155 | 156 | task = 'TTS' if input_text else 'VC' 157 | 158 | recover_wav(post_output, 159 | os.path.join(path_save, 'Wav_%s_ref_%s_%s.wav'%(sample_id, ref_sp, task)), 160 | ismel=ISMEL) 161 | 162 | post_output_path = os.path.join(path_save, 'Mel_%s_ref_%s_%s.npy'%(sample_id, ref_sp, task)) 163 | np.save(post_output_path, post_output) 164 | 165 | plot_data([alignments, audio_seq2seq_alignments], 166 | os.path.join(path_save, 'Ali_%s_ref_%s_%s.pdf'%(sample_id, ref_sp, task))) 167 | 168 | plot_data([np.hstack([text_hidden, audio_seq2seq_hidden])], 169 | os.path.join(path_save, 'Hid_%s_ref_%s_%s.pdf'%(sample_id, ref_sp, task))) 170 | 171 | audio_seq2seq_phids = [id2ph[id] for id in audio_seq2seq_phids[:-1]] 172 | target_text = y[0].data.cpu().numpy()[0] 173 | target_text = [id2ph[id] for id in target_text[:]] 174 | 175 | print('Sounds like %s, Decoded text is '%(id2sp[speaker_id])) 176 | 177 | print(audio_seq2seq_phids) 178 | print(target_text) 179 | 180 | err = levenshteinDistance(audio_seq2seq_phids, target_text) 181 | print(err, len(target_text)) 182 | 183 | errs += err 184 | totalphs += len(target_text) 185 | 186 | print(float(errs)/float(totalphs)) 187 | 188 | 189 | 190 | -------------------------------------------------------------------------------- /codes/logger.py: -------------------------------------------------------------------------------- 1 | import os 2 | import random 3 | import torch.nn.functional as F 4 | from tensorboardX import SummaryWriter 5 | from plotting_utils import plot_alignment_to_numpy, plot_spectrogram_to_numpy, plot_alignment 6 | from plotting_utils import plot_gate_outputs_to_numpy 7 | 8 | 9 | class ParrotLogger(SummaryWriter): 10 | def __init__(self, logdir, ali_path='ali'): 11 | super(ParrotLogger, self).__init__(logdir) 12 | ali_path = os.path.join(logdir, ali_path) 13 | if not os.path.exists(ali_path): 14 | os.makedirs(ali_path) 15 | self.ali_path = ali_path 16 | 17 | def log_training(self, reduced_loss, reduced_losses, reduced_acces, grad_norm, learning_rate, duration, 18 | iteration): 19 | 20 | self.add_scalar("training.loss", reduced_loss, iteration) 21 | self.add_scalar("training.loss.recon", reduced_losses[0], iteration) 22 | self.add_scalar("training.loss.recon_post", reduced_losses[1], iteration) 23 | self.add_scalar("training.loss.stop", reduced_losses[2], iteration) 24 | self.add_scalar("training.loss.contr", reduced_losses[3], iteration) 25 | self.add_scalar("training.loss.spenc", reduced_losses[4], iteration) 26 | self.add_scalar("training.loss.spcla", reduced_losses[5], iteration) 27 | self.add_scalar("training.loss.texcl", reduced_losses[6], iteration) 28 | self.add_scalar("training.loss.spadv", reduced_losses[7], iteration) 29 | self.add_scalar("training.loss.serloss", reduced_losses[8], iteration) 30 | self.add_scalar("training.loss.emobloss", reduced_losses[9], iteration) 31 | 32 | self.add_scalar("grad.norm", grad_norm, iteration) 33 | self.add_scalar("learning.rate", learning_rate, iteration) 34 | self.add_scalar("duration", duration, iteration) 35 | 36 | 37 | self.add_scalar('training.acc.spenc', reduced_acces[0], iteration) 38 | self.add_scalar('training.acc.spcla', reduced_acces[1], iteration) 39 | self.add_scalar('training.acc.texcl', reduced_acces[2], iteration) 40 | self.add_scalar('training.acc.seracc', reduced_acces[3], iteration) 41 | 42 | def log_validation(self, reduced_loss, reduced_losses, reduced_acces, model, y, y_pred, iteration, task): 43 | 44 | self.add_scalar('validation.loss.%s'%task, reduced_loss, iteration) 45 | self.add_scalar("validation.loss.%s.recon"%task, reduced_losses[0], iteration) 46 | self.add_scalar("validation.loss.%s.recon_post"%task, reduced_losses[1], iteration) 47 | self.add_scalar("validation.loss.%s.stop"%task, reduced_losses[2], iteration) 48 | self.add_scalar("validation.loss.%s.contr"%task, reduced_losses[3], iteration) 49 | self.add_scalar("validation.loss.%s.spenc"%task, reduced_losses[4], iteration) 50 | self.add_scalar("validation.loss.%s.spcla"%task, reduced_losses[5], iteration) 51 | self.add_scalar("validation.loss.%s.texcl"%task, reduced_losses[6], iteration) 52 | self.add_scalar("validation.loss.%s.spadv"%task, reduced_losses[7], iteration) 53 | self.add_scalar("validation.loss.%s.serloss"%task, reduced_losses[8], iteration) 54 | self.add_scalar("validation.loss.%s.emobloss"%task, reduced_losses[9], iteration) 55 | 56 | 57 | self.add_scalar('validation.acc.%s.spenc'%task, reduced_acces[0], iteration) 58 | self.add_scalar('validation.acc.%s.spcla'%task, reduced_acces[1], iteration) 59 | self.add_scalar('validatoin.acc.%s.texcl'%task, reduced_acces[2], iteration) 60 | self.add_scalar('validatoin.acc.%s.seracc'%task, reduced_acces[3], iteration) 61 | 62 | predicted_mel, post_output, predicted_stop, alignments, \ 63 | text_hidden, mel_hidden, text_logit_from_mel_hidden, \ 64 | audio_seq2seq_alignments, \ 65 | speaker_logit_from_mel, speaker_logit_from_mel_hidden, \ 66 | text_lengths, mel_lengths, speaker_embedding = y_pred 67 | 68 | text_target, mel_target, spc_target, speaker_target, stop_target, strength_embedding = y 69 | 70 | stop_target = stop_target.reshape(stop_target.size(0), -1, int(stop_target.size(1)/predicted_stop.size(1))) 71 | stop_target = stop_target[:,:,0] 72 | 73 | # plot distribution of parameters 74 | #for tag, value in model.named_parameters(): 75 | # tag = tag.replace('.', '/') 76 | # self.add_histogram(tag, value.data.cpu().numpy(), iteration) 77 | 78 | # plot alignment, mel target and predicted, stop target and predicted 79 | idx = random.randint(0, alignments.size(0) - 1) 80 | 81 | alignments = alignments.data.cpu().numpy() 82 | audio_seq2seq_alignments = audio_seq2seq_alignments.data.cpu().numpy() 83 | 84 | self.add_image( 85 | "%s.alignment"%task, 86 | plot_alignment_to_numpy(alignments[idx].T), 87 | iteration, dataformats='HWC') 88 | 89 | # plot more alignments 90 | plot_alignment(alignments[:4], self.ali_path+'/step-%d-%s.pdf'%(iteration, task)) 91 | 92 | self.add_image( 93 | "%s.audio_seq2seq_alignment"%task, 94 | plot_alignment_to_numpy(audio_seq2seq_alignments[idx].T), 95 | iteration, dataformats='HWC') 96 | 97 | self.add_image( 98 | "%s.mel_target"%task, 99 | plot_spectrogram_to_numpy(mel_target[idx].data.cpu().numpy()), 100 | iteration, dataformats='HWC') 101 | 102 | self.add_image( 103 | "%s.mel_predicted"%task, 104 | plot_spectrogram_to_numpy(predicted_mel[idx].data.cpu().numpy()), 105 | iteration, dataformats='HWC') 106 | 107 | self.add_image( 108 | "%s.spc_target"%task, 109 | plot_spectrogram_to_numpy(spc_target[idx].data.cpu().numpy()), 110 | iteration, dataformats='HWC') 111 | 112 | self.add_image( 113 | "%s.post_predicted"%task, 114 | plot_spectrogram_to_numpy(post_output[idx].data.cpu().numpy()), 115 | iteration, dataformats='HWC') 116 | 117 | self.add_image( 118 | "%s.stop"%task, 119 | plot_gate_outputs_to_numpy( 120 | stop_target[idx].data.cpu().numpy(), 121 | F.sigmoid(predicted_stop[idx]).data.cpu().numpy()), 122 | iteration, dataformats='HWC') 123 | -------------------------------------------------------------------------------- /codes/logger_original.py: -------------------------------------------------------------------------------- 1 | import os 2 | import random 3 | import torch.nn.functional as F 4 | from tensorboardX import SummaryWriter 5 | from plotting_utils import plot_alignment_to_numpy, plot_spectrogram_to_numpy, plot_alignment 6 | from plotting_utils import plot_gate_outputs_to_numpy 7 | 8 | 9 | class ParrotLogger(SummaryWriter): 10 | def __init__(self, logdir, ali_path='ali'): 11 | super(ParrotLogger, self).__init__(logdir) 12 | ali_path = os.path.join(logdir, ali_path) 13 | if not os.path.exists(ali_path): 14 | os.makedirs(ali_path) 15 | self.ali_path = ali_path 16 | 17 | def log_training(self, reduced_loss, reduced_losses, reduced_acces, grad_norm, learning_rate, duration, 18 | iteration): 19 | 20 | self.add_scalar("training.loss", reduced_loss, iteration) 21 | self.add_scalar("training.loss.recon", reduced_losses[0], iteration) 22 | self.add_scalar("training.loss.recon_post", reduced_losses[1], iteration) 23 | self.add_scalar("training.loss.stop", reduced_losses[2], iteration) 24 | self.add_scalar("training.loss.contr", reduced_losses[3], iteration) 25 | self.add_scalar("training.loss.spenc", reduced_losses[4], iteration) 26 | self.add_scalar("training.loss.spcla", reduced_losses[5], iteration) 27 | self.add_scalar("training.loss.texcl", reduced_losses[6], iteration) 28 | self.add_scalar("training.loss.spadv", reduced_losses[7], iteration) 29 | 30 | self.add_scalar("grad.norm", grad_norm, iteration) 31 | self.add_scalar("learning.rate", learning_rate, iteration) 32 | self.add_scalar("duration", duration, iteration) 33 | 34 | 35 | self.add_scalar('training.acc.spenc', reduced_acces[0], iteration) 36 | self.add_scalar('training.acc.spcla', reduced_acces[1], iteration) 37 | self.add_scalar('training.acc.texcl', reduced_acces[2], iteration) 38 | 39 | def log_validation(self, reduced_loss, reduced_losses, reduced_acces, model, y, y_pred, iteration, task): 40 | 41 | self.add_scalar('validation.loss.%s'%task, reduced_loss, iteration) 42 | self.add_scalar("validation.loss.%s.recon"%task, reduced_losses[0], iteration) 43 | self.add_scalar("validation.loss.%s.recon_post"%task, reduced_losses[1], iteration) 44 | self.add_scalar("validation.loss.%s.stop"%task, reduced_losses[2], iteration) 45 | self.add_scalar("validation.loss.%s.contr"%task, reduced_losses[3], iteration) 46 | self.add_scalar("validation.loss.%s.spenc"%task, reduced_losses[4], iteration) 47 | self.add_scalar("validation.loss.%s.spcla"%task, reduced_losses[5], iteration) 48 | self.add_scalar("validation.loss.%s.texcl"%task, reduced_losses[6], iteration) 49 | self.add_scalar("validation.loss.%s.spadv"%task, reduced_losses[7], iteration) 50 | 51 | self.add_scalar('validation.acc.%s.spenc'%task, reduced_acces[0], iteration) 52 | self.add_scalar('validation.acc.%s.spcla'%task, reduced_acces[1], iteration) 53 | self.add_scalar('validatoin.acc.%s.texcl'%task, reduced_acces[2], iteration) 54 | 55 | predicted_mel, post_output, predicted_stop, alignments, \ 56 | text_hidden, mel_hidden, text_logit_from_mel_hidden, \ 57 | audio_seq2seq_alignments, \ 58 | speaker_logit_from_mel, speaker_logit_from_mel_hidden, \ 59 | text_lengths, mel_lengths = y_pred 60 | 61 | text_target, mel_target, spc_target, speaker_target, stop_target = y 62 | 63 | stop_target = stop_target.reshape(stop_target.size(0), -1, int(stop_target.size(1)/predicted_stop.size(1))) 64 | stop_target = stop_target[:,:,0] 65 | 66 | # plot distribution of parameters 67 | #for tag, value in model.named_parameters(): 68 | # tag = tag.replace('.', '/') 69 | # self.add_histogram(tag, value.data.cpu().numpy(), iteration) 70 | 71 | # plot alignment, mel target and predicted, stop target and predicted 72 | idx = random.randint(0, alignments.size(0) - 1) 73 | 74 | alignments = alignments.data.cpu().numpy() 75 | audio_seq2seq_alignments = audio_seq2seq_alignments.data.cpu().numpy() 76 | 77 | self.add_image( 78 | "%s.alignment"%task, 79 | plot_alignment_to_numpy(alignments[idx].T), 80 | iteration, dataformats='HWC') 81 | 82 | # plot more alignments 83 | plot_alignment(alignments[:4], self.ali_path+'/step-%d-%s.pdf'%(iteration, task)) 84 | 85 | self.add_image( 86 | "%s.audio_seq2seq_alignment"%task, 87 | plot_alignment_to_numpy(audio_seq2seq_alignments[idx].T), 88 | iteration, dataformats='HWC') 89 | 90 | self.add_image( 91 | "%s.mel_target"%task, 92 | plot_spectrogram_to_numpy(mel_target[idx].data.cpu().numpy()), 93 | iteration, dataformats='HWC') 94 | 95 | self.add_image( 96 | "%s.mel_predicted"%task, 97 | plot_spectrogram_to_numpy(predicted_mel[idx].data.cpu().numpy()), 98 | iteration, dataformats='HWC') 99 | 100 | self.add_image( 101 | "%s.spc_target"%task, 102 | plot_spectrogram_to_numpy(spc_target[idx].data.cpu().numpy()), 103 | iteration, dataformats='HWC') 104 | 105 | self.add_image( 106 | "%s.post_predicted"%task, 107 | plot_spectrogram_to_numpy(post_output[idx].data.cpu().numpy()), 108 | iteration, dataformats='HWC') 109 | 110 | self.add_image( 111 | "%s.stop"%task, 112 | plot_gate_outputs_to_numpy( 113 | stop_target[idx].data.cpu().numpy(), 114 | F.sigmoid(predicted_stop[idx]).data.cpu().numpy()), 115 | iteration, dataformats='HWC') 116 | -------------------------------------------------------------------------------- /codes/lstm_test.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | #import pandas as pd 3 | import os 4 | import librosa 5 | import librosa.display 6 | import matplotlib.pyplot as plt 7 | import torch 8 | import torch.nn as nn 9 | from sklearn.metrics import confusion_matrix 10 | import pickle 11 | 12 | class TimeDistributed(nn.Module): 13 | def __init__(self, module): 14 | super(TimeDistributed, self).__init__() 15 | self.module = module 16 | 17 | def forward(self, x): 18 | 19 | if len(x.size()) <= 2: 20 | return self.module(x) 21 | # squash samples and timesteps into a single axis 22 | elif len(x.size()) == 3: # (samples, timesteps, inp1) 23 | x_reshape = x.contiguous().view(-1, x.size(2)) # (samples * timesteps, inp1) 24 | elif len(x.size()) == 4: # (samples,timesteps,inp1,inp2) 25 | x_reshape = x.contiguous().view(-1, x.size(2), x.size(3)) # (samples*timesteps,inp1,inp2) 26 | else: # (samples,timesteps,inp1,inp2,inp3) 27 | x_reshape = x.contiguous().view(-1, x.size(2), x.size(3), x.size(4)) # (samples*timesteps,inp1,inp2,inp3) 28 | 29 | y = self.module(x_reshape) 30 | 31 | # we have to reshape Y 32 | if len(x.size()) == 3: 33 | y = y.contiguous().view(x.size(0), -1, y.size(1)) # (samples, timesteps, out1) 34 | elif len(x.size()) == 4: 35 | y = y.contiguous().view(x.size(0), -1, y.size(1), y.size(2)) # (samples, timesteps, out1,out2) 36 | else: 37 | y = y.contiguous().view(x.size(0), -1, y.size(1), y.size(2), 38 | y.size(3)) # (samples, timesteps, out1,out2, out3) 39 | return y 40 | 41 | class HybridModel(nn.Module): 42 | def __init__(self,num_emotions): 43 | super().__init__() 44 | # conv block 45 | self.conv2Dblock = nn.Sequential( 46 | # 1. conv block 47 | TimeDistributed(nn.Conv2d(in_channels=1, 48 | out_channels=16, 49 | kernel_size=3, 50 | stride=1, 51 | padding=1 52 | )), 53 | TimeDistributed(nn.BatchNorm2d(16)), 54 | TimeDistributed(nn.ReLU()), 55 | TimeDistributed(nn.MaxPool2d(kernel_size=2, stride=2)), 56 | TimeDistributed(nn.Dropout(p=0.3)), 57 | # 2. conv block 58 | TimeDistributed(nn.Conv2d(in_channels=16, 59 | out_channels=32, 60 | kernel_size=3, 61 | stride=1, 62 | padding=1 63 | )), 64 | TimeDistributed(nn.BatchNorm2d(32)), 65 | TimeDistributed(nn.ReLU()), 66 | TimeDistributed(nn.MaxPool2d(kernel_size=4, stride=4)), 67 | TimeDistributed(nn.Dropout(p=0.3)), 68 | # 3. conv block 69 | TimeDistributed(nn.Conv2d(in_channels=32, 70 | out_channels=64, 71 | kernel_size=3, 72 | stride=1, 73 | padding=1 74 | )), 75 | TimeDistributed(nn.BatchNorm2d(64)), 76 | TimeDistributed(nn.ReLU()), 77 | TimeDistributed(nn.MaxPool2d(kernel_size=4, stride=4)), 78 | TimeDistributed(nn.Dropout(p=0.3)) 79 | ) 80 | # LSTM block 81 | hidden_size = 32 82 | self.lstm = nn.LSTM(input_size=512,hidden_size=hidden_size,bidirectional=True, batch_first=True) 83 | self.dropout_lstm = nn.Dropout(p=0.4) 84 | self.attention_linear = nn.Linear(2*hidden_size,1) # 2*hidden_size for the 2 outputs of bidir LSTM 85 | # Linear softmax layer 86 | self.out_linear = nn.Linear(2*hidden_size,num_emotions) 87 | def forward(self,x): 88 | conv_embedding = self.conv2Dblock(x) 89 | conv_embedding = torch.flatten(conv_embedding, start_dim=2) # do not flatten batch dimension and time 90 | lstm_embedding, (h,c) = self.lstm(conv_embedding) 91 | lstm_embedding = self.dropout_lstm(lstm_embedding) 92 | # lstm_embedding (batch, time, hidden_size*2) 93 | batch_size,T,_ = lstm_embedding.shape 94 | attention_weights = [None]*T 95 | for t in range(T): 96 | embedding = lstm_embedding[:,t,:] 97 | attention_weights[t] = self.attention_linear(embedding) 98 | attention_weights_norm = nn.functional.softmax(torch.stack(attention_weights,-1),dim=-1) 99 | attention = torch.bmm(attention_weights_norm,lstm_embedding) # (Bx1xT)*(B,T,hidden_size*2)=(B,1,2*hidden_size) 100 | attention = torch.squeeze(attention, 1) 101 | output_logits = self.out_linear(attention) 102 | output_softmax = nn.functional.softmax(output_logits,dim=1) 103 | return output_logits, output_softmax, attention 104 | -------------------------------------------------------------------------------- /codes/model/__init__.py: -------------------------------------------------------------------------------- 1 | from .model import Parrot 2 | from .loss import ParrotLoss 3 | from .utils import lcm,gcd -------------------------------------------------------------------------------- /codes/model/__pycache__/__init__.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KunZhou9646/Emovox/972db246eb7b63e5eaf532d46710af3c9ec11eba/codes/model/__pycache__/__init__.cpython-36.pyc -------------------------------------------------------------------------------- /codes/model/__pycache__/basic_layers.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KunZhou9646/Emovox/972db246eb7b63e5eaf532d46710af3c9ec11eba/codes/model/__pycache__/basic_layers.cpython-36.pyc -------------------------------------------------------------------------------- /codes/model/__pycache__/beam.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KunZhou9646/Emovox/972db246eb7b63e5eaf532d46710af3c9ec11eba/codes/model/__pycache__/beam.cpython-36.pyc -------------------------------------------------------------------------------- /codes/model/__pycache__/decoder.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KunZhou9646/Emovox/972db246eb7b63e5eaf532d46710af3c9ec11eba/codes/model/__pycache__/decoder.cpython-36.pyc -------------------------------------------------------------------------------- /codes/model/__pycache__/layers.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KunZhou9646/Emovox/972db246eb7b63e5eaf532d46710af3c9ec11eba/codes/model/__pycache__/layers.cpython-36.pyc -------------------------------------------------------------------------------- /codes/model/__pycache__/loss.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KunZhou9646/Emovox/972db246eb7b63e5eaf532d46710af3c9ec11eba/codes/model/__pycache__/loss.cpython-36.pyc -------------------------------------------------------------------------------- /codes/model/__pycache__/lstm_test.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KunZhou9646/Emovox/972db246eb7b63e5eaf532d46710af3c9ec11eba/codes/model/__pycache__/lstm_test.cpython-36.pyc -------------------------------------------------------------------------------- /codes/model/__pycache__/model.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KunZhou9646/Emovox/972db246eb7b63e5eaf532d46710af3c9ec11eba/codes/model/__pycache__/model.cpython-36.pyc -------------------------------------------------------------------------------- /codes/model/__pycache__/penalties.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KunZhou9646/Emovox/972db246eb7b63e5eaf532d46710af3c9ec11eba/codes/model/__pycache__/penalties.cpython-36.pyc -------------------------------------------------------------------------------- /codes/model/__pycache__/utils.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KunZhou9646/Emovox/972db246eb7b63e5eaf532d46710af3c9ec11eba/codes/model/__pycache__/utils.cpython-36.pyc -------------------------------------------------------------------------------- /codes/model/basic_layers.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch import nn 3 | from torch.nn import functional as F 4 | 5 | 6 | def tile(x, count, dim=0): 7 | """ 8 | Tiles x on dimension dim count times. 9 | """ 10 | perm = list(range(len(x.size()))) 11 | if dim != 0: 12 | perm[0], perm[dim] = perm[dim], perm[0] 13 | x = x.permute(perm).contiguous() 14 | out_size = list(x.size()) 15 | out_size[0] *= count 16 | batch = x.size(0) 17 | x = x.view(batch, -1) \ 18 | .transpose(0, 1) \ 19 | .repeat(count, 1) \ 20 | .transpose(0, 1) \ 21 | .contiguous() \ 22 | .view(*out_size) 23 | if dim != 0: 24 | x = x.permute(perm).contiguous() 25 | return x 26 | 27 | 28 | def sort_batch(data, lengths): 29 | ''' 30 | sort data by length 31 | sorted_data[initial_index] == data 32 | ''' 33 | sorted_lengths, sorted_index = lengths.sort(0, descending=True) 34 | sorted_data = data[sorted_index] 35 | _, initial_index = sorted_index.sort(0, descending=False) 36 | 37 | return sorted_data, sorted_lengths, initial_index 38 | 39 | 40 | class LinearNorm(torch.nn.Module): 41 | def __init__(self, in_dim, out_dim, bias=True, w_init_gain='linear'): 42 | super(LinearNorm, self).__init__() 43 | self.linear_layer = torch.nn.Linear(in_dim, out_dim, bias=bias) 44 | 45 | torch.nn.init.xavier_uniform_( 46 | self.linear_layer.weight, 47 | gain=torch.nn.init.calculate_gain(w_init_gain)) 48 | 49 | def forward(self, x): 50 | return self.linear_layer(x) 51 | 52 | 53 | class ConvNorm(torch.nn.Module): 54 | def __init__(self, in_channels, out_channels, kernel_size=1, stride=1, 55 | padding=None, dilation=1, bias=True, w_init_gain='linear', param=None): 56 | super(ConvNorm, self).__init__() 57 | if padding is None: 58 | assert(kernel_size % 2 == 1) 59 | padding = int(dilation * (kernel_size - 1) / 2) 60 | 61 | self.conv = torch.nn.Conv1d(in_channels, out_channels, 62 | kernel_size=kernel_size, stride=stride, 63 | padding=padding, dilation=dilation, 64 | bias=bias) 65 | 66 | torch.nn.init.xavier_uniform_( 67 | self.conv.weight, gain=torch.nn.init.calculate_gain(w_init_gain, param=param)) 68 | 69 | def forward(self, signal): 70 | conv_signal = self.conv(signal) 71 | return conv_signal 72 | 73 | 74 | class Prenet(nn.Module): 75 | def __init__(self, in_dim, sizes): 76 | super(Prenet, self).__init__() 77 | in_sizes = [in_dim] + sizes[:-1] 78 | self.layers = nn.ModuleList( 79 | [LinearNorm(in_size, out_size, bias=False) 80 | for (in_size, out_size) in zip(in_sizes, sizes)]) 81 | 82 | def forward(self, x): 83 | for linear in self.layers: 84 | x = F.dropout(F.relu(linear(x)), p=0.5, training=True) 85 | return x 86 | 87 | 88 | class LocationLayer(nn.Module): 89 | def __init__(self, attention_n_filters, attention_kernel_size, 90 | attention_dim): 91 | super(LocationLayer, self).__init__() 92 | padding = int((attention_kernel_size - 1) / 2) 93 | self.location_conv = ConvNorm(2, attention_n_filters, 94 | kernel_size=attention_kernel_size, 95 | padding=padding, bias=False, stride=1, 96 | dilation=1) 97 | self.location_dense = LinearNorm(attention_n_filters, attention_dim, 98 | bias=False, w_init_gain='tanh') 99 | 100 | def forward(self, attention_weights_cat): 101 | processed_attention = self.location_conv(attention_weights_cat) 102 | processed_attention = processed_attention.transpose(1, 2) 103 | processed_attention = self.location_dense(processed_attention) 104 | return processed_attention 105 | 106 | 107 | class Attention(nn.Module): 108 | def __init__(self, attention_rnn_dim, embedding_dim, attention_dim, 109 | attention_location_n_filters, attention_location_kernel_size): 110 | super(Attention, self).__init__() 111 | self.query_layer = LinearNorm(attention_rnn_dim, attention_dim, 112 | bias=False, w_init_gain='tanh') 113 | self.memory_layer = LinearNorm(embedding_dim, attention_dim, bias=False, 114 | w_init_gain='tanh') 115 | self.v = LinearNorm(attention_dim, 1, bias=False) 116 | self.location_layer = LocationLayer(attention_location_n_filters, 117 | attention_location_kernel_size, 118 | attention_dim) 119 | self.score_mask_value = -float("inf") 120 | 121 | def get_alignment_energies(self, query, processed_memory, 122 | attention_weights_cat): 123 | """ 124 | PARAMS 125 | ------ 126 | query: decoder output (batch, n_mel_channels * n_frames_per_step) 127 | processed_memory: processed encoder outputs (B, T_in, attention_dim) 128 | attention_weights_cat: cumulative and prev. att weights (B, 2, max_time) 129 | RETURNS 130 | ------- 131 | alignment (batch, max_time) 132 | """ 133 | 134 | processed_query = self.query_layer(query.unsqueeze(1)) 135 | processed_attention_weights = self.location_layer(attention_weights_cat) 136 | energies = self.v(torch.tanh( 137 | processed_query + processed_attention_weights + processed_memory)) 138 | 139 | energies = energies.squeeze(-1) 140 | return energies 141 | 142 | def forward(self, attention_hidden_state, memory, processed_memory, 143 | attention_weights_cat, mask): 144 | """ 145 | PARAMS 146 | ------ 147 | attention_hidden_state: attention rnn last output 148 | memory: encoder outputs 149 | processed_memory: processed encoder outputs 150 | attention_weights_cat: previous and cummulative attention weights 151 | mask: binary mask for padded data 152 | """ 153 | alignment = self.get_alignment_energies( 154 | attention_hidden_state, processed_memory, attention_weights_cat) 155 | 156 | if mask is not None: 157 | alignment.data.masked_fill_(mask, self.score_mask_value) 158 | 159 | attention_weights = F.softmax(alignment, dim=1) 160 | attention_context = torch.bmm(attention_weights.unsqueeze(1), memory) 161 | attention_context = attention_context.squeeze(1) 162 | 163 | return attention_context, attention_weights 164 | 165 | 166 | class ForwardAttentionV2(nn.Module): 167 | def __init__(self, attention_rnn_dim, embedding_dim, attention_dim, 168 | attention_location_n_filters, attention_location_kernel_size): 169 | super(ForwardAttentionV2, self).__init__() 170 | self.query_layer = LinearNorm(attention_rnn_dim, attention_dim, 171 | bias=False, w_init_gain='tanh') 172 | self.memory_layer = LinearNorm(embedding_dim, attention_dim, bias=False, 173 | w_init_gain='tanh') 174 | self.v = LinearNorm(attention_dim, 1, bias=False) 175 | self.location_layer = LocationLayer(attention_location_n_filters, 176 | attention_location_kernel_size, 177 | attention_dim) 178 | self.score_mask_value = -float(1e20) 179 | 180 | def get_alignment_energies(self, query, processed_memory, 181 | attention_weights_cat): 182 | """ 183 | PARAMS 184 | ------ 185 | query: decoder output (batch, n_mel_channels * n_frames_per_step) 186 | processed_memory: processed encoder outputs (B, T_in, attention_dim) 187 | attention_weights_cat: prev. and cumulative att weights (B, 2, max_time) 188 | RETURNS 189 | ------- 190 | alignment (batch, max_time) 191 | """ 192 | 193 | processed_query = self.query_layer(query.unsqueeze(1)) 194 | processed_attention_weights = self.location_layer(attention_weights_cat) 195 | energies = self.v(torch.tanh( 196 | processed_query + processed_attention_weights + processed_memory)) 197 | 198 | energies = energies.squeeze(-1) 199 | return energies 200 | 201 | def forward(self, attention_hidden_state, memory, processed_memory, 202 | attention_weights_cat, mask, log_alpha): 203 | """ 204 | PARAMS 205 | ------ 206 | attention_hidden_state: attention rnn last output 207 | memory: encoder outputs 208 | processed_memory: processed encoder outputs 209 | attention_weights_cat: previous and cummulative attention weights 210 | mask: binary mask for padded data 211 | """ 212 | log_energy = self.get_alignment_energies( 213 | attention_hidden_state, processed_memory, attention_weights_cat) 214 | 215 | #log_energy = 216 | 217 | if mask is not None: 218 | log_energy.data.masked_fill_(mask, self.score_mask_value) 219 | 220 | #attention_weights = F.softmax(alignment, dim=1) 221 | 222 | #content_score = log_energy.unsqueeze(1) #[B, MAX_TIME] -> [B, 1, MAX_TIME] 223 | #log_alpha = log_alpha.unsqueeze(2) #[B, MAX_TIME] -> [B, MAX_TIME, 1] 224 | 225 | #log_total_score = log_alpha + content_score 226 | 227 | #previous_attention_weights = attention_weights_cat[:,0,:] 228 | 229 | log_alpha_shift_padded = [] 230 | max_time = log_energy.size(1) 231 | for sft in range(2): 232 | shifted = log_alpha[:,:max_time-sft] 233 | shift_padded = F.pad(shifted, (sft,0), 'constant', self.score_mask_value) 234 | log_alpha_shift_padded.append(shift_padded.unsqueeze(2)) 235 | 236 | biased = torch.logsumexp(torch.cat(log_alpha_shift_padded,2), 2) 237 | 238 | log_alpha_new = biased + log_energy 239 | 240 | attention_weights = F.softmax(log_alpha_new, dim=1) 241 | 242 | attention_context = torch.bmm(attention_weights.unsqueeze(1), memory) 243 | attention_context = attention_context.squeeze(1) 244 | 245 | return attention_context, attention_weights, log_alpha_new -------------------------------------------------------------------------------- /codes/model/beam.py: -------------------------------------------------------------------------------- 1 | 2 | 3 | import torch 4 | from .penalties import PenaltyBuilder 5 | 6 | 7 | 8 | class Beam(object): 9 | """ 10 | ''' 11 | adapt from opennmt 12 | https://github.com/OpenNMT/OpenNMT-py/blob/master/onmt/translate/beam.py 13 | ''' 14 | 15 | Class for managing the internals of the beam search process. 16 | Takes care of beams, back pointers, and scores. 17 | Args: 18 | size (int): beam size 19 | pad, bos, eos (int): indices of padding, beginning, and ending. 20 | n_best (int): nbest size to use 21 | cuda (bool): use gpu 22 | global_scorer (:obj:`GlobalScorer`) 23 | """ 24 | 25 | def __init__(self, size, pad, bos, eos, 26 | n_best=1, cuda=False, 27 | global_scorer=None, 28 | min_length=0, 29 | stepwise_penalty=False, 30 | block_ngram_repeat=0, 31 | exclusion_tokens=set()): 32 | 33 | self.size = size 34 | self.tt = torch.cuda if cuda else torch 35 | 36 | # The score for each translation on the beam. 37 | self.scores = self.tt.FloatTensor(size).zero_() 38 | self.all_scores = [] 39 | 40 | # The backpointers at each time-step. 41 | self.prev_ks = [] 42 | 43 | # The outputs at each time-step. 44 | self.next_ys = [self.tt.LongTensor(size) 45 | .fill_(pad)] 46 | self.next_ys[0][0] = bos 47 | 48 | # Has EOS topped the beam yet. 49 | self._eos = eos 50 | self.eos_top = False 51 | 52 | # The attentions (matrix) for each time. 53 | self.attn = [] 54 | self.hidden = [] 55 | 56 | # Time and k pair for finished. 57 | self.finished = [] 58 | self.n_best = n_best 59 | 60 | # Information for global scoring. 61 | self.global_scorer = global_scorer 62 | self.global_state = {} 63 | 64 | # Minimum prediction length 65 | self.min_length = min_length 66 | 67 | # Apply Penalty at every step 68 | self.stepwise_penalty = stepwise_penalty 69 | self.block_ngram_repeat = block_ngram_repeat 70 | self.exclusion_tokens = exclusion_tokens 71 | 72 | def get_current_state(self): 73 | "Get the outputs for the current timestep." 74 | return self.next_ys[-1] 75 | 76 | def get_current_origin(self): 77 | "Get the backpointers for the current timestep." 78 | return self.prev_ks[-1] 79 | 80 | def advance(self, word_probs, attn_out, hidden): 81 | """ 82 | Given prob over words for every last beam `wordLk` and attention 83 | `attn_out`: Compute and update the beam search. 84 | Parameters: 85 | * `word_probs`- probs of advancing from the last step (K x words) 86 | * `attn_out`- attention at the last step 87 | Returns: True if beam search is complete. 88 | """ 89 | num_words = word_probs.size(1) 90 | if self.stepwise_penalty: 91 | self.global_scorer.update_score(self, attn_out) 92 | # force the output to be longer than self.min_length 93 | cur_len = len(self.next_ys) 94 | if cur_len < self.min_length: 95 | for k in range(len(word_probs)): 96 | word_probs[k][self._eos] = -1e20 97 | # Sum the previous scores. 98 | if len(self.prev_ks) > 0: 99 | beam_scores = word_probs + self.scores.unsqueeze(1) 100 | # Don't let EOS have children. 101 | for i in range(self.next_ys[-1].size(0)): 102 | if self.next_ys[-1][i] == self._eos: 103 | beam_scores[i] = -1e20 104 | 105 | # Block ngram repeats 106 | if self.block_ngram_repeat > 0: 107 | ngrams = [] 108 | le = len(self.next_ys) 109 | for j in range(self.next_ys[-1].size(0)): 110 | hyp, _ = self.get_hyp(le - 1, j) 111 | ngrams = set() 112 | fail = False 113 | gram = [] 114 | for i in range(le - 1): 115 | # Last n tokens, n = block_ngram_repeat 116 | gram = (gram + 117 | [hyp[i].item()])[-self.block_ngram_repeat:] 118 | # Skip the blocking if it is in the exclusion list 119 | if set(gram) & self.exclusion_tokens: 120 | continue 121 | if tuple(gram) in ngrams: 122 | fail = True 123 | ngrams.add(tuple(gram)) 124 | if fail: 125 | beam_scores[j] = -10e20 126 | else: 127 | beam_scores = word_probs[0] 128 | flat_beam_scores = beam_scores.view(-1) 129 | best_scores, best_scores_id = flat_beam_scores.topk(self.size, 0, 130 | True, True) 131 | 132 | self.all_scores.append(self.scores) 133 | self.scores = best_scores 134 | 135 | # best_scores_id is flattened beam x word array, so calculate which 136 | # word and beam each score came from 137 | prev_k = best_scores_id / num_words 138 | self.prev_ks.append(prev_k) 139 | self.next_ys.append((best_scores_id - prev_k * num_words)) 140 | self.attn.append(attn_out.index_select(0, prev_k)) 141 | self.hidden.append(hidden.index_select(0, prev_k)) 142 | self.global_scorer.update_global_state(self) 143 | 144 | for i in range(self.next_ys[-1].size(0)): 145 | if self.next_ys[-1][i] == self._eos: 146 | global_scores = self.global_scorer.score(self, self.scores) 147 | s = global_scores[i] 148 | self.finished.append((s, len(self.next_ys) - 1, i)) 149 | 150 | # End condition is when top-of-beam is EOS and no global score. 151 | if self.next_ys[-1][0] == self._eos: 152 | self.all_scores.append(self.scores) 153 | self.eos_top = True 154 | 155 | def done(self): 156 | return self.eos_top and len(self.finished) >= self.n_best 157 | 158 | def sort_finished(self, minimum=None): 159 | if minimum is not None: 160 | i = 0 161 | # Add from beam until we have minimum outputs. 162 | while len(self.finished) < minimum: 163 | global_scores = self.global_scorer.score(self, self.scores) 164 | s = global_scores[i] 165 | self.finished.append((s, len(self.next_ys) - 1, i)) 166 | i += 1 167 | 168 | self.finished.sort(key=lambda a: -a[0]) 169 | scores = [sc for sc, _, _ in self.finished] 170 | ks = [(t, k) for _, t, k in self.finished] 171 | return scores, ks 172 | 173 | def get_hyp(self, timestep, k): 174 | """ 175 | Walk back to construct the full hypothesis. 176 | """ 177 | hyp, attn, hidden = [], [], [] 178 | for j in range(len(self.prev_ks[:timestep]) - 1, -1, -1): 179 | hyp.append(self.next_ys[j + 1][k]) 180 | attn.append(self.attn[j][k]) 181 | hidden.append(self.hidden[j][k]) 182 | k = self.prev_ks[j][k] 183 | return torch.stack(hyp[::-1]), torch.stack(attn[::-1]), torch.stack(hidden[::-1]) 184 | 185 | 186 | class GNMTGlobalScorer(object): 187 | """ 188 | NMT re-ranking score from 189 | "Google's Neural Machine Translation System" :cite:`wu2016google` 190 | Args: 191 | alpha (float): length parameter 192 | beta (float): coverage parameter 193 | """ 194 | 195 | def __init__(self, opt=None): 196 | self.alpha = 0. 197 | self.beta = 0. 198 | penalty_builder = PenaltyBuilder('none', 199 | 'avg') 200 | # Term will be subtracted from probability 201 | self.cov_penalty = penalty_builder.coverage_penalty() 202 | # Probability will be divided by this 203 | self.length_penalty = penalty_builder.length_penalty() 204 | 205 | def score(self, beam, logprobs): 206 | """ 207 | Rescores a prediction based on penalty functions 208 | """ 209 | normalized_probs = self.length_penalty(beam, 210 | logprobs, 211 | self.alpha) 212 | if not beam.stepwise_penalty: 213 | penalty = self.cov_penalty(beam, 214 | beam.global_state["coverage"], 215 | self.beta) 216 | normalized_probs -= penalty 217 | 218 | return normalized_probs 219 | 220 | def update_score(self, beam, attn): 221 | """ 222 | Function to update scores of a Beam that is not finished 223 | """ 224 | if "prev_penalty" in list(beam.global_state.keys()): 225 | beam.scores.add_(beam.global_state["prev_penalty"]) 226 | penalty = self.cov_penalty(beam, 227 | beam.global_state["coverage"] + attn, 228 | self.beta) 229 | beam.scores.sub_(penalty) 230 | 231 | def update_global_state(self, beam): 232 | "Keeps the coverage vector as sum of attentions" 233 | if len(beam.prev_ks) == 1: 234 | beam.global_state["prev_penalty"] = beam.scores.clone().fill_(0.0) 235 | beam.global_state["coverage"] = beam.attn[-1] 236 | self.cov_total = beam.attn[-1].sum(1) 237 | else: 238 | self.cov_total += torch.min(beam.attn[-1], 239 | beam.global_state['coverage']).sum(1) 240 | beam.global_state["coverage"] = beam.global_state["coverage"] \ 241 | .index_select(0, beam.prev_ks[-1]).add(beam.attn[-1]) 242 | 243 | prev_penalty = self.cov_penalty(beam, 244 | beam.global_state["coverage"], 245 | self.beta) 246 | beam.global_state["prev_penalty"] = prev_penalty -------------------------------------------------------------------------------- /codes/model/decoder.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch.autograd import Variable 3 | from torch import nn 4 | from torch.nn import functional as F 5 | from .basic_layers import ConvNorm, LinearNorm, ForwardAttentionV2, Prenet 6 | from .utils import get_mask_from_lengths 7 | 8 | 9 | class Decoder(nn.Module): 10 | def __init__(self, hparams): 11 | super(Decoder, self).__init__() 12 | self.n_mel_channels = hparams.n_mel_channels 13 | self.n_frames_per_step = hparams.n_frames_per_step_decoder 14 | #self.hidden_cat_dim = hparams.encoder_embedding_dim + hparams.speaker_embedding_dim 15 | self.hidden_cat_dim = hparams.encoder_embedding_dim + 128 16 | self.attention_rnn_dim = hparams.attention_rnn_dim 17 | self.decoder_rnn_dim = hparams.decoder_rnn_dim 18 | self.prenet_dim = hparams.prenet_dim 19 | self.max_decoder_steps = hparams.max_decoder_steps 20 | self.stop_threshold = hparams.stop_threshold 21 | self.feed_back_last = hparams.feed_back_last 22 | 23 | if hparams.feed_back_last: 24 | prenet_input_dim = hparams.n_mel_channels 25 | else: 26 | prenet_input_dim = hparams.n_mel_channels * hparams.n_frames_per_step_decoder 27 | 28 | self.prenet = Prenet( 29 | prenet_input_dim , 30 | hparams.prenet_dim) 31 | 32 | self.attention_rnn = nn.LSTMCell( 33 | hparams.prenet_dim[-1] + self.hidden_cat_dim, 34 | hparams.attention_rnn_dim) 35 | 36 | self.attention_layer = ForwardAttentionV2( 37 | hparams.attention_rnn_dim, 38 | self.hidden_cat_dim, 39 | hparams.attention_dim, hparams.attention_location_n_filters, 40 | hparams.attention_location_kernel_size) 41 | 42 | self.decoder_rnn = nn.LSTMCell( 43 | self.hidden_cat_dim + hparams.attention_rnn_dim, 44 | hparams.decoder_rnn_dim) 45 | 46 | self.linear_projection = LinearNorm( 47 | self.hidden_cat_dim + hparams.decoder_rnn_dim, 48 | hparams.n_mel_channels * hparams.n_frames_per_step_decoder) 49 | 50 | self.stop_layer = LinearNorm( 51 | self.hidden_cat_dim + hparams.decoder_rnn_dim, 1, 52 | bias=True, w_init_gain='sigmoid') 53 | 54 | def get_go_frame(self, memory): 55 | """ Gets all zeros frames to use as first decoder input 56 | PARAMS 57 | ------ 58 | memory: decoder outputs 59 | RETURNS 60 | ------- 61 | decoder_input: all zeros frames 62 | """ 63 | B = memory.size(0) 64 | if self.feed_back_last: 65 | input_dim = self.n_mel_channels 66 | else: 67 | input_dim = self.n_mel_channels * self.n_frames_per_step 68 | 69 | decoder_input = Variable(memory.data.new( 70 | B, input_dim).zero_()) 71 | return decoder_input 72 | 73 | def initialize_decoder_states(self, memory, mask): 74 | """ Initializes attention rnn states, decoder rnn states, attention 75 | weights, attention cumulative weights, attention context, stores memory 76 | and stores processed memory 77 | PARAMS 78 | ------ 79 | memory: Encoder outputs 80 | mask: Mask for padded data if training, expects None for inference 81 | """ 82 | B = memory.size(0) 83 | MAX_TIME = memory.size(1) 84 | 85 | self.attention_hidden = Variable(memory.data.new( 86 | B, self.attention_rnn_dim).zero_()) 87 | self.attention_cell = Variable(memory.data.new( 88 | B, self.attention_rnn_dim).zero_()) 89 | 90 | self.decoder_hidden = Variable(memory.data.new( 91 | B, self.decoder_rnn_dim).zero_()) 92 | self.decoder_cell = Variable(memory.data.new( 93 | B, self.decoder_rnn_dim).zero_()) 94 | 95 | self.attention_weights = Variable(memory.data.new( 96 | B, MAX_TIME).zero_()) 97 | self.attention_weights_cum = Variable(memory.data.new( 98 | B, MAX_TIME).zero_()) 99 | self.attention_context = Variable(memory.data.new( 100 | B, self.hidden_cat_dim).zero_()) 101 | 102 | self.log_alpha = Variable(memory.data.new(B, MAX_TIME).fill_(-float(1e20))) 103 | self.log_alpha[:, 0].fill_(0.) 104 | 105 | self.memory = memory 106 | self.processed_memory = self.attention_layer.memory_layer(memory) 107 | self.mask = mask 108 | 109 | def parse_decoder_inputs(self, decoder_inputs): 110 | """ Prepares decoder inputs, i.e. mel outputs 111 | PARAMS 112 | ------ 113 | decoder_inputs: inputs used for teacher-forced training, i.e. mel-specs 114 | RETURNS 115 | ------- 116 | inputs: processed decoder inputs 117 | """ 118 | # (B, n_mel_channels, T_out) -> (B, T_out, n_mel_channels) 119 | decoder_inputs = decoder_inputs.transpose(1, 2) 120 | decoder_inputs = decoder_inputs.reshape( 121 | decoder_inputs.size(0), 122 | int(decoder_inputs.size(1)/self.n_frames_per_step), -1) 123 | # (B, T_out, n_mel_channels) -> (T_out, B, n_mel_channels) 124 | decoder_inputs = decoder_inputs.transpose(0, 1) 125 | if self.feed_back_last: 126 | decoder_inputs = decoder_inputs[:,:,-self.n_mel_channels:] 127 | 128 | return decoder_inputs 129 | 130 | def parse_decoder_outputs(self, mel_outputs, stop_outputs, alignments): 131 | """ Prepares decoder outputs for output 132 | PARAMS 133 | ------ 134 | mel_outputs: 135 | stop_outputs: stop output energies 136 | alignments: 137 | RETURNS 138 | ------- 139 | mel_outputs: 140 | stop_outpust: stop output energies 141 | alignments: 142 | """ 143 | # (T_out, B, MAX_TIME) -> (B, T_out, MAX_TIME) 144 | alignments = torch.stack(alignments).transpose(0, 1) 145 | # (T_out, B) -> (B, T_out) 146 | if alignments.size(0) == 1: 147 | stop_outputs = torch.stack(stop_outputs).unsqueeze(0) 148 | else: 149 | stop_outputs = torch.stack(stop_outputs).transpose(0, 1) 150 | stop_outputs = stop_outputs.contiguous() 151 | # (T_out, B, n_mel_channels) -> (B, T_out, n_mel_channels) 152 | mel_outputs = torch.stack(mel_outputs).transpose(0, 1).contiguous() 153 | # decouple frames per step 154 | mel_outputs = mel_outputs.view( 155 | mel_outputs.size(0), -1, self.n_mel_channels) 156 | # (B, T_out, n_mel_channels) -> (B, n_mel_channels, T_out) 157 | mel_outputs = mel_outputs.transpose(1, 2) 158 | 159 | return mel_outputs, stop_outputs, alignments 160 | 161 | def attend(self, decoder_input): 162 | cell_input = torch.cat((decoder_input, self.attention_context), -1) 163 | self.attention_hidden, self.attention_cell = self.attention_rnn( 164 | cell_input, (self.attention_hidden, self.attention_cell)) 165 | 166 | attention_weights_cat = torch.cat( 167 | (self.attention_weights.unsqueeze(1), 168 | self.attention_weights_cum.unsqueeze(1)), dim=1) 169 | 170 | self.attention_context, self.attention_weights, self.log_alpha = self.attention_layer( 171 | self.attention_hidden, self.memory, self.processed_memory, 172 | attention_weights_cat, self.mask, self.log_alpha) 173 | 174 | self.attention_weights_cum += self.attention_weights 175 | 176 | decoder_rnn_input = torch.cat( 177 | (self.attention_hidden, self.attention_context), -1) 178 | 179 | return decoder_rnn_input, self.attention_context, self.attention_weights 180 | 181 | def decode(self, decoder_input): 182 | 183 | self.decoder_hidden, self.decoder_cell = self.decoder_rnn( 184 | decoder_input, (self.decoder_hidden, self.decoder_cell)) 185 | 186 | return self.decoder_hidden 187 | 188 | def forward(self, memory, decoder_inputs, memory_lengths): 189 | """ Decoder forward pass for training 190 | PARAMS 191 | ------ 192 | memory: Encoder outputs [B, encoder_max_time, hidden_dim] 193 | decoder_inputs: Decoder inputs for teacher forcing. i.e. mel-specs [B, mel_bin, T] 194 | memory_lengths: Encoder output lengths for attention masking. [B] 195 | RETURNS 196 | ------- 197 | mel_outputs: mel outputs from the decoder [B, mel_bin, T] 198 | stop_outputs: stop outputs from the decoder [B, T/r] 199 | alignments: sequence of attention weights from the decoder [B, T/r, encoder_max_time] 200 | """ 201 | 202 | decoder_input = self.get_go_frame(memory).unsqueeze(0) 203 | decoder_inputs = self.parse_decoder_inputs(decoder_inputs) 204 | decoder_inputs = torch.cat((decoder_input, decoder_inputs), dim=0) 205 | decoder_inputs = self.prenet(decoder_inputs) # [T/r + 1, B, prenet_dim ] 206 | 207 | self.initialize_decoder_states( 208 | memory, mask=~get_mask_from_lengths(memory_lengths)) 209 | 210 | mel_outputs, stop_outputs, alignments = [], [], [] 211 | while len(mel_outputs) < decoder_inputs.size(0) - 1: 212 | decoder_input = decoder_inputs[len(mel_outputs)] 213 | 214 | decoder_rnn_input, context, attention_weights = self.attend(decoder_input) 215 | 216 | decoder_rnn_output = self.decode(decoder_rnn_input) 217 | 218 | decoder_hidden_attention_context = torch.cat( 219 | (decoder_rnn_output, context), dim=1) 220 | 221 | mel_output = self.linear_projection(decoder_hidden_attention_context) 222 | stop_output = self.stop_layer(decoder_hidden_attention_context) 223 | 224 | mel_outputs += [mel_output.squeeze(1)] #? perhaps don't need squeeze 225 | stop_outputs += [stop_output.squeeze()] 226 | alignments += [attention_weights] 227 | 228 | mel_outputs, stop_outputs, alignments = self.parse_decoder_outputs( 229 | mel_outputs, stop_outputs, alignments) 230 | 231 | return mel_outputs, stop_outputs, alignments 232 | 233 | def inference(self, memory): 234 | """ Decoder inference 235 | PARAMS 236 | ------ 237 | memory: Encoder outputs 238 | RETURNS 239 | ------- 240 | mel_outputs: mel outputs from the decoder 241 | stop_outputs: stop outputs from the decoder 242 | alignments: sequence of attention weights from the decoder 243 | """ 244 | decoder_input = self.get_go_frame(memory) 245 | 246 | self.initialize_decoder_states(memory, mask=None) 247 | 248 | mel_outputs, stop_outputs, alignments = [], [], [] 249 | while True: 250 | decoder_input = self.prenet(decoder_input) 251 | 252 | decoder_input_final, context, alignment = self.attend(decoder_input) 253 | 254 | #mel_output, stop_output, alignment = self.decode(decoder_input) 255 | decoder_rnn_output = self.decode(decoder_input_final) 256 | decoder_hidden_attention_context = torch.cat( 257 | (decoder_rnn_output, context), dim=1) 258 | 259 | mel_output = self.linear_projection(decoder_hidden_attention_context) 260 | stop_output = self.stop_layer(decoder_hidden_attention_context) 261 | 262 | mel_outputs += [mel_output.squeeze(1)] 263 | stop_outputs += [stop_output] 264 | alignments += [alignment] 265 | 266 | 267 | if torch.sigmoid(stop_output.data) > self.stop_threshold: 268 | break 269 | elif len(mel_outputs) == self.max_decoder_steps: 270 | print("Warning! Reached max decoder steps") 271 | break 272 | 273 | if self.feed_back_last: 274 | decoder_input = mel_output[:,-self.n_mel_channels:] 275 | else: 276 | decoder_input = mel_output 277 | 278 | mel_outputs, stop_outputs, alignments = self.parse_decoder_outputs( 279 | mel_outputs, stop_outputs, alignments) 280 | 281 | return mel_outputs, stop_outputs, alignments -------------------------------------------------------------------------------- /codes/model/loss.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch import nn 3 | from torch.nn import functional as F 4 | from .utils import get_mask_from_lengths 5 | #from train_ser import process_mel, process_post_output, perform_SER 6 | import python_speech_features as ps 7 | 8 | from sklearn.metrics import recall_score as recall 9 | from sklearn.metrics import confusion_matrix as confusion 10 | import numpy as np 11 | from lstm_test import HybridModel 12 | 13 | class ParrotLoss(nn.Module): 14 | def __init__(self, hparams): 15 | super(ParrotLoss, self).__init__() 16 | self.hidden_dim = hparams.encoder_embedding_dim 17 | self.ce_loss = hparams.ce_loss 18 | 19 | self.L1Loss = nn.L1Loss(reduction='none') 20 | self.MSELoss = nn.MSELoss(reduction='none') 21 | self.BCEWithLogitsLoss = nn.BCEWithLogitsLoss(reduction='none') 22 | self.CrossEntropyLoss = nn.CrossEntropyLoss(reduction='none') 23 | self.n_frames_per_step = hparams.n_frames_per_step_decoder 24 | self.eos = hparams.n_symbols 25 | self.predict_spectrogram = hparams.predict_spectrogram 26 | 27 | self.contr_w = hparams.contrastive_loss_w 28 | self.spenc_w = hparams.speaker_encoder_loss_w 29 | self.texcl_w = hparams.text_classifier_loss_w 30 | self.spadv_w = hparams.speaker_adversial_loss_w 31 | self.spcla_w = hparams.speaker_classifier_loss_w 32 | self.serloss_w = hparams.ser_loss_w 33 | self.emoloss_w = hparams.emo_loss_w 34 | 35 | 36 | 37 | 38 | 39 | def perform_SER(self, x, target, device): 40 | def splitIntoChunks(mel_spec,win_size,stride): 41 | mel_spec = mel_spec.T 42 | t = mel_spec.shape[1] 43 | num_of_chunks = int(t/stride) 44 | chunks = [] 45 | for i in range(num_of_chunks): 46 | chunk = mel_spec[:,i*stride:i*stride+win_size] 47 | if chunk.shape[1] == win_size: 48 | chunks.append(chunk) 49 | return np.stack(chunks,axis=0) 50 | 51 | def loss_fnc(predictions, targets): 52 | return nn.CrossEntropyLoss()(input=predictions,target=targets) 53 | 54 | def make_validate_fnc(model,loss_fnc): 55 | def validate(X,Y): 56 | with torch.no_grad(): 57 | model.eval() 58 | output_logits, output_softmax, emotion_embedding = model(X) 59 | predictions = torch.argmax(output_softmax,dim=1) 60 | a = torch.sum(Y==predictions).cpu().detach().numpy() 61 | b = int(len(Y)) 62 | acc = a/b 63 | #accuracy = torch.sum(Y==predictions)/float(len(Y)) 64 | accuracy = torch.tensor(acc,device=device).float() 65 | loss = loss_fnc(output_logits,Y) 66 | return loss.item(), accuracy, predictions, emotion_embedding 67 | return validate 68 | 69 | x = x.cpu().detach().numpy() 70 | x = x.astype(np.float64) 71 | target = target.cpu().detach().numpy() 72 | 73 | 74 | mel_test_chunked = [] 75 | for mel_spec in x: 76 | mel_spec = mel_spec.T 77 | time = mel_spec.shape[0] 78 | if time <= 500: 79 | mel_spec = np.pad(mel_spec, ((0, 500 - time), (0, 0)), 'constant', constant_values=0) 80 | else: 81 | mel_spec = mel_spec[:500,:] 82 | chunks = splitIntoChunks(mel_spec, win_size=128,stride=64) 83 | mel_test_chunked.append(chunks) 84 | 85 | X_test = np.stack(mel_test_chunked,axis=0) 86 | X_test = np.expand_dims(X_test,2) 87 | b,t,c,h,w = X_test.shape 88 | X_test = np.reshape(X_test, newshape=(b,-1)) 89 | X_test = np.reshape(X_test, newshape=(b,t,c,h,w)) 90 | 91 | 92 | Y_test = target.reshape(-1) 93 | #Y_test = Y_test.astype('int8') 94 | 95 | #LOAD_PATH = '/home/zhoukun/SER/lstm_ser/models/cnn_attention_lstm_model_64.pt' 96 | LOAD_PATH = '/home/zhoukun/SER/lstm_ser/models/cnn_attention_lstm_model_64_update_best.pt' 97 | model = HybridModel(num_emotions=4).to(device) 98 | model.load_state_dict(torch.load(LOAD_PATH, map_location=torch.device(device))) 99 | validate = make_validate_fnc(model,loss_fnc) 100 | X_test_tensor = torch.tensor(X_test,device=device).float() 101 | Y_test_tensor = torch.tensor(Y_test,dtype=torch.long,device=device) 102 | test_loss, test_acc, predictions, emotion_embedding = validate(X_test_tensor,Y_test_tensor) 103 | test_loss = torch.tensor(test_loss, device=device).float() 104 | 105 | return test_loss, test_acc, emotion_embedding 106 | 107 | 108 | 109 | def parse_targets(self, targets, text_lengths): 110 | ''' 111 | text_target [batch_size, text_len] 112 | mel_target [batch_size, mel_bins, T] 113 | spc_target [batch_size, spc_bins, T] 114 | speaker_target [batch_size] 115 | stop_target [batch_size, T] 116 | ''' 117 | text_target, mel_target, spc_target, speaker_target, stop_target, strength_embedding = targets 118 | 119 | B = stop_target.size(0) 120 | stop_target = stop_target.reshape(B, -1, self.n_frames_per_step) 121 | stop_target = stop_target[:, :, 0] 122 | 123 | #padded = torch.tensor(text_target.data.new(B,1).zero_()) 124 | padded = text_target.data.new(B,1).zero_().clone().detach() 125 | text_target = torch.cat((text_target, padded), dim=-1) 126 | 127 | # adding the ending token for target 128 | for bid in range(B): 129 | text_target[bid, text_lengths[bid].item()] = self.eos 130 | 131 | return text_target, mel_target, spc_target, speaker_target, stop_target, strength_embedding 132 | 133 | def forward(self, model_outputs, targets, input_text, eps=1e-5): 134 | 135 | ''' 136 | predicted_mel [batch_size, mel_bins, T] 137 | predicted_stop [batch_size, T/r] 138 | alignment 139 | when input_text==True [batch_size, T/r, max_text_len] 140 | when input_text==False [batch_size, T/r, T/r] 141 | text_hidden [B, max_text_len, hidden_dim] 142 | mel_hidden [B, max_text_len, hidden_dim] 143 | text_logit_from_mel_hidden [B, max_text_len+1, n_symbols+1] 144 | speaker_logit_from_mel [B, n_speakers] 145 | speaker_logit_from_mel_hidden [B, max_text_len, n_speakers] 146 | text_lengths [B,] 147 | mel_lengths [B,] 148 | ''' 149 | predicted_mel, post_output, predicted_stop, alignments,\ 150 | text_hidden, mel_hidden, text_logit_from_mel_hidden, \ 151 | audio_seq2seq_alignments, \ 152 | speaker_logit_from_mel, speaker_logit_from_mel_hidden, \ 153 | text_lengths, mel_lengths,speaker_embedding = model_outputs 154 | 155 | text_target, mel_target, spc_target, speaker_target, stop_target, strength_embedding = self.parse_targets(targets, text_lengths) 156 | 157 | #perform SER: 158 | device = 'cuda' if torch.cuda.is_available() else 'cpu' 159 | 160 | ser_loss, ser_acc, emotion_embedding = self.perform_SER(post_output,speaker_target,device) 161 | 162 | 163 | 164 | ## get masks ## 165 | mel_mask = get_mask_from_lengths(mel_lengths, mel_target.size(2)).unsqueeze(1).expand(-1, mel_target.size(1), -1).float() 166 | spc_mask = get_mask_from_lengths(mel_lengths, mel_target.size(2)).unsqueeze(1).expand(-1, spc_target.size(1), -1).float() 167 | 168 | mel_step_lengths = torch.ceil(mel_lengths.float() / self.n_frames_per_step).long() 169 | stop_mask = get_mask_from_lengths(mel_step_lengths, 170 | int(mel_target.size(2)/self.n_frames_per_step)).float() # [B, T/r] 171 | text_mask = get_mask_from_lengths(text_lengths).float() 172 | text_mask_plus_one = get_mask_from_lengths(text_lengths + 1).float() 173 | 174 | # reconstruction loss # 175 | recon_loss = torch.sum(self.L1Loss(predicted_mel, mel_target) * mel_mask) / torch.sum(mel_mask) 176 | 177 | if self.predict_spectrogram: 178 | recon_loss_post = (self.L1Loss(post_output, spc_target) * spc_mask).sum() / spc_mask.sum() 179 | else: 180 | recon_loss_post = (self.L1Loss(post_output, mel_target) * mel_mask).sum() / torch.sum(mel_mask) 181 | 182 | stop_loss = torch.sum(self.BCEWithLogitsLoss(predicted_stop, stop_target) * stop_mask) / torch.sum(stop_mask) 183 | 184 | 185 | if self.contr_w == 0.: 186 | contrast_loss = torch.tensor(0.).cuda() 187 | else: 188 | # contrastive mask # 189 | contrast_mask1 = get_mask_from_lengths(text_lengths).unsqueeze(2).expand(-1, -1, mel_hidden.size(1)) # [B, text_len] -> [B, text_len, T/r] 190 | contrast_mask2 = get_mask_from_lengths(text_lengths).unsqueeze(1).expand(-1, text_hidden.size(1), -1) # [B, T/r] -> [B, text_len, T/r] 191 | contrast_mask = (contrast_mask1 & contrast_mask2).float() 192 | text_hidden_normed = text_hidden / (torch.norm(text_hidden, dim=2, keepdim=True) + eps) 193 | mel_hidden_normed = mel_hidden / (torch.norm(mel_hidden, dim=2, keepdim=True) + eps) 194 | 195 | # (x - y) ** 2 = x ** 2 + y ** 2 - 2xy 196 | distance_matrix_xx = torch.sum(text_hidden_normed ** 2, dim=2, keepdim=True) #[batch_size, text_len, 1] 197 | distance_matrix_yy = torch.sum(mel_hidden_normed ** 2, dim=2) 198 | distance_matrix_yy = distance_matrix_yy.unsqueeze(1) #[batch_size, 1, text_len] 199 | 200 | #[batch_size, text_len, text_len] 201 | distance_matrix_xy = torch.bmm(text_hidden_normed, torch.transpose(mel_hidden_normed, 1, 2)) 202 | distance_matrix = distance_matrix_xx + distance_matrix_yy - 2 * distance_matrix_xy 203 | 204 | TTEXT = distance_matrix.size(1) 205 | hard_alignments = torch.eye(TTEXT).cuda() 206 | contrast_loss = hard_alignments * distance_matrix + \ 207 | (1. - hard_alignments) * torch.max(1. - distance_matrix, torch.zeros_like(distance_matrix)) 208 | 209 | contrast_loss = torch.sum(contrast_loss * contrast_mask) / torch.sum(contrast_mask) 210 | 211 | n_speakers = speaker_logit_from_mel_hidden.size(2) 212 | TTEXT = speaker_logit_from_mel_hidden.size(1) 213 | n_symbols_plus_one = text_logit_from_mel_hidden.size(2) 214 | 215 | # speaker classification loss # 216 | speaker_encoder_loss = nn.CrossEntropyLoss()(speaker_logit_from_mel, speaker_target) 217 | _, predicted_speaker = torch.max(speaker_logit_from_mel,dim=1) 218 | speaker_encoder_acc = ((predicted_speaker == speaker_target).float()).sum() / float(speaker_target.size(0)) 219 | 220 | speaker_logit_flatten = speaker_logit_from_mel_hidden.reshape(-1, n_speakers) # -> [B* TTEXT, n_speakers] 221 | _, predicted_speaker = torch.max(speaker_logit_flatten, dim=1) 222 | speaker_target_flatten = speaker_target.unsqueeze(1).expand(-1, TTEXT).reshape(-1) 223 | speaker_classification_acc = ((predicted_speaker == speaker_target_flatten).float() * text_mask.reshape(-1)).sum() / text_mask.sum() 224 | loss = self.CrossEntropyLoss(speaker_logit_flatten, speaker_target_flatten) 225 | 226 | speaker_classification_loss = torch.sum(loss * text_mask.reshape(-1)) / torch.sum(text_mask) 227 | 228 | # text classification loss # 229 | text_logit_flatten = text_logit_from_mel_hidden.reshape(-1, n_symbols_plus_one) 230 | text_target_flatten = text_target.reshape(-1) 231 | _, predicted_text = torch.max(text_logit_flatten, dim=1) 232 | text_classification_acc = ((predicted_text == text_target_flatten).float()*text_mask_plus_one.reshape(-1)).sum()/text_mask_plus_one.sum() 233 | loss = self.CrossEntropyLoss(text_logit_flatten, text_target_flatten) 234 | text_classification_loss = torch.sum(loss * text_mask_plus_one.reshape(-1)) / torch.sum(text_mask_plus_one) 235 | 236 | # speaker adversival loss # 237 | flatten_target = 1. / n_speakers * torch.ones_like(speaker_logit_flatten) 238 | loss = self.MSELoss(F.softmax(speaker_logit_flatten, dim=1), flatten_target) 239 | mask = text_mask.unsqueeze(2).expand(-1,-1, n_speakers).reshape(-1, n_speakers) 240 | 241 | #new = torch.cat([speaker_embedding,strength_embedding],-1) 242 | #m = torch.nn.Linear(256,128).to('cuda') 243 | #speaker_embedding = m(new) 244 | 245 | # RMSE loss 246 | emotion_embedding_loss = torch.sqrt(torch.mean(self.MSELoss(emotion_embedding, speaker_embedding)) + eps) 247 | 248 | if self.ce_loss: 249 | speaker_adversial_loss = - speaker_classification_loss 250 | else: 251 | speaker_adversial_loss = torch.sum(loss * mask) / torch.sum(mask) 252 | 253 | loss_list = [recon_loss, recon_loss_post, stop_loss, 254 | contrast_loss, speaker_encoder_loss, speaker_classification_loss, 255 | text_classification_loss, speaker_adversial_loss,ser_loss, emotion_embedding_loss] 256 | 257 | acc_list = [speaker_encoder_acc, speaker_classification_acc, text_classification_acc, ser_acc] 258 | 259 | 260 | combined_loss1 = recon_loss + recon_loss_post + stop_loss + self.contr_w * contrast_loss + \ 261 | self.spenc_w * speaker_encoder_loss + self.texcl_w * text_classification_loss + \ 262 | self.spadv_w * speaker_adversial_loss + self.serloss_w * ser_loss + self.emoloss_w * emotion_embedding_loss 263 | 264 | combined_loss2 = self.spcla_w * speaker_classification_loss + self.serloss_w * ser_loss + self.emoloss_w * emotion_embedding_loss 265 | 266 | return loss_list, acc_list, combined_loss1, combined_loss2 267 | 268 | -------------------------------------------------------------------------------- /codes/model/lstm_test.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import os 4 | import librosa 5 | import librosa.display 6 | import matplotlib.pyplot as plt 7 | import torch 8 | import torch.nn as nn 9 | from sklearn.metrics import confusion_matrix 10 | import pickle 11 | 12 | class TimeDistributed(nn.Module): 13 | def __init__(self, module): 14 | super(TimeDistributed, self).__init__() 15 | self.module = module 16 | 17 | def forward(self, x): 18 | 19 | if len(x.size()) <= 2: 20 | return self.module(x) 21 | # squash samples and timesteps into a single axis 22 | elif len(x.size()) == 3: # (samples, timesteps, inp1) 23 | x_reshape = x.contiguous().view(-1, x.size(2)) # (samples * timesteps, inp1) 24 | elif len(x.size()) == 4: # (samples,timesteps,inp1,inp2) 25 | x_reshape = x.contiguous().view(-1, x.size(2), x.size(3)) # (samples*timesteps,inp1,inp2) 26 | else: # (samples,timesteps,inp1,inp2,inp3) 27 | x_reshape = x.contiguous().view(-1, x.size(2), x.size(3), x.size(4)) # (samples*timesteps,inp1,inp2,inp3) 28 | 29 | y = self.module(x_reshape) 30 | 31 | # we have to reshape Y 32 | if len(x.size()) == 3: 33 | y = y.contiguous().view(x.size(0), -1, y.size(1)) # (samples, timesteps, out1) 34 | elif len(x.size()) == 4: 35 | y = y.contiguous().view(x.size(0), -1, y.size(1), y.size(2)) # (samples, timesteps, out1,out2) 36 | else: 37 | y = y.contiguous().view(x.size(0), -1, y.size(1), y.size(2), 38 | y.size(3)) # (samples, timesteps, out1,out2, out3) 39 | return y 40 | 41 | class HybridModel(nn.Module): 42 | def __init__(self,num_emotions): 43 | super().__init__() 44 | # conv block 45 | self.conv2Dblock = nn.Sequential( 46 | # 1. conv block 47 | TimeDistributed(nn.Conv2d(in_channels=1, 48 | out_channels=16, 49 | kernel_size=3, 50 | stride=1, 51 | padding=1 52 | )), 53 | TimeDistributed(nn.BatchNorm2d(16)), 54 | TimeDistributed(nn.ReLU()), 55 | TimeDistributed(nn.MaxPool2d(kernel_size=2, stride=2)), 56 | TimeDistributed(nn.Dropout(p=0.3)), 57 | # 2. conv block 58 | TimeDistributed(nn.Conv2d(in_channels=16, 59 | out_channels=32, 60 | kernel_size=3, 61 | stride=1, 62 | padding=1 63 | )), 64 | TimeDistributed(nn.BatchNorm2d(32)), 65 | TimeDistributed(nn.ReLU()), 66 | TimeDistributed(nn.MaxPool2d(kernel_size=4, stride=4)), 67 | TimeDistributed(nn.Dropout(p=0.3)), 68 | # 3. conv block 69 | TimeDistributed(nn.Conv2d(in_channels=32, 70 | out_channels=64, 71 | kernel_size=3, 72 | stride=1, 73 | padding=1 74 | )), 75 | TimeDistributed(nn.BatchNorm2d(64)), 76 | TimeDistributed(nn.ReLU()), 77 | TimeDistributed(nn.MaxPool2d(kernel_size=4, stride=4)), 78 | TimeDistributed(nn.Dropout(p=0.3)) 79 | ) 80 | # LSTM block 81 | hidden_size = 32 82 | self.lstm = nn.LSTM(input_size=512,hidden_size=hidden_size,bidirectional=True, batch_first=True) 83 | self.dropout_lstm = nn.Dropout(p=0.4) 84 | self.attention_linear = nn.Linear(2*hidden_size,1) # 2*hidden_size for the 2 outputs of bidir LSTM 85 | # Linear softmax layer 86 | self.out_linear = nn.Linear(2*hidden_size,num_emotions) 87 | def forward(self,x): 88 | conv_embedding = self.conv2Dblock(x) 89 | conv_embedding = torch.flatten(conv_embedding, start_dim=2) # do not flatten batch dimension and time 90 | lstm_embedding, (h,c) = self.lstm(conv_embedding) 91 | lstm_embedding = self.dropout_lstm(lstm_embedding) 92 | # lstm_embedding (batch, time, hidden_size*2) 93 | batch_size,T,_ = lstm_embedding.shape 94 | attention_weights = [None]*T 95 | for t in range(T): 96 | embedding = lstm_embedding[:,t,:] 97 | attention_weights[t] = self.attention_linear(embedding) 98 | attention_weights_norm = nn.functional.softmax(torch.stack(attention_weights,-1),dim=-1) 99 | attention = torch.bmm(attention_weights_norm,lstm_embedding) # (Bx1xT)*(B,T,hidden_size*2)=(B,1,2*hidden_size) 100 | attention = torch.squeeze(attention, 1) 101 | output_logits = self.out_linear(attention) 102 | output_softmax = nn.functional.softmax(output_logits,dim=1) 103 | return output_logits, output_softmax, attention 104 | -------------------------------------------------------------------------------- /codes/model/model.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch import nn 3 | from torch.autograd import Variable 4 | from math import sqrt 5 | from .utils import to_gpu 6 | from .decoder import Decoder 7 | from .layers import SpeakerClassifier, SpeakerEncoder, AudioSeq2seq, TextEncoder, PostNet, MergeNet 8 | from .basic_layers import LinearNorm 9 | 10 | class Parrot(nn.Module): 11 | def __init__(self, hparams): 12 | super(Parrot, self).__init__() 13 | 14 | #print hparams 15 | # plus 16 | self.embedding = nn.Embedding( 17 | hparams.n_symbols + 1, hparams.symbols_embedding_dim) 18 | std = sqrt(2.0 / (hparams.n_symbols + hparams.symbols_embedding_dim)) 19 | val = sqrt(3.0) * std 20 | 21 | self.sos = hparams.n_symbols 22 | 23 | self.embedding.weight.data.uniform_(-val, val) 24 | 25 | self.text_encoder = TextEncoder(hparams) 26 | 27 | self.audio_seq2seq = AudioSeq2seq(hparams) 28 | 29 | self.merge_net = MergeNet(hparams) 30 | 31 | self.speaker_encoder = SpeakerEncoder(hparams) 32 | 33 | self.speaker_classifier = SpeakerClassifier(hparams) 34 | 35 | self.decoder = Decoder(hparams) 36 | 37 | self.postnet = PostNet(hparams) 38 | 39 | self.spemb_input = hparams.spemb_input 40 | 41 | self.strength_projection = LinearNorm(1,64) 42 | 43 | def grouped_parameters(self,): 44 | 45 | params_group1 = [p for p in self.embedding.parameters()] 46 | params_group1.extend([p for p in self.text_encoder.parameters()]) 47 | params_group1.extend([p for p in self.audio_seq2seq.parameters()]) 48 | 49 | params_group1.extend([p for p in self.speaker_encoder.parameters()]) 50 | params_group1.extend([p for p in self.merge_net.parameters()]) 51 | params_group1.extend([p for p in self.decoder.parameters()]) 52 | params_group1.extend([p for p in self.postnet.parameters()]) 53 | 54 | #for pn, p in self.audio_seq2seq.name_parameters(): 55 | # p.requires_grad = False 56 | 57 | #for pn, p in self.text_encoder.name_parameters(): 58 | # p.requires_grad = False 59 | 60 | #for pn, p in self.decoder.name_parameters(): 61 | # p.requires_grad = False 62 | 63 | return params_group1, [p for p in self.speaker_classifier.parameters()] 64 | 65 | def parse_batch(self, batch): 66 | text_input_padded, mel_padded, spc_padded, speaker_id, \ 67 | text_lengths, mel_lengths, stop_token_padded, strength_embedding = batch 68 | 69 | text_input_padded = to_gpu(text_input_padded).long() 70 | mel_padded = to_gpu(mel_padded).float() 71 | spc_padded = to_gpu(spc_padded).float() 72 | speaker_id = to_gpu(speaker_id).long() 73 | text_lengths = to_gpu(text_lengths).long() 74 | mel_lengths = to_gpu(mel_lengths).long() 75 | stop_token_padded = to_gpu(stop_token_padded).float() 76 | strength_embedding = to_gpu(strength_embedding).float() 77 | 78 | return ((text_input_padded, mel_padded, text_lengths, mel_lengths, strength_embedding), 79 | (text_input_padded, mel_padded, spc_padded, speaker_id, stop_token_padded,strength_embedding)) 80 | 81 | 82 | def forward(self, inputs, input_text): 83 | ''' 84 | text_input_padded [batch_size, max_text_len] 85 | mel_padded [batch_size, mel_bins, max_mel_len] 86 | text_lengths [batch_size] 87 | mel_lengths [batch_size] 88 | 89 | # 90 | predicted_mel [batch_size, mel_bins, T] 91 | predicted_stop [batch_size, T/r] 92 | alignment input_text==True [batch_size, T/r, max_text_len] or input_text==False [batch_size, T/r, T/r] 93 | text_hidden [B, max_text_len, hidden_dim] 94 | mel_hidden [B, T/r, hidden_dim] 95 | spearker_logit_from_mel [B, n_speakers] 96 | speaker_logit_from_mel_hidden [B, T/r, n_speakers] 97 | text_logit_from_mel_hidden [B, T/r, n_symbols] 98 | 99 | ''' 100 | 101 | text_input_padded, mel_padded, text_lengths, mel_lengths, strength_embedding = inputs 102 | 103 | text_input_embedded = self.embedding(text_input_padded.long()).transpose(1, 2) # -> [B, text_embedding_dim, max_text_len] 104 | text_hidden = self.text_encoder(text_input_embedded, text_lengths) # -> [B, max_text_len, hidden_dim] 105 | 106 | B = text_input_padded.size(0) 107 | start_embedding = Variable(text_input_padded.data.new(B,).fill_(self.sos)) 108 | start_embedding = self.embedding(start_embedding) 109 | 110 | # -> [B, n_speakers], [B, speaker_embedding_dim] 111 | speaker_logit_from_mel, speaker_embedding = self.speaker_encoder(mel_padded, mel_lengths) 112 | 113 | if self.spemb_input: 114 | T = mel_padded.size(2) 115 | audio_input = torch.cat([mel_padded, 116 | speaker_embedding.detach().unsqueeze(2).expand(-1, -1, T)], 1) 117 | else: 118 | audio_input = mel_padded 119 | 120 | audio_seq2seq_hidden, audio_seq2seq_logit, audio_seq2seq_alignments = self.audio_seq2seq( 121 | audio_input, mel_lengths, text_input_embedded, start_embedding) 122 | audio_seq2seq_hidden= audio_seq2seq_hidden[:,:-1, :] # -> [B, text_len, hidden_dim] 123 | 124 | 125 | speaker_logit_from_mel_hidden = self.speaker_classifier(audio_seq2seq_hidden) # -> [B, text_len, n_speakers] 126 | 127 | if input_text: 128 | hidden = self.merge_net(text_hidden, text_lengths) 129 | else: 130 | hidden = self.merge_net(audio_seq2seq_hidden, text_lengths) 131 | 132 | L = hidden.size(1) 133 | 134 | strength_embedding = self.strength_projection(strength_embedding) 135 | 136 | #project_2 = LinearNorm(128,112).to('cuda') 137 | #speaker_embedding_post = project_2(speaker_embedding) 138 | 139 | output = torch.cat([speaker_embedding,strength_embedding],-1) 140 | 141 | 142 | 143 | hidden = torch.cat([hidden, output.detach().unsqueeze(1).expand(-1, L, -1)], -1) 144 | 145 | predicted_mel, predicted_stop, alignments = self.decoder(hidden, mel_padded, text_lengths) 146 | 147 | post_output = self.postnet(predicted_mel) 148 | 149 | outputs = [predicted_mel, post_output, predicted_stop, alignments, 150 | text_hidden, audio_seq2seq_hidden, audio_seq2seq_logit, audio_seq2seq_alignments, 151 | speaker_logit_from_mel, speaker_logit_from_mel_hidden, 152 | text_lengths, mel_lengths,speaker_embedding] 153 | 154 | return outputs 155 | 156 | 157 | def inference(self, inputs, input_text, mel_reference, beam_width): 158 | ''' 159 | decode the audio sequence from input 160 | inputs x 161 | input_text True or False 162 | mel_reference [1, mel_bins, T] 163 | ''' 164 | text_input_padded, mel_padded, text_lengths, mel_lengths = inputs 165 | text_input_embedded = self.embedding(text_input_padded.long()).transpose(1, 2) 166 | text_hidden = self.text_encoder.inference(text_input_embedded) 167 | 168 | B = text_input_padded.size(0) # B should be 1 169 | start_embedding = Variable(text_input_padded.data.new(B,).fill_(self.sos)) 170 | start_embedding = self.embedding(start_embedding) # [1, embedding_dim] 171 | 172 | #-> [B, text_len+1, hidden_dim] [B, text_len+1, n_symbols] [B, text_len+1, T/r] 173 | speaker_id, speaker_embedding = self.speaker_encoder.inference(mel_reference) 174 | 175 | if self.spemb_input: 176 | T = mel_padded.size(2) 177 | audio_input = torch.cat([mel_padded, 178 | speaker_embedding.detach().unsqueeze(2).expand(-1, -1, T)], 1) 179 | else: 180 | audio_input = mel_padded 181 | 182 | audio_seq2seq_hidden, audio_seq2seq_phids, audio_seq2seq_alignments = self.audio_seq2seq.inference_beam( 183 | audio_input, start_embedding, self.embedding, beam_width=beam_width) 184 | audio_seq2seq_hidden= audio_seq2seq_hidden[:,:-1, :] # -> [B, text_len, hidden_dim] 185 | 186 | # -> [B, n_speakers], [B, speaker_embedding_dim] 187 | 188 | if input_text: 189 | hidden = self.merge_net.inference(text_hidden) 190 | else: 191 | hidden = self.merge_net.inference(audio_seq2seq_hidden) 192 | 193 | L = hidden.size(1) 194 | 195 | strength_embedding = self.strength_projection(strength_embedding) 196 | 197 | output = torch.cat([speaker_embedding,strength_embedding],-1) 198 | 199 | hidden = torch.cat([hidden, output.detach().unsqueeze(1).expand(-1, L, -1)], -1) 200 | 201 | predicted_mel, predicted_stop, alignments = self.decoder.inference(hidden) 202 | 203 | post_output = self.postnet(predicted_mel) 204 | 205 | return (predicted_mel, post_output, predicted_stop, alignments, 206 | text_hidden, audio_seq2seq_hidden, audio_seq2seq_phids, audio_seq2seq_alignments, 207 | speaker_id) 208 | 209 | 210 | -------------------------------------------------------------------------------- /codes/model/penalties.py: -------------------------------------------------------------------------------- 1 | 2 | import torch 3 | 4 | 5 | class PenaltyBuilder(object): 6 | """ 7 | Returns the Length and Coverage Penalty function for Beam Search. 8 | Args: 9 | length_pen (str): option name of length pen 10 | cov_pen (str): option name of cov pen 11 | """ 12 | 13 | def __init__(self, cov_pen, length_pen): 14 | self.length_pen = length_pen 15 | self.cov_pen = cov_pen 16 | 17 | def coverage_penalty(self): 18 | if self.cov_pen == "wu": 19 | return self.coverage_wu 20 | elif self.cov_pen == "summary": 21 | return self.coverage_summary 22 | else: 23 | return self.coverage_none 24 | 25 | def length_penalty(self): 26 | if self.length_pen == "wu": 27 | return self.length_wu 28 | elif self.length_pen == "avg": 29 | return self.length_average 30 | else: 31 | return self.length_none 32 | 33 | """ 34 | Below are all the different penalty terms implemented so far 35 | """ 36 | 37 | def coverage_wu(self, beam, cov, beta=0.): 38 | """ 39 | NMT coverage re-ranking score from 40 | "Google's Neural Machine Translation System" :cite:`wu2016google`. 41 | """ 42 | penalty = -torch.min(cov, cov.clone().fill_(1.0)).log().sum(1) 43 | return beta * penalty 44 | 45 | def coverage_summary(self, beam, cov, beta=0.): 46 | """ 47 | Our summary penalty. 48 | """ 49 | penalty = torch.max(cov, cov.clone().fill_(1.0)).sum(1) 50 | penalty -= cov.size(1) 51 | return beta * penalty 52 | 53 | def coverage_none(self, beam, cov, beta=0.): 54 | """ 55 | returns zero as penalty 56 | """ 57 | return beam.scores.clone().fill_(0.0) 58 | 59 | def length_wu(self, beam, logprobs, alpha=0.): 60 | """ 61 | NMT length re-ranking score from 62 | "Google's Neural Machine Translation System" :cite:`wu2016google`. 63 | """ 64 | 65 | modifier = (((5 + len(beam.next_ys)) ** alpha) / 66 | ((5 + 1) ** alpha)) 67 | return (logprobs / modifier) 68 | 69 | def length_average(self, beam, logprobs, alpha=0.): 70 | """ 71 | Returns the average probability of tokens in a sequence. 72 | """ 73 | return logprobs / len(beam.next_ys) 74 | 75 | def length_none(self, beam, logprobs, alpha=0., beta=0.): 76 | """ 77 | Returns unmodified scores. 78 | """ 79 | return logprobs -------------------------------------------------------------------------------- /codes/model/ser.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | import pdb 5 | 6 | class acrnn(nn.Module): 7 | def __init__(self, num_classes=4, is_training=True, 8 | L1=128, L2=256, cell_units=128, num_linear=768, 9 | p=10, time_step=800, F1=128, dropout_keep_prob=1): 10 | super(acrnn, self).__init__() 11 | 12 | self.num_classes = num_classes 13 | self.is_training = is_training 14 | self.L1 = L1 15 | self.L2 = L2 16 | self.cell_units = cell_units 17 | self.num_linear = num_linear 18 | self.p = p 19 | self.time_step = time_step 20 | self.F1 = F1 21 | self.dropout_prob = 1 - dropout_keep_prob 22 | 23 | # tf filter : [filter_height, filter_width, in_channels, out_channels] 24 | self.conv1 = nn.Conv2d(3, self.L1, (5, 3), padding=(2, 1)) # [5, 3, 3, 128] 25 | self.conv2 = nn.Conv2d(self.L1, self.L2, (5, 3), padding=(2, 1)) # [5, 3, 128, 256] 26 | self.conv3 = nn.Conv2d(self.L2, self.L2, (5, 3), padding=(2, 1)) # [5, 3, 256, 256] 27 | self.conv4 = nn.Conv2d(self.L2, self.L2, (5, 3), padding=(2, 1)) # [5, 3, 256, 256] 28 | self.conv5 = nn.Conv2d(self.L2, self.L2, (5, 3), padding=(2, 1)) # [5, 3, 128, 256] 29 | self.conv6 = nn.Conv2d(self.L2, self.L2, (5, 3), padding=(2, 1)) # [5, 3, 128, 256] 30 | 31 | self.linear1 = nn.Linear(self.p*self.L2, self.num_linear) # [10*256, 768] 32 | self.bn = nn.BatchNorm1d(self.num_linear) 33 | 34 | #self.linear_em = nn.Linear(self.p, self.L2) 35 | 36 | self.relu = nn.LeakyReLU(0.01) 37 | self.dropout = nn.Dropout2d(p=self.dropout_prob) 38 | 39 | self.rnn = nn.LSTM(input_size=self.num_linear, hidden_size=self.cell_units, 40 | batch_first=True, num_layers=1, bidirectional=True) 41 | 42 | # for attention 43 | self.a_fc1 = nn.Linear(2*self.cell_units, 1) 44 | self.a_fc2 = nn.Linear(1, 1) 45 | self.sigmoid = nn.Sigmoid() 46 | self.softmax = nn.Softmax(dim=1) 47 | 48 | # fully connected layers 49 | self.fc1 = nn.Linear(2*self.cell_units, self.F1) # [2*128, 64] 50 | self.fc2 = nn.Linear(self.F1, self.num_classes) # [num_classes] 51 | 52 | 53 | def forward(self, x): 54 | 55 | layer1 = self.relu(self.conv1(x)) 56 | layer1 = F.max_pool2d(layer1, kernel_size=(2, 4), stride=(2, 4)) # [1,2,4,1], padding = 'valid' 57 | layer1 = self.dropout(layer1) 58 | 59 | layer2 = self.relu(self.conv2(layer1)) 60 | layer2 = self.dropout(layer2) 61 | 62 | layer3 = self.relu(self.conv3(layer2)) 63 | layer3 = self.dropout(layer3) 64 | 65 | layer4 = self.relu(self.conv4(layer3)) 66 | layer4 = self.dropout(layer4) 67 | 68 | layer5 = self.relu(self.conv5(layer4)) 69 | layer5 = self.dropout(layer5) 70 | 71 | layer6 = self.relu(self.conv6(layer5)) 72 | layer6 = self.dropout(layer6) 73 | 74 | # lstm 75 | layer6 = layer6.permute(0, 2, 3, 1) 76 | layer6 = layer6.reshape(-1, self.time_step, self.L2*self.p) # (-1, 150, 256*10) 77 | 78 | layer6 = layer6.reshape(-1, self.L2*self.p) # (1500, 2560) 79 | 80 | linear1 = self.relu(self.bn(self.linear1(layer6))) # [1500, 768] 81 | linear1 = linear1.reshape(-1, self.time_step, self.num_linear) # [10, 150, 768] 82 | em_bed_low = linear1 83 | 84 | outputs1, output_states1 = self.rnn(linear1) # outputs1 : [10, 150, 128] (B,T,D) 85 | 86 | # # attention 87 | v = self.sigmoid(self.a_fc1(outputs1)) # (10, 150, 1) 88 | alphas = self.softmax(self.a_fc2(v).squeeze()) # (B,T) shape, alphas are attention weights (10,800) 89 | gru = (alphas.unsqueeze(2) * outputs1).sum(dim=1) # (B,D) (10,256) 90 | 91 | # # fc 92 | fully1 = self.relu(self.fc1(gru)) 93 | em_bed_high = fully1 94 | fully1 = self.dropout(fully1) 95 | Ylogits = self.fc2(fully1) 96 | Ylogits = self.softmax(Ylogits) 97 | 98 | return Ylogits, em_bed_low, em_bed_high 99 | -------------------------------------------------------------------------------- /codes/model/utils.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | 4 | 5 | def gcd(a,b): 6 | a, b = (a, b) if a >=b else (b, a) 7 | if a%b == 0: 8 | return b 9 | else : 10 | return gcd(b,a%b) 11 | 12 | def lcm(a,b): 13 | return a*b//gcd(a,b) 14 | 15 | 16 | if __name__ == "__main__": 17 | print(lcm(3,2)) 18 | 19 | def get_mask_from_lengths(lengths, max_len=None): 20 | if max_len is None: 21 | max_len = torch.max(lengths).item() 22 | ids = torch.arange(0, max_len, out=torch.cuda.LongTensor(max_len)) 23 | #print ids 24 | mask = (ids < lengths.unsqueeze(1)).byte() 25 | return mask 26 | 27 | def to_gpu(x): 28 | x = x.contiguous() 29 | 30 | if torch.cuda.is_available(): 31 | x = x.cuda(non_blocking=True) 32 | return torch.autograd.Variable(x) 33 | 34 | def test_mask(): 35 | lengths = torch.IntTensor([3,5,4]) 36 | print(torch.ceil(lengths.float() / 2)) 37 | 38 | data = torch.FloatTensor(3, 5, 2) # [B, T, D] 39 | data.fill_(1.) 40 | m = get_mask_from_lengths(lengths.cuda(), data.size(1)) 41 | print(m) 42 | m = m.unsqueeze(2).expand(-1,-1,data.size(2)).float() 43 | print(m) 44 | 45 | print(torch.sum(data.cuda() * m) / torch.sum(m)) 46 | 47 | 48 | def test_loss(): 49 | data1 = torch.FloatTensor(3, 5, 2) 50 | data1.fill_(1.) 51 | data2 = torch.FloatTensor(3, 5, 2) 52 | data2.fill_(2.) 53 | data2[0,0,0] = 1000 54 | 55 | l = torch.nn.L1Loss(reduction='none')(data1,data2) 56 | print(l) 57 | 58 | 59 | #if __name__ == '__main__': 60 | # test_mask() -------------------------------------------------------------------------------- /codes/multiproc.py: -------------------------------------------------------------------------------- 1 | import time 2 | import torch 3 | import sys 4 | import subprocess 5 | 6 | argslist = list(sys.argv)[1:] 7 | num_gpus = torch.cuda.device_count() 8 | argslist.append('--n_gpus={}'.format(num_gpus)) 9 | workers = [] 10 | job_id = time.strftime("%Y_%m_%d-%H%M%S") 11 | argslist.append("--group_name=group_{}".format(job_id)) 12 | 13 | for i in range(num_gpus): 14 | argslist.append('--rank={}'.format(i)) 15 | stdout = None if i == 0 else open("logs/{}_GPU_{}.log".format(job_id, i), 16 | "w") 17 | print(argslist) 18 | p = subprocess.Popen([str(sys.executable)]+argslist, stdout=stdout) 19 | workers.append(p) 20 | argslist = argslist[:-1] 21 | 22 | for p in workers: 23 | p.wait() 24 | -------------------------------------------------------------------------------- /codes/plotting_utils.py: -------------------------------------------------------------------------------- 1 | import matplotlib 2 | matplotlib.use("Agg") 3 | import matplotlib.pylab as plt 4 | import numpy as np 5 | 6 | 7 | def save_figure_to_numpy(fig): 8 | # save it to a numpy array. 9 | data = np.fromstring(fig.canvas.tostring_rgb(), dtype=np.uint8, sep='') 10 | data = data.reshape(fig.canvas.get_width_height()[::-1] + (3,)) 11 | return data 12 | 13 | def plot_alignment(alignment, fn): 14 | # [4, encoder_step, decoder_step] 15 | fig, axes = plt.subplots(2, 2) 16 | for i in range(2): 17 | for j in range(2): 18 | g = axes[i][j].imshow(alignment[i*2+j,:,:].T, 19 | aspect='auto', origin='lower', 20 | interpolation='none') 21 | plt.colorbar(g, ax=axes[i][j]) 22 | 23 | plt.savefig(fn) 24 | plt.close() 25 | return fn 26 | 27 | 28 | def plot_alignment_to_numpy(alignment, info=None): 29 | fig, ax = plt.subplots(figsize=(6, 4)) 30 | im = ax.imshow(alignment, aspect='auto', origin='lower', 31 | interpolation='none') 32 | fig.colorbar(im, ax=ax) 33 | xlabel = 'Decoder timestep' 34 | if info is not None: 35 | xlabel += '\n\n' + info 36 | plt.xlabel(xlabel) 37 | plt.ylabel('Encoder timestep') 38 | plt.tight_layout() 39 | 40 | fig.canvas.draw() 41 | data = save_figure_to_numpy(fig) 42 | plt.close() 43 | return data 44 | 45 | 46 | def plot_spectrogram_to_numpy(spectrogram): 47 | fig, ax = plt.subplots(figsize=(12, 3)) 48 | im = ax.imshow(spectrogram, aspect="auto", origin="lower", 49 | interpolation='none') 50 | plt.colorbar(im, ax=ax) 51 | plt.xlabel("Frames") 52 | plt.ylabel("Channels") 53 | plt.tight_layout() 54 | 55 | fig.canvas.draw() 56 | data = save_figure_to_numpy(fig) 57 | plt.close() 58 | return data 59 | 60 | 61 | def plot_gate_outputs_to_numpy(gate_targets, gate_outputs): 62 | fig, ax = plt.subplots(figsize=(12, 3)) 63 | ax.scatter(list(range(len(gate_targets))), gate_targets, alpha=0.5, 64 | color='green', marker='+', s=1, label='target') 65 | ax.scatter(list(range(len(gate_outputs))), gate_outputs, alpha=0.5, 66 | color='red', marker='.', s=1, label='predicted') 67 | 68 | plt.xlabel("Frames (Green target, Red predicted)") 69 | plt.ylabel("Gate State") 70 | plt.tight_layout() 71 | 72 | fig.canvas.draw() 73 | data = save_figure_to_numpy(fig) 74 | plt.close() 75 | return data 76 | -------------------------------------------------------------------------------- /codes/train.py: -------------------------------------------------------------------------------- 1 | import os 2 | import time 3 | import argparse 4 | import math 5 | from numpy import finfo 6 | import numpy as np 7 | 8 | import torch 9 | from distributed import apply_gradient_allreduce 10 | import torch.distributed as dist 11 | from torch.utils.data.distributed import DistributedSampler 12 | from torch.utils.data import DataLoader 13 | 14 | from model import Parrot, ParrotLoss, lcm 15 | from reader import TextMelIDLoader, TextMelIDCollate 16 | from logger import ParrotLogger 17 | from hparams import create_hparams 18 | 19 | 20 | def batchnorm_to_float(module): 21 | """Converts batch norm modules to FP32""" 22 | if isinstance(module, torch.nn.modules.batchnorm._BatchNorm): 23 | module.float() 24 | for child in module.children(): 25 | batchnorm_to_float(child) 26 | return module 27 | 28 | 29 | def reduce_tensor(tensor, n_gpus): 30 | rt = tensor.clone() 31 | dist.all_reduce(rt, op=dist.reduce_op.SUM) 32 | rt /= n_gpus 33 | return rt 34 | 35 | 36 | def init_distributed(hparams, n_gpus, rank, group_name): 37 | assert torch.cuda.is_available(), "Distributed mode requires CUDA." 38 | print("Initializing Distributed") 39 | 40 | # Set cuda device so everything is done on the right GPU. 41 | torch.cuda.set_device(rank % torch.cuda.device_count()) 42 | 43 | # Initialize distributed communication 44 | dist.init_process_group( 45 | backend=hparams.dist_backend, init_method=hparams.dist_url, 46 | world_size=n_gpus, rank=rank, group_name=group_name) 47 | 48 | print("Done initializing distributed") 49 | 50 | 51 | def prepare_dataloaders(hparams): 52 | # Get data, data loaders and collate function ready 53 | trainset = TextMelIDLoader(hparams.training_list, hparams.mel_mean_std) 54 | valset = TextMelIDLoader(hparams.validation_list, hparams.mel_mean_std) 55 | collate_fn = TextMelIDCollate(lcm(hparams.n_frames_per_step_encoder, 56 | hparams.n_frames_per_step_decoder)) 57 | 58 | train_sampler = DistributedSampler(trainset) \ 59 | if hparams.distributed_run else None 60 | 61 | train_loader = DataLoader(trainset, num_workers=1, shuffle=True, 62 | sampler=train_sampler, 63 | batch_size=hparams.batch_size, pin_memory=False, 64 | drop_last=True, collate_fn=collate_fn) 65 | return train_loader, valset, collate_fn 66 | 67 | 68 | def prepare_directories_and_logger(output_directory, log_directory, rank): 69 | if rank == 0: 70 | if not os.path.isdir(output_directory): 71 | os.makedirs(output_directory) 72 | os.chmod(output_directory, 0o775) 73 | logger = ParrotLogger(os.path.join(output_directory, log_directory)) 74 | else: 75 | logger = None 76 | return logger 77 | 78 | 79 | def load_model(hparams): 80 | device = 'cuda' if torch.cuda.is_available() else 'cpu' 81 | model = Parrot(hparams).to(device) 82 | if hparams.distributed_run: 83 | model = apply_gradient_allreduce(model) 84 | 85 | return model 86 | 87 | 88 | def warm_start_model(checkpoint_path, model): 89 | assert os.path.isfile(checkpoint_path) 90 | print(("Warm starting model from checkpoint '{}'".format(checkpoint_path))) 91 | checkpoint_dict = torch.load(checkpoint_path, map_location='cpu') 92 | new_state_dict = {} 93 | new_state_dict = {} 94 | for k, v in checkpoint_dict['state_dict'].items(): 95 | if k not in ['speaker_encoder.projection2.linear_layer.weight', 96 | 'speaker_encoder.projection2.linear_layer.bias', 97 | 'speaker_classifier.projection.linear_layer.weight', 98 | 'speaker_classifier.projection.linear_layer.bias']: 99 | new_state_dict[k] = v 100 | else: 101 | s = v.size() 102 | if len(s) == 2: 103 | new_state_dict[k] = torch.nn.init.normal_(torch.empty((5, s[1]))) 104 | else: 105 | new_state_dict[k] = torch.nn.init.normal_(torch.empty(5)) 106 | #new_state_dict[k].weight.requires_grad = False 107 | #new_state_dict[k].bias.requires_grad = False 108 | 109 | if 'text_encoder' in k: 110 | new_state_dict[k].requires_grad = False 111 | if 'audio_seq2seq.encoder' in k: 112 | new_state_dict[k].requires_grad = False 113 | 114 | 115 | model.load_state_dict(new_state_dict,strict=False) 116 | #model.load_state_dict(torch.load(checkpoint_path)['state_dict'], strict=False) 117 | return model 118 | 119 | 120 | def load_checkpoint(checkpoint_path, model, optimizer_main, optimizer_sc): 121 | assert os.path.isfile(checkpoint_path) 122 | print(("Loading checkpoint '{}'".format(checkpoint_path))) 123 | checkpoint_dict = torch.load(checkpoint_path, map_location='cpu') 124 | model.load_state_dict(checkpoint_dict['state_dict']) 125 | optimizer_main.load_state_dict(checkpoint_dict['optimizer_main']) 126 | optimizer_sc.load_state_dict(checkpoint_dict['optimizer_sc']) 127 | learning_rate = checkpoint_dict['learning_rate'] 128 | iteration = checkpoint_dict['iteration'] 129 | print(("Loaded checkpoint '{}' from iteration {}" .format( 130 | checkpoint_path, iteration))) 131 | return model, optimizer_main, optimizer_sc, learning_rate, iteration 132 | 133 | 134 | def save_checkpoint(model, optimizer_main, optimizer_sc, learning_rate, iteration, filepath): 135 | print(("Saving model and optimizer state at iteration {} to {}".format( 136 | iteration, filepath))) 137 | torch.save({'iteration': iteration, 138 | 'state_dict': model.state_dict(), 139 | 'optimizer_main': optimizer_main.state_dict(), 140 | 'optimizer_sc': optimizer_sc.state_dict(), 141 | 'learning_rate': learning_rate}, filepath) 142 | 143 | 144 | def validate(model, criterion, valset, iteration, batch_size, n_gpus, 145 | collate_fn, logger, distributed_run, rank): 146 | """Handles all the validation scoring and printing""" 147 | model.eval() 148 | with torch.no_grad(): 149 | val_sampler = DistributedSampler(valset) if distributed_run else None 150 | val_loader = DataLoader(valset, sampler=val_sampler, num_workers=1, 151 | shuffle=False, batch_size=batch_size, 152 | drop_last=True, 153 | pin_memory=False, collate_fn=collate_fn) 154 | 155 | val_loss_tts, val_loss_vc = 0.0, 0.0 156 | reduced_val_tts_losses, reduced_val_vc_losses = np.zeros([8], dtype=np.float32), np.zeros([8], dtype=np.float32) 157 | reduced_val_tts_acces, reduced_val_vc_acces = np.zeros([3], dtype=np.float32), np.zeros([3], dtype=np.float32) 158 | 159 | for i, batch in enumerate(val_loader): 160 | 161 | x, y = model.parse_batch(batch) 162 | 163 | if i%2 == 0: 164 | y_pred = model(x, True) 165 | else: 166 | y_pred = model(x, False) 167 | 168 | losses, acces, l_main, l_sc = criterion(y_pred, y, False) 169 | if distributed_run: 170 | reduced_val_losses = [] 171 | reduced_val_acces = [] 172 | 173 | for l in losses: 174 | reduced_val_losses.append(reduce_tensor(l.data, n_gpus).item()) 175 | for a in acces: 176 | reduced_val_acces.append(reduce_tensor(a.data, n_gpus).item()) 177 | 178 | l_main = reduce_tensor(l_main.data, n_gpus).item() 179 | l_sc = reduce_tensor(l_sc.data, n_gpus).item() 180 | else: 181 | reduced_val_losses = [l.item() for l in losses] 182 | reduced_val_acces = [a.item() for a in acces] 183 | l_main = l_main.item() 184 | l_sc = l_sc.item() 185 | 186 | if i%2 == 0: 187 | val_loss_tts += l_main + l_sc 188 | y_tts = y 189 | y_tts_pred = y_pred 190 | reduced_val_tts_losses += np.array(reduced_val_losses) 191 | reduced_val_tts_acces += np.array(reduced_val_acces) 192 | else: 193 | val_loss_vc += l_main + l_sc 194 | y_vc = y 195 | y_vc_pred = y_pred 196 | reduced_val_vc_losses += np.array(reduced_val_losses) 197 | reduced_val_vc_acces += np.array(reduced_val_acces) 198 | 199 | if i % 2 == 0: 200 | num_tts = i / 2 + 1 201 | num_vc = i / 2 202 | else: 203 | num_tts = (i + 1) / 2 204 | num_vc = (i + 1) / 2 205 | 206 | val_loss_tts = val_loss_tts / num_tts 207 | val_loss_vc = val_loss_vc / num_vc 208 | reduced_val_tts_acces = reduced_val_tts_acces / num_tts 209 | reduced_val_vc_acces = reduced_val_vc_acces / num_vc 210 | reduced_val_tts_losses = reduced_val_tts_losses / num_tts 211 | reduced_val_vc_losses = reduced_val_vc_losses / num_vc 212 | 213 | model.train() 214 | if rank == 0: 215 | print(("Validation loss {}: TTS {:9f} VC {:9f}".format(iteration, val_loss_tts, val_loss_vc))) 216 | logger.log_validation(val_loss_tts, reduced_val_tts_losses, reduced_val_tts_acces, model, y_tts, y_tts_pred, iteration, 'tts') 217 | logger.log_validation(val_loss_vc, reduced_val_vc_losses, reduced_val_vc_acces, model, y_vc, y_vc_pred, iteration, 'vc') 218 | 219 | 220 | 221 | def train(output_directory, log_directory, checkpoint_path, warm_start, n_gpus, 222 | rank, group_name, hparams): 223 | 224 | """Training and validation logging results to tensorboard and stdout 225 | Params 226 | ------ 227 | output_directory (string): directory to save checkpoints 228 | log_directory (string) directory to save tensorboard logs 229 | checkpoint_path(string): checkpoint path 230 | n_gpus (int): number of gpus 231 | rank (int): rank of current gpu 232 | hparams (object): comma separated list of "name=value" pairs. 233 | """ 234 | 235 | if hparams.distributed_run: 236 | init_distributed(hparams, n_gpus, rank, group_name) 237 | 238 | torch.manual_seed(hparams.seed) 239 | torch.cuda.manual_seed(hparams.seed) 240 | 241 | device = 'cuda' if torch.cuda.is_available() else 'cpu' 242 | model = load_model(hparams) 243 | model = model.to(device) 244 | 245 | learning_rate = hparams.learning_rate 246 | 247 | parameters_main, parameters_sc = model.grouped_parameters() 248 | 249 | optimizer_main = torch.optim.Adam(parameters_main, lr=learning_rate, 250 | weight_decay=hparams.weight_decay) 251 | optimizer_sc = torch.optim.Adam(parameters_sc, lr=learning_rate, 252 | weight_decay=hparams.weight_decay) 253 | 254 | if hparams.distributed_run: 255 | model = apply_gradient_allreduce(model) 256 | 257 | criterion = ParrotLoss(hparams).cuda() 258 | 259 | logger = prepare_directories_and_logger( 260 | output_directory, log_directory, rank) 261 | 262 | train_loader, valset, collate_fn = prepare_dataloaders(hparams) 263 | 264 | # Load checkpoint if one exists 265 | iteration = 0 266 | epoch_offset = 0 267 | if checkpoint_path is not None: 268 | if warm_start: 269 | model = warm_start_model(checkpoint_path, model) 270 | else: 271 | model, optimizer_main, optimizer_sc, _learning_rate, iteration = load_checkpoint( 272 | checkpoint_path, model, optimizer_main, optimizer_sc) 273 | if hparams.use_saved_learning_rate: 274 | learning_rate = _learning_rate 275 | iteration += 1 # next iteration is iteration + 1 276 | epoch_offset = max(0, int(iteration / len(train_loader))) 277 | 278 | model.train() 279 | # ================ MAIN TRAINNIG LOOP! =================== 280 | for epoch in range(epoch_offset, hparams.epochs): 281 | print(("Epoch: {}".format(epoch))) 282 | if epoch > hparams.warmup: 283 | learning_rate = hparams.learning_rate * hparams.decay_rate ** ((epoch - hparams.warmup) // hparams.decay_every + 1) 284 | 285 | for i, batch in enumerate(train_loader): 286 | 287 | start = time.time() 288 | 289 | for param_group in optimizer_main.param_groups: 290 | param_group['lr'] = learning_rate 291 | 292 | for param_group in optimizer_sc.param_groups: 293 | param_group['lr'] = learning_rate 294 | 295 | 296 | 297 | model.zero_grad() 298 | x, y = model.parse_batch(batch) 299 | 300 | if i % 2 == 0: 301 | y_pred = model(x, True) 302 | losses, acces, l_main, l_sc = criterion(y_pred, y, True) 303 | else: 304 | y_pred = model(x, False) 305 | losses, acces, l_main, l_sc = criterion(y_pred, y, False) 306 | 307 | if hparams.distributed_run: 308 | reduced_losses = [] 309 | for l in losses: 310 | reduced_losses.append(reduce_tensor(l.data, n_gpus).item()) 311 | reduced_acces = [] 312 | for a in acces: 313 | reduced_acces.append(reduce_tensor(a.data, n_gpus).item()) 314 | redl_main = reduce_tensor(l_main.data, n_gpus).item() 315 | redl_sc = reduce_tensor(l_sc.data, n_gpus).item() 316 | else: 317 | reduced_losses = [l.item() for l in losses] 318 | reduced_acces = [a.item() for a in acces] 319 | redl_main = l_main.item() 320 | redl_sc = l_sc.item() 321 | 322 | for p in parameters_sc: 323 | p.requires_grad_(requires_grad=False) 324 | 325 | l_main.backward(retain_graph=True) 326 | grad_norm_main = torch.nn.utils.clip_grad_norm_( 327 | parameters_main, hparams.grad_clip_thresh) 328 | 329 | optimizer_main.step() 330 | 331 | for p in parameters_sc: 332 | p.requires_grad_(requires_grad=True) 333 | for p in parameters_main: 334 | p.requires_grad_(requires_grad=False) 335 | 336 | 337 | l_sc.backward() 338 | grad_norm_sc = torch.nn.utils.clip_grad_norm_( 339 | parameters_sc, hparams.grad_clip_thresh) 340 | 341 | 342 | optimizer_sc.step() 343 | 344 | for p in parameters_main: 345 | p.requires_grad_(requires_grad=True) 346 | 347 | if not math.isnan(redl_main) and rank == 0: 348 | 349 | duration = time.time() - start 350 | task = 'TTS' if i%2 == 0 else 'VC' 351 | print(("Train {} {} {:.6f} Grad Norm {:.6f} {:.2f}s/it".format( 352 | task, iteration, redl_main+redl_sc, grad_norm_main, duration))) 353 | logger.log_training( 354 | redl_main+redl_sc, reduced_losses, reduced_acces, grad_norm_main, learning_rate, duration, iteration) 355 | 356 | if (iteration % hparams.iters_per_checkpoint == 0): 357 | validate(model, criterion, valset, iteration, 358 | hparams.batch_size, n_gpus, collate_fn, logger, 359 | hparams.distributed_run, rank) 360 | if rank == 0: 361 | checkpoint_path = os.path.join( 362 | output_directory, "checkpoint_{}".format(iteration)) 363 | save_checkpoint(model, optimizer_main, optimizer_sc, learning_rate, iteration, 364 | checkpoint_path) 365 | 366 | iteration += 1 367 | 368 | 369 | if __name__ == '__main__': 370 | 371 | parser = argparse.ArgumentParser() 372 | parser.add_argument('-o', '--output_directory', type=str, 373 | help='directory to save checkpoints') 374 | parser.add_argument('-l', '--log_directory', type=str, 375 | help='directory to save tensorboard logs') 376 | parser.add_argument('-c', '--checkpoint_path', type=str, default=None, 377 | required=False, help='checkpoint path') 378 | parser.add_argument('--warm_start', action='store_true', 379 | help='load the model only (warm start)') 380 | parser.add_argument('--n_gpus', type=int, default=1, 381 | required=False, help='number of gpus') 382 | parser.add_argument('--rank', type=int, default=0, 383 | required=False, help='rank of current gpu') 384 | parser.add_argument('--group_name', type=str, default='group_name', 385 | required=False, help='Distributed group name') 386 | parser.add_argument('--hparams', type=str, 387 | required=False, help='comma separated name=value pairs') 388 | 389 | args = parser.parse_args() 390 | hparams = create_hparams(args.hparams) 391 | 392 | torch.backends.cudnn.enabled = hparams.cudnn_enabled 393 | torch.backends.cudnn.benchmark = hparams.cudnn_benchmark 394 | 395 | print(("Distributed Run:", hparams.distributed_run)) 396 | print(("cuDNN Enabled:", hparams.cudnn_enabled)) 397 | print(("cuDNN Benchmark:", hparams.cudnn_benchmark)) 398 | 399 | train(args.output_directory, args.log_directory, args.checkpoint_path, 400 | args.warm_start, args.n_gpus, args.rank, args.group_name, hparams) -------------------------------------------------------------------------------- /stage3_update.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KunZhou9646/Emovox/972db246eb7b63e5eaf532d46710af3c9ec11eba/stage3_update.png -------------------------------------------------------------------------------- /train_ser.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import os 4 | import librosa 5 | import librosa.display 6 | import IPython 7 | from IPython.display import Audio 8 | from IPython.display import Image 9 | import matplotlib.pyplot as plt 10 | import torch 11 | import torch.nn as nn 12 | from sklearn.preprocessing import StandardScaler 13 | from sklearn.metrics import confusion_matrix 14 | import seaborn as sn 15 | import pickle 16 | from torch.utils.tensorboard import SummaryWriter 17 | 18 | def addAWGN(signal, num_bits=16, augmented_num=2, snr_low=15, snr_high=30): 19 | signal_len = len(signal) 20 | # Generate White Gaussian noise 21 | noise = np.random.normal(size=(augmented_num, signal_len)) 22 | # Normalize signal and noise 23 | norm_constant = 2.0**(num_bits-1) 24 | signal_norm = signal / norm_constant 25 | noise_norm = noise / norm_constant 26 | # Compute signal and noise power 27 | s_power = np.sum(signal_norm ** 2) / signal_len 28 | n_power = np.sum(noise_norm ** 2, axis=1) / signal_len 29 | # Random SNR: Uniform [15, 30] in dB 30 | target_snr = np.random.randint(snr_low, snr_high) 31 | # Compute K (covariance matrix) for each noise 32 | K = np.sqrt((s_power / n_power) * 10 ** (- target_snr / 10)) 33 | K = np.ones((signal_len, augmented_num)) * K 34 | # Generate noisy signal 35 | return signal + K.T * noise 36 | 37 | def getMELspectrogram(audio, sample_rate): 38 | mel_spec = librosa.feature.melspectrogram(y=audio, 39 | sr=sample_rate, 40 | n_fft=2048, 41 | #n_fft=1024, 42 | #win_length = 512, 43 | win_length = 500, 44 | window='hamming', 45 | hop_length = 200, 46 | n_mels=80, 47 | fmax=None 48 | ) 49 | mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max) 50 | return mel_spec_db 51 | 52 | 53 | # BATCH FIRST TimeDistributed layer 54 | class TimeDistributed(nn.Module): 55 | def __init__(self, module): 56 | super(TimeDistributed, self).__init__() 57 | self.module = module 58 | 59 | def forward(self, x): 60 | 61 | if len(x.size()) <= 2: 62 | return self.module(x) 63 | # squash samples and timesteps into a single axis 64 | elif len(x.size()) == 3: # (samples, timesteps, inp1) 65 | x_reshape = x.contiguous().view(-1, x.size(2)) # (samples * timesteps, inp1) 66 | elif len(x.size()) == 4: # (samples,timesteps,inp1,inp2) 67 | x_reshape = x.contiguous().view(-1, x.size(2), x.size(3)) # (samples*timesteps,inp1,inp2) 68 | else: # (samples,timesteps,inp1,inp2,inp3) 69 | x_reshape = x.contiguous().view(-1, x.size(2), x.size(3), x.size(4)) # (samples*timesteps,inp1,inp2,inp3) 70 | 71 | y = self.module(x_reshape) 72 | 73 | # we have to reshape Y 74 | if len(x.size()) == 3: 75 | y = y.contiguous().view(x.size(0), -1, y.size(1)) # (samples, timesteps, out1) 76 | elif len(x.size()) == 4: 77 | y = y.contiguous().view(x.size(0), -1, y.size(1), y.size(2)) # (samples, timesteps, out1,out2) 78 | else: 79 | y = y.contiguous().view(x.size(0), -1, y.size(1), y.size(2), 80 | y.size(3)) # (samples, timesteps, out1,out2, out3) 81 | return y 82 | 83 | class HybridModel(nn.Module): 84 | def __init__(self,num_emotions): 85 | super().__init__() 86 | # conv block 87 | self.conv2Dblock = nn.Sequential( 88 | # 1. conv block 89 | TimeDistributed(nn.Conv2d(in_channels=1, 90 | out_channels=16, 91 | kernel_size=3, 92 | stride=1, 93 | padding=1 94 | )), 95 | TimeDistributed(nn.BatchNorm2d(16)), 96 | TimeDistributed(nn.ReLU()), 97 | TimeDistributed(nn.MaxPool2d(kernel_size=2, stride=2)), 98 | TimeDistributed(nn.Dropout(p=0.3)), 99 | # 2. conv block 100 | TimeDistributed(nn.Conv2d(in_channels=16, 101 | out_channels=32, 102 | kernel_size=3, 103 | stride=1, 104 | padding=1 105 | )), 106 | TimeDistributed(nn.BatchNorm2d(32)), 107 | TimeDistributed(nn.ReLU()), 108 | TimeDistributed(nn.MaxPool2d(kernel_size=4, stride=4)), 109 | TimeDistributed(nn.Dropout(p=0.3)), 110 | # 3. conv block 111 | TimeDistributed(nn.Conv2d(in_channels=32, 112 | out_channels=64, 113 | kernel_size=3, 114 | stride=1, 115 | padding=1 116 | )), 117 | TimeDistributed(nn.BatchNorm2d(64)), 118 | TimeDistributed(nn.ReLU()), 119 | TimeDistributed(nn.MaxPool2d(kernel_size=4, stride=4)), 120 | TimeDistributed(nn.Dropout(p=0.3)) 121 | ) 122 | # LSTM block 123 | hidden_size = 32 124 | self.lstm = nn.LSTM(input_size=512,hidden_size=hidden_size,bidirectional=True, batch_first=True) 125 | self.dropout_lstm = nn.Dropout(p=0.4) 126 | self.attention_linear = nn.Linear(2*hidden_size,1) # 2*hidden_size for the 2 outputs of bidir LSTM 127 | # Linear softmax layer 128 | self.out_linear = nn.Linear(2*hidden_size,num_emotions) 129 | def forward(self,x): 130 | conv_embedding = self.conv2Dblock(x) 131 | conv_embedding = torch.flatten(conv_embedding, start_dim=2) # do not flatten batch dimension and time 132 | lstm_embedding, (h,c) = self.lstm(conv_embedding) 133 | lstm_embedding = self.dropout_lstm(lstm_embedding) 134 | # lstm_embedding (batch, time, hidden_size*2) 135 | batch_size,T,_ = lstm_embedding.shape 136 | attention_weights = [None]*T 137 | for t in range(T): 138 | embedding = lstm_embedding[:,t,:] 139 | attention_weights[t] = self.attention_linear(embedding) 140 | attention_weights_norm = nn.functional.softmax(torch.stack(attention_weights,-1),dim=-1) 141 | attention = torch.bmm(attention_weights_norm,lstm_embedding) # (Bx1xT)*(B,T,hidden_size*2)=(B,1,2*hidden_size) 142 | attention = torch.squeeze(attention, 1) 143 | output_logits = self.out_linear(attention) 144 | output_softmax = nn.functional.softmax(output_logits,dim=1) 145 | return output_logits, output_softmax 146 | 147 | def splitIntoChunks(mel_spec,win_size,stride): 148 | mel_spec = mel_spec.T 149 | t = mel_spec.shape[1] 150 | num_of_chunks = int(t/stride) 151 | chunks = [] 152 | for i in range(num_of_chunks): 153 | chunk = mel_spec[:,i*stride:i*stride+win_size] 154 | if chunk.shape[1] == win_size: 155 | chunks.append(chunk) 156 | return np.stack(chunks,axis=0) 157 | 158 | def loss_fnc(predictions, targets): 159 | return nn.CrossEntropyLoss()(input=predictions,target=targets) 160 | 161 | def make_train_step(model, loss_fnc, optimizer): 162 | def train_step(X,Y): 163 | # set model to train mode 164 | model.train() 165 | # forward pass 166 | output_logits, output_softmax = model(X) 167 | predictions = torch.argmax(output_softmax,dim=1) 168 | accuracy = torch.sum(Y==predictions)/float(len(Y)) 169 | # compute loss 170 | loss = loss_fnc(output_logits, Y) 171 | # compute gradients 172 | loss.backward() 173 | # update parameters and zero gradients 174 | optimizer.step() 175 | optimizer.zero_grad() 176 | return loss.item(), accuracy*100 177 | return train_step 178 | 179 | def make_validate_fnc(model,loss_fnc): 180 | def validate(X,Y): 181 | with torch.no_grad(): 182 | model.eval() 183 | output_logits, output_softmax = model(X) 184 | predictions = torch.argmax(output_softmax,dim=1) 185 | accuracy = torch.sum(Y==predictions)/float(len(Y)) 186 | loss = loss_fnc(output_logits,Y) 187 | return loss.item(), accuracy*100, predictions 188 | return validate 189 | 190 | def load_data(in_dir): 191 | f = open(in_dir,'rb') 192 | train_data_norm, test_data_norm, train_label, test_label = pickle.load(f) 193 | return train_data_norm, test_data_norm, train_label, test_label 194 | 195 | EMOTIONS = {0:'angry', 1:'happy',2:'sad',3:'neutral'} 196 | data_path = '../transformer_ser/data_500_final.pkl' 197 | model_name = 'cnn_attention_lstm_model_64_update.pt' 198 | model_name_1 = 'cnn_attention_lstm_model_64_update_best.pt' 199 | X_train, X_test, Y_train, Y_test = load_data(data_path) 200 | 201 | Y_train = Y_train.reshape(-1) 202 | Y_test = Y_test.reshape(-1) 203 | 204 | Y_train = Y_train.astype('int8') 205 | Y_test = Y_test.astype('int8') 206 | 207 | print(f'X_train:{X_train.shape}, Y_train:{Y_train.shape}', flush=True) 208 | print(f'X_test:{X_test.shape}, Y_test:{Y_test.shape}', flush=True) 209 | 210 | 211 | # get chunks 212 | # train set 213 | mel_train_chunked = [] 214 | for mel_spec in X_train: 215 | chunks = splitIntoChunks(mel_spec, win_size=128,stride=64) 216 | mel_train_chunked.append(chunks) 217 | print("Number of chunks is {}".format(chunks.shape[0])) 218 | # test set 219 | mel_test_chunked = [] 220 | for mel_spec in X_test: 221 | chunks = splitIntoChunks(mel_spec, win_size=128,stride=64) 222 | mel_test_chunked.append(chunks) 223 | print("Number of chunks is {}".format(chunks.shape[0])) 224 | 225 | 226 | X_train = np.stack(mel_train_chunked,axis=0) 227 | X_train = np.expand_dims(X_train,2) 228 | print('Shape of X_train: ',X_train.shape) 229 | X_test = np.stack(mel_test_chunked,axis=0) 230 | X_test = np.expand_dims(X_test,2) 231 | print('Shape of X_test: ',X_test.shape) 232 | 233 | b,t,c,h,w = X_train.shape 234 | X_train = np.reshape(X_train, newshape=(b,-1)) 235 | X_train = np.reshape(X_train, newshape=(b,t,c,h,w)) 236 | 237 | b,t,c,h,w = X_test.shape 238 | X_test = np.reshape(X_test, newshape=(b,-1)) 239 | X_test = np.reshape(X_test, newshape=(b,t,c,h,w)) 240 | 241 | # X_train = np.expand_dims(X_train,1) #[4257,1,800,80] 242 | # X_test = np.expand_dims(X_test,1) 243 | # 244 | # b,c,h,w = X_train.shape 245 | # X_train = np.reshape(X_train, newshape=(b,-1)) 246 | # #X_train = scaler.fit_transform(X_train) 247 | # X_train = np.reshape(X_train, newshape=(b,c,w,h)) #[4257,1,80,800] 248 | # 249 | # b,c,h,w = X_test.shape 250 | # X_test = np.reshape(X_test, newshape=(b,-1)) 251 | # #X_test = scaler.transform(X_test) 252 | # X_test = np.reshape(X_test, newshape=(b,c,w,h)) 253 | SAVE_PATH = os.path.join(os.getcwd(),'models') 254 | os.makedirs('models',exist_ok=True) 255 | 256 | tb = SummaryWriter() 257 | 258 | EPOCHS=200 259 | DATASET_SIZE = X_train.shape[0] 260 | BATCH_SIZE = 32 261 | device = 'cuda' if torch.cuda.is_available() else 'cpu' 262 | print('Selected device is {}'.format(device)) 263 | model = HybridModel(num_emotions=len(EMOTIONS)).to(device) 264 | print('Number of trainable params: ',sum(p.numel() for p in model.parameters())) 265 | OPTIMIZER = torch.optim.SGD(model.parameters(),lr=0.01, weight_decay=1e-3, momentum=0.8) 266 | 267 | train_step = make_train_step(model, loss_fnc, optimizer=OPTIMIZER) 268 | validate = make_validate_fnc(model,loss_fnc) 269 | losses=[] 270 | val_losses = [] 271 | for epoch in range(EPOCHS): 272 | # schuffle data 273 | ind = np.random.permutation(DATASET_SIZE) 274 | X_train = X_train[ind,:,:,:] 275 | Y_train = Y_train[ind] 276 | epoch_acc = 0 277 | epoch_loss = 0 278 | iters = int(DATASET_SIZE / BATCH_SIZE) 279 | for i in range(iters): 280 | batch_start = i * BATCH_SIZE 281 | batch_end = min(batch_start + BATCH_SIZE, DATASET_SIZE) 282 | actual_batch_size = batch_end-batch_start 283 | X = X_train[batch_start:batch_end,:,:,:] 284 | Y = Y_train[batch_start:batch_end] 285 | X_tensor = torch.tensor(X,device=device).float() 286 | Y_tensor = torch.tensor(Y, dtype=torch.long,device=device) 287 | loss, acc = train_step(X_tensor,Y_tensor) 288 | epoch_acc += acc*actual_batch_size/DATASET_SIZE 289 | epoch_loss += loss*actual_batch_size/DATASET_SIZE 290 | print(f"\r Epoch {epoch}: iteration {i}/{iters}",end='') 291 | X_val_tensor = torch.tensor(X_test,device=device).float() 292 | Y_val_tensor = torch.tensor(Y_test,dtype=torch.long,device=device) 293 | val_loss, val_acc, predictions = validate(X_val_tensor,Y_val_tensor) 294 | losses.append(epoch_loss) 295 | val_losses.append(val_loss) 296 | tb.add_scalar("Training Loss", epoch_loss, epoch) 297 | tb.add_scalar("Training Accuracy", epoch_acc, epoch) 298 | tb.add_scalar("Validation Loss", val_loss, epoch) 299 | tb.add_scalar("Validation Accuracy", val_acc, epoch) 300 | print('') 301 | print(f"Epoch {epoch} --> loss:{epoch_loss:.4f}, acc:{epoch_acc:.2f}%, val_loss:{val_loss:.4f}, val_acc:{val_acc:.2f}%", flush=True) 302 | if val_acc > 96.5: 303 | torch.save(model.state_dict(), os.path.join(SAVE_PATH, model_name_1), _use_new_zipfile_serialization=False) 304 | 305 | tb.flush() 306 | SAVE_PATH = os.path.join(os.getcwd(),'models') 307 | os.makedirs('models',exist_ok=True) 308 | torch.save(model.state_dict(),os.path.join(SAVE_PATH,model_name),_use_new_zipfile_serialization=False) 309 | print('Model is saved to {}'.format(os.path.join(SAVE_PATH,model_name))) 310 | 311 | LOAD_PATH = os.path.join(os.getcwd(),'models') 312 | model = HybridModel(len(EMOTIONS)) 313 | model.load_state_dict(torch.load(os.path.join(LOAD_PATH,model_name))) 314 | print('Model is loaded from {}'.format(os.path.join(LOAD_PATH,model_name))) 315 | 316 | X_test_tensor = torch.tensor(X_test,device=device).float() 317 | Y_test_tensor = torch.tensor(Y_test,dtype=torch.long,device=device) 318 | test_loss, test_acc, predictions = validate(X_test_tensor,Y_test_tensor) 319 | print(f'Test loss is {test_loss:.3f}') 320 | print(f'Test accuracy is {test_acc:.2f}%') 321 | 322 | predictions = predictions.cpu().numpy() 323 | cm = confusion_matrix(Y_test, predictions) 324 | names = [EMOTIONS[ind] for ind in range(len(EMOTIONS))] 325 | df_cm = pd.DataFrame(cm, index=names, columns=names) 326 | # plt.figure(figsize=(10,7)) 327 | sn.set(font_scale=1.4) # for label size 328 | sn.heatmap(df_cm, annot=True, annot_kws={"size": 16}) # font size 329 | plt.show() 330 | plt.savefig('confusion_matrix.png') 331 | 332 | --------------------------------------------------------------------------------