├── README.md
├── Relative Attributes
    ├── 0013_000701.csv
    ├── 0013_Neutral_Angry.csv
    ├── BayesClass_RelAtt.m
    ├── BayesClass_RelAtt_unseen.m
    ├── Create_O_and_S_Mats_2D.m
    ├── GetTrainingSample_per_category.m
    ├── classlabel.csv
    ├── main.m
    ├── meanandvar_forcat.m
    ├── pre-processing.py
    ├── ranksvm_with_sim.m
    └── used_for_training_kun.csv
├── codes
    ├── acrnn_test.py
    ├── checkpoint
    │   └── checkpoint_5900
    ├── distributed.py
    ├── hparams.py
    ├── hparams_1.py
    ├── hparams_update.py
    ├── inference.py
    ├── logger.py
    ├── logger_original.py
    ├── lstm_test.py
    ├── model
    │   ├── __init__.py
    │   ├── __pycache__
    │   │   ├── __init__.cpython-36.pyc
    │   │   ├── basic_layers.cpython-36.pyc
    │   │   ├── beam.cpython-36.pyc
    │   │   ├── decoder.cpython-36.pyc
    │   │   ├── layers.cpython-36.pyc
    │   │   ├── loss.cpython-36.pyc
    │   │   ├── lstm_test.cpython-36.pyc
    │   │   ├── model.cpython-36.pyc
    │   │   ├── penalties.cpython-36.pyc
    │   │   └── utils.cpython-36.pyc
    │   ├── basic_layers.py
    │   ├── beam.py
    │   ├── decoder.py
    │   ├── layers.py
    │   ├── loss.py
    │   ├── lstm_test.py
    │   ├── model.py
    │   ├── penalties.py
    │   ├── ser.py
    │   └── utils.py
    ├── multiproc.py
    ├── plotting_utils.py
    ├── reader
    │   ├── evaluation_spec_list.txt
    │   └── training_mel_list.txt
    ├── train.py
    └── train_ser.py
├── stage3_update.png
└── train_ser.py


/README.md:
--------------------------------------------------------------------------------
 1 | # Emovox
 2 | This is the implementation of the paper "Emotion Intensity and its Control for Emotional Voice Conversion".
 3 | 
 4 | ![image info](./stage3_update.png)
 5 | 
 6 | ## Database:
 7 | We use ESD database, which is an emotional speech database that can be downloaded here: https://hltsingapore.github.io/ESD/. In this paper, we choose "0013" to perform all the experiments. To run the codes, you first need to customize your data path correctly, and generate phoneme transcriptions with Festival. More details can be found in https://github.com/jxzhanggg/nonparaSeq2seqVC_code.
 8 | 
 9 | 
10 | ## Step 1: Learning relative attributes
11 | 
12 | ### 1) Extracting open-simle features
13 | 
14 | ```Bash
15 | python pre-processing.py
16 | ```
17 | 
18 | ### 2) Training relative ranking function
19 | 
20 | ```Matlab
21 | main.m
22 | ```
23 | 
24 | ## Step 2: Emotion recognizer training
25 | 
26 | ```Bash
27 | python train_ser.py
28 | ```
29 | 
30 | ## Step 3: Emovox training
31 | 
32 | ### 1) Style Pre-training
33 | 
34 | You need to download VCTK corpus and customize it accordingly, and then perform feature extraction:
35 | ```Bash
36 | $ cd reader
37 | $ python extract_features.py (please customize "path" and "kind", and edit the codes for "spec" or "mel-spec")
38 | $ python generate_list_mel.py
39 | ```
40 | 
41 | The pre-training procedure is same as the pretraining in  https://github.com/jxzhanggg/nonparaSeq2seqVC_code. You can download the pre-trained models from Stage I: Style Initialization here: https://drive.google.com/file/d/1oqk-PSREwpFNTyeREwcUry13WZ1LYl6U/view?usp=sharing. With the released pre-trained models, you can directly perform Stage II: Emotion Training. If you would like to pre-train it by yourself, you can try the following:
42 | ```Bash
43 | $ python train.py -l logdir \
44 | -o outdir --n_gpus=1 --hparams=speaker_adversial_loss_w=20.,ce_loss=False,speaker_classifier_loss_w=0.1,contrastive_loss_w=30.
45 | ```
46 | 
47 | ### 2) Emotion training
48 | 
49 | You need to download ESD corpus and customize it accordingly, and then perform feature extraction:
50 | ```Bash
51 | $ cd reader
52 | $ python extract.py (please customize "path" and "kind", and edit the codes for "spec" or "mel-spec")
53 | $ python generate_list_mel.py
54 | ```
55 | 
56 | ```Bash
57 | $ python train.py -l logdir \
58 | -o outdir_emotion_IS --n_gpus=1 -c '/home/zhoukun/nonparaSeq2seqVC_code-master/pre-train/outdir/checkpoint_234000 (The path to your Pre-trained models from Stage I)' --warm_start
59 | ```
60 | 
61 | ## Step 4: Run-time conversion
62 | 
63 | (1) Generate emotion embedding from the emotion encoder:
64 | 
65 | Please remember to customize the paths in hparam.py...
66 | ```Bash
67 | $ cd conversion
68 | $ python inference_embedding.py -c '/home/zhoukun/nonparaSeq2seqVC_code-master/fine-tune/outdir_emotion_update/checkpoint_3200 [YOUR EMOTION TRAINING CHECKPOINT]' --hparams speaker_A='Neutral',speaker_B='Happy',speaker_C='Sad',speaker_D='Angry',training_list='/home/zhoukun/nonparaSeq2seqVC_code-master/fine-tune/reader/emotion_list/testing_mel_list.txt',SC_kernel_size=1
69 | ```
70 | (2) Convert the source speech to the target emotion: [FOR EXAMPLE: convert emotion D to emotion A]
71 | ```Bash
72 | $ cd conversion
73 | $ python inference_A.py -c '/home/zhoukun/nonparaSeq2seqVC_code-master/pre-train/outdir_emotion_update/checkpoint_3200[YOUR EMOTION TRAINING CHECKPOINT]' --num 20 --hparams validation_list='/home/zhoukun/nonparaSeq2seqVC_code-master/fine-tune/reader/emotion_list/evaluation_mel_list.txt',SC_kernel_size=1
74 | ```
75 | Please customize inference.py to generate your intended emotion type.
76 | 
77 | 
78 | ## Training log
79 | 
80 | # Still under construction ...
81 | 


--------------------------------------------------------------------------------
/Relative Attributes/0013_000701.csv:
--------------------------------------------------------------------------------
1 | name;frameTime;pcm_RMSenergy_sma_max;pcm_RMSenergy_sma_min;pcm_RMSenergy_sma_range;pcm_RMSenergy_sma_maxPos;pcm_RMSenergy_sma_minPos;pcm_RMSenergy_sma_amean;pcm_RMSenergy_sma_linregc1;pcm_RMSenergy_sma_linregc2;pcm_RMSenergy_sma_linregerrQ;pcm_RMSenergy_sma_stddev;pcm_RMSenergy_sma_skewness;pcm_RMSenergy_sma_kurtosis;pcm_fftMag_mfcc_sma[1]_max;pcm_fftMag_mfcc_sma[1]_min;pcm_fftMag_mfcc_sma[1]_range;pcm_fftMag_mfcc_sma[1]_maxPos;pcm_fftMag_mfcc_sma[1]_minPos;pcm_fftMag_mfcc_sma[1]_amean;pcm_fftMag_mfcc_sma[1]_linregc1;pcm_fftMag_mfcc_sma[1]_linregc2;pcm_fftMag_mfcc_sma[1]_linregerrQ;pcm_fftMag_mfcc_sma[1]_stddev;pcm_fftMag_mfcc_sma[1]_skewness;pcm_fftMag_mfcc_sma[1]_kurtosis;pcm_fftMag_mfcc_sma[2]_max;pcm_fftMag_mfcc_sma[2]_min;pcm_fftMag_mfcc_sma[2]_range;pcm_fftMag_mfcc_sma[2]_maxPos;pcm_fftMag_mfcc_sma[2]_minPos;pcm_fftMag_mfcc_sma[2]_amean;pcm_fftMag_mfcc_sma[2]_linregc1;pcm_fftMag_mfcc_sma[2]_linregc2;pcm_fftMag_mfcc_sma[2]_linregerrQ;pcm_fftMag_mfcc_sma[2]_stddev;pcm_fftMag_mfcc_sma[2]_skewness;pcm_fftMag_mfcc_sma[2]_kurtosis;pcm_fftMag_mfcc_sma[3]_max;pcm_fftMag_mfcc_sma[3]_min;pcm_fftMag_mfcc_sma[3]_range;pcm_fftMag_mfcc_sma[3]_maxPos;pcm_fftMag_mfcc_sma[3]_minPos;pcm_fftMag_mfcc_sma[3]_amean;pcm_fftMag_mfcc_sma[3]_linregc1;pcm_fftMag_mfcc_sma[3]_linregc2;pcm_fftMag_mfcc_sma[3]_linregerrQ;pcm_fftMag_mfcc_sma[3]_stddev;pcm_fftMag_mfcc_sma[3]_skewness;pcm_fftMag_mfcc_sma[3]_kurtosis;pcm_fftMag_mfcc_sma[4]_max;pcm_fftMag_mfcc_sma[4]_min;pcm_fftMag_mfcc_sma[4]_range;pcm_fftMag_mfcc_sma[4]_maxPos;pcm_fftMag_mfcc_sma[4]_minPos;pcm_fftMag_mfcc_sma[4]_amean;pcm_fftMag_mfcc_sma[4]_linregc1;pcm_fftMag_mfcc_sma[4]_linregc2;pcm_fftMag_mfcc_sma[4]_linregerrQ;pcm_fftMag_mfcc_sma[4]_stddev;pcm_fftMag_mfcc_sma[4]_skewness;pcm_fftMag_mfcc_sma[4]_kurtosis;pcm_fftMag_mfcc_sma[5]_max;pcm_fftMag_mfcc_sma[5]_min;pcm_fftMag_mfcc_sma[5]_range;pcm_fftMag_mfcc_sma[5]_maxPos;pcm_fftMag_mfcc_sma[5]_minPos;pcm_fftMag_mfcc_sma[5]_amean;pcm_fftMag_mfcc_sma[5]_linregc1;pcm_fftMag_mfcc_sma[5]_linregc2;pcm_fftMag_mfcc_sma[5]_linregerrQ;pcm_fftMag_mfcc_sma[5]_stddev;pcm_fftMag_mfcc_sma[5]_skewness;pcm_fftMag_mfcc_sma[5]_kurtosis;pcm_fftMag_mfcc_sma[6]_max;pcm_fftMag_mfcc_sma[6]_min;pcm_fftMag_mfcc_sma[6]_range;pcm_fftMag_mfcc_sma[6]_maxPos;pcm_fftMag_mfcc_sma[6]_minPos;pcm_fftMag_mfcc_sma[6]_amean;pcm_fftMag_mfcc_sma[6]_linregc1;pcm_fftMag_mfcc_sma[6]_linregc2;pcm_fftMag_mfcc_sma[6]_linregerrQ;pcm_fftMag_mfcc_sma[6]_stddev;pcm_fftMag_mfcc_sma[6]_skewness;pcm_fftMag_mfcc_sma[6]_kurtosis;pcm_fftMag_mfcc_sma[7]_max;pcm_fftMag_mfcc_sma[7]_min;pcm_fftMag_mfcc_sma[7]_range;pcm_fftMag_mfcc_sma[7]_maxPos;pcm_fftMag_mfcc_sma[7]_minPos;pcm_fftMag_mfcc_sma[7]_amean;pcm_fftMag_mfcc_sma[7]_linregc1;pcm_fftMag_mfcc_sma[7]_linregc2;pcm_fftMag_mfcc_sma[7]_linregerrQ;pcm_fftMag_mfcc_sma[7]_stddev;pcm_fftMag_mfcc_sma[7]_skewness;pcm_fftMag_mfcc_sma[7]_kurtosis;pcm_fftMag_mfcc_sma[8]_max;pcm_fftMag_mfcc_sma[8]_min;pcm_fftMag_mfcc_sma[8]_range;pcm_fftMag_mfcc_sma[8]_maxPos;pcm_fftMag_mfcc_sma[8]_minPos;pcm_fftMag_mfcc_sma[8]_amean;pcm_fftMag_mfcc_sma[8]_linregc1;pcm_fftMag_mfcc_sma[8]_linregc2;pcm_fftMag_mfcc_sma[8]_linregerrQ;pcm_fftMag_mfcc_sma[8]_stddev;pcm_fftMag_mfcc_sma[8]_skewness;pcm_fftMag_mfcc_sma[8]_kurtosis;pcm_fftMag_mfcc_sma[9]_max;pcm_fftMag_mfcc_sma[9]_min;pcm_fftMag_mfcc_sma[9]_range;pcm_fftMag_mfcc_sma[9]_maxPos;pcm_fftMag_mfcc_sma[9]_minPos;pcm_fftMag_mfcc_sma[9]_amean;pcm_fftMag_mfcc_sma[9]_linregc1;pcm_fftMag_mfcc_sma[9]_linregc2;pcm_fftMag_mfcc_sma[9]_linregerrQ;pcm_fftMag_mfcc_sma[9]_stddev;pcm_fftMag_mfcc_sma[9]_skewness;pcm_fftMag_mfcc_sma[9]_kurtosis;pcm_fftMag_mfcc_sma[10]_max;pcm_fftMag_mfcc_sma[10]_min;pcm_fftMag_mfcc_sma[10]_range;pcm_fftMag_mfcc_sma[10]_maxPos;pcm_fftMag_mfcc_sma[10]_minPos;pcm_fftMag_mfcc_sma[10]_amean;pcm_fftMag_mfcc_sma[10]_linregc1;pcm_fftMag_mfcc_sma[10]_linregc2;pcm_fftMag_mfcc_sma[10]_linregerrQ;pcm_fftMag_mfcc_sma[10]_stddev;pcm_fftMag_mfcc_sma[10]_skewness;pcm_fftMag_mfcc_sma[10]_kurtosis;pcm_fftMag_mfcc_sma[11]_max;pcm_fftMag_mfcc_sma[11]_min;pcm_fftMag_mfcc_sma[11]_range;pcm_fftMag_mfcc_sma[11]_maxPos;pcm_fftMag_mfcc_sma[11]_minPos;pcm_fftMag_mfcc_sma[11]_amean;pcm_fftMag_mfcc_sma[11]_linregc1;pcm_fftMag_mfcc_sma[11]_linregc2;pcm_fftMag_mfcc_sma[11]_linregerrQ;pcm_fftMag_mfcc_sma[11]_stddev;pcm_fftMag_mfcc_sma[11]_skewness;pcm_fftMag_mfcc_sma[11]_kurtosis;pcm_fftMag_mfcc_sma[12]_max;pcm_fftMag_mfcc_sma[12]_min;pcm_fftMag_mfcc_sma[12]_range;pcm_fftMag_mfcc_sma[12]_maxPos;pcm_fftMag_mfcc_sma[12]_minPos;pcm_fftMag_mfcc_sma[12]_amean;pcm_fftMag_mfcc_sma[12]_linregc1;pcm_fftMag_mfcc_sma[12]_linregc2;pcm_fftMag_mfcc_sma[12]_linregerrQ;pcm_fftMag_mfcc_sma[12]_stddev;pcm_fftMag_mfcc_sma[12]_skewness;pcm_fftMag_mfcc_sma[12]_kurtosis;pcm_zcr_sma_max;pcm_zcr_sma_min;pcm_zcr_sma_range;pcm_zcr_sma_maxPos;pcm_zcr_sma_minPos;pcm_zcr_sma_amean;pcm_zcr_sma_linregc1;pcm_zcr_sma_linregc2;pcm_zcr_sma_linregerrQ;pcm_zcr_sma_stddev;pcm_zcr_sma_skewness;pcm_zcr_sma_kurtosis;voiceProb_sma_max;voiceProb_sma_min;voiceProb_sma_range;voiceProb_sma_maxPos;voiceProb_sma_minPos;voiceProb_sma_amean;voiceProb_sma_linregc1;voiceProb_sma_linregc2;voiceProb_sma_linregerrQ;voiceProb_sma_stddev;voiceProb_sma_skewness;voiceProb_sma_kurtosis;F0_sma_max;F0_sma_min;F0_sma_range;F0_sma_maxPos;F0_sma_minPos;F0_sma_amean;F0_sma_linregc1;F0_sma_linregc2;F0_sma_linregerrQ;F0_sma_stddev;F0_sma_skewness;F0_sma_kurtosis;pcm_RMSenergy_sma_de_max;pcm_RMSenergy_sma_de_min;pcm_RMSenergy_sma_de_range;pcm_RMSenergy_sma_de_maxPos;pcm_RMSenergy_sma_de_minPos;pcm_RMSenergy_sma_de_amean;pcm_RMSenergy_sma_de_linregc1;pcm_RMSenergy_sma_de_linregc2;pcm_RMSenergy_sma_de_linregerrQ;pcm_RMSenergy_sma_de_stddev;pcm_RMSenergy_sma_de_skewness;pcm_RMSenergy_sma_de_kurtosis;pcm_fftMag_mfcc_sma_de[1]_max;pcm_fftMag_mfcc_sma_de[1]_min;pcm_fftMag_mfcc_sma_de[1]_range;pcm_fftMag_mfcc_sma_de[1]_maxPos;pcm_fftMag_mfcc_sma_de[1]_minPos;pcm_fftMag_mfcc_sma_de[1]_amean;pcm_fftMag_mfcc_sma_de[1]_linregc1;pcm_fftMag_mfcc_sma_de[1]_linregc2;pcm_fftMag_mfcc_sma_de[1]_linregerrQ;pcm_fftMag_mfcc_sma_de[1]_stddev;pcm_fftMag_mfcc_sma_de[1]_skewness;pcm_fftMag_mfcc_sma_de[1]_kurtosis;pcm_fftMag_mfcc_sma_de[2]_max;pcm_fftMag_mfcc_sma_de[2]_min;pcm_fftMag_mfcc_sma_de[2]_range;pcm_fftMag_mfcc_sma_de[2]_maxPos;pcm_fftMag_mfcc_sma_de[2]_minPos;pcm_fftMag_mfcc_sma_de[2]_amean;pcm_fftMag_mfcc_sma_de[2]_linregc1;pcm_fftMag_mfcc_sma_de[2]_linregc2;pcm_fftMag_mfcc_sma_de[2]_linregerrQ;pcm_fftMag_mfcc_sma_de[2]_stddev;pcm_fftMag_mfcc_sma_de[2]_skewness;pcm_fftMag_mfcc_sma_de[2]_kurtosis;pcm_fftMag_mfcc_sma_de[3]_max;pcm_fftMag_mfcc_sma_de[3]_min;pcm_fftMag_mfcc_sma_de[3]_range;pcm_fftMag_mfcc_sma_de[3]_maxPos;pcm_fftMag_mfcc_sma_de[3]_minPos;pcm_fftMag_mfcc_sma_de[3]_amean;pcm_fftMag_mfcc_sma_de[3]_linregc1;pcm_fftMag_mfcc_sma_de[3]_linregc2;pcm_fftMag_mfcc_sma_de[3]_linregerrQ;pcm_fftMag_mfcc_sma_de[3]_stddev;pcm_fftMag_mfcc_sma_de[3]_skewness;pcm_fftMag_mfcc_sma_de[3]_kurtosis;pcm_fftMag_mfcc_sma_de[4]_max;pcm_fftMag_mfcc_sma_de[4]_min;pcm_fftMag_mfcc_sma_de[4]_range;pcm_fftMag_mfcc_sma_de[4]_maxPos;pcm_fftMag_mfcc_sma_de[4]_minPos;pcm_fftMag_mfcc_sma_de[4]_amean;pcm_fftMag_mfcc_sma_de[4]_linregc1;pcm_fftMag_mfcc_sma_de[4]_linregc2;pcm_fftMag_mfcc_sma_de[4]_linregerrQ;pcm_fftMag_mfcc_sma_de[4]_stddev;pcm_fftMag_mfcc_sma_de[4]_skewness;pcm_fftMag_mfcc_sma_de[4]_kurtosis;pcm_fftMag_mfcc_sma_de[5]_max;pcm_fftMag_mfcc_sma_de[5]_min;pcm_fftMag_mfcc_sma_de[5]_range;pcm_fftMag_mfcc_sma_de[5]_maxPos;pcm_fftMag_mfcc_sma_de[5]_minPos;pcm_fftMag_mfcc_sma_de[5]_amean;pcm_fftMag_mfcc_sma_de[5]_linregc1;pcm_fftMag_mfcc_sma_de[5]_linregc2;pcm_fftMag_mfcc_sma_de[5]_linregerrQ;pcm_fftMag_mfcc_sma_de[5]_stddev;pcm_fftMag_mfcc_sma_de[5]_skewness;pcm_fftMag_mfcc_sma_de[5]_kurtosis;pcm_fftMag_mfcc_sma_de[6]_max;pcm_fftMag_mfcc_sma_de[6]_min;pcm_fftMag_mfcc_sma_de[6]_range;pcm_fftMag_mfcc_sma_de[6]_maxPos;pcm_fftMag_mfcc_sma_de[6]_minPos;pcm_fftMag_mfcc_sma_de[6]_amean;pcm_fftMag_mfcc_sma_de[6]_linregc1;pcm_fftMag_mfcc_sma_de[6]_linregc2;pcm_fftMag_mfcc_sma_de[6]_linregerrQ;pcm_fftMag_mfcc_sma_de[6]_stddev;pcm_fftMag_mfcc_sma_de[6]_skewness;pcm_fftMag_mfcc_sma_de[6]_kurtosis;pcm_fftMag_mfcc_sma_de[7]_max;pcm_fftMag_mfcc_sma_de[7]_min;pcm_fftMag_mfcc_sma_de[7]_range;pcm_fftMag_mfcc_sma_de[7]_maxPos;pcm_fftMag_mfcc_sma_de[7]_minPos;pcm_fftMag_mfcc_sma_de[7]_amean;pcm_fftMag_mfcc_sma_de[7]_linregc1;pcm_fftMag_mfcc_sma_de[7]_linregc2;pcm_fftMag_mfcc_sma_de[7]_linregerrQ;pcm_fftMag_mfcc_sma_de[7]_stddev;pcm_fftMag_mfcc_sma_de[7]_skewness;pcm_fftMag_mfcc_sma_de[7]_kurtosis;pcm_fftMag_mfcc_sma_de[8]_max;pcm_fftMag_mfcc_sma_de[8]_min;pcm_fftMag_mfcc_sma_de[8]_range;pcm_fftMag_mfcc_sma_de[8]_maxPos;pcm_fftMag_mfcc_sma_de[8]_minPos;pcm_fftMag_mfcc_sma_de[8]_amean;pcm_fftMag_mfcc_sma_de[8]_linregc1;pcm_fftMag_mfcc_sma_de[8]_linregc2;pcm_fftMag_mfcc_sma_de[8]_linregerrQ;pcm_fftMag_mfcc_sma_de[8]_stddev;pcm_fftMag_mfcc_sma_de[8]_skewness;pcm_fftMag_mfcc_sma_de[8]_kurtosis;pcm_fftMag_mfcc_sma_de[9]_max;pcm_fftMag_mfcc_sma_de[9]_min;pcm_fftMag_mfcc_sma_de[9]_range;pcm_fftMag_mfcc_sma_de[9]_maxPos;pcm_fftMag_mfcc_sma_de[9]_minPos;pcm_fftMag_mfcc_sma_de[9]_amean;pcm_fftMag_mfcc_sma_de[9]_linregc1;pcm_fftMag_mfcc_sma_de[9]_linregc2;pcm_fftMag_mfcc_sma_de[9]_linregerrQ;pcm_fftMag_mfcc_sma_de[9]_stddev;pcm_fftMag_mfcc_sma_de[9]_skewness;pcm_fftMag_mfcc_sma_de[9]_kurtosis;pcm_fftMag_mfcc_sma_de[10]_max;pcm_fftMag_mfcc_sma_de[10]_min;pcm_fftMag_mfcc_sma_de[10]_range;pcm_fftMag_mfcc_sma_de[10]_maxPos;pcm_fftMag_mfcc_sma_de[10]_minPos;pcm_fftMag_mfcc_sma_de[10]_amean;pcm_fftMag_mfcc_sma_de[10]_linregc1;pcm_fftMag_mfcc_sma_de[10]_linregc2;pcm_fftMag_mfcc_sma_de[10]_linregerrQ;pcm_fftMag_mfcc_sma_de[10]_stddev;pcm_fftMag_mfcc_sma_de[10]_skewness;pcm_fftMag_mfcc_sma_de[10]_kurtosis;pcm_fftMag_mfcc_sma_de[11]_max;pcm_fftMag_mfcc_sma_de[11]_min;pcm_fftMag_mfcc_sma_de[11]_range;pcm_fftMag_mfcc_sma_de[11]_maxPos;pcm_fftMag_mfcc_sma_de[11]_minPos;pcm_fftMag_mfcc_sma_de[11]_amean;pcm_fftMag_mfcc_sma_de[11]_linregc1;pcm_fftMag_mfcc_sma_de[11]_linregc2;pcm_fftMag_mfcc_sma_de[11]_linregerrQ;pcm_fftMag_mfcc_sma_de[11]_stddev;pcm_fftMag_mfcc_sma_de[11]_skewness;pcm_fftMag_mfcc_sma_de[11]_kurtosis;pcm_fftMag_mfcc_sma_de[12]_max;pcm_fftMag_mfcc_sma_de[12]_min;pcm_fftMag_mfcc_sma_de[12]_range;pcm_fftMag_mfcc_sma_de[12]_maxPos;pcm_fftMag_mfcc_sma_de[12]_minPos;pcm_fftMag_mfcc_sma_de[12]_amean;pcm_fftMag_mfcc_sma_de[12]_linregc1;pcm_fftMag_mfcc_sma_de[12]_linregc2;pcm_fftMag_mfcc_sma_de[12]_linregerrQ;pcm_fftMag_mfcc_sma_de[12]_stddev;pcm_fftMag_mfcc_sma_de[12]_skewness;pcm_fftMag_mfcc_sma_de[12]_kurtosis;pcm_zcr_sma_de_max;pcm_zcr_sma_de_min;pcm_zcr_sma_de_range;pcm_zcr_sma_de_maxPos;pcm_zcr_sma_de_minPos;pcm_zcr_sma_de_amean;pcm_zcr_sma_de_linregc1;pcm_zcr_sma_de_linregc2;pcm_zcr_sma_de_linregerrQ;pcm_zcr_sma_de_stddev;pcm_zcr_sma_de_skewness;pcm_zcr_sma_de_kurtosis;voiceProb_sma_de_max;voiceProb_sma_de_min;voiceProb_sma_de_range;voiceProb_sma_de_maxPos;voiceProb_sma_de_minPos;voiceProb_sma_de_amean;voiceProb_sma_de_linregc1;voiceProb_sma_de_linregc2;voiceProb_sma_de_linregerrQ;voiceProb_sma_de_stddev;voiceProb_sma_de_skewness;voiceProb_sma_de_kurtosis;F0_sma_de_max;F0_sma_de_min;F0_sma_de_range;F0_sma_de_maxPos;F0_sma_de_minPos;F0_sma_de_amean;F0_sma_de_linregc1;F0_sma_de_linregc2;F0_sma_de_linregerrQ;F0_sma_de_stddev;F0_sma_de_skewness;F0_sma_de_kurtosis
2 | '0013_000701';0.000000;3.369850e-02;1.735465e-05;3.368114e-02;128;1;7.375638e-03;9.669271e-07;7.274111e-03;6.286126e-05;7.928728e-03;1.231967e+00;3.903731e+00;-4.289949e-01;-2.965781e+01;2.922882e+01;86;119;-1.156513e+01;-9.081191e-03;-1.061160e+01;4.861806e+01;6.994570e+00;-4.087515e-01;2.463248e+00;1.203464e+01;-2.046483e+01;3.249947e+01;118;128;-4.140470e+00;1.807947e-02;-6.038815e+00;3.540856e+01;6.051548e+00;-1.875589e-01;3.351907e+00;1.955154e+01;-1.310266e+01;3.265420e+01;175;65;1.407256e+00;4.926669e-02;-3.765746e+00;5.113353e+01;7.754900e+00;4.629310e-01;2.395122e+00;1.685727e+01;-1.567376e+01;3.253102e+01;86;65;-1.302843e+00;3.730469e-03;-1.694542e+00;5.699306e+01;7.552794e+00;4.020537e-01;2.447639e+00;4.721776e+00;-3.279273e+01;3.751450e+01;194;83;-1.143948e+01;3.856772e-03;-1.184444e+01;9.055809e+01;9.519101e+00;-3.643253e-01;1.896232e+00;2.670732e+00;-2.272833e+01;2.539906e+01;16;132;-7.037078e+00;-6.061968e-03;-6.400571e+00;2.522136e+01;5.035643e+00;-5.853465e-01;2.856317e+00;2.703540e+00;-3.120908e+01;3.391262e+01;29;41;-1.052643e+01;-3.276756e-03;-1.018237e+01;4.950270e+01;7.038646e+00;-6.248007e-01;3.172013e+00;2.142991e+01;-3.905996e+01;6.048987e+01;103;160;-6.943793e+00;-2.189371e-02;-4.644953e+00;1.432078e+02;1.204102e+01;-5.093850e-01;3.094743e+00;8.956097e+00;-2.533958e+01;3.429568e+01;197;89;-5.922431e+00;1.756681e-04;-5.940876e+00;4.861909e+01;6.972747e+00;-3.561003e-01;2.635875e+00;1.521178e+01;-1.677612e+01;3.198790e+01;126;164;-1.061801e+00;-3.271530e-02;2.373305e+00;3.841283e+01;6.510269e+00;6.487774e-02;3.377592e+00;1.704618e+01;-2.677855e+01;4.382473e+01;108;163;-4.708352e+00;-2.559226e-02;-2.021164e+00;7.319276e+01;8.696130e+00;-1.842328e-01;3.034668e+00;9.610009e+00;-1.632896e+01;2.593897e+01;203;81;-4.018893e+00;1.434067e-02;-5.524663e+00;3.414647e+01;5.908422e+00;-5.604835e-02;2.275285e+00;7.458333e-01;1.833333e-02;7.275000e-01;118;92;1.568918e-01;-9.117536e-04;2.526259e-01;2.999283e-02;1.818707e-01;1.855961e+00;5.140184e+00;8.273448e-01;1.171804e-01;7.101644e-01;74;17;3.982598e-01;1.266134e-04;3.849654e-01;3.866918e-02;1.967960e-01;4.884707e-01;2.201182e+00;3.117144e+02;0;3.117144e+02;170;0;5.710489e+01;9.207127e-03;5.613814e+01;1.014255e+04;1.007118e+02;1.458309e+00;3.494218e+00;6.368165e-03;-5.066753e-03;1.143492e-02;115;132;6.224651e-07;-1.965598e-06;2.070103e-04;2.720752e-06;1.653809e-03;2.595058e-01;5.316820e+00;6.426627e+00;-6.035265e+00;1.246189e+01;122;115;3.675509e-02;-1.927535e-03;2.391463e-01;2.008664e+00;1.422128e+00;1.150463e-01;8.981321e+00;3.935276e+00;-6.386956e+00;1.032223e+01;116;123;1.665787e-02;1.901351e-04;-3.306316e-03;2.443964e+00;1.563361e+00;-5.182502e-01;4.893812e+00;3.989394e+00;-4.547331e+00;8.536725e+00;89;113;8.128021e-03;-1.953440e-03;2.132392e-01;2.024084e+00;1.427670e+00;-2.068333e-01;3.657849e+00;6.170218e+00;-4.438015e+00;1.060823e+01;43;54;-3.623639e-03;1.559478e-05;-5.261091e-03;3.224394e+00;1.795660e+00;7.156055e-01;4.212295e+00;3.811246e+00;-4.006615e+00;7.817862e+00;176;72;1.258776e-03;2.742066e-03;-2.866582e-01;2.325907e+00;1.534210e+00;9.629573e-02;3.240988e+00;5.007465e+00;-3.484175e+00;8.491640e+00;133;129;-6.047554e-03;1.326532e-03;-1.453335e-01;1.947959e+00;1.398030e+00;2.221200e-01;3.902068e+00;4.570399e+00;-5.748763e+00;1.031916e+01;43;34;7.832383e-04;2.075908e-03;-2.171871e-01;2.939625e+00;1.719190e+00;-2.689770e-01;3.820803e+00;6.299998e+00;-7.817480e+00;1.411748e+01;99;42;-6.132154e-03;1.437077e-03;-1.570253e-01;6.091033e+00;2.469553e+00;-4.642484e-01;3.740222e+00;4.501269e+00;-3.964148e+00;8.465417e+00;110;85;7.161017e-03;1.012676e-03;-9.916998e-02;2.205007e+00;1.486207e+00;-2.858680e-03;3.461223e+00;3.874099e+00;-3.938681e+00;7.812780e+00;44;54;9.058981e-03;2.090432e-04;-1.289056e-02;2.445675e+00;1.563917e+00;-1.241545e-01;2.777863e+00;3.896590e+00;-3.824652e+00;7.721242e+00;141;54;1.654875e-02;1.875372e-03;-1.803653e-01;2.567747e+00;1.606485e+00;2.015208e-02;2.676925e+00;4.173211e+00;-3.225763e+00;7.398974e+00;141;122;-2.564223e-03;1.262240e-03;-1.350994e-01;2.077329e+00;1.443343e+00;1.666264e-01;2.809902e+00;1.574167e-01;-1.438333e-01;3.012500e-01;115;122;-2.622433e-03;6.552591e-05;-9.502653e-03;1.249018e-03;3.556610e-02;-3.552915e-01;1.012584e+01;9.802486e-02;-1.201950e-01;2.182199e-01;157;171;-2.525804e-04;-5.859234e-05;5.899616e-03;1.182782e-03;3.457628e-02;-8.069830e-02;4.849043e+00;8.888889e+01;-8.888889e+01;1.777778e+02;36;39;-2.542378e-09;-1.539215e-02;1.616176e+00;5.662868e+02;2.381524e+01;-6.053687e-01;7.081641e+00
3 | 


--------------------------------------------------------------------------------
/Relative Attributes/BayesClass_RelAtt.m:
--------------------------------------------------------------------------------
 1 | % Bayesian Classification of the Relative Attributes
 2 | % Created by Joe Ellis for Reproducible Codes Class
 3 | % This function takes in the means and Covariances Matrices of each class
 4 | % and then classifies the variables based on their values
 5 | 
 6 | function accuracy = BayesClass_RelAtt(predicts,ground_truth,means,Covariances,used_for_training,unseen)
 7 | 
 8 | % Variables
 9 | % predicts = the values that need to be predicted and classified these are
10 | %   the relative predictions
11 | % ground_truth = the real class_labels they are a 2668 vector;
12 | % means = 1x6x8 matrix of the covariances and the means
13 | % Covariances = 6x6x8 matrix fo the covariances
14 | 
15 | % This is for tracking the accuracy of the set up
16 | correct = 0;
17 | total = 0;
18 | 
19 | % Now do a for loop for each of the predicts variables
20 | for j = 1:length(predicts)
21 |     % We don't want to use the variables that are used for training so
22 |     % let's skip those in test
23 |     if used_for_training(j) == 0
24 |         
25 |         %{
26 |         % This is for debug purposes
27 |         if ismember(ground_truth(j),unseen) == 1
28 |             disp('This is an unseen variable, and is of class');
29 |             disp(ground_truth(j));
30 |         end
31 |         %}
32 |         
33 |         % For each of the categories find the guassian probability of the
34 |         % each variable and each point
35 |         best_prob = 0;
36 |         for k = 1:size(means,3)
37 |             
38 |             % Add a bit of value to the Covariances to insure they are
39 |             % positive definite
40 |             Cov_ex = Covariances(:,:,k) + eye(size(Covariances,1)).*.00001;
41 |             prob = mvnpdf(predicts(j,:),means(:,:,k),Cov_ex);
42 |             
43 |             % Debug Purposes
44 |             % let's calc the distance from the prediction values of the
45 |             % ranking to the predicted means of the values
46 |             %{
47 |             distance = pdist([predicts(j,:);means(:,:,k)],'euclidean');
48 |             disp('This is the class: ');
49 |             disp(k);
50 |             disp('This is the distance: ')
51 |             disp(distance);
52 |             disp('The predicted values');
53 |             disp(predicts(j,:));
54 |             disp('The mean values of this variable');
55 |             disp(means(:,:,k));
56 |             %}
57 |             
58 |             if prob > best_prob
59 |                 best_prob = prob;
60 |                 app_label = k;
61 |             end
62 |         end
63 |         
64 |         % Now see if the label is the same as the ground truth label;
65 |         if ground_truth(j) == app_label;
66 |             correct = correct + 1;
67 |         end
68 |         
69 |         % Add to the total numbers of predicts that are analyzed
70 |         total = total + 1;
71 |     end
72 | end
73 | 
74 | accuracy = correct/total;
75 |     


--------------------------------------------------------------------------------
/Relative Attributes/BayesClass_RelAtt_unseen.m:
--------------------------------------------------------------------------------
 1 | % Bayesian Classification of the Relative Attributes
 2 | % Created by Joe Ellis for Reproducible Codes Class
 3 | % This function takes in the means and Covariances Matrices of each class
 4 | % and then classifies the variables based on their values
 5 | 
 6 | function accuracy = BayesClass_RelAtt_unseen(predicts,ground_truth,means,Covariances,used_for_training,unseen)
 7 | 
 8 | % Variables
 9 | % predicts = the values that need to be predicted and classified these are
10 | %   the relative predictions
11 | % ground_truth = the real class_labels they are a 2668 vector;
12 | % means = 1x6x8 matrix of the covariances and the means
13 | % Covariances = 6x6x8 matrix fo the covariances
14 | 
15 | % This is for tracking the accuracy of the set up
16 | correct = 0;
17 | total = 0;
18 | 
19 | % Now do a for loop for each of the predicts variables
20 | for j = 1:length(predicts)
21 |     % We don't want to use the variables that are used for training so
22 |     % let's skip those in test
23 |     if used_for_training(j) == 0 && ismember(ground_truth(j),unseen) == 1
24 |         
25 |         %{
26 |         % This is for debug purposes
27 |         if ismember(ground_truth(j),unseen) == 1
28 |             disp('This is an unseen variable, and is of class');
29 |             disp(ground_truth(j));
30 |         end
31 |         %}
32 |         
33 |         % For each of the categories find the guassian probability of the
34 |         % each variable and each point
35 |         best_prob = 0;
36 |         for k = 1:size(means,3)
37 |             
38 |             % Add a bit of value to the Covariances to insure they are
39 |             % positive definite
40 |             Cov_ex = Covariances(:,:,k) + eye(size(Covariances,1)).*.00001;
41 |             prob = mvnpdf(predicts(j,:),means(:,:,k),Cov_ex);
42 |             
43 |             % Debug Purposes
44 |             % let's calc the distance from the prediction values of the
45 |             % ranking to the predicted means of the values
46 |             %{
47 |             distance = pdist([predicts(j,:);means(:,:,k)],'euclidean');
48 |             disp('This is the class: ');
49 |             disp(k);
50 |             disp('This is the distance: ')
51 |             disp(distance);
52 |             disp('The predicted values');
53 |             disp(predicts(j,:));
54 |             disp('The mean values of this variable');
55 |             disp(means(:,:,k));
56 |             %}
57 |             
58 |             if prob > best_prob
59 |                 best_prob = prob;
60 |                 app_label = k;
61 |             end
62 |         end
63 |         
64 |         % Now see if the label is the same as the ground truth label;
65 |         if ground_truth(j) == app_label;
66 |             correct = correct + 1;
67 |         end
68 |         
69 |         % Add to the total numbers of predicts that are analyzed
70 |         total = total + 1;
71 |     end
72 | end
73 | 
74 | accuracy = correct/total;
75 |     


--------------------------------------------------------------------------------
/Relative Attributes/Create_O_and_S_Mats_2D.m:
--------------------------------------------------------------------------------
  1 | % This function takes in a matrix with attribute relations in numeric
  2 | % categories, and the features extracted from the available test images,
  3 | % class labels of all of the images, and the images that are used for training, 
  4 | % and outputs the O and S matrix used with rank_with_sim rank svm implementation for training
  5 | % Created by Joe Ellis -- PhD Candidate Columbia University
  6 | 
  7 | function [O,S] = Create_O_and_S_Mats(category_order,used_for_training,class_labels,num_classes,unseen,trainpics,att_combos)
  8 | 
  9 | % INPUTS
 10 | % category_order = the order of the relative attributes of each category.
 11 | % used_for_training = A vector the length of the samples, a 1 denotes that
 12 | %   this sample should be used for training, and a 0 is a test image 
 13 | % class_labels = the class_labels of each sample
 14 | % num_classes = the total number of classes
 15 | % unseen = A vector containing the class labels that are unseen
 16 | % trainpics = The number of pictures used for training
 17 | % att_combos = The number of category pairs that will be used for training
 18 | 
 19 | % OUTPUTS
 20 | % O = The matrix that is used as input to the ranking function, this matrix
 21 | %   for each row in the matrix contains one 1 and -1 element used for
 22 | %   training.
 23 | % S = The similarity matrix is the same as the O matrix, but contains
 24 | %   samples that have the same score for a given attribute
 25 | 
 26 | % num_categories = 6;
 27 | num_categories = 2;
 28 | 
 29 | % This matrix holds the index of the training samples for each class
 30 | train_by_class = zeros(num_classes,trainpics);
 31 | 
 32 | % Set up the O and S Mats
 33 | O = zeros((trainpics^2)*att_combos,length(class_labels),num_categories);
 34 | S = zeros((trainpics^2)*att_combos,length(class_labels),num_categories);
 35 | 
 36 | % Create the train_by_class matrix to create the o and s matrix for ranking
 37 | % functions
 38 | index = ones(1,num_classes);
 39 | for j = 1:length(used_for_training)
 40 |     
 41 |     % pick out the images that are going to be used_for_training
 42 |     if used_for_training(j) == 1;
 43 |         switch class_labels(j)
 44 |             case 1
 45 |                 train_by_class(1,index(1)) = j;
 46 |                 index(1) = index(1) + 1;
 47 |             case 2
 48 |                 train_by_class(2,index(2)) = j;
 49 |                 index(2) = index(2) + 1;
 50 |             case 3
 51 |                 train_by_class(3,index(3)) = j;
 52 |                 index(3) = index(3) + 1;
 53 |             case 4
 54 |                 train_by_class(4,index(4)) = j;
 55 |                 index(4) = index(4) + 1;
 56 |             case 5
 57 |                 train_by_class(5,index(5)) = j;
 58 |                 index(5) = index(5) + 1;
 59 |             case 6
 60 |                 train_by_class(6,index(6)) = j;
 61 |                 index(6) = index(6) + 1;
 62 | %             case 7
 63 | %                 train_by_class(7,index(7)) = j;
 64 | %                 index(7) = index(7) + 1;
 65 | %             case 8
 66 | %                 train_by_class(8,index(8)) = j;
 67 | %                 index(8) = index(8) + 1;
 68 |         end
 69 |     end
 70 | end
 71 | 
 72 | % Now we have the train_by_class matrix which has the training images for
 73 | % each seperate variable.  Now we are going to write the code as to how we
 74 | % are going to create the o matrix and s matrix
 75 | 
 76 | % create the elements to index the o matrix and s matrix
 77 | num_images = length(used_for_training);
 78 | s_index = ones(1,num_categories);
 79 | o_index = ones(1,num_categories);
 80 | 
 81 | % Create the list of seen classes
 82 | seen = [];
 83 | seen_index = 1;
 84 | for z = 1:num_classes
 85 |      if (ismember(z,unseen) == 0)
 86 |          seen(seen_index) = z;
 87 |          seen_index = seen_index + 1;
 88 |      end
 89 | end
 90 | 
 91 | % Now we need to get the mix of the 4 categories that should all be together
 92 | % This section randomly assigns two seen categoires as the category pairs
 93 | % for training.
 94 | combo1 = floor(1+((rand(1,att_combos)).*length(seen)));
 95 | combo2 = floor(1+((rand(1,att_combos)).*length(seen)));
 96 | for z = 1:att_combos
 97 |     test_combos(z,1) = seen(combo1(z));
 98 |     test_combos(z,2) = seen(combo2(z));
 99 |     
100 |     % We should not compare two categoires to each other, and this section
101 |     % does not allow that to happen
102 |     while test_combos(z,1) == test_combos(z,2)
103 |         test_combos(z,2) = floor(1+((rand(1).*length(seen))));
104 |     end
105 |     
106 |     % We also don't want to choose the same combination twice.
107 |     % This function will prevent that from happening by checking the
108 |     % previous combos and making sure they are not the same as the current.
109 |     r = 1;
110 |     while r < z 
111 |         if ismember(0,(sort(test_combos(r,:)) == sort(test_combos(z,:))))
112 |             r = r + 1;
113 |         else
114 |             test_combos(z,1) = seen(floor(1+((rand(1).*length(seen)))));
115 |             test_combos(z,2) = seen(floor(1+((rand(1).*length(seen)))));
116 |             r = 1;
117 |             
118 |             % Make sure that we are not comparing the two values together.
119 |             while test_combos(z,1) == test_combos(z,2)
120 |                 test_combos(z,2) = seen(floor(1+((rand(1).*length(seen)))));
121 |             end
122 |         end
123 |     end
124 |             
125 |             
126 |     
127 | end
128 | 
129 | % Now loop through each attribute pairing that we have and generate the O
130 | % and S matrices.
131 | 
132 | % Now display which classes we are using for training
133 | disp('These are the category pairs for RankSVM training')
134 | disp(test_combos)
135 | for z = 1:size(test_combos,1)
136 |     on_class = test_combos(z,1);
137 |     compared_class = test_combos(z,2);
138 |     
139 |     % Do this for every attribute
140 |     for l = 1:2
141 |         % If the two relative comparisons are equal add this pairing to
142 |         % the S matrix
143 |         if category_order(l,on_class) == category_order(l,compared_class)
144 |             % Now perform this for every training picture for each class
145 |             for j = 1:trainpics
146 |                 for i = 1:trainpics
147 |                     S_row = zeros(1,num_images);
148 |                     S_row(train_by_class(on_class,j)) = 1;
149 |                     S_row(train_by_class(compared_class,i)) = -1;
150 |                     S(s_index(l),:,l) = S_row;
151 |                     s_index(l) = s_index(l) + 1;
152 |                 end
153 |             end
154 |             
155 |             % If the relative comparison of the on_class is greater than
156 |             % that of the compared class
157 |         elseif category_order(l,on_class) > category_order(l,compared_class)
158 |             % Now perform this for every training picture for each class
159 |             for j = 1:trainpics
160 |                 for i = 1:trainpics
161 |                     O_row = zeros(1,num_images);
162 |                     O_row(train_by_class(on_class,j)) = 1;
163 |                     O_row(train_by_class(compared_class,i)) = -1;
164 |                     O(o_index(l),:,l) = O_row;
165 |                     o_index(l) = o_index(l) + 1;
166 |                 end
167 |             end
168 |             
169 |             % If the relative comparison of the new class is greater than
170 |             % that of the compared class
171 |         elseif category_order(l,on_class) < category_order(l,compared_class)
172 |             % Now perform this for every training picture for each class
173 |             for j = 1:trainpics
174 |                 for i = 1:trainpics
175 |                     O_row = zeros(1,num_images);
176 |                     O_row(train_by_class(on_class,j)) = -1;
177 |                     O_row(train_by_class(compared_class,i)) = 1;
178 |                     O(o_index(l),:,l) = O_row;
179 |                     o_index(l) = o_index(l) + 1;
180 |                 end
181 |             end
182 |             
183 |         end
184 |     end
185 | end
186 | 
187 | end


--------------------------------------------------------------------------------
/Relative Attributes/GetTrainingSample_per_category.m:
--------------------------------------------------------------------------------
 1 | % Seperate the examples that are used for training in comparison to the
 2 | % seen and unseen variables
 3 | % Created by Joe Ellis for the Reproducible Codes Class
 4 | function Train_samples = GetTrainingSample_per_category(predicts,class_labels,used_for_training)
 5 | 
 6 | % Variables
 7 | % predicts = the values for each image that have been predicted using the
 8 | %   ranking algorithm devised
 9 | % class_labels = the ground truth label of each class
10 | % used_for_training = If this image should be used in the training of the
11 | %   model 
12 | 
13 | % Train Samples is a 3-D matrix of the training variables that will be used
14 | % to train the gaussian distributions of the material for what we are
15 | % doing.
16 | 
17 | Train_samples = zeros(30,size(predicts,2),8);
18 | index = ones(1,8);
19 | 
20 | % Set up the matrices for training
21 | for j = 1:length(predicts);
22 |     if used_for_training(j) == 1;
23 |         switch class_labels(j)
24 |             case 1
25 |                 Train_samples(index(1),:,1) = predicts(j,:);
26 |                 index(1) = index(1) + 1;
27 |             case 2
28 |                 Train_samples(index(2),:,2) = predicts(j,:);
29 |                 index(2) = index(2) + 1;
30 |             case 3
31 |                 Train_samples(index(3),:,3) = predicts(j,:);
32 |                 index(3) = index(3) + 1;
33 |             case 4
34 |                 Train_samples(index(4),:,4) = predicts(j,:);
35 |                 index(4) = index(4) + 1;
36 |             case 5
37 |                 Train_samples(index(5),:,5) = predicts(j,:);
38 |                 index(5) = index(5) + 1;
39 |             case 6
40 |                 Train_samples(index(6),:,6) = predicts(j,:);
41 |                 index(6) = index(6) + 1;
42 |             case 7
43 |                 Train_samples(index(7),:,7) = predicts(j,:);
44 |                 index(7) = index(7) + 1;
45 |             case 8
46 |                 Train_samples(index(8),:,8) = predicts(j,:);
47 |                 index(8) = index(8) + 1;
48 |         end
49 |     end
50 | end
51 | end
52 |     
53 | 
54 | 
55 | 
56 | 


--------------------------------------------------------------------------------
/Relative Attributes/classlabel.csv:
--------------------------------------------------------------------------------
  1 | 2
  2 | 2
  3 | 2
  4 | 2
  5 | 2
  6 | 2
  7 | 2
  8 | 2
  9 | 2
 10 | 2
 11 | 2
 12 | 2
 13 | 2
 14 | 2
 15 | 2
 16 | 2
 17 | 2
 18 | 2
 19 | 2
 20 | 2
 21 | 2
 22 | 2
 23 | 2
 24 | 2
 25 | 2
 26 | 2
 27 | 2
 28 | 2
 29 | 2
 30 | 2
 31 | 2
 32 | 2
 33 | 2
 34 | 2
 35 | 2
 36 | 2
 37 | 2
 38 | 2
 39 | 2
 40 | 2
 41 | 2
 42 | 2
 43 | 2
 44 | 2
 45 | 2
 46 | 2
 47 | 2
 48 | 2
 49 | 2
 50 | 2
 51 | 2
 52 | 2
 53 | 2
 54 | 2
 55 | 2
 56 | 2
 57 | 2
 58 | 2
 59 | 2
 60 | 2
 61 | 2
 62 | 2
 63 | 2
 64 | 2
 65 | 2
 66 | 2
 67 | 2
 68 | 2
 69 | 2
 70 | 2
 71 | 2
 72 | 2
 73 | 2
 74 | 2
 75 | 2
 76 | 2
 77 | 2
 78 | 2
 79 | 2
 80 | 2
 81 | 2
 82 | 2
 83 | 2
 84 | 2
 85 | 2
 86 | 2
 87 | 2
 88 | 2
 89 | 2
 90 | 2
 91 | 2
 92 | 2
 93 | 2
 94 | 2
 95 | 2
 96 | 2
 97 | 2
 98 | 2
 99 | 2
100 | 2
101 | 2
102 | 2
103 | 2
104 | 2
105 | 2
106 | 2
107 | 2
108 | 2
109 | 2
110 | 2
111 | 2
112 | 2
113 | 2
114 | 2
115 | 2
116 | 2
117 | 2
118 | 2
119 | 2
120 | 2
121 | 2
122 | 2
123 | 2
124 | 2
125 | 2
126 | 2
127 | 2
128 | 2
129 | 2
130 | 2
131 | 2
132 | 2
133 | 2
134 | 2
135 | 2
136 | 2
137 | 2
138 | 2
139 | 2
140 | 2
141 | 2
142 | 2
143 | 2
144 | 2
145 | 2
146 | 2
147 | 2
148 | 2
149 | 2
150 | 2
151 | 2
152 | 2
153 | 2
154 | 2
155 | 2
156 | 2
157 | 2
158 | 2
159 | 2
160 | 2
161 | 2
162 | 2
163 | 2
164 | 2
165 | 2
166 | 2
167 | 2
168 | 2
169 | 2
170 | 2
171 | 2
172 | 2
173 | 2
174 | 2
175 | 2
176 | 2
177 | 2
178 | 2
179 | 2
180 | 2
181 | 2
182 | 2
183 | 2
184 | 2
185 | 2
186 | 2
187 | 2
188 | 2
189 | 2
190 | 2
191 | 2
192 | 2
193 | 2
194 | 2
195 | 2
196 | 2
197 | 2
198 | 2
199 | 2
200 | 2
201 | 2
202 | 2
203 | 2
204 | 2
205 | 2
206 | 2
207 | 2
208 | 2
209 | 2
210 | 2
211 | 2
212 | 2
213 | 2
214 | 2
215 | 2
216 | 2
217 | 2
218 | 2
219 | 2
220 | 2
221 | 2
222 | 2
223 | 2
224 | 2
225 | 2
226 | 2
227 | 2
228 | 2
229 | 2
230 | 2
231 | 2
232 | 2
233 | 2
234 | 2
235 | 2
236 | 2
237 | 2
238 | 2
239 | 2
240 | 2
241 | 2
242 | 2
243 | 2
244 | 2
245 | 2
246 | 2
247 | 2
248 | 2
249 | 2
250 | 2
251 | 2
252 | 2
253 | 2
254 | 2
255 | 2
256 | 2
257 | 2
258 | 2
259 | 2
260 | 2
261 | 2
262 | 2
263 | 2
264 | 2
265 | 2
266 | 2
267 | 2
268 | 2
269 | 2
270 | 2
271 | 2
272 | 2
273 | 2
274 | 2
275 | 2
276 | 2
277 | 2
278 | 2
279 | 2
280 | 2
281 | 2
282 | 2
283 | 2
284 | 2
285 | 2
286 | 2
287 | 2
288 | 2
289 | 2
290 | 2
291 | 2
292 | 2
293 | 2
294 | 2
295 | 2
296 | 2
297 | 2
298 | 2
299 | 2
300 | 2
301 | 2
302 | 2
303 | 2
304 | 2
305 | 2
306 | 2
307 | 2
308 | 2
309 | 2
310 | 2
311 | 2
312 | 2
313 | 2
314 | 2
315 | 2
316 | 2
317 | 2
318 | 2
319 | 2
320 | 2
321 | 2
322 | 2
323 | 2
324 | 2
325 | 2
326 | 2
327 | 2
328 | 2
329 | 2
330 | 2
331 | 2
332 | 2
333 | 2
334 | 2
335 | 2
336 | 2
337 | 2
338 | 2
339 | 2
340 | 2
341 | 2
342 | 2
343 | 2
344 | 2
345 | 2
346 | 2
347 | 2
348 | 2
349 | 2
350 | 2
351 | 1
352 | 1
353 | 1
354 | 1
355 | 1
356 | 1
357 | 1
358 | 1
359 | 1
360 | 1
361 | 1
362 | 1
363 | 1
364 | 1
365 | 1
366 | 1
367 | 1
368 | 1
369 | 1
370 | 1
371 | 1
372 | 1
373 | 1
374 | 1
375 | 1
376 | 1
377 | 1
378 | 1
379 | 1
380 | 1
381 | 1
382 | 1
383 | 1
384 | 1
385 | 1
386 | 1
387 | 1
388 | 1
389 | 1
390 | 1
391 | 1
392 | 1
393 | 1
394 | 1
395 | 1
396 | 1
397 | 1
398 | 1
399 | 1
400 | 1
401 | 1
402 | 1
403 | 1
404 | 1
405 | 1
406 | 1
407 | 1
408 | 1
409 | 1
410 | 1
411 | 1
412 | 1
413 | 1
414 | 1
415 | 1
416 | 1
417 | 1
418 | 1
419 | 1
420 | 1
421 | 1
422 | 1
423 | 1
424 | 1
425 | 1
426 | 1
427 | 1
428 | 1
429 | 1
430 | 1
431 | 1
432 | 1
433 | 1
434 | 1
435 | 1
436 | 1
437 | 1
438 | 1
439 | 1
440 | 1
441 | 1
442 | 1
443 | 1
444 | 1
445 | 1
446 | 1
447 | 1
448 | 1
449 | 1
450 | 1
451 | 1
452 | 1
453 | 1
454 | 1
455 | 1
456 | 1
457 | 1
458 | 1
459 | 1
460 | 1
461 | 1
462 | 1
463 | 1
464 | 1
465 | 1
466 | 1
467 | 1
468 | 1
469 | 1
470 | 1
471 | 1
472 | 1
473 | 1
474 | 1
475 | 1
476 | 1
477 | 1
478 | 1
479 | 1
480 | 1
481 | 1
482 | 1
483 | 1
484 | 1
485 | 1
486 | 1
487 | 1
488 | 1
489 | 1
490 | 1
491 | 1
492 | 1
493 | 1
494 | 1
495 | 1
496 | 1
497 | 1
498 | 1
499 | 1
500 | 1
501 | 1
502 | 1
503 | 1
504 | 1
505 | 1
506 | 1
507 | 1
508 | 1
509 | 1
510 | 1
511 | 1
512 | 1
513 | 1
514 | 1
515 | 1
516 | 1
517 | 1
518 | 1
519 | 1
520 | 1
521 | 1
522 | 1
523 | 1
524 | 1
525 | 1
526 | 1
527 | 1
528 | 1
529 | 1
530 | 1
531 | 1
532 | 1
533 | 1
534 | 1
535 | 1
536 | 1
537 | 1
538 | 1
539 | 1
540 | 1
541 | 1
542 | 1
543 | 1
544 | 1
545 | 1
546 | 1
547 | 1
548 | 1
549 | 1
550 | 1
551 | 1
552 | 1
553 | 1
554 | 1
555 | 1
556 | 1
557 | 1
558 | 1
559 | 1
560 | 1
561 | 1
562 | 1
563 | 1
564 | 1
565 | 1
566 | 1
567 | 1
568 | 1
569 | 1
570 | 1
571 | 1
572 | 1
573 | 1
574 | 1
575 | 1
576 | 1
577 | 1
578 | 1
579 | 1
580 | 1
581 | 1
582 | 1
583 | 1
584 | 1
585 | 1
586 | 1
587 | 1
588 | 1
589 | 1
590 | 1
591 | 1
592 | 1
593 | 1
594 | 1
595 | 1
596 | 1
597 | 1
598 | 1
599 | 1
600 | 1
601 | 1
602 | 1
603 | 1
604 | 1
605 | 1
606 | 1
607 | 1
608 | 1
609 | 1
610 | 1
611 | 1
612 | 1
613 | 1
614 | 1
615 | 1
616 | 1
617 | 1
618 | 1
619 | 1
620 | 1
621 | 1
622 | 1
623 | 1
624 | 1
625 | 1
626 | 1
627 | 1
628 | 1
629 | 1
630 | 1
631 | 1
632 | 1
633 | 1
634 | 1
635 | 1
636 | 1
637 | 1
638 | 1
639 | 1
640 | 1
641 | 1
642 | 1
643 | 1
644 | 1
645 | 1
646 | 1
647 | 1
648 | 1
649 | 1
650 | 1
651 | 1
652 | 1
653 | 1
654 | 1
655 | 1
656 | 1
657 | 1
658 | 1
659 | 1
660 | 1
661 | 1
662 | 1
663 | 1
664 | 1
665 | 1
666 | 1
667 | 1
668 | 1
669 | 1
670 | 1
671 | 1
672 | 1
673 | 1
674 | 1
675 | 1
676 | 1
677 | 1
678 | 1
679 | 1
680 | 1
681 | 1
682 | 1
683 | 1
684 | 1
685 | 1
686 | 1
687 | 1
688 | 1
689 | 1
690 | 1
691 | 1
692 | 1
693 | 1
694 | 1
695 | 1
696 | 1
697 | 1
698 | 1
699 | 1
700 | 1
701 | 


--------------------------------------------------------------------------------
/Relative Attributes/main.m:
--------------------------------------------------------------------------------
  1 | % Script created to create the graphs that we want to create for the osr
  2 | % dataset with the Relative attributes method
  3 | 
  4 | % Train the ranking function should be right here
  5 | % This portion of the code needs to have some ground truth data labeled and
  6 | % the relative similarities finished
  7 | 
  8 | % Clear all the data before running the script
  9 | clear all;
 10 | 
 11 | 
 12 | % These are the num of unseen classes and training images per class
 13 | num_unseen = 0;
 14 | trainpics = 300; %need to change to 300 Kun
 15 | num_iter = 10;
 16 | held_out_attributes = 0;
 17 | emotion_categories = 2;
 18 | num_attributes = 2;
 19 | labeled_pairs = 1;
 20 | looseness_constraint = 1;
 21 | % This is the number of iterations we want to do
 22 | accuracy = zeros(1,num_iter);
 23 |  
 24 | %spk_list = {'0011','0012','0013','0014','0015','0016','0017','0018','0019','0020'}  %[0011--0020]
 25 | emo_list = {'Neutral', 'Happy', 'Angry', 'Sad'}
 26 | spk_list = {'0013'}  %[0011--0020]
 27 | %emo_list = {'Happy', 'Angry', 'Sad'}
 28 | for spk_num = 1:size(spk_list,2)
 29 |     for emo_num = 1:size(emo_list,2)
 30 |         spk_tag = spk_list(spk_num)
 31 |         emo_tag = emo_list(emo_num)
 32 |         spk_tag_str = string(spk_tag(1)) %0011
 33 |         emo_tag_str = string(emo_tag(1)) %Happy
 34 |         % output file (score)
 35 |         score_path = strcat(spk_tag_str, '_Surprise_' , emo_tag_str , '_Score.csv') %"0011_Neutral_Happy_Score.csv"
 36 |         fopen(score_path,'wt')
 37 |         % input file (feature extracted by OpenSmile)
 38 |         
 39 |         %osr_gist_Mat = csvread(strcat(spk_tag_str, '/', spk_tag_str, '_Neutral_', emo_tag_str, '.csv'),1,2); %"0011/0011_Neutral_Happy.csv"
 40 |         
 41 |         osr_gist_Mat = csvread(strcat(spk_tag_str, '_Surprise_', emo_tag_str, '.csv'),1,2); %"0011_Neutral_Happy.csv" %700x384
 42 |         
 43 |         % Debug
 44 |         % osr_gist_Mat = csvread('0010_Neutral_Angry.csv',1,2);
 45 |               
 46 |         
 47 |         used_for_training = csvread('used_for_training_kun.csv'); %debug Kun
 48 |         % class_names = {'Angry','Neutral'};
 49 |         class_labels = csvread('classlabel.csv');
 50 |         relative_ordering = [2 1; 3 1];
 51 |         category_order = relative_ordering;
 52 |  
 53 |         osr_gist_Mat_normal = mapminmax(osr_gist_Mat',0,1); %normalization
 54 |         osr_gist_Mat = osr_gist_Mat_normal'; %384x700
 55 | 
 56 |       
 57 | 
 58 |         for iter = 1:num_iter
 59 | 
 60 |             % Create a random list of unseen images
 61 |             unseen = randperm(emotion_categories,num_unseen);
 62 |             [O,S] = Create_O_and_S_Mats_2D(category_order,used_for_training,class_labels,emotion_categories,unseen,trainpics,labeled_pairs);
 63 |             % Now we need to train the ranking function, but we have some values in the
 64 |             % matrices that will not correspond to the anything becuase some attributes
 65 |             % will have more nodes with similarity.
 66 | 
 67 |             weights = zeros(384,num_attributes); %384x2
 68 |             for l = 1:num_attributes
 69 | 
 70 |                 % Find where each O and S matrix stops having values for each category
 71 |                 % matrix section
 72 | 
 73 |                 % Find when the O matrix for this dimension no longer has real values
 74 | 
 75 |                 for j = 1:size(O,1)
 76 |                     O_length = j;
 77 |                     if ismember(1,O(j,:,l)) == 0;
 78 |                         break;
 79 |                     end
 80 |                 end
 81 | 
 82 |                 % Find when the S matrix for this dimension no longer has real values.
 83 |                 for j = 1:size(S,1)
 84 |                     S_length = j;
 85 |                     if ismember(1,S(j,:,l)) == 0;
 86 |                         break;
 87 |                     end
 88 |                 end
 89 | 
 90 |                 % Now set up the cost matrices both are initialized to 0.1 in the
 91 |                 % Relative Attributes paper from 2011;
 92 |                 Costs_for_O = .1*ones(O_length,1);
 93 |                 Costs_for_S = .1*ones(S_length,1);
 94 | 
 95 |                 if O_length > 1
 96 |                     w = ranksvm_with_sim(osr_gist_Mat,O(1:O_length-1,:,l),S(1:S_length,:,l),Costs_for_O,Costs_for_S);
 97 |                     %w = testrank(osr_gist_Mat,O(1:O_length-1,:,l),S(1:S_length,:,l),Costs_for_O,Costs_for_S);
 98 |                     weights(:,l) = w*2;
 99 |                 else
100 |         %         exit
101 |                 % Re-Do the ranking and start over, because we chose category pairs
102 |                 % that did not have the O matrix for a given attribute.
103 | 
104 |                 % This function creates the O and S matrix used in the ranking algorithm
105 |                 [O,S] = Create_O_and_S_Mats(category_order,used_for_training,class_labels,3,unseen,trainpics,labeled_pairs);
106 | 
107 |                 % initialize the weights matrix that will be learned for ranking
108 |         %         weights = zeros(384,6);
109 |                 weights = zeros(384,num_attributes);
110 | 
111 |                 % re-do the creation of the O and S matrix
112 |                 l = 1;
113 |                 disp('We had to redo the O and S matrix ranking, Pairs chosen were all similar for an attribute');
114 |                 end
115 |             end
116 | 
117 |             % here we want to choose to take out some of the weights for each
118 |             % attribute and also the category order
119 |             if held_out_attributes ~= 0
120 |                 rand_atts = randperm(6,6-held_out_attributes);
121 |                 for j = 1:length(rand_atts);
122 |                     new_weights(:,j) = weights(:,rand_atts(j));
123 |                     new_cat_order(j,:) = category_order(rand_atts(j),:);
124 |                     new_relative_att_predictor(:,j) = relative_att_predictor(:,rand_atts(j));
125 |                 end
126 |             else
127 |                 new_cat_order = category_order;
128 |                 new_weights = weights;
129 |         %         new_relative_att_predictor = relative_att_predictor;
130 |             end
131 | 
132 | 
133 |             % Get the predictions based on the outputs from rank svm
134 |             % Use there trained data
135 |             % relative_att_predictions = feat*new_relative_att_predictor;
136 |             % Use my trained data
137 |             relative_att_predictions = osr_gist_Mat*new_weights;
138 | 
139 |             % Seperate the training samples from the other training samples
140 |             Train_samples = GetTrainingSample_per_category(relative_att_predictions,class_labels,used_for_training);
141 | 
142 |             %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
143 |             % Debug %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
144 |             %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
145 |             % Calculate the means and covariances from the samples
146 |             [means, Covariances] = meanandvar_forcat(Train_samples,[],new_cat_order,emotion_categories,looseness_constraint);
147 | 
148 |             % This is for debug to find the problem with the unseen scategories
149 |             means_unseen = meanandvar_forcat(Train_samples,unseen,new_cat_order,emotion_categories,looseness_constraint);
150 | 
151 |             % This section will find the difference between the values of the means
152 |             disp('The unseen values are')
153 |             unseen
154 |             disp('Actual Means');
155 |             means
156 |             disp('Difference between the means');
157 |             disp(means_unseen - means);
158 | 
159 |             %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
160 | 
161 |             % Classify the predicted features from the system
162 |             accuracy(iter) = BayesClass_RelAtt(relative_att_predictions,class_labels,means_unseen,Covariances,used_for_training,unseen);
163 |             disp('unseen accuracy for means found');
164 |             disp(accuracy(iter))
165 | 
166 |             other_acc = BayesClass_RelAtt_unseen(relative_att_predictions,class_labels,means_unseen,Covariances,used_for_training,unseen);
167 |             disp('unseen accuracy for derived means')
168 |             disp(other_acc);
169 |             disp('The relative ordering of the attributes for each image');
170 |         %     category_order
171 |         end
172 | 
173 |             total_acc = mean(accuracy);
174 | 
175 |         relative_att_predictions_norm = normalize(relative_att_predictions(:,1),'range')
176 |         csvwrite(score_path, relative_att_predictions_norm)   % (350:700) 
177 |         disp('The accuracy of this calculation: ');
178 |         disp(total_acc);
179 |         disp('------------- ok -------------- ');
180 |     end
181 | end
182 | 


--------------------------------------------------------------------------------
/Relative Attributes/meanandvar_forcat.m:
--------------------------------------------------------------------------------
  1 | % Generate mean and covariance matrix for each categories relative scores.
  2 | % Created by Joe Ellis for the Reproduction Code Class
  3 | % Reproducing Relative Attributes
  4 | 
  5 | function [means, Covariances] = meanandvar_forcat(Training_Samples,unseen,category_order,num_classes, looseness_constraint)
  6 | 
  7 | % The looseness constraint should be the looseless-1
  8 | looseness_constraint = looseness_constraint - 1;
  9 | 
 10 | % variables
 11 | % means = 2-d matrix each row is a mean of the labels should be 8x6 rows
 12 | % Covariances = 3-d matrix.  Should be 6x6x8 to finish this work.
 13 | 
 14 | % means of the set ups
 15 | % Create the list of seen categories
 16 | seen = [];
 17 | seen_index = 1;
 18 | for z = 1:num_classes
 19 |     if ismember(z,unseen) == 0
 20 |         seen(seen_index) = z;
 21 |         seen_index = seen_index + 1;
 22 |     end
 23 | end
 24 | 
 25 | % now we have the seen categories, and we want to find the mean and
 26 | % covariance of each of these values.
 27 | 
 28 | % set up the means and covariances that we want to find
 29 | means = zeros(1,size(Training_Samples,2),size(Training_Samples,3));
 30 | Covariances = zeros(size(Training_Samples,2),size(Training_Samples,2),size(Training_Samples,3));
 31 | 
 32 | for k = 1:length(seen)
 33 |     
 34 |     % Get the seen variable index
 35 |     class = seen(k);
 36 |     
 37 |     % Find the means of the seen a
 38 |     means(:,:,class) = mean(Training_Samples(:,:,class));
 39 |     
 40 |     % for loop to iterate over the sections of the training_samples
 41 |     Covariances(:,:,class) = cov(Training_Samples(:,:,class));
 42 | end
 43 | 
 44 | % Now we need to find the average covariance mat for all the seen samples
 45 | AVG_COV = sum(Covariances,3)/length(seen);
 46 | 
 47 | 
 48 | % Now we need to set up the mean and covariance for all of the unseen
 49 | % variables
 50 | 
 51 | % Now we have to find the average distance between the means 
 52 | 
 53 | dm = zeros(1,size(category_order,1));
 54 | 
 55 | for j = 1:size(category_order,1)
 56 |     % This section finds the means and sorts the average distance between
 57 |     % the neightbors
 58 |     sorted_means = sort(nonzeros(means(1,j,:)));
 59 |     diff = 0;
 60 |     for z = 1:length(sorted_means)-1
 61 |         diff = diff + abs(sorted_means(z)-sorted_means(z+1));
 62 |     end
 63 |     dm(j) = diff/(length(seen)-1);
 64 | end
 65 | 
 66 | disp('The differences between the elements for each attribute');
 67 | dm
 68 | 
 69 | % We need to create a category ordering of only the categories available
 70 | % not the unseen categories
 71 | for j = 1:length(seen)
 72 |     there = seen(j);
 73 |     new_category_order(:,j) = category_order(:,there);
 74 | end
 75 |         
 76 | 
 77 | for k = 1:length(unseen)
 78 |     % This is the unseen class
 79 |     class = unseen(k);
 80 |    
 81 |     % now we have to go through every attribute within this section of the
 82 |     % code to figure out 
 83 |     for j = 1:size(new_category_order,1)
 84 |         attr_rank = category_order(j,class);
 85 |         
 86 |         % Now get the max and min of that particular ranking
 87 |         [max_rank max_idx] = max(new_category_order(j,:));
 88 |         [min_rank min_idx] = min(new_category_order(j,:));
 89 |         
 90 |         %if ismember(attr_rank,new_category_order(j,:)) == 1;
 91 |         %    vect = (attr_rank == new_category_order(j,:));
 92 |         %    idx = find(vect);
 93 |         %    means(1,j,class) = means(1,j,seen(idx(1)));
 94 |             %disp(means(1,j,class));
 95 |             
 96 |             
 97 |         if attr_rank > max_rank;
 98 |             % Do some stuff
 99 |             max_mean = means(1,j,seen(max_idx(1)));
100 |             means(1,j,class) = max_mean + dm(j);
101 |             %disp(means(1,j,class));
102 |             
103 |         elseif attr_rank == max_rank
104 |             % Do some stuff
105 |             new_rank = attr_rank - 1;
106 |             idx = find(new_category_order(j,:) == new_rank);
107 |             if isempty(idx) == 0    
108 |                 one_less_mean = means(1,j,seen(idx(1)));
109 |                 means(1,j,class) = one_less_mean + dm(j);
110 |             else
111 |                 max_mean = means(1,j,seen(max_idx(1)));
112 |                 means(1,j,class) = max_mean;
113 |             end
114 |                 
115 |             
116 |         elseif attr_rank < min_rank
117 |             % Do some stuff
118 |             % Now we have to find the average distances between the means
119 |              min_mean = means(1,j,seen(min_idx(1)));
120 |              means(1,j,class) = min_mean - dm(j);
121 |              %disp(means(1,j,class));
122 |              
123 |         elseif attr_rank == min_rank
124 |             % Do some stuff
125 |             % Now we have to find the average distances between the means
126 |              new_rank = attr_rank + 1;
127 |              idx = find(new_category_order(j,:) == new_rank);
128 |              if isempty(idx) == 0    
129 |                 one_more_mean = means(1,j,seen(idx(1)));
130 |                 means(1,j,class) = one_more_mean - dm(j);
131 |              else
132 |                 min_mean = means(1,j,seen(min_idx(1)));
133 |                 means(1,j,class) = min_mean;
134 |              end
135 |             
136 |         else
137 |             % Find the index of the elements one above and below those
138 |             % elements
139 |             row_vec = new_category_order(j,:);
140 |             min_cand = row_vec < attr_rank - looseness_constraint;
141 |             value = 0;
142 |             min_use_index = 1;
143 |             for a = 1:length(min_cand)
144 |                 if min_cand(a) == 1
145 |                     if row_vec(a) > value;
146 |                         min_use_index = a;
147 |                         value = row_vec(a);
148 |                     end
149 |                 end
150 |             end
151 |             lower_u = means(1,j,seen(min_use_index));
152 |             
153 |             % Here we have the values for the max used
154 |             max_cand = row_vec > attr_rank + looseness_constraint;
155 |             value = 9;
156 |             max_use_index = 1;
157 |             for a = 1:length(max_cand)
158 |                 if max_cand(a) == 1
159 |                     if row_vec(a) < value;
160 |                         max_use_index = a;
161 |                         value = row_vec(a);
162 |                     end
163 |                 end
164 |             end
165 |             higher_u = means(1,j,seen(max_use_index));
166 |             
167 |             % This solves for the mean for this class
168 |             means(1,j,class) = (higher_u + lower_u)/2;
169 |             %disp(means(1,j,class));
170 |         end
171 |         
172 |         % Give it the average covariance of all of the elements
173 |         Covariances(:,:,class) = AVG_COV;
174 |     end
175 | end
176 | 
177 | 
178 | % I need to see what is getting messed up here
179 | 
180 | for k = 1:length(unseen)    
181 |     % Get the seen variable index
182 |     class = unseen(k);
183 |     % Find the means of the seen a
184 |     %means(:,:,class) = mean(Training_Samples(:,:,class));
185 |     truemean = mean(Training_Samples(:,:,class));
186 |     % for loop to iterate over the sections of the training_samples
187 |     %Covariances(:,:,class) = cov(Training_Samples(:,:,class));
188 |     
189 | end
190 | % matlab is returning non positive definite matrices for me.  Therefore, I
191 | % need to add a bit to the diagonal.
192 | 
193 |     
194 |     
195 |     
196 |     
197 |     
198 |     
199 |     
200 |     
201 |     
202 |     
203 |     
204 |     
205 |     
206 |     
207 |     
208 |     
209 |     
210 |     
211 |     
212 |     
213 |     
214 |     
215 |     
216 |     
217 | 
218 | 
219 | 
220 | 


--------------------------------------------------------------------------------
/Relative Attributes/pre-processing.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | root_path = '/Users/kun/Desktop/workspace/'  #  
 4 | data_path = root_path +'esd_en/'  # 待处理的音频路径
 5 | 
 6 | count = 0
 7 | 
 8 | opensmile_path = '/Users/kun/Desktop/workspace/open_smile/opensmile/build/progsrc/smilextract/SMILExtract'
 9 | config_path = '/Users/kun/Desktop/workspace/open_smile/opensmile/config/is09-13/IS09_emotion.conf'
10 | 
11 | error_list = []
12 | 
13 | for root, spk_dir, files in os.walk(data_path):
14 |     # print('----')
15 |     # print(root)
16 |     print(spk_dir)
17 |     # print(files)
18 |     # exit()
19 |     for spk in spk_dir:
20 |         spk_out_dir  = root_path + 'output/' + spk
21 |         if not os.path.exists(spk_out_dir):
22 |             os.mkdir(spk_out_dir)
23 |         for root, emo_dir, files in os.walk(data_path + spk):
24 |             print(emo_dir)
25 |             for emo in emo_dir:
26 |                 for _, _, wavfiles in os.walk(data_path + spk + '/' + emo):
27 |                     for i in range(len(wavfiles)):
28 |                         #print(count)
29 |                         wavfile_path = data_path + spk + '/' + emo + '/' + wavfiles[i]
30 |                         print(wavfile_path)
31 |                         
32 |                         emo_out_dir  = root_path + 'output/' + spk+ '/' + emo
33 |                         print(emo_out_dir)
34 |                         feature_path = emo_out_dir + '/' + wavfiles[i][:-4] + '.csv'
35 |                         print(feature_path)
36 |                         # path_remake(files[i])
37 |                         try:
38 |                             if not os.path.exists(emo_out_dir):
39 |                                 os.mkdir(emo_out_dir)
40 |                             os.system(opensmile_path + ' -C ' + config_path + ' -I ' + wavfile_path + ' -csvoutput ' + feature_path + ' -instname ' + feature_path[-15:-4])
41 |                         except:
42 |                             error_list.append(wavfile_path)
43 |                             count += 1
44 |                         # exit()
45 | print('error num: ',count)  # 出错次数，正常情况都是不会出错的。
46 | print('error list: ', error_list)


--------------------------------------------------------------------------------
/Relative Attributes/ranksvm_with_sim.m:
--------------------------------------------------------------------------------
  1 | function w = ranksvm_with_sim(X_,OMat,SMat,Costs_for_O,Costs_for_S,w,opt)
  2 | % W = RANKSVM(X,A,C,W,OPT)
  3 | % Solves the Ranking SVM optimization problem in the primal (with quatratic
  4 | %   penalization of the training errors).  
  5 | %
  6 | % X contains the training inputs and is an n x d matrix (n = number of points).
  7 | % A is a sparse p x n matrix, where p is the number of preference pairs.
  8 | %   Each row of A should contain exactly one +1 and one -1
  9 | %   reflecting the indices of the points constituing the pairs. 
 10 | % C is a vector of training error penalizations (one for each preference pair). 
 11 | %
 12 | % OPT is a structure containing the options (in brackets default values):
 13 | %   lin_cg: Find the Newton step, by linear conjugate gradients [0]
 14 | %   iter_max_Newton: Maximum number of Newton steps [20]
 15 | %   prec: Stopping criterion
 16 | %   cg_prec and cg_it: stopping criteria for the linear CG.
 17 |  
 18 | % Copyright Olivier Chapelle, olivier.chapelle@tuebingen.mpg.de
 19 | % Last modified 25/08/2006  
 20 | 
 21 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 22 | % This code was modified  by Joe Ellis -- Columbia University DVMM Lab
 23 | % https://sites.google.com/site/joeelliscolumbiauniversity/
 24 | % Code was modified to reproduce the results from Relative Attributes paper
 25 | % by Grauman and Parakh.  The variables that are used here can be seen
 26 | % below.
 27 | %
 28 | % Variables:
 29 | % X contains the training inputs and is an n x d matrix (n = number of points).
 30 | % OMat is the matrix that holds the pairs.  It is set up in the same way
 31 | %   that A is described above.a sparse p x n matrix, where p is the number 
 32 | %   of preference pairs. Each row of A should contain exactly one +1 and 
 33 | %   one -1 reflecting the indices of the points constituing the pairs.
 34 | %   The plus one is the attribute that represents the images that have more
 35 | %   of an attribute -1 is the one that does not.
 36 | % SMat is the matrix that holds the pairs.  It is set up in the same way
 37 | %   that A is described above.a sparse p x n matrix, where p is the number 
 38 | %   of preference pairs. Each row of A should contain exactly one +1 and 
 39 | %   one -1 reflecting the indices of the points constituing the pairs.
 40 | %   We still have to have one attribute that has the +1 and -1 to make this
 41 | %   work based on the implementation below.
 42 | % Costs_for_O this has the value of the cost for each preference pair
 43 | % Costs_for_S this has the value of the cost for each preference pair
 44 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 45 |   
 46 | 
 47 |     oldO = OMat;
 48 |     oldS = SMat;
 49 | 
 50 |   if size(OMat,1) == 1
 51 |       OMat = [];
 52 |   end
 53 |   
 54 |   if size(SMat,1) == 1
 55 |       SMat = [];
 56 |   end
 57 | 
 58 |   global X A
 59 |   X = X_; A = OMat; S = SMat; % To avoid passing theses matrices as arguments to subfunctions
 60 | 
 61 |   if nargin < 7       % Assign the options to their default values
 62 |     opt = [];
 63 |   end;
 64 |   if ~isfield(opt,'lin_cg'),            opt.lin_cg = 0;                    end;
 65 |   if ~isfield(opt,'iter_max_Newton'),   opt.iter_max_Newton = 10;          end;  
 66 |   if ~isfield(opt,'prec'),              opt.prec = 1e-7;                   end;  
 67 |   if ~isfield(opt,'cg_prec'),           opt.cg_prec = 1e-3;                end;  
 68 |   if ~isfield(opt,'cg_it'),             opt.cg_it = 20;                    end;  
 69 |   
 70 |   d = size(X,2);
 71 |   n = size([A;S],1);
 72 |   
 73 |   if (d*n>1e9) & (opt.lin_cg==0)
 74 |     warning('Large problem: you should consider trying the lin_cg option')
 75 |   end;
 76 |   
 77 |   if nargin<6
 78 |     w = zeros(d,1); 
 79 |   end; 
 80 |   iter = 0;
 81 |   %out = 1-A*(X*w);
 82 |   
 83 |   %%%%%%%%%%
 84 |   % This is the line that needs to be altered to change the way that
 85 |   % ranksvm works for my purposes
 86 |   % out1 is the same exact implementation as what was originally designed
 87 |   % out2 needs to take into account the pairs that are highly similar
 88 |   
 89 |   disp('The size of the X matrix')
 90 |   disp(size(X))
 91 |   disp('The sixe of the w matrix')
 92 |   disp(size(w))
 93 |   
 94 |   % This is the section of the code that sets up the work
 95 |   if size(oldO,1) == 1
 96 |     %out1 = 1-A*(X*w);
 97 |     out2 = -(S*(X*w));
 98 |     out1 = [];
 99 |   elseif size(oldS,1) == 1
100 |     out1 = 1-A*(X*w);
101 |     %out2 = (S*(X*w));
102 |     out2 = [];
103 |   else
104 |     out1 = 1-A*(X*w);
105 |     out2 = -(S*(X*w));
106 |   end 
107 |     
108 |   % Now concatenate the vectors together.
109 |   out = [out1; out2];
110 |   % We need to keep track of the pairs that are chosen for the experiments,
111 |   % so thusly concatenate the two vectors together so they will be the same
112 |   % for our experiments.
113 |   A = [A;S];
114 |   % disp('The size of the A Matrix')
115 |   % disp(size(A))
116 |   % C = [Costs_for_O;Costs_for_S];
117 |   
118 |   C = 0.1*ones(size(A,1),1);
119 |   while 1
120 |     iter = iter + 1;
121 |     if iter > opt.iter_max_Newton;
122 |       warning(sprintf(['Maximum number of Newton steps reached.' ...
123 |                        'Try larger lambda']));
124 |       break;
125 |     end;
126 |     
127 |     [obj, grad, sv] = obj_fun_linear(w,C,out);      
128 |     
129 |     % Compute the Newton direction either by linear CG
130 |     % Advantage of linear CG when using sparse input: the Hessian
131 |     % is never computed explicitly.
132 |     if opt.lin_cg
133 |       [step, foo, relres] = minres(@hess_vect_mult, -grad,...
134 |                                    opt.cg_prec,opt.cg_it,[],[],[],sv,C);
135 |     else
136 |       Xsv = A(sv,:)*X;
137 |       hess = eye(d) + Xsv'*(Xsv.*repmat(C(sv),1,d)); % Hessian
138 |       step  = - hess \ grad;   % Newton direction
139 |       relres = 0;
140 |     end;
141 |     
142 |     % Do an exact line search
143 |     [t,out] = line_search_linear(w,step,out,C);
144 |     
145 |     w = w + t*step;
146 |     fprintf(['Iter = %d, Obj = %f, Nb of sv = %d, Newton decr = %.3f, ' ...
147 |              'Line search = %.3f, Lin CG acc = %.4f     \n'],...
148 |             iter,obj,sum(sv),-step'*grad/2,t,relres);
149 |     
150 |     if -step'*grad < opt.prec * obj  
151 |       % Stop when the Newton decrement is small enough
152 |       break;
153 |     end;
154 |   end;   
155 |   
156 |   
157 | function [obj, grad, sv] = obj_fun_linear(w,C,out)
158 |   % Compute the objective function, its gradient and the set of support vectors
159 |   % Out is supposed to contain 1-A*X*w
160 |   global X A
161 |   out = max(0,out);
162 |   obj = sum(C.*out.^2)/2 + w'*w/2; % L2 penalization of the errors
163 |   grad = w - (((C.*out)'*A)*X)'; % Gradient
164 |   sv = out>0;  
165 |   
166 |   
167 | function y = hess_vect_mult(w,sv,C)
168 |   % Compute the Hessian times a given vector x.
169 |   global X A
170 |   y = w;
171 |   z = (C.*sv).*(A*(X*w));  % Computing X(sv,:)*x takes more time in Matlab :-(
172 |   y = y + ((z'*A)*X)';
173 |   
174 |   
175 | function [t,out] = line_search_linear(w,d,out,C) 
176 |   % From the current solution w, do a line search in the direction d by
177 |   % 1D Newton minimization
178 |   global X A
179 |   t = 0;
180 |   % Precompute some dots products
181 |   Xd = A*(X*d);
182 |   wd = w'*d;
183 |   dd = d'*d;
184 |   while 1
185 |     out2 = out - t*Xd; % The new outputs after a step of length t
186 |     sv = find(out2>0);
187 |     g = wd + t*dd - (C(sv).*out2(sv))'*Xd(sv); % The gradient (along the line)
188 |     h = dd + Xd(sv)'*(Xd(sv).*C(sv)); % The second derivative (along the line)
189 |     t = t - g/h; % Take the 1D Newton step. Note that if d was an exact Newton
190 |                  % direction, t is 1 after the first iteration.
191 |     if g^2/h < 1e-10, break; end;
192 |   end;
193 |   out = out2;
194 | 


--------------------------------------------------------------------------------
/Relative Attributes/used_for_training_kun.csv:
--------------------------------------------------------------------------------
  1 | 0
  2 | 0
  3 | 0
  4 | 0
  5 | 0
  6 | 0
  7 | 0
  8 | 0
  9 | 0
 10 | 0
 11 | 0
 12 | 0
 13 | 0
 14 | 0
 15 | 0
 16 | 0
 17 | 0
 18 | 0
 19 | 0
 20 | 0
 21 | 0
 22 | 0
 23 | 0
 24 | 0
 25 | 0
 26 | 0
 27 | 0
 28 | 0
 29 | 0
 30 | 0
 31 | 0
 32 | 0
 33 | 0
 34 | 0
 35 | 0
 36 | 0
 37 | 0
 38 | 0
 39 | 0
 40 | 0
 41 | 0
 42 | 0
 43 | 0
 44 | 0
 45 | 0
 46 | 0
 47 | 0
 48 | 0
 49 | 0
 50 | 0
 51 | 1
 52 | 1
 53 | 1
 54 | 1
 55 | 1
 56 | 1
 57 | 1
 58 | 1
 59 | 1
 60 | 1
 61 | 1
 62 | 1
 63 | 1
 64 | 1
 65 | 1
 66 | 1
 67 | 1
 68 | 1
 69 | 1
 70 | 1
 71 | 1
 72 | 1
 73 | 1
 74 | 1
 75 | 1
 76 | 1
 77 | 1
 78 | 1
 79 | 1
 80 | 1
 81 | 1
 82 | 1
 83 | 1
 84 | 1
 85 | 1
 86 | 1
 87 | 1
 88 | 1
 89 | 1
 90 | 1
 91 | 1
 92 | 1
 93 | 1
 94 | 1
 95 | 1
 96 | 1
 97 | 1
 98 | 1
 99 | 1
100 | 1
101 | 1
102 | 1
103 | 1
104 | 1
105 | 1
106 | 1
107 | 1
108 | 1
109 | 1
110 | 1
111 | 1
112 | 1
113 | 1
114 | 1
115 | 1
116 | 1
117 | 1
118 | 1
119 | 1
120 | 1
121 | 1
122 | 1
123 | 1
124 | 1
125 | 1
126 | 1
127 | 1
128 | 1
129 | 1
130 | 1
131 | 1
132 | 1
133 | 1
134 | 1
135 | 1
136 | 1
137 | 1
138 | 1
139 | 1
140 | 1
141 | 1
142 | 1
143 | 1
144 | 1
145 | 1
146 | 1
147 | 1
148 | 1
149 | 1
150 | 1
151 | 1
152 | 1
153 | 1
154 | 1
155 | 1
156 | 1
157 | 1
158 | 1
159 | 1
160 | 1
161 | 1
162 | 1
163 | 1
164 | 1
165 | 1
166 | 1
167 | 1
168 | 1
169 | 1
170 | 1
171 | 1
172 | 1
173 | 1
174 | 1
175 | 1
176 | 1
177 | 1
178 | 1
179 | 1
180 | 1
181 | 1
182 | 1
183 | 1
184 | 1
185 | 1
186 | 1
187 | 1
188 | 1
189 | 1
190 | 1
191 | 1
192 | 1
193 | 1
194 | 1
195 | 1
196 | 1
197 | 1
198 | 1
199 | 1
200 | 1
201 | 1
202 | 1
203 | 1
204 | 1
205 | 1
206 | 1
207 | 1
208 | 1
209 | 1
210 | 1
211 | 1
212 | 1
213 | 1
214 | 1
215 | 1
216 | 1
217 | 1
218 | 1
219 | 1
220 | 1
221 | 1
222 | 1
223 | 1
224 | 1
225 | 1
226 | 1
227 | 1
228 | 1
229 | 1
230 | 1
231 | 1
232 | 1
233 | 1
234 | 1
235 | 1
236 | 1
237 | 1
238 | 1
239 | 1
240 | 1
241 | 1
242 | 1
243 | 1
244 | 1
245 | 1
246 | 1
247 | 1
248 | 1
249 | 1
250 | 1
251 | 1
252 | 1
253 | 1
254 | 1
255 | 1
256 | 1
257 | 1
258 | 1
259 | 1
260 | 1
261 | 1
262 | 1
263 | 1
264 | 1
265 | 1
266 | 1
267 | 1
268 | 1
269 | 1
270 | 1
271 | 1
272 | 1
273 | 1
274 | 1
275 | 1
276 | 1
277 | 1
278 | 1
279 | 1
280 | 1
281 | 1
282 | 1
283 | 1
284 | 1
285 | 1
286 | 1
287 | 1
288 | 1
289 | 1
290 | 1
291 | 1
292 | 1
293 | 1
294 | 1
295 | 1
296 | 1
297 | 1
298 | 1
299 | 1
300 | 1
301 | 1
302 | 1
303 | 1
304 | 1
305 | 1
306 | 1
307 | 1
308 | 1
309 | 1
310 | 1
311 | 1
312 | 1
313 | 1
314 | 1
315 | 1
316 | 1
317 | 1
318 | 1
319 | 1
320 | 1
321 | 1
322 | 1
323 | 1
324 | 1
325 | 1
326 | 1
327 | 1
328 | 1
329 | 1
330 | 1
331 | 1
332 | 1
333 | 1
334 | 1
335 | 1
336 | 1
337 | 1
338 | 1
339 | 1
340 | 1
341 | 1
342 | 1
343 | 1
344 | 1
345 | 1
346 | 1
347 | 1
348 | 1
349 | 1
350 | 1
351 | 0
352 | 0
353 | 0
354 | 0
355 | 0
356 | 0
357 | 0
358 | 0
359 | 0
360 | 0
361 | 0
362 | 0
363 | 0
364 | 0
365 | 0
366 | 0
367 | 0
368 | 0
369 | 0
370 | 0
371 | 0
372 | 0
373 | 0
374 | 0
375 | 0
376 | 0
377 | 0
378 | 0
379 | 0
380 | 0
381 | 0
382 | 0
383 | 0
384 | 0
385 | 0
386 | 0
387 | 0
388 | 0
389 | 0
390 | 0
391 | 0
392 | 0
393 | 0
394 | 0
395 | 0
396 | 0
397 | 0
398 | 0
399 | 0
400 | 0
401 | 1
402 | 1
403 | 1
404 | 1
405 | 1
406 | 1
407 | 1
408 | 1
409 | 1
410 | 1
411 | 1
412 | 1
413 | 1
414 | 1
415 | 1
416 | 1
417 | 1
418 | 1
419 | 1
420 | 1
421 | 1
422 | 1
423 | 1
424 | 1
425 | 1
426 | 1
427 | 1
428 | 1
429 | 1
430 | 1
431 | 1
432 | 1
433 | 1
434 | 1
435 | 1
436 | 1
437 | 1
438 | 1
439 | 1
440 | 1
441 | 1
442 | 1
443 | 1
444 | 1
445 | 1
446 | 1
447 | 1
448 | 1
449 | 1
450 | 1
451 | 1
452 | 1
453 | 1
454 | 1
455 | 1
456 | 1
457 | 1
458 | 1
459 | 1
460 | 1
461 | 1
462 | 1
463 | 1
464 | 1
465 | 1
466 | 1
467 | 1
468 | 1
469 | 1
470 | 1
471 | 1
472 | 1
473 | 1
474 | 1
475 | 1
476 | 1
477 | 1
478 | 1
479 | 1
480 | 1
481 | 1
482 | 1
483 | 1
484 | 1
485 | 1
486 | 1
487 | 1
488 | 1
489 | 1
490 | 1
491 | 1
492 | 1
493 | 1
494 | 1
495 | 1
496 | 1
497 | 1
498 | 1
499 | 1
500 | 1
501 | 1
502 | 1
503 | 1
504 | 1
505 | 1
506 | 1
507 | 1
508 | 1
509 | 1
510 | 1
511 | 1
512 | 1
513 | 1
514 | 1
515 | 1
516 | 1
517 | 1
518 | 1
519 | 1
520 | 1
521 | 1
522 | 1
523 | 1
524 | 1
525 | 1
526 | 1
527 | 1
528 | 1
529 | 1
530 | 1
531 | 1
532 | 1
533 | 1
534 | 1
535 | 1
536 | 1
537 | 1
538 | 1
539 | 1
540 | 1
541 | 1
542 | 1
543 | 1
544 | 1
545 | 1
546 | 1
547 | 1
548 | 1
549 | 1
550 | 1
551 | 1
552 | 1
553 | 1
554 | 1
555 | 1
556 | 1
557 | 1
558 | 1
559 | 1
560 | 1
561 | 1
562 | 1
563 | 1
564 | 1
565 | 1
566 | 1
567 | 1
568 | 1
569 | 1
570 | 1
571 | 1
572 | 1
573 | 1
574 | 1
575 | 1
576 | 1
577 | 1
578 | 1
579 | 1
580 | 1
581 | 1
582 | 1
583 | 1
584 | 1
585 | 1
586 | 1
587 | 1
588 | 1
589 | 1
590 | 1
591 | 1
592 | 1
593 | 1
594 | 1
595 | 1
596 | 1
597 | 1
598 | 1
599 | 1
600 | 1
601 | 1
602 | 1
603 | 1
604 | 1
605 | 1
606 | 1
607 | 1
608 | 1
609 | 1
610 | 1
611 | 1
612 | 1
613 | 1
614 | 1
615 | 1
616 | 1
617 | 1
618 | 1
619 | 1
620 | 1
621 | 1
622 | 1
623 | 1
624 | 1
625 | 1
626 | 1
627 | 1
628 | 1
629 | 1
630 | 1
631 | 1
632 | 1
633 | 1
634 | 1
635 | 1
636 | 1
637 | 1
638 | 1
639 | 1
640 | 1
641 | 1
642 | 1
643 | 1
644 | 1
645 | 1
646 | 1
647 | 1
648 | 1
649 | 1
650 | 1
651 | 1
652 | 1
653 | 1
654 | 1
655 | 1
656 | 1
657 | 1
658 | 1
659 | 1
660 | 1
661 | 1
662 | 1
663 | 1
664 | 1
665 | 1
666 | 1
667 | 1
668 | 1
669 | 1
670 | 1
671 | 1
672 | 1
673 | 1
674 | 1
675 | 1
676 | 1
677 | 1
678 | 1
679 | 1
680 | 1
681 | 1
682 | 1
683 | 1
684 | 1
685 | 1
686 | 1
687 | 1
688 | 1
689 | 1
690 | 1
691 | 1
692 | 1
693 | 1
694 | 1
695 | 1
696 | 1
697 | 1
698 | 1
699 | 1
700 | 1
701 | 


--------------------------------------------------------------------------------
/codes/acrnn_test.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import torch.nn as nn
 3 | import torch.nn.functional as F
 4 | import pdb
 5 | 
 6 | class acrnn(nn.Module):
 7 |     def __init__(self, num_classes=4, is_training=True,
 8 |                  L1=128, L2=256, cell_units=128, num_linear=768,
 9 |                  p=10, time_step=800, F1=128, dropout_keep_prob=1):
10 |         super(acrnn, self).__init__()
11 | 
12 |         self.num_classes = num_classes
13 |         self.is_training = is_training
14 |         self.L1 = L1
15 |         self.L2 = L2
16 |         self.cell_units = cell_units
17 |         self.num_linear = num_linear
18 |         self.p = p
19 |         self.time_step = time_step
20 |         self.F1 = F1
21 |         self.dropout_prob = 1 - dropout_keep_prob
22 | 
23 |         # tf filter : [filter_height, filter_width, in_channels, out_channels]
24 |         self.conv1 = nn.Conv2d(3, self.L1, (5, 3), padding=(2, 1))       # [5, 3,   3, 128]  
25 |         self.conv2 = nn.Conv2d(self.L1, self.L2, (5, 3), padding=(2, 1)) # [5, 3, 128, 256]
26 |         self.conv3 = nn.Conv2d(self.L2, self.L2, (5, 3), padding=(2, 1)) # [5, 3, 256, 256]
27 |         self.conv4 = nn.Conv2d(self.L2, self.L2, (5, 3), padding=(2, 1)) # [5, 3, 256, 256]
28 |         self.conv5 = nn.Conv2d(self.L2, self.L2, (5, 3), padding=(2, 1)) # [5, 3, 128, 256]
29 |         self.conv6 = nn.Conv2d(self.L2, self.L2, (5, 3), padding=(2, 1)) # [5, 3, 128, 256]
30 | 
31 |         self.linear1 = nn.Linear(self.p*self.L2, self.num_linear) # [10*256, 768]
32 |         self.bn = nn.BatchNorm1d(self.num_linear)
33 | 
34 |         #self.linear_em = nn.Linear(self.p, self.L2)
35 | 
36 |         self.relu = nn.LeakyReLU(0.01)
37 |         self.dropout = nn.Dropout2d(p=self.dropout_prob)
38 |         
39 |         self.rnn = nn.LSTM(input_size=self.num_linear, hidden_size=self.cell_units, 
40 |                             batch_first=True, num_layers=1, bidirectional=True) 
41 | 
42 |         # for attention
43 |         self.a_fc1 = nn.Linear(2*self.cell_units, 1)  
44 |         self.a_fc2 = nn.Linear(1, 1)
45 |         self.sigmoid = nn.Sigmoid()
46 |         self.softmax = nn.Softmax(dim=1)
47 | 
48 |         # fully connected layers
49 |         self.fc1 = nn.Linear(2*self.cell_units, self.F1) # [2*128, 64]
50 |         self.fc2 = nn.Linear(self.F1, self.num_classes) # [num_classes]
51 | 
52 |     
53 |     def forward(self, x):
54 |         
55 |         layer1 = self.relu(self.conv1(x))
56 |         layer1 = F.max_pool2d(layer1, kernel_size=(2, 4), stride=(2, 4))   # [1,2,4,1], padding = 'valid'
57 |         layer1 = self.dropout(layer1)
58 | 
59 |         layer2 = self.relu(self.conv2(layer1))
60 |         layer2 = self.dropout(layer2)
61 |         
62 |         layer3 = self.relu(self.conv3(layer2))
63 |         layer3 = self.dropout(layer3)
64 | 
65 |         layer4 = self.relu(self.conv4(layer3))
66 |         layer4 = self.dropout(layer4)
67 | 
68 |         layer5 = self.relu(self.conv5(layer4))
69 |         layer5 = self.dropout(layer5)
70 | 
71 |         layer6 = self.relu(self.conv6(layer5))
72 |         layer6 = self.dropout(layer6)
73 |         
74 |         # lstm
75 |         layer6 = layer6.permute(0, 2, 3, 1)
76 |         layer6 = layer6.reshape(-1, self.time_step, self.L2*self.p)   # (-1, 150, 256*10)
77 | 
78 |         layer6 = layer6.reshape(-1, self.L2*self.p)                        # (1500, 2560)
79 | 
80 |         linear1 = self.relu(self.bn(self.linear1(layer6)))                 # [1500, 768]
81 |         linear1 = linear1.reshape(-1, self.time_step, self.num_linear)   # [10, 150, 768]
82 |         em_bed_low = linear1
83 | 
84 |         outputs1, output_states1 = self.rnn(linear1)                       # outputs1 : [10, 150, 128] (B,T,D)
85 | 
86 |         # # attention
87 |         v = self.sigmoid(self.a_fc1(outputs1))                  # (10, 150, 1)
88 |         alphas = self.softmax(self.a_fc2(v).squeeze())          # (B,T) shape, alphas are attention weights (10,800)
89 |         gru = (alphas.unsqueeze(2) * outputs1).sum(dim=1)      # (B,D) (10,256)
90 |         
91 |         # # fc
92 |         fully1 = self.relu(self.fc1(gru))
93 |         em_bed_high = fully1
94 |         fully1 = self.dropout(fully1)
95 |         Ylogits = self.fc2(fully1)
96 |         Ylogits = self.softmax(Ylogits)
97 | 
98 |         return Ylogits, em_bed_low, em_bed_high
99 | 


--------------------------------------------------------------------------------
/codes/checkpoint/checkpoint_5900:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KunZhou9646/Emovox/972db246eb7b63e5eaf532d46710af3c9ec11eba/codes/checkpoint/checkpoint_5900


--------------------------------------------------------------------------------
/codes/distributed.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import torch.distributed as dist
  3 | from torch.nn.modules import Module
  4 | from torch.autograd import Variable
  5 | 
  6 | def _flatten_dense_tensors(tensors):
  7 |     """Flatten dense tensors into a contiguous 1D buffer. Assume tensors are of
  8 |     same dense type.
  9 |     Since inputs are dense, the resulting tensor will be a concatenated 1D
 10 |     buffer. Element-wise operation on this buffer will be equivalent to
 11 |     operating individually.
 12 |     Arguments:
 13 |         tensors (Iterable[Tensor]): dense tensors to flatten.
 14 |     Returns:
 15 |         A contiguous 1D buffer containing input tensors.
 16 |     """
 17 |     if len(tensors) == 1:
 18 |         return tensors[0].contiguous().view(-1)
 19 |     flat = torch.cat([t.contiguous().view(-1) for t in tensors], dim=0)
 20 |     return flat
 21 | 
 22 | def _unflatten_dense_tensors(flat, tensors):
 23 |     """View a flat buffer using the sizes of tensors. Assume that tensors are of
 24 |     same dense type, and that flat is given by _flatten_dense_tensors.
 25 |     Arguments:
 26 |         flat (Tensor): flattened dense tensors to unflatten.
 27 |         tensors (Iterable[Tensor]): dense tensors whose sizes will be used to
 28 |           unflatten flat.
 29 |     Returns:
 30 |         Unflattened dense tensors with sizes same as tensors and values from
 31 |         flat.
 32 |     """
 33 |     outputs = []
 34 |     offset = 0
 35 |     for tensor in tensors:
 36 |         numel = tensor.numel()
 37 |         outputs.append(flat.narrow(0, offset, numel).view_as(tensor))
 38 |         offset += numel
 39 |     return tuple(outputs)
 40 | 
 41 | 
 42 | '''
 43 | This version of DistributedDataParallel is designed to be used in conjunction with the multiproc.py
 44 | launcher included with this example. It assumes that your run is using multiprocess with 1
 45 | GPU/process, that the model is on the correct device, and that torch.set_device has been
 46 | used to set the device.
 47 | Parameters are broadcasted to the other processes on initialization of DistributedDataParallel,
 48 | and will be allreduced at the finish of the backward pass.
 49 | '''
 50 | class DistributedDataParallel(Module):
 51 | 
 52 |     def __init__(self, module):
 53 |         super(DistributedDataParallel, self).__init__()
 54 |         #fallback for PyTorch 0.3
 55 |         if not hasattr(dist, '_backend'):
 56 |             self.warn_on_half = True
 57 |         else:
 58 |             self.warn_on_half = True if dist._backend == dist.dist_backend.GLOO else False
 59 | 
 60 |         self.module = module
 61 | 
 62 |         for p in list(self.module.state_dict().values()):
 63 |             if not torch.is_tensor(p):
 64 |                 continue
 65 |             dist.broadcast(p, 0)
 66 | 
 67 |         def allreduce_params():
 68 |             if(self.needs_reduction):
 69 |                 self.needs_reduction = False
 70 |                 buckets = {}
 71 |                 for param in self.module.parameters():
 72 |                     if param.requires_grad and param.grad is not None:
 73 |                         tp = type(param.data)
 74 |                         if tp not in buckets:
 75 |                             buckets[tp] = []
 76 |                         buckets[tp].append(param)
 77 |                 if self.warn_on_half:
 78 |                     if torch.cuda.HalfTensor in buckets:
 79 |                         print(("WARNING: gloo dist backend for half parameters may be extremely slow." +
 80 |                               " It is recommended to use the NCCL backend in this case. This currently requires" +
 81 |                               "PyTorch built from top of tree master."))
 82 |                         self.warn_on_half = False
 83 | 
 84 |                 for tp in buckets:
 85 |                     bucket = buckets[tp]
 86 |                     grads = [param.grad.data for param in bucket]
 87 |                     coalesced = _flatten_dense_tensors(grads)
 88 |                     dist.all_reduce(coalesced)
 89 |                     coalesced /= dist.get_world_size()
 90 |                     for buf, synced in zip(grads, _unflatten_dense_tensors(coalesced, grads)):
 91 |                         buf.copy_(synced)
 92 | 
 93 |         for param in list(self.module.parameters()):
 94 |             def allreduce_hook(*unused):
 95 |                 param._execution_engine.queue_callback(allreduce_params)
 96 |             if param.requires_grad:
 97 |                 param.register_hook(allreduce_hook)
 98 | 
 99 |     def forward(self, *inputs, **kwargs):
100 |         self.needs_reduction = True
101 |         return self.module(*inputs, **kwargs)
102 | 
103 |     '''
104 |     def _sync_buffers(self):
105 |         buffers = list(self.module._all_buffers())
106 |         if len(buffers) > 0:
107 |             # cross-node buffer sync
108 |             flat_buffers = _flatten_dense_tensors(buffers)
109 |             dist.broadcast(flat_buffers, 0)
110 |             for buf, synced in zip(buffers, _unflatten_dense_tensors(flat_buffers, buffers)):
111 |                 buf.copy_(synced)
112 |      def train(self, mode=True):
113 |         # Clear NCCL communicator and CUDA event cache of the default group ID,
114 |         # These cache will be recreated at the later call. This is currently a
115 |         # work-around for a potential NCCL deadlock.
116 |         if dist._backend == dist.dist_backend.NCCL:
117 |             dist._clear_group_cache()
118 |         super(DistributedDataParallel, self).train(mode)
119 |         self.module.train(mode)
120 |     '''
121 | '''
122 | Modifies existing model to do gradient allreduce, but doesn't change class
123 | so you don't need "module"
124 | '''
125 | def apply_gradient_allreduce(module):
126 |         if not hasattr(dist, '_backend'):
127 |             module.warn_on_half = True
128 |         else:
129 |             module.warn_on_half = True if dist._backend == dist.dist_backend.GLOO else False
130 | 
131 |         for p in list(module.state_dict().values()):
132 |             if not torch.is_tensor(p):
133 |                 continue
134 |             dist.broadcast(p, 0)
135 | 
136 |         def allreduce_params():
137 |             if(module.needs_reduction):
138 |                 module.needs_reduction = False
139 |                 buckets = {}
140 |                 for param in module.parameters():
141 |                     if param.requires_grad and param.grad is not None:
142 |                         tp = type(param.data)
143 |                         if tp not in buckets:
144 |                             buckets[tp] = []
145 |                         buckets[tp].append(param)
146 |                 if module.warn_on_half:
147 |                     if torch.cuda.HalfTensor in buckets:
148 |                         print(("WARNING: gloo dist backend for half parameters may be extremely slow." +
149 |                               " It is recommended to use the NCCL backend in this case. This currently requires" +
150 |                               "PyTorch built from top of tree master."))
151 |                         module.warn_on_half = False
152 | 
153 |                 for tp in buckets:
154 |                     bucket = buckets[tp]
155 |                     grads = [param.grad.data for param in bucket]
156 |                     coalesced = _flatten_dense_tensors(grads)
157 |                     dist.all_reduce(coalesced)
158 |                     coalesced /= dist.get_world_size()
159 |                     for buf, synced in zip(grads, _unflatten_dense_tensors(coalesced, grads)):
160 |                         buf.copy_(synced)
161 | 
162 |         for param in list(module.parameters()):
163 |             def allreduce_hook(*unused):
164 |                 Variable._execution_engine.queue_callback(allreduce_params)
165 |             if param.requires_grad:
166 |                 param.register_hook(allreduce_hook)
167 | 
168 |         def set_needs_reduction(self, input, output):
169 |             self.needs_reduction = True
170 | 
171 |         module.register_forward_hook(set_needs_reduction)
172 |         return module


--------------------------------------------------------------------------------
/codes/hparams.py:
--------------------------------------------------------------------------------
  1 | import tensorflow as tf
  2 | #from text import symbols
  3 | 
  4 | def create_hparams(hparams_string=None, verbose=False):
  5 |     """Create model hyperparameters. Parse nondefault from given string."""
  6 | 
  7 |     hparams = tf.contrib.training.HParams(
  8 |         ################################
  9 |         # Experiment Parameters        #
 10 |         ################################
 11 |         epochs=50,
 12 |         iters_per_checkpoint=100,
 13 |         seed=1234,
 14 |         dynamic_loss_scaling=True,
 15 |         distributed_run=False,
 16 |         dist_backend="nccl",
 17 |         dist_url="tcp://localhost:54321",
 18 |         cudnn_enabled=True,
 19 |         cudnn_benchmark=False,
 20 | 
 21 |         ################################
 22 |         # Data Parameters              #
 23 |         ################################
 24 |         training_list='/home/zhoukun/nonparaSeq2seqVC_code-master/pre-train/reader/emotion_list/training_mel_list.txt',
 25 |         validation_list='/home/zhoukun/nonparaSeq2seqVC_code-master/pre-train/reader/emotion_list/evaluation_mel_list.txt',
 26 |         #mel_mean_std='/data07/zhoukun/VCTK-Corpus/mel_mean_std.npy',
 27 |         mel_mean_std = '/home/zhoukun/nonparaSeq2seqVC_code-master/0013/mel_mean_std.npy',
 28 |         ################################
 29 |         # Data Parameters              #
 30 |         ################################
 31 |         n_mel_channels=80,
 32 |         n_spc_channels=1025,
 33 |         n_symbols=41, #
 34 |         pretrain_n_speakers=99, #
 35 | 
 36 |         n_speakers=4, #
 37 |         predict_spectrogram=False,
 38 | 
 39 |         ################################
 40 |         # Model Parameters             #
 41 |         ################################
 42 | 
 43 |         symbols_embedding_dim=512,
 44 | 
 45 |         # Text Encoder parameters
 46 |         encoder_kernel_size=5,
 47 |         encoder_n_convolutions=3,
 48 |         encoder_embedding_dim=512,
 49 |         text_encoder_dropout=0.5,
 50 | 
 51 |         # Audio Encoder parameters
 52 |         spemb_input=False,
 53 |         n_frames_per_step_encoder=2,  
 54 |         audio_encoder_hidden_dim=512,
 55 |         AE_attention_dim=128,
 56 |         AE_attention_location_n_filters=32,
 57 |         AE_attention_location_kernel_size=51,
 58 |         beam_width=10,
 59 | 
 60 |         # hidden activation 
 61 |         # relu linear tanh
 62 |         hidden_activation='tanh',
 63 | 
 64 |         #Speaker Encoder parameters
 65 |         speaker_encoder_hidden_dim=256,
 66 |         speaker_encoder_dropout=0.2,
 67 |         #speaker_embedding_dim=128,
 68 |         speaker_embedding_dim=64,
 69 | 
 70 | 
 71 |         #Speaker Classifier parameters
 72 |         SC_hidden_dim=512,
 73 |         SC_n_convolutions=3,
 74 |         SC_kernel_size=1,
 75 | 
 76 |         # Decoder parameters
 77 |         feed_back_last=True,
 78 |         n_frames_per_step_decoder=2,
 79 |         decoder_rnn_dim=512,
 80 |         prenet_dim=[256,256],
 81 |         max_decoder_steps=1000,
 82 |         stop_threshold=0.5,
 83 |     
 84 |         # Attention parameters
 85 |         attention_rnn_dim=512,
 86 |         attention_dim=128,
 87 | 
 88 |         # Location Layer parameters
 89 |         attention_location_n_filters=32,
 90 |         attention_location_kernel_size=17,
 91 | 
 92 |         # PostNet parameters
 93 |         postnet_n_convolutions=5,
 94 |         postnet_dim=512,
 95 |         postnet_kernel_size=5,
 96 |         postnet_dropout=0.5,
 97 | 
 98 |         ################################
 99 |         # Optimization Hyperparameters #
100 |         ################################
101 |         use_saved_learning_rate=False,
102 |         #learning_rate=1e-2,
103 |         learning_rate=1e-3,
104 |         #weight_decay=1e-6,
105 |         weight_decay=1e-4,
106 |         grad_clip_thresh=5.0,
107 |         #batch_size=32,
108 |         #batch_size=16,
109 |         batch_size = 8,
110 |         warmup = 7,
111 |         decay_rate = 0.5,
112 |         decay_every = 7,
113 | 
114 | 
115 | 
116 |         ser_loss_w = 1,
117 |         emo_loss_w = 1,
118 |         contrastive_loss_w=30.0,
119 |         #contrastive_loss_w=0.0,
120 |         speaker_encoder_loss_w=1.0,
121 |         text_classifier_loss_w=1.0,
122 |         #text_classifier_loss_w=0.0,
123 |         speaker_adversial_loss_w=20.,
124 |         speaker_classifier_loss_w=0.1,
125 |         ce_loss=False
126 |     )
127 | 
128 |     if hparams_string:
129 |         tf.logging.info('Parsing command line hparams: %s', hparams_string)
130 |         hparams.parse(hparams_string)
131 | 
132 |     if verbose:
133 |         tf.logging.info('Final parsed hparams: %s', list(hparams.values()))
134 | 
135 |     return hparams
136 | 


--------------------------------------------------------------------------------
/codes/hparams_1.py:
--------------------------------------------------------------------------------
  1 | import tensorflow as tf
  2 | #from text import symbols
  3 | 
  4 | def create_hparams(hparams_string=None, verbose=False):
  5 |     """Create model hyperparameters. Parse nondefault from given string."""
  6 | 
  7 |     hparams = tf.contrib.training.HParams(
  8 |         ################################
  9 |         # Experiment Parameters        #
 10 |         ################################
 11 |         epochs=200,
 12 |         iters_per_checkpoint=1000,
 13 |         seed=1234,
 14 |         distributed_run=False,
 15 |         dist_backend="nccl",
 16 |         dist_url="tcp://localhost:54321",
 17 |         cudnn_enabled=True,
 18 |         cudnn_benchmark=False,
 19 | 
 20 |         ################################
 21 |         # Data Parameters              #
 22 |         ################################
 23 |         training_list='/home/zhoukun/nonparaSeq2seqVC_code-master/pre-train/reader/training_mel_list.txt',
 24 |         validation_list='/home/zhoukun/nonparaSeq2seqVC_code-master/pre-train/reader/evaluation_mel_list.txt',
 25 |         mel_mean_std='/data07/zhoukun/VCTK-Corpus/mel_mean_std.npy',
 26 | 
 27 |         ################################
 28 |         # Data Parameters              #
 29 |         ################################
 30 |         n_mel_channels=80,
 31 |         n_spc_channels=1025,
 32 |         n_symbols=41, #
 33 |         n_speakers=99, #
 34 |         predict_spectrogram=False,
 35 | 
 36 |         ################################
 37 |         # Model Parameters             #
 38 |         ################################
 39 | 
 40 |         symbols_embedding_dim=512,
 41 | 
 42 |         # Text Encoder parameters
 43 |         encoder_kernel_size=5,
 44 |         encoder_n_convolutions=3,
 45 |         encoder_embedding_dim=512,
 46 |         text_encoder_dropout=0.5,
 47 | 
 48 |         # Audio Encoder parameters
 49 |         spemb_input=False,
 50 |         n_frames_per_step_encoder=2,  
 51 |         audio_encoder_hidden_dim=512,
 52 |         AE_attention_dim=128,
 53 |         AE_attention_location_n_filters=32,
 54 |         AE_attention_location_kernel_size=51,
 55 |         beam_width=10,
 56 | 
 57 |         # hidden activation 
 58 |         # relu linear tanh
 59 |         hidden_activation='tanh',
 60 | 
 61 |         #Speaker Encoder parameters
 62 |         speaker_encoder_hidden_dim=256,
 63 |         speaker_encoder_dropout=0.2,
 64 |         speaker_embedding_dim=128,
 65 | 
 66 | 
 67 |         #Speaker Classifier parameters
 68 |         SC_hidden_dim=512,
 69 |         SC_n_convolutions=3,
 70 |         SC_kernel_size=1,
 71 | 
 72 |         # Decoder parameters
 73 |         feed_back_last=True,
 74 |         n_frames_per_step_decoder=2,
 75 |         decoder_rnn_dim=512,
 76 |         prenet_dim=[256,256],
 77 |         max_decoder_steps=1000,
 78 |         stop_threshold=0.5,
 79 |     
 80 |         # Attention parameters
 81 |         attention_rnn_dim=512,
 82 |         attention_dim=128,
 83 | 
 84 |         # Location Layer parameters
 85 |         attention_location_n_filters=32,
 86 |         attention_location_kernel_size=17,
 87 | 
 88 |         # PostNet parameters
 89 |         postnet_n_convolutions=5,
 90 |         postnet_dim=512,
 91 |         postnet_kernel_size=5,
 92 |         postnet_dropout=0.5,
 93 | 
 94 |         ################################
 95 |         # Optimization Hyperparameters #
 96 |         ################################
 97 |         use_saved_learning_rate=False,
 98 |         learning_rate=1e-3,
 99 |         weight_decay=1e-6,
100 |         grad_clip_thresh=5.0,
101 |         batch_size=32,
102 |         
103 |         contrastive_loss_w=30.0,
104 |         speaker_encoder_loss_w=1.0,
105 |         text_classifier_loss_w=1.0,
106 |         speaker_adversial_loss_w=20.,
107 |         speaker_classifier_loss_w=0.1,
108 |         ce_loss=False
109 |     )
110 | 
111 |     if hparams_string:
112 |         tf.logging.info('Parsing command line hparams: %s', hparams_string)
113 |         hparams.parse(hparams_string)
114 | 
115 |     if verbose:
116 |         tf.logging.info('Final parsed hparams: %s', list(hparams.values()))
117 | 
118 |     return hparams
119 | 


--------------------------------------------------------------------------------
/codes/hparams_update.py:
--------------------------------------------------------------------------------
  1 | import tensorflow as tf
  2 | #from text import symbols
  3 | 
  4 | def create_hparams(hparams_string=None, verbose=False):
  5 |     """Create model hyperparameters. Parse nondefault from given string."""
  6 | 
  7 |     hparams = tf.contrib.training.HParams(
  8 |         ################################
  9 |         # Experiment Parameters        #
 10 |         ################################
 11 |         epochs=100,
 12 |         iters_per_checkpoint=100,
 13 |         seed=1234,
 14 |         dynamic_loss_scaling=True,
 15 |         distributed_run=False,
 16 |         dist_backend="nccl",
 17 |         dist_url="tcp://localhost:54321",
 18 |         cudnn_enabled=True,
 19 |         cudnn_benchmark=False,
 20 | 
 21 |         ################################
 22 |         # Data Parameters              #
 23 |         ################################
 24 |         training_list='/home/zhoukun/nonparaSeq2seqVC_code-master/pre-train/reader/emotion_list/training_mel_list.txt',
 25 |         validation_list='/home/zhoukun/nonparaSeq2seqVC_code-master/pre-train/reader/emotion_list/evaluation_mel_list.txt',
 26 |         #mel_mean_std='/data07/zhoukun/VCTK-Corpus/mel_mean_std.npy',
 27 |         mel_mean_std = '/home/zhoukun/nonparaSeq2seqVC_code-master/0013/mel_mean_std.npy',
 28 |         ################################
 29 |         # Data Parameters              #
 30 |         ################################
 31 |         n_mel_channels=80,
 32 |         n_spc_channels=1025,
 33 |         n_symbols=41, #
 34 |         pretrain_n_speakers=99, #
 35 | 
 36 |         n_speakers=5, #
 37 |         predict_spectrogram=False,
 38 | 
 39 |         ################################
 40 |         # Model Parameters             #
 41 |         ################################
 42 | 
 43 |         symbols_embedding_dim=512,
 44 | 
 45 |         # Text Encoder parameters
 46 |         encoder_kernel_size=5,
 47 |         encoder_n_convolutions=3,
 48 |         encoder_embedding_dim=512,
 49 |         text_encoder_dropout=0.5,
 50 | 
 51 |         # Audio Encoder parameters
 52 |         spemb_input=False,
 53 |         n_frames_per_step_encoder=2,  
 54 |         audio_encoder_hidden_dim=512,
 55 |         AE_attention_dim=128,
 56 |         AE_attention_location_n_filters=32,
 57 |         AE_attention_location_kernel_size=51,
 58 |         beam_width=10,
 59 | 
 60 |         # hidden activation 
 61 |         # relu linear tanh
 62 |         hidden_activation='tanh',
 63 | 
 64 |         #Speaker Encoder parameters
 65 |         speaker_encoder_hidden_dim=256,
 66 |         speaker_encoder_dropout=0.2,
 67 |         speaker_embedding_dim=128,
 68 | 
 69 | 
 70 |         #Speaker Classifier parameters
 71 |         SC_hidden_dim=512,
 72 |         SC_n_convolutions=3,
 73 |         SC_kernel_size=1,
 74 | 
 75 |         # Decoder parameters
 76 |         feed_back_last=True,
 77 |         n_frames_per_step_decoder=2,
 78 |         decoder_rnn_dim=512,
 79 |         prenet_dim=[256,256],
 80 |         max_decoder_steps=1000,
 81 |         stop_threshold=0.5,
 82 |     
 83 |         # Attention parameters
 84 |         attention_rnn_dim=512,
 85 |         attention_dim=128,
 86 | 
 87 |         # Location Layer parameters
 88 |         attention_location_n_filters=32,
 89 |         attention_location_kernel_size=17,
 90 | 
 91 |         # PostNet parameters
 92 |         postnet_n_convolutions=5,
 93 |         postnet_dim=512,
 94 |         postnet_kernel_size=5,
 95 |         postnet_dropout=0.5,
 96 | 
 97 |         ################################
 98 |         # Optimization Hyperparameters #
 99 |         ################################
100 |         use_saved_learning_rate=False,
101 |         learning_rate=1e-3,
102 |         weight_decay=1e-6,
103 |         grad_clip_thresh=5.0,
104 |         batch_size=64,
105 |         #batch_size = 8,
106 |         warmup = 7,
107 |         decay_rate = 0.5,
108 |         decay_every = 7,
109 | 
110 | 
111 | 
112 | 
113 |         contrastive_loss_w=30.0,
114 |         #speaker_encoder_loss_w=1.0,
115 |         speaker_encoder_loss_w=5.0,
116 |         text_classifier_loss_w=5.0,
117 |         speaker_adversial_loss_w=20.,
118 |         #speaker_classifier_loss_w=0.1,
119 |         speaker_classifier_loss_w=5.0,
120 |         ce_loss=False
121 |     )
122 | 
123 |     if hparams_string:
124 |         tf.logging.info('Parsing command line hparams: %s', hparams_string)
125 |         hparams.parse(hparams_string)
126 | 
127 |     if verbose:
128 |         tf.logging.info('Final parsed hparams: %s', list(hparams.values()))
129 | 
130 |     return hparams
131 | 


--------------------------------------------------------------------------------
/codes/inference.py:
--------------------------------------------------------------------------------
  1 | import matplotlib
  2 | matplotlib.use("Agg")
  3 | import matplotlib.pylab as plt
  4 | 
  5 | 
  6 | import os
  7 | import librosa
  8 | import numpy as np
  9 | import torch
 10 | from torch.utils.data import DataLoader
 11 | 
 12 | from reader import TextMelIDLoader, TextMelIDCollate, id2ph, id2sp
 13 | from hparams import create_hparams
 14 | from model import Parrot, lcm
 15 | from train import load_model
 16 | import scipy.io.wavfile
 17 | 
 18 | 
 19 | ########### Configuration ###########
 20 | hparams = create_hparams()
 21 | 
 22 | #generation list
 23 | 
 24 | #hlist = '/home/jxzhang/Documents/DataSets/VCTK/list/hold_english.list' #unseen speakers list
 25 | tlist = '/home/zhoukun/nonparaSeq2seqVC_code-master/pre-train/reader/evaluation_mel_list.txt'  #seen speakers list
 26 | 
 27 | # use seen (tlist) or unseen list (hlist)
 28 | test_list = tlist
 29 | checkpoint_path='/home/zhoukun/nonparaSeq2seqVC_code-master/pre-train/outdir/checkpoint_234000'
 30 | # TTS or VC task?
 31 | input_text=False
 32 | # number of utterances for generation
 33 | NUM=10
 34 | ISMEL=(not hparams.predict_spectrogram)
 35 | #####################################
 36 | 
 37 | def plot_data(data, fn, figsize=(12, 4)):
 38 |     fig, axes = plt.subplots(1, len(data), figsize=figsize)
 39 |     for i in range(len(data)):
 40 |         if len(data) == 1:
 41 |             ax = axes
 42 |         else:
 43 |             ax = axes[i]
 44 |         g = ax.imshow(data[i], aspect='auto', origin='lower',
 45 |                        interpolation='none')
 46 | 
 47 |         plt.colorbar(g, ax=ax)
 48 |     plt.savefig(fn)
 49 | 
 50 | 
 51 | model = load_model(hparams)
 52 | 
 53 | model.load_state_dict(torch.load(checkpoint_path)['state_dict'])
 54 | _ = model.eval()
 55 | 
 56 | test_set = TextMelIDLoader(test_list, hparams.mel_mean_std, shuffle=True)
 57 | sample_list = test_set.file_path_list
 58 | collate_fn = TextMelIDCollate(lcm(hparams.n_frames_per_step_encoder,
 59 |                         hparams.n_frames_per_step_decoder))
 60 | 
 61 | test_loader = DataLoader(test_set, num_workers=1, shuffle=False,
 62 |                               sampler=None,
 63 |                               batch_size=1, pin_memory=False,
 64 |                               drop_last=True, collate_fn=collate_fn)
 65 | 
 66 | 
 67 | 
 68 | task = 'tts' if input_text else 'vc'
 69 | path_save = os.path.join(checkpoint_path.replace('checkpoint', 'test'), task)
 70 | path_save += '_seen' if test_list == tlist else '_unseen'
 71 | if not os.path.exists(path_save):
 72 |     os.makedirs(path_save)
 73 | 
 74 | print(path_save)
 75 | 
 76 | def recover_wav(mel, wav_path, ismel=False, 
 77 |         n_fft=2048, win_length=800,hop_length=200):
 78 |     
 79 |     if ismel:
 80 |         mean, std = np.load(hparams.mel_mean_std)
 81 |     else:
 82 |         mean, std = np.load(hparams.mel_mean_std.replace('mel','spec'))
 83 |     
 84 |     mean = mean[:,None]
 85 |     std = std[:,None]
 86 |     mel = 1.2 * mel * std + mean
 87 |     mel = np.exp(mel)
 88 | 
 89 |     if ismel:
 90 |         filters = librosa.filters.mel(sr=16000, n_fft=2048, n_mels=80)
 91 |         inv_filters = np.linalg.pinv(filters)
 92 |         spec = np.dot(inv_filters, mel)
 93 |     else:
 94 |         spec = mel
 95 | 
 96 |     def _griffin_lim(stftm_matrix, shape, max_iter=50):
 97 |         y = np.random.random(shape)
 98 |         for i in range(max_iter):
 99 |             stft_matrix = librosa.core.stft(y, n_fft=n_fft, win_length=win_length, hop_length=hop_length)
100 |             stft_matrix = stftm_matrix * stft_matrix / np.abs(stft_matrix)
101 |             y = librosa.core.istft(stft_matrix, win_length=win_length, hop_length=hop_length)
102 |         return y
103 | 
104 |     shape = spec.shape[1] * hop_length -  hop_length + 1
105 | 
106 |     y = _griffin_lim(spec, shape)
107 |     scipy.io.wavfile.write(wav_path, 16000, y)
108 |     return y
109 | 
110 | 
111 | text_input, mel, spec, speaker_id = test_set[0]
112 | reference_mel = mel.cuda().unsqueeze(0) 
113 | ref_sp = id2sp[speaker_id.item()]
114 | 
115 | def levenshteinDistance(s1, s2):
116 |     if len(s1) > len(s2):
117 |         s1, s2 = s2, s1
118 | 
119 |     distances = list(range(len(s1) + 1))
120 |     for i2, c2 in enumerate(s2):
121 |         distances_ = [i2+1]
122 |         for i1, c1 in enumerate(s1):
123 |             if c1 == c2:
124 |                 distances_.append(distances[i1])
125 |             else:
126 |                 distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
127 |         distances = distances_
128 |     return distances[-1]
129 | 
130 | with torch.no_grad():
131 | 
132 |     errs = 0
133 |     totalphs = 0
134 | 
135 |     for i, batch in enumerate(test_loader):
136 |         if i == NUM:
137 |             break
138 |         
139 |         sample_id = sample_list[i].split('/')[-1][9:17]
140 |         print(('%d index %s, decoding ...'%(i,sample_id)))
141 | 
142 |         x, y = model.parse_batch(batch)
143 |         predicted_mel, post_output, predicted_stop, alignments, \
144 |             text_hidden, audio_seq2seq_hidden, audio_seq2seq_phids, audio_seq2seq_alignments, \
145 |             speaker_id = model.inference(x, input_text, reference_mel, hparams.beam_width)
146 | 
147 |         post_output = post_output.data.cpu().numpy()[0]
148 |         alignments = alignments.data.cpu().numpy()[0].T
149 |         audio_seq2seq_alignments = audio_seq2seq_alignments.data.cpu().numpy()[0].T
150 | 
151 |         text_hidden = text_hidden.data.cpu().numpy()[0].T #-> [hidden_dim, max_text_len]
152 |         audio_seq2seq_hidden = audio_seq2seq_hidden.data.cpu().numpy()[0].T
153 |         audio_seq2seq_phids = audio_seq2seq_phids.data.cpu().numpy()[0] # [T + 1]
154 |         speaker_id = speaker_id.data.cpu().numpy()[0] # scalar
155 | 
156 |         task = 'TTS' if input_text else 'VC'
157 | 
158 |         recover_wav(post_output, 
159 |                     os.path.join(path_save, 'Wav_%s_ref_%s_%s.wav'%(sample_id, ref_sp, task)), 
160 |                     ismel=ISMEL)
161 |         
162 |         post_output_path = os.path.join(path_save, 'Mel_%s_ref_%s_%s.npy'%(sample_id, ref_sp, task))
163 |         np.save(post_output_path, post_output)
164 |                 
165 |         plot_data([alignments, audio_seq2seq_alignments], 
166 |             os.path.join(path_save, 'Ali_%s_ref_%s_%s.pdf'%(sample_id, ref_sp, task)))
167 |         
168 |         plot_data([np.hstack([text_hidden, audio_seq2seq_hidden])], 
169 |             os.path.join(path_save, 'Hid_%s_ref_%s_%s.pdf'%(sample_id, ref_sp, task)))
170 |          
171 |         audio_seq2seq_phids = [id2ph[id] for id in audio_seq2seq_phids[:-1]]
172 |         target_text = y[0].data.cpu().numpy()[0]
173 |         target_text = [id2ph[id] for id in target_text[:]]
174 | 
175 |         print('Sounds like %s, Decoded text is '%(id2sp[speaker_id]))
176 | 
177 |         print(audio_seq2seq_phids)
178 |         print(target_text)
179 |        
180 |         err = levenshteinDistance(audio_seq2seq_phids, target_text)
181 |         print(err, len(target_text))
182 | 
183 |         errs += err
184 |         totalphs += len(target_text)
185 | 
186 | print(float(errs)/float(totalphs))
187 | 
188 |         
189 |         
190 | 


--------------------------------------------------------------------------------
/codes/logger.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import random
  3 | import torch.nn.functional as F
  4 | from tensorboardX import SummaryWriter
  5 | from plotting_utils import plot_alignment_to_numpy, plot_spectrogram_to_numpy, plot_alignment
  6 | from plotting_utils import plot_gate_outputs_to_numpy
  7 | 
  8 | 
  9 | class ParrotLogger(SummaryWriter):
 10 |     def __init__(self, logdir, ali_path='ali'):
 11 |         super(ParrotLogger, self).__init__(logdir)
 12 |         ali_path = os.path.join(logdir, ali_path)
 13 |         if not os.path.exists(ali_path):
 14 |             os.makedirs(ali_path)
 15 |         self.ali_path = ali_path
 16 | 
 17 |     def log_training(self, reduced_loss, reduced_losses, reduced_acces, grad_norm, learning_rate, duration,
 18 |                      iteration):
 19 |         
 20 |         self.add_scalar("training.loss", reduced_loss, iteration)
 21 |         self.add_scalar("training.loss.recon", reduced_losses[0], iteration)
 22 |         self.add_scalar("training.loss.recon_post", reduced_losses[1], iteration)
 23 |         self.add_scalar("training.loss.stop",  reduced_losses[2], iteration)
 24 |         self.add_scalar("training.loss.contr", reduced_losses[3], iteration)
 25 |         self.add_scalar("training.loss.spenc", reduced_losses[4], iteration)
 26 |         self.add_scalar("training.loss.spcla", reduced_losses[5], iteration)
 27 |         self.add_scalar("training.loss.texcl", reduced_losses[6], iteration)
 28 |         self.add_scalar("training.loss.spadv", reduced_losses[7], iteration)
 29 |         self.add_scalar("training.loss.serloss", reduced_losses[8], iteration)
 30 |         self.add_scalar("training.loss.emobloss", reduced_losses[9], iteration)
 31 | 
 32 |         self.add_scalar("grad.norm", grad_norm, iteration)
 33 |         self.add_scalar("learning.rate", learning_rate, iteration)
 34 |         self.add_scalar("duration", duration, iteration)
 35 | 
 36 |         
 37 |         self.add_scalar('training.acc.spenc', reduced_acces[0], iteration)
 38 |         self.add_scalar('training.acc.spcla', reduced_acces[1], iteration)
 39 |         self.add_scalar('training.acc.texcl', reduced_acces[2], iteration)
 40 |         self.add_scalar('training.acc.seracc', reduced_acces[3], iteration)
 41 |     
 42 |     def log_validation(self, reduced_loss, reduced_losses, reduced_acces, model, y, y_pred, iteration, task):
 43 | 
 44 |         self.add_scalar('validation.loss.%s'%task, reduced_loss, iteration)
 45 |         self.add_scalar("validation.loss.%s.recon"%task, reduced_losses[0], iteration)
 46 |         self.add_scalar("validation.loss.%s.recon_post"%task, reduced_losses[1], iteration)
 47 |         self.add_scalar("validation.loss.%s.stop"%task,  reduced_losses[2], iteration)
 48 |         self.add_scalar("validation.loss.%s.contr"%task, reduced_losses[3], iteration)
 49 |         self.add_scalar("validation.loss.%s.spenc"%task, reduced_losses[4], iteration)
 50 |         self.add_scalar("validation.loss.%s.spcla"%task, reduced_losses[5], iteration)
 51 |         self.add_scalar("validation.loss.%s.texcl"%task, reduced_losses[6], iteration)
 52 |         self.add_scalar("validation.loss.%s.spadv"%task, reduced_losses[7], iteration)
 53 |         self.add_scalar("validation.loss.%s.serloss"%task, reduced_losses[8], iteration)
 54 |         self.add_scalar("validation.loss.%s.emobloss"%task, reduced_losses[9], iteration)
 55 | 
 56 | 
 57 |         self.add_scalar('validation.acc.%s.spenc'%task, reduced_acces[0], iteration)
 58 |         self.add_scalar('validation.acc.%s.spcla'%task, reduced_acces[1], iteration)
 59 |         self.add_scalar('validatoin.acc.%s.texcl'%task, reduced_acces[2], iteration)
 60 |         self.add_scalar('validatoin.acc.%s.seracc'%task, reduced_acces[3], iteration)
 61 |         
 62 |         predicted_mel, post_output, predicted_stop, alignments, \
 63 |             text_hidden, mel_hidden,  text_logit_from_mel_hidden, \
 64 |             audio_seq2seq_alignments, \
 65 |             speaker_logit_from_mel, speaker_logit_from_mel_hidden, \
 66 |             text_lengths, mel_lengths, speaker_embedding = y_pred
 67 | 
 68 |         text_target, mel_target, spc_target, speaker_target,  stop_target, strength_embedding  = y
 69 | 
 70 |         stop_target = stop_target.reshape(stop_target.size(0), -1, int(stop_target.size(1)/predicted_stop.size(1)))
 71 |         stop_target = stop_target[:,:,0]
 72 | 
 73 |         # plot distribution of parameters
 74 |         #for tag, value in model.named_parameters():
 75 |         #    tag = tag.replace('.', '/')
 76 |         #    self.add_histogram(tag, value.data.cpu().numpy(), iteration)
 77 | 
 78 |         # plot alignment, mel target and predicted, stop target and predicted
 79 |         idx = random.randint(0, alignments.size(0) - 1)
 80 | 
 81 |         alignments = alignments.data.cpu().numpy()
 82 |         audio_seq2seq_alignments = audio_seq2seq_alignments.data.cpu().numpy()
 83 | 
 84 |         self.add_image(
 85 |             "%s.alignment"%task,
 86 |             plot_alignment_to_numpy(alignments[idx].T),
 87 |             iteration, dataformats='HWC')
 88 |         
 89 |         # plot more alignments
 90 |         plot_alignment(alignments[:4], self.ali_path+'/step-%d-%s.pdf'%(iteration, task))
 91 | 
 92 |         self.add_image(
 93 |             "%s.audio_seq2seq_alignment"%task,
 94 |             plot_alignment_to_numpy(audio_seq2seq_alignments[idx].T),
 95 |             iteration, dataformats='HWC')
 96 | 
 97 |         self.add_image(
 98 |             "%s.mel_target"%task,
 99 |             plot_spectrogram_to_numpy(mel_target[idx].data.cpu().numpy()),
100 |             iteration, dataformats='HWC')
101 |         
102 |         self.add_image(
103 |             "%s.mel_predicted"%task,
104 |             plot_spectrogram_to_numpy(predicted_mel[idx].data.cpu().numpy()),
105 |             iteration, dataformats='HWC')
106 |         
107 |         self.add_image(
108 |             "%s.spc_target"%task,
109 |             plot_spectrogram_to_numpy(spc_target[idx].data.cpu().numpy()),
110 |             iteration, dataformats='HWC')
111 |         
112 |         self.add_image(
113 |             "%s.post_predicted"%task,
114 |             plot_spectrogram_to_numpy(post_output[idx].data.cpu().numpy()),
115 |             iteration, dataformats='HWC')
116 | 
117 |         self.add_image(
118 |             "%s.stop"%task,
119 |             plot_gate_outputs_to_numpy(
120 |                 stop_target[idx].data.cpu().numpy(),
121 |                 F.sigmoid(predicted_stop[idx]).data.cpu().numpy()),
122 |             iteration, dataformats='HWC')
123 | 


--------------------------------------------------------------------------------
/codes/logger_original.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import random
  3 | import torch.nn.functional as F
  4 | from tensorboardX import SummaryWriter
  5 | from plotting_utils import plot_alignment_to_numpy, plot_spectrogram_to_numpy, plot_alignment
  6 | from plotting_utils import plot_gate_outputs_to_numpy
  7 | 
  8 | 
  9 | class ParrotLogger(SummaryWriter):
 10 |     def __init__(self, logdir, ali_path='ali'):
 11 |         super(ParrotLogger, self).__init__(logdir)
 12 |         ali_path = os.path.join(logdir, ali_path)
 13 |         if not os.path.exists(ali_path):
 14 |             os.makedirs(ali_path)
 15 |         self.ali_path = ali_path
 16 | 
 17 |     def log_training(self, reduced_loss, reduced_losses, reduced_acces, grad_norm, learning_rate, duration,
 18 |                      iteration):
 19 |         
 20 |         self.add_scalar("training.loss", reduced_loss, iteration)
 21 |         self.add_scalar("training.loss.recon", reduced_losses[0], iteration)
 22 |         self.add_scalar("training.loss.recon_post", reduced_losses[1], iteration)
 23 |         self.add_scalar("training.loss.stop",  reduced_losses[2], iteration)
 24 |         self.add_scalar("training.loss.contr", reduced_losses[3], iteration)
 25 |         self.add_scalar("training.loss.spenc", reduced_losses[4], iteration)
 26 |         self.add_scalar("training.loss.spcla", reduced_losses[5], iteration)
 27 |         self.add_scalar("training.loss.texcl", reduced_losses[6], iteration)
 28 |         self.add_scalar("training.loss.spadv", reduced_losses[7], iteration)
 29 | 
 30 |         self.add_scalar("grad.norm", grad_norm, iteration)
 31 |         self.add_scalar("learning.rate", learning_rate, iteration)
 32 |         self.add_scalar("duration", duration, iteration)
 33 | 
 34 |         
 35 |         self.add_scalar('training.acc.spenc', reduced_acces[0], iteration)
 36 |         self.add_scalar('training.acc.spcla', reduced_acces[1], iteration)
 37 |         self.add_scalar('training.acc.texcl', reduced_acces[2], iteration)
 38 |     
 39 |     def log_validation(self, reduced_loss, reduced_losses, reduced_acces, model, y, y_pred, iteration, task):
 40 | 
 41 |         self.add_scalar('validation.loss.%s'%task, reduced_loss, iteration)
 42 |         self.add_scalar("validation.loss.%s.recon"%task, reduced_losses[0], iteration)
 43 |         self.add_scalar("validation.loss.%s.recon_post"%task, reduced_losses[1], iteration)
 44 |         self.add_scalar("validation.loss.%s.stop"%task,  reduced_losses[2], iteration)
 45 |         self.add_scalar("validation.loss.%s.contr"%task, reduced_losses[3], iteration)
 46 |         self.add_scalar("validation.loss.%s.spenc"%task, reduced_losses[4], iteration)
 47 |         self.add_scalar("validation.loss.%s.spcla"%task, reduced_losses[5], iteration)
 48 |         self.add_scalar("validation.loss.%s.texcl"%task, reduced_losses[6], iteration)
 49 |         self.add_scalar("validation.loss.%s.spadv"%task, reduced_losses[7], iteration)
 50 | 
 51 |         self.add_scalar('validation.acc.%s.spenc'%task, reduced_acces[0], iteration)
 52 |         self.add_scalar('validation.acc.%s.spcla'%task, reduced_acces[1], iteration)
 53 |         self.add_scalar('validatoin.acc.%s.texcl'%task, reduced_acces[2], iteration)
 54 |         
 55 |         predicted_mel, post_output, predicted_stop, alignments, \
 56 |             text_hidden, mel_hidden,  text_logit_from_mel_hidden, \
 57 |             audio_seq2seq_alignments, \
 58 |             speaker_logit_from_mel, speaker_logit_from_mel_hidden, \
 59 |             text_lengths, mel_lengths = y_pred
 60 | 
 61 |         text_target, mel_target, spc_target, speaker_target,  stop_target  = y
 62 | 
 63 |         stop_target = stop_target.reshape(stop_target.size(0), -1, int(stop_target.size(1)/predicted_stop.size(1)))
 64 |         stop_target = stop_target[:,:,0]
 65 | 
 66 |         # plot distribution of parameters
 67 |         #for tag, value in model.named_parameters():
 68 |         #    tag = tag.replace('.', '/')
 69 |         #    self.add_histogram(tag, value.data.cpu().numpy(), iteration)
 70 | 
 71 |         # plot alignment, mel target and predicted, stop target and predicted
 72 |         idx = random.randint(0, alignments.size(0) - 1)
 73 | 
 74 |         alignments = alignments.data.cpu().numpy()
 75 |         audio_seq2seq_alignments = audio_seq2seq_alignments.data.cpu().numpy()
 76 | 
 77 |         self.add_image(
 78 |             "%s.alignment"%task,
 79 |             plot_alignment_to_numpy(alignments[idx].T),
 80 |             iteration, dataformats='HWC')
 81 |         
 82 |         # plot more alignments
 83 |         plot_alignment(alignments[:4], self.ali_path+'/step-%d-%s.pdf'%(iteration, task))
 84 | 
 85 |         self.add_image(
 86 |             "%s.audio_seq2seq_alignment"%task,
 87 |             plot_alignment_to_numpy(audio_seq2seq_alignments[idx].T),
 88 |             iteration, dataformats='HWC')
 89 | 
 90 |         self.add_image(
 91 |             "%s.mel_target"%task,
 92 |             plot_spectrogram_to_numpy(mel_target[idx].data.cpu().numpy()),
 93 |             iteration, dataformats='HWC')
 94 |         
 95 |         self.add_image(
 96 |             "%s.mel_predicted"%task,
 97 |             plot_spectrogram_to_numpy(predicted_mel[idx].data.cpu().numpy()),
 98 |             iteration, dataformats='HWC')
 99 |         
100 |         self.add_image(
101 |             "%s.spc_target"%task,
102 |             plot_spectrogram_to_numpy(spc_target[idx].data.cpu().numpy()),
103 |             iteration, dataformats='HWC')
104 |         
105 |         self.add_image(
106 |             "%s.post_predicted"%task,
107 |             plot_spectrogram_to_numpy(post_output[idx].data.cpu().numpy()),
108 |             iteration, dataformats='HWC')
109 | 
110 |         self.add_image(
111 |             "%s.stop"%task,
112 |             plot_gate_outputs_to_numpy(
113 |                 stop_target[idx].data.cpu().numpy(),
114 |                 F.sigmoid(predicted_stop[idx]).data.cpu().numpy()),
115 |             iteration, dataformats='HWC')
116 | 


--------------------------------------------------------------------------------
/codes/lstm_test.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | #import pandas as pd
  3 | import os
  4 | import librosa
  5 | import librosa.display
  6 | import matplotlib.pyplot as plt
  7 | import torch
  8 | import torch.nn as nn
  9 | from sklearn.metrics import confusion_matrix
 10 | import pickle
 11 | 
 12 | class TimeDistributed(nn.Module):
 13 |     def __init__(self, module):
 14 |         super(TimeDistributed, self).__init__()
 15 |         self.module = module
 16 | 
 17 |     def forward(self, x):
 18 | 
 19 |         if len(x.size()) <= 2:
 20 |             return self.module(x)
 21 |         # squash samples and timesteps into a single axis
 22 |         elif len(x.size()) == 3:  # (samples, timesteps, inp1)
 23 |             x_reshape = x.contiguous().view(-1, x.size(2))  # (samples * timesteps, inp1)
 24 |         elif len(x.size()) == 4:  # (samples,timesteps,inp1,inp2)
 25 |             x_reshape = x.contiguous().view(-1, x.size(2), x.size(3))  # (samples*timesteps,inp1,inp2)
 26 |         else:  # (samples,timesteps,inp1,inp2,inp3)
 27 |             x_reshape = x.contiguous().view(-1, x.size(2), x.size(3), x.size(4))  # (samples*timesteps,inp1,inp2,inp3)
 28 | 
 29 |         y = self.module(x_reshape)
 30 | 
 31 |         # we have to reshape Y
 32 |         if len(x.size()) == 3:
 33 |             y = y.contiguous().view(x.size(0), -1, y.size(1))  # (samples, timesteps, out1)
 34 |         elif len(x.size()) == 4:
 35 |             y = y.contiguous().view(x.size(0), -1, y.size(1), y.size(2))  # (samples, timesteps, out1,out2)
 36 |         else:
 37 |             y = y.contiguous().view(x.size(0), -1, y.size(1), y.size(2),
 38 |                                     y.size(3))  # (samples, timesteps, out1,out2, out3)
 39 |         return y
 40 | 
 41 | class HybridModel(nn.Module):
 42 |     def __init__(self,num_emotions):
 43 |         super().__init__()
 44 |         # conv block
 45 |         self.conv2Dblock = nn.Sequential(
 46 |             # 1. conv block
 47 |             TimeDistributed(nn.Conv2d(in_channels=1,
 48 |                                    out_channels=16,
 49 |                                    kernel_size=3,
 50 |                                    stride=1,
 51 |                                    padding=1
 52 |                                   )),
 53 |             TimeDistributed(nn.BatchNorm2d(16)),
 54 |             TimeDistributed(nn.ReLU()),
 55 |             TimeDistributed(nn.MaxPool2d(kernel_size=2, stride=2)),
 56 |             TimeDistributed(nn.Dropout(p=0.3)),
 57 |             # 2. conv block
 58 |             TimeDistributed(nn.Conv2d(in_channels=16,
 59 |                                    out_channels=32,
 60 |                                    kernel_size=3,
 61 |                                    stride=1,
 62 |                                    padding=1
 63 |                                   )),
 64 |             TimeDistributed(nn.BatchNorm2d(32)),
 65 |             TimeDistributed(nn.ReLU()),
 66 |             TimeDistributed(nn.MaxPool2d(kernel_size=4, stride=4)),
 67 |             TimeDistributed(nn.Dropout(p=0.3)),
 68 |             # 3. conv block
 69 |             TimeDistributed(nn.Conv2d(in_channels=32,
 70 |                                    out_channels=64,
 71 |                                    kernel_size=3,
 72 |                                    stride=1,
 73 |                                    padding=1
 74 |                                   )),
 75 |             TimeDistributed(nn.BatchNorm2d(64)),
 76 |             TimeDistributed(nn.ReLU()),
 77 |             TimeDistributed(nn.MaxPool2d(kernel_size=4, stride=4)),
 78 |             TimeDistributed(nn.Dropout(p=0.3))
 79 |         )
 80 |         # LSTM block
 81 |         hidden_size = 32
 82 |         self.lstm = nn.LSTM(input_size=512,hidden_size=hidden_size,bidirectional=True, batch_first=True)
 83 |         self.dropout_lstm = nn.Dropout(p=0.4)
 84 |         self.attention_linear = nn.Linear(2*hidden_size,1) # 2*hidden_size for the 2 outputs of bidir LSTM
 85 |         # Linear softmax layer
 86 |         self.out_linear = nn.Linear(2*hidden_size,num_emotions)
 87 |     def forward(self,x):
 88 |         conv_embedding = self.conv2Dblock(x)
 89 |         conv_embedding = torch.flatten(conv_embedding, start_dim=2) # do not flatten batch dimension and time
 90 |         lstm_embedding, (h,c) = self.lstm(conv_embedding)
 91 |         lstm_embedding = self.dropout_lstm(lstm_embedding)
 92 |         # lstm_embedding (batch, time, hidden_size*2)
 93 |         batch_size,T,_ = lstm_embedding.shape
 94 |         attention_weights = [None]*T
 95 |         for t in range(T):
 96 |             embedding = lstm_embedding[:,t,:]
 97 |             attention_weights[t] = self.attention_linear(embedding)
 98 |         attention_weights_norm = nn.functional.softmax(torch.stack(attention_weights,-1),dim=-1)
 99 |         attention = torch.bmm(attention_weights_norm,lstm_embedding) # (Bx1xT)*(B,T,hidden_size*2)=(B,1,2*hidden_size)
100 |         attention = torch.squeeze(attention, 1)
101 |         output_logits = self.out_linear(attention)
102 |         output_softmax = nn.functional.softmax(output_logits,dim=1)
103 |         return output_logits, output_softmax, attention
104 | 


--------------------------------------------------------------------------------
/codes/model/__init__.py:
--------------------------------------------------------------------------------
1 | from .model import Parrot 
2 | from .loss import ParrotLoss
3 | from .utils import lcm,gcd


--------------------------------------------------------------------------------
/codes/model/__pycache__/__init__.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KunZhou9646/Emovox/972db246eb7b63e5eaf532d46710af3c9ec11eba/codes/model/__pycache__/__init__.cpython-36.pyc


--------------------------------------------------------------------------------
/codes/model/__pycache__/basic_layers.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KunZhou9646/Emovox/972db246eb7b63e5eaf532d46710af3c9ec11eba/codes/model/__pycache__/basic_layers.cpython-36.pyc


--------------------------------------------------------------------------------
/codes/model/__pycache__/beam.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KunZhou9646/Emovox/972db246eb7b63e5eaf532d46710af3c9ec11eba/codes/model/__pycache__/beam.cpython-36.pyc


--------------------------------------------------------------------------------
/codes/model/__pycache__/decoder.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KunZhou9646/Emovox/972db246eb7b63e5eaf532d46710af3c9ec11eba/codes/model/__pycache__/decoder.cpython-36.pyc


--------------------------------------------------------------------------------
/codes/model/__pycache__/layers.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KunZhou9646/Emovox/972db246eb7b63e5eaf532d46710af3c9ec11eba/codes/model/__pycache__/layers.cpython-36.pyc


--------------------------------------------------------------------------------
/codes/model/__pycache__/loss.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KunZhou9646/Emovox/972db246eb7b63e5eaf532d46710af3c9ec11eba/codes/model/__pycache__/loss.cpython-36.pyc


--------------------------------------------------------------------------------
/codes/model/__pycache__/lstm_test.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KunZhou9646/Emovox/972db246eb7b63e5eaf532d46710af3c9ec11eba/codes/model/__pycache__/lstm_test.cpython-36.pyc


--------------------------------------------------------------------------------
/codes/model/__pycache__/model.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KunZhou9646/Emovox/972db246eb7b63e5eaf532d46710af3c9ec11eba/codes/model/__pycache__/model.cpython-36.pyc


--------------------------------------------------------------------------------
/codes/model/__pycache__/penalties.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KunZhou9646/Emovox/972db246eb7b63e5eaf532d46710af3c9ec11eba/codes/model/__pycache__/penalties.cpython-36.pyc


--------------------------------------------------------------------------------
/codes/model/__pycache__/utils.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KunZhou9646/Emovox/972db246eb7b63e5eaf532d46710af3c9ec11eba/codes/model/__pycache__/utils.cpython-36.pyc


--------------------------------------------------------------------------------
/codes/model/basic_layers.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | from torch import nn
  3 | from torch.nn import functional as F
  4 | 
  5 | 
  6 | def tile(x, count, dim=0):
  7 |     """
  8 |     Tiles x on dimension dim count times.
  9 |     """
 10 |     perm = list(range(len(x.size())))
 11 |     if dim != 0:
 12 |         perm[0], perm[dim] = perm[dim], perm[0]
 13 |         x = x.permute(perm).contiguous()
 14 |     out_size = list(x.size())
 15 |     out_size[0] *= count
 16 |     batch = x.size(0)
 17 |     x = x.view(batch, -1) \
 18 |          .transpose(0, 1) \
 19 |          .repeat(count, 1) \
 20 |          .transpose(0, 1) \
 21 |          .contiguous() \
 22 |          .view(*out_size)
 23 |     if dim != 0:
 24 |         x = x.permute(perm).contiguous()
 25 |     return x
 26 | 
 27 | 
 28 | def sort_batch(data, lengths):
 29 |     '''
 30 |     sort data by length
 31 |     sorted_data[initial_index] == data
 32 |     '''
 33 |     sorted_lengths, sorted_index = lengths.sort(0, descending=True)
 34 |     sorted_data = data[sorted_index]
 35 |     _, initial_index = sorted_index.sort(0, descending=False)
 36 | 
 37 |     return sorted_data, sorted_lengths, initial_index
 38 | 
 39 | 
 40 | class LinearNorm(torch.nn.Module):
 41 |     def __init__(self, in_dim, out_dim, bias=True, w_init_gain='linear'):
 42 |         super(LinearNorm, self).__init__()
 43 |         self.linear_layer = torch.nn.Linear(in_dim, out_dim, bias=bias)
 44 | 
 45 |         torch.nn.init.xavier_uniform_(
 46 |             self.linear_layer.weight,
 47 |             gain=torch.nn.init.calculate_gain(w_init_gain))
 48 | 
 49 |     def forward(self, x):
 50 |         return self.linear_layer(x)
 51 | 
 52 | 
 53 | class ConvNorm(torch.nn.Module):
 54 |     def __init__(self, in_channels, out_channels, kernel_size=1, stride=1,
 55 |                  padding=None, dilation=1, bias=True, w_init_gain='linear', param=None):
 56 |         super(ConvNorm, self).__init__()
 57 |         if padding is None:
 58 |             assert(kernel_size % 2 == 1)
 59 |             padding = int(dilation * (kernel_size - 1) / 2)
 60 | 
 61 |         self.conv = torch.nn.Conv1d(in_channels, out_channels,
 62 |                                     kernel_size=kernel_size, stride=stride,
 63 |                                     padding=padding, dilation=dilation,
 64 |                                     bias=bias)
 65 | 
 66 |         torch.nn.init.xavier_uniform_(
 67 |             self.conv.weight, gain=torch.nn.init.calculate_gain(w_init_gain, param=param))
 68 | 
 69 |     def forward(self, signal):
 70 |         conv_signal = self.conv(signal)
 71 |         return conv_signal
 72 | 
 73 | 
 74 | class Prenet(nn.Module):
 75 |     def __init__(self, in_dim, sizes):
 76 |         super(Prenet, self).__init__()
 77 |         in_sizes = [in_dim] + sizes[:-1]
 78 |         self.layers = nn.ModuleList(
 79 |             [LinearNorm(in_size, out_size, bias=False)
 80 |              for (in_size, out_size) in zip(in_sizes, sizes)])
 81 | 
 82 |     def forward(self, x):
 83 |         for linear in self.layers:
 84 |             x = F.dropout(F.relu(linear(x)), p=0.5, training=True)
 85 |         return x
 86 | 
 87 | 
 88 | class LocationLayer(nn.Module):
 89 |     def __init__(self, attention_n_filters, attention_kernel_size,
 90 |                  attention_dim):
 91 |         super(LocationLayer, self).__init__()
 92 |         padding = int((attention_kernel_size - 1) / 2)
 93 |         self.location_conv = ConvNorm(2, attention_n_filters,
 94 |                                       kernel_size=attention_kernel_size,
 95 |                                       padding=padding, bias=False, stride=1,
 96 |                                       dilation=1)
 97 |         self.location_dense = LinearNorm(attention_n_filters, attention_dim,
 98 |                                          bias=False, w_init_gain='tanh')
 99 | 
100 |     def forward(self, attention_weights_cat):
101 |         processed_attention = self.location_conv(attention_weights_cat)
102 |         processed_attention = processed_attention.transpose(1, 2)
103 |         processed_attention = self.location_dense(processed_attention)
104 |         return processed_attention
105 | 
106 | 
107 | class Attention(nn.Module):
108 |     def __init__(self, attention_rnn_dim, embedding_dim, attention_dim,
109 |                  attention_location_n_filters, attention_location_kernel_size):
110 |         super(Attention, self).__init__()
111 |         self.query_layer = LinearNorm(attention_rnn_dim, attention_dim,
112 |                                       bias=False, w_init_gain='tanh')
113 |         self.memory_layer = LinearNorm(embedding_dim, attention_dim, bias=False,
114 |                                        w_init_gain='tanh')
115 |         self.v = LinearNorm(attention_dim, 1, bias=False)
116 |         self.location_layer = LocationLayer(attention_location_n_filters,
117 |                                             attention_location_kernel_size,
118 |                                             attention_dim)
119 |         self.score_mask_value = -float("inf")
120 | 
121 |     def get_alignment_energies(self, query, processed_memory,
122 |                                attention_weights_cat):
123 |         """
124 |         PARAMS
125 |         ------
126 |         query: decoder output (batch, n_mel_channels * n_frames_per_step)
127 |         processed_memory: processed encoder outputs (B, T_in, attention_dim)
128 |         attention_weights_cat: cumulative and prev. att weights (B, 2, max_time)
129 |         RETURNS
130 |         -------
131 |         alignment (batch, max_time)
132 |         """
133 | 
134 |         processed_query = self.query_layer(query.unsqueeze(1))
135 |         processed_attention_weights = self.location_layer(attention_weights_cat)
136 |         energies = self.v(torch.tanh(
137 |             processed_query + processed_attention_weights + processed_memory))
138 | 
139 |         energies = energies.squeeze(-1)
140 |         return energies
141 | 
142 |     def forward(self, attention_hidden_state, memory, processed_memory,
143 |                 attention_weights_cat, mask):
144 |         """
145 |         PARAMS
146 |         ------
147 |         attention_hidden_state: attention rnn last output
148 |         memory: encoder outputs
149 |         processed_memory: processed encoder outputs
150 |         attention_weights_cat: previous and cummulative attention weights
151 |         mask: binary mask for padded data
152 |         """
153 |         alignment = self.get_alignment_energies(
154 |             attention_hidden_state, processed_memory, attention_weights_cat)
155 | 
156 |         if mask is not None:
157 |             alignment.data.masked_fill_(mask, self.score_mask_value)
158 | 
159 |         attention_weights = F.softmax(alignment, dim=1)
160 |         attention_context = torch.bmm(attention_weights.unsqueeze(1), memory)
161 |         attention_context = attention_context.squeeze(1)
162 | 
163 |         return attention_context, attention_weights
164 | 
165 | 
166 | class ForwardAttentionV2(nn.Module):
167 |     def __init__(self, attention_rnn_dim, embedding_dim, attention_dim,
168 |                  attention_location_n_filters, attention_location_kernel_size):
169 |         super(ForwardAttentionV2, self).__init__()
170 |         self.query_layer = LinearNorm(attention_rnn_dim, attention_dim,
171 |                                       bias=False, w_init_gain='tanh')
172 |         self.memory_layer = LinearNorm(embedding_dim, attention_dim, bias=False,
173 |                                        w_init_gain='tanh')
174 |         self.v = LinearNorm(attention_dim, 1, bias=False)
175 |         self.location_layer = LocationLayer(attention_location_n_filters,
176 |                                             attention_location_kernel_size,
177 |                                             attention_dim)
178 |         self.score_mask_value = -float(1e20)
179 | 
180 |     def get_alignment_energies(self, query, processed_memory,
181 |                                attention_weights_cat):
182 |         """
183 |         PARAMS
184 |         ------
185 |         query: decoder output (batch, n_mel_channels * n_frames_per_step)
186 |         processed_memory: processed encoder outputs (B, T_in, attention_dim)
187 |         attention_weights_cat:  prev. and cumulative att weights (B, 2, max_time)
188 |         RETURNS
189 |         -------
190 |         alignment (batch, max_time)
191 |         """
192 | 
193 |         processed_query = self.query_layer(query.unsqueeze(1))
194 |         processed_attention_weights = self.location_layer(attention_weights_cat)
195 |         energies = self.v(torch.tanh(
196 |             processed_query + processed_attention_weights + processed_memory))
197 | 
198 |         energies = energies.squeeze(-1)
199 |         return energies
200 | 
201 |     def forward(self, attention_hidden_state, memory, processed_memory,
202 |                 attention_weights_cat, mask, log_alpha):
203 |         """
204 |         PARAMS
205 |         ------
206 |         attention_hidden_state: attention rnn last output
207 |         memory: encoder outputs
208 |         processed_memory: processed encoder outputs
209 |         attention_weights_cat: previous and cummulative attention weights
210 |         mask: binary mask for padded data
211 |         """
212 |         log_energy = self.get_alignment_energies(
213 |             attention_hidden_state, processed_memory, attention_weights_cat)
214 | 
215 |         #log_energy = 
216 | 
217 |         if mask is not None:
218 |             log_energy.data.masked_fill_(mask, self.score_mask_value)
219 | 
220 |         #attention_weights = F.softmax(alignment, dim=1)
221 | 
222 |         #content_score = log_energy.unsqueeze(1) #[B, MAX_TIME] -> [B, 1, MAX_TIME]
223 |         #log_alpha = log_alpha.unsqueeze(2) #[B, MAX_TIME] -> [B, MAX_TIME, 1]
224 | 
225 |         #log_total_score = log_alpha + content_score
226 | 
227 |         #previous_attention_weights = attention_weights_cat[:,0,:]
228 |         
229 |         log_alpha_shift_padded = []
230 |         max_time = log_energy.size(1)
231 |         for sft in range(2):
232 |             shifted = log_alpha[:,:max_time-sft]
233 |             shift_padded = F.pad(shifted, (sft,0), 'constant', self.score_mask_value)
234 |             log_alpha_shift_padded.append(shift_padded.unsqueeze(2))
235 | 
236 |         biased = torch.logsumexp(torch.cat(log_alpha_shift_padded,2), 2)
237 |         
238 |         log_alpha_new = biased +  log_energy
239 | 
240 |         attention_weights =  F.softmax(log_alpha_new, dim=1)
241 | 
242 |         attention_context = torch.bmm(attention_weights.unsqueeze(1), memory)
243 |         attention_context = attention_context.squeeze(1)
244 | 
245 |         return attention_context, attention_weights, log_alpha_new


--------------------------------------------------------------------------------
/codes/model/beam.py:
--------------------------------------------------------------------------------
  1 | 
  2 | 
  3 | import torch
  4 | from .penalties import PenaltyBuilder
  5 | 
  6 | 
  7 | 
  8 | class Beam(object):
  9 |     """
 10 |     '''
 11 |     adapt from opennmt 
 12 |     https://github.com/OpenNMT/OpenNMT-py/blob/master/onmt/translate/beam.py
 13 |     '''
 14 | 
 15 |     Class for managing the internals of the beam search process.
 16 |     Takes care of beams, back pointers, and scores.
 17 |     Args:
 18 |        size (int): beam size
 19 |        pad, bos, eos (int): indices of padding, beginning, and ending.
 20 |        n_best (int): nbest size to use
 21 |        cuda (bool): use gpu
 22 |        global_scorer (:obj:`GlobalScorer`)
 23 |     """
 24 | 
 25 |     def __init__(self, size, pad, bos, eos,
 26 |                  n_best=1, cuda=False,
 27 |                  global_scorer=None,
 28 |                  min_length=0,
 29 |                  stepwise_penalty=False,
 30 |                  block_ngram_repeat=0,
 31 |                  exclusion_tokens=set()):
 32 | 
 33 |         self.size = size
 34 |         self.tt = torch.cuda if cuda else torch
 35 | 
 36 |         # The score for each translation on the beam.
 37 |         self.scores = self.tt.FloatTensor(size).zero_()
 38 |         self.all_scores = []
 39 | 
 40 |         # The backpointers at each time-step.
 41 |         self.prev_ks = []
 42 | 
 43 |         # The outputs at each time-step.
 44 |         self.next_ys = [self.tt.LongTensor(size)
 45 |                         .fill_(pad)]
 46 |         self.next_ys[0][0] = bos
 47 | 
 48 |         # Has EOS topped the beam yet.
 49 |         self._eos = eos
 50 |         self.eos_top = False
 51 | 
 52 |         # The attentions (matrix) for each time.
 53 |         self.attn = []
 54 |         self.hidden = []
 55 | 
 56 |         # Time and k pair for finished.
 57 |         self.finished = []
 58 |         self.n_best = n_best
 59 | 
 60 |         # Information for global scoring.
 61 |         self.global_scorer = global_scorer
 62 |         self.global_state = {}
 63 | 
 64 |         # Minimum prediction length
 65 |         self.min_length = min_length
 66 | 
 67 |         # Apply Penalty at every step
 68 |         self.stepwise_penalty = stepwise_penalty
 69 |         self.block_ngram_repeat = block_ngram_repeat
 70 |         self.exclusion_tokens = exclusion_tokens
 71 | 
 72 |     def get_current_state(self):
 73 |         "Get the outputs for the current timestep."
 74 |         return self.next_ys[-1]
 75 | 
 76 |     def get_current_origin(self):
 77 |         "Get the backpointers for the current timestep."
 78 |         return self.prev_ks[-1]
 79 | 
 80 |     def advance(self, word_probs, attn_out, hidden):
 81 |         """
 82 |         Given prob over words for every last beam `wordLk` and attention
 83 |         `attn_out`: Compute and update the beam search.
 84 |         Parameters:
 85 |         * `word_probs`- probs of advancing from the last step (K x words)
 86 |         * `attn_out`- attention at the last step
 87 |         Returns: True if beam search is complete.
 88 |         """
 89 |         num_words = word_probs.size(1)
 90 |         if self.stepwise_penalty:
 91 |             self.global_scorer.update_score(self, attn_out)
 92 |         # force the output to be longer than self.min_length
 93 |         cur_len = len(self.next_ys)
 94 |         if cur_len < self.min_length:
 95 |             for k in range(len(word_probs)):
 96 |                 word_probs[k][self._eos] = -1e20
 97 |         # Sum the previous scores.
 98 |         if len(self.prev_ks) > 0:
 99 |             beam_scores = word_probs + self.scores.unsqueeze(1)
100 |             # Don't let EOS have children.
101 |             for i in range(self.next_ys[-1].size(0)):
102 |                 if self.next_ys[-1][i] == self._eos:
103 |                     beam_scores[i] = -1e20
104 | 
105 |             # Block ngram repeats
106 |             if self.block_ngram_repeat > 0:
107 |                 ngrams = []
108 |                 le = len(self.next_ys)
109 |                 for j in range(self.next_ys[-1].size(0)):
110 |                     hyp, _ = self.get_hyp(le - 1, j)
111 |                     ngrams = set()
112 |                     fail = False
113 |                     gram = []
114 |                     for i in range(le - 1):
115 |                         # Last n tokens, n = block_ngram_repeat
116 |                         gram = (gram +
117 |                                 [hyp[i].item()])[-self.block_ngram_repeat:]
118 |                         # Skip the blocking if it is in the exclusion list
119 |                         if set(gram) & self.exclusion_tokens:
120 |                             continue
121 |                         if tuple(gram) in ngrams:
122 |                             fail = True
123 |                         ngrams.add(tuple(gram))
124 |                     if fail:
125 |                         beam_scores[j] = -10e20
126 |         else:
127 |             beam_scores = word_probs[0]
128 |         flat_beam_scores = beam_scores.view(-1)
129 |         best_scores, best_scores_id = flat_beam_scores.topk(self.size, 0,
130 |                                                             True, True)
131 | 
132 |         self.all_scores.append(self.scores)
133 |         self.scores = best_scores
134 | 
135 |         # best_scores_id is flattened beam x word array, so calculate which
136 |         # word and beam each score came from
137 |         prev_k = best_scores_id / num_words
138 |         self.prev_ks.append(prev_k)
139 |         self.next_ys.append((best_scores_id - prev_k * num_words))
140 |         self.attn.append(attn_out.index_select(0, prev_k))
141 |         self.hidden.append(hidden.index_select(0, prev_k))
142 |         self.global_scorer.update_global_state(self)
143 | 
144 |         for i in range(self.next_ys[-1].size(0)):
145 |             if self.next_ys[-1][i] == self._eos:
146 |                 global_scores = self.global_scorer.score(self, self.scores)
147 |                 s = global_scores[i]
148 |                 self.finished.append((s, len(self.next_ys) - 1, i))
149 | 
150 |         # End condition is when top-of-beam is EOS and no global score.
151 |         if self.next_ys[-1][0] == self._eos:
152 |             self.all_scores.append(self.scores)
153 |             self.eos_top = True
154 | 
155 |     def done(self):
156 |         return self.eos_top and len(self.finished) >= self.n_best
157 | 
158 |     def sort_finished(self, minimum=None):
159 |         if minimum is not None:
160 |             i = 0
161 |             # Add from beam until we have minimum outputs.
162 |             while len(self.finished) < minimum:
163 |                 global_scores = self.global_scorer.score(self, self.scores)
164 |                 s = global_scores[i]
165 |                 self.finished.append((s, len(self.next_ys) - 1, i))
166 |                 i += 1
167 | 
168 |         self.finished.sort(key=lambda a: -a[0])
169 |         scores = [sc for sc, _, _ in self.finished]
170 |         ks = [(t, k) for _, t, k in self.finished]
171 |         return scores, ks
172 | 
173 |     def get_hyp(self, timestep, k):
174 |         """
175 |         Walk back to construct the full hypothesis.
176 |         """
177 |         hyp, attn, hidden = [], [], []
178 |         for j in range(len(self.prev_ks[:timestep]) - 1, -1, -1):
179 |             hyp.append(self.next_ys[j + 1][k])
180 |             attn.append(self.attn[j][k])
181 |             hidden.append(self.hidden[j][k])
182 |             k = self.prev_ks[j][k]
183 |         return torch.stack(hyp[::-1]), torch.stack(attn[::-1]), torch.stack(hidden[::-1])
184 | 
185 | 
186 | class GNMTGlobalScorer(object):
187 |     """
188 |     NMT re-ranking score from
189 |     "Google's Neural Machine Translation System" :cite:`wu2016google`
190 |     Args:
191 |        alpha (float): length parameter
192 |        beta (float):  coverage parameter
193 |     """
194 | 
195 |     def __init__(self, opt=None):
196 |         self.alpha = 0.
197 |         self.beta = 0.
198 |         penalty_builder = PenaltyBuilder('none',
199 |                                         'avg')
200 |         # Term will be subtracted from probability
201 |         self.cov_penalty = penalty_builder.coverage_penalty()
202 |         # Probability will be divided by this
203 |         self.length_penalty = penalty_builder.length_penalty()
204 | 
205 |     def score(self, beam, logprobs):
206 |         """
207 |         Rescores a prediction based on penalty functions
208 |         """
209 |         normalized_probs = self.length_penalty(beam,
210 |                                                logprobs,
211 |                                                self.alpha)
212 |         if not beam.stepwise_penalty:
213 |             penalty = self.cov_penalty(beam,
214 |                                        beam.global_state["coverage"],
215 |                                        self.beta)
216 |             normalized_probs -= penalty
217 | 
218 |         return normalized_probs
219 | 
220 |     def update_score(self, beam, attn):
221 |         """
222 |         Function to update scores of a Beam that is not finished
223 |         """
224 |         if "prev_penalty" in list(beam.global_state.keys()):
225 |             beam.scores.add_(beam.global_state["prev_penalty"])
226 |             penalty = self.cov_penalty(beam,
227 |                                        beam.global_state["coverage"] + attn,
228 |                                        self.beta)
229 |             beam.scores.sub_(penalty)
230 | 
231 |     def update_global_state(self, beam):
232 |         "Keeps the coverage vector as sum of attentions"
233 |         if len(beam.prev_ks) == 1:
234 |             beam.global_state["prev_penalty"] = beam.scores.clone().fill_(0.0)
235 |             beam.global_state["coverage"] = beam.attn[-1]
236 |             self.cov_total = beam.attn[-1].sum(1)
237 |         else:
238 |             self.cov_total += torch.min(beam.attn[-1],
239 |                                         beam.global_state['coverage']).sum(1)
240 |             beam.global_state["coverage"] = beam.global_state["coverage"] \
241 |                 .index_select(0, beam.prev_ks[-1]).add(beam.attn[-1])
242 | 
243 |             prev_penalty = self.cov_penalty(beam,
244 |                                             beam.global_state["coverage"],
245 |                                             self.beta)
246 |             beam.global_state["prev_penalty"] = prev_penalty


--------------------------------------------------------------------------------
/codes/model/decoder.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | from torch.autograd import Variable
  3 | from torch import nn
  4 | from torch.nn import functional as F
  5 | from .basic_layers import ConvNorm, LinearNorm, ForwardAttentionV2, Prenet
  6 | from .utils import get_mask_from_lengths
  7 | 
  8 | 
  9 | class Decoder(nn.Module):
 10 |     def __init__(self, hparams):
 11 |         super(Decoder, self).__init__()
 12 |         self.n_mel_channels = hparams.n_mel_channels
 13 |         self.n_frames_per_step = hparams.n_frames_per_step_decoder
 14 |         #self.hidden_cat_dim = hparams.encoder_embedding_dim + hparams.speaker_embedding_dim
 15 |         self.hidden_cat_dim = hparams.encoder_embedding_dim + 128
 16 |         self.attention_rnn_dim = hparams.attention_rnn_dim
 17 |         self.decoder_rnn_dim = hparams.decoder_rnn_dim
 18 |         self.prenet_dim = hparams.prenet_dim
 19 |         self.max_decoder_steps = hparams.max_decoder_steps
 20 |         self.stop_threshold = hparams.stop_threshold
 21 |         self.feed_back_last = hparams.feed_back_last
 22 | 
 23 |         if hparams.feed_back_last:
 24 |             prenet_input_dim = hparams.n_mel_channels
 25 |         else:
 26 |             prenet_input_dim = hparams.n_mel_channels * hparams.n_frames_per_step_decoder
 27 |         
 28 |         self.prenet = Prenet(
 29 |             prenet_input_dim ,
 30 |             hparams.prenet_dim)
 31 | 
 32 |         self.attention_rnn = nn.LSTMCell(
 33 |             hparams.prenet_dim[-1] + self.hidden_cat_dim,
 34 |             hparams.attention_rnn_dim)
 35 | 
 36 |         self.attention_layer = ForwardAttentionV2(
 37 |             hparams.attention_rnn_dim, 
 38 |             self.hidden_cat_dim,
 39 |             hparams.attention_dim, hparams.attention_location_n_filters,
 40 |             hparams.attention_location_kernel_size)
 41 | 
 42 |         self.decoder_rnn = nn.LSTMCell(
 43 |             self.hidden_cat_dim + hparams.attention_rnn_dim,
 44 |             hparams.decoder_rnn_dim)
 45 |         
 46 |         self.linear_projection = LinearNorm(
 47 |             self.hidden_cat_dim + hparams.decoder_rnn_dim,
 48 |             hparams.n_mel_channels * hparams.n_frames_per_step_decoder)
 49 | 
 50 |         self.stop_layer = LinearNorm(
 51 |             self.hidden_cat_dim + hparams.decoder_rnn_dim, 1,
 52 |             bias=True, w_init_gain='sigmoid')
 53 | 
 54 |     def get_go_frame(self, memory):
 55 |         """ Gets all zeros frames to use as first decoder input
 56 |         PARAMS
 57 |         ------
 58 |         memory: decoder outputs
 59 |         RETURNS
 60 |         -------
 61 |         decoder_input: all zeros frames
 62 |         """
 63 |         B = memory.size(0)
 64 |         if self.feed_back_last:
 65 |             input_dim = self.n_mel_channels
 66 |         else:
 67 |             input_dim = self.n_mel_channels * self.n_frames_per_step
 68 |         
 69 |         decoder_input = Variable(memory.data.new(
 70 |             B,  input_dim).zero_())
 71 |         return decoder_input
 72 | 
 73 |     def initialize_decoder_states(self, memory, mask):
 74 |         """ Initializes attention rnn states, decoder rnn states, attention
 75 |         weights, attention cumulative weights, attention context, stores memory
 76 |         and stores processed memory
 77 |         PARAMS
 78 |         ------
 79 |         memory: Encoder outputs
 80 |         mask: Mask for padded data if training, expects None for inference
 81 |         """
 82 |         B = memory.size(0)
 83 |         MAX_TIME = memory.size(1)
 84 | 
 85 |         self.attention_hidden = Variable(memory.data.new(
 86 |             B, self.attention_rnn_dim).zero_())
 87 |         self.attention_cell = Variable(memory.data.new(
 88 |             B, self.attention_rnn_dim).zero_())
 89 | 
 90 |         self.decoder_hidden = Variable(memory.data.new(
 91 |             B, self.decoder_rnn_dim).zero_())
 92 |         self.decoder_cell = Variable(memory.data.new(
 93 |             B, self.decoder_rnn_dim).zero_())
 94 | 
 95 |         self.attention_weights = Variable(memory.data.new(
 96 |             B, MAX_TIME).zero_())
 97 |         self.attention_weights_cum = Variable(memory.data.new(
 98 |             B, MAX_TIME).zero_())
 99 |         self.attention_context = Variable(memory.data.new(
100 |             B, self.hidden_cat_dim).zero_())
101 |         
102 |         self.log_alpha = Variable(memory.data.new(B, MAX_TIME).fill_(-float(1e20)))
103 |         self.log_alpha[:, 0].fill_(0.)
104 | 
105 |         self.memory = memory
106 |         self.processed_memory = self.attention_layer.memory_layer(memory)
107 |         self.mask = mask
108 | 
109 |     def parse_decoder_inputs(self, decoder_inputs):
110 |         """ Prepares decoder inputs, i.e. mel outputs
111 |         PARAMS
112 |         ------
113 |         decoder_inputs: inputs used for teacher-forced training, i.e. mel-specs
114 |         RETURNS
115 |         -------
116 |         inputs: processed decoder inputs
117 |         """
118 |         # (B, n_mel_channels, T_out) -> (B, T_out, n_mel_channels)
119 |         decoder_inputs = decoder_inputs.transpose(1, 2)
120 |         decoder_inputs = decoder_inputs.reshape(
121 |             decoder_inputs.size(0),
122 |             int(decoder_inputs.size(1)/self.n_frames_per_step), -1)
123 |         # (B, T_out, n_mel_channels) -> (T_out, B, n_mel_channels)
124 |         decoder_inputs = decoder_inputs.transpose(0, 1)
125 |         if self.feed_back_last:
126 |             decoder_inputs = decoder_inputs[:,:,-self.n_mel_channels:]
127 |         
128 |         return decoder_inputs
129 | 
130 |     def parse_decoder_outputs(self, mel_outputs, stop_outputs, alignments):
131 |         """ Prepares decoder outputs for output
132 |         PARAMS
133 |         ------
134 |         mel_outputs:
135 |         stop_outputs: stop output energies
136 |         alignments:
137 |         RETURNS
138 |         -------
139 |         mel_outputs:
140 |         stop_outpust: stop output energies
141 |         alignments:
142 |         """
143 |         # (T_out, B, MAX_TIME) -> (B, T_out, MAX_TIME)
144 |         alignments = torch.stack(alignments).transpose(0, 1)
145 |         # (T_out, B) -> (B, T_out)
146 |         if alignments.size(0) == 1:
147 |             stop_outputs = torch.stack(stop_outputs).unsqueeze(0)
148 |         else:
149 |             stop_outputs = torch.stack(stop_outputs).transpose(0, 1)
150 |         stop_outputs = stop_outputs.contiguous()
151 |         # (T_out, B, n_mel_channels) -> (B, T_out, n_mel_channels)
152 |         mel_outputs = torch.stack(mel_outputs).transpose(0, 1).contiguous()
153 |         # decouple frames per step
154 |         mel_outputs = mel_outputs.view(
155 |             mel_outputs.size(0), -1, self.n_mel_channels)
156 |         # (B, T_out, n_mel_channels) -> (B, n_mel_channels, T_out)
157 |         mel_outputs = mel_outputs.transpose(1, 2)
158 | 
159 |         return mel_outputs, stop_outputs, alignments
160 | 
161 |     def attend(self, decoder_input):
162 |         cell_input = torch.cat((decoder_input, self.attention_context), -1)
163 |         self.attention_hidden, self.attention_cell = self.attention_rnn(
164 |             cell_input, (self.attention_hidden, self.attention_cell))
165 | 
166 |         attention_weights_cat = torch.cat(
167 |             (self.attention_weights.unsqueeze(1),
168 |              self.attention_weights_cum.unsqueeze(1)), dim=1)
169 | 
170 |         self.attention_context, self.attention_weights, self.log_alpha = self.attention_layer(
171 |             self.attention_hidden, self.memory, self.processed_memory,
172 |             attention_weights_cat, self.mask, self.log_alpha)
173 |         
174 |         self.attention_weights_cum += self.attention_weights
175 | 
176 |         decoder_rnn_input = torch.cat(
177 |             (self.attention_hidden, self.attention_context), -1)
178 | 
179 |         return decoder_rnn_input, self.attention_context, self.attention_weights
180 | 
181 |     def decode(self, decoder_input):
182 | 
183 |         self.decoder_hidden, self.decoder_cell = self.decoder_rnn(
184 |             decoder_input, (self.decoder_hidden, self.decoder_cell))
185 | 
186 |         return self.decoder_hidden
187 | 
188 |     def forward(self, memory, decoder_inputs, memory_lengths):
189 |         """ Decoder forward pass for training
190 |         PARAMS
191 |         ------
192 |         memory: Encoder outputs  [B, encoder_max_time, hidden_dim]
193 |         decoder_inputs: Decoder inputs for teacher forcing. i.e. mel-specs  [B, mel_bin, T]
194 |         memory_lengths: Encoder output lengths for attention masking. [B]
195 |         RETURNS
196 |         -------
197 |         mel_outputs: mel outputs from the decoder   [B, mel_bin, T]
198 |         stop_outputs: stop outputs from the decoder [B, T/r]
199 |         alignments: sequence of attention weights from the decoder [B, T/r, encoder_max_time]
200 |         """
201 | 
202 |         decoder_input = self.get_go_frame(memory).unsqueeze(0)
203 |         decoder_inputs = self.parse_decoder_inputs(decoder_inputs)
204 |         decoder_inputs = torch.cat((decoder_input, decoder_inputs), dim=0)
205 |         decoder_inputs = self.prenet(decoder_inputs) # [T/r + 1, B, prenet_dim ]
206 | 
207 |         self.initialize_decoder_states(
208 |             memory, mask=~get_mask_from_lengths(memory_lengths))
209 | 
210 |         mel_outputs, stop_outputs, alignments = [], [], []
211 |         while len(mel_outputs) < decoder_inputs.size(0) - 1:
212 |             decoder_input = decoder_inputs[len(mel_outputs)]
213 | 
214 |             decoder_rnn_input, context, attention_weights = self.attend(decoder_input)
215 |             
216 |             decoder_rnn_output = self.decode(decoder_rnn_input)
217 | 
218 |             decoder_hidden_attention_context = torch.cat(
219 |                 (decoder_rnn_output, context), dim=1)
220 |             
221 |             mel_output = self.linear_projection(decoder_hidden_attention_context)
222 |             stop_output = self.stop_layer(decoder_hidden_attention_context)
223 | 
224 |             mel_outputs += [mel_output.squeeze(1)] #? perhaps don't need squeeze
225 |             stop_outputs += [stop_output.squeeze()]
226 |             alignments += [attention_weights]
227 | 
228 |         mel_outputs, stop_outputs, alignments = self.parse_decoder_outputs(
229 |             mel_outputs, stop_outputs, alignments)
230 | 
231 |         return mel_outputs, stop_outputs, alignments
232 | 
233 |     def inference(self, memory):
234 |         """ Decoder inference
235 |         PARAMS
236 |         ------
237 |         memory: Encoder outputs
238 |         RETURNS
239 |         -------
240 |         mel_outputs: mel outputs from the decoder
241 |         stop_outputs: stop outputs from the decoder
242 |         alignments: sequence of attention weights from the decoder
243 |         """
244 |         decoder_input = self.get_go_frame(memory)
245 | 
246 |         self.initialize_decoder_states(memory, mask=None)
247 | 
248 |         mel_outputs, stop_outputs, alignments = [], [], []
249 |         while True:
250 |             decoder_input = self.prenet(decoder_input)
251 | 
252 |             decoder_input_final, context, alignment = self.attend(decoder_input)
253 | 
254 |             #mel_output, stop_output, alignment = self.decode(decoder_input)
255 |             decoder_rnn_output = self.decode(decoder_input_final)
256 |             decoder_hidden_attention_context = torch.cat(
257 |                 (decoder_rnn_output, context), dim=1)
258 |             
259 |             mel_output = self.linear_projection(decoder_hidden_attention_context)
260 |             stop_output = self.stop_layer(decoder_hidden_attention_context)
261 | 
262 |             mel_outputs += [mel_output.squeeze(1)]
263 |             stop_outputs += [stop_output]
264 |             alignments += [alignment]
265 | 
266 | 
267 |             if torch.sigmoid(stop_output.data) > self.stop_threshold:
268 |                 break
269 |             elif len(mel_outputs) == self.max_decoder_steps:
270 |                 print("Warning! Reached max decoder steps")
271 |                 break
272 | 
273 |             if self.feed_back_last:
274 |                 decoder_input = mel_output[:,-self.n_mel_channels:]
275 |             else:
276 |                 decoder_input = mel_output
277 | 
278 |         mel_outputs, stop_outputs, alignments = self.parse_decoder_outputs(
279 |             mel_outputs, stop_outputs, alignments)
280 | 
281 |         return mel_outputs, stop_outputs, alignments


--------------------------------------------------------------------------------
/codes/model/loss.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | from torch import nn
  3 | from torch.nn import functional as F
  4 | from .utils import get_mask_from_lengths
  5 | #from train_ser import process_mel, process_post_output, perform_SER
  6 | import python_speech_features as ps
  7 | 
  8 | from sklearn.metrics import recall_score as recall
  9 | from sklearn.metrics import confusion_matrix as confusion
 10 | import numpy as np
 11 | from lstm_test import HybridModel
 12 | 
 13 | class ParrotLoss(nn.Module):
 14 |     def __init__(self, hparams):
 15 |         super(ParrotLoss, self).__init__()
 16 |         self.hidden_dim = hparams.encoder_embedding_dim
 17 |         self.ce_loss = hparams.ce_loss
 18 | 
 19 |         self.L1Loss = nn.L1Loss(reduction='none')
 20 |         self.MSELoss = nn.MSELoss(reduction='none')
 21 |         self.BCEWithLogitsLoss = nn.BCEWithLogitsLoss(reduction='none')
 22 |         self.CrossEntropyLoss = nn.CrossEntropyLoss(reduction='none')
 23 |         self.n_frames_per_step = hparams.n_frames_per_step_decoder
 24 |         self.eos = hparams.n_symbols
 25 |         self.predict_spectrogram = hparams.predict_spectrogram
 26 | 
 27 |         self.contr_w = hparams.contrastive_loss_w
 28 |         self.spenc_w = hparams.speaker_encoder_loss_w
 29 |         self.texcl_w = hparams.text_classifier_loss_w
 30 |         self.spadv_w = hparams.speaker_adversial_loss_w
 31 |         self.spcla_w = hparams.speaker_classifier_loss_w
 32 |         self.serloss_w = hparams.ser_loss_w
 33 |         self.emoloss_w = hparams.emo_loss_w
 34 | 
 35 | 
 36 | 
 37 | 
 38 | 
 39 |     def perform_SER(self, x, target, device):
 40 |         def splitIntoChunks(mel_spec,win_size,stride):
 41 |             mel_spec = mel_spec.T
 42 |             t = mel_spec.shape[1]
 43 |             num_of_chunks = int(t/stride)
 44 |             chunks = []
 45 |             for i in range(num_of_chunks):
 46 |                 chunk = mel_spec[:,i*stride:i*stride+win_size]
 47 |                 if chunk.shape[1] == win_size:
 48 |                     chunks.append(chunk)
 49 |             return np.stack(chunks,axis=0)
 50 | 
 51 |         def loss_fnc(predictions, targets):
 52 |             return nn.CrossEntropyLoss()(input=predictions,target=targets)
 53 | 
 54 |         def make_validate_fnc(model,loss_fnc):
 55 |             def validate(X,Y):
 56 |                 with torch.no_grad():
 57 |                     model.eval()
 58 |                     output_logits, output_softmax, emotion_embedding = model(X)
 59 |                     predictions = torch.argmax(output_softmax,dim=1)
 60 |                     a = torch.sum(Y==predictions).cpu().detach().numpy()
 61 |                     b = int(len(Y))
 62 |                     acc = a/b
 63 |                     #accuracy = torch.sum(Y==predictions)/float(len(Y))
 64 |                     accuracy = torch.tensor(acc,device=device).float()
 65 |                     loss = loss_fnc(output_logits,Y)
 66 |                 return loss.item(), accuracy, predictions, emotion_embedding
 67 |             return validate
 68 | 
 69 |         x = x.cpu().detach().numpy()
 70 |         x = x.astype(np.float64)
 71 |         target = target.cpu().detach().numpy()
 72 | 
 73 | 
 74 |         mel_test_chunked = []
 75 |         for mel_spec in x:
 76 |             mel_spec = mel_spec.T
 77 |             time = mel_spec.shape[0]
 78 |             if time <= 500:
 79 |                 mel_spec = np.pad(mel_spec, ((0, 500 - time), (0, 0)), 'constant', constant_values=0)
 80 |             else:
 81 |                 mel_spec = mel_spec[:500,:]
 82 |             chunks = splitIntoChunks(mel_spec, win_size=128,stride=64)
 83 |             mel_test_chunked.append(chunks)
 84 |             
 85 |         X_test = np.stack(mel_test_chunked,axis=0)
 86 |         X_test = np.expand_dims(X_test,2)
 87 |         b,t,c,h,w = X_test.shape
 88 |         X_test = np.reshape(X_test, newshape=(b,-1))
 89 |         X_test = np.reshape(X_test, newshape=(b,t,c,h,w))
 90 | 
 91 | 
 92 |         Y_test = target.reshape(-1)
 93 |         #Y_test = Y_test.astype('int8')
 94 | 
 95 |         #LOAD_PATH = '/home/zhoukun/SER/lstm_ser/models/cnn_attention_lstm_model_64.pt'
 96 |         LOAD_PATH = '/home/zhoukun/SER/lstm_ser/models/cnn_attention_lstm_model_64_update_best.pt'
 97 |         model = HybridModel(num_emotions=4).to(device)
 98 |         model.load_state_dict(torch.load(LOAD_PATH, map_location=torch.device(device)))
 99 |         validate = make_validate_fnc(model,loss_fnc)
100 |         X_test_tensor = torch.tensor(X_test,device=device).float()
101 |         Y_test_tensor = torch.tensor(Y_test,dtype=torch.long,device=device)
102 |         test_loss, test_acc, predictions, emotion_embedding = validate(X_test_tensor,Y_test_tensor)
103 |         test_loss = torch.tensor(test_loss, device=device).float()
104 | 
105 |         return test_loss, test_acc, emotion_embedding
106 | 
107 | 
108 | 
109 |     def parse_targets(self, targets, text_lengths):
110 |         '''
111 |         text_target [batch_size, text_len]
112 |         mel_target [batch_size, mel_bins, T]
113 |         spc_target [batch_size, spc_bins, T]
114 |         speaker_target [batch_size]
115 |         stop_target [batch_size, T]
116 |         '''
117 |         text_target, mel_target, spc_target, speaker_target, stop_target, strength_embedding = targets
118 | 
119 |         B = stop_target.size(0)
120 |         stop_target = stop_target.reshape(B, -1, self.n_frames_per_step)
121 |         stop_target = stop_target[:, :, 0]
122 | 
123 |         #padded = torch.tensor(text_target.data.new(B,1).zero_())
124 |         padded = text_target.data.new(B,1).zero_().clone().detach()
125 |         text_target = torch.cat((text_target, padded), dim=-1)
126 |         
127 |         # adding the ending token for target
128 |         for bid in range(B):
129 |             text_target[bid, text_lengths[bid].item()] = self.eos
130 | 
131 |         return text_target, mel_target, spc_target, speaker_target, stop_target, strength_embedding
132 |     
133 |     def forward(self, model_outputs, targets, input_text, eps=1e-5):
134 | 
135 |         '''
136 |         predicted_mel [batch_size, mel_bins, T]
137 |         predicted_stop [batch_size, T/r]
138 |         alignment 
139 |             when input_text==True [batch_size, T/r, max_text_len] 
140 |             when input_text==False [batch_size, T/r, T/r]
141 |         text_hidden [B, max_text_len, hidden_dim]
142 |         mel_hidden [B, max_text_len, hidden_dim]
143 |         text_logit_from_mel_hidden [B, max_text_len+1, n_symbols+1]
144 |         speaker_logit_from_mel [B, n_speakers]
145 |         speaker_logit_from_mel_hidden [B, max_text_len, n_speakers]
146 |         text_lengths [B,]
147 |         mel_lengths [B,]
148 |         '''
149 |         predicted_mel, post_output, predicted_stop, alignments,\
150 |             text_hidden, mel_hidden, text_logit_from_mel_hidden, \
151 |             audio_seq2seq_alignments, \
152 |             speaker_logit_from_mel, speaker_logit_from_mel_hidden, \
153 |              text_lengths, mel_lengths,speaker_embedding = model_outputs
154 | 
155 |         text_target, mel_target, spc_target, speaker_target, stop_target, strength_embedding  = self.parse_targets(targets, text_lengths)
156 | 
157 |         #perform SER:
158 |         device = 'cuda' if torch.cuda.is_available() else 'cpu'
159 | 
160 |         ser_loss, ser_acc, emotion_embedding = self.perform_SER(post_output,speaker_target,device)
161 | 
162 | 
163 | 
164 |         ## get masks ##
165 |         mel_mask = get_mask_from_lengths(mel_lengths, mel_target.size(2)).unsqueeze(1).expand(-1, mel_target.size(1), -1).float()
166 |         spc_mask = get_mask_from_lengths(mel_lengths, mel_target.size(2)).unsqueeze(1).expand(-1, spc_target.size(1), -1).float()
167 | 
168 |         mel_step_lengths = torch.ceil(mel_lengths.float() / self.n_frames_per_step).long()
169 |         stop_mask = get_mask_from_lengths(mel_step_lengths, 
170 |                                     int(mel_target.size(2)/self.n_frames_per_step)).float() # [B, T/r]
171 |         text_mask = get_mask_from_lengths(text_lengths).float()
172 |         text_mask_plus_one = get_mask_from_lengths(text_lengths + 1).float()
173 | 
174 |         # reconstruction loss #
175 |         recon_loss = torch.sum(self.L1Loss(predicted_mel, mel_target) * mel_mask) / torch.sum(mel_mask)
176 | 
177 |         if self.predict_spectrogram:
178 |             recon_loss_post = (self.L1Loss(post_output, spc_target) * spc_mask).sum() / spc_mask.sum()
179 |         else:
180 |             recon_loss_post = (self.L1Loss(post_output, mel_target) * mel_mask).sum() / torch.sum(mel_mask)
181 |         
182 |         stop_loss = torch.sum(self.BCEWithLogitsLoss(predicted_stop, stop_target) * stop_mask) / torch.sum(stop_mask)
183 | 
184 | 
185 |         if self.contr_w == 0.:
186 |             contrast_loss = torch.tensor(0.).cuda()
187 |         else:
188 |             # contrastive mask #
189 |             contrast_mask1 =  get_mask_from_lengths(text_lengths).unsqueeze(2).expand(-1, -1, mel_hidden.size(1)) # [B, text_len] -> [B, text_len, T/r]
190 |             contrast_mask2 = get_mask_from_lengths(text_lengths).unsqueeze(1).expand(-1, text_hidden.size(1), -1) # [B, T/r] -> [B, text_len, T/r]
191 |             contrast_mask = (contrast_mask1 & contrast_mask2).float()
192 |             text_hidden_normed = text_hidden / (torch.norm(text_hidden, dim=2, keepdim=True) + eps)
193 |             mel_hidden_normed = mel_hidden / (torch.norm(mel_hidden, dim=2, keepdim=True) + eps)
194 | 
195 |             # (x - y) ** 2 = x ** 2 + y ** 2 - 2xy
196 |             distance_matrix_xx = torch.sum(text_hidden_normed ** 2, dim=2, keepdim=True) #[batch_size, text_len, 1]
197 |             distance_matrix_yy = torch.sum(mel_hidden_normed ** 2, dim=2)
198 |             distance_matrix_yy = distance_matrix_yy.unsqueeze(1) #[batch_size, 1, text_len]
199 | 
200 |             #[batch_size, text_len, text_len]
201 |             distance_matrix_xy = torch.bmm(text_hidden_normed, torch.transpose(mel_hidden_normed, 1, 2)) 
202 |             distance_matrix = distance_matrix_xx + distance_matrix_yy - 2 * distance_matrix_xy
203 |             
204 |             TTEXT = distance_matrix.size(1)
205 |             hard_alignments = torch.eye(TTEXT).cuda()
206 |             contrast_loss = hard_alignments * distance_matrix + \
207 |                 (1. - hard_alignments) * torch.max(1. - distance_matrix, torch.zeros_like(distance_matrix))
208 | 
209 |             contrast_loss = torch.sum(contrast_loss * contrast_mask) / torch.sum(contrast_mask)
210 | 
211 |         n_speakers = speaker_logit_from_mel_hidden.size(2)
212 |         TTEXT = speaker_logit_from_mel_hidden.size(1)
213 |         n_symbols_plus_one = text_logit_from_mel_hidden.size(2)
214 | 
215 |         # speaker classification loss #
216 |         speaker_encoder_loss = nn.CrossEntropyLoss()(speaker_logit_from_mel, speaker_target)
217 |         _, predicted_speaker = torch.max(speaker_logit_from_mel,dim=1)
218 |         speaker_encoder_acc = ((predicted_speaker == speaker_target).float()).sum() / float(speaker_target.size(0))
219 | 
220 |         speaker_logit_flatten = speaker_logit_from_mel_hidden.reshape(-1, n_speakers) # -> [B* TTEXT, n_speakers]
221 |         _, predicted_speaker = torch.max(speaker_logit_flatten, dim=1)
222 |         speaker_target_flatten = speaker_target.unsqueeze(1).expand(-1, TTEXT).reshape(-1)
223 |         speaker_classification_acc = ((predicted_speaker == speaker_target_flatten).float() * text_mask.reshape(-1)).sum() / text_mask.sum()
224 |         loss = self.CrossEntropyLoss(speaker_logit_flatten, speaker_target_flatten)
225 | 
226 |         speaker_classification_loss = torch.sum(loss * text_mask.reshape(-1)) / torch.sum(text_mask)
227 | 
228 |         # text classification loss #
229 |         text_logit_flatten = text_logit_from_mel_hidden.reshape(-1, n_symbols_plus_one)
230 |         text_target_flatten = text_target.reshape(-1)
231 |         _, predicted_text =  torch.max(text_logit_flatten, dim=1)
232 |         text_classification_acc = ((predicted_text == text_target_flatten).float()*text_mask_plus_one.reshape(-1)).sum()/text_mask_plus_one.sum()
233 |         loss = self.CrossEntropyLoss(text_logit_flatten, text_target_flatten)
234 |         text_classification_loss = torch.sum(loss * text_mask_plus_one.reshape(-1)) / torch.sum(text_mask_plus_one)
235 | 
236 |         # speaker adversival loss #
237 |         flatten_target = 1. / n_speakers * torch.ones_like(speaker_logit_flatten)
238 |         loss = self.MSELoss(F.softmax(speaker_logit_flatten, dim=1), flatten_target)
239 |         mask = text_mask.unsqueeze(2).expand(-1,-1, n_speakers).reshape(-1, n_speakers)
240 | 
241 |         #new = torch.cat([speaker_embedding,strength_embedding],-1)
242 |         #m = torch.nn.Linear(256,128).to('cuda')
243 |         #speaker_embedding = m(new)
244 |         
245 |         # RMSE loss
246 |         emotion_embedding_loss = torch.sqrt(torch.mean(self.MSELoss(emotion_embedding, speaker_embedding)) + eps)
247 | 
248 |         if self.ce_loss:
249 |             speaker_adversial_loss = - speaker_classification_loss
250 |         else:
251 |             speaker_adversial_loss = torch.sum(loss * mask) / torch.sum(mask)
252 |         
253 |         loss_list = [recon_loss, recon_loss_post,  stop_loss,
254 |                 contrast_loss, speaker_encoder_loss, speaker_classification_loss,
255 |                 text_classification_loss, speaker_adversial_loss,ser_loss, emotion_embedding_loss]
256 |             
257 |         acc_list = [speaker_encoder_acc, speaker_classification_acc, text_classification_acc, ser_acc]
258 |         
259 |         
260 |         combined_loss1 = recon_loss + recon_loss_post + stop_loss + self.contr_w * contrast_loss + \
261 |             self.spenc_w * speaker_encoder_loss +  self.texcl_w * text_classification_loss + \
262 |             self.spadv_w * speaker_adversial_loss + self.serloss_w * ser_loss + self.emoloss_w * emotion_embedding_loss
263 | 
264 |         combined_loss2 = self.spcla_w * speaker_classification_loss + self.serloss_w * ser_loss + self.emoloss_w * emotion_embedding_loss
265 |         
266 |         return loss_list, acc_list, combined_loss1, combined_loss2
267 | 
268 | 


--------------------------------------------------------------------------------
/codes/model/lstm_test.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import pandas as pd
  3 | import os
  4 | import librosa
  5 | import librosa.display
  6 | import matplotlib.pyplot as plt
  7 | import torch
  8 | import torch.nn as nn
  9 | from sklearn.metrics import confusion_matrix
 10 | import pickle
 11 | 
 12 | class TimeDistributed(nn.Module):
 13 |     def __init__(self, module):
 14 |         super(TimeDistributed, self).__init__()
 15 |         self.module = module
 16 | 
 17 |     def forward(self, x):
 18 | 
 19 |         if len(x.size()) <= 2:
 20 |             return self.module(x)
 21 |         # squash samples and timesteps into a single axis
 22 |         elif len(x.size()) == 3:  # (samples, timesteps, inp1)
 23 |             x_reshape = x.contiguous().view(-1, x.size(2))  # (samples * timesteps, inp1)
 24 |         elif len(x.size()) == 4:  # (samples,timesteps,inp1,inp2)
 25 |             x_reshape = x.contiguous().view(-1, x.size(2), x.size(3))  # (samples*timesteps,inp1,inp2)
 26 |         else:  # (samples,timesteps,inp1,inp2,inp3)
 27 |             x_reshape = x.contiguous().view(-1, x.size(2), x.size(3), x.size(4))  # (samples*timesteps,inp1,inp2,inp3)
 28 | 
 29 |         y = self.module(x_reshape)
 30 | 
 31 |         # we have to reshape Y
 32 |         if len(x.size()) == 3:
 33 |             y = y.contiguous().view(x.size(0), -1, y.size(1))  # (samples, timesteps, out1)
 34 |         elif len(x.size()) == 4:
 35 |             y = y.contiguous().view(x.size(0), -1, y.size(1), y.size(2))  # (samples, timesteps, out1,out2)
 36 |         else:
 37 |             y = y.contiguous().view(x.size(0), -1, y.size(1), y.size(2),
 38 |                                     y.size(3))  # (samples, timesteps, out1,out2, out3)
 39 |         return y
 40 | 
 41 | class HybridModel(nn.Module):
 42 |     def __init__(self,num_emotions):
 43 |         super().__init__()
 44 |         # conv block
 45 |         self.conv2Dblock = nn.Sequential(
 46 |             # 1. conv block
 47 |             TimeDistributed(nn.Conv2d(in_channels=1,
 48 |                                    out_channels=16,
 49 |                                    kernel_size=3,
 50 |                                    stride=1,
 51 |                                    padding=1
 52 |                                   )),
 53 |             TimeDistributed(nn.BatchNorm2d(16)),
 54 |             TimeDistributed(nn.ReLU()),
 55 |             TimeDistributed(nn.MaxPool2d(kernel_size=2, stride=2)),
 56 |             TimeDistributed(nn.Dropout(p=0.3)),
 57 |             # 2. conv block
 58 |             TimeDistributed(nn.Conv2d(in_channels=16,
 59 |                                    out_channels=32,
 60 |                                    kernel_size=3,
 61 |                                    stride=1,
 62 |                                    padding=1
 63 |                                   )),
 64 |             TimeDistributed(nn.BatchNorm2d(32)),
 65 |             TimeDistributed(nn.ReLU()),
 66 |             TimeDistributed(nn.MaxPool2d(kernel_size=4, stride=4)),
 67 |             TimeDistributed(nn.Dropout(p=0.3)),
 68 |             # 3. conv block
 69 |             TimeDistributed(nn.Conv2d(in_channels=32,
 70 |                                    out_channels=64,
 71 |                                    kernel_size=3,
 72 |                                    stride=1,
 73 |                                    padding=1
 74 |                                   )),
 75 |             TimeDistributed(nn.BatchNorm2d(64)),
 76 |             TimeDistributed(nn.ReLU()),
 77 |             TimeDistributed(nn.MaxPool2d(kernel_size=4, stride=4)),
 78 |             TimeDistributed(nn.Dropout(p=0.3))
 79 |         )
 80 |         # LSTM block
 81 |         hidden_size = 32
 82 |         self.lstm = nn.LSTM(input_size=512,hidden_size=hidden_size,bidirectional=True, batch_first=True)
 83 |         self.dropout_lstm = nn.Dropout(p=0.4)
 84 |         self.attention_linear = nn.Linear(2*hidden_size,1) # 2*hidden_size for the 2 outputs of bidir LSTM
 85 |         # Linear softmax layer
 86 |         self.out_linear = nn.Linear(2*hidden_size,num_emotions)
 87 |     def forward(self,x):
 88 |         conv_embedding = self.conv2Dblock(x)
 89 |         conv_embedding = torch.flatten(conv_embedding, start_dim=2) # do not flatten batch dimension and time
 90 |         lstm_embedding, (h,c) = self.lstm(conv_embedding)
 91 |         lstm_embedding = self.dropout_lstm(lstm_embedding)
 92 |         # lstm_embedding (batch, time, hidden_size*2)
 93 |         batch_size,T,_ = lstm_embedding.shape
 94 |         attention_weights = [None]*T
 95 |         for t in range(T):
 96 |             embedding = lstm_embedding[:,t,:]
 97 |             attention_weights[t] = self.attention_linear(embedding)
 98 |         attention_weights_norm = nn.functional.softmax(torch.stack(attention_weights,-1),dim=-1)
 99 |         attention = torch.bmm(attention_weights_norm,lstm_embedding) # (Bx1xT)*(B,T,hidden_size*2)=(B,1,2*hidden_size)
100 |         attention = torch.squeeze(attention, 1)
101 |         output_logits = self.out_linear(attention)
102 |         output_softmax = nn.functional.softmax(output_logits,dim=1)
103 |         return output_logits, output_softmax, attention
104 | 


--------------------------------------------------------------------------------
/codes/model/model.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | from torch import nn
  3 | from torch.autograd import Variable
  4 | from math import sqrt
  5 | from .utils import to_gpu
  6 | from .decoder import Decoder
  7 | from .layers import SpeakerClassifier, SpeakerEncoder, AudioSeq2seq, TextEncoder,  PostNet, MergeNet
  8 | from .basic_layers import LinearNorm
  9 | 
 10 | class Parrot(nn.Module):
 11 |     def __init__(self, hparams):
 12 |         super(Parrot, self).__init__()
 13 | 
 14 |         #print hparams
 15 |         # plus <sos> 
 16 |         self.embedding = nn.Embedding(
 17 |             hparams.n_symbols + 1, hparams.symbols_embedding_dim)
 18 |         std = sqrt(2.0 / (hparams.n_symbols + hparams.symbols_embedding_dim))
 19 |         val = sqrt(3.0) * std
 20 | 
 21 |         self.sos = hparams.n_symbols
 22 | 
 23 |         self.embedding.weight.data.uniform_(-val, val)
 24 | 
 25 |         self.text_encoder = TextEncoder(hparams)
 26 | 
 27 |         self.audio_seq2seq = AudioSeq2seq(hparams)
 28 | 
 29 |         self.merge_net = MergeNet(hparams)
 30 | 
 31 |         self.speaker_encoder = SpeakerEncoder(hparams)
 32 | 
 33 |         self.speaker_classifier = SpeakerClassifier(hparams)
 34 | 
 35 |         self.decoder = Decoder(hparams)
 36 |         
 37 |         self.postnet = PostNet(hparams)
 38 | 
 39 |         self.spemb_input = hparams.spemb_input
 40 | 
 41 |         self.strength_projection = LinearNorm(1,64)
 42 | 
 43 |     def grouped_parameters(self,):
 44 | 
 45 |         params_group1 = [p for p in self.embedding.parameters()]
 46 |         params_group1.extend([p for p in self.text_encoder.parameters()])
 47 |         params_group1.extend([p for p in self.audio_seq2seq.parameters()])
 48 | 
 49 |         params_group1.extend([p for p in self.speaker_encoder.parameters()])
 50 |         params_group1.extend([p for p in self.merge_net.parameters()])
 51 |         params_group1.extend([p for p in self.decoder.parameters()])
 52 |         params_group1.extend([p for p in self.postnet.parameters()])
 53 | 
 54 |         #for pn, p in self.audio_seq2seq.name_parameters():
 55 |         #    p.requires_grad = False
 56 | 
 57 |         #for pn, p in self.text_encoder.name_parameters():
 58 |         #    p.requires_grad = False
 59 | 
 60 |         #for pn, p in self.decoder.name_parameters():
 61 |         #    p.requires_grad = False
 62 | 
 63 |         return params_group1, [p for p in self.speaker_classifier.parameters()]
 64 | 
 65 |     def parse_batch(self, batch):
 66 |         text_input_padded, mel_padded, spc_padded, speaker_id, \
 67 |                     text_lengths, mel_lengths, stop_token_padded, strength_embedding = batch
 68 |         
 69 |         text_input_padded = to_gpu(text_input_padded).long()
 70 |         mel_padded = to_gpu(mel_padded).float()
 71 |         spc_padded = to_gpu(spc_padded).float()
 72 |         speaker_id = to_gpu(speaker_id).long()
 73 |         text_lengths = to_gpu(text_lengths).long()
 74 |         mel_lengths = to_gpu(mel_lengths).long()
 75 |         stop_token_padded = to_gpu(stop_token_padded).float()
 76 |         strength_embedding = to_gpu(strength_embedding).float()
 77 | 
 78 |         return ((text_input_padded, mel_padded, text_lengths, mel_lengths, strength_embedding),
 79 |                 (text_input_padded, mel_padded, spc_padded,  speaker_id, stop_token_padded,strength_embedding))
 80 | 
 81 | 
 82 |     def forward(self, inputs, input_text):
 83 |         '''
 84 |         text_input_padded [batch_size, max_text_len]
 85 |         mel_padded [batch_size, mel_bins, max_mel_len]
 86 |         text_lengths [batch_size]
 87 |         mel_lengths [batch_size]
 88 | 
 89 |         #
 90 |         predicted_mel [batch_size, mel_bins, T]
 91 |         predicted_stop [batch_size, T/r]
 92 |         alignment input_text==True [batch_size, T/r, max_text_len] or input_text==False [batch_size, T/r, T/r]
 93 |         text_hidden [B, max_text_len, hidden_dim]
 94 |         mel_hidden [B, T/r, hidden_dim]
 95 |         spearker_logit_from_mel [B, n_speakers]
 96 |         speaker_logit_from_mel_hidden [B, T/r, n_speakers]
 97 |         text_logit_from_mel_hidden [B, T/r, n_symbols]
 98 | 
 99 |         '''
100 | 
101 |         text_input_padded, mel_padded, text_lengths, mel_lengths, strength_embedding = inputs
102 | 
103 |         text_input_embedded = self.embedding(text_input_padded.long()).transpose(1, 2) # -> [B, text_embedding_dim, max_text_len]
104 |         text_hidden = self.text_encoder(text_input_embedded, text_lengths) # -> [B, max_text_len, hidden_dim]
105 | 
106 |         B = text_input_padded.size(0)
107 |         start_embedding = Variable(text_input_padded.data.new(B,).fill_(self.sos))
108 |         start_embedding = self.embedding(start_embedding)
109 | 
110 |         # -> [B, n_speakers], [B, speaker_embedding_dim] 
111 |         speaker_logit_from_mel, speaker_embedding = self.speaker_encoder(mel_padded, mel_lengths) 
112 | 
113 |         if self.spemb_input:
114 |             T = mel_padded.size(2)
115 |             audio_input = torch.cat([mel_padded, 
116 |                 speaker_embedding.detach().unsqueeze(2).expand(-1, -1, T)], 1)
117 |         else:
118 |             audio_input = mel_padded
119 |         
120 |         audio_seq2seq_hidden, audio_seq2seq_logit, audio_seq2seq_alignments = self.audio_seq2seq(
121 |                 audio_input, mel_lengths, text_input_embedded, start_embedding) 
122 |         audio_seq2seq_hidden= audio_seq2seq_hidden[:,:-1, :] # -> [B, text_len, hidden_dim]
123 |         
124 |         
125 |         speaker_logit_from_mel_hidden = self.speaker_classifier(audio_seq2seq_hidden) # -> [B, text_len, n_speakers]
126 | 
127 |         if input_text:
128 |             hidden = self.merge_net(text_hidden, text_lengths)
129 |         else:
130 |             hidden = self.merge_net(audio_seq2seq_hidden, text_lengths)
131 |         
132 |         L = hidden.size(1)
133 | 
134 |         strength_embedding = self.strength_projection(strength_embedding)
135 | 
136 |         #project_2 = LinearNorm(128,112).to('cuda')
137 |         #speaker_embedding_post = project_2(speaker_embedding)
138 | 
139 |         output = torch.cat([speaker_embedding,strength_embedding],-1)
140 |         
141 | 
142 | 
143 |         hidden = torch.cat([hidden, output.detach().unsqueeze(1).expand(-1, L, -1)], -1)
144 | 
145 |         predicted_mel, predicted_stop, alignments = self.decoder(hidden, mel_padded, text_lengths)
146 | 
147 |         post_output = self.postnet(predicted_mel)
148 | 
149 |         outputs = [predicted_mel, post_output, predicted_stop, alignments,
150 |                   text_hidden, audio_seq2seq_hidden, audio_seq2seq_logit, audio_seq2seq_alignments, 
151 |                   speaker_logit_from_mel, speaker_logit_from_mel_hidden,
152 |                   text_lengths, mel_lengths,speaker_embedding]
153 | 
154 |         return outputs
155 | 
156 |     
157 |     def inference(self, inputs, input_text, mel_reference, beam_width):
158 |         '''
159 |         decode the audio sequence from input
160 |         inputs x
161 |         input_text True or False
162 |         mel_reference [1, mel_bins, T]
163 |         '''
164 |         text_input_padded, mel_padded, text_lengths, mel_lengths = inputs
165 |         text_input_embedded = self.embedding(text_input_padded.long()).transpose(1, 2)
166 |         text_hidden = self.text_encoder.inference(text_input_embedded)
167 | 
168 |         B = text_input_padded.size(0) # B should be 1
169 |         start_embedding = Variable(text_input_padded.data.new(B,).fill_(self.sos))
170 |         start_embedding = self.embedding(start_embedding) # [1, embedding_dim]
171 | 
172 |         #-> [B, text_len+1, hidden_dim] [B, text_len+1, n_symbols] [B, text_len+1, T/r]
173 |         speaker_id, speaker_embedding = self.speaker_encoder.inference(mel_reference)
174 | 
175 |         if self.spemb_input:
176 |             T = mel_padded.size(2)
177 |             audio_input = torch.cat([mel_padded, 
178 |                 speaker_embedding.detach().unsqueeze(2).expand(-1, -1, T)], 1)
179 |         else:
180 |             audio_input = mel_padded
181 |         
182 |         audio_seq2seq_hidden, audio_seq2seq_phids, audio_seq2seq_alignments = self.audio_seq2seq.inference_beam(
183 |                 audio_input, start_embedding, self.embedding, beam_width=beam_width) 
184 |         audio_seq2seq_hidden= audio_seq2seq_hidden[:,:-1, :] # -> [B, text_len, hidden_dim]
185 | 
186 |         # -> [B, n_speakers], [B, speaker_embedding_dim] 
187 | 
188 |         if input_text:
189 |             hidden = self.merge_net.inference(text_hidden)
190 |         else:
191 |             hidden = self.merge_net.inference(audio_seq2seq_hidden)
192 | 
193 |         L = hidden.size(1)
194 | 
195 |         strength_embedding = self.strength_projection(strength_embedding)
196 | 
197 |         output = torch.cat([speaker_embedding,strength_embedding],-1)
198 | 
199 |         hidden = torch.cat([hidden, output.detach().unsqueeze(1).expand(-1, L, -1)], -1)
200 |           
201 |         predicted_mel, predicted_stop, alignments = self.decoder.inference(hidden)
202 | 
203 |         post_output = self.postnet(predicted_mel)
204 | 
205 |         return (predicted_mel, post_output, predicted_stop, alignments,
206 |             text_hidden, audio_seq2seq_hidden, audio_seq2seq_phids, audio_seq2seq_alignments,
207 |             speaker_id)
208 | 
209 | 
210 | 


--------------------------------------------------------------------------------
/codes/model/penalties.py:
--------------------------------------------------------------------------------
 1 | 
 2 | import torch
 3 | 
 4 | 
 5 | class PenaltyBuilder(object):
 6 |     """
 7 |     Returns the Length and Coverage Penalty function for Beam Search.
 8 |     Args:
 9 |         length_pen (str): option name of length pen
10 |         cov_pen (str): option name of cov pen
11 |     """
12 | 
13 |     def __init__(self, cov_pen, length_pen):
14 |         self.length_pen = length_pen
15 |         self.cov_pen = cov_pen
16 | 
17 |     def coverage_penalty(self):
18 |         if self.cov_pen == "wu":
19 |             return self.coverage_wu
20 |         elif self.cov_pen == "summary":
21 |             return self.coverage_summary
22 |         else:
23 |             return self.coverage_none
24 | 
25 |     def length_penalty(self):
26 |         if self.length_pen == "wu":
27 |             return self.length_wu
28 |         elif self.length_pen == "avg":
29 |             return self.length_average
30 |         else:
31 |             return self.length_none
32 | 
33 |     """
34 |     Below are all the different penalty terms implemented so far
35 |     """
36 | 
37 |     def coverage_wu(self, beam, cov, beta=0.):
38 |         """
39 |         NMT coverage re-ranking score from
40 |         "Google's Neural Machine Translation System" :cite:`wu2016google`.
41 |         """
42 |         penalty = -torch.min(cov, cov.clone().fill_(1.0)).log().sum(1)
43 |         return beta * penalty
44 | 
45 |     def coverage_summary(self, beam, cov, beta=0.):
46 |         """
47 |         Our summary penalty.
48 |         """
49 |         penalty = torch.max(cov, cov.clone().fill_(1.0)).sum(1)
50 |         penalty -= cov.size(1)
51 |         return beta * penalty
52 | 
53 |     def coverage_none(self, beam, cov, beta=0.):
54 |         """
55 |         returns zero as penalty
56 |         """
57 |         return beam.scores.clone().fill_(0.0)
58 | 
59 |     def length_wu(self, beam, logprobs, alpha=0.):
60 |         """
61 |         NMT length re-ranking score from
62 |         "Google's Neural Machine Translation System" :cite:`wu2016google`.
63 |         """
64 | 
65 |         modifier = (((5 + len(beam.next_ys)) ** alpha) /
66 |                     ((5 + 1) ** alpha))
67 |         return (logprobs / modifier)
68 | 
69 |     def length_average(self, beam, logprobs, alpha=0.):
70 |         """
71 |         Returns the average probability of tokens in a sequence.
72 |         """
73 |         return logprobs / len(beam.next_ys)
74 | 
75 |     def length_none(self, beam, logprobs, alpha=0., beta=0.):
76 |         """
77 |         Returns unmodified scores.
78 |         """
79 |         return logprobs


--------------------------------------------------------------------------------
/codes/model/ser.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import torch.nn as nn
 3 | import torch.nn.functional as F
 4 | import pdb
 5 | 
 6 | class acrnn(nn.Module):
 7 |     def __init__(self, num_classes=4, is_training=True,
 8 |                  L1=128, L2=256, cell_units=128, num_linear=768,
 9 |                  p=10, time_step=800, F1=128, dropout_keep_prob=1):
10 |         super(acrnn, self).__init__()
11 | 
12 |         self.num_classes = num_classes
13 |         self.is_training = is_training
14 |         self.L1 = L1
15 |         self.L2 = L2
16 |         self.cell_units = cell_units
17 |         self.num_linear = num_linear
18 |         self.p = p
19 |         self.time_step = time_step
20 |         self.F1 = F1
21 |         self.dropout_prob = 1 - dropout_keep_prob
22 | 
23 |         # tf filter : [filter_height, filter_width, in_channels, out_channels]
24 |         self.conv1 = nn.Conv2d(3, self.L1, (5, 3), padding=(2, 1))       # [5, 3,   3, 128]  
25 |         self.conv2 = nn.Conv2d(self.L1, self.L2, (5, 3), padding=(2, 1)) # [5, 3, 128, 256]
26 |         self.conv3 = nn.Conv2d(self.L2, self.L2, (5, 3), padding=(2, 1)) # [5, 3, 256, 256]
27 |         self.conv4 = nn.Conv2d(self.L2, self.L2, (5, 3), padding=(2, 1)) # [5, 3, 256, 256]
28 |         self.conv5 = nn.Conv2d(self.L2, self.L2, (5, 3), padding=(2, 1)) # [5, 3, 128, 256]
29 |         self.conv6 = nn.Conv2d(self.L2, self.L2, (5, 3), padding=(2, 1)) # [5, 3, 128, 256]
30 | 
31 |         self.linear1 = nn.Linear(self.p*self.L2, self.num_linear) # [10*256, 768]
32 |         self.bn = nn.BatchNorm1d(self.num_linear)
33 | 
34 |         #self.linear_em = nn.Linear(self.p, self.L2)
35 | 
36 |         self.relu = nn.LeakyReLU(0.01)
37 |         self.dropout = nn.Dropout2d(p=self.dropout_prob)
38 |         
39 |         self.rnn = nn.LSTM(input_size=self.num_linear, hidden_size=self.cell_units, 
40 |                             batch_first=True, num_layers=1, bidirectional=True) 
41 | 
42 |         # for attention
43 |         self.a_fc1 = nn.Linear(2*self.cell_units, 1)  
44 |         self.a_fc2 = nn.Linear(1, 1)
45 |         self.sigmoid = nn.Sigmoid()
46 |         self.softmax = nn.Softmax(dim=1)
47 | 
48 |         # fully connected layers
49 |         self.fc1 = nn.Linear(2*self.cell_units, self.F1) # [2*128, 64]
50 |         self.fc2 = nn.Linear(self.F1, self.num_classes) # [num_classes]
51 | 
52 |     
53 |     def forward(self, x):
54 |         
55 |         layer1 = self.relu(self.conv1(x))
56 |         layer1 = F.max_pool2d(layer1, kernel_size=(2, 4), stride=(2, 4))   # [1,2,4,1], padding = 'valid'
57 |         layer1 = self.dropout(layer1)
58 | 
59 |         layer2 = self.relu(self.conv2(layer1))
60 |         layer2 = self.dropout(layer2)
61 |         
62 |         layer3 = self.relu(self.conv3(layer2))
63 |         layer3 = self.dropout(layer3)
64 | 
65 |         layer4 = self.relu(self.conv4(layer3))
66 |         layer4 = self.dropout(layer4)
67 | 
68 |         layer5 = self.relu(self.conv5(layer4))
69 |         layer5 = self.dropout(layer5)
70 | 
71 |         layer6 = self.relu(self.conv6(layer5))
72 |         layer6 = self.dropout(layer6)
73 |         
74 |         # lstm
75 |         layer6 = layer6.permute(0, 2, 3, 1)
76 |         layer6 = layer6.reshape(-1, self.time_step, self.L2*self.p)   # (-1, 150, 256*10)
77 | 
78 |         layer6 = layer6.reshape(-1, self.L2*self.p)                        # (1500, 2560)
79 | 
80 |         linear1 = self.relu(self.bn(self.linear1(layer6)))                 # [1500, 768]
81 |         linear1 = linear1.reshape(-1, self.time_step, self.num_linear)   # [10, 150, 768]
82 |         em_bed_low = linear1
83 | 
84 |         outputs1, output_states1 = self.rnn(linear1)                       # outputs1 : [10, 150, 128] (B,T,D)
85 | 
86 |         # # attention
87 |         v = self.sigmoid(self.a_fc1(outputs1))                  # (10, 150, 1)
88 |         alphas = self.softmax(self.a_fc2(v).squeeze())          # (B,T) shape, alphas are attention weights (10,800)
89 |         gru = (alphas.unsqueeze(2) * outputs1).sum(dim=1)      # (B,D) (10,256)
90 |         
91 |         # # fc
92 |         fully1 = self.relu(self.fc1(gru))
93 |         em_bed_high = fully1
94 |         fully1 = self.dropout(fully1)
95 |         Ylogits = self.fc2(fully1)
96 |         Ylogits = self.softmax(Ylogits)
97 | 
98 |         return Ylogits, em_bed_low, em_bed_high
99 | 


--------------------------------------------------------------------------------
/codes/model/utils.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import torch
 3 | 
 4 | 
 5 | def gcd(a,b):  
 6 |     a, b = (a, b) if a >=b else (b, a)
 7 |     if a%b == 0:  
 8 |         return b  
 9 |     else :  
10 |         return gcd(b,a%b) 
11 | 
12 | def lcm(a,b):
13 |     return a*b//gcd(a,b)
14 | 
15 | 
16 | if __name__ == "__main__":
17 |     print(lcm(3,2))
18 | 
19 | def get_mask_from_lengths(lengths, max_len=None):
20 |     if max_len is None:
21 |         max_len = torch.max(lengths).item()
22 |     ids = torch.arange(0, max_len, out=torch.cuda.LongTensor(max_len))
23 |     #print ids
24 |     mask = (ids < lengths.unsqueeze(1)).byte()
25 |     return mask
26 | 
27 | def to_gpu(x):
28 |     x = x.contiguous()
29 | 
30 |     if torch.cuda.is_available():
31 |         x = x.cuda(non_blocking=True)
32 |     return torch.autograd.Variable(x)
33 | 
34 | def test_mask():
35 |     lengths = torch.IntTensor([3,5,4])
36 |     print(torch.ceil(lengths.float() / 2))
37 | 
38 |     data = torch.FloatTensor(3, 5, 2) # [B, T, D]
39 |     data.fill_(1.)
40 |     m = get_mask_from_lengths(lengths.cuda(), data.size(1))
41 |     print(m)
42 |     m =  m.unsqueeze(2).expand(-1,-1,data.size(2)).float()
43 |     print(m)
44 | 
45 |     print(torch.sum(data.cuda() * m) / torch.sum(m))
46 | 
47 | 
48 | def test_loss():
49 |     data1 = torch.FloatTensor(3, 5, 2)
50 |     data1.fill_(1.)
51 |     data2 = torch.FloatTensor(3, 5, 2)
52 |     data2.fill_(2.)
53 |     data2[0,0,0] = 1000
54 | 
55 |     l = torch.nn.L1Loss(reduction='none')(data1,data2)
56 |     print(l)
57 | 
58 | 
59 | #if __name__ == '__main__':
60 | #    test_mask()


--------------------------------------------------------------------------------
/codes/multiproc.py:
--------------------------------------------------------------------------------
 1 | import time
 2 | import torch
 3 | import sys
 4 | import subprocess
 5 | 
 6 | argslist = list(sys.argv)[1:]
 7 | num_gpus = torch.cuda.device_count()
 8 | argslist.append('--n_gpus={}'.format(num_gpus))
 9 | workers = []
10 | job_id = time.strftime("%Y_%m_%d-%H%M%S")
11 | argslist.append("--group_name=group_{}".format(job_id))
12 | 
13 | for i in range(num_gpus):
14 |     argslist.append('--rank={}'.format(i))
15 |     stdout = None if i == 0 else open("logs/{}_GPU_{}.log".format(job_id, i),
16 |                                       "w")
17 |     print(argslist)
18 |     p = subprocess.Popen([str(sys.executable)]+argslist, stdout=stdout)
19 |     workers.append(p)
20 |     argslist = argslist[:-1]
21 | 
22 | for p in workers:
23 |     p.wait()
24 | 


--------------------------------------------------------------------------------
/codes/plotting_utils.py:
--------------------------------------------------------------------------------
 1 | import matplotlib
 2 | matplotlib.use("Agg")
 3 | import matplotlib.pylab as plt
 4 | import numpy as np
 5 | 
 6 | 
 7 | def save_figure_to_numpy(fig):
 8 |     # save it to a numpy array.
 9 |     data = np.fromstring(fig.canvas.tostring_rgb(), dtype=np.uint8, sep='')
10 |     data = data.reshape(fig.canvas.get_width_height()[::-1] + (3,))
11 |     return data
12 | 
13 | def plot_alignment(alignment, fn):
14 |     # [4, encoder_step, decoder_step] 
15 |     fig, axes = plt.subplots(2, 2)
16 |     for i in range(2):
17 |         for j in range(2):
18 |             g = axes[i][j].imshow(alignment[i*2+j,:,:].T,
19 |                 aspect='auto', origin='lower',
20 |                 interpolation='none')
21 |             plt.colorbar(g, ax=axes[i][j])
22 |     
23 |     plt.savefig(fn)
24 |     plt.close()
25 |     return fn
26 | 
27 | 
28 | def plot_alignment_to_numpy(alignment, info=None):
29 |     fig, ax = plt.subplots(figsize=(6, 4))
30 |     im = ax.imshow(alignment, aspect='auto', origin='lower',
31 |                    interpolation='none')
32 |     fig.colorbar(im, ax=ax)
33 |     xlabel = 'Decoder timestep'
34 |     if info is not None:
35 |         xlabel += '\n\n' + info
36 |     plt.xlabel(xlabel)
37 |     plt.ylabel('Encoder timestep')
38 |     plt.tight_layout()
39 | 
40 |     fig.canvas.draw()
41 |     data = save_figure_to_numpy(fig)
42 |     plt.close()
43 |     return data
44 | 
45 | 
46 | def plot_spectrogram_to_numpy(spectrogram):
47 |     fig, ax = plt.subplots(figsize=(12, 3))
48 |     im = ax.imshow(spectrogram, aspect="auto", origin="lower",
49 |                    interpolation='none')
50 |     plt.colorbar(im, ax=ax)
51 |     plt.xlabel("Frames")
52 |     plt.ylabel("Channels")
53 |     plt.tight_layout()
54 | 
55 |     fig.canvas.draw()
56 |     data = save_figure_to_numpy(fig)
57 |     plt.close()
58 |     return data
59 | 
60 | 
61 | def plot_gate_outputs_to_numpy(gate_targets, gate_outputs):
62 |     fig, ax = plt.subplots(figsize=(12, 3))
63 |     ax.scatter(list(range(len(gate_targets))), gate_targets, alpha=0.5,
64 |                color='green', marker='+', s=1, label='target')
65 |     ax.scatter(list(range(len(gate_outputs))), gate_outputs, alpha=0.5,
66 |                color='red', marker='.', s=1, label='predicted')
67 | 
68 |     plt.xlabel("Frames (Green target, Red predicted)")
69 |     plt.ylabel("Gate State")
70 |     plt.tight_layout()
71 | 
72 |     fig.canvas.draw()
73 |     data = save_figure_to_numpy(fig)
74 |     plt.close()
75 |     return data
76 | 


--------------------------------------------------------------------------------
/codes/train.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import time
  3 | import argparse
  4 | import math
  5 | from numpy import finfo
  6 | import numpy as np
  7 | 
  8 | import torch
  9 | from distributed import apply_gradient_allreduce
 10 | import torch.distributed as dist
 11 | from torch.utils.data.distributed import DistributedSampler
 12 | from torch.utils.data import DataLoader
 13 | 
 14 | from model import Parrot, ParrotLoss, lcm
 15 | from reader import TextMelIDLoader, TextMelIDCollate
 16 | from logger import ParrotLogger
 17 | from hparams import create_hparams
 18 | 
 19 | 
 20 | def batchnorm_to_float(module):
 21 |     """Converts batch norm modules to FP32"""
 22 |     if isinstance(module, torch.nn.modules.batchnorm._BatchNorm):
 23 |         module.float()
 24 |     for child in module.children():
 25 |         batchnorm_to_float(child)
 26 |     return module
 27 | 
 28 | 
 29 | def reduce_tensor(tensor, n_gpus):
 30 |     rt = tensor.clone()
 31 |     dist.all_reduce(rt, op=dist.reduce_op.SUM)
 32 |     rt /= n_gpus
 33 |     return rt
 34 | 
 35 | 
 36 | def init_distributed(hparams, n_gpus, rank, group_name):
 37 |     assert torch.cuda.is_available(), "Distributed mode requires CUDA."
 38 |     print("Initializing Distributed")
 39 | 
 40 |     # Set cuda device so everything is done on the right GPU.
 41 |     torch.cuda.set_device(rank % torch.cuda.device_count())
 42 | 
 43 |     # Initialize distributed communication
 44 |     dist.init_process_group(
 45 |         backend=hparams.dist_backend, init_method=hparams.dist_url,
 46 |         world_size=n_gpus, rank=rank, group_name=group_name)
 47 | 
 48 |     print("Done initializing distributed")
 49 | 
 50 | 
 51 | def prepare_dataloaders(hparams):
 52 |     # Get data, data loaders and collate function ready
 53 |     trainset = TextMelIDLoader(hparams.training_list, hparams.mel_mean_std)
 54 |     valset = TextMelIDLoader(hparams.validation_list, hparams.mel_mean_std)
 55 |     collate_fn = TextMelIDCollate(lcm(hparams.n_frames_per_step_encoder,
 56 |                                       hparams.n_frames_per_step_decoder))
 57 | 
 58 |     train_sampler = DistributedSampler(trainset) \
 59 |         if hparams.distributed_run else None
 60 | 
 61 |     train_loader = DataLoader(trainset, num_workers=1, shuffle=True,
 62 |                               sampler=train_sampler,
 63 |                               batch_size=hparams.batch_size, pin_memory=False,
 64 |                               drop_last=True, collate_fn=collate_fn)
 65 |     return train_loader, valset, collate_fn
 66 | 
 67 | 
 68 | def prepare_directories_and_logger(output_directory, log_directory, rank):
 69 |     if rank == 0:
 70 |         if not os.path.isdir(output_directory):
 71 |             os.makedirs(output_directory)
 72 |             os.chmod(output_directory, 0o775)
 73 |         logger = ParrotLogger(os.path.join(output_directory, log_directory))
 74 |     else:
 75 |         logger = None
 76 |     return logger
 77 | 
 78 | 
 79 | def load_model(hparams):
 80 |     device = 'cuda' if torch.cuda.is_available() else 'cpu'
 81 |     model = Parrot(hparams).to(device)
 82 |     if hparams.distributed_run:
 83 |         model = apply_gradient_allreduce(model)
 84 | 
 85 |     return model
 86 | 
 87 | 
 88 | def warm_start_model(checkpoint_path, model):
 89 |     assert os.path.isfile(checkpoint_path)
 90 |     print(("Warm starting model from checkpoint '{}'".format(checkpoint_path)))
 91 |     checkpoint_dict = torch.load(checkpoint_path, map_location='cpu')
 92 |     new_state_dict = {}
 93 |     new_state_dict = {}
 94 |     for k, v in checkpoint_dict['state_dict'].items():
 95 |         if k not in ['speaker_encoder.projection2.linear_layer.weight',
 96 |                      'speaker_encoder.projection2.linear_layer.bias',
 97 |                      'speaker_classifier.projection.linear_layer.weight',
 98 |                      'speaker_classifier.projection.linear_layer.bias']:
 99 |             new_state_dict[k] = v
100 |         else:
101 |             s = v.size()
102 |             if len(s) == 2:
103 |                 new_state_dict[k] = torch.nn.init.normal_(torch.empty((5, s[1])))
104 |             else:
105 |                 new_state_dict[k] = torch.nn.init.normal_(torch.empty(5))
106 |             #new_state_dict[k].weight.requires_grad = False
107 |             #new_state_dict[k].bias.requires_grad = False
108 | 
109 |         if 'text_encoder' in k:
110 |             new_state_dict[k].requires_grad = False
111 |         if 'audio_seq2seq.encoder' in k:
112 |             new_state_dict[k].requires_grad = False
113 | 
114 | 
115 |     model.load_state_dict(new_state_dict,strict=False)
116 |     #model.load_state_dict(torch.load(checkpoint_path)['state_dict'], strict=False)
117 |     return model
118 | 
119 | 
120 | def load_checkpoint(checkpoint_path, model, optimizer_main, optimizer_sc):
121 |     assert os.path.isfile(checkpoint_path)
122 |     print(("Loading checkpoint '{}'".format(checkpoint_path)))
123 |     checkpoint_dict = torch.load(checkpoint_path, map_location='cpu')
124 |     model.load_state_dict(checkpoint_dict['state_dict'])
125 |     optimizer_main.load_state_dict(checkpoint_dict['optimizer_main'])
126 |     optimizer_sc.load_state_dict(checkpoint_dict['optimizer_sc'])
127 |     learning_rate = checkpoint_dict['learning_rate']
128 |     iteration = checkpoint_dict['iteration']
129 |     print(("Loaded checkpoint '{}' from iteration {}" .format(
130 |         checkpoint_path, iteration)))
131 |     return model, optimizer_main, optimizer_sc, learning_rate, iteration
132 | 
133 | 
134 | def save_checkpoint(model, optimizer_main, optimizer_sc, learning_rate, iteration, filepath):
135 |     print(("Saving model and optimizer state at iteration {} to {}".format(
136 |         iteration, filepath)))
137 |     torch.save({'iteration': iteration,
138 |                 'state_dict': model.state_dict(),
139 |                 'optimizer_main': optimizer_main.state_dict(),
140 |                 'optimizer_sc': optimizer_sc.state_dict(),
141 |                 'learning_rate': learning_rate}, filepath)
142 | 
143 | 
144 | def validate(model, criterion, valset, iteration, batch_size, n_gpus,
145 |              collate_fn, logger, distributed_run, rank):
146 |     """Handles all the validation scoring and printing"""
147 |     model.eval()
148 |     with torch.no_grad():
149 |         val_sampler = DistributedSampler(valset) if distributed_run else None
150 |         val_loader = DataLoader(valset, sampler=val_sampler, num_workers=1,
151 |                                 shuffle=False, batch_size=batch_size,
152 |                                 drop_last=True,
153 |                                 pin_memory=False, collate_fn=collate_fn)
154 | 
155 |         val_loss_tts, val_loss_vc = 0.0, 0.0
156 |         reduced_val_tts_losses, reduced_val_vc_losses = np.zeros([8], dtype=np.float32), np.zeros([8], dtype=np.float32)
157 |         reduced_val_tts_acces, reduced_val_vc_acces = np.zeros([3], dtype=np.float32), np.zeros([3], dtype=np.float32)
158 | 
159 |         for i, batch in enumerate(val_loader):
160 | 
161 |             x, y = model.parse_batch(batch)
162 | 
163 |             if i%2 == 0:
164 |                 y_pred = model(x, True)
165 |             else:
166 |                 y_pred = model(x, False)
167 |             
168 |             losses, acces, l_main, l_sc = criterion(y_pred, y, False)
169 |             if distributed_run:
170 |                 reduced_val_losses = []
171 |                 reduced_val_acces = []
172 | 
173 |                 for l in losses:
174 |                     reduced_val_losses.append(reduce_tensor(l.data, n_gpus).item())
175 |                 for a in acces:
176 |                     reduced_val_acces.append(reduce_tensor(a.data, n_gpus).item())
177 |                 
178 |                 l_main = reduce_tensor(l_main.data, n_gpus).item()
179 |                 l_sc = reduce_tensor(l_sc.data, n_gpus).item()
180 |             else:
181 |                 reduced_val_losses = [l.item() for l in losses]
182 |                 reduced_val_acces = [a.item() for a in acces]
183 |                 l_main = l_main.item()
184 |                 l_sc = l_sc.item()
185 |             
186 |             if i%2 == 0:
187 |                 val_loss_tts += l_main  + l_sc
188 |                 y_tts = y
189 |                 y_tts_pred = y_pred
190 |                 reduced_val_tts_losses += np.array(reduced_val_losses)
191 |                 reduced_val_tts_acces += np.array(reduced_val_acces)
192 |             else:
193 |                 val_loss_vc += l_main + l_sc
194 |                 y_vc = y
195 |                 y_vc_pred = y_pred
196 |                 reduced_val_vc_losses += np.array(reduced_val_losses)
197 |                 reduced_val_vc_acces += np.array(reduced_val_acces)
198 |             
199 |         if i % 2 == 0:
200 |             num_tts = i / 2 + 1
201 |             num_vc = i / 2
202 |         else:
203 |             num_tts = (i + 1) / 2 
204 |             num_vc = (i + 1) / 2
205 |         
206 |         val_loss_tts = val_loss_tts / num_tts
207 |         val_loss_vc = val_loss_vc / num_vc
208 |         reduced_val_tts_acces = reduced_val_tts_acces / num_tts
209 |         reduced_val_vc_acces = reduced_val_vc_acces / num_vc
210 |         reduced_val_tts_losses = reduced_val_tts_losses / num_tts
211 |         reduced_val_vc_losses = reduced_val_vc_losses / num_vc
212 | 
213 |     model.train()
214 |     if rank == 0:
215 |         print(("Validation loss {}: TTS {:9f}  VC {:9f}".format(iteration, val_loss_tts, val_loss_vc)))
216 |         logger.log_validation(val_loss_tts, reduced_val_tts_losses, reduced_val_tts_acces, model, y_tts, y_tts_pred, iteration, 'tts')
217 |         logger.log_validation(val_loss_vc, reduced_val_vc_losses, reduced_val_vc_acces, model, y_vc, y_vc_pred, iteration, 'vc')
218 | 
219 | 
220 | 
221 | def train(output_directory, log_directory, checkpoint_path, warm_start, n_gpus,
222 |           rank, group_name, hparams):
223 |     
224 |     """Training and validation logging results to tensorboard and stdout
225 |     Params
226 |     ------
227 |     output_directory (string): directory to save checkpoints
228 |     log_directory (string) directory to save tensorboard logs
229 |     checkpoint_path(string): checkpoint path
230 |     n_gpus (int): number of gpus
231 |     rank (int): rank of current gpu
232 |     hparams (object): comma separated list of "name=value" pairs.
233 |     """
234 | 
235 |     if hparams.distributed_run:
236 |         init_distributed(hparams, n_gpus, rank, group_name)
237 | 
238 |     torch.manual_seed(hparams.seed)
239 |     torch.cuda.manual_seed(hparams.seed)
240 | 
241 |     device = 'cuda' if torch.cuda.is_available() else 'cpu'
242 |     model = load_model(hparams)
243 |     model = model.to(device)
244 | 
245 |     learning_rate = hparams.learning_rate
246 | 
247 |     parameters_main, parameters_sc = model.grouped_parameters()
248 | 
249 |     optimizer_main = torch.optim.Adam(parameters_main, lr=learning_rate,
250 |                                  weight_decay=hparams.weight_decay)
251 |     optimizer_sc = torch.optim.Adam(parameters_sc, lr=learning_rate,
252 |                                  weight_decay=hparams.weight_decay)
253 | 
254 |     if hparams.distributed_run:
255 |         model = apply_gradient_allreduce(model)
256 | 
257 |     criterion = ParrotLoss(hparams).cuda()
258 | 
259 |     logger = prepare_directories_and_logger(
260 |         output_directory, log_directory, rank)
261 | 
262 |     train_loader, valset, collate_fn = prepare_dataloaders(hparams)
263 | 
264 |     # Load checkpoint if one exists
265 |     iteration = 0
266 |     epoch_offset = 0
267 |     if checkpoint_path is not None:
268 |         if warm_start:
269 |             model = warm_start_model(checkpoint_path, model)
270 |         else:
271 |             model, optimizer_main, optimizer_sc, _learning_rate, iteration = load_checkpoint(
272 |                 checkpoint_path, model, optimizer_main, optimizer_sc)
273 |             if hparams.use_saved_learning_rate:
274 |                 learning_rate = _learning_rate
275 |             iteration += 1  # next iteration is iteration + 1
276 |             epoch_offset = max(0, int(iteration / len(train_loader)))
277 | 
278 |     model.train()
279 |     # ================ MAIN TRAINNIG LOOP! ===================
280 |     for epoch in range(epoch_offset, hparams.epochs):
281 |         print(("Epoch: {}".format(epoch)))
282 |         if epoch > hparams.warmup:
283 |             learning_rate = hparams.learning_rate * hparams.decay_rate ** ((epoch - hparams.warmup) // hparams.decay_every + 1)
284 | 
285 |         for i, batch in enumerate(train_loader):
286 | 
287 |             start = time.time()
288 |             
289 |             for param_group in optimizer_main.param_groups:
290 |                 param_group['lr'] = learning_rate
291 |             
292 |             for param_group in optimizer_sc.param_groups:
293 |                 param_group['lr'] = learning_rate
294 |             
295 | 
296 | 
297 |             model.zero_grad()
298 |             x, y = model.parse_batch(batch)
299 | 
300 |             if i % 2 == 0:
301 |                 y_pred = model(x, True)
302 |                 losses, acces, l_main, l_sc  = criterion(y_pred, y, True)
303 |             else:
304 |                 y_pred = model(x, False)
305 |                 losses, acces, l_main, l_sc  = criterion(y_pred, y, False)
306 | 
307 |             if hparams.distributed_run:
308 |                 reduced_losses = []
309 |                 for l in losses:
310 |                     reduced_losses.append(reduce_tensor(l.data, n_gpus).item())
311 |                 reduced_acces = []
312 |                 for a in acces:
313 |                     reduced_acces.append(reduce_tensor(a.data, n_gpus).item())
314 |                 redl_main = reduce_tensor(l_main.data, n_gpus).item()
315 |                 redl_sc = reduce_tensor(l_sc.data, n_gpus).item()
316 |             else:
317 |                 reduced_losses = [l.item() for l in losses]
318 |                 reduced_acces = [a.item() for a in acces]
319 |                 redl_main = l_main.item()
320 |                 redl_sc = l_sc.item()
321 | 
322 |             for p in parameters_sc:
323 |                 p.requires_grad_(requires_grad=False)
324 |           
325 |             l_main.backward(retain_graph=True)
326 |             grad_norm_main = torch.nn.utils.clip_grad_norm_(
327 |                 parameters_main, hparams.grad_clip_thresh)
328 | 
329 |             optimizer_main.step()
330 | 
331 |             for p in parameters_sc:
332 |                 p.requires_grad_(requires_grad=True)
333 |             for p in parameters_main:
334 |                 p.requires_grad_(requires_grad=False)
335 |             
336 |          
337 |             l_sc.backward()
338 |             grad_norm_sc = torch.nn.utils.clip_grad_norm_(
339 |                 parameters_sc, hparams.grad_clip_thresh)
340 | 
341 | 
342 |             optimizer_sc.step()
343 | 
344 |             for p in parameters_main:
345 |                 p.requires_grad_(requires_grad=True)
346 | 
347 |             if not math.isnan(redl_main) and rank == 0:
348 | 
349 |                 duration = time.time() - start
350 |                 task = 'TTS' if i%2 == 0 else 'VC'
351 |                 print(("Train {} {} {:.6f} Grad Norm {:.6f} {:.2f}s/it".format(
352 |                     task, iteration, redl_main+redl_sc, grad_norm_main, duration)))
353 |                 logger.log_training(
354 |                     redl_main+redl_sc, reduced_losses, reduced_acces, grad_norm_main, learning_rate, duration, iteration)
355 | 
356 |             if (iteration % hparams.iters_per_checkpoint == 0):
357 |                 validate(model, criterion, valset, iteration,
358 |                          hparams.batch_size, n_gpus, collate_fn, logger,
359 |                          hparams.distributed_run, rank)
360 |                 if rank == 0:
361 |                     checkpoint_path = os.path.join(
362 |                         output_directory, "checkpoint_{}".format(iteration))
363 |                     save_checkpoint(model, optimizer_main, optimizer_sc, learning_rate, iteration,
364 |                                     checkpoint_path)
365 | 
366 |             iteration += 1
367 | 
368 | 
369 | if __name__ == '__main__':
370 |     
371 |     parser = argparse.ArgumentParser()
372 |     parser.add_argument('-o', '--output_directory', type=str,
373 |                         help='directory to save checkpoints')
374 |     parser.add_argument('-l', '--log_directory', type=str,
375 |                         help='directory to save tensorboard logs')
376 |     parser.add_argument('-c', '--checkpoint_path', type=str, default=None,
377 |                         required=False, help='checkpoint path')
378 |     parser.add_argument('--warm_start', action='store_true',
379 |                         help='load the model only (warm start)')
380 |     parser.add_argument('--n_gpus', type=int, default=1,
381 |                         required=False, help='number of gpus')
382 |     parser.add_argument('--rank', type=int, default=0,
383 |                         required=False, help='rank of current gpu')
384 |     parser.add_argument('--group_name', type=str, default='group_name',
385 |                         required=False, help='Distributed group name')
386 |     parser.add_argument('--hparams', type=str,
387 |                         required=False, help='comma separated name=value pairs')
388 | 
389 |     args = parser.parse_args()
390 |     hparams = create_hparams(args.hparams)
391 | 
392 |     torch.backends.cudnn.enabled = hparams.cudnn_enabled
393 |     torch.backends.cudnn.benchmark = hparams.cudnn_benchmark
394 | 
395 |     print(("Distributed Run:", hparams.distributed_run))
396 |     print(("cuDNN Enabled:", hparams.cudnn_enabled))
397 |     print(("cuDNN Benchmark:", hparams.cudnn_benchmark))
398 | 
399 |     train(args.output_directory, args.log_directory, args.checkpoint_path,
400 |           args.warm_start, args.n_gpus, args.rank, args.group_name, hparams)


--------------------------------------------------------------------------------
/stage3_update.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KunZhou9646/Emovox/972db246eb7b63e5eaf532d46710af3c9ec11eba/stage3_update.png


--------------------------------------------------------------------------------
/train_ser.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import pandas as pd
  3 | import os
  4 | import librosa
  5 | import librosa.display
  6 | import IPython
  7 | from IPython.display import Audio
  8 | from IPython.display import Image
  9 | import matplotlib.pyplot as plt
 10 | import torch
 11 | import torch.nn as nn
 12 | from sklearn.preprocessing import StandardScaler
 13 | from sklearn.metrics import confusion_matrix
 14 | import seaborn as sn
 15 | import pickle
 16 | from torch.utils.tensorboard import SummaryWriter
 17 | 
 18 | def addAWGN(signal, num_bits=16, augmented_num=2, snr_low=15, snr_high=30): 
 19 |     signal_len = len(signal)
 20 |     # Generate White Gaussian noise
 21 |     noise = np.random.normal(size=(augmented_num, signal_len))
 22 |     # Normalize signal and noise
 23 |     norm_constant = 2.0**(num_bits-1)
 24 |     signal_norm = signal / norm_constant
 25 |     noise_norm = noise / norm_constant
 26 |     # Compute signal and noise power
 27 |     s_power = np.sum(signal_norm ** 2) / signal_len
 28 |     n_power = np.sum(noise_norm ** 2, axis=1) / signal_len
 29 |     # Random SNR: Uniform [15, 30] in dB
 30 |     target_snr = np.random.randint(snr_low, snr_high)
 31 |     # Compute K (covariance matrix) for each noise 
 32 |     K = np.sqrt((s_power / n_power) * 10 ** (- target_snr / 10))
 33 |     K = np.ones((signal_len, augmented_num)) * K  
 34 |     # Generate noisy signal
 35 |     return signal + K.T * noise
 36 | 
 37 | def getMELspectrogram(audio, sample_rate):
 38 |     mel_spec = librosa.feature.melspectrogram(y=audio,
 39 |                                               sr=sample_rate,
 40 |                                               n_fft=2048,
 41 |                                               #n_fft=1024,
 42 |                                               #win_length = 512,
 43 |                                               win_length = 500,
 44 |                                               window='hamming',
 45 |                                               hop_length = 200,
 46 |                                               n_mels=80,
 47 |                                               fmax=None
 48 |                                              )
 49 |     mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)
 50 |     return mel_spec_db
 51 | 
 52 | 
 53 | # BATCH FIRST TimeDistributed layer
 54 | class TimeDistributed(nn.Module):
 55 |     def __init__(self, module):
 56 |         super(TimeDistributed, self).__init__()
 57 |         self.module = module
 58 | 
 59 |     def forward(self, x):
 60 | 
 61 |         if len(x.size()) <= 2:
 62 |             return self.module(x)
 63 |         # squash samples and timesteps into a single axis
 64 |         elif len(x.size()) == 3:  # (samples, timesteps, inp1)
 65 |             x_reshape = x.contiguous().view(-1, x.size(2))  # (samples * timesteps, inp1)
 66 |         elif len(x.size()) == 4:  # (samples,timesteps,inp1,inp2)
 67 |             x_reshape = x.contiguous().view(-1, x.size(2), x.size(3))  # (samples*timesteps,inp1,inp2)
 68 |         else:  # (samples,timesteps,inp1,inp2,inp3)
 69 |             x_reshape = x.contiguous().view(-1, x.size(2), x.size(3), x.size(4))  # (samples*timesteps,inp1,inp2,inp3)
 70 | 
 71 |         y = self.module(x_reshape)
 72 | 
 73 |         # we have to reshape Y
 74 |         if len(x.size()) == 3:
 75 |             y = y.contiguous().view(x.size(0), -1, y.size(1))  # (samples, timesteps, out1)
 76 |         elif len(x.size()) == 4:
 77 |             y = y.contiguous().view(x.size(0), -1, y.size(1), y.size(2))  # (samples, timesteps, out1,out2)
 78 |         else:
 79 |             y = y.contiguous().view(x.size(0), -1, y.size(1), y.size(2),
 80 |                                     y.size(3))  # (samples, timesteps, out1,out2, out3)
 81 |         return y
 82 | 
 83 | class HybridModel(nn.Module):
 84 |     def __init__(self,num_emotions):
 85 |         super().__init__()
 86 |         # conv block
 87 |         self.conv2Dblock = nn.Sequential(
 88 |             # 1. conv block
 89 |             TimeDistributed(nn.Conv2d(in_channels=1,
 90 |                                    out_channels=16,
 91 |                                    kernel_size=3,
 92 |                                    stride=1,
 93 |                                    padding=1
 94 |                                   )),
 95 |             TimeDistributed(nn.BatchNorm2d(16)),
 96 |             TimeDistributed(nn.ReLU()),
 97 |             TimeDistributed(nn.MaxPool2d(kernel_size=2, stride=2)),
 98 |             TimeDistributed(nn.Dropout(p=0.3)),
 99 |             # 2. conv block
100 |             TimeDistributed(nn.Conv2d(in_channels=16,
101 |                                    out_channels=32,
102 |                                    kernel_size=3,
103 |                                    stride=1,
104 |                                    padding=1
105 |                                   )),
106 |             TimeDistributed(nn.BatchNorm2d(32)),
107 |             TimeDistributed(nn.ReLU()),
108 |             TimeDistributed(nn.MaxPool2d(kernel_size=4, stride=4)),
109 |             TimeDistributed(nn.Dropout(p=0.3)),
110 |             # 3. conv block
111 |             TimeDistributed(nn.Conv2d(in_channels=32,
112 |                                    out_channels=64,
113 |                                    kernel_size=3,
114 |                                    stride=1,
115 |                                    padding=1
116 |                                   )),
117 |             TimeDistributed(nn.BatchNorm2d(64)),
118 |             TimeDistributed(nn.ReLU()),
119 |             TimeDistributed(nn.MaxPool2d(kernel_size=4, stride=4)),
120 |             TimeDistributed(nn.Dropout(p=0.3))
121 |         )
122 |         # LSTM block
123 |         hidden_size = 32
124 |         self.lstm = nn.LSTM(input_size=512,hidden_size=hidden_size,bidirectional=True, batch_first=True)
125 |         self.dropout_lstm = nn.Dropout(p=0.4)
126 |         self.attention_linear = nn.Linear(2*hidden_size,1) # 2*hidden_size for the 2 outputs of bidir LSTM
127 |         # Linear softmax layer
128 |         self.out_linear = nn.Linear(2*hidden_size,num_emotions)
129 |     def forward(self,x):
130 |         conv_embedding = self.conv2Dblock(x)
131 |         conv_embedding = torch.flatten(conv_embedding, start_dim=2) # do not flatten batch dimension and time
132 |         lstm_embedding, (h,c) = self.lstm(conv_embedding)
133 |         lstm_embedding = self.dropout_lstm(lstm_embedding)
134 |         # lstm_embedding (batch, time, hidden_size*2)
135 |         batch_size,T,_ = lstm_embedding.shape
136 |         attention_weights = [None]*T
137 |         for t in range(T):
138 |             embedding = lstm_embedding[:,t,:]
139 |             attention_weights[t] = self.attention_linear(embedding)
140 |         attention_weights_norm = nn.functional.softmax(torch.stack(attention_weights,-1),dim=-1)
141 |         attention = torch.bmm(attention_weights_norm,lstm_embedding) # (Bx1xT)*(B,T,hidden_size*2)=(B,1,2*hidden_size)
142 |         attention = torch.squeeze(attention, 1)
143 |         output_logits = self.out_linear(attention)
144 |         output_softmax = nn.functional.softmax(output_logits,dim=1)
145 |         return output_logits, output_softmax
146 | 
147 | def splitIntoChunks(mel_spec,win_size,stride):
148 |     mel_spec = mel_spec.T
149 |     t = mel_spec.shape[1]
150 |     num_of_chunks = int(t/stride)
151 |     chunks = []
152 |     for i in range(num_of_chunks):
153 |         chunk = mel_spec[:,i*stride:i*stride+win_size]
154 |         if chunk.shape[1] == win_size:
155 |             chunks.append(chunk)
156 |     return np.stack(chunks,axis=0)
157 | 
158 | def loss_fnc(predictions, targets):
159 |     return nn.CrossEntropyLoss()(input=predictions,target=targets)
160 | 
161 | def make_train_step(model, loss_fnc, optimizer):
162 |     def train_step(X,Y):
163 |         # set model to train mode
164 |         model.train()
165 |         # forward pass
166 |         output_logits, output_softmax = model(X)
167 |         predictions = torch.argmax(output_softmax,dim=1)
168 |         accuracy = torch.sum(Y==predictions)/float(len(Y))
169 |         # compute loss
170 |         loss = loss_fnc(output_logits, Y)
171 |         # compute gradients
172 |         loss.backward()
173 |         # update parameters and zero gradients
174 |         optimizer.step()
175 |         optimizer.zero_grad()
176 |         return loss.item(), accuracy*100
177 |     return train_step
178 | 
179 | def make_validate_fnc(model,loss_fnc):
180 |     def validate(X,Y):
181 |         with torch.no_grad():
182 |             model.eval()
183 |             output_logits, output_softmax = model(X)
184 |             predictions = torch.argmax(output_softmax,dim=1)
185 |             accuracy = torch.sum(Y==predictions)/float(len(Y))
186 |             loss = loss_fnc(output_logits,Y)
187 |         return loss.item(), accuracy*100, predictions
188 |     return validate
189 | 
190 | def load_data(in_dir):
191 |     f = open(in_dir,'rb')
192 |     train_data_norm, test_data_norm, train_label, test_label = pickle.load(f)
193 |     return  train_data_norm, test_data_norm, train_label, test_label
194 | 
195 | EMOTIONS = {0:'angry', 1:'happy',2:'sad',3:'neutral'}
196 | data_path = '../transformer_ser/data_500_final.pkl'
197 | model_name = 'cnn_attention_lstm_model_64_update.pt'
198 | model_name_1 = 'cnn_attention_lstm_model_64_update_best.pt'
199 | X_train, X_test, Y_train, Y_test = load_data(data_path)
200 | 
201 | Y_train = Y_train.reshape(-1)
202 | Y_test = Y_test.reshape(-1)
203 | 
204 | Y_train = Y_train.astype('int8')
205 | Y_test = Y_test.astype('int8')
206 | 
207 | print(f'X_train:{X_train.shape}, Y_train:{Y_train.shape}', flush=True)
208 | print(f'X_test:{X_test.shape}, Y_test:{Y_test.shape}', flush=True)
209 | 
210 | 
211 | # get chunks
212 | # train set
213 | mel_train_chunked = []
214 | for mel_spec in X_train:
215 |     chunks = splitIntoChunks(mel_spec, win_size=128,stride=64)
216 |     mel_train_chunked.append(chunks)
217 | print("Number of chunks is {}".format(chunks.shape[0]))
218 | # test set
219 | mel_test_chunked = []
220 | for mel_spec in X_test:
221 |     chunks = splitIntoChunks(mel_spec, win_size=128,stride=64)
222 |     mel_test_chunked.append(chunks)
223 | print("Number of chunks is {}".format(chunks.shape[0]))
224 | 
225 | 
226 | X_train = np.stack(mel_train_chunked,axis=0)
227 | X_train = np.expand_dims(X_train,2)
228 | print('Shape of X_train: ',X_train.shape)
229 | X_test = np.stack(mel_test_chunked,axis=0)
230 | X_test = np.expand_dims(X_test,2)
231 | print('Shape of X_test: ',X_test.shape)
232 | 
233 | b,t,c,h,w = X_train.shape
234 | X_train = np.reshape(X_train, newshape=(b,-1))
235 | X_train = np.reshape(X_train, newshape=(b,t,c,h,w))
236 | 
237 | b,t,c,h,w = X_test.shape
238 | X_test = np.reshape(X_test, newshape=(b,-1))
239 | X_test = np.reshape(X_test, newshape=(b,t,c,h,w))
240 | 
241 | # X_train = np.expand_dims(X_train,1) #[4257,1,800,80]
242 | # X_test = np.expand_dims(X_test,1)
243 | #
244 | # b,c,h,w = X_train.shape
245 | # X_train = np.reshape(X_train, newshape=(b,-1))
246 | # #X_train = scaler.fit_transform(X_train)
247 | # X_train = np.reshape(X_train, newshape=(b,c,w,h)) #[4257,1,80,800]
248 | #
249 | # b,c,h,w = X_test.shape
250 | # X_test = np.reshape(X_test, newshape=(b,-1))
251 | # #X_test = scaler.transform(X_test)
252 | # X_test = np.reshape(X_test, newshape=(b,c,w,h))
253 | SAVE_PATH = os.path.join(os.getcwd(),'models')
254 | os.makedirs('models',exist_ok=True)
255 | 
256 | tb = SummaryWriter()
257 | 
258 | EPOCHS=200
259 | DATASET_SIZE = X_train.shape[0]
260 | BATCH_SIZE = 32
261 | device = 'cuda' if torch.cuda.is_available() else 'cpu'
262 | print('Selected device is {}'.format(device))
263 | model = HybridModel(num_emotions=len(EMOTIONS)).to(device)
264 | print('Number of trainable params: ',sum(p.numel() for p in model.parameters()))
265 | OPTIMIZER = torch.optim.SGD(model.parameters(),lr=0.01, weight_decay=1e-3, momentum=0.8)
266 | 
267 | train_step = make_train_step(model, loss_fnc, optimizer=OPTIMIZER)
268 | validate = make_validate_fnc(model,loss_fnc)
269 | losses=[]
270 | val_losses = []
271 | for epoch in range(EPOCHS):
272 |     # schuffle data
273 |     ind = np.random.permutation(DATASET_SIZE)
274 |     X_train = X_train[ind,:,:,:]
275 |     Y_train = Y_train[ind]
276 |     epoch_acc = 0
277 |     epoch_loss = 0
278 |     iters = int(DATASET_SIZE / BATCH_SIZE)
279 |     for i in range(iters):
280 |         batch_start = i * BATCH_SIZE
281 |         batch_end = min(batch_start + BATCH_SIZE, DATASET_SIZE)
282 |         actual_batch_size = batch_end-batch_start
283 |         X = X_train[batch_start:batch_end,:,:,:]
284 |         Y = Y_train[batch_start:batch_end]
285 |         X_tensor = torch.tensor(X,device=device).float()
286 |         Y_tensor = torch.tensor(Y, dtype=torch.long,device=device)
287 |         loss, acc = train_step(X_tensor,Y_tensor)
288 |         epoch_acc += acc*actual_batch_size/DATASET_SIZE
289 |         epoch_loss += loss*actual_batch_size/DATASET_SIZE
290 |         print(f"\r Epoch {epoch}: iteration {i}/{iters}",end='')
291 |     X_val_tensor = torch.tensor(X_test,device=device).float()
292 |     Y_val_tensor = torch.tensor(Y_test,dtype=torch.long,device=device)
293 |     val_loss, val_acc, predictions = validate(X_val_tensor,Y_val_tensor)
294 |     losses.append(epoch_loss)
295 |     val_losses.append(val_loss)
296 |     tb.add_scalar("Training Loss", epoch_loss, epoch)
297 |     tb.add_scalar("Training Accuracy", epoch_acc, epoch)
298 |     tb.add_scalar("Validation Loss", val_loss, epoch)
299 |     tb.add_scalar("Validation Accuracy", val_acc, epoch)
300 |     print('')
301 |     print(f"Epoch {epoch} --> loss:{epoch_loss:.4f}, acc:{epoch_acc:.2f}%, val_loss:{val_loss:.4f}, val_acc:{val_acc:.2f}%", flush=True)
302 |     if val_acc > 96.5:
303 |         torch.save(model.state_dict(), os.path.join(SAVE_PATH, model_name_1), _use_new_zipfile_serialization=False)
304 | 
305 | tb.flush()
306 | SAVE_PATH = os.path.join(os.getcwd(),'models')
307 | os.makedirs('models',exist_ok=True)
308 | torch.save(model.state_dict(),os.path.join(SAVE_PATH,model_name),_use_new_zipfile_serialization=False)
309 | print('Model is saved to {}'.format(os.path.join(SAVE_PATH,model_name)))
310 | 
311 | LOAD_PATH = os.path.join(os.getcwd(),'models')
312 | model = HybridModel(len(EMOTIONS))
313 | model.load_state_dict(torch.load(os.path.join(LOAD_PATH,model_name)))
314 | print('Model is loaded from {}'.format(os.path.join(LOAD_PATH,model_name)))
315 | 
316 | X_test_tensor = torch.tensor(X_test,device=device).float()
317 | Y_test_tensor = torch.tensor(Y_test,dtype=torch.long,device=device)
318 | test_loss, test_acc, predictions = validate(X_test_tensor,Y_test_tensor)
319 | print(f'Test loss is {test_loss:.3f}')
320 | print(f'Test accuracy is {test_acc:.2f}%')
321 | 
322 | predictions = predictions.cpu().numpy()
323 | cm = confusion_matrix(Y_test, predictions)
324 | names = [EMOTIONS[ind] for ind in range(len(EMOTIONS))]
325 | df_cm = pd.DataFrame(cm, index=names, columns=names)
326 | # plt.figure(figsize=(10,7))
327 | sn.set(font_scale=1.4) # for label size
328 | sn.heatmap(df_cm, annot=True, annot_kws={"size": 16}) # font size
329 | plt.show()
330 | plt.savefig('confusion_matrix.png')
331 | 
332 | 


--------------------------------------------------------------------------------