├── README.md ├── embed_verif.py ├── embeddings ├── AMS_128Dropout.npy └── LM128Dropout.npy ├── figures ├── Spectrogram.png ├── extendedSpectrogram.png ├── reverseSpectrogram.png ├── rocVox_ROC.jpeg ├── rocVox_pairs.jpeg ├── roc_ROC.jpeg └── roc_pairs.jpeg ├── logs └── LM │ ├── LM_Ident_512D.log │ └── ResNet-20_512D.log ├── prototxt ├── AMS-20_modified.prototxt ├── LogisticMargin.prototxt ├── LogisticMargin_solver.prototxt ├── ResNet-20.prototxt └── ResNet-20_solver.prototxt ├── result └── LogisticMargin │ └── 512D │ └── Ident │ └── LM_512D_30_iter_60000.caffemodel ├── roc.py ├── roc_vox.py ├── sample.wav ├── sample_reverse.wav ├── test_ident.py └── train_aug.py /README.md: -------------------------------------------------------------------------------- 1 | # Unified Hypersphere Embedding for Speaker Recognition 2 | By Mahdi Hajibabaei and Dengxin Dai 3 | 4 | ### Introduction 5 | 6 | This repository contains the code and instruction needed to replicate the experiments done in paper: [Unified Hypersphere Embedding for Speaker Recognition](https://arxiv.org/abs/1807.08312) 7 | 8 | Note: In late 2018, collectors of the dataset changed the structure of dataset which is no longer compatible with parse_list function. If you wish to use the pipeline with newly structured dataset, write a function that creates *_set and *_label for training, validation and testing. 9 | 10 | In this work, we first train a ResNet-20 with the typical softmax and cross entropy loss function and then fine-tune the network with more discriminative loss function such as A-Softmax, AM-Softmax and logistic margin. 11 | 12 | ### Requirements 13 | 1. [*VoxCeleb*](http://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html) dataset and lists of [dataset split](http://www.robots.ox.ac.uk/~vgg/data/voxceleb/meta/iden_split.txt) and [verification evaluation pairs](http://www.robots.ox.ac.uk/~vgg/data/voxceleb/meta/veri_test.txt) 14 | 2. [Sphereface's Caffe build](https://github.com/wy1iu/sphereface/tree/master/tools/caffe-sphereface) 15 | 3. [AM-Softmax's Caffe build](https://github.com/happynear/caffe-windows/tree/504d8a85f552e988fabff88b026f2c31cb778329) 16 | 4. [Anaconda](https://anaconda.org/anaconda/python) or a similar package that includes NumPy, SciPy and scikit-learn. 17 | 18 | ### Setup 19 | 20 | 1. Request the audio files from *Nagrani et al.* and extract the wav files to a directory that would be refered to as *base_address*. 21 | 22 | 2. Follow the instructions to *make* the Caffe and Pycaffe for each aforementioned build. 23 | Add Pycaffe's path as *PYTHONPATH* environment variable by copying the following line to the .bashrc: 24 | 25 | export PYTHONPATH={PATH_TO_CAFFE}/python 26 | 27 | 3. Clone this repository 28 | 29 | ### An Important Note on Repetition and Time-reversion 30 | 31 | Usually, during training models for visual object recognition, we horizontally flip images before feeding them to the model. Horizontally flipping of an image does not change its label and can be used to increase the number of independent training examples and hopefully improve the generalization power of resulting model. Furthermore, during inference features of the flipped and original image can be averaged to improve the prediction accuracy. Similar to object recognition, we horizontally flip the spectrogram of recordings with probability of 50% for training and testing models. You can observe the spectrograms of a recording before (top) and after (bottom) time-reversion in the figures below. 32 | 33 | 34 | ![picture](https://github.com/MahdiHajibabaei/unified-embedding/blob/master/figures/Spectrogram.png) 35 | 36 | 37 | ![picture](https://github.com/MahdiHajibabaei/unified-embedding/blob/master/figures/reverseSpectrogram.png) 38 | 39 | Furthermore, time-reversion and repetition is different from adding environmental noise or convolving the recordings with room impulse response and can be used in addition to these methods. 40 | ** In order to use repetition and time-reversion along addition of noise and room impulse response, append the time-reverse of recording to itself BEFORE feeding it to the pipeline that adds the environmental noise and room impulse response to the recording.** 41 | 42 | 43 | ### Training with augmentation 44 | 45 | Note: Since there is an intersection between the training, validation and testing split for verification and identification. You need to specify which task you want to train for by setting *mode* to either 'identification' or 'verification'. If you want to use this pipeline to train on another dataset please modify the content of parse_list function to append the address of each sample to *_set and its label to *_label. 46 | Note: if you are not using Sun Grid Engine, set allocated_GPU to the id of the GPU that you wish to use. 47 | 48 | 1. Comment the following block of the code in *train_aug.py* that initializes the network's coefficient with *net_weights* and trains the network from scratch 49 | with softmax and cross entropy loss function by setting the argument of caffe.SGDSolver to "prototxt/ResNet-20_solver.prototxt" and executing the *train_aug.py* script. 50 | 51 | solver = caffe.SGDSolver("prototxt/LogisticMargin_solver.prototxt") 52 | net_weights='result/ResNet-20_512D_iter_61600.caffemodel' 53 | print("The network will be initialized with %s" % (net_weights)) 54 | solver.net.copy_from(net_weights) 55 | 56 | After training is finished, the trained network's coefficients and the state of solver would be stored in *result* directory, we will use the network coefficients (e.g. result/ResNet-20_512D_iter_61600.caffemodel) to initialize the network for training with more discriminative loss functions. 57 | 58 | 2. Uncomment the aforementioned block of the code and make sure *net_weights* is set to the address of the previous caffemodel. Run *train_aug.py* again, this time "LM_512D_30_iter_61600.caffemodel" would be saved to *result* directory. 59 | 60 | 3. If you have chosen the mode identification, you need to execute *test_ident.py* script. But before executing, please set the variable *net_weights* to caffemodel that you wish to evaluate and *net_prototxt* to prototxt file of structure of network in interest. Run the script, in the end of execution, the top-1 and top-5 accuracies would be printed in the terminal similar to the message below: 61 | 62 | The top1 accuracy on test set is 0.9447 63 | 64 | The top5 accuracy on test set is 0.9830 65 | 66 | Important note: in recent face recognition literature, similarity of two face images are evaluated by first projecting each image into an embedding space by feeding it to a CNN. Afterwards a similary score is given to pairs of images based on cosine similarity of their embeddings. In result, first we need to embed each sample into a relatively low dimensional embedding space ( by executing embed_verif.py) and then we can use cosine similarity of these embedding to evaluate the odds of two utterances belonging to the same person 67 | 68 | 4. If you wish to evaluate the verification accuracy of a model trained for the task of verification, you first need to extract the embeddings of utterances within the test set. In order to do so, open the *embed_verif.py* and set the *net_weights* to caffemodel that you wish to evaluate and *net_prototxt* to prototxt of the structure of network of interest. Remember to set *embedding_file* to a proper name and directory for storing the resulting embedding. After executing embed_verif.py, the message "Embeddings of test utterances are stored in ..." will be printed in terminal. 69 | 70 | In order to evaluate the Equal Error Rate (EER) and minimum of detection cost function (DCF) on pairs selected by Nagrani et al., set the *embedding_file* in roc_vox.py to address of the embedding that you wish to evaluate and execute the script. Two figures will be displayed: The first one shows the separation between positive pairs and negative pairs: 71 | 72 | ![picture](https://github.com/MahdiHajibabaei/unified-embedding/blob/master/figures/rocVox_pairs.jpeg) 73 | 74 | 75 | The second figure shows the ROC of the embeddings: 76 | 77 | 78 | ![picture](https://github.com/MahdiHajibabaei/unified-embedding/blob/master/figures/rocVox_ROC.jpeg) 79 | 80 | The EER and minimum of detection cost functions (DCF) would be printed on the console afterwards. 81 | 82 | If you wish to evaluate the verification accuracy of any trained model on all possible (11.9 million) pairs within verification test set, set the *embedding_file* in roc.py to address of evaluated embeddings and run the script. Similar to the evaluating on few pre-selected pair, two figures would be shown as follows: 83 | 84 | ![picture](https://github.com/MahdiHajibabaei/unified-embedding/blob/master/figures/roc_pairs.jpeg) 85 | 86 | ![picture](https://github.com/MahdiHajibabaei/unified-embedding/blob/master/figures/roc_ROC.jpeg) 87 | 88 | ### Training and/or evaluating without augmentation 89 | 90 | If you wish to compare the prediction accuracy and performance of models trained and/or evaluated without repetition and time-reversion augmentation, alter with following lines: 91 | 92 | extended_signal=np.append(signal,signal) 93 | beginning=int((len(signal))*np.random.random_sample()) 94 | signal = extended_signal[beginning:beginning+48241] 95 | if (np.int(np.random.random_sample()*2)==1): 96 | signal= signal[::-1] 97 | 98 | with: 99 | 100 | beginning=int((len(signal)-48241)*np.random.random_sample()) 101 | signal = signal[beginning:beginning+48241] 102 | 103 | in train_aug.py if you with to eliminate augmentation in training phase and in embed_verif.py or test_ident.py if you wish to eliminate augmentation in evaluating verification and identification accuracies respectively. keep in mind that 48241 samples represents 3.015 seconds of recordings plus an extra sample to compensate for receptive field of pre-emphasis. 104 | 105 | ### Effect of dropout on verification accuracy 106 | 107 | Applying dropout to penultimate layer of CNN improved the verification accuracy but deteriorated the identification accuracy. If you wish to apply drop out during training, just uncomment the following lines from prototxt of the network's structure. 108 | 109 | layer { 110 | name: "drop6" 111 | type: "Dropout" 112 | bottom: "res4_3p" 113 | top: "res4_3p" 114 | dropout_param { 115 | dropout_ratio: 0.5 116 | } 117 | } 118 | 119 | ### Future works 120 | 121 | At the time of running the experiments in this work, *VoxCeleb* was the largest publicly available dataset. However, not long after, much larger [*VoxCeleb2*](http://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html) dataset with more speakers and more statistically sound evaluations was released and it would be really interesting to see how much improvement using suggested loss functions and augmentation would yield. 122 | 123 | There is also an ongoing National Institute of Standard and Technology Speaker Recognition Evaluation (NIST SRE) challenge that lists *VoxCeleb* and *VoxCeleb2* as their official training dataset. It would be interesting to see how much improvement using the suggested loss functions and augmentation would result in. 124 | 125 | ### Citation 126 | 127 | If you plan to use the repetition and time-reversion augmentation, please consider citing my paper: 128 | 129 | @article{hajibabaei2018unified, 130 | title={Unified Hypersphere Embedding for Speaker Recognition}, 131 | author={Hajibabaei, Mahdi and Dai, Dengxin}, 132 | journal={arXiv preprint arXiv:1807.08312}, 133 | year={2018} 134 | } 135 | 136 | And if you plan to use logistic margin loss function please cite the original AM-Softmax paper (with bibtex given below) along my paper. 137 | 138 | @article{wang2018additive, 139 | title={Additive margin softmax for face verification}, 140 | author={Wang, Feng and Cheng, Jian and Liu, Weiyang and Liu, Haijun}, 141 | journal={IEEE Signal Processing Letters}, 142 | volume={25}, 143 | number={7}, 144 | pages={926--930}, 145 | year={2018}, 146 | publisher={IEEE} 147 | } 148 | 149 | -------------------------------------------------------------------------------- /embed_verif.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import caffe 3 | print "Caffe successfully imported!" 4 | import scipy.io.wavfile 5 | import re 6 | import os.path 7 | 8 | wav_list='/scratch_net/biwidl09/hmahdi/VoxCeleb/Identification_split.txt' 9 | base_address='/scratch_net/biwidl09/hmahdi/VoxCeleb/voxceleb1_wav/' 10 | embedding_file="embeddings/LM_512D.npy" 11 | 12 | pre_emphasis = 0.97 13 | frame_size = 0.025 14 | frame_stride = 0.01 15 | NFFT = 512 16 | BATCH_SIZE=50 17 | train_set=[] 18 | train_label=[] 19 | validation_set=[] 20 | validation_label=[] 21 | test_set=[] 22 | test_label=[] 23 | spectrogram_batch=np.empty([BATCH_SIZE,1,300,257],dtype=float) # np.int((NFFT+1)/2) 24 | label_batch=np.empty([BATCH_SIZE,1,1,1],dtype=float) 25 | number_of_crops=50 26 | 27 | def crop_inference(batch_index): 28 | 29 | for i in range(0,BATCH_SIZE): 30 | sample_index=batch_index*BATCH_SIZE+i 31 | if (sample_index>=len(test_set)): 32 | continue 33 | fileName=test_set[sample_index] # Every sample within the batch is chosen at random 34 | sample_rate, signal = scipy.io.wavfile.read(fileName) # 35 | extended_signal=np.append(signal,signal) 36 | beginning=int((len(signal))*np.random.random_sample()) 37 | signal = extended_signal[beginning:beginning+48241] # Number of samples plus one because we need to apply pre-emphasis filter with receptive field of two 38 | if (np.int(np.random.random_sample()*2)==1): 39 | signal= signal[::-1] 40 | signal=signal-np.mean(signal) 41 | emphasized_signal = np.append(signal[0], signal[1:] - pre_emphasis * signal[:-1]) 42 | 43 | frame_length, frame_step = frame_size * sample_rate, frame_stride * sample_rate # Convert from seconds to samples 44 | signal_length = len(emphasized_signal) 45 | frame_length = int(round(frame_length)) 46 | frame_step = int(round(frame_step)) 47 | num_frames = int(np.ceil(float(np.abs(signal_length - frame_length)) / frame_step)) # Make sure that we have at least 1 frame 48 | 49 | pad_signal_length = num_frames * frame_step + frame_length 50 | z = np.zeros((pad_signal_length - signal_length)) 51 | pad_signal = np.append(emphasized_signal, z) # Pad Signal to make sure that all frames have equal number of samples without truncating any samples from the original signal 52 | 53 | indices = np.tile(np.arange(0, frame_length), (num_frames, 1)) + np.tile(np.arange(0, num_frames * frame_step, frame_step), (frame_length, 1)).T 54 | frames = pad_signal[indices.astype(np.int32, copy=False)] 55 | 56 | frames *= np.hamming(frame_length) 57 | mag_frames = np.absolute(np.fft.rfft(frames, NFFT)) # Magnitude of the FFT 58 | 59 | label_batch[i,0,0,0]=test_label[sample_index] 60 | spectrogram_batch[i,0,:,:]= (mag_frames - mag_frames.mean(axis=0)) / mag_frames.std(axis=0) 61 | 62 | 63 | def evaluate_embeddings(net): 64 | print "Start of embedding test utterances to embedding space ..." 65 | test_set_size=len(test_set) 66 | for i in range(0,(test_set_size/BATCH_SIZE)+1): 67 | for j in range(0,number_of_crops): 68 | crop_inference(i) 69 | net.blobs['data'].data[...] =spectrogram_batch 70 | net.blobs['label'].data[...] =label_batch 71 | net.forward() 72 | if j==0: 73 | batchPoolAverage=net.blobs['res4_3p'].data 74 | else: 75 | batchPoolAverage=batchPoolAverage+net.blobs['res4_3p'].data 76 | batchPoolAverage=batchPoolAverage/number_of_crops 77 | net.blobs['res4_3p'].data[...] = batchPoolAverage 78 | net.forward(start='fc5')#, end='fc7') 79 | 80 | if i==0 : 81 | embeddings=net.blobs['fc5'].data 82 | else: 83 | embeddings=np.append(embeddings,net.blobs['fc5'].data,axis=0) 84 | print("Batch #%d of utterances is evaluated"% (i)) 85 | embeddings=embeddings[0:test_set_size,:] 86 | np.save(embedding_file,embeddings) 87 | print("Embeddings of test utterances are stored in %s" % (embedding_file)) 88 | 89 | 90 | 91 | 92 | if __name__ == '__main__': 93 | 94 | allocated_GPU= int(os.environ['SGE_GPU']) 95 | print("The training will be executed on GPU #%d" % (allocated_GPU)) 96 | caffe.set_device(allocated_GPU) 97 | caffe.set_mode_gpu() 98 | net_weights='result/LM_512D_30_iter_61600.caffemodel' 99 | net_prototxt='prototxt/LogisticMargin.prototxt' 100 | net = caffe.Net(net_prototxt,net_weights,caffe.TEST) 101 | print("The network will be initialized with coefficients from %s" % (net_weights)) 102 | 103 | input_file = open(wav_list,'r') 104 | identity=0 105 | split_index=1 106 | for line in input_file: 107 | parsed_line=re.split(r'[ \n]+', line) 108 | utterance_address=base_address+parsed_line[1] 109 | prev_split_index=split_index 110 | split_index=int(parsed_line[0]) 111 | if (prev_split_index>split_index): 112 | identity=identity+1 113 | if (os.path.isfile(utterance_address)==False): # The file does not exist 114 | continue 115 | elif (parsed_line[1][0]=='E'): 116 | test_set.append(utterance_address) 117 | test_label.append(identity) 118 | elif (split_index==2): 119 | validation_set.append(utterance_address) 120 | validation_label.append(identity) 121 | else :#(split_index==1) 122 | train_set.append(utterance_address) 123 | train_label.append(identity) 124 | 125 | print("The size of training set: %d" % (len(train_set))) 126 | print("The size of validation set: %d" % (len(validation_set))) 127 | print("The size of test set: %d" % (len(test_set))) 128 | print("Number of identities: %d" % (identity+1)) 129 | evaluate_embeddings(net) 130 | -------------------------------------------------------------------------------- /embeddings/AMS_128Dropout.npy: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MahdiHajibabaei/unified-embedding/9e8f5f1cf9bca699fa51ed714ea3d2b6e490684b/embeddings/AMS_128Dropout.npy -------------------------------------------------------------------------------- /embeddings/LM128Dropout.npy: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MahdiHajibabaei/unified-embedding/9e8f5f1cf9bca699fa51ed714ea3d2b6e490684b/embeddings/LM128Dropout.npy -------------------------------------------------------------------------------- /figures/Spectrogram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MahdiHajibabaei/unified-embedding/9e8f5f1cf9bca699fa51ed714ea3d2b6e490684b/figures/Spectrogram.png -------------------------------------------------------------------------------- /figures/extendedSpectrogram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MahdiHajibabaei/unified-embedding/9e8f5f1cf9bca699fa51ed714ea3d2b6e490684b/figures/extendedSpectrogram.png -------------------------------------------------------------------------------- /figures/reverseSpectrogram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MahdiHajibabaei/unified-embedding/9e8f5f1cf9bca699fa51ed714ea3d2b6e490684b/figures/reverseSpectrogram.png -------------------------------------------------------------------------------- /figures/rocVox_ROC.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MahdiHajibabaei/unified-embedding/9e8f5f1cf9bca699fa51ed714ea3d2b6e490684b/figures/rocVox_ROC.jpeg -------------------------------------------------------------------------------- /figures/rocVox_pairs.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MahdiHajibabaei/unified-embedding/9e8f5f1cf9bca699fa51ed714ea3d2b6e490684b/figures/rocVox_pairs.jpeg -------------------------------------------------------------------------------- /figures/roc_ROC.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MahdiHajibabaei/unified-embedding/9e8f5f1cf9bca699fa51ed714ea3d2b6e490684b/figures/roc_ROC.jpeg -------------------------------------------------------------------------------- /figures/roc_pairs.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MahdiHajibabaei/unified-embedding/9e8f5f1cf9bca699fa51ed714ea3d2b6e490684b/figures/roc_pairs.jpeg -------------------------------------------------------------------------------- /prototxt/AMS-20_modified.prototxt: -------------------------------------------------------------------------------- 1 | name: "Sphere-20" 2 | 3 | input: "data" 4 | input_dim: 50 5 | input_dim: 1 6 | input_dim: 300 7 | input_dim: 257 8 | 9 | input: "label" 10 | input_dim: 50 11 | input_dim: 1 12 | input_dim: 1 13 | input_dim: 1 14 | 15 | ############## CNN Architecture ############### 16 | layer { 17 | name: "conv1_1" 18 | type: "Convolution" 19 | bottom: "data" 20 | top: "conv1_1" 21 | param { 22 | lr_mult: 1 23 | decay_mult: 1 24 | } 25 | param { 26 | lr_mult: 2 27 | decay_mult: 0 28 | } 29 | convolution_param { 30 | num_output: 64 31 | kernel_size: 3 32 | stride: 2 33 | pad: 1 34 | weight_filler { 35 | type: "xavier" 36 | } 37 | bias_filler { 38 | type: "constant" 39 | value: 0 40 | } 41 | } 42 | } 43 | layer { 44 | name: "relu1_1" 45 | type: "PReLU" 46 | bottom: "conv1_1" 47 | top: "conv1_1" 48 | } 49 | layer { 50 | name: "conv1_2" 51 | type: "Convolution" 52 | bottom: "conv1_1" 53 | top: "conv1_2" 54 | param { 55 | lr_mult: 1 56 | decay_mult: 1 57 | } 58 | param { 59 | lr_mult: 0 60 | decay_mult: 0 61 | } 62 | convolution_param { 63 | num_output: 64 64 | kernel_size: 3 65 | stride: 1 66 | pad: 1 67 | weight_filler { 68 | type: "gaussian" 69 | std: 0.01 70 | } 71 | bias_filler { 72 | type: "constant" 73 | value: 0 74 | } 75 | } 76 | } 77 | layer { 78 | name: "relu1_2" 79 | type: "PReLU" 80 | bottom: "conv1_2" 81 | top: "conv1_2" 82 | } 83 | layer { 84 | name: "conv1_3" 85 | type: "Convolution" 86 | bottom: "conv1_2" 87 | top: "conv1_3" 88 | param { 89 | lr_mult: 1 90 | decay_mult: 1 91 | } 92 | param { 93 | lr_mult: 0 94 | decay_mult: 0 95 | } 96 | convolution_param { 97 | num_output: 64 98 | kernel_size: 3 99 | stride: 1 100 | pad: 1 101 | weight_filler { 102 | type: "gaussian" 103 | std: 0.01 104 | } 105 | bias_filler { 106 | type: "constant" 107 | value: 0 108 | } 109 | } 110 | } 111 | layer { 112 | name: "relu1_3" 113 | type: "PReLU" 114 | bottom: "conv1_3" 115 | top: "conv1_3" 116 | } 117 | layer { 118 | name: "res1_3" 119 | type: "Eltwise" 120 | bottom: "conv1_1" 121 | bottom: "conv1_3" 122 | top: "res1_3" 123 | eltwise_param { 124 | operation: 1 125 | } 126 | } 127 | layer { 128 | name: "conv2_1" 129 | type: "Convolution" 130 | bottom: "res1_3" 131 | top: "conv2_1" 132 | param { 133 | lr_mult: 1 134 | decay_mult: 1 135 | } 136 | param { 137 | lr_mult: 2 138 | decay_mult: 0 139 | } 140 | convolution_param { 141 | num_output: 128 142 | kernel_size: 3 143 | stride: 2 144 | pad: 1 145 | weight_filler { 146 | type: "xavier" 147 | } 148 | bias_filler { 149 | type: "constant" 150 | value: 0 151 | } 152 | } 153 | } 154 | layer { 155 | name: "relu2_1" 156 | type: "PReLU" 157 | bottom: "conv2_1" 158 | top: "conv2_1" 159 | } 160 | layer { 161 | name: "conv2_2" 162 | type: "Convolution" 163 | bottom: "conv2_1" 164 | top: "conv2_2" 165 | param { 166 | lr_mult: 1 167 | decay_mult: 1 168 | } 169 | param { 170 | lr_mult: 0 171 | decay_mult: 0 172 | } 173 | convolution_param { 174 | num_output: 128 175 | kernel_size: 3 176 | stride: 1 177 | pad: 1 178 | weight_filler { 179 | type: "gaussian" 180 | std: 0.01 181 | } 182 | bias_filler { 183 | type: "constant" 184 | value: 0 185 | } 186 | } 187 | } 188 | layer { 189 | name: "relu2_2" 190 | type: "PReLU" 191 | bottom: "conv2_2" 192 | top: "conv2_2" 193 | } 194 | layer { 195 | name: "conv2_3" 196 | type: "Convolution" 197 | bottom: "conv2_2" 198 | top: "conv2_3" 199 | param { 200 | lr_mult: 1 201 | decay_mult: 1 202 | } 203 | param { 204 | lr_mult: 0 205 | decay_mult: 0 206 | } 207 | convolution_param { 208 | num_output: 128 209 | kernel_size: 3 210 | stride: 1 211 | pad: 1 212 | weight_filler { 213 | type: "gaussian" 214 | std: 0.01 215 | } 216 | bias_filler { 217 | type: "constant" 218 | value: 0 219 | } 220 | } 221 | } 222 | layer { 223 | name: "relu2_3" 224 | type: "PReLU" 225 | bottom: "conv2_3" 226 | top: "conv2_3" 227 | } 228 | layer { 229 | name: "res2_3" 230 | type: "Eltwise" 231 | bottom: "conv2_1" 232 | bottom: "conv2_3" 233 | top: "res2_3" 234 | eltwise_param { 235 | operation: 1 236 | } 237 | } 238 | layer { 239 | name: "conv2_4" 240 | type: "Convolution" 241 | bottom: "res2_3" 242 | top: "conv2_4" 243 | param { 244 | lr_mult: 1 245 | decay_mult: 1 246 | } 247 | param { 248 | lr_mult: 0 249 | decay_mult: 0 250 | } 251 | convolution_param { 252 | num_output: 128 253 | kernel_size: 3 254 | stride: 1 255 | pad: 1 256 | weight_filler { 257 | type: "gaussian" 258 | std: 0.01 259 | } 260 | bias_filler { 261 | type: "constant" 262 | value: 0 263 | } 264 | } 265 | } 266 | layer { 267 | name: "relu2_4" 268 | type: "PReLU" 269 | bottom: "conv2_4" 270 | top: "conv2_4" 271 | } 272 | layer { 273 | name: "conv2_5" 274 | type: "Convolution" 275 | bottom: "conv2_4" 276 | top: "conv2_5" 277 | param { 278 | lr_mult: 1 279 | decay_mult: 1 280 | } 281 | param { 282 | lr_mult: 0 283 | decay_mult: 0 284 | } 285 | convolution_param { 286 | num_output: 128 287 | kernel_size: 3 288 | stride: 1 289 | pad: 1 290 | weight_filler { 291 | type: "gaussian" 292 | std: 0.01 293 | } 294 | bias_filler { 295 | type: "constant" 296 | value: 0 297 | } 298 | } 299 | } 300 | layer { 301 | name: "relu2_5" 302 | type: "PReLU" 303 | bottom: "conv2_5" 304 | top: "conv2_5" 305 | } 306 | layer { 307 | name: "res2_5" 308 | type: "Eltwise" 309 | bottom: "res2_3" 310 | bottom: "conv2_5" 311 | top: "res2_5" 312 | eltwise_param { 313 | operation: 1 314 | } 315 | } 316 | layer { 317 | name: "conv3_1" 318 | type: "Convolution" 319 | bottom: "res2_5" 320 | top: "conv3_1" 321 | param { 322 | lr_mult: 1 323 | decay_mult: 1 324 | } 325 | param { 326 | lr_mult: 2 327 | decay_mult: 0 328 | } 329 | convolution_param { 330 | num_output: 256 331 | kernel_size: 3 332 | stride: 2 333 | pad: 1 334 | weight_filler { 335 | type: "xavier" 336 | } 337 | bias_filler { 338 | type: "constant" 339 | value: 0 340 | } 341 | } 342 | } 343 | layer { 344 | name: "relu3_1" 345 | type: "PReLU" 346 | bottom: "conv3_1" 347 | top: "conv3_1" 348 | } 349 | layer { 350 | name: "conv3_2" 351 | type: "Convolution" 352 | bottom: "conv3_1" 353 | top: "conv3_2" 354 | param { 355 | lr_mult: 1 356 | decay_mult: 1 357 | } 358 | param { 359 | lr_mult: 0 360 | decay_mult: 0 361 | } 362 | convolution_param { 363 | num_output: 256 364 | kernel_size: 3 365 | stride: 1 366 | pad: 1 367 | weight_filler { 368 | type: "gaussian" 369 | std: 0.01 370 | } 371 | bias_filler { 372 | type: "constant" 373 | value: 0 374 | } 375 | } 376 | } 377 | layer { 378 | name: "relu3_2" 379 | type: "PReLU" 380 | bottom: "conv3_2" 381 | top: "conv3_2" 382 | } 383 | layer { 384 | name: "conv3_3" 385 | type: "Convolution" 386 | bottom: "conv3_2" 387 | top: "conv3_3" 388 | param { 389 | lr_mult: 1 390 | decay_mult: 1 391 | } 392 | param { 393 | lr_mult: 0 394 | decay_mult: 0 395 | } 396 | convolution_param { 397 | num_output: 256 398 | kernel_size: 3 399 | stride: 1 400 | pad: 1 401 | weight_filler { 402 | type: "gaussian" 403 | std: 0.01 404 | } 405 | bias_filler { 406 | type: "constant" 407 | value: 0 408 | } 409 | } 410 | } 411 | layer { 412 | name: "relu3_3" 413 | type: "PReLU" 414 | bottom: "conv3_3" 415 | top: "conv3_3" 416 | } 417 | layer { 418 | name: "res3_3" 419 | type: "Eltwise" 420 | bottom: "conv3_1" 421 | bottom: "conv3_3" 422 | top: "res3_3" 423 | eltwise_param { 424 | operation: 1 425 | } 426 | } 427 | layer { 428 | name: "conv3_4" 429 | type: "Convolution" 430 | bottom: "res3_3" 431 | top: "conv3_4" 432 | param { 433 | lr_mult: 1 434 | decay_mult: 1 435 | } 436 | param { 437 | lr_mult: 0 438 | decay_mult: 0 439 | } 440 | convolution_param { 441 | num_output: 256 442 | kernel_size: 3 443 | stride: 1 444 | pad: 1 445 | weight_filler { 446 | type: "gaussian" 447 | std: 0.01 448 | } 449 | bias_filler { 450 | type: "constant" 451 | value: 0 452 | } 453 | } 454 | } 455 | layer { 456 | name: "relu3_4" 457 | type: "PReLU" 458 | bottom: "conv3_4" 459 | top: "conv3_4" 460 | } 461 | layer { 462 | name: "conv3_5" 463 | type: "Convolution" 464 | bottom: "conv3_4" 465 | top: "conv3_5" 466 | param { 467 | lr_mult: 1 468 | decay_mult: 1 469 | } 470 | param { 471 | lr_mult: 0 472 | decay_mult: 0 473 | } 474 | convolution_param { 475 | num_output: 256 476 | kernel_size: 3 477 | stride: 1 478 | pad: 1 479 | weight_filler { 480 | type: "gaussian" 481 | std: 0.01 482 | } 483 | bias_filler { 484 | type: "constant" 485 | value: 0 486 | } 487 | } 488 | } 489 | layer { 490 | name: "relu3_5" 491 | type: "PReLU" 492 | bottom: "conv3_5" 493 | top: "conv3_5" 494 | } 495 | layer { 496 | name: "res3_5" 497 | type: "Eltwise" 498 | bottom: "res3_3" 499 | bottom: "conv3_5" 500 | top: "res3_5" 501 | eltwise_param { 502 | operation: 1 503 | } 504 | } 505 | layer { 506 | name: "conv3_6" 507 | type: "Convolution" 508 | bottom: "res3_5" 509 | top: "conv3_6" 510 | param { 511 | lr_mult: 1 512 | decay_mult: 1 513 | } 514 | param { 515 | lr_mult: 0 516 | decay_mult: 0 517 | } 518 | convolution_param { 519 | num_output: 256 520 | kernel_size: 3 521 | stride: 1 522 | pad: 1 523 | weight_filler { 524 | type: "gaussian" 525 | std: 0.01 526 | } 527 | bias_filler { 528 | type: "constant" 529 | value: 0 530 | } 531 | } 532 | } 533 | layer { 534 | name: "relu3_6" 535 | type: "PReLU" 536 | bottom: "conv3_6" 537 | top: "conv3_6" 538 | } 539 | layer { 540 | name: "conv3_7" 541 | type: "Convolution" 542 | bottom: "conv3_6" 543 | top: "conv3_7" 544 | param { 545 | lr_mult: 1 546 | decay_mult: 1 547 | } 548 | param { 549 | lr_mult: 0 550 | decay_mult: 0 551 | } 552 | convolution_param { 553 | num_output: 256 554 | kernel_size: 3 555 | stride: 1 556 | pad: 1 557 | weight_filler { 558 | type: "gaussian" 559 | std: 0.01 560 | } 561 | bias_filler { 562 | type: "constant" 563 | value: 0 564 | } 565 | } 566 | } 567 | layer { 568 | name: "relu3_7" 569 | type: "PReLU" 570 | bottom: "conv3_7" 571 | top: "conv3_7" 572 | } 573 | layer { 574 | name: "res3_7" 575 | type: "Eltwise" 576 | bottom: "res3_5" 577 | bottom: "conv3_7" 578 | top: "res3_7" 579 | eltwise_param { 580 | operation: 1 581 | } 582 | } 583 | layer { 584 | name: "conv3_8" 585 | type: "Convolution" 586 | bottom: "res3_7" 587 | top: "conv3_8" 588 | param { 589 | lr_mult: 1 590 | decay_mult: 1 591 | } 592 | param { 593 | lr_mult: 0 594 | decay_mult: 0 595 | } 596 | convolution_param { 597 | num_output: 256 598 | kernel_size: 3 599 | stride: 1 600 | pad: 1 601 | weight_filler { 602 | type: "gaussian" 603 | std: 0.01 604 | } 605 | bias_filler { 606 | type: "constant" 607 | value: 0 608 | } 609 | } 610 | } 611 | layer { 612 | name: "relu3_8" 613 | type: "PReLU" 614 | bottom: "conv3_8" 615 | top: "conv3_8" 616 | } 617 | layer { 618 | name: "conv3_9" 619 | type: "Convolution" 620 | bottom: "conv3_8" 621 | top: "conv3_9" 622 | param { 623 | lr_mult: 1 624 | decay_mult: 1 625 | } 626 | param { 627 | lr_mult: 0 628 | decay_mult: 0 629 | } 630 | convolution_param { 631 | num_output: 256 632 | kernel_size: 3 633 | stride: 1 634 | pad: 1 635 | weight_filler { 636 | type: "gaussian" 637 | std: 0.01 638 | } 639 | bias_filler { 640 | type: "constant" 641 | value: 0 642 | } 643 | } 644 | } 645 | layer { 646 | name: "relu3_9" 647 | type: "PReLU" 648 | bottom: "conv3_9" 649 | top: "conv3_9" 650 | } 651 | layer { 652 | name: "res3_9" 653 | type: "Eltwise" 654 | bottom: "res3_7" 655 | bottom: "conv3_9" 656 | top: "res3_9" 657 | eltwise_param { 658 | operation: 1 659 | } 660 | } 661 | layer { 662 | name: "conv4_1" 663 | type: "Convolution" 664 | bottom: "res3_9" 665 | top: "conv4_1" 666 | param { 667 | lr_mult: 1 668 | decay_mult: 1 669 | } 670 | param { 671 | lr_mult: 2 672 | decay_mult: 0 673 | } 674 | convolution_param { 675 | num_output: 512 676 | kernel_size: 3 677 | stride: 2 678 | pad: 1 679 | weight_filler { 680 | type: "xavier" 681 | } 682 | bias_filler { 683 | type: "constant" 684 | value: 0 685 | } 686 | } 687 | } 688 | layer { 689 | name: "relu4_1" 690 | type: "PReLU" 691 | bottom: "conv4_1" 692 | top: "conv4_1" 693 | } 694 | layer { 695 | name: "conv4_2" 696 | type: "Convolution" 697 | bottom: "conv4_1" 698 | top: "conv4_2" 699 | param { 700 | lr_mult: 1 701 | decay_mult: 1 702 | } 703 | param { 704 | lr_mult: 0 705 | decay_mult: 0 706 | } 707 | convolution_param { 708 | num_output: 512 709 | kernel_size: 3 710 | stride: 1 711 | pad: 1 712 | weight_filler { 713 | type: "gaussian" 714 | std: 0.01 715 | } 716 | bias_filler { 717 | type: "constant" 718 | value: 0 719 | } 720 | } 721 | } 722 | layer { 723 | name: "relu4_2" 724 | type: "PReLU" 725 | bottom: "conv4_2" 726 | top: "conv4_2" 727 | } 728 | layer { 729 | name: "conv4_3" 730 | type: "Convolution" 731 | bottom: "conv4_2" 732 | top: "conv4_3" 733 | param { 734 | lr_mult: 1 735 | decay_mult: 1 736 | } 737 | param { 738 | lr_mult: 0 739 | decay_mult: 0 740 | } 741 | convolution_param { 742 | num_output: 512 743 | kernel_size: 3 744 | stride: 1 745 | pad: 1 746 | weight_filler { 747 | type: "gaussian" 748 | std: 0.01 749 | } 750 | bias_filler { 751 | type: "constant" 752 | value: 0 753 | } 754 | } 755 | } 756 | layer { 757 | name: "relu4_3" 758 | type: "PReLU" 759 | bottom: "conv4_3" 760 | top: "conv4_3" 761 | } 762 | layer { 763 | name: "res4_3" 764 | type: "Eltwise" 765 | bottom: "conv4_1" 766 | bottom: "conv4_3" 767 | top: "res4_3" 768 | eltwise_param { 769 | operation: 1 770 | } 771 | } 772 | 773 | layer { 774 | name: "pool1" 775 | type: "Pooling" 776 | bottom: "res4_3" 777 | top: "res4_3p" 778 | pooling_param { 779 | pool: AVE 780 | kernel_w :1 781 | kernel_h :19 # pool over whole time 782 | } 783 | } 784 | 785 | layer { 786 | name: "fc5" 787 | type: "InnerProduct" 788 | bottom: "res4_3p" 789 | top: "fc5" 790 | param { 791 | lr_mult: 2 792 | decay_mult: 0 793 | } 794 | param { 795 | lr_mult: 1 796 | decay_mult: 0 797 | } 798 | inner_product_param { 799 | num_output: 128 800 | weight_filler { 801 | type: "xavier" 802 | } 803 | bias_filler { 804 | type: "constant" 805 | value: 0 806 | } 807 | } 808 | } 809 | ############### A-Softmax Loss ############## 810 | layer { 811 | name: "norm1" 812 | type: "Normalize" 813 | bottom: "fc5" 814 | top: "norm1" 815 | } 816 | layer { 817 | name: "fc6_12" 818 | type: "InnerProduct" 819 | bottom: "norm1" 820 | top: "fc6" 821 | param { 822 | lr_mult: 1 823 | } 824 | inner_product_param{ 825 | num_output: 1251 826 | normalize: true 827 | weight_filler { 828 | type: "xavier" 829 | } 830 | bias_term: false 831 | } 832 | } 833 | 834 | layer { 835 | name: "fc6_scale" 836 | type: "Scale" 837 | bottom: "fc6" 838 | top: "fc6_scale" 839 | param { 840 | lr_mult: 1 841 | decay_mult: 0 842 | } 843 | scale_param { 844 | filler { 845 | value: 100 846 | } 847 | bias_term: true 848 | } 849 | } 850 | 851 | layer { 852 | name: "label_scale_margin" 853 | type: "LabelSpecificAdd" 854 | bottom: "fc6_scale" 855 | bottom: "label" 856 | top: "fc6_scale_margin" 857 | label_specific_add_param { 858 | bias: -50 859 | } 860 | } 861 | 862 | layer { 863 | name: "softmax_loss" 864 | type: "SoftmaxWithLoss" 865 | bottom: "fc6_scale_margin" 866 | bottom: "label" 867 | top: "softmax_loss" 868 | } 869 | 870 | 871 | layer { 872 | name: "top1Accuracy" 873 | type: "Accuracy" 874 | bottom: "fc6" 875 | bottom: "label" 876 | top: "top1Accuracy" 877 | } 878 | 879 | 880 | layer { 881 | name: "top5Accuracy" 882 | type: "Accuracy" 883 | bottom: "fc6" 884 | bottom: "label" 885 | top: "top5Accuracy" 886 | accuracy_param { 887 | top_k: 5 888 | } 889 | 890 | } 891 | 892 | 893 | 894 | -------------------------------------------------------------------------------- /prototxt/LogisticMargin.prototxt: -------------------------------------------------------------------------------- 1 | name: "LogisticMargin" 2 | 3 | input: "data" 4 | input_dim: 50 5 | input_dim: 1 6 | input_dim: 300 7 | input_dim: 257 8 | 9 | input: "label" 10 | input_dim: 50 11 | input_dim: 1 12 | input_dim: 1 13 | input_dim: 1 14 | 15 | ############## CNN Architecture ############### 16 | layer { 17 | name: "conv1_1" 18 | type: "Convolution" 19 | bottom: "data" 20 | top: "conv1_1" 21 | param { 22 | lr_mult: 1 23 | decay_mult: 1 24 | } 25 | param { 26 | lr_mult: 2 27 | decay_mult: 0 28 | } 29 | convolution_param { 30 | num_output: 64 31 | kernel_size: 3 32 | stride: 2 33 | pad: 1 34 | weight_filler { 35 | type: "xavier" 36 | } 37 | bias_filler { 38 | type: "constant" 39 | value: 0 40 | } 41 | } 42 | } 43 | layer { 44 | name: "relu1_1" 45 | type: "PReLU" 46 | bottom: "conv1_1" 47 | top: "conv1_1" 48 | } 49 | layer { 50 | name: "conv1_2" 51 | type: "Convolution" 52 | bottom: "conv1_1" 53 | top: "conv1_2" 54 | param { 55 | lr_mult: 1 56 | decay_mult: 1 57 | } 58 | param { 59 | lr_mult: 0 60 | decay_mult: 0 61 | } 62 | convolution_param { 63 | num_output: 64 64 | kernel_size: 3 65 | stride: 1 66 | pad: 1 67 | weight_filler { 68 | type: "gaussian" 69 | std: 0.01 70 | } 71 | bias_filler { 72 | type: "constant" 73 | value: 0 74 | } 75 | } 76 | } 77 | layer { 78 | name: "relu1_2" 79 | type: "PReLU" 80 | bottom: "conv1_2" 81 | top: "conv1_2" 82 | } 83 | layer { 84 | name: "conv1_3" 85 | type: "Convolution" 86 | bottom: "conv1_2" 87 | top: "conv1_3" 88 | param { 89 | lr_mult: 1 90 | decay_mult: 1 91 | } 92 | param { 93 | lr_mult: 0 94 | decay_mult: 0 95 | } 96 | convolution_param { 97 | num_output: 64 98 | kernel_size: 3 99 | stride: 1 100 | pad: 1 101 | weight_filler { 102 | type: "gaussian" 103 | std: 0.01 104 | } 105 | bias_filler { 106 | type: "constant" 107 | value: 0 108 | } 109 | } 110 | } 111 | layer { 112 | name: "relu1_3" 113 | type: "PReLU" 114 | bottom: "conv1_3" 115 | top: "conv1_3" 116 | } 117 | layer { 118 | name: "res1_3" 119 | type: "Eltwise" 120 | bottom: "conv1_1" 121 | bottom: "conv1_3" 122 | top: "res1_3" 123 | eltwise_param { 124 | operation: 1 125 | } 126 | } 127 | layer { 128 | name: "conv2_1" 129 | type: "Convolution" 130 | bottom: "res1_3" 131 | top: "conv2_1" 132 | param { 133 | lr_mult: 1 134 | decay_mult: 1 135 | } 136 | param { 137 | lr_mult: 2 138 | decay_mult: 0 139 | } 140 | convolution_param { 141 | num_output: 128 142 | kernel_size: 3 143 | stride: 2 144 | pad: 1 145 | weight_filler { 146 | type: "xavier" 147 | } 148 | bias_filler { 149 | type: "constant" 150 | value: 0 151 | } 152 | } 153 | } 154 | layer { 155 | name: "relu2_1" 156 | type: "PReLU" 157 | bottom: "conv2_1" 158 | top: "conv2_1" 159 | } 160 | layer { 161 | name: "conv2_2" 162 | type: "Convolution" 163 | bottom: "conv2_1" 164 | top: "conv2_2" 165 | param { 166 | lr_mult: 1 167 | decay_mult: 1 168 | } 169 | param { 170 | lr_mult: 0 171 | decay_mult: 0 172 | } 173 | convolution_param { 174 | num_output: 128 175 | kernel_size: 3 176 | stride: 1 177 | pad: 1 178 | weight_filler { 179 | type: "gaussian" 180 | std: 0.01 181 | } 182 | bias_filler { 183 | type: "constant" 184 | value: 0 185 | } 186 | } 187 | } 188 | layer { 189 | name: "relu2_2" 190 | type: "PReLU" 191 | bottom: "conv2_2" 192 | top: "conv2_2" 193 | } 194 | layer { 195 | name: "conv2_3" 196 | type: "Convolution" 197 | bottom: "conv2_2" 198 | top: "conv2_3" 199 | param { 200 | lr_mult: 1 201 | decay_mult: 1 202 | } 203 | param { 204 | lr_mult: 0 205 | decay_mult: 0 206 | } 207 | convolution_param { 208 | num_output: 128 209 | kernel_size: 3 210 | stride: 1 211 | pad: 1 212 | weight_filler { 213 | type: "gaussian" 214 | std: 0.01 215 | } 216 | bias_filler { 217 | type: "constant" 218 | value: 0 219 | } 220 | } 221 | } 222 | layer { 223 | name: "relu2_3" 224 | type: "PReLU" 225 | bottom: "conv2_3" 226 | top: "conv2_3" 227 | } 228 | layer { 229 | name: "res2_3" 230 | type: "Eltwise" 231 | bottom: "conv2_1" 232 | bottom: "conv2_3" 233 | top: "res2_3" 234 | eltwise_param { 235 | operation: 1 236 | } 237 | } 238 | layer { 239 | name: "conv2_4" 240 | type: "Convolution" 241 | bottom: "res2_3" 242 | top: "conv2_4" 243 | param { 244 | lr_mult: 1 245 | decay_mult: 1 246 | } 247 | param { 248 | lr_mult: 0 249 | decay_mult: 0 250 | } 251 | convolution_param { 252 | num_output: 128 253 | kernel_size: 3 254 | stride: 1 255 | pad: 1 256 | weight_filler { 257 | type: "gaussian" 258 | std: 0.01 259 | } 260 | bias_filler { 261 | type: "constant" 262 | value: 0 263 | } 264 | } 265 | } 266 | layer { 267 | name: "relu2_4" 268 | type: "PReLU" 269 | bottom: "conv2_4" 270 | top: "conv2_4" 271 | } 272 | layer { 273 | name: "conv2_5" 274 | type: "Convolution" 275 | bottom: "conv2_4" 276 | top: "conv2_5" 277 | param { 278 | lr_mult: 1 279 | decay_mult: 1 280 | } 281 | param { 282 | lr_mult: 0 283 | decay_mult: 0 284 | } 285 | convolution_param { 286 | num_output: 128 287 | kernel_size: 3 288 | stride: 1 289 | pad: 1 290 | weight_filler { 291 | type: "gaussian" 292 | std: 0.01 293 | } 294 | bias_filler { 295 | type: "constant" 296 | value: 0 297 | } 298 | } 299 | } 300 | layer { 301 | name: "relu2_5" 302 | type: "PReLU" 303 | bottom: "conv2_5" 304 | top: "conv2_5" 305 | } 306 | layer { 307 | name: "res2_5" 308 | type: "Eltwise" 309 | bottom: "res2_3" 310 | bottom: "conv2_5" 311 | top: "res2_5" 312 | eltwise_param { 313 | operation: 1 314 | } 315 | } 316 | layer { 317 | name: "conv3_1" 318 | type: "Convolution" 319 | bottom: "res2_5" 320 | top: "conv3_1" 321 | param { 322 | lr_mult: 1 323 | decay_mult: 1 324 | } 325 | param { 326 | lr_mult: 2 327 | decay_mult: 0 328 | } 329 | convolution_param { 330 | num_output: 256 331 | kernel_size: 3 332 | stride: 2 333 | pad: 1 334 | weight_filler { 335 | type: "xavier" 336 | } 337 | bias_filler { 338 | type: "constant" 339 | value: 0 340 | } 341 | } 342 | } 343 | layer { 344 | name: "relu3_1" 345 | type: "PReLU" 346 | bottom: "conv3_1" 347 | top: "conv3_1" 348 | } 349 | layer { 350 | name: "conv3_2" 351 | type: "Convolution" 352 | bottom: "conv3_1" 353 | top: "conv3_2" 354 | param { 355 | lr_mult: 1 356 | decay_mult: 1 357 | } 358 | param { 359 | lr_mult: 0 360 | decay_mult: 0 361 | } 362 | convolution_param { 363 | num_output: 256 364 | kernel_size: 3 365 | stride: 1 366 | pad: 1 367 | weight_filler { 368 | type: "gaussian" 369 | std: 0.01 370 | } 371 | bias_filler { 372 | type: "constant" 373 | value: 0 374 | } 375 | } 376 | } 377 | layer { 378 | name: "relu3_2" 379 | type: "PReLU" 380 | bottom: "conv3_2" 381 | top: "conv3_2" 382 | } 383 | layer { 384 | name: "conv3_3" 385 | type: "Convolution" 386 | bottom: "conv3_2" 387 | top: "conv3_3" 388 | param { 389 | lr_mult: 1 390 | decay_mult: 1 391 | } 392 | param { 393 | lr_mult: 0 394 | decay_mult: 0 395 | } 396 | convolution_param { 397 | num_output: 256 398 | kernel_size: 3 399 | stride: 1 400 | pad: 1 401 | weight_filler { 402 | type: "gaussian" 403 | std: 0.01 404 | } 405 | bias_filler { 406 | type: "constant" 407 | value: 0 408 | } 409 | } 410 | } 411 | layer { 412 | name: "relu3_3" 413 | type: "PReLU" 414 | bottom: "conv3_3" 415 | top: "conv3_3" 416 | } 417 | layer { 418 | name: "res3_3" 419 | type: "Eltwise" 420 | bottom: "conv3_1" 421 | bottom: "conv3_3" 422 | top: "res3_3" 423 | eltwise_param { 424 | operation: 1 425 | } 426 | } 427 | layer { 428 | name: "conv3_4" 429 | type: "Convolution" 430 | bottom: "res3_3" 431 | top: "conv3_4" 432 | param { 433 | lr_mult: 1 434 | decay_mult: 1 435 | } 436 | param { 437 | lr_mult: 0 438 | decay_mult: 0 439 | } 440 | convolution_param { 441 | num_output: 256 442 | kernel_size: 3 443 | stride: 1 444 | pad: 1 445 | weight_filler { 446 | type: "gaussian" 447 | std: 0.01 448 | } 449 | bias_filler { 450 | type: "constant" 451 | value: 0 452 | } 453 | } 454 | } 455 | layer { 456 | name: "relu3_4" 457 | type: "PReLU" 458 | bottom: "conv3_4" 459 | top: "conv3_4" 460 | } 461 | layer { 462 | name: "conv3_5" 463 | type: "Convolution" 464 | bottom: "conv3_4" 465 | top: "conv3_5" 466 | param { 467 | lr_mult: 1 468 | decay_mult: 1 469 | } 470 | param { 471 | lr_mult: 0 472 | decay_mult: 0 473 | } 474 | convolution_param { 475 | num_output: 256 476 | kernel_size: 3 477 | stride: 1 478 | pad: 1 479 | weight_filler { 480 | type: "gaussian" 481 | std: 0.01 482 | } 483 | bias_filler { 484 | type: "constant" 485 | value: 0 486 | } 487 | } 488 | } 489 | layer { 490 | name: "relu3_5" 491 | type: "PReLU" 492 | bottom: "conv3_5" 493 | top: "conv3_5" 494 | } 495 | layer { 496 | name: "res3_5" 497 | type: "Eltwise" 498 | bottom: "res3_3" 499 | bottom: "conv3_5" 500 | top: "res3_5" 501 | eltwise_param { 502 | operation: 1 503 | } 504 | } 505 | layer { 506 | name: "conv3_6" 507 | type: "Convolution" 508 | bottom: "res3_5" 509 | top: "conv3_6" 510 | param { 511 | lr_mult: 1 512 | decay_mult: 1 513 | } 514 | param { 515 | lr_mult: 0 516 | decay_mult: 0 517 | } 518 | convolution_param { 519 | num_output: 256 520 | kernel_size: 3 521 | stride: 1 522 | pad: 1 523 | weight_filler { 524 | type: "gaussian" 525 | std: 0.01 526 | } 527 | bias_filler { 528 | type: "constant" 529 | value: 0 530 | } 531 | } 532 | } 533 | layer { 534 | name: "relu3_6" 535 | type: "PReLU" 536 | bottom: "conv3_6" 537 | top: "conv3_6" 538 | } 539 | layer { 540 | name: "conv3_7" 541 | type: "Convolution" 542 | bottom: "conv3_6" 543 | top: "conv3_7" 544 | param { 545 | lr_mult: 1 546 | decay_mult: 1 547 | } 548 | param { 549 | lr_mult: 0 550 | decay_mult: 0 551 | } 552 | convolution_param { 553 | num_output: 256 554 | kernel_size: 3 555 | stride: 1 556 | pad: 1 557 | weight_filler { 558 | type: "gaussian" 559 | std: 0.01 560 | } 561 | bias_filler { 562 | type: "constant" 563 | value: 0 564 | } 565 | } 566 | } 567 | layer { 568 | name: "relu3_7" 569 | type: "PReLU" 570 | bottom: "conv3_7" 571 | top: "conv3_7" 572 | } 573 | layer { 574 | name: "res3_7" 575 | type: "Eltwise" 576 | bottom: "res3_5" 577 | bottom: "conv3_7" 578 | top: "res3_7" 579 | eltwise_param { 580 | operation: 1 581 | } 582 | } 583 | layer { 584 | name: "conv3_8" 585 | type: "Convolution" 586 | bottom: "res3_7" 587 | top: "conv3_8" 588 | param { 589 | lr_mult: 1 590 | decay_mult: 1 591 | } 592 | param { 593 | lr_mult: 0 594 | decay_mult: 0 595 | } 596 | convolution_param { 597 | num_output: 256 598 | kernel_size: 3 599 | stride: 1 600 | pad: 1 601 | weight_filler { 602 | type: "gaussian" 603 | std: 0.01 604 | } 605 | bias_filler { 606 | type: "constant" 607 | value: 0 608 | } 609 | } 610 | } 611 | layer { 612 | name: "relu3_8" 613 | type: "PReLU" 614 | bottom: "conv3_8" 615 | top: "conv3_8" 616 | } 617 | layer { 618 | name: "conv3_9" 619 | type: "Convolution" 620 | bottom: "conv3_8" 621 | top: "conv3_9" 622 | param { 623 | lr_mult: 1 624 | decay_mult: 1 625 | } 626 | param { 627 | lr_mult: 0 628 | decay_mult: 0 629 | } 630 | convolution_param { 631 | num_output: 256 632 | kernel_size: 3 633 | stride: 1 634 | pad: 1 635 | weight_filler { 636 | type: "gaussian" 637 | std: 0.01 638 | } 639 | bias_filler { 640 | type: "constant" 641 | value: 0 642 | } 643 | } 644 | } 645 | layer { 646 | name: "relu3_9" 647 | type: "PReLU" 648 | bottom: "conv3_9" 649 | top: "conv3_9" 650 | } 651 | layer { 652 | name: "res3_9" 653 | type: "Eltwise" 654 | bottom: "res3_7" 655 | bottom: "conv3_9" 656 | top: "res3_9" 657 | eltwise_param { 658 | operation: 1 659 | } 660 | } 661 | layer { 662 | name: "conv4_1" 663 | type: "Convolution" 664 | bottom: "res3_9" 665 | top: "conv4_1" 666 | param { 667 | lr_mult: 1 668 | decay_mult: 1 669 | } 670 | param { 671 | lr_mult: 2 672 | decay_mult: 0 673 | } 674 | convolution_param { 675 | num_output: 512 676 | kernel_size: 3 677 | stride: 2 678 | pad: 1 679 | weight_filler { 680 | type: "xavier" 681 | } 682 | bias_filler { 683 | type: "constant" 684 | value: 0 685 | } 686 | } 687 | } 688 | layer { 689 | name: "relu4_1" 690 | type: "PReLU" 691 | bottom: "conv4_1" 692 | top: "conv4_1" 693 | } 694 | layer { 695 | name: "conv4_2" 696 | type: "Convolution" 697 | bottom: "conv4_1" 698 | top: "conv4_2" 699 | param { 700 | lr_mult: 1 701 | decay_mult: 1 702 | } 703 | param { 704 | lr_mult: 0 705 | decay_mult: 0 706 | } 707 | convolution_param { 708 | num_output: 512 709 | kernel_size: 3 710 | stride: 1 711 | pad: 1 712 | weight_filler { 713 | type: "gaussian" 714 | std: 0.01 715 | } 716 | bias_filler { 717 | type: "constant" 718 | value: 0 719 | } 720 | } 721 | } 722 | layer { 723 | name: "relu4_2" 724 | type: "PReLU" 725 | bottom: "conv4_2" 726 | top: "conv4_2" 727 | } 728 | layer { 729 | name: "conv4_3" 730 | type: "Convolution" 731 | bottom: "conv4_2" 732 | top: "conv4_3" 733 | param { 734 | lr_mult: 1 735 | decay_mult: 1 736 | } 737 | param { 738 | lr_mult: 0 739 | decay_mult: 0 740 | } 741 | convolution_param { 742 | num_output: 512 743 | kernel_size: 3 744 | stride: 1 745 | pad: 1 746 | weight_filler { 747 | type: "gaussian" 748 | std: 0.01 749 | } 750 | bias_filler { 751 | type: "constant" 752 | value: 0 753 | } 754 | } 755 | } 756 | layer { 757 | name: "relu4_3" 758 | type: "PReLU" 759 | bottom: "conv4_3" 760 | top: "conv4_3" 761 | } 762 | layer { 763 | name: "res4_3" 764 | type: "Eltwise" 765 | bottom: "conv4_1" 766 | bottom: "conv4_3" 767 | top: "res4_3" 768 | eltwise_param { 769 | operation: 1 770 | } 771 | } 772 | 773 | layer { 774 | name: "pool1" 775 | type: "Pooling" 776 | bottom: "res4_3" 777 | top: "res4_3p" 778 | pooling_param { 779 | pool: AVE 780 | kernel_w :1 781 | kernel_h :19 # pool over whole time 782 | } 783 | } 784 | 785 | #layer { 786 | # name: "drop6" 787 | # type: "Dropout" 788 | # bottom: "res4_3p" 789 | # top: "res4_3p" 790 | # dropout_param { 791 | # dropout_ratio: 0.5 792 | # } 793 | #} 794 | 795 | 796 | layer { 797 | name: "fc5" 798 | type: "InnerProduct" 799 | bottom: "res4_3p" 800 | top: "fc5" 801 | param { 802 | lr_mult: 2 803 | decay_mult: 0 804 | } 805 | param { 806 | lr_mult: 1 807 | decay_mult: 0 808 | } 809 | inner_product_param { 810 | num_output: 512 811 | weight_filler { 812 | type: "xavier" 813 | } 814 | bias_filler { 815 | type: "constant" 816 | value: 0 817 | } 818 | } 819 | } 820 | ############### A-Softmax Loss ############## 821 | layer { 822 | name: "norm1" 823 | type: "Normalize" 824 | bottom: "fc5" 825 | top: "norm1" 826 | } 827 | 828 | 829 | layer { 830 | name: "fc6" 831 | type: "InnerProduct" 832 | bottom: "norm1" 833 | top: "fc6" 834 | param { 835 | lr_mult: 1 836 | decay_mult: 1 837 | } 838 | param { 839 | lr_mult: 2 840 | decay_mult: 0 841 | } 842 | inner_product_param { 843 | num_output: 1251 844 | weight_filler { 845 | type: "xavier" 846 | } 847 | bias_filler { 848 | type: "constant" 849 | value: 0 850 | } 851 | } 852 | } 853 | 854 | layer { 855 | name: "fc6_scale" 856 | type: "Scale" 857 | bottom: "fc6" 858 | top: "fc6_scale" 859 | param { 860 | lr_mult: 1 861 | decay_mult: 0 862 | } 863 | scale_param { 864 | filler { 865 | value: 50 866 | } 867 | bias_term: true 868 | } 869 | } 870 | 871 | layer { 872 | name: "label_scale_margin" 873 | type: "LabelSpecificAdd" 874 | bottom: "fc6_scale" 875 | bottom: "label" 876 | top: "fc6_scale_margin" 877 | label_specific_add_param { 878 | bias: -30 879 | } 880 | } 881 | 882 | layer { 883 | name: "softmax_loss" 884 | type: "SoftmaxWithLoss" 885 | bottom: "fc6_scale_margin" 886 | bottom: "label" 887 | top: "softmax_loss" 888 | } 889 | 890 | 891 | layer { 892 | name: "top1Accuracy" 893 | type: "Accuracy" 894 | bottom: "fc6" 895 | bottom: "label" 896 | top: "top1Accuracy" 897 | } 898 | 899 | 900 | layer { 901 | name: "top5Accuracy" 902 | type: "Accuracy" 903 | bottom: "fc6" 904 | bottom: "label" 905 | top: "top5Accuracy" 906 | accuracy_param { 907 | top_k: 5 908 | } 909 | 910 | } 911 | 912 | 913 | 914 | -------------------------------------------------------------------------------- /prototxt/LogisticMargin_solver.prototxt: -------------------------------------------------------------------------------- 1 | 2 | net: "prototxt/LogisticMargin.prototxt" 3 | 4 | base_lr: 0.005 5 | lr_policy: "step" 6 | gamma: 0.75 7 | iter_size: 1 8 | 9 | stepsize: 2800 10 | max_iter: 61600 11 | 12 | display: 100 13 | momentum: 0.93 14 | weight_decay: 0.0005 15 | snapshot: 5600 16 | snapshot_prefix: "result/LM_512D_30" 17 | 18 | solver_mode: GPU 19 | -------------------------------------------------------------------------------- /prototxt/ResNet-20.prototxt: -------------------------------------------------------------------------------- 1 | name: "ResNet-20" 2 | 3 | input: "data" 4 | input_dim: 50 5 | input_dim: 1 6 | input_dim: 300 7 | input_dim: 257 8 | 9 | input: "label" 10 | input_dim: 50 11 | input_dim: 1 12 | input_dim: 1 13 | input_dim: 1 14 | 15 | ############## CNN Architecture ############### 16 | layer { 17 | name: "conv1_1" 18 | type: "Convolution" 19 | bottom: "data" 20 | top: "conv1_1" 21 | param { 22 | lr_mult: 1 23 | decay_mult: 1 24 | } 25 | param { 26 | lr_mult: 2 27 | decay_mult: 0 28 | } 29 | convolution_param { 30 | num_output: 64 31 | kernel_size: 3 32 | stride: 2 33 | pad: 1 34 | weight_filler { 35 | type: "xavier" 36 | } 37 | bias_filler { 38 | type: "constant" 39 | value: 0 40 | } 41 | } 42 | } 43 | layer { 44 | name: "relu1_1" 45 | type: "PReLU" 46 | bottom: "conv1_1" 47 | top: "conv1_1" 48 | } 49 | layer { 50 | name: "conv1_2" 51 | type: "Convolution" 52 | bottom: "conv1_1" 53 | top: "conv1_2" 54 | param { 55 | lr_mult: 1 56 | decay_mult: 1 57 | } 58 | param { 59 | lr_mult: 0 60 | decay_mult: 0 61 | } 62 | convolution_param { 63 | num_output: 64 64 | kernel_size: 3 65 | stride: 1 66 | pad: 1 67 | weight_filler { 68 | type: "gaussian" 69 | std: 0.01 70 | } 71 | bias_filler { 72 | type: "constant" 73 | value: 0 74 | } 75 | } 76 | } 77 | layer { 78 | name: "relu1_2" 79 | type: "PReLU" 80 | bottom: "conv1_2" 81 | top: "conv1_2" 82 | } 83 | layer { 84 | name: "conv1_3" 85 | type: "Convolution" 86 | bottom: "conv1_2" 87 | top: "conv1_3" 88 | param { 89 | lr_mult: 1 90 | decay_mult: 1 91 | } 92 | param { 93 | lr_mult: 0 94 | decay_mult: 0 95 | } 96 | convolution_param { 97 | num_output: 64 98 | kernel_size: 3 99 | stride: 1 100 | pad: 1 101 | weight_filler { 102 | type: "gaussian" 103 | std: 0.01 104 | } 105 | bias_filler { 106 | type: "constant" 107 | value: 0 108 | } 109 | } 110 | } 111 | layer { 112 | name: "relu1_3" 113 | type: "PReLU" 114 | bottom: "conv1_3" 115 | top: "conv1_3" 116 | } 117 | layer { 118 | name: "res1_3" 119 | type: "Eltwise" 120 | bottom: "conv1_1" 121 | bottom: "conv1_3" 122 | top: "res1_3" 123 | eltwise_param { 124 | operation: 1 125 | } 126 | } 127 | layer { 128 | name: "conv2_1" 129 | type: "Convolution" 130 | bottom: "res1_3" 131 | top: "conv2_1" 132 | param { 133 | lr_mult: 1 134 | decay_mult: 1 135 | } 136 | param { 137 | lr_mult: 2 138 | decay_mult: 0 139 | } 140 | convolution_param { 141 | num_output: 128 142 | kernel_size: 3 143 | stride: 2 144 | pad: 1 145 | weight_filler { 146 | type: "xavier" 147 | } 148 | bias_filler { 149 | type: "constant" 150 | value: 0 151 | } 152 | } 153 | } 154 | layer { 155 | name: "relu2_1" 156 | type: "PReLU" 157 | bottom: "conv2_1" 158 | top: "conv2_1" 159 | } 160 | layer { 161 | name: "conv2_2" 162 | type: "Convolution" 163 | bottom: "conv2_1" 164 | top: "conv2_2" 165 | param { 166 | lr_mult: 1 167 | decay_mult: 1 168 | } 169 | param { 170 | lr_mult: 0 171 | decay_mult: 0 172 | } 173 | convolution_param { 174 | num_output: 128 175 | kernel_size: 3 176 | stride: 1 177 | pad: 1 178 | weight_filler { 179 | type: "gaussian" 180 | std: 0.01 181 | } 182 | bias_filler { 183 | type: "constant" 184 | value: 0 185 | } 186 | } 187 | } 188 | layer { 189 | name: "relu2_2" 190 | type: "PReLU" 191 | bottom: "conv2_2" 192 | top: "conv2_2" 193 | } 194 | layer { 195 | name: "conv2_3" 196 | type: "Convolution" 197 | bottom: "conv2_2" 198 | top: "conv2_3" 199 | param { 200 | lr_mult: 1 201 | decay_mult: 1 202 | } 203 | param { 204 | lr_mult: 0 205 | decay_mult: 0 206 | } 207 | convolution_param { 208 | num_output: 128 209 | kernel_size: 3 210 | stride: 1 211 | pad: 1 212 | weight_filler { 213 | type: "gaussian" 214 | std: 0.01 215 | } 216 | bias_filler { 217 | type: "constant" 218 | value: 0 219 | } 220 | } 221 | } 222 | layer { 223 | name: "relu2_3" 224 | type: "PReLU" 225 | bottom: "conv2_3" 226 | top: "conv2_3" 227 | } 228 | layer { 229 | name: "res2_3" 230 | type: "Eltwise" 231 | bottom: "conv2_1" 232 | bottom: "conv2_3" 233 | top: "res2_3" 234 | eltwise_param { 235 | operation: 1 236 | } 237 | } 238 | layer { 239 | name: "conv2_4" 240 | type: "Convolution" 241 | bottom: "res2_3" 242 | top: "conv2_4" 243 | param { 244 | lr_mult: 1 245 | decay_mult: 1 246 | } 247 | param { 248 | lr_mult: 0 249 | decay_mult: 0 250 | } 251 | convolution_param { 252 | num_output: 128 253 | kernel_size: 3 254 | stride: 1 255 | pad: 1 256 | weight_filler { 257 | type: "gaussian" 258 | std: 0.01 259 | } 260 | bias_filler { 261 | type: "constant" 262 | value: 0 263 | } 264 | } 265 | } 266 | layer { 267 | name: "relu2_4" 268 | type: "PReLU" 269 | bottom: "conv2_4" 270 | top: "conv2_4" 271 | } 272 | layer { 273 | name: "conv2_5" 274 | type: "Convolution" 275 | bottom: "conv2_4" 276 | top: "conv2_5" 277 | param { 278 | lr_mult: 1 279 | decay_mult: 1 280 | } 281 | param { 282 | lr_mult: 0 283 | decay_mult: 0 284 | } 285 | convolution_param { 286 | num_output: 128 287 | kernel_size: 3 288 | stride: 1 289 | pad: 1 290 | weight_filler { 291 | type: "gaussian" 292 | std: 0.01 293 | } 294 | bias_filler { 295 | type: "constant" 296 | value: 0 297 | } 298 | } 299 | } 300 | layer { 301 | name: "relu2_5" 302 | type: "PReLU" 303 | bottom: "conv2_5" 304 | top: "conv2_5" 305 | } 306 | layer { 307 | name: "res2_5" 308 | type: "Eltwise" 309 | bottom: "res2_3" 310 | bottom: "conv2_5" 311 | top: "res2_5" 312 | eltwise_param { 313 | operation: 1 314 | } 315 | } 316 | layer { 317 | name: "conv3_1" 318 | type: "Convolution" 319 | bottom: "res2_5" 320 | top: "conv3_1" 321 | param { 322 | lr_mult: 1 323 | decay_mult: 1 324 | } 325 | param { 326 | lr_mult: 2 327 | decay_mult: 0 328 | } 329 | convolution_param { 330 | num_output: 256 331 | kernel_size: 3 332 | stride: 2 333 | pad: 1 334 | weight_filler { 335 | type: "xavier" 336 | } 337 | bias_filler { 338 | type: "constant" 339 | value: 0 340 | } 341 | } 342 | } 343 | layer { 344 | name: "relu3_1" 345 | type: "PReLU" 346 | bottom: "conv3_1" 347 | top: "conv3_1" 348 | } 349 | layer { 350 | name: "conv3_2" 351 | type: "Convolution" 352 | bottom: "conv3_1" 353 | top: "conv3_2" 354 | param { 355 | lr_mult: 1 356 | decay_mult: 1 357 | } 358 | param { 359 | lr_mult: 0 360 | decay_mult: 0 361 | } 362 | convolution_param { 363 | num_output: 256 364 | kernel_size: 3 365 | stride: 1 366 | pad: 1 367 | weight_filler { 368 | type: "gaussian" 369 | std: 0.01 370 | } 371 | bias_filler { 372 | type: "constant" 373 | value: 0 374 | } 375 | } 376 | } 377 | layer { 378 | name: "relu3_2" 379 | type: "PReLU" 380 | bottom: "conv3_2" 381 | top: "conv3_2" 382 | } 383 | layer { 384 | name: "conv3_3" 385 | type: "Convolution" 386 | bottom: "conv3_2" 387 | top: "conv3_3" 388 | param { 389 | lr_mult: 1 390 | decay_mult: 1 391 | } 392 | param { 393 | lr_mult: 0 394 | decay_mult: 0 395 | } 396 | convolution_param { 397 | num_output: 256 398 | kernel_size: 3 399 | stride: 1 400 | pad: 1 401 | weight_filler { 402 | type: "gaussian" 403 | std: 0.01 404 | } 405 | bias_filler { 406 | type: "constant" 407 | value: 0 408 | } 409 | } 410 | } 411 | layer { 412 | name: "relu3_3" 413 | type: "PReLU" 414 | bottom: "conv3_3" 415 | top: "conv3_3" 416 | } 417 | layer { 418 | name: "res3_3" 419 | type: "Eltwise" 420 | bottom: "conv3_1" 421 | bottom: "conv3_3" 422 | top: "res3_3" 423 | eltwise_param { 424 | operation: 1 425 | } 426 | } 427 | layer { 428 | name: "conv3_4" 429 | type: "Convolution" 430 | bottom: "res3_3" 431 | top: "conv3_4" 432 | param { 433 | lr_mult: 1 434 | decay_mult: 1 435 | } 436 | param { 437 | lr_mult: 0 438 | decay_mult: 0 439 | } 440 | convolution_param { 441 | num_output: 256 442 | kernel_size: 3 443 | stride: 1 444 | pad: 1 445 | weight_filler { 446 | type: "gaussian" 447 | std: 0.01 448 | } 449 | bias_filler { 450 | type: "constant" 451 | value: 0 452 | } 453 | } 454 | } 455 | layer { 456 | name: "relu3_4" 457 | type: "PReLU" 458 | bottom: "conv3_4" 459 | top: "conv3_4" 460 | } 461 | layer { 462 | name: "conv3_5" 463 | type: "Convolution" 464 | bottom: "conv3_4" 465 | top: "conv3_5" 466 | param { 467 | lr_mult: 1 468 | decay_mult: 1 469 | } 470 | param { 471 | lr_mult: 0 472 | decay_mult: 0 473 | } 474 | convolution_param { 475 | num_output: 256 476 | kernel_size: 3 477 | stride: 1 478 | pad: 1 479 | weight_filler { 480 | type: "gaussian" 481 | std: 0.01 482 | } 483 | bias_filler { 484 | type: "constant" 485 | value: 0 486 | } 487 | } 488 | } 489 | layer { 490 | name: "relu3_5" 491 | type: "PReLU" 492 | bottom: "conv3_5" 493 | top: "conv3_5" 494 | } 495 | layer { 496 | name: "res3_5" 497 | type: "Eltwise" 498 | bottom: "res3_3" 499 | bottom: "conv3_5" 500 | top: "res3_5" 501 | eltwise_param { 502 | operation: 1 503 | } 504 | } 505 | layer { 506 | name: "conv3_6" 507 | type: "Convolution" 508 | bottom: "res3_5" 509 | top: "conv3_6" 510 | param { 511 | lr_mult: 1 512 | decay_mult: 1 513 | } 514 | param { 515 | lr_mult: 0 516 | decay_mult: 0 517 | } 518 | convolution_param { 519 | num_output: 256 520 | kernel_size: 3 521 | stride: 1 522 | pad: 1 523 | weight_filler { 524 | type: "gaussian" 525 | std: 0.01 526 | } 527 | bias_filler { 528 | type: "constant" 529 | value: 0 530 | } 531 | } 532 | } 533 | layer { 534 | name: "relu3_6" 535 | type: "PReLU" 536 | bottom: "conv3_6" 537 | top: "conv3_6" 538 | } 539 | layer { 540 | name: "conv3_7" 541 | type: "Convolution" 542 | bottom: "conv3_6" 543 | top: "conv3_7" 544 | param { 545 | lr_mult: 1 546 | decay_mult: 1 547 | } 548 | param { 549 | lr_mult: 0 550 | decay_mult: 0 551 | } 552 | convolution_param { 553 | num_output: 256 554 | kernel_size: 3 555 | stride: 1 556 | pad: 1 557 | weight_filler { 558 | type: "gaussian" 559 | std: 0.01 560 | } 561 | bias_filler { 562 | type: "constant" 563 | value: 0 564 | } 565 | } 566 | } 567 | layer { 568 | name: "relu3_7" 569 | type: "PReLU" 570 | bottom: "conv3_7" 571 | top: "conv3_7" 572 | } 573 | layer { 574 | name: "res3_7" 575 | type: "Eltwise" 576 | bottom: "res3_5" 577 | bottom: "conv3_7" 578 | top: "res3_7" 579 | eltwise_param { 580 | operation: 1 581 | } 582 | } 583 | layer { 584 | name: "conv3_8" 585 | type: "Convolution" 586 | bottom: "res3_7" 587 | top: "conv3_8" 588 | param { 589 | lr_mult: 1 590 | decay_mult: 1 591 | } 592 | param { 593 | lr_mult: 0 594 | decay_mult: 0 595 | } 596 | convolution_param { 597 | num_output: 256 598 | kernel_size: 3 599 | stride: 1 600 | pad: 1 601 | weight_filler { 602 | type: "gaussian" 603 | std: 0.01 604 | } 605 | bias_filler { 606 | type: "constant" 607 | value: 0 608 | } 609 | } 610 | } 611 | layer { 612 | name: "relu3_8" 613 | type: "PReLU" 614 | bottom: "conv3_8" 615 | top: "conv3_8" 616 | } 617 | layer { 618 | name: "conv3_9" 619 | type: "Convolution" 620 | bottom: "conv3_8" 621 | top: "conv3_9" 622 | param { 623 | lr_mult: 1 624 | decay_mult: 1 625 | } 626 | param { 627 | lr_mult: 0 628 | decay_mult: 0 629 | } 630 | convolution_param { 631 | num_output: 256 632 | kernel_size: 3 633 | stride: 1 634 | pad: 1 635 | weight_filler { 636 | type: "gaussian" 637 | std: 0.01 638 | } 639 | bias_filler { 640 | type: "constant" 641 | value: 0 642 | } 643 | } 644 | } 645 | layer { 646 | name: "relu3_9" 647 | type: "PReLU" 648 | bottom: "conv3_9" 649 | top: "conv3_9" 650 | } 651 | layer { 652 | name: "res3_9" 653 | type: "Eltwise" 654 | bottom: "res3_7" 655 | bottom: "conv3_9" 656 | top: "res3_9" 657 | eltwise_param { 658 | operation: 1 659 | } 660 | } 661 | layer { 662 | name: "conv4_1" 663 | type: "Convolution" 664 | bottom: "res3_9" 665 | top: "conv4_1" 666 | param { 667 | lr_mult: 1 668 | decay_mult: 1 669 | } 670 | param { 671 | lr_mult: 2 672 | decay_mult: 0 673 | } 674 | convolution_param { 675 | num_output: 512 676 | kernel_size: 3 677 | stride: 2 678 | pad: 1 679 | weight_filler { 680 | type: "xavier" 681 | } 682 | bias_filler { 683 | type: "constant" 684 | value: 0 685 | } 686 | } 687 | } 688 | layer { 689 | name: "relu4_1" 690 | type: "PReLU" 691 | bottom: "conv4_1" 692 | top: "conv4_1" 693 | } 694 | layer { 695 | name: "conv4_2" 696 | type: "Convolution" 697 | bottom: "conv4_1" 698 | top: "conv4_2" 699 | param { 700 | lr_mult: 1 701 | decay_mult: 1 702 | } 703 | param { 704 | lr_mult: 0 705 | decay_mult: 0 706 | } 707 | convolution_param { 708 | num_output: 512 709 | kernel_size: 3 710 | stride: 1 711 | pad: 1 712 | weight_filler { 713 | type: "gaussian" 714 | std: 0.01 715 | } 716 | bias_filler { 717 | type: "constant" 718 | value: 0 719 | } 720 | } 721 | } 722 | layer { 723 | name: "relu4_2" 724 | type: "PReLU" 725 | bottom: "conv4_2" 726 | top: "conv4_2" 727 | } 728 | layer { 729 | name: "conv4_3" 730 | type: "Convolution" 731 | bottom: "conv4_2" 732 | top: "conv4_3" 733 | param { 734 | lr_mult: 1 735 | decay_mult: 1 736 | } 737 | param { 738 | lr_mult: 0 739 | decay_mult: 0 740 | } 741 | convolution_param { 742 | num_output: 512 743 | kernel_size: 3 744 | stride: 1 745 | pad: 1 746 | weight_filler { 747 | type: "gaussian" 748 | std: 0.01 749 | } 750 | bias_filler { 751 | type: "constant" 752 | value: 0 753 | } 754 | } 755 | } 756 | layer { 757 | name: "relu4_3" 758 | type: "PReLU" 759 | bottom: "conv4_3" 760 | top: "conv4_3" 761 | } 762 | layer { 763 | name: "res4_3" 764 | type: "Eltwise" 765 | bottom: "conv4_1" 766 | bottom: "conv4_3" 767 | top: "res4_3" 768 | eltwise_param { 769 | operation: 1 770 | } 771 | } 772 | 773 | layer { 774 | name: "pool1" 775 | type: "Pooling" 776 | bottom: "res4_3" 777 | top: "res4_3p" 778 | pooling_param { 779 | pool: AVE 780 | kernel_w :1 781 | kernel_h :19 # pool over whole time 782 | } 783 | } 784 | 785 | 786 | #layer { 787 | # name: "drop6" 788 | # type: "Dropout" 789 | # bottom: "res4_3p" 790 | # top: "res4_3p" 791 | # dropout_param { 792 | # dropout_ratio: 0.5 793 | # } 794 | #} 795 | 796 | 797 | 798 | layer { 799 | name: "fc5" 800 | type: "InnerProduct" 801 | bottom: "res4_3p" 802 | top: "fc5" 803 | param { 804 | lr_mult: 2 805 | decay_mult: 2 806 | } 807 | param { 808 | lr_mult: 2 809 | decay_mult: 0 810 | } 811 | inner_product_param { 812 | num_output: 512 813 | weight_filler { 814 | type: "xavier" 815 | } 816 | bias_filler { 817 | type: "constant" 818 | value: 0 819 | } 820 | } 821 | } 822 | ############### A-Softmax Loss ############## 823 | layer { 824 | name: "fc6" 825 | type: "InnerProduct" 826 | bottom: "fc5" 827 | top: "fc6" 828 | param { 829 | lr_mult: 1 830 | decay_mult: 1 831 | } 832 | param { 833 | lr_mult: 2 834 | decay_mult: 0 835 | } 836 | inner_product_param { 837 | num_output: 1251 838 | weight_filler { 839 | type: "xavier" 840 | } 841 | bias_filler { 842 | type: "constant" 843 | value: 0 844 | } 845 | } 846 | } 847 | layer { 848 | name: "softmax_loss" 849 | type: "SoftmaxWithLoss" 850 | bottom: "fc6" 851 | bottom: "label" 852 | top: "softmax_loss" 853 | } 854 | 855 | 856 | layer { 857 | name: "top1Accuracy" 858 | type: "Accuracy" 859 | bottom: "fc6" 860 | bottom: "label" 861 | top: "top1Accuracy" 862 | 863 | } 864 | 865 | layer { 866 | name: "top5Accuracy" 867 | type: "Accuracy" 868 | bottom: "fc6" 869 | bottom: "label" 870 | top: "top5Accuracy" 871 | accuracy_param { 872 | top_k: 5 873 | } 874 | 875 | } 876 | 877 | 878 | 879 | -------------------------------------------------------------------------------- /prototxt/ResNet-20_solver.prototxt: -------------------------------------------------------------------------------- 1 | net: "prototxt/ResNet-20.prototxt" 2 | 3 | base_lr: 0.05 4 | lr_policy: "step" 5 | gamma: 0.75 6 | iter_size: 1 7 | 8 | stepsize: 2800 9 | max_iter: 61600 10 | 11 | display: 100 12 | average_loss: 100 13 | momentum: 0.93 14 | weight_decay: 0.0005 15 | snapshot: 61600 16 | snapshot_prefix: "result/ResNet-20_512D" 17 | 18 | solver_mode: GPU 19 | -------------------------------------------------------------------------------- /result/LogisticMargin/512D/Ident/LM_512D_30_iter_60000.caffemodel: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MahdiHajibabaei/unified-embedding/9e8f5f1cf9bca699fa51ed714ea3d2b6e490684b/result/LogisticMargin/512D/Ident/LM_512D_30_iter_60000.caffemodel -------------------------------------------------------------------------------- /roc.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import re 3 | import os.path 4 | import matplotlib.pyplot as plt 5 | from sklearn.metrics import roc_curve, auc, roc_auc_score 6 | 7 | wav_list='/scratch_net/biwidl09/hmahdi/VoxCeleb/Identification_split.txt' 8 | base_address='/scratch_net/biwidl09/hmahdi/VoxCeleb/voxceleb1_wav/' 9 | embedding_file="embeddings/LM_512D.npy" 10 | 11 | if __name__ == '__main__': 12 | 13 | test_set=[] 14 | test_label=[] 15 | 16 | input_file = open(wav_list,'r') 17 | identity=0 18 | split_index=1 19 | for line in input_file: 20 | parsed_line=re.split(r'[ \n]+', line) 21 | utterance_address=base_address+parsed_line[1] 22 | prev_split_index=split_index 23 | split_index=int(parsed_line[0]) 24 | if (prev_split_index>split_index): 25 | identity=identity+1 26 | if (os.path.isfile(utterance_address)==False): 27 | continue 28 | elif (parsed_line[1][0]=='E'): 29 | test_set.append(utterance_address) 30 | test_label.append(identity) 31 | 32 | print("The size of test set: %d" % (len(test_set))) 33 | print("Number of identities: %d" % (identity+1)) 34 | roc_size=4872 35 | embeddings=np.load(embedding_file) 36 | embeddings=np.asarray(embeddings[0:roc_size,:]) 37 | embeddings=(embeddings) / np.sqrt(np.sum(np.square(embeddings), axis=1))[:,None] 38 | score_matrix=[] 39 | label_matrix=[] 40 | true_match=[] 41 | false_match=[] 42 | 43 | for i in range(0,roc_size): 44 | for j in range(0,i): 45 | score=np.dot(embeddings[i,:],embeddings[j,:]) 46 | if (test_label[i]==test_label[j]): 47 | if (i!=j): 48 | score_matrix.append(score) 49 | label_matrix.append(1) 50 | true_match.append(score) 51 | else: 52 | score_matrix.append(score) 53 | label_matrix.append(0) 54 | false_match.append(score) 55 | print("There are %d positive and %d negative pairs" % (len(true_match),len(false_match))) 56 | 57 | bins = np.linspace(-1.2, 1.2, 240) 58 | plt.hist(true_match, bins, density=True, facecolor='g', alpha=0.75,label='positive pairs') 59 | plt.hist(false_match, bins, density=True, label='negative pairs') 60 | plt.title('Similarity score of positive and negative pairs') 61 | plt.legend(loc='upper right') 62 | plt.show() 63 | 64 | fpr, tpr, _ = roc_curve(label_matrix,score_matrix) 65 | intersection=abs(1-tpr-fpr) 66 | DCF2=100*(0.01*(1-tpr)+0.99*fpr) 67 | DCF3=1000*(0.001*(1-tpr)+0.999*fpr) 68 | print("EER= %.2f DCF0.01= %.3f DCF0.001= %.3f" %(100*fpr[np.argmin(intersection)],np.min(DCF2),np.min(DCF3))) 69 | plt.figure() 70 | plt.semilogx(fpr, tpr) 71 | plt.grid(True) 72 | plt.xlabel('False Positive Rate') 73 | plt.ylabel('True Positive Rate') 74 | plt.title('Receiver Operating Characteristic') 75 | plt.legend(loc="lower right") 76 | plt.show() 77 | 78 | 79 | -------------------------------------------------------------------------------- /roc_vox.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import re 3 | import os.path 4 | import matplotlib.pyplot as plt 5 | from sklearn.metrics import roc_curve, auc, roc_auc_score 6 | 7 | wav_list='/scratch_net/biwidl09/hmahdi/VoxCeleb/Identification_split.txt' 8 | base_address='/scratch_net/biwidl09/hmahdi/VoxCeleb/voxceleb1_wav/' 9 | embedding_file="embeddings/LM_512D.npy" 10 | 11 | if __name__ == '__main__': 12 | 13 | test_set=[] 14 | test_label=[] 15 | input_file = open(wav_list,'r') 16 | identity=0 17 | split_index=1 18 | for line in input_file: 19 | parsed_line=re.split(r'[ \n]+', line) 20 | utterance_address=parsed_line[1] 21 | prev_split_index=split_index 22 | split_index=int(parsed_line[0]) 23 | if (prev_split_index>split_index): 24 | identity=identity+1 25 | if (os.path.isfile(base_address+utterance_address)==False): 26 | continue 27 | elif (parsed_line[1][0]=='E'): 28 | test_set.append(utterance_address) 29 | test_label.append(identity) 30 | 31 | print("The size of test set: %d" % (len(test_set))) 32 | print("Number of identities: %d" % (identity+1)) 33 | roc_size=4872 34 | embeddings=np.load(embedding_file) 35 | embeddings=np.asarray(embeddings[0:roc_size,:]) 36 | embeddings=(embeddings ) / np.sqrt(np.sum(np.square(embeddings), axis=1))[:,None] 37 | score_matrix=[] 38 | label_matrix=[] 39 | true_match=[] 40 | false_match=[] 41 | 42 | verif_list='/scratch_net/biwidl09/hmahdi/VoxCeleb/voxceleb1_test.txt' 43 | base_address='/scratch_net/biwidl09/hmahdi/VoxCeleb/voxceleb1_wav/' 44 | input_file = open(verif_list,'r') 45 | identity=0 46 | split_index=1 47 | for line in input_file: 48 | parsed_line=re.split(r'[ \n]+', line) 49 | utterance_address1=parsed_line[1] 50 | utterance_address2=parsed_line[2] 51 | try: 52 | test_set.index(utterance_address1) 53 | except ValueError: 54 | continue 55 | try: 56 | test_set.index(utterance_address2) 57 | except ValueError: 58 | continue 59 | index1 = test_set.index(utterance_address1) 60 | index2 = test_set.index(utterance_address2) 61 | if ((index1>=roc_size)|(index2>=roc_size)): 62 | continue 63 | embedding1=embeddings[index1,:] 64 | embedding2=embeddings[index2,:] 65 | score=np.dot(embedding1,embedding2) 66 | score_matrix.append(score) 67 | if parsed_line[0]=='1': 68 | label_matrix.append(1) 69 | true_match.append(score) 70 | elif parsed_line[0]=='0': 71 | label_matrix.append(0) 72 | false_match.append(score) 73 | 74 | print("There are %d positive pairs and %d negative pairs" % (len(true_match),len(false_match))) 75 | 76 | bins = np.linspace(-1.2, 1.2, 240) 77 | plt.hist(true_match, bins, density=True, facecolor='g', alpha=0.75,label='positive pair') 78 | plt.hist(false_match, bins, density=True, label='negative pair') 79 | plt.title('Similarity score of positive and negative pairs') 80 | plt.legend(loc='upper right') 81 | plt.show() 82 | 83 | fpr, tpr, _ = roc_curve(label_matrix,score_matrix) 84 | intersection=abs(1-tpr-fpr) 85 | plt.figure() 86 | plt.semilogx(fpr, tpr) 87 | DCF2=100*(0.01*(1-tpr)+0.99*fpr) 88 | DCF3=1000*(0.001*(1-tpr)+0.999*fpr) 89 | print("EER= %.2f DCF0.01= %.3f DCF0.001= %.3f" %(100*fpr[np.argmin(intersection)],np.min(DCF2),np.min(DCF3))) 90 | plt.grid(True) 91 | plt.xlabel('False Positive Rate') 92 | plt.ylabel('True Positive Rate') 93 | plt.title('Receiver Operating Characteristic') 94 | plt.legend(loc="lower right") 95 | plt.show() 96 | 97 | 98 | -------------------------------------------------------------------------------- /sample.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MahdiHajibabaei/unified-embedding/9e8f5f1cf9bca699fa51ed714ea3d2b6e490684b/sample.wav -------------------------------------------------------------------------------- /sample_reverse.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MahdiHajibabaei/unified-embedding/9e8f5f1cf9bca699fa51ed714ea3d2b6e490684b/sample_reverse.wav -------------------------------------------------------------------------------- /test_ident.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import caffe 3 | print "Caffe successfully imported!" 4 | import scipy.io.wavfile 5 | import re 6 | import os.path 7 | 8 | wav_list='/scratch_net/biwidl09/hmahdi/VoxCeleb/Identification_split.txt' 9 | base_address='/scratch_net/biwidl09/hmahdi/VoxCeleb/voxceleb1_wav/' 10 | pre_emphasis = 0.97 11 | frame_size = 0.025 12 | frame_stride = 0.01 13 | NFFT = 512 14 | BATCH_SIZE=50 15 | train_set=[] 16 | train_label=[] 17 | validation_set=[] 18 | validation_label=[] 19 | test_set=[] 20 | test_label=[] 21 | spectrogram_batch=np.empty([BATCH_SIZE,1,300,257],dtype=float) # np.int((NFFT+1)/2) 22 | label_batch=np.empty([BATCH_SIZE,1,1,1],dtype=float) 23 | number_of_crops=50 24 | 25 | def crop_inference(batch_index): 26 | for i in range(0,BATCH_SIZE): 27 | sample_index=batch_index*BATCH_SIZE+i 28 | file_name=test_set[sample_index] # Every sample within the batch is chosen at random 29 | sample_rate, signal = scipy.io.wavfile.read(file_name) # 30 | extendedSignal=np.append(signal,signal) 31 | beginning=int((len(signal))*np.random.random_sample()) 32 | signal = extendedSignal[beginning:beginning+48241] # Number of samples plus one because we need to apply pre-emphasis filter with receptive field of two 33 | if (np.int(np.random.random_sample()*2)==1): 34 | signal= signal[::-1] 35 | signal=signal-np.mean(signal) 36 | emphasized_signal = np.append(signal[0], signal[1:] - pre_emphasis * signal[:-1]) 37 | 38 | frame_length, frame_step = frame_size * sample_rate, frame_stride * sample_rate # Convert from seconds to samples 39 | signal_length = len(emphasized_signal) 40 | frame_length = int(round(frame_length)) 41 | frame_step = int(round(frame_step)) 42 | num_frames = int(np.ceil(float(np.abs(signal_length - frame_length)) / frame_step)) # Make sure that we have at least 1 frame 43 | 44 | pad_signal_length = num_frames * frame_step + frame_length 45 | z = np.zeros((pad_signal_length - signal_length)) 46 | pad_signal = np.append(emphasized_signal, z) # Pad Signal to make sure that all frames have equal number of samples without truncating any samples from the original signal 47 | 48 | indices = np.tile(np.arange(0, frame_length), (num_frames, 1)) + np.tile(np.arange(0, num_frames * frame_step, frame_step), (frame_length, 1)).T 49 | frames = pad_signal[indices.astype(np.int32, copy=False)] 50 | 51 | frames *= np.hamming(frame_length) 52 | mag_frames = np.absolute(np.fft.rfft(frames, NFFT)) # Magnitude of the FFT 53 | 54 | label_batch[i,0,0,0]=test_label[sample_index] 55 | spectrogram_batch[i,0,:,:]= (mag_frames - mag_frames.mean(axis=0)) / mag_frames.std(axis=0) 56 | 57 | 58 | def test_accuracy(net): 59 | 60 | test_set_size=len(test_set) 61 | print("Start of testing process on %d samples" % (test_set_size)) 62 | top1Accuracy=0 63 | top5Accuracy=0 64 | for i in range(0,test_set_size/BATCH_SIZE): 65 | for j in range(0,number_of_crops): 66 | crop_inference(i) 67 | net.blobs['data'].data[...] =spectrogram_batch 68 | net.blobs['label'].data[...] =label_batch 69 | net.forward() 70 | if j==0: 71 | batch_pool_average=net.blobs['res4_3p'].data 72 | else: 73 | batch_pool_average=batch_pool_average+net.blobs['res4_3p'].data 74 | batch_pool_average=batch_pool_average/number_of_crops 75 | net.blobs['res4_3p'].data[...] = batch_pool_average 76 | net.forward(start='fc5')#, end='fc7') 77 | top1Accuracy=top1Accuracy+net.blobs['top1Accuracy'].data 78 | top5Accuracy=top5Accuracy+net.blobs['top5Accuracy'].data 79 | print("Batch #%d of testing is evaluated"% (i)) 80 | top1Accuracy=top1Accuracy*BATCH_SIZE/test_set_size 81 | top5Accuracy=top5Accuracy*BATCH_SIZE/test_set_size 82 | print("The top1 accuracy on test set is %.4f" % (top1Accuracy)) 83 | print("The top5 accuracy on test set is %.4f" % (top5Accuracy)) 84 | 85 | 86 | if __name__ == '__main__': 87 | 88 | allocated_GPU= int(os.environ['SGE_GPU']) 89 | print("The training will be executed on GPU #%d" % (allocated_GPU)) 90 | caffe.set_device(allocated_GPU) 91 | caffe.set_mode_gpu() 92 | net_weights='result/LM_512D_30_iter_61600.caffemodel' 93 | net_prototxt='prototxt/LogisticMargin.prototxt' 94 | net = caffe.Net(net_prototxt,net_weights,caffe.TEST) 95 | print("The network will be initialized with coefficients from %s" % (net_weights)) 96 | 97 | input_file = open(wav_list,'r') 98 | identity=0 99 | split_index=1 100 | for line in input_file: 101 | parsed_line=re.split(r'[ \n]+', line) 102 | utterance_address=base_address+parsed_line[1] 103 | prev_split_index=split_index 104 | split_index=int(parsed_line[0]) 105 | if (prev_split_index>split_index): 106 | identity=identity+1 107 | if (os.path.isfile(utterance_address)==False): # The file does not exist 108 | continue 109 | elif (split_index==1): 110 | train_set.append(utterance_address) 111 | train_label.append(identity) 112 | elif (split_index==2): 113 | validation_set.append(utterance_address) 114 | validation_label.append(identity) 115 | elif (split_index==3): 116 | test_set.append(utterance_address) 117 | test_label.append(identity) 118 | print("The size of training set: %d" % (len(train_set))) 119 | print("The size of validation set: %d" % (len(validation_set))) 120 | print("The size of test set: %d" % (len(test_set))) 121 | print("Number of identities: %d" % (identity+1)) 122 | test_accuracy(net) 123 | -------------------------------------------------------------------------------- /train_aug.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import caffe 3 | print "Caffe successfully imported!" 4 | import scipy.io.wavfile 5 | import re 6 | import os.path 7 | 8 | wav_list='/scratch_net/biwidl09/hmahdi/VoxCeleb/Identification_split.txt' 9 | base_address='/scratch_net/biwidl09/hmahdi/VoxCeleb/voxceleb1_wav/' 10 | pre_emphasis = 0.97 11 | frame_size = 0.025 12 | frame_stride = 0.01 13 | NFFT = 512 14 | BATCH_SIZE=50 15 | train_set=[] 16 | train_label=[] 17 | validation_set=[] 18 | validation_label=[] 19 | test_set=[] 20 | test_label=[] 21 | spectrogram_batch=np.empty([BATCH_SIZE,1,300,257],dtype=float) #np.int((NFFT+1)/2) 22 | label_batch=np.empty([BATCH_SIZE,1,1,1],dtype=float) 23 | MAX_ITERATIONS=61600 24 | TEST_INTERVAL=1000 25 | 26 | # Spectrogram extractor, courtesy of Haytham Fayek 27 | def spectrogram_extractor(file_name,i): 28 | 29 | sample_rate, signal = scipy.io.wavfile.read(file_name) 30 | extended_signal=np.append(signal,signal) 31 | beginning=int((len(signal))*np.random.random_sample()) 32 | signal = extended_signal[beginning:beginning+48241] 33 | if (np.int(np.random.random_sample()*2)==1): 34 | signal= signal[::-1] 35 | signal=signal-np.mean(signal) 36 | emphasized_signal = np.append(signal[0], signal[1:] - pre_emphasis * signal[:-1]) 37 | 38 | frame_length, frame_step = frame_size * sample_rate, frame_stride * sample_rate 39 | signal_length = len(emphasized_signal) 40 | frame_length = int(round(frame_length)) 41 | frame_step = int(round(frame_step)) 42 | num_frames = int(np.ceil(float(np.abs(signal_length - frame_length)) / frame_step)) 43 | 44 | pad_signal_length = num_frames * frame_step + frame_length 45 | z = np.zeros((pad_signal_length - signal_length)) 46 | pad_signal = np.append(emphasized_signal, z) 47 | 48 | indices = np.tile(np.arange(0, frame_length), (num_frames, 1)) + np.tile(np.arange(0, num_frames * frame_step, frame_step), (frame_length, 1)).T 49 | frames = pad_signal[indices.astype(np.int32, copy=False)] 50 | 51 | frames *= np.hamming(frame_length) 52 | mag_frames = np.absolute(np.fft.rfft(frames, NFFT)) 53 | 54 | spectrogram_batch[i,0,:,:]= (mag_frames - mag_frames.mean(axis=0)) / mag_frames.std(axis=0) 55 | 56 | # This function parses the file list given by Nagrani et al. 57 | def parse_list(mode): 58 | 59 | input_file = open(wav_list,'r') 60 | identity=0 61 | split_index=1 62 | if (mode=='verification'): 63 | for line in input_file: 64 | parsed_line=re.split(r'[ \n]+', line) 65 | utterance_address=base_address+parsed_line[1] 66 | prev_split_index=split_index 67 | split_index=int(parsed_line[0]) 68 | if (prev_split_index>split_index): 69 | identity=identity+1 70 | if (os.path.isfile(utterance_address)==False): 71 | continue 72 | elif (parsed_line[1][0]=='E'): 73 | test_set.append(utterance_address) 74 | test_label.append(identity) 75 | elif (split_index==2): 76 | validation_set.append(utterance_address) 77 | validation_label.append(identity) 78 | else :#(split_index==1) 79 | train_set.append(utterance_address) 80 | train_label.append(identity) 81 | 82 | elif (mode=='identification'): 83 | for line in input_file: 84 | parsed_line=re.split(r'[ \n]+', line) 85 | utterance_address=base_address+parsed_line[1] 86 | prev_split_index=split_index 87 | split_index=int(parsed_line[0]) 88 | if (prev_split_index>split_index): 89 | identity=identity+1 90 | if (os.path.isfile(utterance_address)==False): 91 | continue 92 | elif (split_index==1): 93 | train_set.append(utterance_address) 94 | train_label.append(identity) 95 | elif (split_index==2): 96 | validation_set.append(utterance_address) 97 | validation_label.append(identity) 98 | elif (split_index==3): 99 | test_set.append(utterance_address) 100 | test_label.append(identity) 101 | else: 102 | print "Parsing failed due to undefined mode..." 103 | return 104 | print("The size of training set: %d" % (len(train_set))) 105 | print("The size of validation set: %d" % (len(validation_set))) 106 | print("The size of test set: %d" % (len(test_set))) 107 | print("Number of identities: %d" % (identity+1)) 108 | 109 | def validate_accuracy(solver): 110 | print "Start of validation process.." 111 | top1Accuracy=0 112 | top5Accuracy=0 113 | validation_set_size=len(validation_set) 114 | for i in range(0,validation_set_size/BATCH_SIZE): 115 | for j in range(0,BATCH_SIZE): 116 | sample_index=i*BATCH_SIZE+j 117 | file_name=validation_set[sample_index] 118 | label_batch[j,0,0,0]=validation_label[sample_index] 119 | spectrogram_extractor(file_name,j) 120 | solver.net.blobs['data'].data[...] =spectrogram_batch 121 | solver.net.blobs['label'].data[...] =label_batch 122 | solver.net.forward() 123 | top1Accuracy=top1Accuracy+solver.net.blobs['top1Accuracy'].data 124 | top5Accuracy=top5Accuracy+solver.net.blobs['top5Accuracy'].data 125 | top1Accuracy=top1Accuracy*BATCH_SIZE/validation_set_size 126 | top5Accuracy=top5Accuracy*BATCH_SIZE/validation_set_size 127 | print("The top1 accuracy on validation set is %.4f" % (top1Accuracy)) 128 | print("The top5 accuracy on validation set is %.4f" % (top5Accuracy)) 129 | 130 | if __name__ == '__main__': 131 | 132 | mode='identification' 133 | allocated_GPU= int(os.environ['SGE_GPU']) 134 | print("The %s training will be executed on GPU #%d" % (mode,allocated_GPU)) 135 | caffe.set_device(allocated_GPU) 136 | caffe.set_mode_gpu() 137 | solver = caffe.SGDSolver("prototxt/ResNet-20_solver.prototxt")# Network with typical softmax with cross entropy loss function 138 | # Uncomment the following block when training from scratch and uncomment to fine-tune 139 | '''' 140 | solver = caffe.SGDSolver("prototxt/LogisticMargin_solver.prototxt")# Network with logistic margin loss function 141 | net_weights='result/ResNet-20_512D_iter_61600.caffemodel' 142 | print("The network will be initialized with %s" % (net_weights)) 143 | solver.net.copy_from(net_weights) 144 | ''' 145 | parse_list(mode) 146 | train_set_size=len(train_set) 147 | 148 | for j in range(0,MAX_ITERATIONS): 149 | for i in range(0,BATCH_SIZE): 150 | sample_index=np.int(np.random.random_sample()*train_set_size) 151 | file_name=train_set[sample_index] 152 | spectrogram_extractor(file_name,i) 153 | label_batch[i,0,0,0]=train_label[sample_index] 154 | solver.net.blobs['data'].data[...] =spectrogram_batch 155 | solver.net.blobs['label'].data[...] =label_batch 156 | solver.step(1) 157 | if (np.mod(j,TEST_INTERVAL)==0): 158 | validate_accuracy(solver) 159 | 160 | --------------------------------------------------------------------------------