├── CONTRIBUTING.md ├── LICENSE ├── NNs ├── README.md ├── helping_functions.py └── lstm.py ├── README.md ├── Short_Programming_Project ├── README.md └── university-groningen-short.pdf ├── classifiers ├── README.md ├── SVMs │ ├── README.md │ ├── mfcc_pca_feature.py │ ├── svm_balancedSampleNumber_greedySearch.py │ ├── svm_default.py │ ├── svm_keeping_supportVectors.py │ └── svm_multiclass.py ├── dimensionality_reduction │ ├── README.md │ ├── graph_spectral_analysis&spectral_clustering_default.py │ ├── kpca_lda_knn_equalizeClasses.py │ ├── kpca_lda_knn_multiclass.py │ └── pca_kpca_from-skratch.py ├── gmm.py ├── gmm_healthy_captured.py ├── knn.py ├── leave_one_out.py ├── logisticRegression.py └── simpleNeuralNetwork.py ├── feature_extraction_techniques ├── README.md ├── lpc.py ├── mfcc.py ├── mfcc_pca.py ├── mgca.py ├── plp.py └── readFiles.py └── speech_features ├── README.md ├── gmm_mfcc_0.txt ├── gmm_mfcc_1.txt ├── gmm_test_mfcc.txt ├── lpc_featuresLR.txt ├── mel_generalized_features.txt ├── mfcc_featuresLR.txt ├── plp_features.txt └── plp_featuresRASTA.txt /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Issues for contribution 2 | 3 | - Feature extraction techniques from skratch LPC, MFCC, PLP for comparing the results with the already made ones. 4 | 5 | - Varitional Autoencoders (VAE) for finding a lower representation of the extracted features. Furthermore we can produce artificial samples using VAEs, with this procedure we can overcome the obstacle that the imbalance dataset causes. 6 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Emmanouil Gionanidis 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /NNs/README.md: -------------------------------------------------------------------------------- 1 | **helping_functions.py** 2 | Implements some helping functions in order to split the data, determine the timeseries format (t,t+1). The previous values that we want to take into cinsideration in order to predict the next value in the time serie is defined by us. Also preprocessing functions like filling the missing values, and mainly making the appropriate format for feeding different types of neural networks. 3 | 4 | **lstm.py** 5 | A Long-Short-Term-Memory Neural Network for predicting values in timeseries. This is an implementation that follows our first approach to deal with sequential data. Our first approach was a Recurrent Neural Network, which is known that it has problems with the short term and long term memory. This happens because of the gradient vanish as we go from the last to the first layers. The aforementioned procedure has as an outcome that the first layers of the NN does not learn at all. We can overcome this obstacle implementing a procedure that provide us Long and Short term memory, because the previous event are important and we need longer dependencies for our data. Because of this we implement the LSTM and the GRU, know as Gated Recurrent Units, which are represent exactly this procedure of giving memory to the network. 6 | -------------------------------------------------------------------------------- /NNs/helping_functions.py: -------------------------------------------------------------------------------- 1 | #!usr/bin/python 2 | 3 | #libraries 4 | import pandas as pd 5 | import matplotlib.pyplot as plt 6 | import numpy as np 7 | import math 8 | import keras.models 9 | import keras.layers 10 | import sklearn.preprocessing 11 | from sklearn.metrics import mean_squared_error 12 | import keras.optimizers 13 | 14 | 15 | '''First we have to define our problem. We use LSTM because we want to take advandage of it' strong point which is Long-Short term memory, contrary to RNN which we know there are drawbacks concerning the gradient in the first layers, so maybe we have a leakage of previous information. For long time series or long sequencies in general we use LSTMs, GRUs bu for small sequencies RNN are efficient as well.''' 16 | 17 | 18 | #------------------- Initialization 19 | 20 | 21 | #make this function for reading the input data and make the format you want, and visualize 22 | def initialization(): 23 | 24 | print('Remember for small datasets emerges problem with zero elements in the testing set, if the training percent is a big number. \n ') 25 | 26 | 27 | #import some randomness in our procedure 28 | np.random.seed(7) 29 | 30 | #determine the path of the file that you want to read from 31 | -----------------------------------------------------------------> Define the path 32 | path = '' 33 | 34 | #print the two destinations that consit the trip of, split the string based on '/' and print the last word subtracking the last four charachters '.txt' 35 | print('Running for: ',path.split('/')[6][:-4]) 36 | 37 | #define the name of the DataFrame columns 38 | names=["x","y"] 39 | 40 | #read the file that we define to the path as a pandas DataFrame with the aforementioned columns, target is the ticket_price, we can feed our model with the usecols 41 | timeserie = pd.read_csv(path, names = names,engine='python', index_col=None, usecols = ["x","y"] 42 | #visualize the DataFrame, and check the dimensions 43 | print('Dataframe: \n') 44 | print(timeserie) 45 | print(timeserie.shape) 46 | print('\n') 47 | 48 | #plot the how the price is evolving through time (definition of time day, month, year etc) 49 | #plt.plot(timeserie) 50 | #plt.show() 51 | 52 | 53 | #returns the file as a Dataframe 54 | return timeserie 55 | 56 | 57 | 58 | 59 | 60 | #------------------ Split Data training/testing 61 | 62 | #split data into training and testing subsets 63 | def split_data(dataset, training_size): 64 | 65 | #translate the training size into our number of elements 66 | train_size = int(len(dataset) * training_size) 67 | test_size = len(dataset) - train_size 68 | 69 | #take the accordinate parts of the datraset 70 | train_samples, test_samples = dataset[0:train_size,:], dataset[train_size: len(dataset),:] 71 | 72 | return train_samples, test_samples 73 | 74 | 75 | 76 | #------------------------------------- Format Dataset 77 | 78 | 79 | '''We are making this function because we want to change the format of our data, we are going to implement regression so to predict the next value of a timeserie. This means that we are goint to have a specific time, let's say t, and we are predicting what is happening in the time t+1, so we need to model our dataset in order to implement this ideology''' 80 | def format_dataset(dataset, time_step): 81 | 82 | #time_step defines how many times you want to look back 83 | 84 | #define as dataT the time t, and as dataT_1 the time t+1 85 | dataT, dataT_1 = [], [] 86 | 87 | #iterate all the dataset and make the format based on the time step 88 | for i in range(len(dataset)-time_step-1): 89 | 90 | #in time t put the current element 91 | dataT.append(dataset[i:(i+time_step), 0]) 92 | 93 | #in the time t+1 put the next element of the element that we append in the list dataT 94 | dataT_1.append(dataset[i + time_step, 0]) 95 | 96 | #repeat this procedure, following the element that we append in the array dataT_1 97 | 98 | 99 | return np.array(dataT), np.array(dataT_1) 100 | 101 | 102 | 103 | #---------------- Preprocessing 104 | 105 | 106 | #we use this function in order to do the preprocessing staff, normalize, and maybe another procedures #that we want to implement for making a proper format to our data 107 | def preprocessing(dataset, time_step, training_size): 108 | 109 | #take only tha information of the dataframe and not the indexes or the columns names 110 | dataset = dataset.values 111 | 112 | #convert them to floats which is more suitable for feeding a neural netowork 113 | dataset = dataset.astype('float32') 114 | 115 | #first we are going to scale our data because LSTMs are sensitive to the unscaled input data and we are goint to see this in action 116 | #scaling in range [0,1] 117 | scaler = sklearn.preprocessing.MinMaxScaler(feature_range=(0,1)) 118 | 119 | 120 | #----------------------------------------------------------------> fit all the dataset 121 | 122 | dataset = scaler.fit_transform(dataset) 123 | 124 | #and then split 125 | train_samples_scaled, test_samples_scaled = split_data(dataset) 126 | 127 | #----------------------------------------------------------------> fit only training 128 | 129 | #first we have to split our data into test and training 130 | train_samples, test_samples = split_data(dataset, training_size) 131 | 132 | print('Samples chosen for the training procedure: \n') 133 | print(train_samples) 134 | print(train_samples.shape) 135 | print('\n') 136 | print('Samples chose for testing: \n') 137 | print(test_samples) 138 | print(test_samples.shape) 139 | print('\n') 140 | 141 | ''' 142 | scaler.fit(train_samples) 143 | 144 | #transform both training and testing data based on the information of the training only because we want our model to work only for one sample for testing as input 145 | train_samples_scaled = scaler.transform(train_samples) 146 | 147 | test_samples_scaled = scaler.transform(test_samples)''' 148 | 149 | #visualize the scaled data 150 | print('Scaled samples for the training procedure: \n') 151 | print(train_samples_scaled) 152 | print(train_samples_scaled.shape) 153 | print('\n') 154 | print('Scaled samples for testing: \n') 155 | print(test_samples_scaled) 156 | print(test_samples_scaled.shape) 157 | print('\n') 158 | 159 | #------------------------------------------------------------------> end with scaling 160 | 161 | #timeserie format for both testing and training sets 162 | 163 | #define the time step, e.g t+5 we want time_step=5 164 | time_step=1 165 | 166 | trainT, trainT_1 = format_dataset(train_samples_scaled, time_step) 167 | 168 | testT, testT_1 = format_dataset(test_samples_scaled, time_step) 169 | 170 | print('Previous train data shape: ') 171 | print(trainT) 172 | print('\n') 173 | 174 | #bacause the LSTM waits our input to be in the format below 175 | #[samples, time steps, features] 176 | #we need to transform it in order to fit this prerequisite 177 | 178 | #------------------------------------------------------------> [samples, time steps, features] format 179 | #we are formating only the training set not the testing, because the testing in going to be just apredicted single valuew 180 | trainT = np.reshape(trainT, (trainT.shape[0], 1, trainT.shape[1])) 181 | 182 | testT = np.reshape(testT, (testT.shape[0], 1, testT.shape[1])) 183 | 184 | print('Current train data shape ready to feed LSTM model: ') 185 | print(trainT) 186 | print('\n') 187 | 188 | 189 | #return the sets of training and testing ready for the LSTM model 190 | return trainT, trainT_1, testT, testT_1, time_step, scaler, dataset 191 | 192 | -------------------------------------------------------------------------------- /NNs/lstm.py: -------------------------------------------------------------------------------- 1 | #!usr/bin/python 2 | 3 | #libraries 4 | import pandas as pd 5 | import matplotlib.pyplot as plt 6 | import numpy as np 7 | import math 8 | import keras.models 9 | import keras.layers 10 | import sklearn.preprocessing 11 | from sklearn.metrics import mean_squared_error 12 | import keras.optimizers 13 | import helping_functions 14 | 15 | 16 | #--------------------------------------- RNN with LSTM layer 17 | 18 | 19 | def LSTM(trainT, trainT_1, testT, testT_1, time_step, scaler, dataset): 20 | 21 | #create the model 22 | 23 | #we choose the Sequential because we want to stack the layers, put them in a row 24 | model = keras.models.Sequential() 25 | 26 | #we add a LSTM layer 27 | #---> with 4 neurons or units 28 | #---> determine the input dimension based on the time_step because the input is going to be our previous values and the output will be only the predicted values 29 | #---> dropout: choose the percent to drop of the linear transformation of the reccurent state 30 | #---> implementation: choose if you want to stack the operation into larger number of smaller dot productes or the inverse 31 | #---> recurrent_dropout: the dropout of the recurrent state 32 | 33 | model.add(keras.layers.LSTM(128, input_shape=(1, time_step), use_bias=True, unit_forget_bias=True, kernel_regularizer=None, recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, dropout=0.0, recurrent_dropout=0.0, implementation=1,return_sequences=True)) 34 | 35 | #another one LSTM layer 36 | model.add(keras.layers.LSTM(64, input_shape=(1, time_step), return_sequences=False)) 37 | 38 | model.add(keras.layers.Dense(16,init='uniform',activation='relu')) 39 | 40 | #just a densenly connected layer with 1 neuron/unit, as an output, that makes the single value prediction 41 | model.add(keras.layers.Dense(1, activation='sigmoid')) 42 | 43 | #we use the RSME to validate the performance of our model, and the Adam optimizer for updating the network weights 44 | 45 | #Optimizers to use 46 | 47 | #----> Stochastic Gradient Descent - SGD 48 | #----> RMSProp 49 | #----> Adagrad 50 | #----> Adam 51 | model.compile(loss='mean_squared_error', optimizer=keras.optimizers.Adam(lr=0.001)) 52 | 53 | #feed our model 54 | results = model.fit(trainT, trainT_1, epochs=300, batch_size=1, verbose=1,validation_data=(testT, testT_1)) 55 | 56 | 57 | #------------------ Make the predictions 58 | train_predict = model.predict(trainT) 59 | test_predict = model.predict(testT) 60 | 61 | #inverse the prediction in order to suit the format euros per time moment, for calculating the RMSE 62 | train_predict = scaler.inverse_transform(train_predict) 63 | trainT_1 = scaler.inverse_transform([trainT_1]) 64 | test_predict = scaler.inverse_transform(test_predict) 65 | testT_1 = scaler.inverse_transform([testT_1]) 66 | 67 | #now we can calculate the RMSE 68 | 69 | train_score = math.sqrt(mean_squared_error(trainT_1[0], train_predict[:,0])) 70 | print('RMSE training: %.2f' % (train_score)) 71 | 72 | test_score = math.sqrt(mean_squared_error(testT_1[0], test_predict[:,0])) 73 | print('RMSE testing: %.2f'% (test_score)) 74 | 75 | Visualize(train_predict, test_predict, dataset, time_step, scaler, results) 76 | 77 | 78 | 79 | #-------------- Visualize the pridictions 80 | def Visualize(train_predict, test_predict, dataset, time_step, scaler, results): 81 | 82 | #initialize the array for testing and training 83 | train_predict_plot = np.empty_like(dataset) 84 | train_predict_plot[:, :] = np.nan 85 | 86 | test_predict_plot = np.empty_like(dataset) 87 | test_predict_plot[:, :] = np.nan 88 | 89 | 90 | #we have to shift the train predictions in order to plot them correctly 91 | train_predict_plot[time_step:len(train_predict)+time_step, :] = train_predict 92 | 93 | #we have to shift the test predictions in order to plot them correctly 94 | test_predict_plot[len(train_predict)+(time_step*2)+1:len(dataset)-1,:] = test_predict 95 | 96 | 97 | #plot baseline and the predictions in the same plot 98 | plt.figure(1) 99 | plt.title('Predictions from training and testins sets') 100 | plt.plot(scaler.inverse_transform(dataset)) 101 | print(train_predict_plot) 102 | print(test_predict_plot) 103 | plt.plot(train_predict_plot) 104 | plt.plot(test_predict_plot) 105 | plt.legend(['Dataset','Train phase prediction','Test phase prediction']) 106 | 107 | plt.figure(2) 108 | plt.title('Train loss curve') 109 | plt.plot(results.history['loss']) 110 | plt.plot(results.history['val_loss']) 111 | plt.legend(['train loss','test loss']) 112 | 113 | #show the plots 114 | plt.show() 115 | 116 | 117 | 118 | #------------------- Main procedure 119 | 120 | def main(): 121 | dataset = helping_functions.initialization() 122 | 123 | training_size = 0.60 124 | time_step = 1 125 | 126 | trainT, trainT_1, testT, testT_1, time_step, scaler, dataset = helping_functions.preprocessing(dataset, time_step, training_size) 127 | 128 | LSTM(trainT, trainT_1, testT, testT_1, time_step, scaler, dataset) 129 | 130 | 131 | main() 132 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Speech-Signal-Processing-and-Classification 2 | 3 | #### Aristotle University of Thessaloniki - University of Groningen 4 | 5 | Abstract of my thesis conducted during the 7-8th semester. " Two-class classification problems by analyzing the speech signal. " 6 | 7 | 8 | 9 | Front-end speech processing aims at extracting proper features from short- term segments of a speech utterance, known as frames. It is a pre-requisite step toward any pattern recognition problem employing speech or audio (e.g., music). Here, we are interesting in voice disorder classification. 10 | That is, to develop two-class classifiers, which can discriminate between utterances of a subject suffering from say vocal fold paralysis and utterances of a healthy subject.The mathematical modeling of the speech production system in humans suggests that an all-pole system function is justified [1-3]. As a consequence, linear prediction coefficients (LPCs) constitute a first choice for modeling the magnitute of the short-term spectrum of speech. LPC-derived cepstral coefficients are guaranteed to discriminate between the system (e.g., vocal tract) contribution and that of the excitation. Taking into account the characteristics of the human ear, the mel-frequency cepstral coefficients (MFCCs) emerged as descriptive features of the speech spectral envelope. Similarly to MFCCs, the perceptual linear prediction coefficients (PLPs) could also be derived. The aforementioned sort of speaking traditional features will be tested against agnostic-features extracted by convolutive neural networks (CNNs) (e.g., auto-encoders) [4]. Additionally as concerns SVM algortihm the dimensionality reduction step took place using algorithms like Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) also the kernel form of PCA - KernelPCA. In the multiclass implementation the use of (KPCA) following with LDA was essential, in the binary classification we compare the use of PCA and KernelPCA. For experimental purposes Graph Spectral Analysis(IsoMap, LLE) was used for dimensionality reduction followed with Spectral Clustering in order to investigate subsets. 11 | The pattern recognition step will be based on Gaussian Mixture Model based classifiers,K-nearest neighbor classifiers, Bayes classifiers, as well as Deep Neural Networks. At the application level, a library for feature extraction and classification in Python will be developed. Credible publicly available resources will be used toward achieving our goal, such as KALDI. Comparisons will be made against [5-7]. 12 | 13 | [1]X. Huang, A. Acero, and H.-W. Hon, Spoken Language Processing. Up- 14 | per Saddle River, N.J.: Pearson Education-Prentice Hall, 2001. 15 | 16 | [2] J. R. Deller, J. H. L. Hansen, and J. G. Proakis, Discrete-Time Pro- 17 | cessing of Speech Signals. New York, N.Y.: Wiley-IEEE, 1999. 18 | 19 | [3] L. R. Rabiner and R. W. Schafer, Theory and Applications of Digital 20 | Speech Processing. Upper Saddle River, N.J.: Pearson Education- Prentice 21 | Hall, 2011. 22 | 23 | [4] Wei-Ning Hsu, Yu Zhang, and James R. Glass, Unsupervised Domain 24 | Adaptation for Robust Speech Recognition via Variational Autoencoder- 25 | Based Data Augmentation,2017, http://arxiv.org/abs/1707.06265. 26 | 27 | [5] C. Kotropoulos and G.R. Arce, ”Linear discriminant classifier with re- 28 | ject option for the detection of vocal fold paralysis and vocal fold edema”, 29 | EURASIP Advances in Signal Processing, vol. 2009, article ID 203790, 13 30 | pages, 2009 (DOI:10.1155/2009/203790). 31 | 32 | [6] E.Ziogas and C.Kotropoulos, ”Detection of vocal fold paralysis and 33 | edema using linear discriminant classifiers” in Proc. 4th Panhellenic Ar- 34 | tificial Intelligence Conf. (SETN-06), Heraklion, Greece, vol. LNAI 3966, 35 | pp. 454-464, May 19-20, 2006. 36 | 37 | [7] M.Marinaki, C.Kotropoulos, I.Pitas, and N.Maglaveras, ”Automatic de- 38 | tection of vocal fold paralysis and edema” in Proc. 8th Int. Conf. Spoken 39 | Language Processing (INTERSPEECH 2004), Jeju, Korea, pp. 537-540, Oc- 40 | tober, 2004. 41 | -------------------------------------------------------------------------------- /Short_Programming_Project/README.md: -------------------------------------------------------------------------------- 1 | University of Groningen 2 | 3 | 4 | Short Programming Project for Gender classification based on voice signals (.wav files) 5 | 6 | -------------------------------------------------------------------------------- /Short_Programming_Project/university-groningen-short.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gionanide/Speech_Signal_Processing_and_Classification/4e250f1f7de9c17263be1f309441f4c9bd34f4be/Short_Programming_Project/university-groningen-short.pdf -------------------------------------------------------------------------------- /classifiers/README.md: -------------------------------------------------------------------------------- 1 | Classifiers implementation , initial step for gender and then for healthy or not cases. 2 | 3 | - Gaussian Mixture Models 4 | - K-nearest Neighbours 5 | - Logistic Regression 6 | - Support Vector Machine 7 | - Linear Discriminant Analysis 8 | - Decision Tree Classifier 9 | - GaussianNB 10 | - Neural Networks 11 | 12 | Further elaboration took place : 13 | - GMMs 14 | - KNN 15 | - LR 16 | - SVM 17 | -------------------------------------------------------------------------------- /classifiers/SVMs/README.md: -------------------------------------------------------------------------------- 1 | Implementation of SVM algorithm for classification **svm_default.py** is using only the default parameters to initialize the procedure. 2 | 3 | In this folder there are variations as concerns the methods of training and the method of evaluation of SVM algorithm. Experiment resutlts using different kernel functions, and different values of parameters. Training methods with balanced training set, the balance is about the number of samples of each class **svm_balancedSampleNumber_greedySearch.py**. 4 | 5 | Examples of this training is using the divided parts and keep only the samples that are support vectors in every iteration, continue this procedure until the class with more samples is finished of iterating. Last one, using greedy algorithms to calculate the kernel parameters. 6 | 7 | In the script **svm_keeping_supportVectors.py** the above experiment is taking place. As a first approach we train our model taking all the samples from class0 and devide them accrodingly just to balance our data, we continue this porcedure until we do not have more untis of samples from class0. From this iteration we keep all the support vectors, which contains samples from both classes. We erase the duplicates and we delete all the samples from class1, so we have a dataframe containing all the support_vectors from class0. And then we feed our model in order to train it with all the samples from class1 and only the samples that were support vectors from class0, and we repeat this procedure. In the end the amount of samples from class0 is going to be smaller than the amount of samples from class1 and when this becomes smaller than the half of the amount of class1 samples we stop. Because we scale the data and we are keeping the support vector we have to unscale them, because if we feed the classifier with the scaled support vectors there are going to be scaled again, so we unscale the support vectors from class0 that we kept. 8 | 9 | In general because class0 has 6 times more samples than class1 in order to reduce the amount of samples of class0 we try this procedure taking the support vectors and then the support vectors of support vectors and goes on. 10 | 11 | Furthermore in the script **svm_multiclass.py** we try to classify a dataset of three classes. 12 | -------------------------------------------------------------------------------- /classifiers/SVMs/mfcc_pca_feature.py: -------------------------------------------------------------------------------- 1 | #!usr/bin/python 2 | from __future__ import division 3 | from python_speech_features import mfcc 4 | import scipy.io.wavfile as wavv 5 | import os 6 | from sklearn.decomposition import IncrementalPCA, PCA 7 | import sys 8 | import pandas as pd 9 | from sklearn import model_selection 10 | from sklearn.svm import SVC # support vectors for classification 11 | from sklearn.metrics import accuracy_score, confusion_matrix, classification_report 12 | from sklearn.model_selection import cross_val_score, GridSearchCV 13 | import timeit 14 | import numpy as np 15 | import itertools 16 | from sklearn.preprocessing import MinMaxScaler 17 | 18 | ''' 19 | We read the input file, we take the rate of the signal and the signal and then the mfcc feature extraction is on. 20 | N numpy array with size of the number of frames, each row has one feature vector. 21 | ''' 22 | def mfcc_features_extraction(wav): 23 | inputWav,wav = readWavFile(wav) 24 | print inputWav 25 | rate,signal = wavv.read(inputWav) 26 | mfcc_features = mfcc(signal,rate) 27 | return mfcc_features,wav 28 | 29 | ''' 30 | Make a numpy array with length the number of mfcc features, 31 | for one input take the sum of all frames in a specific feature and divide them with the number of frames. Because we extract 13 features 32 | from every frame now we have to add them and take the mean of them in order to describe the sample. 33 | ''' 34 | def mean_features(mfcc_features,wav,folder,general_feature_list,general_label_list): 35 | #here we are taking all the mfccs from every frame and we are not taking the average of them, instead we 36 | #are taking PCA in order to reduce the dimension of our data 37 | 38 | if (folder=='HC'): 39 | #map the lists, in the first position of the general_label_list it will be the label 40 | #of the sample which is in the first position in the list general_feature_list 41 | #and we are making this in order to write the sample to the file with the right labels 42 | general_label_list.append(0) 43 | elif(folder == 'PD'): 44 | general_label_list.append(1) 45 | 46 | #initialize the flattend list 47 | flattend_list = [] 48 | 49 | #flat the list, for every frame take the 13 features and put them in one array 50 | for sublist in mfcc_features: 51 | for feature in sublist: 52 | flattend_list.append(feature) 53 | 54 | #check if a sample has les length than the length we determine 55 | if(len(flattend_list)<12800): 56 | print len(flattend_list) 57 | print '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n' 58 | 59 | #make the list of lists as a numpy array in order just one sample 1x(number_of_frames x features) 60 | #so for every sample we have all the features from all the frames in a single row. 61 | #here we append in a list of lists the samples, we want to fill this list with all the samples 62 | general_feature_list.append(flattend_list) 63 | 64 | #in this function we just filling the two lists one with the features and one with the labels 65 | 66 | #sys.exit() 67 | 68 | 69 | 70 | def readWavFile(wav): 71 | #wav = raw_input('Give me the path of the .wav file you want to read: ') 72 | inputWav = 'PATH_TO_WAV'+wav 73 | return inputWav,wav 74 | 75 | 76 | 77 | 78 | ''' 79 | write in a txt file the output vectors of every sample 80 | ''' 81 | def writeFeatures(general_feature_list,general_label_list,wav,folder): 82 | 83 | f = open('PATH_TO_SAMPLES','a') 84 | 85 | 86 | #we have to iterato all the general_feature_list 87 | for x in range(len(general_feature_list)): 88 | #append the last element before you write to the file because it is the label 89 | 90 | print len(general_feature_list[x]) 91 | 92 | #write it to the file after you append it 93 | np.savetxt(f,general_feature_list[x],newline=",") 94 | #write the label 95 | f.write(str(general_label_list[x])) 96 | #and change line 97 | f.write('\n') 98 | 99 | 100 | ''' 101 | if i want to keep only the gender (male,female) 102 | wav = wav.split('/')[1].split('-')[1], this is only for male,female classification 103 | wav = wav.split('/')[1].split('-')[0], this is for edema,paralysis classification 104 | wav.split('/')[1], for healthy,parkinson classification 105 | ''' 106 | 107 | def makeFormat(folder): 108 | if (folder=='HC'): 109 | wav='0' 110 | elif(folder == 'PD'): 111 | wav='1' 112 | return wav 113 | 114 | 115 | ''' 116 | def readCases(): 117 | - now we want to take all the file names of a directory and them read them accordingly 118 | 119 | healthyCases = os.listdir('PATH_TO_WAV') 120 | parkinsonCases = os.listdir('PATH_TO_WAV') 121 | 122 | return healthyCases , parkinsonCases 123 | ''' 124 | 125 | 'takes the csv file and split the label from the features' 126 | def splitData(data): 127 | # Split-out the set in two different arrayste 128 | array = data.values 129 | #features array contains only the features of the samples 130 | features = array[:,0:12800] 131 | #labels array contains only the lables of the samples 132 | labels = array[:,12800] 133 | 134 | return features,labels 135 | 136 | ''' 137 | make this class in order to train the model with the same amount of samples of each class, because we have bigger support from class 0 138 | than class1, particularly it is 9 to 1.''' 139 | def equalizeClasses(data): 140 | #take all the samples from the data frame that they have Label value equal to 0 and in the next line equal to 1 141 | class0 = data.loc[data['Label'] == 0]#class0 and class1 are dataFrames 142 | class1 = data.loc[data['Label'] == 1] 143 | 144 | 145 | #check which class has more samples, by divide them and check if the number is bigger or smaller than 1 146 | weight = len(class0) // len(class1) #take the results as an integer in order to split the class, using prior knowledge that 147 | #class0 has more samples, if it is bigger class0 has more samples and to be exact weight to 1 148 | 149 | balance = (len(class0) // weight) #this is the number of samples in order to balance our classes 150 | 151 | #the keyword argument frac specifies the fraction of rows to return in the random sample, so fra=1 means, return random all rows 152 | #we kind of a way shuffle our data in order not to take the same samples in every iteration 153 | #class0 = class0.sample(frac=1) 154 | 155 | #samples array for training taking the balance number of samples for the shuffled dataFrame 156 | newClass0 = class0.sample(n=balance) 157 | 158 | #and now combine the new dataFrame from class0 with the class1 to return the balanced dataFrame 159 | newData = pd.concat([newClass0, class1]) 160 | 161 | #return the new balanced(number of samples from each class) dataFrame 162 | return newData 163 | 164 | 165 | 166 | '''we use this function in order to apply greedy search for finding the parameters that best fit our model. We have to mention 167 | that we start this procedure from a very large field and then we tried to focues to the direction where the results 168 | appear better. For example for the C parameter, the first range was [0.0001, 0.001, 0.01, 0.1, 1, 10 ,100 ,1000], the result was that 169 | the best value was 1000 so then we tried [100, 1000, 10000, 100000] and so on in order to focues to the area which give us 170 | the best results. This function is in comments because we found the best parameters and we dont need to run it in every trial.''' 171 | def paramTuning(features_train, labels_train, nfolds): 172 | #using the training data and define the number of folds 173 | #determine the range of the Cs range you want to search 174 | Cs = [1000, 10010,10000, 10060, 100000, 1000000] 175 | 176 | #determine the range of the gammas range you want to search 177 | gammas = [0.00001, 0.0001, 0.005, 0.003 ,0.001, 0.01, 0.1] 178 | 179 | #make the dictioanry 180 | param_grid = {'C': Cs, 'gamma': gammas} 181 | 182 | #start the greedy search using all the matching sets from above 183 | grid_search = GridSearchCV(SVC(kernel='rbf'),param_grid,cv=nfolds) 184 | 185 | #fit your training data 186 | grid_search.fit(features_train, labels_train) 187 | 188 | #visualize the best couple of parameters 189 | print grid_search.best_params_ 190 | 191 | 192 | 193 | '''Building a model which is going to be trained with of given cases and test according to new ones''' 194 | def classifyPHC(general_feature_list,general_label_list): 195 | #because we took features and labels seperatly we have to put them in the same list 196 | #and because for every signal we have different frames we took the first 12800 features 197 | for x in range(len(general_feature_list)): 198 | general_feature_list[x] = general_feature_list[x][:12800] 199 | general_feature_list[x].append(general_label_list[x]) 200 | 201 | #here because we have to make the dataframe again because the inputs are two lists 202 | headers = [] 203 | #we initialize the headers/features 204 | for x in range(1,12801): 205 | headers.append('Feature'+str(x)) 206 | headers.append('Label') 207 | 208 | print len(general_feature_list) 209 | print len(general_feature_list[0]) 210 | 211 | #build the dataframe 212 | data = pd.DataFrame(general_feature_list,columns=headers) 213 | 214 | #equalize classes 215 | data = equalizeClasses(data) 216 | 217 | #data = equalizeClasses(data) 218 | features,labels = splitData(data) 219 | 220 | #determine the training and testing size in the range of 1, 1 = 100% 221 | validation_size = 0.2 222 | 223 | #here we are splitting our data based on the validation_size into training and testing data 224 | features_train, features_validation, labels_train, labels_validation = model_selection.train_test_split(features, labels, 225 | test_size=validation_size) 226 | 227 | 228 | #determine the pca, and determine the dimension you want to end up 229 | pca = PCA(n_components=500) 230 | 231 | #fit only the features train 232 | pca.fit(features_train) 233 | 234 | #dimensionality reduction of features train 235 | features_train = pca.transform(features_train) 236 | 237 | #dimensionality reduction of fatures validation 238 | features_validation = pca.transform(features_validation) 239 | 240 | 241 | #normalize data in the range [-1,1] 242 | scaler = MinMaxScaler(feature_range=(-1, 1)) 243 | #fit only th training data in order to find the margin and then test to data without normalize them 244 | scaler.fit(features_train) 245 | 246 | features_train = scaler.transform(features_train) 247 | 248 | #trnasform the validation features without fitting them 249 | features_validation = scaler.transform(features_validation) 250 | 251 | #we can see the shapes of the array just to check 252 | print 'feature training array: ',features_train.shape,'and label training array: ',labels_train.shape 253 | print 'feature testing array: ',features_validation.shape,'and label testing array: ',labels_validation.shape,'\n' 254 | 255 | 256 | #take the best couple of parameters from the procedure of greedy search 257 | #paramTuning(features_train, labels_train, 5) 258 | 259 | #we initialize our model 260 | svm = SVC(kernel='rbf',C=1000,gamma=1e-05,decision_function_shape='ovr') 261 | #svm = NearestNeighbors(n_neighbors=5) 262 | 263 | 264 | 265 | #train our model with the data that we previously precessed 266 | svm.fit(features_train,labels_train) 267 | 268 | #now test our model with the test data 269 | predicted_labels = svm.predict(features_validation) 270 | accuracy = accuracy_score(labels_validation, predicted_labels) 271 | print 'Classification accuracy: ',accuracy*100,'\n' 272 | 273 | #see the accuracy in training procedure 274 | predicted_labels_train = svm.predict(features_train) 275 | accuracy_train = accuracy_score(labels_train, predicted_labels_train) 276 | print 'Training accuracy: ',accuracy_train*100,'\n' 277 | 278 | #confusion matrix to illustrate the faulty classification of each class 279 | conf_matrix = confusion_matrix(labels_validation, predicted_labels) 280 | print 'Confusion matrix: \n',conf_matrix,'\n' 281 | print 'Support class 0 class 1:' 282 | #calculate the support of each class 283 | print ' ',conf_matrix[0][0]+conf_matrix[0][1],' ',conf_matrix[1][0]+conf_matrix[1][1],'\n' 284 | 285 | #calculate the accuracy of each class 286 | hC = (conf_matrix[0][0]/(conf_matrix[0][0]+conf_matrix[0][1]))*100 287 | pC = (conf_matrix[1][1]/(conf_matrix[1][0]+conf_matrix[1][1]))*100 288 | 289 | #see the inside details of the classification 290 | print 'For class 0 man cases:',conf_matrix[0][0],'classified correctly and',conf_matrix[0][1],'missclassified,',hC,'accuracy \n' 291 | print 'For class 1 woman cases:',conf_matrix[1][1],'classified correctly and',conf_matrix[1][0],'missclassified,',pC,'accuracy\n' 292 | 293 | 294 | #try 5-fold cross validation 295 | scores = cross_val_score(svm, features_train, labels_train, cv=5) 296 | print 'cross validation scores for 5-fold',scores,'\n' 297 | print 'parameters of the model: \n',svm.get_params(),'\n' 298 | 299 | print 'number of samples used as support vectors',len(svm.support_vectors_),'\n' 300 | 301 | return svm.support_vectors_ 302 | 303 | ''' 304 | read all the files from both directories based on the keyboard input HC for healthy cases, PD fro parkinson disease 305 | ''' 306 | def mainParkinson(): 307 | general_feature_list = [] 308 | general_label_list = [] 309 | folder = raw_input('Give the name of the folder that you want to read data: ') 310 | if(folder == 'PD'): 311 | healthyCases = os.listdir(PATH) 312 | for x in healthyCases: 313 | wav = '/'+folder+'/'+str(x) 314 | mfcc_features,inputWav = mfcc_features_extraction(wav) 315 | mean_features(mfcc_features,inputWav,folder,general_feature_list,general_label_list) 316 | folder = raw_input('Give the name of the folder that you want to read data: ') 317 | if(folder == 'HC'): 318 | parkinsonCases = os.listdir(PATH) 319 | for x in parkinsonCases: 320 | wav = '/'+folder+'/'+str(x) 321 | mfcc_features,inputWav = mfcc_features_extraction(wav) 322 | mean_features(mfcc_features,inputWav,folder,general_feature_list,general_label_list) 323 | #print general_feature_list, general_label_list 324 | #writeFeatures(general_feature_list,general_label_list,wav,folder) 325 | classifyPHC(general_feature_list,general_label_list) 326 | 327 | ''' 328 | main function, this example is for male,female classification 329 | given an input from the keyboard that determines the name of the File from which we want to read the samples, and 330 | the number of the samples that we want to read 331 | ''' 332 | def mainMaleFemale(): 333 | folder = raw_input('Give the name of the folder that you want to read data: ') 334 | amount = raw_input('Give the number of samples in the specific folder: ') 335 | for x in range(1,int(amount)+1): 336 | wav = '/'+folder+'/'+str(x)+'.wav' 337 | print wav 338 | mfcc_features,inputWav = mfcc_features_extraction(wav) 339 | mean_features(mfcc_features,inputWav,folder) 340 | 341 | 342 | 343 | def main(): 344 | #calculate the time 345 | import time 346 | start_time = time.time() 347 | 348 | #we are making an array in order to keep the support vectors and feed the function with them for the next iteration 349 | mainParkinson() 350 | 351 | time = time.time()-start_time 352 | print 'time: ',time 353 | 354 | 355 | main() 356 | 357 | 358 | 359 | -------------------------------------------------------------------------------- /classifiers/SVMs/svm_balancedSampleNumber_greedySearch.py: -------------------------------------------------------------------------------- 1 | #!usr/bin/python 2 | from __future__ import division 3 | import pandas as pd 4 | from sklearn import model_selection 5 | from sklearn.svm import SVC # support vectors for classification 6 | from sklearn.metrics import accuracy_score, confusion_matrix 7 | from sklearn.model_selection import cross_val_score, GridSearchCV 8 | 9 | 10 | '''this function takes as an input the path of a file with features and labels and returns the content of this file as a csv format in 11 | the form feature1.........feature13,Label''' 12 | def readFile(): 13 | #make the format of the csv file. Our format is a vector with 13 features and a label which show the condition of the 14 | names = ['Feature1', 'Feature2', 'Feature3', 'Feature4','Feature5','Feature6','Feature7','Feature8','Feature9', 15 | 'Feature10','Feature11','Feature12','Feature13','Label'] 16 | 17 | #path to read the samples, samples consist from healthy subjects and subject suffering from Parkinson's desease. 18 | path = 'PATH_TO_SAMPLES.txt' 19 | #read file in csv format 20 | data = pd.read_csv(path,names=names ) 21 | 22 | #return an array of the shape (2103, 14), lines are the samples and columns are the features as we mentioned before 23 | return data 24 | 25 | 'takes the csv file and split the label from the features' 26 | def splitData(data): 27 | # Split-out the set in two different arrayste 28 | array = data.values 29 | #features array contains only the features of the samples 30 | features = array[:,0:13] 31 | #labels array contains only the lables of the samples 32 | labels = array[:,13] 33 | 34 | return features,labels 35 | 36 | ''' 37 | make this class in order to train the model with the same amount of samples of each class, because we have bigger support from class 0 38 | than class1, particularly it is 9 to 1.''' 39 | def equalizeClasses(data): 40 | #take all the samples from the data frame that they have Label value equal to 0 and in the next line equal to 1 41 | class0 = data.loc[data['Label'] == 0]#class0 and class1 are dataFrames 42 | class1 = data.loc[data['Label'] == 1] 43 | 44 | 45 | #check which class has more samples, by divide them and check if the number is bigger or smaller than 1 46 | weight = len(class0) // len(class1) #take the results as an integer in order to split the class, using prior knowledge that 47 | #class0 has more samples, if it is bigger class0 has more samples and to be exact weight to 1 48 | 49 | balance = (len(class0) // weight) #this is the number of samples in order to balance our classes 50 | 51 | #the keyword argument frac specifies the fraction of rows to return in the random sample, so fra=1 means, return random all rows 52 | #we kind of a way shuffle our data in order not to take the same samples in every iteration 53 | class0 = class0.sample(frac=1) 54 | 55 | #samples array for training taking the balance number of samples for the shuffled dataFrame 56 | newClass0 = class0.sample(n=balance) 57 | 58 | #and now combine the new dataFrame from class0 with the class1 to return the balanced dataFrame 59 | newData = pd.concat([newClass0, class1]) 60 | 61 | #return the new balanced(number of samples from each class) dataFrame 62 | return newData 63 | 64 | 65 | '''we use this function in order to apply greedy search for finding the parameters that best fit our model. We have to mention 66 | that we start this procedure from a very large field and then we tried to focues to the direction where the results 67 | appear better. For example for the C parameter, the first range was [0.0001, 0.001, 0.01, 0.1, 1, 10 ,100 ,1000], the result was that 68 | the best value was 1000 so then we tried [100, 1000, 10000, 100000] and so on in order to focues to the area which give us 69 | the best results. This function is in comments because we found the best parameters and we dont need to run it in every trial.''' 70 | def paramTuning(features_train, labels_train, nfolds): 71 | #using the training data and define the number of folds 72 | #determine the range of the Cs range you want to search 73 | Cs = [1000, 10000, 10000, 1000000] 74 | 75 | #determine the range of the gammas range you want to search 76 | gammas = [0.00000001 ,0.00000001 ,0.0000001, 0.000001, 0.00001] 77 | 78 | #make the dictioanry 79 | param_grid = {'C': Cs, 'gamma': gammas} 80 | 81 | #start the greedy search using all the matching sets from above 82 | grid_search = GridSearchCV(SVC(kernel='rbf'),param_grid,cv=nfolds) 83 | 84 | #fit your training data 85 | grid_search.fit(features_train, labels_train) 86 | 87 | #visualize the best couple of parameters 88 | return grid_search.best_params_ 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | '''Building a model which is going to be trained with of given cases and test according to new ones''' 99 | def classifyPHC(): 100 | data = readFile() 101 | data = equalizeClasses(data) 102 | features,labels = splitData(data) 103 | 104 | #determine the training and testing size in the range of 1, 1 = 100% 105 | validation_size = 0.2 106 | 107 | #here we are splitting our data based on the validation_size into training and testing data 108 | features_train, features_validation, labels_train, labels_validation = model_selection.train_test_split(features, labels, 109 | test_size=validation_size) 110 | 111 | #we can see the shapes of the array just to check 112 | print 'feature training array: ',features_train.shape,'and label training array: ',labels_train.shape 113 | print 'feature testing array: ',features_validation.shape,'and label testing array: ',labels_validation.shape,'\n' 114 | 115 | #take the best couple of parameters from the procedure of greedy search 116 | #C_best, gamma_best = paramTuning(features_train, labels_train, 5) 117 | 118 | #we initialize our model 119 | svm = SVC(kernel='rbf',C=1000,gamma=1e-07) 120 | 121 | 122 | #train our model with the data that we previously precessed 123 | svm.fit(features_train,labels_train) 124 | 125 | #now test our model with the test data 126 | predicted_labels = svm.predict(features_validation) 127 | accuracy = accuracy_score(labels_validation, predicted_labels) 128 | print 'Classification accuracy: ',accuracy*100,'\n' 129 | 130 | #confusion matrix to illustrate the faulty classification of each class 131 | conf_matrix = confusion_matrix(labels_validation, predicted_labels) 132 | print 'Confusion matrix: \n',conf_matrix,'\n' 133 | print 'Support class 0 class 1:' 134 | #calculate the support of each class 135 | print ' ',conf_matrix[0][0]+conf_matrix[0][1],' ',conf_matrix[1][0]+conf_matrix[1][1],'\n' 136 | 137 | #calculate the accuracy of each class 138 | hC = (conf_matrix[0][0]/(conf_matrix[0][0]+conf_matrix[0][1]))*100 139 | pC = (conf_matrix[1][1]/(conf_matrix[1][0]+conf_matrix[1][1]))*100 140 | 141 | #see the inside details of the classification 142 | print 'For class 0 healthy cases:',conf_matrix[0][0],'classified correctly and',conf_matrix[0][1],'missclassified,',hC,'accuracy \n' 143 | print 'For class 1 parkinson cases:',conf_matrix[1][1],'classified correctly and',conf_matrix[1][0],'missclassified,',pC,'accuracy\n' 144 | 145 | #try 5-fold cross validation 146 | scores = cross_val_score(svm, features_train, labels_train, cv=5) 147 | print 'cross validation scores for 5-fold',scores,'\n' 148 | print 'parameters of the model: \n',svm.get_params(),'\n' 149 | 150 | print 'number of samples used as support vectors',len(svm.support_vectors_) 151 | 152 | 153 | 154 | classifyPHC() 155 | 156 | -------------------------------------------------------------------------------- /classifiers/SVMs/svm_default.py: -------------------------------------------------------------------------------- 1 | #!usr/bin/python 2 | import pandas as pd 3 | from sklearn import model_selection 4 | from sklearn.svm import SVC # support vectors for classification 5 | from sklearn.metrics import accuracy_score, confusion_matrix 6 | 7 | 8 | 'this function takes as an input the path of a file with features and labels and returns the content of this file as a csv format in' 9 | 'the form feature1.........feature13,Label' 10 | def readFile(): 11 | #make the format of the csv file. Our format is a vector with 13 features and a label which show the condition of the 12 | #sample hc/pc : helathy case, parkinson case 13 | names = ['Feature1', 'Feature2', 'Feature3', 'Feature4','Feature5','Feature6','Feature7','Feature8','Feature9', 14 | 'Feature10','Feature11','Feature12','Feature13','Label'] 15 | 16 | #path to read the samples, samples consist from healthy subjects and subject suffering from Parkinson's desease. 17 | path = 'PATH_TO_SAMPLES.txt' 18 | #read file in csv format 19 | data = pd.read_csv(path,names=names ) 20 | 21 | #return an array of the shape (2103, 14), lines are the samples and columns are the features as we mentioned before 22 | return data 23 | 24 | 'takes the csv file and split the label from the features' 25 | def splitData(data): 26 | # Split-out the set in two different arrayste 27 | array = data.values 28 | #features array contains only the features of the samples 29 | features = array[:,0:13] 30 | #labels array contains only the lables of the samples 31 | labels = array[:,13] 32 | 33 | return features,labels 34 | 35 | 36 | 'Building a model which is going to be trained with of given cases and test according to new ones' 37 | def classifyPHC(): 38 | data = readFile() 39 | features,labels = splitData(data) 40 | 41 | #determine the training and testing size in the range of 1, 1 = 100% 42 | validation_size = 0.2 43 | 44 | #here we are splitting our data based on the validation_size into training and testing data 45 | features_train, features_validation, labels_train, labels_validation = model_selection.train_test_split(features, labels, 46 | test_size=validation_size) 47 | 48 | #we can see the shapes of the array just to check 49 | print 'feature training array: ',features_train.shape,'and label training array: ',labels_train.shape 50 | print 'feature testing array: ',features_validation.shape,'and label testing array: ',labels_validation.shape,'\n' 51 | 52 | #we initialize our model 53 | svm = SVC(kernel='sigmoid',C=0.8) 54 | 55 | #train our model with the data that we previously precessed 56 | svm.fit(features_train,labels_train) 57 | 58 | #now test our model with the test data 59 | predicted_labels = svm.predict(features_validation) 60 | accuracy = accuracy_score(labels_validation, predicted_labels) 61 | print 'Classification accuracy: ',accuracy,'\n' 62 | 63 | #confusion matrix to illustrate the faulty classification of each class 64 | conf_matrix = confusion_matrix(labels_validation, predicted_labels) 65 | print 'Confusion matrix: \n',conf_matrix,'\n' 66 | print 'Support class 0 class 1:' 67 | #calculate the support of each class 68 | print ' ',conf_matrix[0][0]+conf_matrix[0][1],' ',conf_matrix[1][0]+conf_matrix[1][1],'\n' 69 | 70 | #see the inside details of the classification 71 | print 'For class 0 healthy cases:',conf_matrix[0][0],'classified correctly and',conf_matrix[0][1],'missclassified \n' 72 | print 'For class 1 parkinson cases:',conf_matrix[1][1],'classified correctly and',conf_matrix[1][0],'missclassified \n' 73 | 74 | 75 | 76 | classifyPHC() 77 | 78 | -------------------------------------------------------------------------------- /classifiers/SVMs/svm_keeping_supportVectors.py: -------------------------------------------------------------------------------- 1 | #!usr/bin/python 2 | from __future__ import division 3 | import pandas as pd 4 | from sklearn import model_selection 5 | from sklearn.svm import SVC # support vectors for classification 6 | from sklearn.metrics import accuracy_score, confusion_matrix, classification_report 7 | from sklearn.model_selection import cross_val_score, GridSearchCV 8 | import timeit 9 | import numpy as np 10 | import itertools 11 | import sys 12 | from sklearn.preprocessing import MinMaxScaler 13 | 14 | 15 | 16 | '''this function takes as an input the path of a file with features and labels and returns the content of this file as a csv format in 17 | the form feature1.........feature13,Label''' 18 | def readFile(): 19 | #make the format of the csv file. Our format is a vector with 13 features and a label which show the condition of the 20 | #sample hc/pc : helathy case, parkinson case 21 | names = ['Feature1', 'Feature2', 'Feature3', 'Feature4','Feature5','Feature6','Feature7','Feature8','Feature9', 22 | 'Feature10','Feature11','Feature12','Feature13','Label'] 23 | 24 | #path to read the samples, samples consist from healthy subjects and subject suffering from Parkinson's desease. 25 | path = 'PATH_TO_WAV_SAMPLES.txt' 26 | #read file in csv format 27 | data = pd.read_csv(path,names=names ) 28 | 29 | #return an array of the shape (2103, 14), lines are the samples and columns are the features as we mentioned before 30 | return data 31 | 32 | 'takes the csv file and split the label from the features' 33 | def splitData(data): 34 | # Split-out the set in two different arrayste 35 | array = data.values 36 | #features array contains only the features of the samples 37 | features = array[:,0:13] 38 | #labels array contains only the lables of the samples 39 | labels = array[:,13] 40 | 41 | return features,labels 42 | 43 | ''' 44 | make this class in order to train the model with the same amount of samples of each class, because we have bigger support from class 0 45 | than class1, particularly it is 9 to 1. We made this function in order to make a loop, the equalized data take only a small piece of the existing data, so with this 46 | loop we are going to take iteratably all the data, but from every iteration we are keeping only the samples who were support 47 | vectors, the samples only the class which we are taking a piece of it's samples''' 48 | '''''' 49 | def equalizeClasses(data): 50 | #take all the samples from the data frame that they have Label value equal to 0 and in the next line equal to 1 51 | class0 = data.loc[data['Label'] == 0]#class0 and class1 are dataFrames 52 | class1 = data.loc[data['Label'] == 1] 53 | 54 | 55 | #check which class has more samples, by divide them and check if the number is bigger or smaller than 1 56 | weight = len(class0) // len(class1) #take the results as an integer in order to split the class, using prior knowledge that 57 | #class0 has more samples, if it is bigger class0 has more samples and to be exact weight to 1 58 | 59 | #check division with zero 60 | if(weight == 0): 61 | print 'Now the amount of samples in class0 is smaller than half the amount of samples in class1 because we reduce the class0 samples by taking only the support vectors' 62 | if(len(class0)<(len(class1)/2)): 63 | #if the amount of samples in class0 is below the amount of half of the samples in class1 terminate the script 64 | sys.exit() 65 | else: 66 | #else, take all the samples from class0 67 | weight = 1 68 | else: 69 | balance = (len(class0) // weight) #this is the number of samples in order to balance our classes 70 | 71 | #the keyword argument frac specifies the fraction of rows to return in the random sample, so fra=1 means, return random all rows 72 | #we kind of a way shuffle our data in order not to take the same samples in every iteration 73 | #class0 = class0.sample(frac=1) 74 | 75 | #samples array for training taking the balance number of samples for the shuffled dataFrame 76 | #split the dataFrame based on the weight, so here we are making units of samples in the amount of balance in order 77 | #to train our model with an iteration procedure 78 | newData = np.array_split(class0, weight) 79 | 80 | 81 | #and now combine the new dataFrame from class0 with the class1 to return the balanced dataFrame 82 | #newData = pd.concat([newClass0, class1]) 83 | 84 | #return the new balanced(number of samples from each class) dataFrame 85 | #return both classes in order to compine them later 86 | return newData, class1, class0 87 | 88 | 89 | '''we use this function in order to apply greedy search for finding the parameters that best fit our model. We have to mention 90 | that we start this procedure from a very large field and then we tried to focues to the direction where the results 91 | appear better. For example for the C parameter, the first range was [0.0001, 0.001, 0.01, 0.1, 1, 10 ,100 ,1000], the result was that 92 | the best value was 1000 so then we tried [100, 1000, 10000, 100000] and so on in order to focues to the area which give us 93 | the best results. This function is in comments because we found the best parameters and we dont need to run it in every trial.''' 94 | def paramTuning(features_train, labels_train, nfolds): 95 | #using the training data and define the number of folds 96 | #determine the range of the Cs range you want to search 97 | Cs = [1, 10, 100, 1000, 10000] 98 | 99 | #determine the range of the gammas range you want to search 100 | gammas = [0.00000001 ,0.00000001 ,0.0000001, 0.000001, 0.00001] 101 | 102 | #make the dictioanry 103 | param_grid = {'C': Cs, 'gamma': gammas} 104 | 105 | #start the greedy search using all the matching sets from above 106 | grid_search = GridSearchCV(SVC(kernel='rbf'),param_grid,cv=nfolds) 107 | 108 | #fit your training data 109 | grid_search.fit(features_train, labels_train) 110 | 111 | #visualize the best couple of parameters 112 | return grid_search.best_params_ 113 | 114 | 115 | 116 | '''Building a model which is going to be trained with of given cases and test according to new ones''' 117 | def classifyPHC(data): 118 | #take the array with the units of samples of class0 divided properly to train the model in a balanced dataset 119 | data1,class1,class0 = equalizeClasses(data) 120 | #run this procedure by using all the units 121 | 122 | 123 | support_vectors=[] 124 | for newdata in data1: 125 | data = pd.concat([newdata, class1]) 126 | features,labels = splitData(data) 127 | 128 | #determine the training and testing size in the range of 1, 1 = 100% 129 | validation_size = 0.2 130 | 131 | #here we are splitting our data based on the validation_size into training and testing data 132 | features_train_unscaled, features_validation_unscaled, labels_train, labels_validation = model_selection.train_test_split(features, labels, 133 | test_size=validation_size) 134 | 135 | #normalize data in the range [-1,1] 136 | scaler = MinMaxScaler(feature_range=(-1, 1)) 137 | #fit only th training data in order to find the margin and then test to data without normalize them 138 | scaler.fit(features_train_unscaled) 139 | 140 | features_train = scaler.transform(features_train_unscaled) 141 | 142 | #trnasform the validation features without fitting them 143 | features_validation = scaler.transform(features_validation_unscaled) 144 | 145 | 146 | 147 | 148 | 149 | 150 | #we can see the shapes of the array just to check 151 | print 'feature training array: ',features_train.shape,'and label training array: ',labels_train.shape 152 | print 'feature testing array: ',features_validation.shape,'and label testing array: ',labels_validation.shape,'\n' 153 | 154 | 155 | #take the best couple of parameters from the procedure of greedy search 156 | #paramTuning(features_train, labels_train, 5) 157 | 158 | #we initialize our model 159 | svm = SVC(kernel='rbf',C=10,gamma=1,decision_function_shape='ovr') 160 | 161 | #train our model with the data that we previously precessed 162 | svm.fit(features_train,labels_train) 163 | 164 | #now test our model with the test data 165 | predicted_labels = svm.predict(features_validation) 166 | accuracy = accuracy_score(labels_validation, predicted_labels) 167 | print 'Classification accuracy: ',accuracy*100,'\n' 168 | 169 | #see the accuracy in training procedure 170 | predicted_labels_train = svm.predict(features_train) 171 | accuracy_train = accuracy_score(labels_train, predicted_labels_train) 172 | print 'Training accuracy: ',accuracy_train*100,'\n' 173 | 174 | #confusion matrix to illustrate the faulty classification of each class 175 | conf_matrix = confusion_matrix(labels_validation, predicted_labels) 176 | print 'Confusion matrix: \n',conf_matrix,'\n' 177 | print 'Support class 0 class 1:' 178 | #calculate the support of each class 179 | print ' ',conf_matrix[0][0]+conf_matrix[0][1],' ',conf_matrix[1][0]+conf_matrix[1][1],'\n' 180 | 181 | #calculate the accuracy of each class 182 | hC = (conf_matrix[0][0]/(conf_matrix[0][0]+conf_matrix[0][1]))*100 183 | pC = (conf_matrix[1][1]/(conf_matrix[1][0]+conf_matrix[1][1]))*100 184 | 185 | #see the inside details of the classification 186 | print 'For class 0 man cases:',conf_matrix[0][0],'classified correctly and',conf_matrix[0][1],'missclassified,',hC,'accuracy \n' 187 | print 'For class 1 woman cases:',conf_matrix[1][1],'classified correctly and',conf_matrix[1][0],'missclassified,',pC,'accuracy\n' 188 | 189 | 190 | #try 5-fold cross validation 191 | scores = cross_val_score(svm, features_train, labels_train, cv=5) 192 | print 'cross validation scores for 5-fold',scores,'\n' 193 | #print 'parameters of the model: \n',svm.get_params(),'\n' 194 | 195 | print 'number of samples used as support vectors',len(svm.support_vectors_),'\n' 196 | 197 | #keep the support vectors of every iteration, until the units of samples of the class0 finishes 198 | #but we undo the scaling because we want to scale again our data based on the new training sample 199 | unscaledSupportVectors = findUnscaledSupportVectors(features_train_unscaled,features_train,svm.support_vectors_) 200 | support_vectors.append(unscaledSupportVectors) 201 | 202 | 203 | 204 | return support_vectors, class1, class0,features_train_unscaled,features_train 205 | 206 | 207 | '''make this function because we need to keep only the support vectors from the class with bigger amount of samples in order to train 208 | the model with the support vectors only the class0 and all the samples from the class1, also we need to remove the duplicates because 209 | it is possible that we took duplicates as support vectors, and to delete the support vectors from class1. In this function we are doing the same procedure as previous in order to classify with SVM, but we are using only the samples 210 | from class0 that in our previous iterations they appear themselves as support vectors and all the samples from the class1. We are 211 | doing this because we have discrepancies in the amount of samples of the two classes. Trying to get better training results.''' 212 | def initSupportVectors(support_vectors, class1,features_train_unscaled,features_train): 213 | flattened_list = [] 214 | 215 | #run the list of lists, every list contains on single samples which is support vector 216 | for x in support_vectors: 217 | #for every samples in the support vector list 218 | for y in x: 219 | flattened_list.append(list(y)) 220 | 221 | 222 | print 'Amount of support vectors with duplicates: ',len(flattened_list) 223 | 224 | #use this command to remove all the duplicates of the list 225 | uniqueSupportVectors = [list(l) for l in set(tuple(l) for l in flattened_list)] 226 | 227 | #now we need to remove all the samples that are support vectors but they come from the class1, so we are going to check which 228 | #samples of our list are in the class1 list as well. 229 | 230 | #convert the dataFrame into a list which has sublists and every lists contains the features, for exampes the sublist[0] contains 231 | #all the Feature1 of every samples ans so on, so we have to divide it in order to make the real samples 232 | 233 | 234 | 235 | #1825-2102, take every row of the data frame add put it in a list of lists 236 | class1SamplesInaList = [] 237 | #here we iterate the dataFrame, particularly the rows we define with the range 238 | for x in range(1825,2103): 239 | #here we take the specific row 240 | class1Sample = class1.loc[[x]] 241 | #we need to take all the features from this row but the label, because it returns a list of lists we need to join 242 | #this list of lists into one list which contains one single sample of the class1 243 | class1SamplesInaList.append(list(itertools.chain.from_iterable(class1Sample.values.T.tolist()[:13]))) 244 | 245 | #continue this procedure we are going to check if a vector is in both array, if this is true it means tha this vector 246 | #is a support vector because it belongs in the support_vector list and it belongs to the class1 because it is in the 247 | #class1SamplesInaList array so with erase this element from the support vector array, when this procedure is over it means 248 | #that the elements that remain in the support_vector_array is from class one and support_vectors, this is the goal we define. 249 | 250 | #class1SamplesInaList the array which contains as a list every sample of class1 one in a list of lists 251 | #uniqueSupportVectors contains all the samples that used as support_vectors in every iteration erasing the duplicates 252 | #becuase the samples from class1 took place in every iteration 253 | 254 | 255 | #iterate every sample of the class1 in order to check if it exists in the list 256 | for x in class1SamplesInaList: 257 | #if it exists, we need to delete it 258 | if(x in uniqueSupportVectors): 259 | #remove the specific list from the support_vectors that we are going to use 260 | uniqueSupportVectors.remove(x) 261 | 262 | print 'Amount of support vectors without duplicates', len(uniqueSupportVectors),'\n' 263 | 264 | 265 | #so we know that this array contains the support_vectors which are samples only from class0 and there is no duplicates, 266 | #so we have to add a last element in every array to declare the label of the samples and it is going to be 0 because 267 | #we know the class that the samples come from 268 | for x in uniqueSupportVectors: 269 | x.append(0) 270 | 271 | #initialize the dataframe that we want to return 272 | support_vectors_dataframe = pd.DataFrame(columns=['Feature1','Feature2','Feature3','Feature4','Feature5','Feature6','Feature7','Feature8','Feature9','Feature10','Feature11','Feature12','Feature13','Label']) 273 | for x in range(len(uniqueSupportVectors)): 274 | #we need to add the columns and the rows of the dataframe so we are going to do it manually 275 | support_vectors_dataframe.loc[x] = [uniqueSupportVectors[x][y] for y in range(len(uniqueSupportVectors[x]))] 276 | 277 | 278 | #return the dataframe which contains all the support vectors from all the iteration of training with all the units of samples 279 | #of only the class0, and now we are ready to train with them and all the samples of the class1 280 | return pd.concat([support_vectors_dataframe,class1]) 281 | #returns the new data ready to train the model 282 | #the samples from class0 which were support_vectros and all the samples from class1 283 | 284 | 285 | '''because we scale and we scale again the same data, we have to find the pre-scaled data and feed our classifier otherwise 286 | it is going to be perfect because of the consecutivr normalizations. In this function we normalize all our data, then we take 287 | the support vectors from class0 from the function initSupportVectors and we are mapping the scaled support vectors to the 288 | pre-scaled samples that they were support vectors after scaling. In order not to scale and scale again the already scaled data. The gene- 289 | ral problem is that we scale again the support vectors of class0, and the samples from class1 they just be scaled once''' 290 | def findUnscaledSupportVectors(features_train_unscaled,features_train,support_vectors): 291 | #we are applying the same normalization because is the same data so we are going to end up with the same results 292 | #normalize data in the range [-1,1] 293 | #scaler = MinMaxScaler(feature_range=(-1, 1)) 294 | #fit only th training data in order to find the margin and then test to data without normalize them 295 | #fit exactly the same data as before 296 | #scaler.fit(features_train_unscaled) 297 | 298 | unScaled_support_vectors = [] 299 | 300 | 301 | #now we mapped the unscaled data with the scaled data as concerns the position in the array 302 | #check all the train features 303 | for searchSample in features_train: 304 | #because we add support_vectors for every iteration, check for all the iterations 305 | #for every array in the array support_vectors 306 | #if what I search is a support vector then 307 | if searchSample in support_vectors: 308 | position = np.where(features_train == searchSample) 309 | #print searchSample 310 | #print features_train[position[0][0]] 311 | #which means that all the lements are equal 312 | if(position[1][len(position[1])-1] == 12): 313 | #then add it to the unscaled support vectors 314 | unScaled_support_vectors.append(features_train_unscaled[position[0][0]]) 315 | 316 | 317 | 318 | print len(support_vectors) 319 | print len(unScaled_support_vectors) 320 | print 'position',position 321 | #sys.exit() 322 | return unScaled_support_vectors 323 | 324 | 325 | 326 | def main(): 327 | #calculate the time 328 | import time 329 | start_time = time.time() 330 | 331 | data = readFile() 332 | 333 | 334 | 335 | while True: 336 | 337 | #we are making an array in order to keep the support vectors and feed the function with them for the next iteration 338 | support_vectors, class1, class0,features_train_unscaled,features_train = classifyPHC(data) 339 | 340 | 341 | #make class0 list 342 | 343 | #flat the list of lists into one list 344 | data = initSupportVectors(support_vectors, class1,features_train_unscaled,features_train) 345 | 346 | print 'END OF ITERATION NOW WE ARE TRAINING WITH A NEW REDUCED SET OF SUPPORT VECTORS FROM CLASS 0 and data set length',len(data),'\n\n\n\n\n\n\n\n' 347 | 348 | 349 | time = time.time()-start_time 350 | print 'time: ',time 351 | 352 | main() 353 | 354 | 355 | -------------------------------------------------------------------------------- /classifiers/SVMs/svm_multiclass.py: -------------------------------------------------------------------------------- 1 | #!usr/bin/python 2 | from __future__ import division 3 | import pandas as pd 4 | from sklearn import model_selection 5 | from sklearn.svm import SVC # support vectors for classification 6 | from sklearn.metrics import accuracy_score, confusion_matrix 7 | from sklearn.model_selection import cross_val_score, GridSearchCV 8 | 9 | 10 | '''this function takes as an input the path of a file with features and labels and returns the content of this file as a csv format in 11 | the form feature1.........feature13,Label''' 12 | def readFile(): 13 | #make the format of the csv file. Our format is a vector with 13 features and a label which show the condition of the 14 | #sample hc/pc : helathy case, parkinson case 15 | names = ['Feature1', 'Feature2', 'Feature3', 'Feature4','Feature5','Feature6','Feature7','Feature8','Feature9', 16 | 'Feature10','Feature11','Feature12','Feature13','Label'] 17 | 18 | #path to read the samples, samples consist from healthy subjects and subject suffering from Parkinson's desease. 19 | path = 'PATH_TO_SAMPLES.txt' 20 | #read file in csv format 21 | data = pd.read_csv(path,names=names ) 22 | 23 | #return an array of the shape (2103, 14), lines are the samples and columns are the features as we mentioned before 24 | return data 25 | 26 | 'takes the csv file and split the label from the features' 27 | def splitData(data): 28 | # Split-out the set in two different arrayste 29 | array = data.values 30 | #features array contains only the features of the samples 31 | features = array[:,0:13] 32 | #labels array contains only the lables of the samples 33 | labels = array[:,13] 34 | 35 | return features,labels 36 | 37 | 38 | '''we use this function in order to apply greedy search for finding the parameters that best fit our model. We have to mention 39 | that we start this procedure from a very large field and then we tried to focues to the direction where the results 40 | appear better. For example for the C parameter, the first range was [0.0001, 0.001, 0.01, 0.1, 1, 10 ,100 ,1000], the result was that 41 | the best value was 1000 so then we tried [100, 1000, 10000, 100000] and so on in order to focues to the area which give us 42 | the best results. This function is in comments because we found the best parameters and we dont need to run it in every trial.''' 43 | def paramTuning(features_train, labels_train, nfolds): 44 | #using the training data and define the number of folds 45 | #determine the range of the Cs range you want to search 46 | Cs = [1, 10, 100, 1000, 10000] 47 | 48 | #determine the range of the gammas range you want to search 49 | gammas = [0.00000001 ,0.00000001 ,0.0000001, 0.000001, 0.00001] 50 | 51 | #make the dictioanry 52 | param_grid = {'C': Cs, 'gamma': gammas} 53 | 54 | #start the greedy search using all the matching sets from above 55 | grid_search = GridSearchCV(SVC(kernel='rbf'),param_grid,cv=nfolds) 56 | 57 | #fit your training data 58 | grid_search.fit(features_train, labels_train) 59 | 60 | #visualize the best couple of parameters 61 | print grid_search.best_params_ 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | '''Building a model which is going to be trained with of given cases and test according to new ones''' 72 | def classifyPHC(): 73 | data = readFile() 74 | #data = equalizeClasses(data) 75 | features,labels = splitData(data) 76 | 77 | #determine the training and testing size in the range of 1, 1 = 100% 78 | validation_size = 0.2 79 | 80 | #here we are splitting our data based on the validation_size into training and testing data 81 | features_train, features_validation, labels_train, labels_validation = model_selection.train_test_split(features, labels, 82 | test_size=validation_size) 83 | 84 | #we can see the shapes of the array just to check 85 | print 'feature training array: ',features_train.shape,'and label training array: ',labels_train.shape 86 | print 'feature testing array: ',features_validation.shape,'and label testing array: ',labels_validation.shape,'\n' 87 | 88 | #take the best couple of parameters from the procedure of greedy search 89 | #paramTuning(features_train, labels_train, 5) 90 | 91 | #we initialize our model 92 | svm = SVC(kernel='rbf',C=100,gamma=1e-05,decision_function_shape='ovr') 93 | 94 | #train our model with the data that we previously precessed 95 | svm.fit(features_train,labels_train) 96 | 97 | 98 | #now test our model with the test data 99 | predicted_labels = svm.predict(features_validation) 100 | accuracy = accuracy_score(labels_validation, predicted_labels) 101 | print 'Classification accuracy: ',accuracy*100,'\n' 102 | 103 | #confusion matrix to illustrate the faulty classification of each class 104 | conf_matrix = confusion_matrix(labels_validation, predicted_labels) 105 | print 'Confusion matrix: \n',conf_matrix,'\n' 106 | print 'Support class 0 class 1 class2:' 107 | #calculate the support of each class 108 | print ' ',conf_matrix[0][0]+conf_matrix[0][1]+conf_matrix[0][2],' ',conf_matrix[1][0]+conf_matrix[1][1]+conf_matrix[1][2],' ',conf_matrix[2][0]+conf_matrix[2][1]+conf_matrix[2][2],'\n' 109 | 110 | #calculate the accuracy of each class 111 | edema = (conf_matrix[0][0]/(conf_matrix[0][0]+conf_matrix[0][1]+conf_matrix[0][2]))*100 112 | paralysis = (conf_matrix[1][1]/(conf_matrix[1][0]+conf_matrix[1][1]+conf_matrix[1][2]))*100 113 | normal = (conf_matrix[2][2]/(conf_matrix[2][0]+conf_matrix[2][1]+conf_matrix[2][2]))*100 114 | 115 | #see the inside details of the classification 116 | print 'For class 0 edema cases:',conf_matrix[0][0],'classified correctly and',conf_matrix[0][1]+conf_matrix[0][2],'missclassified,',edema,'accuracy \n' 117 | print 'For class 1 paralysis cases:',conf_matrix[1][1],'classified correctly and',conf_matrix[1][0]+conf_matrix[1][2],'missclassified,',paralysis,'accuracy\n' 118 | print 'For class 0 normal cases:',conf_matrix[2][2],'classified correctly and',conf_matrix[2][0]+conf_matrix[2][1],'missclassified,',normal,'accuracy \n' 119 | 120 | #try 5-fold cross validation 121 | scores = cross_val_score(svm, features_train, labels_train, cv=5) 122 | print 'cross validation scores for 5-fold',scores,'\n' 123 | print 'parameters of the model: \n',svm.get_params(),'\n' 124 | 125 | print 'number of samples used as support vectors',len(svm.support_vectors_) 126 | 127 | 128 | 129 | classifyPHC() 130 | 131 | -------------------------------------------------------------------------------- /classifiers/dimensionality_reduction/README.md: -------------------------------------------------------------------------------- 1 | # kpca_lda_knn_equalizeClasses.py 2 | script is using KernalPCA as a first step to reduce the dimension of the data and then LDA to bring the data 3 | into the dimension class -1 and then using knn. Same as the script 4 | # kpca_lda_multiclass.py 5 | for 3 classes. 6 | # pca_kpca_from-skratch.py 7 | Implementation of Principal Component Analysis and KernelPCA from skratch 8 | # graph_spectral_analysis&spectral_clustering.py 9 | Apply dimensionality reduction using graph spectral analysis (LLE, IsoMap etc) and then spectral clustering 10 | -------------------------------------------------------------------------------- /classifiers/dimensionality_reduction/graph_spectral_analysis&spectral_clustering_default.py: -------------------------------------------------------------------------------- 1 | #!usr/bin/python 2 | from __future__ import division 3 | import pandas as pd 4 | from sklearn import model_selection 5 | from sklearn.svm import SVC # supportctors for classification 6 | from sklearn.metrics import accuracy_score, confusion_matrix, classification_report 7 | from sklearn.model_selection import cross_val_score, GridSearchCV 8 | import timeit 9 | from sklearn.preprocessing import MinMaxScaler 10 | from sklearn.discriminant_analysis import LinearDiscriminantAnalysis 11 | import numpy as np 12 | from sklearn.neighbors import KNeighborsClassifier, NearestNeighbors 13 | import matplotlib.pyplot as plt 14 | from matplotlib.pyplot import figure 15 | import seaborn as sns 16 | from sklearn.manifold import LocallyLinearEmbedding, SpectralEmbedding, Isomap 17 | from sklearn.cluster import SpectralClustering 18 | from sklearn.metrics.cluster import homogeneity_score 19 | 20 | '''this function takes as an input the path of a file with features and labels and returns the content of this file as a csv format in 21 | the form feature1.........feature13,Label''' 22 | def readFile(): 23 | #make the format of the csv file. Our format is a vector with 13 features and a label which show the condition of the 24 | #sample hc/pc : helathy case, parkinson case 25 | names = ['Feature1', 'Feature2', 'Feature3', 'Feature4','Feature5','Feature6','Feature7','Feature8','Feature9', 26 | 'Feature10','Feature11','Feature12','Feature13','Label'] 27 | 28 | #path to read the samples, samples consist from healthy subjects and subject suffering from Parkinson's desease. 29 | #path = 'mfcc_man_woman.txt' 30 | path = 'PATH_TO_SAMPLES.txt' 31 | #path = '/home/gionanide/Theses_2017-2018_2519/features/parkinson_healthy/mfcc_parkinson_healthy.txt' 32 | 33 | #read file in csv format 34 | data = pd.read_csv(path,names=names ) 35 | 36 | #return an array of the shape (2103, 14), lines are the samples and columns are the features as we mentioned before 37 | return data 38 | 39 | 'takes the csv file and split the label from the features' 40 | def splitData(data): 41 | # Split-out the set in two different arrayste 42 | array = data.values 43 | #features array contains only the features of the samples 44 | features = array[:,0:13] 45 | #labels array contains only the lables of the samples 46 | labels = array[:,13] 47 | 48 | return features,labels 49 | 50 | ''' 51 | make this class in order to train the model with the same amount of samples of each class, because we have bigger support from class 0 52 | than class1, particularly it is 9 to 1.''' 53 | def equalizeClasses(data): 54 | #take all the samples from the data frame that they have Label value equal to 0 and in the next line equal to 1 55 | class0 = data.loc[data['Label'] == 0]#class0 and class1 are dataFrames 56 | class1 = data.loc[data['Label'] == 1] 57 | 58 | 59 | #check which class has more samples, by divide them and check if the number is bigger or smaller than 1 60 | weight = len(class0) // len(class1) #take the results as an integer in order to split the class, using prior knowledge that 61 | #class0 has more samples, if it is bigger class0 has more samples and to be exact weight to 1 62 | 63 | balance = (len(class0) // weight) #this is the number of samples in order to balance our classes 64 | 65 | #the keyword argument frac specifies the fraction of rows to return in the random sample, so fra=1 means, return random all rows 66 | #we kind of a way shuffle our data in order not to take the same samples in every iteration 67 | #class0 = class0.sample(frac=1) 68 | 69 | #samples array for training taking the balance number of samples for the shuffled dataFrame 70 | newClass0 = class0.sample(n=balance) 71 | 72 | #and now combine the new dataFrame from class0 with the class1 to return the balanced dataFrame 73 | newData = pd.concat([newClass0, class1]) 74 | 75 | #return the new balanced(number of samples from each class) dataFrame 76 | return newData 77 | 78 | '''we made this function in order to make a loop, the equalized data take only a small piece of the existing data, so with this 79 | loop we are going to take iteratably all the data, but from every iteration we are keeping only the samples who were support 80 | vectors, the samples only the class which we are taking a piece of it's samples''' 81 | def keepSV(): 82 | print 'yolo' 83 | 84 | 85 | '''we use this function in order to apply greedy search for finding the parameters that best fit our model. We have to mention 86 | that we start this procedure from a very large field and then we tried to focues to the direction where the results 87 | appear better. For example for the C parameter, the first range was [0.0001, 0.001, 0.01, 0.1, 1, 10 ,100 ,1000], the result was that 88 | the best value was 1000 so then we tried [100, 1000, 10000, 100000] and so on in order to focues to the area which give us 89 | the best results. This function is in comments because we found the best parameters and we dont need to run it in every trial.''' 90 | def paramTuning(features_train, labels_train, nfolds): 91 | #using the training data and define the number of folds 92 | #determine the range of the Cs range you want to search 93 | Cs = [0.001, 0.01, 0.1 ,1, 10, 100, 1000, 10000] 94 | 95 | #determine the range of the gammas range you want to search 96 | gammas = [0.00000001 ,0.00000001 ,0.0000001, 0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1 , 1, 10, 100, 1000] 97 | 98 | #make the dictioanry 99 | param_grid = {'C': Cs, 'gamma': gammas} 100 | 101 | #start the greedy search using all the matching sets from above 102 | grid_search = GridSearchCV(SVC(kernel='poly'),param_grid,cv=nfolds) 103 | 104 | #fit your training data 105 | grid_search.fit(features_train, labels_train) 106 | 107 | #visualize the best couple of parameters 108 | print grid_search.best_params_ 109 | 110 | 111 | 112 | '''Classify Parkinson and Helathy. Building a model which is going to be trained with of given cases and test according to new ones''' 113 | def classifyPHC(): 114 | data = readFile() 115 | #data = equalizeClasses(data) 116 | features,labels = splitData(data) 117 | 118 | #determine the training and testing size in the range of 1, 1 = 100% 119 | validation_size = 0.2 120 | 121 | #here we are splitting our data based on the validation_size into training and testing data 122 | #features_train, features_validation, labels_train, labels_validation = model_selection.train_test_split(features, labels, 123 | #test_size=validation_size) 124 | 125 | 126 | #we are using all the features because it is clustering so we do not want to split to testing and training 127 | #bacause we apply unsupervised techniques 128 | 129 | #normalize data in the range [-1,1] 130 | scaler = MinMaxScaler(feature_range=(-1, 1)) 131 | #fit only th training data in order to find the margin and then test to data without normalize them 132 | scaler.fit(features) 133 | 134 | features_scalar = scaler.transform(features) 135 | 136 | #trnasform the validation features without fitting them 137 | #features_validation_scalar = scaler.transform(features_validation) 138 | 139 | 140 | #apply the dimensionality reduction using graph spectral analysis 141 | 142 | '''#LocallyLinearEmbedding 143 | 144 | lle = LocallyLinearEmbedding(n_components=2) 145 | 146 | 147 | #transform data 148 | features_embedded = lle.fit_transform(features_scalar)''' 149 | 150 | '''#Isometric Mapping 151 | 152 | isomap = Isomap(n_components=2) 153 | 154 | 155 | #transform data 156 | features_embedded = isomap.fit_transform(features_scalar)''' 157 | 158 | #Graph embedding 159 | 160 | spectralEmbedding = SpectralEmbedding(n_components=2) 161 | 162 | #transform training and validation data 163 | features_embedded = spectralEmbedding.fit_transform(features_scalar) 164 | 165 | 166 | 167 | #we can see the shapes of the array just to check 168 | print 'feature training array: ',features_embedded.shape #,'and label training array: ',labels_train.shape 169 | #print 'feature testing array: ',features_validation_embedded.shape,'and label testing array: ',labels_validation.shape,'\n' 170 | 171 | 172 | #take the best couple of parameters from the procedure of greedy search 173 | #paramTuning(features_train, labels_train, 5) 174 | 175 | #we initialize our model 176 | #svm = SVC(kernel='poly',C=0.001,gamma=10,degree=3,decision_function_shape='ovr') 177 | #svm = KNeighborsClassifier(n_neighbors=3) 178 | 179 | #Apply spectral clustering 180 | 181 | spectralClustering = SpectralClustering(n_clusters=2) 182 | 183 | 184 | 185 | 186 | #train our model with the data that we previously precessed 187 | #spectralClustering.fit(features_embedded ) 188 | 189 | #now test our model with the test data 190 | spectralClustering.fit(features_embedded) 191 | 192 | predicted_labels = spectralClustering.labels_ 193 | 194 | #first implementation of score computing 195 | #accuracy = accuracy_score(labels, predicted_labels) 196 | 197 | 198 | #More accurate implementation, considering opposite labels 199 | accuracy = homogeneity_score(labels, predicted_labels) 200 | print 'Clustering accuracy: ',accuracy*100,'\n' 201 | 202 | 203 | #confusion matrix to illustrate the faulty classification of each class 204 | conf_matrix = confusion_matrix(labels, predicted_labels) 205 | print 'Confusion matrix: \n',conf_matrix,'\n' 206 | print 'Support class 0 class 1:' 207 | #calculate the support of each class 208 | print ' ',conf_matrix[0][0]+conf_matrix[0][1],' ',conf_matrix[1][0]+conf_matrix[1][1],'\n' 209 | 210 | #calculate the accuracy of each class 211 | hC = (conf_matrix[0][0]/(conf_matrix[0][0]+conf_matrix[0][1]))*100 212 | pC = (conf_matrix[1][1]/(conf_matrix[1][0]+conf_matrix[1][1]))*100 213 | 214 | #see the inside details of the classification 215 | print 'For class 0 man cases:',conf_matrix[0][0],'classified correctly and',conf_matrix[0][1],'missclassified,',hC,'accuracy \n' 216 | print 'For class 1 woman cases:',conf_matrix[1][1],'classified correctly and',conf_matrix[1][0],'missclassified,',pC,'accuracy\n' 217 | 218 | 219 | #plot the training features after the kpca and the lda procedure 220 | embedded_labels = pd.DataFrame({'Feature1': features_embedded[: ,0], 'Feature2': features_embedded[: ,1],'Label': labels}) 221 | sns.pairplot(embedded_labels, hue='Label') 222 | #plt.savefig('kpca_trainset_parkinson_healthy.png') 223 | #plt.show() 224 | 225 | #plot the training features after the kpca and the lda procedure 226 | embedded_predicted_labels = pd.DataFrame({'Feature1': features_embedded[: ,0], 'Feature2': features_embedded[: ,1],'Label': predicted_labels}) 227 | sns.pairplot(embedded_predicted_labels, hue='Label') 228 | #plt.savefig('kpca_trainset_parkinson_healthy.png') 229 | plt.show() 230 | 231 | 232 | 233 | def main(): 234 | #calculate the time 235 | import time 236 | start_time = time.time() 237 | 238 | #we are making an array in order to keep the support vectors and feed the function with them for the next iteration 239 | #support_vectors = 240 | classifyPHC() 241 | 242 | time = time.time()-start_time 243 | print 'time: ',time 244 | 245 | main() 246 | 247 | -------------------------------------------------------------------------------- /classifiers/dimensionality_reduction/kpca_lda_knn_equalizeClasses.py: -------------------------------------------------------------------------------- 1 | #!usr/bin/python 2 | from __future__ import division 3 | import pandas as pd 4 | from sklearn import model_selection 5 | from sklearn.svm import SVC # support vectors for classification 6 | from sklearn.metrics import accuracy_score, confusion_matrix, classification_report 7 | from sklearn.model_selection import cross_val_score, GridSearchCV 8 | import timeit 9 | from sklearn.preprocessing import MinMaxScaler 10 | from sklearn.discriminant_analysis import LinearDiscriminantAnalysis 11 | from sklearn.decomposition import IncrementalPCA, PCA, KernelPCA 12 | import numpy as np 13 | from sklearn.neighbors import KNeighborsClassifier, NearestNeighbors 14 | import matplotlib.pyplot as plt 15 | from matplotlib.pyplot import figure 16 | import seaborn as sns 17 | 18 | 19 | '''this function takes as an input the path of a file with features and labels and returns the content of this file as a csv format in 20 | the form feature1.........feature13,Label''' 21 | def readFile(): 22 | #make the format of the csv file. Our format is a vector with 13 features and a label which show the condition of the 23 | #sample hc/pc : helathy case, parkinson case 24 | names = ['Feature1', 'Feature2', 'Feature3', 'Feature4','Feature5','Feature6','Feature7','Feature8','Feature9', 25 | 'Feature10','Feature11','Feature12','Feature13','Label'] 26 | 27 | #path to read the samples, samples consist from healthy subjects and subject suffering from Parkinson's desease. 28 | path = 'PATH_TO_SAMPLES.txt' 29 | #path = '/home/gionanide/Theses_2017-2018_2519/features/parkinson_healthy/mfcc_parkinson_healthy.txt' 30 | 31 | #read file in csv format 32 | data = pd.read_csv(path,names=names ) 33 | 34 | #return an array of the shape (2103, 14), lines are the samples and columns are the features as we mentioned before 35 | return data 36 | 37 | 'takes the csv file and split the label from the features' 38 | def splitData(data): 39 | # Split-out the set in two different arrayste 40 | array = data.values 41 | #features array contains only the features of the samples 42 | features = array[:,0:13] 43 | #labels array contains only the lables of the samples 44 | labels = array[:,13] 45 | 46 | return features,labels 47 | 48 | ''' 49 | make this class in order to train the model with the same amount of samples of each class, because we have bigger support from class 0 50 | than class1, particularly it is 9 to 1.''' 51 | def equalizeClasses(data): 52 | #take all the samples from the data frame that they have Label value equal to 0 and in the next line equal to 1 53 | class0 = data.loc[data['Label'] == 0]#class0 and class1 are dataFrames 54 | class1 = data.loc[data['Label'] == 1] 55 | 56 | 57 | #check which class has more samples, by divide them and check if the number is bigger or smaller than 1 58 | weight = len(class0) // len(class1) #take the results as an integer in order to split the class, using prior knowledge that 59 | #class0 has more samples, if it is bigger class0 has more samples and to be exact weight to 1 60 | 61 | balance = (len(class0) // weight) #this is the number of samples in order to balance our classes 62 | 63 | #the keyword argument frac specifies the fraction of rows to return in the random sample, so fra=1 means, return random all rows 64 | #we kind of a way shuffle our data in order not to take the same samples in every iteration 65 | #class0 = class0.sample(frac=1) 66 | 67 | #samples array for training taking the balance number of samples for the shuffled dataFrame 68 | newClass0 = class0.sample(n=balance) 69 | 70 | #and now combine the new dataFrame from class0 with the class1 to return the balanced dataFrame 71 | newData = pd.concat([newClass0, class1]) 72 | 73 | #return the new balanced(number of samples from each class) dataFrame 74 | return newData 75 | 76 | '''we made this function in order to make a loop, the equalized data take only a small piece of the existing data, so with this 77 | loop we are going to take iteratably all the data, but from every iteration we are keeping only the samples who were support 78 | vectors, the samples only the class which we are taking a piece of it's samples''' 79 | #def keepSV(): 80 | 81 | 82 | 83 | '''we use this function in order to apply greedy search for finding the parameters that best fit our model. We have to mention 84 | that we start this procedure from a very large field and then we tried to focues to the direction where the results 85 | appear better. For example for the C parameter, the first range was [0.0001, 0.001, 0.01, 0.1, 1, 10 ,100 ,1000], the result was that 86 | the best value was 1000 so then we tried [100, 1000, 10000, 100000] and so on in order to focues to the area which give us 87 | the best results. This function is in comments because we found the best parameters and we dont need to run it in every trial.''' 88 | def paramTuning(features_train, labels_train, nfolds): 89 | #using the training data and define the number of folds 90 | #determine the range of the Cs range you want to search 91 | Cs = [0.001, 0.01, 0.1 ,1, 10, 100, 1000, 10000] 92 | 93 | #determine the range of the gammas range you want to search 94 | gammas = [0.00000001 ,0.00000001 ,0.0000001, 0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1 , 1, 10, 100, 1000] 95 | 96 | #make the dictioanry 97 | param_grid = {'C': Cs, 'gamma': gammas} 98 | 99 | #start the greedy search using all the matching sets from above 100 | grid_search = GridSearchCV(SVC(kernel='poly'),param_grid,cv=nfolds) 101 | 102 | #fit your training data 103 | grid_search.fit(features_train, labels_train) 104 | 105 | #visualize the best couple of parameters 106 | print grid_search.best_params_ 107 | 108 | 109 | 110 | '''Classify Parkinson and Helathy. Building a model which is going to be trained with of given cases and test according to new ones''' 111 | def classifyPHC(): 112 | data = readFile() 113 | #data = equalizeClasses(data) 114 | features,labels = splitData(data) 115 | 116 | #determine the training and testing size in the range of 1, 1 = 100% 117 | validation_size = 0.2 118 | 119 | #here we are splitting our data based on the validation_size into training and testing data 120 | features_train, features_validation, labels_train, labels_validation = model_selection.train_test_split(features, labels, 121 | test_size=validation_size) 122 | 123 | 124 | #normalize data in the range [-1,1] 125 | scaler = MinMaxScaler(feature_range=(-1, 1)) 126 | #fit only th training data in order to find the margin and then test to data without normalize them 127 | scaler.fit(features_train) 128 | 129 | features_train_scalar = scaler.transform(features_train) 130 | 131 | #trnasform the validation features without fitting them 132 | features_validation_scalar = scaler.transform(features_validation) 133 | 134 | 135 | #determine the pca, and determine the dimension you want to end up 136 | pca = KernelPCA(n_components=6,kernel='rbf',fit_inverse_transform=True) 137 | 138 | #fit only the features train 139 | pca.fit(features_train_scalar) 140 | 141 | #dimensionality reduction of features train 142 | features_train_pca = pca.transform(features_train_scalar) 143 | 144 | #dimensionality reduction of fatures validation 145 | features_validation_pca = pca.transform(features_validation_scalar) 146 | 147 | #reconstruct data training error 148 | reconstruct_data = pca.inverse_transform(features_train_pca) 149 | 150 | error_percentage = (sum(sum(error_matrix))/(len(features_train_scalar)*len(features_train_scalar[0])))*100 151 | 152 | #len(features_train_scalar) = len(reconstruct_data) = 89 153 | #len(features_train_scalar[0]) = len(reconstruct_data[0]) = 13 154 | 155 | #len(error_matrix) = 89, which means for all the samples 156 | #len(error_matrix[0]) = 13, for every feature of every sample 157 | #we take the sum and we conlcude in an array which has the sum for every feature (error) 158 | #so we take the sum again and we divide it with the 89 samples * 13 features 159 | print 'Information loss of KernelPCA:',error_percentage,'% \n' 160 | 161 | 162 | lda = LinearDiscriminantAnalysis() 163 | 164 | lda.fit(features_train_pca,labels_train) 165 | 166 | features_train_pca = lda.transform(features_train_pca) 167 | 168 | features_validation_pca = lda.transform(features_validation_pca) 169 | 170 | #we can see the shapes of the array just to check 171 | print 'feature training array: ',features_train.shape,'and label training array: ',labels_train.shape 172 | print 'feature testing array: ',features_validation.shape,'and label testing array: ',labels_validation.shape,'\n' 173 | 174 | 175 | #take the best couple of parameters from the procedure of greedy search 176 | #paramTuning(features_train, labels_train, 5) 177 | 178 | #we initialize our model 179 | #svm = SVC(kernel='poly',C=0.001,gamma=10,degree=3,decision_function_shape='ovr') 180 | svm = KNeighborsClassifier(n_neighbors=3) 181 | 182 | 183 | 184 | 185 | #train our model with the data that we previously precessed 186 | svm.fit(features_train_pca,labels_train) 187 | 188 | #now test our model with the test data 189 | predicted_labels = svm.predict(features_validation_pca) 190 | accuracy = accuracy_score(labels_validation, predicted_labels) 191 | print 'Classification accuracy: ',accuracy*100,'\n' 192 | 193 | #see the accuracy in training procedure 194 | predicted_labels_train = svm.predict(features_train_pca) 195 | accuracy_train = accuracy_score(labels_train, predicted_labels_train) 196 | print 'Training accuracy: ',accuracy_train*100,'\n' 197 | 198 | #confusion matrix to illustrate the faulty classification of each class 199 | conf_matrix = confusion_matrix(labels_validation, predicted_labels) 200 | print 'Confusion matrix: \n',conf_matrix,'\n' 201 | print 'Support class 0 class 1:' 202 | #calculate the support of each class 203 | print ' ',conf_matrix[0][0]+conf_matrix[0][1],' ',conf_matrix[1][0]+conf_matrix[1][1],'\n' 204 | 205 | #calculate the accuracy of each class 206 | hC = (conf_matrix[0][0]/(conf_matrix[0][0]+conf_matrix[0][1]))*100 207 | pC = (conf_matrix[1][1]/(conf_matrix[1][0]+conf_matrix[1][1]))*100 208 | 209 | #see the inside details of the classification 210 | print 'For class 0 man cases:',conf_matrix[0][0],'classified correctly and',conf_matrix[0][1],'missclassified,',hC,'accuracy \n' 211 | print 'For class 1 woman cases:',conf_matrix[1][1],'classified correctly and',conf_matrix[1][0],'missclassified,',pC,'accuracy\n' 212 | 213 | 214 | #try 5-fold cross validation 215 | scores = cross_val_score(svm, features_train_pca, labels_train, cv=5) 216 | print 'cross validation scores for 5-fold',scores,'\n' 217 | print 'parameters of the model: \n',svm.get_params(),'\n' 218 | 219 | #print 'number of samples used as support vectors',len(svm.support_vectors_),'\n' 220 | 221 | #return svm.support_vectors_ 222 | 223 | '''#plot the training features before the kpca and the lda procedure 224 | kpca_lda = pd.DataFrame({'Feature1': features_train[: ,0], 'Feature2': features_train[: ,1],'Feature3': features_train[: ,2], 'Feature4': features_train[: ,3],'Feature5': features_train[: ,4],'Feature6': features_train[: ,5],'Feature7': features_train[: ,6],'Feature8': features_train[: ,7],'Feature9': features_train[: ,8],'Feature10': features_train[: ,9],'Feature11': features_train[: ,10],'Feature12': features_train[: ,11],'Feature13': features_train[: ,12],'Label': labels_train}) 225 | #'Feature10','Feature11','Feature12','Feature13','Label']) 226 | sns.pairplot(kpca_lda, hue='Label') 227 | plt.savefig('training_features_female_male.png') 228 | #plt.show() 229 | 230 | #plot the validation features before the kpca and the lda procedure 231 | kpca_lda = pd.DataFrame({'Feature1': features_validation[: ,0], 'Feature2': features_validation[: ,1],'Feature3': features_validation[: ,2], 'Feature4': features_validation[: ,3],'Feature5': features_validation[: ,4],'Feature6': features_validation[: ,5],'Feature7': features_validation[: ,6],'Feature8': features_validation[: ,7],'Feature9': features_validation[: ,8],'Feature10': features_validation[: ,9],'Feature11': features_validation[: ,10],'Feature12': features_validation[: ,11],'Feature13': features_validation[: ,12],'Label': labels_validation}) 232 | #'Feature10','Feature11','Feature12','Feature13','Label']) 233 | sns.pairplot(kpca_lda, hue='Label') 234 | plt.savefig('validation_features_female_male.png') 235 | #plt.show() 236 | 237 | #plot the training features after the kpca and the lda procedure 238 | kpca_lda = pd.DataFrame({'Feature1': features_train_pca[:, 0],'Label': labels_train}) 239 | sns.pairplot(kpca_lda, hue='Label') 240 | plt.savefig('kpca_lda_knn_trainingset_female_male.png') 241 | #plt.show() 242 | 243 | #plot the validation features after the kpca and the lda procedure 244 | kpca_lda = pd.DataFrame({'Feature1': features_validation_pca[:, 0],'Label': labels_validation}) 245 | sns.pairplot(kpca_lda, hue='Label') 246 | plt.savefig('kpca_lda_knn_validationset_female_male.png') 247 | plt.show()''' 248 | 249 | def main(): 250 | #calculate the time 251 | import time 252 | start_time = time.time() 253 | 254 | #we are making an array in order to keep the support vectors and feed the function with them for the next iteration 255 | #support_vectors = 256 | classifyPHC() 257 | 258 | time = time.time()-start_time 259 | print 'time: ',time 260 | 261 | main() 262 | 263 | -------------------------------------------------------------------------------- /classifiers/dimensionality_reduction/kpca_lda_knn_multiclass.py: -------------------------------------------------------------------------------- 1 | #!usr/bin/python 2 | from __future__ import division 3 | import pandas as pd 4 | from sklearn import model_selection 5 | from sklearn.svm import SVC # support vectors for classification 6 | from sklearn.metrics import accuracy_score, confusion_matrix 7 | from sklearn.model_selection import cross_val_score, GridSearchCV 8 | from sklearn.preprocessing import MinMaxScaler 9 | from sklearn.discriminant_analysis import LinearDiscriminantAnalysis 10 | from sklearn.decomposition import IncrementalPCA, PCA, KernelPCA 11 | import numpy as np 12 | from sklearn.neighbors import KNeighborsClassifier, NearestNeighbors 13 | import matplotlib.pyplot as plt 14 | from matplotlib.pyplot import figure 15 | import seaborn as sns 16 | 17 | 18 | '''this function takes as an input the path of a file with features and labels and returns the content of this file as a csv format in 19 | the form feature1.........feature13,Label''' 20 | def readFile(): 21 | #make the format of the csv file. Our format is a vector with 13 features and a label which show the condition of the 22 | #sample hc/pc : helathy case, parkinson case 23 | names = ['Feature1', 'Feature2', 'Feature3', 'Feature4','Feature5','Feature6','Feature7','Feature8','Feature9', 24 | 'Feature10','Feature11','Feature12','Feature13','Label'] 25 | 26 | #path to read the samples, samples consist from healthy subjects and subject suffering from Parkinson's desease. 27 | path = 'PATH_TO_SAMPLES.txt' 28 | #read file in csv format 29 | data = pd.read_csv(path,names=names ) 30 | 31 | #return an array of the shape (2103, 14), lines are the samples and columns are the features as we mentioned before 32 | return data 33 | 34 | 'takes the csv file and split the label from the features' 35 | def splitData(data): 36 | # Split-out the set in two different arrayste 37 | array = data.values 38 | #features array contains only the features of the samples 39 | features = array[:,0:13] 40 | #labels array contains only the lables of the samples 41 | labels = array[:,13] 42 | 43 | return features,labels 44 | 45 | 46 | '''we use this function in order to apply greedy search for finding the parameters that best fit our model. We have to mention 47 | that we start this procedure from a very large field and then we tried to focues to the direction where the results 48 | appear better. For example for the C parameter, the first range was [0.0001, 0.001, 0.01, 0.1, 1, 10 ,100 ,1000], the result was that 49 | the best value was 1000 so then we tried [100, 1000, 10000, 100000] and so on in order to focues to the area which give us 50 | the best results. This function is in comments because we found the best parameters and we dont need to run it in every trial.''' 51 | def paramTuning(features_train, labels_train, nfolds): 52 | #using the training data and define the number of folds 53 | #determine the range of the Cs range you want to search 54 | Cs = [0.001 ,0.01 ,0.1 ,1 , 10, 100, 1000, 10000] 55 | 56 | #determine the range of the gammas range you want to search 57 | gammas = [0.00000001 ,0.00000001 ,0.0000001, 0.000001, 0.00001 , 0.0001, 0.001, 0.01, 0.1, 1, 10, 100] 58 | 59 | #make the dictioanry 60 | param_grid = {'C': Cs, 'gamma': gammas} 61 | 62 | #start the greedy search using all the matching sets from above 63 | grid_search = GridSearchCV(SVC(kernel='rbf'),param_grid,cv=nfolds) 64 | 65 | #fit your training data 66 | grid_search.fit(features_train, labels_train) 67 | 68 | #visualize the best couple of parameters 69 | print grid_search.best_params_ 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | '''Building a model which is going to be trained with of given cases and test according to new ones''' 80 | def classifyPHC(): 81 | data = readFile() 82 | #data = equalizeClasses(data) 83 | features,labels = splitData(data) 84 | 85 | #determine the training and testing size in the range of 1, 1 = 100% 86 | validation_size = 0.2 87 | 88 | #here we are splitting our data based on the validation_size into training and testing data 89 | features_train, features_validation, labels_train, labels_validation = model_selection.train_test_split(features, labels, 90 | test_size=validation_size) 91 | 92 | #normalize data in the range [-1,1] 93 | scaler = MinMaxScaler(feature_range=(-1, 1)) 94 | #fit only th training data in order to find the margin and then test to data without normalize them 95 | scaler.fit(features_train) 96 | 97 | features_train_scalar = scaler.transform(features_train) 98 | 99 | #trnasform the validation features without fitting them 100 | features_validation_scalar = scaler.transform(features_validation) 101 | 102 | 103 | #determine the pca, and determine the dimension you want to end up 104 | pca = KernelPCA(n_components=5,kernel='rbf',fit_inverse_transform=True) 105 | 106 | #fit only the features train 107 | pca.fit(features_train_scalar) 108 | 109 | #dimensionality reduction of features train 110 | features_train_pca = pca.transform(features_train_scalar) 111 | 112 | #dimensionality reduction of fatures validation 113 | features_validation_pca = pca.transform(features_validation_scalar) 114 | 115 | #reconstruct data training error 116 | reconstruct_data = pca.inverse_transform(features_train_pca) 117 | 118 | 119 | error_percentage = (sum(sum(error_matrix))/(len(features_train_scalar)*len(features_train_scalar[0])))*100 120 | 121 | #len(features_train_scalar) = len(reconstruct_data) = 89 122 | #len(features_train_scalar[0]) = len(reconstruct_data[0]) = 13 123 | 124 | #len(error_matrix) = 89, which means for all the samples 125 | #len(error_matrix[0]) = 13, for every feature of every sample 126 | #we take the sum and we conlcude in an array which has the sum for every feature (error) 127 | #so we take the sum again and we divide it with the 89 samples * 13 features 128 | print 'Information loss of KernelPCA:',error_percentage,'% \n' 129 | 130 | 131 | lda = LinearDiscriminantAnalysis() 132 | 133 | lda.fit(features_train_pca,labels_train) 134 | 135 | features_train_pca = lda.transform(features_train_pca) 136 | 137 | features_validation_pca = lda.transform(features_validation_pca) 138 | 139 | 140 | 141 | #we can see the shapes of the array just to check 142 | print 'feature training array: ',features_train_pca.shape,'and label training array: ',labels_train.shape 143 | print 'feature testing array: ',features_validation_pca.shape,'and label testing array: ',labels_validation.shape,'\n' 144 | 145 | #take the best couple of parameters from the procedure of greedy search 146 | #paramTuning(features_train, labels_train, 5) 147 | 148 | #we initialize our model 149 | #svm = SVC(kernel='rbf',C=10,gamma=0.0001,decision_function_shape='ovo') 150 | svm = KNeighborsClassifier(n_neighbors=3) 151 | 152 | #train our model with the data that we previously precessed 153 | svm.fit(features_train_pca,labels_train) 154 | 155 | 156 | #now test our model with the test data 157 | predicted_labels = svm.predict(features_validation_pca) 158 | accuracy = accuracy_score(labels_validation, predicted_labels) 159 | print 'Classification accuracy: ',accuracy*100,'\n' 160 | 161 | #see the accuracy in training procedure 162 | predicted_labels_train = svm.predict(features_train_pca) 163 | accuracy_train = accuracy_score(labels_train, predicted_labels_train) 164 | print 'Training accuracy: ',accuracy_train*100,'\n' 165 | 166 | #confusion matrix to illustrate the faulty classification of each class 167 | conf_matrix = confusion_matrix(labels_validation, predicted_labels) 168 | print 'Confusion matrix: \n',conf_matrix,'\n' 169 | print 'Support class 0 class 1 class2:' 170 | #calculate the support of each class 171 | print ' ',conf_matrix[0][0]+conf_matrix[0][1]+conf_matrix[0][2],' ',conf_matrix[1][0]+conf_matrix[1][1]+conf_matrix[1][2],' ',conf_matrix[2][0]+conf_matrix[2][1]+conf_matrix[2][2],'\n' 172 | 173 | #calculate the accuracy of each class 174 | edema = (conf_matrix[0][0]/(conf_matrix[0][0]+conf_matrix[0][1]+conf_matrix[0][2]))*100 175 | paralysis = (conf_matrix[1][1]/(conf_matrix[1][0]+conf_matrix[1][1]+conf_matrix[1][2]))*100 176 | normal = (conf_matrix[2][2]/(conf_matrix[2][0]+conf_matrix[2][1]+conf_matrix[2][2]))*100 177 | 178 | #see the inside details of the classification 179 | print 'For class 0 edema cases:',conf_matrix[0][0],'classified correctly and',conf_matrix[0][1]+conf_matrix[0][2],'missclassified,',edema,'accuracy \n' 180 | print 'For class 1 paralysis cases:',conf_matrix[1][1],'classified correctly and',conf_matrix[1][0]+conf_matrix[1][2],'missclassified,',paralysis,'accuracy\n' 181 | print 'For class 0 normal cases:',conf_matrix[2][2],'classified correctly and',conf_matrix[2][0]+conf_matrix[2][1],'missclassified,',normal,'accuracy \n' 182 | 183 | #try 5-fold cross validation 184 | scores = cross_val_score(svm, features_train_pca, labels_train, cv=5) 185 | print 'cross validation scores for 5-fold',scores,'\n' 186 | print 'parameters of the model: \n',svm.get_params(),'\n' 187 | 188 | 189 | #PLOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOTS 190 | 191 | #sns.pairplot(data, hue='Label') 192 | #plt.savefig('data_visualization.png') 193 | #plt.title('Data visualization') 194 | #plt.show() 195 | 196 | #print features_train.shape 197 | #print len(features_train[:,0]) 198 | #print len(features_train[:,1]) 199 | #print len(labels_train) 200 | 201 | #plot the training features before the kpca and the lda procedure 202 | #kpca_lda = pd.DataFrame({'Feature1': features_train[: ,0], 'Feature2': features_train[: ,1],'Feature3': features_train[: ,2], 'Feature4': features_train[: ,3],'Feature5': features_train[: ,4],'Feature6': features_train[: ,5],'Feature7': features_train[: ,6],'Feature8': features_train[: ,7],'Feature9': features_train[: ,8],'Feature10': features_train[: ,9],'Feature11': features_train[: ,10],'Feature12': features_train[: ,11],'Feature13': features_train[: ,12],'Label': labels_train}) 203 | #'Feature10','Feature11','Feature12','Feature13','Label']) 204 | #sns.pairplot(kpca_lda, hue='Label') 205 | #plt.savefig('training_features.png') 206 | #plt.show() 207 | 208 | #plot the validation features before the kpca and the lda procedure 209 | #kpca_lda = pd.DataFrame({'Feature1': features_validation[: ,0], 'Feature2': features_validation[: ,1],'Feature3': features_validation[: ,2], 'Feature4': features_validation[: ,3],'Feature5': features_validation[: ,4],'Feature6': features_validation[: ,5],'Feature7': features_validation[: ,6],'Feature8': features_validation[: ,7],'Feature9': features_validation[: ,8],'Feature10': features_validation[: ,9],'Feature11': features_validation[: ,10],'Feature12': features_validation[: ,11],'Feature13': features_validation[: ,12],'Label': labels_validation}) 210 | #'Feature10','Feature11','Feature12','Feature13','Label']) 211 | #sns.pairplot(kpca_lda, hue='Label') 212 | #plt.savefig('validation_features.png') 213 | #plt.show() 214 | 215 | #plot the training features after the kpca and the lda procedure 216 | #kpca_lda = pd.DataFrame({'Feature1': features_train_pca[: ,0], 'Feature2': features_validation_pca[: ,1],'Label': labels_validation}) 217 | #sns.pairplot(kpca_lda, hue='Label') 218 | #plt.savefig('kpca_lda_knn_validationset.png') 219 | #plt.show() 220 | 221 | #plot the validation features after the kpca and the lda procedure 222 | #kpca_lda = pd.DataFrame({'Feature1': features_validation_pca[: ,0], 'Feature2': features_validation_pca[: ,1],'Label': labels_validation}) 223 | #sns.pairplot(kpca_lda, hue='Label') 224 | #plt.savefig('kpca_lda_knn_validationset.png') 225 | #plt.show() 226 | 227 | #print 'number of samples used as support vectors',len(svm.support_vectors_) 228 | 229 | 230 | def main(): 231 | import time 232 | start_time = time.time() 233 | classifyPHC() 234 | time = time.time()-start_time 235 | print 'time: ',time 236 | 237 | main() 238 | 239 | -------------------------------------------------------------------------------- /classifiers/dimensionality_reduction/pca_kpca_from-skratch.py: -------------------------------------------------------------------------------- 1 | #!usr/bin/python 2 | from __future__ import division 3 | import random 4 | import numpy as np 5 | from numpy import cov 6 | from numpy.linalg import eig, inv 7 | 8 | def PrincipalComponentAnalysis(dimensions_output,kernel_option,c): 9 | #make a random array with samples lets say 100 samples with dimension 20 10 | samples = np.random.rand(40,3) 11 | 12 | #print samples 13 | print 'samples shape: ',samples.shape,'\n' 14 | 15 | #for every sample I took the square distance of the mean, this is the variable that I want to maximize 16 | #calculate the mean values of each columns, so we have to transpose the matrix because the argument axis refers to row 17 | mean = np.mean(samples.transpose(),axis=1) 18 | 19 | print 'mean shape: ',mean.shape,'\n' 20 | 21 | #print mean 22 | #print mean.shape 23 | 24 | #we are going to center our matrix(the points) to the origin (0,0) by substracting the column means 25 | #samples = samples - mean 26 | 27 | 28 | #calculate the covariance matrix between two features 29 | #the arrays are inserted as transposed that why they are transposed again 30 | if (kernel_option): 31 | print 'Using KernelPCA with rbf kernel \n' 32 | #here we are taking the dimensions of the samples array, so the dimensions of the data 33 | #for every sample 34 | #initialize a numpy array in the shape of the sample array 35 | covSamples = np.zeros((samples.shape[0],samples.shape[1])) 36 | for x in range(samples.shape[0]): 37 | #for all the dimensions of the samples 38 | for y in range(samples.shape[1]): 39 | #insert in numpy array for the first row(first sample) the first column is the first feature 40 | #minus the mean of the first feature 41 | np.put(covSamples[x],y,np.exp(-(np.linalg.norm(samples[x][y] - mean[y])**2/c))) 42 | #break 43 | #break 44 | #samples = np.absolute(samples - mean)**2/c 45 | #print samples.shape 46 | covSamples = np.matmul(covSamples.transpose(),covSamples) 47 | else: 48 | print 'Using linear PCA \n' 49 | covSamples = np.matmul((samples - mean).transpose(),(samples - mean)) 50 | 51 | print 'covariance matrix shape: ',covSamples.shape,'\n' 52 | print covSamples,'\n' 53 | 54 | #print covSamples.shape 55 | 56 | #print covSamples.shape 57 | #print covSamples 58 | 59 | #find eigenvalues and eigenvectors 60 | eigenvalues, eigenvectors = eig(covSamples) 61 | 62 | print 'eigenvectors shape: ',eigenvectors.shape,'\n' 63 | print eigenvectors,'\n' 64 | 65 | #short eigenvectors 66 | sorted_eigenvalues = eigenvalues.argsort()[::-1] 67 | 68 | print 'sorted eigenvalues: ',sorted_eigenvalues,'\n' 69 | 70 | print 'eigenvalues',eigenvalues,'\n' 71 | 72 | 73 | #deterine the dimensions you want to keep based on the eigenvectors you want to multiple the smples 74 | dimensions = eigenvectors[:, sorted_eigenvalues] 75 | 76 | print 'w array shape: ',dimensions.shape,'\n' 77 | 78 | print 'w array sorted based on eigenvalues, every column represent one eigenvector \n',dimensions,'\n' 79 | 80 | #print dimensions.shape 81 | 82 | w = dimensions[:, :dimensions_output] 83 | 84 | print 'w final shape: ',w.shape,'\n' 85 | print 'w final array: \n',w,'\n' 86 | 87 | 88 | print 'final vector multiplication samples:',samples.shape,'w array:',w.shape,'\n' 89 | 90 | samples = np.dot(samples,w) 91 | #dimensions = np.dot(samples) 92 | 93 | print 'dimensionality reduction samples shape: ',samples.shape,'\n' 94 | #print samples[0] 95 | 96 | #for x in eigenvectors: 97 | # print np.linalg.norm(x) 98 | 99 | 100 | 101 | 102 | 103 | 104 | #PrincipalComponentAnalysis(dimensions_output=1,kernel_option=True,c=1) 105 | -------------------------------------------------------------------------------- /classifiers/gmm.py: -------------------------------------------------------------------------------- 1 | #!usr/bin/python 2 | from __future__ import division 3 | import pickle 4 | import pandas as pd 5 | import numpy as np 6 | from sklearn.mixture import GaussianMixture 7 | import seaborn as sns 8 | import matplotlib.pyplot as plt 9 | 10 | 11 | #lpc durbin-levinson 12 | 13 | 14 | #gmm training python 15 | def GaussianMixtureModel_only_for_testing(data): 16 | #A GMM attempts to find a mixture of multidimensional Gaussian probability distributions that best model any input dataset. 17 | #In the simplest case , GMM's can be used for finding clusters in the same manner as k-means. 18 | X,Y = preparingData(data) 19 | #taking only the first two features 20 | #Y = target variable 21 | gmm = GaussianMixture(n_components=2) 22 | #Estimate model parameters with the EM algorithm. 23 | gmm.fit(X) 24 | labels = gmm.predict(X) 25 | print labels 26 | 27 | plt.figure(1) 28 | #because of the probabilistic approach of GMM's it is possible to find a probabilistic cluster assignments. 29 | #porbs : is a matrix [samples, nClusters] which contains the probability of any point belongs to the given cluster 30 | probs = gmm.predict_proba(X).round(3) 31 | #which measures the probability that any point belongs to the given cluster: 32 | print probs 33 | 34 | #we can visualize this uncertainty . For instance let's make the size of each point proortional to the certainty 35 | #of its prediction. We are going to point the points at the boundaries between clusters. 36 | size = 50 * probs.max(1) ** 2 # square emphasizes differences 37 | 38 | #the weights of each mixture components 39 | weights = gmm.weights_ 40 | #the mean of each mixture component 41 | means = gmm.means_ 42 | #the covariance of each mixture component 43 | covars = gmm.covariances_ 44 | 45 | print 'weights: ',weights 46 | print 'means: ', means 47 | 48 | print gmm.score(X) 49 | #Predict the labels for the data samples in X using trained model. 50 | print labels[0] 51 | print Y 52 | print Y[0] 53 | 54 | 55 | #plots 56 | plt.scatter(X[:,5],X[:,6],c=labels,s=40,cmap='viridis') 57 | plt.show() 58 | 59 | def readFeaturesFile(gender): 60 | names = ['Feature1', 'Feature2', 'Feature3', 'Feature4','Feature5','Feature6','Feature7','Feature8','Feature9', 61 | 'Feature10','Feature11','Feature12','Feature13','Gender'] 62 | 63 | #check the gender 64 | if(int(gender)==1): 65 | data = pd.read_csv("PATH_TO_SAMPLES.txt",names=names ) 66 | elif(int(gender)==0): 67 | data = pd.read_csv("PATH_TO_SAMPLES.txt",names=names ) 68 | else: 69 | data = pd.read_csv("PATH_TO_SAMPLES.txt",names=names ) 70 | #the outcome is a list of lists containing the samples with the following format 71 | #[charachteristic,feature1,feature2.......,feature13] 72 | #characheristic based on what we want for classification , can be (male , female) , also can be (normal-female,edema-female) 73 | #in general characheristic is the target value . 74 | return data 75 | 76 | 77 | def preparingData(data): 78 | # Split-out validation dataset 79 | array = data.values 80 | #input 81 | X = array[:,0:13] 82 | #target 83 | Y = array[:,13] 84 | return X,Y 85 | 86 | def GaussianMixtureModel(data,gender): 87 | #A GMM attempts to find a mixture of multidimensional Gaussian probability distributions that best model any input dataset. 88 | #In the simplest case , GMM's can be used for finding clusters in the same manner as k-means. 89 | X,Y = preparingData(data) 90 | #print data.head(n=5) 91 | 92 | #we do not split into training and testing becuase we all ready did that in a file basis, so teh X,Y in this 93 | #function is to train the model and another file with another set of X,Y is the testModels function to assess the model 94 | 95 | #takes only the first feature to redefine the problem as 1-D problem 96 | #dataFeature1 = data.as_matrix(columns=data.columns[0:1]) 97 | #plot histogram 98 | #sns.distplot(dataFeature1,bins=20,kde=False) 99 | #plt.show() 100 | 101 | 102 | 103 | #Y = target variable 104 | gmm = GaussianMixture(n_components=8,max_iter=200,covariance_type='diag',n_init=3) 105 | gmm.fit(X) 106 | 107 | 108 | 109 | #save the model to disk 110 | filename = 'finalizedModel_'+gender+'.gmm' 111 | pickle.dump(gmm,open(filename,'wb')) 112 | print 'Model saved in path: PATH_TO'+filename 113 | 114 | 115 | return X 116 | #load the model from disk 117 | '''loadedModel = pickle.load(open(filename,'rb')) 118 | result = loadedModel.score(X) 119 | print result''' 120 | 121 | def testModels(data,threshold_input,x_test,y_test): 122 | gmmFiles = ['PATH_TO/finalizedModel_0.gmm','PATH_TO/finalizedModel_1.gmm'] 123 | models = [pickle.load(open(filename,'r')) for filename in gmmFiles] 124 | log_likelihood = np.zeros(len(models)) 125 | genders = ['male','female'] 126 | assessModel = [] 127 | prediction = [] 128 | features = X 129 | for i in range(len(models)): 130 | gmm = models[i] 131 | scores = np.array(gmm.score(features)) 132 | #first take for the male model all the log likelihoods and then the same procedure for the female model 133 | assessModel.append(gmm.score_samples(features)) 134 | log_likelihood[i] = scores.sum() 135 | #higher the value it is, the more likely your model fits the model 136 | for x in range(len(assessModel[0])): 137 | #the division is gmm(Malemodel) / gmm(Femalemodel) if the result is > 1 then the example is male 138 | 139 | 140 | #if the prediction for male in negative and the prediction for female positive we dont have to check 141 | #because the difference is obvious and we are pretty sure that it is female 142 | if(assessModel[0][x] < 0 and assessModel[1][x] > 0): 143 | # x / y and x is < 0 , so we have to classify this as female 144 | # we have to be sure that the prediction will be above the threshold 145 | prediction.append(float(threshold_input) + 1) 146 | 147 | #same as above , we need to be sure that the prediction is below the threshold (male) because we are pretty 148 | #sure from the model's outcome that this sample is female 149 | elif(assessModel[0][x] > 0 and assessModel[1][x] < 0): 150 | prediction.append(float(threshold_input) - 1) 151 | else: 152 | prediction.append( abs(( assessModel[0][x] / assessModel[1][x] )) ) 153 | 154 | 155 | #take an array with the predictions and check if they are true(correct classification) or false(wrong classification) 156 | assessment=[] 157 | true_negative=0 158 | true_positive=0 159 | false_positive=0 160 | false_negative=0 161 | for x in range(len(prediction)):#reject option 162 | if(prediction[x]<1.019 and prediction[x]>1.012): 163 | print prediction[x] , ' can not decide' 164 | elif(prediction[x] 1 then the example is male 59 | 60 | 61 | #if the prediction for male in negative and the prediction for female positive we dont have to check 62 | #because the difference is obvious and we are pretty sure that it is female 63 | if(assessModel[0][x] < 0 and assessModel[1][x] > 0): 64 | # x / y and x is < 0 , so we have to classify this as female 65 | # we have to be sure that the prediction will be above the threshold 66 | prediction.append(float(threshold_input) + 1) 67 | #same as above , we need to be sure that the prediction is below the threshold (male) because we are pretty 68 | #sure from the model's outcome that this sample is female 69 | elif(assessModel[0][x] > 0 and assessModel[1][x] < 0): 70 | prediction.append(float(threshold_input) - 1) 71 | else: 72 | prediction.append( abs(( assessModel[0][x] / assessModel[1][x] )) ) 73 | 74 | 75 | #take an array with the predictions and check if they are true(correct classification) or false(wrong classification) 76 | assessment=[] 77 | condition=[] 78 | true_negative=0 79 | true_positive=0 80 | false_positive=0 81 | false_negative=0 82 | for x in range(len(prediction)): 83 | if(prediction[x]<1.04790 and prediction[x]>0.97890): 84 | print prediction[x] , ' can not decide' 85 | print '\n' 86 | continue 87 | if(prediction[x] 0): 55 | return True 56 | else: 57 | return False 58 | 59 | 60 | 61 | def LR_ROC(data): 62 | #we initialize the random number generator to a const value 63 | #this is important if we want to ensure that the results 64 | #we can achieve from this model can be achieved again precisely 65 | #Axis or axes along which the means are computed. The default is to compute the mean of the flattened array. 66 | mean = np.mean(data,axis=0) 67 | std = np.std(data,axis=0) 68 | #print 'Mean: \n',mean 69 | #print 'Standar deviation: \n',std 70 | X,Y = preparingData(data) 71 | x_train, x_test, y_train, y_test = train_test_split(X,Y, test_size=0.20) 72 | # convert integers to dummy variables (i.e. one hot encoded) 73 | lr = LogisticRegression(class_weight='balanced') 74 | lr.fit(x_train,y_train) 75 | #The score function of sklearn can quickly assess the model performance 76 | #due to class imbalance , we nned to evaluate the model performance 77 | #on every class. Which means to find when we classify people from the first team wrong 78 | 79 | 80 | #feature selection RFE is based on the idea to repeatedly construct a model and choose either the best 81 | #or worst performing feature, setting the feature aside and then repeating the process with the rest of the 82 | #features. This process is applied until all features in the dataset are exhausted. The goal of RFE is to select 83 | # features by recursively considering smaller and smaller sets of features 84 | rfe = RFE(lr,13) 85 | rfe = rfe.fit(x_train,y_train) 86 | #print rfe.support_ 87 | 88 | #An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape 89 | #[# input features], in which an element is True iff its corresponding feature is selected for retention 90 | 91 | #print rfe.ranking_ 92 | 93 | #so we have to take all the features 94 | 95 | #model fitting 96 | 97 | #predicting the test set results and calculating the accuracy 98 | y_pred = lr.predict(x_test) 99 | print 'Accuracy of logistic regression classifier on the test set: ', lr.score(x_test,y_test) 100 | 101 | #cross validation 102 | kfold = model_selection.KFold(n_splits=10,shuffle=True,random_state=7) 103 | modelCV = LogisticRegression() 104 | scoring = 'accuracy' 105 | results = model_selection.cross_val_score(modelCV, x_train,y_train,cv=kfold,scoring=scoring) 106 | print '10-fold cross validation average accuracy: ', results.mean() 107 | 108 | #confusion matrix 109 | confusionMatrix = confusion_matrix(y_test,y_pred) 110 | print 'Confusion matrix: ' 111 | print confusionMatrix 112 | print 'We had ',confusionMatrix[0][0] + confusionMatrix[1][1], 'correct predictions' 113 | print 'And ',confusionMatrix[1][0] + confusionMatrix[0][1],'incorrect prediction' 114 | print '' 115 | 116 | #The precision is intuitively the ability of the classifier to not label a sample as positive if it is negative. 117 | #The recall is intuitively the ability of the classifier to find all the positive samples. 118 | #The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0. 119 | #The support is the number of occurrences of each class in y_test. 120 | 121 | #classification report 122 | print(classification_report(y_test,y_pred)) 123 | 124 | #roc curve 125 | logit_roc_auc = roc_auc_score(y_test, lr.predict(x_test)) 126 | fpr , tpr , thresholds = roc_curve(y_test,lr.predict_proba(x_test)[:,1]) 127 | 128 | #AUC is a measure of the overall performance of a diagnostic test and is 129 | #interpreted as the average value of sensitivity for all possible values of specificity 130 | 131 | fprtpr = np.hstack((fpr[:,np.newaxis],tpr[:,np.newaxis])) 132 | 133 | hull = ConvexHull(fprtpr) 134 | 135 | hull_indices = np.unique(hull.simplices.flat) 136 | hull_points = fprtpr[hull_indices,:] 137 | hull_points_y=[] 138 | hull_points_x=[] 139 | for x in range(len(hull_points)): 140 | coordinates = np.split(hull_points[x],2) 141 | hull_points_x.append(coordinates[0]) 142 | hull_points_y.append(coordinates[1]) 143 | 144 | 145 | 146 | 147 | #this implementation os only for the smooth rock curve 148 | 149 | hull_points_x_curve = [] 150 | hull_points_y_curve = [] 151 | 152 | #determine the starting and ending point 153 | startingPoint = np.split(hull_points[0],2) 154 | print 'starting point: ',startingPoint 155 | print startingPoint[1][0] 156 | endingPoint = np.split(hull_points[len(hull_points)-1],2) 157 | print 'ending point: ',endingPoint 158 | 159 | #append the strting point into the hull 160 | hull_points_x_curve.append(startingPoint[0]) 161 | hull_points_y_curve.append(startingPoint[1]) 162 | 163 | #check if there is a points under the starting and the ending point, only to make the ROC curve 164 | print len(hull_points) 165 | for x in range(1,len(hull_points)-1): 166 | print x 167 | coordinates = np.split(hull_points[x],2) 168 | ifnotUnder = not(isUnder(startingPoint , endingPoint , coordinates)) 169 | print ifnotUnder 170 | if (ifnotUnder): 171 | hull_points_y_curve.append(coordinates[1]) 172 | hull_points_x_curve.append(coordinates[0]) 173 | 174 | #append the ending point into the hull 175 | hull_points_x_curve.append(endingPoint[0]) 176 | hull_points_y_curve.append(endingPoint[1]) 177 | 178 | 179 | 180 | 181 | plt.figure(1) 182 | plt.title('ROC curve smooth') 183 | plt.scatter(hull_points_y,hull_points_x) 184 | area_under = metrics.auc(hull_points_y,hull_points_x) 185 | plt.plot(hull_points_x_curve,hull_points_y_curve,label='Area under the curve = %0.2f' %area_under) 186 | plt.legend(loc='lower right') 187 | 188 | 189 | 190 | plt.figure(2) 191 | plt.scatter(fpr,tpr) 192 | plt.title('Convex Hull') 193 | #plt.plot(fpr[hull.vertices],tpr[hull.vertices]) 194 | plt.plot(fprtpr[:,0], fprtpr[:,1], 'o') 195 | for simplex in hull.simplices: 196 | plt.plot(fprtpr[simplex, 0], fprtpr[simplex, 1],'r--',lw=2) 197 | 198 | plt.figure(3) 199 | plt.plot(fpr,tpr,label='Logistic Regression (area = %0.2f)' %logit_roc_auc) 200 | plt.plot([0,1],[0,1],'r--') 201 | plt.xlabel('False positive rate') 202 | plt.ylabel('True positive rate') 203 | plt.title('Receiver operating characteristic') 204 | plt.legend(loc='lower right') 205 | plt.show() 206 | 207 | #It generally means that your model can only provide discrete predictions, rather than a continous score. This can often be 208 | # remedied by adding more samples to your dataset, having more continous features in the model, more features in general or using 209 | # a model specification that provides a continous prediction output. The reason why it occurs in a decision tree is that you 210 | #often do binary splits; this is efficient computationally, but only gives 2^n groupings. Unless your n number of splits are very 211 | #large, you'll only have 16/32/64/128 groups, whereas if you used an algorithm such as logistic regression and used continous 212 | #variables, your prediction would fall in the continous range between 0 and 1. I'm not familiar with the type of data you listed, 213 | # but I suspect you have a lot of categorical data.It's not necessarily a problem to have a ROC that is discrete rather than 214 | #smooth, it really depends on your goals for the model (descriptive vs prescriptive), as well as how well your model fits on 215 | #out-of-sample datasets. Many of the problems I've solved in my career just needed a Yes/No line drawn (such as email this 216 | #person/don't email), so having a continous and smooth prediction along the range of inputs wasn't necessary. 217 | 218 | 219 | 220 | 221 | def main(): 222 | data = readFeaturesFile() 223 | LR_ROC(data) 224 | 225 | main() 226 | -------------------------------------------------------------------------------- /classifiers/simpleNeuralNetwork.py: -------------------------------------------------------------------------------- 1 | #!usr/bin/python 2 | 3 | import keras 4 | import numpy as np 5 | import matplotlib.pyplot as plt 6 | from keras.models import Sequential 7 | from keras.layers import Dense, Activation , MaxPool2D , Conv2D , Flatten 8 | from keras.optimizers import Adam 9 | import pandas as pd 10 | from sklearn.model_selection import train_test_split 11 | from keras.optimizers import SGD 12 | from sklearn.model_selection import StratifiedKFold 13 | 14 | def readFeaturesFile(): 15 | names = ['Feature1', 'Feature2', 'Feature3', 'Feature4','Feature5','Feature6','Feature7','Feature8','Feature9', 16 | 'Feature10','Feature11','Feature12','Feature13','Gender'] 17 | data = pd.read_csv("PATH_TO_SAMPLES.txt",names=names ) 18 | #the outcome is a list of lists containing the samples with the following format 19 | #[charachteristic,feature1,feature2.......,feature13] 20 | #characheristic based on what we want for classification , can be (male , female) , also can be (normal-female,edema-female) 21 | #in general characheristic is the target value . 22 | return data 23 | 24 | def preparingData(data): 25 | # Split-out validation dataset 26 | array = data.values 27 | #input 28 | X = array[:,0:13] 29 | #target 30 | Y = array[:,13] 31 | 32 | #determine the test and the training size 33 | x_train, x_test, y_train, y_test = train_test_split(X,Y, test_size=0.10) 34 | 35 | 36 | #x_train = 37 | #Encode the labels 38 | #reconstruct the data as a vector with a sequence of 1s and 0s 39 | y_train = keras.utils.to_categorical(y_train, num_classes = 2) 40 | y_test = keras.utils.to_categorical(y_test, num_classes = 2) 41 | '''print y_train.shape 42 | print x_train.shape 43 | print(y_train[0], np.argmax(y_train[0])) 44 | print(y_train[1], np.argmax(y_train[1])) 45 | print(y_train[2], np.argmax(y_train[2])) 46 | print(y_train[3], np.argmax(y_train[3]))''' 47 | return x_train , x_test , y_train , y_test 48 | 49 | 50 | def returnData(data): 51 | # Split-out validation dataset 52 | array = data.values 53 | #input 54 | X = array[:,0:13] 55 | #target 56 | Y = array[:,13] 57 | 58 | #determine the test and the training size 59 | 60 | return X,Y 61 | 62 | 63 | 64 | #Multilayer Perceptron 65 | def testing_NN(data): 66 | X,Y = returnData(data) 67 | 68 | #determine the validation 69 | kfold = StratifiedKFold(n_splits=10,shuffle=True) 70 | #keep the results 71 | cvscores = [] 72 | for train,test in kfold.split(X,Y): 73 | #Define a siple Multilayer Perceptron 74 | model = Sequential() 75 | 76 | #our classification is binary 77 | 78 | #as a first step we have to define the input dimensionality 79 | 80 | 81 | model.add(Dense(14,activation='relu',input_dim=13)) 82 | 83 | 84 | #model.add(Dense(14,activation='relu',input_dim=13)) 85 | model.add(Dense(8, activation='relu')) 86 | 87 | #add another hidden layer 88 | #model.add(Dense(16,activation='relu')) 89 | #the last step , add an output layer (number of neurons = number of classes) 90 | model.add(Dense(1,activation='sigmoid')) 91 | 92 | #select the optimizer 93 | #adam = Adam(lr=0.0001) 94 | adam = Adam(lr=0.001) 95 | #learning rate is between 0.0001 and 0.001 , but it is objective to define it 96 | #because we need out model not to learn to fast and maybe we have overfitting but also 97 | #not to slow and take to much time . We can check this with the learning rate curve 98 | 99 | #we select the loss function and metrics that should be monitored 100 | #and then we compile our model 101 | model.compile(loss='binary_crossentropy',optimizer=adam,metrics=['accuracy']) 102 | 103 | #now we train our model 104 | model.fit(X[train],Y[train],epochs=50,batch_size=75,verbose=0) 105 | 106 | # evaluate the model 107 | scores = model.evaluate(X[test], Y[test], verbose=0) 108 | print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100)) 109 | cvscores.append(scores[1] * 100) 110 | 111 | 112 | print("%.2f%% (+/- %.2f%%)" % (np.mean(cvscores), np.std(cvscores))) 113 | 114 | 115 | '''x_train , x_test , y_train , y_test = preparingData(data) 116 | 117 | #validation data = test data 118 | #only for the plot 119 | results = model.fit(x_train,y_train,epochs=50,batch_size=75,verbose=2,validation_data=(x_test,y_test)) 120 | 121 | plt.figure(1) 122 | plt.plot(results.history['loss']) 123 | plt.plot(results.history['val_loss']) 124 | plt.legend(['train loss', 'test loss']) 125 | plt.show() 126 | 127 | 128 | 129 | 130 | #now we can evaluate our model 131 | print '\n' 132 | print 'Train accuracy: ' , model.evaluate(x_train,y_train,batch_size=25) 133 | print 'Test accuracy: ',model.evaluate(x_test,y_test,batch_size=25) 134 | 135 | #visualize the actual output of the network 136 | output = model.predict(x_train) 137 | print '\n' 138 | print 'Actual output: ',output[0],np.argmax(output[0]) 139 | 140 | #we can also check our model behaviour in depth 141 | print'\n' 142 | #print the first ten predictions 143 | for x in range(10): 144 | print 'Prediction: ',np.argsort(output[x])[::-1],'True target: ',np.argmax(y_train[x])''' 145 | 146 | #Multilayer Perceptron 147 | def simpleNN(data): 148 | x_train , x_test , y_train , y_test = preparingData(data) 149 | 150 | #because as we can see from the previous function simpleNN the 151 | #test loss is going bigger which means that we have overfitting problem 152 | #here we are going to try to overcome this obstacle 153 | 154 | model = Sequential() 155 | 156 | #The input layer: 157 | '''With respect to the number of neurons comprising this layer, this parameter is completely and uniquely determined 158 | once you know the shape of your training data. Specifically, the number of neurons comprising that layer is equal to the number 159 | of features (columns) in your data. Some NN configurations add one additional node for a bias term.''' 160 | 161 | model.add(Dense(14,activation='relu',input_dim=13,kernel_initializer='random_uniform')) 162 | 163 | #The output layer 164 | '''If the NN is a classifier, then it also has a single node unless 165 | softmax is used in which case the output layer has one node per 166 | class label in your model.''' 167 | 168 | model.add(Dense(2,activation='softmax')) 169 | 170 | #binary_crossentropy because we have a binary classification model 171 | #Because it is not guaranteed that we are going to find the global optimum 172 | #because we can be trapped in a local minima and the algorithm may think that 173 | #you reach global minima. To avoid this situation, we use a momentum term in the 174 | #objective function, which is a value 0 < momentum < 1 , that increases the size of the steps 175 | #taken towards the minimum by trying to jump from a local minima. 176 | 177 | 178 | #If the momentum term is large then the learning rate should be kept smaller. 179 | #A large value of momentum means that the convergence will happen fast,but if 180 | #both are kept at large values , then we might skip the minimum with a huge step. 181 | #A small value of momentum cannot reliably avoid local minima, and also slow down 182 | #the training system. We are trying to find the right value of momentum through cross-validation. 183 | model.compile(loss='binary_crossentropy',optimizer=SGD(lr=0.001,momentum=0.6),metrics=['accuracy']) 184 | 185 | #In simple terms , learning rate is how quickly a network abandons old beliefs for new ones. 186 | #Which means that with a higher LR the network changes its mind more quickly , in pur case this means 187 | #how quickly our model update the parameters (weights,bias). 188 | 189 | #verbose: Integer. 0, 1, or 2. Verbosity mode. 0 = silent, 1 = progress bar, 2 = one line per epoch. 190 | 191 | results = model.fit(x_train,y_train,epochs=50,batch_size=50,verbose=2,validation_data=(x_test,y_test)) 192 | 193 | print 'Train accuracy: ' , model.evaluate(x_train,y_train,batch_size=50,verbose=2) 194 | print 'Test accuracy: ',model.evaluate(x_test,y_test,batch_size=50,verbose=2) 195 | 196 | 197 | 198 | #visualize 199 | plt.figure(1) 200 | plt.plot(results.history['loss']) 201 | plt.plot(results.history['val_loss']) 202 | plt.legend(['train loss', 'test loss']) 203 | plt.show() 204 | 205 | print model.summary() 206 | 207 | 208 | 209 | def main(): 210 | data = readFeaturesFile() 211 | simpleNN(data) 212 | 213 | 214 | main() 215 | -------------------------------------------------------------------------------- /feature_extraction_techniques/README.md: -------------------------------------------------------------------------------- 1 | Feature extraction techniques implemented in Python 2 | 3 | MFCC 4 | LPC 5 | PLP 6 | MGCA 7 | -------------------------------------------------------------------------------- /feature_extraction_techniques/lpc.py: -------------------------------------------------------------------------------- 1 | #!usr/bin/python 2 | import matplotlib.pyplot as plt 3 | import numpy as np 4 | import wave 5 | import scipy.io.wavfile as wav 6 | from scipy import signal 7 | import scipy as sk 8 | from audiolazy import * 9 | from audiolazy import lpc 10 | from sklearn import preprocessing 11 | import scipy.signal as sig 12 | import scipy.linalg as linalg 13 | 14 | 15 | def readWavFile(wav): 16 | #given a path from the keyboard to read a .wav file 17 | #wav = raw_input('Give me the path of the .wav file you want to read: ') 18 | inputWav = 'PATH_TO_WAV'+wav 19 | return inputWav 20 | 21 | #reading the .wav file (signal file) and extract the information we need 22 | def initialize(inputWav): 23 | rate , signal = wav.read(readWavFile(inputWav)) # returns a wave_read object , rate: sampling frequency 24 | sig = wave.open(readWavFile(inputWav)) 25 | # signal is the numpy 2D array with the date of the .wav file 26 | # len(signal) number of samples 27 | sampwidth = sig.getsampwidth() 28 | print 'The sample rate of the audio is: ',rate 29 | print 'Sampwidth: ',sampwidth 30 | return signal , rate 31 | 32 | 33 | #implementation of the low-pass filter 34 | def lowPassFilter(signal, coeff=0.97): 35 | return np.append(signal[0], signal[1:] - coeff * signal[:-1]) #y[n] = x[n] - a*x[n-1] , a = 0.97 , a>0 for low-pass filters 36 | 37 | def preEmphasis(wav): 38 | #taking the signal 39 | signal , rate = initialize(wav) 40 | #Pre-emphasis Stage 41 | preEmphasis = 0.97 42 | emphasizedSignal = lowPassFilter(signal) 43 | Time=np.linspace(0, len(signal)/rate, num=len(signal)) 44 | EmphasizedTime=np.linspace(0, len(emphasizedSignal)/rate, num=len(emphasizedSignal)) 45 | #plots using matplotlib 46 | '''plt.figure(figsize=(9, 7)) 47 | plt.subplot(211, facecolor='darkslategray') 48 | plt.title('Signal wave') 49 | plt.ylim(-50000, 50000) 50 | plt.ylabel('Amplitude', fontsize=16) 51 | plt.plot(Time,signal,'C1') 52 | plt.subplot(212, facecolor='darkslategray') 53 | plt.title('Pre-emphasis') 54 | plt.ylim(-50000, 50000) 55 | plt.xlabel('time(s)', fontsize=10) 56 | plt.ylabel('Amplitude', fontsize=16) 57 | plt.plot(EmphasizedTime,emphasizedSignal,'C1') 58 | plt.show()''' 59 | return emphasizedSignal, signal , rate 60 | 61 | 62 | def visualize(rate,signal): 63 | #taking the signal's time 64 | Time=np.linspace(0, len(signal)/rate, num=len(signal)) 65 | #plots using matplotlib 66 | plt.figure(figsize=(10, 6)) 67 | plt.subplot(facecolor='darkslategray') 68 | plt.title('Signal wave') 69 | plt.ylim(-40000, 40000) 70 | plt.ylabel('Amplitude', fontsize=16) 71 | plt.xlabel('Time(s)', fontsize=8) 72 | plt.plot(Time,signal,'C1') 73 | plt.draw() 74 | #plt.show() 75 | 76 | def framing(fs,signal): 77 | #split the signal into frames 78 | windowSize = 0.025 # 25ms 79 | windowStep = 0.01 # 10ms 80 | overlap = int(fs*windowStep) 81 | frameSize = int(fs*windowSize)# int() because the numpy array can take integer as an argument in the initiation 82 | numberOfframes = int(np.ceil(float(np.abs(len(signal) - frameSize)) / overlap )) 83 | print 'Overlap is: ',overlap 84 | print 'Frame size is: ',frameSize 85 | print 'Number of frames: ',numberOfframes 86 | frames = np.ndarray((numberOfframes,frameSize))# initiate a 2D array with numberOfframes rows and frame size columns 87 | #assing samples into the frames (framing) 88 | for k in range(0,numberOfframes): 89 | for i in range(0,frameSize): 90 | if((k*overlap+i)0 for low-pass filters 33 | 34 | 35 | def preEmphasis(wav): 36 | #taking the signal 37 | signal , rate = initialize(wav) 38 | #Pre-emphasis Stage 39 | preEmphasis = 0.97 40 | emphasizedSignal = lowPassFilter(signal) 41 | Time=np.linspace(0, len(signal)/rate, num=len(signal)) 42 | EmphasizedTime=np.linspace(0, len(emphasizedSignal)/rate, num=len(emphasizedSignal)) 43 | return emphasizedSignal, signal , rate 44 | 45 | def writeFeatures(mgca_features,wav): 46 | #write in a txt file the output vectors of every sample 47 | f = open('mel_generalized_features.txt','a')#sample ID 48 | #f = open('mfcc_featuresLR.txt','a')#only to initiate the input for the ROC curve 49 | wav = makeFormat(wav) 50 | np.savetxt(f,mgca_features,newline=",") 51 | f.write(wav) 52 | f.write('\n') 53 | 54 | 55 | def makeFormat(wav): 56 | #if i want to keep only the gender (male,female) 57 | wav = wav.split('/')[1].split('-')[1] 58 | #only to make the format for Logistic Regression 59 | if (wav=='Female'): 60 | wav='1' 61 | else: 62 | wav='0' 63 | return wav 64 | 65 | 66 | def mgca_feature_extraction(wav): 67 | #I pre-emphasized the signal with a low pass filter 68 | emphasizedSignal,signal,rate = preEmphasis(wav) 69 | 70 | 71 | #and now I have the signal windowed 72 | emphasizedSignal*=np.hamming(len(emphasizedSignal)) 73 | 74 | mgca_features = mgcep(emphasizedSignal,order=12) 75 | 76 | writeFeatures(mgca_features,wav) 77 | 78 | 79 | 80 | 81 | def mel_Generalized(): 82 | folder = raw_input('Give the name of the folder that you want to read data: ') 83 | amount = raw_input('Give the number of samples in the specific folder: ') 84 | print 'Mel-Generalized Cepstrum analysis github implementation ' 85 | for x in range(1,int(amount)+1): 86 | wav = '/'+folder+'/'+str(x)+'.wav' 87 | print wav 88 | mgca_feature_extraction(wav) 89 | 90 | 91 | 92 | def main(): 93 | mel_Generalized() 94 | 95 | main() 96 | -------------------------------------------------------------------------------- /feature_extraction_techniques/plp.py: -------------------------------------------------------------------------------- 1 | #!usr/bin/python 2 | 3 | import numpy 4 | import numpy.matlib 5 | import scipy 6 | from scipy.fftpack.realtransforms import dct 7 | from sidekit.frontend.vad import pre_emphasis 8 | from sidekit.frontend.io import * 9 | from sidekit.frontend.normfeat import * 10 | from sidekit.frontend.features import * 11 | import scipy.io.wavfile as wav 12 | import numpy as np 13 | 14 | 15 | def readWavFile(wav): 16 | #given a path from the keyboard to read a .wav file 17 | #wav = raw_input('Give me the path of the .wav file you want to read: ') 18 | inputWav = 'PATH_TO_WAV'+wav 19 | return inputWav 20 | 21 | #reading the .wav file (signal file) and extract the information we need 22 | def initialize(inputWav): 23 | rate , signal = wav.read(readWavFile(inputWav)) # returns a wave_read object , rate: sampling frequency 24 | sig = wave.open(readWavFile(inputWav)) 25 | # signal is the numpy 2D array with the date of the .wav file 26 | # len(signal) number of samples 27 | sampwidth = sig.getsampwidth() 28 | print 'The sample rate of the audio is: ',rate 29 | print 'Sampwidth: ',sampwidth 30 | return signal , rate 31 | 32 | def PLP(): 33 | folder = raw_input('Give the name of the folder that you want to read data: ') 34 | amount = raw_input('Give the number of samples in the specific folder: ') 35 | for x in range(1,int(amount)+1): 36 | wav = '/'+folder+'/'+str(x)+'.wav' 37 | print wav 38 | #inputWav = readWavFile(wav) 39 | signal,rate = initialize(wav) 40 | #returns PLP coefficients for every frame 41 | plp_features = plp(signal,rasta=True) 42 | meanFeatures(plp_features[0]) 43 | 44 | 45 | #compute the mean features for one .wav file (take the features for every frame and make a mean for the sample) 46 | def meanFeatures(plp_features): 47 | #make a numpy array with length the number of plp features 48 | mean_features=np.zeros(len(plp_features[0])) 49 | #for one input take the sum of all frames in a specific feature and divide them with the number of frames 50 | for x in range(len(plp_features)): 51 | for y in range(len(plp_features[x])): 52 | mean_features[y]+=plp_features[x][y] 53 | mean_features = (mean_features / len(plp_features)) 54 | print mean_features 55 | 56 | 57 | 58 | def main(): 59 | PLP() 60 | 61 | main() 62 | -------------------------------------------------------------------------------- /feature_extraction_techniques/readFiles.py: -------------------------------------------------------------------------------- 1 | #!usr/bin/python 2 | import os 3 | 4 | def readCases(): 5 | healthyCases = os.listdir('path') 6 | capturedCases = os.listdir('path') 7 | #using the os libary that Python provides to read all the files from a scertain directory 8 | #the function return two arrays with all the file names that are in the specific directory 9 | -------------------------------------------------------------------------------- /speech_features/README.md: -------------------------------------------------------------------------------- 1 | EXAMPLE speech features extracted with various techniques. 2 | 3 | MFCC (Mel Frequency Cepstral Coefficient) 4 | 5 | PLP (Perceptual Linear Prediction) 6 | 7 | - with RASTA filtering 8 | - and without 9 | 10 | LPC (Liner Predictive Coding) 11 | 12 | MGCA(Mel Generalized Cepstrum Analysis) 13 | -------------------------------------------------------------------------------- /speech_features/gmm_mfcc_0.txt: -------------------------------------------------------------------------------- 1 | 1.571385614170679723e+01,1.404875356097497274e+01,-4.860084266595126934e+00,-8.052633391335941582e+00,-3.585894965594753359e+01,-1.664005410215195724e+01,2.096955330354112235e+01,6.827590955059440248e+00,-9.411830493556637478e+00,-1.708960159495563147e+01,-1.474398190776833095e+00,-5.020199916868210543e+00,-2.692742851984636232e+01,0 2 | -------------------------------------------------------------------------------- /speech_features/gmm_mfcc_1.txt: -------------------------------------------------------------------------------- 1 | 1.760115643322911794e+01,8.026205839965216526e+00,-2.372870381468542789e+00,-3.842529428287276261e+01,-2.477253797237680999e+01,-1.050848974736010000e+01,1.884725359397479139e+01,-1.503582987857505238e+01,-1.585323702282968261e+00,1.268366760659247383e+01,-2.354793372118761496e+01,-9.443139105278184786e+00,-1.776925001121093572e+01,1 2 | -------------------------------------------------------------------------------- /speech_features/gmm_test_mfcc.txt: -------------------------------------------------------------------------------- 1 | 1.781939282818649417e+01,6.159290119863752189e+00,-1.652226688796058696e+01,-1.856868294488861793e+01,-1.951579460253341125e+01,-3.897547750343369533e+00,1.205675536859451746e+01,-2.243241947716080986e+01,1.401626082539313778e+01,-2.213080744116675902e+01,-2.415440396864470429e+00,4.772852345310729660e+00,-2.062950925082413178e+01,0 2 | -------------------------------------------------------------------------------- /speech_features/lpc_featuresLR.txt: -------------------------------------------------------------------------------- 1 | -5.713715899460503067e-01,6.012342530606594460e-02,-2.024042466157382758e-01,2.926073544026031037e-01,-2.459888669142980266e-01,3.511328323794973144e-02,-1.930418301955590665e-01,2.854422988330282962e-01,-1.310427749596927705e-01,1.676472867063789340e-01,-6.043539229498144649e-02,-6.383703620963765424e-02,1.358666383377096776e-02,0 2 | -------------------------------------------------------------------------------- /speech_features/mel_generalized_features.txt: -------------------------------------------------------------------------------- 1 | 4.011244746326546595e-01,9.854180285960870145e-03,9.698894709162839134e-03,1.068934676826956663e-02,1.239489571158165077e-02,1.024365701392198659e-02,1.003316534286036871e-02,1.159946972333673713e-02,1.145714884192802069e-02,1.136573025658852570e-02,1.031905747748021983e-02,1.177596370257075371e-02,8.511094453882924599e-03,0 2 | -------------------------------------------------------------------------------- /speech_features/mfcc_featuresLR.txt: -------------------------------------------------------------------------------- 1 | 1.869958534730199062e+01,-1.106390788440557937e+00,-1.463190887778748339e-01,5.917073258262402824e+00,-1.099502916134400898e+01,-6.256470537374103635e+00,5.928290413728441344e+00,1.086700024980764567e+01,-5.702149955059144792e-01,-1.617507983735385180e+00,-5.507156315888738440e+00,2.872456836350028464e+00,-3.080658678467673273e+00,1 2 | -------------------------------------------------------------------------------- /speech_features/plp_features.txt: -------------------------------------------------------------------------------- 1 | 5.895941153135551893e+00,-6.748447900593427251e-01,-3.319222787104435940e-02,-1.578910175596278109e-01,1.221139505707007356e-01,5.985885025331066922e-02,-9.752085645350065668e-02,-7.502659344937070984e-02,-1.439876159846410764e-02,2.497247509572233549e-03,-1.891271314688349955e-02,8.776126537227386948e-02,-5.181689905905148552e-02,0 2 | -------------------------------------------------------------------------------- /speech_features/plp_featuresRASTA.txt: -------------------------------------------------------------------------------- 1 | -6.622877863274987398e-01,-4.013368720022871261e-01,-2.644490505501208566e-01,-2.436498156014590688e-01,-2.119007564962608059e-01,-1.768779937559497861e-01,-1.387866152690148402e-01,-1.090785786905756338e-01,-7.822805864564488787e-02,-5.214452830678916601e-02,-3.087330660971222482e-02,-6.892111334665819607e-03,1.710506533453253972e-02,0 2 | --------------------------------------------------------------------------------