├── CONTRIBUTING.md
├── LICENSE
├── NNs
    ├── README.md
    ├── helping_functions.py
    └── lstm.py
├── README.md
├── Short_Programming_Project
    ├── README.md
    └── university-groningen-short.pdf
├── classifiers
    ├── README.md
    ├── SVMs
    │   ├── README.md
    │   ├── mfcc_pca_feature.py
    │   ├── svm_balancedSampleNumber_greedySearch.py
    │   ├── svm_default.py
    │   ├── svm_keeping_supportVectors.py
    │   └── svm_multiclass.py
    ├── dimensionality_reduction
    │   ├── README.md
    │   ├── graph_spectral_analysis&spectral_clustering_default.py
    │   ├── kpca_lda_knn_equalizeClasses.py
    │   ├── kpca_lda_knn_multiclass.py
    │   └── pca_kpca_from-skratch.py
    ├── gmm.py
    ├── gmm_healthy_captured.py
    ├── knn.py
    ├── leave_one_out.py
    ├── logisticRegression.py
    └── simpleNeuralNetwork.py
├── feature_extraction_techniques
    ├── README.md
    ├── lpc.py
    ├── mfcc.py
    ├── mfcc_pca.py
    ├── mgca.py
    ├── plp.py
    └── readFiles.py
└── speech_features
    ├── README.md
    ├── gmm_mfcc_0.txt
    ├── gmm_mfcc_1.txt
    ├── gmm_test_mfcc.txt
    ├── lpc_featuresLR.txt
    ├── mel_generalized_features.txt
    ├── mfcc_featuresLR.txt
    ├── plp_features.txt
    └── plp_featuresRASTA.txt


/CONTRIBUTING.md:
--------------------------------------------------------------------------------
1 | # Issues for contribution
2 | 
3 |    - Feature extraction techniques from skratch LPC, MFCC, PLP for comparing the results with the already made ones.
4 |   
5 |    -  Varitional Autoencoders (VAE) for finding a lower representation of the extracted features. Furthermore we can produce artificial samples using VAEs, with this procedure we can overcome the obstacle that the imbalance dataset causes.
6 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2019 Emmanouil Gionanidis
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/NNs/README.md:
--------------------------------------------------------------------------------
1 | **helping_functions.py**  
2 | Implements some helping functions in order to split the data, determine the timeseries format (t,t+1). The previous values that we want to take into cinsideration in order to predict the next value in the time serie is defined by us. Also preprocessing functions like filling the missing values, and mainly making the appropriate format for feeding different types of neural networks.
3 | 
4 | **lstm.py**  
5 | A Long-Short-Term-Memory Neural Network for predicting values in timeseries. This is an implementation that follows our first approach to deal with sequential data. Our first approach was a Recurrent Neural Network, which is known that it has problems with the short term and long term memory. This happens because of the gradient vanish as we go from the last to the first layers. The aforementioned procedure has as an outcome that the first layers of the NN does not learn at all. We can overcome this obstacle implementing a procedure that provide us Long and Short term memory, because the previous event are important and we need longer dependencies for our data. Because of this we implement the LSTM and the GRU, know as Gated Recurrent Units, which are represent exactly this procedure of giving memory to the network.
6 | 


--------------------------------------------------------------------------------
/NNs/helping_functions.py:
--------------------------------------------------------------------------------
  1 | #!usr/bin/python
  2 | 
  3 | #libraries
  4 | import pandas as pd
  5 | import matplotlib.pyplot as plt	
  6 | import numpy as np
  7 | import math
  8 | import keras.models
  9 | import keras.layers 
 10 | import sklearn.preprocessing
 11 | from sklearn.metrics import mean_squared_error
 12 | import keras.optimizers
 13 | 
 14 | 
 15 | '''First we have to define our problem. We use LSTM because we want to take advandage of it' strong point which is Long-Short term memory, contrary to RNN which we know there are drawbacks concerning the gradient in the first layers, so maybe we have a leakage of previous information. For long time series or long sequencies in general we use LSTMs, GRUs bu for small sequencies RNN  are efficient as well.'''
 16 | 
 17 | 
 18 | #------------------- Initialization
 19 | 
 20 | 
 21 | #make this function for reading the input data and make the format you want, and visualize 
 22 | def initialization():
 23 | 
 24 | 	print('Remember for small datasets emerges problem with zero elements in the testing set, if the training percent is a big number. \n ')
 25 | 
 26 | 
 27 | 	#import some randomness in our procedure
 28 | 	np.random.seed(7)
 29 | 
 30 | 	#determine the path of the file that you want to read from
 31 | 	-----------------------------------------------------------------> Define the path
 32 | 	path = ''
 33 | 
 34 | 	#print the two destinations that consit the trip of, split the string based on '/' and print the last word subtracking the last four charachters '.txt'
 35 | 	print('Running for: ',path.split('/')[6][:-4])
 36 | 
 37 | 	#define the name of the DataFrame columns
 38 | 	names=["x","y"]
 39 | 
 40 | 	#read the file that we define to the path as a pandas DataFrame with the aforementioned columns, target is the ticket_price, we can feed our model with the usecols
 41 | 	timeserie = pd.read_csv(path, names = names,engine='python', index_col=None, usecols = ["x","y"]
 42 | 	#visualize the DataFrame, and check the dimensions
 43 | 	print('Dataframe: \n')
 44 | 	print(timeserie)
 45 | 	print(timeserie.shape)
 46 | 	print('\n')
 47 | 
 48 | 	#plot the how the price is evolving through time (definition of time day, month, year etc)
 49 | 	#plt.plot(timeserie)
 50 | 	#plt.show()
 51 | 
 52 | 
 53 | 	#returns the file as a Dataframe
 54 | 	return timeserie
 55 | 
 56 | 
 57 | 
 58 | 
 59 | 
 60 | #------------------ Split Data training/testing
 61 | 
 62 | #split data into training and testing subsets
 63 | def split_data(dataset, training_size):
 64 | 	
 65 | 	#translate the training size into our number of elements
 66 | 	train_size = int(len(dataset) * training_size)
 67 | 	test_size = len(dataset) - train_size
 68 | 
 69 | 	#take the accordinate parts of the datraset
 70 | 	train_samples, test_samples = dataset[0:train_size,:], dataset[train_size: len(dataset),:]
 71 | 
 72 | 	return train_samples, test_samples
 73 | 
 74 | 
 75 | 
 76 | #------------------------------------- 	Format Dataset
 77 | 
 78 | 
 79 | '''We are making this function because we want to change the format of our data, we are going to implement regression so to predict the next value of a timeserie. This means that we are goint to have a specific time, let's say t, and we are predicting what is happening in the time t+1, so we need to model our dataset in order to implement this ideology'''
 80 | def format_dataset(dataset, time_step):
 81 | 	
 82 | 	#time_step defines how many times you want to look back
 83 | 
 84 | 	#define as dataT the time t, and as dataT_1 the time t+1
 85 | 	dataT, dataT_1 = [], []
 86 | 
 87 | 	#iterate all the dataset and make the format based on the time step
 88 | 	for i in range(len(dataset)-time_step-1):
 89 | 
 90 | 		#in time t put the current element
 91 | 		dataT.append(dataset[i:(i+time_step), 0])
 92 | 
 93 | 		#in the time t+1 put the next element of the element that we append in the list dataT
 94 | 		dataT_1.append(dataset[i + time_step, 0])
 95 | 
 96 | 		#repeat this procedure, following the element that we append in the array dataT_1
 97 | 	
 98 | 	
 99 | 	return np.array(dataT), np.array(dataT_1)
100 | 
101 | 
102 | 
103 | #---------------- Preprocessing
104 | 
105 | 
106 | #we use this function in order to do the preprocessing staff, normalize, and maybe another procedures #that we want to implement for making a proper format to our data
107 | def preprocessing(dataset, time_step, training_size):
108 | 	
109 | 	#take only tha information of the dataframe and not the indexes or the columns names
110 | 	dataset = dataset.values
111 | 
112 | 	#convert them to floats which is more suitable for feeding a neural netowork
113 | 	dataset = dataset.astype('float32')
114 | 
115 | 	#first we are going to scale our data because LSTMs are sensitive to the unscaled input data and we are goint to see this in action
116 | 	#scaling in range [0,1]
117 | 	scaler = sklearn.preprocessing.MinMaxScaler(feature_range=(0,1))
118 | 
119 | 
120 | 	#----------------------------------------------------------------> fit all the dataset
121 | 
122 | 	dataset = scaler.fit_transform(dataset)
123 | 
124 | 	#and then split
125 | 	train_samples_scaled, test_samples_scaled = split_data(dataset)
126 | 
127 | 	#----------------------------------------------------------------> fit only training
128 | 
129 | 	#first we have to split our data into test and training
130 | 	train_samples, test_samples = split_data(dataset, training_size)
131 | 
132 | 	print('Samples chosen for the training procedure: \n')
133 | 	print(train_samples)
134 | 	print(train_samples.shape)
135 | 	print('\n')
136 | 	print('Samples chose for testing: \n')
137 | 	print(test_samples)
138 | 	print(test_samples.shape)
139 | 	print('\n')
140 | 
141 | 	'''
142 | 	scaler.fit(train_samples)
143 | 
144 | 	#transform both training and testing data based on the information of the training only because we want our model to work only for one sample for testing as input
145 | 	train_samples_scaled = scaler.transform(train_samples)
146 | 
147 | 	test_samples_scaled = scaler.transform(test_samples)'''
148 | 
149 | 	#visualize the scaled data
150 | 	print('Scaled samples for the training procedure: \n')
151 | 	print(train_samples_scaled)
152 | 	print(train_samples_scaled.shape)
153 | 	print('\n')
154 | 	print('Scaled samples for testing: \n')
155 | 	print(test_samples_scaled)
156 | 	print(test_samples_scaled.shape)
157 | 	print('\n')
158 | 
159 | 	#------------------------------------------------------------------> end with scaling
160 | 
161 | 	#timeserie format for both testing and training sets
162 | 
163 | 	#define the time step, e.g t+5 we want time_step=5	
164 | 	time_step=1
165 | 
166 | 	trainT, trainT_1 = format_dataset(train_samples_scaled, time_step)
167 | 
168 | 	testT, testT_1 = format_dataset(test_samples_scaled, time_step)
169 | 
170 | 	print('Previous train data shape: ')
171 | 	print(trainT)
172 | 	print('\n')
173 | 	
174 | 	#bacause the LSTM waits our input to be in the format below
175 | 	#[samples, time steps, features]
176 | 	#we need to transform it in order to fit this prerequisite
177 | 
178 | 	#------------------------------------------------------------> [samples, time steps, features] format
179 | 	#we are formating only the training set not the testing, because the testing in going to be just  apredicted single valuew
180 | 	trainT = np.reshape(trainT, (trainT.shape[0], 1, trainT.shape[1]))
181 | 
182 | 	testT = np.reshape(testT, (testT.shape[0], 1, testT.shape[1]))
183 | 
184 | 	print('Current train data shape ready to feed LSTM model: ')
185 | 	print(trainT)
186 | 	print('\n')
187 | 
188 | 	
189 | 	#return the sets of training and testing ready for the LSTM model
190 | 	return trainT, trainT_1, testT, testT_1, time_step, scaler, dataset
191 | 
192 | 


--------------------------------------------------------------------------------
/NNs/lstm.py:
--------------------------------------------------------------------------------
  1 | #!usr/bin/python
  2 | 
  3 | #libraries
  4 | import pandas as pd
  5 | import matplotlib.pyplot as plt	
  6 | import numpy as np
  7 | import math
  8 | import keras.models
  9 | import keras.layers 
 10 | import sklearn.preprocessing
 11 | from sklearn.metrics import mean_squared_error
 12 | import keras.optimizers
 13 | import helping_functions
 14 | 
 15 | 
 16 | #--------------------------------------- RNN with LSTM layer
 17 | 
 18 | 
 19 | def LSTM(trainT, trainT_1, testT, testT_1, time_step, scaler, dataset):
 20 | 	
 21 | 	#create the model
 22 | 
 23 | 	#we choose the Sequential because we want to stack the layers, put them in a row
 24 | 	model = keras.models.Sequential()
 25 | 
 26 | 	#we add a LSTM layer 
 27 | 	#---> with 4 neurons or units
 28 | 	#---> determine the input dimension based on the time_step because the input is going to be our previous values and the output will be only the predicted values
 29 | 	#---> dropout: choose the percent to drop of the linear transformation of the reccurent state
 30 | 	#---> implementation: choose if you want to stack the operation into larger number of smaller dot productes or the inverse
 31 | 	#---> recurrent_dropout: the dropout of the recurrent state
 32 | 
 33 | 	model.add(keras.layers.LSTM(128, input_shape=(1, time_step), use_bias=True, unit_forget_bias=True, kernel_regularizer=None, recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, dropout=0.0, recurrent_dropout=0.0, implementation=1,return_sequences=True))
 34 | 
 35 | 	#another one LSTM layer
 36 | 	model.add(keras.layers.LSTM(64, input_shape=(1, time_step), return_sequences=False))
 37 | 
 38 | 	model.add(keras.layers.Dense(16,init='uniform',activation='relu'))
 39 | 
 40 | 	#just a densenly connected layer with 1 neuron/unit, as an output, that makes the single value prediction
 41 | 	model.add(keras.layers.Dense(1, activation='sigmoid'))
 42 | 
 43 | 	#we use the RSME to validate the performance of our model, and the Adam optimizer for updating the network weights
 44 | 
 45 | 	#Optimizers to use
 46 | 
 47 | 	#----> Stochastic Gradient Descent - SGD
 48 | 	#----> RMSProp 
 49 | 	#----> Adagrad
 50 | 	#----> Adam
 51 | 	model.compile(loss='mean_squared_error', optimizer=keras.optimizers.Adam(lr=0.001))
 52 | 
 53 | 	#feed our model
 54 | 	results = model.fit(trainT, trainT_1, epochs=300, batch_size=1, verbose=1,validation_data=(testT, testT_1))
 55 | 
 56 | 
 57 | 	#------------------ Make the predictions
 58 | 	train_predict = model.predict(trainT)
 59 | 	test_predict = model.predict(testT)
 60 | 
 61 | 	#inverse the prediction in order to suit the format euros per time moment, for calculating the RMSE
 62 | 	train_predict = scaler.inverse_transform(train_predict)
 63 | 	trainT_1 = scaler.inverse_transform([trainT_1])
 64 | 	test_predict = scaler.inverse_transform(test_predict)
 65 | 	testT_1 = scaler.inverse_transform([testT_1])
 66 | 
 67 | 	#now we can calculate the RMSE	
 68 | 
 69 | 	train_score = math.sqrt(mean_squared_error(trainT_1[0], train_predict[:,0]))
 70 | 	print('RMSE training: %.2f' % (train_score))
 71 | 
 72 | 	test_score = math.sqrt(mean_squared_error(testT_1[0], test_predict[:,0]))
 73 | 	print('RMSE testing: %.2f'% (test_score))
 74 | 
 75 | 	Visualize(train_predict, test_predict, dataset, time_step, scaler, results)
 76 | 
 77 | 
 78 | 
 79 | #-------------- Visualize the pridictions
 80 | def Visualize(train_predict, test_predict, dataset, time_step, scaler, results):
 81 | 	
 82 | 	#initialize the array for testing and training
 83 | 	train_predict_plot = np.empty_like(dataset)
 84 | 	train_predict_plot[:, :] = np.nan
 85 | 
 86 | 	test_predict_plot = np.empty_like(dataset)
 87 | 	test_predict_plot[:, :] = np.nan
 88 | 
 89 | 
 90 | 	#we have to shift the train predictions in order to plot them correctly
 91 | 	train_predict_plot[time_step:len(train_predict)+time_step, :] = train_predict
 92 | 
 93 | 	#we have to shift the test predictions in order to plot them correctly
 94 | 	test_predict_plot[len(train_predict)+(time_step*2)+1:len(dataset)-1,:] = test_predict
 95 | 
 96 | 
 97 | 	#plot baseline and the predictions in the same plot
 98 | 	plt.figure(1)
 99 | 	plt.title('Predictions from training and testins sets')
100 | 	plt.plot(scaler.inverse_transform(dataset))
101 | 	print(train_predict_plot)
102 | 	print(test_predict_plot)
103 | 	plt.plot(train_predict_plot)
104 | 	plt.plot(test_predict_plot)
105 | 	plt.legend(['Dataset','Train phase prediction','Test phase prediction'])
106 | 
107 | 	plt.figure(2)
108 | 	plt.title('Train loss curve')
109 | 	plt.plot(results.history['loss'])
110 | 	plt.plot(results.history['val_loss'])
111 | 	plt.legend(['train loss','test loss'])
112 | 
113 | 	#show the plots
114 | 	plt.show()
115 | 	
116 | 
117 | 
118 | #------------------- Main procedure
119 | 
120 | def main():
121 | 	dataset = helping_functions.initialization()
122 | 	
123 | 	training_size = 0.60
124 | 	time_step = 1
125 | 	
126 | 	trainT, trainT_1, testT, testT_1, time_step, scaler, dataset = helping_functions.preprocessing(dataset, time_step, training_size)
127 | 	
128 | 	LSTM(trainT, trainT_1, testT, testT_1, time_step, scaler, dataset)
129 | 
130 | 
131 | main()
132 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Speech-Signal-Processing-and-Classification
 2 | 
 3 |  #### Aristotle University of Thessaloniki - University of Groningen
 4 |  
 5 | Abstract of my thesis conducted during the 7-8th semester. " Two-class classification problems by analyzing the speech signal. "
 6 |  
 7 |  
 8 |  
 9 |    Front-end speech processing aims at extracting proper features from short- term segments of a speech utterance, known as frames. It is a pre-requisite step toward any pattern recognition problem employing speech or audio (e.g., music). Here, we are interesting in voice disorder classification. 
10 |    That is, to develop two-class classifiers, which can discriminate between utterances of a subject suffering from say vocal fold paralysis and utterances of a healthy subject.The mathematical modeling of the speech production system in humans suggests that an all-pole system function is justified [1-3]. As a consequence, linear prediction coefficients (LPCs) constitute a first choice for modeling the magnitute of the short-term spectrum of speech. LPC-derived cepstral coefficients are guaranteed to discriminate between the system (e.g., vocal tract) contribution and that of the excitation. Taking into account the characteristics of the human ear, the mel-frequency cepstral coefficients (MFCCs) emerged as descriptive features of the speech spectral envelope. Similarly to MFCCs, the perceptual linear prediction coefficients (PLPs) could also be derived. The aforementioned sort of speaking traditional features will be tested against agnostic-features extracted by convolutive neural networks (CNNs) (e.g., auto-encoders) [4]. Additionally as concerns SVM algortihm the dimensionality reduction step took place using algorithms like Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) also the kernel form of PCA - KernelPCA. In the multiclass implementation the use of (KPCA) following with LDA was essential, in the binary classification we compare the use of PCA and KernelPCA. For experimental purposes Graph Spectral Analysis(IsoMap, LLE) was used for dimensionality reduction followed with Spectral Clustering in order to investigate subsets.
11 |     The pattern recognition step will be based on Gaussian Mixture Model based classifiers,K-nearest neighbor classifiers, Bayes classifiers, as well as Deep Neural Networks. At the application level, a library for feature extraction and classification in Python will be developed. Credible publicly available resources will be used toward achieving our goal, such as KALDI. Comparisons will be made against [5-7].
12 | 
13 | [1]X. Huang, A. Acero, and H.-W. Hon, Spoken Language Processing. Up-
14 | per Saddle River, N.J.: Pearson Education-Prentice Hall, 2001.
15 | 
16 | [2] J. R. Deller, J. H. L. Hansen, and J. G. Proakis, Discrete-Time Pro-
17 | cessing of Speech Signals. New York, N.Y.: Wiley-IEEE, 1999.
18 | 
19 | [3] L. R. Rabiner and R. W. Schafer, Theory and Applications of Digital
20 | Speech Processing. Upper Saddle River, N.J.: Pearson Education- Prentice
21 | Hall, 2011.
22 | 
23 | [4] Wei-Ning Hsu, Yu Zhang, and James R. Glass, Unsupervised Domain
24 | Adaptation for Robust Speech Recognition via Variational Autoencoder-
25 | Based Data Augmentation,2017, http://arxiv.org/abs/1707.06265.
26 | 
27 | [5] C. Kotropoulos and G.R. Arce, ”Linear discriminant classifier with re-
28 | ject option for the detection of vocal fold paralysis and vocal fold edema”,
29 | EURASIP Advances in Signal Processing, vol. 2009, article ID 203790, 13
30 | pages, 2009 (DOI:10.1155/2009/203790).
31 | 
32 | [6] E.Ziogas and C.Kotropoulos, ”Detection of vocal fold paralysis and
33 | edema using linear discriminant classifiers” in Proc. 4th Panhellenic Ar-
34 | tificial Intelligence Conf. (SETN-06), Heraklion, Greece, vol. LNAI 3966,
35 | pp. 454-464, May 19-20, 2006.
36 | 
37 | [7] M.Marinaki, C.Kotropoulos, I.Pitas, and N.Maglaveras, ”Automatic de-
38 | tection of vocal fold paralysis and edema” in Proc. 8th Int. Conf. Spoken
39 | Language Processing (INTERSPEECH 2004), Jeju, Korea, pp. 537-540, Oc-
40 | tober, 2004.
41 | 


--------------------------------------------------------------------------------
/Short_Programming_Project/README.md:
--------------------------------------------------------------------------------
1 | University of Groningen
2 |  
3 |  
4 |  Short Programming Project for Gender classification based on voice signals (.wav files)
5 | 
6 | 


--------------------------------------------------------------------------------
/Short_Programming_Project/university-groningen-short.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gionanide/Speech_Signal_Processing_and_Classification/4e250f1f7de9c17263be1f309441f4c9bd34f4be/Short_Programming_Project/university-groningen-short.pdf


--------------------------------------------------------------------------------
/classifiers/README.md:
--------------------------------------------------------------------------------
 1 | Classifiers implementation , initial step for gender and then for healthy or not  cases.
 2 | 
 3 | - Gaussian Mixture Models  
 4 | - K-nearest Neighbours  
 5 | - Logistic Regression  
 6 | - Support Vector Machine  
 7 | - Linear Discriminant Analysis  
 8 | - Decision Tree Classifier  
 9 | - GaussianNB 
10 | - Neural Networks  
11 | 
12 | Further elaboration took place :
13 | - GMMs  
14 | - KNN  
15 | - LR
16 | - SVM
17 | 


--------------------------------------------------------------------------------
/classifiers/SVMs/README.md:
--------------------------------------------------------------------------------
 1 |   Implementation of SVM algorithm for classification **svm_default.py** is using only the default parameters to initialize the procedure.
 2 |   
 3 |  In this  folder there are variations as concerns the methods of training and the method of evaluation of SVM algorithm. Experiment resutlts using different kernel functions, and different values of parameters. Training methods with balanced training set, the balance is about the number of samples of each class **svm_balancedSampleNumber_greedySearch.py**.
 4 | 
 5 |   Examples of this training is using the divided parts and keep only the samples that are support vectors in every iteration, continue this procedure until the class with more samples is finished of iterating. Last one, using greedy algorithms to calculate the kernel parameters.
 6 |   
 7 |   In the script **svm_keeping_supportVectors.py** the above experiment is taking place. As a first approach we train our model taking all the samples from class0 and devide them accrodingly just to balance our data, we continue this porcedure until we do not have more untis of samples from class0. From this iteration we keep all the support vectors, which contains samples from both classes. We erase the duplicates and we delete all the samples from class1, so we have a dataframe containing all the support_vectors from class0. And then we feed our model in order to train it with all the samples from class1 and only the samples that were support vectors from class0, and we repeat this procedure. In the end the amount of samples from class0 is going to be smaller than the amount of samples from class1 and when this becomes smaller than the half of the amount of class1 samples we stop. Because we scale the data and we are keeping the support vector we have to unscale them, because if we feed the classifier with the scaled support vectors there are going to be scaled again, so we unscale the support vectors from class0 that we kept.
 8 |   
 9 |   In general because class0 has 6 times more samples than class1 in order to reduce the amount of samples of class0 we try this procedure taking the support vectors and then the support vectors of support vectors and goes on.
10 |   
11 |   Furthermore in the script **svm_multiclass.py** we try to classify a dataset of three classes.
12 | 


--------------------------------------------------------------------------------
/classifiers/SVMs/mfcc_pca_feature.py:
--------------------------------------------------------------------------------
  1 | #!usr/bin/python
  2 | from __future__ import division
  3 | from python_speech_features import mfcc
  4 | import scipy.io.wavfile as wavv
  5 | import os
  6 | from sklearn.decomposition import IncrementalPCA, PCA
  7 | import sys
  8 | import pandas as pd
  9 | from sklearn import model_selection
 10 | from sklearn.svm import SVC # support vectors for classification
 11 | from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
 12 | from sklearn.model_selection import cross_val_score, GridSearchCV
 13 | import timeit
 14 | import numpy as np
 15 | import itertools
 16 | from sklearn.preprocessing import MinMaxScaler
 17 | 
 18 | '''
 19 | We read the input file, we take the rate of the signal and the signal and then the mfcc feature extraction is on.
 20 | N numpy array with size of the number of frames, each row has one feature vector.
 21 | '''
 22 | def mfcc_features_extraction(wav):
 23 | 	inputWav,wav = readWavFile(wav)
 24 | 	print inputWav
 25 | 	rate,signal = wavv.read(inputWav)
 26 | 	mfcc_features = mfcc(signal,rate)
 27 | 	return mfcc_features,wav
 28 | 
 29 | '''
 30 | Make a numpy array with length the number of mfcc features,
 31 | for one input take the sum of all frames in a specific feature and divide them with the number of frames. Because we extract 13 features
 32 | from every frame now we have to add them and take the mean of them in order to describe the sample.
 33 | '''
 34 | def mean_features(mfcc_features,wav,folder,general_feature_list,general_label_list):
 35 | 	#here we are taking all the mfccs from every frame and we are not taking the average of them, instead we
 36 | 	#are taking PCA in order to reduce the dimension of our data
 37 | 
 38 | 	if (folder=='HC'):
 39 | 		#map the lists, in the first position of the general_label_list it will be the label
 40 | 		#of the sample which is in the first position in the list general_feature_list
 41 | 		#and we are making this in order to write the sample to the file with the right labels
 42 | 		general_label_list.append(0)
 43 | 	elif(folder == 'PD'):
 44 | 		general_label_list.append(1)
 45 | 
 46 | 	#initialize the flattend list
 47 | 	flattend_list = []
 48 | 	
 49 | 	#flat the list, for every frame take the 13 features and put them in one array
 50 | 	for sublist in mfcc_features:
 51 | 		for feature in sublist:
 52 | 			flattend_list.append(feature)
 53 | 
 54 | 	#check if a sample has les length than the length we determine
 55 | 	if(len(flattend_list)<12800):
 56 | 		print len(flattend_list)
 57 | 		print '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'
 58 | 
 59 | 	#make the list of lists as a numpy array in order just one sample 1x(number_of_frames x features)
 60 | 	#so for every sample we have all the features from all the frames in a single row.	
 61 | 	#here we append in a list of lists the samples, we want to fill this list with all the samples
 62 | 	general_feature_list.append(flattend_list)
 63 | 	
 64 | 	#in this function we just filling the two lists one with the features and one with the labels
 65 | 	
 66 | 	#sys.exit()
 67 | 	
 68 | 
 69 | 
 70 | def readWavFile(wav):
 71 | 	#wav = raw_input('Give me the path of the .wav file you want to read: ')
 72 | 	inputWav = 'PATH_TO_WAV'+wav
 73 | 	return inputWav,wav
 74 | 
 75 | 	
 76 | 
 77 | 
 78 | '''
 79 | write in a txt file the output vectors of every sample
 80 | '''
 81 | def writeFeatures(general_feature_list,general_label_list,wav,folder):
 82 | 	
 83 | 	f = open('PATH_TO_SAMPLES','a')
 84 | 	
 85 | 
 86 | 	#we have to iterato all the general_feature_list
 87 | 	for x in range(len(general_feature_list)):
 88 | 		#append the last element before you write to the file because it is the label
 89 | 
 90 | 		print len(general_feature_list[x])
 91 | 		
 92 | 		#write it to the file after you append it
 93 | 		np.savetxt(f,general_feature_list[x],newline=",")
 94 | 		#write the label
 95 | 		f.write(str(general_label_list[x]))
 96 | 		#and change line
 97 | 		f.write('\n')
 98 | 	
 99 | 
100 | '''
101 | if i want to keep only the gender (male,female)
102 | wav = wav.split('/')[1].split('-')[1], this is only for male,female classification
103 | wav = wav.split('/')[1].split('-')[0], this is for edema,paralysis classification
104 | wav.split('/')[1], for healthy,parkinson classification
105 | '''
106 | 
107 | def makeFormat(folder):
108 | 	if (folder=='HC'):
109 | 		wav='0'
110 | 	elif(folder == 'PD'):
111 | 		wav='1'
112 | 	return wav
113 | 
114 | 
115 | '''
116 | def readCases():
117 | 	- now we want to take all the file names of a directory and them read them accordingly
118 | 
119 | 	healthyCases = os.listdir('PATH_TO_WAV')
120 | 	parkinsonCases = os.listdir('PATH_TO_WAV')
121 | 	
122 | 	return healthyCases , parkinsonCases
123 | '''
124 | 
125 | 'takes the csv file and split the label from the features'
126 | def splitData(data):
127 | 	# Split-out the set in two different arrayste
128 | 	array = data.values
129 | 	#features array contains only the features of the samples
130 | 	features = array[:,0:12800]
131 | 	#labels array contains only the lables of the samples
132 | 	labels = array[:,12800]	
133 | 
134 | 	return features,labels
135 | 
136 | '''
137 | make this class in order to train the model with the same amount of samples of each class, because we have bigger support from class 0
138 | than class1, particularly it is 9 to 1.'''
139 | def equalizeClasses(data):
140 | 	#take all the samples from the data frame that they have Label value equal to 0 and in the next line equal to 1
141 | 	class0 = data.loc[data['Label'] == 0]#class0 and class1 are dataFrames
142 | 	class1 = data.loc[data['Label'] == 1]
143 | 
144 | 
145 | 	#check which class has more samples, by divide them and check if the number is bigger or smaller than 1
146 | 	weight = len(class0) // len(class1) #take the results as an integer in order to split the class, using prior knowledge that 
147 | 	#class0 has more samples, if it is bigger class0 has more samples and to be exact weight to 1 
148 | 
149 | 	balance = (len(class0) // weight) #this is the number of samples in order to balance our classes
150 | 
151 | 	#the keyword argument frac specifies the fraction of rows to return in the random sample, so fra=1 means, return random all rows
152 | 	#we kind of a way shuffle our data in order not to take the same samples in every iteration
153 | 	#class0 = class0.sample(frac=1)
154 | 	
155 | 	#samples array for training taking the balance number of samples for the shuffled dataFrame
156 | 	newClass0 = class0.sample(n=balance)
157 | 	
158 | 	#and now combine the new dataFrame from class0 with the class1 to return the balanced dataFrame
159 | 	newData = pd.concat([newClass0, class1])	
160 | 	
161 | 	#return the new balanced(number of samples from each class) dataFrame
162 | 	return newData
163 | 
164 | 
165 | 
166 | '''we use this function in order to apply greedy search for finding the parameters that best fit our model. We have to mention
167 | that we start this procedure from a very large field and then we tried to focues to the direction where the results
168 | appear better. For example for the C parameter, the first range was [0.0001, 0.001, 0.01, 0.1, 1, 10 ,100 ,1000], the result was that
169 | the best value was 1000 so then we tried [100, 1000, 10000, 100000] and so on in order to focues to the area which give us
170 | the best results. This function is in comments because we found the best parameters and we dont need to run it in every trial.'''
171 | def paramTuning(features_train, labels_train, nfolds):
172 | 	#using the training data and define the number of folds
173 | 	#determine the range of the Cs range you want to search
174 | 	Cs = [1000, 10010,10000, 10060, 100000, 1000000]
175 | 
176 | 	#determine the range of the gammas range you want to search
177 | 	gammas = [0.00001, 0.0001, 0.005, 0.003 ,0.001, 0.01, 0.1]
178 | 
179 | 	#make the dictioanry
180 | 	param_grid = {'C': Cs, 'gamma': gammas}
181 | 
182 | 	#start the greedy search using all the matching sets from above
183 | 	grid_search = GridSearchCV(SVC(kernel='rbf'),param_grid,cv=nfolds)
184 | 
185 | 	#fit your training data
186 | 	grid_search.fit(features_train, labels_train)
187 | 
188 | 	#visualize the best couple of parameters
189 | 	print grid_search.best_params_
190 | 
191 | 
192 | 
193 | '''Building a model which is going to be trained with of given cases and test according to new ones'''
194 | def classifyPHC(general_feature_list,general_label_list):
195 | 	#because we took features and labels seperatly we have to put them in the same list
196 | 	#and because for every signal we have different frames we took the first 12800 features
197 | 	for x in range(len(general_feature_list)):
198 | 		general_feature_list[x] = general_feature_list[x][:12800]	
199 | 		general_feature_list[x].append(general_label_list[x])
200 | 
201 | 	#here because we have to make the dataframe again because the inputs are two lists 
202 | 	headers = []	
203 | 	#we initialize the headers/features
204 | 	for x in range(1,12801):
205 | 		headers.append('Feature'+str(x))
206 | 	headers.append('Label')
207 | 	
208 | 	print len(general_feature_list)
209 | 	print len(general_feature_list[0])
210 | 	
211 | 	#build the dataframe
212 | 	data = pd.DataFrame(general_feature_list,columns=headers)
213 | 
214 | 	#equalize classes
215 | 	data = equalizeClasses(data)
216 | 
217 | 	#data = equalizeClasses(data)
218 | 	features,labels = splitData(data)
219 | 	
220 | 	#determine the training and testing size in the range of 1, 1 = 100%
221 | 	validation_size = 0.2
222 | 	
223 | 	#here we are splitting our data based on the validation_size into training and testing data
224 | 	features_train, features_validation, labels_train, labels_validation = model_selection.train_test_split(features, labels, 
225 | 			test_size=validation_size)
226 | 
227 | 
228 | 	#determine the pca, and determine the dimension you want to end up
229 | 	pca = PCA(n_components=500)
230 | 
231 | 	#fit only the features train
232 | 	pca.fit(features_train)
233 | 
234 | 	#dimensionality reduction of features train
235 | 	features_train = pca.transform(features_train)
236 | 
237 | 	#dimensionality reduction of fatures validation
238 | 	features_validation = pca.transform(features_validation)
239 | 	
240 | 	
241 | 	#normalize data in the range [-1,1]
242 | 	scaler = MinMaxScaler(feature_range=(-1, 1))
243 | 	#fit only th training data in order to find the margin and then test to data without normalize them
244 | 	scaler.fit(features_train)
245 | 
246 | 	features_train = scaler.transform(features_train)
247 | 
248 | 	#trnasform the validation features without fitting them
249 | 	features_validation = scaler.transform(features_validation)
250 | 
251 | 	#we can see the shapes of the array just to check
252 | 	print 'feature training array: ',features_train.shape,'and label training array: ',labels_train.shape
253 | 	print 'feature testing array: ',features_validation.shape,'and label testing array: ',labels_validation.shape,'\n'
254 | 
255 | 
256 | 	#take the best couple of parameters from the procedure of greedy search
257 | 	#paramTuning(features_train, labels_train, 5)
258 | 
259 | 	#we initialize our model
260 | 	svm = SVC(kernel='rbf',C=1000,gamma=1e-05,decision_function_shape='ovr')
261 | 	#svm = NearestNeighbors(n_neighbors=5)
262 | 
263 | 	
264 | 
265 | 	#train our model with the data that we previously precessed
266 | 	svm.fit(features_train,labels_train)
267 | 
268 | 	#now test our model with the test data
269 | 	predicted_labels = svm.predict(features_validation)
270 | 	accuracy = accuracy_score(labels_validation, predicted_labels)
271 | 	print 'Classification accuracy: ',accuracy*100,'\n'
272 | 
273 | 	#see the accuracy in training procedure
274 | 	predicted_labels_train = svm.predict(features_train)
275 | 	accuracy_train = accuracy_score(labels_train, predicted_labels_train)
276 | 	print 'Training accuracy: ',accuracy_train*100,'\n'
277 | 
278 | 	#confusion matrix to illustrate the faulty classification of each class
279 | 	conf_matrix = confusion_matrix(labels_validation, predicted_labels)
280 | 	print 'Confusion matrix: \n',conf_matrix,'\n'
281 | 	print 'Support    class 0   class 1:'
282 | 	#calculate the support of each class
283 | 	print '          ',conf_matrix[0][0]+conf_matrix[0][1],'     ',conf_matrix[1][0]+conf_matrix[1][1],'\n'
284 | 
285 | 	#calculate the accuracy of each class
286 | 	hC = (conf_matrix[0][0]/(conf_matrix[0][0]+conf_matrix[0][1]))*100
287 | 	pC = (conf_matrix[1][1]/(conf_matrix[1][0]+conf_matrix[1][1]))*100
288 | 
289 | 	#see the inside details of the classification
290 | 	print 'For class 0 man cases:',conf_matrix[0][0],'classified correctly and',conf_matrix[0][1],'missclassified,',hC,'accuracy \n'
291 | 	print 'For class 1 woman cases:',conf_matrix[1][1],'classified correctly and',conf_matrix[1][0],'missclassified,',pC,'accuracy\n'
292 | 
293 | 
294 | 	#try 5-fold cross validation
295 | 	scores = cross_val_score(svm, features_train, labels_train, cv=5)
296 | 	print 'cross validation scores for 5-fold',scores,'\n'
297 | 	print 'parameters of the model: \n',svm.get_params(),'\n'
298 | 
299 | 	print 'number of samples used as support vectors',len(svm.support_vectors_),'\n'
300 | 
301 | 	return svm.support_vectors_
302 | 
303 | '''
304 | read all the files from both directories based on the keyboard input HC for healthy cases, PD fro parkinson disease
305 | '''
306 | def mainParkinson():
307 | 	general_feature_list = []
308 | 	general_label_list = []
309 | 	folder = raw_input('Give the name of the folder that you want to read data: ')
310 | 	if(folder == 'PD'):
311 | 		healthyCases = os.listdir(PATH)
312 | 		for x in healthyCases:
313 | 			wav = '/'+folder+'/'+str(x)
314 | 			mfcc_features,inputWav = mfcc_features_extraction(wav)
315 | 			mean_features(mfcc_features,inputWav,folder,general_feature_list,general_label_list)
316 | 		folder = raw_input('Give the name of the folder that you want to read data: ')
317 | 		if(folder == 'HC'):
318 | 			parkinsonCases = os.listdir(PATH)
319 | 			for x in parkinsonCases:
320 | 				wav = '/'+folder+'/'+str(x)
321 | 				mfcc_features,inputWav = mfcc_features_extraction(wav)
322 | 				mean_features(mfcc_features,inputWav,folder,general_feature_list,general_label_list)
323 | 		#print general_feature_list, general_label_list
324 | 		#writeFeatures(general_feature_list,general_label_list,wav,folder)
325 | 		classifyPHC(general_feature_list,general_label_list)
326 | 		
327 | '''
328 | main function, this example is for male,female classification 
329 | given an input from the keyboard that determines the name of the File from which we want to read the samples, and
330 | the number of the samples that we want to read
331 | '''
332 | def mainMaleFemale():
333 | 	folder = raw_input('Give the name of the folder that you want to read data: ')
334 | 	amount = raw_input('Give the number of samples in the specific folder: ')
335 | 	for x in range(1,int(amount)+1):
336 | 		wav = '/'+folder+'/'+str(x)+'.wav'
337 | 		print wav
338 | 		mfcc_features,inputWav = mfcc_features_extraction(wav)
339 | 		mean_features(mfcc_features,inputWav,folder)
340 | 
341 | 
342 | 
343 | def main():
344 | 	#calculate the time
345 | 	import time
346 | 	start_time = time.time()
347 | 
348 | 	#we are making an array in order to keep the support vectors and feed the function with them for the next iteration
349 | 	mainParkinson()
350 | 
351 | 	time = time.time()-start_time
352 | 	print 'time: ',time
353 | 
354 | 
355 | main()
356 | 
357 | 
358 | 
359 | 


--------------------------------------------------------------------------------
/classifiers/SVMs/svm_balancedSampleNumber_greedySearch.py:
--------------------------------------------------------------------------------
  1 | #!usr/bin/python
  2 | from __future__ import division
  3 | import pandas as pd
  4 | from sklearn import model_selection
  5 | from sklearn.svm import SVC # support vectors for classification
  6 | from sklearn.metrics import accuracy_score, confusion_matrix
  7 | from sklearn.model_selection import cross_val_score, GridSearchCV
  8 | 
  9 | 
 10 | '''this function takes as an input the path of a file with features and labels and returns the content of this file as a csv format in
 11 | the form feature1.........feature13,Label'''
 12 | def readFile():
 13 | 	#make the format of the csv file. Our format is a vector with 13 features and a label which show the condition of the
 14 |   	names = ['Feature1', 'Feature2', 'Feature3', 'Feature4','Feature5','Feature6','Feature7','Feature8','Feature9',
 15 | 	'Feature10','Feature11','Feature12','Feature13','Label']
 16 | 
 17 | 	#path to read the samples, samples consist from healthy subjects and subject suffering from Parkinson's desease.
 18 | 	path = 'PATH_TO_SAMPLES.txt'
 19 | 	#read file in csv format
 20 | 	data = pd.read_csv(path,names=names )
 21 | 	
 22 | 	#return an array of the shape (2103, 14), lines are the samples and columns are the features as we mentioned before
 23 | 	return data
 24 | 
 25 | 'takes the csv file and split the label from the features'
 26 | def splitData(data):
 27 | 	# Split-out the set in two different arrayste
 28 | 	array = data.values
 29 | 	#features array contains only the features of the samples
 30 | 	features = array[:,0:13]
 31 | 	#labels array contains only the lables of the samples
 32 | 	labels = array[:,13]	
 33 | 
 34 | 	return features,labels
 35 | 
 36 | '''
 37 | make this class in order to train the model with the same amount of samples of each class, because we have bigger support from class 0
 38 | than class1, particularly it is 9 to 1.'''
 39 | def equalizeClasses(data):
 40 | 	#take all the samples from the data frame that they have Label value equal to 0 and in the next line equal to 1
 41 | 	class0 = data.loc[data['Label'] == 0]#class0 and class1 are dataFrames
 42 | 	class1 = data.loc[data['Label'] == 1]
 43 | 
 44 | 
 45 | 	#check which class has more samples, by divide them and check if the number is bigger or smaller than 1
 46 | 	weight = len(class0) // len(class1) #take the results as an integer in order to split the class, using prior knowledge that 
 47 | 	#class0 has more samples, if it is bigger class0 has more samples and to be exact weight to 1 
 48 | 
 49 | 	balance = (len(class0) // weight) #this is the number of samples in order to balance our classes
 50 | 
 51 | 	#the keyword argument frac specifies the fraction of rows to return in the random sample, so fra=1 means, return random all rows
 52 | 	#we kind of a way shuffle our data in order not to take the same samples in every iteration
 53 | 	class0 = class0.sample(frac=1)
 54 | 	
 55 | 	#samples array for training taking the balance number of samples for the shuffled dataFrame
 56 | 	newClass0 = class0.sample(n=balance)
 57 | 	
 58 | 	#and now combine the new dataFrame from class0 with the class1 to return the balanced dataFrame
 59 | 	newData = pd.concat([newClass0, class1])	
 60 | 	
 61 | 	#return the new balanced(number of samples from each class) dataFrame
 62 | 	return newData
 63 | 
 64 | 
 65 | '''we use this function in order to apply greedy search for finding the parameters that best fit our model. We have to mention
 66 | that we start this procedure from a very large field and then we tried to focues to the direction where the results
 67 | appear better. For example for the C parameter, the first range was [0.0001, 0.001, 0.01, 0.1, 1, 10 ,100 ,1000], the result was that
 68 | the best value was 1000 so then we tried [100, 1000, 10000, 100000] and so on in order to focues to the area which give us
 69 | the best results. This function is in comments because we found the best parameters and we dont need to run it in every trial.'''
 70 | def paramTuning(features_train, labels_train, nfolds):
 71 | 	#using the training data and define the number of folds
 72 | 	#determine the range of the Cs range you want to search
 73 | 	Cs = [1000, 10000, 10000, 1000000]
 74 | 
 75 | 	#determine the range of the gammas range you want to search
 76 | 	gammas = [0.00000001 ,0.00000001 ,0.0000001, 0.000001, 0.00001]
 77 | 
 78 | 	#make the dictioanry
 79 | 	param_grid = {'C': Cs, 'gamma': gammas}
 80 | 
 81 | 	#start the greedy search using all the matching sets from above
 82 | 	grid_search = GridSearchCV(SVC(kernel='rbf'),param_grid,cv=nfolds)
 83 | 
 84 | 	#fit your training data
 85 | 	grid_search.fit(features_train, labels_train)
 86 | 
 87 | 	#visualize the best couple of parameters
 88 | 	return grid_search.best_params_
 89 | 
 90 | 
 91 | 	
 92 | 	
 93 | 
 94 | 
 95 | 
 96 | 
 97 | 
 98 | '''Building a model which is going to be trained with of given cases and test according to new ones'''
 99 | def classifyPHC():
100 | 	data = readFile()
101 | 	data = equalizeClasses(data)
102 | 	features,labels = splitData(data)
103 | 	
104 | 	#determine the training and testing size in the range of 1, 1 = 100%
105 | 	validation_size = 0.2
106 | 	
107 | 	#here we are splitting our data based on the validation_size into training and testing data
108 | 	features_train, features_validation, labels_train, labels_validation = model_selection.train_test_split(features, labels, 
109 | 			test_size=validation_size)
110 | 
111 | 	#we can see the shapes of the array just to check
112 | 	print 'feature training array: ',features_train.shape,'and label training array: ',labels_train.shape
113 | 	print 'feature testing array: ',features_validation.shape,'and label testing array: ',labels_validation.shape,'\n'
114 | 
115 | 	#take the best couple of parameters from the procedure of greedy search
116 | 	#C_best, gamma_best = paramTuning(features_train, labels_train, 5)
117 | 
118 | 	#we initialize our model
119 | 	svm = SVC(kernel='rbf',C=1000,gamma=1e-07)
120 | 
121 | 
122 | 	#train our model with the data that we previously precessed
123 | 	svm.fit(features_train,labels_train)
124 | 
125 | 	#now test our model with the test data
126 | 	predicted_labels = svm.predict(features_validation)
127 | 	accuracy = accuracy_score(labels_validation, predicted_labels)
128 | 	print 'Classification accuracy: ',accuracy*100,'\n'
129 | 
130 | 	#confusion matrix to illustrate the faulty classification of each class
131 | 	conf_matrix = confusion_matrix(labels_validation, predicted_labels)
132 | 	print 'Confusion matrix: \n',conf_matrix,'\n'
133 | 	print 'Support    class 0   class 1:'
134 | 	#calculate the support of each class
135 | 	print '          ',conf_matrix[0][0]+conf_matrix[0][1],'     ',conf_matrix[1][0]+conf_matrix[1][1],'\n'
136 | 
137 | 	#calculate the accuracy of each class
138 | 	hC = (conf_matrix[0][0]/(conf_matrix[0][0]+conf_matrix[0][1]))*100
139 | 	pC = (conf_matrix[1][1]/(conf_matrix[1][0]+conf_matrix[1][1]))*100
140 | 
141 | 	#see the inside details of the classification
142 | 	print 'For class 0 healthy cases:',conf_matrix[0][0],'classified correctly and',conf_matrix[0][1],'missclassified,',hC,'accuracy \n'
143 | 	print 'For class 1 parkinson cases:',conf_matrix[1][1],'classified correctly and',conf_matrix[1][0],'missclassified,',pC,'accuracy\n'
144 | 
145 | 	#try 5-fold cross validation
146 | 	scores = cross_val_score(svm, features_train, labels_train, cv=5)
147 | 	print 'cross validation scores for 5-fold',scores,'\n'
148 | 	print 'parameters of the model: \n',svm.get_params(),'\n'
149 | 
150 | 	print 'number of samples used as support vectors',len(svm.support_vectors_)
151 | 
152 | 	
153 | 
154 | classifyPHC()
155 | 
156 | 


--------------------------------------------------------------------------------
/classifiers/SVMs/svm_default.py:
--------------------------------------------------------------------------------
 1 | #!usr/bin/python
 2 | import pandas as pd
 3 | from sklearn import model_selection
 4 | from sklearn.svm import SVC # support vectors for classification
 5 | from sklearn.metrics import accuracy_score, confusion_matrix
 6 | 
 7 | 
 8 | 'this function takes as an input the path of a file with features and labels and returns the content of this file as a csv format in'
 9 | 'the form feature1.........feature13,Label'
10 | def readFile():
11 | 	#make the format of the csv file. Our format is a vector with 13 features and a label which show the condition of the
12 | 	#sample hc/pc : helathy case, parkinson case
13 |   	names = ['Feature1', 'Feature2', 'Feature3', 'Feature4','Feature5','Feature6','Feature7','Feature8','Feature9',
14 | 	'Feature10','Feature11','Feature12','Feature13','Label']
15 | 
16 | 	#path to read the samples, samples consist from healthy subjects and subject suffering from Parkinson's desease.
17 | 	path = 'PATH_TO_SAMPLES.txt'
18 | 	#read file in csv format
19 | 	data = pd.read_csv(path,names=names )
20 | 	
21 | 	#return an array of the shape (2103, 14), lines are the samples and columns are the features as we mentioned before
22 | 	return data
23 | 
24 | 'takes the csv file and split the label from the features'
25 | def splitData(data):
26 | 	# Split-out the set in two different arrayste
27 | 	array = data.values
28 | 	#features array contains only the features of the samples
29 | 	features = array[:,0:13]
30 | 	#labels array contains only the lables of the samples
31 | 	labels = array[:,13]	
32 | 
33 | 	return features,labels
34 | 
35 | 
36 | 'Building a model which is going to be trained with of given cases and test according to new ones'
37 | def classifyPHC():
38 | 	data = readFile()
39 | 	features,labels = splitData(data)
40 | 	
41 | 	#determine the training and testing size in the range of 1, 1 = 100%
42 | 	validation_size = 0.2
43 | 	
44 | 	#here we are splitting our data based on the validation_size into training and testing data
45 | 	features_train, features_validation, labels_train, labels_validation = model_selection.train_test_split(features, labels, 
46 | 			test_size=validation_size)
47 | 
48 | 	#we can see the shapes of the array just to check
49 | 	print 'feature training array: ',features_train.shape,'and label training array: ',labels_train.shape
50 | 	print 'feature testing array: ',features_validation.shape,'and label testing array: ',labels_validation.shape,'\n'
51 | 
52 | 	#we initialize our model
53 | 	svm = SVC(kernel='sigmoid',C=0.8)
54 | 
55 | 	#train our model with the data that we previously precessed
56 | 	svm.fit(features_train,labels_train)
57 | 
58 | 	#now test our model with the test data
59 | 	predicted_labels = svm.predict(features_validation)
60 | 	accuracy = accuracy_score(labels_validation, predicted_labels)
61 | 	print 'Classification accuracy: ',accuracy,'\n'
62 | 
63 | 	#confusion matrix to illustrate the faulty classification of each class
64 | 	conf_matrix = confusion_matrix(labels_validation, predicted_labels)
65 | 	print 'Confusion matrix: \n',conf_matrix,'\n'
66 | 	print 'Support    class 0   class 1:'
67 | 	#calculate the support of each class
68 | 	print '          ',conf_matrix[0][0]+conf_matrix[0][1],'     ',conf_matrix[1][0]+conf_matrix[1][1],'\n'
69 | 
70 | 	#see the inside details of the classification
71 | 	print 'For class 0 healthy cases:',conf_matrix[0][0],'classified correctly and',conf_matrix[0][1],'missclassified \n'
72 | 	print 'For class 1 parkinson cases:',conf_matrix[1][1],'classified correctly and',conf_matrix[1][0],'missclassified \n'
73 | 
74 | 	
75 | 
76 | classifyPHC()
77 | 
78 | 


--------------------------------------------------------------------------------
/classifiers/SVMs/svm_keeping_supportVectors.py:
--------------------------------------------------------------------------------
  1 | #!usr/bin/python
  2 | from __future__ import division
  3 | import pandas as pd
  4 | from sklearn import model_selection
  5 | from sklearn.svm import SVC # support vectors for classification
  6 | from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
  7 | from sklearn.model_selection import cross_val_score, GridSearchCV
  8 | import timeit
  9 | import numpy as np
 10 | import itertools
 11 | import sys
 12 | from sklearn.preprocessing import MinMaxScaler
 13 | 
 14 | 
 15 | 
 16 | '''this function takes as an input the path of a file with features and labels and returns the content of this file as a csv format in
 17 | the form feature1.........feature13,Label'''
 18 | def readFile():
 19 | 	#make the format of the csv file. Our format is a vector with 13 features and a label which show the condition of the
 20 | 	#sample hc/pc : helathy case, parkinson case
 21 |   	names = ['Feature1', 'Feature2', 'Feature3', 'Feature4','Feature5','Feature6','Feature7','Feature8','Feature9',
 22 | 	'Feature10','Feature11','Feature12','Feature13','Label']
 23 | 
 24 | 	#path to read the samples, samples consist from healthy subjects and subject suffering from Parkinson's desease.
 25 | 	path = 'PATH_TO_WAV_SAMPLES.txt'
 26 | 	#read file in csv format
 27 | 	data = pd.read_csv(path,names=names )
 28 | 	
 29 | 	#return an array of the shape (2103, 14), lines are the samples and columns are the features as we mentioned before
 30 | 	return data
 31 | 
 32 | 'takes the csv file and split the label from the features'
 33 | def splitData(data):
 34 | 	# Split-out the set in two different arrayste
 35 | 	array = data.values
 36 | 	#features array contains only the features of the samples
 37 | 	features = array[:,0:13]
 38 | 	#labels array contains only the lables of the samples
 39 | 	labels = array[:,13]	
 40 | 
 41 | 	return features,labels
 42 | 
 43 | '''
 44 | make this class in order to train the model with the same amount of samples of each class, because we have bigger support from class 0
 45 | than class1, particularly it is 9 to 1. We made this function in order to make a loop, the equalized data take only a small piece of the existing data, so with this 
 46 | loop we are going to take iteratably all the data, but from every iteration we are keeping only the samples who were support
 47 | vectors, the samples only the class which we are taking a piece of it's samples'''
 48 | ''''''
 49 | def equalizeClasses(data):
 50 | 	#take all the samples from the data frame that they have Label value equal to 0 and in the next line equal to 1
 51 | 	class0 = data.loc[data['Label'] == 0]#class0 and class1 are dataFrames
 52 | 	class1 = data.loc[data['Label'] == 1]
 53 | 
 54 | 
 55 | 	#check which class has more samples, by divide them and check if the number is bigger or smaller than 1
 56 | 	weight = len(class0) // len(class1) #take the results as an integer in order to split the class, using prior knowledge that 
 57 | 	#class0 has more samples, if it is bigger class0 has more samples and to be exact weight to 1 
 58 | 
 59 | 	#check division with zero
 60 | 	if(weight == 0):
 61 | 		print 'Now the amount of samples in class0 is smaller than half the amount of samples in class1 because we reduce the class0 samples by taking only the support vectors'
 62 | 		if(len(class0)<(len(class1)/2)):
 63 | 			#if the amount of samples in class0 is below the amount of half of the samples in class1 terminate the script
 64 | 			sys.exit()
 65 | 		else:
 66 | 			#else, take all the samples from class0
 67 | 			weight = 1
 68 | 	else:
 69 | 		balance = (len(class0) // weight) #this is the number of samples in order to balance our classes
 70 | 
 71 | 	#the keyword argument frac specifies the fraction of rows to return in the random sample, so fra=1 means, return random all rows
 72 | 	#we kind of a way shuffle our data in order not to take the same samples in every iteration
 73 | 	#class0 = class0.sample(frac=1)
 74 | 	
 75 | 	#samples array for training taking the balance number of samples for the shuffled dataFrame
 76 | 	#split the dataFrame based on the weight, so here we are making units of samples in the amount of balance in order
 77 | 	#to train our model with an iteration procedure
 78 | 	newData = np.array_split(class0, weight)
 79 | 	
 80 | 
 81 | 	#and now combine the new dataFrame from class0 with the class1 to return the balanced dataFrame
 82 | 	#newData = pd.concat([newClass0, class1])	
 83 | 	
 84 | 	#return the new balanced(number of samples from each class) dataFrame
 85 | 	#return both classes in order to compine them later
 86 | 	return newData, class1, class0
 87 | 
 88 | 
 89 | '''we use this function in order to apply greedy search for finding the parameters that best fit our model. We have to mention
 90 | that we start this procedure from a very large field and then we tried to focues to the direction where the results
 91 | appear better. For example for the C parameter, the first range was [0.0001, 0.001, 0.01, 0.1, 1, 10 ,100 ,1000], the result was that
 92 | the best value was 1000 so then we tried [100, 1000, 10000, 100000] and so on in order to focues to the area which give us
 93 | the best results. This function is in comments because we found the best parameters and we dont need to run it in every trial.'''
 94 | def paramTuning(features_train, labels_train, nfolds):
 95 | 	#using the training data and define the number of folds
 96 | 	#determine the range of the Cs range you want to search
 97 | 	Cs = [1, 10, 100, 1000, 10000]
 98 | 
 99 | 	#determine the range of the gammas range you want to search
100 | 	gammas = [0.00000001 ,0.00000001 ,0.0000001, 0.000001, 0.00001]
101 | 
102 | 	#make the dictioanry
103 | 	param_grid = {'C': Cs, 'gamma': gammas}
104 | 
105 | 	#start the greedy search using all the matching sets from above
106 | 	grid_search = GridSearchCV(SVC(kernel='rbf'),param_grid,cv=nfolds)
107 | 
108 | 	#fit your training data
109 | 	grid_search.fit(features_train, labels_train)
110 | 
111 | 	#visualize the best couple of parameters
112 | 	return grid_search.best_params_
113 | 
114 | 
115 | 
116 | '''Building a model which is going to be trained with of given cases and test according to new ones'''
117 | def classifyPHC(data):
118 | 	#take the array with the units of samples of class0 divided properly to train the model in a balanced dataset
119 | 	data1,class1,class0 = equalizeClasses(data)
120 | 	#run this procedure by using all the units
121 | 
122 | 
123 | 	support_vectors=[]
124 | 	for newdata in data1:
125 | 		data = pd.concat([newdata, class1])
126 | 		features,labels = splitData(data)
127 | 
128 | 		#determine the training and testing size in the range of 1, 1 = 100%
129 | 		validation_size = 0.2
130 | 	
131 | 		#here we are splitting our data based on the validation_size into training and testing data
132 | 		features_train_unscaled, features_validation_unscaled, labels_train, labels_validation = model_selection.train_test_split(features, labels, 
133 | 				test_size=validation_size)
134 | 
135 | 		#normalize data in the range [-1,1]
136 | 		scaler = MinMaxScaler(feature_range=(-1, 1))
137 | 		#fit only th training data in order to find the margin and then test to data without normalize them
138 | 		scaler.fit(features_train_unscaled)
139 | 
140 | 		features_train = scaler.transform(features_train_unscaled)
141 | 
142 | 		#trnasform the validation features without fitting them
143 | 		features_validation = scaler.transform(features_validation_unscaled)
144 | 
145 | 		
146 | 		
147 | 				
148 | 
149 | 
150 | 		#we can see the shapes of the array just to check
151 | 		print 'feature training array: ',features_train.shape,'and label training array: ',labels_train.shape
152 | 		print 'feature testing array: ',features_validation.shape,'and label testing array: ',labels_validation.shape,'\n'
153 | 
154 | 
155 | 		#take the best couple of parameters from the procedure of greedy search
156 | 		#paramTuning(features_train, labels_train, 5)
157 | 
158 | 		#we initialize our model
159 | 		svm = SVC(kernel='rbf',C=10,gamma=1,decision_function_shape='ovr')
160 | 
161 | 		#train our model with the data that we previously precessed
162 | 		svm.fit(features_train,labels_train)
163 | 
164 | 		#now test our model with the test data
165 | 		predicted_labels = svm.predict(features_validation)
166 | 		accuracy = accuracy_score(labels_validation, predicted_labels)
167 | 		print 'Classification accuracy: ',accuracy*100,'\n'
168 | 
169 | 		#see the accuracy in training procedure
170 | 		predicted_labels_train = svm.predict(features_train)
171 | 		accuracy_train = accuracy_score(labels_train, predicted_labels_train)
172 | 		print 'Training accuracy: ',accuracy_train*100,'\n'
173 | 
174 | 		#confusion matrix to illustrate the faulty classification of each class
175 | 		conf_matrix = confusion_matrix(labels_validation, predicted_labels)
176 | 		print 'Confusion matrix: \n',conf_matrix,'\n'
177 | 		print 'Support    class 0   class 1:'
178 | 		#calculate the support of each class
179 | 		print '          ',conf_matrix[0][0]+conf_matrix[0][1],'     ',conf_matrix[1][0]+conf_matrix[1][1],'\n'
180 | 
181 | 		#calculate the accuracy of each class
182 | 		hC = (conf_matrix[0][0]/(conf_matrix[0][0]+conf_matrix[0][1]))*100
183 | 		pC = (conf_matrix[1][1]/(conf_matrix[1][0]+conf_matrix[1][1]))*100
184 | 
185 | 		#see the inside details of the classification
186 | 		print 'For class 0 man cases:',conf_matrix[0][0],'classified correctly and',conf_matrix[0][1],'missclassified,',hC,'accuracy \n'
187 | 		print 'For class 1 woman cases:',conf_matrix[1][1],'classified correctly and',conf_matrix[1][0],'missclassified,',pC,'accuracy\n'
188 | 
189 | 
190 | 		#try 5-fold cross validation
191 | 		scores = cross_val_score(svm, features_train, labels_train, cv=5)
192 | 		print 'cross validation scores for 5-fold',scores,'\n'
193 | 		#print 'parameters of the model: \n',svm.get_params(),'\n'
194 | 
195 | 		print 'number of samples used as support vectors',len(svm.support_vectors_),'\n'
196 | 
197 | 		#keep the support vectors of every iteration, until the units of samples of the class0 finishes
198 | 		#but we undo the scaling because we want to scale again our data based on the new training sample
199 | 		unscaledSupportVectors = findUnscaledSupportVectors(features_train_unscaled,features_train,svm.support_vectors_)
200 | 		support_vectors.append(unscaledSupportVectors)
201 | 
202 | 
203 | 
204 | 	return support_vectors, class1, class0,features_train_unscaled,features_train
205 | 	
206 | 	
207 | '''make this function because we need to keep only the support vectors from the class with bigger amount of samples in order to train
208 | the model with the support vectors only the class0 and all the samples from the class1, also we need to remove the duplicates because
209 | it is possible that we took duplicates as support vectors, and to delete the support vectors from class1. In this function we are doing the same procedure as previous in order to classify with SVM, but we are using only the samples
210 | from class0 that in our previous iterations they appear themselves as support vectors and all the samples from the class1. We are 
211 | doing this because we have discrepancies in the amount of samples of the two classes. Trying to get better training results.'''
212 | def initSupportVectors(support_vectors, class1,features_train_unscaled,features_train):
213 | 	flattened_list = []
214 | 
215 | 	#run the list of lists, every list contains on single samples which is support vector
216 | 	for x in support_vectors:
217 | 		#for every samples in the support vector list
218 | 		for y in x:
219 | 			flattened_list.append(list(y))
220 | 
221 | 
222 | 	print 'Amount of support vectors with duplicates: ',len(flattened_list)
223 | 
224 | 	#use this command to remove all the duplicates of the list
225 | 	uniqueSupportVectors = [list(l) for l in set(tuple(l) for l in flattened_list)]
226 | 
227 | 	#now we need to remove all the samples that are support vectors but they come from the class1, so we are going to check which
228 | 	#samples of our list are in the class1 list as well.
229 | 
230 | 	#convert the dataFrame into a list which has sublists and every lists contains the features, for exampes the sublist[0] contains
231 | 	#all the Feature1 of every samples ans so on, so we have to divide it in order to make the real samples
232 | 
233 | 
234 | 	
235 | 	#1825-2102, take every row of the data frame add put it in a list of lists
236 | 	class1SamplesInaList = []
237 | 	#here we iterate the dataFrame, particularly the rows we define with the range
238 | 	for x in range(1825,2103):
239 | 		#here we take the specific row
240 | 		class1Sample = class1.loc[[x]]	
241 | 		#we need to take all the features from this row but the label, because it returns a list of lists we need to join
242 | 		#this list of lists into one list which contains one single sample of the class1
243 | 		class1SamplesInaList.append(list(itertools.chain.from_iterable(class1Sample.values.T.tolist()[:13])))
244 | 
245 | 	#continue this procedure we are going to check if a vector is in both array, if this is true it means tha this vector
246 | 	#is a support vector because it belongs in the support_vector list and it belongs to the class1 because it is in the
247 | 	#class1SamplesInaList array so with erase this element from the support vector array, when this procedure is over it means
248 | 	#that the elements that remain in the support_vector_array is from class one and support_vectors, this is the goal we define.
249 | 	
250 | 	#class1SamplesInaList the array which contains as a list every sample of class1 one in a list of lists
251 | 	#uniqueSupportVectors contains all the samples that used as support_vectors in every iteration erasing the duplicates
252 | 	#becuase the samples from class1 took place in every iteration
253 | 	
254 | 
255 | 	#iterate every sample of the class1 in order to check if it exists in the list
256 | 	for x in class1SamplesInaList:
257 | 		#if it exists, we need to delete it
258 | 		if(x in uniqueSupportVectors):
259 | 			#remove the specific list from the support_vectors that we are going to use
260 | 			uniqueSupportVectors.remove(x)
261 | 
262 | 	print 'Amount of support vectors without duplicates', len(uniqueSupportVectors),'\n'
263 | 
264 | 
265 | 	#so we know that this array contains the support_vectors which are samples only from class0 and there is no duplicates,
266 | 	#so we have to add a last element in every array to declare the label of the samples and it is going to be 0 because
267 | 	#we know the class that the samples come from
268 | 	for x in uniqueSupportVectors:
269 | 		x.append(0)
270 | 
271 | 	#initialize the dataframe that we want to return 
272 | 	support_vectors_dataframe = pd.DataFrame(columns=['Feature1','Feature2','Feature3','Feature4','Feature5','Feature6','Feature7','Feature8','Feature9','Feature10','Feature11','Feature12','Feature13','Label'])                             
273 | 	for x in range(len(uniqueSupportVectors)):
274 | 		#we need to add the columns and the rows of the dataframe so we are going to do it manually
275 | 		support_vectors_dataframe.loc[x] = [uniqueSupportVectors[x][y] for y in range(len(uniqueSupportVectors[x]))]
276 | 
277 | 	
278 | 	#return the dataframe which contains all the support vectors from all the iteration of training with all the units of samples
279 | 	#of only the class0, and now we are ready to train with them and all the samples of the class1
280 | 	return pd.concat([support_vectors_dataframe,class1])
281 | 	#returns the new data ready to train the model
282 | 	#the samples from class0 which were support_vectros and all the samples from class1
283 | 
284 | 	
285 | '''because we scale and we scale again the same data, we have to find the pre-scaled data and feed our classifier otherwise
286 | it is going to be perfect because of the consecutivr normalizations. In this function we normalize all our data, then we take
287 | the support vectors from class0 from the function initSupportVectors and we are mapping the scaled support vectors to the 
288 | pre-scaled samples that they were support vectors after scaling. In order not to scale and scale again the already scaled data. The gene-
289 | ral problem is that we scale again the support vectors of class0, and the samples from class1 they just be scaled once'''
290 | def findUnscaledSupportVectors(features_train_unscaled,features_train,support_vectors):
291 | 	#we are applying the same normalization because is the same data so we are going to end up with the same results
292 | 	#normalize data in the range [-1,1]
293 | 	#scaler = MinMaxScaler(feature_range=(-1, 1))
294 | 	#fit only th training data in order to find the margin and then test to data without normalize them
295 | 	#fit exactly the same data as before
296 | 	#scaler.fit(features_train_unscaled)
297 | 
298 | 	unScaled_support_vectors = []
299 | 
300 | 
301 | 	#now we mapped the unscaled data with the scaled data as concerns the position in the array
302 | 	#check all the train features
303 | 	for searchSample in features_train:
304 | 		#because we add support_vectors for every iteration, check for all the iterations
305 | 		#for every array in the array support_vectors
306 | 		#if what I search is a support vector then
307 | 		if searchSample in support_vectors:
308 | 			position = np.where(features_train == searchSample)
309 | 			#print searchSample
310 | 			#print features_train[position[0][0]]
311 | 			#which means that all the lements are equal
312 | 			if(position[1][len(position[1])-1] == 12):
313 | 				#then add it to the unscaled support vectors
314 | 				unScaled_support_vectors.append(features_train_unscaled[position[0][0]])
315 | 				
316 | 				
317 | 				
318 | 	print len(support_vectors)
319 | 	print len(unScaled_support_vectors)
320 | 	print 'position',position
321 | 	#sys.exit()
322 | 	return unScaled_support_vectors
323 | 
324 | 
325 | 
326 | def main():
327 | 	#calculate the time
328 | 	import time
329 | 	start_time = time.time()
330 | 
331 | 	data = readFile()
332 | 
333 | 	
334 | 	
335 | 	while True:
336 | 
337 | 		#we are making an array in order to keep the support vectors and feed the function with them for the next iteration
338 | 		support_vectors, class1, class0,features_train_unscaled,features_train = classifyPHC(data)
339 | 
340 | 		
341 | 		#make class0 list
342 | 		
343 | 		#flat the list of lists into one list
344 | 		data = initSupportVectors(support_vectors, class1,features_train_unscaled,features_train)
345 | 
346 | 		print 'END OF ITERATION NOW WE ARE TRAINING WITH A NEW REDUCED SET OF SUPPORT VECTORS FROM CLASS 0  and data set length',len(data),'\n\n\n\n\n\n\n\n'
347 | 	
348 | 
349 | 	time = time.time()-start_time
350 | 	print 'time: ',time
351 | 
352 | main()
353 | 
354 | 
355 | 


--------------------------------------------------------------------------------
/classifiers/SVMs/svm_multiclass.py:
--------------------------------------------------------------------------------
  1 | #!usr/bin/python
  2 | from __future__ import division
  3 | import pandas as pd
  4 | from sklearn import model_selection
  5 | from sklearn.svm import SVC # support vectors for classification
  6 | from sklearn.metrics import accuracy_score, confusion_matrix
  7 | from sklearn.model_selection import cross_val_score, GridSearchCV
  8 | 
  9 | 
 10 | '''this function takes as an input the path of a file with features and labels and returns the content of this file as a csv format in
 11 | the form feature1.........feature13,Label'''
 12 | def readFile():
 13 | 	#make the format of the csv file. Our format is a vector with 13 features and a label which show the condition of the
 14 | 	#sample hc/pc : helathy case, parkinson case
 15 |   	names = ['Feature1', 'Feature2', 'Feature3', 'Feature4','Feature5','Feature6','Feature7','Feature8','Feature9',
 16 | 	'Feature10','Feature11','Feature12','Feature13','Label']
 17 | 
 18 | 	#path to read the samples, samples consist from healthy subjects and subject suffering from Parkinson's desease.
 19 | 	path = 'PATH_TO_SAMPLES.txt'
 20 | 	#read file in csv format
 21 | 	data = pd.read_csv(path,names=names )
 22 | 	
 23 | 	#return an array of the shape (2103, 14), lines are the samples and columns are the features as we mentioned before
 24 | 	return data
 25 | 
 26 | 'takes the csv file and split the label from the features'
 27 | def splitData(data):
 28 | 	# Split-out the set in two different arrayste
 29 | 	array = data.values
 30 | 	#features array contains only the features of the samples
 31 | 	features = array[:,0:13]
 32 | 	#labels array contains only the lables of the samples
 33 | 	labels = array[:,13]	
 34 | 
 35 | 	return features,labels
 36 | 
 37 | 
 38 | '''we use this function in order to apply greedy search for finding the parameters that best fit our model. We have to mention
 39 | that we start this procedure from a very large field and then we tried to focues to the direction where the results
 40 | appear better. For example for the C parameter, the first range was [0.0001, 0.001, 0.01, 0.1, 1, 10 ,100 ,1000], the result was that
 41 | the best value was 1000 so then we tried [100, 1000, 10000, 100000] and so on in order to focues to the area which give us
 42 | the best results. This function is in comments because we found the best parameters and we dont need to run it in every trial.'''
 43 | def paramTuning(features_train, labels_train, nfolds):
 44 | 	#using the training data and define the number of folds
 45 | 	#determine the range of the Cs range you want to search
 46 | 	Cs = [1, 10, 100, 1000, 10000]
 47 | 
 48 | 	#determine the range of the gammas range you want to search
 49 | 	gammas = [0.00000001 ,0.00000001 ,0.0000001, 0.000001, 0.00001]
 50 | 
 51 | 	#make the dictioanry
 52 | 	param_grid = {'C': Cs, 'gamma': gammas}
 53 | 
 54 | 	#start the greedy search using all the matching sets from above
 55 | 	grid_search = GridSearchCV(SVC(kernel='rbf'),param_grid,cv=nfolds)
 56 | 
 57 | 	#fit your training data
 58 | 	grid_search.fit(features_train, labels_train)
 59 | 
 60 | 	#visualize the best couple of parameters
 61 | 	print grid_search.best_params_
 62 | 
 63 | 
 64 | 	
 65 | 	
 66 | 
 67 | 
 68 | 
 69 | 
 70 | 
 71 | '''Building a model which is going to be trained with of given cases and test according to new ones'''
 72 | def classifyPHC():
 73 | 	data = readFile()
 74 | 	#data = equalizeClasses(data)
 75 | 	features,labels = splitData(data)
 76 | 	
 77 | 	#determine the training and testing size in the range of 1, 1 = 100%
 78 | 	validation_size = 0.2
 79 | 	
 80 | 	#here we are splitting our data based on the validation_size into training and testing data
 81 | 	features_train, features_validation, labels_train, labels_validation = model_selection.train_test_split(features, labels, 
 82 | 			test_size=validation_size)
 83 | 
 84 | 	#we can see the shapes of the array just to check
 85 | 	print 'feature training array: ',features_train.shape,'and label training array: ',labels_train.shape
 86 | 	print 'feature testing array: ',features_validation.shape,'and label testing array: ',labels_validation.shape,'\n'
 87 | 
 88 | 	#take the best couple of parameters from the procedure of greedy search
 89 | 	#paramTuning(features_train, labels_train, 5)
 90 | 
 91 | 	#we initialize our model
 92 | 	svm = SVC(kernel='rbf',C=100,gamma=1e-05,decision_function_shape='ovr')
 93 | 
 94 | 	#train our model with the data that we previously precessed
 95 | 	svm.fit(features_train,labels_train)
 96 | 
 97 | 
 98 | 	#now test our model with the test data
 99 | 	predicted_labels = svm.predict(features_validation)
100 | 	accuracy = accuracy_score(labels_validation, predicted_labels)
101 | 	print 'Classification accuracy: ',accuracy*100,'\n'
102 | 
103 | 	#confusion matrix to illustrate the faulty classification of each class
104 | 	conf_matrix = confusion_matrix(labels_validation, predicted_labels)
105 | 	print 'Confusion matrix: \n',conf_matrix,'\n'
106 | 	print 'Support    class 0   class 1    class2:'
107 | 	#calculate the support of each class  
108 | 	print '          ',conf_matrix[0][0]+conf_matrix[0][1]+conf_matrix[0][2],'       ',conf_matrix[1][0]+conf_matrix[1][1]+conf_matrix[1][2],'        ',conf_matrix[2][0]+conf_matrix[2][1]+conf_matrix[2][2],'\n'
109 | 
110 | 	#calculate the accuracy of each class
111 | 	edema = (conf_matrix[0][0]/(conf_matrix[0][0]+conf_matrix[0][1]+conf_matrix[0][2]))*100
112 | 	paralysis = (conf_matrix[1][1]/(conf_matrix[1][0]+conf_matrix[1][1]+conf_matrix[1][2]))*100
113 | 	normal = (conf_matrix[2][2]/(conf_matrix[2][0]+conf_matrix[2][1]+conf_matrix[2][2]))*100
114 | 
115 | 	#see the inside details of the classification
116 | 	print 'For class 0 edema cases:',conf_matrix[0][0],'classified correctly and',conf_matrix[0][1]+conf_matrix[0][2],'missclassified,',edema,'accuracy \n'
117 | 	print 'For class 1 paralysis cases:',conf_matrix[1][1],'classified correctly and',conf_matrix[1][0]+conf_matrix[1][2],'missclassified,',paralysis,'accuracy\n'
118 | 	print 'For class 0 normal cases:',conf_matrix[2][2],'classified correctly and',conf_matrix[2][0]+conf_matrix[2][1],'missclassified,',normal,'accuracy \n'
119 | 
120 | 	#try 5-fold cross validation
121 | 	scores = cross_val_score(svm, features_train, labels_train, cv=5)
122 | 	print 'cross validation scores for 5-fold',scores,'\n'
123 | 	print 'parameters of the model: \n',svm.get_params(),'\n'
124 | 
125 | 	print 'number of samples used as support vectors',len(svm.support_vectors_)
126 | 
127 | 	
128 | 
129 | classifyPHC()
130 | 
131 | 


--------------------------------------------------------------------------------
/classifiers/dimensionality_reduction/README.md:
--------------------------------------------------------------------------------
 1 | # kpca_lda_knn_equalizeClasses.py 
 2 |   script is using KernalPCA as a first step to reduce the dimension of the data and then LDA to bring the data
 3 |   into the dimension class -1 and then using knn. Same as the script 
 4 |  # kpca_lda_multiclass.py
 5 |  for 3 classes.
 6 |   # pca_kpca_from-skratch.py
 7 | Implementation of Principal Component Analysis and KernelPCA from skratch
 8 | # graph_spectral_analysis&spectral_clustering.py
 9 | Apply dimensionality reduction using graph spectral analysis (LLE, IsoMap etc) and then spectral clustering
10 | 


--------------------------------------------------------------------------------
/classifiers/dimensionality_reduction/graph_spectral_analysis&spectral_clustering_default.py:
--------------------------------------------------------------------------------
  1 | #!usr/bin/python
  2 | from __future__ import division
  3 | import pandas as pd
  4 | from sklearn import model_selection
  5 | from sklearn.svm import SVC # supportctors for classification
  6 | from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
  7 | from sklearn.model_selection import cross_val_score, GridSearchCV
  8 | import timeit
  9 | from sklearn.preprocessing import MinMaxScaler
 10 | from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
 11 | import numpy as np
 12 | from sklearn.neighbors import KNeighborsClassifier, NearestNeighbors
 13 | import matplotlib.pyplot as plt
 14 | from matplotlib.pyplot import figure
 15 | import seaborn as sns
 16 | from sklearn.manifold import LocallyLinearEmbedding, SpectralEmbedding, Isomap
 17 | from sklearn.cluster import SpectralClustering
 18 | from sklearn.metrics.cluster import homogeneity_score
 19 | 
 20 | '''this function takes as an input the path of a file with features and labels and returns the content of this file as a csv format in
 21 | the form feature1.........feature13,Label'''
 22 | def readFile():
 23 | 	#make the format of the csv file. Our format is a vector with 13 features and a label which show the condition of the
 24 | 	#sample hc/pc : helathy case, parkinson case
 25 |   	names = ['Feature1', 'Feature2', 'Feature3', 'Feature4','Feature5','Feature6','Feature7','Feature8','Feature9',
 26 | 	'Feature10','Feature11','Feature12','Feature13','Label']
 27 | 
 28 | 	#path to read the samples, samples consist from healthy subjects and subject suffering from Parkinson's desease.
 29 | 	#path = 'mfcc_man_woman.txt'
 30 | 	path = 'PATH_TO_SAMPLES.txt'
 31 | 	#path = '/home/gionanide/Theses_2017-2018_2519/features/parkinson_healthy/mfcc_parkinson_healthy.txt'
 32 | 
 33 | 	#read file in csv format
 34 | 	data = pd.read_csv(path,names=names )
 35 | 	
 36 | 	#return an array of the shape (2103, 14), lines are the samples and columns are the features as we mentioned before
 37 | 	return data
 38 | 
 39 | 'takes the csv file and split the label from the features'
 40 | def splitData(data):
 41 | 	# Split-out the set in two different arrayste
 42 | 	array = data.values
 43 | 	#features array contains only the features of the samples
 44 | 	features = array[:,0:13]
 45 | 	#labels array contains only the lables of the samples
 46 | 	labels = array[:,13]	
 47 | 
 48 | 	return features,labels
 49 | 
 50 | '''
 51 | make this class in order to train the model with the same amount of samples of each class, because we have bigger support from class 0
 52 | than class1, particularly it is 9 to 1.'''
 53 | def equalizeClasses(data):
 54 | 	#take all the samples from the data frame that they have Label value equal to 0 and in the next line equal to 1
 55 | 	class0 = data.loc[data['Label'] == 0]#class0 and class1 are dataFrames
 56 | 	class1 = data.loc[data['Label'] == 1]
 57 | 
 58 | 
 59 | 	#check which class has more samples, by divide them and check if the number is bigger or smaller than 1
 60 | 	weight = len(class0) // len(class1) #take the results as an integer in order to split the class, using prior knowledge that 
 61 | 	#class0 has more samples, if it is bigger class0 has more samples and to be exact weight to 1 
 62 | 
 63 | 	balance = (len(class0) // weight) #this is the number of samples in order to balance our classes
 64 | 
 65 | 	#the keyword argument frac specifies the fraction of rows to return in the random sample, so fra=1 means, return random all rows
 66 | 	#we kind of a way shuffle our data in order not to take the same samples in every iteration
 67 | 	#class0 = class0.sample(frac=1)
 68 | 	
 69 | 	#samples array for training taking the balance number of samples for the shuffled dataFrame
 70 | 	newClass0 = class0.sample(n=balance)
 71 | 	
 72 | 	#and now combine the new dataFrame from class0 with the class1 to return the balanced dataFrame
 73 | 	newData = pd.concat([newClass0, class1])	
 74 | 	
 75 | 	#return the new balanced(number of samples from each class) dataFrame
 76 | 	return newData
 77 | 
 78 | '''we made this function in order to make a loop, the equalized data take only a small piece of the existing data, so with this 
 79 | loop we are going to take iteratably all the data, but from every iteration we are keeping only the samples who were support
 80 | vectors, the samples only the class which we are taking a piece of it's samples'''
 81 | def keepSV():
 82 | 	print 'yolo'
 83 | 
 84 | 
 85 | '''we use this function in order to apply greedy search for finding the parameters that best fit our model. We have to mention
 86 | that we start this procedure from a very large field and then we tried to focues to the direction where the results
 87 | appear better. For example for the C parameter, the first range was [0.0001, 0.001, 0.01, 0.1, 1, 10 ,100 ,1000], the result was that
 88 | the best value was 1000 so then we tried [100, 1000, 10000, 100000] and so on in order to focues to the area which give us
 89 | the best results. This function is in comments because we found the best parameters and we dont need to run it in every trial.'''
 90 | def paramTuning(features_train, labels_train, nfolds):
 91 | 	#using the training data and define the number of folds
 92 | 	#determine the range of the Cs range you want to search
 93 | 	Cs = [0.001, 0.01, 0.1 ,1, 10, 100, 1000, 10000]
 94 | 
 95 | 	#determine the range of the gammas range you want to search
 96 | 	gammas = [0.00000001 ,0.00000001 ,0.0000001, 0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1 , 1, 10, 100, 1000]
 97 | 
 98 | 	#make the dictioanry
 99 | 	param_grid = {'C': Cs, 'gamma': gammas}
100 | 
101 | 	#start the greedy search using all the matching sets from above
102 | 	grid_search = GridSearchCV(SVC(kernel='poly'),param_grid,cv=nfolds)
103 | 
104 | 	#fit your training data
105 | 	grid_search.fit(features_train, labels_train)
106 | 
107 | 	#visualize the best couple of parameters
108 | 	print grid_search.best_params_
109 | 
110 | 
111 | 
112 | '''Classify Parkinson and Helathy. Building a model which is going to be trained with of given cases and test according to new ones'''
113 | def classifyPHC():
114 | 	data = readFile()
115 | 	#data = equalizeClasses(data)
116 | 	features,labels = splitData(data)
117 | 	
118 | 	#determine the training and testing size in the range of 1, 1 = 100%
119 | 	validation_size = 0.2
120 | 	
121 | 	#here we are splitting our data based on the validation_size into training and testing data
122 | 	#features_train, features_validation, labels_train, labels_validation = model_selection.train_test_split(features, labels, 
123 | 			#test_size=validation_size)
124 | 
125 | 
126 | 	#we are using all the features because it is clustering so we do not want to split to testing and training
127 | 	#bacause we apply unsupervised techniques
128 | 	
129 | 	#normalize data in the range [-1,1]
130 | 	scaler = MinMaxScaler(feature_range=(-1, 1))
131 | 	#fit only th training data in order to find the margin and then test to data without normalize them
132 | 	scaler.fit(features)
133 | 
134 | 	features_scalar = scaler.transform(features)
135 | 
136 | 	#trnasform the validation features without fitting them
137 | 	#features_validation_scalar = scaler.transform(features_validation)
138 | 
139 | 
140 | 	#apply the dimensionality reduction using graph spectral analysis
141 | 	
142 | 	'''#LocallyLinearEmbedding
143 | 
144 | 	lle = LocallyLinearEmbedding(n_components=2)
145 | 
146 | 	
147 | 	#transform data
148 | 	features_embedded = lle.fit_transform(features_scalar)'''
149 | 
150 | 	'''#Isometric Mapping
151 | 
152 | 	isomap = Isomap(n_components=2)
153 | 
154 | 
155 | 	#transform data
156 | 	features_embedded = isomap.fit_transform(features_scalar)'''
157 | 
158 | 	#Graph embedding
159 | 
160 | 	spectralEmbedding = SpectralEmbedding(n_components=2)
161 | 
162 | 	#transform training and validation data
163 | 	features_embedded = spectralEmbedding.fit_transform(features_scalar)
164 | 
165 | 
166 | 	
167 | 	#we can see the shapes of the array just to check
168 | 	print 'feature training array: ',features_embedded.shape #,'and label training array: ',labels_train.shape
169 | 	#print 'feature testing array: ',features_validation_embedded.shape,'and label testing array: ',labels_validation.shape,'\n'
170 | 
171 | 
172 | 	#take the best couple of parameters from the procedure of greedy search
173 | 	#paramTuning(features_train, labels_train, 5)
174 | 
175 | 	#we initialize our model
176 | 	#svm = SVC(kernel='poly',C=0.001,gamma=10,degree=3,decision_function_shape='ovr')
177 | 	#svm = KNeighborsClassifier(n_neighbors=3)
178 | 
179 | 	#Apply spectral clustering
180 | 
181 | 	spectralClustering = SpectralClustering(n_clusters=2)
182 | 	
183 | 
184 | 	
185 | 
186 | 	#train our model with the data that we previously precessed
187 | 	#spectralClustering.fit(features_embedded )
188 | 
189 | 	#now test our model with the test data
190 | 	spectralClustering.fit(features_embedded)
191 | 
192 | 	predicted_labels = spectralClustering.labels_
193 | 	
194 | 	#first implementation of score computing
195 | 	#accuracy = accuracy_score(labels, predicted_labels)
196 | 	
197 | 	
198 | 	#More accurate implementation, considering opposite labels
199 | 	accuracy = homogeneity_score(labels, predicted_labels)
200 | 	print 'Clustering accuracy: ',accuracy*100,'\n'
201 | 
202 | 
203 | 	#confusion matrix to illustrate the faulty classification of each class
204 | 	conf_matrix = confusion_matrix(labels, predicted_labels)
205 | 	print 'Confusion matrix: \n',conf_matrix,'\n'
206 | 	print 'Support    class 0   class 1:'
207 | 	#calculate the support of each class
208 | 	print '          ',conf_matrix[0][0]+conf_matrix[0][1],'     ',conf_matrix[1][0]+conf_matrix[1][1],'\n'
209 | 
210 | 	#calculate the accuracy of each class
211 | 	hC = (conf_matrix[0][0]/(conf_matrix[0][0]+conf_matrix[0][1]))*100
212 | 	pC = (conf_matrix[1][1]/(conf_matrix[1][0]+conf_matrix[1][1]))*100
213 | 
214 | 	#see the inside details of the classification
215 | 	print 'For class 0 man cases:',conf_matrix[0][0],'classified correctly and',conf_matrix[0][1],'missclassified,',hC,'accuracy \n'
216 | 	print 'For class 1 woman cases:',conf_matrix[1][1],'classified correctly and',conf_matrix[1][0],'missclassified,',pC,'accuracy\n'
217 | 
218 | 
219 | 	#plot the training features after the kpca and the lda procedure
220 | 	embedded_labels = pd.DataFrame({'Feature1': features_embedded[: ,0], 'Feature2': features_embedded[: ,1],'Label': labels})
221 | 	sns.pairplot(embedded_labels, hue='Label')
222 | 	#plt.savefig('kpca_trainset_parkinson_healthy.png')
223 | 	#plt.show()
224 | 
225 | 	#plot the training features after the kpca and the lda procedure
226 | 	embedded_predicted_labels = pd.DataFrame({'Feature1': features_embedded[: ,0], 'Feature2': features_embedded[: ,1],'Label': predicted_labels})
227 | 	sns.pairplot(embedded_predicted_labels, hue='Label')
228 | 	#plt.savefig('kpca_trainset_parkinson_healthy.png')
229 | 	plt.show()
230 | 
231 | 	
232 | 
233 | def main():
234 | 	#calculate the time
235 | 	import time
236 | 	start_time = time.time()
237 | 
238 | 	#we are making an array in order to keep the support vectors and feed the function with them for the next iteration
239 | 	#support_vectors = 
240 | 	classifyPHC()
241 | 
242 | 	time = time.time()-start_time
243 | 	print 'time: ',time
244 | 
245 | main()
246 | 
247 | 


--------------------------------------------------------------------------------
/classifiers/dimensionality_reduction/kpca_lda_knn_equalizeClasses.py:
--------------------------------------------------------------------------------
  1 | #!usr/bin/python
  2 | from __future__ import division
  3 | import pandas as pd
  4 | from sklearn import model_selection
  5 | from sklearn.svm import SVC # support vectors for classification
  6 | from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
  7 | from sklearn.model_selection import cross_val_score, GridSearchCV
  8 | import timeit
  9 | from sklearn.preprocessing import MinMaxScaler
 10 | from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
 11 | from sklearn.decomposition import IncrementalPCA, PCA, KernelPCA
 12 | import numpy as np
 13 | from sklearn.neighbors import KNeighborsClassifier, NearestNeighbors
 14 | import matplotlib.pyplot as plt
 15 | from matplotlib.pyplot import figure
 16 | import seaborn as sns
 17 | 
 18 | 
 19 | '''this function takes as an input the path of a file with features and labels and returns the content of this file as a csv format in
 20 | the form feature1.........feature13,Label'''
 21 | def readFile():
 22 | 	#make the format of the csv file. Our format is a vector with 13 features and a label which show the condition of the
 23 | 	#sample hc/pc : helathy case, parkinson case
 24 |   	names = ['Feature1', 'Feature2', 'Feature3', 'Feature4','Feature5','Feature6','Feature7','Feature8','Feature9',
 25 | 	'Feature10','Feature11','Feature12','Feature13','Label']
 26 | 
 27 | 	#path to read the samples, samples consist from healthy subjects and subject suffering from Parkinson's desease.
 28 | 	path = 'PATH_TO_SAMPLES.txt'
 29 | 	#path = '/home/gionanide/Theses_2017-2018_2519/features/parkinson_healthy/mfcc_parkinson_healthy.txt'
 30 | 
 31 | 	#read file in csv format
 32 | 	data = pd.read_csv(path,names=names )
 33 | 	
 34 | 	#return an array of the shape (2103, 14), lines are the samples and columns are the features as we mentioned before
 35 | 	return data
 36 | 
 37 | 'takes the csv file and split the label from the features'
 38 | def splitData(data):
 39 | 	# Split-out the set in two different arrayste
 40 | 	array = data.values
 41 | 	#features array contains only the features of the samples
 42 | 	features = array[:,0:13]
 43 | 	#labels array contains only the lables of the samples
 44 | 	labels = array[:,13]	
 45 | 
 46 | 	return features,labels
 47 | 
 48 | '''
 49 | make this class in order to train the model with the same amount of samples of each class, because we have bigger support from class 0
 50 | than class1, particularly it is 9 to 1.'''
 51 | def equalizeClasses(data):
 52 | 	#take all the samples from the data frame that they have Label value equal to 0 and in the next line equal to 1
 53 | 	class0 = data.loc[data['Label'] == 0]#class0 and class1 are dataFrames
 54 | 	class1 = data.loc[data['Label'] == 1]
 55 | 
 56 | 
 57 | 	#check which class has more samples, by divide them and check if the number is bigger or smaller than 1
 58 | 	weight = len(class0) // len(class1) #take the results as an integer in order to split the class, using prior knowledge that 
 59 | 	#class0 has more samples, if it is bigger class0 has more samples and to be exact weight to 1 
 60 | 
 61 | 	balance = (len(class0) // weight) #this is the number of samples in order to balance our classes
 62 | 
 63 | 	#the keyword argument frac specifies the fraction of rows to return in the random sample, so fra=1 means, return random all rows
 64 | 	#we kind of a way shuffle our data in order not to take the same samples in every iteration
 65 | 	#class0 = class0.sample(frac=1)
 66 | 	
 67 | 	#samples array for training taking the balance number of samples for the shuffled dataFrame
 68 | 	newClass0 = class0.sample(n=balance)
 69 | 	
 70 | 	#and now combine the new dataFrame from class0 with the class1 to return the balanced dataFrame
 71 | 	newData = pd.concat([newClass0, class1])	
 72 | 	
 73 | 	#return the new balanced(number of samples from each class) dataFrame
 74 | 	return newData
 75 | 
 76 | '''we made this function in order to make a loop, the equalized data take only a small piece of the existing data, so with this 
 77 | loop we are going to take iteratably all the data, but from every iteration we are keeping only the samples who were support
 78 | vectors, the samples only the class which we are taking a piece of it's samples'''
 79 | #def keepSV():
 80 | 
 81 | 
 82 | 
 83 | '''we use this function in order to apply greedy search for finding the parameters that best fit our model. We have to mention
 84 | that we start this procedure from a very large field and then we tried to focues to the direction where the results
 85 | appear better. For example for the C parameter, the first range was [0.0001, 0.001, 0.01, 0.1, 1, 10 ,100 ,1000], the result was that
 86 | the best value was 1000 so then we tried [100, 1000, 10000, 100000] and so on in order to focues to the area which give us
 87 | the best results. This function is in comments because we found the best parameters and we dont need to run it in every trial.'''
 88 | def paramTuning(features_train, labels_train, nfolds):
 89 | 	#using the training data and define the number of folds
 90 | 	#determine the range of the Cs range you want to search
 91 | 	Cs = [0.001, 0.01, 0.1 ,1, 10, 100, 1000, 10000]
 92 | 
 93 | 	#determine the range of the gammas range you want to search
 94 | 	gammas = [0.00000001 ,0.00000001 ,0.0000001, 0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1 , 1, 10, 100, 1000]
 95 | 
 96 | 	#make the dictioanry
 97 | 	param_grid = {'C': Cs, 'gamma': gammas}
 98 | 
 99 | 	#start the greedy search using all the matching sets from above
100 | 	grid_search = GridSearchCV(SVC(kernel='poly'),param_grid,cv=nfolds)
101 | 
102 | 	#fit your training data
103 | 	grid_search.fit(features_train, labels_train)
104 | 
105 | 	#visualize the best couple of parameters
106 | 	print grid_search.best_params_
107 | 
108 | 
109 | 
110 | '''Classify Parkinson and Helathy. Building a model which is going to be trained with of given cases and test according to new ones'''
111 | def classifyPHC():
112 | 	data = readFile()
113 | 	#data = equalizeClasses(data)
114 | 	features,labels = splitData(data)
115 | 	
116 | 	#determine the training and testing size in the range of 1, 1 = 100%
117 | 	validation_size = 0.2
118 | 	
119 | 	#here we are splitting our data based on the validation_size into training and testing data
120 | 	features_train, features_validation, labels_train, labels_validation = model_selection.train_test_split(features, labels, 
121 | 			test_size=validation_size)
122 | 
123 | 	
124 | 	#normalize data in the range [-1,1]
125 | 	scaler = MinMaxScaler(feature_range=(-1, 1))
126 | 	#fit only th training data in order to find the margin and then test to data without normalize them
127 | 	scaler.fit(features_train)
128 | 
129 | 	features_train_scalar = scaler.transform(features_train)
130 | 
131 | 	#trnasform the validation features without fitting them
132 | 	features_validation_scalar = scaler.transform(features_validation)
133 | 
134 | 
135 | 	#determine the pca, and determine the dimension you want to end up
136 | 	pca = KernelPCA(n_components=6,kernel='rbf',fit_inverse_transform=True)
137 | 
138 | 	#fit only the features train
139 | 	pca.fit(features_train_scalar)
140 | 
141 | 	#dimensionality reduction of features train
142 | 	features_train_pca = pca.transform(features_train_scalar)
143 | 
144 | 	#dimensionality reduction of fatures validation
145 | 	features_validation_pca = pca.transform(features_validation_scalar)
146 | 
147 | 	#reconstruct data training error
148 | 	reconstruct_data = pca.inverse_transform(features_train_pca)
149 | 
150 | 	error_percentage = (sum(sum(error_matrix))/(len(features_train_scalar)*len(features_train_scalar[0])))*100
151 | 
152 | 	#len(features_train_scalar) = len(reconstruct_data) = 89
153 | 	#len(features_train_scalar[0]) = len(reconstruct_data[0]) = 13
154 | 
155 | 	#len(error_matrix) = 89, which means for all the samples
156 | 	#len(error_matrix[0]) = 13, for every feature of every sample
157 | 	#we take the sum and we conlcude in an array which has the sum for every feature (error)
158 | 	#so we take the sum again and we divide it with the 89 samples * 13 features
159 | 	print 'Information loss of KernelPCA:',error_percentage,'% \n'
160 | 
161 | 
162 | 	lda = LinearDiscriminantAnalysis()
163 | 
164 | 	lda.fit(features_train_pca,labels_train)
165 | 
166 | 	features_train_pca = lda.transform(features_train_pca)
167 | 
168 | 	features_validation_pca = lda.transform(features_validation_pca)
169 | 
170 | 	#we can see the shapes of the array just to check
171 | 	print 'feature training array: ',features_train.shape,'and label training array: ',labels_train.shape
172 | 	print 'feature testing array: ',features_validation.shape,'and label testing array: ',labels_validation.shape,'\n'
173 | 
174 | 
175 | 	#take the best couple of parameters from the procedure of greedy search
176 | 	#paramTuning(features_train, labels_train, 5)
177 | 
178 | 	#we initialize our model
179 | 	#svm = SVC(kernel='poly',C=0.001,gamma=10,degree=3,decision_function_shape='ovr')
180 | 	svm = KNeighborsClassifier(n_neighbors=3)
181 | 	
182 | 
183 | 	
184 | 
185 | 	#train our model with the data that we previously precessed
186 | 	svm.fit(features_train_pca,labels_train)
187 | 
188 | 	#now test our model with the test data
189 | 	predicted_labels = svm.predict(features_validation_pca)
190 | 	accuracy = accuracy_score(labels_validation, predicted_labels)
191 | 	print 'Classification accuracy: ',accuracy*100,'\n'
192 | 
193 | 	#see the accuracy in training procedure
194 | 	predicted_labels_train = svm.predict(features_train_pca)
195 | 	accuracy_train = accuracy_score(labels_train, predicted_labels_train)
196 | 	print 'Training accuracy: ',accuracy_train*100,'\n'
197 | 
198 | 	#confusion matrix to illustrate the faulty classification of each class
199 | 	conf_matrix = confusion_matrix(labels_validation, predicted_labels)
200 | 	print 'Confusion matrix: \n',conf_matrix,'\n'
201 | 	print 'Support    class 0   class 1:'
202 | 	#calculate the support of each class
203 | 	print '          ',conf_matrix[0][0]+conf_matrix[0][1],'     ',conf_matrix[1][0]+conf_matrix[1][1],'\n'
204 | 
205 | 	#calculate the accuracy of each class
206 | 	hC = (conf_matrix[0][0]/(conf_matrix[0][0]+conf_matrix[0][1]))*100
207 | 	pC = (conf_matrix[1][1]/(conf_matrix[1][0]+conf_matrix[1][1]))*100
208 | 
209 | 	#see the inside details of the classification
210 | 	print 'For class 0 man cases:',conf_matrix[0][0],'classified correctly and',conf_matrix[0][1],'missclassified,',hC,'accuracy \n'
211 | 	print 'For class 1 woman cases:',conf_matrix[1][1],'classified correctly and',conf_matrix[1][0],'missclassified,',pC,'accuracy\n'
212 | 
213 | 
214 | 	#try 5-fold cross validation
215 | 	scores = cross_val_score(svm, features_train_pca, labels_train, cv=5)
216 | 	print 'cross validation scores for 5-fold',scores,'\n'
217 | 	print 'parameters of the model: \n',svm.get_params(),'\n'
218 | 
219 | 	#print 'number of samples used as support vectors',len(svm.support_vectors_),'\n'
220 | 
221 | 	#return svm.support_vectors_
222 | 
223 | 	'''#plot the training features before the kpca and the lda procedure
224 | 	kpca_lda = pd.DataFrame({'Feature1': features_train[: ,0], 'Feature2': features_train[: ,1],'Feature3': features_train[: ,2], 'Feature4': features_train[: ,3],'Feature5': features_train[: ,4],'Feature6': features_train[: ,5],'Feature7': features_train[: ,6],'Feature8': features_train[: ,7],'Feature9': features_train[: ,8],'Feature10': features_train[: ,9],'Feature11': features_train[: ,10],'Feature12': features_train[: ,11],'Feature13': features_train[: ,12],'Label': labels_train})
225 | 	#'Feature10','Feature11','Feature12','Feature13','Label'])
226 | 	sns.pairplot(kpca_lda, hue='Label')
227 | 	plt.savefig('training_features_female_male.png')
228 | 	#plt.show()
229 | 
230 | 	#plot the validation features before the kpca and the lda procedure
231 | 	kpca_lda = pd.DataFrame({'Feature1': features_validation[: ,0], 'Feature2': features_validation[: ,1],'Feature3': features_validation[: ,2], 'Feature4': features_validation[: ,3],'Feature5': features_validation[: ,4],'Feature6': features_validation[: ,5],'Feature7': features_validation[: ,6],'Feature8': features_validation[: ,7],'Feature9': features_validation[: ,8],'Feature10': features_validation[: ,9],'Feature11': features_validation[: ,10],'Feature12': features_validation[: ,11],'Feature13': features_validation[: ,12],'Label': labels_validation})
232 | 	#'Feature10','Feature11','Feature12','Feature13','Label'])
233 | 	sns.pairplot(kpca_lda, hue='Label')
234 | 	plt.savefig('validation_features_female_male.png')
235 | 	#plt.show()
236 | 
237 | 	#plot the training features after the kpca and the lda procedure
238 | 	kpca_lda = pd.DataFrame({'Feature1': features_train_pca[:, 0],'Label': labels_train})
239 | 	sns.pairplot(kpca_lda, hue='Label')
240 | 	plt.savefig('kpca_lda_knn_trainingset_female_male.png')
241 | 	#plt.show()
242 | 
243 | 	#plot the validation features after the kpca and the lda procedure
244 | 	kpca_lda = pd.DataFrame({'Feature1': features_validation_pca[:, 0],'Label': labels_validation})
245 | 	sns.pairplot(kpca_lda, hue='Label')
246 | 	plt.savefig('kpca_lda_knn_validationset_female_male.png')
247 | 	plt.show()'''
248 | 
249 | def main():
250 | 	#calculate the time
251 | 	import time
252 | 	start_time = time.time()
253 | 
254 | 	#we are making an array in order to keep the support vectors and feed the function with them for the next iteration
255 | 	#support_vectors = 
256 | 	classifyPHC()
257 | 
258 | 	time = time.time()-start_time
259 | 	print 'time: ',time
260 | 
261 | main()
262 | 
263 | 


--------------------------------------------------------------------------------
/classifiers/dimensionality_reduction/kpca_lda_knn_multiclass.py:
--------------------------------------------------------------------------------
  1 | #!usr/bin/python
  2 | from __future__ import division
  3 | import pandas as pd
  4 | from sklearn import model_selection
  5 | from sklearn.svm import SVC # support vectors for classification
  6 | from sklearn.metrics import accuracy_score, confusion_matrix
  7 | from sklearn.model_selection import cross_val_score, GridSearchCV
  8 | from sklearn.preprocessing import MinMaxScaler
  9 | from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
 10 | from sklearn.decomposition import IncrementalPCA, PCA, KernelPCA
 11 | import numpy as np
 12 | from sklearn.neighbors import KNeighborsClassifier, NearestNeighbors
 13 | import matplotlib.pyplot as plt
 14 | from matplotlib.pyplot import figure
 15 | import seaborn as sns
 16 | 
 17 | 
 18 | '''this function takes as an input the path of a file with features and labels and returns the content of this file as a csv format in
 19 | the form feature1.........feature13,Label'''
 20 | def readFile():
 21 | 	#make the format of the csv file. Our format is a vector with 13 features and a label which show the condition of the
 22 | 	#sample hc/pc : helathy case, parkinson case
 23 |   	names = ['Feature1', 'Feature2', 'Feature3', 'Feature4','Feature5','Feature6','Feature7','Feature8','Feature9',
 24 | 	'Feature10','Feature11','Feature12','Feature13','Label']
 25 | 
 26 | 	#path to read the samples, samples consist from healthy subjects and subject suffering from Parkinson's desease.
 27 | 	path = 'PATH_TO_SAMPLES.txt'
 28 | 	#read file in csv format
 29 | 	data = pd.read_csv(path,names=names )
 30 | 	
 31 | 	#return an array of the shape (2103, 14), lines are the samples and columns are the features as we mentioned before
 32 | 	return data
 33 | 
 34 | 'takes the csv file and split the label from the features'
 35 | def splitData(data):
 36 | 	# Split-out the set in two different arrayste
 37 | 	array = data.values
 38 | 	#features array contains only the features of the samples
 39 | 	features = array[:,0:13]
 40 | 	#labels array contains only the lables of the samples
 41 | 	labels = array[:,13]	
 42 | 
 43 | 	return features,labels
 44 | 
 45 | 
 46 | '''we use this function in order to apply greedy search for finding the parameters that best fit our model. We have to mention
 47 | that we start this procedure from a very large field and then we tried to focues to the direction where the results
 48 | appear better. For example for the C parameter, the first range was [0.0001, 0.001, 0.01, 0.1, 1, 10 ,100 ,1000], the result was that
 49 | the best value was 1000 so then we tried [100, 1000, 10000, 100000] and so on in order to focues to the area which give us
 50 | the best results. This function is in comments because we found the best parameters and we dont need to run it in every trial.'''
 51 | def paramTuning(features_train, labels_train, nfolds):
 52 | 	#using the training data and define the number of folds
 53 | 	#determine the range of the Cs range you want to search
 54 | 	Cs = [0.001 ,0.01 ,0.1 ,1 , 10, 100, 1000, 10000]
 55 | 
 56 | 	#determine the range of the gammas range you want to search
 57 | 	gammas = [0.00000001 ,0.00000001 ,0.0000001, 0.000001, 0.00001 , 0.0001, 0.001, 0.01, 0.1, 1, 10, 100]
 58 | 
 59 | 	#make the dictioanry
 60 | 	param_grid = {'C': Cs, 'gamma': gammas}
 61 | 
 62 | 	#start the greedy search using all the matching sets from above
 63 | 	grid_search = GridSearchCV(SVC(kernel='rbf'),param_grid,cv=nfolds)
 64 | 
 65 | 	#fit your training data
 66 | 	grid_search.fit(features_train, labels_train)
 67 | 
 68 | 	#visualize the best couple of parameters
 69 | 	print grid_search.best_params_
 70 | 
 71 | 
 72 | 	
 73 | 	
 74 | 
 75 | 
 76 | 
 77 | 
 78 | 
 79 | '''Building a model which is going to be trained with of given cases and test according to new ones'''
 80 | def classifyPHC():
 81 | 	data = readFile()
 82 | 	#data = equalizeClasses(data)
 83 | 	features,labels = splitData(data)
 84 | 	
 85 | 	#determine the training and testing size in the range of 1, 1 = 100%
 86 | 	validation_size = 0.2
 87 | 	
 88 | 	#here we are splitting our data based on the validation_size into training and testing data
 89 | 	features_train, features_validation, labels_train, labels_validation = model_selection.train_test_split(features, labels, 
 90 | 			test_size=validation_size)
 91 | 
 92 | 	#normalize data in the range [-1,1]
 93 | 	scaler = MinMaxScaler(feature_range=(-1, 1))
 94 | 	#fit only th training data in order to find the margin and then test to data without normalize them
 95 | 	scaler.fit(features_train)
 96 | 
 97 | 	features_train_scalar = scaler.transform(features_train)
 98 | 
 99 | 	#trnasform the validation features without fitting them
100 | 	features_validation_scalar = scaler.transform(features_validation)
101 | 
102 | 
103 | 	#determine the pca, and determine the dimension you want to end up
104 | 	pca = KernelPCA(n_components=5,kernel='rbf',fit_inverse_transform=True)
105 | 
106 | 	#fit only the features train
107 | 	pca.fit(features_train_scalar)
108 | 
109 | 	#dimensionality reduction of features train
110 | 	features_train_pca = pca.transform(features_train_scalar)
111 | 
112 | 	#dimensionality reduction of fatures validation
113 | 	features_validation_pca = pca.transform(features_validation_scalar)
114 | 
115 | 	#reconstruct data training error
116 | 	reconstruct_data = pca.inverse_transform(features_train_pca)
117 | 	
118 | 	
119 | 	error_percentage = (sum(sum(error_matrix))/(len(features_train_scalar)*len(features_train_scalar[0])))*100
120 | 
121 | 	#len(features_train_scalar) = len(reconstruct_data) = 89
122 | 	#len(features_train_scalar[0]) = len(reconstruct_data[0]) = 13
123 | 
124 | 	#len(error_matrix) = 89, which means for all the samples
125 | 	#len(error_matrix[0]) = 13, for every feature of every sample
126 | 	#we take the sum and we conlcude in an array which has the sum for every feature (error)
127 | 	#so we take the sum again and we divide it with the 89 samples * 13 features
128 | 	print 'Information loss of KernelPCA:',error_percentage,'% \n'
129 | 
130 | 
131 | 	lda = LinearDiscriminantAnalysis()
132 | 
133 | 	lda.fit(features_train_pca,labels_train)
134 | 
135 | 	features_train_pca = lda.transform(features_train_pca)
136 | 
137 | 	features_validation_pca = lda.transform(features_validation_pca)
138 | 
139 | 	
140 | 
141 | 	#we can see the shapes of the array just to check
142 | 	print 'feature training array: ',features_train_pca.shape,'and label training array: ',labels_train.shape
143 | 	print 'feature testing array: ',features_validation_pca.shape,'and label testing array: ',labels_validation.shape,'\n'
144 | 
145 | 	#take the best couple of parameters from the procedure of greedy search
146 | 	#paramTuning(features_train, labels_train, 5)
147 | 
148 | 	#we initialize our model
149 | 	#svm = SVC(kernel='rbf',C=10,gamma=0.0001,decision_function_shape='ovo')
150 | 	svm = KNeighborsClassifier(n_neighbors=3)
151 | 
152 | 	#train our model with the data that we previously precessed
153 | 	svm.fit(features_train_pca,labels_train)
154 | 
155 | 
156 | 	#now test our model with the test data
157 | 	predicted_labels = svm.predict(features_validation_pca)
158 | 	accuracy = accuracy_score(labels_validation, predicted_labels)
159 | 	print 'Classification accuracy: ',accuracy*100,'\n'
160 | 
161 | 	#see the accuracy in training procedure
162 | 	predicted_labels_train = svm.predict(features_train_pca)
163 | 	accuracy_train = accuracy_score(labels_train, predicted_labels_train)
164 | 	print 'Training accuracy: ',accuracy_train*100,'\n'
165 | 
166 | 	#confusion matrix to illustrate the faulty classification of each class
167 | 	conf_matrix = confusion_matrix(labels_validation, predicted_labels)
168 | 	print 'Confusion matrix: \n',conf_matrix,'\n'
169 | 	print 'Support    class 0   class 1    class2:'
170 | 	#calculate the support of each class  
171 | 	print '          ',conf_matrix[0][0]+conf_matrix[0][1]+conf_matrix[0][2],'       ',conf_matrix[1][0]+conf_matrix[1][1]+conf_matrix[1][2],'        ',conf_matrix[2][0]+conf_matrix[2][1]+conf_matrix[2][2],'\n'
172 | 
173 | 	#calculate the accuracy of each class
174 | 	edema = (conf_matrix[0][0]/(conf_matrix[0][0]+conf_matrix[0][1]+conf_matrix[0][2]))*100
175 | 	paralysis = (conf_matrix[1][1]/(conf_matrix[1][0]+conf_matrix[1][1]+conf_matrix[1][2]))*100
176 | 	normal = (conf_matrix[2][2]/(conf_matrix[2][0]+conf_matrix[2][1]+conf_matrix[2][2]))*100
177 | 
178 | 	#see the inside details of the classification
179 | 	print 'For class 0 edema cases:',conf_matrix[0][0],'classified correctly and',conf_matrix[0][1]+conf_matrix[0][2],'missclassified,',edema,'accuracy \n'
180 | 	print 'For class 1 paralysis cases:',conf_matrix[1][1],'classified correctly and',conf_matrix[1][0]+conf_matrix[1][2],'missclassified,',paralysis,'accuracy\n'
181 | 	print 'For class 0 normal cases:',conf_matrix[2][2],'classified correctly and',conf_matrix[2][0]+conf_matrix[2][1],'missclassified,',normal,'accuracy \n'
182 | 
183 | 	#try 5-fold cross validation
184 | 	scores = cross_val_score(svm, features_train_pca, labels_train, cv=5)
185 | 	print 'cross validation scores for 5-fold',scores,'\n'
186 | 	print 'parameters of the model: \n',svm.get_params(),'\n'
187 | 
188 | 	
189 | 	#PLOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOTS
190 | 	
191 | 	#sns.pairplot(data, hue='Label')
192 | 	#plt.savefig('data_visualization.png')
193 | 	#plt.title('Data visualization')
194 | 	#plt.show()
195 | 
196 | 	#print features_train.shape
197 | 	#print len(features_train[:,0])
198 | 	#print len(features_train[:,1])
199 | 	#print len(labels_train)
200 | 
201 | 	#plot the training features before the kpca and the lda procedure
202 | 	#kpca_lda = pd.DataFrame({'Feature1': features_train[: ,0], 'Feature2': features_train[: ,1],'Feature3': features_train[: ,2], 'Feature4': features_train[: ,3],'Feature5': features_train[: ,4],'Feature6': features_train[: ,5],'Feature7': features_train[: ,6],'Feature8': features_train[: ,7],'Feature9': features_train[: ,8],'Feature10': features_train[: ,9],'Feature11': features_train[: ,10],'Feature12': features_train[: ,11],'Feature13': features_train[: ,12],'Label': labels_train})
203 | 	#'Feature10','Feature11','Feature12','Feature13','Label'])
204 | 	#sns.pairplot(kpca_lda, hue='Label')
205 | 	#plt.savefig('training_features.png')
206 | 	#plt.show()
207 | 
208 | 	#plot the validation features before the kpca and the lda procedure
209 | 	#kpca_lda = pd.DataFrame({'Feature1': features_validation[: ,0], 'Feature2': features_validation[: ,1],'Feature3': features_validation[: ,2], 'Feature4': features_validation[: ,3],'Feature5': features_validation[: ,4],'Feature6': features_validation[: ,5],'Feature7': features_validation[: ,6],'Feature8': features_validation[: ,7],'Feature9': features_validation[: ,8],'Feature10': features_validation[: ,9],'Feature11': features_validation[: ,10],'Feature12': features_validation[: ,11],'Feature13': features_validation[: ,12],'Label': labels_validation})
210 | 	#'Feature10','Feature11','Feature12','Feature13','Label'])
211 | 	#sns.pairplot(kpca_lda, hue='Label')
212 | 	#plt.savefig('validation_features.png')
213 | 	#plt.show()
214 | 
215 | 	#plot the training features after the kpca and the lda procedure
216 | 	#kpca_lda = pd.DataFrame({'Feature1': features_train_pca[: ,0], 'Feature2': features_validation_pca[: ,1],'Label': labels_validation})
217 | 	#sns.pairplot(kpca_lda, hue='Label')
218 | 	#plt.savefig('kpca_lda_knn_validationset.png')
219 | 	#plt.show()
220 | 
221 | 	#plot the validation features after the kpca and the lda procedure
222 | 	#kpca_lda = pd.DataFrame({'Feature1': features_validation_pca[: ,0], 'Feature2': features_validation_pca[: ,1],'Label': labels_validation})
223 | 	#sns.pairplot(kpca_lda, hue='Label')
224 | 	#plt.savefig('kpca_lda_knn_validationset.png')
225 | 	#plt.show()
226 | 
227 | 	#print 'number of samples used as support vectors',len(svm.support_vectors_)
228 | 
229 | 
230 | def main():
231 | 	import time
232 | 	start_time = time.time()
233 | 	classifyPHC()
234 | 	time = time.time()-start_time
235 | 	print 'time: ',time
236 | 
237 | main()
238 | 
239 | 


--------------------------------------------------------------------------------
/classifiers/dimensionality_reduction/pca_kpca_from-skratch.py:
--------------------------------------------------------------------------------
  1 | #!usr/bin/python 
  2 | from __future__ import division
  3 | import random
  4 | import numpy as np
  5 | from numpy import cov
  6 | from numpy.linalg import eig, inv
  7 | 
  8 | def PrincipalComponentAnalysis(dimensions_output,kernel_option,c):
  9 | 	#make a random array with samples lets say 100 samples with dimension 20
 10 | 	samples = np.random.rand(40,3)
 11 | 
 12 | 	#print samples
 13 | 	print 'samples shape: ',samples.shape,'\n'
 14 | 
 15 | 	#for every sample I took the square distance of the mean, this is the variable that I want to maximize
 16 | 	#calculate the mean values of each columns, so we have to transpose the matrix because the argument axis refers to row
 17 | 	mean = np.mean(samples.transpose(),axis=1)
 18 | 
 19 | 	print 'mean shape: ',mean.shape,'\n'
 20 | 
 21 | 	#print mean
 22 | 	#print mean.shape
 23 | 
 24 | 	#we are going to center our matrix(the points) to the origin (0,0) by substracting the column means
 25 | 	#samples = samples - mean
 26 | 
 27 | 
 28 | 	#calculate the covariance matrix between two features
 29 | 	#the arrays are inserted as transposed that why they are transposed again
 30 | 	if (kernel_option):
 31 | 		print 'Using KernelPCA with rbf kernel \n'
 32 | 		#here we are taking the dimensions of the samples array, so the dimensions of the data
 33 | 		#for every sample
 34 | 		#initialize a numpy array in the shape of the sample array
 35 | 		covSamples = np.zeros((samples.shape[0],samples.shape[1]))
 36 | 		for x in range(samples.shape[0]):
 37 | 			#for all the dimensions of the samples
 38 | 			for y in range(samples.shape[1]):
 39 | 				#insert in numpy array for the first row(first sample) the first column is the first feature
 40 | 				#minus the mean of the first feature		 
 41 | 				np.put(covSamples[x],y,np.exp(-(np.linalg.norm(samples[x][y] - mean[y])**2/c)))
 42 | 				#break
 43 | 			#break
 44 | 		#samples = np.absolute(samples - mean)**2/c
 45 | 		#print samples.shape
 46 | 		covSamples = np.matmul(covSamples.transpose(),covSamples)          
 47 | 	else:
 48 | 		print 'Using linear PCA \n'
 49 | 		covSamples = np.matmul((samples - mean).transpose(),(samples - mean))
 50 | 
 51 | 	print 'covariance matrix shape: ',covSamples.shape,'\n'
 52 | 	print covSamples,'\n'
 53 | 
 54 | 	#print covSamples.shape
 55 | 	
 56 | 	#print covSamples.shape
 57 | 	#print covSamples
 58 | 
 59 | 	#find eigenvalues and eigenvectors
 60 | 	eigenvalues, eigenvectors = eig(covSamples)
 61 | 
 62 | 	print 'eigenvectors shape: ',eigenvectors.shape,'\n'
 63 | 	print eigenvectors,'\n'
 64 | 
 65 | 	#short eigenvectors
 66 | 	sorted_eigenvalues = eigenvalues.argsort()[::-1]
 67 | 
 68 | 	print 'sorted eigenvalues: ',sorted_eigenvalues,'\n'
 69 | 
 70 | 	print 'eigenvalues',eigenvalues,'\n'
 71 | 	
 72 | 
 73 | 	#deterine the dimensions you want to keep based on the eigenvectors you want to multiple the smples
 74 | 	dimensions =  eigenvectors[:, sorted_eigenvalues]
 75 | 
 76 | 	print 'w array shape: ',dimensions.shape,'\n'
 77 | 
 78 | 	print 'w array sorted based on eigenvalues, every column represent one eigenvector \n',dimensions,'\n'
 79 | 
 80 | 	#print dimensions.shape
 81 | 
 82 | 	w = dimensions[:, :dimensions_output]
 83 | 
 84 | 	print 'w final shape: ',w.shape,'\n'
 85 | 	print 'w final array: \n',w,'\n'
 86 | 
 87 | 
 88 | 	print 'final vector multiplication samples:',samples.shape,'w array:',w.shape,'\n'
 89 | 
 90 | 	samples = np.dot(samples,w)
 91 | 	#dimensions = np.dot(samples)
 92 | 
 93 | 	print 'dimensionality reduction samples shape: ',samples.shape,'\n'
 94 | 	#print samples[0]
 95 | 
 96 | 	#for x in eigenvectors:
 97 | 	#	print np.linalg.norm(x)
 98 | 
 99 | 	
100 | 
101 | 	
102 | 
103 | 
104 | #PrincipalComponentAnalysis(dimensions_output=1,kernel_option=True,c=1)
105 | 


--------------------------------------------------------------------------------
/classifiers/gmm.py:
--------------------------------------------------------------------------------
  1 | #!usr/bin/python
  2 | from __future__ import division
  3 | import pickle
  4 | import pandas as pd
  5 | import numpy as np
  6 | from sklearn.mixture import GaussianMixture
  7 | import seaborn as sns
  8 | import matplotlib.pyplot as plt
  9 | 
 10 | 
 11 | #lpc durbin-levinson
 12 | 
 13 | 
 14 | #gmm training python
 15 | def GaussianMixtureModel_only_for_testing(data):
 16 | 	#A GMM attempts to find a mixture of multidimensional Gaussian probability distributions that best model any input dataset.
 17 | 	#In the simplest case , GMM's can be used for finding clusters in the same manner as k-means.
 18 | 	X,Y = preparingData(data)
 19 | 	#taking only the first two features
 20 | 	#Y = target variable
 21 | 	gmm =  GaussianMixture(n_components=2)
 22 | 	#Estimate model parameters with the EM algorithm.
 23 | 	gmm.fit(X)
 24 | 	labels = gmm.predict(X)
 25 | 	print labels
 26 | 
 27 | 	plt.figure(1)
 28 | 	#because of the probabilistic approach of GMM's it is possible to find a probabilistic cluster assignments.
 29 | 	#porbs : is a matrix [samples, nClusters] which contains the probability of any point belongs to the given cluster
 30 | 	probs = gmm.predict_proba(X).round(3)
 31 | 	#which measures the probability that any point belongs to the given cluster:
 32 | 	print probs
 33 | 
 34 | 	#we can visualize this uncertainty . For instance let's make the size of each point proortional to the certainty
 35 | 	#of its prediction. We are going to point the points at the boundaries between clusters.
 36 | 	size = 50 * probs.max(1) ** 2  # square emphasizes differences
 37 | 		
 38 | 	#the weights of each mixture components
 39 | 	weights = gmm.weights_
 40 | 	#the mean of each mixture component
 41 | 	means = gmm.means_
 42 | 	#the covariance of each mixture component
 43 | 	covars = gmm.covariances_
 44 | 
 45 | 	print 'weights: ',weights
 46 | 	print 'means: ', means
 47 | 
 48 | 	print gmm.score(X)
 49 | 	#Predict the labels for the data samples in X using trained model.
 50 | 	print labels[0]
 51 | 	print Y
 52 | 	print Y[0]
 53 | 	
 54 | 
 55 | 	#plots 
 56 | 	plt.scatter(X[:,5],X[:,6],c=labels,s=40,cmap='viridis')
 57 | 	plt.show()
 58 | 
 59 | def readFeaturesFile(gender):
 60 | 	names = ['Feature1', 'Feature2', 'Feature3', 'Feature4','Feature5','Feature6','Feature7','Feature8','Feature9',
 61 | 'Feature10','Feature11','Feature12','Feature13','Gender']
 62 | 	
 63 | 	#check the gender
 64 | 	if(int(gender)==1):
 65 | 		data = pd.read_csv("PATH_TO_SAMPLES.txt",names=names )
 66 | 	elif(int(gender)==0):
 67 | 		data = pd.read_csv("PATH_TO_SAMPLES.txt",names=names )
 68 | 	else:
 69 | 		data = pd.read_csv("PATH_TO_SAMPLES.txt",names=names )
 70 | 	#the outcome is a list of lists containing the samples with the following format
 71 | 	#[charachteristic,feature1,feature2.......,feature13]
 72 | 	#characheristic based on what we want for classification , can be (male , female) , also can be (normal-female,edema-female)
 73 | 	#in general characheristic is the target value .
 74 | 	return data
 75 | 
 76 | 
 77 | def preparingData(data):
 78 | 	# Split-out validation dataset
 79 | 	array = data.values
 80 | 	#input
 81 | 	X = array[:,0:13]
 82 | 	#target 
 83 | 	Y = array[:,13]
 84 | 	return X,Y
 85 | 
 86 | def GaussianMixtureModel(data,gender):
 87 | 	#A GMM attempts to find a mixture of multidimensional Gaussian probability distributions that best model any input dataset.
 88 | 	#In the simplest case , GMM's can be used for finding clusters in the same manner as k-means.
 89 | 	X,Y = preparingData(data)
 90 | 	#print data.head(n=5)
 91 | 	
 92 | 	#we do not split into training and testing becuase we all ready did that in a file basis, so teh X,Y in this
 93 | 	#function is to train the model and another file with another set of X,Y is the testModels function to assess the model
 94 | 
 95 | 	#takes only the first feature to redefine the problem as 1-D problem
 96 | 	#dataFeature1 =  data.as_matrix(columns=data.columns[0:1])
 97 | 	#plot histogram
 98 | 	#sns.distplot(dataFeature1,bins=20,kde=False)
 99 | 	#plt.show()
100 | 
101 | 	
102 | 	
103 | 	#Y = target variable
104 | 	gmm =  GaussianMixture(n_components=8,max_iter=200,covariance_type='diag',n_init=3)
105 | 	gmm.fit(X)
106 | 	
107 | 		
108 | 
109 | 	#save the model to disk
110 | 	filename = 'finalizedModel_'+gender+'.gmm'
111 | 	pickle.dump(gmm,open(filename,'wb'))
112 | 	print 'Model saved in path: PATH_TO'+filename
113 | 
114 | 
115 | 	return X
116 | 	#load the model from disk
117 | 	'''loadedModel = pickle.load(open(filename,'rb'))
118 | 	result = loadedModel.score(X)
119 | 	print result'''
120 | 
121 | def testModels(data,threshold_input,x_test,y_test):
122 | 	gmmFiles = ['PATH_TO/finalizedModel_0.gmm','PATH_TO/finalizedModel_1.gmm']
123 | 	models = [pickle.load(open(filename,'r')) for filename in gmmFiles]
124 | 	log_likelihood = np.zeros(len(models))
125 | 	genders = ['male','female']
126 | 	assessModel = []
127 | 	prediction = []
128 | 	features = X
129 | 	for i in range(len(models)):
130 | 		gmm = models[i]
131 | 		scores = np.array(gmm.score(features))
132 | 		#first take for the male model all the log likelihoods and then the same procedure for the female model
133 | 		assessModel.append(gmm.score_samples(features))
134 | 		log_likelihood[i] = scores.sum()
135 | 	#higher the value it is, the more likely your model fits the model
136 | 	for x in range(len(assessModel[0])):
137 | 		#the division is gmm(Malemodel) / gmm(Femalemodel) if the result is > 1 then the example is male
138 | 		
139 | 		
140 | 		#if the prediction for male in negative and the prediction for female positive we dont have to check
141 | 		#because the difference is obvious and we are pretty sure that it is female
142 | 		if(assessModel[0][x] < 0 and assessModel[1][x] > 0):
143 | 			# x / y and x is < 0 , so we have to classify this as female
144 | 			# we have to be sure that the prediction will be above the threshold 
145 | 			prediction.append(float(threshold_input) + 1)
146 | 			
147 | 		#same as above , we need to be sure that the prediction is below the threshold (male) because we are pretty
148 | 		#sure from the model's outcome that this sample is female
149 | 		elif(assessModel[0][x] > 0 and assessModel[1][x] < 0):
150 | 			prediction.append(float(threshold_input) - 1)
151 | 		else:
152 | 			prediction.append( abs(( assessModel[0][x] / assessModel[1][x] )) ) 
153 | 	
154 | 
155 | 	#take an array with the predictions and check if they are true(correct classification) or false(wrong classification)
156 | 	assessment=[]
157 | 	true_negative=0
158 | 	true_positive=0
159 | 	false_positive=0
160 | 	false_negative=0
161 | 	for x in range(len(prediction)):#reject option
162 | 		if(prediction[x]<1.019 and prediction[x]>1.012):
163 | 			print prediction[x] , ' can not decide'
164 | 		elif(prediction[x]<float(threshold_input)):#the model predict male and we check if it is indeed male
165 | 			#print prediction[x], ( Y[x] == 0 )
166 | 			decision = (Y[x] == 0)
167 | 			assessment.append(decision)
168 | 			if(decision):
169 | 				true_negative+=1
170 | 			else:
171 | 				false_negative+=1
172 | 		else:
173 | 			decision1 = (Y[x] == 1)#the model predict female and we check if it is indeed female
174 | 			#print prediction[x], ( Y[x] == 1 )
175 | 			assessment.append(decision1)
176 | 			if(decision1):
177 | 				true_positive+=1
178 | 			else:
179 | 				false_positive+=1
180 | 	for x in range(len(assessment)):
181 | 		if(assessment[x]==False):
182 | 			print prediction[x],assessment[x]
183 | 			print assessModel[0][x],assessModel[1][x]
184 | 	#construct confusion matrix
185 | 	confusion_matrix= [[0 for x in range(2)] for y in range(2)]
186 | 	confusion_matrix[0][0]= true_negative
187 | 	confusion_matrix[0][1]= false_positive
188 | 	confusion_matrix[1][0]= false_negative
189 | 	confusion_matrix[1][1]= true_positive
190 | 
191 | 	
192 | 	#sensitivity/ true positive rate
193 | 	tpr = (true_positive)/(true_positive + false_negative)
194 | 	#fall-out/ false positive rate
195 | 	fpr = (false_positive)/(false_positive + true_negative)
196 | 
197 | 	#ROC curve with error and reject option
198 | 	
199 | 				
200 | 	winner = np.argmax(log_likelihood)
201 | 	print ''
202 | 	print "\tdetected as - ", genders[winner],"\n\tscores:male ",log_likelihood[0],",female ", log_likelihood[1],"\n"
203 | 
204 | 	male_support = confusion_matrix[0][0] + confusion_matrix[0][1]
205 | 	female_support = confusion_matrix[1][0] + confusion_matrix[1][1]
206 | 
207 | 	print 'Confusion matrix:       support: '
208 | 	print '               ',confusion_matrix[0],'    0.0: ',male_support
209 | 	print '               ',confusion_matrix[1],'    1.0: ',female_support
210 | 	print ''
211 | 	print 'Accuracy: ', ( assessment.count(True) / len(assessment) ) * 100 
212 | 	
213 | 	
214 | 	#with reject option
215 | 	tpr_plot = np.array([0,tpr,1])
216 | 	fpr_plot = np.array([0,fpr,1])
217 | 	#without reject option
218 | 	tpr_plot_wr = np.array([0,0.983606557377,1])
219 | 	fpr_plot_wr = np.array([0,0.0588235294118,1])
220 | 
221 | 	#area under the curve with the reject option
222 | 	area_under_plot =  metrics.auc(fpr_plot,tpr_plot)
223 | 	#area under the curve without the reject option
224 | 	area_under_wr = metrics.auc(fpr_plot_wr,tpr_plot_wr)
225 | 	
226 | 	#with reject option
227 | 	thresholds = np.array([0.86,0.87,0.88,0.89,0.90,0.91,0.92,0.93,0.94,0.95,0.96,0.97,0.98,0.99,1.0,1.01,1.02,1.03,1.04,1.05,1.06,1.07,1.08,1.09,1.10,1.11,1.12,1.13])
228 | 	accuracy_plot = np.array([89.09090909,90.0,90.909090,92.727272,93.636363,94.545454,94.545454,94.545454,97.272727,97.272727,98.18181818,98.18181818,98.18181818,97.272727,96.363636,95.45454545,95.45454545,92.727272,91.818181,90.909090,90.909090,90.909090,90.909090,90.0,90.0,88.181818,88.181818,87.272727])
229 | 	
230 | 	#without reject option
231 | 	accuracy_wr = np.array([87.5,88.392875,89.285714,91.071428,91.9642785,92.8571428,92.8571428,93.75,95.535714,95.535714,96.42785,96.42785,96.42785,95.535714,94.64285,93.75,95.535714,92.8571428,91.9642785,91.071428,91.071428,91.071428,91.071428,90.178571,90.178571,88.392857,88.1818181,87.5])
232 | 	
233 | 
234 | 
235 | 	plt.figure(1)
236 | 	green_patch = mpatches.Patch(color='green', label='With reject option (area = %0.2f)' %area_under_plot)
237 | 	blue_patch = mpatches.Patch(color='blue', label='Without reject option (area = %0.2f)' %area_under_wr)
238 | 	plt.legend(handles=[green_patch,blue_patch])
239 | 	plt.plot(fpr_plot,tpr_plot,marker='d',linestyle='--',color='g')
240 | 	plt.plot(fpr_plot_wr,tpr_plot_wr,marker='d',linestyle='--',color='b')
241 | 	plt.plot([0,1],[0,1],'r--')
242 | 	plt.xlabel('False positive rate')
243 | 	plt.ylabel('True positive rate')
244 | 	plt.title('Receiver operating characteristic')
245 | 	plt.legend(loc='lower right')
246 | 	
247 | 	
248 | 	
249 | 	plt.figure(2)
250 | 	plt.plot(thresholds,accuracy_plot,marker='o',linestyle='--')
251 | 	plt.xlabel('thresholds')
252 | 	plt.ylabel('accuracy')
253 | 	plt.title('Optimal threshold(with reject option)')
254 | 
255 | 	plt.figure(3)
256 | 	plt.plot(thresholds,accuracy_wr,marker='o',linestyle='--')
257 | 	plt.xlabel('thresholds')
258 | 	plt.ylabel('accuracy')
259 | 	plt.title('Optimal threshold(without reject option)')
260 | 	plt.show()
261 | 
262 | 
263 | 
264 | def determineComponents(data):
265 | 	X,Y = preparingData(data)
266 | 	n_components = np.arange(1,10)
267 | 	bic = np.zeros(n_components.shape)
268 | 
269 | 	for i,n in enumerate(n_components):
270 | 		#fit gmm to data for each value of components
271 | 		gmm = GaussianMixture(n_components=n,max_iter=200, covariance_type='diag' ,n_init=3)
272 | 		gmm.fit(X)
273 | 		#store BIC scores
274 | 		bic[i] = gmm.bic(X)
275 | 
276 | 	#Therefore, Bayesian Information Criteria (BIC) is introduced as a cost function composing of 2 terms; 
277 | 	#1) minus of log-likelihood and 2) model complexity. Please see my old post. You will see that BIC prefers model 
278 | 	#that gives good result while the complexity remains small. In other words, the model whose BIC is smallest is the winner
279 | 	#plot the results
280 | 	plt.plot(bic)
281 | 	plt.show()
282 | 
283 | def main():
284 | 	gender = raw_input('Choose the gender for training press 1(Female) or 0(Male) and any other number for testing: ')
285 | 	data = readFeaturesFile(gender)
286 | 	#determineComponents(data)
287 | 	#GaussianMixtureModel(data,gender)
288 | 	threshold = raw_input('Threshold: ')
289 | 	testModels(data,threshold,x_test,y_test)
290 | 	#GaussianMixtureModel_only_for_testing(data)
291 | 
292 | main()
293 | 


--------------------------------------------------------------------------------
/classifiers/gmm_healthy_captured.py:
--------------------------------------------------------------------------------
  1 | #!usr/bin/python
  2 | from __future__ import division
  3 | import pickle
  4 | import pandas as pd
  5 | import numpy as np
  6 | from sklearn.mixture import GaussianMixture
  7 | import seaborn as sns
  8 | import matplotlib.pyplot as plt
  9 | from sklearn import metrics
 10 | import matplotlib.patches as mpatches
 11 | 
 12 | def readFeaturesFile(gender):
 13 | 	names = ['Feature1', 'Feature2', 'Feature3', 'Feature4','Feature5','Feature6','Feature7','Feature8','Feature9',
 14 | 'Feature10','Feature11','Feature12','Feature13','Gender']
 15 | 	
 16 | 	#check the gender
 17 | 	if(int(gender)==0):
 18 | 		data = pd.read_csv("PATH_TO_SAMPLES.txt",names=names )
 19 | 	elif(int(gender)==1):
 20 | 		data = pd.read_csv("PATH_TO_SAMPLES.txt",names=names )
 21 | 	else:
 22 | 		data = pd.read_csv("PATH_TO_SAMPLES.txt",names=names )
 23 | 	#the outcome is a list of lists containing the samples with the following format
 24 | 	#[charachteristic,feature1,feature2.......,feature13]
 25 | 	#characheristic based on what we want for classification , can be (male , female) , also can be (normal-female,edema-female)
 26 | 	#in general characheristic is the target value .
 27 | 	return data
 28 | 
 29 | def preparingData(data):
 30 | 	# Split-out validation dataset
 31 | 	array = data.values
 32 | 	#input
 33 | 	X = array[:,0:13]
 34 | 	#target 
 35 | 	Y = array[:,13]
 36 | 	return X,Y
 37 | 
 38 | 
 39 | def testModels(data,threshold_input):
 40 | 
 41 | 	gmmFiles = ['PATH_TO_SAMPLES/finalizedModel_0_healthy.gmm','PATH_TO_SAMPLES/finalizedModel_1_parkinson.gmm']
 42 | 	models = [pickle.load(open(filename,'r')) for filename in gmmFiles]
 43 | 	log_likelihood = np.zeros(len(models))
 44 | 	genders = ['healthy','parkinson']
 45 | 	X,Y = preparingData(data)
 46 | 	assessModel = []
 47 | 	prediction = []
 48 | 	#determine the prediction as only with the fatures for testing
 49 | 	features = X
 50 | 	for i in range(len(models)):
 51 | 		gmm = models[i]
 52 | 		scores = np.array(gmm.score(features))
 53 | 		#first take for the male model all the log likelihoods and then the same procedure for the female model
 54 | 		assessModel.append(gmm.score_samples(features))
 55 | 		log_likelihood[i] = scores.sum()
 56 | 	#higher the value it is, the more likely your model fits the model
 57 | 	for x in range(len(assessModel[0])):
 58 | 		#the division is gmm(Malemodel) / gmm(Femalemodel) if the result is > 1 then the example is male
 59 | 
 60 | 
 61 | 		#if the prediction for male in negative and the prediction for female positive we dont have to check
 62 | 		#because the difference is obvious and we are pretty sure that it is female
 63 | 		if(assessModel[0][x] < 0 and assessModel[1][x] > 0):
 64 | 			# x / y and x is < 0 , so we have to classify this as female
 65 | 			# we have to be sure that the prediction will be above the threshold 
 66 | 			prediction.append(float(threshold_input) + 1)
 67 | 		#same as above , we need to be sure that the prediction is below the threshold (male) because we are pretty
 68 | 		#sure from the model's outcome that this sample is female
 69 | 		elif(assessModel[0][x] > 0 and assessModel[1][x] < 0):
 70 | 			prediction.append(float(threshold_input) - 1)
 71 | 		else:
 72 | 			prediction.append( abs(( assessModel[0][x] / assessModel[1][x] )) ) 
 73 | 	
 74 | 
 75 | 	#take an array with the predictions and check if they are true(correct classification) or false(wrong classification)
 76 | 	assessment=[]
 77 | 	condition=[]
 78 | 	true_negative=0
 79 | 	true_positive=0
 80 | 	false_positive=0
 81 | 	false_negative=0
 82 | 	for x in range(len(prediction)):
 83 | 		if(prediction[x]<1.04790 and prediction[x]>0.97890):
 84 | 			print prediction[x] , ' can not decide'
 85 | 			print '\n'
 86 | 			continue
 87 | 		if(prediction[x]<float(threshold_input)):#the model predict healthy and we check if it is indeed male
 88 | 			#print prediction[x], ( Y[x] == 0 )
 89 | 			decision = (Y[x] == 0)
 90 | 			condition.append('healthy')
 91 | 			assessment.append(decision)
 92 | 			if(decision):
 93 | 				true_negative+=1
 94 | 			else:
 95 | 				false_negative+=1
 96 | 		else:
 97 | 			decision1 = (Y[x] == 1)#the model predict parkinson and we check if it is indeed female
 98 | 			#print prediction[x], ( Y[x] == 1 )
 99 | 			condition.append('parkinson')
100 | 			assessment.append(decision1)
101 | 			if(decision1):
102 | 				true_positive+=1
103 | 			else:
104 | 				false_positive+=1
105 | 	for x in range(len(assessment)):
106 | 		if(assessment[x]==False):
107 | 			print prediction[x],assessment[x],condition[x]
108 | 			print assessModel[0][x],assessModel[1][x]
109 | 			print '\n'
110 | 	#construct confusion matrix
111 | 	confusion_matrix= [[0 for x in range(2)] for y in range(2)]
112 | 	confusion_matrix[0][0]= true_negative
113 | 	confusion_matrix[0][1]= false_positive
114 | 	confusion_matrix[1][0]= false_negative
115 | 	confusion_matrix[1][1]= true_positive
116 | 
117 | 	#sensitivity/ true positive rate
118 | 	tpr = (true_positive)/(true_positive + false_negative)
119 | 	#fall-out/ false positive rate
120 | 	fpr = (false_positive)/(false_positive + true_negative)
121 | 
122 | 	print 'tpr: ',tpr
123 | 	print 'fpr : ',fpr
124 | 
125 | 	#ROC curve with error and reject option
126 | 				
127 | 	winner = np.argmax(log_likelihood)
128 | 	print ''
129 | 	print "\tdetected as - ", genders[winner],"\n\tscores: healthy ",log_likelihood[0],",parkinson ", log_likelihood[1],"\n"
130 | 
131 | 	male_support = confusion_matrix[0][0] + confusion_matrix[0][1]
132 | 	female_support = confusion_matrix[1][0] + confusion_matrix[1][1]
133 | 
134 | 	print 'Confusion matrix:       support: '
135 | 	print '               ',confusion_matrix[0],'    0.0: ',male_support
136 | 	print '               ',confusion_matrix[1],'    1.0: ',female_support
137 | 	print ''
138 | 	print 'Accuracy without reject option: 88.1122206372 \n'
139 | 	print 'Accuracy with reject option: ', ( assessment.count(True) / len(assessment) ) * 100
140 | 	
141 | 	
142 | 	#with reject option
143 | 	tpr_plot = np.array([0,tpr,1])
144 | 	fpr_plot = np.array([0,fpr,1])
145 | 	#without reject option
146 | 	tpr_plot_wr = np.array([0,0.23381294964,1])
147 | 	fpr_plot_wr = np.array([0,0.0202739726027,1])
148 | 
149 | 	#area under the curve with the reject option
150 | 	area_under_plot =  metrics.auc(fpr_plot,tpr_plot)
151 | 	#area under the curve without the reject option
152 | 	area_under_wr = metrics.auc(fpr_plot_wr,tpr_plot_wr)
153 | 
154 | 	
155 | 	
156 | 
157 | 	plt.figure(1)
158 | 	green_patch = mpatches.Patch(color='green', label='With reject option (area = %0.2f)' %area_under_plot)
159 | 	blue_patch = mpatches.Patch(color='blue', label='Without reject option (area = %0.2f)' %area_under_wr)
160 | 	plt.legend(handles=[green_patch,blue_patch])
161 | 	plt.plot(fpr_plot,tpr_plot,marker='d',linestyle='--',color='g')
162 | 	plt.plot(fpr_plot_wr,tpr_plot_wr,marker='d',linestyle='--',color='b')
163 | 	plt.plot([0,1],[0,1],'r--')
164 | 	plt.xlabel('False positive rate')
165 | 	plt.ylabel('True positive rate')
166 | 	plt.title('Receiver operating characteristic')
167 | 	plt.legend(loc='lower right')
168 | 	plt.show()
169 | 
170 | 
171 | 
172 | 
173 | def main():
174 | 	#gender = raw_input('Choose the gender for training press 1(Female) or 0(Male) and any other number for testing: ')
175 | 	data = readFeaturesFile(gender=6)
176 | 
177 | 	#threshold = raw_input('Threshold: ')
178 | 	threshold = 1.05
179 | 	testModels(data,threshold)
180 | 
181 | 
182 | main()
183 | 
184 | 


--------------------------------------------------------------------------------
/classifiers/knn.py:
--------------------------------------------------------------------------------
  1 | #!usr/bin/python
  2 | from __future__ import division
  3 | import numpy as np
  4 | import pandas as pd
  5 | import matplotlib.pyplot as plt
  6 | import scipy as sk
  7 | from sklearn.feature_selection import RFE
  8 | import matplotlib.patches as mpatches
  9 | from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score
 10 | from pandas.tools.plotting import scatter_matrix
 11 | from sklearn import model_selection
 12 | from sklearn.utils.estimator_checks import check_estimator
 13 | from sklearn.preprocessing import label_binarize
 14 | from sklearn.metrics import classification_report
 15 | from sklearn.metrics import confusion_matrix
 16 | from sklearn.metrics import accuracy_score
 17 | from sklearn import metrics
 18 | from sklearn import svm, datasets
 19 | from sklearn.linear_model import LogisticRegression
 20 | from sklearn.model_selection import train_test_split
 21 | from sklearn.externals import joblib
 22 | import numpy as np
 23 | from sklearn.metrics import classification_report
 24 | from sklearn.metrics import roc_auc_score
 25 | from sklearn.metrics import roc_curve
 26 | from sklearn.neighbors import KNeighborsClassifier
 27 | 
 28 | def readFeaturesFile():
 29 | 	names = ['Feature1', 'Feature2', 'Feature3', 'Feature4','Feature5','Feature6','Feature7','Feature8','Feature9',
 30 | 'Feature10','Feature11','Feature12','Feature13','Gender']
 31 | 	data = pd.read_csv("PATH_TO_SAMPLES.txt",names=names )
 32 | 	#the outcome is a list of lists containing the samples with the following format
 33 | 	#[charachteristic,feature1,feature2.......,feature13]
 34 | 	#characheristic based on what we want for classification , can be (male , female) , also can be (normal-female,edema-female)
 35 | 	#in general characheristic is the target value .
 36 | 	return data
 37 | 
 38 | def preparingData(data):
 39 | 	# Split-out validation dataset
 40 | 	array = data.values
 41 | 	#input
 42 | 	X = array[:,0:13]
 43 | 	#target 
 44 | 	Y = array[:,13]
 45 | 	return X,Y
 46 | 
 47 | 
 48 | def knn_ROC(data):
 49 | 	X,Y = preparingData(data)
 50 | 	accuracy=0
 51 | 	#keep tha rates and the number of neighbours for the plots
 52 | 	rates=[]
 53 | 	n_neighbours=[]
 54 | 	#initiate the lists of true positive etc
 55 | 	tp=[]
 56 | 	tn=[]
 57 | 	fp=[]
 58 | 	fn=[]
 59 | 	k=0
 60 | 	for n in range(5,20):
 61 | 		n_neighbours.append(n)
 62 | 		#KNN for variable number of neighbors , check the rates and plot them according to the number of neighbors
 63 | 		x_train, x_test, y_train, y_test = train_test_split(X,Y, test_size=0.20)
 64 | 		
 65 | 		
 66 | 		knn = KNeighborsClassifier(n_neighbors=n).fit(x_train,y_train)
 67 | 		for l in range(len(Y)):
 68 | 			if Y[l]==knn.predict(x_test)[l]:
 69 | 				#every time the prediction is right
 70 | 				accuracy+=1
 71 | 		#calcuate the rate %
 72 | 		rates.append((accuracy / len(Y))*100)
 73 | 		accuracy=0
 74 | 		trueInput = data.ix[data['Gender']==1].iloc[:,0:13]
 75 | 		trueOutput = data.ix[data['Gender']==1].iloc[:,13]
 76 | 		#true positive rate	
 77 | 		tp.append(np.mean(knn.predict(trueInput)==trueOutput))
 78 | 		#true negative	
 79 | 		falseInput = data.ix[data['Gender']==0].iloc[:,0:13]
 80 | 		falseOutput = data.ix[data['Gender']==0].iloc[:,13]
 81 | 		#true negative rate
 82 | 		tn.append(np.mean(knn.predict(falseInput)==falseOutput))
 83 | 		#false positive
 84 | 		fp.append(1 - tp[k])
 85 | 		#flase negative
 86 | 		fn.append(1 - tn[k])
 87 | 		k+=1
 88 | 	#print rates
 89 | 
 90 | 
 91 | 
 92 | 	#visualize
 93 | 	x = [n for n in range(5,21)]
 94 | 	y = [n for n in range(80,96,2)]
 95 | 	#figure 1 : plot the rating based on the neighbours number
 96 | 	plt.figure(1)
 97 | 	plt.plot(n_neighbours, rates, marker='o', linestyle='--', color='k', label='Square') 
 98 | 	plt.title('KNN k-validation')
 99 | 	plt.xticks(x)
100 | 	plt.yticks(y)
101 | 	black_patch = mpatches.Patch(color='k', label='Accuracy')
102 | 	plt.legend(handles=[black_patch])
103 | 	plt.ylabel('100%', fontsize=10)
104 | 	plt.xlabel('K (neighbours)', fontsize=8)
105 | 	
106 | 	#figure 2: plot the true positive etc based on the neighbours number to compare the missclassification 
107 | 	#and how important they are
108 | 	plt.figure(2)
109 | 	red_patch = mpatches.Patch(color='red', label='True positive')
110 | 	blue_patch = mpatches.Patch(color='blue', label='True negative')
111 | 	green_patch = mpatches.Patch(color='green', label='False positive')
112 | 	magenta_patch = mpatches.Patch(color='magenta', label='False negative')
113 | 	plt.legend(handles=[red_patch,blue_patch,green_patch,magenta_patch])
114 | 	plt.title('KNN classify-validation')
115 | 	plt.xticks(x)
116 | 	plt.ylabel('Rating', fontsize=10)
117 | 	plt.xlabel('K (neighbours)', fontsize=8)
118 | 	plt.plot(n_neighbours, tp, marker='d', linestyle='--', color='r', label='Square') 
119 | 	plt.plot(n_neighbours, tn, marker='d', linestyle='--', color='b', label='Square') 
120 | 	plt.plot(n_neighbours, fp, marker='d', linestyle='--', color='g', label='Square') 
121 | 	plt.plot(n_neighbours, fn, marker='d', linestyle='--', color='m', label='Square') 
122 | 	#plot
123 | 	plt.show()
124 | 
125 | 
126 | 	#cross-validation
127 | 	kfold = model_selection.KFold(n_splits=10,random_state=7,shuffle=True)
128 | 	x_train, x_test, y_train, y_test = train_test_split(X,Y, test_size=0.20)
129 | 	optimalNeighboursNumber = 5+rates.index(max(rates))
130 | 	modelCV = KNeighborsClassifier(n_neighbors=optimalNeighboursNumber)
131 | 	scoring = 'accuracy'
132 | 	results = model_selection.cross_val_score(modelCV,x_train,y_train,cv=kfold,scoring=scoring)
133 | 	print '10-fold cross validation average accuracy: ',results.mean()
134 | 
135 | 	#confusion matrix
136 | 	knn = KNeighborsClassifier(n_neighbors=optimalNeighboursNumber)
137 | 	knn.fit(x_train,y_train)
138 | 	y_pred = knn.predict(x_test)
139 | 	#print y_pred
140 | 	#print y_test
141 | 	print 'KNN classifier accuracy: ', knn.score(x_test,y_test)
142 | 	confusionMatrix = confusion_matrix(y_test,y_pred)
143 | 	print 'Confusion matrix: '
144 | 	print confusionMatrix
145 | 	print 'We had ',confusionMatrix[0][0] + confusionMatrix[1][1], 'correct predictions'
146 | 	print 'And ',confusionMatrix[1][0] + confusionMatrix[0][1],'incorrect prediction'
147 | 	print ''
148 | 
149 | 	print(classification_report(y_test,y_pred))
150 | 
151 | 
152 | 	
153 | 	
154 | 	
155 | 	
156 | 	
157 | 	
158 | def main():
159 | 	data = readFeaturesFile()
160 | 	knn_ROC(data)
161 | 
162 | main()
163 | 
164 | 


--------------------------------------------------------------------------------
/classifiers/leave_one_out.py:
--------------------------------------------------------------------------------
  1 | #!usr/bin/python
  2 | from __future__ import division
  3 | import numpy as np
  4 | import pandas as pd
  5 | import matplotlib.pyplot as plt
  6 | import time
  7 | import elm
  8 | import scipy as sk
  9 | import numpy as np
 10 | from sklearn.utils.testing import assert_greater, assert_raise_message,assert_allclose
 11 | import matplotlib.patches as mpatches
 12 | from sklearn.mixture import GaussianMixture
 13 | from sklearn.preprocessing import LabelEncoder
 14 | from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score
 15 | from pandas.tools.plotting import scatter_matrix
 16 | from sklearn import model_selection
 17 | from sklearn.utils.estimator_checks import check_estimator
 18 | from sklearn.preprocessing import label_binarize
 19 | from sklearn.metrics import classification_report
 20 | from sklearn.metrics import confusion_matrix
 21 | from sklearn.metrics import accuracy_score
 22 | from sklearn import metrics
 23 | from sklearn.preprocessing import normalize
 24 | from sklearn.metrics import roc_curve, auc
 25 | from sklearn import svm, datasets
 26 | from sklearn.linear_model import LogisticRegression
 27 | from sklearn.tree import DecisionTreeClassifier
 28 | from sklearn.neighbors import KNeighborsClassifier
 29 | from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
 30 | from sklearn.naive_bayes import GaussianNB
 31 | from sklearn.multiclass import OneVsRestClassifier
 32 | from sklearn.svm import SVC
 33 | from sklearn.cluster import KMeans
 34 | from sklearn.ensemble import RandomForestClassifier
 35 | from sklearn.model_selection import train_test_split
 36 | from sklearn.externals import joblib
 37 | 
 38 | #GMLVQ , www.cs.rug.nl/~biehl
 39 | #forest, DTC
 40 | 
 41 | def readFeaturesFile():
 42 | 	names = ['Feature1', 'Feature2', 'Feature3', 'Feature4','Feature5','Feature6','Feature7','Feature8','Feature9',
 43 | 'Feature10','Feature11','Feature12','Feature13','Gender']
 44 | 	data = pd.read_csv("PATH_TO_SAMPLES.txt",names=names )
 45 | 	#the outcome is a list of lists containing the samples with the following format
 46 | 	#[charachteristic,feature1,feature2.......,feature13]
 47 | 	#characheristic based on what we want for classification , can be (male , female) , also can be (normal-female,edema-female)
 48 | 	#in general characheristic is the target value .
 49 | 
 50 | 
 51 | 	training(data)	
 52 | 	#visualizeData(data)
 53 | 	
 54 | 
 55 | #Implementation in order to measure algorithms' execution time
 56 | def compare_Algorithms(model,X_train,Y_train,kfold,scoring):
 57 | 	#return the validation score for every algorithm
 58 | 	return model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
 59 | 
 60 | 
 61 | 
 62 | 
 63 | def preparingData(data):
 64 | 	# Split-out validation dataset
 65 | 	array = data.values
 66 | 	#input
 67 | 	X = array[:,0:13]
 68 | 	#target 
 69 | 	Y = array[:,13]
 70 | 	return X,Y
 71 | 
 72 | def training(data):
 73 | 	models = []
 74 | 	#ML algorithms
 75 | 	models.append(('LR', LogisticRegression()))
 76 | 	models.append(('LDA', LinearDiscriminantAnalysis()))
 77 | 	models.append(('KNN', KNeighborsClassifier()))
 78 | 	models.append(('CART', DecisionTreeClassifier()))
 79 | 	models.append(('NB', GaussianNB()))
 80 | 	models.append(('SVM', SVC()))
 81 | 	# evaluate each model in turn
 82 | 	names = []
 83 | 	resultsLOO = []
 84 | 	results = []
 85 | 	meanResultsLOO=np.zeros(len(models))
 86 | 	meanTimes=np.zeros(len(models))
 87 | 	meanResults=np.zeros(len(models))
 88 | 	n = int(raw_input('How many times you want to run the procedure? '))
 89 | 	for x in range(n):
 90 | 		X,Y = preparingData(data)
 91 | 		#splitting into training and testing 
 92 | 		validation_size = 0.20
 93 | 		#test_size is the splitting between the training and testing data , for example 20% testing and 80% training in our case
 94 | 		X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size)
 95 | 		scoring = 'accuracy'
 96 | 		#testing
 97 | 		x=0
 98 | 		print'\n'	
 99 | 		print('Algorithm:     accuracy(k-fold)     accuracy(leave-one-out)        time')
100 | 		for name, model in models:
101 | 			#function for k-flods cross validation , k-folds = n_splits 
102 | 			#then i can determine if i want to shuffle my data every time before the validation True
103 | 			#and random_state , is the seed for the random number generator
104 | 			kfold = model_selection.KFold(n_splits=10, shuffle=True)
105 | 			#leave-one-out validation with no arguments
106 | 			leaveOneOut = model_selection.LeaveOneOut()
107 | 			import time
108 | 			#initialize the time
109 | 			start_time = time.time()
110 | 		 	cv_results = compare_Algorithms(model,X_train,Y_train,kfold,scoring)
111 | 			#compare algorithms with leave-one-out validation
112 | 			cv_resultsLOO = compare_Algorithms(model,X_train,Y_train,leaveOneOut,scoring)
113 | 			#count the time around the command
114 | 			time = time.time() - start_time
115 | 			#append the validation of every algorithm
116 | 			resultsLOO.append(cv_resultsLOO)
117 | 		    	results.append(cv_results)
118 | 		    	names.append(name)
119 | 			#visualize results
120 | 			msg = "%s:              %f                  %f             %f" % (name, cv_results.mean(),cv_resultsLOO.mean(),time)
121 | 			meanTimes[x]+=time
122 | 			meanResultsLOO[x]+=cv_resultsLOO.mean()
123 | 			meanResults[x]+=cv_results.mean()
124 | 			if(x<len(models)-1):
125 | 				x+=1
126 | 			else:
127 | 				x=0
128 | 			print msg
129 | 	#divide the outcomes with the number of iterations to take the mean of all the iterations
130 | 	meanTimes =  (meanTimes / n)
131 | 	meanResults = (meanResults / n)
132 | 	meanResultsLOO = (meanResultsLOO / n)
133 | 	print'\n\n'
134 | 	print('Mean of every iteration:')
135 | 	print('Algorithm:     accuracy(k=10-fold)     accuracy(leave-one-out)        time')
136 | 	for x in range(len(models)):
137 | 		msg = "%s:              %f                  %f             %f" % (names[x],meanResults[x],meanResultsLOO[x],meanTimes[x])
138 | 		print(msg)
139 | 
140 | def visualizeData(data):
141 | 	#Checking my data
142 | 	#data shape
143 | 	print(data.shape)
144 | 	#print the 20 first samples
145 | 	print(data.head(20))
146 | 	#This includes the count, mean, the min and max values as well as some percentiles
147 | 	print(data.describe())
148 | 	#class distribution
149 | 	print(data.groupby('Gender').size())
150 | 	#Visualize my data
151 | 	# box and whisker plots
152 | 	data.plot(kind='box', subplots=True, sharex=False, sharey=False)
153 | 	# box and whisker plots
154 | 	data.plot(kind='box', subplots=True, sharex=False, sharey=False)
155 | 	# scatter plot matrix
156 | 	scatter_matrix(data)
157 | 	plt.show()
158 | 	
159 | def main():
160 | 	readFeaturesFile()
161 | 
162 | main()
163 | 


--------------------------------------------------------------------------------
/classifiers/logisticRegression.py:
--------------------------------------------------------------------------------
  1 | #!usr/bin/python
  2 | from __future__ import division
  3 | import numpy as np
  4 | import pandas as pd
  5 | import matplotlib.pyplot as plt
  6 | import scipy as sk
  7 | from sklearn.feature_selection import RFE
  8 | import matplotlib.patches as mpatches
  9 | from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score
 10 | from pandas.tools.plotting import scatter_matrix
 11 | from sklearn import model_selection
 12 | from sklearn.utils.estimator_checks import check_estimator
 13 | from sklearn.preprocessing import label_binarize
 14 | from sklearn.metrics import classification_report
 15 | from sklearn.metrics import confusion_matrix
 16 | from sklearn.metrics import accuracy_score
 17 | from sklearn import metrics
 18 | from sklearn import svm, datasets
 19 | from sklearn.linear_model import LogisticRegression
 20 | from sklearn.model_selection import train_test_split
 21 | from sklearn.externals import joblib
 22 | import numpy as np
 23 | from sklearn.metrics import classification_report
 24 | from sklearn.metrics import roc_auc_score
 25 | from sklearn.metrics import roc_curve
 26 | 
 27 | 
 28 | def readFeaturesFile():
 29 | 	names = ['Feature1', 'Feature2', 'Feature3', 'Feature4','Feature5','Feature6','Feature7','Feature8','Feature9',
 30 | 'Feature10','Feature11','Feature12','Feature13','Gender']
 31 | 	data = pd.read_csv("PATH_TO_SAMPLES.txt",names=names )
 32 | 	#the outcome is a list of lists containing the samples with the following format
 33 | 	#[charachteristic,feature1,feature2.......,feature13]
 34 | 	#characheristic based on what we want for classification , can be (male , female) , also can be (normal-female,edema-female)
 35 | 	#in general characheristic is the target value .
 36 | 	return data
 37 | 
 38 | def preparingData(data):
 39 | 	# Split-out validation dataset
 40 | 	array = data.values
 41 | 	#input
 42 | 	X = array[:,0:13]
 43 | 	#target 
 44 | 	Y = array[:,13]
 45 | 	return X,Y
 46 | 
 47 | #check if a points is under the line which consist from the initial and endind point
 48 | #return True if c is between a and b
 49 | #otherwise returns Flase
 50 | def isUnder(a, b, c):
 51 | 	#x_coordinate [0][0]\
 52 | 	#y_coordinate [1][0]\
 53 | 	result = ((a[0][0] - b[0][0])*(c[1][0] - b[1][0]) - (c[0][0] - b[0][0])*(a[1][0] - b[1][0]))
 54 | 	if (result > 0):
 55 | 		return True
 56 | 	else:
 57 | 		return False
 58 | 
 59 | 
 60 | 	
 61 | def LR_ROC(data):
 62 | 	#we initialize the random number generator to a const value
 63 | 	#this is important if we want to ensure that the results
 64 | 	#we can achieve from this model can be achieved again precisely
 65 | 	#Axis or axes along which the means are computed. The default is to compute the mean of the flattened array.	
 66 | 	mean = np.mean(data,axis=0)
 67 | 	std = np.std(data,axis=0)
 68 | 	#print 'Mean: \n',mean
 69 | 	#print 'Standar deviation: \n',std
 70 | 	X,Y = preparingData(data)
 71 | 	x_train, x_test, y_train, y_test = train_test_split(X,Y, test_size=0.20)
 72 | 	# convert integers to dummy variables (i.e. one hot encoded)
 73 | 	lr = LogisticRegression(class_weight='balanced')
 74 | 	lr.fit(x_train,y_train)
 75 | 	#The score function of sklearn can quickly assess the model performance
 76 | 	#due to class imbalance , we nned to evaluate the model performance
 77 | 	#on every class. Which means to find when we classify people from the first team wrong
 78 | 
 79 | 
 80 | 	#feature selection RFE is based on the idea to repeatedly construct a model and choose either the best
 81 | 	#or worst performing feature, setting the feature aside and then repeating the process with the rest of the 
 82 | 	#features. This process is applied until all features in the dataset are exhausted. The goal of RFE is to select
 83 | 	# features by recursively considering smaller and smaller sets of features
 84 | 	rfe = RFE(lr,13)
 85 | 	rfe = rfe.fit(x_train,y_train)
 86 | 	#print rfe.support_
 87 | 
 88 | 	#An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape 
 89 | 	#[# input features], in which an element is True iff its corresponding feature is selected for retention
 90 | 
 91 | 	#print rfe.ranking_
 92 | 
 93 | 	#so we have to take all the features
 94 | 
 95 | 	#model fitting
 96 | 	
 97 | 	#predicting the test set results and calculating the accuracy
 98 | 	y_pred  = lr.predict(x_test)
 99 | 	print 'Accuracy of logistic regression classifier on the test set: ', lr.score(x_test,y_test)
100 | 
101 | 	#cross validation
102 | 	kfold = model_selection.KFold(n_splits=10,shuffle=True,random_state=7)
103 | 	modelCV = LogisticRegression()
104 | 	scoring = 'accuracy'
105 | 	results = model_selection.cross_val_score(modelCV, x_train,y_train,cv=kfold,scoring=scoring)
106 | 	print '10-fold cross validation average accuracy: ', results.mean()
107 | 
108 | 	#confusion matrix
109 | 	confusionMatrix = confusion_matrix(y_test,y_pred)
110 | 	print 'Confusion matrix: '
111 | 	print confusionMatrix
112 | 	print 'We had ',confusionMatrix[0][0] + confusionMatrix[1][1], 'correct predictions'
113 | 	print 'And ',confusionMatrix[1][0] + confusionMatrix[0][1],'incorrect prediction'
114 | 	print ''
115 | 
116 | 	#The precision is intuitively the ability of the classifier to not label a sample as positive if it is negative.
117 | 	#The recall is intuitively the ability of the classifier to find all the positive samples.
118 | 	#The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.
119 | 	#The support is the number of occurrences of each class in y_test.
120 | 
121 | 	#classification report
122 | 	print(classification_report(y_test,y_pred))
123 | 
124 | 	#roc curve
125 | 	logit_roc_auc = roc_auc_score(y_test, lr.predict(x_test))
126 | 	fpr , tpr , thresholds = roc_curve(y_test,lr.predict_proba(x_test)[:,1])
127 | 	
128 | 	#AUC is a measure of the overall performance of a diagnostic test and is 
129 | 	#interpreted as the average value of sensitivity for all possible values of specificity
130 | 	
131 | 	fprtpr = np.hstack((fpr[:,np.newaxis],tpr[:,np.newaxis]))
132 | 
133 | 	hull = ConvexHull(fprtpr)
134 | 
135 | 	hull_indices = np.unique(hull.simplices.flat)
136 | 	hull_points = fprtpr[hull_indices,:]
137 | 	hull_points_y=[]
138 | 	hull_points_x=[]
139 | 	for x in range(len(hull_points)):
140 | 		coordinates =  np.split(hull_points[x],2)
141 | 		hull_points_x.append(coordinates[0])
142 | 		hull_points_y.append(coordinates[1])
143 | 		
144 | 		
145 | 		
146 | 		
147 | 	#this implementation os only for the smooth rock curve
148 | 
149 | 	hull_points_x_curve = []
150 | 	hull_points_y_curve = []
151 | 
152 | 	#determine the starting and ending point
153 | 	startingPoint = np.split(hull_points[0],2)
154 | 	print 'starting point: ',startingPoint
155 | 	print startingPoint[1][0]
156 | 	endingPoint = np.split(hull_points[len(hull_points)-1],2)
157 | 	print 'ending point: ',endingPoint
158 | 
159 | 	#append the strting point into the hull
160 | 	hull_points_x_curve.append(startingPoint[0])
161 | 	hull_points_y_curve.append(startingPoint[1])
162 | 
163 | 	#check if there is a points under the starting and the ending point, only to make the ROC curve
164 | 	print len(hull_points)
165 | 	for x in range(1,len(hull_points)-1):
166 | 		print x
167 | 		coordinates =  np.split(hull_points[x],2)	
168 | 		ifnotUnder = not(isUnder(startingPoint , endingPoint , coordinates))
169 | 		print ifnotUnder
170 | 		if (ifnotUnder):
171 | 			hull_points_y_curve.append(coordinates[1])
172 | 			hull_points_x_curve.append(coordinates[0])
173 | 
174 | 	#append the ending point into the hull
175 | 	hull_points_x_curve.append(endingPoint[0])
176 | 	hull_points_y_curve.append(endingPoint[1])	
177 | 		
178 | 	
179 | 	
180 | 		
181 | 	plt.figure(1)
182 | 	plt.title('ROC curve smooth')
183 | 	plt.scatter(hull_points_y,hull_points_x)
184 | 	area_under =  metrics.auc(hull_points_y,hull_points_x)
185 | 	plt.plot(hull_points_x_curve,hull_points_y_curve,label='Area under the curve = %0.2f' %area_under)
186 | 	plt.legend(loc='lower right')
187 | 	
188 | 	
189 | 
190 | 	plt.figure(2)
191 | 	plt.scatter(fpr,tpr)
192 | 	plt.title('Convex Hull')
193 | 	#plt.plot(fpr[hull.vertices],tpr[hull.vertices])
194 | 	plt.plot(fprtpr[:,0], fprtpr[:,1], 'o')
195 | 	for simplex in hull.simplices:
196 | 	     plt.plot(fprtpr[simplex, 0], fprtpr[simplex, 1],'r--',lw=2)
197 | 	
198 | 	plt.figure(3)
199 | 	plt.plot(fpr,tpr,label='Logistic Regression (area = %0.2f)' %logit_roc_auc)
200 | 	plt.plot([0,1],[0,1],'r--')
201 | 	plt.xlabel('False positive rate')
202 | 	plt.ylabel('True positive rate')
203 | 	plt.title('Receiver operating characteristic')
204 | 	plt.legend(loc='lower right')
205 | 	plt.show()
206 | 
207 | 	#It generally means that your model can only provide discrete predictions, rather than a continous score. This can often be
208 | 	# remedied by adding more samples to your dataset, having more continous features in the model, more features in general or using
209 | 	# a model specification that provides a continous prediction output. The reason why it occurs in a decision tree is that you 
210 | 	#often do binary splits; this is efficient computationally, but only gives 2^n groupings. Unless your n number of splits are very 
211 | 	#large, you'll only have 16/32/64/128 groups, whereas if you used an algorithm such as logistic regression and used continous 
212 | 	#variables, your prediction would fall in the continous range between 0 and 1. I'm not familiar with the type of data you listed,
213 | 	# but I suspect you have a lot of categorical data.It's not necessarily a problem to have a ROC that is discrete rather than 
214 | 	#smooth, it really depends on your goals for the model (descriptive vs prescriptive), as well as how well your model fits on 
215 | 	#out-of-sample datasets. Many of the problems I've solved in my career just needed a Yes/No line drawn (such as email this 
216 | 	#person/don't email), so having a continous and smooth prediction along the range of inputs wasn't necessary.
217 | 	
218 | 
219 | 
220 | 
221 | def main():
222 | 	data = readFeaturesFile()
223 | 	LR_ROC(data)
224 | 
225 | main()
226 | 


--------------------------------------------------------------------------------
/classifiers/simpleNeuralNetwork.py:
--------------------------------------------------------------------------------
  1 | #!usr/bin/python
  2 | 
  3 | import keras
  4 | import numpy as np
  5 | import matplotlib.pyplot as plt
  6 | from keras.models import Sequential
  7 | from keras.layers import Dense, Activation , MaxPool2D , Conv2D , Flatten
  8 | from keras.optimizers import Adam
  9 | import pandas as pd
 10 | from sklearn.model_selection import train_test_split
 11 | from keras.optimizers import SGD
 12 | from sklearn.model_selection import StratifiedKFold
 13 | 
 14 | def readFeaturesFile():
 15 | 	names = ['Feature1', 'Feature2', 'Feature3', 'Feature4','Feature5','Feature6','Feature7','Feature8','Feature9',
 16 | 'Feature10','Feature11','Feature12','Feature13','Gender']
 17 | 	data = pd.read_csv("PATH_TO_SAMPLES.txt",names=names )
 18 | 	#the outcome is a list of lists containing the samples with the following format
 19 | 	#[charachteristic,feature1,feature2.......,feature13]
 20 | 	#characheristic based on what we want for classification , can be (male , female) , also can be (normal-female,edema-female)
 21 | 	#in general characheristic is the target value .
 22 | 	return data
 23 | 
 24 | def preparingData(data):
 25 | 	# Split-out validation dataset
 26 | 	array = data.values
 27 | 	#input
 28 | 	X = array[:,0:13]
 29 | 	#target 
 30 | 	Y = array[:,13]
 31 | 	
 32 | 	#determine the test and the training size
 33 | 	x_train, x_test, y_train, y_test = train_test_split(X,Y, test_size=0.10)
 34 | 	
 35 | 
 36 | 	#x_train = 
 37 | 	#Encode the labels
 38 | 	#reconstruct the data as a vector with a sequence of 1s and 0s
 39 | 	y_train = keras.utils.to_categorical(y_train, num_classes = 2)
 40 | 	y_test = keras.utils.to_categorical(y_test, num_classes = 2)
 41 | 	'''print y_train.shape
 42 | 	print x_train.shape
 43 | 	print(y_train[0], np.argmax(y_train[0]))
 44 | 	print(y_train[1], np.argmax(y_train[1]))
 45 | 	print(y_train[2], np.argmax(y_train[2]))
 46 | 	print(y_train[3], np.argmax(y_train[3]))'''
 47 | 	return x_train , x_test , y_train , y_test 
 48 | 
 49 | 
 50 | def returnData(data):
 51 | 	# Split-out validation dataset
 52 | 	array = data.values
 53 | 	#input
 54 | 	X = array[:,0:13]
 55 | 	#target 
 56 | 	Y = array[:,13]
 57 | 	
 58 | 	#determine the test and the training size
 59 | 	
 60 | 	return X,Y
 61 | 
 62 | 
 63 | 
 64 | #Multilayer Perceptron
 65 | def testing_NN(data):
 66 | 	X,Y = returnData(data)
 67 | 	
 68 | 	#determine the validation
 69 | 	kfold = StratifiedKFold(n_splits=10,shuffle=True)
 70 | 	#keep the results
 71 | 	cvscores = []
 72 | 	for train,test in kfold.split(X,Y):
 73 | 		#Define a siple Multilayer Perceptron
 74 | 		model = Sequential()
 75 | 
 76 | 		#our classification is binary
 77 | 
 78 | 		#as a first step we have to define the input dimensionality
 79 | 
 80 | 
 81 | 		model.add(Dense(14,activation='relu',input_dim=13))
 82 | 
 83 | 
 84 | 		#model.add(Dense(14,activation='relu',input_dim=13))
 85 | 		model.add(Dense(8, activation='relu'))
 86 | 
 87 | 		#add another hidden layer
 88 | 		#model.add(Dense(16,activation='relu'))
 89 | 		#the last step , add an output layer (number of neurons = number of classes)
 90 | 		model.add(Dense(1,activation='sigmoid'))
 91 | 
 92 | 		#select the optimizer
 93 | 		#adam = Adam(lr=0.0001)
 94 | 		adam = Adam(lr=0.001)
 95 | 		#learning rate is between 0.0001 and 0.001 , but it is objective to define it
 96 | 		#because we need out model not to learn to fast and maybe we have overfitting but also 
 97 | 		#not to slow and take to much time . We can check this with the learning rate curve
 98 | 	
 99 | 		#we select the loss function and metrics that should be monitored
100 | 		#and then we compile our model
101 | 		model.compile(loss='binary_crossentropy',optimizer=adam,metrics=['accuracy'])
102 | 
103 | 		#now we train our model
104 | 		model.fit(X[train],Y[train],epochs=50,batch_size=75,verbose=0)
105 | 
106 | 			# evaluate the model
107 | 		scores = model.evaluate(X[test], Y[test], verbose=0)
108 | 		print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
109 | 		cvscores.append(scores[1] * 100)
110 | 
111 | 
112 | 	print("%.2f%% (+/- %.2f%%)" % (np.mean(cvscores), np.std(cvscores)))
113 | 
114 | 
115 | 	'''x_train , x_test , y_train , y_test = preparingData(data)
116 | 	
117 | 	#validation data =  test data
118 | 	#only for the plot
119 | 	results = model.fit(x_train,y_train,epochs=50,batch_size=75,verbose=2,validation_data=(x_test,y_test))
120 | 
121 | 	plt.figure(1)
122 | 	plt.plot(results.history['loss'])
123 | 	plt.plot(results.history['val_loss'])
124 | 	plt.legend(['train loss', 'test loss'])
125 | 	plt.show()
126 | 
127 | 
128 | 
129 | 
130 | 	#now we can evaluate our model
131 | 	print '\n'
132 | 	print 'Train accuracy: ' , model.evaluate(x_train,y_train,batch_size=25)
133 | 	print 'Test accuracy: ',model.evaluate(x_test,y_test,batch_size=25)
134 | 
135 | 	#visualize the actual output of the network
136 | 	output = model.predict(x_train)
137 | 	print '\n'
138 | 	print 'Actual output: ',output[0],np.argmax(output[0])
139 | 
140 | 	#we can also check our model behaviour in depth
141 | 	print'\n'
142 | 	#print the first ten predictions
143 | 	for x in range(10):
144 | 		print 'Prediction: ',np.argsort(output[x])[::-1],'True target: ',np.argmax(y_train[x])'''
145 | 
146 | #Multilayer Perceptron
147 | def simpleNN(data):
148 | 	x_train , x_test , y_train , y_test = preparingData(data)
149 | 
150 | 	#because as we can see from the previous function simpleNN the
151 | 	#test loss is going bigger which means that we have overfitting problem
152 | 	#here we are going to try to overcome this obstacle
153 | 
154 | 	model = Sequential()
155 | 
156 | 	#The input layer:
157 | 	'''With respect to the number of neurons comprising this layer, this parameter is completely and uniquely determined 
158 | 	once you know the shape of your training data. Specifically, the number of neurons comprising that layer is equal to the number 
159 | 	of features (columns) in your data. Some NN configurations add one additional node for a bias term.'''
160 | 
161 | 	model.add(Dense(14,activation='relu',input_dim=13,kernel_initializer='random_uniform'))
162 | 	
163 | 	#The output layer
164 | 	'''If the NN is a classifier, then it also has a single node unless
165 | 	 softmax is used in which case the output layer has one node per 
166 | 	class label in your model.'''
167 | 	
168 | 	model.add(Dense(2,activation='softmax'))
169 | 
170 | 	#binary_crossentropy because we have a binary classification model
171 | 	#Because it is not guaranteed that we are going to find the global optimum
172 | 	#because we can be trapped in a local minima and the algorithm may think that
173 | 	#you reach global minima. To avoid this situation, we use a momentum term in the 
174 | 	#objective function, which is a value 0 < momentum < 1 , that increases the size of the steps
175 | 	#taken towards the minimum by trying to jump from a local minima.
176 | 
177 | 
178 | 	#If the momentum term is large then the learning rate should be kept smaller.
179 | 	#A large value of momentum means that the convergence will happen fast,but if
180 | 	#both are kept at large values , then we might skip the minimum with a huge step.
181 | 	#A small value of momentum cannot reliably avoid local minima, and also slow down
182 | 	#the training system. We are trying to find the right value of momentum through cross-validation.	
183 | 	model.compile(loss='binary_crossentropy',optimizer=SGD(lr=0.001,momentum=0.6),metrics=['accuracy'])
184 | 
185 | 	#In simple terms , learning rate is how quickly a network abandons old beliefs for new ones.
186 | 	#Which means that with a higher LR the network changes its mind more quickly , in pur case this means
187 | 	#how quickly our model update the parameters (weights,bias).
188 | 
189 | 	#verbose: Integer. 0, 1, or 2. Verbosity mode. 0 = silent, 1 = progress bar, 2 = one line per epoch.
190 | 
191 | 	results = model.fit(x_train,y_train,epochs=50,batch_size=50,verbose=2,validation_data=(x_test,y_test))
192 | 
193 | 	print 'Train accuracy: ' , model.evaluate(x_train,y_train,batch_size=50,verbose=2)
194 | 	print 'Test accuracy: ',model.evaluate(x_test,y_test,batch_size=50,verbose=2)
195 | 
196 | 
197 | 	
198 | 	#visualize
199 | 	plt.figure(1)
200 | 	plt.plot(results.history['loss'])
201 | 	plt.plot(results.history['val_loss'])
202 | 	plt.legend(['train loss', 'test loss'])
203 | 	plt.show()
204 | 
205 | 	print model.summary()
206 | 
207 | 
208 | 
209 | def main():
210 | 	data = readFeaturesFile()
211 | 	simpleNN(data)
212 | 		
213 | 
214 | main()
215 | 


--------------------------------------------------------------------------------
/feature_extraction_techniques/README.md:
--------------------------------------------------------------------------------
1 | Feature extraction techniques implemented in Python
2 | 
3 | MFCC  
4 | LPC  
5 | PLP  
6 | MGCA
7 | 


--------------------------------------------------------------------------------
/feature_extraction_techniques/lpc.py:
--------------------------------------------------------------------------------
  1 | #!usr/bin/python
  2 | import matplotlib.pyplot as plt
  3 | import numpy as np
  4 | import wave
  5 | import scipy.io.wavfile as wav
  6 | from scipy import signal
  7 | import scipy as sk
  8 | from audiolazy import *
  9 | from audiolazy import lpc
 10 | from sklearn import preprocessing
 11 | import scipy.signal as sig
 12 | import scipy.linalg as linalg
 13 | 
 14 | 
 15 | def readWavFile(wav):
 16 | 	#given a path from the keyboard to read a .wav file
 17 | 	#wav = raw_input('Give me the path of the .wav file you want to read: ')
 18 | 	inputWav = 'PATH_TO_WAV'+wav
 19 | 	return inputWav
 20 | 
 21 | #reading the .wav file (signal file) and extract the information we need 
 22 | def initialize(inputWav):
 23 | 	rate , signal  = wav.read(readWavFile(inputWav)) # returns a wave_read object , rate: sampling frequency 
 24 | 	sig = wave.open(readWavFile(inputWav))
 25 | 	# signal is the numpy 2D array with the date of the .wav file
 26 | 	# len(signal) number of samples
 27 | 	sampwidth = sig.getsampwidth()
 28 | 	print 'The sample rate of the audio is: ',rate
 29 | 	print 'Sampwidth: ',sampwidth	
 30 | 	return signal ,  rate 
 31 | 
 32 | 
 33 | #implementation of the low-pass filter
 34 | def lowPassFilter(signal, coeff=0.97):
 35 | 	return np.append(signal[0], signal[1:] - coeff * signal[:-1]) #y[n] = x[n] - a*x[n-1] , a = 0.97 , a>0 for low-pass filters 
 36 | 
 37 | def preEmphasis(wav):
 38 | 	#taking the signal
 39 | 	signal , rate = initialize(wav)
 40 | 	#Pre-emphasis Stage
 41 | 	preEmphasis = 0.97
 42 | 	emphasizedSignal = lowPassFilter(signal)
 43 | 	Time=np.linspace(0, len(signal)/rate, num=len(signal))
 44 | 	EmphasizedTime=np.linspace(0, len(emphasizedSignal)/rate, num=len(emphasizedSignal))
 45 | 	#plots using matplotlib
 46 | 	'''plt.figure(figsize=(9, 7)) 
 47 | 	plt.subplot(211, facecolor='darkslategray')
 48 | 	plt.title('Signal wave')
 49 | 	plt.ylim(-50000, 50000)
 50 | 	plt.ylabel('Amplitude', fontsize=16)
 51 | 	plt.plot(Time,signal,'C1')
 52 | 	plt.subplot(212, facecolor='darkslategray')
 53 | 	plt.title('Pre-emphasis')
 54 | 	plt.ylim(-50000, 50000)
 55 | 	plt.xlabel('time(s)', fontsize=10)
 56 | 	plt.ylabel('Amplitude', fontsize=16)
 57 | 	plt.plot(EmphasizedTime,emphasizedSignal,'C1')
 58 | 	plt.show()'''
 59 | 	return emphasizedSignal, signal , rate
 60 | 
 61 | 
 62 | def visualize(rate,signal):
 63 | 	#taking the signal's time
 64 | 	Time=np.linspace(0, len(signal)/rate, num=len(signal))
 65 | 	#plots using matplotlib
 66 | 	plt.figure(figsize=(10, 6)) 
 67 | 	plt.subplot(facecolor='darkslategray')
 68 | 	plt.title('Signal wave')
 69 | 	plt.ylim(-40000, 40000)
 70 | 	plt.ylabel('Amplitude', fontsize=16)
 71 | 	plt.xlabel('Time(s)', fontsize=8)
 72 | 	plt.plot(Time,signal,'C1')
 73 | 	plt.draw()
 74 | 	#plt.show()
 75 | 
 76 | def framing(fs,signal):	
 77 | 	#split the signal into frames
 78 | 	windowSize = 0.025 # 25ms
 79 | 	windowStep = 0.01 # 10ms
 80 | 	overlap = int(fs*windowStep)
 81 | 	frameSize = int(fs*windowSize)# int() because the numpy array can take integer as an argument in the initiation
 82 | 	numberOfframes = int(np.ceil(float(np.abs(len(signal) - frameSize)) / overlap ))
 83 | 	print 'Overlap is: ',overlap 
 84 | 	print 'Frame size is: ',frameSize
 85 | 	print 'Number of frames: ',numberOfframes
 86 | 	frames = np.ndarray((numberOfframes,frameSize))# initiate a 2D array with numberOfframes rows and frame size columns
 87 | 	#assing samples into the frames (framing)
 88 | 	for k in range(0,numberOfframes):
 89 | 		for i in range(0,frameSize):
 90 | 			if((k*overlap+i)<len(signal)):
 91 | 				frames[k][i]=signal[k*overlap+i]
 92 | 			else:
 93 | 				frames[k][i]=0
 94 | 	return frames,frameSize
 95 | 
 96 | def hamming(frames,frameSize):
 97 | 	# Windowing with Hamming
 98 | 	#Hamming implementation : W[n] = 0.54 - 0.46 * numpy.cos((2 * numpy.pi * n) / (frameSize - 1))  
 99 | 	# y[n] = s[n] (signal in a specific sample) * w[n] (the window function Hamming) 
100 | 	frames*=np.hamming(frameSize)
101 | 	'''plt.figure(figsize=(10, 6)) 
102 | 	plt.subplot(facecolor='darkslategray')
103 | 	plt.title('Hamming window')
104 | 	plt.ylim(-40000, 40000)
105 | 	plt.ylabel('Amplitude', fontsize=16)
106 | 	plt.xlabel('Time(ms)', fontsize=8)
107 | 	plt.plot(frames,'C1')
108 | 	plt.show()'''
109 | 	return frames
110 | 	
111 | def autocorrelation(hammingFrames):
112 | 	correlateFrames=[]
113 | 	for k in range(len(hammingFrames)):
114 | 		correlateFrames.append(np.correlate(hammingFrames[k],hammingFrames[k],mode='full'))
115 | 	#print 'Each frame after windowing and autocorrelation: \n',correlateFrames
116 | 	yolo =  correlateFrames[len(correlateFrames)/2:]
117 | 	return yolo
118 | 	
119 | 	
120 | 	
121 | 
122 | def levinsonDurbin(correlateFrames):
123 | 	#normalizedCF = preprocessing.normalize(correlateFrames, norm='l2')
124 | 	filt1 = levinson_durbin(correlateFrames,13)
125 | 	print filt1.numerator[1:]
126 | 
127 | 
128 | def myLPC():
129 | 	folder = raw_input('Give the name of the folder that you want to read data: ')
130 | 	amount = raw_input('Give the number of samples in the specific folder: ')
131 | 	for x in range(1,int(amount)+1):
132 | 		wav = '/'+folder+'/'+str(x)+'.wav'
133 | 		print wav
134 | 		emphasizedSignal,signal,rate = preEmphasis(wav)
135 | 		#visualize(rate,signal)
136 | 		frames , frameSize = framing(rate,signal)
137 | 		hammingFrames = hamming(frames,frameSize)
138 | 		correlateFrames = autocorrelation(hammingFrames)
139 | 		merged=correlateFrames[0]
140 | 		for x in range(1,len(correlateFrames)-1):
141 | 			merged = np.append(merged,correlateFrames[x])
142 | 		lev_Dur = levinsonDurbin(merged)
143 | 		
144 | 
145 | def LPC_autocorrelation(order=13):
146 | 	#Takes in a signal and determines lpc coefficients(through autocorrelation method) and gain for inverse filter.
147 | 	folder = raw_input('Give the name of the folder that you want to read data: ')
148 | 	amount = raw_input('Give the number of samples in the specific folder: ')
149 | 	for x in range(1,int(amount)+1):
150 | 		wav = '/'+folder+'/'+str(x)+'.wav'
151 | 		print wav
152 | 		#preemhasis filter
153 | 		emphasizedSignal,signal,rate = preEmphasis(wav)
154 | 		length = emphasizedSignal.size
155 | 		#prepare the signal for autocorrelation , fast Fourier transform method
156 | 		autocorrelation = sig.fftconvolve(emphasizedSignal, emphasizedSignal[::-1])
157 | 		#autocorrelation method
158 | 		autocorr_coefficients = autocorrelation[autocorrelation.size/2:][:(order + 1)]
159 | 		
160 | 		
161 | 		#using levinson_durbin method instead of solving toeplitz
162 | 		lpc_coefficients_levinson = levinson_durbin(autocorr_coefficients,13)
163 | 		print 'With levinson_durbin instead of toeplitz ' , lpc_coefficients_levinson.numerator
164 | 		
165 | 		
166 | 		#The Toeplitz matrix has constant diagonals, with c as its first column and r as its first row. If r is not given
167 | 		R = linalg.toeplitz(autocorr_coefficients[:order])
168 | 		#Given a square matrix a, return the matrix ainv satisfying
169 | 		lpc_coefficients = np.dot(linalg.inv(R), autocorr_coefficients[1:order+1])
170 | 		#(Multiplicative) inverse of the matrix (inv),  Returns the dot product of a and b. If a and b are both scalars 
171 | 		#or both 1-D arrays then a scalar is returned; otherwise an array is returned. If out is given, then it is returned  (np.dot())
172 | 		lpc_features=[]
173 | 		for x in lpc_coefficients:
174 | 			lpc_features.append(x)
175 | 		print lpc_features
176 | 	
177 | 
178 | def LPC():
179 | 	folder = raw_input('Give the name of the folder that you want to read data: ')
180 | 	amount = raw_input('Give the number of samples in the specific folder: ')
181 | 	for x in range(1,int(amount)+1):
182 | 		wav = '/'+folder+'/'+str(x)+'.wav'
183 | 		print wav
184 | 		emphasizedSignal,signal,rate = preEmphasis(wav)
185 | 		filt = lpc(emphasizedSignal,order=13)
186 | 		lpc_features =  filt.numerator[1:]
187 | 		print len(lpc_features)
188 | 		print lpc_features
189 | 		
190 | 
191 | def main():
192 | 	LPC()
193 | 	#myLPC()	
194 | 	LPC_autocorrelation()
195 | 
196 | 		
197 | 
198 | 
199 | 
200 | 
201 | main()
202 | 	
203 | 


--------------------------------------------------------------------------------
/feature_extraction_techniques/mfcc.py:
--------------------------------------------------------------------------------
 1 | #!usr/bin/python
 2 | from python_speech_features import mfcc
 3 | import scipy.io.wavfile as wavv
 4 | import numpy as np
 5 | 
 6 | def normalizeDataStd(data):
 7 | 	#normalize with mean and std 
 8 | 	#norm = (x_i - mean) / std
 9 | 	mean = np.mean(data,axis=0)
10 | 	std = np.std(data,axis=0)
11 | 	data = (data - mean) / std
12 | 
13 | def normalizeDataMM(mean_features):
14 | 	#normalize with min , max 
15 | 	#norm = (x_i - min ) / (max - min)
16 | 	dataMin = np.amin(data,axis=0)
17 | 	dataMax = np.amax(data,axis=0)
18 | 	base = dataMax - dataMin
19 | 	data = (data - dataMin) / base
20 | 
21 | def mfcc_features_extraction(wav):
22 | 	inputWav,wav = readWavFile(wav)
23 | 	rate,signal = wavv.read(inputWav)
24 | 	mfcc_features = mfcc(signal,rate)
25 | 	#n numpy array with size of the number of frames , each row has one feature vector
26 | 	return mfcc_features,wav
27 | 
28 | def mean_features(mfcc_features,wav):
29 | 	#make a numpy array with length the number of mfcc features
30 | 	mean_features=np.zeros(len(mfcc_features[0]))
31 | 	#for one input take the sum of all frames in a specific feature and divide them with the number of frames
32 | 	for x in range(len(mfcc_features)):
33 | 		for y in range(len(mfcc_features[x])):
34 | 			mean_features[y]+=mfcc_features[x][y]
35 | 	mean_features = (mean_features / len(mfcc_features)) 
36 | 	print mean_features
37 | 	writeFeatures(mean_features,wav)
38 | 
39 | def readWavFile(wav):
40 | 	#given a path from the keyboard to read a .wav file
41 | 	#wav = raw_input('Give me the path of the .wav file you want to read: ')
42 | 	inputWav = 'PATH_TO_WAV'+wav
43 | 	return inputWav,wav
44 | 
45 | def writeFeatures(mean_features,wav):
46 | 	#write in a txt file the output vectors of every sample
47 | 	f = open('mfcc_features.txt','a')#sample ID
48 | 	#f = open('mfcc_featuresLR.txt','a')#only to initiate the input for the ROC curve
49 | 	wav = makeFormat(wav)
50 | 	np.savetxt(f,mean_features,newline=",")
51 | 	f.write(wav)
52 | 	f.write('\n')
53 | 	
54 | 
55 | def makeFormat(wav):
56 | 	#if i want to keep only the gender (male,female)
57 | 	wav = wav.split('/')[1].split('-')[1]
58 | 	#only to make the format for Logistic Regression
59 | 	'''if (wav=='Female'):
60 | 		wav='1'
61 | 	else:
62 | 		wav='0'''
63 | 	return wav
64 | 	
65 | 
66 | def main():
67 | 	folder = raw_input('Give the name of the folder that you want to read data: ')
68 | 	amount = raw_input('Give the number of samples in the specific folder: ')
69 | 	for x in range(1,int(amount)):
70 | 		wav = '/'+folder+'/'+str(x)+'.wav'
71 | 		print wav
72 | 		mfcc_features,inputWav = mfcc_features_extraction(wav)
73 | 		mean_features(mfcc_features,inputWav)
74 | 
75 | main()
76 | 


--------------------------------------------------------------------------------
/feature_extraction_techniques/mfcc_pca.py:
--------------------------------------------------------------------------------
 1 | #!usr/bin/python
 2 | from python_speech_features import mfcc
 3 | import scipy.io.wavfile as wavv
 4 | import numpy as np
 5 | import os
 6 | from sklearn.decomposition import IncrementalPCA, PCA
 7 | 
 8 | '''
 9 | Make a numpy array with length the number of mfcc features,
10 | for one input take the sum of all frames in a specific feature and divide them with the number of frames. Because we extract 13 features
11 | from every frame now we have to add them and take the mean of them in order to describe the sample. In our previous example we take
12 | the mean of all this features, in this case we are using PCA to conclude to a single feature vector (1,13) with dimensionality reduction.
13 | '''
14 | def mean_features(mfcc_features,wav,folder):
15 | 	#here we are taking all the mfccs from every frame and we are not taking the average of them, instead we
16 | 	#are taking PCA in order to reduce the dimension of our data
17 | 
18 | 	#make the list of lists as a numpy array in order to keep from them 15578x13 just one samples 1x13	
19 | 	flattend_mfcc = np.array(mfcc_features)
20 | 
21 | 	#just to check the shape of the array before transapose
22 | 	#print flattend_mfcc.shape
23 | 
24 | 	#because the shape of the array is (1199,13) se if we apply PCA we are going to keep just the number of columns we define
25 | 	#but this is not the point we want to keep all the columns but just one row (dimensionality reduction)
26 | 	#so we reshape our array in order to be (13,1199) so keeping all the rows, all our features, but just one column
27 | 	#so we reduce our dimension from 1199 to 13
28 | 	flattend_mfcc = flattend_mfcc.transpose()
29 | 
30 | 	#confirm that we trnaspose the arraty
31 | 	#print flattend_mfcc.shape
32 | 	#initialize the pca
33 | 	pca = PCA(n_components=1)
34 | 	
35 | 	#fit the features in the model
36 | 	pca.fit(flattend_mfcc)
37 | 
38 | 	#apply PCA and keep just one column, which means one feature vector with 13 features
39 | 	sample = pca.transform(flattend_mfcc)
40 | 
41 | 	#because the result is (13,1) we want to make it a feature vector se we want to reshape it like (1,13)
42 | 	sample = sample.transpose()
43 | 	
44 | 	#transform it to a list in order to satisfy the format for writing the feature vector in the file
45 | 	pca_features = sample.tolist()
46 | 
47 | 	#and keep just the first list, because it returns you a list of lists with only one list
48 | 	pca_features = pca_features[0]
49 | 
50 | 	print pca_features
51 | 


--------------------------------------------------------------------------------
/feature_extraction_techniques/mgca.py:
--------------------------------------------------------------------------------
 1 | #!usr/bin/python
 2 | 
 3 | from pysptk.sptk import *
 4 | from scipy.signal import hamming
 5 | import numpy.matlib
 6 | import scipy
 7 | import scipy.io.wavfile as wav
 8 | import numpy as np
 9 | import wave
10 | from python_speech_features.sigproc import *
11 | from math import *
12 | 
13 | def readWavFile(wav):
14 | 	#given a path from the keyboard to read a .wav file
15 | 	#wav = raw_input('Give me the path of the .wav file you want to read: ')
16 | 	inputWav = 'PATH_TO_WAV'+wav
17 | 	return inputWav
18 | 
19 | #reading the .wav file (signal file) and extract the information we need 
20 | def initialize(inputWav):
21 | 	rate , signal  = wav.read(readWavFile(inputWav)) # returns a wave_read object , rate: sampling frequency 
22 | 	sig = wave.open(readWavFile(inputWav))
23 | 	# signal is the numpy 2D array with the date of the .wav file
24 | 	# len(signal) number of samples
25 | 	sampwidth = sig.getsampwidth()
26 | 	print 'The sample rate of the audio is: ',rate
27 | 	print 'Sampwidth: ',sampwidth	
28 | 	return signal ,  rate 
29 | 
30 | #implementation of the low-pass filter
31 | def lowPassFilter(signal, coeff=0.97):
32 | 	return np.append(signal[0], signal[1:] - coeff * signal[:-1]) #y[n] = x[n] - a*x[n-1] , a = 0.97 , a>0 for low-pass filters 
33 | 
34 | 
35 | def preEmphasis(wav):
36 | 	#taking the signal
37 | 	signal , rate = initialize(wav)
38 | 	#Pre-emphasis Stage
39 | 	preEmphasis = 0.97
40 | 	emphasizedSignal = lowPassFilter(signal)
41 | 	Time=np.linspace(0, len(signal)/rate, num=len(signal))
42 | 	EmphasizedTime=np.linspace(0, len(emphasizedSignal)/rate, num=len(emphasizedSignal))
43 | 	return emphasizedSignal, signal , rate
44 | 
45 | def writeFeatures(mgca_features,wav):
46 | 	#write in a txt file the output vectors of every sample
47 | 	f = open('mel_generalized_features.txt','a')#sample ID
48 | 	#f = open('mfcc_featuresLR.txt','a')#only to initiate the input for the ROC curve
49 | 	wav = makeFormat(wav)
50 | 	np.savetxt(f,mgca_features,newline=",")
51 | 	f.write(wav)
52 | 	f.write('\n')
53 | 	
54 | 
55 | def makeFormat(wav):
56 | 	#if i want to keep only the gender (male,female)
57 | 	wav = wav.split('/')[1].split('-')[1]
58 | 	#only to make the format for Logistic Regression
59 | 	if (wav=='Female'):
60 | 		wav='1'
61 | 	else:
62 | 		wav='0'
63 | 	return wav
64 | 
65 | 
66 | def mgca_feature_extraction(wav):
67 | 	#I pre-emphasized the signal with a low pass filter
68 | 	emphasizedSignal,signal,rate = preEmphasis(wav)
69 | 	
70 | 	
71 | 	#and now I have the signal windowed
72 | 	emphasizedSignal*=np.hamming(len(emphasizedSignal))
73 | 	
74 | 	mgca_features = mgcep(emphasizedSignal,order=12)
75 | 
76 | 	writeFeatures(mgca_features,wav)
77 | 		
78 | 
79 | 
80 | 
81 | def mel_Generalized():
82 | 	folder = raw_input('Give the name of the folder that you want to read data: ')
83 | 	amount = raw_input('Give the number of samples in the specific folder: ')
84 | 	print 'Mel-Generalized Cepstrum analysis github implementation '
85 | 	for x in range(1,int(amount)+1):
86 | 		wav = '/'+folder+'/'+str(x)+'.wav'
87 | 		print wav
88 | 		mgca_feature_extraction(wav)
89 | 		
90 | 		
91 | 
92 | def main():
93 | 	mel_Generalized()
94 | 
95 | main()
96 | 


--------------------------------------------------------------------------------
/feature_extraction_techniques/plp.py:
--------------------------------------------------------------------------------
 1 | #!usr/bin/python
 2 | 
 3 | import numpy
 4 | import numpy.matlib
 5 | import scipy
 6 | from scipy.fftpack.realtransforms import dct
 7 | from sidekit.frontend.vad import pre_emphasis
 8 | from sidekit.frontend.io import *
 9 | from sidekit.frontend.normfeat import *
10 | from sidekit.frontend.features import *
11 | import scipy.io.wavfile as wav
12 | import numpy as np
13 | 
14 | 
15 | def readWavFile(wav):
16 | 	#given a path from the keyboard to read a .wav file
17 | 	#wav = raw_input('Give me the path of the .wav file you want to read: ')
18 | 	inputWav = 'PATH_TO_WAV'+wav
19 | 	return inputWav
20 | 
21 | #reading the .wav file (signal file) and extract the information we need 
22 | def initialize(inputWav):
23 | 	rate , signal  = wav.read(readWavFile(inputWav)) # returns a wave_read object , rate: sampling frequency 
24 | 	sig = wave.open(readWavFile(inputWav))
25 | 	# signal is the numpy 2D array with the date of the .wav file
26 | 	# len(signal) number of samples
27 | 	sampwidth = sig.getsampwidth()
28 | 	print 'The sample rate of the audio is: ',rate
29 | 	print 'Sampwidth: ',sampwidth	
30 | 	return signal ,  rate 
31 | 
32 | def PLP():
33 | 	folder = raw_input('Give the name of the folder that you want to read data: ')
34 | 	amount = raw_input('Give the number of samples in the specific folder: ')
35 | 	for x in range(1,int(amount)+1):
36 | 		wav = '/'+folder+'/'+str(x)+'.wav'
37 | 		print wav
38 | 		#inputWav = readWavFile(wav)
39 | 		signal,rate = initialize(wav)
40 | 		#returns PLP coefficients for every frame 
41 | 		plp_features = plp(signal,rasta=True)
42 | 		meanFeatures(plp_features[0])	
43 | 
44 | 
45 | #compute the mean features for one .wav file (take the features for every frame and make a mean for the sample)
46 | def meanFeatures(plp_features):
47 | 	#make a numpy array with length the number of plp features
48 | 	mean_features=np.zeros(len(plp_features[0]))
49 | 	#for one input take the sum of all frames in a specific feature and divide them with the number of frames
50 | 	for x in range(len(plp_features)):
51 | 		for y in range(len(plp_features[x])):
52 | 			mean_features[y]+=plp_features[x][y]
53 | 	mean_features = (mean_features / len(plp_features)) 
54 | 	print mean_features
55 | 	
56 | 
57 | 
58 | def main():
59 | 	PLP()
60 | 
61 | main()
62 | 


--------------------------------------------------------------------------------
/feature_extraction_techniques/readFiles.py:
--------------------------------------------------------------------------------
1 | #!usr/bin/python
2 | import os
3 | 
4 | def readCases():
5 | 	healthyCases = os.listdir('path')
6 | 	capturedCases = os.listdir('path')
7 | 	#using the os libary that Python provides to read all the files from a scertain directory
8 | 	#the function return two arrays with all the file names that are in the specific directory
9 | 


--------------------------------------------------------------------------------
/speech_features/README.md:
--------------------------------------------------------------------------------
 1 | EXAMPLE speech features extracted with various techniques.
 2 | 
 3 | MFCC (Mel Frequency Cepstral Coefficient)  
 4 |   
 5 | PLP (Perceptual Linear Prediction)  
 6 |   
 7 |   - with RASTA filtering  
 8 |   - and without  
 9 |     
10 | LPC (Liner Predictive Coding)  
11 |   
12 | MGCA(Mel Generalized Cepstrum Analysis) 
13 | 


--------------------------------------------------------------------------------
/speech_features/gmm_mfcc_0.txt:
--------------------------------------------------------------------------------
1 | 1.571385614170679723e+01,1.404875356097497274e+01,-4.860084266595126934e+00,-8.052633391335941582e+00,-3.585894965594753359e+01,-1.664005410215195724e+01,2.096955330354112235e+01,6.827590955059440248e+00,-9.411830493556637478e+00,-1.708960159495563147e+01,-1.474398190776833095e+00,-5.020199916868210543e+00,-2.692742851984636232e+01,0
2 | 


--------------------------------------------------------------------------------
/speech_features/gmm_mfcc_1.txt:
--------------------------------------------------------------------------------
1 | 1.760115643322911794e+01,8.026205839965216526e+00,-2.372870381468542789e+00,-3.842529428287276261e+01,-2.477253797237680999e+01,-1.050848974736010000e+01,1.884725359397479139e+01,-1.503582987857505238e+01,-1.585323702282968261e+00,1.268366760659247383e+01,-2.354793372118761496e+01,-9.443139105278184786e+00,-1.776925001121093572e+01,1
2 | 


--------------------------------------------------------------------------------
/speech_features/gmm_test_mfcc.txt:
--------------------------------------------------------------------------------
1 | 1.781939282818649417e+01,6.159290119863752189e+00,-1.652226688796058696e+01,-1.856868294488861793e+01,-1.951579460253341125e+01,-3.897547750343369533e+00,1.205675536859451746e+01,-2.243241947716080986e+01,1.401626082539313778e+01,-2.213080744116675902e+01,-2.415440396864470429e+00,4.772852345310729660e+00,-2.062950925082413178e+01,0
2 | 


--------------------------------------------------------------------------------
/speech_features/lpc_featuresLR.txt:
--------------------------------------------------------------------------------
1 | -5.713715899460503067e-01,6.012342530606594460e-02,-2.024042466157382758e-01,2.926073544026031037e-01,-2.459888669142980266e-01,3.511328323794973144e-02,-1.930418301955590665e-01,2.854422988330282962e-01,-1.310427749596927705e-01,1.676472867063789340e-01,-6.043539229498144649e-02,-6.383703620963765424e-02,1.358666383377096776e-02,0
2 | 


--------------------------------------------------------------------------------
/speech_features/mel_generalized_features.txt:
--------------------------------------------------------------------------------
1 | 4.011244746326546595e-01,9.854180285960870145e-03,9.698894709162839134e-03,1.068934676826956663e-02,1.239489571158165077e-02,1.024365701392198659e-02,1.003316534286036871e-02,1.159946972333673713e-02,1.145714884192802069e-02,1.136573025658852570e-02,1.031905747748021983e-02,1.177596370257075371e-02,8.511094453882924599e-03,0
2 | 


--------------------------------------------------------------------------------
/speech_features/mfcc_featuresLR.txt:
--------------------------------------------------------------------------------
1 | 1.869958534730199062e+01,-1.106390788440557937e+00,-1.463190887778748339e-01,5.917073258262402824e+00,-1.099502916134400898e+01,-6.256470537374103635e+00,5.928290413728441344e+00,1.086700024980764567e+01,-5.702149955059144792e-01,-1.617507983735385180e+00,-5.507156315888738440e+00,2.872456836350028464e+00,-3.080658678467673273e+00,1
2 | 


--------------------------------------------------------------------------------
/speech_features/plp_features.txt:
--------------------------------------------------------------------------------
1 | 5.895941153135551893e+00,-6.748447900593427251e-01,-3.319222787104435940e-02,-1.578910175596278109e-01,1.221139505707007356e-01,5.985885025331066922e-02,-9.752085645350065668e-02,-7.502659344937070984e-02,-1.439876159846410764e-02,2.497247509572233549e-03,-1.891271314688349955e-02,8.776126537227386948e-02,-5.181689905905148552e-02,0
2 | 


--------------------------------------------------------------------------------
/speech_features/plp_featuresRASTA.txt:
--------------------------------------------------------------------------------
1 | -6.622877863274987398e-01,-4.013368720022871261e-01,-2.644490505501208566e-01,-2.436498156014590688e-01,-2.119007564962608059e-01,-1.768779937559497861e-01,-1.387866152690148402e-01,-1.090785786905756338e-01,-7.822805864564488787e-02,-5.214452830678916601e-02,-3.087330660971222482e-02,-6.892111334665819607e-03,1.710506533453253972e-02,0
2 | 


--------------------------------------------------------------------------------