├── Fig1.png ├── Fig2.png ├── LICENSE ├── README.md ├── Team 1 └── main.py ├── Team 10 └── main.py ├── Team 11 └── main.py ├── Team 12 └── main.py ├── Team 13 └── main.py ├── Team 14 └── main.py ├── Team 15 └── main.py ├── Team 16 └── main.py ├── Team 17 └── main.py ├── Team 18 ├── classifiers.py └── main.py ├── Team 19 └── main.py ├── Team 2 ├── main.py ├── read_write.py └── reduce_dim.py ├── Team 20 └── main.py ├── Team 3 ├── classifiers.py └── main.py ├── Team 4 └── main.py ├── Team 5 └── main.py ├── Team 6 └── main.py ├── Team 7 └── main.py ├── Team 8 └── main.py └── Team 9 └── main.py /Fig1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/basiralab/BrainNet-ML-ToolBox/5a90b050f76e07caf77c3057186acf5d88db46be/Fig1.png -------------------------------------------------------------------------------- /Fig2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/basiralab/BrainNet-ML-ToolBox/5a90b050f76e07caf77c3057186acf5d88db46be/Fig2.png -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 BASIRA LAB 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # BrainNet-ML-ToolBox: A Python Machine Learning Toolbox for Brain Network Classification 2 | ============================================================== 3 | 4 | **BrainNet-ML-ToolBox** library supports the combination of models and score from 5 | key ML libraries such as `scikit-learn `_, and `xgboost `_, for data preprocessing, dimensionality reduction, and classification. This toolbox has been put together and polished by Goktug Guvercin (goktug150140@gmail.com). 6 | 7 | # Introduction 8 | 9 | This repo is a machine learning (ML) toolbox including 20 different pipelines for brain network classification. 10 | 11 | Autism spectrum disorder (ASD) affects the brain connectivity at different levels. Nonetheless, non-invasively distinguishing such effects using magnetic resonance imaging (MRI) remains very challenging to machine learning diagnostic frameworks due to ASD heterogeneity. So far, existing network neuroscience works mainly focused on functional (derived from functional MRI) and structural (derived from diffusion MRI) brain connectivity, which might not capture relational morphological changes between brain regions. Indeed, machine learning (ML) studies for ASD diagnosis using morphological brain networks derived from conventional T1-weighted MRI are very scarce. 12 | 13 | To fill this gap, we leverage crowdsourcing by organizing a **Kaggle competition** to build a pool of machine learning pipelines for neurological disorder diagnosis with application ASD diagnosis using cortical morphological networks derived from T1-weighted MRI. The general aim of this challenge was to encourage the competitors to come up with a machine learning pipelines that can differentiate normal controls from autistic subjects using cortical morphological networks. The competitors were allowed to use built-in machine learning methods to design their brain network classification frameworks. **In this repository, we include the source codes of the top 20 teams in the competition.** 14 | 15 | During the competition, participants were provided with a training dataset and only allowed to check their performance on a public test data. The final evaluation was performed on both public and hidden test datasets based on accuracy, sensitivity, and specificity metrics. Teams were ranked using each performance metric separately and the final ranking was determined based on the mean of all rankings. **The first-ranked team (Team-1) achieved 70% accuracy, 72.5% sensitivity, and 67.5% specificity, where the second-ranked team (Team-2) achieved 63.8%, 62.5%, 65% respectively.** 16 | 17 | ![BrainNet pipeline](https://github.com/basiralab/BrainNet-ML-ToolBox/blob/master/Fig1.png) 18 | ![BrainNet pipeline](https://github.com/basiralab/BrainNet-ML-ToolBox/blob/master/Fig2.png) 19 | 20 | # Installation 21 | 22 | The source codes have been tested with Python 3.6.2 version through PyCharm IDE on OSX operating system. There is no need of GPU to run the codes. 23 | 24 | Required Python Modules: 25 | 26 | * csv 27 | * numpy 28 | * pandas 29 | * xgboost 30 | * mlxtend 31 | * statistics 32 | * warnings 33 | * matplotlib 34 | * scikit-learn 35 | 36 | # Dataset format 37 | 38 | The brain network dataset in the training stage comprised 120 samples, each represented by 595 morphological connectivity features. The testing set comprised 80 samples. If you intend to run source codes for your own dataset, some operations inside the codes such as constant feature elimination and loading data from CSV files can be modified accordingly. 39 | 40 | # Please cite the following paper when using BrainNet-ML-ToolBox: 41 | 42 | ``` 43 | @article{brainNetML2020, 44 | title={Machine Learning Methods for Brain Network Classification: Application to Autism Diagnosis using Cortical Morphological Networks}, 45 | author={Bilgen, Ismail and Guvercin, Goktug and Rekik, Islem}, 46 | journal={arXiv preprint arXiv:2004.13321}, 47 | year={2020} 48 | } 49 | 50 | ``` 51 | Paper link on arXiv: 52 | https://arxiv.org/pdf/2004.13321v1.pdf 53 | 54 | 55 | -------------------------------------------------------------------------------- /Team 1/main.py: -------------------------------------------------------------------------------- 1 | """ 2 | Target Problem: 3 | --------------- 4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD) 5 | 6 | Proposed Solution (Machine Learning Pipeline): 7 | ---------------------------------------------- 8 | * Constant Feature Elimination -> SelectKBest Algorithm -> Gradient Boosting 9 | 10 | Input to Proposed Solution: 11 | --------------------------- 12 | * Directories of training and testing data in csv file format 13 | * These two types of data should be stored in n x m pattern in csv file format. 14 | 15 | Typical Example: 16 | ---------------- 17 | n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column) 18 | k x s samples in testing csv file (k number of samples, s number of features) 19 | 20 | * These data set files are ready by load_data() function. 21 | * For comprehensive information about input format, please check the section 22 | "Data Sets and Usage Format of Source Codes" in README.md file on github. 23 | 24 | Output of Proposed Solution: 25 | ---------------------------- 26 | * Predictions generated by learning model for testing set 27 | * They are stored in "submission.csv" file. 28 | 29 | Code Owner: 30 | ----------- 31 | * Copyright © Team 1. All rights reserved. 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """ 33 | 34 | 35 | import csv 36 | import pandas as pd 37 | from sklearn.ensemble import GradientBoostingClassifier 38 | from sklearn.feature_selection import chi2, SelectKBest 39 | 40 | 41 | def load_data(filename): 42 | train_data = pd.read_csv(filename[0]) 43 | test_data = pd.read_csv(filename[1]) 44 | return train_data, test_data 45 | 46 | 47 | def filter_feature_selection(X, y, nof_features): 48 | 49 | """ 50 | This function performs selectKBest filtering algorithm to compute importance scores of features. 51 | Then, these scores are concatenated with features themselves. 52 | Finally, top k most important features are selected. 53 | 54 | Parameters 55 | ---------- 56 | X: features of train data set 57 | y: labels of train data set 58 | nof_features: number of features to be selected """ 59 | 60 | best_features = SelectKBest(score_func=chi2, k="all") 61 | best_features.fit(X, y) 62 | df_scores = pd.DataFrame(best_features.scores_) 63 | df_columns = pd.DataFrame(X.columns) 64 | 65 | # Concatenate features and related score for visualization 66 | feature_scores = pd.concat([df_columns, df_scores], axis=1) 67 | feature_scores.columns = ['Specs', 'Score'] 68 | 69 | # Select top K features 70 | ft = feature_scores.nlargest(nof_features, 'Score') 71 | features = ft.index.values 72 | return features 73 | 74 | 75 | def preprocessing(trainData): 76 | 77 | """ 78 | The method splits the train dataset into features and labels, which are X and y. 79 | Then, it drops constant features, and chooses top 341 features. 80 | 81 | Parameters 82 | ---------- 83 | trainData: It is numpy array containing features and labels together 84 | :return: It returns train dataset with selected features, labels, and index values of chosen features 85 | """ 86 | 87 | X = trainData.iloc[:, 0:595] 88 | y = trainData.iloc[:, -1] 89 | 90 | # Drop zero columns and Select top K features with Filter Method 91 | X = X.drop(["X3", "X31", "X32", "X127", "X128", "X590"], axis=1) 92 | feature_indexes = filter_feature_selection(X, y, 341) 93 | 94 | return X.values[:, feature_indexes], y, feature_indexes 95 | 96 | 97 | def train_model(X_train, y_train): 98 | 99 | """ 100 | A learning model is trained by using train data set. 101 | Gradient Boosting classifier is preferred. 102 | 103 | Parameters 104 | ---------- 105 | X_train: train dataset with top k most important features 106 | y_train: labels of those samples 107 | """ 108 | 109 | gradBoost = GradientBoostingClassifier( 110 | n_estimators=6, 111 | learning_rate=1, 112 | max_features=2, 113 | max_depth=2, 114 | random_state=289) 115 | 116 | gradBoost.fit(X_train, y_train) 117 | return gradBoost 118 | 119 | 120 | def transform_test_samples(test_data, selected_features): 121 | 122 | """ 123 | This method applies preprocessing operations to test data. 124 | It at first eliminates constant features, and then choose top k most relevant features from the data set. 125 | 126 | Parameters 127 | ---------- 128 | test_data: test data set 129 | selected_features: index values of selected features 130 | """ 131 | X = test_data.iloc[:, 0:595] 132 | X = X.drop(["X3", "X31", "X32", "X127", "X128", "X590"], axis=1) 133 | return X.values[:, selected_features] 134 | 135 | 136 | def predict(model, X_test): 137 | 138 | """ 139 | This method predicts outputs for test samples by using learning model. 140 | 141 | Parameters 142 | ---------- 143 | model: learning model for prediction 144 | X_test: testing dataset 145 | :return: predictions for testing set 146 | """ 147 | return model.predict(X_test) 148 | 149 | 150 | def write_output(predictions): 151 | 152 | submissionFile = [["ID", "Predicted"]] 153 | for i, prediction in enumerate(predictions): 154 | submissionFile.append([i + 1, int(prediction)]) 155 | 156 | # Write to file 157 | with open('Submission.csv', 'w') as csvFile: 158 | writer = csv.writer(csvFile) 159 | writer.writerows(submissionFile) 160 | 161 | 162 | # ******* Main Program ******* # 163 | 164 | Xpaths = ["train.csv", "test.csv"] 165 | trainData, testData = load_data(Xpaths) 166 | 167 | X_train, y_train, selected_features = preprocessing(trainData) 168 | gradientBoost = train_model(X_train, y_train) 169 | 170 | XtestNew = transform_test_samples(testData, selected_features) 171 | predictions = predict(gradientBoost, XtestNew) 172 | write_output(predictions) 173 | -------------------------------------------------------------------------------- /Team 10/main.py: -------------------------------------------------------------------------------- 1 | """ 2 | Target Problem: 3 | --------------- 4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD) 5 | 6 | Proposed Solution (Machine Learning Pipeline): 7 | ---------------------------------------------- 8 | * Constant Feature Elimination -> PCA -> Decision Tree 9 | 10 | Input to Proposed Solution: 11 | --------------------------- 12 | * Directories of training and testing data in csv file format 13 | * These two types of data should be stored in n x m pattern in csv file format. 14 | 15 | Typical Example: 16 | ---------------- 17 | n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column) 18 | k x s samples in testing csv file (k number of samples, s number of features) 19 | 20 | * These data set files are ready by load_data() function. 21 | * For comprehensive information about input format, please check the section 22 | "Data Sets and Usage Format of Source Codes" in README.md file on github. 23 | 24 | Output of Proposed Solution: 25 | ---------------------------- 26 | * Predictions generated by learning model for testing set 27 | * They are stored in "submission.csv" file. 28 | 29 | Code Owner: 30 | ----------- 31 | * Copyright © Team 10. All rights reserved. 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """ 33 | 34 | import numpy as np 35 | import pandas as pd 36 | from sklearn.decomposition import PCA 37 | from sklearn.tree import DecisionTreeClassifier 38 | 39 | np.random.seed(3) # anchoring randomization during training step 40 | 41 | 42 | def load_data(traname, testname): 43 | 44 | """ 45 | The method reads train and test data from dataset files. 46 | 47 | Parameters 48 | ---------- 49 | traname: directory of training dataset file 50 | testname: directory of testing dataset file 51 | 52 | """ 53 | train_data = pd.read_csv(traname) 54 | test_data = pd.read_csv(testname) 55 | return train_data, test_data 56 | 57 | 58 | def preprocessing(train_data, test_data): 59 | 60 | """ 61 | * The method at first eliminates constant features from both train and test data. 62 | * Then, it splits training data into features and labels. 63 | * Finally, the method performs pca on training and testing data sets to reduce the dimension and 64 | overcome curse of dimensionality problem. 65 | 66 | Parameters 67 | ---------- 68 | train_data: training data set in data frame format 69 | test_data: testing data set in data frame format 70 | 71 | """ 72 | 73 | # constant feature elimination 74 | train_data = train_data.drop(['X3', 'X31', 'X32', 'X127', 'X128', 'X590'], axis=1) 75 | train_data = np.asarray(train_data) 76 | 77 | test_data = test_data.drop(['X3', 'X31', 'X32', 'X127', 'X128', 'X590'], axis=1) 78 | test_data = np.asarray(test_data) 79 | 80 | # training data is split into features and labels 81 | train_x = train_data[:, :train_data.shape[1] - 1] 82 | train_y = train_data[:, train_data.shape[1] - 1] 83 | train_y.shape = (np.size(train_y), 1) 84 | 85 | # principal component analysis 86 | pca = PCA(n_components=60) 87 | train_x_pca = pca.fit_transform(train_x) 88 | test_pca = pca.transform(test_data) 89 | 90 | return train_x_pca, train_y, test_pca 91 | 92 | 93 | def train_model(train_x, train_y): 94 | 95 | """ 96 | The method creates a learning model, and trains it by using training data. 97 | 98 | Parameters 99 | ---------- 100 | train_x: features of training data 101 | train_y: labels of training data 102 | 103 | """ 104 | 105 | clf = DecisionTreeClassifier(max_depth=3, max_features=11, random_state=9) 106 | clf.fit(train_x, train_y) 107 | return clf 108 | 109 | 110 | def predict(model, test_x): 111 | 112 | """ 113 | The method predicts labels for testing data samples by using trained learning model. 114 | 115 | Parameters 116 | ---------- 117 | model: trained learning model 118 | test_x: features of testing data 119 | 120 | """ 121 | 122 | predictions = model.predict(test_x) 123 | return predictions 124 | 125 | 126 | def write_output(ytest): 127 | yt = pd.DataFrame(ytest, dtype='int32') 128 | yt.columns = ["Predicted"] 129 | yt.index += 1 130 | yt.to_csv("./submission.csv", index_label="ID") 131 | return 132 | 133 | 134 | # ********** MAIN PROGRAM ********** # 135 | 136 | trainfile = "train.csv" 137 | testfile = "test.csv" 138 | 139 | train_data, test_data = load_data(trainfile, testfile) 140 | train_x, train_y, test_x = preprocessing(train_data, test_data) 141 | 142 | model = train_model(train_x, train_y) 143 | predictions = predict(model, test_x) 144 | write_output(predictions) 145 | -------------------------------------------------------------------------------- /Team 11/main.py: -------------------------------------------------------------------------------- 1 | """ 2 | Target Problem: 3 | --------------- 4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD) 5 | 6 | Proposed Solution (Machine Learning Pipeline): 7 | ---------------------------------------------- 8 | * SelectKBest Algorithm -> Adaptive Boosting (Base: Decision Tree) 9 | 10 | Input to Proposed Solution: 11 | --------------------------- 12 | * Directories of training and testing data in csv file format 13 | * These two types of data should be stored in n x m pattern in csv file format. 14 | 15 | Typical Example: 16 | ---------------- 17 | n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column) 18 | k x s samples in testing csv file (k number of samples, s number of features) 19 | 20 | * These data set files are ready by load_data() function. 21 | * For comprehensive information about input format, please check the section 22 | "Data Sets and Usage Format of Source Codes" in README.md file on github. 23 | 24 | Output of Proposed Solution: 25 | ---------------------------- 26 | * Predictions generated by learning model for testing set 27 | * They are stored in "submission.csv" file. 28 | 29 | Code Owner: 30 | ----------- 31 | * Copyright © Team 11. All rights reserved. 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """ 33 | 34 | import csv 35 | import numpy as np 36 | import pandas as pd 37 | from sklearn.feature_selection import SelectKBest, chi2 38 | from sklearn.ensemble import AdaBoostClassifier 39 | 40 | 41 | def load_data(train_file, test_file): 42 | 43 | """ 44 | The method reads train and test data from their dataset files. 45 | Then, it splits train data into features and labels. 46 | 47 | Parameters 48 | ---------- 49 | train_file: directory of the file in which train data set is located 50 | test_file: directory of the file in which test data set is located 51 | 52 | """ 53 | 54 | data = np.asarray(pd.read_csv(train_file, header=0)) 55 | data_ts = np.asarray(pd.read_csv(test_file, header=0)) 56 | 57 | x_tra = data[:, :-1] 58 | y_tra = data[:, -1] 59 | 60 | return x_tra, y_tra, data_ts 61 | 62 | 63 | def preprocessing(x_tra, y_tra, x_tst): 64 | 65 | """ 66 | * The method computes chi square value for each feature, and 67 | chooses top 190 features with highest chi square value. 68 | * This is performed by using SelectKBest() feature selection algorithm. 69 | 70 | Parameters 71 | ---------- 72 | x_tra: features of training data 73 | y_tra: labels of training data 74 | x_tst: features of test data 75 | 76 | """ 77 | 78 | selector = SelectKBest(chi2, 190) 79 | selector.fit(x_tra, y_tra) 80 | x_tra_new = selector.transform(x_tra) 81 | x_tst_new = selector.transform(x_tst) 82 | 83 | return x_tra_new, x_tst_new 84 | 85 | 86 | def train_model(x_tra, y_tra): 87 | 88 | """ 89 | The method creates a learning model and trains it by using training data. 90 | 91 | Parameters 92 | ---------- 93 | x_tra: features of training data 94 | y_tra: labels of training data 95 | """ 96 | 97 | clf1 = AdaBoostClassifier(n_estimators=300, random_state=1) 98 | clf1.fit(x_tra, y_tra) 99 | return clf1 100 | 101 | 102 | def predict(x_tst, model): 103 | 104 | """ 105 | The method predicts labels for testing data samples by using trained learning model. 106 | 107 | Parameters 108 | ---------- 109 | x_tst: features of testing data 110 | model: trained learning model 111 | """ 112 | 113 | predictions = model.predict(x_tst) 114 | return predictions 115 | 116 | 117 | def write_output(predictions): 118 | 119 | order = np.arange(1, 81) 120 | order.shape = (80, 1) 121 | predictions.shape = (80, 1) 122 | 123 | pred = np.concatenate((order, predictions), axis=1).astype(dtype=np.int) 124 | with open('submission.csv', 'w') as csvFile: 125 | writer = csv.writer(csvFile) 126 | writer.writerow(("ID", "Predicted")) 127 | writer.writerows(pred) 128 | csvFile.close() 129 | 130 | 131 | # ********** MAIN PROGRAM ********** # 132 | 133 | x_tra, y_tra, x_tst = load_data("train.csv", "test.csv") 134 | x_tra_new, x_tst_new = preprocessing(x_tra, y_tra, x_tst) 135 | 136 | model = train_model(x_tra_new, y_tra) 137 | predictions = predict(x_tst_new, model) 138 | write_output(predictions) 139 | -------------------------------------------------------------------------------- /Team 12/main.py: -------------------------------------------------------------------------------- 1 | """ 2 | Target Problem: 3 | --------------- 4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD) 5 | 6 | Proposed Solution (Machine Learning Pipeline): 7 | ---------------------------------------------- 8 | * PCA -> SVC 9 | 10 | Input to Proposed Solution: 11 | --------------------------- 12 | * Directories of training and testing data in csv file format 13 | * These two types of data should be stored in n x m pattern in csv file format. 14 | 15 | Typical Example: 16 | ---------------- 17 | n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column) 18 | k x s samples in testing csv file (k number of samples, s number of features) 19 | 20 | * These data set files are ready by load_data() function. 21 | * For comprehensive information about input format, please check the section 22 | "Data Sets and Usage Format of Source Codes" in README.md file on github. 23 | 24 | Output of Proposed Solution: 25 | ---------------------------- 26 | * Predictions generated by learning model for testing set 27 | * They are stored in "submission.csv" file. 28 | 29 | Code Owner: 30 | ----------- 31 | * Copyright © Team 12. All rights reserved. 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """ 33 | 34 | import numpy as np 35 | import pandas as pd 36 | from matplotlib import pyplot as plt 37 | 38 | from sklearn import svm 39 | from sklearn.decomposition import PCA 40 | 41 | plt.style.use('ggplot') 42 | 43 | 44 | def load_data(): 45 | 46 | """ 47 | The method reads train and test data from data set files. 48 | Then, it splits train data set into features and labels. 49 | 50 | """ 51 | 52 | train_set = np.array(pd.read_csv("train.csv")) 53 | test_set = np.array(pd.read_csv("test.csv")) 54 | 55 | train_x = train_set[:, :train_set.shape[1] - 1] 56 | train_y = train_set[:, train_set.shape[1] - 1] 57 | 58 | return train_x, train_y, test_set 59 | 60 | 61 | def preprocessing(x_tra, x_tst): 62 | 63 | """ 64 | * The method reduces dimension of training and testing data set by using PCA. 65 | 66 | Parameters 67 | ---------- 68 | x_tra: features of training data 69 | x_tst: features of testing data 70 | """ 71 | 72 | pca = PCA(n_components=5) 73 | x_tra_new = pca.fit_transform(x_tra) 74 | x_tst_new = pca.transform(x_tst) 75 | 76 | return x_tra_new, x_tst_new 77 | 78 | 79 | def train_model(x_train, y_train): 80 | 81 | """ 82 | The method creates a learning model and trains it by using training dataset 83 | 84 | Parameters 85 | ---------- 86 | x_train: features of training data 87 | y_train: labels of training data 88 | """ 89 | 90 | model = svm.SVC(kernel='linear') 91 | model.fit(x_train, y_train) 92 | return model 93 | 94 | 95 | def predict(model, x_tst): 96 | 97 | """ 98 | The method predicts the labels for testing data samples by using trained learning model. 99 | 100 | Parameters 101 | ---------- 102 | model: trained learning model 103 | x_tst: features of testing data 104 | 105 | """ 106 | 107 | predictions = model.predict(x_tst) 108 | return predictions 109 | 110 | 111 | def write_output(predictions): 112 | 113 | ind = [x for x in range(1, len(predictions) + 1)] 114 | 115 | temp = pd.DataFrame(data=ind, columns=['ID']) 116 | temp2 = pd.DataFrame(data=predictions, columns=['Predicted']) 117 | 118 | y_pred = pd.concat([temp, temp2], axis=1) 119 | y_pred = y_pred.astype({"ID": int, "Predicted": int}) 120 | y_pred.to_csv("submission.csv", index=False, float_format="%.0f") 121 | 122 | 123 | # ********** MAIN PROGRAM ********** # 124 | 125 | train_x, train_y, test_set = load_data() 126 | x_tra_new, x_tst_new = preprocessing(train_x, test_set) 127 | 128 | model = train_model(x_tra_new, train_y) 129 | predictions = predict(model, x_tst_new) 130 | write_output(predictions) 131 | -------------------------------------------------------------------------------- /Team 13/main.py: -------------------------------------------------------------------------------- 1 | """ 2 | Target Problem: 3 | --------------- 4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD) 5 | 6 | Proposed Solution (Machine Learning Pipeline): 7 | ---------------------------------------------- 8 | * Coorelation Based Elimination -> SelectKBest Algorithm -> Adaptive Boosting (Base: Decision Tree) 9 | 10 | Input to Proposed Solution: 11 | --------------------------- 12 | * Directories of training and testing data in csv file format 13 | * These two types of data should be stored in n x m pattern in csv file format. 14 | 15 | Typical Example: 16 | ---------------- 17 | n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column) 18 | k x s samples in testing csv file (k number of samples, s number of features) 19 | 20 | * These data set files are ready by load_data() function. 21 | * For comprehensive information about input format, please check the section 22 | "Data Sets and Usage Format of Source Codes" in README.md file on github. 23 | 24 | Output of Proposed Solution: 25 | ---------------------------- 26 | * Predictions generated by learning model for testing set 27 | * They are stored in "submission.csv" file. 28 | 29 | Code Owner: 30 | ----------- 31 | * Copyright © Team 13. All rights reserved. 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """ 33 | 34 | import csv 35 | import pandas as pd 36 | import matplotlib.pyplot as plt 37 | 38 | from sklearn.ensemble import AdaBoostClassifier 39 | from sklearn.feature_selection import SelectKBest, chi2 40 | 41 | 42 | def load_data(): 43 | """ 44 | The method reads train and test data from data set files. 45 | Then, it splits train data into features and labels. 46 | """ 47 | 48 | train_data = pd.read_csv('train.csv') 49 | test_data = pd.read_csv('test.csv') 50 | 51 | x_train = train_data.iloc[:, 0: -1] 52 | y_train = train_data.iloc[:, -1] 53 | x_test = test_data.iloc[:, 0:] 54 | 55 | return x_train, y_train, x_test 56 | 57 | 58 | def preprocessing(x_train, y_train, x_test): 59 | 60 | """ 61 | The method at first chooses top 50 features with highest chi square value by using SelectKBest algorithm. 62 | Then, those features are sorted, and the features least correlated with labels are eliminated. 63 | 64 | Parameters 65 | ---------- 66 | x_train: features of training data 67 | y_train: labels of training data 68 | x_test: features of testing data 69 | """ 70 | 71 | selector = SelectKBest(score_func=chi2, k=50) 72 | 73 | fit = selector.fit(x_train, y_train) 74 | 75 | df_scores = pd.DataFrame(fit.scores_) 76 | df_columns = pd.DataFrame(x_train.columns) 77 | 78 | feature_scores = pd.concat([df_columns, df_scores], axis=1) 79 | feature_scores.columns = ['Specs', 'Score'] 80 | 81 | selected_features = feature_scores.sort_values(['Score'], ascending=0).iloc[0:50, :] 82 | 83 | new_x_train = x_train.loc[:, selected_features['Specs']] 84 | new_x_test = x_test.loc[:, selected_features['Specs']] 85 | 86 | plt.matshow(new_x_train.corr().abs()) 87 | plt.show() 88 | 89 | new_x_train = new_x_train.drop(['X584', 'X579', 'X404', 'X528', 'X318'], axis=1) 90 | new_x_test = new_x_test.drop(['X584', 'X579', 'X404', 'X528', 'X318'], axis=1) 91 | 92 | return new_x_train, new_x_test 93 | 94 | 95 | def train_model(x_train, y_train): 96 | 97 | """ 98 | The method creates a learning model and trains it by using training data. 99 | 100 | Parameters 101 | ---------- 102 | x_train: features of training data 103 | y_train: labels of training data 104 | """ 105 | 106 | model = AdaBoostClassifier(n_estimators=10) 107 | model.fit(x_train, y_train) 108 | return model 109 | 110 | 111 | def predict(model, x_test): 112 | 113 | """ 114 | The method predicts labels for testing data samples. 115 | 116 | Parameters 117 | ---------- 118 | model: trained model 119 | x_test: features of testing data set 120 | """ 121 | 122 | y_pred = model.predict(x_test) 123 | return y_pred 124 | 125 | 126 | def write_output(y_pred): 127 | 128 | for i in range(0, len(y_pred) + 1): 129 | if i == 0: 130 | with open('submission.csv', 'w') as writeFile: 131 | writer = csv.writer(writeFile) 132 | writer.writerow(["ID", "Predicted"]) 133 | continue 134 | row = [i, int(y_pred[i - 1])] 135 | with open('submission.csv', 'a') as writeFile: 136 | writer = csv.writer(writeFile) 137 | writer.writerow(row) 138 | 139 | 140 | if __name__ == '__main__': 141 | 142 | x_train, y_train, x_test = load_data() 143 | x_train, x_test = preprocessing(x_train, y_train, x_test) 144 | 145 | model = train_model(x_train, y_train) 146 | predictions = predict(model, x_test) 147 | write_output(predictions) 148 | -------------------------------------------------------------------------------- /Team 14/main.py: -------------------------------------------------------------------------------- 1 | """ 2 | Target Problem: 3 | --------------- 4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD) 5 | 6 | Proposed Solution (Machine Learning Pipeline): 7 | ---------------------------------------------- 8 | * PCA -> Random Forest 9 | 10 | Input to Proposed Solution: 11 | --------------------------- 12 | * Directories of training and testing data in csv file format 13 | * These two types of data should be stored in n x m pattern in csv file format. 14 | 15 | Typical Example: 16 | ---------------- 17 | n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column) 18 | k x s samples in testing csv file (k number of samples, s number of features) 19 | 20 | * These data set files are ready by load_data() function. 21 | * For comprehensive information about input format, please check the section 22 | "Data Sets and Usage Format of Source Codes" in README.md file on github. 23 | 24 | Output of Proposed Solution: 25 | ---------------------------- 26 | * Predictions generated by learning model for testing set 27 | * They are stored in "submission.csv" file. 28 | 29 | Code Owner: 30 | ----------- 31 | * Copyright © Team 14. All rights reserved. 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """ 33 | 34 | import pandas as pd 35 | from sklearn.decomposition import PCA 36 | from sklearn.ensemble import RandomForestClassifier 37 | 38 | 39 | def load_data(): 40 | 41 | """ 42 | The method reads train and test data from data set files. 43 | Then, it splits train data into features and labels. 44 | """ 45 | 46 | train = pd.read_csv("train.csv") 47 | test = pd.read_csv("test.csv") 48 | 49 | train_x = train.iloc[:, :-1] 50 | train_y = train.iloc[:, -1] 51 | return train_x, train_y, test 52 | 53 | 54 | def preprocessing(X, y, test): 55 | 56 | """ 57 | * The method at first eliminates nan-valued columns. 58 | (There is no nan-valued feature column, it is redundant operation.) 59 | 60 | * Then, first 100 samples from train data are chosen as training set. 61 | The remains 20 samples from train data are regarded as testing data. 62 | In other words, the learning model will be trained with first 100 training samples, not all of them. 63 | 64 | * Finally, it performs pca on training and testing data to reduce the dimension. 65 | 66 | Parameters 67 | ---------- 68 | X: features of training data 69 | y: labels of training data 70 | test: features of testing data 71 | """ 72 | 73 | X = X.dropna(axis=1, how='all') 74 | test = test.dropna(axis=1, how='all') 75 | 76 | x_train = X[0:100][:] 77 | x_test = X[100:120][:] 78 | y_train = y.iloc[0:100] 79 | y_test = y.iloc[100:120] 80 | 81 | pca = PCA(n_components=80) 82 | x_train = pca.fit_transform(x_train) 83 | x_test = pca.transform(x_test) 84 | test = pca.transform(test) 85 | 86 | return x_train, x_test, y_train, y_test, test 87 | 88 | 89 | def train_model(x_train, y_train): 90 | """ 91 | The method creates a learning model and trains it by using first 100 training samples. 92 | 93 | Parameters 94 | ---------- 95 | x_train: features of first 100 training samples 96 | y_train: labels of first 100 training samples 97 | """ 98 | 99 | classifier = RandomForestClassifier(n_estimators=1000, random_state=50) 100 | classifier.fit(x_train, y_train) 101 | return classifier 102 | 103 | 104 | def predict(classifier, x_train_test, real_test): 105 | 106 | """ 107 | The method makes two predictions: 108 | - First prediction is for last 20 samples of training data 109 | - Second prediction is for testing data 110 | 111 | Parameters 112 | ---------- 113 | classifier: trained model 114 | x_train_test: features of last 20 samples of training data 115 | real_test: features of testing data 116 | """ 117 | 118 | predictions1 = classifier.predict(x_train_test) 119 | predictions2 = classifier.predict(real_test) 120 | return predictions1, predictions2 121 | 122 | 123 | def write_output(y_predTest): 124 | 125 | f = open('submission.csv', 'w', encoding="utf-8") 126 | tempstr = "ID,Predicted\n" 127 | f.write(tempstr) 128 | 129 | # print(y_predTest) 130 | 131 | i = 1 132 | for y in y_predTest: 133 | tempstr = str(i) 134 | tempstr += "," 135 | tempstr += str(y) 136 | tempstr += "\n" 137 | 138 | i += 1 139 | f.write(tempstr) 140 | f.close 141 | 142 | 143 | # ********** MAIN PROGRAM ********** # 144 | 145 | X, y, test = load_data() 146 | x_train, x_train_test, y_train, y_train_test, real_test = preprocessing(X, y, test) 147 | 148 | classifier = train_model(x_train, y_train) 149 | train_test_pred, real_test_pred = predict(classifier, x_train_test, real_test) 150 | write_output(real_test_pred) 151 | 152 | """ 153 | ********************************************************* 154 | The codes below are used to investigate the performance * 155 | of trained model on last 20 samples of training data. * 156 | ********************************************************* 157 | 158 | print(confusion_matrix(y_train_test, train_test_pred)) 159 | print(classification_report(y_train_test, train_test_pred)) 160 | print(accuracy_score(y_train_test, train_test_pred)) 161 | 162 | """ 163 | -------------------------------------------------------------------------------- /Team 15/main.py: -------------------------------------------------------------------------------- 1 | """ 2 | Target Problem: 3 | --------------- 4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD) 5 | 6 | Proposed Solution (Machine Learning Pipeline): 7 | ---------------------------------------------- 8 | * Isolation Forest -> RFECV Algorithm -> Bagging Classifier (Base: SVM) 9 | 10 | Input to Proposed Solution: 11 | --------------------------- 12 | * Directories of training and testing data in csv file format 13 | * These two types of data should be stored in n x m pattern in csv file format. 14 | 15 | Typical Example: 16 | ---------------- 17 | n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column) 18 | k x s samples in testing csv file (k number of samples, s number of features) 19 | 20 | * These data set files are ready by load_data() function. 21 | * For comprehensive information about input format, please check the section 22 | "Data Sets and Usage Format of Source Codes" in README.md file on github. 23 | 24 | Output of Proposed Solution: 25 | ---------------------------- 26 | * Predictions generated by learning model for testing set 27 | * They are stored in "submission.csv" file. 28 | 29 | Code Owner: 30 | ----------- 31 | * Copyright © Team 15. All rights reserved. 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """ 33 | 34 | import csv 35 | import numpy as np 36 | from statistics import mean 37 | from sklearn.ensemble import BaggingClassifier, IsolationForest 38 | 39 | from sklearn.svm import SVC 40 | from sklearn.model_selection import ShuffleSplit, GridSearchCV 41 | from sklearn.feature_selection import RFECV 42 | 43 | np.random.seed(7) 44 | 45 | 46 | def load_data(): 47 | 48 | """ 49 | * Train data is read from "train.csv" file. 50 | * The whole content which is read from the file is decomposed into sub parts called "row". 51 | * Each sub part is collected in the list "contents". 52 | * Then, this list is converted into numpy array and split into features and labels of training samples. 53 | 54 | * Test data is read from "test.csv" file. 55 | * The whole content which is read from the file is decomposed into sub parts called "row". 56 | * Each sub part is collected in the list "contents". 57 | * Then, this list is converted into numpy array called "x_test". 58 | 59 | """ 60 | 61 | contents = [] 62 | with open('train.csv') as csv_file: 63 | csv_reader = csv.reader(csv_file, delimiter=',',) 64 | next(csv_reader) 65 | for row in csv_reader: 66 | contents += [row] 67 | 68 | cont_np = np.asarray(contents, dtype=np.float64) 69 | train_x = cont_np[:, :-1] 70 | train_y = cont_np[:, -1] 71 | 72 | contents = [] 73 | with open('test.csv') as csv_file: 74 | csv_reader = csv.reader(csv_file, delimiter=',',) 75 | next(csv_reader) 76 | for row in csv_reader: 77 | contents += [row] 78 | 79 | test_x = np.asarray(contents, dtype=np.float64) 80 | 81 | return train_x, train_y, test_x 82 | 83 | 84 | def outlier_detection(train_x, train_y): 85 | 86 | """ 87 | * The outliers are the samples which do not fit general data distribution trend. 88 | Those samples mislead the learning model during training phase. 89 | Hence, eliminating outliers is beneficial process. 90 | 91 | * This method detects outliers by using Isolation Forest model. 92 | * Then, it discards those samples from training data. 93 | 94 | Parameters 95 | ---------- 96 | train_x: features of training data 97 | train_y: labels of training data 98 | """ 99 | 100 | clf = IsolationForest(behaviour='new', random_state=1, contamination='auto') 101 | preds = clf.fit_predict(train_x) 102 | for i in range(0, len(preds)): 103 | if preds[i] == -1: 104 | train_x = np.delete(train_x, i, 0) 105 | train_y = np.delete(train_y, i, 0) 106 | 107 | return train_x, train_y 108 | 109 | 110 | def feature_selection(train_x, train_y, test_x): 111 | 112 | """ 113 | The method uses Recursive Feature Elimination Feature method to choose subset of features. 114 | It is a wrapper method of feature selection techniques. 115 | The main purpose is to reduce the dimension of the samples to avoid curse of dimensionality. 116 | 117 | Parameters 118 | ---------- 119 | train_x: features of training data 120 | test_x: features of testing data 121 | """ 122 | 123 | svc = SVC(kernel="linear") 124 | rfecv = RFECV(estimator=svc, step=1, cv=ShuffleSplit(n_splits=10, test_size=0.25, random_state=0), 125 | n_jobs=-1, scoring='accuracy') 126 | 127 | reduced_train_x = rfecv.fit_transform(train_x, train_y) 128 | reduced_test_x = rfecv.transform(test_x) 129 | return reduced_train_x, reduced_test_x 130 | 131 | 132 | def svc_param_selection(X, y, cv): 133 | 134 | """ 135 | The method aims to find best parameters for svc learning model. 136 | To accomplish this, It uses Grid Search Cross Validation method. 137 | 138 | A set of values are determined for each parameter of svc learning model. 139 | Grid Search chooses one of those values which maximizes the classification accuracy. 140 | 141 | Parameters 142 | ---------- 143 | X: features of training data 144 | y: labels of training data 145 | cv: cross validation object which will be applied during parameter search operation 146 | """ 147 | 148 | # a set of values for svc parameters 149 | param_c = [10**i for i in range(-11, 3)] 150 | param_gamma = [10**i for i in range(-11, 4)] 151 | param_coef = [10**i for i in range(-4, 4)] 152 | max_iter = [1000000] 153 | tol = [1e-3] 154 | 155 | param_grid = {'C': param_c, 'gamma': param_gamma, 'coef0': param_coef, 'max_iter': max_iter, 'tol': tol} 156 | grid_search = GridSearchCV(SVC(kernel="linear"), param_grid, cv=cv) 157 | grid_search.fit(X, y) 158 | return grid_search.best_params_, grid_search.best_score_, grid_search.cv_results_ 159 | 160 | 161 | def bagging_param_selection(X, y, cv, classifier): 162 | 163 | """ 164 | The method aims to find best parameters for bagging learning model. 165 | To accomplish this, It uses Grid Search Cross Validation method. 166 | For base model of bagging classifier, svc learning model with best parameters is chosen. 167 | 168 | A set of values are determined for each parameter of bagging learning model. 169 | Grid Search chooses one of those values which maximizes the classification accuracy. 170 | 171 | Parameters 172 | ---------- 173 | X: features of training data 174 | y: labels of training data 175 | cv: cross validation object which will be applied during parameter search operation 176 | classifier: base classifier for bagging learning model 177 | 178 | """ 179 | 180 | max_samples = [0.3, 0.4, 0.5, 0.6, 0.7, 0.8] 181 | max_features = [0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] 182 | 183 | param_grid = {'max_samples': max_samples, 'max_features': max_features} 184 | grid_search = GridSearchCV(BaggingClassifier(classifier), param_grid, cv=cv) 185 | grid_search.fit(X, y) 186 | return grid_search.best_params_, grid_search.best_score_, grid_search.cv_results_ 187 | 188 | 189 | def write_output(predictions): 190 | 191 | with open('submission.csv', mode='w') as output_file: 192 | 193 | output_writer = csv.writer(output_file, delimiter=',') 194 | output_writer.writerow(["ID", "Predicted"]) 195 | 196 | for i in range(1, len(predictions) + 1): 197 | output_writer.writerow([i, int(predictions[i - 1])]) 198 | 199 | 200 | # ********** MAIN PROGRAM ********** # 201 | 202 | 203 | train_x, train_y, test_x = load_data() 204 | new_train_x, new_train_y = outlier_detection(train_x, train_y) 205 | reduced_train_x, reduced_test_x = feature_selection(new_train_x, new_train_y, test_x) 206 | 207 | 208 | # Best parameters are determined for support vector classifier (svc). 209 | # Then, svc learning model is created with those best parameters. 210 | 211 | print("SVC") 212 | best_params, best_score, cv_results = svc_param_selection(reduced_train_x, new_train_y, 213 | ShuffleSplit(n_splits=10, test_size=0.25, random_state=0)) 214 | print("Best Params: ", best_params) 215 | print("Best Score: ", best_score) 216 | print("Mean Test Score: ", mean(cv_results["mean_test_score"])) 217 | best_svc = SVC(kernel="linear", C=best_params["C"], gamma=best_params["gamma"], random_state=5) 218 | 219 | 220 | # Best parameters are determined for bagging classifier. 221 | # When determining those parameters, best svc is chosen as base model of bagging classifier. 222 | 223 | print("\nBagging + SVC") 224 | best_params, best_score, cv_results = bagging_param_selection(reduced_train_x, new_train_y, 225 | ShuffleSplit(n_splits=10, test_size=0.25, random_state=0), 226 | best_svc) 227 | print("Best Params: ", best_params) 228 | print("Best Score: ", best_score) 229 | print("Mean Test Score: ", mean(cv_results["mean_test_score"])) 230 | 231 | # Bagging classifier is created by using best parameters of svc and bagging 232 | 233 | clf = BaggingClassifier(best_svc, max_samples=best_params["max_samples"], max_features=best_params["max_features"], 234 | random_state=5) 235 | clf.fit(reduced_train_x, new_train_y) 236 | predictions = clf.predict(reduced_test_x) 237 | write_output(predictions) 238 | -------------------------------------------------------------------------------- /Team 16/main.py: -------------------------------------------------------------------------------- 1 | """ 2 | Target Problem: 3 | --------------- 4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD) 5 | 6 | Proposed Solution (Machine Learning Pipeline): 7 | ---------------------------------------------- 8 | * SelectKBest Algorithm -> PCA -> KNN 9 | 10 | Input to Proposed Solution: 11 | --------------------------- 12 | * Directories of training and testing data in csv file format 13 | * These two types of data should be stored in n x m pattern in csv file format. 14 | 15 | Typical Example: 16 | ---------------- 17 | n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column) 18 | k x s samples in testing csv file (k number of samples, s number of features) 19 | 20 | * These data set files are ready by load_data() function. 21 | * For comprehensive information about input format, please check the section 22 | "Data Sets and Usage Format of Source Codes" in README.md file on github. 23 | 24 | Output of Proposed Solution: 25 | ---------------------------- 26 | * Predictions generated by learning model for testing set 27 | * They are stored in "submission.csv" file. 28 | 29 | Code Owner: 30 | ----------- 31 | * Copyright © Team 16. All rights reserved. 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """ 33 | 34 | import numpy as np 35 | import pandas as pd 36 | 37 | from sklearn.decomposition import PCA 38 | from sklearn.feature_selection import chi2 39 | from sklearn.feature_selection import SelectKBest 40 | from sklearn.neighbors import KNeighborsClassifier 41 | 42 | 43 | def load_data(paths): 44 | 45 | """ 46 | The method reads train and test data set from their data set files. 47 | The directory of the files are passed to the function via the parameter "paths". 48 | 49 | Parameters 50 | ---------- 51 | paths: it is an array collecting directory paths of train and test data. 52 | """ 53 | 54 | train_data = pd.read_csv(paths[0]) 55 | test_data = pd.read_csv(paths[1]) 56 | 57 | y_train = train_data["class"] 58 | x_train = train_data 59 | x_train.drop("class", axis=1, inplace=True) 60 | 61 | return x_train, y_train, test_data 62 | 63 | 64 | def preprocessing(x_train, y_train, x_test): 65 | 66 | """ 67 | The method performs two dimensionality reduction methods: SelectKBest and PCA 68 | By using SelectKBest algorithm, it chooses top 80 features with highest chi square value. 69 | Then, this method synthesizes 5 new features for training and testing data by using PCA. 70 | Totally, the data sets are reduced to 5-dimensional space. 71 | 72 | Parameters 73 | ---------- 74 | x_train: features of train data 75 | y_train: labels of train data 76 | x_test: features of test data 77 | """ 78 | 79 | selector = SelectKBest(chi2, k=80) 80 | selector.fit(x_train, y_train) 81 | 82 | x_train_reduced = selector.transform(x_train) 83 | x_test_reduced = selector.transform(x_test) 84 | 85 | pca = PCA(n_components=5) 86 | pca.fit(x_train_reduced) 87 | 88 | x_train_reduced = pca.transform(x_train_reduced) 89 | x_test_reduced = pca.transform(x_test_reduced) 90 | 91 | return x_train_reduced, x_test_reduced 92 | 93 | 94 | def train_model(x_train, y_train): 95 | 96 | """ 97 | The method trains KNN classification model by using training data set. 98 | Then, It returns trained learning model. 99 | 100 | Parameters 101 | ---------- 102 | x_train: features of train data 103 | y_train: labels of train data 104 | """ 105 | 106 | clf = KNeighborsClassifier(n_neighbors=7) 107 | clf.fit(x_train, y_train) 108 | return clf 109 | 110 | 111 | def predict(model, x_test): 112 | 113 | """ 114 | The method predicts labels for testing data samples. 115 | 116 | Parameters 117 | ---------- 118 | model: trained learning model (KNN) 119 | x_test: features of testing data 120 | """ 121 | return model.predict(x_test) 122 | 123 | 124 | def write_output(myPredict): 125 | ID = np.arange(1, len(myPredict) + 1) 126 | predictID = list(zip(ID, myPredict)) 127 | predictID = pd.DataFrame(predictID, columns=['ID', 'Predicted']) 128 | predictID.to_csv('submission.csv', index=False) 129 | 130 | 131 | # ********** MAIN PROGRAM ********** # 132 | 133 | x_tra, y_tra, x_tst = load_data(['train.csv', 'test.csv']) 134 | x_tra_reduced, x_test_reduced = preprocessing(x_tra, y_tra, x_tst) 135 | 136 | my_model = train_model(x_tra_reduced, y_tra) 137 | my_predict = predict(my_model, x_test_reduced) 138 | write_output(my_predict) 139 | -------------------------------------------------------------------------------- /Team 17/main.py: -------------------------------------------------------------------------------- 1 | """ 2 | Target Problem: 3 | --------------- 4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD) 5 | 6 | Proposed Solution (Machine Learning Pipeline): 7 | ---------------------------------------------- 8 | * LDA -> Voting Classifier 9 | 10 | Input to Proposed Solution: 11 | --------------------------- 12 | * Directories of training and testing data in csv file format 13 | * These two types of data should be stored in n x m pattern in csv file format. 14 | 15 | Typical Example: 16 | ---------------- 17 | n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column) 18 | k x s samples in testing csv file (k number of samples, s number of features) 19 | 20 | * These data set files are ready by load_data() function. 21 | * For comprehensive information about input format, please check the section 22 | "Data Sets and Usage Format of Source Codes" in README.md file on github. 23 | 24 | Output of Proposed Solution: 25 | ---------------------------- 26 | * Predictions generated by learning model for testing set 27 | * They are stored in "submission.csv" file. 28 | 29 | Code Owner: 30 | ----------- 31 | * Copyright © Team 17. All rights reserved. 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """ 33 | 34 | import numpy as np 35 | import pandas as pd 36 | 37 | from sklearn.svm import SVC 38 | from sklearn.naive_bayes import GaussianNB 39 | from sklearn.linear_model import SGDClassifier 40 | from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA 41 | from sklearn.ensemble import RandomForestClassifier as RFC, VotingClassifier, GradientBoostingClassifier as GBC 42 | 43 | 44 | def load_data(tra_file_path, test_file_path): 45 | 46 | """ 47 | The method reads train and test data from data set files in dataframe format. 48 | When reading them, it benefits from directory paths of the files passed by the parameters. 49 | Then, it converts them into numpy arrays and returns them. 50 | 51 | Parameters 52 | ---------- 53 | tra_file_path: directory of training data 54 | test_file_path: directory of testing data 55 | """ 56 | 57 | x_tra = pd.read_csv(tra_file_path, sep=',') 58 | x_tst = pd.read_csv(test_file_path, sep=',') 59 | return np.array(x_tra), np.array(x_tst) 60 | 61 | 62 | def dimension_reduction(train_x, train_y, test_x): 63 | 64 | """ 65 | Linear Discriminant Analysis (LDA) can be used to both reduce the dimension and train a learning model. 66 | In this point, it is used as dimension reducer. It is like supervised version of PCA. 67 | This method benefits from LDA to reduce the dimension train and test data. 68 | 69 | Parameters 70 | ---------- 71 | train_x: features of training data 72 | train_y: labels of training data 73 | test_x: features of testing data 74 | """ 75 | 76 | lda = LDA(n_components=1) 77 | lda.fit(train_x, train_y) 78 | 79 | train_x_reduced = lda.transform(train_x) 80 | test_x_reduced = lda.transform(test_x) 81 | return train_x_reduced, test_x_reduced 82 | 83 | 84 | def train_model(train, target): 85 | 86 | """ 87 | The method creates 5 different learning models. 88 | Then, these 5 models are combined in a voting classifier. 89 | That voting classifier is trained with training data. 90 | 91 | Parameters 92 | ---------- 93 | train: features of training data 94 | target: labels of training data 95 | """ 96 | 97 | clf1 = SVC(kernel='rbf', gamma=5, C=80, random_state=1) 98 | clf2 = GaussianNB() 99 | clf3 = GBC(n_estimators=20, learning_rate=1.0, random_state=1) 100 | clf4 = RFC(n_estimators=20, random_state=1) 101 | clf5 = SGDClassifier(tol=1e-3, random_state=1) 102 | 103 | ensemble = VotingClassifier(estimators=[('svm', clf1), ('nb', clf2), ('gbc', clf3), ('rfc', clf4), ('sgd', clf5)], 104 | voting='hard') 105 | ensemble.fit(train, target) 106 | return ensemble 107 | 108 | 109 | def predict(model, x_test): 110 | 111 | """ 112 | The method predicts labels for testing data samples. 113 | 114 | Parameters 115 | ---------- 116 | model: trained voting classifier 117 | x_test: features of testing data 118 | """ 119 | return model.predict(x_test) 120 | 121 | 122 | def write_output(predictions): 123 | 124 | results = np.zeros((len(predictions), 2)) 125 | for i in range(len(predictions)): 126 | results[i][0] = i + 1 127 | results[i][1] = predictions[i] 128 | 129 | results = results.astype(int) 130 | predictions = predictions.astype(int) 131 | 132 | results = pd.DataFrame(data=results, columns=['ID', 'Predicted']) 133 | results.to_csv('submission.csv', index=False, sep=',') 134 | 135 | 136 | def main(tra_file_path, test_file_path): 137 | 138 | train_data, test_features = load_data(tra_file_path, test_file_path) 139 | train_features = train_data[:, 0:len(train_data[0]) - 1] 140 | train_labels = train_data[:, len(train_data[0]) - 1] 141 | 142 | x_train, x_test = dimension_reduction(train_features, train_labels, test_features) 143 | clf = train_model(x_train, train_labels) 144 | preds = predict(clf, x_test) 145 | write_output(preds) 146 | 147 | 148 | main('train.csv', 'test.csv') 149 | -------------------------------------------------------------------------------- /Team 18/classifiers.py: -------------------------------------------------------------------------------- 1 | # Code Owners: Bulut Karabıyık - Cankurt Kostur 2 | # Code Editor: Göktuğ Güvercin 3 | 4 | import numpy as np 5 | from sklearn.svm import SVC 6 | from sklearn.tree import DecisionTreeClassifier 7 | from sklearn.neighbors import KNeighborsClassifier 8 | from sklearn.discriminant_analysis import LinearDiscriminantAnalysis 9 | from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis 10 | 11 | param_grids = [ 12 | {'n_neighbors': np.arange(1, 30, 2), 13 | }, 14 | { 15 | 'kernel': ['rbf', 'linear'], 16 | 'C': np.arange(0.025, 5, 0.025)}, 17 | { 18 | 'max_depth': np.arange(3, 10)}, 19 | 20 | { 21 | 'tol': [1e-4] 22 | }, 23 | { 24 | 'tol': [1.0e-4] 25 | } 26 | ] 27 | 28 | classifiers = [ 29 | KNeighborsClassifier(), 30 | SVC(probability=True), 31 | DecisionTreeClassifier(), 32 | LinearDiscriminantAnalysis(), 33 | QuadraticDiscriminantAnalysis()] 34 | 35 | """ 36 | * The list "param_grids" contains dictionary objects. 37 | * Each dictionary can have one or more than one parameter name and corresponding value range. 38 | * The values in that range are tried in cross validation by GridSearch to determine which one is 39 | the best value for that parameter. 40 | 41 | * The list "classifiers" contains learning model objects. 42 | * For each learning model in that list, best parameter set is determined by GridSearch. 43 | * Then, those models and their best parameter sets are used to construct powerful voting classifier 44 | """ 45 | -------------------------------------------------------------------------------- /Team 18/main.py: -------------------------------------------------------------------------------- 1 | """ 2 | Target Problem: 3 | --------------- 4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD) 5 | 6 | Proposed Solution (Machine Learning Pipeline): 7 | ---------------------------------------------- 8 | * SelectKBest Algorithm -> PCA -> Variance Thresholding -> Voting Classifier 9 | 10 | Input to Proposed Solution: 11 | --------------------------- 12 | * Directories of training and testing data in csv file format 13 | * These two types of data should be stored in n x m pattern in csv file format. 14 | 15 | Typical Example: 16 | ---------------- 17 | n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column) 18 | k x s samples in testing csv file (k number of samples, s number of features) 19 | 20 | * These data set files are ready by load_data() function. 21 | * For comprehensive information about input format, please check the section 22 | "Data Sets and Usage Format of Source Codes" in README.md file on github. 23 | 24 | Output of Proposed Solution: 25 | ---------------------------- 26 | * Predictions generated by learning model for testing set 27 | * They are stored in "submission.csv" file. 28 | 29 | Code Owner: 30 | ----------- 31 | * Copyright © Team 18. All rights reserved. 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """ 33 | 34 | import warnings 35 | import pandas as pd 36 | 37 | from sklearn.decomposition import PCA 38 | from sklearn.ensemble import VotingClassifier 39 | from sklearn.model_selection import GridSearchCV 40 | from sklearn.feature_selection import chi2, VarianceThreshold, SelectKBest 41 | 42 | from classifiers import * 43 | 44 | warnings.filterwarnings("ignore") 45 | warnings.filterwarnings(action='ignore', category=DeprecationWarning) 46 | warnings.filterwarnings(action='ignore', category=FutureWarning) 47 | 48 | 49 | def load_data(x_train_path, x_test_path): 50 | 51 | """ 52 | The method reads train and test data from dataset csv files. 53 | Then, train data is decomposed into features and labels. 54 | Finally, the method returns features and labels of train data and test data itself. 55 | 56 | Parameters 57 | ---------- 58 | x_train_path: directory of training data set file 59 | x_test_path: directory of testing data set file 60 | """ 61 | 62 | all_data = pd.read_csv(x_train_path) 63 | x_test = pd.read_csv(x_test_path) 64 | 65 | y_train = all_data["class"] 66 | x_train = pd.read_csv(x_train_path) 67 | x_train.drop("class", axis=1, inplace=True) 68 | return x_train, y_train, x_test 69 | 70 | 71 | def preprocessing(x_train, y_train, x_test): 72 | 73 | """ 74 | * The method performs 3 dimensionality reduction methods: Variance Threshold - SelectKBest Algorithm - PCA. 75 | * It at first performs variance threshold, and eliminates all features whose variance values are lower than 0.001. 76 | * Then, it computes chi square value for each feature, and chooses top 10 features with highest chi square value. 77 | When doing this, it benefits from SelectKBest algorithm. 78 | * In final step, the method synthesizes 2 new features by using pca. 79 | 80 | Parameters 81 | ---------- 82 | x_train: features of training data 83 | y_train: labels of training data 84 | x_test: features of testing data 85 | """ 86 | 87 | selector_threshold = VarianceThreshold(0.001) 88 | selector_threshold.fit(x_train) 89 | 90 | x_train_new = selector_threshold.transform(x_train) 91 | x_test_new = selector_threshold.transform(x_test) 92 | 93 | selector = SelectKBest(chi2, k=10) 94 | selector.fit(x_train_new, y_train) 95 | 96 | x_train_new = selector.transform(x_train_new) 97 | x_test_new = selector.transform(x_test_new) 98 | 99 | pca = PCA(n_components=2, whiten=True) 100 | pca.fit(x_train_new) 101 | 102 | x_train_pca = pca.transform(x_train_new) 103 | x_test_pca = pca.transform(x_test_new) 104 | return x_train_pca, x_test_pca 105 | 106 | 107 | def train_model(x_train, y_train): 108 | 109 | """ 110 | * The method performs GridSearch operation to choose best parameter set for each classification model. 111 | * Then, the classification models with best parameter set and their names are stored in two different lists. 112 | * These two lists are used to combine these best-parametrized classification models in voting classifier. 113 | * That voting classifier is trained with training data and returned. 114 | 115 | Parameters 116 | ---------- 117 | x_train: features of training data 118 | y_train: labels of training data 119 | 120 | """ 121 | 122 | best_models = [] 123 | model_names = [] 124 | 125 | for i in range(len(classifiers)): 126 | 127 | model = classifiers[i] 128 | grid_search = GridSearchCV(model, param_grids[i], cv=5) 129 | 130 | grid_search.fit(x_train, y_train) 131 | best_models.append(grid_search.best_estimator_) 132 | 133 | name = model.__class__.__name__ 134 | model_names.append(name) 135 | 136 | estimators = [('knn', best_models[0]), ('SVC', best_models[1]), ('DT', best_models[2]), ('LA', best_models[3]), 137 | ('QA', best_models[4])] 138 | 139 | ensemble = VotingClassifier(estimators, voting='hard') 140 | ensemble.fit(x_train, y_train) 141 | return ensemble 142 | 143 | 144 | def predict(model, x_test): 145 | """ 146 | The method predicts labels for testing data samples by using trained learning model, that is voting classifier. 147 | 148 | Parameters 149 | ---------- 150 | model: trained learning model 151 | x_test: features of testing data 152 | """ 153 | return model.predict(x_test) 154 | 155 | 156 | def write_output(prediction, file_name): 157 | ID = np.arange(1, len(prediction) + 1) 158 | Id_Predict = list(zip(ID, prediction)) 159 | Id_Predict = pd.DataFrame(Id_Predict, columns=['ID', 'Predicted']) 160 | Id_Predict.to_csv(file_name, index=False) 161 | 162 | 163 | # ********** MAIN PROGRAM ********** # 164 | 165 | x_train, y_train, x_test = load_data("train.csv", "test.csv") 166 | x_train_pca, x_test_pca = preprocessing(x_train, y_train, x_test) 167 | 168 | model = train_model(x_train_pca, y_train) 169 | predictions = predict(model, x_test_pca) 170 | write_output(predictions, "submission.csv") 171 | -------------------------------------------------------------------------------- /Team 19/main.py: -------------------------------------------------------------------------------- 1 | """ 2 | Target Problem: 3 | --------------- 4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD) 5 | 6 | Proposed Solution (Machine Learning Pipeline): 7 | ---------------------------------------------- 8 | * PCA -> KNN 9 | 10 | Input to Proposed Solution: 11 | --------------------------- 12 | * Directories of training and testing data in csv file format 13 | * These two types of data should be stored in n x m pattern in csv file format. 14 | 15 | Typical Example: 16 | ---------------- 17 | n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column) 18 | k x s samples in testing csv file (k number of samples, s number of features) 19 | 20 | * These data set files are ready by load_data() function. 21 | * For comprehensive information about input format, please check the section 22 | "Data Sets and Usage Format of Source Codes" in README.md file on github. 23 | 24 | Output of Proposed Solution: 25 | ---------------------------- 26 | * Predictions generated by learning model for testing set 27 | * They are stored in "submission.csv" file. 28 | 29 | Code Owner: 30 | ----------- 31 | * Copyright © Team 19. All rights reserved. 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """ 33 | 34 | import numpy as np 35 | import pandas as pd 36 | import matplotlib.pyplot as plt 37 | 38 | from sklearn.neighbors import KNeighborsClassifier 39 | from sklearn.decomposition import PCA 40 | from sklearn.preprocessing import MinMaxScaler 41 | 42 | 43 | def load_data(train_path, test_path): 44 | 45 | """ 46 | The train and test data are read from their source files in data-frame format. 47 | Then, they are converted into numpy array and returned. 48 | 49 | Parameters 50 | ---------- 51 | train_path: directory of train data set file 52 | test_path: directory of test data set file 53 | """ 54 | 55 | train_data = np.asarray(pd.read_csv(train_path, skiprows=0)) 56 | test_data = np.asarray(pd.read_csv(test_path, skiprows=0)) 57 | return train_data, test_data 58 | 59 | 60 | def preprocessing(train_x, test_x): 61 | 62 | """ 63 | The method synthesizes new 70 features by using pca. 64 | In this way, the dimension of train and test data is reduced. 65 | 66 | Parameters 67 | ---------- 68 | train_x: features of train data 69 | test_x: features of test data 70 | 71 | """ 72 | 73 | pca = PCA(n_components=70) 74 | x_train_clean = pca.fit_transform(train_x) 75 | x_test_clean = pca.transform(test_x) 76 | return x_train_clean, x_test_clean 77 | 78 | 79 | def find_component(train_x): 80 | 81 | """ 82 | This method is used to determine the number of features that data samples are reduced to. 83 | 84 | Parameters 85 | ---------- 86 | train_x: features of train data 87 | """ 88 | 89 | scaler = MinMaxScaler(feature_range=[0, 1]) 90 | data_rescaled = scaler.fit_transform(train_x) 91 | pca = PCA().fit(data_rescaled) 92 | 93 | plt.Figure() 94 | plt.plot(np.cumsum(pca.explained_variance_ratio_)) 95 | plt.xlabel('Number of components') 96 | plt.ylabel('Variance') 97 | plt.show() 98 | 99 | 100 | def train_model(x_train, y_train): 101 | 102 | """ 103 | The method creates KNN learning model and trains it by using training data. 104 | It returns trained learning model. 105 | 106 | Parameters 107 | ---------- 108 | x_train: features of training data 109 | y_train: labels of training data 110 | 111 | """ 112 | 113 | classifier = KNeighborsClassifier(n_neighbors=5) 114 | classifier.fit(x_train, y_train) 115 | return classifier 116 | 117 | 118 | def predict(model, x_test): 119 | 120 | """ 121 | The method predicts labels for testing data samples by using trained learning model. 122 | 123 | Parameters 124 | ---------- 125 | model: trained learning model (KNN) 126 | x_test: features of testing data 127 | 128 | """ 129 | 130 | predictions = model.predict(x_test) 131 | predictions.shape = (np.size(predictions), 1) 132 | return predictions 133 | 134 | 135 | def write_output(predictions): 136 | 137 | temp = np.ones((80, 1), dtype=float) 138 | for i in range(0, 80): 139 | temp[i] = i + 1 140 | 141 | y_csv = np.concatenate((temp, predictions), 1) 142 | np.savetxt('submission.csv', y_csv, delimiter=",", comments='', fmt='%.0f', header="ID,Predicted") 143 | 144 | 145 | # ********** MAIN PROGRAM ********** # 146 | 147 | train_data, test_data = load_data("train.csv", "test.csv") 148 | x_train = train_data[:, :595] 149 | y_train = train_data[:, 595] 150 | 151 | # find_component(x_train) -> it shows why 70 components are needed for pca 152 | 153 | x_train_clean, x_test_clean = preprocessing(x_train, test_data) 154 | classifier = train_model(x_train_clean, y_train) 155 | predictions = predict(classifier, x_test_clean) 156 | write_output(predictions) 157 | -------------------------------------------------------------------------------- /Team 2/main.py: -------------------------------------------------------------------------------- 1 | """ 2 | Target Problem: 3 | --------------- 4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD) 5 | 6 | Proposed Solution (Machine Learning Pipeline): 7 | ---------------------------------------------- 8 | * MRMR algorithm -> Gradient Boosting 9 | 10 | Input to Proposed Solution: 11 | --------------------------- 12 | * Directories of training and testing data in csv file format 13 | * These two types of data should be stored in n x m pattern in csv file format. 14 | 15 | Typical Example: 16 | ---------------- 17 | n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column) 18 | k x s samples in testing csv file (k number of samples, s number of features) 19 | 20 | * These data set files are ready by load_data() function. 21 | * For comprehensive information about input format, please check the section 22 | "Data Sets and Usage Format of Source Codes" in README.md file on github. 23 | 24 | Output of Proposed Solution: 25 | ---------------------------- 26 | * Predictions generated by learning model for testing set 27 | * They are stored in "submission.csv" file. 28 | 29 | Code Owner: 30 | ----------- 31 | * Copyright © Team 2. All rights reserved. 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """ 33 | 34 | from reduce_dim import * 35 | from sklearn.ensemble import GradientBoostingClassifier 36 | 37 | # ********** MAIN PROGRAM ********** # 38 | 39 | # test and training data are read from csv file by helper functions 40 | tra_data, tst_features = load_data("train.csv", "test.csv") 41 | tra_features, tra_labels = split_data(tra_data) 42 | 43 | # Dimension of the data is reduced in a preprocessing step. 44 | pcc_tra_features, pcc_tst_features = apply_MRMR(3, tra_data, tst_features) 45 | 46 | grd_boost_clf = GradientBoostingClassifier(n_estimators=34, learning_rate=1.27, 47 | max_features=3, max_depth=3, random_state=7) 48 | 49 | grd_boost_clf.fit(pcc_tra_features, tra_labels) # training step 50 | predictions = grd_boost_clf.predict(pcc_tst_features) # testing step 51 | 52 | # the results are written to output file. 53 | write_output(predictions, "Submission.txt") 54 | -------------------------------------------------------------------------------- /Team 2/read_write.py: -------------------------------------------------------------------------------- 1 | # Code Owners: Göktuğ Güvercin - Uğur Tepecik - Ege Apak 2 | # Code Editor: Göktuğ Güvercin 3 | 4 | import numpy as np 5 | import pandas as pd 6 | 7 | 8 | def load_data(directory1, directory2): 9 | 10 | """ 11 | It reads the content of training and testing data files. 12 | Then, it returns them as numpy arrays 13 | 14 | Parameters 15 | ---------- 16 | directory1: directory of training file 17 | directory2: directory of testing file 18 | """ 19 | 20 | tra_data = np.array(pd.read_csv(directory1)) 21 | tst_data = np.array(pd.read_csv(directory2)) 22 | return tra_data, tst_data 23 | 24 | 25 | def split_data(dataset): 26 | 27 | """ 28 | The "dataset" array is split into features and labels. 29 | CAUTION: "dataset" numpy array must contain label values at the last column. 30 | """ 31 | 32 | labels = dataset[:, len(dataset[0]) - 1] 33 | features = dataset[:, :len(dataset[0]) - 1] 34 | return features, labels 35 | 36 | 37 | def write_output(predictions, directory): 38 | 39 | size = len(predictions) 40 | indices = np.array([i for i in range(1, size + 1)]) 41 | 42 | indices.shape = (size, 1) 43 | predictions.shape = (size, 1) 44 | submission_array = np.concatenate((indices, predictions), 1) 45 | 46 | np.savetxt(directory, submission_array, delimiter=",", fmt="%d", header="ID,Predicted", comments="") 47 | -------------------------------------------------------------------------------- /Team 2/reduce_dim.py: -------------------------------------------------------------------------------- 1 | # Code Owners: Göktuğ Güvercin - Uğur Tepecik - Ege Apak 2 | # Code Editor: Göktuğ Güvercin 3 | 4 | 5 | from read_write import * 6 | 7 | 8 | def create_dataframe(dataset): 9 | 10 | """ 11 | * This function takes "dataset" array as an argument, and creates a data frame object "df". 12 | This dataframe object is needed to compute correlation coefficient matrix. 13 | * Finally, the method changes names of columns in data frame for easy use. 14 | 15 | Parameters 16 | ---------- 17 | dataset: It is 2D numpy array. Its last column should contain target scores (labels). 18 | :return: data frame object 19 | """ 20 | 21 | df = pd.DataFrame(dataset) 22 | df.columns = [i for i in range(len(df.columns) - 1)] + ["Labels"] 23 | return df 24 | 25 | 26 | def find_pcc_features(df, nof_features): 27 | 28 | """ 29 | * This method at first computes correlation matrix by using built-in corr() method. 30 | * Then, one of mutually-correlated features is eliminated. For example, feature A and feature B 31 | are well-correlated to each other. In this case, these features have similar behavior and 32 | similar effect on classification task, so we can discard one of them. We do not need to keep both of them. 33 | 34 | * After that, the row which stores the correlation between features and label is extracted. 35 | Absolute of that row is computed, because negative value only refers to inverse relation. 36 | We do not care forward or inverse relation; we care most related (max absolute values) correlations. 37 | * Then, correlation values are sorted in descending order. 38 | * Finally, indices of the features correlated to labels are stored in the list "pcc_features". 39 | 40 | * To summarize, MRMR (minimum redundancy maximum relevance) feature selection algorithm is performed. 41 | The features chosen by this algorithm is called pearson-correlation-coefficient (pcc) features. 42 | 43 | Parameters 44 | ---------- 45 | df: data frame object 46 | nof_features: the number of features that you reduce the dimension to 47 | :return: a list of indices of the features which are highly-correlated to labels 48 | """ 49 | 50 | pcc_features = [] 51 | corr_features = [] 52 | corr_matrix = df.corr() 53 | 54 | # determining similar (mutually-correlated) features 55 | for i in range(len(corr_matrix.columns) - 1): 56 | for j in range(i): 57 | if np.abs(corr_matrix[i][j]) > 0.75: 58 | corr_features.append(i) 59 | 60 | corr_matrix = corr_matrix.drop(corr_features) # eliminating similar features 61 | corr_label = corr_matrix["Labels"].abs() # taking absolute of correlations 62 | 63 | sorted_corr_label = corr_label.sort_values(na_position="last", ascending=False) 64 | feature_names = sorted_corr_label.index 65 | 66 | for i in range(1, nof_features + 1): # taking most informative n features 67 | pcc_features.append(feature_names[i]) 68 | 69 | return pcc_features 70 | 71 | 72 | def pcc_transform(features, indices): 73 | 74 | """ 75 | This method takes transpose of "features" array to access the features easily. 76 | Our dataset is in the dimension 120 x 595, which means that each row refers to one sample. 77 | I want to keep most correlated features (indices), and remove the other features. 78 | To accomplish this, each feature must be represented a list. 79 | In that list, all values which that feature took across all samples must be stored. 80 | This is only possible by taking transpose of "features" array 81 | 82 | Parameters 83 | ---------- 84 | features: two dimensional numpy array representing our samples without labels 85 | indices: index values of most correlated features 86 | :return: 87 | """ 88 | 89 | features_T = features.T 90 | features_T = features_T[indices] 91 | features = features_T.T 92 | return features 93 | 94 | 95 | def apply_MRMR(nof_features, tra_dataset, tst_features): 96 | 97 | """ 98 | This method creates a data frame to be able to compute correlation matrix. 99 | Then, index values of most correlated n features are determined. 100 | By using these index values, training and testing features are reduced to lower dimension with pcc_transfrom(). 101 | 102 | Parameters 103 | ---------- 104 | nof_features: the number of features that you reduce the dimension to 105 | tra_dataset: Two dimensional numpy array (training set). Its last column refers to target scores (labels) 106 | tst_features: Two dimensional numpy array (testing set). It does not contain target scores (labels) 107 | :return: reduced training and testing set 108 | """ 109 | 110 | df = create_dataframe(tra_dataset) 111 | pcc_indices = find_pcc_features(df, nof_features) 112 | 113 | tra_features = tra_dataset[:, :len(tra_dataset[0]) - 1] 114 | pcc_tra_features = pcc_transform(tra_features, pcc_indices) 115 | pcc_tst_features = pcc_transform(tst_features, pcc_indices) 116 | 117 | return pcc_tra_features, pcc_tst_features 118 | 119 | -------------------------------------------------------------------------------- /Team 20/main.py: -------------------------------------------------------------------------------- 1 | """ 2 | Target Problem: 3 | --------------- 4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD) 5 | 6 | Proposed Solution (Machine Learning Pipeline): 7 | ---------------------------------------------- 8 | * Constant Feature Elimination -> PCA -> Decision Tree Regressor 9 | 10 | Input to Proposed Solution: 11 | --------------------------- 12 | * Directories of training and testing data in csv file format 13 | * These two types of data should be stored in n x m pattern in csv file format. 14 | 15 | Typical Example: 16 | ---------------- 17 | n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column) 18 | k x s samples in testing csv file (k number of samples, s number of features) 19 | 20 | * These data set files are ready by load_data() function. 21 | * For comprehensive information about input format, please check the section 22 | "Data Sets and Usage Format of Source Codes" in README.md file on github. 23 | 24 | Output of Proposed Solution: 25 | ---------------------------- 26 | * Predictions generated by learning model for testing set 27 | * They are stored in "submission.csv" file. 28 | 29 | Code Owner: 30 | ----------- 31 | * Copyright © Team 20. All rights reserved. 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """ 33 | 34 | import csv 35 | import warnings 36 | import pandas as pd 37 | 38 | from sklearn import tree 39 | from sklearn.decomposition import PCA 40 | 41 | warnings.simplefilter(action='ignore', category=FutureWarning) 42 | 43 | 44 | def load_data(filename): 45 | return pd.read_csv(filename) 46 | 47 | 48 | def preprocessing(train_set, test_set, nof_features): 49 | 50 | """ 51 | The method at first discards constant features, since those features are affectless on classification. 52 | Then, train data set is decomposed into features and labels. 53 | Finally, the method synthesizes new 10 features for train and test data by using pca. 54 | 55 | Parameters 56 | ---------- 57 | train_set: train data in data-frame format 58 | test_set: test data in data-frame format 59 | nof_features: number of features to be synthesized during pca 60 | """ 61 | 62 | train_set = train_set.drop(['X3', 'X31', 'X32', 'X127', 'X128', 'X590'], axis=1) 63 | test_set = test_set.drop(['X3', 'X31', 'X32', 'X127', 'X128', 'X590'], axis=1) 64 | 65 | train_features = train_set.iloc[:, 0:589].values 66 | train_labels = train_set.iloc[:, 589].values 67 | test_features = test_set.iloc[:, 0:589].values 68 | 69 | pca = PCA(n_components=nof_features) 70 | extracted_train_features = pca.fit_transform(train_features) 71 | extracted_test_features = pca.transform(test_features) 72 | 73 | return extracted_train_features, train_labels, extracted_test_features 74 | 75 | 76 | def train_model(train_x, train_y, test_x): 77 | 78 | """ 79 | The method creates a Decision Tree Regressor model, and trains it by using train data. 80 | Then, the method predicts labels for testing samples by using regressor model. 81 | Those labels are returned. 82 | 83 | Parameters 84 | ---------- 85 | train_x: features of train data 86 | train_y: labels of train data 87 | test_x: features of test data 88 | """ 89 | 90 | model = tree.DecisionTreeRegressor(random_state=7) 91 | model.fit(train_x, train_y) 92 | return model.predict(test_x) 93 | 94 | 95 | def write_output(predictions): 96 | 97 | with open('submission.csv', mode='w') as predicted_file: 98 | submission = csv.writer(predicted_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL) 99 | submission.writerow(['ID', 'Predicted']) 100 | 101 | a = 1 102 | for i in predictions: 103 | submission.writerow([str(a), int(i)]) 104 | a = a + 1 105 | 106 | 107 | # ********** MAIN PROGRAM ********** # 108 | 109 | train_df = load_data('train.csv') 110 | test_df = load_data('test.csv') 111 | 112 | train_features, train_labels, test_features = preprocessing(train_df, test_df, 10) 113 | predictions = train_model(train_features, train_labels, test_features) 114 | write_output(predictions) 115 | -------------------------------------------------------------------------------- /Team 3/classifiers.py: -------------------------------------------------------------------------------- 1 | # Code Owner: Mümtaz Cem Eriş - İsmet Ata Yardımcı 2 | # Code Editor: Göktuğ Güvercin 3 | 4 | import numpy as np 5 | 6 | from sklearn.svm import SVC 7 | from sklearn.naive_bayes import GaussianNB 8 | from sklearn.tree import DecisionTreeClassifier 9 | from sklearn.neighbors import KNeighborsClassifier 10 | from sklearn.linear_model import LogisticRegression, RidgeClassifier 11 | 12 | import xgboost as xgb 13 | from sklearn.ensemble import * 14 | from mlxtend.classifier import EnsembleVoteClassifier 15 | from sklearn.model_selection import GridSearchCV 16 | 17 | # ************************************************************************************************ 18 | # This source file only provides classifiers for other source files. The main program is main.py * 19 | # ************************************************************************************************ 20 | 21 | seed = 1075 22 | np.random.seed(seed) 23 | 24 | # Classifiers 25 | rf = RandomForestClassifier() 26 | et = ExtraTreesClassifier() 27 | knn = KNeighborsClassifier() 28 | svc = SVC() 29 | rg = RidgeClassifier() 30 | lr = LogisticRegression(solver='lbfgs') 31 | gnb = GaussianNB() 32 | dt = DecisionTreeClassifier(max_depth=1) 33 | 34 | # Bagging Classifiers 35 | bagging_clf = BaggingClassifier(rf, max_samples=0.4, max_features=10, random_state=seed) 36 | 37 | # Boosting Classifiers 38 | ada_boost = AdaBoostClassifier() 39 | ada_boost_svc = AdaBoostClassifier(base_estimator=svc, algorithm='SAMME') 40 | grad_boost = GradientBoostingClassifier() 41 | xgb_boost = xgb.XGBClassifier() 42 | 43 | # Voting Classifiers 44 | vclf = VotingClassifier(estimators=[('ada_boost', ada_boost), ('grad_boost', grad_boost), 45 | ('xgb_boost', xgb_boost), ('BaggingWithRF', bagging_clf)], voting='hard') 46 | 47 | ev_clf = EnsembleVoteClassifier(clfs=[ada_boost_svc, grad_boost, xgb_boost], voting='hard') 48 | 49 | # Grid Search 50 | params = {'gradientboostingclassifier__n_estimators': [10, 200], 51 | 'xgbclassifier__n_estimators': [10, 200]} 52 | 53 | grid = GridSearchCV(estimator=ev_clf, param_grid=params, cv=5) 54 | -------------------------------------------------------------------------------- /Team 3/main.py: -------------------------------------------------------------------------------- 1 | """ 2 | Target Problem: 3 | --------------- 4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD) 5 | 6 | Proposed Solution (Machine Learning Pipeline): 7 | ---------------------------------------------- 8 | * Standard Scaling -> PCA -> Voting Classifier 9 | 10 | Input to Proposed Solution: 11 | --------------------------- 12 | * Directories of training and testing data in csv file format 13 | * These two types of data should be stored in n x m pattern in csv file format. 14 | 15 | Typical Example: 16 | ---------------- 17 | n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column) 18 | k x s samples in testing csv file (k number of samples, s number of features) 19 | 20 | * These data set files are ready by load_data() function. 21 | * For comprehensive information about input format, please check the section 22 | "Data Sets and Usage Format of Source Codes" in README.md file on github. 23 | 24 | Output of Proposed Solution: 25 | ---------------------------- 26 | * Predictions generated by learning model for testing set 27 | * They are stored in "submission.csv" file. 28 | 29 | Code Owner: 30 | ----------- 31 | * Copyright © Team 3. All rights reserved. 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """ 33 | 34 | import csv 35 | import warnings 36 | 37 | from classifiers import * 38 | from sklearn.decomposition import PCA 39 | from sklearn.preprocessing import StandardScaler 40 | from sklearn.model_selection import cross_val_score 41 | 42 | 43 | def load_data(tra_file, tst_file): 44 | 45 | """ 46 | This method reads the files in which training and testing data samples are located. 47 | Then, it returns training and testing data set in numpy array format. 48 | 49 | Parameters 50 | ---------- 51 | tra_file: directory name of training dataset file 52 | tst_file: directory name of testing dataset file 53 | """ 54 | 55 | x_tra = np.genfromtxt(tra_file, delimiter=',') 56 | x_tst = np.genfromtxt(tst_file, delimiter=',') 57 | 58 | # delete first rows 59 | x_tra = np.delete(x_tra, 0, 0) 60 | x_tst = np.delete(x_tst, 0, 0) 61 | y_tra = x_tra[:, -1] 62 | 63 | # delete class row 64 | x_tra = np.delete(x_tra, -1, 1) 65 | return x_tra, x_tst, y_tra 66 | 67 | 68 | def preprocessing(x_tra, x_tst): 69 | 70 | """ 71 | Training and testing data set are at first scaled by using standard scaler method. 72 | Then, these data sets are reduced to lower dimension by using feature extraction pca technique. 73 | When performing these two operations, testing data set is not included in fit operation of scaler and pca. 74 | 75 | Parameters 76 | ---------- 77 | x_tra: training data set (they should not contain label values) 78 | x_tst: testing data set 79 | :return: reduced and scaled training and testing sets 80 | """ 81 | 82 | scaler = StandardScaler() 83 | scaler.fit(x_tra) 84 | 85 | x_tra_scaled = scaler.transform(x_tra) 86 | x_tst_scaled = scaler.transform(x_tst) 87 | 88 | pca = PCA(.95) 89 | pca.fit(x_tra_scaled) 90 | x_tra_reduced = pca.transform(x_tra_scaled) 91 | x_tst_reduced = pca.transform(x_tst_scaled) 92 | 93 | return x_tra_reduced, x_tst_reduced 94 | 95 | 96 | def train_model(x_tra_reduced, y_tra, model): 97 | 98 | model.fit(x_tra_reduced, y_tra) 99 | return model 100 | 101 | 102 | def predict(model, x_tst_reduced): 103 | 104 | predictions = model.predict(x_tst_reduced) 105 | return predictions 106 | 107 | 108 | def write_output(prediction, filename): 109 | 110 | with open(filename, 'w', newline='') as csvfile: 111 | filewriter = csv.writer(csvfile, delimiter=',') 112 | filewriter.writerow(["ID", "Predicted"]) 113 | id = 1 114 | 115 | for row in prediction: 116 | filewriter.writerow([id, row.astype(int)]) 117 | id += 1 118 | 119 | 120 | # RandomForestClassifier gives lots of warnings 121 | # therefore this line is added below 122 | warnings.filterwarnings("ignore") 123 | 124 | # Load data 125 | Xtra, Xtst, Ytra = load_data('train.csv', 'test.csv') 126 | Xtra_reduced, Xtst_reduced = preprocessing(Xtra, Xtst) 127 | 128 | 129 | # *************** SECTION 1 *************** # 130 | 131 | # Each of models would be trained and their cross validation score would be printed in this section. 132 | 133 | print("Classifiers cross-validation") 134 | 135 | labels_clf = ['RandomForest', 'ExtraTrees', 'KNeighbors', 'SVC', 'Ridge', 'LogisticRegression', 'GaussianNB', 136 | 'DecisionTree'] 137 | 138 | for model, label in zip([rf, et, knn, svc, rg, lr, gnb, dt], labels_clf): 139 | 140 | scores = cross_val_score(model, Xtra_reduced, Ytra, cv=5, scoring='accuracy') 141 | trained_model = train_model(Xtra_reduced, Ytra, model) 142 | prediction = predict(trained_model, Xtst_reduced) 143 | 144 | print("Mean: {0:.3f}, std: {1:.3f} [{2} is used.]".format(scores.mean(), scores.std(), label)) 145 | 146 | print("-----------------------------------\n") 147 | 148 | # *************** SECTION 2 *************** # 149 | 150 | # Ensemble models would be trained and their cross validation scores would be printed in this section 151 | 152 | print("Bagging, Boosting and GridSearchCV cross-validation") 153 | 154 | labels = ['Ada Boost', 'Ada BoostSVC', 'Grad Boost', 'XG Boost', 'Ensemble', 'Voting', 155 | 'BaggingWithRF', 'Grid'] 156 | 157 | for model, label in zip([ada_boost, ada_boost_svc, grad_boost, xgb_boost, ev_clf, vclf, bagging_clf, grid], labels): 158 | 159 | scores = cross_val_score(model, Xtra_reduced, Ytra, cv=5, scoring='accuracy') 160 | trained_model = train_model(Xtra_reduced, Ytra, model) 161 | prediction = predict(trained_model, Xtst_reduced) 162 | 163 | if label == "Ensemble": 164 | write_output(prediction, label + ".csv") 165 | 166 | if label == "Ensemble": 167 | print("Mean: {0:.3f}, std: {1:.3f} [*{2} is used. (Chosen model)]".format(scores.mean(), scores.std(), label)) 168 | 169 | else: 170 | print("Mean: {0:.3f}, std: {1:.3f} [{2} is used.]".format(scores.mean(), scores.std(), label)) 171 | -------------------------------------------------------------------------------- /Team 4/main.py: -------------------------------------------------------------------------------- 1 | """ 2 | Target Problem: 3 | --------------- 4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD) 5 | 6 | Proposed Solution (Machine Learning Pipeline): 7 | ---------------------------------------------- 8 | * Standard Scaling -> PCA -> Decision Tree Classifier 9 | 10 | Input to Proposed Solution: 11 | --------------------------- 12 | * Directories of training and testing data in csv file format 13 | * These two types of data should be stored in n x m pattern in csv file format. 14 | 15 | Typical Example: 16 | ---------------- 17 | n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column) 18 | k x s samples in testing csv file (k number of samples, s number of features) 19 | 20 | * These data set files are ready by load_data() function. 21 | * For comprehensive information about input format, please check the section 22 | "Data Sets and Usage Format of Source Codes" in README.md file on github. 23 | 24 | Output of Proposed Solution: 25 | ---------------------------- 26 | * Predictions generated by learning model for testing set 27 | * They are stored in "submission.csv" file. 28 | 29 | Code Owner: 30 | ----------- 31 | * Copyright © Team 4. All rights reserved. 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """ 33 | 34 | import csv 35 | import numpy as np 36 | import pandas as pd 37 | from sklearn.decomposition import PCA 38 | from sklearn.preprocessing import StandardScaler 39 | from sklearn.tree import DecisionTreeClassifier 40 | 41 | np.random.seed(1) # Anchoring randomization during training step 42 | 43 | 44 | def load_data(): 45 | 46 | """ 47 | The method reads train and test file to obtain train and test data. 48 | Then, it splits train data into two parts which are features and labels. 49 | :return: features and labels of train data and test data are returned. 50 | 51 | """ 52 | 53 | train_data = pd.read_csv('train.csv') 54 | test_data = pd.read_csv('test.csv') 55 | 56 | all_train = train_data.iloc[:, :].values 57 | y_train = train_data.iloc[:, 595].values # labels 58 | x_train = train_data.iloc[:, 0:595].values # features 59 | 60 | x_test = test_data.iloc[:, 0:595].values 61 | 62 | return x_train, y_train, x_test 63 | 64 | 65 | def standardization(x_train, x_test): 66 | 67 | """ 68 | The method performs standard scaling on training and testing data. 69 | When doing this, only train data is included in training phase (fitting operation). 70 | 71 | Parameters 72 | ---------- 73 | x_train: features of train data 74 | x_test: features of test data 75 | 76 | """ 77 | 78 | sc = StandardScaler() 79 | x_train_std = sc.fit_transform(x_train) 80 | x_test_std = sc.transform(x_test) 81 | 82 | return x_train_std, x_test_std 83 | 84 | 85 | def dim_red(x_train, x_test): 86 | 87 | """ 88 | The method reduces the dimension of training and testing data by using PCA. 89 | When doing this, only training data is included in training phase (fitting operation). 90 | 91 | Parameters 92 | ---------- 93 | x_train: features of scaled training data 94 | x_test: features of scaled testing data 95 | 96 | """ 97 | 98 | pca = PCA(n_components=15) 99 | x_train_red = pca.fit_transform(x_train) 100 | x_test_red = pca.transform(x_test) 101 | 102 | return x_train_red, x_test_red 103 | 104 | 105 | def decision_tree(criterion_name, x_train, y_train, x_test): 106 | 107 | """ 108 | The method creates a decision tree learning model, trains it by using train data and generate predictions for 109 | testing data. 110 | Then, predicted labels are returned. 111 | 112 | Parameters 113 | ---------- 114 | criterion_name: it specifies which function is used to measure quality of a split 115 | x_train: features of scaled and reduced training set 116 | y_train: labels of scaled and reduced training set 117 | x_test: features of scaled and reduced testing set 118 | 119 | """ 120 | 121 | dtc = DecisionTreeClassifier(criterion=criterion_name, max_depth=4, 122 | max_features=3, random_state=3, max_leaf_nodes=2) 123 | dtc.fit(x_train, y_train) 124 | y_pred = dtc.predict(x_test) 125 | 126 | return y_pred 127 | 128 | 129 | def write_output(y_pred): 130 | 131 | fields = ['ID', 'Predicted'] 132 | filename = "submission.csv" 133 | rows = list() 134 | with open(filename, 'w',newline="") as csvfile: 135 | csvwriter = csv.writer(csvfile) 136 | csvwriter.writerow(fields) 137 | for i in range(len(y_pred)): 138 | rows.append([i+1, y_pred[i]]) 139 | csvwriter.writerows(rows) 140 | 141 | 142 | # ********** MAIN PROGRAM ********** # 143 | 144 | x_train, y_train, x_test = load_data() 145 | x_train_std, x_test_std = standardization(x_train, x_test) 146 | x_train_red, x_test_red = dim_red(x_train_std, x_test_std) 147 | 148 | y_pred = decision_tree('entropy', x_train_red, y_train, x_test_red) 149 | write_output(y_pred) 150 | -------------------------------------------------------------------------------- /Team 5/main.py: -------------------------------------------------------------------------------- 1 | """ 2 | Target Problem: 3 | --------------- 4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD) 5 | 6 | Proposed Solution (Machine Learning Pipeline): 7 | ---------------------------------------------- 8 | * Min-max Scaling -> Elimination of Highly Coorelated Features -> SelectKBest algorithm -> XGBoost 9 | 10 | Input to Proposed Solution: 11 | --------------------------- 12 | * Directories of training and testing data in csv file format 13 | * These two types of data should be stored in n x m pattern in csv file format. 14 | 15 | Typical Example: 16 | ---------------- 17 | n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column) 18 | k x s samples in testing csv file (k number of samples, s number of features) 19 | 20 | * These data set files are ready by load_data() function. 21 | * For comprehensive information about input format, please check the section 22 | "Data Sets and Usage Format of Source Codes" in README.md file on github. 23 | 24 | Output of Proposed Solution: 25 | ---------------------------- 26 | * Predictions generated by learning model for testing set 27 | * They are stored in "submission.csv" file. 28 | 29 | Code Owner: 30 | ----------- 31 | * Copyright © Team 5. All rights reserved. 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """ 33 | 34 | import numpy as np 35 | import pandas as pd 36 | 37 | from numpy import genfromtxt 38 | from xgboost import XGBClassifier 39 | from sklearn.preprocessing import MinMaxScaler 40 | from sklearn.feature_selection import SelectKBest, chi2 41 | 42 | 43 | def load_data(train_path, test_path): 44 | 45 | """ 46 | The method reads train and test data from their files. 47 | Then, it deletes feature names by using numpy delete() method. 48 | Finally, it splits train data into two parts: train features (train_x) and train labels (train_y). 49 | 50 | Parameters 51 | ---------- 52 | train_path: directory name of the file in which training data exists 53 | test_path: directory name of the file in which testing data exists 54 | 55 | """ 56 | 57 | train = genfromtxt(train_path, delimiter=',') 58 | train = np.delete(train, 0, 0) 59 | test = genfromtxt(test_path, delimiter=',') 60 | test = np.delete(test, 0, 0) 61 | 62 | train_x = train[:, 0:595] 63 | train_y = train[:, 595] 64 | 65 | return train_x, train_y, test 66 | 67 | 68 | def preprocessing(train_x, train_y, test): 69 | 70 | """ 71 | * The method at first performs min-max scaling on train and test data. 72 | * Then, it computes correlation matrix to find out which features are mostly-correlated to each other. 73 | Two features which are very correlated to each other have almost same impact on classification and labels. 74 | Hence, one of these two features is discarded. We do not need to use both of them. 75 | * Finally, selectKBest() algorithm is used with chi square value to choose top 100 features. 76 | * Totally, two feature selection methods are used to reduce the dimension of training and testing data. 77 | 78 | Parameters 79 | ---------- 80 | train_x: features of train data samples 81 | train_y: labels of train data samples 82 | test: features of test data samples 83 | 84 | """ 85 | 86 | scaler = MinMaxScaler() 87 | scaler.fit(train_x) 88 | scaled_test = scaler.transform(test) 89 | scaled_train_x = scaler.transform(train_x) 90 | 91 | df_test = pd.DataFrame(data=scaled_test) 92 | df_train_x = pd.DataFrame(data=scaled_train_x) 93 | 94 | corr_matrix = df_train_x.corr().abs() 95 | upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool)) 96 | drop = [column for column in upper.columns if any(upper[column] > 0.9)] # finding highly-correlated features 97 | 98 | df_train_x = df_train_x.drop(df_train_x.columns[drop], axis=1) # discarding those features 99 | df_test = df_test.drop(df_test.columns[drop], axis=1) 100 | 101 | # Select features 102 | selector = SelectKBest(chi2, k=100) 103 | selector.fit(df_train_x, train_y) 104 | reduced_train_x = selector.transform(df_train_x) 105 | reduced_test = selector.transform(df_test) 106 | 107 | return reduced_train_x, reduced_test 108 | 109 | 110 | def train_model(train_x, train_y): 111 | 112 | """ 113 | The method trains XGB Classifier model by using training samples and their ground truth labels. 114 | 115 | Parameters 116 | ---------- 117 | train_x: features of training samples 118 | train_y: labels of training samples 119 | 120 | """ 121 | 122 | s_clf = XGBClassifier(learning_rate=0.1, n_estimators=140, max_depth=5, min_child_weight=3, gamma=0.2, 123 | subsample=0.6, colsample_bytree=1.0, objective='binary:logistic', nthread=4, 124 | scale_pos_weight=1, seed=27, silent=1) 125 | 126 | s_clf.fit(train_x, train_y) 127 | return s_clf 128 | 129 | 130 | def predict(model, test): 131 | 132 | """ 133 | The method make predictions by using passed learning model "model" and testing data "scaled_test". 134 | 135 | Parameters 136 | ---------- 137 | model: Learning model object trained by training data 138 | scaled_test: features of testing samples 139 | 140 | """ 141 | 142 | predictions = model.predict(test) 143 | return predictions 144 | 145 | 146 | def write_output(prediction): 147 | # Write output to csv file 148 | f = open("submission.csv","w+") 149 | f.write("ID,Predicted\n") 150 | for i, item in enumerate(prediction): 151 | f.write(str(1+i) + "," + str(int(item)) + "\n") 152 | f.close() 153 | 154 | 155 | # ********** MAIN PROGRAM ********** # 156 | 157 | train_x, train_y, test = load_data('train.csv', 'test.csv') 158 | scaled_train_x, scaled_test = preprocessing(train_x, train_y, test) 159 | 160 | model = train_model(scaled_train_x, train_y) 161 | predictions = predict(model, scaled_test) 162 | write_output(predictions) 163 | -------------------------------------------------------------------------------- /Team 6/main.py: -------------------------------------------------------------------------------- 1 | """ 2 | Target Problem: 3 | --------------- 4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD) 5 | 6 | Proposed Solution (Machine Learning Pipeline): 7 | ---------------------------------------------- 8 | * Standard Scaling -> PCA -> SVM 9 | 10 | Input to Proposed Solution: 11 | --------------------------- 12 | * Directories of training and testing data in csv file format 13 | * These two types of data should be stored in n x m pattern in csv file format. 14 | 15 | Typical Example: 16 | ---------------- 17 | n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column) 18 | k x s samples in testing csv file (k number of samples, s number of features) 19 | 20 | * These data set files are ready by load_data() function. 21 | * For comprehensive information about input format, please check the section 22 | "Data Sets and Usage Format of Source Codes" in README.md file on github. 23 | 24 | Output of Proposed Solution: 25 | ---------------------------- 26 | * Predictions generated by learning model for testing set 27 | * They are stored in "submisson.csv" file. 28 | 29 | Code Owner: 30 | ----------- 31 | * Copyright © Team 6. All rights reserved. 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """ 33 | 34 | import numpy as np 35 | import pandas as pd 36 | 37 | from sklearn.svm import SVC 38 | from sklearn.decomposition import PCA 39 | from sklearn.preprocessing import StandardScaler 40 | 41 | np.random.seed(5) # anchoring randomization during training step 42 | 43 | 44 | def load_data(): 45 | 46 | train_data = pd.read_csv('train.csv', header=0) 47 | test_data = pd.read_csv('test.csv', header=0) 48 | return train_data, test_data 49 | 50 | 51 | def preprocessing(train_data, test_data): 52 | 53 | """ 54 | The method splits features and labels of training data into two separate sections. 55 | Then, it applies standard scaling operation on training and testing sets. 56 | 57 | Parameters: 58 | ----------- 59 | train_data: It is numpy array containing features and labels together 60 | test_data: It is numpy array containing only features """ 61 | 62 | x_train = train_data.iloc[:, 0:595].values 63 | y_train = train_data.iloc[:, 595:].values 64 | x_test = test_data.iloc[:].values 65 | 66 | sc = StandardScaler() 67 | scaled_x_train = sc.fit_transform(x_train) 68 | scaled_x_test = sc.transform(x_test) 69 | 70 | return scaled_x_train, scaled_x_test, y_train 71 | 72 | 73 | def dimension_reduction(scaled_x_train, scaled_x_test): 74 | 75 | """ 76 | The method performs pca technique to reduce the dimension of feature space in which observations are situated. 77 | 78 | Parameters: 79 | ----------- 80 | scaled_x_train: scaled features of training data 81 | scaled_x_test: scaled features of testing data """ 82 | 83 | # reducing training and testing samples into lower dimension 84 | pca = PCA(n_components=2) 85 | reduced_x_train = pca.fit_transform(scaled_x_train) 86 | reduced_x_test = pca.transform(scaled_x_test) 87 | 88 | return reduced_x_train, reduced_x_test 89 | 90 | 91 | def train_model(reduced_x_train, y_train): 92 | 93 | """ 94 | The method trains svm learning model by using training features cross its ground truth labels. 95 | 96 | Parameters: 97 | ----------- 98 | reduced_x_train: features of training data lying on principal component axes 99 | y_train: ground truth labels of those features """ 100 | 101 | svc = SVC(kernel='poly', gamma=0.5, C=1, random_state=3) 102 | svc.fit(reduced_x_train, y_train.ravel()) 103 | return svc 104 | 105 | 106 | def predict(svc, reduced_x_test): 107 | 108 | """ 109 | The method predicts outputs for test samples by using learning model. 110 | 111 | Parameters: 112 | ----------- 113 | svc: trained learning model 114 | reduced_x_test: features of testing set lying on principal component axes """ 115 | 116 | predictions = svc.predict(reduced_x_test) 117 | return predictions 118 | 119 | 120 | def write_output(predictions): 121 | 122 | # writing predictions to the file 123 | f = open("submission.csv", "w") 124 | f.write("ID,Predicted\n") 125 | for i in range(0, len(predictions)): 126 | f.write(str(i+1) + "," + str(predictions[i]) + "\n") 127 | 128 | 129 | # ******* Main Program ******* # 130 | 131 | train_data, test_data = load_data() 132 | 133 | scaled_x_train, scaled_x_test, y_train, = preprocessing(train_data, test_data) 134 | reduced_x_train, reduced_x_test = dimension_reduction(scaled_x_train, scaled_x_test) 135 | 136 | svc_model = train_model(reduced_x_train, y_train) 137 | predictions = predict(svc_model, reduced_x_test) 138 | 139 | write_output(predictions) 140 | -------------------------------------------------------------------------------- /Team 7/main.py: -------------------------------------------------------------------------------- 1 | """ 2 | Target Problem: 3 | --------------- 4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD) 5 | 6 | Proposed Solution (Machine Learning Pipeline): 7 | ---------------------------------------------- 8 | * Constant Feature Elimination -> SelectKBest Algorithm -> PCA -> Decision Tree Classifier 9 | 10 | Input to Proposed Solution: 11 | --------------------------- 12 | * Directories of training and testing data in csv file format 13 | * These two types of data should be stored in n x m pattern in csv file format. 14 | 15 | Typical Example: 16 | ---------------- 17 | n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column) 18 | k x s samples in testing csv file (k number of samples, s number of features) 19 | 20 | * These data set files are ready by load_data() function. 21 | * For comprehensive information about input format, please check the section 22 | "Data Sets and Usage Format of Source Codes" in README.md file on github. 23 | 24 | Output of Proposed Solution: 25 | ---------------------------- 26 | * Predictions generated by learning model for testing set 27 | * They are stored in "submission.csv" file. 28 | 29 | Code Owner: 30 | ----------- 31 | * Copyright © Team 7. All rights reserved. 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """ 33 | 34 | 35 | import numpy as np 36 | import pandas as pd 37 | from sklearn.tree import DecisionTreeClassifier 38 | from sklearn.decomposition import PCA 39 | from sklearn.feature_selection import SelectKBest 40 | from sklearn.feature_selection import chi2 41 | 42 | 43 | def load_data(tra_name, test_name): 44 | 45 | """ 46 | The method reads the training and testing data from their csv files. 47 | 48 | Parameters 49 | ---------- 50 | tra_name: directory of the training dataset file 51 | test_name: directory of testing dataset file 52 | 53 | """ 54 | 55 | train_data = pd.read_csv(tra_name) 56 | test_data = pd.read_csv(test_name) 57 | return train_data, test_data 58 | 59 | 60 | def preprocessing(train_data, test_data): 61 | 62 | """ 63 | The method at first eliminates constant features. 64 | Then, it chooses top 100 features by evaluating chi square values of each feature. 65 | Finally, these 100 features are reduced to 80 features by using principal component analysis. 66 | 67 | Parameters 68 | ---------- 69 | train_data: training dataset containing features and labels 70 | test_data: testing dataset containing only features 71 | 72 | """ 73 | 74 | train_data = train_data.drop(['X3', 'X31', 'X32', 'X127', 'X128', 'X590'], axis=1) 75 | train_data = np.asarray(train_data) 76 | 77 | train_x = train_data[:, :train_data.shape[1] - 1] 78 | train_y = train_data[:, train_data.shape[1] - 1] 79 | train_y.shape = (np.size(train_y), 1) 80 | 81 | test_data = test_data.drop(['X3', 'X31', 'X32', 'X127', 'X128', 'X590'], axis=1) 82 | test_data = np.asarray(test_data) 83 | 84 | selector = SelectKBest(score_func=chi2, k=100) 85 | selector.fit(train_x, train_y) 86 | 87 | train_features = selector.transform(train_x) 88 | test_features = selector.transform(test_data) 89 | 90 | pca = PCA(n_components=80) 91 | x_tra_pca = pca.fit_transform(train_features) 92 | x_test_pca = pca.transform(test_features) 93 | 94 | return x_tra_pca, train_y, x_test_pca 95 | 96 | # ********** MAIN PROGRAM ********** # 97 | 98 | 99 | train_data, test_data = load_data("train.csv", "test.csv") 100 | x_tra_pca, train_y, x_test_pca = preprocessing(train_data, test_data) 101 | 102 | 103 | clf = DecisionTreeClassifier(random_state=25) 104 | clf.fit(x_tra_pca, train_y) 105 | predictions = clf.predict(x_test_pca) 106 | 107 | 108 | yt = pd.DataFrame(predictions, dtype='int32') 109 | yt.columns = ["Predicted"] 110 | yt.index += 1 111 | yt.to_csv("./submission.csv", index_label="ID") 112 | -------------------------------------------------------------------------------- /Team 8/main.py: -------------------------------------------------------------------------------- 1 | """ 2 | Target Problem: 3 | --------------- 4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD) 5 | 6 | Proposed Solution (Machine Learning Pipeline): 7 | ---------------------------------------------- 8 | * Standard Scaling -> PCA -> Logistic Regression 9 | 10 | Input to Proposed Solution: 11 | --------------------------- 12 | * Directories of training and testing data in csv file format 13 | * These two types of data should be stored in n x m pattern in csv file format. 14 | 15 | Typical Example: 16 | ---------------- 17 | n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column) 18 | k x s samples in testing csv file (k number of samples, s number of features) 19 | 20 | * These data set files are ready by load_data() function. 21 | * For comprehensive information about input format, please check the section 22 | "Data Sets and Usage Format of Source Codes" in README.md file on github. 23 | 24 | Output of Proposed Solution: 25 | ---------------------------- 26 | * Predictions generated by learning model for testing set 27 | * They are stored in "submission.csv" file. 28 | 29 | Code Owner: 30 | ----------- 31 | * Copyright © Team 8. All rights reserved. 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """ 33 | 34 | import csv 35 | import warnings 36 | import numpy as np 37 | 38 | from sklearn.decomposition import PCA 39 | from sklearn.preprocessing import StandardScaler 40 | from sklearn.linear_model import LogisticRegression 41 | 42 | warnings.filterwarnings("ignore", category=FutureWarning) 43 | 44 | 45 | def load_data(x_paths): 46 | 47 | """ 48 | The method reads train and test data from data set files. 49 | Then, it splits train data into two pieces: features and labels. 50 | 51 | Parameters 52 | ---------- 53 | x_paths: directory of train and test data files 54 | 55 | """ 56 | 57 | data = np.matrix(np.genfromtxt(x_paths+'train.csv', delimiter=',')) 58 | x_train = np.asarray(data[1:, 0:595]) 59 | y_train = np.asarray(data[1:, 595]) 60 | 61 | data2 = np.matrix(np.genfromtxt(x_paths+'test.csv', delimiter=',')) 62 | x_test = np.asarray(data2[1:, 0:595]) 63 | 64 | return x_train, y_train, x_test 65 | 66 | 67 | def preprocessing(x_train, x_test): 68 | 69 | """ 70 | The method performs standard scaling on training and testing data. 71 | Then, it reduces the dimension of training and testing data by using pca. 72 | 73 | Parameters 74 | ---------- 75 | x_train: features of training data 76 | x_test: features of testing data 77 | 78 | """ 79 | 80 | sc = StandardScaler() 81 | x_train = sc.fit_transform(x_train) 82 | x_test = sc.transform(x_test) 83 | 84 | pca = PCA(n_components=2) 85 | x_train = pca.fit_transform(x_train) 86 | x_test = pca.transform(x_test) 87 | 88 | return x_train, x_test 89 | 90 | 91 | def train_model(x_train, y_train): 92 | 93 | """ 94 | The method creates a logistic regression classifier, and trains it with training data. 95 | 96 | Parameters 97 | ---------- 98 | x_train: features of training data 99 | y_train: labels of training data 100 | 101 | """ 102 | 103 | classifier = LogisticRegression(random_state=0) 104 | classifier.fit(x_train, np.ravel(y_train, order='C')) 105 | return classifier 106 | 107 | 108 | def predict(x_test, model): 109 | 110 | """ 111 | The method predicts labels for testing data by using model object. 112 | 113 | Parameters 114 | ---------- 115 | x_test: features of testing data 116 | model: trained learning model 117 | 118 | """ 119 | 120 | y_pred = model.predict(x_test) 121 | return y_pred 122 | 123 | 124 | def write_output(y_pred): 125 | 126 | ID = 1 127 | lines = [["ID", "Predicted"]] 128 | 129 | for i in y_pred: 130 | # Reobtaining the ID is simple since the samples remain in order 131 | temp = [ID, int(i)] 132 | ID += 1 133 | lines.append(temp) 134 | 135 | # Write the output in a file 136 | with open('submission.csv', 'w') as writeFile: 137 | writer = csv.writer(writeFile) 138 | writer.writerows(lines) 139 | writeFile.close() 140 | 141 | # ********** MAIN PROGRAM ********** # 142 | 143 | 144 | Data = load_data("") 145 | x_train, y_train, x_test = load_data("") 146 | x_train_pca, x_test_pca = preprocessing(x_train, x_test) 147 | 148 | model = train_model(x_train_pca, y_train) 149 | predictions = predict(x_test_pca, model) 150 | write_output(predictions) 151 | -------------------------------------------------------------------------------- /Team 9/main.py: -------------------------------------------------------------------------------- 1 | """ 2 | Target Problem: 3 | --------------- 4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD) 5 | 6 | Proposed Solution (Machine Learning Pipeline): 7 | ---------------------------------------------- 8 | * Min-Max Scaling -> PCA -> Bagging Classifier (Base: KNN) 9 | 10 | Input to Proposed Solution: 11 | --------------------------- 12 | * Directories of training and testing data in csv file format 13 | * These two types of data should be stored in n x m pattern in csv file format. 14 | 15 | Typical Example: 16 | ---------------- 17 | n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column) 18 | k x s samples in testing csv file (k number of samples, s number of features) 19 | 20 | * These data set files are ready by load_data() function. 21 | * For comprehensive information about input format, please check the section 22 | "Data Sets and Usage Format of Source Codes" in README.md file on github. 23 | 24 | Output of Proposed Solution: 25 | ---------------------------- 26 | * Predictions generated by learning model for testing set 27 | * They are stored in "submission.csv" file. 28 | 29 | Code Owner: 30 | ----------- 31 | * Copyright © Team 9. All rights reserved. 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """ 33 | 34 | import csv 35 | import pandas as pd 36 | 37 | from sklearn.preprocessing import MinMaxScaler 38 | from sklearn.decomposition import PCA 39 | 40 | from sklearn.ensemble import BaggingClassifier 41 | from sklearn.neighbors import KNeighborsClassifier 42 | 43 | 44 | def load_data(): 45 | 46 | """ 47 | The method reads train and test data from data files. 48 | Then, it splits train data into features and labels. 49 | """ 50 | 51 | train_data = pd.read_csv("train.csv") 52 | test_data = pd.read_csv("test.csv") 53 | train_x = train_data.iloc[:, 0:595] 54 | train_y = train_data.iloc[:, -1] 55 | 56 | return train_x, train_y, test_data 57 | 58 | 59 | def preprocessing(train_x, test_x): 60 | 61 | """ 62 | The method at first performs min-max scaling on train and testing data. 63 | Then, it reduces dimension of train and test data by using pca. 64 | 65 | Parameters 66 | ---------- 67 | train_x: features of train data 68 | test_x: features of test data 69 | 70 | """ 71 | 72 | scaler = MinMaxScaler() 73 | scl_train_x = scaler.fit_transform(train_x) 74 | scl_test_x = scaler.transform(test_x) 75 | 76 | pca = PCA(n_components=5) 77 | pca.fit(scl_train_x) 78 | pca_train_x = pca.transform(scl_train_x) 79 | pca_test_x = pca.transform(scl_test_x) 80 | 81 | return pca_train_x, pca_test_x 82 | 83 | 84 | def train_model(train_x, train_y): 85 | 86 | """ 87 | The method trains a learning model by using training data. 88 | 89 | Parameters 90 | ---------- 91 | train_x: features of training data 92 | train_y: labels of training data 93 | 94 | """ 95 | model = BaggingClassifier(KNeighborsClassifier(), max_samples=0.5, random_state=4) 96 | model.fit(train_x, train_y) 97 | return model 98 | 99 | 100 | def predict(model, test_x): 101 | 102 | """ 103 | The method predicts labels for testing samples by using trained model. 104 | 105 | Parameters 106 | ---------- 107 | model: trained model 108 | test_x: features of testing data 109 | """ 110 | return model.predict(test_x) 111 | 112 | 113 | def write_output(predictions): 114 | 115 | with open("submission.csv", mode='w+', newline='') as file: 116 | writer = csv.writer(file) 117 | writer.writerow(["ID", "Predicted"]) 118 | 119 | count = 1 120 | for pred in predictions: 121 | with open("submission.csv", mode='a', newline='') as file: 122 | writer = csv.writer(file) 123 | writer.writerow([count, pred]) 124 | count += 1 125 | 126 | 127 | # ********** MAIN PROGRAM ********** # 128 | 129 | train_x, train_y, test_x = load_data() 130 | pca_train_x, pca_test_x = preprocessing(train_x, test_x) 131 | model = train_model(pca_train_x, train_y) 132 | predictions = predict(model, pca_test_x) 133 | write_output(predictions) 134 | --------------------------------------------------------------------------------