├── Fig1.png
├── Fig2.png
├── LICENSE
├── README.md
├── Team 1
    └── main.py
├── Team 10
    └── main.py
├── Team 11
    └── main.py
├── Team 12
    └── main.py
├── Team 13
    └── main.py
├── Team 14
    └── main.py
├── Team 15
    └── main.py
├── Team 16
    └── main.py
├── Team 17
    └── main.py
├── Team 18
    ├── classifiers.py
    └── main.py
├── Team 19
    └── main.py
├── Team 2
    ├── main.py
    ├── read_write.py
    └── reduce_dim.py
├── Team 20
    └── main.py
├── Team 3
    ├── classifiers.py
    └── main.py
├── Team 4
    └── main.py
├── Team 5
    └── main.py
├── Team 6
    └── main.py
├── Team 7
    └── main.py
├── Team 8
    └── main.py
└── Team 9
    └── main.py


/Fig1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/basiralab/BrainNet-ML-ToolBox/5a90b050f76e07caf77c3057186acf5d88db46be/Fig1.png


--------------------------------------------------------------------------------
/Fig2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/basiralab/BrainNet-ML-ToolBox/5a90b050f76e07caf77c3057186acf5d88db46be/Fig2.png


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2020 BASIRA LAB
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # BrainNet-ML-ToolBox: A Python Machine Learning Toolbox for Brain Network Classification
 2 | ==============================================================
 3 | 
 4 | **BrainNet-ML-ToolBox** library supports the combination of models and score from
 5 | key ML libraries such as `scikit-learn <https://scikit-learn.org/stable/index.html>`_, and `xgboost <https://xgboost.ai/>`_, for data preprocessing,  dimensionality reduction, and classification. This toolbox has been put together and polished by Goktug Guvercin (goktug150140@gmail.com).
 6 | 
 7 | # Introduction
 8 | 
 9 | This repo is a machine learning (ML) toolbox including 20 different pipelines for brain network classification.
10 | 
11 | Autism spectrum disorder (ASD) affects the brain connectivity at different levels. Nonetheless, non-invasively distinguishing such effects using magnetic resonance imaging (MRI) remains very challenging to machine learning diagnostic frameworks due to ASD heterogeneity. So far, existing network neuroscience works mainly focused on functional (derived from functional MRI) and structural (derived from diffusion MRI) brain connectivity, which might not capture relational morphological changes between brain regions. Indeed, machine learning (ML) studies for ASD diagnosis using morphological brain networks derived from conventional T1-weighted MRI are very scarce.
12 | 
13 | To fill this gap, we leverage crowdsourcing by organizing a **Kaggle competition** to build a pool of machine learning pipelines for neurological disorder diagnosis with application ASD diagnosis using cortical morphological networks derived from T1-weighted MRI. The general aim of this challenge was to encourage the competitors to come up with a machine learning pipelines that can differentiate normal controls from autistic subjects using cortical morphological networks. The competitors were allowed to use built-in machine learning methods to design their brain network classification frameworks. **In this repository, we include the source codes of the top 20 teams in the competition.**
14 | 
15 | During the competition, participants were provided with a training dataset and only allowed to check their performance on a public test data. The final evaluation was performed on both public and hidden test datasets based on accuracy, sensitivity, and specificity metrics. Teams were ranked using each performance metric separately and the final ranking was determined based on the mean of all rankings. **The first-ranked team (Team-1) achieved 70% accuracy, 72.5% sensitivity, and 67.5% specificity, where the second-ranked team (Team-2) achieved 63.8%, 62.5%, 65% respectively.**
16 | 
17 | ![BrainNet pipeline](https://github.com/basiralab/BrainNet-ML-ToolBox/blob/master/Fig1.png)
18 | ![BrainNet pipeline](https://github.com/basiralab/BrainNet-ML-ToolBox/blob/master/Fig2.png)
19 | 
20 | # Installation
21 | 
22 | The source codes have been tested with Python 3.6.2 version through PyCharm IDE on OSX operating system. There is no need of GPU to run the codes.
23 | 
24 | Required Python Modules:
25 | 
26 | * csv
27 | * numpy
28 | * pandas
29 | * xgboost
30 | * mlxtend
31 | * statistics
32 | * warnings
33 | * matplotlib
34 | * scikit-learn
35 | 
36 | # Dataset format
37 | 
38 | The brain network dataset in the training stage comprised 120 samples, each represented by 595 morphological connectivity features. The testing set comprised  80 samples. If you intend to run source codes for your own dataset, some operations inside the codes such as constant feature elimination and loading data from CSV files can be modified accordingly. 
39 | 
40 | # Please cite the following paper when using BrainNet-ML-ToolBox:
41 | 
42 | ```
43 | @article{brainNetML2020,
44 |   title={Machine Learning Methods for Brain Network Classification: Application to Autism Diagnosis using Cortical Morphological Networks},
45 |   author={Bilgen, Ismail and Guvercin, Goktug and Rekik, Islem},
46 |   journal={arXiv preprint arXiv:2004.13321},
47 |   year={2020}
48 | }
49 | 
50 | ```
51 | Paper link on arXiv:
52 | https://arxiv.org/pdf/2004.13321v1.pdf
53 | 
54 | 
55 | 


--------------------------------------------------------------------------------
/Team 1/main.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Target Problem:
  3 | ---------------
  4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD)
  5 | 
  6 | Proposed Solution (Machine Learning Pipeline):
  7 | ----------------------------------------------
  8 | * Constant Feature Elimination -> SelectKBest Algorithm -> Gradient Boosting
  9 | 
 10 | Input to Proposed Solution:
 11 | ---------------------------
 12 | * Directories of training and testing data in csv file format
 13 | * These two types of data should be stored in n x m pattern in csv file format.
 14 | 
 15 |   Typical Example:
 16 |   ----------------
 17 |   n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column)
 18 |   k x s samples in testing csv file (k number of samples, s number of features)
 19 | 
 20 | * These data set files are ready by load_data() function.
 21 | * For comprehensive information about input format, please check the section
 22 |   "Data Sets and Usage Format of Source Codes" in README.md file on github.
 23 | 
 24 | Output of Proposed Solution:
 25 | ----------------------------
 26 | * Predictions generated by learning model for testing set
 27 | * They are stored in "submission.csv" file.
 28 | 
 29 | Code Owner:
 30 | -----------
 31 | * Copyright © Team 1. All rights reserved.
 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """
 33 | 
 34 | 
 35 | import csv
 36 | import pandas as pd
 37 | from sklearn.ensemble import GradientBoostingClassifier
 38 | from sklearn.feature_selection import chi2, SelectKBest
 39 | 
 40 | 
 41 | def load_data(filename):
 42 |     train_data = pd.read_csv(filename[0])
 43 |     test_data = pd.read_csv(filename[1])
 44 |     return train_data, test_data
 45 | 
 46 | 
 47 | def filter_feature_selection(X, y, nof_features):
 48 | 
 49 |     """
 50 |     This function performs selectKBest filtering algorithm to compute importance scores of features.
 51 |     Then, these scores are concatenated with features themselves.
 52 |     Finally, top k most important features are selected.
 53 | 
 54 |     Parameters
 55 |     ----------
 56 |     X: features of train data set
 57 |     y: labels of train data set
 58 |     nof_features: number of features to be selected """
 59 | 
 60 |     best_features = SelectKBest(score_func=chi2, k="all")
 61 |     best_features.fit(X, y)
 62 |     df_scores = pd.DataFrame(best_features.scores_)
 63 |     df_columns = pd.DataFrame(X.columns)
 64 | 
 65 |     # Concatenate features and related score for visualization
 66 |     feature_scores = pd.concat([df_columns, df_scores], axis=1)
 67 |     feature_scores.columns = ['Specs', 'Score']
 68 | 
 69 |     # Select top K features
 70 |     ft = feature_scores.nlargest(nof_features, 'Score')
 71 |     features = ft.index.values
 72 |     return features
 73 | 
 74 | 
 75 | def preprocessing(trainData):
 76 | 
 77 |     """
 78 |     The method splits the train dataset into features and labels, which are X and y.
 79 |     Then, it drops constant features, and chooses top 341 features.
 80 | 
 81 |     Parameters
 82 |     ----------
 83 |     trainData: It is numpy array containing features and labels together
 84 |     :return: It returns train dataset with selected features, labels, and index values of chosen features
 85 |     """
 86 | 
 87 |     X = trainData.iloc[:, 0:595]
 88 |     y = trainData.iloc[:, -1]
 89 | 
 90 |     # Drop zero columns and Select top K features with Filter Method
 91 |     X = X.drop(["X3", "X31", "X32", "X127", "X128", "X590"], axis=1)
 92 |     feature_indexes = filter_feature_selection(X, y, 341)
 93 | 
 94 |     return X.values[:, feature_indexes], y, feature_indexes
 95 | 
 96 | 
 97 | def train_model(X_train, y_train):
 98 | 
 99 |     """
100 |     A learning model is trained by using train data set.
101 |     Gradient Boosting classifier is preferred.
102 | 
103 |     Parameters
104 |     ----------
105 |     X_train: train dataset with top k most important features
106 |     y_train: labels of those samples
107 |     """
108 | 
109 |     gradBoost = GradientBoostingClassifier(
110 |         n_estimators=6,
111 |         learning_rate=1,
112 |         max_features=2,
113 |         max_depth=2,
114 |         random_state=289)
115 | 
116 |     gradBoost.fit(X_train, y_train)
117 |     return gradBoost
118 | 
119 | 
120 | def transform_test_samples(test_data, selected_features):
121 | 
122 |     """
123 |     This method applies preprocessing operations to test data.
124 |     It at first eliminates constant features, and then choose top k most relevant features from the data set.
125 | 
126 |     Parameters
127 |     ----------
128 |     test_data: test data set
129 |     selected_features: index values of selected features
130 |     """
131 |     X = test_data.iloc[:, 0:595]
132 |     X = X.drop(["X3", "X31", "X32", "X127", "X128", "X590"], axis=1)
133 |     return X.values[:, selected_features]
134 | 
135 | 
136 | def predict(model, X_test):
137 | 
138 |     """
139 |     This method predicts outputs for test samples by using learning model.
140 | 
141 |     Parameters
142 |     ----------
143 |     model: learning model for prediction
144 |     X_test: testing dataset
145 |     :return: predictions for testing set
146 |     """
147 |     return model.predict(X_test)
148 | 
149 | 
150 | def write_output(predictions):
151 | 
152 |     submissionFile = [["ID", "Predicted"]]
153 |     for i, prediction in enumerate(predictions):
154 |         submissionFile.append([i + 1, int(prediction)])
155 | 
156 |     # Write to file
157 |     with open('Submission.csv', 'w') as csvFile:
158 |         writer = csv.writer(csvFile)
159 |         writer.writerows(submissionFile)
160 | 
161 | 
162 | # ******* Main Program ******* #
163 | 
164 | Xpaths = ["train.csv", "test.csv"]
165 | trainData, testData = load_data(Xpaths)
166 | 
167 | X_train, y_train, selected_features = preprocessing(trainData)
168 | gradientBoost = train_model(X_train, y_train)
169 | 
170 | XtestNew = transform_test_samples(testData, selected_features)
171 | predictions = predict(gradientBoost, XtestNew)
172 | write_output(predictions)
173 | 


--------------------------------------------------------------------------------
/Team 10/main.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Target Problem:
  3 | ---------------
  4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD)
  5 | 
  6 | Proposed Solution (Machine Learning Pipeline):
  7 | ----------------------------------------------
  8 | * Constant Feature Elimination -> PCA -> Decision Tree
  9 | 
 10 | Input to Proposed Solution:
 11 | ---------------------------
 12 | * Directories of training and testing data in csv file format
 13 | * These two types of data should be stored in n x m pattern in csv file format.
 14 | 
 15 |   Typical Example:
 16 |   ----------------
 17 |   n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column)
 18 |   k x s samples in testing csv file (k number of samples, s number of features)
 19 | 
 20 | * These data set files are ready by load_data() function.
 21 | * For comprehensive information about input format, please check the section
 22 |   "Data Sets and Usage Format of Source Codes" in README.md file on github.
 23 | 
 24 | Output of Proposed Solution:
 25 | ----------------------------
 26 | * Predictions generated by learning model for testing set
 27 | * They are stored in "submission.csv" file.
 28 | 
 29 | Code Owner:
 30 | -----------
 31 | * Copyright © Team 10. All rights reserved.
 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """
 33 | 
 34 | import numpy as np
 35 | import pandas as pd
 36 | from sklearn.decomposition import PCA
 37 | from sklearn.tree import DecisionTreeClassifier
 38 | 
 39 | np.random.seed(3)  # anchoring randomization during training step
 40 | 
 41 | 
 42 | def load_data(traname, testname):
 43 | 
 44 |     """
 45 |     The method reads train and test data from dataset files.
 46 | 
 47 |     Parameters
 48 |     ----------
 49 |     traname: directory of training dataset file
 50 |     testname: directory of testing dataset file
 51 | 
 52 |     """
 53 |     train_data = pd.read_csv(traname)
 54 |     test_data = pd.read_csv(testname)
 55 |     return train_data, test_data
 56 | 
 57 | 
 58 | def preprocessing(train_data, test_data):
 59 | 
 60 |     """
 61 |     * The method at first eliminates constant features from both train and test data.
 62 |     * Then, it splits training data into features and labels.
 63 |     * Finally, the method performs pca on training and testing data sets to reduce the dimension and
 64 |       overcome curse of dimensionality problem.
 65 | 
 66 |     Parameters
 67 |     ----------
 68 |     train_data: training data set in data frame format
 69 |     test_data: testing data set in data frame format
 70 | 
 71 |     """
 72 | 
 73 |     # constant feature elimination
 74 |     train_data = train_data.drop(['X3', 'X31', 'X32', 'X127', 'X128', 'X590'], axis=1)
 75 |     train_data = np.asarray(train_data)
 76 | 
 77 |     test_data = test_data.drop(['X3', 'X31', 'X32', 'X127', 'X128', 'X590'], axis=1)
 78 |     test_data = np.asarray(test_data)
 79 | 
 80 |     # training data is split into features and labels
 81 |     train_x = train_data[:, :train_data.shape[1] - 1]
 82 |     train_y = train_data[:, train_data.shape[1] - 1]
 83 |     train_y.shape = (np.size(train_y), 1)
 84 | 
 85 |     # principal component analysis
 86 |     pca = PCA(n_components=60)
 87 |     train_x_pca = pca.fit_transform(train_x)
 88 |     test_pca = pca.transform(test_data)
 89 | 
 90 |     return train_x_pca, train_y, test_pca
 91 | 
 92 | 
 93 | def train_model(train_x, train_y):
 94 | 
 95 |     """
 96 |     The method creates a learning model, and trains it by using training data.
 97 | 
 98 |     Parameters
 99 |     ----------
100 |     train_x: features of training data
101 |     train_y: labels of training data
102 | 
103 |     """
104 | 
105 |     clf = DecisionTreeClassifier(max_depth=3, max_features=11, random_state=9)
106 |     clf.fit(train_x, train_y)
107 |     return clf
108 | 
109 | 
110 | def predict(model, test_x):
111 | 
112 |     """
113 |     The method predicts labels for testing data samples by using trained learning model.
114 | 
115 |     Parameters
116 |     ----------
117 |     model: trained learning model
118 |     test_x: features of testing data
119 | 
120 |     """
121 | 
122 |     predictions = model.predict(test_x)
123 |     return predictions
124 | 
125 | 
126 | def write_output(ytest):
127 |     yt = pd.DataFrame(ytest, dtype='int32')
128 |     yt.columns = ["Predicted"]
129 |     yt.index += 1
130 |     yt.to_csv("./submission.csv", index_label="ID")
131 |     return
132 | 
133 | 
134 | # ********** MAIN PROGRAM ********** #
135 | 
136 | trainfile = "train.csv"
137 | testfile = "test.csv"
138 | 
139 | train_data, test_data = load_data(trainfile, testfile)
140 | train_x, train_y, test_x = preprocessing(train_data, test_data)
141 | 
142 | model = train_model(train_x, train_y)
143 | predictions = predict(model, test_x)
144 | write_output(predictions)
145 | 


--------------------------------------------------------------------------------
/Team 11/main.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Target Problem:
  3 | ---------------
  4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD)
  5 | 
  6 | Proposed Solution (Machine Learning Pipeline):
  7 | ----------------------------------------------
  8 | * SelectKBest Algorithm -> Adaptive Boosting (Base: Decision Tree)
  9 | 
 10 | Input to Proposed Solution:
 11 | ---------------------------
 12 | * Directories of training and testing data in csv file format
 13 | * These two types of data should be stored in n x m pattern in csv file format.
 14 | 
 15 |   Typical Example:
 16 |   ----------------
 17 |   n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column)
 18 |   k x s samples in testing csv file (k number of samples, s number of features)
 19 | 
 20 | * These data set files are ready by load_data() function.
 21 | * For comprehensive information about input format, please check the section
 22 |   "Data Sets and Usage Format of Source Codes" in README.md file on github.
 23 | 
 24 | Output of Proposed Solution:
 25 | ----------------------------
 26 | * Predictions generated by learning model for testing set
 27 | * They are stored in "submission.csv" file.
 28 | 
 29 | Code Owner:
 30 | -----------
 31 | * Copyright © Team 11. All rights reserved.
 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """
 33 | 
 34 | import csv
 35 | import numpy as np
 36 | import pandas as pd
 37 | from sklearn.feature_selection import SelectKBest, chi2
 38 | from sklearn.ensemble import AdaBoostClassifier
 39 | 
 40 | 
 41 | def load_data(train_file, test_file):
 42 | 
 43 |     """
 44 |     The method reads train and test data from their dataset files.
 45 |     Then, it splits train data into features and labels.
 46 | 
 47 |     Parameters
 48 |     ----------
 49 |     train_file: directory of the file in which train data set is located
 50 |     test_file: directory of the file in which test data set is located
 51 | 
 52 |     """
 53 | 
 54 |     data = np.asarray(pd.read_csv(train_file, header=0))
 55 |     data_ts = np.asarray(pd.read_csv(test_file, header=0))
 56 | 
 57 |     x_tra = data[:, :-1]
 58 |     y_tra = data[:, -1]
 59 | 
 60 |     return x_tra, y_tra, data_ts
 61 | 
 62 | 
 63 | def preprocessing(x_tra, y_tra, x_tst):
 64 | 
 65 |     """
 66 |     * The method computes chi square value for each feature, and
 67 |       chooses top 190 features with highest chi square value.
 68 |     * This is performed by using SelectKBest() feature selection algorithm.
 69 | 
 70 |     Parameters
 71 |     ----------
 72 |     x_tra: features of training data
 73 |     y_tra: labels of training data
 74 |     x_tst: features of test data
 75 | 
 76 |     """
 77 | 
 78 |     selector = SelectKBest(chi2, 190)
 79 |     selector.fit(x_tra, y_tra)
 80 |     x_tra_new = selector.transform(x_tra)
 81 |     x_tst_new = selector.transform(x_tst)
 82 | 
 83 |     return x_tra_new, x_tst_new
 84 | 
 85 | 
 86 | def train_model(x_tra, y_tra):
 87 | 
 88 |     """
 89 |     The method creates a learning model and trains it by using training data.
 90 | 
 91 |     Parameters
 92 |     ----------
 93 |     x_tra: features of training data
 94 |     y_tra: labels of training data
 95 |     """
 96 | 
 97 |     clf1 = AdaBoostClassifier(n_estimators=300, random_state=1)
 98 |     clf1.fit(x_tra, y_tra)
 99 |     return clf1
100 | 
101 | 
102 | def predict(x_tst, model):
103 | 
104 |     """
105 |     The method predicts labels for testing data samples by using trained learning model.
106 | 
107 |     Parameters
108 |     ----------
109 |     x_tst: features of testing data
110 |     model: trained learning model
111 |     """
112 | 
113 |     predictions = model.predict(x_tst)
114 |     return predictions
115 | 
116 | 
117 | def write_output(predictions):
118 | 
119 |     order = np.arange(1, 81)
120 |     order.shape = (80, 1)
121 |     predictions.shape = (80, 1)
122 | 
123 |     pred = np.concatenate((order, predictions), axis=1).astype(dtype=np.int)
124 |     with open('submission.csv', 'w') as csvFile:
125 |         writer = csv.writer(csvFile)
126 |         writer.writerow(("ID", "Predicted"))
127 |         writer.writerows(pred)
128 |     csvFile.close()
129 | 
130 | 
131 | # ********** MAIN PROGRAM ********** #
132 | 
133 | x_tra, y_tra, x_tst = load_data("train.csv", "test.csv")
134 | x_tra_new, x_tst_new = preprocessing(x_tra, y_tra, x_tst)
135 | 
136 | model = train_model(x_tra_new, y_tra)
137 | predictions = predict(x_tst_new, model)
138 | write_output(predictions)
139 | 


--------------------------------------------------------------------------------
/Team 12/main.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Target Problem:
  3 | ---------------
  4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD)
  5 | 
  6 | Proposed Solution (Machine Learning Pipeline):
  7 | ----------------------------------------------
  8 | * PCA -> SVC
  9 | 
 10 | Input to Proposed Solution:
 11 | ---------------------------
 12 | * Directories of training and testing data in csv file format
 13 | * These two types of data should be stored in n x m pattern in csv file format.
 14 | 
 15 |   Typical Example:
 16 |   ----------------
 17 |   n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column)
 18 |   k x s samples in testing csv file (k number of samples, s number of features)
 19 | 
 20 | * These data set files are ready by load_data() function.
 21 | * For comprehensive information about input format, please check the section
 22 |   "Data Sets and Usage Format of Source Codes" in README.md file on github.
 23 | 
 24 | Output of Proposed Solution:
 25 | ----------------------------
 26 | * Predictions generated by learning model for testing set
 27 | * They are stored in "submission.csv" file.
 28 | 
 29 | Code Owner:
 30 | -----------
 31 | * Copyright © Team 12. All rights reserved.
 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """
 33 | 
 34 | import numpy as np
 35 | import pandas as pd
 36 | from matplotlib import pyplot as plt
 37 | 
 38 | from sklearn import svm
 39 | from sklearn.decomposition import PCA
 40 | 
 41 | plt.style.use('ggplot')
 42 | 
 43 | 
 44 | def load_data():
 45 | 
 46 |     """
 47 |     The method reads train and test data from data set files.
 48 |     Then, it splits train data set into features and labels.
 49 | 
 50 |     """
 51 | 
 52 |     train_set = np.array(pd.read_csv("train.csv"))
 53 |     test_set = np.array(pd.read_csv("test.csv"))
 54 | 
 55 |     train_x = train_set[:, :train_set.shape[1] - 1]
 56 |     train_y = train_set[:, train_set.shape[1] - 1]
 57 | 
 58 |     return train_x, train_y, test_set
 59 | 
 60 | 
 61 | def preprocessing(x_tra, x_tst):
 62 | 
 63 |     """
 64 |     * The method reduces dimension of training and testing data set by using PCA.
 65 | 
 66 |     Parameters
 67 |     ----------
 68 |     x_tra: features of training data
 69 |     x_tst: features of testing data
 70 |     """
 71 | 
 72 |     pca = PCA(n_components=5)
 73 |     x_tra_new = pca.fit_transform(x_tra)
 74 |     x_tst_new = pca.transform(x_tst)
 75 | 
 76 |     return x_tra_new, x_tst_new
 77 | 
 78 | 
 79 | def train_model(x_train, y_train):
 80 | 
 81 |     """
 82 |     The method creates a learning model and trains it by using training dataset
 83 | 
 84 |     Parameters
 85 |     ----------
 86 |     x_train: features of training data
 87 |     y_train: labels of training data
 88 |     """
 89 | 
 90 |     model = svm.SVC(kernel='linear')
 91 |     model.fit(x_train, y_train)
 92 |     return model
 93 | 
 94 | 
 95 | def predict(model, x_tst):
 96 | 
 97 |     """
 98 |     The method predicts the labels for testing data samples by using trained learning model.
 99 | 
100 |     Parameters
101 |     ----------
102 |     model: trained learning model
103 |     x_tst: features of testing data
104 | 
105 |     """
106 | 
107 |     predictions = model.predict(x_tst)
108 |     return predictions
109 | 
110 | 
111 | def write_output(predictions):
112 | 
113 |     ind = [x for x in range(1, len(predictions) + 1)]
114 | 
115 |     temp = pd.DataFrame(data=ind, columns=['ID'])
116 |     temp2 = pd.DataFrame(data=predictions, columns=['Predicted'])
117 | 
118 |     y_pred = pd.concat([temp, temp2], axis=1)
119 |     y_pred = y_pred.astype({"ID": int, "Predicted": int})
120 |     y_pred.to_csv("submission.csv", index=False, float_format="%.0f")
121 | 
122 | 
123 | # ********** MAIN PROGRAM ********** #
124 | 
125 | train_x, train_y, test_set = load_data()
126 | x_tra_new, x_tst_new = preprocessing(train_x, test_set)
127 | 
128 | model = train_model(x_tra_new, train_y)
129 | predictions = predict(model, x_tst_new)
130 | write_output(predictions)
131 | 


--------------------------------------------------------------------------------
/Team 13/main.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Target Problem:
  3 | ---------------
  4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD)
  5 | 
  6 | Proposed Solution (Machine Learning Pipeline):
  7 | ----------------------------------------------
  8 | * Coorelation Based Elimination -> SelectKBest Algorithm -> Adaptive Boosting (Base: Decision Tree)
  9 | 
 10 | Input to Proposed Solution:
 11 | ---------------------------
 12 | * Directories of training and testing data in csv file format
 13 | * These two types of data should be stored in n x m pattern in csv file format.
 14 | 
 15 |   Typical Example:
 16 |   ----------------
 17 |   n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column)
 18 |   k x s samples in testing csv file (k number of samples, s number of features)
 19 | 
 20 | * These data set files are ready by load_data() function.
 21 | * For comprehensive information about input format, please check the section
 22 |   "Data Sets and Usage Format of Source Codes" in README.md file on github.
 23 | 
 24 | Output of Proposed Solution:
 25 | ----------------------------
 26 | * Predictions generated by learning model for testing set
 27 | * They are stored in "submission.csv" file.
 28 | 
 29 | Code Owner:
 30 | -----------
 31 | * Copyright © Team 13. All rights reserved.
 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """
 33 | 
 34 | import csv
 35 | import pandas as pd
 36 | import matplotlib.pyplot as plt
 37 | 
 38 | from sklearn.ensemble import AdaBoostClassifier
 39 | from sklearn.feature_selection import SelectKBest, chi2
 40 | 
 41 | 
 42 | def load_data():
 43 |     """
 44 |     The method reads train and test data from data set files.
 45 |     Then, it splits train data into features and labels.
 46 |     """
 47 | 
 48 |     train_data = pd.read_csv('train.csv')
 49 |     test_data = pd.read_csv('test.csv')
 50 | 
 51 |     x_train = train_data.iloc[:, 0: -1]
 52 |     y_train = train_data.iloc[:, -1]
 53 |     x_test = test_data.iloc[:, 0:]
 54 | 
 55 |     return x_train, y_train, x_test
 56 | 
 57 | 
 58 | def preprocessing(x_train, y_train, x_test):
 59 | 
 60 |     """
 61 |     The method at first chooses top 50 features with highest chi square value by using SelectKBest algorithm.
 62 |     Then, those features are sorted, and the features least correlated with labels are eliminated.
 63 | 
 64 |     Parameters
 65 |     ----------
 66 |     x_train: features of training data
 67 |     y_train: labels of training data
 68 |     x_test: features of testing data
 69 |     """
 70 | 
 71 |     selector = SelectKBest(score_func=chi2, k=50)
 72 | 
 73 |     fit = selector.fit(x_train, y_train)
 74 | 
 75 |     df_scores = pd.DataFrame(fit.scores_)
 76 |     df_columns = pd.DataFrame(x_train.columns)
 77 | 
 78 |     feature_scores = pd.concat([df_columns, df_scores], axis=1)
 79 |     feature_scores.columns = ['Specs', 'Score']
 80 | 
 81 |     selected_features = feature_scores.sort_values(['Score'], ascending=0).iloc[0:50, :]
 82 | 
 83 |     new_x_train = x_train.loc[:, selected_features['Specs']]
 84 |     new_x_test = x_test.loc[:, selected_features['Specs']]
 85 | 
 86 |     plt.matshow(new_x_train.corr().abs())
 87 |     plt.show()
 88 | 
 89 |     new_x_train = new_x_train.drop(['X584', 'X579', 'X404', 'X528', 'X318'], axis=1)
 90 |     new_x_test = new_x_test.drop(['X584', 'X579', 'X404', 'X528', 'X318'], axis=1)
 91 | 
 92 |     return new_x_train, new_x_test
 93 | 
 94 | 
 95 | def train_model(x_train, y_train):
 96 | 
 97 |     """
 98 |     The method creates a learning model and trains it by using training data.
 99 | 
100 |     Parameters
101 |     ----------
102 |     x_train: features of training data
103 |     y_train: labels of training data
104 |     """
105 | 
106 |     model = AdaBoostClassifier(n_estimators=10)
107 |     model.fit(x_train, y_train)
108 |     return model
109 | 
110 | 
111 | def predict(model, x_test):
112 | 
113 |     """
114 |     The method predicts labels for testing data samples.
115 | 
116 |     Parameters
117 |     ----------
118 |     model: trained model
119 |     x_test: features of testing data set
120 |     """
121 | 
122 |     y_pred = model.predict(x_test)
123 |     return y_pred
124 | 
125 | 
126 | def write_output(y_pred):
127 | 
128 |     for i in range(0, len(y_pred) + 1):
129 |         if i == 0:
130 |             with open('submission.csv', 'w') as writeFile:
131 |                 writer = csv.writer(writeFile)
132 |                 writer.writerow(["ID", "Predicted"])
133 |             continue
134 |         row = [i, int(y_pred[i - 1])]
135 |         with open('submission.csv', 'a') as writeFile:
136 |             writer = csv.writer(writeFile)
137 |             writer.writerow(row)
138 | 
139 | 
140 | if __name__ == '__main__':
141 | 
142 |     x_train, y_train, x_test = load_data()
143 |     x_train, x_test = preprocessing(x_train, y_train, x_test)
144 | 
145 |     model = train_model(x_train, y_train)
146 |     predictions = predict(model, x_test)
147 |     write_output(predictions)
148 | 


--------------------------------------------------------------------------------
/Team 14/main.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Target Problem:
  3 | ---------------
  4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD)
  5 | 
  6 | Proposed Solution (Machine Learning Pipeline):
  7 | ----------------------------------------------
  8 | * PCA -> Random Forest
  9 | 
 10 | Input to Proposed Solution:
 11 | ---------------------------
 12 | * Directories of training and testing data in csv file format
 13 | * These two types of data should be stored in n x m pattern in csv file format.
 14 | 
 15 |   Typical Example:
 16 |   ----------------
 17 |   n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column)
 18 |   k x s samples in testing csv file (k number of samples, s number of features)
 19 | 
 20 | * These data set files are ready by load_data() function.
 21 | * For comprehensive information about input format, please check the section
 22 |   "Data Sets and Usage Format of Source Codes" in README.md file on github.
 23 | 
 24 | Output of Proposed Solution:
 25 | ----------------------------
 26 | * Predictions generated by learning model for testing set
 27 | * They are stored in "submission.csv" file.
 28 | 
 29 | Code Owner:
 30 | -----------
 31 | * Copyright © Team 14. All rights reserved.
 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """
 33 | 
 34 | import pandas as pd
 35 | from sklearn.decomposition import PCA
 36 | from sklearn.ensemble import RandomForestClassifier
 37 | 
 38 | 
 39 | def load_data():
 40 | 
 41 |     """
 42 |     The method reads train and test data from data set files.
 43 |     Then, it splits train data into features and labels.
 44 |     """
 45 | 
 46 |     train = pd.read_csv("train.csv")
 47 |     test = pd.read_csv("test.csv")
 48 | 
 49 |     train_x = train.iloc[:, :-1]
 50 |     train_y = train.iloc[:, -1]
 51 |     return train_x, train_y, test
 52 | 
 53 | 
 54 | def preprocessing(X, y, test):
 55 | 
 56 |     """
 57 |     * The method at first eliminates nan-valued columns.
 58 |       (There is no nan-valued feature column, it is redundant operation.)
 59 | 
 60 |     * Then, first 100 samples from train data are chosen as training set.
 61 |       The remains 20 samples from train data are regarded as testing data.
 62 |       In other words, the learning model will be trained with first 100 training samples, not all of them.
 63 | 
 64 |     * Finally, it performs pca on training and testing data to reduce the dimension.
 65 | 
 66 |     Parameters
 67 |     ----------
 68 |     X: features of training data
 69 |     y: labels of training data
 70 |     test: features of testing data
 71 |     """
 72 | 
 73 |     X = X.dropna(axis=1, how='all')
 74 |     test = test.dropna(axis=1, how='all')
 75 | 
 76 |     x_train = X[0:100][:]
 77 |     x_test = X[100:120][:]
 78 |     y_train = y.iloc[0:100]
 79 |     y_test = y.iloc[100:120]
 80 | 
 81 |     pca = PCA(n_components=80)
 82 |     x_train = pca.fit_transform(x_train)
 83 |     x_test = pca.transform(x_test)
 84 |     test = pca.transform(test)
 85 | 
 86 |     return x_train, x_test, y_train, y_test, test
 87 | 
 88 | 
 89 | def train_model(x_train, y_train):
 90 |     """
 91 |     The method creates a learning model and trains it by using first 100 training samples.
 92 | 
 93 |     Parameters
 94 |     ----------
 95 |     x_train: features of first 100 training samples
 96 |     y_train: labels of first 100 training samples
 97 |     """
 98 | 
 99 |     classifier = RandomForestClassifier(n_estimators=1000, random_state=50)
100 |     classifier.fit(x_train, y_train)
101 |     return classifier
102 | 
103 | 
104 | def predict(classifier, x_train_test, real_test):
105 | 
106 |     """
107 |     The method makes two predictions:
108 |        - First prediction is for last 20 samples of training data
109 |        - Second prediction is for testing data
110 | 
111 |     Parameters
112 |     ----------
113 |     classifier: trained model
114 |     x_train_test: features of last 20 samples of training data
115 |     real_test: features of testing data
116 |     """
117 | 
118 |     predictions1 = classifier.predict(x_train_test)
119 |     predictions2 = classifier.predict(real_test)
120 |     return predictions1, predictions2
121 | 
122 | 
123 | def write_output(y_predTest):
124 | 
125 |     f = open('submission.csv', 'w', encoding="utf-8")
126 |     tempstr = "ID,Predicted\n"
127 |     f.write(tempstr)
128 | 
129 |     # print(y_predTest)
130 | 
131 |     i = 1
132 |     for y in y_predTest:
133 |         tempstr = str(i)
134 |         tempstr += ","
135 |         tempstr += str(y)
136 |         tempstr += "\n"
137 | 
138 |         i += 1
139 |         f.write(tempstr)
140 |     f.close
141 | 
142 | 
143 | # ********** MAIN PROGRAM ********** #
144 | 
145 | X, y, test = load_data()
146 | x_train, x_train_test, y_train, y_train_test, real_test = preprocessing(X, y, test)
147 | 
148 | classifier = train_model(x_train, y_train)
149 | train_test_pred, real_test_pred = predict(classifier, x_train_test, real_test)
150 | write_output(real_test_pred)
151 | 
152 | """
153 | *********************************************************
154 | The codes below are used to investigate the performance *
155 | of trained model on last 20 samples of training data.   *
156 | *********************************************************
157 | 
158 | print(confusion_matrix(y_train_test, train_test_pred))
159 | print(classification_report(y_train_test, train_test_pred))
160 | print(accuracy_score(y_train_test, train_test_pred))
161 | 
162 | """
163 | 


--------------------------------------------------------------------------------
/Team 15/main.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Target Problem:
  3 | ---------------
  4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD)
  5 | 
  6 | Proposed Solution (Machine Learning Pipeline):
  7 | ----------------------------------------------
  8 | * Isolation Forest -> RFECV Algorithm -> Bagging Classifier (Base: SVM)
  9 | 
 10 | Input to Proposed Solution:
 11 | ---------------------------
 12 | * Directories of training and testing data in csv file format
 13 | * These two types of data should be stored in n x m pattern in csv file format.
 14 | 
 15 |   Typical Example:
 16 |   ----------------
 17 |   n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column)
 18 |   k x s samples in testing csv file (k number of samples, s number of features)
 19 | 
 20 | * These data set files are ready by load_data() function.
 21 | * For comprehensive information about input format, please check the section
 22 |   "Data Sets and Usage Format of Source Codes" in README.md file on github.
 23 | 
 24 | Output of Proposed Solution:
 25 | ----------------------------
 26 | * Predictions generated by learning model for testing set
 27 | * They are stored in "submission.csv" file.
 28 | 
 29 | Code Owner:
 30 | -----------
 31 | * Copyright © Team 15. All rights reserved.
 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """
 33 | 
 34 | import csv
 35 | import numpy as np
 36 | from statistics import mean
 37 | from sklearn.ensemble import BaggingClassifier, IsolationForest
 38 | 
 39 | from sklearn.svm import SVC
 40 | from sklearn.model_selection import ShuffleSplit, GridSearchCV
 41 | from sklearn.feature_selection import RFECV
 42 | 
 43 | np.random.seed(7)
 44 | 
 45 | 
 46 | def load_data():
 47 | 
 48 |     """
 49 |     * Train data is read from "train.csv" file.
 50 |     * The whole content which is read from the file is decomposed into sub parts called "row".
 51 |     * Each sub part is collected in the list "contents".
 52 |     * Then, this list is converted into numpy array and split into features and labels of training samples.
 53 | 
 54 |     * Test data is read from "test.csv" file.
 55 |     * The whole content which is read from the file is decomposed into sub parts called "row".
 56 |     * Each sub part is collected in the list "contents".
 57 |     * Then, this list is converted into numpy array called "x_test".
 58 | 
 59 |     """
 60 | 
 61 |     contents = []
 62 |     with open('train.csv') as csv_file:
 63 |         csv_reader = csv.reader(csv_file, delimiter=',',)
 64 |         next(csv_reader)
 65 |         for row in csv_reader:
 66 |                 contents += [row]
 67 | 
 68 |     cont_np = np.asarray(contents, dtype=np.float64)
 69 |     train_x = cont_np[:, :-1]
 70 |     train_y = cont_np[:, -1]
 71 | 
 72 |     contents = []
 73 |     with open('test.csv') as csv_file:
 74 |         csv_reader = csv.reader(csv_file, delimiter=',',)
 75 |         next(csv_reader)
 76 |         for row in csv_reader:
 77 |                 contents += [row]
 78 | 
 79 |     test_x = np.asarray(contents, dtype=np.float64)
 80 | 
 81 |     return train_x, train_y, test_x
 82 | 
 83 | 
 84 | def outlier_detection(train_x, train_y):
 85 | 
 86 |     """
 87 |     * The outliers are the samples which do not fit general data distribution trend.
 88 |       Those samples mislead the learning model during training phase.
 89 |       Hence, eliminating outliers is beneficial process.
 90 | 
 91 |     * This method detects outliers by using Isolation Forest model.
 92 |     * Then, it discards those samples from training data.
 93 | 
 94 |     Parameters
 95 |     ----------
 96 |     train_x: features of training data
 97 |     train_y: labels of training data
 98 |     """
 99 | 
100 |     clf = IsolationForest(behaviour='new', random_state=1, contamination='auto')
101 |     preds = clf.fit_predict(train_x)
102 |     for i in range(0, len(preds)):
103 |         if preds[i] == -1:
104 |             train_x = np.delete(train_x, i, 0)
105 |             train_y = np.delete(train_y, i, 0)
106 | 
107 |     return train_x, train_y
108 | 
109 | 
110 | def feature_selection(train_x, train_y, test_x):
111 | 
112 |     """
113 |     The method uses Recursive Feature Elimination Feature method to choose subset of features.
114 |     It is a wrapper method of feature selection techniques.
115 |     The main purpose is to reduce the dimension of the samples to avoid curse of dimensionality.
116 | 
117 |     Parameters
118 |     ----------
119 |     train_x: features of training data
120 |     test_x: features of testing data
121 |     """
122 | 
123 |     svc = SVC(kernel="linear")
124 |     rfecv = RFECV(estimator=svc, step=1, cv=ShuffleSplit(n_splits=10, test_size=0.25, random_state=0),
125 |                   n_jobs=-1, scoring='accuracy')
126 | 
127 |     reduced_train_x = rfecv.fit_transform(train_x, train_y)
128 |     reduced_test_x = rfecv.transform(test_x)
129 |     return reduced_train_x, reduced_test_x
130 | 
131 | 
132 | def svc_param_selection(X, y, cv):
133 | 
134 |     """
135 |     The method aims to find best parameters for svc learning model.
136 |     To accomplish this, It uses Grid Search Cross Validation method.
137 | 
138 |     A set of values are determined for each parameter of svc learning model.
139 |     Grid Search chooses one of those values which maximizes the classification accuracy.
140 | 
141 |     Parameters
142 |     ----------
143 |     X: features of training data
144 |     y: labels of training data
145 |     cv: cross validation object which will be applied during parameter search operation
146 |     """
147 | 
148 |     # a set of values for svc parameters
149 |     param_c = [10**i for i in range(-11, 3)]
150 |     param_gamma = [10**i for i in range(-11, 4)]
151 |     param_coef = [10**i for i in range(-4, 4)]
152 |     max_iter = [1000000]
153 |     tol = [1e-3]
154 | 
155 |     param_grid = {'C': param_c, 'gamma': param_gamma, 'coef0': param_coef, 'max_iter': max_iter, 'tol': tol}
156 |     grid_search = GridSearchCV(SVC(kernel="linear"), param_grid, cv=cv)
157 |     grid_search.fit(X, y)
158 |     return grid_search.best_params_, grid_search.best_score_, grid_search.cv_results_
159 | 
160 | 
161 | def bagging_param_selection(X, y, cv, classifier):
162 | 
163 |     """
164 |     The method aims to find best parameters for bagging learning model.
165 |     To accomplish this, It uses Grid Search Cross Validation method.
166 |     For base model of bagging classifier, svc learning model with best parameters is chosen.
167 | 
168 |     A set of values are determined for each parameter of bagging learning model.
169 |     Grid Search chooses one of those values which maximizes the classification accuracy.
170 | 
171 |     Parameters
172 |     ----------
173 |     X: features of training data
174 |     y: labels of training data
175 |     cv: cross validation object which will be applied during parameter search operation
176 |     classifier: base classifier for bagging learning model
177 | 
178 |     """
179 | 
180 |     max_samples = [0.3, 0.4, 0.5, 0.6, 0.7, 0.8]
181 |     max_features = [0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
182 | 
183 |     param_grid = {'max_samples': max_samples, 'max_features': max_features}
184 |     grid_search = GridSearchCV(BaggingClassifier(classifier), param_grid, cv=cv)
185 |     grid_search.fit(X, y)
186 |     return grid_search.best_params_, grid_search.best_score_, grid_search.cv_results_
187 | 
188 | 
189 | def write_output(predictions):
190 | 
191 |     with open('submission.csv', mode='w') as output_file:
192 | 
193 |         output_writer = csv.writer(output_file, delimiter=',')
194 |         output_writer.writerow(["ID", "Predicted"])
195 | 
196 |         for i in range(1, len(predictions) + 1):
197 |             output_writer.writerow([i, int(predictions[i - 1])])
198 | 
199 | 
200 | # ********** MAIN PROGRAM ********** #
201 | 
202 | 
203 | train_x, train_y, test_x = load_data()
204 | new_train_x, new_train_y = outlier_detection(train_x, train_y)
205 | reduced_train_x, reduced_test_x = feature_selection(new_train_x, new_train_y, test_x)
206 | 
207 | 
208 | # Best parameters are determined for support vector classifier (svc).
209 | # Then, svc learning model is created with those best parameters.
210 | 
211 | print("SVC")
212 | best_params, best_score, cv_results = svc_param_selection(reduced_train_x, new_train_y,
213 |                                                           ShuffleSplit(n_splits=10, test_size=0.25, random_state=0))
214 | print("Best Params: ", best_params)
215 | print("Best Score: ", best_score)
216 | print("Mean Test Score: ", mean(cv_results["mean_test_score"]))
217 | best_svc = SVC(kernel="linear", C=best_params["C"], gamma=best_params["gamma"], random_state=5)
218 | 
219 | 
220 | # Best parameters are determined for bagging classifier.
221 | # When determining those parameters, best svc is chosen as base model of bagging classifier.
222 | 
223 | print("\nBagging + SVC")
224 | best_params, best_score, cv_results = bagging_param_selection(reduced_train_x, new_train_y,
225 |                                                               ShuffleSplit(n_splits=10, test_size=0.25, random_state=0),
226 |                                                               best_svc)
227 | print("Best Params: ", best_params)
228 | print("Best Score: ", best_score)
229 | print("Mean Test Score: ", mean(cv_results["mean_test_score"]))
230 | 
231 | # Bagging classifier is created by using best parameters of svc and bagging
232 | 
233 | clf = BaggingClassifier(best_svc, max_samples=best_params["max_samples"], max_features=best_params["max_features"],
234 |                         random_state=5)
235 | clf.fit(reduced_train_x, new_train_y)
236 | predictions = clf.predict(reduced_test_x)
237 | write_output(predictions)
238 | 


--------------------------------------------------------------------------------
/Team 16/main.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Target Problem:
  3 | ---------------
  4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD)
  5 | 
  6 | Proposed Solution (Machine Learning Pipeline):
  7 | ----------------------------------------------
  8 | * SelectKBest Algorithm -> PCA -> KNN
  9 | 
 10 | Input to Proposed Solution:
 11 | ---------------------------
 12 | * Directories of training and testing data in csv file format
 13 | * These two types of data should be stored in n x m pattern in csv file format.
 14 | 
 15 |   Typical Example:
 16 |   ----------------
 17 |   n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column)
 18 |   k x s samples in testing csv file (k number of samples, s number of features)
 19 | 
 20 | * These data set files are ready by load_data() function.
 21 | * For comprehensive information about input format, please check the section
 22 |   "Data Sets and Usage Format of Source Codes" in README.md file on github.
 23 | 
 24 | Output of Proposed Solution:
 25 | ----------------------------
 26 | * Predictions generated by learning model for testing set
 27 | * They are stored in "submission.csv" file.
 28 | 
 29 | Code Owner:
 30 | -----------
 31 | * Copyright © Team 16. All rights reserved.
 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """
 33 | 
 34 | import numpy as np
 35 | import pandas as pd
 36 | 
 37 | from sklearn.decomposition import PCA
 38 | from sklearn.feature_selection import chi2
 39 | from sklearn.feature_selection import SelectKBest
 40 | from sklearn.neighbors import KNeighborsClassifier
 41 | 
 42 | 
 43 | def load_data(paths):
 44 | 
 45 |     """
 46 |     The method reads train and test data set from their data set files.
 47 |     The directory of the files are passed to the function via the parameter "paths".
 48 | 
 49 |     Parameters
 50 |     ----------
 51 |     paths: it is an array collecting directory paths of train and test data.
 52 |     """
 53 | 
 54 |     train_data = pd.read_csv(paths[0])
 55 |     test_data = pd.read_csv(paths[1])
 56 | 
 57 |     y_train = train_data["class"]
 58 |     x_train = train_data
 59 |     x_train.drop("class", axis=1, inplace=True)
 60 | 
 61 |     return x_train, y_train, test_data
 62 | 
 63 | 
 64 | def preprocessing(x_train, y_train, x_test):
 65 | 
 66 |     """
 67 |     The method performs two dimensionality reduction methods: SelectKBest and PCA
 68 |     By using SelectKBest algorithm, it chooses top 80 features with highest chi square value.
 69 |     Then, this method synthesizes 5 new features for training and testing data by using PCA.
 70 |     Totally, the data sets are reduced to 5-dimensional space.
 71 | 
 72 |     Parameters
 73 |     ----------
 74 |     x_train: features of train data
 75 |     y_train: labels of train data
 76 |     x_test: features of test data
 77 |     """
 78 | 
 79 |     selector = SelectKBest(chi2, k=80)
 80 |     selector.fit(x_train, y_train)
 81 | 
 82 |     x_train_reduced = selector.transform(x_train)
 83 |     x_test_reduced = selector.transform(x_test)
 84 | 
 85 |     pca = PCA(n_components=5)
 86 |     pca.fit(x_train_reduced)
 87 | 
 88 |     x_train_reduced = pca.transform(x_train_reduced)
 89 |     x_test_reduced = pca.transform(x_test_reduced)
 90 | 
 91 |     return x_train_reduced, x_test_reduced
 92 | 
 93 | 
 94 | def train_model(x_train, y_train):
 95 | 
 96 |     """
 97 |     The method trains KNN classification model by using training data set.
 98 |     Then, It returns trained learning model.
 99 | 
100 |     Parameters
101 |     ----------
102 |     x_train: features of train data
103 |     y_train: labels of train data
104 |     """
105 | 
106 |     clf = KNeighborsClassifier(n_neighbors=7)
107 |     clf.fit(x_train, y_train)
108 |     return clf
109 | 
110 | 
111 | def predict(model, x_test):
112 | 
113 |     """
114 |     The method predicts labels for testing data samples.
115 | 
116 |     Parameters
117 |     ----------
118 |     model: trained learning model (KNN)
119 |     x_test: features of testing data
120 |     """
121 |     return model.predict(x_test)
122 | 
123 | 
124 | def write_output(myPredict):
125 |     ID = np.arange(1, len(myPredict) + 1)
126 |     predictID = list(zip(ID, myPredict))
127 |     predictID = pd.DataFrame(predictID, columns=['ID', 'Predicted'])
128 |     predictID.to_csv('submission.csv', index=False)
129 | 
130 | 
131 | # ********** MAIN PROGRAM ********** #
132 | 
133 | x_tra, y_tra, x_tst = load_data(['train.csv', 'test.csv'])
134 | x_tra_reduced, x_test_reduced = preprocessing(x_tra, y_tra, x_tst)
135 | 
136 | my_model = train_model(x_tra_reduced, y_tra)
137 | my_predict = predict(my_model, x_test_reduced)
138 | write_output(my_predict)
139 | 


--------------------------------------------------------------------------------
/Team 17/main.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Target Problem:
  3 | ---------------
  4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD)
  5 | 
  6 | Proposed Solution (Machine Learning Pipeline):
  7 | ----------------------------------------------
  8 | * LDA -> Voting Classifier
  9 | 
 10 | Input to Proposed Solution:
 11 | ---------------------------
 12 | * Directories of training and testing data in csv file format
 13 | * These two types of data should be stored in n x m pattern in csv file format.
 14 | 
 15 |   Typical Example:
 16 |   ----------------
 17 |   n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column)
 18 |   k x s samples in testing csv file (k number of samples, s number of features)
 19 | 
 20 | * These data set files are ready by load_data() function.
 21 | * For comprehensive information about input format, please check the section
 22 |   "Data Sets and Usage Format of Source Codes" in README.md file on github.
 23 | 
 24 | Output of Proposed Solution:
 25 | ----------------------------
 26 | * Predictions generated by learning model for testing set
 27 | * They are stored in "submission.csv" file.
 28 | 
 29 | Code Owner:
 30 | -----------
 31 | * Copyright © Team 17. All rights reserved.
 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """
 33 | 
 34 | import numpy as np
 35 | import pandas as pd
 36 | 
 37 | from sklearn.svm import SVC
 38 | from sklearn.naive_bayes import GaussianNB
 39 | from sklearn.linear_model import SGDClassifier
 40 | from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
 41 | from sklearn.ensemble import RandomForestClassifier as RFC, VotingClassifier, GradientBoostingClassifier as GBC
 42 | 
 43 | 
 44 | def load_data(tra_file_path, test_file_path):
 45 | 
 46 |     """
 47 |     The method reads train and test data from data set files in dataframe format.
 48 |     When reading them, it benefits from directory paths of the files passed by the parameters.
 49 |     Then, it converts them into numpy arrays and returns them.
 50 | 
 51 |     Parameters
 52 |     ----------
 53 |     tra_file_path: directory of training data
 54 |     test_file_path: directory of testing data
 55 |     """
 56 | 
 57 |     x_tra = pd.read_csv(tra_file_path, sep=',')
 58 |     x_tst = pd.read_csv(test_file_path, sep=',')
 59 |     return np.array(x_tra), np.array(x_tst)
 60 | 
 61 | 
 62 | def dimension_reduction(train_x, train_y, test_x):
 63 | 
 64 |     """
 65 |     Linear Discriminant Analysis (LDA) can be used to both reduce the dimension and train a learning model.
 66 |     In this point, it is used as dimension reducer. It is like supervised version of PCA.
 67 |     This method benefits from LDA to reduce the dimension train and test data.
 68 | 
 69 |     Parameters
 70 |     ----------
 71 |     train_x: features of training data
 72 |     train_y: labels of training data
 73 |     test_x: features of testing data
 74 |     """
 75 | 
 76 |     lda = LDA(n_components=1)
 77 |     lda.fit(train_x, train_y)
 78 | 
 79 |     train_x_reduced = lda.transform(train_x)
 80 |     test_x_reduced = lda.transform(test_x)
 81 |     return train_x_reduced, test_x_reduced
 82 | 
 83 | 
 84 | def train_model(train, target):
 85 | 
 86 |     """
 87 |     The method creates 5 different learning models.
 88 |     Then, these 5 models are combined in a voting classifier.
 89 |     That voting classifier is trained with training data.
 90 | 
 91 |     Parameters
 92 |     ----------
 93 |     train: features of training data
 94 |     target: labels of training data
 95 |     """
 96 | 
 97 |     clf1 = SVC(kernel='rbf', gamma=5, C=80, random_state=1)
 98 |     clf2 = GaussianNB()
 99 |     clf3 = GBC(n_estimators=20, learning_rate=1.0, random_state=1)
100 |     clf4 = RFC(n_estimators=20, random_state=1)
101 |     clf5 = SGDClassifier(tol=1e-3, random_state=1)
102 | 
103 |     ensemble = VotingClassifier(estimators=[('svm', clf1), ('nb', clf2), ('gbc', clf3), ('rfc', clf4), ('sgd', clf5)],
104 |                                 voting='hard')
105 |     ensemble.fit(train, target)
106 |     return ensemble
107 | 
108 | 
109 | def predict(model, x_test):
110 | 
111 |     """
112 |     The method predicts labels for testing data samples.
113 | 
114 |     Parameters
115 |     ----------
116 |     model: trained voting classifier
117 |     x_test: features of testing data
118 |     """
119 |     return model.predict(x_test)
120 | 
121 | 
122 | def write_output(predictions):
123 | 
124 |     results = np.zeros((len(predictions), 2))
125 |     for i in range(len(predictions)):
126 |         results[i][0] = i + 1
127 |         results[i][1] = predictions[i]
128 | 
129 |     results = results.astype(int)
130 |     predictions = predictions.astype(int)
131 | 
132 |     results = pd.DataFrame(data=results, columns=['ID', 'Predicted'])
133 |     results.to_csv('submission.csv', index=False, sep=',')
134 | 
135 | 
136 | def main(tra_file_path, test_file_path):
137 | 
138 |     train_data, test_features = load_data(tra_file_path, test_file_path)
139 |     train_features = train_data[:, 0:len(train_data[0]) - 1]
140 |     train_labels = train_data[:, len(train_data[0]) - 1]
141 | 
142 |     x_train, x_test = dimension_reduction(train_features, train_labels, test_features)
143 |     clf = train_model(x_train, train_labels)
144 |     preds = predict(clf, x_test)
145 |     write_output(preds)
146 | 
147 | 
148 | main('train.csv', 'test.csv')
149 | 


--------------------------------------------------------------------------------
/Team 18/classifiers.py:
--------------------------------------------------------------------------------
 1 | # Code Owners: Bulut Karabıyık - Cankurt Kostur
 2 | # Code Editor: Göktuğ Güvercin
 3 | 
 4 | import numpy as np
 5 | from sklearn.svm import SVC
 6 | from sklearn.tree import DecisionTreeClassifier
 7 | from sklearn.neighbors import KNeighborsClassifier
 8 | from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
 9 | from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
10 | 
11 | param_grids = [
12 |     {'n_neighbors': np.arange(1, 30, 2),
13 |      },
14 |     {
15 |         'kernel': ['rbf', 'linear'],
16 |         'C': np.arange(0.025, 5, 0.025)},
17 |     {
18 |         'max_depth': np.arange(3, 10)},
19 | 
20 |     {
21 |         'tol': [1e-4]
22 |     },
23 |     {
24 |         'tol': [1.0e-4]
25 |     }
26 | ]
27 | 
28 | classifiers = [
29 |     KNeighborsClassifier(),
30 |     SVC(probability=True),
31 |     DecisionTreeClassifier(),
32 |     LinearDiscriminantAnalysis(),
33 |     QuadraticDiscriminantAnalysis()]
34 | 
35 | """
36 | * The list "param_grids" contains dictionary objects.
37 | * Each dictionary can have one or more than one parameter name and corresponding value range.
38 | * The values in that range are tried in cross validation by GridSearch to determine which one is
39 |   the best value for that parameter.
40 | 
41 | * The list "classifiers" contains learning model objects.
42 | * For each learning model in that list, best parameter set is determined by GridSearch.
43 | * Then, those models and their best parameter sets are used to construct powerful voting classifier
44 | """
45 | 


--------------------------------------------------------------------------------
/Team 18/main.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Target Problem:
  3 | ---------------
  4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD)
  5 | 
  6 | Proposed Solution (Machine Learning Pipeline):
  7 | ----------------------------------------------
  8 | * SelectKBest Algorithm -> PCA -> Variance Thresholding -> Voting Classifier
  9 | 
 10 | Input to Proposed Solution:
 11 | ---------------------------
 12 | * Directories of training and testing data in csv file format
 13 | * These two types of data should be stored in n x m pattern in csv file format.
 14 | 
 15 |   Typical Example:
 16 |   ----------------
 17 |   n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column)
 18 |   k x s samples in testing csv file (k number of samples, s number of features)
 19 | 
 20 | * These data set files are ready by load_data() function.
 21 | * For comprehensive information about input format, please check the section
 22 |   "Data Sets and Usage Format of Source Codes" in README.md file on github.
 23 | 
 24 | Output of Proposed Solution:
 25 | ----------------------------
 26 | * Predictions generated by learning model for testing set
 27 | * They are stored in "submission.csv" file.
 28 | 
 29 | Code Owner:
 30 | -----------
 31 | * Copyright © Team 18. All rights reserved.
 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """
 33 | 
 34 | import warnings
 35 | import pandas as pd
 36 | 
 37 | from sklearn.decomposition import PCA
 38 | from sklearn.ensemble import VotingClassifier
 39 | from sklearn.model_selection import GridSearchCV
 40 | from sklearn.feature_selection import chi2, VarianceThreshold, SelectKBest
 41 | 
 42 | from classifiers import *
 43 | 
 44 | warnings.filterwarnings("ignore")
 45 | warnings.filterwarnings(action='ignore', category=DeprecationWarning)
 46 | warnings.filterwarnings(action='ignore', category=FutureWarning)
 47 | 
 48 | 
 49 | def load_data(x_train_path, x_test_path):
 50 | 
 51 |     """
 52 |     The method reads train and test data from dataset csv files.
 53 |     Then, train data is decomposed into features and labels.
 54 |     Finally, the method returns features and labels of train data and test data itself.
 55 | 
 56 |     Parameters
 57 |     ----------
 58 |     x_train_path: directory of training data set file
 59 |     x_test_path: directory of testing data set file
 60 |     """
 61 | 
 62 |     all_data = pd.read_csv(x_train_path)
 63 |     x_test = pd.read_csv(x_test_path)
 64 | 
 65 |     y_train = all_data["class"]
 66 |     x_train = pd.read_csv(x_train_path)
 67 |     x_train.drop("class", axis=1, inplace=True)
 68 |     return x_train, y_train, x_test
 69 | 
 70 | 
 71 | def preprocessing(x_train, y_train, x_test):
 72 | 
 73 |     """
 74 |     * The method performs 3 dimensionality reduction methods: Variance Threshold - SelectKBest Algorithm - PCA.
 75 |     * It at first performs variance threshold, and eliminates all features whose variance values are lower than 0.001.
 76 |     * Then, it computes chi square value for each feature, and chooses top 10 features with highest chi square value.
 77 |       When doing this, it benefits from SelectKBest algorithm.
 78 |     * In final step, the method synthesizes 2 new features by using pca.
 79 | 
 80 |     Parameters
 81 |     ----------
 82 |     x_train: features of training data
 83 |     y_train: labels of training data
 84 |     x_test: features of testing data
 85 |     """
 86 | 
 87 |     selector_threshold = VarianceThreshold(0.001)
 88 |     selector_threshold.fit(x_train)
 89 | 
 90 |     x_train_new = selector_threshold.transform(x_train)
 91 |     x_test_new = selector_threshold.transform(x_test)
 92 | 
 93 |     selector = SelectKBest(chi2, k=10)
 94 |     selector.fit(x_train_new, y_train)
 95 | 
 96 |     x_train_new = selector.transform(x_train_new)
 97 |     x_test_new = selector.transform(x_test_new)
 98 | 
 99 |     pca = PCA(n_components=2, whiten=True)
100 |     pca.fit(x_train_new)
101 | 
102 |     x_train_pca = pca.transform(x_train_new)
103 |     x_test_pca = pca.transform(x_test_new)
104 |     return x_train_pca, x_test_pca
105 | 
106 | 
107 | def train_model(x_train, y_train):
108 | 
109 |     """
110 |     * The method performs GridSearch operation to choose best parameter set for each classification model.
111 |     * Then, the classification models with best parameter set and their names are stored in two different lists.
112 |     * These two lists are used to combine these best-parametrized classification models in voting classifier.
113 |     * That voting classifier is trained with training data and returned.
114 | 
115 |     Parameters
116 |     ----------
117 |     x_train: features of training data
118 |     y_train: labels of training data
119 | 
120 |     """
121 | 
122 |     best_models = []
123 |     model_names = []
124 | 
125 |     for i in range(len(classifiers)):
126 | 
127 |         model = classifiers[i]
128 |         grid_search = GridSearchCV(model, param_grids[i], cv=5)
129 | 
130 |         grid_search.fit(x_train, y_train)
131 |         best_models.append(grid_search.best_estimator_)
132 | 
133 |         name = model.__class__.__name__
134 |         model_names.append(name)
135 | 
136 |     estimators = [('knn', best_models[0]), ('SVC', best_models[1]), ('DT', best_models[2]), ('LA', best_models[3]),
137 |                   ('QA', best_models[4])]
138 | 
139 |     ensemble = VotingClassifier(estimators, voting='hard')
140 |     ensemble.fit(x_train, y_train)
141 |     return ensemble
142 | 
143 | 
144 | def predict(model, x_test):
145 |     """
146 |     The method predicts labels for testing data samples by using trained learning model, that is voting classifier.
147 | 
148 |     Parameters
149 |     ----------
150 |     model: trained learning model
151 |     x_test: features of testing data
152 |     """
153 |     return model.predict(x_test)
154 | 
155 | 
156 | def write_output(prediction, file_name):
157 |     ID = np.arange(1, len(prediction) + 1)
158 |     Id_Predict = list(zip(ID, prediction))
159 |     Id_Predict = pd.DataFrame(Id_Predict, columns=['ID', 'Predicted'])
160 |     Id_Predict.to_csv(file_name, index=False)
161 | 
162 | 
163 | # ********** MAIN PROGRAM ********** #
164 | 
165 | x_train, y_train, x_test = load_data("train.csv", "test.csv")
166 | x_train_pca, x_test_pca = preprocessing(x_train, y_train, x_test)
167 | 
168 | model = train_model(x_train_pca, y_train)
169 | predictions = predict(model, x_test_pca)
170 | write_output(predictions, "submission.csv")
171 | 


--------------------------------------------------------------------------------
/Team 19/main.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Target Problem:
  3 | ---------------
  4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD)
  5 | 
  6 | Proposed Solution (Machine Learning Pipeline):
  7 | ----------------------------------------------
  8 | * PCA -> KNN
  9 | 
 10 | Input to Proposed Solution:
 11 | ---------------------------
 12 | * Directories of training and testing data in csv file format
 13 | * These two types of data should be stored in n x m pattern in csv file format.
 14 | 
 15 |   Typical Example:
 16 |   ----------------
 17 |   n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column)
 18 |   k x s samples in testing csv file (k number of samples, s number of features)
 19 | 
 20 | * These data set files are ready by load_data() function.
 21 | * For comprehensive information about input format, please check the section
 22 |   "Data Sets and Usage Format of Source Codes" in README.md file on github.
 23 | 
 24 | Output of Proposed Solution:
 25 | ----------------------------
 26 | * Predictions generated by learning model for testing set
 27 | * They are stored in "submission.csv" file.
 28 | 
 29 | Code Owner:
 30 | -----------
 31 | * Copyright © Team 19. All rights reserved.
 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """
 33 | 
 34 | import numpy as np
 35 | import pandas as pd
 36 | import matplotlib.pyplot as plt
 37 | 
 38 | from sklearn.neighbors import KNeighborsClassifier
 39 | from sklearn.decomposition import PCA
 40 | from sklearn.preprocessing import MinMaxScaler
 41 | 
 42 | 
 43 | def load_data(train_path, test_path):
 44 | 
 45 |     """
 46 |     The train and test data are read from their source files in data-frame format.
 47 |     Then, they are converted into numpy array and returned.
 48 | 
 49 |     Parameters
 50 |     ----------
 51 |     train_path: directory of train data set file
 52 |     test_path: directory of test data set file
 53 |     """
 54 | 
 55 |     train_data = np.asarray(pd.read_csv(train_path, skiprows=0))
 56 |     test_data = np.asarray(pd.read_csv(test_path, skiprows=0))
 57 |     return train_data, test_data
 58 | 
 59 | 
 60 | def preprocessing(train_x, test_x):
 61 | 
 62 |     """
 63 |     The method synthesizes new 70 features by using pca.
 64 |     In this way, the dimension of train and test data is reduced.
 65 | 
 66 |     Parameters
 67 |     ----------
 68 |     train_x: features of train data
 69 |     test_x: features of test data
 70 | 
 71 |     """
 72 | 
 73 |     pca = PCA(n_components=70)
 74 |     x_train_clean = pca.fit_transform(train_x)
 75 |     x_test_clean = pca.transform(test_x)
 76 |     return x_train_clean, x_test_clean
 77 | 
 78 | 
 79 | def find_component(train_x):
 80 | 
 81 |     """
 82 |     This method is used to determine the number of features that data samples are reduced to.
 83 | 
 84 |     Parameters
 85 |     ----------
 86 |     train_x: features of train data
 87 |     """
 88 | 
 89 |     scaler = MinMaxScaler(feature_range=[0, 1])
 90 |     data_rescaled = scaler.fit_transform(train_x)
 91 |     pca = PCA().fit(data_rescaled)
 92 | 
 93 |     plt.Figure()
 94 |     plt.plot(np.cumsum(pca.explained_variance_ratio_))
 95 |     plt.xlabel('Number of components')
 96 |     plt.ylabel('Variance')
 97 |     plt.show()
 98 | 
 99 | 
100 | def train_model(x_train, y_train):
101 | 
102 |     """
103 |     The method creates KNN learning model and trains it by using training data.
104 |     It returns trained learning model.
105 | 
106 |     Parameters
107 |     ----------
108 |     x_train: features of training data
109 |     y_train: labels of training data
110 | 
111 |     """
112 | 
113 |     classifier = KNeighborsClassifier(n_neighbors=5)
114 |     classifier.fit(x_train, y_train)
115 |     return classifier
116 | 
117 | 
118 | def predict(model, x_test):
119 | 
120 |     """
121 |     The method predicts labels for testing data samples by using trained learning model.
122 | 
123 |     Parameters
124 |     ----------
125 |     model: trained learning model (KNN)
126 |     x_test: features of testing data
127 | 
128 |     """
129 | 
130 |     predictions = model.predict(x_test)
131 |     predictions.shape = (np.size(predictions), 1)
132 |     return predictions
133 | 
134 | 
135 | def write_output(predictions):
136 | 
137 |     temp = np.ones((80, 1), dtype=float)
138 |     for i in range(0, 80):
139 |         temp[i] = i + 1
140 | 
141 |     y_csv = np.concatenate((temp, predictions), 1)
142 |     np.savetxt('submission.csv', y_csv, delimiter=",", comments='', fmt='%.0f', header="ID,Predicted")
143 | 
144 | 
145 | # ********** MAIN PROGRAM ********** #
146 | 
147 | train_data, test_data = load_data("train.csv", "test.csv")
148 | x_train = train_data[:, :595]
149 | y_train = train_data[:, 595]
150 | 
151 | # find_component(x_train) -> it shows why 70 components are needed for pca
152 | 
153 | x_train_clean, x_test_clean = preprocessing(x_train, test_data)
154 | classifier = train_model(x_train_clean, y_train)
155 | predictions = predict(classifier, x_test_clean)
156 | write_output(predictions)
157 | 


--------------------------------------------------------------------------------
/Team 2/main.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Target Problem:
 3 | ---------------
 4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD)
 5 | 
 6 | Proposed Solution (Machine Learning Pipeline):
 7 | ----------------------------------------------
 8 | * MRMR algorithm -> Gradient Boosting
 9 | 
10 | Input to Proposed Solution:
11 | ---------------------------
12 | * Directories of training and testing data in csv file format
13 | * These two types of data should be stored in n x m pattern in csv file format.
14 | 
15 |   Typical Example:
16 |   ----------------
17 |   n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column)
18 |   k x s samples in testing csv file (k number of samples, s number of features)
19 | 
20 | * These data set files are ready by load_data() function.
21 | * For comprehensive information about input format, please check the section
22 |   "Data Sets and Usage Format of Source Codes" in README.md file on github.
23 | 
24 | Output of Proposed Solution:
25 | ----------------------------
26 | * Predictions generated by learning model for testing set
27 | * They are stored in "submission.csv" file.
28 | 
29 | Code Owner:
30 | -----------
31 | * Copyright © Team 2. All rights reserved.
32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """
33 | 
34 | from reduce_dim import *
35 | from sklearn.ensemble import GradientBoostingClassifier
36 | 
37 | # ********** MAIN PROGRAM ********** #
38 | 
39 | # test and training data are read from csv file by helper functions
40 | tra_data, tst_features = load_data("train.csv", "test.csv")
41 | tra_features, tra_labels = split_data(tra_data)
42 | 
43 | # Dimension of the data is reduced in a preprocessing step.
44 | pcc_tra_features, pcc_tst_features = apply_MRMR(3, tra_data, tst_features)
45 | 
46 | grd_boost_clf = GradientBoostingClassifier(n_estimators=34, learning_rate=1.27,
47 |                                            max_features=3, max_depth=3, random_state=7)
48 | 
49 | grd_boost_clf.fit(pcc_tra_features, tra_labels)  # training step
50 | predictions = grd_boost_clf.predict(pcc_tst_features)  # testing step
51 | 
52 | # the results are written to output file.
53 | write_output(predictions, "Submission.txt")
54 | 


--------------------------------------------------------------------------------
/Team 2/read_write.py:
--------------------------------------------------------------------------------
 1 | # Code Owners: Göktuğ Güvercin - Uğur Tepecik - Ege Apak
 2 | # Code Editor: Göktuğ Güvercin
 3 | 
 4 | import numpy as np
 5 | import pandas as pd
 6 | 
 7 | 
 8 | def load_data(directory1, directory2):
 9 | 
10 |     """
11 |     It reads the content of training and testing data files.
12 |     Then, it returns them as numpy arrays
13 | 
14 |     Parameters
15 |     ----------
16 |     directory1: directory of training file
17 |     directory2: directory of testing file
18 |     """
19 | 
20 |     tra_data = np.array(pd.read_csv(directory1))
21 |     tst_data = np.array(pd.read_csv(directory2))
22 |     return tra_data, tst_data
23 | 
24 | 
25 | def split_data(dataset):
26 | 
27 |     """
28 |     The "dataset" array is split into features and labels.
29 |     CAUTION: "dataset" numpy array must contain label values at the last column.
30 |     """
31 | 
32 |     labels = dataset[:, len(dataset[0]) - 1]
33 |     features = dataset[:, :len(dataset[0]) - 1]
34 |     return features, labels
35 | 
36 | 
37 | def write_output(predictions, directory):
38 | 
39 |     size = len(predictions)
40 |     indices = np.array([i for i in range(1, size + 1)])
41 | 
42 |     indices.shape = (size, 1)
43 |     predictions.shape = (size, 1)
44 |     submission_array = np.concatenate((indices, predictions), 1)
45 | 
46 |     np.savetxt(directory, submission_array, delimiter=",", fmt="%d", header="ID,Predicted", comments="")
47 | 


--------------------------------------------------------------------------------
/Team 2/reduce_dim.py:
--------------------------------------------------------------------------------
  1 | # Code Owners: Göktuğ Güvercin - Uğur Tepecik - Ege Apak
  2 | # Code Editor: Göktuğ Güvercin
  3 | 
  4 | 
  5 | from read_write import *
  6 | 
  7 | 
  8 | def create_dataframe(dataset):
  9 | 
 10 |     """
 11 |     * This function takes "dataset" array as an argument, and creates a data frame object "df".
 12 |       This dataframe object is needed to compute correlation coefficient matrix.
 13 |     * Finally, the method changes names of columns in data frame for easy use.
 14 | 
 15 |     Parameters
 16 |     ----------
 17 |     dataset: It is 2D numpy array. Its last column should contain target scores (labels).
 18 |     :return: data frame object
 19 |     """
 20 | 
 21 |     df = pd.DataFrame(dataset)
 22 |     df.columns = [i for i in range(len(df.columns) - 1)] + ["Labels"]
 23 |     return df
 24 | 
 25 | 
 26 | def find_pcc_features(df, nof_features):
 27 | 
 28 |     """
 29 |     * This method at first computes correlation matrix by using built-in corr() method.
 30 |     * Then, one of mutually-correlated features is eliminated. For example, feature A and feature B
 31 |       are well-correlated to each other. In this case, these features have similar behavior and
 32 |       similar effect on classification task, so we can discard one of them. We do not need to keep both of them.
 33 | 
 34 |     * After that, the row which stores the correlation between features and label is extracted.
 35 |       Absolute of that row is computed, because negative value only refers to inverse relation.
 36 |       We do not care forward or inverse relation; we care most related (max absolute values) correlations.
 37 |     * Then, correlation values are sorted in descending order.
 38 |     * Finally, indices of the features correlated to labels are stored in the list "pcc_features".
 39 | 
 40 |     * To summarize, MRMR (minimum redundancy maximum relevance) feature selection algorithm is performed.
 41 |       The features chosen by this algorithm is called pearson-correlation-coefficient (pcc) features.
 42 | 
 43 |     Parameters
 44 |     ----------
 45 |     df: data frame object
 46 |     nof_features: the number of features that you reduce the dimension to
 47 |     :return: a list of indices of the features which are highly-correlated to labels
 48 |     """
 49 | 
 50 |     pcc_features = []
 51 |     corr_features = []
 52 |     corr_matrix = df.corr()
 53 | 
 54 |     # determining similar (mutually-correlated) features
 55 |     for i in range(len(corr_matrix.columns) - 1):
 56 |         for j in range(i):
 57 |             if np.abs(corr_matrix[i][j]) > 0.75:
 58 |                 corr_features.append(i)
 59 | 
 60 |     corr_matrix = corr_matrix.drop(corr_features)  # eliminating similar features
 61 |     corr_label = corr_matrix["Labels"].abs()  # taking absolute of correlations
 62 | 
 63 |     sorted_corr_label = corr_label.sort_values(na_position="last", ascending=False)
 64 |     feature_names = sorted_corr_label.index
 65 | 
 66 |     for i in range(1, nof_features + 1): # taking most informative n features
 67 |         pcc_features.append(feature_names[i])
 68 | 
 69 |     return pcc_features
 70 | 
 71 | 
 72 | def pcc_transform(features, indices):
 73 | 
 74 |     """
 75 |     This method takes transpose of "features" array to access the features easily.
 76 |     Our dataset is in the dimension 120 x 595, which means that each row refers to one sample.
 77 |     I want to keep most correlated features (indices), and remove the other features.
 78 |     To accomplish this, each feature must be represented a list.
 79 |     In that list, all values which that feature took across all samples must be stored.
 80 |     This is only possible by taking transpose of "features" array
 81 | 
 82 |     Parameters
 83 |     ----------
 84 |     features: two dimensional numpy array representing our samples without labels
 85 |     indices: index values of most correlated features
 86 |     :return:
 87 |     """
 88 | 
 89 |     features_T = features.T
 90 |     features_T = features_T[indices]
 91 |     features = features_T.T
 92 |     return features
 93 | 
 94 | 
 95 | def apply_MRMR(nof_features, tra_dataset, tst_features):
 96 | 
 97 |     """
 98 |     This method creates a data frame to be able to compute correlation matrix.
 99 |     Then, index values of most correlated n features are determined.
100 |     By using these index values, training and testing features are reduced to lower dimension with pcc_transfrom().
101 | 
102 |     Parameters
103 |     ----------
104 |     nof_features: the number of features that you reduce the dimension to
105 |     tra_dataset: Two dimensional numpy array (training set). Its last column refers to target scores (labels)
106 |     tst_features: Two dimensional numpy array (testing set). It does not contain target scores (labels)
107 |     :return: reduced training and testing set
108 |     """
109 | 
110 |     df = create_dataframe(tra_dataset)
111 |     pcc_indices = find_pcc_features(df, nof_features)
112 | 
113 |     tra_features = tra_dataset[:, :len(tra_dataset[0]) - 1]
114 |     pcc_tra_features = pcc_transform(tra_features, pcc_indices)
115 |     pcc_tst_features = pcc_transform(tst_features, pcc_indices)
116 | 
117 |     return pcc_tra_features, pcc_tst_features
118 | 
119 | 


--------------------------------------------------------------------------------
/Team 20/main.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Target Problem:
  3 | ---------------
  4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD)
  5 | 
  6 | Proposed Solution (Machine Learning Pipeline):
  7 | ----------------------------------------------
  8 | * Constant Feature Elimination -> PCA -> Decision Tree Regressor
  9 | 
 10 | Input to Proposed Solution:
 11 | ---------------------------
 12 | * Directories of training and testing data in csv file format
 13 | * These two types of data should be stored in n x m pattern in csv file format.
 14 | 
 15 |   Typical Example:
 16 |   ----------------
 17 |   n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column)
 18 |   k x s samples in testing csv file (k number of samples, s number of features)
 19 | 
 20 | * These data set files are ready by load_data() function.
 21 | * For comprehensive information about input format, please check the section
 22 |   "Data Sets and Usage Format of Source Codes" in README.md file on github.
 23 | 
 24 | Output of Proposed Solution:
 25 | ----------------------------
 26 | * Predictions generated by learning model for testing set
 27 | * They are stored in "submission.csv" file.
 28 | 
 29 | Code Owner:
 30 | -----------
 31 | * Copyright © Team 20. All rights reserved.
 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """
 33 | 
 34 | import csv
 35 | import warnings
 36 | import pandas as pd
 37 | 
 38 | from sklearn import tree
 39 | from sklearn.decomposition import PCA
 40 | 
 41 | warnings.simplefilter(action='ignore', category=FutureWarning)
 42 | 
 43 | 
 44 | def load_data(filename):
 45 |     return pd.read_csv(filename)
 46 | 
 47 | 
 48 | def preprocessing(train_set, test_set, nof_features):
 49 | 
 50 |     """
 51 |     The method at first discards constant features, since those features are affectless on classification.
 52 |     Then, train data set is decomposed into features and labels.
 53 |     Finally, the method synthesizes new 10 features for train and test data by using pca.
 54 | 
 55 |     Parameters
 56 |     ----------
 57 |     train_set: train data in data-frame format
 58 |     test_set: test data in data-frame format
 59 |     nof_features: number of features to be synthesized during pca
 60 |     """
 61 | 
 62 |     train_set = train_set.drop(['X3', 'X31', 'X32', 'X127', 'X128', 'X590'], axis=1)
 63 |     test_set = test_set.drop(['X3', 'X31', 'X32', 'X127', 'X128', 'X590'], axis=1)
 64 | 
 65 |     train_features = train_set.iloc[:, 0:589].values
 66 |     train_labels = train_set.iloc[:, 589].values
 67 |     test_features = test_set.iloc[:, 0:589].values
 68 | 
 69 |     pca = PCA(n_components=nof_features)
 70 |     extracted_train_features = pca.fit_transform(train_features)
 71 |     extracted_test_features = pca.transform(test_features)
 72 | 
 73 |     return extracted_train_features, train_labels, extracted_test_features
 74 | 
 75 | 
 76 | def train_model(train_x, train_y, test_x):
 77 | 
 78 |     """
 79 |     The method creates a Decision Tree Regressor model, and trains it by using train data.
 80 |     Then, the method predicts labels for testing samples by using regressor model.
 81 |     Those labels are returned.
 82 | 
 83 |     Parameters
 84 |     ----------
 85 |     train_x: features of train data
 86 |     train_y: labels of train data
 87 |     test_x: features of test data
 88 |     """
 89 | 
 90 |     model = tree.DecisionTreeRegressor(random_state=7)
 91 |     model.fit(train_x, train_y)
 92 |     return model.predict(test_x)
 93 | 
 94 | 
 95 | def write_output(predictions):
 96 | 
 97 |     with open('submission.csv', mode='w') as predicted_file:
 98 |         submission = csv.writer(predicted_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
 99 |         submission.writerow(['ID', 'Predicted'])
100 | 
101 |         a = 1
102 |         for i in predictions:
103 |             submission.writerow([str(a), int(i)])
104 |             a = a + 1
105 | 
106 | 
107 | # ********** MAIN PROGRAM ********** #
108 | 
109 | train_df = load_data('train.csv')
110 | test_df = load_data('test.csv')
111 | 
112 | train_features, train_labels, test_features = preprocessing(train_df, test_df, 10)
113 | predictions = train_model(train_features, train_labels, test_features)
114 | write_output(predictions)
115 | 


--------------------------------------------------------------------------------
/Team 3/classifiers.py:
--------------------------------------------------------------------------------
 1 | # Code Owner: Mümtaz Cem Eriş - İsmet Ata Yardımcı
 2 | # Code Editor: Göktuğ Güvercin
 3 | 
 4 | import numpy as np
 5 | 
 6 | from sklearn.svm import SVC
 7 | from sklearn.naive_bayes import GaussianNB
 8 | from sklearn.tree import DecisionTreeClassifier
 9 | from sklearn.neighbors import KNeighborsClassifier
10 | from sklearn.linear_model import LogisticRegression, RidgeClassifier
11 | 
12 | import xgboost as xgb
13 | from sklearn.ensemble import *
14 | from mlxtend.classifier import EnsembleVoteClassifier
15 | from sklearn.model_selection import GridSearchCV
16 | 
17 | # ************************************************************************************************
18 | # This source file only provides classifiers for other source files. The main program is main.py *
19 | # ************************************************************************************************
20 | 
21 | seed = 1075
22 | np.random.seed(seed)
23 | 
24 | # Classifiers
25 | rf = RandomForestClassifier()
26 | et = ExtraTreesClassifier()
27 | knn = KNeighborsClassifier()
28 | svc = SVC()
29 | rg = RidgeClassifier()
30 | lr = LogisticRegression(solver='lbfgs')
31 | gnb = GaussianNB()
32 | dt = DecisionTreeClassifier(max_depth=1)
33 | 
34 | # Bagging Classifiers
35 | bagging_clf = BaggingClassifier(rf, max_samples=0.4, max_features=10, random_state=seed)
36 | 
37 | # Boosting Classifiers
38 | ada_boost = AdaBoostClassifier()
39 | ada_boost_svc = AdaBoostClassifier(base_estimator=svc, algorithm='SAMME')
40 | grad_boost = GradientBoostingClassifier()
41 | xgb_boost = xgb.XGBClassifier()
42 | 
43 | # Voting Classifiers
44 | vclf = VotingClassifier(estimators=[('ada_boost', ada_boost), ('grad_boost', grad_boost),
45 |                                     ('xgb_boost', xgb_boost), ('BaggingWithRF', bagging_clf)], voting='hard')
46 | 
47 | ev_clf = EnsembleVoteClassifier(clfs=[ada_boost_svc, grad_boost, xgb_boost], voting='hard')
48 | 
49 | # Grid Search
50 | params = {'gradientboostingclassifier__n_estimators': [10, 200],
51 |           'xgbclassifier__n_estimators': [10, 200]}
52 | 
53 | grid = GridSearchCV(estimator=ev_clf, param_grid=params, cv=5)
54 | 


--------------------------------------------------------------------------------
/Team 3/main.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Target Problem:
  3 | ---------------
  4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD)
  5 | 
  6 | Proposed Solution (Machine Learning Pipeline):
  7 | ----------------------------------------------
  8 | * Standard Scaling -> PCA -> Voting Classifier
  9 | 
 10 | Input to Proposed Solution:
 11 | ---------------------------
 12 | * Directories of training and testing data in csv file format
 13 | * These two types of data should be stored in n x m pattern in csv file format.
 14 | 
 15 |   Typical Example:
 16 |   ----------------
 17 |   n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column)
 18 |   k x s samples in testing csv file (k number of samples, s number of features)
 19 | 
 20 | * These data set files are ready by load_data() function.
 21 | * For comprehensive information about input format, please check the section
 22 |   "Data Sets and Usage Format of Source Codes" in README.md file on github.
 23 | 
 24 | Output of Proposed Solution:
 25 | ----------------------------
 26 | * Predictions generated by learning model for testing set
 27 | * They are stored in "submission.csv" file.
 28 | 
 29 | Code Owner:
 30 | -----------
 31 | * Copyright © Team 3. All rights reserved.
 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """
 33 | 
 34 | import csv
 35 | import warnings
 36 | 
 37 | from classifiers import *
 38 | from sklearn.decomposition import PCA
 39 | from sklearn.preprocessing import StandardScaler
 40 | from sklearn.model_selection import cross_val_score
 41 | 
 42 | 
 43 | def load_data(tra_file, tst_file):
 44 | 
 45 |     """
 46 |     This method reads the files in which training and testing data samples are located.
 47 |     Then, it returns training and testing data set in numpy array format.
 48 | 
 49 |     Parameters
 50 |     ----------
 51 |     tra_file: directory name of training dataset file
 52 |     tst_file: directory name of testing dataset file
 53 |     """
 54 | 
 55 |     x_tra = np.genfromtxt(tra_file, delimiter=',')
 56 |     x_tst = np.genfromtxt(tst_file, delimiter=',')
 57 | 
 58 |     # delete first rows
 59 |     x_tra = np.delete(x_tra, 0, 0)
 60 |     x_tst = np.delete(x_tst, 0, 0)
 61 |     y_tra = x_tra[:, -1]
 62 | 
 63 |     # delete class row
 64 |     x_tra = np.delete(x_tra, -1, 1)
 65 |     return x_tra, x_tst, y_tra
 66 | 
 67 | 
 68 | def preprocessing(x_tra, x_tst):
 69 | 
 70 |     """
 71 |     Training and testing data set are at first scaled by using standard scaler method.
 72 |     Then, these data sets are reduced to lower dimension by using feature extraction pca technique.
 73 |     When performing these two operations, testing data set is not included in fit operation of scaler and pca.
 74 | 
 75 |     Parameters
 76 |     ----------
 77 |     x_tra: training data set (they should not contain label values)
 78 |     x_tst: testing data set
 79 |     :return: reduced and scaled training and testing sets
 80 |     """
 81 | 
 82 |     scaler = StandardScaler()
 83 |     scaler.fit(x_tra)
 84 | 
 85 |     x_tra_scaled = scaler.transform(x_tra)
 86 |     x_tst_scaled = scaler.transform(x_tst)
 87 | 
 88 |     pca = PCA(.95)
 89 |     pca.fit(x_tra_scaled)
 90 |     x_tra_reduced = pca.transform(x_tra_scaled)
 91 |     x_tst_reduced = pca.transform(x_tst_scaled)
 92 | 
 93 |     return x_tra_reduced, x_tst_reduced
 94 | 
 95 | 
 96 | def train_model(x_tra_reduced, y_tra, model):
 97 | 
 98 |     model.fit(x_tra_reduced, y_tra)
 99 |     return model
100 | 
101 | 
102 | def predict(model, x_tst_reduced):
103 | 
104 |     predictions = model.predict(x_tst_reduced)
105 |     return predictions
106 | 
107 | 
108 | def write_output(prediction, filename):
109 | 
110 |     with open(filename, 'w', newline='') as csvfile:
111 |         filewriter = csv.writer(csvfile, delimiter=',')
112 |         filewriter.writerow(["ID", "Predicted"])
113 |         id = 1
114 | 
115 |         for row in prediction:
116 |             filewriter.writerow([id, row.astype(int)])
117 |             id += 1
118 | 
119 | 
120 | # RandomForestClassifier gives lots of warnings
121 | # therefore this line is added below
122 | warnings.filterwarnings("ignore")
123 | 
124 | # Load data
125 | Xtra, Xtst, Ytra = load_data('train.csv', 'test.csv')
126 | Xtra_reduced, Xtst_reduced = preprocessing(Xtra, Xtst)
127 | 
128 | 
129 | # *************** SECTION 1 *************** #
130 | 
131 | # Each of models would be trained and their cross validation score would be printed in this section.
132 | 
133 | print("Classifiers cross-validation")
134 | 
135 | labels_clf = ['RandomForest', 'ExtraTrees', 'KNeighbors', 'SVC', 'Ridge', 'LogisticRegression', 'GaussianNB',
136 |               'DecisionTree']
137 | 
138 | for model, label in zip([rf, et, knn, svc, rg, lr, gnb, dt], labels_clf):
139 | 
140 |     scores = cross_val_score(model, Xtra_reduced, Ytra, cv=5, scoring='accuracy')
141 |     trained_model = train_model(Xtra_reduced, Ytra, model)
142 |     prediction = predict(trained_model, Xtst_reduced)
143 | 
144 |     print("Mean: {0:.3f}, std: {1:.3f} [{2} is used.]".format(scores.mean(), scores.std(), label))
145 | 
146 | print("-----------------------------------\n")
147 | 
148 | # *************** SECTION 2 *************** #
149 | 
150 | # Ensemble models would be trained and their cross validation scores would be printed in this section
151 | 
152 | print("Bagging, Boosting and GridSearchCV cross-validation")
153 | 
154 | labels = ['Ada Boost', 'Ada BoostSVC', 'Grad Boost', 'XG Boost', 'Ensemble', 'Voting',
155 |           'BaggingWithRF', 'Grid']
156 | 
157 | for model, label in zip([ada_boost, ada_boost_svc, grad_boost, xgb_boost, ev_clf, vclf, bagging_clf, grid], labels):
158 | 
159 |     scores = cross_val_score(model, Xtra_reduced, Ytra, cv=5, scoring='accuracy')
160 |     trained_model = train_model(Xtra_reduced, Ytra, model)
161 |     prediction = predict(trained_model, Xtst_reduced)
162 | 
163 |     if label == "Ensemble":
164 |         write_output(prediction, label + ".csv")
165 | 
166 |     if label == "Ensemble":
167 |         print("Mean: {0:.3f}, std: {1:.3f} [*{2} is used. (Chosen model)]".format(scores.mean(), scores.std(), label))
168 | 
169 |     else:
170 |         print("Mean: {0:.3f}, std: {1:.3f} [{2} is used.]".format(scores.mean(), scores.std(), label))
171 | 


--------------------------------------------------------------------------------
/Team 4/main.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Target Problem:
  3 | ---------------
  4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD)
  5 | 
  6 | Proposed Solution (Machine Learning Pipeline):
  7 | ----------------------------------------------
  8 | * Standard Scaling -> PCA -> Decision Tree Classifier
  9 | 
 10 | Input to Proposed Solution:
 11 | ---------------------------
 12 | * Directories of training and testing data in csv file format
 13 | * These two types of data should be stored in n x m pattern in csv file format.
 14 | 
 15 |   Typical Example:
 16 |   ----------------
 17 |   n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column)
 18 |   k x s samples in testing csv file (k number of samples, s number of features)
 19 | 
 20 | * These data set files are ready by load_data() function.
 21 | * For comprehensive information about input format, please check the section
 22 |   "Data Sets and Usage Format of Source Codes" in README.md file on github.
 23 | 
 24 | Output of Proposed Solution:
 25 | ----------------------------
 26 | * Predictions generated by learning model for testing set
 27 | * They are stored in "submission.csv" file.
 28 | 
 29 | Code Owner:
 30 | -----------
 31 | * Copyright © Team 4. All rights reserved.
 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """
 33 | 
 34 | import csv
 35 | import numpy as np
 36 | import pandas as pd
 37 | from sklearn.decomposition import PCA
 38 | from sklearn.preprocessing import StandardScaler
 39 | from sklearn.tree import DecisionTreeClassifier
 40 | 
 41 | np.random.seed(1)  # Anchoring randomization during training step
 42 | 
 43 | 
 44 | def load_data():
 45 | 
 46 |     """
 47 |     The method reads train and test file to obtain train and test data.
 48 |     Then, it splits train data into two parts which are features and labels.
 49 |     :return: features and labels of train data and test data are returned.
 50 | 
 51 |     """
 52 | 
 53 |     train_data = pd.read_csv('train.csv')
 54 |     test_data = pd.read_csv('test.csv')
 55 | 
 56 |     all_train = train_data.iloc[:, :].values
 57 |     y_train = train_data.iloc[:, 595].values  # labels
 58 |     x_train = train_data.iloc[:, 0:595].values  # features
 59 | 
 60 |     x_test = test_data.iloc[:, 0:595].values
 61 | 
 62 |     return x_train, y_train, x_test
 63 | 
 64 | 
 65 | def standardization(x_train, x_test):
 66 | 
 67 |     """
 68 |     The method performs standard scaling on training and testing data.
 69 |     When doing this, only train data is included in training phase (fitting operation).
 70 | 
 71 |     Parameters
 72 |     ----------
 73 |     x_train: features of train data
 74 |     x_test: features of test data
 75 | 
 76 |     """
 77 | 
 78 |     sc = StandardScaler()
 79 |     x_train_std = sc.fit_transform(x_train)
 80 |     x_test_std = sc.transform(x_test)
 81 | 
 82 |     return x_train_std, x_test_std
 83 | 
 84 | 
 85 | def dim_red(x_train, x_test):
 86 | 
 87 |     """
 88 |     The method reduces the dimension of training and testing data by using PCA.
 89 |     When doing this, only training data is included in training phase (fitting operation).
 90 | 
 91 |     Parameters
 92 |     ----------
 93 |     x_train: features of scaled training data
 94 |     x_test: features of scaled testing data
 95 | 
 96 |     """
 97 | 
 98 |     pca = PCA(n_components=15)
 99 |     x_train_red = pca.fit_transform(x_train)
100 |     x_test_red = pca.transform(x_test)
101 | 
102 |     return x_train_red, x_test_red
103 | 
104 | 
105 | def decision_tree(criterion_name, x_train, y_train, x_test):
106 | 
107 |     """
108 |     The method creates a decision tree learning model, trains it by using train data and generate predictions for
109 |         testing data.
110 |     Then, predicted labels are returned.
111 | 
112 |     Parameters
113 |     ----------
114 |     criterion_name: it specifies which function is used to measure quality of a split
115 |     x_train: features of scaled and reduced training set
116 |     y_train: labels of scaled and reduced training set
117 |     x_test: features of scaled and reduced testing set
118 | 
119 |     """
120 | 
121 |     dtc = DecisionTreeClassifier(criterion=criterion_name, max_depth=4,
122 |                                  max_features=3, random_state=3, max_leaf_nodes=2)
123 |     dtc.fit(x_train, y_train)
124 |     y_pred = dtc.predict(x_test)
125 | 
126 |     return y_pred
127 | 
128 | 
129 | def write_output(y_pred):
130 | 
131 |     fields = ['ID', 'Predicted']
132 |     filename = "submission.csv"
133 |     rows = list()
134 |     with open(filename, 'w',newline="") as csvfile:
135 |         csvwriter = csv.writer(csvfile)
136 |         csvwriter.writerow(fields)
137 |         for i in range(len(y_pred)):
138 |             rows.append([i+1, y_pred[i]])
139 |         csvwriter.writerows(rows)
140 | 
141 | 
142 | # ********** MAIN PROGRAM ********** #
143 | 
144 | x_train, y_train, x_test = load_data()
145 | x_train_std, x_test_std = standardization(x_train, x_test)
146 | x_train_red, x_test_red = dim_red(x_train_std, x_test_std)
147 | 
148 | y_pred = decision_tree('entropy', x_train_red, y_train, x_test_red)
149 | write_output(y_pred)
150 | 


--------------------------------------------------------------------------------
/Team 5/main.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Target Problem:
  3 | ---------------
  4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD)
  5 | 
  6 | Proposed Solution (Machine Learning Pipeline):
  7 | ----------------------------------------------
  8 | * Min-max Scaling -> Elimination of Highly Coorelated Features -> SelectKBest algorithm -> XGBoost
  9 | 
 10 | Input to Proposed Solution:
 11 | ---------------------------
 12 | * Directories of training and testing data in csv file format
 13 | * These two types of data should be stored in n x m pattern in csv file format.
 14 | 
 15 |   Typical Example:
 16 |   ----------------
 17 |   n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column)
 18 |   k x s samples in testing csv file (k number of samples, s number of features)
 19 | 
 20 | * These data set files are ready by load_data() function.
 21 | * For comprehensive information about input format, please check the section
 22 |   "Data Sets and Usage Format of Source Codes" in README.md file on github.
 23 | 
 24 | Output of Proposed Solution:
 25 | ----------------------------
 26 | * Predictions generated by learning model for testing set
 27 | * They are stored in "submission.csv" file.
 28 | 
 29 | Code Owner:
 30 | -----------
 31 | * Copyright © Team 5. All rights reserved.
 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """
 33 | 
 34 | import numpy as np
 35 | import pandas as pd
 36 | 
 37 | from numpy import genfromtxt
 38 | from xgboost import XGBClassifier
 39 | from sklearn.preprocessing import MinMaxScaler
 40 | from sklearn.feature_selection import SelectKBest, chi2
 41 | 
 42 | 
 43 | def load_data(train_path, test_path):
 44 | 
 45 |     """
 46 |     The method reads train and test data from their files.
 47 |     Then, it deletes feature names by using numpy delete() method.
 48 |     Finally, it splits train data into two parts: train features (train_x) and train labels (train_y).
 49 | 
 50 |     Parameters
 51 |     ----------
 52 |     train_path: directory name of the file in which training data exists
 53 |     test_path: directory name of the file in which testing data exists
 54 | 
 55 |     """
 56 | 
 57 |     train = genfromtxt(train_path, delimiter=',')
 58 |     train = np.delete(train, 0, 0)
 59 |     test = genfromtxt(test_path, delimiter=',')
 60 |     test = np.delete(test, 0, 0)
 61 | 
 62 |     train_x = train[:, 0:595]
 63 |     train_y = train[:, 595]
 64 | 
 65 |     return train_x, train_y, test
 66 | 
 67 | 
 68 | def preprocessing(train_x, train_y, test):
 69 | 
 70 |     """
 71 |     * The method at first performs min-max scaling on train and test data.
 72 |     * Then, it computes correlation matrix to find out which features are mostly-correlated to each other.
 73 |       Two features which are very correlated to each other have almost same impact on classification and labels.
 74 |       Hence, one of these two features is discarded. We do not need to use both of them.
 75 |     * Finally, selectKBest() algorithm is used with chi square value to choose top 100 features.
 76 |     * Totally, two feature selection methods are used to reduce the dimension of training and testing data.
 77 | 
 78 |     Parameters
 79 |     ----------
 80 |     train_x: features of train data samples
 81 |     train_y: labels of train data samples
 82 |     test: features of test data samples
 83 | 
 84 |     """
 85 | 
 86 |     scaler = MinMaxScaler()
 87 |     scaler.fit(train_x)
 88 |     scaled_test = scaler.transform(test)
 89 |     scaled_train_x = scaler.transform(train_x)
 90 | 
 91 |     df_test = pd.DataFrame(data=scaled_test)
 92 |     df_train_x = pd.DataFrame(data=scaled_train_x)
 93 | 
 94 |     corr_matrix = df_train_x.corr().abs()
 95 |     upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
 96 |     drop = [column for column in upper.columns if any(upper[column] > 0.9)]  # finding highly-correlated features
 97 | 
 98 |     df_train_x = df_train_x.drop(df_train_x.columns[drop], axis=1)  # discarding those features
 99 |     df_test = df_test.drop(df_test.columns[drop], axis=1)
100 | 
101 |     # Select features
102 |     selector = SelectKBest(chi2, k=100)
103 |     selector.fit(df_train_x, train_y)
104 |     reduced_train_x = selector.transform(df_train_x)
105 |     reduced_test = selector.transform(df_test)
106 | 
107 |     return reduced_train_x, reduced_test
108 | 
109 | 
110 | def train_model(train_x, train_y):
111 | 
112 |     """
113 |     The method trains XGB Classifier model by using training samples and their ground truth labels.
114 | 
115 |     Parameters
116 |     ----------
117 |     train_x: features of training samples
118 |     train_y: labels of training samples
119 | 
120 |     """
121 | 
122 |     s_clf = XGBClassifier(learning_rate=0.1, n_estimators=140, max_depth=5, min_child_weight=3, gamma=0.2,
123 |                           subsample=0.6, colsample_bytree=1.0, objective='binary:logistic', nthread=4,
124 |                           scale_pos_weight=1, seed=27, silent=1)
125 | 
126 |     s_clf.fit(train_x, train_y)
127 |     return s_clf
128 | 
129 | 
130 | def predict(model, test):
131 | 
132 |     """
133 |     The method make predictions by using passed learning model "model" and testing data "scaled_test".
134 | 
135 |     Parameters
136 |     ----------
137 |     model: Learning model object trained by training data
138 |     scaled_test: features of testing samples
139 | 
140 |     """
141 | 
142 |     predictions = model.predict(test)
143 |     return predictions
144 | 
145 | 
146 | def write_output(prediction):
147 |     # Write output to csv file
148 |     f = open("submission.csv","w+")
149 |     f.write("ID,Predicted\n")
150 |     for i, item in enumerate(prediction):
151 |         f.write(str(1+i) + "," + str(int(item)) + "\n")
152 |     f.close()
153 | 
154 | 
155 | # ********** MAIN PROGRAM ********** #
156 | 
157 | train_x, train_y, test = load_data('train.csv', 'test.csv')
158 | scaled_train_x, scaled_test = preprocessing(train_x, train_y, test)
159 | 
160 | model = train_model(scaled_train_x, train_y)
161 | predictions = predict(model, scaled_test)
162 | write_output(predictions)
163 | 


--------------------------------------------------------------------------------
/Team 6/main.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Target Problem:
  3 | ---------------
  4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD)
  5 | 
  6 | Proposed Solution (Machine Learning Pipeline):
  7 | ----------------------------------------------
  8 | * Standard Scaling -> PCA -> SVM
  9 | 
 10 | Input to Proposed Solution:
 11 | ---------------------------
 12 | * Directories of training and testing data in csv file format
 13 | * These two types of data should be stored in n x m pattern in csv file format.
 14 | 
 15 |   Typical Example:
 16 |   ----------------
 17 |   n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column)
 18 |   k x s samples in testing csv file (k number of samples, s number of features)
 19 | 
 20 | * These data set files are ready by load_data() function.
 21 | * For comprehensive information about input format, please check the section
 22 |   "Data Sets and Usage Format of Source Codes" in README.md file on github.
 23 | 
 24 | Output of Proposed Solution:
 25 | ----------------------------
 26 | * Predictions generated by learning model for testing set
 27 | * They are stored in "submisson.csv" file.
 28 | 
 29 | Code Owner:
 30 | -----------
 31 | * Copyright © Team 6. All rights reserved.
 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """
 33 | 
 34 | import numpy as np
 35 | import pandas as pd
 36 | 
 37 | from sklearn.svm import SVC
 38 | from sklearn.decomposition import PCA
 39 | from sklearn.preprocessing import StandardScaler
 40 | 
 41 | np.random.seed(5)  # anchoring randomization during training step
 42 | 
 43 | 
 44 | def load_data():
 45 | 
 46 |     train_data = pd.read_csv('train.csv', header=0)
 47 |     test_data = pd.read_csv('test.csv', header=0)
 48 |     return train_data, test_data
 49 | 
 50 | 
 51 | def preprocessing(train_data, test_data):
 52 | 
 53 |     """
 54 |     The method splits features and labels of training data into two separate sections.
 55 |     Then, it applies standard scaling operation on training and testing sets.
 56 | 
 57 |     Parameters:
 58 |     -----------
 59 |     train_data: It is numpy array containing features and labels together
 60 |     test_data: It is numpy array containing only features """
 61 | 
 62 |     x_train = train_data.iloc[:, 0:595].values
 63 |     y_train = train_data.iloc[:, 595:].values
 64 |     x_test = test_data.iloc[:].values
 65 | 
 66 |     sc = StandardScaler()
 67 |     scaled_x_train = sc.fit_transform(x_train)
 68 |     scaled_x_test = sc.transform(x_test)
 69 | 
 70 |     return scaled_x_train, scaled_x_test, y_train
 71 | 
 72 | 
 73 | def dimension_reduction(scaled_x_train, scaled_x_test):
 74 | 
 75 |     """
 76 |     The method performs pca technique to reduce the dimension of feature space in which observations are situated.
 77 | 
 78 |     Parameters:
 79 |     -----------
 80 |     scaled_x_train: scaled features of training data
 81 |     scaled_x_test: scaled features of testing data """
 82 | 
 83 |     # reducing training and testing samples into lower dimension
 84 |     pca = PCA(n_components=2)
 85 |     reduced_x_train = pca.fit_transform(scaled_x_train)
 86 |     reduced_x_test = pca.transform(scaled_x_test)
 87 | 
 88 |     return reduced_x_train, reduced_x_test
 89 | 
 90 | 
 91 | def train_model(reduced_x_train, y_train):
 92 | 
 93 |     """
 94 |     The method trains svm learning model by using training features cross its ground truth labels.
 95 | 
 96 |     Parameters:
 97 |     -----------
 98 |     reduced_x_train: features of training data lying on principal component axes
 99 |     y_train: ground truth labels of those features """
100 | 
101 |     svc = SVC(kernel='poly', gamma=0.5, C=1, random_state=3)
102 |     svc.fit(reduced_x_train, y_train.ravel())
103 |     return svc
104 | 
105 | 
106 | def predict(svc, reduced_x_test):
107 | 
108 |     """
109 |     The method predicts outputs for test samples by using learning model.
110 | 
111 |     Parameters:
112 |     -----------
113 |     svc: trained learning model
114 |     reduced_x_test: features of testing set lying on principal component axes """
115 | 
116 |     predictions = svc.predict(reduced_x_test)
117 |     return predictions
118 | 
119 | 
120 | def write_output(predictions):
121 | 
122 |     # writing predictions to the file
123 |     f = open("submission.csv", "w")
124 |     f.write("ID,Predicted\n")
125 |     for i in range(0, len(predictions)):
126 |         f.write(str(i+1) + "," + str(predictions[i]) + "\n")
127 | 
128 | 
129 | # ******* Main Program ******* #
130 | 
131 | train_data, test_data = load_data()
132 | 
133 | scaled_x_train, scaled_x_test, y_train, = preprocessing(train_data, test_data)
134 | reduced_x_train, reduced_x_test = dimension_reduction(scaled_x_train, scaled_x_test)
135 | 
136 | svc_model = train_model(reduced_x_train, y_train)
137 | predictions = predict(svc_model, reduced_x_test)
138 | 
139 | write_output(predictions)
140 | 


--------------------------------------------------------------------------------
/Team 7/main.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Target Problem:
  3 | ---------------
  4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD)
  5 | 
  6 | Proposed Solution (Machine Learning Pipeline):
  7 | ----------------------------------------------
  8 | * Constant Feature Elimination -> SelectKBest Algorithm -> PCA -> Decision Tree Classifier
  9 | 
 10 | Input to Proposed Solution:
 11 | ---------------------------
 12 | * Directories of training and testing data in csv file format
 13 | * These two types of data should be stored in n x m pattern in csv file format.
 14 | 
 15 |   Typical Example:
 16 |   ----------------
 17 |   n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column)
 18 |   k x s samples in testing csv file (k number of samples, s number of features)
 19 | 
 20 | * These data set files are ready by load_data() function.
 21 | * For comprehensive information about input format, please check the section
 22 |   "Data Sets and Usage Format of Source Codes" in README.md file on github.
 23 | 
 24 | Output of Proposed Solution:
 25 | ----------------------------
 26 | * Predictions generated by learning model for testing set
 27 | * They are stored in "submission.csv" file.
 28 | 
 29 | Code Owner:
 30 | -----------
 31 | * Copyright © Team 7. All rights reserved.
 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """
 33 | 
 34 | 
 35 | import numpy as np
 36 | import pandas as pd
 37 | from sklearn.tree import DecisionTreeClassifier
 38 | from sklearn.decomposition import PCA
 39 | from sklearn.feature_selection import SelectKBest
 40 | from sklearn.feature_selection import chi2
 41 | 
 42 | 
 43 | def load_data(tra_name, test_name):
 44 | 
 45 |     """
 46 |     The method reads the training and testing data from their csv files.
 47 | 
 48 |     Parameters
 49 |     ----------
 50 |     tra_name:  directory of the training dataset file
 51 |     test_name: directory of testing dataset file
 52 | 
 53 |     """
 54 | 
 55 |     train_data = pd.read_csv(tra_name)
 56 |     test_data = pd.read_csv(test_name)
 57 |     return train_data, test_data
 58 | 
 59 | 
 60 | def preprocessing(train_data, test_data):
 61 | 
 62 |     """
 63 |     The method at first eliminates constant features.
 64 |     Then, it chooses top 100 features by evaluating chi square values of each feature.
 65 |     Finally, these 100 features are reduced to 80 features by using principal component analysis.
 66 | 
 67 |     Parameters
 68 |     ----------
 69 |     train_data: training dataset containing features and labels
 70 |     test_data: testing dataset containing only features
 71 | 
 72 |     """
 73 | 
 74 |     train_data = train_data.drop(['X3', 'X31', 'X32', 'X127', 'X128', 'X590'], axis=1)
 75 |     train_data = np.asarray(train_data)
 76 | 
 77 |     train_x = train_data[:, :train_data.shape[1] - 1]
 78 |     train_y = train_data[:, train_data.shape[1] - 1]
 79 |     train_y.shape = (np.size(train_y), 1)
 80 | 
 81 |     test_data = test_data.drop(['X3', 'X31', 'X32', 'X127', 'X128', 'X590'], axis=1)
 82 |     test_data = np.asarray(test_data)
 83 | 
 84 |     selector = SelectKBest(score_func=chi2, k=100)
 85 |     selector.fit(train_x, train_y)
 86 | 
 87 |     train_features = selector.transform(train_x)
 88 |     test_features = selector.transform(test_data)
 89 | 
 90 |     pca = PCA(n_components=80)
 91 |     x_tra_pca = pca.fit_transform(train_features)
 92 |     x_test_pca = pca.transform(test_features)
 93 | 
 94 |     return x_tra_pca, train_y, x_test_pca
 95 | 
 96 | # ********** MAIN PROGRAM ********** #
 97 | 
 98 | 
 99 | train_data, test_data = load_data("train.csv", "test.csv")
100 | x_tra_pca, train_y, x_test_pca = preprocessing(train_data, test_data)
101 | 
102 | 
103 | clf = DecisionTreeClassifier(random_state=25)
104 | clf.fit(x_tra_pca, train_y)
105 | predictions = clf.predict(x_test_pca)
106 | 
107 | 
108 | yt = pd.DataFrame(predictions, dtype='int32')
109 | yt.columns = ["Predicted"]
110 | yt.index += 1
111 | yt.to_csv("./submission.csv", index_label="ID")
112 | 


--------------------------------------------------------------------------------
/Team 8/main.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Target Problem:
  3 | ---------------
  4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD)
  5 | 
  6 | Proposed Solution (Machine Learning Pipeline):
  7 | ----------------------------------------------
  8 | * Standard Scaling -> PCA -> Logistic Regression
  9 | 
 10 | Input to Proposed Solution:
 11 | ---------------------------
 12 | * Directories of training and testing data in csv file format
 13 | * These two types of data should be stored in n x m pattern in csv file format.
 14 | 
 15 |   Typical Example:
 16 |   ----------------
 17 |   n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column)
 18 |   k x s samples in testing csv file (k number of samples, s number of features)
 19 | 
 20 | * These data set files are ready by load_data() function.
 21 | * For comprehensive information about input format, please check the section
 22 |   "Data Sets and Usage Format of Source Codes" in README.md file on github.
 23 | 
 24 | Output of Proposed Solution:
 25 | ----------------------------
 26 | * Predictions generated by learning model for testing set
 27 | * They are stored in "submission.csv" file.
 28 | 
 29 | Code Owner:
 30 | -----------
 31 | * Copyright © Team 8. All rights reserved.
 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """
 33 | 
 34 | import csv
 35 | import warnings
 36 | import numpy as np
 37 | 
 38 | from sklearn.decomposition import PCA
 39 | from sklearn.preprocessing import StandardScaler
 40 | from sklearn.linear_model import LogisticRegression
 41 | 
 42 | warnings.filterwarnings("ignore", category=FutureWarning)
 43 | 
 44 | 
 45 | def load_data(x_paths):
 46 | 
 47 |     """
 48 |     The method reads train and test data from data set files.
 49 |     Then, it splits train data into two pieces: features and labels.
 50 | 
 51 |     Parameters
 52 |     ----------
 53 |     x_paths: directory of train and test data files
 54 | 
 55 |     """
 56 | 
 57 |     data = np.matrix(np.genfromtxt(x_paths+'train.csv', delimiter=','))
 58 |     x_train = np.asarray(data[1:, 0:595])
 59 |     y_train = np.asarray(data[1:, 595])
 60 | 
 61 |     data2 = np.matrix(np.genfromtxt(x_paths+'test.csv', delimiter=','))
 62 |     x_test = np.asarray(data2[1:, 0:595])
 63 | 
 64 |     return x_train, y_train, x_test
 65 | 
 66 | 
 67 | def preprocessing(x_train, x_test):
 68 | 
 69 |     """
 70 |     The method performs standard scaling on training and testing data.
 71 |     Then, it reduces the dimension of training and testing data by using pca.
 72 | 
 73 |     Parameters
 74 |     ----------
 75 |     x_train: features of training data
 76 |     x_test: features of testing data
 77 | 
 78 |     """
 79 | 
 80 |     sc = StandardScaler()
 81 |     x_train = sc.fit_transform(x_train)
 82 |     x_test = sc.transform(x_test)
 83 | 
 84 |     pca = PCA(n_components=2)
 85 |     x_train = pca.fit_transform(x_train)
 86 |     x_test = pca.transform(x_test)
 87 | 
 88 |     return x_train, x_test
 89 | 
 90 | 
 91 | def train_model(x_train, y_train):
 92 | 
 93 |     """
 94 |     The method creates a logistic regression classifier, and trains it with training data.
 95 | 
 96 |     Parameters
 97 |     ----------
 98 |     x_train: features of training data
 99 |     y_train: labels of training data
100 | 
101 |     """
102 | 
103 |     classifier = LogisticRegression(random_state=0)
104 |     classifier.fit(x_train, np.ravel(y_train, order='C'))
105 |     return classifier
106 | 
107 | 
108 | def predict(x_test, model):
109 | 
110 |     """
111 |     The method predicts labels for testing data by using model object.
112 | 
113 |     Parameters
114 |     ----------
115 |     x_test: features of testing data
116 |     model: trained learning model
117 | 
118 |     """
119 | 
120 |     y_pred = model.predict(x_test)
121 |     return y_pred
122 | 
123 | 
124 | def write_output(y_pred):
125 | 
126 |     ID = 1
127 |     lines = [["ID", "Predicted"]]
128 | 
129 |     for i in y_pred:
130 |         # Reobtaining the ID is simple since the samples remain in order
131 |         temp = [ID, int(i)]
132 |         ID += 1
133 |         lines.append(temp)
134 | 
135 |     # Write the output in a file
136 |     with open('submission.csv', 'w') as writeFile:
137 |         writer = csv.writer(writeFile)
138 |         writer.writerows(lines)
139 |     writeFile.close()
140 | 
141 | # ********** MAIN PROGRAM ********** #
142 | 
143 | 
144 | Data = load_data("")
145 | x_train, y_train, x_test = load_data("")
146 | x_train_pca, x_test_pca = preprocessing(x_train, x_test)
147 | 
148 | model = train_model(x_train_pca, y_train)
149 | predictions = predict(x_test_pca, model)
150 | write_output(predictions)
151 | 


--------------------------------------------------------------------------------
/Team 9/main.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Target Problem:
  3 | ---------------
  4 | * A classifier for the diagnosis of Autism Spectrum Disorder (ASD)
  5 | 
  6 | Proposed Solution (Machine Learning Pipeline):
  7 | ----------------------------------------------
  8 | * Min-Max Scaling -> PCA -> Bagging Classifier (Base: KNN)
  9 | 
 10 | Input to Proposed Solution:
 11 | ---------------------------
 12 | * Directories of training and testing data in csv file format
 13 | * These two types of data should be stored in n x m pattern in csv file format.
 14 | 
 15 |   Typical Example:
 16 |   ----------------
 17 |   n x m samples in training csv file (n number of samples, m - 1 number of features, ground truth labels at last column)
 18 |   k x s samples in testing csv file (k number of samples, s number of features)
 19 | 
 20 | * These data set files are ready by load_data() function.
 21 | * For comprehensive information about input format, please check the section
 22 |   "Data Sets and Usage Format of Source Codes" in README.md file on github.
 23 | 
 24 | Output of Proposed Solution:
 25 | ----------------------------
 26 | * Predictions generated by learning model for testing set
 27 | * They are stored in "submission.csv" file.
 28 | 
 29 | Code Owner:
 30 | -----------
 31 | * Copyright © Team 9. All rights reserved.
 32 | * Copyright © Istanbul Technical University, Learning From Data Spring 2019. All rights reserved. """
 33 | 
 34 | import csv
 35 | import pandas as pd
 36 | 
 37 | from sklearn.preprocessing import MinMaxScaler
 38 | from sklearn.decomposition import PCA
 39 | 
 40 | from sklearn.ensemble import BaggingClassifier
 41 | from sklearn.neighbors import KNeighborsClassifier
 42 | 
 43 | 
 44 | def load_data():
 45 | 
 46 |     """
 47 |     The method reads train and test data from data files.
 48 |     Then, it splits train data into features and labels.
 49 |     """
 50 | 
 51 |     train_data = pd.read_csv("train.csv")
 52 |     test_data = pd.read_csv("test.csv")
 53 |     train_x = train_data.iloc[:, 0:595]
 54 |     train_y = train_data.iloc[:, -1]
 55 | 
 56 |     return train_x, train_y, test_data
 57 | 
 58 | 
 59 | def preprocessing(train_x, test_x):
 60 | 
 61 |     """
 62 |     The method at first performs min-max scaling on train and testing data.
 63 |     Then, it reduces dimension of train and test data by using pca.
 64 | 
 65 |     Parameters
 66 |     ----------
 67 |     train_x: features of train data
 68 |     test_x: features of test data
 69 | 
 70 |     """
 71 | 
 72 |     scaler = MinMaxScaler()
 73 |     scl_train_x = scaler.fit_transform(train_x)
 74 |     scl_test_x = scaler.transform(test_x)
 75 | 
 76 |     pca = PCA(n_components=5)
 77 |     pca.fit(scl_train_x)
 78 |     pca_train_x = pca.transform(scl_train_x)
 79 |     pca_test_x = pca.transform(scl_test_x)
 80 | 
 81 |     return pca_train_x, pca_test_x
 82 | 
 83 | 
 84 | def train_model(train_x, train_y):
 85 | 
 86 |     """
 87 |     The method trains a learning model by using training data.
 88 | 
 89 |     Parameters
 90 |     ----------
 91 |     train_x: features of training data
 92 |     train_y: labels of training data
 93 | 
 94 |     """
 95 |     model = BaggingClassifier(KNeighborsClassifier(), max_samples=0.5, random_state=4)
 96 |     model.fit(train_x, train_y)
 97 |     return model
 98 | 
 99 | 
100 | def predict(model, test_x):
101 | 
102 |     """
103 |     The method predicts labels for testing samples by using trained model.
104 | 
105 |     Parameters
106 |     ----------
107 |     model: trained model
108 |     test_x: features of testing data
109 |     """
110 |     return model.predict(test_x)
111 | 
112 | 
113 | def write_output(predictions):
114 | 
115 |     with open("submission.csv", mode='w+', newline='') as file:
116 |         writer = csv.writer(file)
117 |         writer.writerow(["ID", "Predicted"])
118 | 
119 |     count = 1
120 |     for pred in predictions:
121 |         with open("submission.csv", mode='a', newline='') as file:
122 |             writer = csv.writer(file)
123 |             writer.writerow([count, pred])
124 |             count += 1
125 | 
126 | 
127 | # ********** MAIN PROGRAM ********** #
128 | 
129 | train_x, train_y, test_x = load_data()
130 | pca_train_x, pca_test_x = preprocessing(train_x, test_x)
131 | model = train_model(pca_train_x, train_y)
132 | predictions = predict(model, pca_test_x)
133 | write_output(predictions)
134 | 


--------------------------------------------------------------------------------