├── .gitignore ├── 01 Introduction of Classification ├── 01 Introduction of Classification.ipynb ├── README.md └── images │ ├── ANN.jpeg │ ├── CV.png │ ├── DT.png │ ├── KNN.png │ ├── KNN1.png │ ├── NB.png │ ├── README.md │ └── ROC.png ├── 02 Types of Classification Algorithm ├── 02 Types of Classification Algorithm.ipynb └── README.md ├── 03 Classification Algorithms ├── Intro & Types of Classifier.ipynb ├── README.md ├── data │ ├── README.md │ └── df.csv └── images │ ├── AS.png │ ├── DT.png │ ├── EDA.jpg │ ├── KNN.png │ ├── LR.png │ ├── NB.png │ ├── README.md │ ├── RF.png │ ├── SCD.png │ └── SVM.png ├── 04 Exploratory Data Analysis ├── 01 Data_Prep.sas ├── 02 Exploratory Data Analysis.ipynb ├── 02 Exploratory Data Analysis.py ├── README.md └── data │ ├── 00 df.csv │ └── README.md ├── 05 Decision Tree ├── 01 Data_Prep.sas ├── 02 Decision Tree.ipynb ├── 02 Decision Tree.py ├── README.md └── data │ ├── 00 df.csv │ └── README.md ├── 06 Exploratory Data Analysis ├── 01 Data_Prep.sas ├── 02 Exploratory Data Analysis.ipynb ├── 02 Exploratory Data Analysis.py ├── README.md └── data │ ├── 00 df.csv │ └── README.md ├── 07 K-Nearest Neighbors ├── 01 Data_Prep.sas ├── 02 K-Neighbors Classifier.ipynb ├── 02 K-Neighbors Classifier.py ├── README.md └── data │ ├── 00 df.csv │ └── README.md ├── 08 Logistic Regression Classifier ├── 01 Data_Prep.sas ├── 02 Logistic Regression.ipynb ├── 02 Logistic Regression.py ├── README.md └── data │ ├── 00 df.csv │ └── README.md ├── 09 Naive_Bayes classifier ├── 01 Data_Prep.sas ├── 02 Naive_Bayes classifier.ipynb ├── 02 Naive_Bayes classifier.py ├── README.md └── data │ ├── 00 df.csv │ └── README.md ├── 10 Random Forest ├── 01 Data_Prep.sas ├── 02 Random Forest.ipynb ├── 02 Random Forest.py ├── README.md └── data │ ├── 00 df.csv │ └── README.md ├── 11 Stochastic Gradient Descent ├── 01 Data_Prep.sas ├── 02 Stochastic Gradient Descent.ipynb ├── 02 Stochastic Gradient Descent.py ├── README.md └── data │ ├── 00 df.csv │ └── README.md ├── 12 Support Vector Machine ├── 01 Data_Prep.sas ├── 02 Support Vector Machine.ipynb ├── 02 Support Vector Machine.py ├── README.md └── data │ ├── 00 df.csv │ └── README.md ├── LICENSE └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 95 | __pypackages__/ 96 | 97 | # Celery stuff 98 | celerybeat-schedule 99 | celerybeat.pid 100 | 101 | # SageMath parsed files 102 | *.sage.py 103 | 104 | # Environments 105 | .env 106 | .venv 107 | env/ 108 | venv/ 109 | ENV/ 110 | env.bak/ 111 | venv.bak/ 112 | 113 | # Spyder project settings 114 | .spyderproject 115 | .spyproject 116 | 117 | # Rope project settings 118 | .ropeproject 119 | 120 | # mkdocs documentation 121 | /site 122 | 123 | # mypy 124 | .mypy_cache/ 125 | .dmypy.json 126 | dmypy.json 127 | 128 | # Pyre type checker 129 | .pyre/ 130 | -------------------------------------------------------------------------------- /01 Introduction of Classification/01 Introduction of Classification.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "---\n", 8 | "

Machine Learning Classifier

\n", 9 | "\n", 10 | "---" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": {}, 16 | "source": [ 17 | "---\n", 18 | "# **What is Classification?**\n", 19 | "\n", 20 | "---" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "### Classification is the process of predicting the class of given data points. Classes are sometimes called as targets/ labels or categories. Classification predictive modeling is the task of approximating a mapping function (f) from input variables (X) to discrete output variables (y).\n", 28 | "### For example, spam detection in email service providers can be identified as a classification problem. This is s binary classification since there are only 2 classes as spam and not spam. A classifier utilizes some training data to understand how given input variables relate to the class. In this case, known spam and non-spam emails have to be used as the training data. When the classifier is trained accurately, it can be used to detect an unknown email.\n", 29 | "### Classification belongs to the category of supervised learning where the targets also provided with the input data. There are many applications in classification in many domains such as in credit approval, medical diagnosis, target marketing etc.\n", 30 | "### There are two types of learners in classification as lazy learners and eager learners." 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "## **1.Lazy learners**" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "### Lazy learners simply store the training data and wait until a testing data appear. When it does, classification is conducted based on the most related data in the stored training data. Compared to eager learners, lazy learners have less training time but more time in predicting.\n", 45 | "### *Ex. k-nearest neighbor, Case-based reasoning*" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "## **2. Eager learners**" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": {}, 58 | "source": [ 59 | "### Eager learners construct a classification model based on the given training data before receiving data for classification. It must be able to commit to a single hypothesis that covers the entire instance space. Due to the model construction, eager learners take a long time for train and less time to predict.\n", 60 | "### *Ex. Decision Tree, Naive Bayes, Artificial Neural Networks*" 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "metadata": {}, 66 | "source": [ 67 | "---\n", 68 | "# **Classification algorithms**\n", 69 | "---" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "### There is a lot of classification algorithms available now but it is not possible to conclude which one is superior to other. It depends on the application and nature of available data set. For example, if the classes are linearly separable, the linear classifiers like Logistic regression, Fisher’s linear discriminant can outperform sophisticated models and vice versa." 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": {}, 82 | "source": [ 83 | "---\n", 84 | "## **1. Decision Tree**" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "metadata": {}, 90 | "source": [ 91 | "" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "metadata": {}, 97 | "source": [ 98 | "### Decision tree builds classification or regression models in the form of a tree structure. It utilizes an if-then rule set which is mutually exclusive and exhaustive for classification. The rules are learned sequentially using the training data one at a time. Each time a rule is learned, the tuples covered by the rules are removed. This process is continued on the training set until meeting a termination condition." 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "metadata": {}, 104 | "source": [ 105 | "### The tree is constructed in a top-down recursive divide-and-conquer manner. All the attributes should be categorical. Otherwise, they should be discretized in advance. Attributes in the top of the tree have more impact towards in the classification and they are identified using the information gain concept." 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "metadata": {}, 111 | "source": [ 112 | "### A decision tree can be easily over-fitted generating too many branches and may reflect anomalies due to noise or outliers. An over-fitted model has a very poor performance on the unseen data even though it gives an impressive performance on training data. This can be avoided by pre-pruning which halts tree construction early or post-pruning which removes branches from the fully grown tree." 113 | ] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "metadata": {}, 118 | "source": [ 119 | "---\n", 120 | "## **2. Naive Bayes**" 121 | ] 122 | }, 123 | { 124 | "cell_type": "markdown", 125 | "metadata": {}, 126 | "source": [ 127 | "### Naive Bayes is a probabilistic classifier inspired by the Bayes theorem under a simple assumption which is the attributes are conditionally independent." 128 | ] 129 | }, 130 | { 131 | "cell_type": "markdown", 132 | "metadata": {}, 133 | "source": [ 134 | "" 135 | ] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "metadata": {}, 140 | "source": [ 141 | "### The classification is conducted by deriving the maximum posterior which is the maximal P(Ci|X) with the above assumption applying to Bayes theorem. This assumption greatly reduces the computational cost by only counting the class distribution. Even though the assumption is not valid in most cases since the attributes are dependent, surprisingly Naive Bayes has able to perform impressively." 142 | ] 143 | }, 144 | { 145 | "cell_type": "markdown", 146 | "metadata": {}, 147 | "source": [ 148 | "### Naive Bayes is a very simple algorithm to implement and good results have obtained in most cases. It can be easily scalable to larger datasets since it takes linear time, rather than by expensive iterative approximation as used for many other types of classifiers." 149 | ] 150 | }, 151 | { 152 | "cell_type": "markdown", 153 | "metadata": {}, 154 | "source": [ 155 | "### Naive Bayes can suffer from a problem called the zero probability problem. When the conditional probability is zero for a particular attribute, it fails to give a valid prediction. This needs to be fixed explicitly using a Laplacian estimator." 156 | ] 157 | }, 158 | { 159 | "cell_type": "markdown", 160 | "metadata": {}, 161 | "source": [ 162 | "---\n", 163 | "## **3. Artificial Neural Networks**" 164 | ] 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "metadata": {}, 169 | "source": [ 170 | "" 171 | ] 172 | }, 173 | { 174 | "cell_type": "markdown", 175 | "metadata": {}, 176 | "source": [ 177 | "### Artificial Neural Network is a set of connected input/output units where each connection has a weight associated with it started by psychologists and neurobiologists to develop and test computational analogs of neurons. During the learning phase, the network learns by adjusting the weights so as to be able to predict the correct class label of the input tuples." 178 | ] 179 | }, 180 | { 181 | "cell_type": "markdown", 182 | "metadata": {}, 183 | "source": [ 184 | "### There are many network architectures available now like Feed-forward, Convolutional, Recurrent etc. The appropriate architecture depends on the application of the model. For most cases feed-forward models give reasonably accurate results and especially for image processing applications, convolutional networks perform better." 185 | ] 186 | }, 187 | { 188 | "cell_type": "markdown", 189 | "metadata": {}, 190 | "source": [ 191 | "### There can be multiple hidden layers in the model depending on the complexity of the function which is going to be mapped by the model. Having more hidden layers will enable to model complex relationships such as deep neural networks.\n" 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": {}, 197 | "source": [ 198 | "### However, when there are many hidden layers, it takes a lot of time to train and adjust wights. The other disadvantage of is the poor interpretability of model compared to other models like Decision Trees due to the unknown symbolic meaning behind the learned weights.\n" 199 | ] 200 | }, 201 | { 202 | "cell_type": "markdown", 203 | "metadata": {}, 204 | "source": [ 205 | "### But Artificial Neural Networks have performed impressively in most of the real world applications. It is high tolerance to noisy data and able to classify untrained patterns. Usually, Artificial Neural Networks perform better with continuous-valued inputs and outputs." 206 | ] 207 | }, 208 | { 209 | "cell_type": "markdown", 210 | "metadata": {}, 211 | "source": [ 212 | "### *All of the above algorithms are eager learners since they train a model in advance to generalize the training data and use it for prediction later.*" 213 | ] 214 | }, 215 | { 216 | "cell_type": "markdown", 217 | "metadata": {}, 218 | "source": [ 219 | "---\n", 220 | "## **4. k-Nearest Neighbor (KNN)**" 221 | ] 222 | }, 223 | { 224 | "cell_type": "markdown", 225 | "metadata": {}, 226 | "source": [ 227 | "" 228 | ] 229 | }, 230 | { 231 | "cell_type": "markdown", 232 | "metadata": {}, 233 | "source": [ 234 | "### k-Nearest Neighbor is a lazy learning algorithm which stores all instances correspond to training data points in n-dimensional space. When an unknown discrete data is received, it analyzes the closest k number of instances saved (nearest neighbors)and returns the most common class as the prediction and for real-valued data it returns the mean of k nearest neighbors.\n" 235 | ] 236 | }, 237 | { 238 | "cell_type": "markdown", 239 | "metadata": {}, 240 | "source": [ 241 | "### In the distance-weighted nearest neighbor algorithm, it weights the contribution of each of the k neighbors according to their distance using the following query giving greater weight to the closest neighbors.\n" 242 | ] 243 | }, 244 | { 245 | "cell_type": "markdown", 246 | "metadata": {}, 247 | "source": [ 248 | "" 249 | ] 250 | }, 251 | { 252 | "cell_type": "markdown", 253 | "metadata": {}, 254 | "source": [ 255 | "### Usually KNN is robust to noisy data since it is averaging the k-nearest neighbors.\n" 256 | ] 257 | }, 258 | { 259 | "cell_type": "markdown", 260 | "metadata": {}, 261 | "source": [ 262 | "---\n", 263 | "# **Evaluating a classifier**\n", 264 | "---" 265 | ] 266 | }, 267 | { 268 | "cell_type": "markdown", 269 | "metadata": {}, 270 | "source": [ 271 | "### After training the model the most important part is to evaluate the classifier to verify its applicability.\n" 272 | ] 273 | }, 274 | { 275 | "cell_type": "markdown", 276 | "metadata": {}, 277 | "source": [ 278 | "---\n", 279 | "## **Holdout method**" 280 | ] 281 | }, 282 | { 283 | "cell_type": "markdown", 284 | "metadata": {}, 285 | "source": [ 286 | "### There are several methods exists and the most common method is the holdout method. In this method, the given data set is divided into 2 partitions as test and train 20% and 80% respectively. The train set will be used to train the model and the unseen test data will be used to test its predictive power.\n" 287 | ] 288 | }, 289 | { 290 | "cell_type": "markdown", 291 | "metadata": {}, 292 | "source": [ 293 | "---\n", 294 | "## **Cross-validation**" 295 | ] 296 | }, 297 | { 298 | "cell_type": "markdown", 299 | "metadata": {}, 300 | "source": [ 301 | "" 302 | ] 303 | }, 304 | { 305 | "cell_type": "markdown", 306 | "metadata": {}, 307 | "source": [ 308 | "### Over-fitting is a common problem in machine learning which can occur in most models. k-fold cross-validation can be conducted to verify that the model is not over-fitted. In this method, the data-set is randomly partitioned into k mutually exclusive subsets, each approximately equal size and one is kept for testing while others are used for training. This process is iterated throughout the whole k folds.\n" 309 | ] 310 | }, 311 | { 312 | "cell_type": "markdown", 313 | "metadata": {}, 314 | "source": [ 315 | "---\n", 316 | "## **Precision and Recall**" 317 | ] 318 | }, 319 | { 320 | "cell_type": "markdown", 321 | "metadata": {}, 322 | "source": [ 323 | "### Precision is the fraction of relevant instances among the retrieved instances, while recall is the fraction of relevant instances that have been retrieved over the total amount of relevant instances. Precision and Recall are used as a measurement of the relevance.\n" 324 | ] 325 | }, 326 | { 327 | "cell_type": "markdown", 328 | "metadata": {}, 329 | "source": [ 330 | "---\n", 331 | "## **ROC curve ( Receiver Operating Characteristics)**" 332 | ] 333 | }, 334 | { 335 | "cell_type": "markdown", 336 | "metadata": {}, 337 | "source": [ 338 | "### ROC curve is used for visual comparison of classification models which shows the trade-off between the true positive rate and the false positive rate. The area under the ROC curve is a measure of the accuracy of the model. When a model is closer to the diagonal, it is less accurate and the model with perfect accuracy will have an area of 1.0\n" 339 | ] 340 | }, 341 | { 342 | "cell_type": "markdown", 343 | "metadata": {}, 344 | "source": [ 345 | "" 346 | ] 347 | }, 348 | { 349 | "cell_type": "code", 350 | "execution_count": null, 351 | "metadata": {}, 352 | "outputs": [], 353 | "source": [] 354 | } 355 | ], 356 | "metadata": { 357 | "kernelspec": { 358 | "display_name": "Python 3", 359 | "language": "python", 360 | "name": "python3" 361 | }, 362 | "language_info": { 363 | "codemirror_mode": { 364 | "name": "ipython", 365 | "version": 3 366 | }, 367 | "file_extension": ".py", 368 | "mimetype": "text/x-python", 369 | "name": "python", 370 | "nbconvert_exporter": "python", 371 | "pygments_lexer": "ipython3", 372 | "version": "3.6.8" 373 | } 374 | }, 375 | "nbformat": 4, 376 | "nbformat_minor": 4 377 | } 378 | -------------------------------------------------------------------------------- /01 Introduction of Classification/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /01 Introduction of Classification/images/ANN.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Awesome-Machine-Learning/Machine-Learning-Classifications/9e22e12dfc3da18d18478f13a1b96c47680ab497/01 Introduction of Classification/images/ANN.jpeg -------------------------------------------------------------------------------- /01 Introduction of Classification/images/CV.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Awesome-Machine-Learning/Machine-Learning-Classifications/9e22e12dfc3da18d18478f13a1b96c47680ab497/01 Introduction of Classification/images/CV.png -------------------------------------------------------------------------------- /01 Introduction of Classification/images/DT.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Awesome-Machine-Learning/Machine-Learning-Classifications/9e22e12dfc3da18d18478f13a1b96c47680ab497/01 Introduction of Classification/images/DT.png -------------------------------------------------------------------------------- /01 Introduction of Classification/images/KNN.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Awesome-Machine-Learning/Machine-Learning-Classifications/9e22e12dfc3da18d18478f13a1b96c47680ab497/01 Introduction of Classification/images/KNN.png -------------------------------------------------------------------------------- /01 Introduction of Classification/images/KNN1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Awesome-Machine-Learning/Machine-Learning-Classifications/9e22e12dfc3da18d18478f13a1b96c47680ab497/01 Introduction of Classification/images/KNN1.png -------------------------------------------------------------------------------- /01 Introduction of Classification/images/NB.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Awesome-Machine-Learning/Machine-Learning-Classifications/9e22e12dfc3da18d18478f13a1b96c47680ab497/01 Introduction of Classification/images/NB.png -------------------------------------------------------------------------------- /01 Introduction of Classification/images/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /01 Introduction of Classification/images/ROC.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Awesome-Machine-Learning/Machine-Learning-Classifications/9e22e12dfc3da18d18478f13a1b96c47680ab497/01 Introduction of Classification/images/ROC.png -------------------------------------------------------------------------------- /02 Types of Classification Algorithm/02 Types of Classification Algorithm.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "---\n", 8 | "

Types of Machine Learning Classifier

\n", 9 | "\n", 10 | "---" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": {}, 16 | "source": [ 17 | "### In machine learning and statistics, classification is a supervised learning approach in which the computer program learns from the data input given to it and then uses this learning to classify new observation. This data set may simply be bi-class (like identifying whether the person is male or female or that the mail is spam or non-spam) or it may be multi-class too. Some examples of classification problems are: speech recognition, handwriting recognition, bio metric identification, document classification etc.\n", 18 | "### Here we have the types of classification algorithms in Machine Learning:\n", 19 | "### 1. Linear Classifiers: Logistic Regression, Naive Bayes Classifier\n", 20 | "### 2. Nearest Neighbor\n", 21 | "### 3. Support Vector Machines\n", 22 | "### 4. Decision Trees\n", 23 | "### 5. Boosted Trees\n", 24 | "### 6. Random Forest\n", 25 | "### 7. Neural Networks" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "---\n", 33 | "## **Naive Bayes Classifier (Generative Learning Model) :**" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "### It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability. Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods." 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "---\n", 48 | "## **Nearest Neighbor:**" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": {}, 54 | "source": [ 55 | "### The k-nearest-neighbors algorithm is a classification algorithm, and it is supervised: it takes a bunch of labelled points and uses them to learn how to label other points. To label a new point, it looks at the labelled points closest to that new point (those are its nearest neighbors), and has those neighbors vote, so whichever label the most of the neighbors have is the label for the new point (the “k” is the number of neighbors it checks)." 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "metadata": {}, 61 | "source": [ 62 | "---\n", 63 | "## **Logistic Regression (Predictive Learning Model) :**" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": {}, 69 | "source": [ 70 | "### It is a statistical method for analysing a data set in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes). The goal of logistic regression is to find the best fitting model to describe the relationship between the dichotomous characteristic of interest (dependent variable = response or outcome variable) and a set of independent (predictor or explanatory) variables. This is better than other binary classification like nearest neighbor since it also explains quantitatively the factors that lead to classification." 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "metadata": {}, 76 | "source": [ 77 | "---\n", 78 | "## **Decision Trees:**" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "### Decision tree builds classification or regression models in the form of a tree structure. It breaks down a data set into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision node has two or more branches and a leaf node represents a classification or decision. The topmost decision node in a tree which corresponds to the best predictor called root node. Decision trees can handle both categorical and numerical data." 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": {}, 91 | "source": [ 92 | "---\n", 93 | "## **Random Forest:**" 94 | ] 95 | }, 96 | { 97 | "cell_type": "markdown", 98 | "metadata": {}, 99 | "source": [ 100 | "### Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees’ habit of over fitting to their training set." 101 | ] 102 | }, 103 | { 104 | "cell_type": "markdown", 105 | "metadata": {}, 106 | "source": [ 107 | "---\n", 108 | "## **Neural Network:**" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "### A neural network consists of units (neurons), arranged in layers, which convert an input vector into some output. Each unit takes an input, applies a (often nonlinear) function to it and then passes the output on to the next layer. Generally the networks are defined to be feed-forward: a unit feeds its output to all the units on the next layer, but there is no feedback to the previous layer. Weightings are applied to the signals passing from one unit to another, and it is these weightings which are tuned in the training phase to adapt a neural network to the particular problem at hand.\n" 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": null, 121 | "metadata": {}, 122 | "outputs": [], 123 | "source": [] 124 | } 125 | ], 126 | "metadata": { 127 | "kernelspec": { 128 | "display_name": "Python 3", 129 | "language": "python", 130 | "name": "python3" 131 | }, 132 | "language_info": { 133 | "codemirror_mode": { 134 | "name": "ipython", 135 | "version": 3 136 | }, 137 | "file_extension": ".py", 138 | "mimetype": "text/x-python", 139 | "name": "python", 140 | "nbconvert_exporter": "python", 141 | "pygments_lexer": "ipython3", 142 | "version": "3.6.8" 143 | } 144 | }, 145 | "nbformat": 4, 146 | "nbformat_minor": 4 147 | } 148 | -------------------------------------------------------------------------------- /02 Types of Classification Algorithm/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /03 Classification Algorithms/Intro & Types of Classifier.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### the 7 most commonly used classification algorithms along with the python code: Logistic Regression, Naïve Bayes, Stochastic Gradient Descent, K-Nearest Neighbours, Decision Tree, Random Forest, and Support Vector Machine" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "---\n", 15 | "# **1. Introduction**" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "---\n", 23 | "## **1.1 Structured Data Classification**" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "### Classification can be performed on structured or unstructured data. Classification is a technique where we categorize data into a given number of classes. The main goal of a classification problem is to identify the category/class to which a new data will fall under." 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "### Few of the terminologies encountered in machine learning – classification:" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "### **1.Classifier:** An algorithm that maps the input data to a specific category." 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": {}, 50 | "source": [ 51 | "### **2. Classification model:** A classification model tries to draw some conclusion from the input values given for training. It will predict the class labels/categories for the new data.\n", 52 | "### **3. Feature:** A feature is an individual measurable property of a phenomenon being observed.\n", 53 | "### **4. Binary Classification:** Classification task with two possible outcomes. Eg: Gender classification (Male / Female)\n", 54 | "### **5. Multi class classification:** Classification with more than two classes. In multi class classification each sample is assigned to one and only one target label. Eg: An animal can be cat or dog but not both at the same time\n", 55 | "### **6. Multi label classification:** Classification task where each sample is mapped to a set of target labels (more than one class). Eg: A news article can be about sports, a person, and location at the same time." 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "metadata": {}, 61 | "source": [ 62 | "---\n", 63 | "### The following are the steps involved in building a classification model:" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": {}, 69 | "source": [ 70 | "### **1. Initialize** the classifier to be used.\n", 71 | "### **2. Train the classifier:** All classifiers in scikit-learn uses a fit(X, y) method to fit the model(training) for the given train data X and train label y.\n", 72 | "### **3. Predict the target:** Given an unlabeled observation X, the predict(X) returns the predicted label y.\n", 73 | "### **4. Evaluate** the classifier model" 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": {}, 79 | "source": [ 80 | "---\n", 81 | "## **1.2 Dataset Source and Contents**" 82 | ] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "metadata": {}, 87 | "source": [ 88 | "### The dataset contains salaries. The following is a description of our dataset:" 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": {}, 94 | "source": [ 95 | "### **of Classes:** 2 (‘>50K’ and ‘<=50K’)\n", 96 | "### **of attributes (Columns):** 7\n", 97 | "### **of instances (Rows):** 48,842" 98 | ] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "metadata": {}, 103 | "source": [ 104 | "---\n", 105 | "## **1.3 Exploratory Data Analysis**" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "metadata": {}, 111 | "source": [ 112 | "" 113 | ] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "metadata": {}, 118 | "source": [ 119 | "---\n", 120 | "# **2 Classification Algorithms (Python)**\n", 121 | "---" 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": {}, 127 | "source": [ 128 | "## **2.1 Logistic Regression**" 129 | ] 130 | }, 131 | { 132 | "cell_type": "markdown", 133 | "metadata": {}, 134 | "source": [ 135 | "" 136 | ] 137 | }, 138 | { 139 | "cell_type": "markdown", 140 | "metadata": {}, 141 | "source": [ 142 | "### **Definition:** Logistic regression is a machine learning algorithm for classification. In this algorithm, the probabilities describing the possible outcomes of a single trial are modelled using a logistic function.\n", 143 | "\n", 144 | "### **Advantages:** Logistic regression is designed for this purpose (classification), and is most useful for understanding the influence of several independent variables on a single outcome variable.\n", 145 | "\n", 146 | "### **Disadvantages:** Works only when the predicted variable is binary, assumes all predictors are independent of each other, and assumes data is free of missing values." 147 | ] 148 | }, 149 | { 150 | "cell_type": "markdown", 151 | "metadata": {}, 152 | "source": [ 153 | "---\n", 154 | "## **2.2 Naïve Bayes**" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "" 162 | ] 163 | }, 164 | { 165 | "cell_type": "markdown", 166 | "metadata": {}, 167 | "source": [ 168 | "### **Definition:** Naive Bayes algorithm based on Bayes’ theorem with the assumption of independence between every pair of features. Naive Bayes classifiers work well in many real-world situations such as document classification and spam filtering.\n", 169 | "\n", 170 | "### **Advantages:** This algorithm requires a small amount of training data to estimate the necessary parameters. Naive Bayes classifiers are extremely fast compared to more sophisticated methods.\n", 171 | "\n", 172 | "### **Disadvantages:** Naive Bayes is is known to be a bad estimator." 173 | ] 174 | }, 175 | { 176 | "cell_type": "markdown", 177 | "metadata": {}, 178 | "source": [ 179 | "---\n", 180 | "## **2.3 Stochastic Gradient Descent**" 181 | ] 182 | }, 183 | { 184 | "cell_type": "markdown", 185 | "metadata": {}, 186 | "source": [ 187 | "" 188 | ] 189 | }, 190 | { 191 | "cell_type": "markdown", 192 | "metadata": {}, 193 | "source": [ 194 | "### **Definition:** Stochastic gradient descent is a simple and very efficient approach to fit linear models. It is particularly useful when the number of samples is very large. It supports different loss functions and penalties for classification.\n", 195 | "\n", 196 | "### **Advantages:** Efficiency and ease of implementation.\n", 197 | "\n", 198 | "### **Disadvantages:** Requires a number of hyper-parameters and it is sensitive to feature scaling." 199 | ] 200 | }, 201 | { 202 | "cell_type": "markdown", 203 | "metadata": {}, 204 | "source": [ 205 | "---\n", 206 | "## **2.4 K-Nearest Neighbours**" 207 | ] 208 | }, 209 | { 210 | "cell_type": "markdown", 211 | "metadata": {}, 212 | "source": [ 213 | "" 214 | ] 215 | }, 216 | { 217 | "cell_type": "markdown", 218 | "metadata": {}, 219 | "source": [ 220 | "### **Definition:** Neighbours based classification is a type of lazy learning as it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the k nearest neighbours of each point.\n", 221 | "\n", 222 | "### **Advantages:** This algorithm is simple to implement, robust to noisy training data, and effective if training data is large.\n", 223 | "\n", 224 | "### **Disadvantages:** Need to determine the value of K and the computation cost is high as it needs to computer the distance of each instance to all the training samples." 225 | ] 226 | }, 227 | { 228 | "cell_type": "markdown", 229 | "metadata": {}, 230 | "source": [ 231 | "---\n", 232 | "## **2.5 Decision Tree**" 233 | ] 234 | }, 235 | { 236 | "cell_type": "markdown", 237 | "metadata": {}, 238 | "source": [ 239 | "" 240 | ] 241 | }, 242 | { 243 | "cell_type": "markdown", 244 | "metadata": {}, 245 | "source": [ 246 | "### **Definition:** Given a data of attributes together with its classes, a decision tree produces a sequence of rules that can be used to classify the data.\n", 247 | "\n", 248 | "### **Advantages:** Decision Tree is simple to understand and visualise, requires little data preparation, and can handle both numerical and categorical data.\n", 249 | "\n", 250 | "### **Disadvantages:** Decision tree can create complex trees that do not generalise well, and decision trees can be unstable because small variations in the data might result in a completely different tree being generated." 251 | ] 252 | }, 253 | { 254 | "cell_type": "markdown", 255 | "metadata": {}, 256 | "source": [ 257 | "---\n", 258 | "## **2.6 Random Forest**" 259 | ] 260 | }, 261 | { 262 | "cell_type": "markdown", 263 | "metadata": {}, 264 | "source": [ 265 | "" 266 | ] 267 | }, 268 | { 269 | "cell_type": "markdown", 270 | "metadata": {}, 271 | "source": [ 272 | "### **Definition:** Random forest classifier is a meta-estimator that fits a number of decision trees on various sub-samples of datasets and uses average to improve the predictive accuracy of the model and controls over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement.\n", 273 | "\n", 274 | "### **Advantages:** Reduction in over-fitting and random forest classifier is more accurate than decision trees in most cases.\n", 275 | "\n", 276 | "### **Disadvantages:** Slow real time prediction, difficult to implement, and complex algorithm." 277 | ] 278 | }, 279 | { 280 | "cell_type": "markdown", 281 | "metadata": {}, 282 | "source": [ 283 | "---\n", 284 | "## **2.7 Support Vector Machine**" 285 | ] 286 | }, 287 | { 288 | "cell_type": "markdown", 289 | "metadata": {}, 290 | "source": [ 291 | "" 292 | ] 293 | }, 294 | { 295 | "cell_type": "markdown", 296 | "metadata": {}, 297 | "source": [ 298 | "### **Definition:** Support vector machine is a representation of the training data as points in space separated into categories by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.\n", 299 | "\n", 300 | "### **Advantages:** Effective in high dimensional spaces and uses a subset of training points in the decision function so it is also memory efficient.\n", 301 | "\n", 302 | "### **Disadvantages:** The algorithm does not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation." 303 | ] 304 | }, 305 | { 306 | "cell_type": "markdown", 307 | "metadata": {}, 308 | "source": [ 309 | "---\n", 310 | "# **3 Conclusion**\n", 311 | "---" 312 | ] 313 | }, 314 | { 315 | "cell_type": "markdown", 316 | "metadata": {}, 317 | "source": [ 318 | "### **3.1 Comparison Matrix**" 319 | ] 320 | }, 321 | { 322 | "cell_type": "markdown", 323 | "metadata": {}, 324 | "source": [ 325 | ">- **Accuracy: (True Positive + True Negative) / Total Population**\n", 326 | ">>- Accuracy is a ratio of correctly predicted observation to the total observations. Accuracy is the most intuitive performance measure.\n", 327 | ">>- True Positive: The number of correct predictions that the occurrence is positive\n", 328 | ">>- True Negative: The number of correct predictions that the occurrence is negative" 329 | ] 330 | }, 331 | { 332 | "cell_type": "markdown", 333 | "metadata": {}, 334 | "source": [ 335 | ">- **F1-Score: (2 x Precision x Recall) / (Precision + Recall)**\n", 336 | "\n", 337 | ">>- F1-Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. F1-Score is usually more useful than accuracy, especially if you have an uneven class distribution.\n", 338 | ">>- Precision: When a positive value is predicted, how often is the prediction correct?\n", 339 | ">>- Recall: When the actual value is positive, how often is the prediction correct?" 340 | ] 341 | }, 342 | { 343 | "cell_type": "raw", 344 | "metadata": {}, 345 | "source": [] 346 | }, 347 | { 348 | "cell_type": "raw", 349 | "metadata": {}, 350 | "source": [ 351 | "Classification Algorithm | Accuracy | F1-Score\n", 352 | "\n", 353 | "Logistic Regression | 84.60% | 0.6337\n", 354 | "Naïve Bayes | 80.11% | 0.6005\n", 355 | "Stochastic Gradient Descent | 82.20% | 0.5780\n", 356 | "K-Nearest Neighbours | 83.56% | 0.5924\n", 357 | "Decision Tree | 84.23% | 0.6308\n", 358 | "Random Forest | 84.33% | 0.6275\n", 359 | "Support Vector Machine | 84.09% | 0.6145" 360 | ] 361 | }, 362 | { 363 | "cell_type": "markdown", 364 | "metadata": {}, 365 | "source": [ 366 | "---\n", 367 | "## **3.2 Algorithm Selection**" 368 | ] 369 | }, 370 | { 371 | "cell_type": "markdown", 372 | "metadata": {}, 373 | "source": [ 374 | "" 375 | ] 376 | }, 377 | { 378 | "cell_type": "code", 379 | "execution_count": null, 380 | "metadata": {}, 381 | "outputs": [], 382 | "source": [] 383 | } 384 | ], 385 | "metadata": { 386 | "kernelspec": { 387 | "display_name": "Python 3", 388 | "language": "python", 389 | "name": "python3" 390 | }, 391 | "language_info": { 392 | "codemirror_mode": { 393 | "name": "ipython", 394 | "version": 3 395 | }, 396 | "file_extension": ".py", 397 | "mimetype": "text/x-python", 398 | "name": "python", 399 | "nbconvert_exporter": "python", 400 | "pygments_lexer": "ipython3", 401 | "version": "3.6.8" 402 | } 403 | }, 404 | "nbformat": 4, 405 | "nbformat_minor": 4 406 | } 407 | -------------------------------------------------------------------------------- /03 Classification Algorithms/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /03 Classification Algorithms/data/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /03 Classification Algorithms/images/AS.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Awesome-Machine-Learning/Machine-Learning-Classifications/9e22e12dfc3da18d18478f13a1b96c47680ab497/03 Classification Algorithms/images/AS.png -------------------------------------------------------------------------------- /03 Classification Algorithms/images/DT.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Awesome-Machine-Learning/Machine-Learning-Classifications/9e22e12dfc3da18d18478f13a1b96c47680ab497/03 Classification Algorithms/images/DT.png -------------------------------------------------------------------------------- /03 Classification Algorithms/images/EDA.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Awesome-Machine-Learning/Machine-Learning-Classifications/9e22e12dfc3da18d18478f13a1b96c47680ab497/03 Classification Algorithms/images/EDA.jpg -------------------------------------------------------------------------------- /03 Classification Algorithms/images/KNN.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Awesome-Machine-Learning/Machine-Learning-Classifications/9e22e12dfc3da18d18478f13a1b96c47680ab497/03 Classification Algorithms/images/KNN.png -------------------------------------------------------------------------------- /03 Classification Algorithms/images/LR.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Awesome-Machine-Learning/Machine-Learning-Classifications/9e22e12dfc3da18d18478f13a1b96c47680ab497/03 Classification Algorithms/images/LR.png -------------------------------------------------------------------------------- /03 Classification Algorithms/images/NB.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Awesome-Machine-Learning/Machine-Learning-Classifications/9e22e12dfc3da18d18478f13a1b96c47680ab497/03 Classification Algorithms/images/NB.png -------------------------------------------------------------------------------- /03 Classification Algorithms/images/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /03 Classification Algorithms/images/RF.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Awesome-Machine-Learning/Machine-Learning-Classifications/9e22e12dfc3da18d18478f13a1b96c47680ab497/03 Classification Algorithms/images/RF.png -------------------------------------------------------------------------------- /03 Classification Algorithms/images/SCD.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Awesome-Machine-Learning/Machine-Learning-Classifications/9e22e12dfc3da18d18478f13a1b96c47680ab497/03 Classification Algorithms/images/SCD.png -------------------------------------------------------------------------------- /03 Classification Algorithms/images/SVM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Awesome-Machine-Learning/Machine-Learning-Classifications/9e22e12dfc3da18d18478f13a1b96c47680ab497/03 Classification Algorithms/images/SVM.png -------------------------------------------------------------------------------- /04 Exploratory Data Analysis/01 Data_Prep.sas: -------------------------------------------------------------------------------- 1 | options compress=yes; 2 | 3 | data train; 4 | format flag $20.; 5 | set 'ADULT.DATA'n; 6 | flag = 'train'; 7 | run; 8 | 9 | data test (rename=('|1x3 Cross validator'n = F1)); 10 | format flag $20.; 11 | set 'ADULT.TEST'n; 12 | flag = 'test'; 13 | run; 14 | 15 | data rg.df (rename=( 16 | F1 = age 17 | F2 = workclass 18 | F3 = fnlwgt 19 | F4 = education 20 | F5 = education_num 21 | F6 = marital_status 22 | F7 = occupation 23 | F8 = relationship 24 | F9 = race 25 | F10 = sex 26 | F11 = capital_gain 27 | F12 = capital_loss 28 | F13 = hours_per_week 29 | F14 = native_country 30 | F15 = income 31 | )); 32 | set train test; 33 | if compress(F15) ^= ""; 34 | run; 35 | 36 | data rg.df; 37 | set rg.df; 38 | if compress(income) in (">50K",">50K.") then y = 1; 39 | else y = 0; 40 | run; 41 | 42 | /*age*/ 43 | data rg.df; 44 | format age_bin $20.; 45 | set rg.df; 46 | if age <= 25 then age_bin = 'a. 0-25'; 47 | else if age <= 30 then age_bin = 'b. 26-30 & 71-100'; 48 | else if age <= 35 then age_bin = 'c. 31-35 & 61-70'; 49 | else if age <= 40 then age_bin = 'd. 36-40 & 56-60'; 50 | else if age <= 55 then age_bin = 'e. 40-55'; 51 | else if age <= 60 then age_bin = 'd. 36-40 & 56-60'; 52 | else if age <= 70 then age_bin = 'c. 31-35 & 61-70'; 53 | else age_bin = 'b. 26-30 & 71-100'; 54 | run; 55 | 56 | /*workclass*/ 57 | data rg.df; 58 | format workclass_bin $20.; 59 | set rg.df; 60 | if workclass in ('?','Never-worked','Without-pay') then workclass_bin = 'a. no income'; 61 | else workclass_bin = 'b. income'; 62 | run; 63 | 64 | /*education*/ 65 | data rg.df; 66 | format education_bin $20.; 67 | set rg.df; 68 | if education in ('10th','11th','12th','1st-4th','5th-6th','7th-8th','9th','Preschool') then education_bin = 'a. Low'; 69 | else if education in ('HS-grad','Some-college','Assoc-acdm','Assoc-voc') then education_bin = 'b. Mid'; 70 | else if education in ('Bachelors') then education_bin = 'c. Bachelors'; 71 | else if education in ('Masters') then education_bin = 'd. Masters'; 72 | else education_bin = 'e. High'; 73 | run; 74 | 75 | /*education_num*/ 76 | data rg.df; 77 | format education_num_bin $20.; 78 | set rg.df; 79 | if education_num <= 8 then education_num_bin = 'a. 0-8'; 80 | else if education_num <= 12 then education_num_bin = 'b. 9-12'; 81 | else if education_num <= 13 then education_num_bin = 'c. 13'; 82 | else if education_num <= 14 then education_num_bin = 'd. 14'; 83 | else education_num_bin = 'e. 15+'; 84 | run; 85 | 86 | /*race & sex*/ 87 | data rg.df; 88 | format race_sex $50.; 89 | format race_sex_bin $20.; 90 | set rg.df; 91 | race_sex = compress(race)||' - '||compress(sex); 92 | if race_sex in ('Asian-Pac-Islander - Male','White - Male') then race_sex_bin = 'c. High'; 93 | else if race_sex in ('White - Female','Asian-Pac-Islander - Female','Amer-Indian-Eskimo - Male','Other - Male','Black - Male') then race_sex_bin = 'b. Mid'; 94 | else race_sex_bin = 'a. Low'; 95 | run; 96 | 97 | /*capital_gain & capital_loss*/ 98 | data rg.df; 99 | format capital_gl_bin $20.; 100 | set rg.df; 101 | if capital_gain = . then capital_gain = 0; 102 | if capital_loss = . then capital_loss = 0; 103 | capital_gl = capital_gain - capital_loss; 104 | if capital_gl > 0 then capital_gl_bin = "c. > 0"; 105 | else if capital_gl < 0 then capital_gl_bin = "b. < 0"; 106 | else capital_gl_bin = "a. = 0"; 107 | run; 108 | 109 | /*marital_status & relationship*/ 110 | data rg.df; 111 | format msr $50.; 112 | format msr_bin $20.; 113 | set rg.df; 114 | msr = compress(marital_status)||' - '||compress(relationship); 115 | if msr in ('Married-AF-spouse - Wife','Married-civ-spouse - Husband','Married-civ-spouse - Wife','Married-AF-spouse - Husband') then msr_bin = 'c. High'; 116 | else if msr in ('Widowed - Not-in-family','Divorced - Unmarried','Never-married - Not-in-family','Widowed - Unmarried','Separated - Not-in-family','Married-spouse-absent - Not-in-family','Divorced - Not-in-family','Married-civ-spouse - Other-relative','Married-civ-spouse - Own-child','Married-civ-spouse - Not-in-family') then msr_bin = 'b. Mid'; 117 | else msr_bin = 'a. Low'; 118 | run; 119 | 120 | /*occupation*/ 121 | data rg.df; 122 | format occupation_bin $20.; 123 | set rg.df; 124 | if occupation in ('Priv-house-serv','Other-service','Handlers-cleaners') then occupation_bin = 'a. Low'; 125 | else if occupation in ('Armed-Forces','?','Farming-fishing','Machine-op-inspct','Adm-clerical') then occupation_bin = 'b. Mid - Low'; 126 | else if occupation in ('Transport-moving','Craft-repair','Sales') then occupation_bin = 'c. Mid - Mid'; 127 | else if occupation in ('Tech-support','Protective-serv') then occupation_bin = 'd. Mid - High'; 128 | else occupation_bin = 'e. High'; 129 | run; 130 | 131 | /*hours_per_week*/ 132 | data rg.df; 133 | format hours_per_week_bin $20.; 134 | set rg.df; 135 | if hours_per_week <= 30 then hours_per_week_bin = 'a. 0-30'; 136 | else if hours_per_week <= 40 then hours_per_week_bin = 'b. 31-40'; 137 | else if hours_per_week <= 50 then hours_per_week_bin = 'd. 41-50 & 61-70'; 138 | else if hours_per_week <= 60 then hours_per_week_bin = 'e. 51-60'; 139 | else if hours_per_week <= 70 then hours_per_week_bin = 'd. 41-50 & 61-70'; 140 | else hours_per_week_bin = 'c. 71-100'; 141 | run; 142 | 143 | %macro rg_bin (var); 144 | proc sql; 145 | select &var., count(y) as cnt_y, mean(y) as avg_y 146 | from rg.df 147 | group by &var.; 148 | quit; 149 | %mend; 150 | 151 | %rg_bin(age_bin); 152 | %rg_bin(workclass_bin); 153 | %rg_bin(education_bin); 154 | %rg_bin(education_num_bin); 155 | %rg_bin(race_sex_bin); 156 | %rg_bin(capital_gl_bin); 157 | %rg_bin(msr_bin); 158 | %rg_bin(occupation_bin); 159 | %rg_bin(hours_per_week_bin); 160 | 161 | data rg.df (keep= 162 | y 163 | flag 164 | age_bin 165 | workclass_bin 166 | education_bin 167 | education_num_bin 168 | race_sex_bin 169 | capital_gl_bin 170 | msr_bin 171 | occupation_bin 172 | hours_per_week_bin 173 | ); 174 | set rg.df; 175 | run; -------------------------------------------------------------------------------- /04 Exploratory Data Analysis/02 Exploratory Data Analysis.py: -------------------------------------------------------------------------------- 1 | # import libraries 2 | import math 3 | import numpy as np 4 | import pandas as pd 5 | from datetime import datetime 6 | import seaborn as sns 7 | import matplotlib.pyplot as plt 8 | %matplotlib inline 9 | plt.style.use('seaborn-whitegrid') 10 | 11 | 12 | # Import the data 13 | df = pd.read_csv('data/00 df.csv') 14 | df = df[df['flag']=='train'] 15 | # print(df.info()) 16 | 17 | 18 | # Exploratory Data Analysis & plot the data 19 | #age_bin 20 | x_chart = df.pivot_table(values=['flag'], index=['age_bin'], columns=['y'], aggfunc='count') 21 | x_chart = x_chart.apply(lambda c: c / c.sum() * 100, axis=1) 22 | x_chart.plot(kind="bar",stacked=True) 23 | plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5)) 24 | 25 | #capital_gl_bin 26 | x_chart = df.pivot_table(values=['flag'], index=['capital_gl_bin'], columns=['y'], aggfunc='count') 27 | x_chart = x_chart.apply(lambda c: c / c.sum() * 100, axis=1) 28 | x_chart.plot(kind="bar",stacked=True) 29 | plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5)) 30 | 31 | #education_bin 32 | x_chart = df.pivot_table(values=['flag'], index=['education_bin'], columns=['y'], aggfunc='count') 33 | x_chart = x_chart.apply(lambda c: c / c.sum() * 100, axis=1) 34 | x_chart.plot(kind="bar",stacked=True) 35 | plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5)) 36 | 37 | 38 | #hours_per_week_bin 39 | x_chart = df.pivot_table(values=['flag'], index=['hours_per_week_bin'], columns=['y'], aggfunc='count') 40 | x_chart = x_chart.apply(lambda c: c / c.sum() * 100, axis=1) 41 | x_chart.plot(kind="bar",stacked=True) 42 | plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5)) 43 | 44 | 45 | #msr_bin 46 | x_chart = df.pivot_table(values=['flag'], index=['msr_bin'], columns=['y'], aggfunc='count') 47 | x_chart = x_chart.apply(lambda c: c / c.sum() * 100, axis=1) 48 | x_chart.plot(kind="bar",stacked=True) 49 | plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5)) 50 | 51 | 52 | #occupation_bin 53 | x_chart = df.pivot_table(values=['flag'], index=['occupation_bin'], columns=['y'], aggfunc='count') 54 | x_chart = x_chart.apply(lambda c: c / c.sum() * 100, axis=1) 55 | x_chart.plot(kind="bar",stacked=True) 56 | plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5)) 57 | 58 | 59 | #race_sex_bin 60 | x_chart = df.pivot_table(values=['flag'], index=['race_sex_bin'], columns=['y'], aggfunc='count') 61 | x_chart = x_chart.apply(lambda c: c / c.sum() * 100, axis=1) 62 | x_chart.plot(kind="bar",stacked=True) 63 | plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5)) 64 | 65 | 66 | 67 | 68 | 69 | -------------------------------------------------------------------------------- /04 Exploratory Data Analysis/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /04 Exploratory Data Analysis/data/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /05 Decision Tree/01 Data_Prep.sas: -------------------------------------------------------------------------------- 1 | options compress=yes; 2 | 3 | data train; 4 | format flag $20.; 5 | set 'ADULT.DATA'n; 6 | flag = 'train'; 7 | run; 8 | 9 | data test (rename=('|1x3 Cross validator'n = F1)); 10 | format flag $20.; 11 | set 'ADULT.TEST'n; 12 | flag = 'test'; 13 | run; 14 | 15 | data rg.df (rename=( 16 | F1 = age 17 | F2 = workclass 18 | F3 = fnlwgt 19 | F4 = education 20 | F5 = education_num 21 | F6 = marital_status 22 | F7 = occupation 23 | F8 = relationship 24 | F9 = race 25 | F10 = sex 26 | F11 = capital_gain 27 | F12 = capital_loss 28 | F13 = hours_per_week 29 | F14 = native_country 30 | F15 = income 31 | )); 32 | set train test; 33 | if compress(F15) ^= ""; 34 | run; 35 | 36 | data rg.df; 37 | set rg.df; 38 | if compress(income) in (">50K",">50K.") then y = 1; 39 | else y = 0; 40 | run; 41 | 42 | /*age*/ 43 | data rg.df; 44 | format age_bin $20.; 45 | set rg.df; 46 | if age <= 25 then age_bin = 'a. 0-25'; 47 | else if age <= 30 then age_bin = 'b. 26-30 & 71-100'; 48 | else if age <= 35 then age_bin = 'c. 31-35 & 61-70'; 49 | else if age <= 40 then age_bin = 'd. 36-40 & 56-60'; 50 | else if age <= 55 then age_bin = 'e. 40-55'; 51 | else if age <= 60 then age_bin = 'd. 36-40 & 56-60'; 52 | else if age <= 70 then age_bin = 'c. 31-35 & 61-70'; 53 | else age_bin = 'b. 26-30 & 71-100'; 54 | run; 55 | 56 | /*workclass*/ 57 | data rg.df; 58 | format workclass_bin $20.; 59 | set rg.df; 60 | if workclass in ('?','Never-worked','Without-pay') then workclass_bin = 'a. no income'; 61 | else workclass_bin = 'b. income'; 62 | run; 63 | 64 | /*education*/ 65 | data rg.df; 66 | format education_bin $20.; 67 | set rg.df; 68 | if education in ('10th','11th','12th','1st-4th','5th-6th','7th-8th','9th','Preschool') then education_bin = 'a. Low'; 69 | else if education in ('HS-grad','Some-college','Assoc-acdm','Assoc-voc') then education_bin = 'b. Mid'; 70 | else if education in ('Bachelors') then education_bin = 'c. Bachelors'; 71 | else if education in ('Masters') then education_bin = 'd. Masters'; 72 | else education_bin = 'e. High'; 73 | run; 74 | 75 | /*education_num*/ 76 | data rg.df; 77 | format education_num_bin $20.; 78 | set rg.df; 79 | if education_num <= 8 then education_num_bin = 'a. 0-8'; 80 | else if education_num <= 12 then education_num_bin = 'b. 9-12'; 81 | else if education_num <= 13 then education_num_bin = 'c. 13'; 82 | else if education_num <= 14 then education_num_bin = 'd. 14'; 83 | else education_num_bin = 'e. 15+'; 84 | run; 85 | 86 | /*race & sex*/ 87 | data rg.df; 88 | format race_sex $50.; 89 | format race_sex_bin $20.; 90 | set rg.df; 91 | race_sex = compress(race)||' - '||compress(sex); 92 | if race_sex in ('Asian-Pac-Islander - Male','White - Male') then race_sex_bin = 'c. High'; 93 | else if race_sex in ('White - Female','Asian-Pac-Islander - Female','Amer-Indian-Eskimo - Male','Other - Male','Black - Male') then race_sex_bin = 'b. Mid'; 94 | else race_sex_bin = 'a. Low'; 95 | run; 96 | 97 | /*capital_gain & capital_loss*/ 98 | data rg.df; 99 | format capital_gl_bin $20.; 100 | set rg.df; 101 | if capital_gain = . then capital_gain = 0; 102 | if capital_loss = . then capital_loss = 0; 103 | capital_gl = capital_gain - capital_loss; 104 | if capital_gl > 0 then capital_gl_bin = "c. > 0"; 105 | else if capital_gl < 0 then capital_gl_bin = "b. < 0"; 106 | else capital_gl_bin = "a. = 0"; 107 | run; 108 | 109 | /*marital_status & relationship*/ 110 | data rg.df; 111 | format msr $50.; 112 | format msr_bin $20.; 113 | set rg.df; 114 | msr = compress(marital_status)||' - '||compress(relationship); 115 | if msr in ('Married-AF-spouse - Wife','Married-civ-spouse - Husband','Married-civ-spouse - Wife','Married-AF-spouse - Husband') then msr_bin = 'c. High'; 116 | else if msr in ('Widowed - Not-in-family','Divorced - Unmarried','Never-married - Not-in-family','Widowed - Unmarried','Separated - Not-in-family','Married-spouse-absent - Not-in-family','Divorced - Not-in-family','Married-civ-spouse - Other-relative','Married-civ-spouse - Own-child','Married-civ-spouse - Not-in-family') then msr_bin = 'b. Mid'; 117 | else msr_bin = 'a. Low'; 118 | run; 119 | 120 | /*occupation*/ 121 | data rg.df; 122 | format occupation_bin $20.; 123 | set rg.df; 124 | if occupation in ('Priv-house-serv','Other-service','Handlers-cleaners') then occupation_bin = 'a. Low'; 125 | else if occupation in ('Armed-Forces','?','Farming-fishing','Machine-op-inspct','Adm-clerical') then occupation_bin = 'b. Mid - Low'; 126 | else if occupation in ('Transport-moving','Craft-repair','Sales') then occupation_bin = 'c. Mid - Mid'; 127 | else if occupation in ('Tech-support','Protective-serv') then occupation_bin = 'd. Mid - High'; 128 | else occupation_bin = 'e. High'; 129 | run; 130 | 131 | /*hours_per_week*/ 132 | data rg.df; 133 | format hours_per_week_bin $20.; 134 | set rg.df; 135 | if hours_per_week <= 30 then hours_per_week_bin = 'a. 0-30'; 136 | else if hours_per_week <= 40 then hours_per_week_bin = 'b. 31-40'; 137 | else if hours_per_week <= 50 then hours_per_week_bin = 'd. 41-50 & 61-70'; 138 | else if hours_per_week <= 60 then hours_per_week_bin = 'e. 51-60'; 139 | else if hours_per_week <= 70 then hours_per_week_bin = 'd. 41-50 & 61-70'; 140 | else hours_per_week_bin = 'c. 71-100'; 141 | run; 142 | 143 | %macro rg_bin (var); 144 | proc sql; 145 | select &var., count(y) as cnt_y, mean(y) as avg_y 146 | from rg.df 147 | group by &var.; 148 | quit; 149 | %mend; 150 | 151 | %rg_bin(age_bin); 152 | %rg_bin(workclass_bin); 153 | %rg_bin(education_bin); 154 | %rg_bin(education_num_bin); 155 | %rg_bin(race_sex_bin); 156 | %rg_bin(capital_gl_bin); 157 | %rg_bin(msr_bin); 158 | %rg_bin(occupation_bin); 159 | %rg_bin(hours_per_week_bin); 160 | 161 | data rg.df (keep= 162 | y 163 | flag 164 | age_bin 165 | workclass_bin 166 | education_bin 167 | education_num_bin 168 | race_sex_bin 169 | capital_gl_bin 170 | msr_bin 171 | occupation_bin 172 | hours_per_week_bin 173 | ); 174 | set rg.df; 175 | run; -------------------------------------------------------------------------------- /05 Decision Tree/02 Decision Tree.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import math\n", 10 | "import numpy as np\n", 11 | "import pandas as pd\n", 12 | "from datetime import datetime\n", 13 | "\n", 14 | "import seaborn as sns\n", 15 | "import matplotlib.pyplot as plt\n", 16 | "%matplotlib inline \n", 17 | "plt.style.use('seaborn-whitegrid')\n", 18 | "\n", 19 | "from sklearn.tree import DecisionTreeClassifier\n", 20 | "from sklearn.metrics import classification_report\n", 21 | "from sklearn.metrics import confusion_matrix" 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "# Get the Data" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 2, 34 | "metadata": {}, 35 | "outputs": [], 36 | "source": [ 37 | "df = pd.read_csv('data/00 df.csv')\n", 38 | "train = df[df['flag']=='train']\n", 39 | "test = df[df['flag']=='test']" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": 3, 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "cat_feats = ['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin']\n", 49 | "\n", 50 | "y_train = train['y']\n", 51 | "x_train = train[['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin']]\n", 52 | "x_train = pd.get_dummies(x_train,columns=cat_feats,drop_first=True)\n", 53 | "\n", 54 | "y_test = test['y']\n", 55 | "x_test = test[['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin']]\n", 56 | "x_test = pd.get_dummies(x_test,columns=cat_feats,drop_first=True)" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "# Decision Tree" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": 4, 69 | "metadata": {}, 70 | "outputs": [ 71 | { 72 | "data": { 73 | "text/plain": [ 74 | "" 75 | ] 76 | }, 77 | "execution_count": 4, 78 | "metadata": {}, 79 | "output_type": "execute_result" 80 | }, 81 | { 82 | "data": { 83 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAe8AAAD0CAYAAACy764hAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8li6FKAAAgAElEQVR4nO3deVxU9f4/8NfMwLCPiCumgCAi4EJoLve6leGa3QyTLc0lNbNcbvnTAJPcMOvrVUvNoaJCEK83b0loLuSGmimmCIxmieCeK8M+zMz5/aHOlcVY5cwMr+fj0cNhzgy83uH4mjnzmXMkgiAIICIiIpMhFTsAERER1Q7Lm4iIyMSwvImIiEwMy5uIiMjEsLyJiIhMDMubiIjIxFiIHaCitLQ0sSMQERE1qp49e9bq9kZX3kDthzBGKpUK3t7eYseoF3OYATCPOcxhBoBzGBNzmAEwjznq8qKVu82JiIhMDMubiIjIxLC8iYiITAzLm4iIyMSwvImIiEwMy5vIiGXFx0Pp5oZkX18o3dyQFR8vdiQiMgJG+VExIrpf3LunTYO2qAgAoM7Jwe5p0wAAPmFhYkYjIpHxlTeREdEUFuLa8ePI+Oor7H3zTUNxP6QtKkJqRIRI6YjIWPCVN5EIyoqKcFulwu3MTNzKzDT8qb54sdr7qnNykLN3L1yGDIFEInnyYYnI6LC8iZ6gsuJi3Dl7tlJJ52VnA4IAAJDJ5Wju5YV2ffui25QpaOnrixa+vtgaEID83NxK31MilWJrQABa+Pqi55w58A4Lg6WNTWOPRkQiYnkTNQBtSQnunDtXuaQvXICg1wMApBYWaO7lhba9esH3tdcMJd28UydILSo/FAcsX17uPW8AsLC1xfPr1wMATq5ejd1Tp+LQggXoPn06/N58Ew5PPdU4AxORqFjeRLWgLS3F3d9+q1TS937/3VDSEpkMzTt3Rms/P3iHhf2vpD09IbO0rPHPergoLTUiAurcXChcXNB/2TLD9b4TJuDywYNIW70ax6KjcXzlSniNGwf/2bPh3Lt3ww9PREaD5U1UBZ1Gg7vnz5cr6NuZmbh7/jwEnQ7Ag5Lu1Aktu3aFV1CQoaSdOneGTC5vkBw+YWHwCQur8uQLEokEHQYNQodBg3DvwgX8+umnOPP551AlJKBdv37wnzMHnV9+ucpX9URk2vioJrOVFR9veNV6qMKr1od0ZWW4e/585VfS589Dr9UCuP8es6OHB1r4+qJzYCBa+Pqipa8vmnt5wcLKSozRKnF0d8ezq1bhb1FRyPzqK5xcuxY/BAXBoUMH+M2cie5Tp8LGyUnsmETUQFjeZJaq+oz0rtdfx/VffoF1ixaGkr7722/Ql5Xdv5NEAkd3d7Tw9YXnSy+VK2lTWRBmpVDAf9Ys+M2ciewdO5C2ejUOLViAox98AN/XXoP/rFloYeKnTyQiljeZqdSIiEqfkdaVlODk2rUAgGYdO6KFry88XnjBUNJOXbrA0tZWjLgNTiqTwWP0aHiMHo2b6elIW7MGGbGxOP3ZZ3AbPhw9Z8+G29ChkEh5qAciU8TyJrOkruIjVgAAiQSz8vMht7Nr3EAiatW9O4Z/8QUGrliB0xs34tS6dfh2xAg4dekC/9mz4TN+fJP6/0FkDvi0m8yKtqQE++bONXyGuiKFi0uTLSrbVq3QLzIS03JyMDIuDpZ2dtg7YwaUHTrg4IIFUF+6JHZEIqohljeZjZtnzmBT795IW70aLgEBsKjwPrWFrS36L1smUjrjIZPL4fPqq3j1+HGEpKbCZcgQHP/oI8R07IikoCBcPXoUwmOe/BCRcWB5k8kT9Hqc+Ne/sKlXLxT9+SdeTk7GuN27MTQmBgpXV0AigcLVFUOVSp7Q4xESiQRP/f3veHHrVrx+4QJ6zp2Li7t2IeFvf0N8nz5QJSRAp9GIHZOIqsDyJpOWf+UK/jNsGPb/859wGzYMr6Wnw33kSAD3PyM97eJFjMrMxLSLF1ncf6GZqysGf/QRpl++jCHr1qE0Lw/JYWGI6dgRPy9fjqJbt8SOSESPYHmTyfrt22/xdffuuHLkCAI2bsRL338Pu9atxY5l0uT29nj6zTcxWaXCy8nJaOHri9SICCg7dMCuqVNxMyND7IhEBJY3mSBNfj52TpqE7WPHopm7Oyb8+it6TJvGM2w1IIlUCveRI/HK7t2YmJEBnwkToNq0CV9364atAQH4IznZcDhYImp8LG8yKVeOHMHXfn7I+uYb9I2IQOiRI3Dq3FnsWGatpa8vhm7ciOmXL2PA8uW4rVLhvy+8gC+7dMHJTz+FpqBA7IhETQ7Lm0yCrqwMhxctQuKAARD0egQdOID+S5fW6kQfVD82LVqgz3vvYWp2Nl7YvBlWzZvjp7ffxsb27bH/3XeRV4NzkRNRw2B5k9G7+/vvSBwwAEcXL4b3q6/itVOn0L5/f7FjNVkyS0t0CQ7Gq8eOIfToUbgNH4601avxuYcHvg8MxOVDh/hRM6InjOVNRksQBKR//jm+8fPDnXPn8MKWLRj59dewatZM7Gj0QLu+fTE6MRFTs7PxzP/7f7i0bx8SBw7Epl69kBkXB21pKYD7x5pXurkh2dcXSjc3ZMXHi5ycyLSxvMkoFd26he9ffhm7p06Fc58+mHjmDLqMGyd2LHoMRYcOGBgdjemXLyPgs89QVlyMnRMmIMbNDd8HBmL31KlQ5+QAggB1Tg52T5vGAqd6aepPCFneZHSyd+3C1926IXvHDgz6+GO8smcPHNq3FzsW1YClrS16TJ+OSZmZGLtrF1o//TTOb9sGbXFxudtpi4qQGhEhUsqmzRxK7+FZA5vyE0KemISMRllxMQ4tWICTa9eihY8PAn/8Ea179BA7FtWBRCKB29ChcBs6FB9LpVUea16dk4PEwYPh6O4ORw8PNHvkT5sWLfjRvyegqlPl7p42DQBqfBAjQa+HTqOBTqOBvqzsf5c1GujKyu7/WWG7vorbV7f9r7ZdPngQugdvyTykLSrCnjfewM30dFg1awa5QgErhcJwWf7IZSuFAhbW1g37P7eRsbzJKPx5+jSSw8JwOzMT/rNmYcCKFSZzDm36awoXl/uvkCqwsLODoNUi+8cfUXjtWrltcoUCju7u5Qr9Yck7uLjwUwY1pC0tReH16yi8ehUFV68i5e23K50qV1tUhF1TpuDk2rU1KuAn+fl+qaUlZHI5ZHK54bK0wtcyubxScT9UVlCAk2vWPHb7o2RyeZWl/riyf9z1Mrm8zvNmxccjNSICPb/9ttb3ZXmTqAS9HidWrUJqRASsnZwQ+OOP6DhsmNixqAH1X7as3Ks94P5JYoZu3Gh4tVdWVIS87GzkXbiAe3/8gXsXLiDvwgXczsrCheTkcv8YS2QyKFxcDIXezMOj3Kt3a0fHRp+xsenKysqVcsG1a/+7/OC/wmvXUFzDw9rqSkth7eRUqTirK1GpXA6ZpaVhW33uK7WwqPHeFqWbW5VPCBWurph28SK0paXQqNXQqNUoVatRmpf3v68fXK7q+vxLl3ArMxOavDyU5uVBr9VWm8XC2rrKUpc3a3b/z8dcfzk1FT8vWVLpLaWaYnmTaNSXLmHna6/h0r596PTSSxgaEwPbli3FjkUN7GFBp0ZEQJ2bC4WLC/ovW1ZuN62lrS1a+vqipa9vpfsLej0Krl41FPq9P/4w/Hn+u+9QfPNmudtbOzmVe6X+6Kt3h/btIZXJnuzA9aDXalF448b98v2LYq44M3D/SY29szPsnJ3h6O6Op/r3h327drBv1w52zs6wb9cO2154AQWXL1e6r8LVFWN37myMERvE454QPjxroIWVFSxatYJtq1Z1/hmCIEBXWlqu7DV5eff/rPAkoOL1ednZ5a4XdLp6z1wRy5tEcfbf/8ae6dOhLyvDsM8/R9fJk/kepxnzCQuDT1gYVCoVvL29a3VfiVQKh/bt4dC+PToMHFhpuyY/v1yxP7x84+RJnN+2rdyrJ6mlJZq5uVW5O76Zuzvk9vZ/meXhbk51bi4OVfEk5HH0Oh2K/vyz2lIu+vPPSusDJFIp7Nq2hZ2zMxQuLmjXty/sKpSyfbt2sGnZstonJgNXrPjL0jMVNXlCWF8SiQQW1tawsLaGXZs2df4+giBAW1xcqey3Dh1a5VqQmmJ5U6MqVauR8tZbyIqLQ9vevTFq0yY09/QUOxaZMLmDA1r36FHl4ka9Tof8S5fuF3uFV+3Xf/kFJXfvlru9bevWj90df2nfPuyePr3SYq/SvDy069fvf8VcVSnfuFH5vWKJBLatWxvKt22vXvdL+UEhPyxo29atG2xvQWOUXmOpzxPCxiSRSGBpawtLW1vA2dlw/ePWgtQUy5sazeXUVOwYPx75ubno9/776BsZyYVH9ERJZbL7r7Td3ODy3HOVtpfcvVt5d/yFC7hy5AjOJiZWuzhLW1SElJkzK11v06qVoZRb+/lVWcp2bdpAatH4/wSbSumZu6p2/ddGtX9z9Ho9oqKicO7cOcjlcixduhSurq6G7du3b0dsbCykUikCAwMRGhoKnU6HyMhIZGdnQyaTITo6Gi4uLsjMzMQbb7wBNzc3AEBISAhGPjj3MpkvXVkZjn7wAY5FR0Ph5oaQ1FS069dP7FhEsG7eHG179kTbnj0rbdOVlUGdk2Mo9L0zZjz2+/xj2zZDMdu1aVOvFcjUNDy6F6Quqi3vvXv3QqPRYMuWLTh16hRWrFiBDRs2GLavXLkSP/zwA2xtbTFq1CiMGjUKx48fBwAkJibi2LFjiI6OxoYNG5CVlYVJkyZh8uTJdQpLpufOb78hOSwMN06cQNdJk/DcmjWQOziIHYuoWjJLSzTv1AnNO3UCAPyyYsVjVzh7jhnT2PHIDDzcC5KWllbr+1Zb3mlpaRgwYAAAwM/PDxkZGeW2e3l5IT8/HxYWFhAEARKJBM8//zwGDx4MALh69SpaPlhBnJGRgezsbKSkpMDV1RXh4eGwr2aBCJkmQRCQHhODfXPnwsLKCqO3boXX2LFixyKqs+pWOBM1pmrLu6CgoFzBymQyaLVaWDx4r8bT0xOBgYGwsbFBQEAAFArF/W9sYYH58+djz549WLt2LQCge/fueOWVV9C1a1ds2LAB69atw/z58yv9TJVK1SDDiamkpMTk56jrDKV37uDMwoW4sW8fWvbrhx7Ll0Pfpo1o/z+a8u/C2JjyHBJ/f3SNisK5f/0Lxdevw6ZtW3jNnQuJv79JzmTKv4tHmcsctSZUY/ny5UJycrLh6wEDBhguq1QqYdiwYYJarRa0Wq0wd+5cYceOHeXu/+effwqDBw8WCgsLhby8PMP158+fFyZMmFDp5504caK6SCYhKytL7Aj1VpcZ/khOFta1aSOsksuF46tWCXqd7gkkq52m+rswRpzDeJjDDIJgHnPUpfeqPTGJv78/Dh48CAA4deoUOnfubNjm4OAAa2trWFlZQSaTwcnJCWq1Gt999x02btwIALCxsYFEIoFMJsOUKVOQnp4OADh69Ch8qzggA5mmsqIi7H3rLWwbNQq2rVrh1RMn0GvuXEikPPcNEVFDq3a3eUBAAA4fPozg4GAIgoDly5cjKSkJRUVFCAoKQlBQEEJDQ2FpaQkXFxeMGTMGWq0W7733HsLCwqDVahEeHg4rKytERUVhyZIlsLS0RMuWLbFkyZLGmJGesBu//orksDDcUanQc+5cDFi+3OQP+k9EZMyqLW+pVIrFixeXu87Dw8NwOSQkBCEhIeW2y+VyrFmzptL38vX1RWJiYl2zkpHR63Q4/vHHOLxwIWxbtcLY3bvhFhAgdiwiIrPHg7RQnahzc7FjwgRcPnAAnoGBGLpxI2xatBA7FhFRk8DyplpTbd6MvTNmQK/TYXhsLHxfe43HJSciakQsb6qxknv3kPLWW1DFx6Ndv34YGRcHx0feQiEiosbB8qYauXTwIHaMH4+CK1fwt6go9I2IEOW4zERExPKmKjx62sODHTqgVY8euPDDD3B0d79/XPK+fcWOSETUpLG8qZys+Phyh4DMz81Ffm4u2g8ahJeTknhcciIiI8AjaFA5qRERVZ6iTn3xIoubiMhIsLypHHVubq2uJyKixsfyJgBA4Z9/YuekSYAgVLld4eLSyImIiOhxWN5NnF6nw6/r1+NLLy+oNm2C+wsvwMLGptxteNpDIiLjwvJuwq798gvi+/RBysyZaP3003gtPR0vJyVhaEwMFK6ugEQChasrhiqV8AkLEzsuERE9wNXmTVDx7ds4FB6O9JgY2LVti1EJCegSHGw4SppPWBh8wsKgUqng7e0tcloiIqqI5d2ECHo9znz5JQ4tWICSe/fQc84c/C0qClYKhdjRiIioFljeTcSNX3/F3jffxLWff8ZT/fvj+XXr0Kp7d7FjERFRHbC8zVzJvXs4vHAhTq1fD5uWLTHi66/hM348TyRCRGTCWN5mShAEZMXF4cC8eSi+dQs9ZsxA/6VLYe3oKHY0IiKqJ5a3Gbp55gz2vvkmrqSmwrlPHwTu3Ik2/v5ixyIiogbC8jYjpWo1jkRF4eTatbB2dMTQmBh0mzwZEik/EUhEZE5Y3mZAEASc27IF+/75TxRev47ur7+OAdHRsGnRQuxoRET0BLC8TdxtlQopb72F3J9+Qht/f7z03//CuU8fsWMREdETxPI2UZrCQvy8ZAlOrFoFSzs7DFm3Dj2mT4dUJhM7GhERPWEsbxMjCALOb9uGfXPnIv/SJfhOnIiBH34Iu9atxY5GRESNhOVtQu6eP4+Ut9/GxV270Kp7d4xKSED7/v3FjkVERI2M5W0CyoqLcSw6Gsc//BAyKys8u3o1np45E1IL/vqIiJoi/utv5P5ISkLKrFlQX7wI79BQDPr4Y9g7O4sdi4iIRMTyNlL3srOxb/Zs/JGUBCdvb4z76Se4PPus2LGIiMgIsLyNjLakBMc/+gjHli+HRCbDwJUr0XP2bMjkcrGjERGRkWB5G5HsXbuQ8tZbuPf77+g8diwGr1oFRYcOYsciIiIjw/I2AupLl7Bv7lyc//ZbNPf0xNhdu+A2dKjYsYiIyEixvEWk02hw4l//wtHFiwFBQP+lS9Hr3XdhYWUldjQiIjJiLG+R5P70E/a+9RbuqFTwePFFPLdmDZq5uYkdi4iITADLu5EVXL2K/e+8g7OJiWjWsSPGJCXB44UXxI5FREQmhOXdSHRlZfj1009xZNEi6DQa9Hv/ffResACWNjZiRyMiIhPD8m4Elw8dwt6ZM3HrzBm4DR+OIZ98guadOokdi4iITBTLu4FlxccjNSIC6txcHHjqKSjc3HA1NRUOHTrgH9u2odNLL0EikYgdk4iITBjLuwFlxcdj97Rp0BYVAQAKLl9GweXL8HjxRYxKSIDczk7khEREZA6kYgcwJ6kREYbiftTN06dZ3ERE1GCqfeWt1+sRFRWFc+fOQS6XY+nSpXB1dTVs3759O2JjYyGVShEYGIjQ0FDodDpERkYiOzsbMpkM0dHRcHFxQU5ODhYsWACJRAJPT08sWrQIUqn5PH9Q5+bW6noiIqK6qLY59+7dC41Ggy1btuCdd97BihUrym1fuXIlYmNjsXnzZsTGxiIvLw/79u0DACQmJmLWrFmIjo4GAERHR2POnDlISEiAIAhISUl5AiOJx+ExhzJVuLg0chIiIjJn1ZZ3WloaBgwYAADw8/NDRkZGue1eXl7Iz8+HRqOBIAiQSCR4/vnnsWTJEgDA1atX0bJlSwBAZmYmevfuDQAYOHAgjhw50qDDiM13woRK11nY2qL/smUipCEiInNV7W7zgoIC2NvbG76WyWTQarWwsLh/V09PTwQGBsLGxgYBAQFQKBT3v7GFBebPn489e/Zg7dq1AGAodwCws7NDfn5+lT9TpVLVbyqR5J46BYmVFayaN0fJjRuwadsWXnPnQuLvb5IzlZSUmGTuisxhDnOYAeAcxsQcZgDMZ47aqra87e3tUVhYaPhar9cbivvs2bPYv38/UlJSYGtri3nz5mHnzp0YMWIEAODDDz/Eu+++i3HjxiE5Obnc+9uFhYWGoq/I29u7XkOJofj2bfy4Zw/8pk7FkE8+gUqlMsk5HmUOMwDmMYc5zABwDmNiDjMA5jFHWlpare9T7W5zf39/HDx4EABw6tQpdO7c2bDNwcEB1tbWsLKygkwmg5OTE9RqNb777jts3LgRAGBjYwOJRAKZTAYfHx8cO3YMAHDw4EH06tWr1oGNVVZcHHSlpeg+bZrYUYiIyMxV+8o7ICAAhw8fRnBwMARBwPLly5GUlISioiIEBQUhKCgIoaGhsLS0hIuLC8aMGQOtVov33nsPYWFh0Gq1CA8Ph5WVFebPn4+FCxdi1apVcHd3x7BhwxpjxidOEAScVirh3LcvWnXrJnYcIiIyc9WWt1QqxeLFi8td5+HhYbgcEhKCkJCQctvlcjnWrFlT6Xt17NgRmzZtqmtWo3UlNRV3VCoM+/JLsaMQEVETYD4fshZRulIJuUIBr3HjxI5CRERNAMu7norv3MG5rVvh8+qrPIoaERE1CpZ3PXGhGhERNTaWdz0IgoB0pRJte/dG6x49xI5DRERNBMu7Hq4eOYLbWVnowVfdRETUiFje9XBaqYTcwQFeQUFiRyEioiaE5V1HJXfv4rd//xveYWGQP3L4WCIioieN5V1HWZs2QVtSgh7Tp4sdhYiImhiWdx0YFqo98wxa+/mJHYeIiJoYlncdXPv5Z9zKyODHw4iISBQs7zo4rVTC0t4eXYKDxY5CRERNEMu7lkru3cO5LVvgw4VqREQkEpZ3LWVt2gRtcTF3mRMRkWhY3rXwcKFam5490cbfX+w4RETURLG8a+HasWO4deYMX3UTEZGoWN61kK5UwtLODt4Vzl9ORETUmFjeNVSal4eziYnwDg2F3MFB7DhERNSEsbxrKCs+ngvViIjIKLC8a0AQBKRv3IjWTz+NNj17ih2HiIiaOJZ3DVw/fhw309PRY/p0SCQSseMQEVETx/KugYcL1bpwoRoRERkBlnc1StVqqDZvRpeQEFgpFGLHISIiYnlXRxUfD21REReqERGR0WB5/wVBEHB640a09vND2169xI5DREQEgOX9l66fOIGbp0+j+7RpXKhGRERGg+X9F9KVSljY2sI7NFTsKERERAYs78coVatxdvNmdAkOhlWzZmLHISIiMmB5P8bZzZtRVliIHlyoRkRERobl/RjpSiVade+Otr17ix2FiIioHJZ3Fa6npeHGyZNcqEZEREaJ5V2FdKUSFjY28A4LEzsKERFRJSzvCjT5+VAlJMArKAjWjo5ixyEiIqqE5V3B2cRElBUUoMf06WJHISIiqhLLu4LTSiVadusG5z59xI5CRERUJZb3I26cPIkbJ05woRoRERk1lvcj0pVKWFhbw+fVV8WOQkRE9Fgs7wc0BQXIio/nQjUiIjJ6FtXdQK/XIyoqCufOnYNcLsfSpUvh6upq2L59+3bExsZCKpUiMDAQoaGhKCsrQ3h4OK5cuQKNRoMZM2ZgyJAhyMzMxBtvvAE3NzcAQEhICEaOHPnEhquNhwvVeOpPIiIydtWW9969e6HRaLBlyxacOnUKK1aswIYNGwzbV65ciR9++AG2trYYNWoURo0ahb1798LR0REfffQR7t69izFjxmDIkCHIysrCpEmTMHny5Cc6VF2kK5Vo4euLdv36iR2FiIjoL1Vb3mlpaRgwYAAAwM/PDxkZGeW2e3l5IT8/HxYWFhAEARKJBMOHD8ewYcMMt5HJZACAjIwMZGdnIyUlBa6urggPD4e9vX1DzlMnN379FdePH8dza9ZwoRoRERm9asu7oKCgXMHKZDJotVpYWNy/q6enJwIDA2FjY4OAgAAoFIpy9501axbmzJkDAOjevTteeeUVdO3aFRs2bMC6deswf/78Sj9TpVLVe7DaOLNyJaRWVpA980yD/eySkpJGn6OhmcMMgHnMYQ4zAJzDmJjDDID5zFFb1Za3vb09CgsLDV/r9XpDcZ89exb79+9HSkoKbG1tMW/ePOzcuRMjRozAtWvXMHPmTISGhmL06NEAUK7cAwICsGTJkip/pre3d70HqylNYSH2JCejy7hx6NGAu8xVKlWjzvEkmMMMgHnMYQ4zAJzDmJjDDIB5zJGWllbr+1S72tzf3x8HDx4EAJw6dQqdO3c2bHNwcIC1tTWsrKwgk8ng5OQEtVqNW7duYfLkyZg3bx7Gjh1ruP2UKVOQnp4OADh69Ch8fX1rHbihnduyBZr8fC5UIyIik1HtK++AgAAcPnwYwcHBEAQBy5cvR1JSEoqKihAUFISgoCCEhobC0tISLi4uGDNmDFauXAm1Wo3169dj/fr1AICYmBhERUVhyZIlsLS0RMuWLR/7yrsxpSuVcPL2xlN//7vYUYiIiGqk2vKWSqVYvHhxues8PDwMl0NCQhASElJue2RkJCIjIyt9L19fXyQmJtY1a4P78/RpXDt2DM+uXs2FakREZDKa9EFa0mNiILOygs/48WJHISIiqrEmW96awkJkxcXB65VXYOPkJHYcIiKiGmuy5X3u3/+GRq3mQjUiIjI5Tba805VKOHXpgqf69xc7ChERUa00yfK+mZ6Oaz//zFN/EhGRSWqS5Z0eEwOZXA7fCRPEjkJERFRrTa68y4qKkBUXh85jx8KmRQux4xAREdVakyvvc1u3ojQvjwvViIjIZDW58k5XKtG8c2e0HzhQ7ChERER10qTK+2ZGBq4eOcKFakREZNKaVHmfebhQ7bXXxI5CRERUZ02mvMuKi5H5zTfwDAyEbcuWYschIiKqsyZT3r9t3YrSe/e4UI2IiExekynvdKUSzT090WHQILGjEBER1UuTKO9bmZm4cvgwF6oREZFZaBLlnR4TA6mlJReqERGRWTD78i4rLkbWN9/A8+WXYduqldhxiIiI6s3sy/v8t9+i5O5d9OBCNSIiMhNmX97pSiUcO3VCh8GDxY5CRETUIMy6vG+rVLh86BC6T50KidSsRyUioibErBvNsFBt4kSxoxARETUYsy1vbUkJMr/+Gp1eegl2rVuLHYeIiKjBmG15n9+2DSV37nChGhERmR2zLe/TSiUcPTzg8txzYkchIiJqUGZZ3vD9HY8AAAwDSURBVLfPnsXlAwfQjQvViIjIDJlls6XHxEBqYYGuXKhGRERmyOzKW1tSgqyHC9XatBE7DhERUYMzu/I+/9//ovj2bZ76k4iIzJbZlXe6UolmHTvCdcgQsaMQERE9EWZV3nd++w2X9u/nEdWIiMismVXDGRaqTZokdhQiIqInxmzKW1taisyvvoLHiy/Crm1bseMQERE9MWZT3r9/9x2Kb93iQjUiIjJ7ZlPe6UolFK6ucAsIEDsKERHRE2UW5X33/Hnk/vQTF6oREVGTYBZNlx4TA4lMhq6TJ4sdhYiI6Ikz+fLWlpYiIzYWnV58EfbOzmLHISIieuIsqruBXq9HVFQUzp07B7lcjqVLl8LV1dWwffv27YiNjYVUKkVgYCBCQ0NRVlaG8PBwXLlyBRqNBjNmzMCQIUOQk5ODBQsWQCKRwNPTE4sWLYK0nru5f//+ey5UIyKiJqXa5ty7dy80Gg22bNmCd955BytWrCi3feXKlYiNjcXmzZsRGxuLvLw8bN++HY6OjkhISEBMTAyWLFkCAIiOjsacOXOQkJAAQRCQkpJS7wEeLlRz5UI1IiJqIqot77S0NAwYMAAA4Ofnh4yMjHLbvby8kJ+fD41GA0EQIJFIMHz4cMyePdtwG5lMBgDIzMxE7969AQADBw7EkSNH6hX+7u+/IzclBd1efx3SBz+DiIjI3FW727ygoAD29vaGr2UyGbRaLSws7t/V09MTgYGBsLGxQUBAABQKRbn7zpo1C3PmzAEAQ7kDgJ2dHfLz8+sV/sznn99fqMYjqhERURNSbXnb29ujsLDQ8LVerzcU99mzZ7F//36kpKTA1tYW8+bNw86dOzFixAhcu3YNM2fORGhoKEaPHg0A5d7fLiwsLFf0j1KpVNUG12s0OPX552g9aBAuq9WAWl3tfRpTSUlJjeYwZuYwA2Aec5jDDADnMCbmMANgPnPUVrXl7e/vj3379mHkyJE4deoUOnfubNjm4OAAa2trWFlZQSaTwcnJCWq1Grdu3cLkyZPx/vvvo1+/fobb+/j44NixY+jTpw8OHjyIvn37Vvkzvb29qw1+7j//geb2bfz9nXfgXoPbNzaVSlWjOYyZOcwAmMcc5jADwDmMiTnMAJjHHGlpabW+T7XlHRAQgMOHDyM4OBiCIGD58uVISkpCUVERgoKCEBQUhNDQUFhaWsLFxQVjxozBypUroVarsX79eqxfvx4AEBMTg/nz52PhwoVYtWoV3N3dMWzYsNpP+UC6UgmHDh3gVo/vQUREZIqqLW+pVIrFixeXu87Dw8NwOSQkBCEhIeW2R0ZGIjIystL36tixIzZt2lTXrAb3LlxAzp49+NsHH3ChGhERNTkmeZCWM59/DolUim48ohoRETVBJlfeurIynPnyS7iPGgWH9u3FjkNERNToTK68/9i+HUU3bqD79OliRyEiIhKFyZV3ulIJh/bt0XH4cLGjEBERicKkyvtedjYu7t7NI6oREVGTZlLl/XChGk/9SURETZnJlLeurAwZX36JjiNHQtGhg9hxiIiIRGMy5X3hhx9QeP06evDUn0RE1MSZTHmfViph/9RT6DhihNhRiIiIRGUS5Z138SIu7tqFblOmQGpR7UHhiIiIzJpJlPeZL74AAHSbMkXkJEREROIz+vLWa7U488UX6DhiBBQuLmLHISIiEp3Rl/eF5GQUXrvGhWpEREQPGH15n964Efbt2sF91CixoxARERkFoy7vvJwcZP/4I7pyoRoREZGBUZc3F6oRERFVZrTlrddqkfHFF+g4fDiaubqKHYeIiMhoGG15X9ixAwVXr6I7F6oRERGVY7Tlna5Uws7ZmQvViIiIKjDK8lbn5iJ75050mzwZMktLseMQEREZFaMs7zNffglBENCVC9WIiIgqMcryzvjiC7gNHQrHjh3FjkJERGR0jLK88y9f5kI1IiKixzDK8pZIpdAUFIgdg4iIyCgZZXkLej32zpiBrPh4saMQEREZHaMsbwDQFhUhNSJC7BhERERGx2jLG7j/kTEiIiIqz6jLm+fvJiIiqsxoy9vC1hb9ly0TOwYREZHRMcryVri6YqhSCZ+wMLGjEBERGR2jPEn2tIsXxY5ARERktIzylTcRERE9HsubiIjIxLC8iYiITAzLm4iIyMSwvImIiEyMRBAEQewQj0pLSxM7AhERUaPq2bNnrW5vdOVNREREf427zYmIiEwMy5uIiMjEGEV5l5WVYd68eQgNDcXYsWORkpIidqR6uX37NgYNGoQ//vhD7Ch1tnHjRgQFBeHll1/G1q1bxY5Ta2VlZXjnnXcQHByM0NBQk/xdnD59GuPHjwcA5OTkICQkBKGhoVi0aBH0er3I6Wrm0RlUKhVCQ0Mxfvx4TJkyBbdu3RI5Xc09OsdDSUlJCAoKEilR3Tw6x+3btzFjxgyEhYUhODgYuSZyFseKf6fGjRuHkJAQvPfeeybxuKiq7+ry+DaK8t6+fTscHR2RkJCAmJgYLFmyROxIdVZWVob3338f1tbWYkeps2PHjuHXX3/F5s2bERcXh+vXr4sdqdYOHDgArVaLxMREzJw5E6tXrxY7Uq3ExMQgMjISpaWlAIDo6GjMmTMHCQkJEATBJJ7gVpxh2bJlWLhwIeLi4hAQEICYmBiRE9ZMxTmA+6Xxn//8B6a0ZKjiHB999BFGjx6N+Ph4zJkzBxcuXBA5YfUqzvDpp59i5syZ2Lx5MzQaDfbv3y9uwBqoqu/q8vg2ivIePnw4Zs+ebfhaJpOJmKZ+PvzwQwQHB6N169ZiR6mz1NRUdO7cGTNnzsQbb7yBwYMHix2p1jp27AidTge9Xo+CggJYWBjlYfwfy8XFBZ988onh68zMTPTu3RsAMHDgQBw5ckSsaDVWcYZVq1bB29sbAKDT6WBlZSVWtFqpOMfdu3fx8ccfIzw8XMRUtVdxjpMnT+LGjRuYOHEikpKSDH+/jFnFGby9vXHv3j0IgoDCwkKTeJxX1Xd1eXwbRXnb2dnB3t4eBQUFmDVrFubMmSN2pDrZtm0bnJycMGDAALGj1Mvdu3eRkZGBNWvW4IMPPsC7775rUq8wAMDW1hZXrlzBiBEjsHDhwkq7PI3dsGHDyv1DJAgCJBIJgPuPl/z8fLGi1VjFGR4+oT158iQ2bdqEiRMnipSsdh6dQ6fTISIiAuHh4bCzsxM5We1U/H1cuXIFCoUCX331FZydnU1iT0jFGdzc3LBs2TKMGDECt2/fRp8+fURMVzNV9V1dHt9GUd4AcO3aNUyYMAH/+Mc/MHr0aLHj1Mm3336LI0eOYPz48VCpVJg/fz5u3rwpdqxac3R0RP/+/SGXy+Hu7g4rKyvcuXNH7Fi18tVXX6F///7YtWsXvv/+eyxYsKDcbk9TI5X+76FaWFgIhUIhYpq627FjBxYtWgSlUgknJyex49RaZmYmcnJyEBUVhX/+85/4/fffsWzZMrFj1YmjoyOee+45AMBzzz2HjIwMkRPV3rJlyxAfH48ff/wRL730ElasWCF2pBqp2Hd1eXwbRXnfunULkydPxrx58zB27Fix49RZfHw8Nm3ahLi4OHh7e+PDDz9Eq1atxI5Vaz179sShQ4cgCAJu3LiB4uJiODo6ih2rVhQKBRwcHAAAzZo1g1arhU6nEzlV3fn4+ODYsWMAgIMHD6JXr14iJ6q977//3vD46NChg9hx6qR79+5ITk5GXFwcVq1ahU6dOiEiIkLsWHXSs2dPHDhwAABw/PhxdOrUSeREtdesWTPY29sDuL9nR61Wi5yoelX1XV0e30bxBsFnn30GtVqN9evXY/369QDuL0ww5UVfpuzZZ5/F8ePHMXbsWAiCgPfff9/k1iFMnDgR4eHhCA0NRVlZGebOnQtbW1uxY9XZ/PnzsXDhQqxatQru7u4YNmyY2JFqRafTYdmyZXB2dsbbb78NAHjmmWcwa9YskZM1XfPnz0dkZCQSExNhb2+P//u//xM7Uq0tXboUc+fOhYWFBSwtLU1isXNVfRcREYGlS5fW6vHNI6wRERGZGKPYbU5EREQ1x/ImIiIyMSxvIiIiE8PyJiIiMjEsbyIiIhPD8iYiIjIxLG8iIiITw/ImIiIyMf8fQFBQeUOFfDsAAAAASUVORK5CYII=\n", 84 | "text/plain": [ 85 | "
" 86 | ] 87 | }, 88 | "metadata": {}, 89 | "output_type": "display_data" 90 | } 91 | ], 92 | "source": [ 93 | "results = []\n", 94 | "max_depth_options = [2,4,6,8,10,12,14,16,18,20]\n", 95 | "for trees in max_depth_options:\n", 96 | " model = DecisionTreeClassifier(max_depth=trees, random_state=101)\n", 97 | " model.fit(x_train, y_train)\n", 98 | " y_pred = model.predict(x_test)\n", 99 | " accuracy = np.mean(y_test==y_pred)\n", 100 | " results.append(accuracy)\n", 101 | "\n", 102 | "plt.figure(figsize=(8,4))\n", 103 | "pd.Series(results, max_depth_options).plot(color=\"darkred\",marker=\"o\")" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": 5, 109 | "metadata": {}, 110 | "outputs": [ 111 | { 112 | "data": { 113 | "text/plain": [ 114 | "" 115 | ] 116 | }, 117 | "execution_count": 5, 118 | "metadata": {}, 119 | "output_type": "execute_result" 120 | }, 121 | { 122 | "data": { 123 | "image/png": "\n", 124 | "text/plain": [ 125 | "
" 126 | ] 127 | }, 128 | "metadata": {}, 129 | "output_type": "display_data" 130 | } 131 | ], 132 | "source": [ 133 | "results = []\n", 134 | "max_features_options = ['auto',None,'sqrt',0.95,0.75,0.5,0.25,0.10]\n", 135 | "for trees in max_features_options:\n", 136 | " model = DecisionTreeClassifier(max_depth=10, random_state=101, max_features = trees)\n", 137 | " model.fit(x_train, y_train)\n", 138 | " y_pred = model.predict(x_test)\n", 139 | " accuracy = np.mean(y_test==y_pred)\n", 140 | " results.append(accuracy)\n", 141 | "\n", 142 | "plt.figure(figsize=(8,4))\n", 143 | "pd.Series(results, max_features_options).plot(kind=\"bar\",color=\"darkred\",ylim=(0.7,0.9))" 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": 6, 149 | "metadata": {}, 150 | "outputs": [ 151 | { 152 | "data": { 153 | "text/plain": [ 154 | "" 155 | ] 156 | }, 157 | "execution_count": 6, 158 | "metadata": {}, 159 | "output_type": "execute_result" 160 | }, 161 | { 162 | "data": { 163 | "image/png": "\n", 164 | "text/plain": [ 165 | "
" 166 | ] 167 | }, 168 | "metadata": {}, 169 | "output_type": "display_data" 170 | } 171 | ], 172 | "source": [ 173 | "results = []\n", 174 | "min_samples_leaf_options = [5,10,15,20,25,30,35,40,45,50]\n", 175 | "for trees in min_samples_leaf_options:\n", 176 | " model = DecisionTreeClassifier(max_depth=10, random_state=101, max_features = None, min_samples_leaf = trees)\n", 177 | " model.fit(x_train, y_train)\n", 178 | " y_pred = model.predict(x_test)\n", 179 | " accuracy = np.mean(y_test==y_pred)\n", 180 | " results.append(accuracy)\n", 181 | "\n", 182 | "plt.figure(figsize=(8,4))\n", 183 | "pd.Series(results, min_samples_leaf_options).plot(color=\"darkred\",marker=\"o\")" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": 7, 189 | "metadata": {}, 190 | "outputs": [], 191 | "source": [ 192 | "dtree = DecisionTreeClassifier(max_depth=10, random_state=101, max_features = None, min_samples_leaf = 15)\n", 193 | "dtree.fit(x_train, y_train)\n", 194 | "y_pred=dtree.predict(x_test)" 195 | ] 196 | }, 197 | { 198 | "cell_type": "code", 199 | "execution_count": 8, 200 | "metadata": {}, 201 | "outputs": [ 202 | { 203 | "name": "stdout", 204 | "output_type": "stream", 205 | "text": [ 206 | "[[11521 914]\n", 207 | " [ 1653 2193]]\n", 208 | "accuracy: 0.8423315521159634\n", 209 | "precision: 0.7058255551979401\n", 210 | "recall: 0.5702028081123245\n", 211 | "f1 score: 0.6308068459657701\n" 212 | ] 213 | } 214 | ], 215 | "source": [ 216 | "test_calc = pd.concat([pd.DataFrame(y_test).reset_index(drop=True),pd.DataFrame(y_pred).reset_index(drop=True)],axis=1)\n", 217 | "test_calc.rename(columns={0: 'predicted'}, inplace=True)\n", 218 | "\n", 219 | "test_calc['predicted'] = test_calc['predicted'].apply(lambda x: 1 if x > 0.5 else 0)\n", 220 | "df_table = confusion_matrix(test_calc['y'],test_calc['predicted'])\n", 221 | "print (df_table)\n", 222 | "\n", 223 | "print('accuracy:', (df_table[0,0] + df_table[1,1]) / (df_table[0,0] + df_table[0,1] + df_table[1,0] + df_table[1,1]))\n", 224 | "print ('precision:', df_table[1,1] / (df_table[1,1] + df_table[0,1]))\n", 225 | "print('recall:', df_table[1,1] / (df_table[1,1] + df_table[1,0]))\n", 226 | "\n", 227 | "p = df_table[1,1] / (df_table[1,1] + df_table[0,1])\n", 228 | "r = df_table[1,1] / (df_table[1,1] + df_table[1,0])\n", 229 | "print('f1 score: ', (2*p*r)/(p+r))" 230 | ] 231 | } 232 | ], 233 | "metadata": { 234 | "kernelspec": { 235 | "display_name": "Python 3", 236 | "language": "python", 237 | "name": "python3" 238 | }, 239 | "language_info": { 240 | "codemirror_mode": { 241 | "name": "ipython", 242 | "version": 3 243 | }, 244 | "file_extension": ".py", 245 | "mimetype": "text/x-python", 246 | "name": "python", 247 | "nbconvert_exporter": "python", 248 | "pygments_lexer": "ipython3", 249 | "version": "3.6.8" 250 | } 251 | }, 252 | "nbformat": 4, 253 | "nbformat_minor": 4 254 | } 255 | -------------------------------------------------------------------------------- /05 Decision Tree/02 Decision Tree.py: -------------------------------------------------------------------------------- 1 | # Import the libraries 2 | import math 3 | import numpy as np 4 | import pandas as pd 5 | from datetime import datetime 6 | 7 | import seaborn as sns 8 | import matplotlib.pyplot as plt 9 | %matplotlib inline 10 | plt.style.use('seaborn-whitegrid') 11 | 12 | from sklearn.tree import DecisionTreeClassifier 13 | from sklearn.metrics import classification_report 14 | from sklearn.metrics import confusion_matrix 15 | 16 | # Import the data 17 | df = pd.read_csv('data/00 df.csv') 18 | 19 | 20 | # Split data into Train & test 21 | train = df[df['flag']=='train'] 22 | test = df[df['flag']=='test'] 23 | 24 | cat_feats = ['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin'] 25 | 26 | y_train = train['y'] 27 | x_train = train[['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin']] 28 | x_train = pd.get_dummies(x_train,columns=cat_feats,drop_first=True) 29 | 30 | y_test = test['y'] 31 | x_test = test[['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin']] 32 | x_test = pd.get_dummies(x_test,columns=cat_feats,drop_first=True) 33 | 34 | 35 | 36 | # Decision Tree 37 | results = [] 38 | max_depth_options = [2,4,6,8,10,12,14,16,18,20] 39 | for trees in max_depth_options: 40 | model = DecisionTreeClassifier(max_depth=trees, random_state=101) 41 | model.fit(x_train, y_train) 42 | y_pred = model.predict(x_test) 43 | accuracy = np.mean(y_test==y_pred) 44 | results.append(accuracy) 45 | 46 | 47 | # Plot the data 48 | plt.figure(figsize=(8,4)) 49 | pd.Series(results, max_depth_options).plot(color="darkred",marker="o") 50 | 51 | 52 | 53 | results = [] 54 | max_features_options = ['auto',None,'sqrt',0.95,0.75,0.5,0.25,0.10] 55 | for trees in max_features_options: 56 | model = DecisionTreeClassifier(max_depth=10, random_state=101, max_features = trees) 57 | model.fit(x_train, y_train) 58 | y_pred = model.predict(x_test) 59 | accuracy = np.mean(y_test==y_pred) 60 | results.append(accuracy) 61 | 62 | plt.figure(figsize=(8,4)) 63 | pd.Series(results, max_features_options).plot(kind="bar",color="darkred",ylim=(0.7,0.9)) 64 | 65 | 66 | 67 | results = [] 68 | min_samples_leaf_options = [5,10,15,20,25,30,35,40,45,50] 69 | for trees in min_samples_leaf_options: 70 | model = DecisionTreeClassifier(max_depth=10, random_state=101, max_features = None, min_samples_leaf = trees) 71 | model.fit(x_train, y_train) 72 | y_pred = model.predict(x_test) 73 | accuracy = np.mean(y_test==y_pred) 74 | results.append(accuracy) 75 | 76 | plt.figure(figsize=(8,4)) 77 | pd.Series(results, min_samples_leaf_options).plot(color="darkred",marker="o") 78 | 79 | 80 | 81 | # DTC 82 | dtree = DecisionTreeClassifier(max_depth=10, random_state=101, max_features = None, min_samples_leaf = 15) 83 | dtree.fit(x_train, y_train) 84 | y_pred=dtree.predict(x_test) 85 | 86 | 87 | test_calc = pd.concat([pd.DataFrame(y_test).reset_index(drop=True),pd.DataFrame(y_pred).reset_index(drop=True)],axis=1) 88 | test_calc.rename(columns={0: 'predicted'}, inplace=True) 89 | 90 | test_calc['predicted'] = test_calc['predicted'].apply(lambda x: 1 if x > 0.5 else 0) 91 | df_table = confusion_matrix(test_calc['y'],test_calc['predicted']) 92 | print (df_table) 93 | 94 | print('accuracy:', (df_table[0,0] + df_table[1,1]) / (df_table[0,0] + df_table[0,1] + df_table[1,0] + df_table[1,1])) 95 | print ('precision:', df_table[1,1] / (df_table[1,1] + df_table[0,1])) 96 | print('recall:', df_table[1,1] / (df_table[1,1] + df_table[1,0])) 97 | 98 | p = df_table[1,1] / (df_table[1,1] + df_table[0,1]) 99 | r = df_table[1,1] / (df_table[1,1] + df_table[1,0]) 100 | print('f1 score: ', (2*p*r)/(p+r)) -------------------------------------------------------------------------------- /05 Decision Tree/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /05 Decision Tree/data/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /06 Exploratory Data Analysis/01 Data_Prep.sas: -------------------------------------------------------------------------------- 1 | options compress=yes; 2 | 3 | data train; 4 | format flag $20.; 5 | set 'ADULT.DATA'n; 6 | flag = 'train'; 7 | run; 8 | 9 | data test (rename=('|1x3 Cross validator'n = F1)); 10 | format flag $20.; 11 | set 'ADULT.TEST'n; 12 | flag = 'test'; 13 | run; 14 | 15 | data rg.df (rename=( 16 | F1 = age 17 | F2 = workclass 18 | F3 = fnlwgt 19 | F4 = education 20 | F5 = education_num 21 | F6 = marital_status 22 | F7 = occupation 23 | F8 = relationship 24 | F9 = race 25 | F10 = sex 26 | F11 = capital_gain 27 | F12 = capital_loss 28 | F13 = hours_per_week 29 | F14 = native_country 30 | F15 = income 31 | )); 32 | set train test; 33 | if compress(F15) ^= ""; 34 | run; 35 | 36 | data rg.df; 37 | set rg.df; 38 | if compress(income) in (">50K",">50K.") then y = 1; 39 | else y = 0; 40 | run; 41 | 42 | /*age*/ 43 | data rg.df; 44 | format age_bin $20.; 45 | set rg.df; 46 | if age <= 25 then age_bin = 'a. 0-25'; 47 | else if age <= 30 then age_bin = 'b. 26-30 & 71-100'; 48 | else if age <= 35 then age_bin = 'c. 31-35 & 61-70'; 49 | else if age <= 40 then age_bin = 'd. 36-40 & 56-60'; 50 | else if age <= 55 then age_bin = 'e. 40-55'; 51 | else if age <= 60 then age_bin = 'd. 36-40 & 56-60'; 52 | else if age <= 70 then age_bin = 'c. 31-35 & 61-70'; 53 | else age_bin = 'b. 26-30 & 71-100'; 54 | run; 55 | 56 | /*workclass*/ 57 | data rg.df; 58 | format workclass_bin $20.; 59 | set rg.df; 60 | if workclass in ('?','Never-worked','Without-pay') then workclass_bin = 'a. no income'; 61 | else workclass_bin = 'b. income'; 62 | run; 63 | 64 | /*education*/ 65 | data rg.df; 66 | format education_bin $20.; 67 | set rg.df; 68 | if education in ('10th','11th','12th','1st-4th','5th-6th','7th-8th','9th','Preschool') then education_bin = 'a. Low'; 69 | else if education in ('HS-grad','Some-college','Assoc-acdm','Assoc-voc') then education_bin = 'b. Mid'; 70 | else if education in ('Bachelors') then education_bin = 'c. Bachelors'; 71 | else if education in ('Masters') then education_bin = 'd. Masters'; 72 | else education_bin = 'e. High'; 73 | run; 74 | 75 | /*education_num*/ 76 | data rg.df; 77 | format education_num_bin $20.; 78 | set rg.df; 79 | if education_num <= 8 then education_num_bin = 'a. 0-8'; 80 | else if education_num <= 12 then education_num_bin = 'b. 9-12'; 81 | else if education_num <= 13 then education_num_bin = 'c. 13'; 82 | else if education_num <= 14 then education_num_bin = 'd. 14'; 83 | else education_num_bin = 'e. 15+'; 84 | run; 85 | 86 | /*race & sex*/ 87 | data rg.df; 88 | format race_sex $50.; 89 | format race_sex_bin $20.; 90 | set rg.df; 91 | race_sex = compress(race)||' - '||compress(sex); 92 | if race_sex in ('Asian-Pac-Islander - Male','White - Male') then race_sex_bin = 'c. High'; 93 | else if race_sex in ('White - Female','Asian-Pac-Islander - Female','Amer-Indian-Eskimo - Male','Other - Male','Black - Male') then race_sex_bin = 'b. Mid'; 94 | else race_sex_bin = 'a. Low'; 95 | run; 96 | 97 | /*capital_gain & capital_loss*/ 98 | data rg.df; 99 | format capital_gl_bin $20.; 100 | set rg.df; 101 | if capital_gain = . then capital_gain = 0; 102 | if capital_loss = . then capital_loss = 0; 103 | capital_gl = capital_gain - capital_loss; 104 | if capital_gl > 0 then capital_gl_bin = "c. > 0"; 105 | else if capital_gl < 0 then capital_gl_bin = "b. < 0"; 106 | else capital_gl_bin = "a. = 0"; 107 | run; 108 | 109 | /*marital_status & relationship*/ 110 | data rg.df; 111 | format msr $50.; 112 | format msr_bin $20.; 113 | set rg.df; 114 | msr = compress(marital_status)||' - '||compress(relationship); 115 | if msr in ('Married-AF-spouse - Wife','Married-civ-spouse - Husband','Married-civ-spouse - Wife','Married-AF-spouse - Husband') then msr_bin = 'c. High'; 116 | else if msr in ('Widowed - Not-in-family','Divorced - Unmarried','Never-married - Not-in-family','Widowed - Unmarried','Separated - Not-in-family','Married-spouse-absent - Not-in-family','Divorced - Not-in-family','Married-civ-spouse - Other-relative','Married-civ-spouse - Own-child','Married-civ-spouse - Not-in-family') then msr_bin = 'b. Mid'; 117 | else msr_bin = 'a. Low'; 118 | run; 119 | 120 | /*occupation*/ 121 | data rg.df; 122 | format occupation_bin $20.; 123 | set rg.df; 124 | if occupation in ('Priv-house-serv','Other-service','Handlers-cleaners') then occupation_bin = 'a. Low'; 125 | else if occupation in ('Armed-Forces','?','Farming-fishing','Machine-op-inspct','Adm-clerical') then occupation_bin = 'b. Mid - Low'; 126 | else if occupation in ('Transport-moving','Craft-repair','Sales') then occupation_bin = 'c. Mid - Mid'; 127 | else if occupation in ('Tech-support','Protective-serv') then occupation_bin = 'd. Mid - High'; 128 | else occupation_bin = 'e. High'; 129 | run; 130 | 131 | /*hours_per_week*/ 132 | data rg.df; 133 | format hours_per_week_bin $20.; 134 | set rg.df; 135 | if hours_per_week <= 30 then hours_per_week_bin = 'a. 0-30'; 136 | else if hours_per_week <= 40 then hours_per_week_bin = 'b. 31-40'; 137 | else if hours_per_week <= 50 then hours_per_week_bin = 'd. 41-50 & 61-70'; 138 | else if hours_per_week <= 60 then hours_per_week_bin = 'e. 51-60'; 139 | else if hours_per_week <= 70 then hours_per_week_bin = 'd. 41-50 & 61-70'; 140 | else hours_per_week_bin = 'c. 71-100'; 141 | run; 142 | 143 | %macro rg_bin (var); 144 | proc sql; 145 | select &var., count(y) as cnt_y, mean(y) as avg_y 146 | from rg.df 147 | group by &var.; 148 | quit; 149 | %mend; 150 | 151 | %rg_bin(age_bin); 152 | %rg_bin(workclass_bin); 153 | %rg_bin(education_bin); 154 | %rg_bin(education_num_bin); 155 | %rg_bin(race_sex_bin); 156 | %rg_bin(capital_gl_bin); 157 | %rg_bin(msr_bin); 158 | %rg_bin(occupation_bin); 159 | %rg_bin(hours_per_week_bin); 160 | 161 | data rg.df (keep= 162 | y 163 | flag 164 | age_bin 165 | workclass_bin 166 | education_bin 167 | education_num_bin 168 | race_sex_bin 169 | capital_gl_bin 170 | msr_bin 171 | occupation_bin 172 | hours_per_week_bin 173 | ); 174 | set rg.df; 175 | run; -------------------------------------------------------------------------------- /06 Exploratory Data Analysis/02 Exploratory Data Analysis.py: -------------------------------------------------------------------------------- 1 | # import libraries 2 | import math 3 | import numpy as np 4 | import pandas as pd 5 | from datetime import datetime 6 | import seaborn as sns 7 | import matplotlib.pyplot as plt 8 | %matplotlib inline 9 | plt.style.use('seaborn-whitegrid') 10 | 11 | 12 | # Import the data 13 | df = pd.read_csv('data/00 df.csv') 14 | df = df[df['flag']=='train'] 15 | # print(df.info()) 16 | 17 | 18 | # Exploratory Data Analysis & plot the data 19 | #age_bin 20 | x_chart = df.pivot_table(values=['flag'], index=['age_bin'], columns=['y'], aggfunc='count') 21 | x_chart = x_chart.apply(lambda c: c / c.sum() * 100, axis=1) 22 | x_chart.plot(kind="bar",stacked=True) 23 | plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5)) 24 | 25 | #capital_gl_bin 26 | x_chart = df.pivot_table(values=['flag'], index=['capital_gl_bin'], columns=['y'], aggfunc='count') 27 | x_chart = x_chart.apply(lambda c: c / c.sum() * 100, axis=1) 28 | x_chart.plot(kind="bar",stacked=True) 29 | plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5)) 30 | 31 | #education_bin 32 | x_chart = df.pivot_table(values=['flag'], index=['education_bin'], columns=['y'], aggfunc='count') 33 | x_chart = x_chart.apply(lambda c: c / c.sum() * 100, axis=1) 34 | x_chart.plot(kind="bar",stacked=True) 35 | plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5)) 36 | 37 | 38 | #hours_per_week_bin 39 | x_chart = df.pivot_table(values=['flag'], index=['hours_per_week_bin'], columns=['y'], aggfunc='count') 40 | x_chart = x_chart.apply(lambda c: c / c.sum() * 100, axis=1) 41 | x_chart.plot(kind="bar",stacked=True) 42 | plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5)) 43 | 44 | 45 | #msr_bin 46 | x_chart = df.pivot_table(values=['flag'], index=['msr_bin'], columns=['y'], aggfunc='count') 47 | x_chart = x_chart.apply(lambda c: c / c.sum() * 100, axis=1) 48 | x_chart.plot(kind="bar",stacked=True) 49 | plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5)) 50 | 51 | 52 | #occupation_bin 53 | x_chart = df.pivot_table(values=['flag'], index=['occupation_bin'], columns=['y'], aggfunc='count') 54 | x_chart = x_chart.apply(lambda c: c / c.sum() * 100, axis=1) 55 | x_chart.plot(kind="bar",stacked=True) 56 | plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5)) 57 | 58 | 59 | #race_sex_bin 60 | x_chart = df.pivot_table(values=['flag'], index=['race_sex_bin'], columns=['y'], aggfunc='count') 61 | x_chart = x_chart.apply(lambda c: c / c.sum() * 100, axis=1) 62 | x_chart.plot(kind="bar",stacked=True) 63 | plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5)) 64 | 65 | 66 | 67 | 68 | 69 | -------------------------------------------------------------------------------- /06 Exploratory Data Analysis/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /06 Exploratory Data Analysis/data/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /07 K-Nearest Neighbors/01 Data_Prep.sas: -------------------------------------------------------------------------------- 1 | options compress=yes; 2 | 3 | data train; 4 | format flag $20.; 5 | set 'ADULT.DATA'n; 6 | flag = 'train'; 7 | run; 8 | 9 | data test (rename=('|1x3 Cross validator'n = F1)); 10 | format flag $20.; 11 | set 'ADULT.TEST'n; 12 | flag = 'test'; 13 | run; 14 | 15 | data rg.df (rename=( 16 | F1 = age 17 | F2 = workclass 18 | F3 = fnlwgt 19 | F4 = education 20 | F5 = education_num 21 | F6 = marital_status 22 | F7 = occupation 23 | F8 = relationship 24 | F9 = race 25 | F10 = sex 26 | F11 = capital_gain 27 | F12 = capital_loss 28 | F13 = hours_per_week 29 | F14 = native_country 30 | F15 = income 31 | )); 32 | set train test; 33 | if compress(F15) ^= ""; 34 | run; 35 | 36 | data rg.df; 37 | set rg.df; 38 | if compress(income) in (">50K",">50K.") then y = 1; 39 | else y = 0; 40 | run; 41 | 42 | /*age*/ 43 | data rg.df; 44 | format age_bin $20.; 45 | set rg.df; 46 | if age <= 25 then age_bin = 'a. 0-25'; 47 | else if age <= 30 then age_bin = 'b. 26-30 & 71-100'; 48 | else if age <= 35 then age_bin = 'c. 31-35 & 61-70'; 49 | else if age <= 40 then age_bin = 'd. 36-40 & 56-60'; 50 | else if age <= 55 then age_bin = 'e. 40-55'; 51 | else if age <= 60 then age_bin = 'd. 36-40 & 56-60'; 52 | else if age <= 70 then age_bin = 'c. 31-35 & 61-70'; 53 | else age_bin = 'b. 26-30 & 71-100'; 54 | run; 55 | 56 | /*workclass*/ 57 | data rg.df; 58 | format workclass_bin $20.; 59 | set rg.df; 60 | if workclass in ('?','Never-worked','Without-pay') then workclass_bin = 'a. no income'; 61 | else workclass_bin = 'b. income'; 62 | run; 63 | 64 | /*education*/ 65 | data rg.df; 66 | format education_bin $20.; 67 | set rg.df; 68 | if education in ('10th','11th','12th','1st-4th','5th-6th','7th-8th','9th','Preschool') then education_bin = 'a. Low'; 69 | else if education in ('HS-grad','Some-college','Assoc-acdm','Assoc-voc') then education_bin = 'b. Mid'; 70 | else if education in ('Bachelors') then education_bin = 'c. Bachelors'; 71 | else if education in ('Masters') then education_bin = 'd. Masters'; 72 | else education_bin = 'e. High'; 73 | run; 74 | 75 | /*education_num*/ 76 | data rg.df; 77 | format education_num_bin $20.; 78 | set rg.df; 79 | if education_num <= 8 then education_num_bin = 'a. 0-8'; 80 | else if education_num <= 12 then education_num_bin = 'b. 9-12'; 81 | else if education_num <= 13 then education_num_bin = 'c. 13'; 82 | else if education_num <= 14 then education_num_bin = 'd. 14'; 83 | else education_num_bin = 'e. 15+'; 84 | run; 85 | 86 | /*race & sex*/ 87 | data rg.df; 88 | format race_sex $50.; 89 | format race_sex_bin $20.; 90 | set rg.df; 91 | race_sex = compress(race)||' - '||compress(sex); 92 | if race_sex in ('Asian-Pac-Islander - Male','White - Male') then race_sex_bin = 'c. High'; 93 | else if race_sex in ('White - Female','Asian-Pac-Islander - Female','Amer-Indian-Eskimo - Male','Other - Male','Black - Male') then race_sex_bin = 'b. Mid'; 94 | else race_sex_bin = 'a. Low'; 95 | run; 96 | 97 | /*capital_gain & capital_loss*/ 98 | data rg.df; 99 | format capital_gl_bin $20.; 100 | set rg.df; 101 | if capital_gain = . then capital_gain = 0; 102 | if capital_loss = . then capital_loss = 0; 103 | capital_gl = capital_gain - capital_loss; 104 | if capital_gl > 0 then capital_gl_bin = "c. > 0"; 105 | else if capital_gl < 0 then capital_gl_bin = "b. < 0"; 106 | else capital_gl_bin = "a. = 0"; 107 | run; 108 | 109 | /*marital_status & relationship*/ 110 | data rg.df; 111 | format msr $50.; 112 | format msr_bin $20.; 113 | set rg.df; 114 | msr = compress(marital_status)||' - '||compress(relationship); 115 | if msr in ('Married-AF-spouse - Wife','Married-civ-spouse - Husband','Married-civ-spouse - Wife','Married-AF-spouse - Husband') then msr_bin = 'c. High'; 116 | else if msr in ('Widowed - Not-in-family','Divorced - Unmarried','Never-married - Not-in-family','Widowed - Unmarried','Separated - Not-in-family','Married-spouse-absent - Not-in-family','Divorced - Not-in-family','Married-civ-spouse - Other-relative','Married-civ-spouse - Own-child','Married-civ-spouse - Not-in-family') then msr_bin = 'b. Mid'; 117 | else msr_bin = 'a. Low'; 118 | run; 119 | 120 | /*occupation*/ 121 | data rg.df; 122 | format occupation_bin $20.; 123 | set rg.df; 124 | if occupation in ('Priv-house-serv','Other-service','Handlers-cleaners') then occupation_bin = 'a. Low'; 125 | else if occupation in ('Armed-Forces','?','Farming-fishing','Machine-op-inspct','Adm-clerical') then occupation_bin = 'b. Mid - Low'; 126 | else if occupation in ('Transport-moving','Craft-repair','Sales') then occupation_bin = 'c. Mid - Mid'; 127 | else if occupation in ('Tech-support','Protective-serv') then occupation_bin = 'd. Mid - High'; 128 | else occupation_bin = 'e. High'; 129 | run; 130 | 131 | /*hours_per_week*/ 132 | data rg.df; 133 | format hours_per_week_bin $20.; 134 | set rg.df; 135 | if hours_per_week <= 30 then hours_per_week_bin = 'a. 0-30'; 136 | else if hours_per_week <= 40 then hours_per_week_bin = 'b. 31-40'; 137 | else if hours_per_week <= 50 then hours_per_week_bin = 'd. 41-50 & 61-70'; 138 | else if hours_per_week <= 60 then hours_per_week_bin = 'e. 51-60'; 139 | else if hours_per_week <= 70 then hours_per_week_bin = 'd. 41-50 & 61-70'; 140 | else hours_per_week_bin = 'c. 71-100'; 141 | run; 142 | 143 | %macro rg_bin (var); 144 | proc sql; 145 | select &var., count(y) as cnt_y, mean(y) as avg_y 146 | from rg.df 147 | group by &var.; 148 | quit; 149 | %mend; 150 | 151 | %rg_bin(age_bin); 152 | %rg_bin(workclass_bin); 153 | %rg_bin(education_bin); 154 | %rg_bin(education_num_bin); 155 | %rg_bin(race_sex_bin); 156 | %rg_bin(capital_gl_bin); 157 | %rg_bin(msr_bin); 158 | %rg_bin(occupation_bin); 159 | %rg_bin(hours_per_week_bin); 160 | 161 | data rg.df (keep= 162 | y 163 | flag 164 | age_bin 165 | workclass_bin 166 | education_bin 167 | education_num_bin 168 | race_sex_bin 169 | capital_gl_bin 170 | msr_bin 171 | occupation_bin 172 | hours_per_week_bin 173 | ); 174 | set rg.df; 175 | run; -------------------------------------------------------------------------------- /07 K-Nearest Neighbors/02 K-Neighbors Classifier.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import math\n", 10 | "import numpy as np\n", 11 | "import pandas as pd\n", 12 | "from datetime import datetime\n", 13 | "\n", 14 | "import seaborn as sns\n", 15 | "import matplotlib.pyplot as plt\n", 16 | "%matplotlib inline \n", 17 | "plt.style.use('seaborn-whitegrid')\n", 18 | "\n", 19 | "from sklearn.neighbors import KNeighborsClassifier\n", 20 | "from sklearn.metrics import classification_report\n", 21 | "from sklearn.metrics import confusion_matrix" 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "# Get the Data" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 2, 34 | "metadata": {}, 35 | "outputs": [], 36 | "source": [ 37 | "df = pd.read_csv('data/00 df.csv')\n", 38 | "train = df[df['flag']=='train']\n", 39 | "test = df[df['flag']=='test']" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": 3, 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "cat_feats = ['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin']\n", 49 | "\n", 50 | "y_train = train['y']\n", 51 | "x_train = train[['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin']]\n", 52 | "x_train = pd.get_dummies(x_train,columns=cat_feats,drop_first=True)\n", 53 | "\n", 54 | "y_test = test['y']\n", 55 | "x_test = test[['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin']]\n", 56 | "x_test = pd.get_dummies(x_test,columns=cat_feats,drop_first=True)" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "# KNN" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": 4, 69 | "metadata": {}, 70 | "outputs": [], 71 | "source": [ 72 | "error_rate = []\n", 73 | "for i in range(1,51): \n", 74 | " knn = KNeighborsClassifier(n_neighbors=i)\n", 75 | " knn.fit(x_train,y_train)\n", 76 | " pred_i = knn.predict(x_test)\n", 77 | " error_rate.append(np.mean(pred_i != y_test))" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": 5, 83 | "metadata": {}, 84 | "outputs": [ 85 | { 86 | "data": { 87 | "text/plain": [ 88 | "Text(0, 0.5, 'Error Rate')" 89 | ] 90 | }, 91 | "execution_count": 5, 92 | "metadata": {}, 93 | "output_type": "execute_result" 94 | }, 95 | { 96 | "data": { 97 | "image/png": "\n", 98 | "text/plain": [ 99 | "
" 100 | ] 101 | }, 102 | "metadata": {}, 103 | "output_type": "display_data" 104 | } 105 | ], 106 | "source": [ 107 | "plt.figure(figsize=(8,4))\n", 108 | "plt.plot(range(1,51),error_rate,color='darkred', marker='o',markersize=10)\n", 109 | "plt.title('Error Rate vs. K Value')\n", 110 | "plt.xlabel('K')\n", 111 | "plt.ylabel('Error Rate')" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": 6, 117 | "metadata": {}, 118 | "outputs": [], 119 | "source": [ 120 | "knn = KNeighborsClassifier(n_neighbors=15)\n", 121 | "knn.fit(x_train,y_train)\n", 122 | "y_pred=knn.predict(x_test)" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": 7, 128 | "metadata": {}, 129 | "outputs": [ 130 | { 131 | "name": "stdout", 132 | "output_type": "stream", 133 | "text": [ 134 | "[[11660 775]\n", 135 | " [ 1901 1945]]\n", 136 | "accuracy: 0.8356366316565321\n", 137 | "precision: 0.7150735294117647\n", 138 | "recall: 0.5057202288091524\n", 139 | "f1 score: 0.5924459335973195\n" 140 | ] 141 | } 142 | ], 143 | "source": [ 144 | "test_calc = pd.concat([pd.DataFrame(y_test).reset_index(drop=True),pd.DataFrame(y_pred).reset_index(drop=True)],axis=1)\n", 145 | "test_calc.rename(columns={0: 'predicted'}, inplace=True)\n", 146 | "\n", 147 | "test_calc['predicted'] = test_calc['predicted'].apply(lambda x: 1 if x > 0.5 else 0)\n", 148 | "df_table = confusion_matrix(test_calc['y'],test_calc['predicted'])\n", 149 | "print (df_table)\n", 150 | "\n", 151 | "print('accuracy:', (df_table[0,0] + df_table[1,1]) / (df_table[0,0] + df_table[0,1] + df_table[1,0] + df_table[1,1]))\n", 152 | "print ('precision:', df_table[1,1] / (df_table[1,1] + df_table[0,1]))\n", 153 | "print('recall:', df_table[1,1] / (df_table[1,1] + df_table[1,0]))\n", 154 | "\n", 155 | "p = df_table[1,1] / (df_table[1,1] + df_table[0,1])\n", 156 | "r = df_table[1,1] / (df_table[1,1] + df_table[1,0])\n", 157 | "print('f1 score: ', (2*p*r)/(p+r))" 158 | ] 159 | } 160 | ], 161 | "metadata": { 162 | "kernelspec": { 163 | "display_name": "Python 3", 164 | "language": "python", 165 | "name": "python3" 166 | }, 167 | "language_info": { 168 | "codemirror_mode": { 169 | "name": "ipython", 170 | "version": 3 171 | }, 172 | "file_extension": ".py", 173 | "mimetype": "text/x-python", 174 | "name": "python", 175 | "nbconvert_exporter": "python", 176 | "pygments_lexer": "ipython3", 177 | "version": "3.6.8" 178 | } 179 | }, 180 | "nbformat": 4, 181 | "nbformat_minor": 4 182 | } 183 | -------------------------------------------------------------------------------- /07 K-Nearest Neighbors/02 K-Neighbors Classifier.py: -------------------------------------------------------------------------------- 1 | # Import the libraries 2 | import math 3 | import numpy as np 4 | import pandas as pd 5 | from datetime import datetime 6 | 7 | import seaborn as sns 8 | import matplotlib.pyplot as plt 9 | %matplotlib inline 10 | plt.style.use('seaborn-whitegrid') 11 | 12 | from sklearn.neighbors import KNeighborsClassifier 13 | from sklearn.metrics import classification_report 14 | from sklearn.metrics import confusion_matrix 15 | 16 | # Import the data 17 | df = pd.read_csv('data/00 df.csv') 18 | 19 | 20 | # Split data into Train & test 21 | train = df[df['flag']=='train'] 22 | test = df[df['flag']=='test'] 23 | 24 | cat_feats = ['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin'] 25 | 26 | y_train = train['y'] 27 | x_train = train[['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin']] 28 | x_train = pd.get_dummies(x_train,columns=cat_feats,drop_first=True) 29 | 30 | y_test = test['y'] 31 | x_test = test[['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin']] 32 | x_test = pd.get_dummies(x_test,columns=cat_feats,drop_first=True) 33 | 34 | 35 | # KNN 36 | error_rate = [] 37 | for i in range(1,51): 38 | knn = KNeighborsClassifier(n_neighbors=i) 39 | knn.fit(x_train,y_train) 40 | pred_i = knn.predict(x_test) 41 | error_rate.append(np.mean(pred_i != y_test)) 42 | 43 | 44 | # plot the data 45 | plt.figure(figsize=(8,4)) 46 | plt.plot(range(1,51),error_rate,color='darkred', marker='o',markersize=10) 47 | plt.title('Error Rate vs. K Value') 48 | plt.xlabel('K') 49 | plt.ylabel('Error Rate') 50 | plt.show() 51 | 52 | # KNN 53 | knn = KNeighborsClassifier(n_neighbors=15) 54 | knn.fit(x_train,y_train) 55 | y_pred=knn.predict(x_test) 56 | 57 | test_calc = pd.concat([pd.DataFrame(y_test).reset_index(drop=True),pd.DataFrame(y_pred).reset_index(drop=True)],axis=1) 58 | test_calc.rename(columns={0: 'predicted'}, inplace=True) 59 | 60 | test_calc['predicted'] = test_calc['predicted'].apply(lambda x: 1 if x > 0.5 else 0) 61 | df_table = confusion_matrix(test_calc['y'],test_calc['predicted']) 62 | print (df_table) 63 | 64 | print('accuracy:', (df_table[0,0] + df_table[1,1]) / (df_table[0,0] + df_table[0,1] + df_table[1,0] + df_table[1,1])) 65 | print ('precision:', df_table[1,1] / (df_table[1,1] + df_table[0,1])) 66 | print('recall:', df_table[1,1] / (df_table[1,1] + df_table[1,0])) 67 | 68 | p = df_table[1,1] / (df_table[1,1] + df_table[0,1]) 69 | r = df_table[1,1] / (df_table[1,1] + df_table[1,0]) 70 | print('f1 score: ', (2*p*r)/(p+r)) -------------------------------------------------------------------------------- /07 K-Nearest Neighbors/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /07 K-Nearest Neighbors/data/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /08 Logistic Regression Classifier/01 Data_Prep.sas: -------------------------------------------------------------------------------- 1 | options compress=yes; 2 | 3 | data train; 4 | format flag $20.; 5 | set 'ADULT.DATA'n; 6 | flag = 'train'; 7 | run; 8 | 9 | data test (rename=('|1x3 Cross validator'n = F1)); 10 | format flag $20.; 11 | set 'ADULT.TEST'n; 12 | flag = 'test'; 13 | run; 14 | 15 | data rg.df (rename=( 16 | F1 = age 17 | F2 = workclass 18 | F3 = fnlwgt 19 | F4 = education 20 | F5 = education_num 21 | F6 = marital_status 22 | F7 = occupation 23 | F8 = relationship 24 | F9 = race 25 | F10 = sex 26 | F11 = capital_gain 27 | F12 = capital_loss 28 | F13 = hours_per_week 29 | F14 = native_country 30 | F15 = income 31 | )); 32 | set train test; 33 | if compress(F15) ^= ""; 34 | run; 35 | 36 | data rg.df; 37 | set rg.df; 38 | if compress(income) in (">50K",">50K.") then y = 1; 39 | else y = 0; 40 | run; 41 | 42 | /*age*/ 43 | data rg.df; 44 | format age_bin $20.; 45 | set rg.df; 46 | if age <= 25 then age_bin = 'a. 0-25'; 47 | else if age <= 30 then age_bin = 'b. 26-30 & 71-100'; 48 | else if age <= 35 then age_bin = 'c. 31-35 & 61-70'; 49 | else if age <= 40 then age_bin = 'd. 36-40 & 56-60'; 50 | else if age <= 55 then age_bin = 'e. 40-55'; 51 | else if age <= 60 then age_bin = 'd. 36-40 & 56-60'; 52 | else if age <= 70 then age_bin = 'c. 31-35 & 61-70'; 53 | else age_bin = 'b. 26-30 & 71-100'; 54 | run; 55 | 56 | /*workclass*/ 57 | data rg.df; 58 | format workclass_bin $20.; 59 | set rg.df; 60 | if workclass in ('?','Never-worked','Without-pay') then workclass_bin = 'a. no income'; 61 | else workclass_bin = 'b. income'; 62 | run; 63 | 64 | /*education*/ 65 | data rg.df; 66 | format education_bin $20.; 67 | set rg.df; 68 | if education in ('10th','11th','12th','1st-4th','5th-6th','7th-8th','9th','Preschool') then education_bin = 'a. Low'; 69 | else if education in ('HS-grad','Some-college','Assoc-acdm','Assoc-voc') then education_bin = 'b. Mid'; 70 | else if education in ('Bachelors') then education_bin = 'c. Bachelors'; 71 | else if education in ('Masters') then education_bin = 'd. Masters'; 72 | else education_bin = 'e. High'; 73 | run; 74 | 75 | /*education_num*/ 76 | data rg.df; 77 | format education_num_bin $20.; 78 | set rg.df; 79 | if education_num <= 8 then education_num_bin = 'a. 0-8'; 80 | else if education_num <= 12 then education_num_bin = 'b. 9-12'; 81 | else if education_num <= 13 then education_num_bin = 'c. 13'; 82 | else if education_num <= 14 then education_num_bin = 'd. 14'; 83 | else education_num_bin = 'e. 15+'; 84 | run; 85 | 86 | /*race & sex*/ 87 | data rg.df; 88 | format race_sex $50.; 89 | format race_sex_bin $20.; 90 | set rg.df; 91 | race_sex = compress(race)||' - '||compress(sex); 92 | if race_sex in ('Asian-Pac-Islander - Male','White - Male') then race_sex_bin = 'c. High'; 93 | else if race_sex in ('White - Female','Asian-Pac-Islander - Female','Amer-Indian-Eskimo - Male','Other - Male','Black - Male') then race_sex_bin = 'b. Mid'; 94 | else race_sex_bin = 'a. Low'; 95 | run; 96 | 97 | /*capital_gain & capital_loss*/ 98 | data rg.df; 99 | format capital_gl_bin $20.; 100 | set rg.df; 101 | if capital_gain = . then capital_gain = 0; 102 | if capital_loss = . then capital_loss = 0; 103 | capital_gl = capital_gain - capital_loss; 104 | if capital_gl > 0 then capital_gl_bin = "c. > 0"; 105 | else if capital_gl < 0 then capital_gl_bin = "b. < 0"; 106 | else capital_gl_bin = "a. = 0"; 107 | run; 108 | 109 | /*marital_status & relationship*/ 110 | data rg.df; 111 | format msr $50.; 112 | format msr_bin $20.; 113 | set rg.df; 114 | msr = compress(marital_status)||' - '||compress(relationship); 115 | if msr in ('Married-AF-spouse - Wife','Married-civ-spouse - Husband','Married-civ-spouse - Wife','Married-AF-spouse - Husband') then msr_bin = 'c. High'; 116 | else if msr in ('Widowed - Not-in-family','Divorced - Unmarried','Never-married - Not-in-family','Widowed - Unmarried','Separated - Not-in-family','Married-spouse-absent - Not-in-family','Divorced - Not-in-family','Married-civ-spouse - Other-relative','Married-civ-spouse - Own-child','Married-civ-spouse - Not-in-family') then msr_bin = 'b. Mid'; 117 | else msr_bin = 'a. Low'; 118 | run; 119 | 120 | /*occupation*/ 121 | data rg.df; 122 | format occupation_bin $20.; 123 | set rg.df; 124 | if occupation in ('Priv-house-serv','Other-service','Handlers-cleaners') then occupation_bin = 'a. Low'; 125 | else if occupation in ('Armed-Forces','?','Farming-fishing','Machine-op-inspct','Adm-clerical') then occupation_bin = 'b. Mid - Low'; 126 | else if occupation in ('Transport-moving','Craft-repair','Sales') then occupation_bin = 'c. Mid - Mid'; 127 | else if occupation in ('Tech-support','Protective-serv') then occupation_bin = 'd. Mid - High'; 128 | else occupation_bin = 'e. High'; 129 | run; 130 | 131 | /*hours_per_week*/ 132 | data rg.df; 133 | format hours_per_week_bin $20.; 134 | set rg.df; 135 | if hours_per_week <= 30 then hours_per_week_bin = 'a. 0-30'; 136 | else if hours_per_week <= 40 then hours_per_week_bin = 'b. 31-40'; 137 | else if hours_per_week <= 50 then hours_per_week_bin = 'd. 41-50 & 61-70'; 138 | else if hours_per_week <= 60 then hours_per_week_bin = 'e. 51-60'; 139 | else if hours_per_week <= 70 then hours_per_week_bin = 'd. 41-50 & 61-70'; 140 | else hours_per_week_bin = 'c. 71-100'; 141 | run; 142 | 143 | %macro rg_bin (var); 144 | proc sql; 145 | select &var., count(y) as cnt_y, mean(y) as avg_y 146 | from rg.df 147 | group by &var.; 148 | quit; 149 | %mend; 150 | 151 | %rg_bin(age_bin); 152 | %rg_bin(workclass_bin); 153 | %rg_bin(education_bin); 154 | %rg_bin(education_num_bin); 155 | %rg_bin(race_sex_bin); 156 | %rg_bin(capital_gl_bin); 157 | %rg_bin(msr_bin); 158 | %rg_bin(occupation_bin); 159 | %rg_bin(hours_per_week_bin); 160 | 161 | data rg.df (keep= 162 | y 163 | flag 164 | age_bin 165 | workclass_bin 166 | education_bin 167 | education_num_bin 168 | race_sex_bin 169 | capital_gl_bin 170 | msr_bin 171 | occupation_bin 172 | hours_per_week_bin 173 | ); 174 | set rg.df; 175 | run; -------------------------------------------------------------------------------- /08 Logistic Regression Classifier/02 Logistic Regression.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import math\n", 10 | "import numpy as np\n", 11 | "import pandas as pd\n", 12 | "from datetime import datetime\n", 13 | "\n", 14 | "import seaborn as sns\n", 15 | "import matplotlib.pyplot as plt\n", 16 | "%matplotlib inline \n", 17 | "plt.style.use('seaborn-whitegrid')\n", 18 | "\n", 19 | "from sklearn.linear_model import LogisticRegression\n", 20 | "from sklearn.metrics import classification_report\n", 21 | "from sklearn.metrics import confusion_matrix" 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "# Get the Data" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 2, 34 | "metadata": {}, 35 | "outputs": [], 36 | "source": [ 37 | "df = pd.read_csv('data/00 df.csv')\n", 38 | "train = df[df['flag']=='train']\n", 39 | "test = df[df['flag']=='test']" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": 3, 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "cat_feats = ['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin']\n", 49 | "\n", 50 | "y_train = train['y']\n", 51 | "x_train = train[['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin']]\n", 52 | "x_train = pd.get_dummies(x_train,columns=cat_feats,drop_first=True)\n", 53 | "\n", 54 | "y_test = test['y']\n", 55 | "x_test = test[['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin']]\n", 56 | "x_test = pd.get_dummies(x_test,columns=cat_feats,drop_first=True)" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "# Logistic Regression" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": 4, 69 | "metadata": {}, 70 | "outputs": [], 71 | "source": [ 72 | "lr = LogisticRegression()\n", 73 | "lr.fit(x_train, y_train)\n", 74 | "y_pred=lr.predict(x_test)" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 5, 80 | "metadata": {}, 81 | "outputs": [ 82 | { 83 | "name": "stdout", 84 | "output_type": "stream", 85 | "text": [ 86 | "[[11602 833]\n", 87 | " [ 1676 2170]]\n", 88 | "accuracy: 0.8458939868558443\n", 89 | "precision: 0.7226107226107226\n", 90 | "recall: 0.5642225689027561\n", 91 | "f1 score: 0.6336691487808439\n" 92 | ] 93 | } 94 | ], 95 | "source": [ 96 | "test_calc = pd.concat([pd.DataFrame(y_test).reset_index(drop=True),pd.DataFrame(y_pred).reset_index(drop=True)],axis=1)\n", 97 | "test_calc.rename(columns={0: 'predicted'}, inplace=True)\n", 98 | "\n", 99 | "test_calc['predicted'] = test_calc['predicted'].apply(lambda x: 1 if x > 0.5 else 0)\n", 100 | "df_table = confusion_matrix(test_calc['y'],test_calc['predicted'])\n", 101 | "print (df_table)\n", 102 | "\n", 103 | "print('accuracy:', (df_table[0,0] + df_table[1,1]) / (df_table[0,0] + df_table[0,1] + df_table[1,0] + df_table[1,1]))\n", 104 | "print ('precision:', df_table[1,1] / (df_table[1,1] + df_table[0,1]))\n", 105 | "print('recall:', df_table[1,1] / (df_table[1,1] + df_table[1,0]))\n", 106 | "\n", 107 | "p = df_table[1,1] / (df_table[1,1] + df_table[0,1])\n", 108 | "r = df_table[1,1] / (df_table[1,1] + df_table[1,0])\n", 109 | "print('f1 score: ', (2*p*r)/(p+r))" 110 | ] 111 | } 112 | ], 113 | "metadata": { 114 | "kernelspec": { 115 | "display_name": "Python 3", 116 | "language": "python", 117 | "name": "python3" 118 | }, 119 | "language_info": { 120 | "codemirror_mode": { 121 | "name": "ipython", 122 | "version": 3 123 | }, 124 | "file_extension": ".py", 125 | "mimetype": "text/x-python", 126 | "name": "python", 127 | "nbconvert_exporter": "python", 128 | "pygments_lexer": "ipython3", 129 | "version": "3.6.8" 130 | } 131 | }, 132 | "nbformat": 4, 133 | "nbformat_minor": 4 134 | } 135 | -------------------------------------------------------------------------------- /08 Logistic Regression Classifier/02 Logistic Regression.py: -------------------------------------------------------------------------------- 1 | # import the libraries 2 | import math 3 | import numpy as np 4 | import pandas as pd 5 | from datetime import datetime 6 | 7 | import seaborn as sns 8 | import matplotlib.pyplot as plt 9 | %matplotlib inline 10 | plt.style.use('seaborn-whitegrid') 11 | 12 | from sklearn.linear_model import LogisticRegression 13 | from sklearn.metrics import classification_report 14 | from sklearn.metrics import confusion_matrix 15 | 16 | 17 | # import the data 18 | df = pd.read_csv('data/00 df.csv') 19 | 20 | # split data into train n test 21 | train = df[df['flag']=='train'] 22 | test = df[df['flag']=='test'] 23 | 24 | cat_feats = ['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin'] 25 | 26 | y_train = train['y'] 27 | x_train = train[['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin']] 28 | x_train = pd.get_dummies(x_train,columns=cat_feats,drop_first=True) 29 | 30 | y_test = test['y'] 31 | x_test = test[['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin']] 32 | x_test = pd.get_dummies(x_test,columns=cat_feats,drop_first=True) 33 | 34 | 35 | # Logistic Regression Model 36 | lr=LogisticRegression() 37 | lr.fit(x_train, y_train) 38 | y_pred=lr.predict(x_test) 39 | 40 | test_calc = pd.concat([pd.DataFrame(y_test).reset_index(drop=True),pd.DataFrame(y_pred).reset_index(drop=True)],axis=1) 41 | test_calc.rename(columns={0: 'predicted'}, inplace=True) 42 | 43 | test_calc['predicted'] = test_calc['predicted'].apply(lambda x: 1 if x > 0.5 else 0) 44 | df_table = confusion_matrix(test_calc['y'],test_calc['predicted']) 45 | print (df_table) 46 | 47 | print('accuracy:', (df_table[0,0] + df_table[1,1]) / (df_table[0,0] + df_table[0,1] + df_table[1,0] + df_table[1,1])) 48 | print ('precision:', df_table[1,1] / (df_table[1,1] + df_table[0,1])) 49 | print('recall:', df_table[1,1] / (df_table[1,1] + df_table[1,0])) 50 | 51 | p = df_table[1,1] / (df_table[1,1] + df_table[0,1]) 52 | r = df_table[1,1] / (df_table[1,1] + df_table[1,0]) 53 | print('f1 score: ', (2*p*r)/(p+r)) 54 | 55 | 56 | -------------------------------------------------------------------------------- /08 Logistic Regression Classifier/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /08 Logistic Regression Classifier/data/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /09 Naive_Bayes classifier/01 Data_Prep.sas: -------------------------------------------------------------------------------- 1 | options compress=yes; 2 | 3 | data train; 4 | format flag $20.; 5 | set 'ADULT.DATA'n; 6 | flag = 'train'; 7 | run; 8 | 9 | data test (rename=('|1x3 Cross validator'n = F1)); 10 | format flag $20.; 11 | set 'ADULT.TEST'n; 12 | flag = 'test'; 13 | run; 14 | 15 | data rg.df (rename=( 16 | F1 = age 17 | F2 = workclass 18 | F3 = fnlwgt 19 | F4 = education 20 | F5 = education_num 21 | F6 = marital_status 22 | F7 = occupation 23 | F8 = relationship 24 | F9 = race 25 | F10 = sex 26 | F11 = capital_gain 27 | F12 = capital_loss 28 | F13 = hours_per_week 29 | F14 = native_country 30 | F15 = income 31 | )); 32 | set train test; 33 | if compress(F15) ^= ""; 34 | run; 35 | 36 | data rg.df; 37 | set rg.df; 38 | if compress(income) in (">50K",">50K.") then y = 1; 39 | else y = 0; 40 | run; 41 | 42 | /*age*/ 43 | data rg.df; 44 | format age_bin $20.; 45 | set rg.df; 46 | if age <= 25 then age_bin = 'a. 0-25'; 47 | else if age <= 30 then age_bin = 'b. 26-30 & 71-100'; 48 | else if age <= 35 then age_bin = 'c. 31-35 & 61-70'; 49 | else if age <= 40 then age_bin = 'd. 36-40 & 56-60'; 50 | else if age <= 55 then age_bin = 'e. 40-55'; 51 | else if age <= 60 then age_bin = 'd. 36-40 & 56-60'; 52 | else if age <= 70 then age_bin = 'c. 31-35 & 61-70'; 53 | else age_bin = 'b. 26-30 & 71-100'; 54 | run; 55 | 56 | /*workclass*/ 57 | data rg.df; 58 | format workclass_bin $20.; 59 | set rg.df; 60 | if workclass in ('?','Never-worked','Without-pay') then workclass_bin = 'a. no income'; 61 | else workclass_bin = 'b. income'; 62 | run; 63 | 64 | /*education*/ 65 | data rg.df; 66 | format education_bin $20.; 67 | set rg.df; 68 | if education in ('10th','11th','12th','1st-4th','5th-6th','7th-8th','9th','Preschool') then education_bin = 'a. Low'; 69 | else if education in ('HS-grad','Some-college','Assoc-acdm','Assoc-voc') then education_bin = 'b. Mid'; 70 | else if education in ('Bachelors') then education_bin = 'c. Bachelors'; 71 | else if education in ('Masters') then education_bin = 'd. Masters'; 72 | else education_bin = 'e. High'; 73 | run; 74 | 75 | /*education_num*/ 76 | data rg.df; 77 | format education_num_bin $20.; 78 | set rg.df; 79 | if education_num <= 8 then education_num_bin = 'a. 0-8'; 80 | else if education_num <= 12 then education_num_bin = 'b. 9-12'; 81 | else if education_num <= 13 then education_num_bin = 'c. 13'; 82 | else if education_num <= 14 then education_num_bin = 'd. 14'; 83 | else education_num_bin = 'e. 15+'; 84 | run; 85 | 86 | /*race & sex*/ 87 | data rg.df; 88 | format race_sex $50.; 89 | format race_sex_bin $20.; 90 | set rg.df; 91 | race_sex = compress(race)||' - '||compress(sex); 92 | if race_sex in ('Asian-Pac-Islander - Male','White - Male') then race_sex_bin = 'c. High'; 93 | else if race_sex in ('White - Female','Asian-Pac-Islander - Female','Amer-Indian-Eskimo - Male','Other - Male','Black - Male') then race_sex_bin = 'b. Mid'; 94 | else race_sex_bin = 'a. Low'; 95 | run; 96 | 97 | /*capital_gain & capital_loss*/ 98 | data rg.df; 99 | format capital_gl_bin $20.; 100 | set rg.df; 101 | if capital_gain = . then capital_gain = 0; 102 | if capital_loss = . then capital_loss = 0; 103 | capital_gl = capital_gain - capital_loss; 104 | if capital_gl > 0 then capital_gl_bin = "c. > 0"; 105 | else if capital_gl < 0 then capital_gl_bin = "b. < 0"; 106 | else capital_gl_bin = "a. = 0"; 107 | run; 108 | 109 | /*marital_status & relationship*/ 110 | data rg.df; 111 | format msr $50.; 112 | format msr_bin $20.; 113 | set rg.df; 114 | msr = compress(marital_status)||' - '||compress(relationship); 115 | if msr in ('Married-AF-spouse - Wife','Married-civ-spouse - Husband','Married-civ-spouse - Wife','Married-AF-spouse - Husband') then msr_bin = 'c. High'; 116 | else if msr in ('Widowed - Not-in-family','Divorced - Unmarried','Never-married - Not-in-family','Widowed - Unmarried','Separated - Not-in-family','Married-spouse-absent - Not-in-family','Divorced - Not-in-family','Married-civ-spouse - Other-relative','Married-civ-spouse - Own-child','Married-civ-spouse - Not-in-family') then msr_bin = 'b. Mid'; 117 | else msr_bin = 'a. Low'; 118 | run; 119 | 120 | /*occupation*/ 121 | data rg.df; 122 | format occupation_bin $20.; 123 | set rg.df; 124 | if occupation in ('Priv-house-serv','Other-service','Handlers-cleaners') then occupation_bin = 'a. Low'; 125 | else if occupation in ('Armed-Forces','?','Farming-fishing','Machine-op-inspct','Adm-clerical') then occupation_bin = 'b. Mid - Low'; 126 | else if occupation in ('Transport-moving','Craft-repair','Sales') then occupation_bin = 'c. Mid - Mid'; 127 | else if occupation in ('Tech-support','Protective-serv') then occupation_bin = 'd. Mid - High'; 128 | else occupation_bin = 'e. High'; 129 | run; 130 | 131 | /*hours_per_week*/ 132 | data rg.df; 133 | format hours_per_week_bin $20.; 134 | set rg.df; 135 | if hours_per_week <= 30 then hours_per_week_bin = 'a. 0-30'; 136 | else if hours_per_week <= 40 then hours_per_week_bin = 'b. 31-40'; 137 | else if hours_per_week <= 50 then hours_per_week_bin = 'd. 41-50 & 61-70'; 138 | else if hours_per_week <= 60 then hours_per_week_bin = 'e. 51-60'; 139 | else if hours_per_week <= 70 then hours_per_week_bin = 'd. 41-50 & 61-70'; 140 | else hours_per_week_bin = 'c. 71-100'; 141 | run; 142 | 143 | %macro rg_bin (var); 144 | proc sql; 145 | select &var., count(y) as cnt_y, mean(y) as avg_y 146 | from rg.df 147 | group by &var.; 148 | quit; 149 | %mend; 150 | 151 | %rg_bin(age_bin); 152 | %rg_bin(workclass_bin); 153 | %rg_bin(education_bin); 154 | %rg_bin(education_num_bin); 155 | %rg_bin(race_sex_bin); 156 | %rg_bin(capital_gl_bin); 157 | %rg_bin(msr_bin); 158 | %rg_bin(occupation_bin); 159 | %rg_bin(hours_per_week_bin); 160 | 161 | data rg.df (keep= 162 | y 163 | flag 164 | age_bin 165 | workclass_bin 166 | education_bin 167 | education_num_bin 168 | race_sex_bin 169 | capital_gl_bin 170 | msr_bin 171 | occupation_bin 172 | hours_per_week_bin 173 | ); 174 | set rg.df; 175 | run; -------------------------------------------------------------------------------- /09 Naive_Bayes classifier/02 Naive_Bayes classifier.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import math\n", 10 | "import numpy as np\n", 11 | "import pandas as pd\n", 12 | "from datetime import datetime\n", 13 | "\n", 14 | "import seaborn as sns\n", 15 | "import matplotlib.pyplot as plt\n", 16 | "%matplotlib inline \n", 17 | "plt.style.use('seaborn-whitegrid')\n", 18 | "\n", 19 | "from sklearn.naive_bayes import GaussianNB\n", 20 | "from sklearn.metrics import classification_report\n", 21 | "from sklearn.metrics import confusion_matrix" 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "# Get the Data" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 2, 34 | "metadata": {}, 35 | "outputs": [], 36 | "source": [ 37 | "df = pd.read_csv('data/00 df.csv')\n", 38 | "train = df[df['flag']=='train']\n", 39 | "test = df[df['flag']=='test']" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": 3, 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "cat_feats = ['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin']\n", 49 | "\n", 50 | "y_train = train['y']\n", 51 | "x_train = train[['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin']]\n", 52 | "x_train = pd.get_dummies(x_train,columns=cat_feats,drop_first=True)\n", 53 | "\n", 54 | "y_test = test['y']\n", 55 | "x_test = test[['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin']]\n", 56 | "x_test = pd.get_dummies(x_test,columns=cat_feats,drop_first=True)" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "# Naive Bayes" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": 4, 69 | "metadata": {}, 70 | "outputs": [], 71 | "source": [ 72 | "nb = GaussianNB()\n", 73 | "nb.fit(x_train, y_train)\n", 74 | "y_pred=nb.predict(x_test)" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 5, 80 | "metadata": {}, 81 | "outputs": [ 82 | { 83 | "name": "stdout", 84 | "output_type": "stream", 85 | "text": [ 86 | "[[10608 1827]\n", 87 | " [ 1412 2434]]\n", 88 | "accuracy: 0.8010564461642405\n", 89 | "precision: 0.5712274114057733\n", 90 | "recall: 0.6328653146125846\n", 91 | "f1 score: 0.6004687307265326\n" 92 | ] 93 | } 94 | ], 95 | "source": [ 96 | "test_calc = pd.concat([pd.DataFrame(y_test).reset_index(drop=True),pd.DataFrame(y_pred).reset_index(drop=True)],axis=1)\n", 97 | "test_calc.rename(columns={0: 'predicted'}, inplace=True)\n", 98 | "\n", 99 | "test_calc['predicted'] = test_calc['predicted'].apply(lambda x: 1 if x > 0.5 else 0)\n", 100 | "df_table = confusion_matrix(test_calc['y'],test_calc['predicted'])\n", 101 | "print (df_table)\n", 102 | "\n", 103 | "print('accuracy:', (df_table[0,0] + df_table[1,1]) / (df_table[0,0] + df_table[0,1] + df_table[1,0] + df_table[1,1]))\n", 104 | "print ('precision:', df_table[1,1] / (df_table[1,1] + df_table[0,1]))\n", 105 | "print('recall:', df_table[1,1] / (df_table[1,1] + df_table[1,0]))\n", 106 | "\n", 107 | "p = df_table[1,1] / (df_table[1,1] + df_table[0,1])\n", 108 | "r = df_table[1,1] / (df_table[1,1] + df_table[1,0])\n", 109 | "print('f1 score: ', (2*p*r)/(p+r))" 110 | ] 111 | } 112 | ], 113 | "metadata": { 114 | "kernelspec": { 115 | "display_name": "Python 3", 116 | "language": "python", 117 | "name": "python3" 118 | }, 119 | "language_info": { 120 | "codemirror_mode": { 121 | "name": "ipython", 122 | "version": 3 123 | }, 124 | "file_extension": ".py", 125 | "mimetype": "text/x-python", 126 | "name": "python", 127 | "nbconvert_exporter": "python", 128 | "pygments_lexer": "ipython3", 129 | "version": "3.6.8" 130 | } 131 | }, 132 | "nbformat": 4, 133 | "nbformat_minor": 4 134 | } 135 | -------------------------------------------------------------------------------- /09 Naive_Bayes classifier/02 Naive_Bayes classifier.py: -------------------------------------------------------------------------------- 1 | # Import the libraries 2 | import math 3 | import numpy as np 4 | import pandas as pd 5 | from datetime import datetime 6 | 7 | import seaborn as sns 8 | import matplotlib.pyplot as plt 9 | %matplotlib inline 10 | plt.style.use('seaborn-whitegrid') 11 | 12 | from sklearn.naive_bayes import GaussianNB 13 | from sklearn.metrics import classification_report 14 | from sklearn.metrics import confusion_matrix 15 | 16 | # Import the data 17 | df = pd.read_csv('data/00 df.csv') 18 | 19 | # Split data into train & test 20 | train = df[df['flag']=='train'] 21 | test = df[df['flag']=='test'] 22 | 23 | cat_feats = ['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin'] 24 | 25 | y_train = train['y'] 26 | x_train = train[['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin']] 27 | x_train = pd.get_dummies(x_train,columns=cat_feats,drop_first=True) 28 | 29 | y_test = test['y'] 30 | x_test = test[['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin']] 31 | x_test = pd.get_dummies(x_test,columns=cat_feats,drop_first=True) 32 | 33 | 34 | # Naive Bayes 35 | nb = GaussianNB() 36 | nb.fit(x_train, y_train) 37 | y_pred=nb.predict(x_test) 38 | 39 | test_calc = pd.concat([pd.DataFrame(y_test).reset_index(drop=True),pd.DataFrame(y_pred).reset_index(drop=True)],axis=1) 40 | test_calc.rename(columns={0: 'predicted'}, inplace=True) 41 | 42 | test_calc['predicted'] = test_calc['predicted'].apply(lambda x: 1 if x > 0.5 else 0) 43 | df_table = confusion_matrix(test_calc['y'],test_calc['predicted']) 44 | print (df_table) 45 | 46 | print('accuracy:', (df_table[0,0] + df_table[1,1]) / (df_table[0,0] + df_table[0,1] + df_table[1,0] + df_table[1,1])) 47 | print ('precision:', df_table[1,1] / (df_table[1,1] + df_table[0,1])) 48 | print('recall:', df_table[1,1] / (df_table[1,1] + df_table[1,0])) 49 | 50 | p = df_table[1,1] / (df_table[1,1] + df_table[0,1]) 51 | r = df_table[1,1] / (df_table[1,1] + df_table[1,0]) 52 | print('f1 score: ', (2*p*r)/(p+r)) -------------------------------------------------------------------------------- /09 Naive_Bayes classifier/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /09 Naive_Bayes classifier/data/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /10 Random Forest/01 Data_Prep.sas: -------------------------------------------------------------------------------- 1 | options compress=yes; 2 | 3 | data train; 4 | format flag $20.; 5 | set 'ADULT.DATA'n; 6 | flag = 'train'; 7 | run; 8 | 9 | data test (rename=('|1x3 Cross validator'n = F1)); 10 | format flag $20.; 11 | set 'ADULT.TEST'n; 12 | flag = 'test'; 13 | run; 14 | 15 | data rg.df (rename=( 16 | F1 = age 17 | F2 = workclass 18 | F3 = fnlwgt 19 | F4 = education 20 | F5 = education_num 21 | F6 = marital_status 22 | F7 = occupation 23 | F8 = relationship 24 | F9 = race 25 | F10 = sex 26 | F11 = capital_gain 27 | F12 = capital_loss 28 | F13 = hours_per_week 29 | F14 = native_country 30 | F15 = income 31 | )); 32 | set train test; 33 | if compress(F15) ^= ""; 34 | run; 35 | 36 | data rg.df; 37 | set rg.df; 38 | if compress(income) in (">50K",">50K.") then y = 1; 39 | else y = 0; 40 | run; 41 | 42 | /*age*/ 43 | data rg.df; 44 | format age_bin $20.; 45 | set rg.df; 46 | if age <= 25 then age_bin = 'a. 0-25'; 47 | else if age <= 30 then age_bin = 'b. 26-30 & 71-100'; 48 | else if age <= 35 then age_bin = 'c. 31-35 & 61-70'; 49 | else if age <= 40 then age_bin = 'd. 36-40 & 56-60'; 50 | else if age <= 55 then age_bin = 'e. 40-55'; 51 | else if age <= 60 then age_bin = 'd. 36-40 & 56-60'; 52 | else if age <= 70 then age_bin = 'c. 31-35 & 61-70'; 53 | else age_bin = 'b. 26-30 & 71-100'; 54 | run; 55 | 56 | /*workclass*/ 57 | data rg.df; 58 | format workclass_bin $20.; 59 | set rg.df; 60 | if workclass in ('?','Never-worked','Without-pay') then workclass_bin = 'a. no income'; 61 | else workclass_bin = 'b. income'; 62 | run; 63 | 64 | /*education*/ 65 | data rg.df; 66 | format education_bin $20.; 67 | set rg.df; 68 | if education in ('10th','11th','12th','1st-4th','5th-6th','7th-8th','9th','Preschool') then education_bin = 'a. Low'; 69 | else if education in ('HS-grad','Some-college','Assoc-acdm','Assoc-voc') then education_bin = 'b. Mid'; 70 | else if education in ('Bachelors') then education_bin = 'c. Bachelors'; 71 | else if education in ('Masters') then education_bin = 'd. Masters'; 72 | else education_bin = 'e. High'; 73 | run; 74 | 75 | /*education_num*/ 76 | data rg.df; 77 | format education_num_bin $20.; 78 | set rg.df; 79 | if education_num <= 8 then education_num_bin = 'a. 0-8'; 80 | else if education_num <= 12 then education_num_bin = 'b. 9-12'; 81 | else if education_num <= 13 then education_num_bin = 'c. 13'; 82 | else if education_num <= 14 then education_num_bin = 'd. 14'; 83 | else education_num_bin = 'e. 15+'; 84 | run; 85 | 86 | /*race & sex*/ 87 | data rg.df; 88 | format race_sex $50.; 89 | format race_sex_bin $20.; 90 | set rg.df; 91 | race_sex = compress(race)||' - '||compress(sex); 92 | if race_sex in ('Asian-Pac-Islander - Male','White - Male') then race_sex_bin = 'c. High'; 93 | else if race_sex in ('White - Female','Asian-Pac-Islander - Female','Amer-Indian-Eskimo - Male','Other - Male','Black - Male') then race_sex_bin = 'b. Mid'; 94 | else race_sex_bin = 'a. Low'; 95 | run; 96 | 97 | /*capital_gain & capital_loss*/ 98 | data rg.df; 99 | format capital_gl_bin $20.; 100 | set rg.df; 101 | if capital_gain = . then capital_gain = 0; 102 | if capital_loss = . then capital_loss = 0; 103 | capital_gl = capital_gain - capital_loss; 104 | if capital_gl > 0 then capital_gl_bin = "c. > 0"; 105 | else if capital_gl < 0 then capital_gl_bin = "b. < 0"; 106 | else capital_gl_bin = "a. = 0"; 107 | run; 108 | 109 | /*marital_status & relationship*/ 110 | data rg.df; 111 | format msr $50.; 112 | format msr_bin $20.; 113 | set rg.df; 114 | msr = compress(marital_status)||' - '||compress(relationship); 115 | if msr in ('Married-AF-spouse - Wife','Married-civ-spouse - Husband','Married-civ-spouse - Wife','Married-AF-spouse - Husband') then msr_bin = 'c. High'; 116 | else if msr in ('Widowed - Not-in-family','Divorced - Unmarried','Never-married - Not-in-family','Widowed - Unmarried','Separated - Not-in-family','Married-spouse-absent - Not-in-family','Divorced - Not-in-family','Married-civ-spouse - Other-relative','Married-civ-spouse - Own-child','Married-civ-spouse - Not-in-family') then msr_bin = 'b. Mid'; 117 | else msr_bin = 'a. Low'; 118 | run; 119 | 120 | /*occupation*/ 121 | data rg.df; 122 | format occupation_bin $20.; 123 | set rg.df; 124 | if occupation in ('Priv-house-serv','Other-service','Handlers-cleaners') then occupation_bin = 'a. Low'; 125 | else if occupation in ('Armed-Forces','?','Farming-fishing','Machine-op-inspct','Adm-clerical') then occupation_bin = 'b. Mid - Low'; 126 | else if occupation in ('Transport-moving','Craft-repair','Sales') then occupation_bin = 'c. Mid - Mid'; 127 | else if occupation in ('Tech-support','Protective-serv') then occupation_bin = 'd. Mid - High'; 128 | else occupation_bin = 'e. High'; 129 | run; 130 | 131 | /*hours_per_week*/ 132 | data rg.df; 133 | format hours_per_week_bin $20.; 134 | set rg.df; 135 | if hours_per_week <= 30 then hours_per_week_bin = 'a. 0-30'; 136 | else if hours_per_week <= 40 then hours_per_week_bin = 'b. 31-40'; 137 | else if hours_per_week <= 50 then hours_per_week_bin = 'd. 41-50 & 61-70'; 138 | else if hours_per_week <= 60 then hours_per_week_bin = 'e. 51-60'; 139 | else if hours_per_week <= 70 then hours_per_week_bin = 'd. 41-50 & 61-70'; 140 | else hours_per_week_bin = 'c. 71-100'; 141 | run; 142 | 143 | %macro rg_bin (var); 144 | proc sql; 145 | select &var., count(y) as cnt_y, mean(y) as avg_y 146 | from rg.df 147 | group by &var.; 148 | quit; 149 | %mend; 150 | 151 | %rg_bin(age_bin); 152 | %rg_bin(workclass_bin); 153 | %rg_bin(education_bin); 154 | %rg_bin(education_num_bin); 155 | %rg_bin(race_sex_bin); 156 | %rg_bin(capital_gl_bin); 157 | %rg_bin(msr_bin); 158 | %rg_bin(occupation_bin); 159 | %rg_bin(hours_per_week_bin); 160 | 161 | data rg.df (keep= 162 | y 163 | flag 164 | age_bin 165 | workclass_bin 166 | education_bin 167 | education_num_bin 168 | race_sex_bin 169 | capital_gl_bin 170 | msr_bin 171 | occupation_bin 172 | hours_per_week_bin 173 | ); 174 | set rg.df; 175 | run; -------------------------------------------------------------------------------- /10 Random Forest/02 Random Forest.py: -------------------------------------------------------------------------------- 1 | # Import the libararies 2 | import math 3 | import numpy as np 4 | import pandas as pd 5 | from datetime import datetime 6 | 7 | import seaborn as sns 8 | import matplotlib.pyplot as plt 9 | %matplotlib inline 10 | plt.style.use('seaborn-whitegrid') 11 | 12 | from sklearn.ensemble import RandomForestClassifier 13 | from sklearn.metrics import classification_report 14 | from sklearn.metrics import confusion_matrix 15 | 16 | # Import the data 17 | df = pd.read_csv('data/00 df.csv') 18 | 19 | 20 | # split the data into train & test 21 | train = df[df['flag']=='train'] 22 | test = df[df['flag']=='test'] 23 | 24 | cat_feats = ['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin'] 25 | 26 | y_train = train['y'] 27 | x_train = train[['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin']] 28 | x_train = pd.get_dummies(x_train,columns=cat_feats,drop_first=True) 29 | 30 | y_test = test['y'] 31 | x_test = test[['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin']] 32 | x_test = pd.get_dummies(x_test,columns=cat_feats,drop_first=True) 33 | 34 | 35 | # Random Forest 36 | results = [] 37 | n_estimaor_options = [20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95,100] 38 | for trees in n_estimaor_options: 39 | model = RandomForestClassifier(trees, oob_score=True, n_jobs=-1, random_state=101) 40 | model.fit(x_train, y_train) 41 | y_pred = model.predict(x_test) 42 | accuracy = np.mean(y_test==y_pred) 43 | results.append(accuracy) 44 | 45 | plt.figure(figsize=(8,4)) 46 | pd.Series(results, n_estimaor_options).plot(color="darkred",marker="o") 47 | 48 | 49 | 50 | results = [] 51 | max_features_options = ['auto',None,'sqrt',0.95,0.75,0.5,0.25,0.10] 52 | for trees in max_features_options: 53 | model = RandomForestClassifier(n_estimators=70, oob_score=True, n_jobs=-1, random_state=101, max_features = trees) 54 | model.fit(x_train, y_train) 55 | y_pred = model.predict(x_test) 56 | accuracy = np.mean(y_test==y_pred) 57 | results.append(accuracy) 58 | 59 | plt.figure(figsize=(8,4)) 60 | pd.Series(results, max_features_options).plot(kind="bar",color="darkred",ylim=(0.7,0.9)) 61 | 62 | 63 | 64 | results = [] 65 | min_samples_leaf_options = [5,10,15,20,25,30,35,40,45,50] 66 | for trees in min_samples_leaf_options: 67 | model = RandomForestClassifier(n_estimators=70, oob_score=True, n_jobs=-1, random_state=101, max_features = None, min_samples_leaf = trees) 68 | model.fit(x_train, y_train) 69 | y_pred = model.predict(x_test) 70 | accuracy = np.mean(y_test==y_pred) 71 | results.append(accuracy) 72 | 73 | plt.figure(figsize=(8,4)) 74 | pd.Series(results, min_samples_leaf_options).plot(color="darkred",marker="o") 75 | 76 | 77 | 78 | rfm = RandomForestClassifier(n_estimators=70, oob_score=True, n_jobs=-1, random_state=101, max_features = None, min_samples_leaf = 30) 79 | rfm.fit(x_train, y_train) 80 | y_pred=rfm.predict(x_test) 81 | 82 | 83 | 84 | test_calc = pd.concat([pd.DataFrame(y_test).reset_index(drop=True),pd.DataFrame(y_pred).reset_index(drop=True)],axis=1) 85 | test_calc.rename(columns={0: 'predicted'}, inplace=True) 86 | 87 | test_calc['predicted'] = test_calc['predicted'].apply(lambda x: 1 if x > 0.5 else 0) 88 | df_table = confusion_matrix(test_calc['y'],test_calc['predicted']) 89 | print (df_table) 90 | 91 | print('accuracy:', (df_table[0,0] + df_table[1,1]) / (df_table[0,0] + df_table[0,1] + df_table[1,0] + df_table[1,1])) 92 | print ('precision:', df_table[1,1] / (df_table[1,1] + df_table[0,1])) 93 | print('recall:', df_table[1,1] / (df_table[1,1] + df_table[1,0])) 94 | 95 | p = df_table[1,1] / (df_table[1,1] + df_table[0,1]) 96 | r = df_table[1,1] / (df_table[1,1] + df_table[1,0]) 97 | print('f1 score: ', (2*p*r)/(p+r)) -------------------------------------------------------------------------------- /10 Random Forest/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /10 Random Forest/data/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /11 Stochastic Gradient Descent/01 Data_Prep.sas: -------------------------------------------------------------------------------- 1 | options compress=yes; 2 | 3 | data train; 4 | format flag $20.; 5 | set 'ADULT.DATA'n; 6 | flag = 'train'; 7 | run; 8 | 9 | data test (rename=('|1x3 Cross validator'n = F1)); 10 | format flag $20.; 11 | set 'ADULT.TEST'n; 12 | flag = 'test'; 13 | run; 14 | 15 | data rg.df (rename=( 16 | F1 = age 17 | F2 = workclass 18 | F3 = fnlwgt 19 | F4 = education 20 | F5 = education_num 21 | F6 = marital_status 22 | F7 = occupation 23 | F8 = relationship 24 | F9 = race 25 | F10 = sex 26 | F11 = capital_gain 27 | F12 = capital_loss 28 | F13 = hours_per_week 29 | F14 = native_country 30 | F15 = income 31 | )); 32 | set train test; 33 | if compress(F15) ^= ""; 34 | run; 35 | 36 | data rg.df; 37 | set rg.df; 38 | if compress(income) in (">50K",">50K.") then y = 1; 39 | else y = 0; 40 | run; 41 | 42 | /*age*/ 43 | data rg.df; 44 | format age_bin $20.; 45 | set rg.df; 46 | if age <= 25 then age_bin = 'a. 0-25'; 47 | else if age <= 30 then age_bin = 'b. 26-30 & 71-100'; 48 | else if age <= 35 then age_bin = 'c. 31-35 & 61-70'; 49 | else if age <= 40 then age_bin = 'd. 36-40 & 56-60'; 50 | else if age <= 55 then age_bin = 'e. 40-55'; 51 | else if age <= 60 then age_bin = 'd. 36-40 & 56-60'; 52 | else if age <= 70 then age_bin = 'c. 31-35 & 61-70'; 53 | else age_bin = 'b. 26-30 & 71-100'; 54 | run; 55 | 56 | /*workclass*/ 57 | data rg.df; 58 | format workclass_bin $20.; 59 | set rg.df; 60 | if workclass in ('?','Never-worked','Without-pay') then workclass_bin = 'a. no income'; 61 | else workclass_bin = 'b. income'; 62 | run; 63 | 64 | /*education*/ 65 | data rg.df; 66 | format education_bin $20.; 67 | set rg.df; 68 | if education in ('10th','11th','12th','1st-4th','5th-6th','7th-8th','9th','Preschool') then education_bin = 'a. Low'; 69 | else if education in ('HS-grad','Some-college','Assoc-acdm','Assoc-voc') then education_bin = 'b. Mid'; 70 | else if education in ('Bachelors') then education_bin = 'c. Bachelors'; 71 | else if education in ('Masters') then education_bin = 'd. Masters'; 72 | else education_bin = 'e. High'; 73 | run; 74 | 75 | /*education_num*/ 76 | data rg.df; 77 | format education_num_bin $20.; 78 | set rg.df; 79 | if education_num <= 8 then education_num_bin = 'a. 0-8'; 80 | else if education_num <= 12 then education_num_bin = 'b. 9-12'; 81 | else if education_num <= 13 then education_num_bin = 'c. 13'; 82 | else if education_num <= 14 then education_num_bin = 'd. 14'; 83 | else education_num_bin = 'e. 15+'; 84 | run; 85 | 86 | /*race & sex*/ 87 | data rg.df; 88 | format race_sex $50.; 89 | format race_sex_bin $20.; 90 | set rg.df; 91 | race_sex = compress(race)||' - '||compress(sex); 92 | if race_sex in ('Asian-Pac-Islander - Male','White - Male') then race_sex_bin = 'c. High'; 93 | else if race_sex in ('White - Female','Asian-Pac-Islander - Female','Amer-Indian-Eskimo - Male','Other - Male','Black - Male') then race_sex_bin = 'b. Mid'; 94 | else race_sex_bin = 'a. Low'; 95 | run; 96 | 97 | /*capital_gain & capital_loss*/ 98 | data rg.df; 99 | format capital_gl_bin $20.; 100 | set rg.df; 101 | if capital_gain = . then capital_gain = 0; 102 | if capital_loss = . then capital_loss = 0; 103 | capital_gl = capital_gain - capital_loss; 104 | if capital_gl > 0 then capital_gl_bin = "c. > 0"; 105 | else if capital_gl < 0 then capital_gl_bin = "b. < 0"; 106 | else capital_gl_bin = "a. = 0"; 107 | run; 108 | 109 | /*marital_status & relationship*/ 110 | data rg.df; 111 | format msr $50.; 112 | format msr_bin $20.; 113 | set rg.df; 114 | msr = compress(marital_status)||' - '||compress(relationship); 115 | if msr in ('Married-AF-spouse - Wife','Married-civ-spouse - Husband','Married-civ-spouse - Wife','Married-AF-spouse - Husband') then msr_bin = 'c. High'; 116 | else if msr in ('Widowed - Not-in-family','Divorced - Unmarried','Never-married - Not-in-family','Widowed - Unmarried','Separated - Not-in-family','Married-spouse-absent - Not-in-family','Divorced - Not-in-family','Married-civ-spouse - Other-relative','Married-civ-spouse - Own-child','Married-civ-spouse - Not-in-family') then msr_bin = 'b. Mid'; 117 | else msr_bin = 'a. Low'; 118 | run; 119 | 120 | /*occupation*/ 121 | data rg.df; 122 | format occupation_bin $20.; 123 | set rg.df; 124 | if occupation in ('Priv-house-serv','Other-service','Handlers-cleaners') then occupation_bin = 'a. Low'; 125 | else if occupation in ('Armed-Forces','?','Farming-fishing','Machine-op-inspct','Adm-clerical') then occupation_bin = 'b. Mid - Low'; 126 | else if occupation in ('Transport-moving','Craft-repair','Sales') then occupation_bin = 'c. Mid - Mid'; 127 | else if occupation in ('Tech-support','Protective-serv') then occupation_bin = 'd. Mid - High'; 128 | else occupation_bin = 'e. High'; 129 | run; 130 | 131 | /*hours_per_week*/ 132 | data rg.df; 133 | format hours_per_week_bin $20.; 134 | set rg.df; 135 | if hours_per_week <= 30 then hours_per_week_bin = 'a. 0-30'; 136 | else if hours_per_week <= 40 then hours_per_week_bin = 'b. 31-40'; 137 | else if hours_per_week <= 50 then hours_per_week_bin = 'd. 41-50 & 61-70'; 138 | else if hours_per_week <= 60 then hours_per_week_bin = 'e. 51-60'; 139 | else if hours_per_week <= 70 then hours_per_week_bin = 'd. 41-50 & 61-70'; 140 | else hours_per_week_bin = 'c. 71-100'; 141 | run; 142 | 143 | %macro rg_bin (var); 144 | proc sql; 145 | select &var., count(y) as cnt_y, mean(y) as avg_y 146 | from rg.df 147 | group by &var.; 148 | quit; 149 | %mend; 150 | 151 | %rg_bin(age_bin); 152 | %rg_bin(workclass_bin); 153 | %rg_bin(education_bin); 154 | %rg_bin(education_num_bin); 155 | %rg_bin(race_sex_bin); 156 | %rg_bin(capital_gl_bin); 157 | %rg_bin(msr_bin); 158 | %rg_bin(occupation_bin); 159 | %rg_bin(hours_per_week_bin); 160 | 161 | data rg.df (keep= 162 | y 163 | flag 164 | age_bin 165 | workclass_bin 166 | education_bin 167 | education_num_bin 168 | race_sex_bin 169 | capital_gl_bin 170 | msr_bin 171 | occupation_bin 172 | hours_per_week_bin 173 | ); 174 | set rg.df; 175 | run; -------------------------------------------------------------------------------- /11 Stochastic Gradient Descent/02 Stochastic Gradient Descent.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import math\n", 10 | "import numpy as np\n", 11 | "import pandas as pd\n", 12 | "from datetime import datetime\n", 13 | "\n", 14 | "import seaborn as sns\n", 15 | "import matplotlib.pyplot as plt\n", 16 | "%matplotlib inline \n", 17 | "plt.style.use('seaborn-whitegrid')\n", 18 | "\n", 19 | "from sklearn.linear_model import SGDClassifier\n", 20 | "from sklearn.metrics import classification_report\n", 21 | "from sklearn.metrics import confusion_matrix" 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "# Get the Data" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 2, 34 | "metadata": {}, 35 | "outputs": [], 36 | "source": [ 37 | "df = pd.read_csv('data/00 df.csv')\n", 38 | "train = df[df['flag']=='train']\n", 39 | "test = df[df['flag']=='test']" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": 3, 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "cat_feats = ['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin']\n", 49 | "\n", 50 | "y_train = train['y']\n", 51 | "x_train = train[['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin']]\n", 52 | "x_train = pd.get_dummies(x_train,columns=cat_feats,drop_first=True)\n", 53 | "\n", 54 | "y_test = test['y']\n", 55 | "x_test = test[['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin']]\n", 56 | "x_test = pd.get_dummies(x_test,columns=cat_feats,drop_first=True)" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "# Stochastic Gradient Descent" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": 4, 69 | "metadata": {}, 70 | "outputs": [], 71 | "source": [ 72 | "sgd = SGDClassifier(loss='modified_huber', shuffle=True,random_state=101)\n", 73 | "sgd.fit(x_train, y_train)\n", 74 | "y_pred=sgd.predict(x_test)" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 5, 80 | "metadata": {}, 81 | "outputs": [ 82 | { 83 | "name": "stdout", 84 | "output_type": "stream", 85 | "text": [ 86 | "[[11244 1191]\n", 87 | " [ 1456 2390]]\n", 88 | "accuracy: 0.8374178490264725\n", 89 | "precision: 0.6674113376151913\n", 90 | "recall: 0.6214248569942797\n", 91 | "f1 score: 0.643597684125488\n" 92 | ] 93 | } 94 | ], 95 | "source": [ 96 | "test_calc = pd.concat([pd.DataFrame(y_test).reset_index(drop=True),pd.DataFrame(y_pred).reset_index(drop=True)],axis=1)\n", 97 | "test_calc.rename(columns={0: 'predicted'}, inplace=True)\n", 98 | "\n", 99 | "test_calc['predicted'] = test_calc['predicted'].apply(lambda x: 1 if x > 0.5 else 0)\n", 100 | "df_table = confusion_matrix(test_calc['y'],test_calc['predicted'])\n", 101 | "print (df_table)\n", 102 | "\n", 103 | "print('accuracy:', (df_table[0,0] + df_table[1,1]) / (df_table[0,0] + df_table[0,1] + df_table[1,0] + df_table[1,1]))\n", 104 | "print ('precision:', df_table[1,1] / (df_table[1,1] + df_table[0,1]))\n", 105 | "print('recall:', df_table[1,1] / (df_table[1,1] + df_table[1,0]))\n", 106 | "\n", 107 | "p = df_table[1,1] / (df_table[1,1] + df_table[0,1])\n", 108 | "r = df_table[1,1] / (df_table[1,1] + df_table[1,0])\n", 109 | "print('f1 score: ', (2*p*r)/(p+r))" 110 | ] 111 | } 112 | ], 113 | "metadata": { 114 | "kernelspec": { 115 | "display_name": "Python 3", 116 | "language": "python", 117 | "name": "python3" 118 | }, 119 | "language_info": { 120 | "codemirror_mode": { 121 | "name": "ipython", 122 | "version": 3 123 | }, 124 | "file_extension": ".py", 125 | "mimetype": "text/x-python", 126 | "name": "python", 127 | "nbconvert_exporter": "python", 128 | "pygments_lexer": "ipython3", 129 | "version": "3.6.8" 130 | } 131 | }, 132 | "nbformat": 4, 133 | "nbformat_minor": 4 134 | } 135 | -------------------------------------------------------------------------------- /11 Stochastic Gradient Descent/02 Stochastic Gradient Descent.py: -------------------------------------------------------------------------------- 1 | # Import the libraries 2 | import math 3 | import numpy as np 4 | import pandas as pd 5 | from datetime import datetime 6 | 7 | import seaborn as sns 8 | import matplotlib.pyplot as plt 9 | %matplotlib inline 10 | plt.style.use('seaborn-whitegrid') 11 | 12 | from sklearn.linear_model import SGDClassifier 13 | from sklearn.metrics import classification_report 14 | from sklearn.metrics import confusion_matrix 15 | 16 | # import the data 17 | df = pd.read_csv('data/00 df.csv') 18 | 19 | 20 | # split data into train & test 21 | train = df[df['flag']=='train'] 22 | test = df[df['flag']=='test'] 23 | 24 | cat_feats = ['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin'] 25 | 26 | y_train = train['y'] 27 | x_train = train[['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin']] 28 | x_train = pd.get_dummies(x_train,columns=cat_feats,drop_first=True) 29 | 30 | y_test = test['y'] 31 | x_test = test[['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin']] 32 | x_test = pd.get_dummies(x_test,columns=cat_feats,drop_first=True) 33 | 34 | 35 | # Stochastic Gradient Descent 36 | sgd = SGDClassifier(loss='modified_huber', shuffle=True,random_state=101) 37 | sgd.fit(x_train, y_train) 38 | y_pred=sgd.predict(x_test) 39 | 40 | test_calc = pd.concat([pd.DataFrame(y_test).reset_index(drop=True),pd.DataFrame(y_pred).reset_index(drop=True)],axis=1) 41 | test_calc.rename(columns={0: 'predicted'}, inplace=True) 42 | 43 | test_calc['predicted'] = test_calc['predicted'].apply(lambda x: 1 if x > 0.5 else 0) 44 | df_table = confusion_matrix(test_calc['y'],test_calc['predicted']) 45 | print (df_table) 46 | 47 | print('accuracy:', (df_table[0,0] + df_table[1,1]) / (df_table[0,0] + df_table[0,1] + df_table[1,0] + df_table[1,1])) 48 | print ('precision:', df_table[1,1] / (df_table[1,1] + df_table[0,1])) 49 | print('recall:', df_table[1,1] / (df_table[1,1] + df_table[1,0])) 50 | 51 | p = df_table[1,1] / (df_table[1,1] + df_table[0,1]) 52 | r = df_table[1,1] / (df_table[1,1] + df_table[1,0]) 53 | print('f1 score: ', (2*p*r)/(p+r)) -------------------------------------------------------------------------------- /11 Stochastic Gradient Descent/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /11 Stochastic Gradient Descent/data/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /12 Support Vector Machine/01 Data_Prep.sas: -------------------------------------------------------------------------------- 1 | options compress=yes; 2 | 3 | data train; 4 | format flag $20.; 5 | set 'ADULT.DATA'n; 6 | flag = 'train'; 7 | run; 8 | 9 | data test (rename=('|1x3 Cross validator'n = F1)); 10 | format flag $20.; 11 | set 'ADULT.TEST'n; 12 | flag = 'test'; 13 | run; 14 | 15 | data rg.df (rename=( 16 | F1 = age 17 | F2 = workclass 18 | F3 = fnlwgt 19 | F4 = education 20 | F5 = education_num 21 | F6 = marital_status 22 | F7 = occupation 23 | F8 = relationship 24 | F9 = race 25 | F10 = sex 26 | F11 = capital_gain 27 | F12 = capital_loss 28 | F13 = hours_per_week 29 | F14 = native_country 30 | F15 = income 31 | )); 32 | set train test; 33 | if compress(F15) ^= ""; 34 | run; 35 | 36 | data rg.df; 37 | set rg.df; 38 | if compress(income) in (">50K",">50K.") then y = 1; 39 | else y = 0; 40 | run; 41 | 42 | /*age*/ 43 | data rg.df; 44 | format age_bin $20.; 45 | set rg.df; 46 | if age <= 25 then age_bin = 'a. 0-25'; 47 | else if age <= 30 then age_bin = 'b. 26-30 & 71-100'; 48 | else if age <= 35 then age_bin = 'c. 31-35 & 61-70'; 49 | else if age <= 40 then age_bin = 'd. 36-40 & 56-60'; 50 | else if age <= 55 then age_bin = 'e. 40-55'; 51 | else if age <= 60 then age_bin = 'd. 36-40 & 56-60'; 52 | else if age <= 70 then age_bin = 'c. 31-35 & 61-70'; 53 | else age_bin = 'b. 26-30 & 71-100'; 54 | run; 55 | 56 | /*workclass*/ 57 | data rg.df; 58 | format workclass_bin $20.; 59 | set rg.df; 60 | if workclass in ('?','Never-worked','Without-pay') then workclass_bin = 'a. no income'; 61 | else workclass_bin = 'b. income'; 62 | run; 63 | 64 | /*education*/ 65 | data rg.df; 66 | format education_bin $20.; 67 | set rg.df; 68 | if education in ('10th','11th','12th','1st-4th','5th-6th','7th-8th','9th','Preschool') then education_bin = 'a. Low'; 69 | else if education in ('HS-grad','Some-college','Assoc-acdm','Assoc-voc') then education_bin = 'b. Mid'; 70 | else if education in ('Bachelors') then education_bin = 'c. Bachelors'; 71 | else if education in ('Masters') then education_bin = 'd. Masters'; 72 | else education_bin = 'e. High'; 73 | run; 74 | 75 | /*education_num*/ 76 | data rg.df; 77 | format education_num_bin $20.; 78 | set rg.df; 79 | if education_num <= 8 then education_num_bin = 'a. 0-8'; 80 | else if education_num <= 12 then education_num_bin = 'b. 9-12'; 81 | else if education_num <= 13 then education_num_bin = 'c. 13'; 82 | else if education_num <= 14 then education_num_bin = 'd. 14'; 83 | else education_num_bin = 'e. 15+'; 84 | run; 85 | 86 | /*race & sex*/ 87 | data rg.df; 88 | format race_sex $50.; 89 | format race_sex_bin $20.; 90 | set rg.df; 91 | race_sex = compress(race)||' - '||compress(sex); 92 | if race_sex in ('Asian-Pac-Islander - Male','White - Male') then race_sex_bin = 'c. High'; 93 | else if race_sex in ('White - Female','Asian-Pac-Islander - Female','Amer-Indian-Eskimo - Male','Other - Male','Black - Male') then race_sex_bin = 'b. Mid'; 94 | else race_sex_bin = 'a. Low'; 95 | run; 96 | 97 | /*capital_gain & capital_loss*/ 98 | data rg.df; 99 | format capital_gl_bin $20.; 100 | set rg.df; 101 | if capital_gain = . then capital_gain = 0; 102 | if capital_loss = . then capital_loss = 0; 103 | capital_gl = capital_gain - capital_loss; 104 | if capital_gl > 0 then capital_gl_bin = "c. > 0"; 105 | else if capital_gl < 0 then capital_gl_bin = "b. < 0"; 106 | else capital_gl_bin = "a. = 0"; 107 | run; 108 | 109 | /*marital_status & relationship*/ 110 | data rg.df; 111 | format msr $50.; 112 | format msr_bin $20.; 113 | set rg.df; 114 | msr = compress(marital_status)||' - '||compress(relationship); 115 | if msr in ('Married-AF-spouse - Wife','Married-civ-spouse - Husband','Married-civ-spouse - Wife','Married-AF-spouse - Husband') then msr_bin = 'c. High'; 116 | else if msr in ('Widowed - Not-in-family','Divorced - Unmarried','Never-married - Not-in-family','Widowed - Unmarried','Separated - Not-in-family','Married-spouse-absent - Not-in-family','Divorced - Not-in-family','Married-civ-spouse - Other-relative','Married-civ-spouse - Own-child','Married-civ-spouse - Not-in-family') then msr_bin = 'b. Mid'; 117 | else msr_bin = 'a. Low'; 118 | run; 119 | 120 | /*occupation*/ 121 | data rg.df; 122 | format occupation_bin $20.; 123 | set rg.df; 124 | if occupation in ('Priv-house-serv','Other-service','Handlers-cleaners') then occupation_bin = 'a. Low'; 125 | else if occupation in ('Armed-Forces','?','Farming-fishing','Machine-op-inspct','Adm-clerical') then occupation_bin = 'b. Mid - Low'; 126 | else if occupation in ('Transport-moving','Craft-repair','Sales') then occupation_bin = 'c. Mid - Mid'; 127 | else if occupation in ('Tech-support','Protective-serv') then occupation_bin = 'd. Mid - High'; 128 | else occupation_bin = 'e. High'; 129 | run; 130 | 131 | /*hours_per_week*/ 132 | data rg.df; 133 | format hours_per_week_bin $20.; 134 | set rg.df; 135 | if hours_per_week <= 30 then hours_per_week_bin = 'a. 0-30'; 136 | else if hours_per_week <= 40 then hours_per_week_bin = 'b. 31-40'; 137 | else if hours_per_week <= 50 then hours_per_week_bin = 'd. 41-50 & 61-70'; 138 | else if hours_per_week <= 60 then hours_per_week_bin = 'e. 51-60'; 139 | else if hours_per_week <= 70 then hours_per_week_bin = 'd. 41-50 & 61-70'; 140 | else hours_per_week_bin = 'c. 71-100'; 141 | run; 142 | 143 | %macro rg_bin (var); 144 | proc sql; 145 | select &var., count(y) as cnt_y, mean(y) as avg_y 146 | from rg.df 147 | group by &var.; 148 | quit; 149 | %mend; 150 | 151 | %rg_bin(age_bin); 152 | %rg_bin(workclass_bin); 153 | %rg_bin(education_bin); 154 | %rg_bin(education_num_bin); 155 | %rg_bin(race_sex_bin); 156 | %rg_bin(capital_gl_bin); 157 | %rg_bin(msr_bin); 158 | %rg_bin(occupation_bin); 159 | %rg_bin(hours_per_week_bin); 160 | 161 | data rg.df (keep= 162 | y 163 | flag 164 | age_bin 165 | workclass_bin 166 | education_bin 167 | education_num_bin 168 | race_sex_bin 169 | capital_gl_bin 170 | msr_bin 171 | occupation_bin 172 | hours_per_week_bin 173 | ); 174 | set rg.df; 175 | run; -------------------------------------------------------------------------------- /12 Support Vector Machine/02 Support Vector Machine.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 10, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import math\n", 10 | "import numpy as np\n", 11 | "import pandas as pd\n", 12 | "from datetime import datetime\n", 13 | "\n", 14 | "import seaborn as sns\n", 15 | "import matplotlib.pyplot as plt\n", 16 | "%matplotlib inline \n", 17 | "plt.style.use('seaborn-whitegrid')\n", 18 | "\n", 19 | "from sklearn.svm import SVC\n", 20 | "from sklearn.metrics import classification_report\n", 21 | "from sklearn.metrics import confusion_matrix" 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "# Get the Data" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 11, 34 | "metadata": {}, 35 | "outputs": [], 36 | "source": [ 37 | "df = pd.read_csv('data/00 df.csv')\n", 38 | "train = df[df['flag']=='train']\n", 39 | "test = df[df['flag']=='test']" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": 12, 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "cat_feats = ['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin']\n", 49 | "\n", 50 | "y_train = train['y']\n", 51 | "x_train = train[['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin']]\n", 52 | "x_train = pd.get_dummies(x_train,columns=cat_feats,drop_first=True)\n", 53 | "\n", 54 | "y_test = test['y']\n", 55 | "x_test = test[['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin']]\n", 56 | "x_test = pd.get_dummies(x_test,columns=cat_feats,drop_first=True)" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "# Support Vector Machine" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": 13, 69 | "metadata": {}, 70 | "outputs": [], 71 | "source": [ 72 | "svm = SVC(kernel=\"rbf\", C=0.025,random_state=101)\n", 73 | "svm.fit(x_train, y_train)\n", 74 | "y_pred=svm.predict(x_test)" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 14, 80 | "metadata": {}, 81 | "outputs": [ 82 | { 83 | "name": "stdout", 84 | "output_type": "stream", 85 | "text": [ 86 | "[[11769 666]\n", 87 | " [ 2019 1827]]\n", 88 | "accuracy: 0.8350838400589644\n", 89 | "precision: 0.7328519855595668\n", 90 | "recall: 0.4750390015600624\n", 91 | "f1 score: 0.5764316138192144\n" 92 | ] 93 | } 94 | ], 95 | "source": [ 96 | "test_calc = pd.concat([pd.DataFrame(y_test).reset_index(drop=True),pd.DataFrame(y_pred).reset_index(drop=True)],axis=1)\n", 97 | "test_calc.rename(columns={0: 'predicted'}, inplace=True)\n", 98 | "\n", 99 | "test_calc['predicted'] = test_calc['predicted'].apply(lambda x: 1 if x > 0.5 else 0)\n", 100 | "df_table = confusion_matrix(test_calc['y'],test_calc['predicted'])\n", 101 | "print (df_table)\n", 102 | "\n", 103 | "print('accuracy:', (df_table[0,0] + df_table[1,1]) / (df_table[0,0] + df_table[0,1] + df_table[1,0] + df_table[1,1]))\n", 104 | "print ('precision:', df_table[1,1] / (df_table[1,1] + df_table[0,1]))\n", 105 | "print('recall:', df_table[1,1] / (df_table[1,1] + df_table[1,0]))\n", 106 | "\n", 107 | "p = df_table[1,1] / (df_table[1,1] + df_table[0,1])\n", 108 | "r = df_table[1,1] / (df_table[1,1] + df_table[1,0])\n", 109 | "print('f1 score: ', (2*p*r)/(p+r))" 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": 15, 115 | "metadata": {}, 116 | "outputs": [], 117 | "source": [ 118 | "svm = SVC(kernel=\"linear\", C=0.025,random_state=101)\n", 119 | "svm.fit(x_train, y_train)\n", 120 | "y_pred=svm.predict(x_test)" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": 16, 126 | "metadata": {}, 127 | "outputs": [ 128 | { 129 | "name": "stdout", 130 | "output_type": "stream", 131 | "text": [ 132 | "[[11625 810]\n", 133 | " [ 1781 2065]]\n", 134 | "accuracy: 0.8408574411891161\n", 135 | "precision: 0.7182608695652174\n", 136 | "recall: 0.5369214768590743\n", 137 | "f1 score: 0.6144918910876358\n" 138 | ] 139 | } 140 | ], 141 | "source": [ 142 | "test_calc = pd.concat([pd.DataFrame(y_test).reset_index(drop=True),pd.DataFrame(y_pred).reset_index(drop=True)],axis=1)\n", 143 | "test_calc.rename(columns={0: 'predicted'}, inplace=True)\n", 144 | "\n", 145 | "test_calc['predicted'] = test_calc['predicted'].apply(lambda x: 1 if x > 0.5 else 0)\n", 146 | "df_table = confusion_matrix(test_calc['y'],test_calc['predicted'])\n", 147 | "print (df_table)\n", 148 | "\n", 149 | "print('accuracy:', (df_table[0,0] + df_table[1,1]) / (df_table[0,0] + df_table[0,1] + df_table[1,0] + df_table[1,1]))\n", 150 | "print ('precision:', df_table[1,1] / (df_table[1,1] + df_table[0,1]))\n", 151 | "print('recall:', df_table[1,1] / (df_table[1,1] + df_table[1,0]))\n", 152 | "\n", 153 | "p = df_table[1,1] / (df_table[1,1] + df_table[0,1])\n", 154 | "r = df_table[1,1] / (df_table[1,1] + df_table[1,0])\n", 155 | "print('f1 score: ', (2*p*r)/(p+r))" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": 17, 161 | "metadata": {}, 162 | "outputs": [], 163 | "source": [ 164 | "svm = SVC(kernel=\"poly\", C=0.025,random_state=101)\n", 165 | "svm.fit(x_train, y_train)\n", 166 | "y_pred=svm.predict(x_test)" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": 18, 172 | "metadata": {}, 173 | "outputs": [ 174 | { 175 | "name": "stdout", 176 | "output_type": "stream", 177 | "text": [ 178 | "[[11705 730]\n", 179 | " [ 1838 2008]]\n", 180 | "accuracy: 0.8422701308273448\n", 181 | "precision: 0.733382030679328\n", 182 | "recall: 0.5221008840353614\n", 183 | "f1 score: 0.6099635479951396\n" 184 | ] 185 | } 186 | ], 187 | "source": [ 188 | "test_calc = pd.concat([pd.DataFrame(y_test).reset_index(drop=True),pd.DataFrame(y_pred).reset_index(drop=True)],axis=1)\n", 189 | "test_calc.rename(columns={0: 'predicted'}, inplace=True)\n", 190 | "\n", 191 | "test_calc['predicted'] = test_calc['predicted'].apply(lambda x: 1 if x > 0.5 else 0)\n", 192 | "df_table = confusion_matrix(test_calc['y'],test_calc['predicted'])\n", 193 | "print (df_table)\n", 194 | "\n", 195 | "print('accuracy:', (df_table[0,0] + df_table[1,1]) / (df_table[0,0] + df_table[0,1] + df_table[1,0] + df_table[1,1]))\n", 196 | "print ('precision:', df_table[1,1] / (df_table[1,1] + df_table[0,1]))\n", 197 | "print('recall:', df_table[1,1] / (df_table[1,1] + df_table[1,0]))\n", 198 | "\n", 199 | "p = df_table[1,1] / (df_table[1,1] + df_table[0,1])\n", 200 | "r = df_table[1,1] / (df_table[1,1] + df_table[1,0])\n", 201 | "print('f1 score: ', (2*p*r)/(p+r))" 202 | ] 203 | } 204 | ], 205 | "metadata": { 206 | "kernelspec": { 207 | "display_name": "Python 3", 208 | "language": "python", 209 | "name": "python3" 210 | }, 211 | "language_info": { 212 | "codemirror_mode": { 213 | "name": "ipython", 214 | "version": 3 215 | }, 216 | "file_extension": ".py", 217 | "mimetype": "text/x-python", 218 | "name": "python", 219 | "nbconvert_exporter": "python", 220 | "pygments_lexer": "ipython3", 221 | "version": "3.6.8" 222 | } 223 | }, 224 | "nbformat": 4, 225 | "nbformat_minor": 4 226 | } 227 | -------------------------------------------------------------------------------- /12 Support Vector Machine/02 Support Vector Machine.py: -------------------------------------------------------------------------------- 1 | # Import the libraries 2 | import math 3 | import numpy as np 4 | import pandas as pd 5 | from datetime import datetime 6 | 7 | import seaborn as sns 8 | import matplotlib.pyplot as plt 9 | %matplotlib inline 10 | plt.style.use('seaborn-whitegrid') 11 | 12 | from sklearn.svm import SVC 13 | from sklearn.metrics import classification_report 14 | from sklearn.metrics import confusion_matrix 15 | 16 | 17 | # Import the dataset 18 | df = pd.read_csv('data/00 df.csv') 19 | 20 | 21 | # SPlit data into Train & test 22 | train = df[df['flag']=='train'] 23 | test = df[df['flag']=='test'] 24 | 25 | cat_feats = ['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin'] 26 | 27 | y_train = train['y'] 28 | x_train = train[['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin']] 29 | x_train = pd.get_dummies(x_train,columns=cat_feats,drop_first=True) 30 | 31 | y_test = test['y'] 32 | x_test = test[['age_bin','capital_gl_bin','education_bin','hours_per_week_bin','msr_bin','occupation_bin','race_sex_bin']] 33 | x_test = pd.get_dummies(x_test,columns=cat_feats,drop_first=True) 34 | 35 | 36 | 37 | # Support Vector Machine 38 | svm = SVC(kernel="rbf", C=0.025,random_state=101) 39 | svm.fit(x_train, y_train) 40 | y_pred=svm.predict(x_test) 41 | 42 | 43 | test_calc = pd.concat([pd.DataFrame(y_test).reset_index(drop=True),pd.DataFrame(y_pred).reset_index(drop=True)],axis=1) 44 | test_calc.rename(columns={0: 'predicted'}, inplace=True) 45 | 46 | test_calc['predicted'] = test_calc['predicted'].apply(lambda x: 1 if x > 0.5 else 0) 47 | df_table = confusion_matrix(test_calc['y'],test_calc['predicted']) 48 | print (df_table) 49 | 50 | print('accuracy:', (df_table[0,0] + df_table[1,1]) / (df_table[0,0] + df_table[0,1] + df_table[1,0] + df_table[1,1])) 51 | print ('precision:', df_table[1,1] / (df_table[1,1] + df_table[0,1])) 52 | print('recall:', df_table[1,1] / (df_table[1,1] + df_table[1,0])) 53 | 54 | p = df_table[1,1] / (df_table[1,1] + df_table[0,1]) 55 | r = df_table[1,1] / (df_table[1,1] + df_table[1,0]) 56 | print('f1 score: ', (2*p*r)/(p+r)) 57 | 58 | 59 | # SVC 60 | svm = SVC(kernel="linear", C=0.025,random_state=101) 61 | svm.fit(x_train, y_train) 62 | y_pred=svm.predict(x_test) 63 | 64 | 65 | test_calc = pd.concat([pd.DataFrame(y_test).reset_index(drop=True),pd.DataFrame(y_pred).reset_index(drop=True)],axis=1) 66 | test_calc.rename(columns={0: 'predicted'}, inplace=True) 67 | 68 | test_calc['predicted'] = test_calc['predicted'].apply(lambda x: 1 if x > 0.5 else 0) 69 | df_table = confusion_matrix(test_calc['y'],test_calc['predicted']) 70 | print (df_table) 71 | 72 | print('accuracy:', (df_table[0,0] + df_table[1,1]) / (df_table[0,0] + df_table[0,1] + df_table[1,0] + df_table[1,1])) 73 | print ('precision:', df_table[1,1] / (df_table[1,1] + df_table[0,1])) 74 | print('recall:', df_table[1,1] / (df_table[1,1] + df_table[1,0])) 75 | 76 | p = df_table[1,1] / (df_table[1,1] + df_table[0,1]) 77 | r = df_table[1,1] / (df_table[1,1] + df_table[1,0]) 78 | print('f1 score: ', (2*p*r)/(p+r)) 79 | 80 | 81 | # SVC 82 | svm = SVC(kernel="poly", C=0.025,random_state=101) 83 | svm.fit(x_train, y_train) 84 | y_pred=svm.predict(x_test) 85 | 86 | 87 | test_calc = pd.concat([pd.DataFrame(y_test).reset_index(drop=True),pd.DataFrame(y_pred).reset_index(drop=True)],axis=1) 88 | test_calc.rename(columns={0: 'predicted'}, inplace=True) 89 | 90 | test_calc['predicted'] = test_calc['predicted'].apply(lambda x: 1 if x > 0.5 else 0) 91 | df_table = confusion_matrix(test_calc['y'],test_calc['predicted']) 92 | print (df_table) 93 | 94 | print('accuracy:', (df_table[0,0] + df_table[1,1]) / (df_table[0,0] + df_table[0,1] + df_table[1,0] + df_table[1,1])) 95 | print ('precision:', df_table[1,1] / (df_table[1,1] + df_table[0,1])) 96 | print('recall:', df_table[1,1] / (df_table[1,1] + df_table[1,0])) 97 | 98 | p = df_table[1,1] / (df_table[1,1] + df_table[0,1]) 99 | r = df_table[1,1] / (df_table[1,1] + df_table[1,0]) 100 | print('f1 score: ', (2*p*r)/(p+r)) 101 | 102 | -------------------------------------------------------------------------------- /12 Support Vector Machine/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /12 Support Vector Machine/data/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Free Machine Learning 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | --- 2 |

Machine-Learning-Classification

3 | 4 | --- 5 | --- 6 | ![1_9iNsnOm0I3YFF-A_ryp5Mg](https://user-images.githubusercontent.com/42931974/71458425-af86a200-27c8-11ea-9643-8e849b990cf9.jpeg) 7 | 8 | --- 9 | --- 10 |

Classification Process

11 | 12 | ![1_PAqzvCxPjpDN8RC9HQw45w](https://user-images.githubusercontent.com/42931974/71458473-de047d00-27c8-11ea-9bed-e3d4d3752f0b.jpeg) 13 | 14 | --- 15 | --- 16 | #### Classification is the process of predicting the class of given data points. Classes are sometimes called as targets/ labels or categories. Classification predictive modeling is the task of approximating a mapping function (f) from input variables (X) to discrete output variables (y). 17 | #### For example, spam detection in email service providers can be identified as a classification problem. This is s binary classification since there are only 2 classes as spam and not spam. A classifier utilizes some training data to understand how given input variables relate to the class. In this case, known spam and non-spam emails have to be used as the training data. When the classifier is trained accurately, it can be used to detect an unknown email. 18 | #### Classification belongs to the category of supervised learning where the targets also provided with the input data. There are many applications in classification in many domains such as in credit approval, medical diagnosis, target marketing etc. 19 | --------------------------------------------------------------------------------