├── Assignment 1.ipynb
├── Assignment 2.ipynb
├── Assignment 3.ipynb
├── Assignment 4.ipynb
├── Classifier Visualization.ipynb
├── Module 1.ipynb
├── Module 2.ipynb
├── Module 3.ipynb
├── Module 4.ipynb
├── README.md
├── Slides
├── Module 1.pdf
├── Module 2.pdf
├── Module 3.pdf
├── Module 4.pdf
└── Module 5.pdf
├── Unsupervised Learning.ipynb
├── __pycache__
├── adspy_shared_utilities.cpython-35.pyc
└── adspy_shared_utilities.cpython-36.pyc
├── adspy_shared_utilities.py
├── adspy_temp.dot
├── fraud_data.csv
├── fruit_data_with_colors.txt
├── mushrooms.csv
└── readonly
├── Assignment 1.ipynb
├── Assignment 2.ipynb
├── Assignment 3.ipynb
├── Assignment 4.ipynb
├── Classifier Visualization.ipynb
├── CommViolPredUnnormalizedData.txt
├── Module 1.ipynb
├── Module 2.ipynb
├── Module 3.ipynb
├── Module 4.ipynb
├── Unsupervised Learning.ipynb
├── addresses.csv
├── adspy_shared_utilities.py
├── fraud_data.csv
├── fruit_data_with_colors.txt
├── latlons.csv
├── mushrooms.csv
├── polynomialreg1.png
└── test.csv
/Assignment 3.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "---\n",
8 | "\n",
9 | "_You are currently looking at **version 1.2** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-machine-learning/resources/bANLa) course resource._\n",
10 | "\n",
11 | "---"
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "# Assignment 3 - Evaluation\n",
19 | "\n",
20 | "In this assignment you will train several models and evaluate how effectively they predict instances of fraud using data based on [this dataset from Kaggle](https://www.kaggle.com/dalpozz/creditcardfraud).\n",
21 | " \n",
22 | "Each row in `fraud_data.csv` corresponds to a credit card transaction. Features include confidential variables `V1` through `V28` as well as `Amount` which is the amount of the transaction. \n",
23 | " \n",
24 | "The target is stored in the `class` column, where a value of 1 corresponds to an instance of fraud and 0 corresponds to an instance of not fraud."
25 | ]
26 | },
27 | {
28 | "cell_type": "code",
29 | "execution_count": 1,
30 | "metadata": {
31 | "collapsed": true
32 | },
33 | "outputs": [],
34 | "source": [
35 | "import numpy as np\n",
36 | "import pandas as pd"
37 | ]
38 | },
39 | {
40 | "cell_type": "markdown",
41 | "metadata": {},
42 | "source": [
43 | "### Question 1\n",
44 | "Import the data from `fraud_data.csv`. What percentage of the observations in the dataset are instances of fraud?\n",
45 | "\n",
46 | "*This function should return a float between 0 and 1.* "
47 | ]
48 | },
49 | {
50 | "cell_type": "code",
51 | "execution_count": 2,
52 | "metadata": {
53 | "collapsed": false
54 | },
55 | "outputs": [
56 | {
57 | "data": {
58 | "text/plain": [
59 | "0.016410823768035772"
60 | ]
61 | },
62 | "execution_count": 2,
63 | "metadata": {},
64 | "output_type": "execute_result"
65 | }
66 | ],
67 | "source": [
68 | "def answer_one():\n",
69 | " \n",
70 | " df = pd.read_csv('fraud_data.csv')\n",
71 | " seg = df.iloc[:,-1]\n",
72 | " return float(seg[seg.values==1].count()/seg.count())\n",
73 | "\n",
74 | "answer_one()"
75 | ]
76 | },
77 | {
78 | "cell_type": "code",
79 | "execution_count": 3,
80 | "metadata": {
81 | "collapsed": false
82 | },
83 | "outputs": [],
84 | "source": [
85 | "# Use X_train, X_test, y_train, y_test for all of the following questions\n",
86 | "from sklearn.model_selection import train_test_split\n",
87 | "\n",
88 | "df = pd.read_csv('fraud_data.csv')\n",
89 | "\n",
90 | "X = df.iloc[:,:-1]\n",
91 | "y = df.iloc[:,-1]\n",
92 | "\n",
93 | "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)\n",
94 | "\n",
95 | "#df"
96 | ]
97 | },
98 | {
99 | "cell_type": "markdown",
100 | "metadata": {},
101 | "source": [
102 | "### Question 2\n",
103 | "\n",
104 | "Using `X_train`, `X_test`, `y_train`, and `y_test` (as defined above), train a dummy classifier that classifies everything as the majority class of the training data. What is the accuracy of this classifier? What is the recall?\n",
105 | "\n",
106 | "*This function should a return a tuple with two floats, i.e. `(accuracy score, recall score)`.*"
107 | ]
108 | },
109 | {
110 | "cell_type": "code",
111 | "execution_count": 4,
112 | "metadata": {
113 | "collapsed": false
114 | },
115 | "outputs": [
116 | {
117 | "data": {
118 | "text/plain": [
119 | "(0.98303522035773561, 0.0)"
120 | ]
121 | },
122 | "execution_count": 4,
123 | "metadata": {},
124 | "output_type": "execute_result"
125 | }
126 | ],
127 | "source": [
128 | "def answer_two():\n",
129 | " from sklearn.dummy import DummyClassifier\n",
130 | " from sklearn.metrics import recall_score\n",
131 | " \n",
132 | " dummy_majority = DummyClassifier(strategy='most_frequent').fit(X_train,y_train)\n",
133 | " y_dummy_pred = dummy_majority.predict(X_test)\n",
134 | " return (dummy_majority.score(X_train,y_train),recall_score(y_test,y_dummy_pred))\n",
135 | "\n",
136 | "answer_two()"
137 | ]
138 | },
139 | {
140 | "cell_type": "markdown",
141 | "metadata": {},
142 | "source": [
143 | "### Question 3\n",
144 | "\n",
145 | "Using X_train, X_test, y_train, y_test (as defined above), train a SVC classifer using the default parameters. What is the accuracy, recall, and precision of this classifier?\n",
146 | "\n",
147 | "*This function should a return a tuple with three floats, i.e. `(accuracy score, recall score, precision score)`.*"
148 | ]
149 | },
150 | {
151 | "cell_type": "code",
152 | "execution_count": 5,
153 | "metadata": {
154 | "collapsed": false
155 | },
156 | "outputs": [
157 | {
158 | "data": {
159 | "text/plain": [
160 | "(0.99784866924826354, 0.375, 1.0)"
161 | ]
162 | },
163 | "execution_count": 5,
164 | "metadata": {},
165 | "output_type": "execute_result"
166 | }
167 | ],
168 | "source": [
169 | "def answer_three():\n",
170 | " from sklearn.metrics import recall_score, precision_score\n",
171 | " from sklearn.svm import SVC\n",
172 | "\n",
173 | " clf = SVC(kernel='rbf').fit(X_train,y_train)\n",
174 | " pred = clf.predict(X_test)\n",
175 | " return (clf.score(X_train,y_train),recall_score(y_test,pred),precision_score(y_test,pred))\n",
176 | "\n",
177 | "answer_three()"
178 | ]
179 | },
180 | {
181 | "cell_type": "markdown",
182 | "metadata": {},
183 | "source": [
184 | "### Question 4\n",
185 | "\n",
186 | "Using the SVC classifier with parameters `{'C': 1e9, 'gamma': 1e-07}`, what is the confusion matrix when using a threshold of -220 on the decision function. Use X_test and y_test.\n",
187 | "\n",
188 | "*This function should return a confusion matrix, a 2x2 numpy array with 4 integers.*"
189 | ]
190 | },
191 | {
192 | "cell_type": "code",
193 | "execution_count": 6,
194 | "metadata": {
195 | "collapsed": false
196 | },
197 | "outputs": [
198 | {
199 | "data": {
200 | "text/plain": [
201 | "array([[5320, 24],\n",
202 | " [ 14, 66]])"
203 | ]
204 | },
205 | "execution_count": 6,
206 | "metadata": {},
207 | "output_type": "execute_result"
208 | }
209 | ],
210 | "source": [
211 | "def answer_four():\n",
212 | " from sklearn.metrics import confusion_matrix\n",
213 | " from sklearn.svm import SVC\n",
214 | " \n",
215 | " svm = SVC(C=1e9,gamma=1e-07).fit(X_train,y_train)\n",
216 | " y_pred = svm.decision_function(X_test) > -220\n",
217 | " return confusion_matrix(y_test,y_pred)\n",
218 | "\n",
219 | "answer_four()"
220 | ]
221 | },
222 | {
223 | "cell_type": "markdown",
224 | "metadata": {},
225 | "source": [
226 | "### Question 5\n",
227 | "\n",
228 | "Train a logisitic regression classifier with default parameters using X_train and y_train.\n",
229 | "\n",
230 | "For the logisitic regression classifier, create a precision recall curve and a roc curve using y_test and the probability estimates for X_test (probability it is fraud).\n",
231 | "\n",
232 | "Looking at the precision recall curve, what is the recall when the precision is `0.75`?\n",
233 | "\n",
234 | "Looking at the roc curve, what is the true positive rate when the false positive rate is `0.16`?\n",
235 | "\n",
236 | "*This function should return a tuple with two floats, i.e. `(recall, true positive rate)`.*"
237 | ]
238 | },
239 | {
240 | "cell_type": "code",
241 | "execution_count": 7,
242 | "metadata": {
243 | "collapsed": false
244 | },
245 | "outputs": [
246 | {
247 | "data": {
248 | "text/plain": [
249 | "(0.83, 0.94)"
250 | ]
251 | },
252 | "execution_count": 7,
253 | "metadata": {},
254 | "output_type": "execute_result"
255 | }
256 | ],
257 | "source": [
258 | "def answer_five():\n",
259 | " \n",
260 | " #%matplotlib notebook\n",
261 | " #import matplotlib.pyplot as plt\n",
262 | " from sklearn.linear_model import LogisticRegression\n",
263 | " from sklearn.metrics import precision_recall_curve, roc_curve\n",
264 | " \n",
265 | " y_scores_lr = LogisticRegression().fit(X_train,y_train).decision_function(X_test)\n",
266 | " precision,recall,thresholds = precision_recall_curve(y_test,y_scores_lr)\n",
267 | " fpr_lr, tpr_lr, _ = roc_curve(y_test, y_scores_lr)\n",
268 | " \n",
269 | " #plt.figure()\n",
270 | " #plt.plot(precision,recall)\n",
271 | " #plt.plot(fpr_lr,tpr_lr)\n",
272 | " \n",
273 | " return (0.83,0.94)\n",
274 | "\n",
275 | "answer_five()"
276 | ]
277 | },
278 | {
279 | "cell_type": "markdown",
280 | "metadata": {},
281 | "source": [
282 | "### Question 6\n",
283 | "\n",
284 | "Perform a grid search over the parameters listed below for a Logisitic Regression classifier, using recall for scoring and the default 3-fold cross validation.\n",
285 | "\n",
286 | "`'penalty': ['l1', 'l2']`\n",
287 | "\n",
288 | "`'C':[0.01, 0.1, 1, 10, 100]`\n",
289 | "\n",
290 | "From `.cv_results_`, create an array of the mean test scores of each parameter combination. i.e.\n",
291 | "\n",
292 | "| \t| `l1` \t| `l2` \t|\n",
293 | "|:----:\t|----\t|----\t|\n",
294 | "| **`0.01`** \t| ?\t| ? \t|\n",
295 | "| **`0.1`** \t| ?\t| ? \t|\n",
296 | "| **`1`** \t| ?\t| ? \t|\n",
297 | "| **`10`** \t| ?\t| ? \t|\n",
298 | "| **`100`** \t| ?\t| ? \t|\n",
299 | "\n",
300 | "
\n",
301 | "\n",
302 | "*This function should return a 5 by 2 numpy array with 10 floats.* \n",
303 | "\n",
304 | "*Note: do not return a DataFrame, just the values denoted by '?' above in a numpy array. You might need to reshape your raw result to meet the format we are looking for.*"
305 | ]
306 | },
307 | {
308 | "cell_type": "code",
309 | "execution_count": 8,
310 | "metadata": {
311 | "collapsed": false
312 | },
313 | "outputs": [
314 | {
315 | "data": {
316 | "text/plain": [
317 | "array([[ 0.66666667, 0.76086957],\n",
318 | " [ 0.80072464, 0.80434783],\n",
319 | " [ 0.8115942 , 0.8115942 ],\n",
320 | " [ 0.80797101, 0.8115942 ],\n",
321 | " [ 0.80797101, 0.80797101]])"
322 | ]
323 | },
324 | "execution_count": 8,
325 | "metadata": {},
326 | "output_type": "execute_result"
327 | }
328 | ],
329 | "source": [
330 | "def answer_six():\n",
331 | " \n",
332 | " from sklearn.model_selection import GridSearchCV\n",
333 | " from sklearn.linear_model import LogisticRegression\n",
334 | " from sklearn.model_selection import cross_val_score\n",
335 | " \n",
336 | " grid_values={'penalty': ['l1', 'l2'], 'C':[0.01, 0.1, 1, 10, 100]}\n",
337 | " lr = LogisticRegression().fit(X_train,y_train)\n",
338 | " \n",
339 | " # Number of folds included as a parameter in GridSearchCV. Default cv=3.\n",
340 | " lr_custom = GridSearchCV(lr,param_grid=grid_values,scoring='recall',cv=3)\n",
341 | " lr_custom.fit(X_train,y_train)\n",
342 | "\n",
343 | " return lr_custom.cv_results_['mean_test_score'].reshape(5,2)\n",
344 | "\n",
345 | "answer_six()"
346 | ]
347 | },
348 | {
349 | "cell_type": "code",
350 | "execution_count": 9,
351 | "metadata": {
352 | "collapsed": false
353 | },
354 | "outputs": [],
355 | "source": [
356 | "# Use the following function to help visualize results from the grid search\n",
357 | "def GridSearch_Heatmap(scores):\n",
358 | " %matplotlib notebook\n",
359 | " import seaborn as sns\n",
360 | " import matplotlib.pyplot as plt\n",
361 | " plt.figure()\n",
362 | " sns.heatmap(scores.reshape(5,2), xticklabels=['l1','l2'], yticklabels=[0.01, 0.1, 1, 10, 100])\n",
363 | " plt.yticks(rotation=0);\n",
364 | "\n",
365 | "#GridSearch_Heatmap(answer_six())"
366 | ]
367 | },
368 | {
369 | "cell_type": "code",
370 | "execution_count": null,
371 | "metadata": {
372 | "collapsed": true
373 | },
374 | "outputs": [],
375 | "source": []
376 | }
377 | ],
378 | "metadata": {
379 | "coursera": {
380 | "course_slug": "python-machine-learning",
381 | "graded_item_id": "5yX9Z",
382 | "launcher_item_id": "eqnV3",
383 | "part_id": "Msnj0"
384 | },
385 | "kernelspec": {
386 | "display_name": "Python 3",
387 | "language": "python",
388 | "name": "python3"
389 | },
390 | "language_info": {
391 | "codemirror_mode": {
392 | "name": "ipython",
393 | "version": 3
394 | },
395 | "file_extension": ".py",
396 | "mimetype": "text/x-python",
397 | "name": "python",
398 | "nbconvert_exporter": "python",
399 | "pygments_lexer": "ipython3",
400 | "version": "3.6.2"
401 | }
402 | },
403 | "nbformat": 4,
404 | "nbformat_minor": 2
405 | }
406 |
--------------------------------------------------------------------------------
/Assignment 4.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "---\n",
8 | "\n",
9 | "_You are currently looking at **version 1.1** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-machine-learning/resources/bANLa) course resource._\n",
10 | "\n",
11 | "---"
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "## Assignment 4 - Understanding and Predicting Property Maintenance Fines\n",
19 | "\n",
20 | "This assignment is based on a data challenge from the Michigan Data Science Team ([MDST](http://midas.umich.edu/mdst/)). \n",
21 | "\n",
22 | "The Michigan Data Science Team ([MDST](http://midas.umich.edu/mdst/)) and the Michigan Student Symposium for Interdisciplinary Statistical Sciences ([MSSISS](https://sites.lsa.umich.edu/mssiss/)) have partnered with the City of Detroit to help solve one of the most pressing problems facing Detroit - blight. [Blight violations](http://www.detroitmi.gov/How-Do-I/Report/Blight-Complaint-FAQs) are issued by the city to individuals who allow their properties to remain in a deteriorated condition. Every year, the city of Detroit issues millions of dollars in fines to residents and every year, many of these fines remain unpaid. Enforcing unpaid blight fines is a costly and tedious process, so the city wants to know: how can we increase blight ticket compliance?\n",
23 | "\n",
24 | "The first step in answering this question is understanding when and why a resident might fail to comply with a blight ticket. This is where predictive modeling comes in. For this assignment, your task is to predict whether a given blight ticket will be paid on time.\n",
25 | "\n",
26 | "All data for this assignment has been provided to us through the [Detroit Open Data Portal](https://data.detroitmi.gov/). **Only the data already included in your Coursera directory can be used for training the model for this assignment.** Nonetheless, we encourage you to look into data from other Detroit datasets to help inform feature creation and model selection. We recommend taking a look at the following related datasets:\n",
27 | "\n",
28 | "* [Building Permits](https://data.detroitmi.gov/Property-Parcels/Building-Permits/xw2a-a7tf)\n",
29 | "* [Trades Permits](https://data.detroitmi.gov/Property-Parcels/Trades-Permits/635b-dsgv)\n",
30 | "* [Improve Detroit: Submitted Issues](https://data.detroitmi.gov/Government/Improve-Detroit-Submitted-Issues/fwz3-w3yn)\n",
31 | "* [DPD: Citizen Complaints](https://data.detroitmi.gov/Public-Safety/DPD-Citizen-Complaints-2016/kahe-efs3)\n",
32 | "* [Parcel Map](https://data.detroitmi.gov/Property-Parcels/Parcel-Map/fxkw-udwf)\n",
33 | "\n",
34 | "___\n",
35 | "\n",
36 | "We provide you with two data files for use in training and validating your models: train.csv and test.csv. Each row in these two files corresponds to a single blight ticket, and includes information about when, why, and to whom each ticket was issued. The target variable is compliance, which is True if the ticket was paid early, on time, or within one month of the hearing data, False if the ticket was paid after the hearing date or not at all, and Null if the violator was found not responsible. Compliance, as well as a handful of other variables that will not be available at test-time, are only included in train.csv.\n",
37 | "\n",
38 | "Note: All tickets where the violators were found not responsible are not considered during evaluation. They are included in the training set as an additional source of data for visualization, and to enable unsupervised and semi-supervised approaches. However, they are not included in the test set.\n",
39 | "\n",
40 | "
\n",
41 | "\n",
42 | "**File descriptions** (Use only this data for training your model!)\n",
43 | "\n",
44 | " readonly/train.csv - the training set (all tickets issued 2004-2011)\n",
45 | " readonly/test.csv - the test set (all tickets issued 2012-2016)\n",
46 | " readonly/addresses.csv & readonly/latlons.csv - mapping from ticket id to addresses, and from addresses to lat/lon coordinates. \n",
47 | " Note: misspelled addresses may be incorrectly geolocated.\n",
48 | "\n",
49 | "
\n",
50 | "\n",
51 | "**Data fields**\n",
52 | "\n",
53 | "train.csv & test.csv\n",
54 | "\n",
55 | " ticket_id - unique identifier for tickets\n",
56 | " agency_name - Agency that issued the ticket\n",
57 | " inspector_name - Name of inspector that issued the ticket\n",
58 | " violator_name - Name of the person/organization that the ticket was issued to\n",
59 | " violation_street_number, violation_street_name, violation_zip_code - Address where the violation occurred\n",
60 | " mailing_address_str_number, mailing_address_str_name, city, state, zip_code, non_us_str_code, country - Mailing address of the violator\n",
61 | " ticket_issued_date - Date and time the ticket was issued\n",
62 | " hearing_date - Date and time the violator's hearing was scheduled\n",
63 | " violation_code, violation_description - Type of violation\n",
64 | " disposition - Judgment and judgement type\n",
65 | " fine_amount - Violation fine amount, excluding fees\n",
66 | " admin_fee - $20 fee assigned to responsible judgments\n",
67 | "state_fee - $10 fee assigned to responsible judgments\n",
68 | " late_fee - 10% fee assigned to responsible judgments\n",
69 | " discount_amount - discount applied, if any\n",
70 | " clean_up_cost - DPW clean-up or graffiti removal cost\n",
71 | " judgment_amount - Sum of all fines and fees\n",
72 | " grafitti_status - Flag for graffiti violations\n",
73 | " \n",
74 | "train.csv only\n",
75 | "\n",
76 | " payment_amount - Amount paid, if any\n",
77 | " payment_date - Date payment was made, if it was received\n",
78 | " payment_status - Current payment status as of Feb 1 2017\n",
79 | " balance_due - Fines and fees still owed\n",
80 | " collection_status - Flag for payments in collections\n",
81 | " compliance [target variable for prediction] \n",
82 | " Null = Not responsible\n",
83 | " 0 = Responsible, non-compliant\n",
84 | " 1 = Responsible, compliant\n",
85 | " compliance_detail - More information on why each ticket was marked compliant or non-compliant\n",
86 | "\n",
87 | "\n",
88 | "___\n",
89 | "\n",
90 | "## Evaluation\n",
91 | "\n",
92 | "Your predictions will be given as the probability that the corresponding blight ticket will be paid on time.\n",
93 | "\n",
94 | "The evaluation metric for this assignment is the Area Under the ROC Curve (AUC). \n",
95 | "\n",
96 | "Your grade will be based on the AUC score computed for your classifier. A model which with an AUROC of 0.7 passes this assignment, over 0.75 will recieve full points.\n",
97 | "___\n",
98 | "\n",
99 | "For this assignment, create a function that trains a model to predict blight ticket compliance in Detroit using `readonly/train.csv`. Using this model, return a series of length 61001 with the data being the probability that each corresponding ticket from `readonly/test.csv` will be paid, and the index being the ticket_id.\n",
100 | "\n",
101 | "Example:\n",
102 | "\n",
103 | " ticket_id\n",
104 | " 284932 0.531842\n",
105 | " 285362 0.401958\n",
106 | " 285361 0.105928\n",
107 | " 285338 0.018572\n",
108 | " ...\n",
109 | " 376499 0.208567\n",
110 | " 376500 0.818759\n",
111 | " 369851 0.018528\n",
112 | " Name: compliance, dtype: float32\n",
113 | " \n",
114 | "### Hints\n",
115 | "\n",
116 | "* Make sure your code is working before submitting it to the autograder.\n",
117 | "\n",
118 | "* Print out your result to see whether there is anything weird (e.g., all probabilities are the same).\n",
119 | "\n",
120 | "* Generally the total runtime should be less than 10 mins. You should NOT use Neural Network related classifiers (e.g., MLPClassifier) in this question. \n",
121 | "\n",
122 | "* Try to avoid global variables. If you have other functions besides blight_model, you should move those functions inside the scope of blight_model.\n",
123 | "\n",
124 | "* Refer to the pinned threads in Week 4's discussion forum when there is something you could not figure it out."
125 | ]
126 | },
127 | {
128 | "cell_type": "code",
129 | "execution_count": 1,
130 | "metadata": {
131 | "collapsed": false
132 | },
133 | "outputs": [],
134 | "source": [
135 | "#%matplotlib notebook\n",
136 | "import pandas as pd\n",
137 | "import numpy as np\n",
138 | "#import matplotlib.pyplot as plt\n",
139 | "pd.set_option('display.max_columns',50)\n",
140 | "\n",
141 | "from sklearn.ensemble import RandomForestRegressor\n",
142 | "from sklearn.ensemble import GradientBoostingClassifier\n",
143 | "from sklearn.model_selection import train_test_split\n",
144 | "from sklearn.model_selection import GridSearchCV\n",
145 | "from sklearn.model_selection import cross_val_score\n",
146 | "from sklearn.preprocessing import LabelEncoder\n",
147 | "from sklearn.metrics import roc_auc_score\n",
148 | "\n",
149 | "def blight_model():\n",
150 | " # Reading files:\n",
151 | " train = pd.read_csv('train.csv',encoding='ISO-8859-1')\n",
152 | " test = pd.read_csv('test.csv',encoding='ISO-8859-1')\n",
153 | " test.set_index(test['ticket_id'],inplace=True)\n",
154 | " \n",
155 | " # Cleaning:\n",
156 | " train.dropna(subset=['compliance'],inplace=True)\n",
157 | " train = train[train['country']=='USA']\n",
158 | " #test = test[test['country']=='USA']\n",
159 | "\n",
160 | " label_encoder = LabelEncoder()\n",
161 | " label_encoder.fit(train['disposition'].append(test['disposition'], ignore_index=True))\n",
162 | " train['disposition'] = label_encoder.transform(train['disposition'])\n",
163 | " test['disposition'] = label_encoder.transform(test['disposition'])\n",
164 | "\n",
165 | " label_encoder = LabelEncoder()\n",
166 | " label_encoder.fit(train['violation_code'].append(test['violation_code'], ignore_index=True))\n",
167 | " train['violation_code'] = label_encoder.transform(train['violation_code'])\n",
168 | " test['violation_code'] = label_encoder.transform(test['violation_code'])\n",
169 | "\n",
170 | " feature_names=['disposition','violation_code']\n",
171 | " X = train[feature_names]\n",
172 | " y = train['compliance']\n",
173 | " test = test[feature_names]\n",
174 | " X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=0)\n",
175 | "\n",
176 | " # grid search\n",
177 | " model = RandomForestRegressor()\n",
178 | " param_grid = {'n_estimators':[5,7], 'max_depth':[5,10]}\n",
179 | " grid_search = GridSearchCV(model, param_grid, scoring=\"roc_auc\")\n",
180 | " grid_result = grid_search.fit(X_train, y_train)\n",
181 | " \n",
182 | " # summarize results\n",
183 | " #print(\"Best: %f using %s\" % (grid_result.best_score_, grid_result.best_params_))\n",
184 | " return pd.DataFrame(grid_result.predict(test),index=test.index,columns=['compliance'])"
185 | ]
186 | },
187 | {
188 | "cell_type": "code",
189 | "execution_count": 2,
190 | "metadata": {
191 | "collapsed": false
192 | },
193 | "outputs": [
194 | {
195 | "name": "stderr",
196 | "output_type": "stream",
197 | "text": [
198 | "/opt/conda/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2827: DtypeWarning: Columns (11,12,31) have mixed types. Specify dtype option on import or set low_memory=False.\n",
199 | " if self.run_code(code, result):\n"
200 | ]
201 | },
202 | {
203 | "name": "stdout",
204 | "output_type": "stream",
205 | "text": [
206 | "Best: 0.771883 using {'max_depth': 10, 'n_estimators': 7}\n"
207 | ]
208 | },
209 | {
210 | "data": {
211 | "text/html": [
212 | "
\n",
213 | "
\n",
214 | " \n",
215 | " \n",
216 | " | \n",
217 | " compliance | \n",
218 | "
\n",
219 | " \n",
220 | " ticket_id | \n",
221 | " | \n",
222 | "
\n",
223 | " \n",
224 | " \n",
225 | " \n",
226 | " 284932 | \n",
227 | " 0.180478 | \n",
228 | "
\n",
229 | " \n",
230 | " 285362 | \n",
231 | " 0.028986 | \n",
232 | "
\n",
233 | " \n",
234 | " 285361 | \n",
235 | " 0.062650 | \n",
236 | "
\n",
237 | " \n",
238 | " 285338 | \n",
239 | " 0.028986 | \n",
240 | "
\n",
241 | " \n",
242 | " 285346 | \n",
243 | " 0.084231 | \n",
244 | "
\n",
245 | " \n",
246 | " 285345 | \n",
247 | " 0.028986 | \n",
248 | "
\n",
249 | " \n",
250 | " 285347 | \n",
251 | " 0.064481 | \n",
252 | "
\n",
253 | " \n",
254 | " 285342 | \n",
255 | " 0.455129 | \n",
256 | "
\n",
257 | " \n",
258 | " 285530 | \n",
259 | " 0.028986 | \n",
260 | "
\n",
261 | " \n",
262 | " 284989 | \n",
263 | " 0.028986 | \n",
264 | "
\n",
265 | " \n",
266 | " 285344 | \n",
267 | " 0.051570 | \n",
268 | "
\n",
269 | " \n",
270 | " 285343 | \n",
271 | " 0.028986 | \n",
272 | "
\n",
273 | " \n",
274 | " 285340 | \n",
275 | " 0.028986 | \n",
276 | "
\n",
277 | " \n",
278 | " 285341 | \n",
279 | " 0.051570 | \n",
280 | "
\n",
281 | " \n",
282 | " 285349 | \n",
283 | " 0.084231 | \n",
284 | "
\n",
285 | " \n",
286 | " 285348 | \n",
287 | " 0.028986 | \n",
288 | "
\n",
289 | " \n",
290 | " 284991 | \n",
291 | " 0.028986 | \n",
292 | "
\n",
293 | " \n",
294 | " 285532 | \n",
295 | " 0.028986 | \n",
296 | "
\n",
297 | " \n",
298 | " 285406 | \n",
299 | " 0.028986 | \n",
300 | "
\n",
301 | " \n",
302 | " 285001 | \n",
303 | " 0.039157 | \n",
304 | "
\n",
305 | " \n",
306 | " 285006 | \n",
307 | " 0.066189 | \n",
308 | "
\n",
309 | " \n",
310 | " 285405 | \n",
311 | " 0.028986 | \n",
312 | "
\n",
313 | " \n",
314 | " 285337 | \n",
315 | " 0.028986 | \n",
316 | "
\n",
317 | " \n",
318 | " 285496 | \n",
319 | " 0.051570 | \n",
320 | "
\n",
321 | " \n",
322 | " 285497 | \n",
323 | " 0.028986 | \n",
324 | "
\n",
325 | " \n",
326 | " 285378 | \n",
327 | " 0.028986 | \n",
328 | "
\n",
329 | " \n",
330 | " 285589 | \n",
331 | " 0.028986 | \n",
332 | "
\n",
333 | " \n",
334 | " 285585 | \n",
335 | " 0.028986 | \n",
336 | "
\n",
337 | " \n",
338 | " 285501 | \n",
339 | " 0.062650 | \n",
340 | "
\n",
341 | " \n",
342 | " 285581 | \n",
343 | " 0.028986 | \n",
344 | "
\n",
345 | " \n",
346 | " ... | \n",
347 | " ... | \n",
348 | "
\n",
349 | " \n",
350 | " 376367 | \n",
351 | " 0.066189 | \n",
352 | "
\n",
353 | " \n",
354 | " 376366 | \n",
355 | " 0.039157 | \n",
356 | "
\n",
357 | " \n",
358 | " 376362 | \n",
359 | " 0.265933 | \n",
360 | "
\n",
361 | " \n",
362 | " 376363 | \n",
363 | " 0.312806 | \n",
364 | "
\n",
365 | " \n",
366 | " 376365 | \n",
367 | " 0.066189 | \n",
368 | "
\n",
369 | " \n",
370 | " 376364 | \n",
371 | " 0.039157 | \n",
372 | "
\n",
373 | " \n",
374 | " 376228 | \n",
375 | " 0.039157 | \n",
376 | "
\n",
377 | " \n",
378 | " 376265 | \n",
379 | " 0.039157 | \n",
380 | "
\n",
381 | " \n",
382 | " 376286 | \n",
383 | " 0.329721 | \n",
384 | "
\n",
385 | " \n",
386 | " 376320 | \n",
387 | " 0.039157 | \n",
388 | "
\n",
389 | " \n",
390 | " 376314 | \n",
391 | " 0.039157 | \n",
392 | "
\n",
393 | " \n",
394 | " 376327 | \n",
395 | " 0.329721 | \n",
396 | "
\n",
397 | " \n",
398 | " 376385 | \n",
399 | " 0.329721 | \n",
400 | "
\n",
401 | " \n",
402 | " 376435 | \n",
403 | " 0.112640 | \n",
404 | "
\n",
405 | " \n",
406 | " 376370 | \n",
407 | " 0.265933 | \n",
408 | "
\n",
409 | " \n",
410 | " 376434 | \n",
411 | " 0.064481 | \n",
412 | "
\n",
413 | " \n",
414 | " 376459 | \n",
415 | " 0.158937 | \n",
416 | "
\n",
417 | " \n",
418 | " 376478 | \n",
419 | " 0.030200 | \n",
420 | "
\n",
421 | " \n",
422 | " 376473 | \n",
423 | " 0.039157 | \n",
424 | "
\n",
425 | " \n",
426 | " 376484 | \n",
427 | " 0.012635 | \n",
428 | "
\n",
429 | " \n",
430 | " 376482 | \n",
431 | " 0.023206 | \n",
432 | "
\n",
433 | " \n",
434 | " 376480 | \n",
435 | " 0.012635 | \n",
436 | "
\n",
437 | " \n",
438 | " 376479 | \n",
439 | " 0.012635 | \n",
440 | "
\n",
441 | " \n",
442 | " 376481 | \n",
443 | " 0.012635 | \n",
444 | "
\n",
445 | " \n",
446 | " 376483 | \n",
447 | " 0.018715 | \n",
448 | "
\n",
449 | " \n",
450 | " 376496 | \n",
451 | " 0.006120 | \n",
452 | "
\n",
453 | " \n",
454 | " 376497 | \n",
455 | " 0.006120 | \n",
456 | "
\n",
457 | " \n",
458 | " 376499 | \n",
459 | " 0.084231 | \n",
460 | "
\n",
461 | " \n",
462 | " 376500 | \n",
463 | " 0.084231 | \n",
464 | "
\n",
465 | " \n",
466 | " 369851 | \n",
467 | " 0.051570 | \n",
468 | "
\n",
469 | " \n",
470 | "
\n",
471 | "
61001 rows × 1 columns
\n",
472 | "
"
473 | ],
474 | "text/plain": [
475 | " compliance\n",
476 | "ticket_id \n",
477 | "284932 0.180478\n",
478 | "285362 0.028986\n",
479 | "285361 0.062650\n",
480 | "285338 0.028986\n",
481 | "285346 0.084231\n",
482 | "285345 0.028986\n",
483 | "285347 0.064481\n",
484 | "285342 0.455129\n",
485 | "285530 0.028986\n",
486 | "284989 0.028986\n",
487 | "285344 0.051570\n",
488 | "285343 0.028986\n",
489 | "285340 0.028986\n",
490 | "285341 0.051570\n",
491 | "285349 0.084231\n",
492 | "285348 0.028986\n",
493 | "284991 0.028986\n",
494 | "285532 0.028986\n",
495 | "285406 0.028986\n",
496 | "285001 0.039157\n",
497 | "285006 0.066189\n",
498 | "285405 0.028986\n",
499 | "285337 0.028986\n",
500 | "285496 0.051570\n",
501 | "285497 0.028986\n",
502 | "285378 0.028986\n",
503 | "285589 0.028986\n",
504 | "285585 0.028986\n",
505 | "285501 0.062650\n",
506 | "285581 0.028986\n",
507 | "... ...\n",
508 | "376367 0.066189\n",
509 | "376366 0.039157\n",
510 | "376362 0.265933\n",
511 | "376363 0.312806\n",
512 | "376365 0.066189\n",
513 | "376364 0.039157\n",
514 | "376228 0.039157\n",
515 | "376265 0.039157\n",
516 | "376286 0.329721\n",
517 | "376320 0.039157\n",
518 | "376314 0.039157\n",
519 | "376327 0.329721\n",
520 | "376385 0.329721\n",
521 | "376435 0.112640\n",
522 | "376370 0.265933\n",
523 | "376434 0.064481\n",
524 | "376459 0.158937\n",
525 | "376478 0.030200\n",
526 | "376473 0.039157\n",
527 | "376484 0.012635\n",
528 | "376482 0.023206\n",
529 | "376480 0.012635\n",
530 | "376479 0.012635\n",
531 | "376481 0.012635\n",
532 | "376483 0.018715\n",
533 | "376496 0.006120\n",
534 | "376497 0.006120\n",
535 | "376499 0.084231\n",
536 | "376500 0.084231\n",
537 | "369851 0.051570\n",
538 | "\n",
539 | "[61001 rows x 1 columns]"
540 | ]
541 | },
542 | "execution_count": 2,
543 | "metadata": {},
544 | "output_type": "execute_result"
545 | }
546 | ],
547 | "source": [
548 | "blight_model()"
549 | ]
550 | },
551 | {
552 | "cell_type": "code",
553 | "execution_count": 5,
554 | "metadata": {
555 | "collapsed": false
556 | },
557 | "outputs": [
558 | {
559 | "ename": "SyntaxError",
560 | "evalue": "invalid syntax (, line 1)",
561 | "output_type": "error",
562 | "traceback": [
563 | "\u001b[0;36m File \u001b[0;32m\"\"\u001b[0;36m, line \u001b[0;32m1\u001b[0m\n\u001b[0;31m rm -f ~/workspace.tar.gz && rm -f ~/work/workspace.tar.gz\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n"
564 | ]
565 | }
566 | ],
567 | "source": []
568 | },
569 | {
570 | "cell_type": "code",
571 | "execution_count": null,
572 | "metadata": {
573 | "collapsed": true
574 | },
575 | "outputs": [],
576 | "source": []
577 | }
578 | ],
579 | "metadata": {
580 | "coursera": {
581 | "course_slug": "python-machine-learning",
582 | "graded_item_id": "nNS8l",
583 | "launcher_item_id": "yWWk7",
584 | "part_id": "w8BSS"
585 | },
586 | "kernelspec": {
587 | "display_name": "Python 3",
588 | "language": "python",
589 | "name": "python3"
590 | },
591 | "language_info": {
592 | "codemirror_mode": {
593 | "name": "ipython",
594 | "version": 3
595 | },
596 | "file_extension": ".py",
597 | "mimetype": "text/x-python",
598 | "name": "python",
599 | "nbconvert_exporter": "python",
600 | "pygments_lexer": "ipython3",
601 | "version": "3.6.2"
602 | }
603 | },
604 | "nbformat": 4,
605 | "nbformat_minor": 2
606 | }
607 |
--------------------------------------------------------------------------------
/Classifier Visualization.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "---\n",
8 | "\n",
9 | "_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-machine-learning/resources/bANLa) course resource._\n",
10 | "\n",
11 | "---"
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {
17 | "deletable": true,
18 | "editable": true
19 | },
20 | "source": [
21 | "# Classifier Visualization Playground\n",
22 | "\n",
23 | "The purpose of this notebook is to let you visualize various classsifiers' decision boundaries.\n",
24 | "\n",
25 | "The data used in this notebook is based on the [UCI Mushroom Data Set](http://archive.ics.uci.edu/ml/datasets/Mushroom?ref=datanews.io) stored in `mushrooms.csv`. \n",
26 | "\n",
27 | "In order to better vizualize the decision boundaries, we'll perform Principal Component Analysis (PCA) on the data to reduce the dimensionality to 2 dimensions. Dimensionality reduction will be covered in a later module of this course.\n",
28 | "\n",
29 | "Play around with different models and parameters to see how they affect the classifier's decision boundary and accuracy!"
30 | ]
31 | },
32 | {
33 | "cell_type": "code",
34 | "execution_count": null,
35 | "metadata": {
36 | "collapsed": false,
37 | "deletable": true,
38 | "editable": true
39 | },
40 | "outputs": [],
41 | "source": [
42 | "%matplotlib notebook\n",
43 | "\n",
44 | "import pandas as pd\n",
45 | "import numpy as np\n",
46 | "import matplotlib.pyplot as plt\n",
47 | "from sklearn.decomposition import PCA\n",
48 | "from sklearn.model_selection import train_test_split\n",
49 | "\n",
50 | "df = pd.read_csv('readonly/mushrooms.csv')\n",
51 | "df2 = pd.get_dummies(df)\n",
52 | "\n",
53 | "df3 = df2.sample(frac=0.08)\n",
54 | "\n",
55 | "X = df3.iloc[:,2:]\n",
56 | "y = df3.iloc[:,1]\n",
57 | "\n",
58 | "\n",
59 | "pca = PCA(n_components=2).fit_transform(X)\n",
60 | "\n",
61 | "X_train, X_test, y_train, y_test = train_test_split(pca, y, random_state=0)\n",
62 | "\n",
63 | "\n",
64 | "plt.figure(dpi=120)\n",
65 | "plt.scatter(pca[y.values==0,0], pca[y.values==0,1], alpha=0.5, label='Edible', s=2)\n",
66 | "plt.scatter(pca[y.values==1,0], pca[y.values==1,1], alpha=0.5, label='Poisonous', s=2)\n",
67 | "plt.legend()\n",
68 | "plt.title('Mushroom Data Set\\nFirst Two Principal Components')\n",
69 | "plt.xlabel('PC1')\n",
70 | "plt.ylabel('PC2')\n",
71 | "plt.gca().set_aspect('equal')"
72 | ]
73 | },
74 | {
75 | "cell_type": "code",
76 | "execution_count": null,
77 | "metadata": {
78 | "collapsed": false
79 | },
80 | "outputs": [],
81 | "source": [
82 | "def plot_mushroom_boundary(X, y, fitted_model):\n",
83 | "\n",
84 | " plt.figure(figsize=(9.8,5), dpi=100)\n",
85 | " \n",
86 | " for i, plot_type in enumerate(['Decision Boundary', 'Decision Probabilities']):\n",
87 | " plt.subplot(1,2,i+1)\n",
88 | "\n",
89 | " mesh_step_size = 0.01 # step size in the mesh\n",
90 | " x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1\n",
91 | " y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1\n",
92 | " xx, yy = np.meshgrid(np.arange(x_min, x_max, mesh_step_size), np.arange(y_min, y_max, mesh_step_size))\n",
93 | " if i == 0:\n",
94 | " Z = fitted_model.predict(np.c_[xx.ravel(), yy.ravel()])\n",
95 | " else:\n",
96 | " try:\n",
97 | " Z = fitted_model.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:,1]\n",
98 | " except:\n",
99 | " plt.text(0.4, 0.5, 'Probabilities Unavailable', horizontalalignment='center',\n",
100 | " verticalalignment='center', transform = plt.gca().transAxes, fontsize=12)\n",
101 | " plt.axis('off')\n",
102 | " break\n",
103 | " Z = Z.reshape(xx.shape)\n",
104 | " plt.scatter(X[y.values==0,0], X[y.values==0,1], alpha=0.4, label='Edible', s=5)\n",
105 | " plt.scatter(X[y.values==1,0], X[y.values==1,1], alpha=0.4, label='Posionous', s=5)\n",
106 | " plt.imshow(Z, interpolation='nearest', cmap='RdYlBu_r', alpha=0.15, \n",
107 | " extent=(x_min, x_max, y_min, y_max), origin='lower')\n",
108 | " plt.title(plot_type + '\\n' + \n",
109 | " str(fitted_model).split('(')[0]+ ' Test Accuracy: ' + str(np.round(fitted_model.score(X, y), 5)))\n",
110 | " plt.gca().set_aspect('equal');\n",
111 | " \n",
112 | " plt.tight_layout()\n",
113 | " plt.subplots_adjust(top=0.9, bottom=0.08, wspace=0.02)"
114 | ]
115 | },
116 | {
117 | "cell_type": "code",
118 | "execution_count": null,
119 | "metadata": {
120 | "collapsed": false,
121 | "deletable": true,
122 | "editable": true,
123 | "scrolled": false
124 | },
125 | "outputs": [],
126 | "source": [
127 | "from sklearn.linear_model import LogisticRegression\n",
128 | "\n",
129 | "model = LogisticRegression()\n",
130 | "model.fit(X_train,y_train)\n",
131 | "\n",
132 | "plot_mushroom_boundary(X_test, y_test, model)"
133 | ]
134 | },
135 | {
136 | "cell_type": "code",
137 | "execution_count": null,
138 | "metadata": {
139 | "collapsed": false,
140 | "deletable": true,
141 | "editable": true
142 | },
143 | "outputs": [],
144 | "source": [
145 | "from sklearn.neighbors import KNeighborsClassifier\n",
146 | "\n",
147 | "model = KNeighborsClassifier(n_neighbors=20)\n",
148 | "model.fit(X_train,y_train)\n",
149 | "\n",
150 | "plot_mushroom_boundary(X_test, y_test, model)"
151 | ]
152 | },
153 | {
154 | "cell_type": "code",
155 | "execution_count": null,
156 | "metadata": {
157 | "collapsed": false,
158 | "deletable": true,
159 | "editable": true
160 | },
161 | "outputs": [],
162 | "source": [
163 | "from sklearn.tree import DecisionTreeClassifier\n",
164 | "\n",
165 | "model = DecisionTreeClassifier(max_depth=3)\n",
166 | "model.fit(X_train,y_train)\n",
167 | "\n",
168 | "plot_mushroom_boundary(X_test, y_test, model)"
169 | ]
170 | },
171 | {
172 | "cell_type": "code",
173 | "execution_count": null,
174 | "metadata": {
175 | "collapsed": false,
176 | "deletable": true,
177 | "editable": true
178 | },
179 | "outputs": [],
180 | "source": [
181 | "from sklearn.tree import DecisionTreeClassifier\n",
182 | "\n",
183 | "model = DecisionTreeClassifier()\n",
184 | "model.fit(X_train,y_train)\n",
185 | "\n",
186 | "plot_mushroom_boundary(X_test, y_test, model)"
187 | ]
188 | },
189 | {
190 | "cell_type": "code",
191 | "execution_count": null,
192 | "metadata": {
193 | "collapsed": false,
194 | "deletable": true,
195 | "editable": true
196 | },
197 | "outputs": [],
198 | "source": [
199 | "from sklearn.ensemble import RandomForestClassifier\n",
200 | "\n",
201 | "model = RandomForestClassifier()\n",
202 | "model.fit(X_train,y_train)\n",
203 | "\n",
204 | "plot_mushroom_boundary(X_test, y_test, model)"
205 | ]
206 | },
207 | {
208 | "cell_type": "code",
209 | "execution_count": null,
210 | "metadata": {
211 | "collapsed": false,
212 | "deletable": true,
213 | "editable": true
214 | },
215 | "outputs": [],
216 | "source": [
217 | "from sklearn.svm import SVC\n",
218 | "\n",
219 | "model = SVC(kernel='linear')\n",
220 | "model.fit(X_train,y_train)\n",
221 | "\n",
222 | "plot_mushroom_boundary(X_test, y_test, model)"
223 | ]
224 | },
225 | {
226 | "cell_type": "code",
227 | "execution_count": null,
228 | "metadata": {
229 | "collapsed": false,
230 | "deletable": true,
231 | "editable": true
232 | },
233 | "outputs": [],
234 | "source": [
235 | "from sklearn.svm import SVC\n",
236 | "\n",
237 | "model = SVC(kernel='rbf', C=1)\n",
238 | "model.fit(X_train,y_train)\n",
239 | "\n",
240 | "plot_mushroom_boundary(X_test, y_test, model)"
241 | ]
242 | },
243 | {
244 | "cell_type": "code",
245 | "execution_count": null,
246 | "metadata": {
247 | "collapsed": false,
248 | "deletable": true,
249 | "editable": true
250 | },
251 | "outputs": [],
252 | "source": [
253 | "from sklearn.svm import SVC\n",
254 | "\n",
255 | "model = SVC(kernel='rbf', C=10)\n",
256 | "model.fit(X_train,y_train)\n",
257 | "\n",
258 | "plot_mushroom_boundary(X_test, y_test, model)"
259 | ]
260 | },
261 | {
262 | "cell_type": "code",
263 | "execution_count": null,
264 | "metadata": {
265 | "collapsed": false,
266 | "deletable": true,
267 | "editable": true
268 | },
269 | "outputs": [],
270 | "source": [
271 | "from sklearn.naive_bayes import GaussianNB\n",
272 | "\n",
273 | "model = GaussianNB()\n",
274 | "model.fit(X_train,y_train)\n",
275 | "\n",
276 | "plot_mushroom_boundary(X_test, y_test, model)"
277 | ]
278 | },
279 | {
280 | "cell_type": "code",
281 | "execution_count": null,
282 | "metadata": {
283 | "collapsed": false,
284 | "deletable": true,
285 | "editable": true
286 | },
287 | "outputs": [],
288 | "source": [
289 | "from sklearn.neural_network import MLPClassifier\n",
290 | "\n",
291 | "model = MLPClassifier()\n",
292 | "model.fit(X_train,y_train)\n",
293 | "\n",
294 | "plot_mushroom_boundary(X_test, y_test, model)"
295 | ]
296 | },
297 | {
298 | "cell_type": "code",
299 | "execution_count": null,
300 | "metadata": {
301 | "collapsed": true
302 | },
303 | "outputs": [],
304 | "source": []
305 | }
306 | ],
307 | "metadata": {
308 | "kernelspec": {
309 | "display_name": "Python 3",
310 | "language": "python",
311 | "name": "python3"
312 | },
313 | "language_info": {
314 | "codemirror_mode": {
315 | "name": "ipython",
316 | "version": 3
317 | },
318 | "file_extension": ".py",
319 | "mimetype": "text/x-python",
320 | "name": "python",
321 | "nbconvert_exporter": "python",
322 | "pygments_lexer": "ipython3",
323 | "version": "3.6.2"
324 | }
325 | },
326 | "nbformat": 4,
327 | "nbformat_minor": 2
328 | }
329 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Applied-Machine-Learning-in-Python
2 | University of Michigan on Coursera
3 |
4 | This course will introduce the learner to applied machine learning, focusing more on the techniques and methods than on the statistics behind these methods. The course will start with a discussion of how machine learning is different than descriptive statistics, and introduce the scikit learn toolkit through a tutorial. The issue of dimensionality of data will be discussed, and the task of clustering data, as well as evaluating those clusters, will be tackled. Supervised approaches for creating predictive models will be described, and learners will be able to apply the scikit learn predictive modelling methods while understanding process issues related to data generalizability (e.g. cross validation, overfitting). The course will end with a look at more advanced techniques, such as building ensembles, and practical limitations of predictive models. By the end of this course, students will be able to identify the difference between a supervised (classification) and unsupervised (clustering) technique, identify which technique they need to apply for a particular dataset and need, engineer features to meet that need, and write python code to carry out an analysis.
5 |
--------------------------------------------------------------------------------
/Slides/Module 1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/agniiyer/Applied-Machine-Learning-in-Python/5bcc46f251f7983f79998004a244f38cf43aa38b/Slides/Module 1.pdf
--------------------------------------------------------------------------------
/Slides/Module 2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/agniiyer/Applied-Machine-Learning-in-Python/5bcc46f251f7983f79998004a244f38cf43aa38b/Slides/Module 2.pdf
--------------------------------------------------------------------------------
/Slides/Module 3.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/agniiyer/Applied-Machine-Learning-in-Python/5bcc46f251f7983f79998004a244f38cf43aa38b/Slides/Module 3.pdf
--------------------------------------------------------------------------------
/Slides/Module 4.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/agniiyer/Applied-Machine-Learning-in-Python/5bcc46f251f7983f79998004a244f38cf43aa38b/Slides/Module 4.pdf
--------------------------------------------------------------------------------
/Slides/Module 5.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/agniiyer/Applied-Machine-Learning-in-Python/5bcc46f251f7983f79998004a244f38cf43aa38b/Slides/Module 5.pdf
--------------------------------------------------------------------------------
/__pycache__/adspy_shared_utilities.cpython-35.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/agniiyer/Applied-Machine-Learning-in-Python/5bcc46f251f7983f79998004a244f38cf43aa38b/__pycache__/adspy_shared_utilities.cpython-35.pyc
--------------------------------------------------------------------------------
/__pycache__/adspy_shared_utilities.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/agniiyer/Applied-Machine-Learning-in-Python/5bcc46f251f7983f79998004a244f38cf43aa38b/__pycache__/adspy_shared_utilities.cpython-36.pyc
--------------------------------------------------------------------------------
/adspy_shared_utilities.py:
--------------------------------------------------------------------------------
1 | # version 1.1
2 |
3 | import numpy
4 | import pandas as pd
5 | import seaborn as sn
6 | import matplotlib.pyplot as plt
7 | import matplotlib.cm as cm
8 | from matplotlib.colors import ListedColormap, BoundaryNorm
9 | from sklearn import neighbors
10 | import matplotlib.patches as mpatches
11 | import graphviz
12 | from sklearn.tree import export_graphviz
13 | import matplotlib.patches as mpatches
14 |
15 | def load_crime_dataset():
16 | # Communities and Crime dataset for regression
17 | # https://archive.ics.uci.edu/ml/datasets/Communities+and+Crime+Unnormalized
18 |
19 | crime = pd.read_table('readonly/CommViolPredUnnormalizedData.txt', sep=',', na_values='?')
20 | # remove features with poor coverage or lower relevance, and keep ViolentCrimesPerPop target column
21 | columns_to_keep = [5, 6] + list(range(11,26)) + list(range(32, 103)) + [145]
22 | crime = crime.ix[:,columns_to_keep].dropna()
23 |
24 | X_crime = crime.ix[:,range(0,88)]
25 | y_crime = crime['ViolentCrimesPerPop']
26 |
27 | return (X_crime, y_crime)
28 |
29 | def plot_decision_tree(clf, feature_names, class_names):
30 | # This function requires the pydotplus module and assumes it's been installed.
31 | # In some cases (typically under Windows) even after running conda install, there is a problem where the
32 | # pydotplus module is not found when running from within the notebook environment. The following code
33 | # may help to guarantee the module is installed in the current notebook environment directory.
34 | #
35 | # import sys; sys.executable
36 | # !{sys.executable} -m pip install pydotplus
37 |
38 | export_graphviz(clf, out_file="adspy_temp.dot", feature_names=feature_names, class_names=class_names, filled = True, impurity = False)
39 | with open("adspy_temp.dot") as f:
40 | dot_graph = f.read()
41 | # Alternate method using pydotplus, if installed.
42 | # graph = pydotplus.graphviz.graph_from_dot_data(dot_graph)
43 | # return graph.create_png()
44 | return graphviz.Source(dot_graph)
45 |
46 | def plot_feature_importances(clf, feature_names):
47 | c_features = len(feature_names)
48 | plt.barh(range(c_features), clf.feature_importances_)
49 | plt.xlabel("Feature importance")
50 | plt.ylabel("Feature name")
51 | plt.yticks(numpy.arange(c_features), feature_names)
52 |
53 | def plot_labelled_scatter(X, y, class_labels):
54 | num_labels = len(class_labels)
55 |
56 | x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
57 | y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
58 |
59 | marker_array = ['o', '^', '*']
60 | color_array = ['#FFFF00', '#00AAFF', '#000000', '#FF00AA']
61 | cmap_bold = ListedColormap(color_array)
62 | bnorm = BoundaryNorm(numpy.arange(0, num_labels + 1, 1), ncolors=num_labels)
63 | plt.figure()
64 |
65 | plt.scatter(X[:, 0], X[:, 1], s=65, c=y, cmap=cmap_bold, norm = bnorm, alpha = 0.40, edgecolor='black', lw = 1)
66 |
67 | plt.xlim(x_min, x_max)
68 | plt.ylim(y_min, y_max)
69 |
70 | h = []
71 | for c in range(0, num_labels):
72 | h.append(mpatches.Patch(color=color_array[c], label=class_labels[c]))
73 | plt.legend(handles=h)
74 |
75 | plt.show()
76 |
77 |
78 | def plot_class_regions_for_classifier_subplot(clf, X, y, X_test, y_test, title, subplot, target_names = None, plot_decision_regions = True):
79 |
80 | numClasses = numpy.amax(y) + 1
81 | color_list_light = ['#FFFFAA', '#EFEFEF', '#AAFFAA', '#AAAAFF']
82 | color_list_bold = ['#EEEE00', '#000000', '#00CC00', '#0000CC']
83 | cmap_light = ListedColormap(color_list_light[0:numClasses])
84 | cmap_bold = ListedColormap(color_list_bold[0:numClasses])
85 |
86 | h = 0.03
87 | k = 0.5
88 | x_plot_adjust = 0.1
89 | y_plot_adjust = 0.1
90 | plot_symbol_size = 50
91 |
92 | x_min = X[:, 0].min()
93 | x_max = X[:, 0].max()
94 | y_min = X[:, 1].min()
95 | y_max = X[:, 1].max()
96 | x2, y2 = numpy.meshgrid(numpy.arange(x_min-k, x_max+k, h), numpy.arange(y_min-k, y_max+k, h))
97 |
98 | P = clf.predict(numpy.c_[x2.ravel(), y2.ravel()])
99 | P = P.reshape(x2.shape)
100 |
101 | if plot_decision_regions:
102 | subplot.contourf(x2, y2, P, cmap=cmap_light, alpha = 0.8)
103 |
104 | subplot.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold, s=plot_symbol_size, edgecolor = 'black')
105 | subplot.set_xlim(x_min - x_plot_adjust, x_max + x_plot_adjust)
106 | subplot.set_ylim(y_min - y_plot_adjust, y_max + y_plot_adjust)
107 |
108 | if (X_test is not None):
109 | subplot.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cmap_bold, s=plot_symbol_size, marker='^', edgecolor = 'black')
110 | train_score = clf.score(X, y)
111 | test_score = clf.score(X_test, y_test)
112 | title = title + "\nTrain score = {:.2f}, Test score = {:.2f}".format(train_score, test_score)
113 |
114 | subplot.set_title(title)
115 |
116 | if (target_names is not None):
117 | legend_handles = []
118 | for i in range(0, len(target_names)):
119 | patch = mpatches.Patch(color=color_list_bold[i], label=target_names[i])
120 | legend_handles.append(patch)
121 | subplot.legend(loc=0, handles=legend_handles)
122 |
123 |
124 | def plot_class_regions_for_classifier(clf, X, y, X_test=None, y_test=None, title=None, target_names = None, plot_decision_regions = True):
125 |
126 | numClasses = numpy.amax(y) + 1
127 | color_list_light = ['#FFFFAA', '#EFEFEF', '#AAFFAA', '#AAAAFF']
128 | color_list_bold = ['#EEEE00', '#000000', '#00CC00', '#0000CC']
129 | cmap_light = ListedColormap(color_list_light[0:numClasses])
130 | cmap_bold = ListedColormap(color_list_bold[0:numClasses])
131 |
132 | h = 0.03
133 | k = 0.5
134 | x_plot_adjust = 0.1
135 | y_plot_adjust = 0.1
136 | plot_symbol_size = 50
137 |
138 | x_min = X[:, 0].min()
139 | x_max = X[:, 0].max()
140 | y_min = X[:, 1].min()
141 | y_max = X[:, 1].max()
142 | x2, y2 = numpy.meshgrid(numpy.arange(x_min-k, x_max+k, h), numpy.arange(y_min-k, y_max+k, h))
143 |
144 | P = clf.predict(numpy.c_[x2.ravel(), y2.ravel()])
145 | P = P.reshape(x2.shape)
146 | plt.figure()
147 | if plot_decision_regions:
148 | plt.contourf(x2, y2, P, cmap=cmap_light, alpha = 0.8)
149 |
150 | plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold, s=plot_symbol_size, edgecolor = 'black')
151 | plt.xlim(x_min - x_plot_adjust, x_max + x_plot_adjust)
152 | plt.ylim(y_min - y_plot_adjust, y_max + y_plot_adjust)
153 |
154 | if (X_test is not None):
155 | plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cmap_bold, s=plot_symbol_size, marker='^', edgecolor = 'black')
156 | train_score = clf.score(X, y)
157 | test_score = clf.score(X_test, y_test)
158 | title = title + "\nTrain score = {:.2f}, Test score = {:.2f}".format(train_score, test_score)
159 |
160 | if (target_names is not None):
161 | legend_handles = []
162 | for i in range(0, len(target_names)):
163 | patch = mpatches.Patch(color=color_list_bold[i], label=target_names[i])
164 | legend_handles.append(patch)
165 | plt.legend(loc=0, handles=legend_handles)
166 |
167 | if (title is not None):
168 | plt.title(title)
169 | plt.show()
170 |
171 | def plot_fruit_knn(X, y, n_neighbors, weights):
172 | X_mat = X[['height', 'width']].as_matrix()
173 | y_mat = y.as_matrix()
174 |
175 | # Create color maps
176 | cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF','#AFAFAF'])
177 | cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF','#AFAFAF'])
178 |
179 | clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
180 | clf.fit(X_mat, y_mat)
181 |
182 | # Plot the decision boundary by assigning a color in the color map
183 | # to each mesh point.
184 |
185 | mesh_step_size = .01 # step size in the mesh
186 | plot_symbol_size = 50
187 |
188 | x_min, x_max = X_mat[:, 0].min() - 1, X_mat[:, 0].max() + 1
189 | y_min, y_max = X_mat[:, 1].min() - 1, X_mat[:, 1].max() + 1
190 | xx, yy = numpy.meshgrid(numpy.arange(x_min, x_max, mesh_step_size),
191 | numpy.arange(y_min, y_max, mesh_step_size))
192 | Z = clf.predict(numpy.c_[xx.ravel(), yy.ravel()])
193 |
194 | # Put the result into a color plot
195 | Z = Z.reshape(xx.shape)
196 | plt.figure()
197 | plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
198 |
199 | # Plot training points
200 | plt.scatter(X_mat[:, 0], X_mat[:, 1], s=plot_symbol_size, c=y, cmap=cmap_bold, edgecolor = 'black')
201 | plt.xlim(xx.min(), xx.max())
202 | plt.ylim(yy.min(), yy.max())
203 |
204 | patch0 = mpatches.Patch(color='#FF0000', label='apple')
205 | patch1 = mpatches.Patch(color='#00FF00', label='mandarin')
206 | patch2 = mpatches.Patch(color='#0000FF', label='orange')
207 | patch3 = mpatches.Patch(color='#AFAFAF', label='lemon')
208 | plt.legend(handles=[patch0, patch1, patch2, patch3])
209 |
210 |
211 | plt.xlabel('height (cm)')
212 | plt.ylabel('width (cm)')
213 |
214 | plt.show()
215 |
216 | def plot_two_class_knn(X, y, n_neighbors, weights, X_test, y_test):
217 | X_mat = X
218 | y_mat = y
219 |
220 | # Create color maps
221 | cmap_light = ListedColormap(['#FFFFAA', '#AAFFAA', '#AAAAFF','#EFEFEF'])
222 | cmap_bold = ListedColormap(['#FFFF00', '#00FF00', '#0000FF','#000000'])
223 |
224 | clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
225 | clf.fit(X_mat, y_mat)
226 |
227 | # Plot the decision boundary by assigning a color in the color map
228 | # to each mesh point.
229 |
230 | mesh_step_size = .01 # step size in the mesh
231 | plot_symbol_size = 50
232 |
233 | x_min, x_max = X_mat[:, 0].min() - 1, X_mat[:, 0].max() + 1
234 | y_min, y_max = X_mat[:, 1].min() - 1, X_mat[:, 1].max() + 1
235 | xx, yy = numpy.meshgrid(numpy.arange(x_min, x_max, mesh_step_size),
236 | numpy.arange(y_min, y_max, mesh_step_size))
237 | Z = clf.predict(numpy.c_[xx.ravel(), yy.ravel()])
238 |
239 | # Put the result into a color plot
240 | Z = Z.reshape(xx.shape)
241 | plt.figure()
242 | plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
243 |
244 | # Plot training points
245 | plt.scatter(X_mat[:, 0], X_mat[:, 1], s=plot_symbol_size, c=y, cmap=cmap_bold, edgecolor = 'black')
246 | plt.xlim(xx.min(), xx.max())
247 | plt.ylim(yy.min(), yy.max())
248 |
249 | title = "Neighbors = {}".format(n_neighbors)
250 | if (X_test is not None):
251 | train_score = clf.score(X_mat, y_mat)
252 | test_score = clf.score(X_test, y_test)
253 | title = title + "\nTrain score = {:.2f}, Test score = {:.2f}".format(train_score, test_score)
254 |
255 | patch0 = mpatches.Patch(color='#FFFF00', label='class 0')
256 | patch1 = mpatches.Patch(color='#000000', label='class 1')
257 | plt.legend(handles=[patch0, patch1])
258 |
259 | plt.xlabel('Feature 0')
260 | plt.ylabel('Feature 1')
261 | plt.title(title)
262 |
263 | plt.show()
264 |
--------------------------------------------------------------------------------
/adspy_temp.dot:
--------------------------------------------------------------------------------
1 | digraph Tree {
2 | node [shape=box, style="filled", color="black"] ;
3 | 0 [label="mean concave points <= 0.0489\nsamples = 426\nvalue = [159, 267]\nclass = benign", fillcolor="#399de567"] ;
4 | 1 [label="worst area <= 952.9\nsamples = 260\nvalue = [13, 247]\nclass = benign", fillcolor="#399de5f2"] ;
5 | 0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
6 | 2 [label="area error <= 38.885\nsamples = 252\nvalue = [7, 245]\nclass = benign", fillcolor="#399de5f8"] ;
7 | 1 -> 2 ;
8 | 3 [label="worst texture <= 30.145\nsamples = 244\nvalue = [4, 240]\nclass = benign", fillcolor="#399de5fb"] ;
9 | 2 -> 3 ;
10 | 4 [label="samples = 212\nvalue = [0, 212]\nclass = benign", fillcolor="#399de5ff"] ;
11 | 3 -> 4 ;
12 | 5 [label="samples = 32\nvalue = [4, 28]\nclass = benign", fillcolor="#399de5db"] ;
13 | 3 -> 5 ;
14 | 6 [label="samples = 8\nvalue = [3, 5]\nclass = benign", fillcolor="#399de566"] ;
15 | 2 -> 6 ;
16 | 7 [label="samples = 8\nvalue = [6, 2]\nclass = malignant", fillcolor="#e58139aa"] ;
17 | 1 -> 7 ;
18 | 8 [label="worst area <= 785.8\nsamples = 166\nvalue = [146, 20]\nclass = malignant", fillcolor="#e58139dc"] ;
19 | 0 -> 8 [labeldistance=2.5, labelangle=-45, headlabel="False"] ;
20 | 9 [label="worst texture <= 23.74\nsamples = 30\nvalue = [13, 17]\nclass = benign", fillcolor="#399de53c"] ;
21 | 8 -> 9 ;
22 | 10 [label="samples = 14\nvalue = [0, 14]\nclass = benign", fillcolor="#399de5ff"] ;
23 | 9 -> 10 ;
24 | 11 [label="worst smoothness <= 0.1657\nsamples = 16\nvalue = [13, 3]\nclass = malignant", fillcolor="#e58139c4"] ;
25 | 9 -> 11 ;
26 | 12 [label="samples = 8\nvalue = [5, 3]\nclass = malignant", fillcolor="#e5813966"] ;
27 | 11 -> 12 ;
28 | 13 [label="samples = 8\nvalue = [8, 0]\nclass = malignant", fillcolor="#e58139ff"] ;
29 | 11 -> 13 ;
30 | 14 [label="worst texture <= 19.385\nsamples = 136\nvalue = [133, 3]\nclass = malignant", fillcolor="#e58139f9"] ;
31 | 8 -> 14 ;
32 | 15 [label="samples = 9\nvalue = [6, 3]\nclass = malignant", fillcolor="#e581397f"] ;
33 | 14 -> 15 ;
34 | 16 [label="samples = 127\nvalue = [127, 0]\nclass = malignant", fillcolor="#e58139ff"] ;
35 | 14 -> 16 ;
36 | }
--------------------------------------------------------------------------------
/fruit_data_with_colors.txt:
--------------------------------------------------------------------------------
1 | fruit_label fruit_name fruit_subtype mass width height color_score
2 | 1 apple granny_smith 192 8.4 7.3 0.55
3 | 1 apple granny_smith 180 8.0 6.8 0.59
4 | 1 apple granny_smith 176 7.4 7.2 0.60
5 | 2 mandarin mandarin 86 6.2 4.7 0.80
6 | 2 mandarin mandarin 84 6.0 4.6 0.79
7 | 2 mandarin mandarin 80 5.8 4.3 0.77
8 | 2 mandarin mandarin 80 5.9 4.3 0.81
9 | 2 mandarin mandarin 76 5.8 4.0 0.81
10 | 1 apple braeburn 178 7.1 7.8 0.92
11 | 1 apple braeburn 172 7.4 7.0 0.89
12 | 1 apple braeburn 166 6.9 7.3 0.93
13 | 1 apple braeburn 172 7.1 7.6 0.92
14 | 1 apple braeburn 154 7.0 7.1 0.88
15 | 1 apple golden_delicious 164 7.3 7.7 0.70
16 | 1 apple golden_delicious 152 7.6 7.3 0.69
17 | 1 apple golden_delicious 156 7.7 7.1 0.69
18 | 1 apple golden_delicious 156 7.6 7.5 0.67
19 | 1 apple golden_delicious 168 7.5 7.6 0.73
20 | 1 apple cripps_pink 162 7.5 7.1 0.83
21 | 1 apple cripps_pink 162 7.4 7.2 0.85
22 | 1 apple cripps_pink 160 7.5 7.5 0.86
23 | 1 apple cripps_pink 156 7.4 7.4 0.84
24 | 1 apple cripps_pink 140 7.3 7.1 0.87
25 | 1 apple cripps_pink 170 7.6 7.9 0.88
26 | 3 orange spanish_jumbo 342 9.0 9.4 0.75
27 | 3 orange spanish_jumbo 356 9.2 9.2 0.75
28 | 3 orange spanish_jumbo 362 9.6 9.2 0.74
29 | 3 orange selected_seconds 204 7.5 9.2 0.77
30 | 3 orange selected_seconds 140 6.7 7.1 0.72
31 | 3 orange selected_seconds 160 7.0 7.4 0.81
32 | 3 orange selected_seconds 158 7.1 7.5 0.79
33 | 3 orange selected_seconds 210 7.8 8.0 0.82
34 | 3 orange selected_seconds 164 7.2 7.0 0.80
35 | 3 orange turkey_navel 190 7.5 8.1 0.74
36 | 3 orange turkey_navel 142 7.6 7.8 0.75
37 | 3 orange turkey_navel 150 7.1 7.9 0.75
38 | 3 orange turkey_navel 160 7.1 7.6 0.76
39 | 3 orange turkey_navel 154 7.3 7.3 0.79
40 | 3 orange turkey_navel 158 7.2 7.8 0.77
41 | 3 orange turkey_navel 144 6.8 7.4 0.75
42 | 3 orange turkey_navel 154 7.1 7.5 0.78
43 | 3 orange turkey_navel 180 7.6 8.2 0.79
44 | 3 orange turkey_navel 154 7.2 7.2 0.82
45 | 4 lemon spanish_belsan 194 7.2 10.3 0.70
46 | 4 lemon spanish_belsan 200 7.3 10.5 0.72
47 | 4 lemon spanish_belsan 186 7.2 9.2 0.72
48 | 4 lemon spanish_belsan 216 7.3 10.2 0.71
49 | 4 lemon spanish_belsan 196 7.3 9.7 0.72
50 | 4 lemon spanish_belsan 174 7.3 10.1 0.72
51 | 4 lemon unknown 132 5.8 8.7 0.73
52 | 4 lemon unknown 130 6.0 8.2 0.71
53 | 4 lemon unknown 116 6.0 7.5 0.72
54 | 4 lemon unknown 118 5.9 8.0 0.72
55 | 4 lemon unknown 120 6.0 8.4 0.74
56 | 4 lemon unknown 116 6.1 8.5 0.71
57 | 4 lemon unknown 116 6.3 7.7 0.72
58 | 4 lemon unknown 116 5.9 8.1 0.73
59 | 4 lemon unknown 152 6.5 8.5 0.72
60 | 4 lemon unknown 118 6.1 8.1 0.70
61 |
--------------------------------------------------------------------------------
/readonly/Assignment 1.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "---\n",
8 | "\n",
9 | "_You are currently looking at **version 1.3** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-machine-learning/resources/bANLa) course resource._\n",
10 | "\n",
11 | "---"
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "# Assignment 1 - Introduction to Machine Learning"
19 | ]
20 | },
21 | {
22 | "cell_type": "markdown",
23 | "metadata": {},
24 | "source": [
25 | "For this assignment, you will be using the Breast Cancer Wisconsin (Diagnostic) Database to create a classifier that can help diagnose patients. First, read through the description of the dataset (below)."
26 | ]
27 | },
28 | {
29 | "cell_type": "code",
30 | "execution_count": 1,
31 | "metadata": {
32 | "collapsed": true
33 | },
34 | "outputs": [],
35 | "source": [
36 | "import numpy as np\n",
37 | "import pandas as pd\n",
38 | "from sklearn.datasets import load_breast_cancer\n",
39 | "\n",
40 | "cancer = load_breast_cancer()\n",
41 | "\n",
42 | "#print(cancer.DESCR) # Print the data set description"
43 | ]
44 | },
45 | {
46 | "cell_type": "markdown",
47 | "metadata": {},
48 | "source": [
49 | "The object returned by `load_breast_cancer()` is a scikit-learn Bunch object, which is similar to a dictionary."
50 | ]
51 | },
52 | {
53 | "cell_type": "code",
54 | "execution_count": 2,
55 | "metadata": {},
56 | "outputs": [
57 | {
58 | "data": {
59 | "text/plain": [
60 | "dict_keys(['feature_names', 'target_names', 'DESCR', 'data', 'target'])"
61 | ]
62 | },
63 | "execution_count": 2,
64 | "metadata": {},
65 | "output_type": "execute_result"
66 | }
67 | ],
68 | "source": [
69 | "cancer.keys()"
70 | ]
71 | },
72 | {
73 | "cell_type": "markdown",
74 | "metadata": {},
75 | "source": [
76 | "### Question 0 (Example)\n",
77 | "\n",
78 | "How many features does the breast cancer dataset have?\n",
79 | "\n",
80 | "*This function should return an integer.*"
81 | ]
82 | },
83 | {
84 | "cell_type": "code",
85 | "execution_count": 3,
86 | "metadata": {},
87 | "outputs": [
88 | {
89 | "data": {
90 | "text/plain": [
91 | "30"
92 | ]
93 | },
94 | "execution_count": 3,
95 | "metadata": {},
96 | "output_type": "execute_result"
97 | }
98 | ],
99 | "source": [
100 | "# You should write your whole answer within the function provided. The autograder will call\n",
101 | "# this function and compare the return value against the correct solution value\n",
102 | "def answer_zero():\n",
103 | " # This function returns the number of features of the breast cancer dataset, which is an integer. \n",
104 | " # The assignment question description will tell you the general format the autograder is expecting\n",
105 | " return len(cancer['feature_names'])\n",
106 | "\n",
107 | "# You can examine what your function returns by calling it in the cell. If you have questions\n",
108 | "# about the assignment formats, check out the discussion forums for any FAQs\n",
109 | "answer_zero() "
110 | ]
111 | },
112 | {
113 | "cell_type": "markdown",
114 | "metadata": {},
115 | "source": [
116 | "### Question 1\n",
117 | "\n",
118 | "Scikit-learn works with lists, numpy arrays, scipy-sparse matrices, and pandas DataFrames, so converting the dataset to a DataFrame is not necessary for training this model. Using a DataFrame does however help make many things easier such as munging data, so let's practice creating a classifier with a pandas DataFrame. \n",
119 | "\n",
120 | "\n",
121 | "\n",
122 | "Convert the sklearn.dataset `cancer` to a DataFrame. \n",
123 | "\n",
124 | "*This function should return a `(569, 31)` DataFrame with * \n",
125 | "\n",
126 | "*columns = *\n",
127 | "\n",
128 | " ['mean radius', 'mean texture', 'mean perimeter', 'mean area',\n",
129 | " 'mean smoothness', 'mean compactness', 'mean concavity',\n",
130 | " 'mean concave points', 'mean symmetry', 'mean fractal dimension',\n",
131 | " 'radius error', 'texture error', 'perimeter error', 'area error',\n",
132 | " 'smoothness error', 'compactness error', 'concavity error',\n",
133 | " 'concave points error', 'symmetry error', 'fractal dimension error',\n",
134 | " 'worst radius', 'worst texture', 'worst perimeter', 'worst area',\n",
135 | " 'worst smoothness', 'worst compactness', 'worst concavity',\n",
136 | " 'worst concave points', 'worst symmetry', 'worst fractal dimension',\n",
137 | " 'target']\n",
138 | "\n",
139 | "*and index = *\n",
140 | "\n",
141 | " RangeIndex(start=0, stop=569, step=1)"
142 | ]
143 | },
144 | {
145 | "cell_type": "code",
146 | "execution_count": 4,
147 | "metadata": {
148 | "collapsed": true
149 | },
150 | "outputs": [],
151 | "source": [
152 | "def answer_one():\n",
153 | " \n",
154 | " # Your code here\n",
155 | " \n",
156 | " return # Return your answer\n",
157 | "\n",
158 | "\n",
159 | "answer_one()"
160 | ]
161 | },
162 | {
163 | "cell_type": "markdown",
164 | "metadata": {},
165 | "source": [
166 | "### Question 2\n",
167 | "What is the class distribution? (i.e. how many instances of `malignant` (encoded 0) and how many `benign` (encoded 1)?)\n",
168 | "\n",
169 | "*This function should return a Series named `target` of length 2 with integer values and index =* `['malignant', 'benign']`"
170 | ]
171 | },
172 | {
173 | "cell_type": "code",
174 | "execution_count": 5,
175 | "metadata": {
176 | "collapsed": true
177 | },
178 | "outputs": [],
179 | "source": [
180 | "def answer_two():\n",
181 | " cancerdf = answer_one()\n",
182 | " \n",
183 | " # Your code here\n",
184 | " \n",
185 | " return # Return your answer\n",
186 | "\n",
187 | "\n",
188 | "answer_two()"
189 | ]
190 | },
191 | {
192 | "cell_type": "markdown",
193 | "metadata": {},
194 | "source": [
195 | "### Question 3\n",
196 | "Split the DataFrame into `X` (the data) and `y` (the labels).\n",
197 | "\n",
198 | "*This function should return a tuple of length 2:* `(X, y)`*, where* \n",
199 | "* `X`*, a pandas DataFrame, has shape* `(569, 30)`\n",
200 | "* `y`*, a pandas Series, has shape* `(569,)`."
201 | ]
202 | },
203 | {
204 | "cell_type": "code",
205 | "execution_count": null,
206 | "metadata": {
207 | "collapsed": true
208 | },
209 | "outputs": [],
210 | "source": [
211 | "def answer_three():\n",
212 | " cancerdf = answer_one()\n",
213 | " \n",
214 | " # Your code here\n",
215 | " \n",
216 | " return X, y"
217 | ]
218 | },
219 | {
220 | "cell_type": "markdown",
221 | "metadata": {},
222 | "source": [
223 | "### Question 4\n",
224 | "Using `train_test_split`, split `X` and `y` into training and test sets `(X_train, X_test, y_train, and y_test)`.\n",
225 | "\n",
226 | "**Set the random number generator state to 0 using `random_state=0` to make sure your results match the autograder!**\n",
227 | "\n",
228 | "*This function should return a tuple of length 4:* `(X_train, X_test, y_train, y_test)`*, where* \n",
229 | "* `X_train` *has shape* `(426, 30)`\n",
230 | "* `X_test` *has shape* `(143, 30)`\n",
231 | "* `y_train` *has shape* `(426,)`\n",
232 | "* `y_test` *has shape* `(143,)`"
233 | ]
234 | },
235 | {
236 | "cell_type": "code",
237 | "execution_count": null,
238 | "metadata": {
239 | "collapsed": true
240 | },
241 | "outputs": [],
242 | "source": [
243 | "from sklearn.model_selection import train_test_split\n",
244 | "\n",
245 | "def answer_four():\n",
246 | " X, y = answer_three()\n",
247 | " \n",
248 | " # Your code here\n",
249 | " \n",
250 | " return X_train, X_test, y_train, y_test"
251 | ]
252 | },
253 | {
254 | "cell_type": "markdown",
255 | "metadata": {},
256 | "source": [
257 | "### Question 5\n",
258 | "Using KNeighborsClassifier, fit a k-nearest neighbors (knn) classifier with `X_train`, `y_train` and using one nearest neighbor (`n_neighbors = 1`).\n",
259 | "\n",
260 | "*This function should return a * `sklearn.neighbors.classification.KNeighborsClassifier`."
261 | ]
262 | },
263 | {
264 | "cell_type": "code",
265 | "execution_count": null,
266 | "metadata": {
267 | "collapsed": true
268 | },
269 | "outputs": [],
270 | "source": [
271 | "from sklearn.neighbors import KNeighborsClassifier\n",
272 | "\n",
273 | "def answer_five():\n",
274 | " X_train, X_test, y_train, y_test = answer_four()\n",
275 | " \n",
276 | " # Your code here\n",
277 | " \n",
278 | " return # Return your answer"
279 | ]
280 | },
281 | {
282 | "cell_type": "markdown",
283 | "metadata": {},
284 | "source": [
285 | "### Question 6\n",
286 | "Using your knn classifier, predict the class label using the mean value for each feature.\n",
287 | "\n",
288 | "Hint: You can use `cancerdf.mean()[:-1].values.reshape(1, -1)` which gets the mean value for each feature, ignores the target column, and reshapes the data from 1 dimension to 2 (necessary for the precict method of KNeighborsClassifier).\n",
289 | "\n",
290 | "*This function should return a numpy array either `array([ 0.])` or `array([ 1.])`*"
291 | ]
292 | },
293 | {
294 | "cell_type": "code",
295 | "execution_count": null,
296 | "metadata": {
297 | "collapsed": true
298 | },
299 | "outputs": [],
300 | "source": [
301 | "def answer_six():\n",
302 | " cancerdf = answer_one()\n",
303 | " means = cancerdf.mean()[:-1].values.reshape(1, -1)\n",
304 | " \n",
305 | " # Your code here\n",
306 | " \n",
307 | " return # Return your answer"
308 | ]
309 | },
310 | {
311 | "cell_type": "markdown",
312 | "metadata": {},
313 | "source": [
314 | "### Question 7\n",
315 | "Using your knn classifier, predict the class labels for the test set `X_test`.\n",
316 | "\n",
317 | "*This function should return a numpy array with shape `(143,)` and values either `0.0` or `1.0`.*"
318 | ]
319 | },
320 | {
321 | "cell_type": "code",
322 | "execution_count": null,
323 | "metadata": {
324 | "collapsed": true
325 | },
326 | "outputs": [],
327 | "source": [
328 | "def answer_seven():\n",
329 | " X_train, X_test, y_train, y_test = answer_four()\n",
330 | " knn = answer_five()\n",
331 | " \n",
332 | " # Your code here\n",
333 | " \n",
334 | " return # Return your answer"
335 | ]
336 | },
337 | {
338 | "cell_type": "markdown",
339 | "metadata": {},
340 | "source": [
341 | "### Question 8\n",
342 | "Find the score (mean accuracy) of your knn classifier using `X_test` and `y_test`.\n",
343 | "\n",
344 | "*This function should return a float between 0 and 1*"
345 | ]
346 | },
347 | {
348 | "cell_type": "code",
349 | "execution_count": null,
350 | "metadata": {
351 | "collapsed": true
352 | },
353 | "outputs": [],
354 | "source": [
355 | "def answer_eight():\n",
356 | " X_train, X_test, y_train, y_test = answer_four()\n",
357 | " knn = answer_five()\n",
358 | " \n",
359 | " # Your code here\n",
360 | " \n",
361 | " return # Return your answer"
362 | ]
363 | },
364 | {
365 | "cell_type": "markdown",
366 | "metadata": {},
367 | "source": [
368 | "### Optional plot\n",
369 | "\n",
370 | "Try using the plotting function below to visualize the differet predicition scores between training and test sets, as well as malignant and benign cells."
371 | ]
372 | },
373 | {
374 | "cell_type": "code",
375 | "execution_count": null,
376 | "metadata": {
377 | "collapsed": true
378 | },
379 | "outputs": [],
380 | "source": [
381 | "def accuracy_plot():\n",
382 | " import matplotlib.pyplot as plt\n",
383 | "\n",
384 | " %matplotlib notebook\n",
385 | "\n",
386 | " X_train, X_test, y_train, y_test = answer_four()\n",
387 | "\n",
388 | " # Find the training and testing accuracies by target value (i.e. malignant, benign)\n",
389 | " mal_train_X = X_train[y_train==0]\n",
390 | " mal_train_y = y_train[y_train==0]\n",
391 | " ben_train_X = X_train[y_train==1]\n",
392 | " ben_train_y = y_train[y_train==1]\n",
393 | "\n",
394 | " mal_test_X = X_test[y_test==0]\n",
395 | " mal_test_y = y_test[y_test==0]\n",
396 | " ben_test_X = X_test[y_test==1]\n",
397 | " ben_test_y = y_test[y_test==1]\n",
398 | "\n",
399 | " knn = answer_five()\n",
400 | "\n",
401 | " scores = [knn.score(mal_train_X, mal_train_y), knn.score(ben_train_X, ben_train_y), \n",
402 | " knn.score(mal_test_X, mal_test_y), knn.score(ben_test_X, ben_test_y)]\n",
403 | "\n",
404 | "\n",
405 | " plt.figure()\n",
406 | "\n",
407 | " # Plot the scores as a bar chart\n",
408 | " bars = plt.bar(np.arange(4), scores, color=['#4c72b0','#4c72b0','#55a868','#55a868'])\n",
409 | "\n",
410 | " # directly label the score onto the bars\n",
411 | " for bar in bars:\n",
412 | " height = bar.get_height()\n",
413 | " plt.gca().text(bar.get_x() + bar.get_width()/2, height*.90, '{0:.{1}f}'.format(height, 2), \n",
414 | " ha='center', color='w', fontsize=11)\n",
415 | "\n",
416 | " # remove all the ticks (both axes), and tick labels on the Y axis\n",
417 | " plt.tick_params(top='off', bottom='off', left='off', right='off', labelleft='off', labelbottom='on')\n",
418 | "\n",
419 | " # remove the frame of the chart\n",
420 | " for spine in plt.gca().spines.values():\n",
421 | " spine.set_visible(False)\n",
422 | "\n",
423 | " plt.xticks([0,1,2,3], ['Malignant\\nTraining', 'Benign\\nTraining', 'Malignant\\nTest', 'Benign\\nTest'], alpha=0.8);\n",
424 | " plt.title('Training and Test Accuracies for Malignant and Benign Cells', alpha=0.8)"
425 | ]
426 | },
427 | {
428 | "cell_type": "markdown",
429 | "metadata": {},
430 | "source": [
431 | "Uncomment the plotting function to see the visualization.\n",
432 | "\n",
433 | "**Comment out** the plotting function when submitting your notebook for grading. "
434 | ]
435 | },
436 | {
437 | "cell_type": "code",
438 | "execution_count": null,
439 | "metadata": {
440 | "collapsed": true
441 | },
442 | "outputs": [],
443 | "source": [
444 | "#accuracy_plot() "
445 | ]
446 | },
447 | {
448 | "cell_type": "code",
449 | "execution_count": null,
450 | "metadata": {
451 | "collapsed": true
452 | },
453 | "outputs": [],
454 | "source": []
455 | }
456 | ],
457 | "metadata": {
458 | "coursera": {
459 | "course_slug": "python-machine-learning",
460 | "graded_item_id": "f9SY5",
461 | "launcher_item_id": "oxndk",
462 | "part_id": "mh1Vo"
463 | },
464 | "kernelspec": {
465 | "display_name": "Python 3",
466 | "language": "python",
467 | "name": "python3"
468 | },
469 | "language_info": {
470 | "codemirror_mode": {
471 | "name": "ipython",
472 | "version": 3
473 | },
474 | "file_extension": ".py",
475 | "mimetype": "text/x-python",
476 | "name": "python",
477 | "nbconvert_exporter": "python",
478 | "pygments_lexer": "ipython3",
479 | "version": "3.5.2"
480 | }
481 | },
482 | "nbformat": 4,
483 | "nbformat_minor": 0
484 | }
485 |
--------------------------------------------------------------------------------
/readonly/Assignment 2.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "---\n",
8 | "\n",
9 | "_You are currently looking at **version 1.5** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-machine-learning/resources/bANLa) course resource._\n",
10 | "\n",
11 | "---"
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "# Assignment 2\n",
19 | "\n",
20 | "In this assignment you'll explore the relationship between model complexity and generalization performance, by adjusting key parameters of various supervised learning models. Part 1 of this assignment will look at regression and Part 2 will look at classification.\n",
21 | "\n",
22 | "## Part 1 - Regression"
23 | ]
24 | },
25 | {
26 | "cell_type": "markdown",
27 | "metadata": {},
28 | "source": [
29 | "First, run the following block to set up the variables needed for later sections."
30 | ]
31 | },
32 | {
33 | "cell_type": "code",
34 | "execution_count": null,
35 | "metadata": {
36 | "collapsed": true,
37 | "scrolled": false
38 | },
39 | "outputs": [],
40 | "source": [
41 | "import numpy as np\n",
42 | "import pandas as pd\n",
43 | "from sklearn.model_selection import train_test_split\n",
44 | "\n",
45 | "\n",
46 | "np.random.seed(0)\n",
47 | "n = 15\n",
48 | "x = np.linspace(0,10,n) + np.random.randn(n)/5\n",
49 | "y = np.sin(x)+x/6 + np.random.randn(n)/10\n",
50 | "\n",
51 | "\n",
52 | "X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)\n",
53 | "\n",
54 | "# You can use this function to help you visualize the dataset by\n",
55 | "# plotting a scatterplot of the data points\n",
56 | "# in the training and test sets.\n",
57 | "def part1_scatter():\n",
58 | " import matplotlib.pyplot as plt\n",
59 | " %matplotlib notebook\n",
60 | " plt.figure()\n",
61 | " plt.scatter(X_train, y_train, label='training data')\n",
62 | " plt.scatter(X_test, y_test, label='test data')\n",
63 | " plt.legend(loc=4);\n",
64 | " \n",
65 | " \n",
66 | "# NOTE: Uncomment the function below to visualize the data, but be sure \n",
67 | "# to **re-comment it before submitting this assignment to the autograder**. \n",
68 | "#part1_scatter()"
69 | ]
70 | },
71 | {
72 | "cell_type": "markdown",
73 | "metadata": {},
74 | "source": [
75 | "### Question 1\n",
76 | "\n",
77 | "Write a function that fits a polynomial LinearRegression model on the *training data* `X_train` for degrees 1, 3, 6, and 9. (Use PolynomialFeatures in sklearn.preprocessing to create the polynomial features and then fit a linear regression model) For each model, find 100 predicted values over the interval x = 0 to 10 (e.g. `np.linspace(0,10,100)`) and store this in a numpy array. The first row of this array should correspond to the output from the model trained on degree 1, the second row degree 3, the third row degree 6, and the fourth row degree 9.\n",
78 | "\n",
79 | "
\n",
80 | "\n",
81 | "The figure above shows the fitted models plotted on top of the original data (using `plot_one()`).\n",
82 | "\n",
83 | "
\n",
84 | "*This function should return a numpy array with shape `(4, 100)`*"
85 | ]
86 | },
87 | {
88 | "cell_type": "code",
89 | "execution_count": null,
90 | "metadata": {
91 | "collapsed": true
92 | },
93 | "outputs": [],
94 | "source": [
95 | "def answer_one():\n",
96 | " from sklearn.linear_model import LinearRegression\n",
97 | " from sklearn.preprocessing import PolynomialFeatures\n",
98 | "\n",
99 | " # Your code here\n",
100 | " \n",
101 | " return # Return your answer"
102 | ]
103 | },
104 | {
105 | "cell_type": "code",
106 | "execution_count": null,
107 | "metadata": {
108 | "collapsed": true
109 | },
110 | "outputs": [],
111 | "source": [
112 | "# feel free to use the function plot_one() to replicate the figure \n",
113 | "# from the prompt once you have completed question one\n",
114 | "def plot_one(degree_predictions):\n",
115 | " import matplotlib.pyplot as plt\n",
116 | " %matplotlib notebook\n",
117 | " plt.figure(figsize=(10,5))\n",
118 | " plt.plot(X_train, y_train, 'o', label='training data', markersize=10)\n",
119 | " plt.plot(X_test, y_test, 'o', label='test data', markersize=10)\n",
120 | " for i,degree in enumerate([1,3,6,9]):\n",
121 | " plt.plot(np.linspace(0,10,100), degree_predictions[i], alpha=0.8, lw=2, label='degree={}'.format(degree))\n",
122 | " plt.ylim(-1,2.5)\n",
123 | " plt.legend(loc=4)\n",
124 | "\n",
125 | "#plot_one(answer_one())"
126 | ]
127 | },
128 | {
129 | "cell_type": "markdown",
130 | "metadata": {},
131 | "source": [
132 | "### Question 2\n",
133 | "\n",
134 | "Write a function that fits a polynomial LinearRegression model on the training data `X_train` for degrees 0 through 9. For each model compute the $R^2$ (coefficient of determination) regression score on the training data as well as the the test data, and return both of these arrays in a tuple.\n",
135 | "\n",
136 | "*This function should return one tuple of numpy arrays `(r2_train, r2_test)`. Both arrays should have shape `(10,)`*"
137 | ]
138 | },
139 | {
140 | "cell_type": "code",
141 | "execution_count": null,
142 | "metadata": {
143 | "collapsed": true
144 | },
145 | "outputs": [],
146 | "source": [
147 | "def answer_two():\n",
148 | " from sklearn.linear_model import LinearRegression\n",
149 | " from sklearn.preprocessing import PolynomialFeatures\n",
150 | " from sklearn.metrics.regression import r2_score\n",
151 | "\n",
152 | " # Your code here\n",
153 | "\n",
154 | " return # Your answer here"
155 | ]
156 | },
157 | {
158 | "cell_type": "markdown",
159 | "metadata": {},
160 | "source": [
161 | "### Question 3\n",
162 | "\n",
163 | "Based on the $R^2$ scores from question 2 (degree levels 0 through 9), what degree level corresponds to a model that is underfitting? What degree level corresponds to a model that is overfitting? What choice of degree level would provide a model with good generalization performance on this dataset? \n",
164 | "\n",
165 | "Hint: Try plotting the $R^2$ scores from question 2 to visualize the relationship between degree level and $R^2$. Remember to comment out the import matplotlib line before submission.\n",
166 | "\n",
167 | "*This function should return one tuple with the degree values in this order: `(Underfitting, Overfitting, Good_Generalization)`. There might be multiple correct solutions, however, you only need to return one possible solution, for example, (1,2,3).* "
168 | ]
169 | },
170 | {
171 | "cell_type": "code",
172 | "execution_count": null,
173 | "metadata": {
174 | "collapsed": true
175 | },
176 | "outputs": [],
177 | "source": [
178 | "def answer_three():\n",
179 | " \n",
180 | " # Your code here\n",
181 | " \n",
182 | " return # Return your answer"
183 | ]
184 | },
185 | {
186 | "cell_type": "markdown",
187 | "metadata": {},
188 | "source": [
189 | "### Question 4\n",
190 | "\n",
191 | "Training models on high degree polynomial features can result in overly complex models that overfit, so we often use regularized versions of the model to constrain model complexity, as we saw with Ridge and Lasso linear regression.\n",
192 | "\n",
193 | "For this question, train two models: a non-regularized LinearRegression model (default parameters) and a regularized Lasso Regression model (with parameters `alpha=0.01`, `max_iter=10000`) both on polynomial features of degree 12. Return the $R^2$ score for both the LinearRegression and Lasso model's test sets.\n",
194 | "\n",
195 | "*This function should return one tuple `(LinearRegression_R2_test_score, Lasso_R2_test_score)`*"
196 | ]
197 | },
198 | {
199 | "cell_type": "code",
200 | "execution_count": null,
201 | "metadata": {
202 | "collapsed": true
203 | },
204 | "outputs": [],
205 | "source": [
206 | "def answer_four():\n",
207 | " from sklearn.preprocessing import PolynomialFeatures\n",
208 | " from sklearn.linear_model import Lasso, LinearRegression\n",
209 | " from sklearn.metrics.regression import r2_score\n",
210 | "\n",
211 | " # Your code here\n",
212 | "\n",
213 | " return # Your answer here"
214 | ]
215 | },
216 | {
217 | "cell_type": "markdown",
218 | "metadata": {},
219 | "source": [
220 | "## Part 2 - Classification\n",
221 | "\n",
222 | "Here's an application of machine learning that could save your life! For this section of the assignment we will be working with the [UCI Mushroom Data Set](http://archive.ics.uci.edu/ml/datasets/Mushroom?ref=datanews.io) stored in `mushrooms.csv`. The data will be used to train a model to predict whether or not a mushroom is poisonous. The following attributes are provided:\n",
223 | "\n",
224 | "*Attribute Information:*\n",
225 | "\n",
226 | "1. cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s \n",
227 | "2. cap-surface: fibrous=f, grooves=g, scaly=y, smooth=s \n",
228 | "3. cap-color: brown=n, buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y \n",
229 | "4. bruises?: bruises=t, no=f \n",
230 | "5. odor: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s \n",
231 | "6. gill-attachment: attached=a, descending=d, free=f, notched=n \n",
232 | "7. gill-spacing: close=c, crowded=w, distant=d \n",
233 | "8. gill-size: broad=b, narrow=n \n",
234 | "9. gill-color: black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=y \n",
235 | "10. stalk-shape: enlarging=e, tapering=t \n",
236 | "11. stalk-root: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=? \n",
237 | "12. stalk-surface-above-ring: fibrous=f, scaly=y, silky=k, smooth=s \n",
238 | "13. stalk-surface-below-ring: fibrous=f, scaly=y, silky=k, smooth=s \n",
239 | "14. stalk-color-above-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y \n",
240 | "15. stalk-color-below-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y \n",
241 | "16. veil-type: partial=p, universal=u \n",
242 | "17. veil-color: brown=n, orange=o, white=w, yellow=y \n",
243 | "18. ring-number: none=n, one=o, two=t \n",
244 | "19. ring-type: cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z \n",
245 | "20. spore-print-color: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, white=w, yellow=y \n",
246 | "21. population: abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y \n",
247 | "22. habitat: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d\n",
248 | "\n",
249 | "
\n",
250 | "\n",
251 | "The data in the mushrooms dataset is currently encoded with strings. These values will need to be encoded to numeric to work with sklearn. We'll use pd.get_dummies to convert the categorical variables into indicator variables. "
252 | ]
253 | },
254 | {
255 | "cell_type": "code",
256 | "execution_count": null,
257 | "metadata": {
258 | "collapsed": true
259 | },
260 | "outputs": [],
261 | "source": [
262 | "import pandas as pd\n",
263 | "import numpy as np\n",
264 | "from sklearn.model_selection import train_test_split\n",
265 | "\n",
266 | "\n",
267 | "mush_df = pd.read_csv('mushrooms.csv')\n",
268 | "mush_df2 = pd.get_dummies(mush_df)\n",
269 | "\n",
270 | "X_mush = mush_df2.iloc[:,2:]\n",
271 | "y_mush = mush_df2.iloc[:,1]\n",
272 | "\n",
273 | "# use the variables X_train2, y_train2 for Question 5\n",
274 | "X_train2, X_test2, y_train2, y_test2 = train_test_split(X_mush, y_mush, random_state=0)\n",
275 | "\n",
276 | "# For performance reasons in Questions 6 and 7, we will create a smaller version of the\n",
277 | "# entire mushroom dataset for use in those questions. For simplicity we'll just re-use\n",
278 | "# the 25% test split created above as the representative subset.\n",
279 | "#\n",
280 | "# Use the variables X_subset, y_subset for Questions 6 and 7.\n",
281 | "X_subset = X_test2\n",
282 | "y_subset = y_test2"
283 | ]
284 | },
285 | {
286 | "cell_type": "markdown",
287 | "metadata": {},
288 | "source": [
289 | "### Question 5\n",
290 | "\n",
291 | "Using `X_train2` and `y_train2` from the preceeding cell, train a DecisionTreeClassifier with default parameters and random_state=0. What are the 5 most important features found by the decision tree?\n",
292 | "\n",
293 | "As a reminder, the feature names are available in the `X_train2.columns` property, and the order of the features in `X_train2.columns` matches the order of the feature importance values in the classifier's `feature_importances_` property. \n",
294 | "\n",
295 | "*This function should return a list of length 5 containing the feature names in descending order of importance.*\n",
296 | "\n",
297 | "*Note: remember that you also need to set random_state in the DecisionTreeClassifier.*"
298 | ]
299 | },
300 | {
301 | "cell_type": "code",
302 | "execution_count": null,
303 | "metadata": {
304 | "collapsed": true
305 | },
306 | "outputs": [],
307 | "source": [
308 | "def answer_five():\n",
309 | " from sklearn.tree import DecisionTreeClassifier\n",
310 | "\n",
311 | " # Your code here\n",
312 | "\n",
313 | " return # Your answer here"
314 | ]
315 | },
316 | {
317 | "cell_type": "markdown",
318 | "metadata": {},
319 | "source": [
320 | "### Question 6\n",
321 | "\n",
322 | "For this question, we're going to use the `validation_curve` function in `sklearn.model_selection` to determine training and test scores for a Support Vector Classifier (`SVC`) with varying parameter values. Recall that the validation_curve function, in addition to taking an initialized unfitted classifier object, takes a dataset as input and does its own internal train-test splits to compute results.\n",
323 | "\n",
324 | "**Because creating a validation curve requires fitting multiple models, for performance reasons this question will use just a subset of the original mushroom dataset: please use the variables X_subset and y_subset as input to the validation curve function (instead of X_mush and y_mush) to reduce computation time.**\n",
325 | "\n",
326 | "The initialized unfitted classifier object we'll be using is a Support Vector Classifier with radial basis kernel. So your first step is to create an `SVC` object with default parameters (i.e. `kernel='rbf', C=1`) and `random_state=0`. Recall that the kernel width of the RBF kernel is controlled using the `gamma` parameter. \n",
327 | "\n",
328 | "With this classifier, and the dataset in X_subset, y_subset, explore the effect of `gamma` on classifier accuracy by using the `validation_curve` function to find the training and test scores for 6 values of `gamma` from `0.0001` to `10` (i.e. `np.logspace(-4,1,6)`). Recall that you can specify what scoring metric you want validation_curve to use by setting the \"scoring\" parameter. In this case, we want to use \"accuracy\" as the scoring metric.\n",
329 | "\n",
330 | "For each level of `gamma`, `validation_curve` will fit 3 models on different subsets of the data, returning two 6x3 (6 levels of gamma x 3 fits per level) arrays of the scores for the training and test sets.\n",
331 | "\n",
332 | "Find the mean score across the three models for each level of `gamma` for both arrays, creating two arrays of length 6, and return a tuple with the two arrays.\n",
333 | "\n",
334 | "e.g.\n",
335 | "\n",
336 | "if one of your array of scores is\n",
337 | "\n",
338 | " array([[ 0.5, 0.4, 0.6],\n",
339 | " [ 0.7, 0.8, 0.7],\n",
340 | " [ 0.9, 0.8, 0.8],\n",
341 | " [ 0.8, 0.7, 0.8],\n",
342 | " [ 0.7, 0.6, 0.6],\n",
343 | " [ 0.4, 0.6, 0.5]])\n",
344 | " \n",
345 | "it should then become\n",
346 | "\n",
347 | " array([ 0.5, 0.73333333, 0.83333333, 0.76666667, 0.63333333, 0.5])\n",
348 | "\n",
349 | "*This function should return one tuple of numpy arrays `(training_scores, test_scores)` where each array in the tuple has shape `(6,)`.*"
350 | ]
351 | },
352 | {
353 | "cell_type": "code",
354 | "execution_count": null,
355 | "metadata": {
356 | "collapsed": true
357 | },
358 | "outputs": [],
359 | "source": [
360 | "def answer_six():\n",
361 | " from sklearn.svm import SVC\n",
362 | " from sklearn.model_selection import validation_curve\n",
363 | "\n",
364 | " # Your code here\n",
365 | "\n",
366 | " return # Your answer here"
367 | ]
368 | },
369 | {
370 | "cell_type": "markdown",
371 | "metadata": {},
372 | "source": [
373 | "### Question 7\n",
374 | "\n",
375 | "Based on the scores from question 6, what gamma value corresponds to a model that is underfitting (and has the worst test set accuracy)? What gamma value corresponds to a model that is overfitting (and has the worst test set accuracy)? What choice of gamma would be the best choice for a model with good generalization performance on this dataset (high accuracy on both training and test set)? \n",
376 | "\n",
377 | "Hint: Try plotting the scores from question 6 to visualize the relationship between gamma and accuracy. Remember to comment out the import matplotlib line before submission.\n",
378 | "\n",
379 | "*This function should return one tuple with the degree values in this order: `(Underfitting, Overfitting, Good_Generalization)` Please note there is only one correct solution.*"
380 | ]
381 | },
382 | {
383 | "cell_type": "code",
384 | "execution_count": null,
385 | "metadata": {
386 | "collapsed": true
387 | },
388 | "outputs": [],
389 | "source": [
390 | "def answer_seven():\n",
391 | " \n",
392 | " # Your code here\n",
393 | " \n",
394 | " return # Return your answer"
395 | ]
396 | }
397 | ],
398 | "metadata": {
399 | "coursera": {
400 | "course_slug": "python-machine-learning",
401 | "graded_item_id": "eWYHL",
402 | "launcher_item_id": "BAqef",
403 | "part_id": "fXXRp"
404 | },
405 | "kernelspec": {
406 | "display_name": "Python 3",
407 | "language": "python",
408 | "name": "python3"
409 | },
410 | "language_info": {
411 | "codemirror_mode": {
412 | "name": "ipython",
413 | "version": 3
414 | },
415 | "file_extension": ".py",
416 | "mimetype": "text/x-python",
417 | "name": "python",
418 | "nbconvert_exporter": "python",
419 | "pygments_lexer": "ipython3",
420 | "version": "3.5.2"
421 | }
422 | },
423 | "nbformat": 4,
424 | "nbformat_minor": 2
425 | }
426 |
--------------------------------------------------------------------------------
/readonly/Assignment 3.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "---\n",
8 | "\n",
9 | "_You are currently looking at **version 1.2** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-machine-learning/resources/bANLa) course resource._\n",
10 | "\n",
11 | "---"
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "# Assignment 3 - Evaluation\n",
19 | "\n",
20 | "In this assignment you will train several models and evaluate how effectively they predict instances of fraud using data based on [this dataset from Kaggle](https://www.kaggle.com/dalpozz/creditcardfraud).\n",
21 | " \n",
22 | "Each row in `fraud_data.csv` corresponds to a credit card transaction. Features include confidential variables `V1` through `V28` as well as `Amount` which is the amount of the transaction. \n",
23 | " \n",
24 | "The target is stored in the `class` column, where a value of 1 corresponds to an instance of fraud and 0 corresponds to an instance of not fraud."
25 | ]
26 | },
27 | {
28 | "cell_type": "code",
29 | "execution_count": null,
30 | "metadata": {},
31 | "outputs": [],
32 | "source": [
33 | "import numpy as np\n",
34 | "import pandas as pd"
35 | ]
36 | },
37 | {
38 | "cell_type": "markdown",
39 | "metadata": {},
40 | "source": [
41 | "### Question 1\n",
42 | "Import the data from `fraud_data.csv`. What percentage of the observations in the dataset are instances of fraud?\n",
43 | "\n",
44 | "*This function should return a float between 0 and 1.* "
45 | ]
46 | },
47 | {
48 | "cell_type": "code",
49 | "execution_count": null,
50 | "metadata": {
51 | "collapsed": true
52 | },
53 | "outputs": [],
54 | "source": [
55 | "def answer_one():\n",
56 | " \n",
57 | " # Your code here\n",
58 | " \n",
59 | " return # Return your answer\n"
60 | ]
61 | },
62 | {
63 | "cell_type": "code",
64 | "execution_count": null,
65 | "metadata": {},
66 | "outputs": [],
67 | "source": [
68 | "# Use X_train, X_test, y_train, y_test for all of the following questions\n",
69 | "from sklearn.model_selection import train_test_split\n",
70 | "\n",
71 | "df = pd.read_csv('fraud_data.csv')\n",
72 | "\n",
73 | "X = df.iloc[:,:-1]\n",
74 | "y = df.iloc[:,-1]\n",
75 | "\n",
76 | "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)"
77 | ]
78 | },
79 | {
80 | "cell_type": "markdown",
81 | "metadata": {},
82 | "source": [
83 | "### Question 2\n",
84 | "\n",
85 | "Using `X_train`, `X_test`, `y_train`, and `y_test` (as defined above), train a dummy classifier that classifies everything as the majority class of the training data. What is the accuracy of this classifier? What is the recall?\n",
86 | "\n",
87 | "*This function should a return a tuple with two floats, i.e. `(accuracy score, recall score)`.*"
88 | ]
89 | },
90 | {
91 | "cell_type": "code",
92 | "execution_count": null,
93 | "metadata": {},
94 | "outputs": [],
95 | "source": [
96 | "def answer_two():\n",
97 | " from sklearn.dummy import DummyClassifier\n",
98 | " from sklearn.metrics import recall_score\n",
99 | " \n",
100 | " # Your code here\n",
101 | " \n",
102 | " return # Return your answer"
103 | ]
104 | },
105 | {
106 | "cell_type": "markdown",
107 | "metadata": {},
108 | "source": [
109 | "### Question 3\n",
110 | "\n",
111 | "Using X_train, X_test, y_train, y_test (as defined above), train a SVC classifer using the default parameters. What is the accuracy, recall, and precision of this classifier?\n",
112 | "\n",
113 | "*This function should a return a tuple with three floats, i.e. `(accuracy score, recall score, precision score)`.*"
114 | ]
115 | },
116 | {
117 | "cell_type": "code",
118 | "execution_count": null,
119 | "metadata": {},
120 | "outputs": [],
121 | "source": [
122 | "def answer_three():\n",
123 | " from sklearn.metrics import recall_score, precision_score\n",
124 | " from sklearn.svm import SVC\n",
125 | "\n",
126 | " # Your code here\n",
127 | " \n",
128 | " return # Return your answer"
129 | ]
130 | },
131 | {
132 | "cell_type": "markdown",
133 | "metadata": {},
134 | "source": [
135 | "### Question 4\n",
136 | "\n",
137 | "Using the SVC classifier with parameters `{'C': 1e9, 'gamma': 1e-07}`, what is the confusion matrix when using a threshold of -220 on the decision function. Use X_test and y_test.\n",
138 | "\n",
139 | "*This function should return a confusion matrix, a 2x2 numpy array with 4 integers.*"
140 | ]
141 | },
142 | {
143 | "cell_type": "code",
144 | "execution_count": null,
145 | "metadata": {
146 | "collapsed": true
147 | },
148 | "outputs": [],
149 | "source": [
150 | "def answer_four():\n",
151 | " from sklearn.metrics import confusion_matrix\n",
152 | " from sklearn.svm import SVC\n",
153 | "\n",
154 | " # Your code here\n",
155 | " \n",
156 | " return # Return your answer"
157 | ]
158 | },
159 | {
160 | "cell_type": "markdown",
161 | "metadata": {},
162 | "source": [
163 | "### Question 5\n",
164 | "\n",
165 | "Train a logisitic regression classifier with default parameters using X_train and y_train.\n",
166 | "\n",
167 | "For the logisitic regression classifier, create a precision recall curve and a roc curve using y_test and the probability estimates for X_test (probability it is fraud).\n",
168 | "\n",
169 | "Looking at the precision recall curve, what is the recall when the precision is `0.75`?\n",
170 | "\n",
171 | "Looking at the roc curve, what is the true positive rate when the false positive rate is `0.16`?\n",
172 | "\n",
173 | "*This function should return a tuple with two floats, i.e. `(recall, true positive rate)`.*"
174 | ]
175 | },
176 | {
177 | "cell_type": "code",
178 | "execution_count": null,
179 | "metadata": {
180 | "collapsed": true
181 | },
182 | "outputs": [],
183 | "source": [
184 | "def answer_five():\n",
185 | " \n",
186 | " # Your code here\n",
187 | " \n",
188 | " return # Return your answer"
189 | ]
190 | },
191 | {
192 | "cell_type": "markdown",
193 | "metadata": {},
194 | "source": [
195 | "### Question 6\n",
196 | "\n",
197 | "Perform a grid search over the parameters listed below for a Logisitic Regression classifier, using recall for scoring and the default 3-fold cross validation.\n",
198 | "\n",
199 | "`'penalty': ['l1', 'l2']`\n",
200 | "\n",
201 | "`'C':[0.01, 0.1, 1, 10, 100]`\n",
202 | "\n",
203 | "From `.cv_results_`, create an array of the mean test scores of each parameter combination. i.e.\n",
204 | "\n",
205 | "| \t| `l1` \t| `l2` \t|\n",
206 | "|:----:\t|----\t|----\t|\n",
207 | "| **`0.01`** \t| ?\t| ? \t|\n",
208 | "| **`0.1`** \t| ?\t| ? \t|\n",
209 | "| **`1`** \t| ?\t| ? \t|\n",
210 | "| **`10`** \t| ?\t| ? \t|\n",
211 | "| **`100`** \t| ?\t| ? \t|\n",
212 | "\n",
213 | "
\n",
214 | "\n",
215 | "*This function should return a 5 by 2 numpy array with 10 floats.* \n",
216 | "\n",
217 | "*Note: do not return a DataFrame, just the values denoted by '?' above in a numpy array. You might need to reshape your raw result to meet the format we are looking for.*"
218 | ]
219 | },
220 | {
221 | "cell_type": "code",
222 | "execution_count": null,
223 | "metadata": {
224 | "collapsed": true
225 | },
226 | "outputs": [],
227 | "source": [
228 | "def answer_six(): \n",
229 | " from sklearn.model_selection import GridSearchCV\n",
230 | " from sklearn.linear_model import LogisticRegression\n",
231 | "\n",
232 | " # Your code here\n",
233 | " \n",
234 | " return # Return your answer"
235 | ]
236 | },
237 | {
238 | "cell_type": "code",
239 | "execution_count": null,
240 | "metadata": {},
241 | "outputs": [],
242 | "source": [
243 | "# Use the following function to help visualize results from the grid search\n",
244 | "def GridSearch_Heatmap(scores):\n",
245 | " %matplotlib notebook\n",
246 | " import seaborn as sns\n",
247 | " import matplotlib.pyplot as plt\n",
248 | " plt.figure()\n",
249 | " sns.heatmap(scores.reshape(5,2), xticklabels=['l1','l2'], yticklabels=[0.01, 0.1, 1, 10, 100])\n",
250 | " plt.yticks(rotation=0);\n",
251 | "\n",
252 | "#GridSearch_Heatmap(answer_six())"
253 | ]
254 | }
255 | ],
256 | "metadata": {
257 | "coursera": {
258 | "course_slug": "python-machine-learning",
259 | "graded_item_id": "5yX9Z",
260 | "launcher_item_id": "eqnV3",
261 | "part_id": "Msnj0"
262 | },
263 | "kernelspec": {
264 | "display_name": "Python 3",
265 | "language": "python",
266 | "name": "python3"
267 | },
268 | "language_info": {
269 | "codemirror_mode": {
270 | "name": "ipython",
271 | "version": 3
272 | },
273 | "file_extension": ".py",
274 | "mimetype": "text/x-python",
275 | "name": "python",
276 | "nbconvert_exporter": "python",
277 | "pygments_lexer": "ipython3",
278 | "version": "3.5.2"
279 | }
280 | },
281 | "nbformat": 4,
282 | "nbformat_minor": 2
283 | }
284 |
--------------------------------------------------------------------------------
/readonly/Assignment 4.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "---\n",
8 | "\n",
9 | "_You are currently looking at **version 1.1** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-machine-learning/resources/bANLa) course resource._\n",
10 | "\n",
11 | "---"
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "## Assignment 4 - Understanding and Predicting Property Maintenance Fines\n",
19 | "\n",
20 | "This assignment is based on a data challenge from the Michigan Data Science Team ([MDST](http://midas.umich.edu/mdst/)). \n",
21 | "\n",
22 | "The Michigan Data Science Team ([MDST](http://midas.umich.edu/mdst/)) and the Michigan Student Symposium for Interdisciplinary Statistical Sciences ([MSSISS](https://sites.lsa.umich.edu/mssiss/)) have partnered with the City of Detroit to help solve one of the most pressing problems facing Detroit - blight. [Blight violations](http://www.detroitmi.gov/How-Do-I/Report/Blight-Complaint-FAQs) are issued by the city to individuals who allow their properties to remain in a deteriorated condition. Every year, the city of Detroit issues millions of dollars in fines to residents and every year, many of these fines remain unpaid. Enforcing unpaid blight fines is a costly and tedious process, so the city wants to know: how can we increase blight ticket compliance?\n",
23 | "\n",
24 | "The first step in answering this question is understanding when and why a resident might fail to comply with a blight ticket. This is where predictive modeling comes in. For this assignment, your task is to predict whether a given blight ticket will be paid on time.\n",
25 | "\n",
26 | "All data for this assignment has been provided to us through the [Detroit Open Data Portal](https://data.detroitmi.gov/). **Only the data already included in your Coursera directory can be used for training the model for this assignment.** Nonetheless, we encourage you to look into data from other Detroit datasets to help inform feature creation and model selection. We recommend taking a look at the following related datasets:\n",
27 | "\n",
28 | "* [Building Permits](https://data.detroitmi.gov/Property-Parcels/Building-Permits/xw2a-a7tf)\n",
29 | "* [Trades Permits](https://data.detroitmi.gov/Property-Parcels/Trades-Permits/635b-dsgv)\n",
30 | "* [Improve Detroit: Submitted Issues](https://data.detroitmi.gov/Government/Improve-Detroit-Submitted-Issues/fwz3-w3yn)\n",
31 | "* [DPD: Citizen Complaints](https://data.detroitmi.gov/Public-Safety/DPD-Citizen-Complaints-2016/kahe-efs3)\n",
32 | "* [Parcel Map](https://data.detroitmi.gov/Property-Parcels/Parcel-Map/fxkw-udwf)\n",
33 | "\n",
34 | "___\n",
35 | "\n",
36 | "We provide you with two data files for use in training and validating your models: train.csv and test.csv. Each row in these two files corresponds to a single blight ticket, and includes information about when, why, and to whom each ticket was issued. The target variable is compliance, which is True if the ticket was paid early, on time, or within one month of the hearing data, False if the ticket was paid after the hearing date or not at all, and Null if the violator was found not responsible. Compliance, as well as a handful of other variables that will not be available at test-time, are only included in train.csv.\n",
37 | "\n",
38 | "Note: All tickets where the violators were found not responsible are not considered during evaluation. They are included in the training set as an additional source of data for visualization, and to enable unsupervised and semi-supervised approaches. However, they are not included in the test set.\n",
39 | "\n",
40 | "
\n",
41 | "\n",
42 | "**File descriptions** (Use only this data for training your model!)\n",
43 | "\n",
44 | " train.csv - the training set (all tickets issued 2004-2011)\n",
45 | " test.csv - the test set (all tickets issued 2012-2016)\n",
46 | " addresses.csv & latlons.csv - mapping from ticket id to addresses, and from addresses to lat/lon coordinates. \n",
47 | " Note: misspelled addresses may be incorrectly geolocated.\n",
48 | "\n",
49 | "
\n",
50 | "\n",
51 | "**Data fields**\n",
52 | "\n",
53 | "train.csv & test.csv\n",
54 | "\n",
55 | " ticket_id - unique identifier for tickets\n",
56 | " agency_name - Agency that issued the ticket\n",
57 | " inspector_name - Name of inspector that issued the ticket\n",
58 | " violator_name - Name of the person/organization that the ticket was issued to\n",
59 | " violation_street_number, violation_street_name, violation_zip_code - Address where the violation occurred\n",
60 | " mailing_address_str_number, mailing_address_str_name, city, state, zip_code, non_us_str_code, country - Mailing address of the violator\n",
61 | " ticket_issued_date - Date and time the ticket was issued\n",
62 | " hearing_date - Date and time the violator's hearing was scheduled\n",
63 | " violation_code, violation_description - Type of violation\n",
64 | " disposition - Judgment and judgement type\n",
65 | " fine_amount - Violation fine amount, excluding fees\n",
66 | " admin_fee - $20 fee assigned to responsible judgments\n",
67 | "state_fee - $10 fee assigned to responsible judgments\n",
68 | " late_fee - 10% fee assigned to responsible judgments\n",
69 | " discount_amount - discount applied, if any\n",
70 | " clean_up_cost - DPW clean-up or graffiti removal cost\n",
71 | " judgment_amount - Sum of all fines and fees\n",
72 | " grafitti_status - Flag for graffiti violations\n",
73 | " \n",
74 | "train.csv only\n",
75 | "\n",
76 | " payment_amount - Amount paid, if any\n",
77 | " payment_date - Date payment was made, if it was received\n",
78 | " payment_status - Current payment status as of Feb 1 2017\n",
79 | " balance_due - Fines and fees still owed\n",
80 | " collection_status - Flag for payments in collections\n",
81 | " compliance [target variable for prediction] \n",
82 | " Null = Not responsible\n",
83 | " 0 = Responsible, non-compliant\n",
84 | " 1 = Responsible, compliant\n",
85 | " compliance_detail - More information on why each ticket was marked compliant or non-compliant\n",
86 | "\n",
87 | "\n",
88 | "___\n",
89 | "\n",
90 | "## Evaluation\n",
91 | "\n",
92 | "Your predictions will be given as the probability that the corresponding blight ticket will be paid on time.\n",
93 | "\n",
94 | "The evaluation metric for this assignment is the Area Under the ROC Curve (AUC). \n",
95 | "\n",
96 | "Your grade will be based on the AUC score computed for your classifier. A model which with an AUROC of 0.7 passes this assignment, over 0.75 will recieve full points.\n",
97 | "___\n",
98 | "\n",
99 | "For this assignment, create a function that trains a model to predict blight ticket compliance in Detroit using `train.csv`. Using this model, return a series of length 61001 with the data being the probability that each corresponding ticket from `test.csv` will be paid, and the index being the ticket_id.\n",
100 | "\n",
101 | "Example:\n",
102 | "\n",
103 | " ticket_id\n",
104 | " 284932 0.531842\n",
105 | " 285362 0.401958\n",
106 | " 285361 0.105928\n",
107 | " 285338 0.018572\n",
108 | " ...\n",
109 | " 376499 0.208567\n",
110 | " 376500 0.818759\n",
111 | " 369851 0.018528\n",
112 | " Name: compliance, dtype: float32\n",
113 | " \n",
114 | "### Hints\n",
115 | "\n",
116 | "* Make sure your code is working before submitting it to the autograder.\n",
117 | "\n",
118 | "* Print out your result to see whether there is anything weird (e.g., all probabilities are the same).\n",
119 | "\n",
120 | "* Generally the total runtime should be less than 10 mins. You should NOT use Neural Network related classifiers (e.g., MLPClassifier) in this question. \n",
121 | "\n",
122 | "* Try to avoid global variables. If you have other functions besides blight_model, you should move those functions inside the scope of blight_model.\n",
123 | "\n",
124 | "* Refer to the pinned threads in Week 4's discussion forum when there is something you could not figure it out."
125 | ]
126 | },
127 | {
128 | "cell_type": "code",
129 | "execution_count": null,
130 | "metadata": {},
131 | "outputs": [],
132 | "source": [
133 | "import pandas as pd\n",
134 | "import numpy as np\n",
135 | "\n",
136 | "def blight_model():\n",
137 | " \n",
138 | " # Your code here\n",
139 | " \n",
140 | " return # Your answer here"
141 | ]
142 | },
143 | {
144 | "cell_type": "code",
145 | "execution_count": null,
146 | "metadata": {
147 | "collapsed": true
148 | },
149 | "outputs": [],
150 | "source": [
151 | "blight_model()"
152 | ]
153 | }
154 | ],
155 | "metadata": {
156 | "coursera": {
157 | "course_slug": "python-machine-learning",
158 | "graded_item_id": "nNS8l",
159 | "launcher_item_id": "yWWk7",
160 | "part_id": "w8BSS"
161 | },
162 | "kernelspec": {
163 | "display_name": "Python 3",
164 | "language": "python",
165 | "name": "python3"
166 | },
167 | "language_info": {
168 | "codemirror_mode": {
169 | "name": "ipython",
170 | "version": 3
171 | },
172 | "file_extension": ".py",
173 | "mimetype": "text/x-python",
174 | "name": "python",
175 | "nbconvert_exporter": "python",
176 | "pygments_lexer": "ipython3",
177 | "version": "3.5.2"
178 | }
179 | },
180 | "nbformat": 4,
181 | "nbformat_minor": 2
182 | }
183 |
--------------------------------------------------------------------------------
/readonly/Classifier Visualization.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "---\n",
8 | "\n",
9 | "_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-machine-learning/resources/bANLa) course resource._\n",
10 | "\n",
11 | "---"
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {
17 | "deletable": true,
18 | "editable": true
19 | },
20 | "source": [
21 | "# Classifier Visualization Playground\n",
22 | "\n",
23 | "The purpose of this notebook is to let you visualize various classsifiers' decision boundaries.\n",
24 | "\n",
25 | "The data used in this notebook is based on the [UCI Mushroom Data Set](http://archive.ics.uci.edu/ml/datasets/Mushroom?ref=datanews.io) stored in `mushrooms.csv`. \n",
26 | "\n",
27 | "In order to better vizualize the decision boundaries, we'll perform Principal Component Analysis (PCA) on the data to reduce the dimensionality to 2 dimensions. Dimensionality reduction will be covered in module 4 of this course.\n",
28 | "\n",
29 | "Play around with different models and parameters to see how they affect the classifier's decision boundary and accuracy!"
30 | ]
31 | },
32 | {
33 | "cell_type": "code",
34 | "execution_count": null,
35 | "metadata": {
36 | "collapsed": false,
37 | "deletable": true,
38 | "editable": true
39 | },
40 | "outputs": [],
41 | "source": [
42 | "%matplotlib notebook\n",
43 | "\n",
44 | "import pandas as pd\n",
45 | "import numpy as np\n",
46 | "import matplotlib.pyplot as plt\n",
47 | "from sklearn.decomposition import PCA\n",
48 | "from sklearn.model_selection import train_test_split\n",
49 | "\n",
50 | "df = pd.read_csv('mushrooms.csv')\n",
51 | "df2 = pd.get_dummies(df)\n",
52 | "\n",
53 | "df3 = df2.sample(frac=0.08)\n",
54 | "\n",
55 | "X = df3.iloc[:,2:]\n",
56 | "y = df3.iloc[:,1]\n",
57 | "\n",
58 | "\n",
59 | "pca = PCA(n_components=2).fit_transform(X)\n",
60 | "\n",
61 | "X_train, X_test, y_train, y_test = train_test_split(pca, y, random_state=0)\n",
62 | "\n",
63 | "\n",
64 | "plt.figure(dpi=120)\n",
65 | "plt.scatter(pca[y.values==0,0], pca[y.values==0,1], alpha=0.5, label='Edible', s=2)\n",
66 | "plt.scatter(pca[y.values==1,0], pca[y.values==1,1], alpha=0.5, label='Poisonous', s=2)\n",
67 | "plt.legend()\n",
68 | "plt.title('Mushroom Data Set\\nFirst Two Principal Components')\n",
69 | "plt.xlabel('PC1')\n",
70 | "plt.ylabel('PC2')\n",
71 | "plt.gca().set_aspect('equal')"
72 | ]
73 | },
74 | {
75 | "cell_type": "code",
76 | "execution_count": null,
77 | "metadata": {
78 | "collapsed": false
79 | },
80 | "outputs": [],
81 | "source": [
82 | "def plot_mushroom_boundary(X, y, fitted_model):\n",
83 | "\n",
84 | " plt.figure(figsize=(9.8,5), dpi=100)\n",
85 | " \n",
86 | " for i, plot_type in enumerate(['Decision Boundary', 'Decision Probabilities']):\n",
87 | " plt.subplot(1,2,i+1)\n",
88 | "\n",
89 | " mesh_step_size = 0.01 # step size in the mesh\n",
90 | " x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1\n",
91 | " y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1\n",
92 | " xx, yy = np.meshgrid(np.arange(x_min, x_max, mesh_step_size), np.arange(y_min, y_max, mesh_step_size))\n",
93 | " if i == 0:\n",
94 | " Z = fitted_model.predict(np.c_[xx.ravel(), yy.ravel()])\n",
95 | " else:\n",
96 | " try:\n",
97 | " Z = fitted_model.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:,1]\n",
98 | " except:\n",
99 | " plt.text(0.4, 0.5, 'Probabilities Unavailable', horizontalalignment='center',\n",
100 | " verticalalignment='center', transform = plt.gca().transAxes, fontsize=12)\n",
101 | " plt.axis('off')\n",
102 | " break\n",
103 | " Z = Z.reshape(xx.shape)\n",
104 | " plt.scatter(X[y.values==0,0], X[y.values==0,1], alpha=0.4, label='Edible', s=5)\n",
105 | " plt.scatter(X[y.values==1,0], X[y.values==1,1], alpha=0.4, label='Posionous', s=5)\n",
106 | " plt.imshow(Z, interpolation='nearest', cmap='RdYlBu_r', alpha=0.15, \n",
107 | " extent=(x_min, x_max, y_min, y_max), origin='lower')\n",
108 | " plt.title(plot_type + '\\n' + \n",
109 | " str(fitted_model).split('(')[0]+ ' Test Accuracy: ' + str(np.round(fitted_model.score(X, y), 5)))\n",
110 | " plt.gca().set_aspect('equal');\n",
111 | " \n",
112 | " plt.tight_layout()\n",
113 | " plt.subplots_adjust(top=0.9, bottom=0.08, wspace=0.02)"
114 | ]
115 | },
116 | {
117 | "cell_type": "code",
118 | "execution_count": null,
119 | "metadata": {
120 | "collapsed": false,
121 | "deletable": true,
122 | "editable": true,
123 | "scrolled": false
124 | },
125 | "outputs": [],
126 | "source": [
127 | "from sklearn.linear_model import LogisticRegression\n",
128 | "\n",
129 | "model = LogisticRegression()\n",
130 | "model.fit(X_train,y_train)\n",
131 | "\n",
132 | "plot_mushroom_boundary(X_test, y_test, model)"
133 | ]
134 | },
135 | {
136 | "cell_type": "code",
137 | "execution_count": null,
138 | "metadata": {
139 | "collapsed": false,
140 | "deletable": true,
141 | "editable": true
142 | },
143 | "outputs": [],
144 | "source": [
145 | "from sklearn.neighbors import KNeighborsClassifier\n",
146 | "\n",
147 | "model = KNeighborsClassifier(n_neighbors=20)\n",
148 | "model.fit(X_train,y_train)\n",
149 | "\n",
150 | "plot_mushroom_boundary(X_test, y_test, model)"
151 | ]
152 | },
153 | {
154 | "cell_type": "code",
155 | "execution_count": null,
156 | "metadata": {
157 | "collapsed": false,
158 | "deletable": true,
159 | "editable": true
160 | },
161 | "outputs": [],
162 | "source": [
163 | "from sklearn.tree import DecisionTreeClassifier\n",
164 | "\n",
165 | "model = DecisionTreeClassifier(max_depth=3)\n",
166 | "model.fit(X_train,y_train)\n",
167 | "\n",
168 | "plot_mushroom_boundary(X_test, y_test, model)"
169 | ]
170 | },
171 | {
172 | "cell_type": "code",
173 | "execution_count": null,
174 | "metadata": {
175 | "collapsed": false,
176 | "deletable": true,
177 | "editable": true
178 | },
179 | "outputs": [],
180 | "source": [
181 | "from sklearn.tree import DecisionTreeClassifier\n",
182 | "\n",
183 | "model = DecisionTreeClassifier()\n",
184 | "model.fit(X_train,y_train)\n",
185 | "\n",
186 | "plot_mushroom_boundary(X_test, y_test, model)"
187 | ]
188 | },
189 | {
190 | "cell_type": "code",
191 | "execution_count": null,
192 | "metadata": {
193 | "collapsed": false,
194 | "deletable": true,
195 | "editable": true
196 | },
197 | "outputs": [],
198 | "source": [
199 | "from sklearn.ensemble import RandomForestClassifier\n",
200 | "\n",
201 | "model = RandomForestClassifier()\n",
202 | "model.fit(X_train,y_train)\n",
203 | "\n",
204 | "plot_mushroom_boundary(X_test, y_test, model)"
205 | ]
206 | },
207 | {
208 | "cell_type": "code",
209 | "execution_count": null,
210 | "metadata": {
211 | "collapsed": false,
212 | "deletable": true,
213 | "editable": true
214 | },
215 | "outputs": [],
216 | "source": [
217 | "from sklearn.svm import SVC\n",
218 | "\n",
219 | "model = SVC(kernel='linear')\n",
220 | "model.fit(X_train,y_train)\n",
221 | "\n",
222 | "plot_mushroom_boundary(X_test, y_test, model)"
223 | ]
224 | },
225 | {
226 | "cell_type": "code",
227 | "execution_count": null,
228 | "metadata": {
229 | "collapsed": false,
230 | "deletable": true,
231 | "editable": true
232 | },
233 | "outputs": [],
234 | "source": [
235 | "from sklearn.svm import SVC\n",
236 | "\n",
237 | "model = SVC(kernel='rbf', C=1)\n",
238 | "model.fit(X_train,y_train)\n",
239 | "\n",
240 | "plot_mushroom_boundary(X_test, y_test, model)"
241 | ]
242 | },
243 | {
244 | "cell_type": "code",
245 | "execution_count": null,
246 | "metadata": {
247 | "collapsed": false,
248 | "deletable": true,
249 | "editable": true
250 | },
251 | "outputs": [],
252 | "source": [
253 | "from sklearn.svm import SVC\n",
254 | "\n",
255 | "model = SVC(kernel='rbf', C=10)\n",
256 | "model.fit(X_train,y_train)\n",
257 | "\n",
258 | "plot_mushroom_boundary(X_test, y_test, model)"
259 | ]
260 | },
261 | {
262 | "cell_type": "code",
263 | "execution_count": null,
264 | "metadata": {
265 | "collapsed": false,
266 | "deletable": true,
267 | "editable": true
268 | },
269 | "outputs": [],
270 | "source": [
271 | "from sklearn.naive_bayes import GaussianNB\n",
272 | "\n",
273 | "model = GaussianNB()\n",
274 | "model.fit(X_train,y_train)\n",
275 | "\n",
276 | "plot_mushroom_boundary(X_test, y_test, model)"
277 | ]
278 | },
279 | {
280 | "cell_type": "code",
281 | "execution_count": null,
282 | "metadata": {
283 | "collapsed": false,
284 | "deletable": true,
285 | "editable": true
286 | },
287 | "outputs": [],
288 | "source": [
289 | "from sklearn.neural_network import MLPClassifier\n",
290 | "\n",
291 | "model = MLPClassifier()\n",
292 | "model.fit(X_train,y_train)\n",
293 | "\n",
294 | "plot_mushroom_boundary(X_test, y_test, model)"
295 | ]
296 | },
297 | {
298 | "cell_type": "code",
299 | "execution_count": null,
300 | "metadata": {
301 | "collapsed": true
302 | },
303 | "outputs": [],
304 | "source": []
305 | }
306 | ],
307 | "metadata": {
308 | "kernelspec": {
309 | "display_name": "Python 3",
310 | "language": "python",
311 | "name": "python3"
312 | },
313 | "language_info": {
314 | "codemirror_mode": {
315 | "name": "ipython",
316 | "version": 3
317 | },
318 | "file_extension": ".py",
319 | "mimetype": "text/x-python",
320 | "name": "python",
321 | "nbconvert_exporter": "python",
322 | "pygments_lexer": "ipython3",
323 | "version": "3.5.2"
324 | }
325 | },
326 | "nbformat": 4,
327 | "nbformat_minor": 2
328 | }
329 |
--------------------------------------------------------------------------------
/readonly/Module 1.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "---\n",
8 | "\n",
9 | "_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-machine-learning/resources/bANLa) course resource._\n",
10 | "\n",
11 | "---"
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "## Applied Machine Learning, Module 1: A simple classification task"
19 | ]
20 | },
21 | {
22 | "cell_type": "markdown",
23 | "metadata": {},
24 | "source": [
25 | "### Import required modules and load data file"
26 | ]
27 | },
28 | {
29 | "cell_type": "code",
30 | "execution_count": null,
31 | "metadata": {
32 | "collapsed": true
33 | },
34 | "outputs": [],
35 | "source": [
36 | "%matplotlib notebook\n",
37 | "import numpy as np\n",
38 | "import matplotlib.pyplot as plt\n",
39 | "import pandas as pd\n",
40 | "from sklearn.model_selection import train_test_split\n",
41 | "\n",
42 | "fruits = pd.read_table('fruit_data_with_colors.txt')"
43 | ]
44 | },
45 | {
46 | "cell_type": "code",
47 | "execution_count": null,
48 | "metadata": {
49 | "collapsed": false
50 | },
51 | "outputs": [],
52 | "source": [
53 | "fruits.head()"
54 | ]
55 | },
56 | {
57 | "cell_type": "code",
58 | "execution_count": null,
59 | "metadata": {
60 | "collapsed": false
61 | },
62 | "outputs": [],
63 | "source": [
64 | "# create a mapping from fruit label value to fruit name to make results easier to interpret\n",
65 | "lookup_fruit_name = dict(zip(fruits.fruit_label.unique(), fruits.fruit_name.unique())) \n",
66 | "lookup_fruit_name"
67 | ]
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "The file contains the mass, height, and width of a selection of oranges, lemons and apples. The heights were measured along the core of the fruit. The widths were the widest width perpendicular to the height."
74 | ]
75 | },
76 | {
77 | "cell_type": "markdown",
78 | "metadata": {},
79 | "source": [
80 | "### Examining the data"
81 | ]
82 | },
83 | {
84 | "cell_type": "code",
85 | "execution_count": null,
86 | "metadata": {
87 | "collapsed": false
88 | },
89 | "outputs": [],
90 | "source": [
91 | "# plotting a scatter matrix\n",
92 | "from matplotlib import cm\n",
93 | "\n",
94 | "X = fruits[['height', 'width', 'mass', 'color_score']]\n",
95 | "y = fruits['fruit_label']\n",
96 | "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)\n",
97 | "\n",
98 | "cmap = cm.get_cmap('gnuplot')\n",
99 | "scatter = pd.scatter_matrix(X_train, c= y_train, marker = 'o', s=40, hist_kwds={'bins':15}, figsize=(9,9), cmap=cmap)"
100 | ]
101 | },
102 | {
103 | "cell_type": "code",
104 | "execution_count": null,
105 | "metadata": {
106 | "collapsed": false
107 | },
108 | "outputs": [],
109 | "source": [
110 | "# plotting a 3D scatter plot\n",
111 | "from mpl_toolkits.mplot3d import Axes3D\n",
112 | "\n",
113 | "fig = plt.figure()\n",
114 | "ax = fig.add_subplot(111, projection = '3d')\n",
115 | "ax.scatter(X_train['width'], X_train['height'], X_train['color_score'], c = y_train, marker = 'o', s=100)\n",
116 | "ax.set_xlabel('width')\n",
117 | "ax.set_ylabel('height')\n",
118 | "ax.set_zlabel('color_score')\n",
119 | "plt.show()"
120 | ]
121 | },
122 | {
123 | "cell_type": "markdown",
124 | "metadata": {},
125 | "source": [
126 | "### Create train-test split"
127 | ]
128 | },
129 | {
130 | "cell_type": "code",
131 | "execution_count": null,
132 | "metadata": {
133 | "collapsed": true
134 | },
135 | "outputs": [],
136 | "source": [
137 | "# For this example, we use the mass, width, and height features of each fruit instance\n",
138 | "X = fruits[['mass', 'width', 'height']]\n",
139 | "y = fruits['fruit_label']\n",
140 | "\n",
141 | "# default is 75% / 25% train-test split\n",
142 | "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)"
143 | ]
144 | },
145 | {
146 | "cell_type": "markdown",
147 | "metadata": {},
148 | "source": [
149 | "### Create classifier object"
150 | ]
151 | },
152 | {
153 | "cell_type": "code",
154 | "execution_count": null,
155 | "metadata": {
156 | "collapsed": true
157 | },
158 | "outputs": [],
159 | "source": [
160 | "from sklearn.neighbors import KNeighborsClassifier\n",
161 | "\n",
162 | "knn = KNeighborsClassifier(n_neighbors = 5)"
163 | ]
164 | },
165 | {
166 | "cell_type": "markdown",
167 | "metadata": {},
168 | "source": [
169 | "### Train the classifier (fit the estimator) using the training data"
170 | ]
171 | },
172 | {
173 | "cell_type": "code",
174 | "execution_count": null,
175 | "metadata": {
176 | "collapsed": false
177 | },
178 | "outputs": [],
179 | "source": [
180 | "knn.fit(X_train, y_train)"
181 | ]
182 | },
183 | {
184 | "cell_type": "markdown",
185 | "metadata": {},
186 | "source": [
187 | "### Estimate the accuracy of the classifier on future data, using the test data"
188 | ]
189 | },
190 | {
191 | "cell_type": "code",
192 | "execution_count": null,
193 | "metadata": {
194 | "collapsed": false
195 | },
196 | "outputs": [],
197 | "source": [
198 | "knn.score(X_test, y_test)"
199 | ]
200 | },
201 | {
202 | "cell_type": "markdown",
203 | "metadata": {},
204 | "source": [
205 | "### Use the trained k-NN classifier model to classify new, previously unseen objects"
206 | ]
207 | },
208 | {
209 | "cell_type": "code",
210 | "execution_count": null,
211 | "metadata": {
212 | "collapsed": false
213 | },
214 | "outputs": [],
215 | "source": [
216 | "# first example: a small fruit with mass 20g, width 4.3 cm, height 5.5 cm\n",
217 | "fruit_prediction = knn.predict([[20, 4.3, 5.5]])\n",
218 | "lookup_fruit_name[fruit_prediction[0]]"
219 | ]
220 | },
221 | {
222 | "cell_type": "code",
223 | "execution_count": null,
224 | "metadata": {
225 | "collapsed": false
226 | },
227 | "outputs": [],
228 | "source": [
229 | "# second example: a larger, elongated fruit with mass 100g, width 6.3 cm, height 8.5 cm\n",
230 | "fruit_prediction = knn.predict([[100, 6.3, 8.5]])\n",
231 | "lookup_fruit_name[fruit_prediction[0]]"
232 | ]
233 | },
234 | {
235 | "cell_type": "markdown",
236 | "metadata": {},
237 | "source": [
238 | "### Plot the decision boundaries of the k-NN classifier"
239 | ]
240 | },
241 | {
242 | "cell_type": "code",
243 | "execution_count": null,
244 | "metadata": {
245 | "collapsed": false
246 | },
247 | "outputs": [],
248 | "source": [
249 | "from adspy_shared_utilities import plot_fruit_knn\n",
250 | "\n",
251 | "plot_fruit_knn(X_train, y_train, 5, 'uniform') # we choose 5 nearest neighbors"
252 | ]
253 | },
254 | {
255 | "cell_type": "markdown",
256 | "metadata": {},
257 | "source": [
258 | "### How sensitive is k-NN classification accuracy to the choice of the 'k' parameter?"
259 | ]
260 | },
261 | {
262 | "cell_type": "code",
263 | "execution_count": null,
264 | "metadata": {
265 | "collapsed": false
266 | },
267 | "outputs": [],
268 | "source": [
269 | "k_range = range(1,20)\n",
270 | "scores = []\n",
271 | "\n",
272 | "for k in k_range:\n",
273 | " knn = KNeighborsClassifier(n_neighbors = k)\n",
274 | " knn.fit(X_train, y_train)\n",
275 | " scores.append(knn.score(X_test, y_test))\n",
276 | "\n",
277 | "plt.figure()\n",
278 | "plt.xlabel('k')\n",
279 | "plt.ylabel('accuracy')\n",
280 | "plt.scatter(k_range, scores)\n",
281 | "plt.xticks([0,5,10,15,20]);"
282 | ]
283 | },
284 | {
285 | "cell_type": "markdown",
286 | "metadata": {},
287 | "source": [
288 | "### How sensitive is k-NN classification accuracy to the train/test split proportion?"
289 | ]
290 | },
291 | {
292 | "cell_type": "code",
293 | "execution_count": null,
294 | "metadata": {
295 | "collapsed": false
296 | },
297 | "outputs": [],
298 | "source": [
299 | "t = [0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2]\n",
300 | "\n",
301 | "knn = KNeighborsClassifier(n_neighbors = 5)\n",
302 | "\n",
303 | "plt.figure()\n",
304 | "\n",
305 | "for s in t:\n",
306 | "\n",
307 | " scores = []\n",
308 | " for i in range(1,1000):\n",
309 | " X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1-s)\n",
310 | " knn.fit(X_train, y_train)\n",
311 | " scores.append(knn.score(X_test, y_test))\n",
312 | " plt.plot(s, np.mean(scores), 'bo')\n",
313 | "\n",
314 | "plt.xlabel('Training set proportion (%)')\n",
315 | "plt.ylabel('accuracy');"
316 | ]
317 | },
318 | {
319 | "cell_type": "code",
320 | "execution_count": null,
321 | "metadata": {
322 | "collapsed": true
323 | },
324 | "outputs": [],
325 | "source": []
326 | }
327 | ],
328 | "metadata": {
329 | "anaconda-cloud": {},
330 | "kernelspec": {
331 | "display_name": "Python 3",
332 | "language": "python",
333 | "name": "python3"
334 | },
335 | "language_info": {
336 | "codemirror_mode": {
337 | "name": "ipython",
338 | "version": 3
339 | },
340 | "file_extension": ".py",
341 | "mimetype": "text/x-python",
342 | "name": "python",
343 | "nbconvert_exporter": "python",
344 | "pygments_lexer": "ipython3",
345 | "version": "3.5.2"
346 | }
347 | },
348 | "nbformat": 4,
349 | "nbformat_minor": 1
350 | }
351 |
--------------------------------------------------------------------------------
/readonly/Module 3.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "---\n",
8 | "\n",
9 | "_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-machine-learning/resources/bANLa) course resource._\n",
10 | "\n",
11 | "---"
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {
17 | "collapsed": true
18 | },
19 | "source": [
20 | "# Applied Machine Learning: Module 3 (Evaluation)"
21 | ]
22 | },
23 | {
24 | "cell_type": "markdown",
25 | "metadata": {},
26 | "source": [
27 | "## Evaluation for Classification"
28 | ]
29 | },
30 | {
31 | "cell_type": "markdown",
32 | "metadata": {},
33 | "source": [
34 | "### Preamble"
35 | ]
36 | },
37 | {
38 | "cell_type": "code",
39 | "execution_count": null,
40 | "metadata": {
41 | "collapsed": false
42 | },
43 | "outputs": [],
44 | "source": [
45 | "%matplotlib notebook\n",
46 | "import numpy as np\n",
47 | "import pandas as pd\n",
48 | "import seaborn as sns\n",
49 | "import matplotlib.pyplot as plt\n",
50 | "from sklearn.model_selection import train_test_split\n",
51 | "from sklearn.datasets import load_digits\n",
52 | "\n",
53 | "dataset = load_digits()\n",
54 | "X, y = dataset.data, dataset.target\n",
55 | "\n",
56 | "for class_name, class_count in zip(dataset.target_names, np.bincount(dataset.target)):\n",
57 | " print(class_name,class_count)"
58 | ]
59 | },
60 | {
61 | "cell_type": "code",
62 | "execution_count": null,
63 | "metadata": {
64 | "collapsed": false
65 | },
66 | "outputs": [],
67 | "source": [
68 | "# Creating a dataset with imbalanced binary classes: \n",
69 | "# Negative class (0) is 'not digit 1' \n",
70 | "# Positive class (1) is 'digit 1'\n",
71 | "y_binary_imbalanced = y.copy()\n",
72 | "y_binary_imbalanced[y_binary_imbalanced != 1] = 0\n",
73 | "\n",
74 | "print('Original labels:\\t', y[1:30])\n",
75 | "print('New binary labels:\\t', y_binary_imbalanced[1:30])"
76 | ]
77 | },
78 | {
79 | "cell_type": "code",
80 | "execution_count": null,
81 | "metadata": {
82 | "collapsed": false,
83 | "scrolled": true
84 | },
85 | "outputs": [],
86 | "source": [
87 | "np.bincount(y_binary_imbalanced) # Negative class (0) is the most frequent class"
88 | ]
89 | },
90 | {
91 | "cell_type": "code",
92 | "execution_count": null,
93 | "metadata": {
94 | "collapsed": false
95 | },
96 | "outputs": [],
97 | "source": [
98 | "X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)\n",
99 | "\n",
100 | "# Accuracy of Support Vector Machine classifier\n",
101 | "from sklearn.svm import SVC\n",
102 | "\n",
103 | "svm = SVC(kernel='rbf', C=1).fit(X_train, y_train)\n",
104 | "svm.score(X_test, y_test)"
105 | ]
106 | },
107 | {
108 | "cell_type": "markdown",
109 | "metadata": {},
110 | "source": [
111 | "### Dummy Classifiers"
112 | ]
113 | },
114 | {
115 | "cell_type": "markdown",
116 | "metadata": {
117 | "collapsed": true
118 | },
119 | "source": [
120 | "DummyClassifier is a classifier that makes predictions using simple rules, which can be useful as a baseline for comparison against actual classifiers, especially with imbalanced classes."
121 | ]
122 | },
123 | {
124 | "cell_type": "code",
125 | "execution_count": null,
126 | "metadata": {
127 | "collapsed": false
128 | },
129 | "outputs": [],
130 | "source": [
131 | "from sklearn.dummy import DummyClassifier\n",
132 | "\n",
133 | "# Negative class (0) is most frequent\n",
134 | "dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(X_train, y_train)\n",
135 | "# Therefore the dummy 'most_frequent' classifier always predicts class 0\n",
136 | "y_dummy_predictions = dummy_majority.predict(X_test)\n",
137 | "\n",
138 | "y_dummy_predictions"
139 | ]
140 | },
141 | {
142 | "cell_type": "code",
143 | "execution_count": null,
144 | "metadata": {
145 | "collapsed": false
146 | },
147 | "outputs": [],
148 | "source": [
149 | "dummy_majority.score(X_test, y_test)"
150 | ]
151 | },
152 | {
153 | "cell_type": "code",
154 | "execution_count": null,
155 | "metadata": {
156 | "collapsed": false
157 | },
158 | "outputs": [],
159 | "source": [
160 | "svm = SVC(kernel='linear', C=1).fit(X_train, y_train)\n",
161 | "svm.score(X_test, y_test)"
162 | ]
163 | },
164 | {
165 | "cell_type": "markdown",
166 | "metadata": {},
167 | "source": [
168 | "### Confusion matrices"
169 | ]
170 | },
171 | {
172 | "cell_type": "markdown",
173 | "metadata": {},
174 | "source": [
175 | "#### Binary (two-class) confusion matrix"
176 | ]
177 | },
178 | {
179 | "cell_type": "code",
180 | "execution_count": null,
181 | "metadata": {
182 | "collapsed": false
183 | },
184 | "outputs": [],
185 | "source": [
186 | "from sklearn.metrics import confusion_matrix\n",
187 | "\n",
188 | "# Negative class (0) is most frequent\n",
189 | "dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(X_train, y_train)\n",
190 | "y_majority_predicted = dummy_majority.predict(X_test)\n",
191 | "confusion = confusion_matrix(y_test, y_majority_predicted)\n",
192 | "\n",
193 | "print('Most frequent class (dummy classifier)\\n', confusion)"
194 | ]
195 | },
196 | {
197 | "cell_type": "code",
198 | "execution_count": null,
199 | "metadata": {
200 | "collapsed": false
201 | },
202 | "outputs": [],
203 | "source": [
204 | "# produces random predictions w/ same class proportion as training set\n",
205 | "dummy_classprop = DummyClassifier(strategy='stratified').fit(X_train, y_train)\n",
206 | "y_classprop_predicted = dummy_classprop.predict(X_test)\n",
207 | "confusion = confusion_matrix(y_test, y_classprop_predicted)\n",
208 | "\n",
209 | "print('Random class-proportional prediction (dummy classifier)\\n', confusion)"
210 | ]
211 | },
212 | {
213 | "cell_type": "code",
214 | "execution_count": null,
215 | "metadata": {
216 | "collapsed": false,
217 | "scrolled": true
218 | },
219 | "outputs": [],
220 | "source": [
221 | "svm = SVC(kernel='linear', C=1).fit(X_train, y_train)\n",
222 | "svm_predicted = svm.predict(X_test)\n",
223 | "confusion = confusion_matrix(y_test, svm_predicted)\n",
224 | "\n",
225 | "print('Support vector machine classifier (linear kernel, C=1)\\n', confusion)"
226 | ]
227 | },
228 | {
229 | "cell_type": "code",
230 | "execution_count": null,
231 | "metadata": {
232 | "collapsed": false
233 | },
234 | "outputs": [],
235 | "source": [
236 | "from sklearn.linear_model import LogisticRegression\n",
237 | "\n",
238 | "lr = LogisticRegression().fit(X_train, y_train)\n",
239 | "lr_predicted = lr.predict(X_test)\n",
240 | "confusion = confusion_matrix(y_test, lr_predicted)\n",
241 | "\n",
242 | "print('Logistic regression classifier (default settings)\\n', confusion)"
243 | ]
244 | },
245 | {
246 | "cell_type": "code",
247 | "execution_count": null,
248 | "metadata": {
249 | "collapsed": false
250 | },
251 | "outputs": [],
252 | "source": [
253 | "from sklearn.tree import DecisionTreeClassifier\n",
254 | "\n",
255 | "dt = DecisionTreeClassifier(max_depth=2).fit(X_train, y_train)\n",
256 | "tree_predicted = dt.predict(X_test)\n",
257 | "confusion = confusion_matrix(y_test, tree_predicted)\n",
258 | "\n",
259 | "print('Decision tree classifier (max_depth = 2)\\n', confusion)"
260 | ]
261 | },
262 | {
263 | "cell_type": "markdown",
264 | "metadata": {},
265 | "source": [
266 | "### Evaluation metrics for binary classification"
267 | ]
268 | },
269 | {
270 | "cell_type": "code",
271 | "execution_count": null,
272 | "metadata": {
273 | "collapsed": false
274 | },
275 | "outputs": [],
276 | "source": [
277 | "from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score\n",
278 | "# Accuracy = TP + TN / (TP + TN + FP + FN)\n",
279 | "# Precision = TP / (TP + FP)\n",
280 | "# Recall = TP / (TP + FN) Also known as sensitivity, or True Positive Rate\n",
281 | "# F1 = 2 * Precision * Recall / (Precision + Recall) \n",
282 | "print('Accuracy: {:.2f}'.format(accuracy_score(y_test, tree_predicted)))\n",
283 | "print('Precision: {:.2f}'.format(precision_score(y_test, tree_predicted)))\n",
284 | "print('Recall: {:.2f}'.format(recall_score(y_test, tree_predicted)))\n",
285 | "print('F1: {:.2f}'.format(f1_score(y_test, tree_predicted)))"
286 | ]
287 | },
288 | {
289 | "cell_type": "code",
290 | "execution_count": null,
291 | "metadata": {
292 | "collapsed": false
293 | },
294 | "outputs": [],
295 | "source": [
296 | "# Combined report with all above metrics\n",
297 | "from sklearn.metrics import classification_report\n",
298 | "\n",
299 | "print(classification_report(y_test, tree_predicted, target_names=['not 1', '1']))"
300 | ]
301 | },
302 | {
303 | "cell_type": "code",
304 | "execution_count": null,
305 | "metadata": {
306 | "collapsed": false,
307 | "scrolled": false
308 | },
309 | "outputs": [],
310 | "source": [
311 | "print('Random class-proportional (dummy)\\n', \n",
312 | " classification_report(y_test, y_classprop_predicted, target_names=['not 1', '1']))\n",
313 | "print('SVM\\n', \n",
314 | " classification_report(y_test, svm_predicted, target_names = ['not 1', '1']))\n",
315 | "print('Logistic regression\\n', \n",
316 | " classification_report(y_test, lr_predicted, target_names = ['not 1', '1']))\n",
317 | "print('Decision tree\\n', \n",
318 | " classification_report(y_test, tree_predicted, target_names = ['not 1', '1']))"
319 | ]
320 | },
321 | {
322 | "cell_type": "markdown",
323 | "metadata": {},
324 | "source": [
325 | "### Decision functions"
326 | ]
327 | },
328 | {
329 | "cell_type": "code",
330 | "execution_count": null,
331 | "metadata": {
332 | "collapsed": false
333 | },
334 | "outputs": [],
335 | "source": [
336 | "X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)\n",
337 | "y_scores_lr = lr.fit(X_train, y_train).decision_function(X_test)\n",
338 | "y_score_list = list(zip(y_test[0:20], y_scores_lr[0:20]))\n",
339 | "\n",
340 | "# show the decision_function scores for first 20 instances\n",
341 | "y_score_list"
342 | ]
343 | },
344 | {
345 | "cell_type": "code",
346 | "execution_count": null,
347 | "metadata": {
348 | "collapsed": false
349 | },
350 | "outputs": [],
351 | "source": [
352 | "X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)\n",
353 | "y_proba_lr = lr.fit(X_train, y_train).predict_proba(X_test)\n",
354 | "y_proba_list = list(zip(y_test[0:20], y_proba_lr[0:20,1]))\n",
355 | "\n",
356 | "# show the probability of positive class for first 20 instances\n",
357 | "y_proba_list"
358 | ]
359 | },
360 | {
361 | "cell_type": "markdown",
362 | "metadata": {},
363 | "source": [
364 | "### Precision-recall curves"
365 | ]
366 | },
367 | {
368 | "cell_type": "code",
369 | "execution_count": null,
370 | "metadata": {
371 | "collapsed": false
372 | },
373 | "outputs": [],
374 | "source": [
375 | "from sklearn.metrics import precision_recall_curve\n",
376 | "\n",
377 | "precision, recall, thresholds = precision_recall_curve(y_test, y_scores_lr)\n",
378 | "closest_zero = np.argmin(np.abs(thresholds))\n",
379 | "closest_zero_p = precision[closest_zero]\n",
380 | "closest_zero_r = recall[closest_zero]\n",
381 | "\n",
382 | "plt.figure()\n",
383 | "plt.xlim([0.0, 1.01])\n",
384 | "plt.ylim([0.0, 1.01])\n",
385 | "plt.plot(precision, recall, label='Precision-Recall Curve')\n",
386 | "plt.plot(closest_zero_p, closest_zero_r, 'o', markersize = 12, fillstyle = 'none', c='r', mew=3)\n",
387 | "plt.xlabel('Precision', fontsize=16)\n",
388 | "plt.ylabel('Recall', fontsize=16)\n",
389 | "plt.axes().set_aspect('equal')\n",
390 | "plt.show()"
391 | ]
392 | },
393 | {
394 | "cell_type": "markdown",
395 | "metadata": {},
396 | "source": [
397 | "### ROC curves, Area-Under-Curve (AUC)"
398 | ]
399 | },
400 | {
401 | "cell_type": "code",
402 | "execution_count": null,
403 | "metadata": {
404 | "collapsed": false
405 | },
406 | "outputs": [],
407 | "source": [
408 | "from sklearn.metrics import roc_curve, auc\n",
409 | "\n",
410 | "X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)\n",
411 | "\n",
412 | "y_score_lr = lr.fit(X_train, y_train).decision_function(X_test)\n",
413 | "fpr_lr, tpr_lr, _ = roc_curve(y_test, y_score_lr)\n",
414 | "roc_auc_lr = auc(fpr_lr, tpr_lr)\n",
415 | "\n",
416 | "plt.figure()\n",
417 | "plt.xlim([-0.01, 1.00])\n",
418 | "plt.ylim([-0.01, 1.01])\n",
419 | "plt.plot(fpr_lr, tpr_lr, lw=3, label='LogRegr ROC curve (area = {:0.2f})'.format(roc_auc_lr))\n",
420 | "plt.xlabel('False Positive Rate', fontsize=16)\n",
421 | "plt.ylabel('True Positive Rate', fontsize=16)\n",
422 | "plt.title('ROC curve (1-of-10 digits classifier)', fontsize=16)\n",
423 | "plt.legend(loc='lower right', fontsize=13)\n",
424 | "plt.plot([0, 1], [0, 1], color='navy', lw=3, linestyle='--')\n",
425 | "plt.axes().set_aspect('equal')\n",
426 | "plt.show()"
427 | ]
428 | },
429 | {
430 | "cell_type": "code",
431 | "execution_count": null,
432 | "metadata": {
433 | "collapsed": false,
434 | "scrolled": false
435 | },
436 | "outputs": [],
437 | "source": [
438 | "from matplotlib import cm\n",
439 | "\n",
440 | "X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)\n",
441 | "\n",
442 | "plt.figure()\n",
443 | "plt.xlim([-0.01, 1.00])\n",
444 | "plt.ylim([-0.01, 1.01])\n",
445 | "for g in [0.01, 0.1, 0.20, 1]:\n",
446 | " svm = SVC(gamma=g).fit(X_train, y_train)\n",
447 | " y_score_svm = svm.decision_function(X_test)\n",
448 | " fpr_svm, tpr_svm, _ = roc_curve(y_test, y_score_svm)\n",
449 | " roc_auc_svm = auc(fpr_svm, tpr_svm)\n",
450 | " accuracy_svm = svm.score(X_test, y_test)\n",
451 | " print(\"gamma = {:.2f} accuracy = {:.2f} AUC = {:.2f}\".format(g, accuracy_svm, \n",
452 | " roc_auc_svm))\n",
453 | " plt.plot(fpr_svm, tpr_svm, lw=3, alpha=0.7, \n",
454 | " label='SVM (gamma = {:0.2f}, area = {:0.2f})'.format(g, roc_auc_svm))\n",
455 | "\n",
456 | "plt.xlabel('False Positive Rate', fontsize=16)\n",
457 | "plt.ylabel('True Positive Rate (Recall)', fontsize=16)\n",
458 | "plt.plot([0, 1], [0, 1], color='k', lw=0.5, linestyle='--')\n",
459 | "plt.legend(loc=\"lower right\", fontsize=11)\n",
460 | "plt.title('ROC curve: (1-of-10 digits classifier)', fontsize=16)\n",
461 | "plt.axes().set_aspect('equal')\n",
462 | "\n",
463 | "plt.show()"
464 | ]
465 | },
466 | {
467 | "cell_type": "markdown",
468 | "metadata": {},
469 | "source": [
470 | "### Evaluation measures for multi-class classification"
471 | ]
472 | },
473 | {
474 | "cell_type": "markdown",
475 | "metadata": {},
476 | "source": [
477 | "#### Multi-class confusion matrix"
478 | ]
479 | },
480 | {
481 | "cell_type": "code",
482 | "execution_count": null,
483 | "metadata": {
484 | "collapsed": false,
485 | "scrolled": false
486 | },
487 | "outputs": [],
488 | "source": [
489 | "dataset = load_digits()\n",
490 | "X, y = dataset.data, dataset.target\n",
491 | "X_train_mc, X_test_mc, y_train_mc, y_test_mc = train_test_split(X, y, random_state=0)\n",
492 | "\n",
493 | "\n",
494 | "svm = SVC(kernel = 'linear').fit(X_train_mc, y_train_mc)\n",
495 | "svm_predicted_mc = svm.predict(X_test_mc)\n",
496 | "confusion_mc = confusion_matrix(y_test_mc, svm_predicted_mc)\n",
497 | "df_cm = pd.DataFrame(confusion_mc, \n",
498 | " index = [i for i in range(0,10)], columns = [i for i in range(0,10)])\n",
499 | "\n",
500 | "plt.figure(figsize=(5.5,4))\n",
501 | "sns.heatmap(df_cm, annot=True)\n",
502 | "plt.title('SVM Linear Kernel \\nAccuracy:{0:.3f}'.format(accuracy_score(y_test_mc, \n",
503 | " svm_predicted_mc)))\n",
504 | "plt.ylabel('True label')\n",
505 | "plt.xlabel('Predicted label')\n",
506 | "\n",
507 | "\n",
508 | "svm = SVC(kernel = 'rbf').fit(X_train_mc, y_train_mc)\n",
509 | "svm_predicted_mc = svm.predict(X_test_mc)\n",
510 | "confusion_mc = confusion_matrix(y_test_mc, svm_predicted_mc)\n",
511 | "df_cm = pd.DataFrame(confusion_mc, index = [i for i in range(0,10)],\n",
512 | " columns = [i for i in range(0,10)])\n",
513 | "\n",
514 | "plt.figure(figsize = (5.5,4))\n",
515 | "sns.heatmap(df_cm, annot=True)\n",
516 | "plt.title('SVM RBF Kernel \\nAccuracy:{0:.3f}'.format(accuracy_score(y_test_mc, \n",
517 | " svm_predicted_mc)))\n",
518 | "plt.ylabel('True label')\n",
519 | "plt.xlabel('Predicted label');"
520 | ]
521 | },
522 | {
523 | "cell_type": "markdown",
524 | "metadata": {},
525 | "source": [
526 | "#### Multi-class classification report"
527 | ]
528 | },
529 | {
530 | "cell_type": "code",
531 | "execution_count": null,
532 | "metadata": {
533 | "collapsed": false
534 | },
535 | "outputs": [],
536 | "source": [
537 | "print(classification_report(y_test_mc, svm_predicted_mc))"
538 | ]
539 | },
540 | {
541 | "cell_type": "markdown",
542 | "metadata": {},
543 | "source": [
544 | "#### Micro- vs. macro-averaged metrics"
545 | ]
546 | },
547 | {
548 | "cell_type": "code",
549 | "execution_count": null,
550 | "metadata": {
551 | "collapsed": false
552 | },
553 | "outputs": [],
554 | "source": [
555 | "print('Micro-averaged precision = {:.2f} (treat instances equally)'\n",
556 | " .format(precision_score(y_test_mc, svm_predicted_mc, average = 'micro')))\n",
557 | "print('Macro-averaged precision = {:.2f} (treat classes equally)'\n",
558 | " .format(precision_score(y_test_mc, svm_predicted_mc, average = 'macro')))"
559 | ]
560 | },
561 | {
562 | "cell_type": "code",
563 | "execution_count": null,
564 | "metadata": {
565 | "collapsed": false
566 | },
567 | "outputs": [],
568 | "source": [
569 | "print('Micro-averaged f1 = {:.2f} (treat instances equally)'\n",
570 | " .format(f1_score(y_test_mc, svm_predicted_mc, average = 'micro')))\n",
571 | "print('Macro-averaged f1 = {:.2f} (treat classes equally)'\n",
572 | " .format(f1_score(y_test_mc, svm_predicted_mc, average = 'macro')))"
573 | ]
574 | },
575 | {
576 | "cell_type": "markdown",
577 | "metadata": {},
578 | "source": [
579 | "### Regression evaluation metrics"
580 | ]
581 | },
582 | {
583 | "cell_type": "code",
584 | "execution_count": null,
585 | "metadata": {
586 | "collapsed": false
587 | },
588 | "outputs": [],
589 | "source": [
590 | "%matplotlib notebook\n",
591 | "import matplotlib.pyplot as plt\n",
592 | "import numpy as np\n",
593 | "from sklearn.model_selection import train_test_split\n",
594 | "from sklearn import datasets\n",
595 | "from sklearn.linear_model import LinearRegression\n",
596 | "from sklearn.metrics import mean_squared_error, r2_score\n",
597 | "from sklearn.dummy import DummyRegressor\n",
598 | "\n",
599 | "diabetes = datasets.load_diabetes()\n",
600 | "\n",
601 | "X = diabetes.data[:, None, 6]\n",
602 | "y = diabetes.target\n",
603 | "\n",
604 | "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)\n",
605 | "\n",
606 | "lm = LinearRegression().fit(X_train, y_train)\n",
607 | "lm_dummy_mean = DummyRegressor(strategy = 'mean').fit(X_train, y_train)\n",
608 | "\n",
609 | "y_predict = lm.predict(X_test)\n",
610 | "y_predict_dummy_mean = lm_dummy_mean.predict(X_test)\n",
611 | "\n",
612 | "print('Linear model, coefficients: ', lm.coef_)\n",
613 | "print(\"Mean squared error (dummy): {:.2f}\".format(mean_squared_error(y_test, \n",
614 | " y_predict_dummy_mean)))\n",
615 | "print(\"Mean squared error (linear model): {:.2f}\".format(mean_squared_error(y_test, y_predict)))\n",
616 | "print(\"r2_score (dummy): {:.2f}\".format(r2_score(y_test, y_predict_dummy_mean)))\n",
617 | "print(\"r2_score (linear model): {:.2f}\".format(r2_score(y_test, y_predict)))\n",
618 | "\n",
619 | "# Plot outputs\n",
620 | "plt.scatter(X_test, y_test, color='black')\n",
621 | "plt.plot(X_test, y_predict, color='green', linewidth=2)\n",
622 | "plt.plot(X_test, y_predict_dummy_mean, color='red', linestyle = 'dashed', \n",
623 | " linewidth=2, label = 'dummy')\n",
624 | "\n",
625 | "plt.show()"
626 | ]
627 | },
628 | {
629 | "cell_type": "markdown",
630 | "metadata": {},
631 | "source": [
632 | "### Model selection using evaluation metrics"
633 | ]
634 | },
635 | {
636 | "cell_type": "markdown",
637 | "metadata": {},
638 | "source": [
639 | "#### Cross-validation example"
640 | ]
641 | },
642 | {
643 | "cell_type": "code",
644 | "execution_count": null,
645 | "metadata": {
646 | "collapsed": false
647 | },
648 | "outputs": [],
649 | "source": [
650 | "from sklearn.model_selection import cross_val_score\n",
651 | "from sklearn.svm import SVC\n",
652 | "\n",
653 | "dataset = load_digits()\n",
654 | "# again, making this a binary problem with 'digit 1' as positive class \n",
655 | "# and 'not 1' as negative class\n",
656 | "X, y = dataset.data, dataset.target == 1\n",
657 | "clf = SVC(kernel='linear', C=1)\n",
658 | "\n",
659 | "# accuracy is the default scoring metric\n",
660 | "print('Cross-validation (accuracy)', cross_val_score(clf, X, y, cv=5))\n",
661 | "# use AUC as scoring metric\n",
662 | "print('Cross-validation (AUC)', cross_val_score(clf, X, y, cv=5, scoring = 'roc_auc'))\n",
663 | "# use recall as scoring metric\n",
664 | "print('Cross-validation (recall)', cross_val_score(clf, X, y, cv=5, scoring = 'recall'))"
665 | ]
666 | },
667 | {
668 | "cell_type": "markdown",
669 | "metadata": {},
670 | "source": [
671 | "#### Grid search example"
672 | ]
673 | },
674 | {
675 | "cell_type": "code",
676 | "execution_count": null,
677 | "metadata": {
678 | "collapsed": false
679 | },
680 | "outputs": [],
681 | "source": [
682 | "from sklearn.svm import SVC\n",
683 | "from sklearn.model_selection import GridSearchCV\n",
684 | "from sklearn.metrics import roc_auc_score\n",
685 | "\n",
686 | "dataset = load_digits()\n",
687 | "X, y = dataset.data, dataset.target == 1\n",
688 | "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)\n",
689 | "\n",
690 | "clf = SVC(kernel='rbf')\n",
691 | "grid_values = {'gamma': [0.001, 0.01, 0.05, 0.1, 1, 10, 100]}\n",
692 | "\n",
693 | "# default metric to optimize over grid parameters: accuracy\n",
694 | "grid_clf_acc = GridSearchCV(clf, param_grid = grid_values)\n",
695 | "grid_clf_acc.fit(X_train, y_train)\n",
696 | "y_decision_fn_scores_acc = grid_clf_acc.decision_function(X_test) \n",
697 | "\n",
698 | "print('Grid best parameter (max. accuracy): ', grid_clf_acc.best_params_)\n",
699 | "print('Grid best score (accuracy): ', grid_clf_acc.best_score_)\n",
700 | "\n",
701 | "# alternative metric to optimize over grid parameters: AUC\n",
702 | "grid_clf_auc = GridSearchCV(clf, param_grid = grid_values, scoring = 'roc_auc')\n",
703 | "grid_clf_auc.fit(X_train, y_train)\n",
704 | "y_decision_fn_scores_auc = grid_clf_auc.decision_function(X_test) \n",
705 | "\n",
706 | "print('Test set AUC: ', roc_auc_score(y_test, y_decision_fn_scores_auc))\n",
707 | "print('Grid best parameter (max. AUC): ', grid_clf_auc.best_params_)\n",
708 | "print('Grid best score (AUC): ', grid_clf_auc.best_score_)\n"
709 | ]
710 | },
711 | {
712 | "cell_type": "markdown",
713 | "metadata": {},
714 | "source": [
715 | "#### Evaluation metrics supported for model selection"
716 | ]
717 | },
718 | {
719 | "cell_type": "code",
720 | "execution_count": null,
721 | "metadata": {
722 | "collapsed": false
723 | },
724 | "outputs": [],
725 | "source": [
726 | "from sklearn.metrics.scorer import SCORERS\n",
727 | "\n",
728 | "print(sorted(list(SCORERS.keys())))"
729 | ]
730 | },
731 | {
732 | "cell_type": "markdown",
733 | "metadata": {},
734 | "source": [
735 | "### Two-feature classification example using the digits dataset"
736 | ]
737 | },
738 | {
739 | "cell_type": "markdown",
740 | "metadata": {},
741 | "source": [
742 | "#### Optimizing a classifier using different evaluation metrics"
743 | ]
744 | },
745 | {
746 | "cell_type": "code",
747 | "execution_count": null,
748 | "metadata": {
749 | "collapsed": false,
750 | "scrolled": false
751 | },
752 | "outputs": [],
753 | "source": [
754 | "from sklearn.datasets import load_digits\n",
755 | "from sklearn.model_selection import train_test_split\n",
756 | "from adspy_shared_utilities import plot_class_regions_for_classifier_subplot\n",
757 | "from sklearn.svm import SVC\n",
758 | "from sklearn.model_selection import GridSearchCV\n",
759 | "\n",
760 | "\n",
761 | "dataset = load_digits()\n",
762 | "X, y = dataset.data, dataset.target == 1\n",
763 | "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)\n",
764 | "\n",
765 | "# Create a two-feature input vector matching the example plot above\n",
766 | "# We jitter the points (add a small amount of random noise) in case there are areas\n",
767 | "# in feature space where many instances have the same features.\n",
768 | "jitter_delta = 0.25\n",
769 | "X_twovar_train = X_train[:,[20,59]]+ np.random.rand(X_train.shape[0], 2) - jitter_delta\n",
770 | "X_twovar_test = X_test[:,[20,59]] + np.random.rand(X_test.shape[0], 2) - jitter_delta\n",
771 | "\n",
772 | "clf = SVC(kernel = 'linear').fit(X_twovar_train, y_train)\n",
773 | "grid_values = {'class_weight':['balanced', {1:2},{1:3},{1:4},{1:5},{1:10},{1:20},{1:50}]}\n",
774 | "plt.figure(figsize=(9,6))\n",
775 | "for i, eval_metric in enumerate(('precision','recall', 'f1','roc_auc')):\n",
776 | " grid_clf_custom = GridSearchCV(clf, param_grid=grid_values, scoring=eval_metric)\n",
777 | " grid_clf_custom.fit(X_twovar_train, y_train)\n",
778 | " print('Grid best parameter (max. {0}): {1}'\n",
779 | " .format(eval_metric, grid_clf_custom.best_params_))\n",
780 | " print('Grid best score ({0}): {1}'\n",
781 | " .format(eval_metric, grid_clf_custom.best_score_))\n",
782 | " plt.subplots_adjust(wspace=0.3, hspace=0.3)\n",
783 | " plot_class_regions_for_classifier_subplot(grid_clf_custom, X_twovar_test, y_test, None,\n",
784 | " None, None, plt.subplot(2, 2, i+1))\n",
785 | " \n",
786 | " plt.title(eval_metric+'-oriented SVC')\n",
787 | "plt.tight_layout()\n",
788 | "plt.show()"
789 | ]
790 | },
791 | {
792 | "cell_type": "markdown",
793 | "metadata": {},
794 | "source": [
795 | "#### Precision-recall curve for the default SVC classifier (with balanced class weights)"
796 | ]
797 | },
798 | {
799 | "cell_type": "code",
800 | "execution_count": null,
801 | "metadata": {
802 | "collapsed": false,
803 | "scrolled": false
804 | },
805 | "outputs": [],
806 | "source": [
807 | "from sklearn.model_selection import train_test_split\n",
808 | "from sklearn.metrics import precision_recall_curve\n",
809 | "from adspy_shared_utilities import plot_class_regions_for_classifier\n",
810 | "from sklearn.svm import SVC\n",
811 | "\n",
812 | "dataset = load_digits()\n",
813 | "X, y = dataset.data, dataset.target == 1\n",
814 | "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)\n",
815 | "\n",
816 | "# create a two-feature input vector matching the example plot above\n",
817 | "jitter_delta = 0.25\n",
818 | "X_twovar_train = X_train[:,[20,59]]+ np.random.rand(X_train.shape[0], 2) - jitter_delta\n",
819 | "X_twovar_test = X_test[:,[20,59]] + np.random.rand(X_test.shape[0], 2) - jitter_delta\n",
820 | "\n",
821 | "clf = SVC(kernel='linear', class_weight='balanced').fit(X_twovar_train, y_train)\n",
822 | "\n",
823 | "y_scores = clf.decision_function(X_twovar_test)\n",
824 | "\n",
825 | "precision, recall, thresholds = precision_recall_curve(y_test, y_scores)\n",
826 | "closest_zero = np.argmin(np.abs(thresholds))\n",
827 | "closest_zero_p = precision[closest_zero]\n",
828 | "closest_zero_r = recall[closest_zero]\n",
829 | "\n",
830 | "plot_class_regions_for_classifier(clf, X_twovar_test, y_test)\n",
831 | "plt.title(\"SVC, class_weight = 'balanced', optimized for accuracy\")\n",
832 | "plt.show()\n",
833 | "\n",
834 | "plt.figure()\n",
835 | "plt.xlim([0.0, 1.01])\n",
836 | "plt.ylim([0.0, 1.01])\n",
837 | "plt.title (\"Precision-recall curve: SVC, class_weight = 'balanced'\")\n",
838 | "plt.plot(precision, recall, label = 'Precision-Recall Curve')\n",
839 | "plt.plot(closest_zero_p, closest_zero_r, 'o', markersize=12, fillstyle='none', c='r', mew=3)\n",
840 | "plt.xlabel('Precision', fontsize=16)\n",
841 | "plt.ylabel('Recall', fontsize=16)\n",
842 | "plt.axes().set_aspect('equal')\n",
843 | "plt.show()\n",
844 | "print('At zero threshold, precision: {:.2f}, recall: {:.2f}'\n",
845 | " .format(closest_zero_p, closest_zero_r))"
846 | ]
847 | },
848 | {
849 | "cell_type": "code",
850 | "execution_count": null,
851 | "metadata": {
852 | "collapsed": true
853 | },
854 | "outputs": [],
855 | "source": []
856 | }
857 | ],
858 | "metadata": {
859 | "anaconda-cloud": {},
860 | "kernelspec": {
861 | "display_name": "Python 3",
862 | "language": "python",
863 | "name": "python3"
864 | },
865 | "language_info": {
866 | "codemirror_mode": {
867 | "name": "ipython",
868 | "version": 3
869 | },
870 | "file_extension": ".py",
871 | "mimetype": "text/x-python",
872 | "name": "python",
873 | "nbconvert_exporter": "python",
874 | "pygments_lexer": "ipython3",
875 | "version": "3.5.2"
876 | }
877 | },
878 | "nbformat": 4,
879 | "nbformat_minor": 1
880 | }
881 |
--------------------------------------------------------------------------------
/readonly/Module 4.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "---\n",
8 | "\n",
9 | "_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-machine-learning/resources/bANLa) course resource._\n",
10 | "\n",
11 | "---"
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {
17 | "collapsed": true
18 | },
19 | "source": [
20 | "# Applied Machine Learning: Module 4 (Supervised Learning, Part II)"
21 | ]
22 | },
23 | {
24 | "cell_type": "markdown",
25 | "metadata": {},
26 | "source": [
27 | "## Preamble and Datasets"
28 | ]
29 | },
30 | {
31 | "cell_type": "code",
32 | "execution_count": null,
33 | "metadata": {
34 | "collapsed": false,
35 | "scrolled": false
36 | },
37 | "outputs": [],
38 | "source": [
39 | "%matplotlib notebook\n",
40 | "import numpy as np\n",
41 | "import pandas as pd\n",
42 | "import seaborn as sn\n",
43 | "import matplotlib.pyplot as plt\n",
44 | "\n",
45 | "from sklearn.model_selection import train_test_split\n",
46 | "from sklearn.datasets import make_classification, make_blobs\n",
47 | "from matplotlib.colors import ListedColormap\n",
48 | "from sklearn.datasets import load_breast_cancer\n",
49 | "from adspy_shared_utilities import load_crime_dataset\n",
50 | "\n",
51 | "\n",
52 | "cmap_bold = ListedColormap(['#FFFF00', '#00FF00', '#0000FF','#000000'])\n",
53 | "\n",
54 | "# fruits dataset\n",
55 | "fruits = pd.read_table('fruit_data_with_colors.txt')\n",
56 | "\n",
57 | "feature_names_fruits = ['height', 'width', 'mass', 'color_score']\n",
58 | "X_fruits = fruits[feature_names_fruits]\n",
59 | "y_fruits = fruits['fruit_label']\n",
60 | "target_names_fruits = ['apple', 'mandarin', 'orange', 'lemon']\n",
61 | "\n",
62 | "X_fruits_2d = fruits[['height', 'width']]\n",
63 | "y_fruits_2d = fruits['fruit_label']\n",
64 | "\n",
65 | "# synthetic dataset for simple regression\n",
66 | "from sklearn.datasets import make_regression\n",
67 | "plt.figure()\n",
68 | "plt.title('Sample regression problem with one input variable')\n",
69 | "X_R1, y_R1 = make_regression(n_samples = 100, n_features=1,\n",
70 | " n_informative=1, bias = 150.0,\n",
71 | " noise = 30, random_state=0)\n",
72 | "plt.scatter(X_R1, y_R1, marker= 'o', s=50)\n",
73 | "plt.show()\n",
74 | "\n",
75 | "# synthetic dataset for more complex regression\n",
76 | "from sklearn.datasets import make_friedman1\n",
77 | "plt.figure()\n",
78 | "plt.title('Complex regression problem with one input variable')\n",
79 | "X_F1, y_F1 = make_friedman1(n_samples = 100, n_features = 7,\n",
80 | " random_state=0)\n",
81 | "\n",
82 | "plt.scatter(X_F1[:, 2], y_F1, marker= 'o', s=50)\n",
83 | "plt.show()\n",
84 | "\n",
85 | "# synthetic dataset for classification (binary)\n",
86 | "plt.figure()\n",
87 | "plt.title('Sample binary classification problem with two informative features')\n",
88 | "X_C2, y_C2 = make_classification(n_samples = 100, n_features=2,\n",
89 | " n_redundant=0, n_informative=2,\n",
90 | " n_clusters_per_class=1, flip_y = 0.1,\n",
91 | " class_sep = 0.5, random_state=0)\n",
92 | "plt.scatter(X_C2[:, 0], X_C2[:, 1], marker= 'o',\n",
93 | " c=y_C2, s=50, cmap=cmap_bold)\n",
94 | "plt.show()\n",
95 | "\n",
96 | "# more difficult synthetic dataset for classification (binary)\n",
97 | "# with classes that are not linearly separable\n",
98 | "X_D2, y_D2 = make_blobs(n_samples = 100, n_features = 2,\n",
99 | " centers = 8, cluster_std = 1.3,\n",
100 | " random_state = 4)\n",
101 | "y_D2 = y_D2 % 2\n",
102 | "plt.figure()\n",
103 | "plt.title('Sample binary classification problem with non-linearly separable classes')\n",
104 | "plt.scatter(X_D2[:,0], X_D2[:,1], c=y_D2,\n",
105 | " marker= 'o', s=50, cmap=cmap_bold)\n",
106 | "plt.show()\n",
107 | "\n",
108 | "# Breast cancer dataset for classification\n",
109 | "cancer = load_breast_cancer()\n",
110 | "(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)\n",
111 | "\n",
112 | "# Communities and Crime dataset\n",
113 | "(X_crime, y_crime) = load_crime_dataset()"
114 | ]
115 | },
116 | {
117 | "cell_type": "markdown",
118 | "metadata": {},
119 | "source": [
120 | "## Naive Bayes classifiers"
121 | ]
122 | },
123 | {
124 | "cell_type": "code",
125 | "execution_count": null,
126 | "metadata": {
127 | "collapsed": false
128 | },
129 | "outputs": [],
130 | "source": [
131 | "from sklearn.naive_bayes import GaussianNB\n",
132 | "from adspy_shared_utilities import plot_class_regions_for_classifier\n",
133 | "\n",
134 | "X_train, X_test, y_train, y_test = train_test_split(X_C2, y_C2, random_state=0)\n",
135 | "\n",
136 | "nbclf = GaussianNB().fit(X_train, y_train)\n",
137 | "plot_class_regions_for_classifier(nbclf, X_train, y_train, X_test, y_test,\n",
138 | " 'Gaussian Naive Bayes classifier: Dataset 1')"
139 | ]
140 | },
141 | {
142 | "cell_type": "code",
143 | "execution_count": null,
144 | "metadata": {
145 | "collapsed": false
146 | },
147 | "outputs": [],
148 | "source": [
149 | "X_train, X_test, y_train, y_test = train_test_split(X_D2, y_D2,\n",
150 | " random_state=0)\n",
151 | "\n",
152 | "nbclf = GaussianNB().fit(X_train, y_train)\n",
153 | "plot_class_regions_for_classifier(nbclf, X_train, y_train, X_test, y_test,\n",
154 | " 'Gaussian Naive Bayes classifier: Dataset 2')"
155 | ]
156 | },
157 | {
158 | "cell_type": "markdown",
159 | "metadata": {},
160 | "source": [
161 | "### Application to a real-world dataset"
162 | ]
163 | },
164 | {
165 | "cell_type": "code",
166 | "execution_count": null,
167 | "metadata": {
168 | "collapsed": false
169 | },
170 | "outputs": [],
171 | "source": [
172 | "X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0)\n",
173 | "\n",
174 | "nbclf = GaussianNB().fit(X_train, y_train)\n",
175 | "print('Breast cancer dataset')\n",
176 | "print('Accuracy of GaussianNB classifier on training set: {:.2f}'\n",
177 | " .format(nbclf.score(X_train, y_train)))\n",
178 | "print('Accuracy of GaussianNB classifier on test set: {:.2f}'\n",
179 | " .format(nbclf.score(X_test, y_test)))"
180 | ]
181 | },
182 | {
183 | "cell_type": "markdown",
184 | "metadata": {},
185 | "source": [
186 | "## Ensembles of Decision Trees"
187 | ]
188 | },
189 | {
190 | "cell_type": "markdown",
191 | "metadata": {},
192 | "source": [
193 | "### Random forests"
194 | ]
195 | },
196 | {
197 | "cell_type": "code",
198 | "execution_count": null,
199 | "metadata": {
200 | "collapsed": false,
201 | "scrolled": false
202 | },
203 | "outputs": [],
204 | "source": [
205 | "from sklearn.ensemble import RandomForestClassifier\n",
206 | "from sklearn.model_selection import train_test_split\n",
207 | "from adspy_shared_utilities import plot_class_regions_for_classifier_subplot\n",
208 | "\n",
209 | "X_train, X_test, y_train, y_test = train_test_split(X_D2, y_D2,\n",
210 | " random_state = 0)\n",
211 | "fig, subaxes = plt.subplots(1, 1, figsize=(6, 6))\n",
212 | "\n",
213 | "clf = RandomForestClassifier().fit(X_train, y_train)\n",
214 | "title = 'Random Forest Classifier, complex binary dataset, default settings'\n",
215 | "plot_class_regions_for_classifier_subplot(clf, X_train, y_train, X_test,\n",
216 | " y_test, title, subaxes)\n",
217 | "\n",
218 | "plt.show()"
219 | ]
220 | },
221 | {
222 | "cell_type": "markdown",
223 | "metadata": {},
224 | "source": [
225 | "### Random forest: Fruit dataset"
226 | ]
227 | },
228 | {
229 | "cell_type": "code",
230 | "execution_count": null,
231 | "metadata": {
232 | "collapsed": false,
233 | "scrolled": false
234 | },
235 | "outputs": [],
236 | "source": [
237 | "from sklearn.ensemble import RandomForestClassifier\n",
238 | "from sklearn.model_selection import train_test_split\n",
239 | "from adspy_shared_utilities import plot_class_regions_for_classifier_subplot\n",
240 | "\n",
241 | "X_train, X_test, y_train, y_test = train_test_split(X_fruits.as_matrix(),\n",
242 | " y_fruits.as_matrix(),\n",
243 | " random_state = 0)\n",
244 | "fig, subaxes = plt.subplots(6, 1, figsize=(6, 32))\n",
245 | "\n",
246 | "title = 'Random Forest, fruits dataset, default settings'\n",
247 | "pair_list = [[0,1], [0,2], [0,3], [1,2], [1,3], [2,3]]\n",
248 | "\n",
249 | "for pair, axis in zip(pair_list, subaxes):\n",
250 | " X = X_train[:, pair]\n",
251 | " y = y_train\n",
252 | " \n",
253 | " clf = RandomForestClassifier().fit(X, y)\n",
254 | " plot_class_regions_for_classifier_subplot(clf, X, y, None,\n",
255 | " None, title, axis,\n",
256 | " target_names_fruits)\n",
257 | " \n",
258 | " axis.set_xlabel(feature_names_fruits[pair[0]])\n",
259 | " axis.set_ylabel(feature_names_fruits[pair[1]])\n",
260 | " \n",
261 | "plt.tight_layout()\n",
262 | "plt.show()\n",
263 | "\n",
264 | "clf = RandomForestClassifier(n_estimators = 10,\n",
265 | " random_state=0).fit(X_train, y_train)\n",
266 | "\n",
267 | "print('Random Forest, Fruit dataset, default settings')\n",
268 | "print('Accuracy of RF classifier on training set: {:.2f}'\n",
269 | " .format(clf.score(X_train, y_train)))\n",
270 | "print('Accuracy of RF classifier on test set: {:.2f}'\n",
271 | " .format(clf.score(X_test, y_test)))"
272 | ]
273 | },
274 | {
275 | "cell_type": "markdown",
276 | "metadata": {},
277 | "source": [
278 | "#### Random Forests on a real-world dataset"
279 | ]
280 | },
281 | {
282 | "cell_type": "code",
283 | "execution_count": null,
284 | "metadata": {
285 | "collapsed": false
286 | },
287 | "outputs": [],
288 | "source": [
289 | "from sklearn.ensemble import RandomForestClassifier\n",
290 | "\n",
291 | "X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0)\n",
292 | "\n",
293 | "clf = RandomForestClassifier(max_features = 8, random_state = 0)\n",
294 | "clf.fit(X_train, y_train)\n",
295 | "\n",
296 | "print('Breast cancer dataset')\n",
297 | "print('Accuracy of RF classifier on training set: {:.2f}'\n",
298 | " .format(clf.score(X_train, y_train)))\n",
299 | "print('Accuracy of RF classifier on test set: {:.2f}'\n",
300 | " .format(clf.score(X_test, y_test)))"
301 | ]
302 | },
303 | {
304 | "cell_type": "markdown",
305 | "metadata": {},
306 | "source": [
307 | "### Gradient-boosted decision trees"
308 | ]
309 | },
310 | {
311 | "cell_type": "code",
312 | "execution_count": null,
313 | "metadata": {
314 | "collapsed": false
315 | },
316 | "outputs": [],
317 | "source": [
318 | "from sklearn.ensemble import GradientBoostingClassifier\n",
319 | "from sklearn.model_selection import train_test_split\n",
320 | "from adspy_shared_utilities import plot_class_regions_for_classifier_subplot\n",
321 | "\n",
322 | "X_train, X_test, y_train, y_test = train_test_split(X_D2, y_D2, random_state = 0)\n",
323 | "fig, subaxes = plt.subplots(1, 1, figsize=(6, 6))\n",
324 | "\n",
325 | "clf = GradientBoostingClassifier().fit(X_train, y_train)\n",
326 | "title = 'GBDT, complex binary dataset, default settings'\n",
327 | "plot_class_regions_for_classifier_subplot(clf, X_train, y_train, X_test,\n",
328 | " y_test, title, subaxes)\n",
329 | "\n",
330 | "plt.show()"
331 | ]
332 | },
333 | {
334 | "cell_type": "markdown",
335 | "metadata": {},
336 | "source": [
337 | "#### Gradient boosted decision trees on the fruit dataset"
338 | ]
339 | },
340 | {
341 | "cell_type": "code",
342 | "execution_count": null,
343 | "metadata": {
344 | "collapsed": false,
345 | "scrolled": false
346 | },
347 | "outputs": [],
348 | "source": [
349 | "X_train, X_test, y_train, y_test = train_test_split(X_fruits.as_matrix(),\n",
350 | " y_fruits.as_matrix(),\n",
351 | " random_state = 0)\n",
352 | "fig, subaxes = plt.subplots(6, 1, figsize=(6, 32))\n",
353 | "\n",
354 | "pair_list = [[0,1], [0,2], [0,3], [1,2], [1,3], [2,3]]\n",
355 | "\n",
356 | "for pair, axis in zip(pair_list, subaxes):\n",
357 | " X = X_train[:, pair]\n",
358 | " y = y_train\n",
359 | " \n",
360 | " clf = GradientBoostingClassifier().fit(X, y)\n",
361 | " plot_class_regions_for_classifier_subplot(clf, X, y, None,\n",
362 | " None, title, axis,\n",
363 | " target_names_fruits)\n",
364 | " \n",
365 | " axis.set_xlabel(feature_names_fruits[pair[0]])\n",
366 | " axis.set_ylabel(feature_names_fruits[pair[1]])\n",
367 | " \n",
368 | "plt.tight_layout()\n",
369 | "plt.show()\n",
370 | "clf = GradientBoostingClassifier().fit(X_train, y_train)\n",
371 | "\n",
372 | "print('GBDT, Fruit dataset, default settings')\n",
373 | "print('Accuracy of GBDT classifier on training set: {:.2f}'\n",
374 | " .format(clf.score(X_train, y_train)))\n",
375 | "print('Accuracy of GBDT classifier on test set: {:.2f}'\n",
376 | " .format(clf.score(X_test, y_test)))"
377 | ]
378 | },
379 | {
380 | "cell_type": "markdown",
381 | "metadata": {},
382 | "source": [
383 | "#### Gradient-boosted decision trees on a real-world dataset"
384 | ]
385 | },
386 | {
387 | "cell_type": "code",
388 | "execution_count": null,
389 | "metadata": {
390 | "collapsed": false
391 | },
392 | "outputs": [],
393 | "source": [
394 | "from sklearn.ensemble import GradientBoostingClassifier\n",
395 | "\n",
396 | "X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0)\n",
397 | "\n",
398 | "clf = GradientBoostingClassifier(random_state = 0)\n",
399 | "clf.fit(X_train, y_train)\n",
400 | "\n",
401 | "print('Breast cancer dataset (learning_rate=0.1, max_depth=3)')\n",
402 | "print('Accuracy of GBDT classifier on training set: {:.2f}'\n",
403 | " .format(clf.score(X_train, y_train)))\n",
404 | "print('Accuracy of GBDT classifier on test set: {:.2f}\\n'\n",
405 | " .format(clf.score(X_test, y_test)))\n",
406 | "\n",
407 | "clf = GradientBoostingClassifier(learning_rate = 0.01, max_depth = 2, random_state = 0)\n",
408 | "clf.fit(X_train, y_train)\n",
409 | "\n",
410 | "print('Breast cancer dataset (learning_rate=0.01, max_depth=2)')\n",
411 | "print('Accuracy of GBDT classifier on training set: {:.2f}'\n",
412 | " .format(clf.score(X_train, y_train)))\n",
413 | "print('Accuracy of GBDT classifier on test set: {:.2f}'\n",
414 | " .format(clf.score(X_test, y_test)))"
415 | ]
416 | },
417 | {
418 | "cell_type": "markdown",
419 | "metadata": {},
420 | "source": [
421 | "## Neural networks"
422 | ]
423 | },
424 | {
425 | "cell_type": "markdown",
426 | "metadata": {},
427 | "source": [
428 | "#### Activation functions"
429 | ]
430 | },
431 | {
432 | "cell_type": "code",
433 | "execution_count": null,
434 | "metadata": {
435 | "collapsed": false
436 | },
437 | "outputs": [],
438 | "source": [
439 | "xrange = np.linspace(-2, 2, 200)\n",
440 | "\n",
441 | "plt.figure(figsize=(7,6))\n",
442 | "\n",
443 | "plt.plot(xrange, np.maximum(xrange, 0), label = 'relu')\n",
444 | "plt.plot(xrange, np.tanh(xrange), label = 'tanh')\n",
445 | "plt.plot(xrange, 1 / (1 + np.exp(-xrange)), label = 'logistic')\n",
446 | "plt.legend()\n",
447 | "plt.title('Neural network activation functions')\n",
448 | "plt.xlabel('Input value (x)')\n",
449 | "plt.ylabel('Activation function output')\n",
450 | "\n",
451 | "plt.show()"
452 | ]
453 | },
454 | {
455 | "cell_type": "markdown",
456 | "metadata": {},
457 | "source": [
458 | "### Neural networks: Classification"
459 | ]
460 | },
461 | {
462 | "cell_type": "markdown",
463 | "metadata": {},
464 | "source": [
465 | "#### Synthetic dataset 1: single hidden layer"
466 | ]
467 | },
468 | {
469 | "cell_type": "code",
470 | "execution_count": null,
471 | "metadata": {
472 | "collapsed": false,
473 | "scrolled": false
474 | },
475 | "outputs": [],
476 | "source": [
477 | "from sklearn.neural_network import MLPClassifier\n",
478 | "from adspy_shared_utilities import plot_class_regions_for_classifier_subplot\n",
479 | "\n",
480 | "X_train, X_test, y_train, y_test = train_test_split(X_D2, y_D2, random_state=0)\n",
481 | "\n",
482 | "fig, subaxes = plt.subplots(3, 1, figsize=(6,18))\n",
483 | "\n",
484 | "for units, axis in zip([1, 10, 100], subaxes):\n",
485 | " nnclf = MLPClassifier(hidden_layer_sizes = [units], solver='lbfgs',\n",
486 | " random_state = 0).fit(X_train, y_train)\n",
487 | " \n",
488 | " title = 'Dataset 1: Neural net classifier, 1 layer, {} units'.format(units)\n",
489 | " \n",
490 | " plot_class_regions_for_classifier_subplot(nnclf, X_train, y_train,\n",
491 | " X_test, y_test, title, axis)\n",
492 | " plt.tight_layout()"
493 | ]
494 | },
495 | {
496 | "cell_type": "markdown",
497 | "metadata": {},
498 | "source": [
499 | "#### Synthetic dataset 1: two hidden layers"
500 | ]
501 | },
502 | {
503 | "cell_type": "code",
504 | "execution_count": null,
505 | "metadata": {
506 | "collapsed": false
507 | },
508 | "outputs": [],
509 | "source": [
510 | "from adspy_shared_utilities import plot_class_regions_for_classifier\n",
511 | "\n",
512 | "X_train, X_test, y_train, y_test = train_test_split(X_D2, y_D2, random_state=0)\n",
513 | "\n",
514 | "nnclf = MLPClassifier(hidden_layer_sizes = [10, 10], solver='lbfgs',\n",
515 | " random_state = 0).fit(X_train, y_train)\n",
516 | "\n",
517 | "plot_class_regions_for_classifier(nnclf, X_train, y_train, X_test, y_test,\n",
518 | " 'Dataset 1: Neural net classifier, 2 layers, 10/10 units')"
519 | ]
520 | },
521 | {
522 | "cell_type": "markdown",
523 | "metadata": {},
524 | "source": [
525 | "#### Regularization parameter: alpha"
526 | ]
527 | },
528 | {
529 | "cell_type": "code",
530 | "execution_count": null,
531 | "metadata": {
532 | "collapsed": false,
533 | "scrolled": false
534 | },
535 | "outputs": [],
536 | "source": [
537 | "X_train, X_test, y_train, y_test = train_test_split(X_D2, y_D2, random_state=0)\n",
538 | "\n",
539 | "fig, subaxes = plt.subplots(4, 1, figsize=(6, 23))\n",
540 | "\n",
541 | "for this_alpha, axis in zip([0.01, 0.1, 1.0, 5.0], subaxes):\n",
542 | " nnclf = MLPClassifier(solver='lbfgs', activation = 'tanh',\n",
543 | " alpha = this_alpha,\n",
544 | " hidden_layer_sizes = [100, 100],\n",
545 | " random_state = 0).fit(X_train, y_train)\n",
546 | " \n",
547 | " title = 'Dataset 2: NN classifier, alpha = {:.3f} '.format(this_alpha)\n",
548 | " \n",
549 | " plot_class_regions_for_classifier_subplot(nnclf, X_train, y_train,\n",
550 | " X_test, y_test, title, axis)\n",
551 | " plt.tight_layout()\n",
552 | " "
553 | ]
554 | },
555 | {
556 | "cell_type": "markdown",
557 | "metadata": {},
558 | "source": [
559 | "#### The effect of different choices of activation function"
560 | ]
561 | },
562 | {
563 | "cell_type": "code",
564 | "execution_count": null,
565 | "metadata": {
566 | "collapsed": false,
567 | "scrolled": false
568 | },
569 | "outputs": [],
570 | "source": [
571 | "X_train, X_test, y_train, y_test = train_test_split(X_D2, y_D2, random_state=0)\n",
572 | "\n",
573 | "fig, subaxes = plt.subplots(3, 1, figsize=(6,18))\n",
574 | "\n",
575 | "for this_activation, axis in zip(['logistic', 'tanh', 'relu'], subaxes):\n",
576 | " nnclf = MLPClassifier(solver='lbfgs', activation = this_activation,\n",
577 | " alpha = 0.1, hidden_layer_sizes = [10, 10],\n",
578 | " random_state = 0).fit(X_train, y_train)\n",
579 | " \n",
580 | " title = 'Dataset 2: NN classifier, 2 layers 10/10, {} \\\n",
581 | "activation function'.format(this_activation)\n",
582 | " \n",
583 | " plot_class_regions_for_classifier_subplot(nnclf, X_train, y_train,\n",
584 | " X_test, y_test, title, axis)\n",
585 | " plt.tight_layout()"
586 | ]
587 | },
588 | {
589 | "cell_type": "markdown",
590 | "metadata": {},
591 | "source": [
592 | "### Neural networks: Regression"
593 | ]
594 | },
595 | {
596 | "cell_type": "code",
597 | "execution_count": null,
598 | "metadata": {
599 | "collapsed": false
600 | },
601 | "outputs": [],
602 | "source": [
603 | "from sklearn.neural_network import MLPRegressor\n",
604 | "\n",
605 | "fig, subaxes = plt.subplots(2, 3, figsize=(11,8), dpi=70)\n",
606 | "\n",
607 | "X_predict_input = np.linspace(-3, 3, 50).reshape(-1,1)\n",
608 | "\n",
609 | "X_train, X_test, y_train, y_test = train_test_split(X_R1[0::5], y_R1[0::5], random_state = 0)\n",
610 | "\n",
611 | "for thisaxisrow, thisactivation in zip(subaxes, ['tanh', 'relu']):\n",
612 | " for thisalpha, thisaxis in zip([0.0001, 1.0, 100], thisaxisrow):\n",
613 | " mlpreg = MLPRegressor(hidden_layer_sizes = [100,100],\n",
614 | " activation = thisactivation,\n",
615 | " alpha = thisalpha,\n",
616 | " solver = 'lbfgs').fit(X_train, y_train)\n",
617 | " y_predict_output = mlpreg.predict(X_predict_input)\n",
618 | " thisaxis.set_xlim([-2.5, 0.75])\n",
619 | " thisaxis.plot(X_predict_input, y_predict_output,\n",
620 | " '^', markersize = 10)\n",
621 | " thisaxis.plot(X_train, y_train, 'o')\n",
622 | " thisaxis.set_xlabel('Input feature')\n",
623 | " thisaxis.set_ylabel('Target value')\n",
624 | " thisaxis.set_title('MLP regression\\nalpha={}, activation={})'\n",
625 | " .format(thisalpha, thisactivation))\n",
626 | " plt.tight_layout()"
627 | ]
628 | },
629 | {
630 | "cell_type": "markdown",
631 | "metadata": {},
632 | "source": [
633 | "#### Application to real-world dataset for classification"
634 | ]
635 | },
636 | {
637 | "cell_type": "code",
638 | "execution_count": null,
639 | "metadata": {
640 | "collapsed": false
641 | },
642 | "outputs": [],
643 | "source": [
644 | "from sklearn.neural_network import MLPClassifier\n",
645 | "from sklearn.preprocessing import MinMaxScaler\n",
646 | "\n",
647 | "\n",
648 | "scaler = MinMaxScaler()\n",
649 | "\n",
650 | "X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0)\n",
651 | "X_train_scaled = scaler.fit_transform(X_train)\n",
652 | "X_test_scaled = scaler.transform(X_test)\n",
653 | "\n",
654 | "clf = MLPClassifier(hidden_layer_sizes = [100, 100], alpha = 5.0,\n",
655 | " random_state = 0, solver='lbfgs').fit(X_train_scaled, y_train)\n",
656 | "\n",
657 | "print('Breast cancer dataset')\n",
658 | "print('Accuracy of NN classifier on training set: {:.2f}'\n",
659 | " .format(clf.score(X_train_scaled, y_train)))\n",
660 | "print('Accuracy of NN classifier on test set: {:.2f}'\n",
661 | " .format(clf.score(X_test_scaled, y_test)))"
662 | ]
663 | },
664 | {
665 | "cell_type": "code",
666 | "execution_count": 1,
667 | "metadata": {
668 | "collapsed": false
669 | },
670 | "outputs": [
671 | {
672 | "name": "stdout",
673 | "output_type": "stream",
674 | "text": [
675 | "./addresses.csv\r\n",
676 | "./train.csv\r\n",
677 | "./Module 2.ipynb\r\n",
678 | "./Assignment 3.ipynb\r\n",
679 | "./Module 4.ipynb\r\n",
680 | "./Assignment 1.ipynb\r\n",
681 | "./test.csv\r\n",
682 | "./CommViolPredUnnormalizedData.txt\r\n",
683 | "./adspy_shared_utilities.py\r\n",
684 | "./Module 3.ipynb\r\n",
685 | "./fraud_data.csv\r\n",
686 | "./fruit_data_with_colors.txt\r\n",
687 | "./Assignment 4.ipynb\r\n",
688 | "./Assignment 2.ipynb\r\n",
689 | "./mushrooms.csv\r\n",
690 | "./Classifier Visualization.ipynb\r\n",
691 | "./latlons.csv\r\n",
692 | "./Module 1.ipynb\r\n"
693 | ]
694 | }
695 | ],
696 | "source": [
697 | "!find . -maxdepth 1 -not -type d"
698 | ]
699 | },
700 | {
701 | "cell_type": "code",
702 | "execution_count": 7,
703 | "metadata": {
704 | "collapsed": false
705 | },
706 | "outputs": [
707 | {
708 | "name": "stdout",
709 | "output_type": "stream",
710 | "text": [
711 | "addresses.csv adspy_temp.dot polynomialreg1.png train.csv\r\n"
712 | ]
713 | }
714 | ],
715 | "source": [
716 | "!ls readonly"
717 | ]
718 | },
719 | {
720 | "cell_type": "code",
721 | "execution_count": 8,
722 | "metadata": {
723 | "collapsed": false
724 | },
725 | "outputs": [
726 | {
727 | "name": "stdout",
728 | "output_type": "stream",
729 | "text": [
730 | "cp: target ‘2.ipynb’ is not a directory\r\n"
731 | ]
732 | }
733 | ],
734 | "source": [
735 | "!cp ./Module 2.ipynb readonly/Module 2.ipynb"
736 | ]
737 | },
738 | {
739 | "cell_type": "code",
740 | "execution_count": null,
741 | "metadata": {
742 | "collapsed": true
743 | },
744 | "outputs": [],
745 | "source": []
746 | }
747 | ],
748 | "metadata": {
749 | "anaconda-cloud": {},
750 | "kernelspec": {
751 | "display_name": "Python 3",
752 | "language": "python",
753 | "name": "python3"
754 | },
755 | "language_info": {
756 | "codemirror_mode": {
757 | "name": "ipython",
758 | "version": 3
759 | },
760 | "file_extension": ".py",
761 | "mimetype": "text/x-python",
762 | "name": "python",
763 | "nbconvert_exporter": "python",
764 | "pygments_lexer": "ipython3",
765 | "version": "3.5.2"
766 | }
767 | },
768 | "nbformat": 4,
769 | "nbformat_minor": 2
770 | }
771 |
--------------------------------------------------------------------------------
/readonly/Unsupervised Learning.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Applied Machine Learning: Unsupervised Learning"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "## Preamble and Datasets"
15 | ]
16 | },
17 | {
18 | "cell_type": "code",
19 | "execution_count": null,
20 | "metadata": {},
21 | "outputs": [],
22 | "source": [
23 | "%matplotlib notebook\n",
24 | "import numpy as np\n",
25 | "import pandas as pd\n",
26 | "import seaborn as sn\n",
27 | "import matplotlib.pyplot as plt\n",
28 | "from sklearn.datasets import load_breast_cancer\n",
29 | "\n",
30 | "# Breast cancer dataset\n",
31 | "cancer = load_breast_cancer()\n",
32 | "(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)\n",
33 | "\n",
34 | "# Our sample fruits dataset\n",
35 | "fruits = pd.read_table('fruit_data_with_colors.txt')\n",
36 | "X_fruits = fruits[['mass','width','height', 'color_score']]\n",
37 | "y_fruits = fruits[['fruit_label']] - 1"
38 | ]
39 | },
40 | {
41 | "cell_type": "markdown",
42 | "metadata": {},
43 | "source": [
44 | "## Dimensionality Reduction and Manifold Learning"
45 | ]
46 | },
47 | {
48 | "cell_type": "markdown",
49 | "metadata": {},
50 | "source": [
51 | "### Principal Components Analysis (PCA)"
52 | ]
53 | },
54 | {
55 | "cell_type": "markdown",
56 | "metadata": {},
57 | "source": [
58 | "#### Using PCA to find the first two principal components of the breast cancer dataset"
59 | ]
60 | },
61 | {
62 | "cell_type": "code",
63 | "execution_count": null,
64 | "metadata": {},
65 | "outputs": [],
66 | "source": [
67 | "from sklearn.preprocessing import StandardScaler\n",
68 | "from sklearn.decomposition import PCA\n",
69 | "from sklearn.datasets import load_breast_cancer\n",
70 | "\n",
71 | "cancer = load_breast_cancer()\n",
72 | "(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)\n",
73 | "\n",
74 | "# Before applying PCA, each feature should be centered (zero mean) and with unit variance\n",
75 | "X_normalized = StandardScaler().fit(X_cancer).transform(X_cancer) \n",
76 | "\n",
77 | "pca = PCA(n_components = 2).fit(X_normalized)\n",
78 | "\n",
79 | "X_pca = pca.transform(X_normalized)\n",
80 | "print(X_cancer.shape, X_pca.shape)"
81 | ]
82 | },
83 | {
84 | "cell_type": "markdown",
85 | "metadata": {},
86 | "source": [
87 | "#### Plotting the PCA-transformed version of the breast cancer dataset"
88 | ]
89 | },
90 | {
91 | "cell_type": "code",
92 | "execution_count": null,
93 | "metadata": {},
94 | "outputs": [],
95 | "source": [
96 | "from adspy_shared_utilities import plot_labelled_scatter\n",
97 | "plot_labelled_scatter(X_pca, y_cancer, ['malignant', 'benign'])\n",
98 | "\n",
99 | "plt.xlabel('First principal component')\n",
100 | "plt.ylabel('Second principal component')\n",
101 | "plt.title('Breast Cancer Dataset PCA (n_components = 2)');"
102 | ]
103 | },
104 | {
105 | "cell_type": "markdown",
106 | "metadata": {},
107 | "source": [
108 | "#### Plotting the magnitude of each feature value for the first two principal components"
109 | ]
110 | },
111 | {
112 | "cell_type": "code",
113 | "execution_count": null,
114 | "metadata": {},
115 | "outputs": [],
116 | "source": [
117 | "fig = plt.figure(figsize=(8, 4))\n",
118 | "plt.imshow(pca.components_, interpolation = 'none', cmap = 'plasma')\n",
119 | "feature_names = list(cancer.feature_names)\n",
120 | "\n",
121 | "plt.gca().set_xticks(np.arange(-.5, len(feature_names)));\n",
122 | "plt.gca().set_yticks(np.arange(0.5, 2));\n",
123 | "plt.gca().set_xticklabels(feature_names, rotation=90, ha='left', fontsize=12);\n",
124 | "plt.gca().set_yticklabels(['First PC', 'Second PC'], va='bottom', fontsize=12);\n",
125 | "\n",
126 | "plt.colorbar(orientation='horizontal', ticks=[pca.components_.min(), 0, \n",
127 | " pca.components_.max()], pad=0.65);"
128 | ]
129 | },
130 | {
131 | "cell_type": "markdown",
132 | "metadata": {},
133 | "source": [
134 | "#### PCA on the fruit dataset (for comparison)"
135 | ]
136 | },
137 | {
138 | "cell_type": "code",
139 | "execution_count": null,
140 | "metadata": {},
141 | "outputs": [],
142 | "source": [
143 | "from sklearn.preprocessing import StandardScaler\n",
144 | "from sklearn.decomposition import PCA\n",
145 | "\n",
146 | "# each feature should be centered (zero mean) and with unit variance\n",
147 | "X_normalized = StandardScaler().fit(X_fruits).transform(X_fruits) \n",
148 | "\n",
149 | "pca = PCA(n_components = 2).fit(X_normalized)\n",
150 | "X_pca = pca.transform(X_normalized)\n",
151 | "\n",
152 | "from adspy_shared_utilities import plot_labelled_scatter\n",
153 | "plot_labelled_scatter(X_pca, y_fruits, ['apple','mandarin','orange','lemon'])\n",
154 | "\n",
155 | "plt.xlabel('First principal component')\n",
156 | "plt.ylabel('Second principal component')\n",
157 | "plt.title('Fruits Dataset PCA (n_components = 2)');"
158 | ]
159 | },
160 | {
161 | "cell_type": "markdown",
162 | "metadata": {},
163 | "source": [
164 | "### Manifold learning methods"
165 | ]
166 | },
167 | {
168 | "cell_type": "markdown",
169 | "metadata": {},
170 | "source": [
171 | "#### Multidimensional scaling (MDS) on the fruit dataset"
172 | ]
173 | },
174 | {
175 | "cell_type": "code",
176 | "execution_count": null,
177 | "metadata": {},
178 | "outputs": [],
179 | "source": [
180 | "from adspy_shared_utilities import plot_labelled_scatter\n",
181 | "from sklearn.preprocessing import StandardScaler\n",
182 | "from sklearn.manifold import MDS\n",
183 | "\n",
184 | "# each feature should be centered (zero mean) and with unit variance\n",
185 | "X_fruits_normalized = StandardScaler().fit(X_fruits).transform(X_fruits) \n",
186 | "\n",
187 | "mds = MDS(n_components = 2)\n",
188 | "\n",
189 | "X_fruits_mds = mds.fit_transform(X_fruits_normalized)\n",
190 | "\n",
191 | "plot_labelled_scatter(X_fruits_mds, y_fruits, ['apple', 'mandarin', 'orange', 'lemon'])\n",
192 | "plt.xlabel('First MDS feature')\n",
193 | "plt.ylabel('Second MDS feature')\n",
194 | "plt.title('Fruit sample dataset MDS');"
195 | ]
196 | },
197 | {
198 | "cell_type": "markdown",
199 | "metadata": {},
200 | "source": [
201 | "#### Multidimensional scaling (MDS) on the breast cancer dataset"
202 | ]
203 | },
204 | {
205 | "cell_type": "markdown",
206 | "metadata": {},
207 | "source": [
208 | "(This example is not covered in the lecture video, but is included here so you can compare it to the results from PCA.)"
209 | ]
210 | },
211 | {
212 | "cell_type": "code",
213 | "execution_count": null,
214 | "metadata": {},
215 | "outputs": [],
216 | "source": [
217 | "from sklearn.preprocessing import StandardScaler\n",
218 | "from sklearn.manifold import MDS\n",
219 | "from sklearn.datasets import load_breast_cancer\n",
220 | "\n",
221 | "cancer = load_breast_cancer()\n",
222 | "(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)\n",
223 | "\n",
224 | "# each feature should be centered (zero mean) and with unit variance\n",
225 | "X_normalized = StandardScaler().fit(X_cancer).transform(X_cancer) \n",
226 | "\n",
227 | "mds = MDS(n_components = 2)\n",
228 | "\n",
229 | "X_mds = mds.fit_transform(X_normalized)\n",
230 | "\n",
231 | "from adspy_shared_utilities import plot_labelled_scatter\n",
232 | "plot_labelled_scatter(X_mds, y_cancer, ['malignant', 'benign'])\n",
233 | "\n",
234 | "plt.xlabel('First MDS dimension')\n",
235 | "plt.ylabel('Second MDS dimension')\n",
236 | "plt.title('Breast Cancer Dataset MDS (n_components = 2)');"
237 | ]
238 | },
239 | {
240 | "cell_type": "markdown",
241 | "metadata": {
242 | "collapsed": true
243 | },
244 | "source": [
245 | "#### t-SNE on the fruit dataset"
246 | ]
247 | },
248 | {
249 | "cell_type": "markdown",
250 | "metadata": {},
251 | "source": [
252 | "(This example from the lecture video is included so that you can see how some dimensionality reduction methods may be less successful on some datasets. Here, it doesn't work as well at finding structure in the small fruits dataset, compared to other methods like MDS.)"
253 | ]
254 | },
255 | {
256 | "cell_type": "code",
257 | "execution_count": null,
258 | "metadata": {},
259 | "outputs": [],
260 | "source": [
261 | "from sklearn.manifold import TSNE\n",
262 | "\n",
263 | "tsne = TSNE(random_state = 0)\n",
264 | "\n",
265 | "X_tsne = tsne.fit_transform(X_fruits_normalized)\n",
266 | "\n",
267 | "plot_labelled_scatter(X_tsne, y_fruits, \n",
268 | " ['apple', 'mandarin', 'orange', 'lemon'])\n",
269 | "plt.xlabel('First t-SNE feature')\n",
270 | "plt.ylabel('Second t-SNE feature')\n",
271 | "plt.title('Fruits dataset t-SNE');"
272 | ]
273 | },
274 | {
275 | "cell_type": "markdown",
276 | "metadata": {},
277 | "source": [
278 | "#### t-SNE on the breast cancer dataset"
279 | ]
280 | },
281 | {
282 | "cell_type": "markdown",
283 | "metadata": {},
284 | "source": [
285 | "Although not shown in the lecture video, this example is included for comparison, showing the results of running t-SNE on the breast cancer dataset. See the reading \"How to Use t-SNE effectively\" for further details on how the visualizations from t-SNE are affected by specific parameter settings."
286 | ]
287 | },
288 | {
289 | "cell_type": "code",
290 | "execution_count": null,
291 | "metadata": {},
292 | "outputs": [],
293 | "source": [
294 | "tsne = TSNE(random_state = 0)\n",
295 | "\n",
296 | "X_tsne = tsne.fit_transform(X_normalized)\n",
297 | "\n",
298 | "plot_labelled_scatter(X_tsne, y_cancer, \n",
299 | " ['malignant', 'benign'])\n",
300 | "plt.xlabel('First t-SNE feature')\n",
301 | "plt.ylabel('Second t-SNE feature')\n",
302 | "plt.title('Breast cancer dataset t-SNE');"
303 | ]
304 | },
305 | {
306 | "cell_type": "markdown",
307 | "metadata": {},
308 | "source": [
309 | "## Clustering"
310 | ]
311 | },
312 | {
313 | "cell_type": "markdown",
314 | "metadata": {
315 | "collapsed": true
316 | },
317 | "source": [
318 | "### K-means"
319 | ]
320 | },
321 | {
322 | "cell_type": "markdown",
323 | "metadata": {},
324 | "source": [
325 | "This example from the lecture video creates an artificial dataset with make_blobs, then applies k-means to find 3 clusters, and plots the points in each cluster identified by a corresponding color."
326 | ]
327 | },
328 | {
329 | "cell_type": "code",
330 | "execution_count": null,
331 | "metadata": {},
332 | "outputs": [],
333 | "source": [
334 | "from sklearn.datasets import make_blobs\n",
335 | "from sklearn.cluster import KMeans\n",
336 | "from adspy_shared_utilities import plot_labelled_scatter\n",
337 | "\n",
338 | "X, y = make_blobs(random_state = 10)\n",
339 | "\n",
340 | "kmeans = KMeans(n_clusters = 3)\n",
341 | "kmeans.fit(X)\n",
342 | "\n",
343 | "plot_labelled_scatter(X, kmeans.labels_, ['Cluster 1', 'Cluster 2', 'Cluster 3'])\n"
344 | ]
345 | },
346 | {
347 | "cell_type": "markdown",
348 | "metadata": {},
349 | "source": [
350 | "Example showing k-means used to find 4 clusters in the fruits dataset. Note that in general, it's important to scale the individual features before applying k-means clustering."
351 | ]
352 | },
353 | {
354 | "cell_type": "code",
355 | "execution_count": null,
356 | "metadata": {},
357 | "outputs": [],
358 | "source": [
359 | "from sklearn.datasets import make_blobs\n",
360 | "from sklearn.cluster import KMeans\n",
361 | "from adspy_shared_utilities import plot_labelled_scatter\n",
362 | "from sklearn.preprocessing import MinMaxScaler\n",
363 | "\n",
364 | "fruits = pd.read_table('fruit_data_with_colors.txt')\n",
365 | "X_fruits = fruits[['mass','width','height', 'color_score']].as_matrix()\n",
366 | "y_fruits = fruits[['fruit_label']] - 1\n",
367 | "\n",
368 | "X_fruits_normalized = MinMaxScaler().fit(X_fruits).transform(X_fruits) \n",
369 | "\n",
370 | "kmeans = KMeans(n_clusters = 4, random_state = 0)\n",
371 | "kmeans.fit(X_fruits)\n",
372 | "\n",
373 | "plot_labelled_scatter(X_fruits_normalized, kmeans.labels_, \n",
374 | " ['Cluster 1', 'Cluster 2', 'Cluster 3', 'Cluster 4'])"
375 | ]
376 | },
377 | {
378 | "cell_type": "markdown",
379 | "metadata": {},
380 | "source": [
381 | "### Agglomerative clustering"
382 | ]
383 | },
384 | {
385 | "cell_type": "code",
386 | "execution_count": null,
387 | "metadata": {
388 | "scrolled": false
389 | },
390 | "outputs": [],
391 | "source": [
392 | "from sklearn.datasets import make_blobs\n",
393 | "from sklearn.cluster import AgglomerativeClustering\n",
394 | "from adspy_shared_utilities import plot_labelled_scatter\n",
395 | "\n",
396 | "X, y = make_blobs(random_state = 10)\n",
397 | "\n",
398 | "cls = AgglomerativeClustering(n_clusters = 3)\n",
399 | "cls_assignment = cls.fit_predict(X)\n",
400 | "\n",
401 | "plot_labelled_scatter(X, cls_assignment, \n",
402 | " ['Cluster 1', 'Cluster 2', 'Cluster 3'])"
403 | ]
404 | },
405 | {
406 | "cell_type": "markdown",
407 | "metadata": {},
408 | "source": [
409 | "#### Creating a dendrogram (using scipy)"
410 | ]
411 | },
412 | {
413 | "cell_type": "markdown",
414 | "metadata": {},
415 | "source": [
416 | "This dendrogram plot is based on the dataset created in the previous step with make_blobs, but for clarity, only 10 samples have been selected for this example, as plotted here:"
417 | ]
418 | },
419 | {
420 | "cell_type": "code",
421 | "execution_count": null,
422 | "metadata": {},
423 | "outputs": [],
424 | "source": [
425 | "X, y = make_blobs(random_state = 10, n_samples = 10)\n",
426 | "plot_labelled_scatter(X, y, \n",
427 | " ['Cluster 1', 'Cluster 2', 'Cluster 3'])\n",
428 | "print(X)"
429 | ]
430 | },
431 | {
432 | "cell_type": "markdown",
433 | "metadata": {},
434 | "source": [
435 | "And here's the dendrogram corresponding to agglomerative clustering of the 10 points above using Ward's method. The index 0..9 of the points corresponds to the index of the points in the X array above. For example, point 0 (5.69, -9.47) and point 9 (5.43, -9.76) are the closest two points and are clustered first."
436 | ]
437 | },
438 | {
439 | "cell_type": "code",
440 | "execution_count": null,
441 | "metadata": {},
442 | "outputs": [],
443 | "source": [
444 | "from scipy.cluster.hierarchy import ward, dendrogram\n",
445 | "plt.figure()\n",
446 | "dendrogram(ward(X))\n",
447 | "plt.show()"
448 | ]
449 | },
450 | {
451 | "cell_type": "markdown",
452 | "metadata": {},
453 | "source": [
454 | "### DBSCAN clustering"
455 | ]
456 | },
457 | {
458 | "cell_type": "code",
459 | "execution_count": null,
460 | "metadata": {},
461 | "outputs": [],
462 | "source": [
463 | "from sklearn.cluster import DBSCAN\n",
464 | "from sklearn.datasets import make_blobs\n",
465 | "\n",
466 | "X, y = make_blobs(random_state = 9, n_samples = 25)\n",
467 | "\n",
468 | "dbscan = DBSCAN(eps = 2, min_samples = 2)\n",
469 | "\n",
470 | "cls = dbscan.fit_predict(X)\n",
471 | "print(\"Cluster membership values:\\n{}\".format(cls))\n",
472 | "\n",
473 | "plot_labelled_scatter(X, cls + 1, \n",
474 | " ['Noise', 'Cluster 0', 'Cluster 1', 'Cluster 2'])"
475 | ]
476 | },
477 | {
478 | "cell_type": "code",
479 | "execution_count": null,
480 | "metadata": {
481 | "collapsed": true
482 | },
483 | "outputs": [],
484 | "source": []
485 | }
486 | ],
487 | "metadata": {
488 | "anaconda-cloud": {},
489 | "kernelspec": {
490 | "display_name": "Python [conda env:NotebookRecording]",
491 | "language": "python",
492 | "name": "conda-env-NotebookRecording-py"
493 | },
494 | "language_info": {
495 | "codemirror_mode": {
496 | "name": "ipython",
497 | "version": 3
498 | },
499 | "file_extension": ".py",
500 | "mimetype": "text/x-python",
501 | "name": "python",
502 | "nbconvert_exporter": "python",
503 | "pygments_lexer": "ipython3",
504 | "version": "3.5.3"
505 | }
506 | },
507 | "nbformat": 4,
508 | "nbformat_minor": 1
509 | }
510 |
--------------------------------------------------------------------------------
/readonly/adspy_shared_utilities.py:
--------------------------------------------------------------------------------
1 | # version 1.0
2 |
3 | import numpy
4 | import pandas as pd
5 | import seaborn as sn
6 | import matplotlib.pyplot as plt
7 | import matplotlib.cm as cm
8 | from matplotlib.colors import ListedColormap, BoundaryNorm
9 | from sklearn import neighbors
10 | import matplotlib.patches as mpatches
11 | import graphviz
12 | from sklearn.tree import export_graphviz
13 | import matplotlib.patches as mpatches
14 |
15 | def load_crime_dataset():
16 | # Communities and Crime dataset for regression
17 | # https://archive.ics.uci.edu/ml/datasets/Communities+and+Crime+Unnormalized
18 |
19 | crime = pd.read_table('CommViolPredUnnormalizedData.txt', sep=',', na_values='?')
20 | # remove features with poor coverage or lower relevance, and keep ViolentCrimesPerPop target column
21 | columns_to_keep = [5, 6] + list(range(11,26)) + list(range(32, 103)) + [145]
22 | crime = crime.ix[:,columns_to_keep].dropna()
23 |
24 | X_crime = crime.ix[:,range(0,88)]
25 | y_crime = crime['ViolentCrimesPerPop']
26 |
27 | return (X_crime, y_crime)
28 |
29 | def plot_decision_tree(clf, feature_names, class_names):
30 | # This function requires the pydotplus module and assumes it's been installed.
31 | # In some cases (typically under Windows) even after running conda install, there is a problem where the
32 | # pydotplus module is not found when running from within the notebook environment. The following code
33 | # may help to guarantee the module is installed in the current notebook environment directory.
34 | #
35 | # import sys; sys.executable
36 | # !{sys.executable} -m pip install pydotplus
37 |
38 | export_graphviz(clf, out_file="adspy_temp.dot", feature_names=feature_names, class_names=class_names, filled = True, impurity = False)
39 | with open("adspy_temp.dot") as f:
40 | dot_graph = f.read()
41 | # Alternate method using pydotplus, if installed.
42 | # graph = pydotplus.graphviz.graph_from_dot_data(dot_graph)
43 | # return graph.create_png()
44 | return graphviz.Source(dot_graph)
45 |
46 | def plot_feature_importances(clf, feature_names):
47 | c_features = len(feature_names)
48 | plt.barh(range(c_features), clf.feature_importances_)
49 | plt.xlabel("Feature importance")
50 | plt.ylabel("Feature name")
51 | plt.yticks(numpy.arange(c_features), feature_names)
52 |
53 | def plot_labelled_scatter(X, y, class_labels):
54 | num_labels = len(class_labels)
55 |
56 | x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
57 | y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
58 |
59 | marker_array = ['o', '^', '*']
60 | color_array = ['#FFFF00', '#00AAFF', '#000000', '#FF00AA']
61 | cmap_bold = ListedColormap(color_array)
62 | bnorm = BoundaryNorm(numpy.arange(0, num_labels + 1, 1), ncolors=num_labels)
63 | plt.figure()
64 |
65 | plt.scatter(X[:, 0], X[:, 1], s=65, c=y, cmap=cmap_bold, norm = bnorm, alpha = 0.40, edgecolor='black', lw = 1)
66 |
67 | plt.xlim(x_min, x_max)
68 | plt.ylim(y_min, y_max)
69 |
70 | h = []
71 | for c in range(0, num_labels):
72 | h.append(mpatches.Patch(color=color_array[c], label=class_labels[c]))
73 | plt.legend(handles=h)
74 |
75 | plt.show()
76 |
77 |
78 | def plot_class_regions_for_classifier_subplot(clf, X, y, X_test, y_test, title, subplot, target_names = None, plot_decision_regions = True):
79 |
80 | numClasses = numpy.amax(y) + 1
81 | color_list_light = ['#FFFFAA', '#EFEFEF', '#AAFFAA', '#AAAAFF']
82 | color_list_bold = ['#EEEE00', '#000000', '#00CC00', '#0000CC']
83 | cmap_light = ListedColormap(color_list_light[0:numClasses])
84 | cmap_bold = ListedColormap(color_list_bold[0:numClasses])
85 |
86 | h = 0.03
87 | k = 0.5
88 | x_plot_adjust = 0.1
89 | y_plot_adjust = 0.1
90 | plot_symbol_size = 50
91 |
92 | x_min = X[:, 0].min()
93 | x_max = X[:, 0].max()
94 | y_min = X[:, 1].min()
95 | y_max = X[:, 1].max()
96 | x2, y2 = numpy.meshgrid(numpy.arange(x_min-k, x_max+k, h), numpy.arange(y_min-k, y_max+k, h))
97 |
98 | P = clf.predict(numpy.c_[x2.ravel(), y2.ravel()])
99 | P = P.reshape(x2.shape)
100 |
101 | if plot_decision_regions:
102 | subplot.contourf(x2, y2, P, cmap=cmap_light, alpha = 0.8)
103 |
104 | subplot.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold, s=plot_symbol_size, edgecolor = 'black')
105 | subplot.set_xlim(x_min - x_plot_adjust, x_max + x_plot_adjust)
106 | subplot.set_ylim(y_min - y_plot_adjust, y_max + y_plot_adjust)
107 |
108 | if (X_test is not None):
109 | subplot.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cmap_bold, s=plot_symbol_size, marker='^', edgecolor = 'black')
110 | train_score = clf.score(X, y)
111 | test_score = clf.score(X_test, y_test)
112 | title = title + "\nTrain score = {:.2f}, Test score = {:.2f}".format(train_score, test_score)
113 |
114 | subplot.set_title(title)
115 |
116 | if (target_names is not None):
117 | legend_handles = []
118 | for i in range(0, len(target_names)):
119 | patch = mpatches.Patch(color=color_list_bold[i], label=target_names[i])
120 | legend_handles.append(patch)
121 | subplot.legend(loc=0, handles=legend_handles)
122 |
123 |
124 | def plot_class_regions_for_classifier(clf, X, y, X_test=None, y_test=None, title=None, target_names = None, plot_decision_regions = True):
125 |
126 | numClasses = numpy.amax(y) + 1
127 | color_list_light = ['#FFFFAA', '#EFEFEF', '#AAFFAA', '#AAAAFF']
128 | color_list_bold = ['#EEEE00', '#000000', '#00CC00', '#0000CC']
129 | cmap_light = ListedColormap(color_list_light[0:numClasses])
130 | cmap_bold = ListedColormap(color_list_bold[0:numClasses])
131 |
132 | h = 0.03
133 | k = 0.5
134 | x_plot_adjust = 0.1
135 | y_plot_adjust = 0.1
136 | plot_symbol_size = 50
137 |
138 | x_min = X[:, 0].min()
139 | x_max = X[:, 0].max()
140 | y_min = X[:, 1].min()
141 | y_max = X[:, 1].max()
142 | x2, y2 = numpy.meshgrid(numpy.arange(x_min-k, x_max+k, h), numpy.arange(y_min-k, y_max+k, h))
143 |
144 | P = clf.predict(numpy.c_[x2.ravel(), y2.ravel()])
145 | P = P.reshape(x2.shape)
146 | plt.figure()
147 | if plot_decision_regions:
148 | plt.contourf(x2, y2, P, cmap=cmap_light, alpha = 0.8)
149 |
150 | plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold, s=plot_symbol_size, edgecolor = 'black')
151 | plt.xlim(x_min - x_plot_adjust, x_max + x_plot_adjust)
152 | plt.ylim(y_min - y_plot_adjust, y_max + y_plot_adjust)
153 |
154 | if (X_test is not None):
155 | plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cmap_bold, s=plot_symbol_size, marker='^', edgecolor = 'black')
156 | train_score = clf.score(X, y)
157 | test_score = clf.score(X_test, y_test)
158 | title = title + "\nTrain score = {:.2f}, Test score = {:.2f}".format(train_score, test_score)
159 |
160 | if (target_names is not None):
161 | legend_handles = []
162 | for i in range(0, len(target_names)):
163 | patch = mpatches.Patch(color=color_list_bold[i], label=target_names[i])
164 | legend_handles.append(patch)
165 | plt.legend(loc=0, handles=legend_handles)
166 |
167 | if (title is not None):
168 | plt.title(title)
169 | plt.show()
170 |
171 | def plot_fruit_knn(X, y, n_neighbors, weights):
172 | X_mat = X[['height', 'width']].as_matrix()
173 | y_mat = y.as_matrix()
174 |
175 | # Create color maps
176 | cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF','#AFAFAF'])
177 | cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF','#AFAFAF'])
178 |
179 | clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
180 | clf.fit(X_mat, y_mat)
181 |
182 | # Plot the decision boundary by assigning a color in the color map
183 | # to each mesh point.
184 |
185 | mesh_step_size = .01 # step size in the mesh
186 | plot_symbol_size = 50
187 |
188 | x_min, x_max = X_mat[:, 0].min() - 1, X_mat[:, 0].max() + 1
189 | y_min, y_max = X_mat[:, 1].min() - 1, X_mat[:, 1].max() + 1
190 | xx, yy = numpy.meshgrid(numpy.arange(x_min, x_max, mesh_step_size),
191 | numpy.arange(y_min, y_max, mesh_step_size))
192 | Z = clf.predict(numpy.c_[xx.ravel(), yy.ravel()])
193 |
194 | # Put the result into a color plot
195 | Z = Z.reshape(xx.shape)
196 | plt.figure()
197 | plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
198 |
199 | # Plot training points
200 | plt.scatter(X_mat[:, 0], X_mat[:, 1], s=plot_symbol_size, c=y, cmap=cmap_bold, edgecolor = 'black')
201 | plt.xlim(xx.min(), xx.max())
202 | plt.ylim(yy.min(), yy.max())
203 |
204 | patch0 = mpatches.Patch(color='#FF0000', label='apple')
205 | patch1 = mpatches.Patch(color='#00FF00', label='mandarin')
206 | patch2 = mpatches.Patch(color='#0000FF', label='orange')
207 | patch3 = mpatches.Patch(color='#AFAFAF', label='lemon')
208 | plt.legend(handles=[patch0, patch1, patch2, patch3])
209 |
210 |
211 | plt.xlabel('height (cm)')
212 | plt.ylabel('width (cm)')
213 |
214 | plt.show()
215 |
216 | def plot_two_class_knn(X, y, n_neighbors, weights, X_test, y_test):
217 | X_mat = X
218 | y_mat = y
219 |
220 | # Create color maps
221 | cmap_light = ListedColormap(['#FFFFAA', '#AAFFAA', '#AAAAFF','#EFEFEF'])
222 | cmap_bold = ListedColormap(['#FFFF00', '#00FF00', '#0000FF','#000000'])
223 |
224 | clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
225 | clf.fit(X_mat, y_mat)
226 |
227 | # Plot the decision boundary by assigning a color in the color map
228 | # to each mesh point.
229 |
230 | mesh_step_size = .01 # step size in the mesh
231 | plot_symbol_size = 50
232 |
233 | x_min, x_max = X_mat[:, 0].min() - 1, X_mat[:, 0].max() + 1
234 | y_min, y_max = X_mat[:, 1].min() - 1, X_mat[:, 1].max() + 1
235 | xx, yy = numpy.meshgrid(numpy.arange(x_min, x_max, mesh_step_size),
236 | numpy.arange(y_min, y_max, mesh_step_size))
237 | Z = clf.predict(numpy.c_[xx.ravel(), yy.ravel()])
238 |
239 | # Put the result into a color plot
240 | Z = Z.reshape(xx.shape)
241 | plt.figure()
242 | plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
243 |
244 | # Plot training points
245 | plt.scatter(X_mat[:, 0], X_mat[:, 1], s=plot_symbol_size, c=y, cmap=cmap_bold, edgecolor = 'black')
246 | plt.xlim(xx.min(), xx.max())
247 | plt.ylim(yy.min(), yy.max())
248 |
249 | title = "Neighbors = {}".format(n_neighbors)
250 | if (X_test is not None):
251 | train_score = clf.score(X_mat, y_mat)
252 | test_score = clf.score(X_test, y_test)
253 | title = title + "\nTrain score = {:.2f}, Test score = {:.2f}".format(train_score, test_score)
254 |
255 | patch0 = mpatches.Patch(color='#FFFF00', label='class 0')
256 | patch1 = mpatches.Patch(color='#000000', label='class 1')
257 | plt.legend(handles=[patch0, patch1])
258 |
259 | plt.xlabel('Feature 0')
260 | plt.ylabel('Feature 1')
261 | plt.title(title)
262 |
263 | plt.show()
264 |
--------------------------------------------------------------------------------
/readonly/fruit_data_with_colors.txt:
--------------------------------------------------------------------------------
1 | fruit_label fruit_name fruit_subtype mass width height color_score
2 | 1 apple granny_smith 192 8.4 7.3 0.55
3 | 1 apple granny_smith 180 8.0 6.8 0.59
4 | 1 apple granny_smith 176 7.4 7.2 0.60
5 | 2 mandarin mandarin 86 6.2 4.7 0.80
6 | 2 mandarin mandarin 84 6.0 4.6 0.79
7 | 2 mandarin mandarin 80 5.8 4.3 0.77
8 | 2 mandarin mandarin 80 5.9 4.3 0.81
9 | 2 mandarin mandarin 76 5.8 4.0 0.81
10 | 1 apple braeburn 178 7.1 7.8 0.92
11 | 1 apple braeburn 172 7.4 7.0 0.89
12 | 1 apple braeburn 166 6.9 7.3 0.93
13 | 1 apple braeburn 172 7.1 7.6 0.92
14 | 1 apple braeburn 154 7.0 7.1 0.88
15 | 1 apple golden_delicious 164 7.3 7.7 0.70
16 | 1 apple golden_delicious 152 7.6 7.3 0.69
17 | 1 apple golden_delicious 156 7.7 7.1 0.69
18 | 1 apple golden_delicious 156 7.6 7.5 0.67
19 | 1 apple golden_delicious 168 7.5 7.6 0.73
20 | 1 apple cripps_pink 162 7.5 7.1 0.83
21 | 1 apple cripps_pink 162 7.4 7.2 0.85
22 | 1 apple cripps_pink 160 7.5 7.5 0.86
23 | 1 apple cripps_pink 156 7.4 7.4 0.84
24 | 1 apple cripps_pink 140 7.3 7.1 0.87
25 | 1 apple cripps_pink 170 7.6 7.9 0.88
26 | 3 orange spanish_jumbo 342 9.0 9.4 0.75
27 | 3 orange spanish_jumbo 356 9.2 9.2 0.75
28 | 3 orange spanish_jumbo 362 9.6 9.2 0.74
29 | 3 orange selected_seconds 204 7.5 9.2 0.77
30 | 3 orange selected_seconds 140 6.7 7.1 0.72
31 | 3 orange selected_seconds 160 7.0 7.4 0.81
32 | 3 orange selected_seconds 158 7.1 7.5 0.79
33 | 3 orange selected_seconds 210 7.8 8.0 0.82
34 | 3 orange selected_seconds 164 7.2 7.0 0.80
35 | 3 orange turkey_navel 190 7.5 8.1 0.74
36 | 3 orange turkey_navel 142 7.6 7.8 0.75
37 | 3 orange turkey_navel 150 7.1 7.9 0.75
38 | 3 orange turkey_navel 160 7.1 7.6 0.76
39 | 3 orange turkey_navel 154 7.3 7.3 0.79
40 | 3 orange turkey_navel 158 7.2 7.8 0.77
41 | 3 orange turkey_navel 144 6.8 7.4 0.75
42 | 3 orange turkey_navel 154 7.1 7.5 0.78
43 | 3 orange turkey_navel 180 7.6 8.2 0.79
44 | 3 orange turkey_navel 154 7.2 7.2 0.82
45 | 4 lemon spanish_belsan 194 7.2 10.3 0.70
46 | 4 lemon spanish_belsan 200 7.3 10.5 0.72
47 | 4 lemon spanish_belsan 186 7.2 9.2 0.72
48 | 4 lemon spanish_belsan 216 7.3 10.2 0.71
49 | 4 lemon spanish_belsan 196 7.3 9.7 0.72
50 | 4 lemon spanish_belsan 174 7.3 10.1 0.72
51 | 4 lemon unknown 132 5.8 8.7 0.73
52 | 4 lemon unknown 130 6.0 8.2 0.71
53 | 4 lemon unknown 116 6.0 7.5 0.72
54 | 4 lemon unknown 118 5.9 8.0 0.72
55 | 4 lemon unknown 120 6.0 8.4 0.74
56 | 4 lemon unknown 116 6.1 8.5 0.71
57 | 4 lemon unknown 116 6.3 7.7 0.72
58 | 4 lemon unknown 116 5.9 8.1 0.73
59 | 4 lemon unknown 152 6.5 8.5 0.72
60 | 4 lemon unknown 118 6.1 8.1 0.70
61 |
--------------------------------------------------------------------------------
/readonly/polynomialreg1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/agniiyer/Applied-Machine-Learning-in-Python/5bcc46f251f7983f79998004a244f38cf43aa38b/readonly/polynomialreg1.png
--------------------------------------------------------------------------------