├── ML with Python Certificate.pdf
├── Module 1 Quiz .pdf
├── Module 2 Quiz.pdf
├── Module 3 Quiz .pdf
├── Module 4 Quiz .pdf
├── README.md
├── week_1_Assignment.ipynb
├── week_2_Assignment.ipynb
├── week_3_Assignment.ipynb
└── week_4_Assignment.ipynb


/ML with Python Certificate.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Vaibhavabhaysharma/Coursera-Applied-Machine-Learning-with-Python-/3b68c33a1bc8b87a0fd51ec12a335881aa9d892c/ML with Python Certificate.pdf


--------------------------------------------------------------------------------
/Module 1 Quiz .pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Vaibhavabhaysharma/Coursera-Applied-Machine-Learning-with-Python-/3b68c33a1bc8b87a0fd51ec12a335881aa9d892c/Module 1 Quiz .pdf


--------------------------------------------------------------------------------
/Module 2 Quiz.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Vaibhavabhaysharma/Coursera-Applied-Machine-Learning-with-Python-/3b68c33a1bc8b87a0fd51ec12a335881aa9d892c/Module 2 Quiz.pdf


--------------------------------------------------------------------------------
/Module 3 Quiz .pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Vaibhavabhaysharma/Coursera-Applied-Machine-Learning-with-Python-/3b68c33a1bc8b87a0fd51ec12a335881aa9d892c/Module 3 Quiz .pdf


--------------------------------------------------------------------------------
/Module 4 Quiz .pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Vaibhavabhaysharma/Coursera-Applied-Machine-Learning-with-Python-/3b68c33a1bc8b87a0fd51ec12a335881aa9d892c/Module 4 Quiz .pdf


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Coursera-Applied-Machine-Learning-with-Python-
2 | This repository contains solutions of all assignments of University of Michigan's Applied Machine  Learning with python course.
3 | 


--------------------------------------------------------------------------------
/week_2_Assignment.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "---\n",
  8 |     "\n",
  9 |     "_You are currently looking at **version 1.5** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-machine-learning/resources/bANLa) course resource._\n",
 10 |     "\n",
 11 |     "---"
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "markdown",
 16 |    "metadata": {},
 17 |    "source": [
 18 |     "# Assignment 2\n",
 19 |     "\n",
 20 |     "In this assignment you'll explore the relationship between model complexity and generalization performance, by adjusting key parameters of various supervised learning models. Part 1 of this assignment will look at regression and Part 2 will look at classification.\n",
 21 |     "\n",
 22 |     "## Part 1 - Regression"
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "markdown",
 27 |    "metadata": {},
 28 |    "source": [
 29 |     "First, run the following block to set up the variables needed for later sections."
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "code",
 34 |    "execution_count": 152,
 35 |    "metadata": {
 36 |     "collapsed": false,
 37 |     "scrolled": true
 38 |    },
 39 |    "outputs": [],
 40 |    "source": [
 41 |     "import numpy as np\n",
 42 |     "import pandas as pd\n",
 43 |     "from sklearn.model_selection import train_test_split\n",
 44 |     "\n",
 45 |     "\n",
 46 |     "np.random.seed(0)\n",
 47 |     "n = 15\n",
 48 |     "x = np.linspace(0,10,n) + np.random.randn(n)/5\n",
 49 |     "y = np.sin(x)+x/6 + np.random.randn(n)/10\n",
 50 |     "\n",
 51 |     "\n",
 52 |     "X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)\n",
 53 |     "\n",
 54 |     "# You can use this function to help you visualize the dataset by\n",
 55 |     "# plotting a scatterplot of the data points\n",
 56 |     "# in the training and test sets.\n",
 57 |     "def part1_scatter():\n",
 58 |     "    plt.figure()\n",
 59 |     "    plt.scatter(X_train, y_train, label='training data')\n",
 60 |     "    plt.scatter(X_test, y_test, label='test data')\n",
 61 |     "    plt.legend(loc=4);\n",
 62 |     "    \n",
 63 |     "    \n",
 64 |     "# NOTE: Uncomment the function below to visualize the data, but be sure \n",
 65 |     "# to **re-comment it before submitting this assignment to the autograder**.   \n",
 66 |     "#part1_scatter()"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "markdown",
 71 |    "metadata": {},
 72 |    "source": [
 73 |     "### Question 1\n",
 74 |     "\n",
 75 |     "Write a function that fits a polynomial LinearRegression model on the *training data* `X_train` for degrees 1, 3, 6, and 9. (Use PolynomialFeatures in sklearn.preprocessing to create the polynomial features and then fit a linear regression model) For each model, find 100 predicted values over the interval x = 0 to 10 (e.g. `np.linspace(0,10,100)`) and store this in a numpy array. The first row of this array should correspond to the output from the model trained on degree 1, the second row degree 3, the third row degree 6, and the fourth row degree 9.\n",
 76 |     "\n",
 77 |     "<img src=\"readonly/polynomialreg1.png\" style=\"width: 1000px;\"/>\n",
 78 |     "\n",
 79 |     "The figure above shows the fitted models plotted on top of the original data (using `plot_one()`).\n",
 80 |     "\n",
 81 |     "<br>\n",
 82 |     "*This function should return a numpy array with shape `(4, 100)`*"
 83 |    ]
 84 |   },
 85 |   {
 86 |    "cell_type": "code",
 87 |    "execution_count": 153,
 88 |    "metadata": {
 89 |     "collapsed": false
 90 |    },
 91 |    "outputs": [
 92 |     {
 93 |      "data": {
 94 |       "text/plain": [
 95 |        "array([[  2.53040195e-01,   2.69201547e-01,   2.85362899e-01,\n",
 96 |        "          3.01524251e-01,   3.17685603e-01,   3.33846955e-01,\n",
 97 |        "          3.50008306e-01,   3.66169658e-01,   3.82331010e-01,\n",
 98 |        "          3.98492362e-01,   4.14653714e-01,   4.30815066e-01,\n",
 99 |        "          4.46976417e-01,   4.63137769e-01,   4.79299121e-01,\n",
100 |        "          4.95460473e-01,   5.11621825e-01,   5.27783177e-01,\n",
101 |        "          5.43944529e-01,   5.60105880e-01,   5.76267232e-01,\n",
102 |        "          5.92428584e-01,   6.08589936e-01,   6.24751288e-01,\n",
103 |        "          6.40912640e-01,   6.57073992e-01,   6.73235343e-01,\n",
104 |        "          6.89396695e-01,   7.05558047e-01,   7.21719399e-01,\n",
105 |        "          7.37880751e-01,   7.54042103e-01,   7.70203454e-01,\n",
106 |        "          7.86364806e-01,   8.02526158e-01,   8.18687510e-01,\n",
107 |        "          8.34848862e-01,   8.51010214e-01,   8.67171566e-01,\n",
108 |        "          8.83332917e-01,   8.99494269e-01,   9.15655621e-01,\n",
109 |        "          9.31816973e-01,   9.47978325e-01,   9.64139677e-01,\n",
110 |        "          9.80301028e-01,   9.96462380e-01,   1.01262373e+00,\n",
111 |        "          1.02878508e+00,   1.04494644e+00,   1.06110779e+00,\n",
112 |        "          1.07726914e+00,   1.09343049e+00,   1.10959184e+00,\n",
113 |        "          1.12575320e+00,   1.14191455e+00,   1.15807590e+00,\n",
114 |        "          1.17423725e+00,   1.19039860e+00,   1.20655995e+00,\n",
115 |        "          1.22272131e+00,   1.23888266e+00,   1.25504401e+00,\n",
116 |        "          1.27120536e+00,   1.28736671e+00,   1.30352807e+00,\n",
117 |        "          1.31968942e+00,   1.33585077e+00,   1.35201212e+00,\n",
118 |        "          1.36817347e+00,   1.38433482e+00,   1.40049618e+00,\n",
119 |        "          1.41665753e+00,   1.43281888e+00,   1.44898023e+00,\n",
120 |        "          1.46514158e+00,   1.48130294e+00,   1.49746429e+00,\n",
121 |        "          1.51362564e+00,   1.52978699e+00,   1.54594834e+00,\n",
122 |        "          1.56210969e+00,   1.57827105e+00,   1.59443240e+00,\n",
123 |        "          1.61059375e+00,   1.62675510e+00,   1.64291645e+00,\n",
124 |        "          1.65907781e+00,   1.67523916e+00,   1.69140051e+00,\n",
125 |        "          1.70756186e+00,   1.72372321e+00,   1.73988457e+00,\n",
126 |        "          1.75604592e+00,   1.77220727e+00,   1.78836862e+00,\n",
127 |        "          1.80452997e+00,   1.82069132e+00,   1.83685268e+00,\n",
128 |        "          1.85301403e+00],\n",
129 |        "       [  1.22989539e+00,   1.15143628e+00,   1.07722393e+00,\n",
130 |        "          1.00717881e+00,   9.41221419e-01,   8.79272234e-01,\n",
131 |        "          8.21251741e-01,   7.67080426e-01,   7.16678772e-01,\n",
132 |        "          6.69967266e-01,   6.26866391e-01,   5.87296632e-01,\n",
133 |        "          5.51178474e-01,   5.18432402e-01,   4.88978901e-01,\n",
134 |        "          4.62738455e-01,   4.39631549e-01,   4.19578668e-01,\n",
135 |        "          4.02500297e-01,   3.88316920e-01,   3.76949022e-01,\n",
136 |        "          3.68317088e-01,   3.62341603e-01,   3.58943051e-01,\n",
137 |        "          3.58041918e-01,   3.59558687e-01,   3.63413845e-01,\n",
138 |        "          3.69527874e-01,   3.77821261e-01,   3.88214491e-01,\n",
139 |        "          4.00628046e-01,   4.14982414e-01,   4.31198078e-01,\n",
140 |        "          4.49195522e-01,   4.68895233e-01,   4.90217694e-01,\n",
141 |        "          5.13083391e-01,   5.37412808e-01,   5.63126429e-01,\n",
142 |        "          5.90144741e-01,   6.18388226e-01,   6.47777371e-01,\n",
143 |        "          6.78232660e-01,   7.09674578e-01,   7.42023609e-01,\n",
144 |        "          7.75200238e-01,   8.09124950e-01,   8.43718230e-01,\n",
145 |        "          8.78900563e-01,   9.14592432e-01,   9.50714324e-01,\n",
146 |        "          9.87186723e-01,   1.02393011e+00,   1.06086498e+00,\n",
147 |        "          1.09791181e+00,   1.13499108e+00,   1.17202328e+00,\n",
148 |        "          1.20892890e+00,   1.24562842e+00,   1.28204233e+00,\n",
149 |        "          1.31809110e+00,   1.35369523e+00,   1.38877520e+00,\n",
150 |        "          1.42325149e+00,   1.45704459e+00,   1.49007498e+00,\n",
151 |        "          1.52226316e+00,   1.55352959e+00,   1.58379478e+00,\n",
152 |        "          1.61297919e+00,   1.64100332e+00,   1.66778766e+00,\n",
153 |        "          1.69325268e+00,   1.71731887e+00,   1.73990672e+00,\n",
154 |        "          1.76093671e+00,   1.78032933e+00,   1.79800506e+00,\n",
155 |        "          1.81388438e+00,   1.82788778e+00,   1.83993575e+00,\n",
156 |        "          1.84994877e+00,   1.85784732e+00,   1.86355189e+00,\n",
157 |        "          1.86698296e+00,   1.86806103e+00,   1.86670656e+00,\n",
158 |        "          1.86284006e+00,   1.85638200e+00,   1.84725286e+00,\n",
159 |        "          1.83537314e+00,   1.82066332e+00,   1.80304388e+00,\n",
160 |        "          1.78243530e+00,   1.75875808e+00,   1.73193269e+00,\n",
161 |        "          1.70187963e+00,   1.66851936e+00,   1.63177240e+00,\n",
162 |        "          1.59155920e+00],\n",
163 |        "       [ -1.99554310e-01,  -3.95192724e-03,   1.79851752e-01,\n",
164 |        "          3.51005136e-01,   5.08831706e-01,   6.52819233e-01,\n",
165 |        "          7.82609240e-01,   8.97986721e-01,   9.98870117e-01,\n",
166 |        "          1.08530155e+00,   1.15743729e+00,   1.21553852e+00,\n",
167 |        "          1.25996233e+00,   1.29115292e+00,   1.30963316e+00,\n",
168 |        "          1.31599632e+00,   1.31089811e+00,   1.29504889e+00,\n",
169 |        "          1.26920626e+00,   1.23416782e+00,   1.19076415e+00,\n",
170 |        "          1.13985218e+00,   1.08230867e+00,   1.01902405e+00,\n",
171 |        "          9.50896441e-01,   8.78825970e-01,   8.03709344e-01,\n",
172 |        "          7.26434655e-01,   6.47876457e-01,   5.68891088e-01,\n",
173 |        "          4.90312256e-01,   4.12946874e-01,   3.37571147e-01,\n",
174 |        "          2.64926923e-01,   1.95718291e-01,   1.30608438e-01,\n",
175 |        "          7.02167560e-02,   1.51162118e-02,  -3.41690366e-02,\n",
176 |        "         -7.71657636e-02,  -1.13453547e-01,  -1.42666382e-01,\n",
177 |        "         -1.64494044e-01,  -1.78683194e-01,  -1.85038228e-01,\n",
178 |        "         -1.83421873e-01,  -1.73755533e-01,  -1.56019368e-01,\n",
179 |        "         -1.30252132e-01,  -9.65507462e-02,  -5.50696232e-02,\n",
180 |        "         -6.01973198e-03,   5.03325883e-02,   1.13667071e-01,\n",
181 |        "          1.83611221e-01,   2.59742264e-01,   3.41589357e-01,\n",
182 |        "          4.28636046e-01,   5.20322987e-01,   6.16050916e-01,\n",
183 |        "          7.15183874e-01,   8.17052690e-01,   9.20958717e-01,\n",
184 |        "          1.02617782e+00,   1.13196463e+00,   1.23755703e+00,\n",
185 |        "          1.34218093e+00,   1.44505526e+00,   1.54539723e+00,\n",
186 |        "          1.64242789e+00,   1.73537785e+00,   1.82349336e+00,\n",
187 |        "          1.90604254e+00,   1.98232198e+00,   2.05166348e+00,\n",
188 |        "          2.11344114e+00,   2.16707864e+00,   2.21205680e+00,\n",
189 |        "          2.24792141e+00,   2.27429129e+00,   2.29086658e+00,\n",
190 |        "          2.29743739e+00,   2.29389257e+00,   2.28022881e+00,\n",
191 |        "          2.25656001e+00,   2.22312684e+00,   2.18030664e+00,\n",
192 |        "          2.12862347e+00,   2.06875850e+00,   2.00156065e+00,\n",
193 |        "          1.92805743e+00,   1.84946605e+00,   1.76720485e+00,\n",
194 |        "          1.68290491e+00,   1.59842194e+00,   1.51584842e+00,\n",
195 |        "          1.43752602e+00,   1.36605824e+00,   1.30432333e+00,\n",
196 |        "          1.25548743e+00],\n",
197 |        "       [  6.79502285e+00,   4.14319957e+00,   2.23123322e+00,\n",
198 |        "          9.10495532e-01,   5.49803315e-02,  -4.41344457e-01,\n",
199 |        "         -6.66950444e-01,  -6.94942887e-01,  -5.85049614e-01,\n",
200 |        "         -3.85418417e-01,  -1.34236065e-01,   1.38818559e-01,\n",
201 |        "          4.11275202e-01,   6.66715442e-01,   8.93747460e-01,\n",
202 |        "          1.08510202e+00,   1.23683979e+00,   1.34766069e+00,\n",
203 |        "          1.41830632e+00,   1.45104724e+00,   1.44924694e+00,\n",
204 |        "          1.41699534e+00,   1.35880444e+00,   1.27935985e+00,\n",
205 |        "          1.18332182e+00,   1.07516995e+00,   9.59086410e-01,\n",
206 |        "          8.38872457e-01,   7.17893658e-01,   5.99049596e-01,\n",
207 |        "          4.84764051e-01,   3.76992063e-01,   2.77240599e-01,\n",
208 |        "          1.86599822e-01,   1.05782272e-01,   3.51675757e-02,\n",
209 |        "         -2.51494865e-02,  -7.53094019e-02,  -1.15638484e-01,\n",
210 |        "         -1.46600958e-01,  -1.68753745e-01,  -1.82704910e-01,\n",
211 |        "         -1.89076542e-01,  -1.88472636e-01,  -1.81452388e-01,\n",
212 |        "         -1.68509141e-01,  -1.50055083e-01,  -1.26411638e-01,\n",
213 |        "         -9.78053923e-02,  -6.43692604e-02,  -2.61485139e-02,\n",
214 |        "          1.68888091e-02,   6.48376626e-02,   1.17838541e-01,\n",
215 |        "          1.76057485e-01,   2.39664260e-01,   3.08809443e-01,\n",
216 |        "          3.83601186e-01,   4.64082407e-01,   5.50209170e-01,\n",
217 |        "          6.41830991e-01,   7.38673768e-01,   8.40326006e-01,\n",
218 |        "          9.46228923e-01,   1.05567100e+00,   1.16778742e+00,\n",
219 |        "          1.28156471e+00,   1.39585100e+00,   1.50937183e+00,\n",
220 |        "          1.62075165e+00,   1.72854097e+00,   1.83124862e+00,\n",
221 |        "          1.92737898e+00,   2.01547331e+00,   2.09415458e+00,\n",
222 |        "          2.16217465e+00,   2.21846257e+00,   2.26217273e+00,\n",
223 |        "          2.29273094e+00,   2.30987668e+00,   2.31369926e+00,\n",
224 |        "          2.30466539e+00,   2.28363551e+00,   2.25186569e+00,\n",
225 |        "          2.21099186e+00,   2.16299265e+00,   2.11012671e+00,\n",
226 |        "          2.05484041e+00,   1.99964089e+00,   1.94692956e+00,\n",
227 |        "          1.89879060e+00,   1.85672836e+00,   1.82134774e+00,\n",
228 |        "          1.79197049e+00,   1.76618058e+00,   1.73929091e+00,\n",
229 |        "          1.70372341e+00,   1.64829405e+00,   1.55739372e+00,\n",
230 |        "          1.41005558e+00]])"
231 |       ]
232 |      },
233 |      "execution_count": 153,
234 |      "metadata": {},
235 |      "output_type": "execute_result"
236 |     }
237 |    ],
238 |    "source": [
239 |     "def answer_one():\n",
240 |     "    from sklearn.linear_model import LinearRegression \n",
241 |     "    from sklearn.preprocessing import PolynomialFeatures \n",
242 |     "    # To capture interactions between the original features by adding them as features to the Linear model.\n",
243 |     "\n",
244 |     "    clf = LinearRegression() \n",
245 |     "    preds =np.zeros((4,100)) \n",
246 |     "    X_input = np.linspace(0,10,100) #Given requirement \n",
247 |     "    orders = [1,3,6,9] #Given requirement\n",
248 |     "\n",
249 |     "    for i in range(len(orders)):\n",
250 |     "        poly = PolynomialFeatures (orders[i]) # Object to add polynomial features.\n",
251 |     "\n",
252 |     "        #Add polynomial features to training data and input data:\n",
253 |     "\n",
254 |     "        #Need to transpose X_train and X_input for poly fit to Mork.\n",
255 |     "\n",
256 |     "        X_train_poly = poly.fit_transform(X_train[None].T) \n",
257 |     "        X_input_poly = poly.fit_transform(X_input[None].T)\n",
258 |     "\n",
259 |     "        #Train Linear regression classifier with training data:\n",
260 |     "        clf.fit(X_train_poly, y_train)\n",
261 |     "\n",
262 |     "        #Get predictions from Linear classifier using transformed input data:\n",
263 |     "        preds[i,:]=clf.predict(X_input_poly)\n",
264 |     "    return preds\n",
265 |     "answer_one()"
266 |    ]
267 |   },
268 |   {
269 |    "cell_type": "code",
270 |    "execution_count": 154,
271 |    "metadata": {
272 |     "collapsed": true
273 |    },
274 |    "outputs": [],
275 |    "source": [
276 |     "# feel free to use the function plot_one() to replicate the figure \n",
277 |     "# from the prompt once you have completed question one\n",
278 |     "def plot_one(degree_predictions):\n",
279 |     "    import matplotlib.pyplot as plt\n",
280 |     "    #%matplotlib notebook\n",
281 |     "    plt.figure(figsize=(10,5))\n",
282 |     "    plt.plot(X_train, y_train, 'o', label='training data', markersize=10)\n",
283 |     "    plt.plot(X_test, y_test, 'o', label='test data', markersize=10)\n",
284 |     "    for i,degree in enumerate([1,3,6,9]):\n",
285 |     "        plt.plot(np.linspace(0,10,100), degree_predictions[i], alpha=0.8, lw=2, label='degree={}'.format(degree))\n",
286 |     "    plt.ylim(-1,2.5)\n",
287 |     "    plt.legend(loc=4)\n",
288 |     "\n",
289 |     "#plot_one(answer_one())"
290 |    ]
291 |   },
292 |   {
293 |    "cell_type": "markdown",
294 |    "metadata": {},
295 |    "source": [
296 |     "### Question 2\n",
297 |     "\n",
298 |     "Write a function that fits a polynomial LinearRegression model on the training data `X_train` for degrees 0 through 9. For each model compute the $R^2$ (coefficient of determination) regression score on the training data as well as the the test data, and return both of these arrays in a tuple.\n",
299 |     "\n",
300 |     "*This function should return one tuple of numpy arrays `(r2_train, r2_test)`. Both arrays should have shape `(10,)`*"
301 |    ]
302 |   },
303 |   {
304 |    "cell_type": "code",
305 |    "execution_count": 155,
306 |    "metadata": {
307 |     "collapsed": false
308 |    },
309 |    "outputs": [
310 |     {
311 |      "data": {
312 |       "text/plain": [
313 |        "([0.0,\n",
314 |        "  0.42924577812346632,\n",
315 |        "  0.4510998044408247,\n",
316 |        "  0.58719953687798465,\n",
317 |        "  0.91941944717693436,\n",
318 |        "  0.97578641430682345,\n",
319 |        "  0.99018233247950815,\n",
320 |        "  0.99352509278403633,\n",
321 |        "  0.99637545387765036,\n",
322 |        "  0.99803706256653058],\n",
323 |        " [-0.47808641737141788,\n",
324 |        "  -0.45237104233936676,\n",
325 |        "  -0.068569841499158901,\n",
326 |        "  0.0053310529457363254,\n",
327 |        "  0.73004942818707141,\n",
328 |        "  0.87708300916999382,\n",
329 |        "  0.92140939814150025,\n",
330 |        "  0.92021504127165776,\n",
331 |        "  0.63247950833243671,\n",
332 |        "  -0.64525377099038894])"
333 |       ]
334 |      },
335 |      "execution_count": 155,
336 |      "metadata": {},
337 |      "output_type": "execute_result"
338 |     }
339 |    ],
340 |    "source": [
341 |     "def answer_two():\n",
342 |     "    from sklearn.linear_model import LinearRegression\n",
343 |     "    from sklearn.preprocessing import PolynomialFeatures\n",
344 |     "    from sklearn.metrics.regression import r2_score\n",
345 |     "\n",
346 |     "    r2_test, r2_train=[],[]\n",
347 |     "    \n",
348 |     "    for i in range(10):\n",
349 |     "        poly = PolynomialFeatures(i)\n",
350 |     "        X_train_poly = poly.fit_transform(X_train[None].T)\n",
351 |     "        X_test_poly = poly.fit_transform(X_test[None].T)\n",
352 |     "        linreg = LinearRegression().fit(X_train_poly, y_train)\n",
353 |     "        r2_train.append(linreg.score(X_train_poly, y_train))\n",
354 |     "        r2_test.append(linreg.score(X_test_poly, y_test))\n",
355 |     "    \n",
356 |     "    return (r2_train, r2_test)\n",
357 |     "\n",
358 |     "answer_two()\n"
359 |    ]
360 |   },
361 |   {
362 |    "cell_type": "markdown",
363 |    "metadata": {},
364 |    "source": [
365 |     "### Question 3\n",
366 |     "\n",
367 |     "Based on the $R^2$ scores from question 2 (degree levels 0 through 9), what degree level corresponds to a model that is underfitting? What degree level corresponds to a model that is overfitting? What choice of degree level would provide a model with good generalization performance on this dataset? \n",
368 |     "\n",
369 |     "Hint: Try plotting the $R^2$ scores from question 2 to visualize the relationship between degree level and $R^2$. Remember to comment out the import matplotlib line before submission.\n",
370 |     "\n",
371 |     "*This function should return one tuple with the degree values in this order: `(Underfitting, Overfitting, Good_Generalization)`. There might be multiple correct solutions, however, you only need to return one possible solution, for example, (1,2,3).* "
372 |    ]
373 |   },
374 |   {
375 |    "cell_type": "code",
376 |    "execution_count": 156,
377 |    "metadata": {
378 |     "collapsed": false
379 |    },
380 |    "outputs": [
381 |     {
382 |      "data": {
383 |       "text/plain": [
384 |        "(0, 9, 6)"
385 |       ]
386 |      },
387 |      "execution_count": 156,
388 |      "metadata": {},
389 |      "output_type": "execute_result"
390 |     }
391 |    ],
392 |    "source": [
393 |     "def answer_three():\n",
394 |     "\n",
395 |     "    #(result train, result test) - answer two) \n",
396 |     "    #plt.figure(figsize-(10,5))\n",
397 |     "    #plt.plot (range(@,10,1),result_train, 'r', Label-'training data') \n",
398 |     "    #plt.plot(range(e,10,1), result_test, 'b', Label-'test data') plt.xlabel ('Order') \n",
399 |     "    #plt.ylabel('Score) \n",
400 |     "    #plt.Legend()\n",
401 |     "\n",
402 |     "    return (0,9,6)\n",
403 |     "\n",
404 |     "answer_three()"
405 |    ]
406 |   },
407 |   {
408 |    "cell_type": "markdown",
409 |    "metadata": {},
410 |    "source": [
411 |     "### Question 4\n",
412 |     "\n",
413 |     "Training models on high degree polynomial features can result in overly complex models that overfit, so we often use regularized versions of the model to constrain model complexity, as we saw with Ridge and Lasso linear regression.\n",
414 |     "\n",
415 |     "For this question, train two models: a non-regularized LinearRegression model (default parameters) and a regularized Lasso Regression model (with parameters `alpha=0.01`, `max_iter=10000`) both on polynomial features of degree 12. Return the $R^2$ score for both the LinearRegression and Lasso model's test sets.\n",
416 |     "\n",
417 |     "*This function should return one tuple `(LinearRegression_R2_test_score, Lasso_R2_test_score)`*"
418 |    ]
419 |   },
420 |   {
421 |    "cell_type": "code",
422 |    "execution_count": 157,
423 |    "metadata": {
424 |     "collapsed": false
425 |    },
426 |    "outputs": [
427 |     {
428 |      "name": "stderr",
429 |      "output_type": "stream",
430 |      "text": [
431 |       "/opt/conda/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:484: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.\n",
432 |       "  ConvergenceWarning)\n"
433 |      ]
434 |     },
435 |     {
436 |      "data": {
437 |       "text/plain": [
438 |        "(-4.3120017974975458, 0.84330858396117558)"
439 |       ]
440 |      },
441 |      "execution_count": 157,
442 |      "metadata": {},
443 |      "output_type": "execute_result"
444 |     }
445 |    ],
446 |    "source": [
447 |     "def answer_four():\n",
448 |     "\n",
449 |     "    from sklearn.preprocessing import PolynomialFeatures \n",
450 |     "    from sklearn.linear_model import Lasso, LinearRegression \n",
451 |     "    from sklearn.metrics.regression import r2_score\n",
452 |     "\n",
453 |     "    poly=PolynomialFeatures (12) \n",
454 |     "    X_train_poly = poly.fit_transform(X_train[None].T) \n",
455 |     "    X_test_poly = poly.fit_transform(X_test[None].T)\n",
456 |     "\n",
457 |     "    linreg = LinearRegression().fit(X_train_poly,y_train) \n",
458 |     "    linlasso = Lasso(alpha=0.01, max_iter=18808).fit(X_train_poly,y_train)\n",
459 |     "\n",
460 |     "    #Asked to find score for TEST SET!\n",
461 |     "\n",
462 |     "    return (linreg.score(X_test_poly,y_test),linlasso.score(X_test_poly,y_test))\n",
463 |     "\n",
464 |     "answer_four()"
465 |    ]
466 |   },
467 |   {
468 |    "cell_type": "markdown",
469 |    "metadata": {},
470 |    "source": [
471 |     "## Part 2 - Classification\n",
472 |     "\n",
473 |     "Here's an application of machine learning that could save your life! For this section of the assignment we will be working with the [UCI Mushroom Data Set](http://archive.ics.uci.edu/ml/datasets/Mushroom?ref=datanews.io) stored in `readonly/mushrooms.csv`. The data will be used to train a model to predict whether or not a mushroom is poisonous. The following attributes are provided:\n",
474 |     "\n",
475 |     "*Attribute Information:*\n",
476 |     "\n",
477 |     "1. cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s \n",
478 |     "2. cap-surface: fibrous=f, grooves=g, scaly=y, smooth=s \n",
479 |     "3. cap-color: brown=n, buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y \n",
480 |     "4. bruises?: bruises=t, no=f \n",
481 |     "5. odor: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s \n",
482 |     "6. gill-attachment: attached=a, descending=d, free=f, notched=n \n",
483 |     "7. gill-spacing: close=c, crowded=w, distant=d \n",
484 |     "8. gill-size: broad=b, narrow=n \n",
485 |     "9. gill-color: black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=y \n",
486 |     "10. stalk-shape: enlarging=e, tapering=t \n",
487 |     "11. stalk-root: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=? \n",
488 |     "12. stalk-surface-above-ring: fibrous=f, scaly=y, silky=k, smooth=s \n",
489 |     "13. stalk-surface-below-ring: fibrous=f, scaly=y, silky=k, smooth=s \n",
490 |     "14. stalk-color-above-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y \n",
491 |     "15. stalk-color-below-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y \n",
492 |     "16. veil-type: partial=p, universal=u \n",
493 |     "17. veil-color: brown=n, orange=o, white=w, yellow=y \n",
494 |     "18. ring-number: none=n, one=o, two=t \n",
495 |     "19. ring-type: cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z \n",
496 |     "20. spore-print-color: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, white=w, yellow=y \n",
497 |     "21. population: abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y \n",
498 |     "22. habitat: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d\n",
499 |     "\n",
500 |     "<br>\n",
501 |     "\n",
502 |     "The data in the mushrooms dataset is currently encoded with strings. These values will need to be encoded to numeric to work with sklearn. We'll use pd.get_dummies to convert the categorical variables into indicator variables. "
503 |    ]
504 |   },
505 |   {
506 |    "cell_type": "code",
507 |    "execution_count": 162,
508 |    "metadata": {
509 |     "collapsed": false
510 |    },
511 |    "outputs": [],
512 |    "source": [
513 |     "import pandas as pd\n",
514 |     "import numpy as np\n",
515 |     "from sklearn.model_selection import train_test_split\n",
516 |     "\n",
517 |     "\n",
518 |     "mush_df = pd.read_csv('mushrooms.csv')\n",
519 |     "mush_df2 = pd.get_dummies(mush_df)\n",
520 |     "\n",
521 |     "X_mush = mush_df2.iloc[:,2:]\n",
522 |     "y_mush = mush_df2.iloc[:,1]\n",
523 |     "\n",
524 |     "# use the variables X_train2, y_train2 for Question 5\n",
525 |     "X_train2, X_test2, y_train2, y_test2 = train_test_split(X_mush, y_mush, random_state=0)\n",
526 |     "\n",
527 |     "# For performance reasons in Questions 6 and 7, we will create a smaller version of the\n",
528 |     "# entire mushroom dataset for use in those questions.  For simplicity we'll just re-use\n",
529 |     "# the 25% test split created above as the representative subset.\n",
530 |     "#\n",
531 |     "# Use the variables X_subset, y_subset for Questions 6 and 7.\n",
532 |     "X_subset = X_test2\n",
533 |     "y_subset = y_test2"
534 |    ]
535 |   },
536 |   {
537 |    "cell_type": "markdown",
538 |    "metadata": {},
539 |    "source": [
540 |     "### Question 5\n",
541 |     "\n",
542 |     "Using `X_train2` and `y_train2` from the preceeding cell, train a DecisionTreeClassifier with default parameters and random_state=0. What are the 5 most important features found by the decision tree?\n",
543 |     "\n",
544 |     "As a reminder, the feature names are available in the `X_train2.columns` property, and the order of the features in `X_train2.columns` matches the order of the feature importance values in the classifier's `feature_importances_` property. \n",
545 |     "\n",
546 |     "*This function should return a list of length 5 containing the feature names in descending order of importance.*\n",
547 |     "\n",
548 |     "*Note: remember that you also need to set random_state in the DecisionTreeClassifier.*"
549 |    ]
550 |   },
551 |   {
552 |    "cell_type": "code",
553 |    "execution_count": 163,
554 |    "metadata": {
555 |     "collapsed": false
556 |    },
557 |    "outputs": [
558 |     {
559 |      "name": "stderr",
560 |      "output_type": "stream",
561 |      "text": [
562 |       "/opt/conda/lib/python3.6/site-packages/ipykernel/__main__.py:6: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)\n"
563 |      ]
564 |     },
565 |     {
566 |      "data": {
567 |       "text/plain": [
568 |        "['odor_n', 'stalk-root_c', 'stalk-root_r', 'spore-print-color_r', 'odor_l']"
569 |       ]
570 |      },
571 |      "execution_count": 163,
572 |      "metadata": {},
573 |      "output_type": "execute_result"
574 |     }
575 |    ],
576 |    "source": [
577 |     "def answer_five():\n",
578 |     "\n",
579 |     "    from sklearn.tree import DecisionTreeClassifier \n",
580 |     "    clf = DecisionTreeClassifier(random_state=0).fit(X_train2,y_train2) \n",
581 |     "    df = pd.DataFrame({'feature':X_train2.columns.values, 'feature importance': clf.feature_importances_})\n",
582 |     "    return df.sort(['feature importance'],ascending=0)['feature'].head(5).tolist()\n",
583 |     "\n",
584 |     "answer_five()"
585 |    ]
586 |   },
587 |   {
588 |    "cell_type": "markdown",
589 |    "metadata": {},
590 |    "source": [
591 |     "### Question 6\n",
592 |     "\n",
593 |     "For this question, we're going to use the `validation_curve` function in `sklearn.model_selection` to determine training and test scores for a Support Vector Classifier (`SVC`) with varying parameter values.  Recall that the validation_curve function, in addition to taking an initialized unfitted classifier object, takes a dataset as input and does its own internal train-test splits to compute results.\n",
594 |     "\n",
595 |     "**Because creating a validation curve requires fitting multiple models, for performance reasons this question will use just a subset of the original mushroom dataset: please use the variables X_subset and y_subset as input to the validation curve function (instead of X_mush and y_mush) to reduce computation time.**\n",
596 |     "\n",
597 |     "The initialized unfitted classifier object we'll be using is a Support Vector Classifier with radial basis kernel.  So your first step is to create an `SVC` object with default parameters (i.e. `kernel='rbf', C=1`) and `random_state=0`. Recall that the kernel width of the RBF kernel is controlled using the `gamma` parameter.  \n",
598 |     "\n",
599 |     "With this classifier, and the dataset in X_subset, y_subset, explore the effect of `gamma` on classifier accuracy by using the `validation_curve` function to find the training and test scores for 6 values of `gamma` from `0.0001` to `10` (i.e. `np.logspace(-4,1,6)`). Recall that you can specify what scoring metric you want validation_curve to use by setting the \"scoring\" parameter.  In this case, we want to use \"accuracy\" as the scoring metric.\n",
600 |     "\n",
601 |     "For each level of `gamma`, `validation_curve` will fit 3 models on different subsets of the data, returning two 6x3 (6 levels of gamma x 3 fits per level) arrays of the scores for the training and test sets.\n",
602 |     "\n",
603 |     "Find the mean score across the three models for each level of `gamma` for both arrays, creating two arrays of length 6, and return a tuple with the two arrays.\n",
604 |     "\n",
605 |     "e.g.\n",
606 |     "\n",
607 |     "if one of your array of scores is\n",
608 |     "\n",
609 |     "    array([[ 0.5,  0.4,  0.6],\n",
610 |     "           [ 0.7,  0.8,  0.7],\n",
611 |     "           [ 0.9,  0.8,  0.8],\n",
612 |     "           [ 0.8,  0.7,  0.8],\n",
613 |     "           [ 0.7,  0.6,  0.6],\n",
614 |     "           [ 0.4,  0.6,  0.5]])\n",
615 |     "       \n",
616 |     "it should then become\n",
617 |     "\n",
618 |     "    array([ 0.5,  0.73333333,  0.83333333,  0.76666667,  0.63333333, 0.5])\n",
619 |     "\n",
620 |     "*This function should return one tuple of numpy arrays `(training_scores, test_scores)` where each array in the tuple has shape `(6,)`.*"
621 |    ]
622 |   },
623 |   {
624 |    "cell_type": "code",
625 |    "execution_count": 164,
626 |    "metadata": {
627 |     "collapsed": false
628 |    },
629 |    "outputs": [
630 |     {
631 |      "data": {
632 |       "text/plain": [
633 |        "(array([ 0.56647847,  0.93155951,  0.99039881,  1.        ,  1.        ,  1.        ]),\n",
634 |        " array([ 0.56768547,  0.92959558,  0.98965952,  1.        ,  0.99507994,\n",
635 |        "         0.52240279]))"
636 |       ]
637 |      },
638 |      "execution_count": 164,
639 |      "metadata": {},
640 |      "output_type": "execute_result"
641 |     }
642 |    ],
643 |    "source": [
644 |     "def answer_six():\n",
645 |     "\n",
646 |     "    from sklearn.svm import SVC \n",
647 |     "    from sklearn.model_selection import validation_curve\n",
648 |     "\n",
649 |     "    # RBF kernel and C-1 are default anyway.\n",
650 |     "    param_range = np.logspace(-4,1,6) \n",
651 |     "    train_scores, test_scores = validation_curve(SVC(random_state=0),X_subset,y_subset, param_name='gamma', param_range=param_range)   \n",
652 |     "\n",
653 |     "    # Finds row-wise mean (i.e mean across column values).\n",
654 |     "\n",
655 |     "    return (np.array(list(map(np.mean,train_scores))),np.array(list(map(np.mean,test_scores))))\n",
656 |     "\n",
657 |     "answer_six()"
658 |    ]
659 |   },
660 |   {
661 |    "cell_type": "markdown",
662 |    "metadata": {},
663 |    "source": [
664 |     "### Question 7\n",
665 |     "\n",
666 |     "Based on the scores from question 6, what gamma value corresponds to a model that is underfitting (and has the worst test set accuracy)? What gamma value corresponds to a model that is overfitting (and has the worst test set accuracy)? What choice of gamma would be the best choice for a model with good generalization performance on this dataset (high accuracy on both training and test set)? \n",
667 |     "\n",
668 |     "Hint: Try plotting the scores from question 6 to visualize the relationship between gamma and accuracy. Remember to comment out the import matplotlib line before submission.\n",
669 |     "\n",
670 |     "*This function should return one tuple with the degree values in this order: `(Underfitting, Overfitting, Good_Generalization)` Please note there is only one correct solution.*"
671 |    ]
672 |   },
673 |   {
674 |    "cell_type": "code",
675 |    "execution_count": 165,
676 |    "metadata": {
677 |     "collapsed": false
678 |    },
679 |    "outputs": [
680 |     {
681 |      "data": {
682 |       "text/plain": [
683 |        "(0.0001, 10, 0.1)"
684 |       ]
685 |      },
686 |      "execution_count": 165,
687 |      "metadata": {},
688 |      "output_type": "execute_result"
689 |     }
690 |    ],
691 |    "source": [
692 |     "def answer_seven():\n",
693 |     "\n",
694 |     "    #train_scores, test scores answers() \n",
695 |     "    #plt. figure() \n",
696 |     "    #pit.plot(np.logspoce(-4,1,6), train scores, '', Label-Training scores\") \n",
697 |     "    #plt.plot(np.Logspace(-4,1,6), test scores, 'r', Label-\"Test scores') \n",
698 |     "    #plt.xscale(\"Log = Sets x axis to a Log scale.\n",
699 |     "    #plt. xLabel('Goma) \n",
700 |     "    #plt.ylabel('Scores') \n",
701 |     "    #plt. Legend()\n",
702 |     "\n",
703 |     "    return (0.0001,10,0.1)\n",
704 |     "\n",
705 |     "answer_seven()"
706 |    ]
707 |   },
708 |   {
709 |    "cell_type": "code",
710 |    "execution_count": null,
711 |    "metadata": {
712 |     "collapsed": true
713 |    },
714 |    "outputs": [],
715 |    "source": []
716 |   },
717 |   {
718 |    "cell_type": "code",
719 |    "execution_count": null,
720 |    "metadata": {
721 |     "collapsed": true
722 |    },
723 |    "outputs": [],
724 |    "source": []
725 |   },
726 |   {
727 |    "cell_type": "code",
728 |    "execution_count": null,
729 |    "metadata": {
730 |     "collapsed": true
731 |    },
732 |    "outputs": [],
733 |    "source": []
734 |   }
735 |  ],
736 |  "metadata": {
737 |   "coursera": {
738 |    "course_slug": "python-machine-learning",
739 |    "graded_item_id": "eWYHL",
740 |    "launcher_item_id": "BAqef",
741 |    "part_id": "fXXRp"
742 |   },
743 |   "kernelspec": {
744 |    "display_name": "Python 3",
745 |    "language": "python",
746 |    "name": "python3"
747 |   },
748 |   "language_info": {
749 |    "codemirror_mode": {
750 |     "name": "ipython",
751 |     "version": 3
752 |    },
753 |    "file_extension": ".py",
754 |    "mimetype": "text/x-python",
755 |    "name": "python",
756 |    "nbconvert_exporter": "python",
757 |    "pygments_lexer": "ipython3",
758 |    "version": "3.6.2"
759 |   }
760 |  },
761 |  "nbformat": 4,
762 |  "nbformat_minor": 2
763 | }
764 | 
765 | 


--------------------------------------------------------------------------------
/week_3_Assignment.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "---\n",
  8 |     "\n",
  9 |     "_You are currently looking at **version 1.2** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-machine-learning/resources/bANLa) course resource._\n",
 10 |     "\n",
 11 |     "---"
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "markdown",
 16 |    "metadata": {},
 17 |    "source": [
 18 |     "# Assignment 3 - Evaluation\n",
 19 |     "\n",
 20 |     "In this assignment you will train several models and evaluate how effectively they predict instances of fraud using data based on [this dataset from Kaggle](https://www.kaggle.com/dalpozz/creditcardfraud).\n",
 21 |     " \n",
 22 |     "Each row in `fraud_data.csv` corresponds to a credit card transaction. Features include confidential variables `V1` through `V28` as well as `Amount` which is the amount of the transaction. \n",
 23 |     " \n",
 24 |     "The target is stored in the `class` column, where a value of 1 corresponds to an instance of fraud and 0 corresponds to an instance of not fraud."
 25 |    ]
 26 |   },
 27 |   {
 28 |    "cell_type": "code",
 29 |    "execution_count": 1,
 30 |    "metadata": {
 31 |     "collapsed": true
 32 |    },
 33 |    "outputs": [],
 34 |    "source": [
 35 |     "import numpy as np\n",
 36 |     "import pandas as pd"
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "markdown",
 41 |    "metadata": {},
 42 |    "source": [
 43 |     "### Question 1\n",
 44 |     "Import the data from `fraud_data.csv`. What percentage of the observations in the dataset are instances of fraud?\n",
 45 |     "\n",
 46 |     "*This function should return a float between 0 and 1.* "
 47 |    ]
 48 |   },
 49 |   {
 50 |    "cell_type": "code",
 51 |    "execution_count": 2,
 52 |    "metadata": {
 53 |     "collapsed": false
 54 |    },
 55 |    "outputs": [
 56 |     {
 57 |      "data": {
 58 |       "text/plain": [
 59 |        "0.016684632328818484"
 60 |       ]
 61 |      },
 62 |      "execution_count": 2,
 63 |      "metadata": {},
 64 |      "output_type": "execute_result"
 65 |     }
 66 |    ],
 67 |    "source": [
 68 |     "def answer_one():\n",
 69 |     "    \n",
 70 |     "    # Your code here\n",
 71 |     "    df = pd.read_csv('fraud_data.csv')\n",
 72 |     "    ans = (len(df[df['Class'] == 1]) / len(df[df['Class'] == 0]))\n",
 73 |     "    return ans\n",
 74 |     "\n",
 75 |     "answer_one()\n"
 76 |    ]
 77 |   },
 78 |   {
 79 |    "cell_type": "code",
 80 |    "execution_count": 10,
 81 |    "metadata": {
 82 |     "collapsed": true
 83 |    },
 84 |    "outputs": [],
 85 |    "source": [
 86 |     "# Use X_train, X_test, y_train, y_test for all of the following questions\n",
 87 |     "from sklearn.model_selection import train_test_split\n",
 88 |     "\n",
 89 |     "df = pd.read_csv('fraud_data.csv')\n",
 90 |     "\n",
 91 |     "X = df.iloc[:,:-1]\n",
 92 |     "y = df.iloc[:,-1]\n",
 93 |     "\n",
 94 |     "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)"
 95 |    ]
 96 |   },
 97 |   {
 98 |    "cell_type": "markdown",
 99 |    "metadata": {},
100 |    "source": [
101 |     "### Question 2\n",
102 |     "\n",
103 |     "Using `X_train`, `X_test`, `y_train`, and `y_test` (as defined above), train a dummy classifier that classifies everything as the majority class of the training data. What is the accuracy of this classifier? What is the recall?\n",
104 |     "\n",
105 |     "*This function should a return a tuple with two floats, i.e. `(accuracy score, recall score)`.*"
106 |    ]
107 |   },
108 |   {
109 |    "cell_type": "code",
110 |    "execution_count": 11,
111 |    "metadata": {
112 |     "collapsed": false
113 |    },
114 |    "outputs": [
115 |     {
116 |      "data": {
117 |       "text/plain": [
118 |        "(0.98525073746312686, 0.0)"
119 |       ]
120 |      },
121 |      "execution_count": 11,
122 |      "metadata": {},
123 |      "output_type": "execute_result"
124 |     }
125 |    ],
126 |    "source": [
127 |     "def answer_two():\n",
128 |     "    from sklearn.dummy import DummyClassifier\n",
129 |     "    from sklearn.metrics import recall_score, accuracy_score\n",
130 |     "    \n",
131 |     "    # Your code here\n",
132 |     "    \n",
133 |     "    dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(X_train, y_train)\n",
134 |     "    # Therefore the dummy 'most_frequent' classifier always predicts class 0\n",
135 |     "    y_dummy_predictions = dummy_majority.predict(X_test)\n",
136 |     "\n",
137 |     "    ans = (accuracy_score(y_test, y_dummy_predictions), recall_score(y_test, y_dummy_predictions))\n",
138 |     "    \n",
139 |     "    return ans\n",
140 |     "answer_two()\n"
141 |    ]
142 |   },
143 |   {
144 |    "cell_type": "markdown",
145 |    "metadata": {},
146 |    "source": [
147 |     "### Question 3\n",
148 |     "\n",
149 |     "Using X_train, X_test, y_train, y_test (as defined above), train a SVC classifer using the default parameters. What is the accuracy, recall, and precision of this classifier?\n",
150 |     "\n",
151 |     "*This function should a return a tuple with three floats, i.e. `(accuracy score, recall score, precision score)`.*"
152 |    ]
153 |   },
154 |   {
155 |    "cell_type": "code",
156 |    "execution_count": 12,
157 |    "metadata": {
158 |     "collapsed": false
159 |    },
160 |    "outputs": [
161 |     {
162 |      "data": {
163 |       "text/plain": [
164 |        "(0.99078171091445433, 0.375, 1.0)"
165 |       ]
166 |      },
167 |      "execution_count": 12,
168 |      "metadata": {},
169 |      "output_type": "execute_result"
170 |     }
171 |    ],
172 |    "source": [
173 |     "def answer_three():\n",
174 |     "    from sklearn.metrics import recall_score, precision_score, accuracy_score\n",
175 |     "    from sklearn.svm import SVC\n",
176 |     "\n",
177 |     "    # Your code here\n",
178 |     "    svm = SVC().fit(X_train, y_train)\n",
179 |     "    y_predictions = svm.predict(X_test)\n",
180 |     "    \n",
181 |     "    ans = (accuracy_score(y_test, y_predictions), recall_score(y_test, y_predictions), precision_score(y_test, y_predictions))\n",
182 |     "    \n",
183 |     "    return ans\n",
184 |     "answer_three()"
185 |    ]
186 |   },
187 |   {
188 |    "cell_type": "markdown",
189 |    "metadata": {},
190 |    "source": [
191 |     "### Question 4\n",
192 |     "\n",
193 |     "Using the SVC classifier with parameters `{'C': 1e9, 'gamma': 1e-07}`, what is the confusion matrix when using a threshold of -220 on the decision function. Use X_test and y_test.\n",
194 |     "\n",
195 |     "*This function should return a confusion matrix, a 2x2 numpy array with 4 integers.*"
196 |    ]
197 |   },
198 |   {
199 |    "cell_type": "code",
200 |    "execution_count": 13,
201 |    "metadata": {
202 |     "collapsed": false
203 |    },
204 |    "outputs": [
205 |     {
206 |      "data": {
207 |       "text/plain": [
208 |        "array([[5320,   24],\n",
209 |        "       [  14,   66]])"
210 |       ]
211 |      },
212 |      "execution_count": 13,
213 |      "metadata": {},
214 |      "output_type": "execute_result"
215 |     }
216 |    ],
217 |    "source": [
218 |     "def answer_four():\n",
219 |     "    from sklearn.metrics import confusion_matrix\n",
220 |     "    from sklearn.svm import SVC\n",
221 |     "\n",
222 |     "    # Your code here\n",
223 |     "    svm = SVC(C = 1e9, gamma = 1e-07).fit(X_train, y_train)\n",
224 |     "    svm_predicted = svm.decision_function(X_test) > -220\n",
225 |     "    \n",
226 |     "#     print(svm_predicted)\n",
227 |     "    \n",
228 |     "    confusion = confusion_matrix(y_test, svm_predicted)\n",
229 |     "    \n",
230 |     "    ans = confusion\n",
231 |     "    \n",
232 |     "    return ans\n",
233 |     "\n",
234 |     "answer_four()"
235 |    ]
236 |   },
237 |   {
238 |    "cell_type": "markdown",
239 |    "metadata": {},
240 |    "source": [
241 |     "### Question 5\n",
242 |     "\n",
243 |     "Train a logisitic regression classifier with default parameters using X_train and y_train.\n",
244 |     "\n",
245 |     "For the logisitic regression classifier, create a precision recall curve and a roc curve using y_test and the probability estimates for X_test (probability it is fraud).\n",
246 |     "\n",
247 |     "Looking at the precision recall curve, what is the recall when the precision is `0.75`?\n",
248 |     "\n",
249 |     "Looking at the roc curve, what is the true positive rate when the false positive rate is `0.16`?\n",
250 |     "\n",
251 |     "*This function should return a tuple with two floats, i.e. `(recall, true positive rate)`.*"
252 |    ]
253 |   },
254 |   {
255 |    "cell_type": "code",
256 |    "execution_count": 14,
257 |    "metadata": {
258 |     "collapsed": false
259 |    },
260 |    "outputs": [
261 |     {
262 |      "data": {
263 |       "text/plain": [
264 |        "(0.82499999999999996, 0.98750000000000004)"
265 |       ]
266 |      },
267 |      "execution_count": 14,
268 |      "metadata": {},
269 |      "output_type": "execute_result"
270 |     }
271 |    ],
272 |    "source": [
273 |     "def answer_five():\n",
274 |     "        \n",
275 |     "    # Your code here\n",
276 |     "    from sklearn.linear_model import LogisticRegression\n",
277 |     "    from sklearn.metrics import precision_recall_curve\n",
278 |     "    from sklearn.metrics import roc_curve\n",
279 |     "\n",
280 |     "    lr = LogisticRegression().fit(X_train, y_train)\n",
281 |     "    \n",
282 |     "    y_scores_lr = lr.fit(X_train, y_train).decision_function(X_test)\n",
283 |     "    \n",
284 |     "#     lr_predicted = lr.predict(X_test)\n",
285 |     "    \n",
286 |     "    precision, recall, thresholds = precision_recall_curve(y_test, y_scores_lr)\n",
287 |     "    closest_zero_p = np.argmin(np.abs(precision-0.75))\n",
288 |     "#     closest_zero_p = precision[closest_zero]\n",
289 |     "    closest_zero_r = recall[closest_zero_p]\n",
290 |     "    \n",
291 |     "#     print(closest_zero_r)\n",
292 |     "    \n",
293 |     "    \n",
294 |     "    fpr_lr, tpr_lr, _ = roc_curve(y_test, y_scores_lr)\n",
295 |     "#     roc_auc_lr = auc(fpr_lr, tpr_lr)\n",
296 |     "    \n",
297 |     "    closest_zero_fpr_lr = np.argmin(np.abs(fpr_lr - 0.16))\n",
298 |     "#     closest_zero_p = precision[closest_zero]\n",
299 |     "    closest_zero_tpr_lr = recall[closest_zero_fpr_lr]\n",
300 |     "    \n",
301 |     "#     print(closest_zero_tpr_lr)\n",
302 |     "\n",
303 |     "    \n",
304 |     "#     y_proba_lr = lr.fit(X_train, y_train).predict_proba(X_test)\n",
305 |     "    \n",
306 |     "#     confusion = confusion_matrix(y_test, lr_predicted)\n",
307 |     "\n",
308 |     "    ans = (closest_zero_r, closest_zero_tpr_lr)\n",
309 |     "    \n",
310 |     "    return ans\n",
311 |     "answer_five()"
312 |    ]
313 |   },
314 |   {
315 |    "cell_type": "markdown",
316 |    "metadata": {},
317 |    "source": [
318 |     "### Question 6\n",
319 |     "\n",
320 |     "Perform a grid search over the parameters listed below for a Logisitic Regression classifier, using recall for scoring and the default 3-fold cross validation.\n",
321 |     "\n",
322 |     "`'penalty': ['l1', 'l2']`\n",
323 |     "\n",
324 |     "`'C':[0.01, 0.1, 1, 10, 100]`\n",
325 |     "\n",
326 |     "From `.cv_results_`, create an array of the mean test scores of each parameter combination. i.e.\n",
327 |     "\n",
328 |     "|      \t| `l1` \t| `l2` \t|\n",
329 |     "|:----:\t|----\t|----\t|\n",
330 |     "| **`0.01`** \t|    ?\t|   ? \t|\n",
331 |     "| **`0.1`**  \t|    ?\t|   ? \t|\n",
332 |     "| **`1`**    \t|    ?\t|   ? \t|\n",
333 |     "| **`10`**   \t|    ?\t|   ? \t|\n",
334 |     "| **`100`**   \t|    ?\t|   ? \t|\n",
335 |     "\n",
336 |     "<br>\n",
337 |     "\n",
338 |     "*This function should return a 5 by 2 numpy array with 10 floats.* \n",
339 |     "\n",
340 |     "*Note: do not return a DataFrame, just the values denoted by '?' above in a numpy array. You might need to reshape your raw result to meet the format we are looking for.*"
341 |    ]
342 |   },
343 |   {
344 |    "cell_type": "code",
345 |    "execution_count": 15,
346 |    "metadata": {
347 |     "collapsed": false
348 |    },
349 |    "outputs": [
350 |     {
351 |      "data": {
352 |       "text/plain": [
353 |        "array([[ 0.66666667,  0.76086957],\n",
354 |        "       [ 0.80072464,  0.80434783],\n",
355 |        "       [ 0.8115942 ,  0.8115942 ],\n",
356 |        "       [ 0.80797101,  0.8115942 ],\n",
357 |        "       [ 0.80797101,  0.80797101]])"
358 |       ]
359 |      },
360 |      "execution_count": 15,
361 |      "metadata": {},
362 |      "output_type": "execute_result"
363 |     }
364 |    ],
365 |    "source": [
366 |     "def answer_six():    \n",
367 |     "    from sklearn.model_selection import GridSearchCV\n",
368 |     "    from sklearn.linear_model import LogisticRegression\n",
369 |     "\n",
370 |     "    # Your code here\n",
371 |     "    lr = LogisticRegression()\n",
372 |     "\n",
373 |     "    grid_values = {'C': [0.01, 0.1, 1, 10, 100], 'penalty': ['l1', 'l2']}\n",
374 |     "\n",
375 |     "    # default metric to optimize over grid parameters\n",
376 |     "    grid_lr = GridSearchCV(lr, param_grid = grid_values, scoring = 'recall')\n",
377 |     "    grid_lr.fit(X_train, y_train)\n",
378 |     "    \n",
379 |     "#     print(grid_lr.cv_results_['mean_test_score'].reshape(5,2))\n",
380 |     "    ans = np.array(grid_lr.cv_results_['mean_test_score'].reshape(5,2))\n",
381 |     " \n",
382 |     "    return ans\n",
383 |     "\n",
384 |     "answer_six()\n"
385 |    ]
386 |   },
387 |   {
388 |    "cell_type": "code",
389 |    "execution_count": 16,
390 |    "metadata": {
391 |     "collapsed": true
392 |    },
393 |    "outputs": [],
394 |    "source": [
395 |     "# Use the following function to help visualize results from the grid search\n",
396 |     "def GridSearch_Heatmap(scores):\n",
397 |     "    #%matplotlib notebook\n",
398 |     "    import seaborn as sns\n",
399 |     "    #import matplotlib.pyplot as plt\n",
400 |     "    plt.figure()\n",
401 |     "    sns.heatmap(scores.reshape(5,2), xticklabels=['l1','l2'], yticklabels=[0.01, 0.1, 1, 10, 100])\n",
402 |     "    plt.yticks(rotation=0);\n",
403 |     "\n",
404 |     "#GridSearch_Heatmap(answer_six())"
405 |    ]
406 |   },
407 |   {
408 |    "cell_type": "code",
409 |    "execution_count": null,
410 |    "metadata": {
411 |     "collapsed": true
412 |    },
413 |    "outputs": [],
414 |    "source": []
415 |   },
416 |   {
417 |    "cell_type": "code",
418 |    "execution_count": null,
419 |    "metadata": {
420 |     "collapsed": true
421 |    },
422 |    "outputs": [],
423 |    "source": []
424 |   },
425 |   {
426 |    "cell_type": "code",
427 |    "execution_count": null,
428 |    "metadata": {
429 |     "collapsed": true
430 |    },
431 |    "outputs": [],
432 |    "source": []
433 |   }
434 |  ],
435 |  "metadata": {
436 |   "coursera": {
437 |    "course_slug": "python-machine-learning",
438 |    "graded_item_id": "5yX9Z",
439 |    "launcher_item_id": "eqnV3",
440 |    "part_id": "Msnj0"
441 |   },
442 |   "kernelspec": {
443 |    "display_name": "Python 3",
444 |    "language": "python",
445 |    "name": "python3"
446 |   },
447 |   "language_info": {
448 |    "codemirror_mode": {
449 |     "name": "ipython",
450 |     "version": 3
451 |    },
452 |    "file_extension": ".py",
453 |    "mimetype": "text/x-python",
454 |    "name": "python",
455 |    "nbconvert_exporter": "python",
456 |    "pygments_lexer": "ipython3",
457 |    "version": "3.6.2"
458 |   }
459 |  },
460 |  "nbformat": 4,
461 |  "nbformat_minor": 2
462 | }
463 | 


--------------------------------------------------------------------------------
/week_4_Assignment.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "---\n",
  8 |     "\n",
  9 |     "_You are currently looking at **version 1.1** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-machine-learning/resources/bANLa) course resource._\n",
 10 |     "\n",
 11 |     "---"
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "markdown",
 16 |    "metadata": {},
 17 |    "source": [
 18 |     "## Assignment 4 - Understanding and Predicting Property Maintenance Fines\n",
 19 |     "\n",
 20 |     "This assignment is based on a data challenge from the Michigan Data Science Team ([MDST](http://midas.umich.edu/mdst/)). \n",
 21 |     "\n",
 22 |     "The Michigan Data Science Team ([MDST](http://midas.umich.edu/mdst/)) and the Michigan Student Symposium for Interdisciplinary Statistical Sciences ([MSSISS](https://sites.lsa.umich.edu/mssiss/)) have partnered with the City of Detroit to help solve one of the most pressing problems facing Detroit - blight. [Blight violations](http://www.detroitmi.gov/How-Do-I/Report/Blight-Complaint-FAQs) are issued by the city to individuals who allow their properties to remain in a deteriorated condition. Every year, the city of Detroit issues millions of dollars in fines to residents and every year, many of these fines remain unpaid. Enforcing unpaid blight fines is a costly and tedious process, so the city wants to know: how can we increase blight ticket compliance?\n",
 23 |     "\n",
 24 |     "The first step in answering this question is understanding when and why a resident might fail to comply with a blight ticket. This is where predictive modeling comes in. For this assignment, your task is to predict whether a given blight ticket will be paid on time.\n",
 25 |     "\n",
 26 |     "All data for this assignment has been provided to us through the [Detroit Open Data Portal](https://data.detroitmi.gov/). **Only the data already included in your Coursera directory can be used for training the model for this assignment.** Nonetheless, we encourage you to look into data from other Detroit datasets to help inform feature creation and model selection. We recommend taking a look at the following related datasets:\n",
 27 |     "\n",
 28 |     "* [Building Permits](https://data.detroitmi.gov/Property-Parcels/Building-Permits/xw2a-a7tf)\n",
 29 |     "* [Trades Permits](https://data.detroitmi.gov/Property-Parcels/Trades-Permits/635b-dsgv)\n",
 30 |     "* [Improve Detroit: Submitted Issues](https://data.detroitmi.gov/Government/Improve-Detroit-Submitted-Issues/fwz3-w3yn)\n",
 31 |     "* [DPD: Citizen Complaints](https://data.detroitmi.gov/Public-Safety/DPD-Citizen-Complaints-2016/kahe-efs3)\n",
 32 |     "* [Parcel Map](https://data.detroitmi.gov/Property-Parcels/Parcel-Map/fxkw-udwf)\n",
 33 |     "\n",
 34 |     "___\n",
 35 |     "\n",
 36 |     "We provide you with two data files for use in training and validating your models: train.csv and test.csv. Each row in these two files corresponds to a single blight ticket, and includes information about when, why, and to whom each ticket was issued. The target variable is compliance, which is True if the ticket was paid early, on time, or within one month of the hearing data, False if the ticket was paid after the hearing date or not at all, and Null if the violator was found not responsible. Compliance, as well as a handful of other variables that will not be available at test-time, are only included in train.csv.\n",
 37 |     "\n",
 38 |     "Note: All tickets where the violators were found not responsible are not considered during evaluation. They are included in the training set as an additional source of data for visualization, and to enable unsupervised and semi-supervised approaches. However, they are not included in the test set.\n",
 39 |     "\n",
 40 |     "<br>\n",
 41 |     "\n",
 42 |     "**File descriptions** (Use only this data for training your model!)\n",
 43 |     "\n",
 44 |     "    readonly/train.csv - the training set (all tickets issued 2004-2011)\n",
 45 |     "    readonly/test.csv - the test set (all tickets issued 2012-2016)\n",
 46 |     "    readonly/addresses.csv & readonly/latlons.csv - mapping from ticket id to addresses, and from addresses to lat/lon coordinates. \n",
 47 |     "     Note: misspelled addresses may be incorrectly geolocated.\n",
 48 |     "\n",
 49 |     "<br>\n",
 50 |     "\n",
 51 |     "**Data fields**\n",
 52 |     "\n",
 53 |     "train.csv & test.csv\n",
 54 |     "\n",
 55 |     "    ticket_id - unique identifier for tickets\n",
 56 |     "    agency_name - Agency that issued the ticket\n",
 57 |     "    inspector_name - Name of inspector that issued the ticket\n",
 58 |     "    violator_name - Name of the person/organization that the ticket was issued to\n",
 59 |     "    violation_street_number, violation_street_name, violation_zip_code - Address where the violation occurred\n",
 60 |     "    mailing_address_str_number, mailing_address_str_name, city, state, zip_code, non_us_str_code, country - Mailing address of the violator\n",
 61 |     "    ticket_issued_date - Date and time the ticket was issued\n",
 62 |     "    hearing_date - Date and time the violator's hearing was scheduled\n",
 63 |     "    violation_code, violation_description - Type of violation\n",
 64 |     "    disposition - Judgment and judgement type\n",
 65 |     "    fine_amount - Violation fine amount, excluding fees\n",
 66 |     "    admin_fee - $20 fee assigned to responsible judgments\n",
 67 |     "state_fee - $10 fee assigned to responsible judgments\n",
 68 |     "    late_fee - 10% fee assigned to responsible judgments\n",
 69 |     "    discount_amount - discount applied, if any\n",
 70 |     "    clean_up_cost - DPW clean-up or graffiti removal cost\n",
 71 |     "    judgment_amount - Sum of all fines and fees\n",
 72 |     "    grafitti_status - Flag for graffiti violations\n",
 73 |     "    \n",
 74 |     "train.csv only\n",
 75 |     "\n",
 76 |     "    payment_amount - Amount paid, if any\n",
 77 |     "    payment_date - Date payment was made, if it was received\n",
 78 |     "    payment_status - Current payment status as of Feb 1 2017\n",
 79 |     "    balance_due - Fines and fees still owed\n",
 80 |     "    collection_status - Flag for payments in collections\n",
 81 |     "    compliance [target variable for prediction] \n",
 82 |     "     Null = Not responsible\n",
 83 |     "     0 = Responsible, non-compliant\n",
 84 |     "     1 = Responsible, compliant\n",
 85 |     "    compliance_detail - More information on why each ticket was marked compliant or non-compliant\n",
 86 |     "\n",
 87 |     "\n",
 88 |     "___\n",
 89 |     "\n",
 90 |     "## Evaluation\n",
 91 |     "\n",
 92 |     "Your predictions will be given as the probability that the corresponding blight ticket will be paid on time.\n",
 93 |     "\n",
 94 |     "The evaluation metric for this assignment is the Area Under the ROC Curve (AUC). \n",
 95 |     "\n",
 96 |     "Your grade will be based on the AUC score computed for your classifier. A model which with an AUROC of 0.7 passes this assignment, over 0.75 will recieve full points.\n",
 97 |     "___\n",
 98 |     "\n",
 99 |     "For this assignment, create a function that trains a model to predict blight ticket compliance in Detroit using `readonly/train.csv`. Using this model, return a series of length 61001 with the data being the probability that each corresponding ticket from `readonly/test.csv` will be paid, and the index being the ticket_id.\n",
100 |     "\n",
101 |     "Example:\n",
102 |     "\n",
103 |     "    ticket_id\n",
104 |     "       284932    0.531842\n",
105 |     "       285362    0.401958\n",
106 |     "       285361    0.105928\n",
107 |     "       285338    0.018572\n",
108 |     "                 ...\n",
109 |     "       376499    0.208567\n",
110 |     "       376500    0.818759\n",
111 |     "       369851    0.018528\n",
112 |     "       Name: compliance, dtype: float32\n",
113 |     "       \n",
114 |     "### Hints\n",
115 |     "\n",
116 |     "* Make sure your code is working before submitting it to the autograder.\n",
117 |     "\n",
118 |     "* Print out your result to see whether there is anything weird (e.g., all probabilities are the same).\n",
119 |     "\n",
120 |     "* Generally the total runtime should be less than 10 mins. You should NOT use Neural Network related classifiers (e.g., MLPClassifier) in this question. \n",
121 |     "\n",
122 |     "* Try to avoid global variables. If you have other functions besides blight_model, you should move those functions inside the scope of blight_model.\n",
123 |     "\n",
124 |     "* Refer to the pinned threads in Week 4's discussion forum when there is something you could not figure it out."
125 |    ]
126 |   },
127 |   {
128 |    "cell_type": "code",
129 |    "execution_count": 7,
130 |    "metadata": {
131 |     "collapsed": false
132 |    },
133 |    "outputs": [],
134 |    "source": [
135 |     "#%matplotlib notebook\n",
136 |     "import pandas as pd\n",
137 |     "import numpy as np\n",
138 |     "#import matplotlib.pyplot as plt\n",
139 |     "pd.set_option('display.max_columns',50)\n",
140 |     "\n",
141 |     "from sklearn.ensemble import RandomForestRegressor\n",
142 |     "from sklearn.ensemble import GradientBoostingClassifier\n",
143 |     "from sklearn.model_selection import train_test_split\n",
144 |     "from sklearn.model_selection import GridSearchCV\n",
145 |     "from sklearn.model_selection import cross_val_score\n",
146 |     "from sklearn.preprocessing import LabelEncoder\n",
147 |     "from sklearn.metrics import roc_auc_score\n",
148 |     "\n",
149 |     "def blight_model():\n",
150 |     "    # Reading files:\n",
151 |     "    train = pd.read_csv('train.csv',encoding='ISO-8859-1')\n",
152 |     "    test = pd.read_csv('test.csv',encoding='ISO-8859-1')\n",
153 |     "    test.set_index(test['ticket_id'],inplace=True)\n",
154 |     "    \n",
155 |     "    # Cleaning:\n",
156 |     "    train.dropna(subset=['compliance'],inplace=True)\n",
157 |     "    train = train[train['country']=='USA']\n",
158 |     "    #test = test[test['country']=='USA']\n",
159 |     "\n",
160 |     "    label_encoder = LabelEncoder()\n",
161 |     "    label_encoder.fit(train['disposition'].append(test['disposition'], ignore_index=True))\n",
162 |     "    train['disposition'] = label_encoder.transform(train['disposition'])\n",
163 |     "    test['disposition'] = label_encoder.transform(test['disposition'])\n",
164 |     "\n",
165 |     "    label_encoder = LabelEncoder()\n",
166 |     "    label_encoder.fit(train['violation_code'].append(test['violation_code'], ignore_index=True))\n",
167 |     "    train['violation_code'] = label_encoder.transform(train['violation_code'])\n",
168 |     "    test['violation_code'] = label_encoder.transform(test['violation_code'])\n",
169 |     "\n",
170 |     "    feature_names=['disposition','violation_code']\n",
171 |     "    X = train[feature_names]\n",
172 |     "    y = train['compliance']\n",
173 |     "    test = test[feature_names]\n",
174 |     "    X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=0)\n",
175 |     "\n",
176 |     "    # grid search\n",
177 |     "    model = RandomForestRegressor()\n",
178 |     "    param_grid = {'n_estimators':[5,7], 'max_depth':[5,10]}\n",
179 |     "    grid_search = GridSearchCV(model, param_grid, scoring=\"roc_auc\")\n",
180 |     "    grid_result = grid_search.fit(X_train, y_train)\n",
181 |     "    \n",
182 |     "    # summarize results\n",
183 |     "    #print(\"Best: %f using %s\" % (grid_result.best_score_, grid_result.best_params_))\n",
184 |     "    return pd.DataFrame(grid_result.predict(test),index=test.index,columns=['compliance'])\n"
185 |    ]
186 |   },
187 |   {
188 |    "cell_type": "code",
189 |    "execution_count": 8,
190 |    "metadata": {
191 |     "collapsed": false
192 |    },
193 |    "outputs": [
194 |     {
195 |      "name": "stderr",
196 |      "output_type": "stream",
197 |      "text": [
198 |       "/opt/conda/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2827: DtypeWarning: Columns (11,12,31) have mixed types. Specify dtype option on import or set low_memory=False.\n",
199 |       "  if self.run_code(code, result):\n"
200 |      ]
201 |     },
202 |     {
203 |      "data": {
204 |       "text/html": [
205 |        "<div>\n",
206 |        "<table border=\"1\" class=\"dataframe\">\n",
207 |        "  <thead>\n",
208 |        "    <tr style=\"text-align: right;\">\n",
209 |        "      <th></th>\n",
210 |        "      <th>compliance</th>\n",
211 |        "    </tr>\n",
212 |        "    <tr>\n",
213 |        "      <th>ticket_id</th>\n",
214 |        "      <th></th>\n",
215 |        "    </tr>\n",
216 |        "  </thead>\n",
217 |        "  <tbody>\n",
218 |        "    <tr>\n",
219 |        "      <th>284932</th>\n",
220 |        "      <td>0.177636</td>\n",
221 |        "    </tr>\n",
222 |        "    <tr>\n",
223 |        "      <th>285362</th>\n",
224 |        "      <td>0.029367</td>\n",
225 |        "    </tr>\n",
226 |        "    <tr>\n",
227 |        "      <th>285361</th>\n",
228 |        "      <td>0.065970</td>\n",
229 |        "    </tr>\n",
230 |        "    <tr>\n",
231 |        "      <th>285338</th>\n",
232 |        "      <td>0.029367</td>\n",
233 |        "    </tr>\n",
234 |        "    <tr>\n",
235 |        "      <th>285346</th>\n",
236 |        "      <td>0.083335</td>\n",
237 |        "    </tr>\n",
238 |        "    <tr>\n",
239 |        "      <th>285345</th>\n",
240 |        "      <td>0.029367</td>\n",
241 |        "    </tr>\n",
242 |        "    <tr>\n",
243 |        "      <th>285347</th>\n",
244 |        "      <td>0.062404</td>\n",
245 |        "    </tr>\n",
246 |        "    <tr>\n",
247 |        "      <th>285342</th>\n",
248 |        "      <td>0.480734</td>\n",
249 |        "    </tr>\n",
250 |        "    <tr>\n",
251 |        "      <th>285530</th>\n",
252 |        "      <td>0.029367</td>\n",
253 |        "    </tr>\n",
254 |        "    <tr>\n",
255 |        "      <th>284989</th>\n",
256 |        "      <td>0.029367</td>\n",
257 |        "    </tr>\n",
258 |        "    <tr>\n",
259 |        "      <th>285344</th>\n",
260 |        "      <td>0.051367</td>\n",
261 |        "    </tr>\n",
262 |        "    <tr>\n",
263 |        "      <th>285343</th>\n",
264 |        "      <td>0.029367</td>\n",
265 |        "    </tr>\n",
266 |        "    <tr>\n",
267 |        "      <th>285340</th>\n",
268 |        "      <td>0.029367</td>\n",
269 |        "    </tr>\n",
270 |        "    <tr>\n",
271 |        "      <th>285341</th>\n",
272 |        "      <td>0.051367</td>\n",
273 |        "    </tr>\n",
274 |        "    <tr>\n",
275 |        "      <th>285349</th>\n",
276 |        "      <td>0.083335</td>\n",
277 |        "    </tr>\n",
278 |        "    <tr>\n",
279 |        "      <th>285348</th>\n",
280 |        "      <td>0.029367</td>\n",
281 |        "    </tr>\n",
282 |        "    <tr>\n",
283 |        "      <th>284991</th>\n",
284 |        "      <td>0.029367</td>\n",
285 |        "    </tr>\n",
286 |        "    <tr>\n",
287 |        "      <th>285532</th>\n",
288 |        "      <td>0.029367</td>\n",
289 |        "    </tr>\n",
290 |        "    <tr>\n",
291 |        "      <th>285406</th>\n",
292 |        "      <td>0.029367</td>\n",
293 |        "    </tr>\n",
294 |        "    <tr>\n",
295 |        "      <th>285001</th>\n",
296 |        "      <td>0.039585</td>\n",
297 |        "    </tr>\n",
298 |        "    <tr>\n",
299 |        "      <th>285006</th>\n",
300 |        "      <td>0.034218</td>\n",
301 |        "    </tr>\n",
302 |        "    <tr>\n",
303 |        "      <th>285405</th>\n",
304 |        "      <td>0.029367</td>\n",
305 |        "    </tr>\n",
306 |        "    <tr>\n",
307 |        "      <th>285337</th>\n",
308 |        "      <td>0.029367</td>\n",
309 |        "    </tr>\n",
310 |        "    <tr>\n",
311 |        "      <th>285496</th>\n",
312 |        "      <td>0.051367</td>\n",
313 |        "    </tr>\n",
314 |        "    <tr>\n",
315 |        "      <th>285497</th>\n",
316 |        "      <td>0.029367</td>\n",
317 |        "    </tr>\n",
318 |        "    <tr>\n",
319 |        "      <th>285378</th>\n",
320 |        "      <td>0.029367</td>\n",
321 |        "    </tr>\n",
322 |        "    <tr>\n",
323 |        "      <th>285589</th>\n",
324 |        "      <td>0.029367</td>\n",
325 |        "    </tr>\n",
326 |        "    <tr>\n",
327 |        "      <th>285585</th>\n",
328 |        "      <td>0.029367</td>\n",
329 |        "    </tr>\n",
330 |        "    <tr>\n",
331 |        "      <th>285501</th>\n",
332 |        "      <td>0.065970</td>\n",
333 |        "    </tr>\n",
334 |        "    <tr>\n",
335 |        "      <th>285581</th>\n",
336 |        "      <td>0.029367</td>\n",
337 |        "    </tr>\n",
338 |        "    <tr>\n",
339 |        "      <th>...</th>\n",
340 |        "      <td>...</td>\n",
341 |        "    </tr>\n",
342 |        "    <tr>\n",
343 |        "      <th>376367</th>\n",
344 |        "      <td>0.034218</td>\n",
345 |        "    </tr>\n",
346 |        "    <tr>\n",
347 |        "      <th>376366</th>\n",
348 |        "      <td>0.039585</td>\n",
349 |        "    </tr>\n",
350 |        "    <tr>\n",
351 |        "      <th>376362</th>\n",
352 |        "      <td>0.271635</td>\n",
353 |        "    </tr>\n",
354 |        "    <tr>\n",
355 |        "      <th>376363</th>\n",
356 |        "      <td>0.282298</td>\n",
357 |        "    </tr>\n",
358 |        "    <tr>\n",
359 |        "      <th>376365</th>\n",
360 |        "      <td>0.034218</td>\n",
361 |        "    </tr>\n",
362 |        "    <tr>\n",
363 |        "      <th>376364</th>\n",
364 |        "      <td>0.039585</td>\n",
365 |        "    </tr>\n",
366 |        "    <tr>\n",
367 |        "      <th>376228</th>\n",
368 |        "      <td>0.039585</td>\n",
369 |        "    </tr>\n",
370 |        "    <tr>\n",
371 |        "      <th>376265</th>\n",
372 |        "      <td>0.039585</td>\n",
373 |        "    </tr>\n",
374 |        "    <tr>\n",
375 |        "      <th>376286</th>\n",
376 |        "      <td>0.331396</td>\n",
377 |        "    </tr>\n",
378 |        "    <tr>\n",
379 |        "      <th>376320</th>\n",
380 |        "      <td>0.039585</td>\n",
381 |        "    </tr>\n",
382 |        "    <tr>\n",
383 |        "      <th>376314</th>\n",
384 |        "      <td>0.039585</td>\n",
385 |        "    </tr>\n",
386 |        "    <tr>\n",
387 |        "      <th>376327</th>\n",
388 |        "      <td>0.331396</td>\n",
389 |        "    </tr>\n",
390 |        "    <tr>\n",
391 |        "      <th>376385</th>\n",
392 |        "      <td>0.331396</td>\n",
393 |        "    </tr>\n",
394 |        "    <tr>\n",
395 |        "      <th>376435</th>\n",
396 |        "      <td>0.139655</td>\n",
397 |        "    </tr>\n",
398 |        "    <tr>\n",
399 |        "      <th>376370</th>\n",
400 |        "      <td>0.271635</td>\n",
401 |        "    </tr>\n",
402 |        "    <tr>\n",
403 |        "      <th>376434</th>\n",
404 |        "      <td>0.062404</td>\n",
405 |        "    </tr>\n",
406 |        "    <tr>\n",
407 |        "      <th>376459</th>\n",
408 |        "      <td>0.199092</td>\n",
409 |        "    </tr>\n",
410 |        "    <tr>\n",
411 |        "      <th>376478</th>\n",
412 |        "      <td>0.030587</td>\n",
413 |        "    </tr>\n",
414 |        "    <tr>\n",
415 |        "      <th>376473</th>\n",
416 |        "      <td>0.039585</td>\n",
417 |        "    </tr>\n",
418 |        "    <tr>\n",
419 |        "      <th>376484</th>\n",
420 |        "      <td>0.012334</td>\n",
421 |        "    </tr>\n",
422 |        "    <tr>\n",
423 |        "      <th>376482</th>\n",
424 |        "      <td>0.018209</td>\n",
425 |        "    </tr>\n",
426 |        "    <tr>\n",
427 |        "      <th>376480</th>\n",
428 |        "      <td>0.012334</td>\n",
429 |        "    </tr>\n",
430 |        "    <tr>\n",
431 |        "      <th>376479</th>\n",
432 |        "      <td>0.012334</td>\n",
433 |        "    </tr>\n",
434 |        "    <tr>\n",
435 |        "      <th>376481</th>\n",
436 |        "      <td>0.012334</td>\n",
437 |        "    </tr>\n",
438 |        "    <tr>\n",
439 |        "      <th>376483</th>\n",
440 |        "      <td>0.020022</td>\n",
441 |        "    </tr>\n",
442 |        "    <tr>\n",
443 |        "      <th>376496</th>\n",
444 |        "      <td>0.007049</td>\n",
445 |        "    </tr>\n",
446 |        "    <tr>\n",
447 |        "      <th>376497</th>\n",
448 |        "      <td>0.007049</td>\n",
449 |        "    </tr>\n",
450 |        "    <tr>\n",
451 |        "      <th>376499</th>\n",
452 |        "      <td>0.083335</td>\n",
453 |        "    </tr>\n",
454 |        "    <tr>\n",
455 |        "      <th>376500</th>\n",
456 |        "      <td>0.083335</td>\n",
457 |        "    </tr>\n",
458 |        "    <tr>\n",
459 |        "      <th>369851</th>\n",
460 |        "      <td>0.051367</td>\n",
461 |        "    </tr>\n",
462 |        "  </tbody>\n",
463 |        "</table>\n",
464 |        "<p>61001 rows × 1 columns</p>\n",
465 |        "</div>"
466 |       ],
467 |       "text/plain": [
468 |        "           compliance\n",
469 |        "ticket_id            \n",
470 |        "284932       0.177636\n",
471 |        "285362       0.029367\n",
472 |        "285361       0.065970\n",
473 |        "285338       0.029367\n",
474 |        "285346       0.083335\n",
475 |        "285345       0.029367\n",
476 |        "285347       0.062404\n",
477 |        "285342       0.480734\n",
478 |        "285530       0.029367\n",
479 |        "284989       0.029367\n",
480 |        "285344       0.051367\n",
481 |        "285343       0.029367\n",
482 |        "285340       0.029367\n",
483 |        "285341       0.051367\n",
484 |        "285349       0.083335\n",
485 |        "285348       0.029367\n",
486 |        "284991       0.029367\n",
487 |        "285532       0.029367\n",
488 |        "285406       0.029367\n",
489 |        "285001       0.039585\n",
490 |        "285006       0.034218\n",
491 |        "285405       0.029367\n",
492 |        "285337       0.029367\n",
493 |        "285496       0.051367\n",
494 |        "285497       0.029367\n",
495 |        "285378       0.029367\n",
496 |        "285589       0.029367\n",
497 |        "285585       0.029367\n",
498 |        "285501       0.065970\n",
499 |        "285581       0.029367\n",
500 |        "...               ...\n",
501 |        "376367       0.034218\n",
502 |        "376366       0.039585\n",
503 |        "376362       0.271635\n",
504 |        "376363       0.282298\n",
505 |        "376365       0.034218\n",
506 |        "376364       0.039585\n",
507 |        "376228       0.039585\n",
508 |        "376265       0.039585\n",
509 |        "376286       0.331396\n",
510 |        "376320       0.039585\n",
511 |        "376314       0.039585\n",
512 |        "376327       0.331396\n",
513 |        "376385       0.331396\n",
514 |        "376435       0.139655\n",
515 |        "376370       0.271635\n",
516 |        "376434       0.062404\n",
517 |        "376459       0.199092\n",
518 |        "376478       0.030587\n",
519 |        "376473       0.039585\n",
520 |        "376484       0.012334\n",
521 |        "376482       0.018209\n",
522 |        "376480       0.012334\n",
523 |        "376479       0.012334\n",
524 |        "376481       0.012334\n",
525 |        "376483       0.020022\n",
526 |        "376496       0.007049\n",
527 |        "376497       0.007049\n",
528 |        "376499       0.083335\n",
529 |        "376500       0.083335\n",
530 |        "369851       0.051367\n",
531 |        "\n",
532 |        "[61001 rows x 1 columns]"
533 |       ]
534 |      },
535 |      "execution_count": 8,
536 |      "metadata": {},
537 |      "output_type": "execute_result"
538 |     }
539 |    ],
540 |    "source": [
541 |     "blight_model()"
542 |    ]
543 |   },
544 |   {
545 |    "cell_type": "code",
546 |    "execution_count": null,
547 |    "metadata": {
548 |     "collapsed": true
549 |    },
550 |    "outputs": [],
551 |    "source": []
552 |   }
553 |  ],
554 |  "metadata": {
555 |   "coursera": {
556 |    "course_slug": "python-machine-learning",
557 |    "graded_item_id": "nNS8l",
558 |    "launcher_item_id": "yWWk7",
559 |    "part_id": "w8BSS"
560 |   },
561 |   "kernelspec": {
562 |    "display_name": "Python 3",
563 |    "language": "python",
564 |    "name": "python3"
565 |   },
566 |   "language_info": {
567 |    "codemirror_mode": {
568 |     "name": "ipython",
569 |     "version": 3
570 |    },
571 |    "file_extension": ".py",
572 |    "mimetype": "text/x-python",
573 |    "name": "python",
574 |    "nbconvert_exporter": "python",
575 |    "pygments_lexer": "ipython3",
576 |    "version": "3.6.2"
577 |   }
578 |  },
579 |  "nbformat": 4,
580 |  "nbformat_minor": 2
581 | }
582 | 


--------------------------------------------------------------------------------