├── Machine-Learning-Foundations-A-Case-Study-Approach
├── Course-2
│ ├── Predicting House Price using SkitLearn.ipynb
│ └── Predicting House Price using SkitLearn.py
├── Course-3
│ └── Analyzing product sentiment + Quiz Answers.ipynb
├── Course-4
│ └── document-retrieval.ipynb
├── Course-5
│ └── Song recommender.ipynb
├── Course-6
│ └── Deep Features for Image Retrieval.ipynb
├── Predicting house prices + Quiz Answers .ipynb
├── Song recommender-Copy1.ipynb
└── home_data.csv
├── Machine_Learning_Classification
├── Week 7 - Stochastic Gradient Ascent.ipynb
├── test
├── week1_Logistic Regression Assignment 1.ipynb
├── week2_programming assignment 1.ipynb
├── week2_programming assignment 2_regularization.ipynb
├── week3_binary decision tree_programming assignment 2.ipynb
├── week3_decision tree_programming assignment 1.ipynb
├── week4_decision tree in practice_programming assignment1.ipynb
├── week5_boosting_programming_assignment_1.ipynb
├── week5_boosting_programming_assignment_2.ipynb
└── week6_programming assignment_precision recall.ipynb
├── Machine_Learning_Regression
├── K-Nearest Neighborhood Regression.ipynb
├── ML_UW_Regression.pdf
├── Week 5 - Lasso Regression Assignment 1.ipynb
├── week 2 - multiple regression - Assignment 2.ipynb
├── week1_kc_house_linear regression_assignment1.py
├── week3_Assess Performance - Programming Assignment.ipynb
├── week4_ridge_regression_Assess Performance - Programming Assignment.ipynb
└── week5-lasso regression assignment 2.ipynb
├── Maching_Learning_Clustering-Retrieval
├── Quiz-Answers
│ ├── screencapture-coursera-org-learn-ml-clustering-and-retrieval-exam-0acPr-kd-trees-1475969153420.png
│ ├── screencapture-coursera-org-learn-ml-clustering-and-retrieval-exam-7cVMj-locality-sensitive-hashing-1475969687954.png
│ ├── screencapture-coursera-org-learn-ml-clustering-and-retrieval-exam-itzL5-implementing-locality-sensitive-hashing-from-scratch-1475994803333.png
│ ├── screencapture-coursera-org-learn-ml-clustering-and-retrieval-exam-tHtXY-k-means-1476159235037.png
│ └── test
├── test
├── week2_choosing features and metrics for nearest neighbor search_programming assignment.ipynb
├── week2_locality sensitive hashing_programming assignment.ipynb
├── week3_k-means_programming_assignment_1.ipynb
├── week4_implement_the_EM_algorithm-programing_assignment_1.ipynb
├── week4_text_em_clustering_programming-assignment-2.ipynb
├── week5_lda_programming_assignment_1.ipynb
└── week6_Hierarchical Clustering_programming assignment 1.ipynb
└── README.md
/Machine-Learning-Foundations-A-Case-Study-Approach/Course-2/Predicting House Price using SkitLearn.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 16,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "import matplotlib.pyplot as plt\n",
12 | "import numpy as np\n",
13 | "import pandas as pd\n",
14 | "from sklearn import datasets, linear_model"
15 | ]
16 | },
17 | {
18 | "cell_type": "code",
19 | "execution_count": 18,
20 | "metadata": {
21 | "collapsed": false
22 | },
23 | "outputs": [],
24 | "source": [
25 | "sales = pd.read_csv('home_data.csv')"
26 | ]
27 | },
28 | {
29 | "cell_type": "code",
30 | "execution_count": 19,
31 | "metadata": {
32 | "collapsed": false
33 | },
34 | "outputs": [
35 | {
36 | "data": {
37 | "text/html": [
38 | "
\n",
39 | "
\n",
40 | " \n",
41 | " \n",
42 | " | \n",
43 | " id | \n",
44 | " date | \n",
45 | " price | \n",
46 | " bedrooms | \n",
47 | " bathrooms | \n",
48 | " sqft_living | \n",
49 | " sqft_lot | \n",
50 | " floors | \n",
51 | " waterfront | \n",
52 | " view | \n",
53 | " ... | \n",
54 | " grade | \n",
55 | " sqft_above | \n",
56 | " sqft_basement | \n",
57 | " yr_built | \n",
58 | " yr_renovated | \n",
59 | " zipcode | \n",
60 | " lat | \n",
61 | " long | \n",
62 | " sqft_living15 | \n",
63 | " sqft_lot15 | \n",
64 | "
\n",
65 | " \n",
66 | " \n",
67 | " \n",
68 | " 0 | \n",
69 | " 7129300520 | \n",
70 | " 20141013T000000 | \n",
71 | " 221900 | \n",
72 | " 3 | \n",
73 | " 1.00 | \n",
74 | " 1180 | \n",
75 | " 5650 | \n",
76 | " 1 | \n",
77 | " 0 | \n",
78 | " 0 | \n",
79 | " ... | \n",
80 | " 7 | \n",
81 | " 1180 | \n",
82 | " 0 | \n",
83 | " 1955 | \n",
84 | " 0 | \n",
85 | " 98178 | \n",
86 | " 47.5112 | \n",
87 | " -122.257 | \n",
88 | " 1340 | \n",
89 | " 5650 | \n",
90 | "
\n",
91 | " \n",
92 | " 1 | \n",
93 | " 6414100192 | \n",
94 | " 20141209T000000 | \n",
95 | " 538000 | \n",
96 | " 3 | \n",
97 | " 2.25 | \n",
98 | " 2570 | \n",
99 | " 7242 | \n",
100 | " 2 | \n",
101 | " 0 | \n",
102 | " 0 | \n",
103 | " ... | \n",
104 | " 7 | \n",
105 | " 2170 | \n",
106 | " 400 | \n",
107 | " 1951 | \n",
108 | " 1991 | \n",
109 | " 98125 | \n",
110 | " 47.7210 | \n",
111 | " -122.319 | \n",
112 | " 1690 | \n",
113 | " 7639 | \n",
114 | "
\n",
115 | " \n",
116 | " 2 | \n",
117 | " 5631500400 | \n",
118 | " 20150225T000000 | \n",
119 | " 180000 | \n",
120 | " 2 | \n",
121 | " 1.00 | \n",
122 | " 770 | \n",
123 | " 10000 | \n",
124 | " 1 | \n",
125 | " 0 | \n",
126 | " 0 | \n",
127 | " ... | \n",
128 | " 6 | \n",
129 | " 770 | \n",
130 | " 0 | \n",
131 | " 1933 | \n",
132 | " 0 | \n",
133 | " 98028 | \n",
134 | " 47.7379 | \n",
135 | " -122.233 | \n",
136 | " 2720 | \n",
137 | " 8062 | \n",
138 | "
\n",
139 | " \n",
140 | " 3 | \n",
141 | " 2487200875 | \n",
142 | " 20141209T000000 | \n",
143 | " 604000 | \n",
144 | " 4 | \n",
145 | " 3.00 | \n",
146 | " 1960 | \n",
147 | " 5000 | \n",
148 | " 1 | \n",
149 | " 0 | \n",
150 | " 0 | \n",
151 | " ... | \n",
152 | " 7 | \n",
153 | " 1050 | \n",
154 | " 910 | \n",
155 | " 1965 | \n",
156 | " 0 | \n",
157 | " 98136 | \n",
158 | " 47.5208 | \n",
159 | " -122.393 | \n",
160 | " 1360 | \n",
161 | " 5000 | \n",
162 | "
\n",
163 | " \n",
164 | " 4 | \n",
165 | " 1954400510 | \n",
166 | " 20150218T000000 | \n",
167 | " 510000 | \n",
168 | " 3 | \n",
169 | " 2.00 | \n",
170 | " 1680 | \n",
171 | " 8080 | \n",
172 | " 1 | \n",
173 | " 0 | \n",
174 | " 0 | \n",
175 | " ... | \n",
176 | " 8 | \n",
177 | " 1680 | \n",
178 | " 0 | \n",
179 | " 1987 | \n",
180 | " 0 | \n",
181 | " 98074 | \n",
182 | " 47.6168 | \n",
183 | " -122.045 | \n",
184 | " 1800 | \n",
185 | " 7503 | \n",
186 | "
\n",
187 | " \n",
188 | "
\n",
189 | "
5 rows × 21 columns
\n",
190 | "
"
191 | ],
192 | "text/plain": [
193 | " id date price bedrooms bathrooms sqft_living \\\n",
194 | "0 7129300520 20141013T000000 221900 3 1.00 1180 \n",
195 | "1 6414100192 20141209T000000 538000 3 2.25 2570 \n",
196 | "2 5631500400 20150225T000000 180000 2 1.00 770 \n",
197 | "3 2487200875 20141209T000000 604000 4 3.00 1960 \n",
198 | "4 1954400510 20150218T000000 510000 3 2.00 1680 \n",
199 | "\n",
200 | " sqft_lot floors waterfront view ... grade sqft_above \\\n",
201 | "0 5650 1 0 0 ... 7 1180 \n",
202 | "1 7242 2 0 0 ... 7 2170 \n",
203 | "2 10000 1 0 0 ... 6 770 \n",
204 | "3 5000 1 0 0 ... 7 1050 \n",
205 | "4 8080 1 0 0 ... 8 1680 \n",
206 | "\n",
207 | " sqft_basement yr_built yr_renovated zipcode lat long \\\n",
208 | "0 0 1955 0 98178 47.5112 -122.257 \n",
209 | "1 400 1951 1991 98125 47.7210 -122.319 \n",
210 | "2 0 1933 0 98028 47.7379 -122.233 \n",
211 | "3 910 1965 0 98136 47.5208 -122.393 \n",
212 | "4 0 1987 0 98074 47.6168 -122.045 \n",
213 | "\n",
214 | " sqft_living15 sqft_lot15 \n",
215 | "0 1340 5650 \n",
216 | "1 1690 7639 \n",
217 | "2 2720 8062 \n",
218 | "3 1360 5000 \n",
219 | "4 1800 7503 \n",
220 | "\n",
221 | "[5 rows x 21 columns]"
222 | ]
223 | },
224 | "execution_count": 19,
225 | "metadata": {},
226 | "output_type": "execute_result"
227 | }
228 | ],
229 | "source": [
230 | "sales.head()"
231 | ]
232 | },
233 | {
234 | "cell_type": "code",
235 | "execution_count": 21,
236 | "metadata": {
237 | "collapsed": false
238 | },
239 | "outputs": [],
240 | "source": [
241 | "sales_X = sales['sqft_living']"
242 | ]
243 | },
244 | {
245 | "cell_type": "code",
246 | "execution_count": 48,
247 | "metadata": {
248 | "collapsed": false
249 | },
250 | "outputs": [
251 | {
252 | "data": {
253 | "text/plain": [
254 | "array([1180, 2570, 770, ..., 1020, 1600, 1020])"
255 | ]
256 | },
257 | "execution_count": 48,
258 | "metadata": {},
259 | "output_type": "execute_result"
260 | }
261 | ],
262 | "source": [
263 | "x = []\n",
264 | "for i in sales_X:\n",
265 | " x.append(i)\n",
266 | "xarray = np.array(x)\n",
267 | "xarray"
268 | ]
269 | },
270 | {
271 | "cell_type": "code",
272 | "execution_count": 68,
273 | "metadata": {
274 | "collapsed": false
275 | },
276 | "outputs": [
277 | {
278 | "data": {
279 | "text/plain": [
280 | "array([[1180],\n",
281 | " [2570],\n",
282 | " [ 770],\n",
283 | " ..., \n",
284 | " [4910],\n",
285 | " [2770],\n",
286 | " [1190]])"
287 | ]
288 | },
289 | "execution_count": 68,
290 | "metadata": {},
291 | "output_type": "execute_result"
292 | }
293 | ],
294 | "source": [
295 | "sales_X_train = xarray[:-20]\n",
296 | "sales_X_train.reshape(len(sales_X_train),1)"
297 | ]
298 | },
299 | {
300 | "cell_type": "code",
301 | "execution_count": 53,
302 | "metadata": {
303 | "collapsed": false
304 | },
305 | "outputs": [],
306 | "source": [
307 | "sales_X_test = xarray[-20:]"
308 | ]
309 | },
310 | {
311 | "cell_type": "code",
312 | "execution_count": 54,
313 | "metadata": {
314 | "collapsed": false
315 | },
316 | "outputs": [
317 | {
318 | "data": {
319 | "text/plain": [
320 | "array([4170, 2500, 1530, 3600, 3410, 3118, 3990, 4470, 1425, 1500, 2270,\n",
321 | " 1490, 2520, 3510, 1310, 1530, 2310, 1020, 1600, 1020])"
322 | ]
323 | },
324 | "execution_count": 54,
325 | "metadata": {},
326 | "output_type": "execute_result"
327 | }
328 | ],
329 | "source": [
330 | "sales_X_test"
331 | ]
332 | },
333 | {
334 | "cell_type": "code",
335 | "execution_count": 33,
336 | "metadata": {
337 | "collapsed": true
338 | },
339 | "outputs": [],
340 | "source": [
341 | "sales_Y = sales['price']"
342 | ]
343 | },
344 | {
345 | "cell_type": "code",
346 | "execution_count": 55,
347 | "metadata": {
348 | "collapsed": false
349 | },
350 | "outputs": [
351 | {
352 | "data": {
353 | "text/plain": [
354 | "array([221900, 538000, 180000, ..., 402101, 400000, 325000])"
355 | ]
356 | },
357 | "execution_count": 55,
358 | "metadata": {},
359 | "output_type": "execute_result"
360 | }
361 | ],
362 | "source": [
363 | "y = []\n",
364 | "for m in sales_Y:\n",
365 | " y.append(m)\n",
366 | "yarray = np.array(y)\n",
367 | "yarray"
368 | ]
369 | },
370 | {
371 | "cell_type": "code",
372 | "execution_count": 58,
373 | "metadata": {
374 | "collapsed": false
375 | },
376 | "outputs": [],
377 | "source": [
378 | "sales_y_train = yarray[:-20]\n",
379 | "sales_y_test = yarray[-20:]"
380 | ]
381 | },
382 | {
383 | "cell_type": "code",
384 | "execution_count": 66,
385 | "metadata": {
386 | "collapsed": false
387 | },
388 | "outputs": [
389 | {
390 | "data": {
391 | "text/plain": [
392 | "array([[ 221900],\n",
393 | " [ 538000],\n",
394 | " [ 180000],\n",
395 | " ..., \n",
396 | " [1222500],\n",
397 | " [ 572000],\n",
398 | " [ 475000]])"
399 | ]
400 | },
401 | "execution_count": 66,
402 | "metadata": {},
403 | "output_type": "execute_result"
404 | }
405 | ],
406 | "source": [
407 | "sales_y_train.reshape(len(sales_y_train),1)"
408 | ]
409 | },
410 | {
411 | "cell_type": "code",
412 | "execution_count": 59,
413 | "metadata": {
414 | "collapsed": true
415 | },
416 | "outputs": [],
417 | "source": [
418 | "regr = linear_model.LinearRegression()"
419 | ]
420 | },
421 | {
422 | "cell_type": "code",
423 | "execution_count": 70,
424 | "metadata": {
425 | "collapsed": false
426 | },
427 | "outputs": [
428 | {
429 | "data": {
430 | "text/plain": [
431 | "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)"
432 | ]
433 | },
434 | "execution_count": 70,
435 | "metadata": {},
436 | "output_type": "execute_result"
437 | }
438 | ],
439 | "source": [
440 | "regr.fit(sales_X_train.reshape(len(sales_X_train),1), sales_y_train.reshape(len(sales_y_train),1))"
441 | ]
442 | },
443 | {
444 | "cell_type": "code",
445 | "execution_count": 71,
446 | "metadata": {
447 | "collapsed": false
448 | },
449 | "outputs": [
450 | {
451 | "name": "stdout",
452 | "output_type": "stream",
453 | "text": [
454 | "('Coefficients: \\n', array([[ 280.6363448]]))\n"
455 | ]
456 | }
457 | ],
458 | "source": [
459 | "print(\"Coefficients: \\n\", regr.coef_)"
460 | ]
461 | },
462 | {
463 | "cell_type": "code",
464 | "execution_count": null,
465 | "metadata": {
466 | "collapsed": false
467 | },
468 | "outputs": [],
469 | "source": [
470 | "print(\"Residual sum of squares: %.2f\"\n",
471 | " % np.mean((regr.predict(sales_X_test.reshape(len(sales_X_test),1)) - sales_y_test.reshape(len(sales_y_test),1)) ** 2))"
472 | ]
473 | },
474 | {
475 | "cell_type": "code",
476 | "execution_count": null,
477 | "metadata": {
478 | "collapsed": false
479 | },
480 | "outputs": [],
481 | "source": [
482 | "print('Variance score: %.2f' % regr.score(sales_X_test.reshape(len(sales_X_test),1), sales_y_test.reshape(len(sales_y_test),1)))"
483 | ]
484 | },
485 | {
486 | "cell_type": "code",
487 | "execution_count": null,
488 | "metadata": {
489 | "collapsed": true
490 | },
491 | "outputs": [],
492 | "source": [
493 | "plt.scatter(sales_X_test.reshape(len(sales_X_test),1), sales_y_test.reshape(len(sales_y_test),1), color='black')\n",
494 | "plt.plot(sales_X_test.reshape(len(sales_X_test),1), sales_y_test.reshape(len(sales_y_test),1), color='blue',\n",
495 | " linewidth=3)\n",
496 | "\n",
497 | "plt.xticks(())\n",
498 | "plt.yticks(())\n",
499 | "\n",
500 | "plt.show()"
501 | ]
502 | },
503 | {
504 | "cell_type": "code",
505 | "execution_count": null,
506 | "metadata": {
507 | "collapsed": true
508 | },
509 | "outputs": [],
510 | "source": []
511 | }
512 | ],
513 | "metadata": {
514 | "kernelspec": {
515 | "display_name": "Python 2",
516 | "language": "python",
517 | "name": "python2"
518 | },
519 | "language_info": {
520 | "codemirror_mode": {
521 | "name": "ipython",
522 | "version": 2
523 | },
524 | "file_extension": ".py",
525 | "mimetype": "text/x-python",
526 | "name": "python",
527 | "nbconvert_exporter": "python",
528 | "pygments_lexer": "ipython2",
529 | "version": "2.7.11"
530 | }
531 | },
532 | "nbformat": 4,
533 | "nbformat_minor": 0
534 | }
535 |
--------------------------------------------------------------------------------
/Machine-Learning-Foundations-A-Case-Study-Approach/Course-2/Predicting House Price using SkitLearn.py:
--------------------------------------------------------------------------------
1 |
2 | # coding: utf-8
3 |
4 | # In[16]:
5 |
6 | import matplotlib.pyplot as plt
7 | import numpy as np
8 | import pandas as pd
9 | from sklearn import datasets, linear_model
10 |
11 |
12 | # In[18]:
13 |
14 | sales = pd.read_csv('home_data.csv')
15 |
16 |
17 | # In[19]:
18 |
19 | sales.head()
20 |
21 |
22 | # In[21]:
23 |
24 | sales_X = sales['sqft_living']
25 |
26 |
27 | # In[48]:
28 |
29 | x = []
30 | for i in sales_X:
31 | x.append(i)
32 | xarray = np.array(x)
33 | xarray
34 |
35 |
36 | # In[68]:
37 |
38 | sales_X_train = xarray[:-20]
39 | sales_X_train.reshape(len(sales_X_train),1)
40 |
41 |
42 | # In[53]:
43 |
44 | sales_X_test = xarray[-20:]
45 |
46 |
47 | # In[54]:
48 |
49 | sales_X_test
50 |
51 |
52 | # In[33]:
53 |
54 | sales_Y = sales['price']
55 |
56 |
57 | # In[55]:
58 |
59 | y = []
60 | for m in sales_Y:
61 | y.append(m)
62 | yarray = np.array(y)
63 | yarray
64 |
65 |
66 | # In[58]:
67 |
68 | sales_y_train = yarray[:-20]
69 | sales_y_test = yarray[-20:]
70 |
71 |
72 | # In[66]:
73 |
74 | sales_y_train.reshape(len(sales_y_train),1)
75 |
76 |
77 | # In[59]:
78 |
79 | regr = linear_model.LinearRegression()
80 |
81 |
82 | # In[70]:
83 |
84 | regr.fit(sales_X_train.reshape(len(sales_X_train),1), sales_y_train.reshape(len(sales_y_train),1))
85 |
86 |
87 | # In[71]:
88 |
89 | print("Coefficients: \n", regr.coef_)
90 |
91 |
92 | # In[ ]:
93 |
94 | print("Residual sum of squares: %.2f"
95 | % np.mean((regr.predict(sales_X_test.reshape(len(sales_X_test),1)) - sales_y_test.reshape(len(sales_y_test),1)) ** 2))
96 |
97 |
98 | # In[ ]:
99 |
100 | print('Variance score: %.2f' % regr.score(sales_X_test.reshape(len(sales_X_test),1), sales_y_test.reshape(len(sales_y_test),1)))
101 |
102 |
103 | # In[ ]:
104 |
105 | plt.scatter(sales_X_test.reshape(len(sales_X_test),1), sales_y_test.reshape(len(sales_y_test),1), color='black')
106 | plt.plot(sales_X_test.reshape(len(sales_X_test),1), sales_y_test.reshape(len(sales_y_test),1), color='blue',
107 | linewidth=3)
108 |
109 | plt.xticks(())
110 | plt.yticks(())
111 |
112 | plt.show()
113 |
114 |
115 | # In[ ]:
116 |
117 |
118 |
119 |
--------------------------------------------------------------------------------
/Machine_Learning_Classification/test:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/Machine_Learning_Regression/K-Nearest Neighborhood Regression.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Predicting house prices using k-nearest neighbors regression\n",
8 | "\n"
9 | ]
10 | },
11 | {
12 | "cell_type": "markdown",
13 | "metadata": {},
14 | "source": [
15 | "In this notebook, you will implement k-nearest neighbors regression. You will:\n",
16 | "\n",
17 | "Find the k-nearest neighbors of a given query input\n",
18 | "Predict the output for the query input using the k-nearest neighbors\n",
19 | "Choose the best value of k using a validation set"
20 | ]
21 | },
22 | {
23 | "cell_type": "code",
24 | "execution_count": 1,
25 | "metadata": {
26 | "collapsed": true
27 | },
28 | "outputs": [],
29 | "source": [
30 | "import numpy as np\n",
31 | "import pandas as pd"
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": 2,
37 | "metadata": {
38 | "collapsed": true
39 | },
40 | "outputs": [],
41 | "source": [
42 | "dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':float, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}"
43 | ]
44 | },
45 | {
46 | "cell_type": "code",
47 | "execution_count": 3,
48 | "metadata": {
49 | "collapsed": false
50 | },
51 | "outputs": [],
52 | "source": [
53 | "sales = pd.read_csv('kc_house_data_small.csv', dtype = dtype_dict)\n",
54 | "train = pd.read_csv('kc_house_data_small_train.csv', dtype = dtype_dict)\n",
55 | "test = pd.read_csv('kc_house_data_small_test.csv', dtype = dtype_dict)\n",
56 | "validate = pd.read_csv('kc_house_data_validation 2.csv', dtype = dtype_dict)"
57 | ]
58 | },
59 | {
60 | "cell_type": "markdown",
61 | "metadata": {},
62 | "source": [
63 | "3. To efficiently compute pairwise distances among data points, we will convert the SFrame (or dataframe) into a 2D Numpy array. First import the numpy library and then copy and paste get_numpy_data() (or equivalent). The function takes a dataset, a list of features (e.g. [‘sqft_living’, ‘bedrooms’]) to be used as inputs, and a name of the output (e.g. ‘price’). It returns a ‘features_matrix’ (2D array) consisting of a column of ones followed by columns containing the values of the input features in the data set in the same order as the input list. It also returns an ‘output_array’, which is an array of the values of the output in the dataset (e.g. ‘price’).\n",
64 | "\n"
65 | ]
66 | },
67 | {
68 | "cell_type": "code",
69 | "execution_count": 4,
70 | "metadata": {
71 | "collapsed": true
72 | },
73 | "outputs": [],
74 | "source": [
75 | "def get_numpy_data(data, features, output):\n",
76 | " data['constant'] = 1 # add a constant column to a dataframe\n",
77 | " # prepend variable 'constant' to the features list\n",
78 | " features = ['constant'] + features\n",
79 | " # select the columns of dataframe given by the ‘features’ list into the SFrame ‘features_sframe’\n",
80 | "\n",
81 | " # this will convert the features_sframe into a numpy matrix with GraphLab Create >= 1.7!!\n",
82 | " features_matrix = data[features].as_matrix(columns=None)\n",
83 | " # assign the column of data_sframe associated with the target to the variable ‘output_sarray’\n",
84 | "\n",
85 | " # this will convert the SArray into a numpy array:\n",
86 | " output_array = data[output].as_matrix(columns=None) \n",
87 | " return(features_matrix, output_array)"
88 | ]
89 | },
90 | {
91 | "cell_type": "markdown",
92 | "metadata": {},
93 | "source": [
94 | "Similarly, copy and paste the normalize_features function (or equivalent) from Module 5 (Ridge Regression). Given a feature matrix, each column is divided (element-wise) by its 2-norm. The function returns two items: (i) a feature matrix with normalized columns and (ii) the norms of the original columns."
95 | ]
96 | },
97 | {
98 | "cell_type": "code",
99 | "execution_count": 5,
100 | "metadata": {
101 | "collapsed": true
102 | },
103 | "outputs": [],
104 | "source": [
105 | "def normalize_features(features):\n",
106 | " norms = np.sqrt(np.sum(features**2,axis=0))\n",
107 | " normlized_features = features/norms\n",
108 | " return (normlized_features, norms)"
109 | ]
110 | },
111 | {
112 | "cell_type": "markdown",
113 | "metadata": {},
114 | "source": [
115 | "Using get_numpy_data (or equivalent), extract numpy arrays of the training, test, and validation sets."
116 | ]
117 | },
118 | {
119 | "cell_type": "code",
120 | "execution_count": 6,
121 | "metadata": {
122 | "collapsed": false
123 | },
124 | "outputs": [],
125 | "source": [
126 | "features = [m for m,n in dtype_dict.items() if train[m].dtypes != object]"
127 | ]
128 | },
129 | {
130 | "cell_type": "code",
131 | "execution_count": 7,
132 | "metadata": {
133 | "collapsed": false
134 | },
135 | "outputs": [
136 | {
137 | "data": {
138 | "text/plain": [
139 | "['bathrooms',\n",
140 | " 'sqft_living15',\n",
141 | " 'sqft_above',\n",
142 | " 'grade',\n",
143 | " 'yr_built',\n",
144 | " 'price',\n",
145 | " 'bedrooms',\n",
146 | " 'long',\n",
147 | " 'sqft_lot15',\n",
148 | " 'sqft_living',\n",
149 | " 'floors',\n",
150 | " 'sqft_lot',\n",
151 | " 'waterfront',\n",
152 | " 'sqft_basement',\n",
153 | " 'yr_renovated',\n",
154 | " 'lat',\n",
155 | " 'condition',\n",
156 | " 'view']"
157 | ]
158 | },
159 | "execution_count": 7,
160 | "metadata": {},
161 | "output_type": "execute_result"
162 | }
163 | ],
164 | "source": [
165 | "features"
166 | ]
167 | },
168 | {
169 | "cell_type": "code",
170 | "execution_count": 8,
171 | "metadata": {
172 | "collapsed": true
173 | },
174 | "outputs": [],
175 | "source": [
176 | "features.remove('price')"
177 | ]
178 | },
179 | {
180 | "cell_type": "code",
181 | "execution_count": 9,
182 | "metadata": {
183 | "collapsed": true
184 | },
185 | "outputs": [],
186 | "source": [
187 | "training_feature_matrix, training_output = get_numpy_data(train, features, 'price')"
188 | ]
189 | },
190 | {
191 | "cell_type": "code",
192 | "execution_count": 10,
193 | "metadata": {
194 | "collapsed": true
195 | },
196 | "outputs": [],
197 | "source": [
198 | "testing_feature_matrix, testing_output = get_numpy_data(test, features, 'price')"
199 | ]
200 | },
201 | {
202 | "cell_type": "code",
203 | "execution_count": 11,
204 | "metadata": {
205 | "collapsed": false
206 | },
207 | "outputs": [],
208 | "source": [
209 | "validating_feature_matrix, validating_output = get_numpy_data(validate, features, 'price')"
210 | ]
211 | },
212 | {
213 | "cell_type": "markdown",
214 | "metadata": {},
215 | "source": [
216 | "In computing distances, it is crucial to normalize features. Otherwise, for example, the ‘sqft_living’ feature (typically on the order of thousands) would exert a much larger influence on distance than the ‘bedrooms’ feature (typically on the order of ones). We divide each column of the training feature matrix by its 2-norm, so that the transformed column has unit norm.\n",
217 | "\n",
218 | "IMPORTANT: Make sure to store the norms of the features in the training set. The features in the test and validation sets must be divided by these same norms, so that the training, test, and validation sets are normalized consistently.\n",
219 | "\n",
220 | "e.g. in Python:\n",
221 | "\n"
222 | ]
223 | },
224 | {
225 | "cell_type": "code",
226 | "execution_count": 12,
227 | "metadata": {
228 | "collapsed": false
229 | },
230 | "outputs": [],
231 | "source": [
232 | "features_train, norms = normalize_features(training_feature_matrix)\n",
233 | "features_test = testing_feature_matrix / norms\n",
234 | "features_valid = validating_feature_matrix / norms"
235 | ]
236 | },
237 | {
238 | "cell_type": "markdown",
239 | "metadata": {
240 | "collapsed": true
241 | },
242 | "source": [
243 | "#Compute a single distance"
244 | ]
245 | },
246 | {
247 | "cell_type": "markdown",
248 | "metadata": {},
249 | "source": [
250 | "To start, let's just explore computing the “distance” between two given houses. We will take our query house to be the first house of the test set and look at the distance between this house and the 10th house of the training set.\n",
251 | "\n",
252 | "To see the features associated with the query house, print the first row (index 0) of the test feature matrix. You should get an 18-dimensional vector whose components are between 0 and 1. Similarly, print the 10th row (index 9) of the training feature matrix."
253 | ]
254 | },
255 | {
256 | "cell_type": "code",
257 | "execution_count": 13,
258 | "metadata": {
259 | "collapsed": false
260 | },
261 | "outputs": [
262 | {
263 | "name": "stdout",
264 | "output_type": "stream",
265 | "text": [
266 | "[ 0.01345102 0.01807473 0.01375926 0.01362084 0.01564352 0.01350306\n",
267 | " 0.01551285 -0.01346922 0.0016225 0.01759212 0.017059 0.00160518\n",
268 | " 0. 0.02481682 0. 0.01345387 0.0116321 0.05102365]\n",
269 | "[ 0.01345102 0.00602491 0.01195898 0.0096309 0.01390535 0.01302544\n",
270 | " 0.01163464 -0.01346251 0.00156612 0.0083488 0.01279425 0.00050756\n",
271 | " 0. 0. 0. 0.01346821 0.01938684 0. ]\n"
272 | ]
273 | }
274 | ],
275 | "source": [
276 | "print features_test[0]\n",
277 | "print features_train[9]"
278 | ]
279 | },
280 | {
281 | "cell_type": "markdown",
282 | "metadata": {},
283 | "source": [
284 | "Quiz Question: What is the Euclidean distance between the query house and the 10th house of the training set?"
285 | ]
286 | },
287 | {
288 | "cell_type": "code",
289 | "execution_count": 14,
290 | "metadata": {
291 | "collapsed": false
292 | },
293 | "outputs": [
294 | {
295 | "data": {
296 | "text/plain": [
297 | "0.059723593713980776"
298 | ]
299 | },
300 | "execution_count": 14,
301 | "metadata": {},
302 | "output_type": "execute_result"
303 | }
304 | ],
305 | "source": [
306 | "np.sqrt(np.sum((features_train[9] - features_test[0])**2))"
307 | ]
308 | },
309 | {
310 | "cell_type": "markdown",
311 | "metadata": {},
312 | "source": [
313 | "Of course, to do nearest neighbor regression, we need to compute the distance between our query house and all houses in the training set.\n",
314 | "\n",
315 | "To visualize this nearest-neighbor search, let's first compute the distance from our query house (features_test[0]) to the first 10 houses of the training set (features_train[0:10]) and then search for the nearest neighbor within this small set of houses. Through restricting ourselves to a small set of houses to begin with, we can visually scan the list of 10 distances to verify that our code for finding the nearest neighbor is working.\n",
316 | "\n",
317 | "Write a loop to compute the Euclidean distance from the query house to each of the first 10 houses in the training set."
318 | ]
319 | },
320 | {
321 | "cell_type": "markdown",
322 | "metadata": {},
323 | "source": [
324 | "Quiz Question: Among the first 10 training houses, which house is the closest to the query house?"
325 | ]
326 | },
327 | {
328 | "cell_type": "code",
329 | "execution_count": 15,
330 | "metadata": {
331 | "collapsed": false
332 | },
333 | "outputs": [
334 | {
335 | "name": "stdout",
336 | "output_type": "stream",
337 | "text": [
338 | "{0: 0.060274709162955922, 1: 0.085468811476437465, 2: 0.061499464352793153, 3: 0.053402739792943632, 4: 0.058444840601704413, 5: 0.059879215098128345, 6: 0.0546314049677546, 7: 0.055431083236146074, 8: 0.052383627840220305, 9: 0.059723593713980776}\n"
339 | ]
340 | }
341 | ],
342 | "source": [
343 | "distance = {}\n",
344 | "for i in range(10):\n",
345 | " distance[i] = np.sqrt(np.sum((features_train[i] - features_test[0])**2))\n",
346 | "print distance"
347 | ]
348 | },
349 | {
350 | "cell_type": "code",
351 | "execution_count": 16,
352 | "metadata": {
353 | "collapsed": true
354 | },
355 | "outputs": [],
356 | "source": [
357 | "distance_2 = []\n",
358 | "for x,y in distance.items():\n",
359 | " distance_2.append((y,x))"
360 | ]
361 | },
362 | {
363 | "cell_type": "code",
364 | "execution_count": 17,
365 | "metadata": {
366 | "collapsed": false
367 | },
368 | "outputs": [],
369 | "source": [
370 | "distance_2.sort()"
371 | ]
372 | },
373 | {
374 | "cell_type": "code",
375 | "execution_count": 18,
376 | "metadata": {
377 | "collapsed": false
378 | },
379 | "outputs": [
380 | {
381 | "data": {
382 | "text/plain": [
383 | "[(0.052383627840220305, 8),\n",
384 | " (0.053402739792943632, 3),\n",
385 | " (0.0546314049677546, 6),\n",
386 | " (0.055431083236146074, 7),\n",
387 | " (0.058444840601704413, 4),\n",
388 | " (0.059723593713980776, 9),\n",
389 | " (0.059879215098128345, 5),\n",
390 | " (0.060274709162955922, 0),\n",
391 | " (0.061499464352793153, 2),\n",
392 | " (0.085468811476437465, 1)]"
393 | ]
394 | },
395 | "execution_count": 18,
396 | "metadata": {},
397 | "output_type": "execute_result"
398 | }
399 | ],
400 | "source": [
401 | "distance_2"
402 | ]
403 | },
404 | {
405 | "cell_type": "markdown",
406 | "metadata": {},
407 | "source": [
408 | "It is computationally inefficient to loop over computing distances to all houses in our training dataset. Fortunately, many of the numpy functions can be vectorized, applying the same operation over multiple values or vectors. We now walk through this process. (The material up to #13 is specific to numpy; if you are using other languages such as R or Matlab, consult relevant manuals on vectorization.)\n",
409 | "\n",
410 | "Consider the following loop that computes the element-wise difference between the features of the query house (features_test[0]) and the first 3 training houses (features_train[0:3]):"
411 | ]
412 | },
413 | {
414 | "cell_type": "code",
415 | "execution_count": 19,
416 | "metadata": {
417 | "collapsed": false
418 | },
419 | "outputs": [
420 | {
421 | "name": "stdout",
422 | "output_type": "stream",
423 | "text": [
424 | "[ 0.00000000e+00 -1.20498190e-02 -5.14364795e-03 -5.50336860e-03\n",
425 | " -3.47633726e-03 -1.63756198e-04 -3.87821276e-03 1.29876855e-05\n",
426 | " 6.69281453e-04 -1.05552733e-02 -8.52950206e-03 2.08673616e-04\n",
427 | " 0.00000000e+00 -2.48168183e-02 0.00000000e+00 -1.70254220e-05\n",
428 | " 0.00000000e+00 -5.10236549e-02]\n",
429 | "[ 0.00000000e+00 -4.51868214e-03 -2.89330197e-03 1.30705004e-03\n",
430 | " -3.47633726e-03 -1.91048898e-04 -3.87821276e-03 6.16364736e-06\n",
431 | " 1.47606982e-03 -2.26610387e-03 0.00000000e+00 7.19763456e-04\n",
432 | " 0.00000000e+00 -1.45830788e-02 6.65082271e-02 4.23090220e-05\n",
433 | " 0.00000000e+00 -5.10236549e-02]\n",
434 | "[ 0.00000000e+00 -1.20498190e-02 3.72914476e-03 -8.32384500e-03\n",
435 | " -5.21450589e-03 -3.13866046e-04 -7.75642553e-03 1.56292487e-05\n",
436 | " 1.64764925e-03 -1.30002801e-02 -8.52950206e-03 1.60518166e-03\n",
437 | " 0.00000000e+00 -2.48168183e-02 0.00000000e+00 4.70885840e-05\n",
438 | " 0.00000000e+00 -5.10236549e-02]\n"
439 | ]
440 | }
441 | ],
442 | "source": [
443 | "for i in xrange(3):\n",
444 | " print features_train[i]-features_test[0]\n",
445 | " # should print 3 vectors of length 18"
446 | ]
447 | },
448 | {
449 | "cell_type": "code",
450 | "execution_count": 20,
451 | "metadata": {
452 | "collapsed": false
453 | },
454 | "outputs": [
455 | {
456 | "name": "stdout",
457 | "output_type": "stream",
458 | "text": [
459 | "[[ 0.00000000e+00 -1.20498190e-02 -5.14364795e-03 -5.50336860e-03\n",
460 | " -3.47633726e-03 -1.63756198e-04 -3.87821276e-03 1.29876855e-05\n",
461 | " 6.69281453e-04 -1.05552733e-02 -8.52950206e-03 2.08673616e-04\n",
462 | " 0.00000000e+00 -2.48168183e-02 0.00000000e+00 -1.70254220e-05\n",
463 | " 0.00000000e+00 -5.10236549e-02]\n",
464 | " [ 0.00000000e+00 -4.51868214e-03 -2.89330197e-03 1.30705004e-03\n",
465 | " -3.47633726e-03 -1.91048898e-04 -3.87821276e-03 6.16364736e-06\n",
466 | " 1.47606982e-03 -2.26610387e-03 0.00000000e+00 7.19763456e-04\n",
467 | " 0.00000000e+00 -1.45830788e-02 6.65082271e-02 4.23090220e-05\n",
468 | " 0.00000000e+00 -5.10236549e-02]\n",
469 | " [ 0.00000000e+00 -1.20498190e-02 3.72914476e-03 -8.32384500e-03\n",
470 | " -5.21450589e-03 -3.13866046e-04 -7.75642553e-03 1.56292487e-05\n",
471 | " 1.64764925e-03 -1.30002801e-02 -8.52950206e-03 1.60518166e-03\n",
472 | " 0.00000000e+00 -2.48168183e-02 0.00000000e+00 4.70885840e-05\n",
473 | " 0.00000000e+00 -5.10236549e-02]]\n"
474 | ]
475 | }
476 | ],
477 | "source": [
478 | "print features_train[0:3] - features_test[0]"
479 | ]
480 | },
481 | {
482 | "cell_type": "markdown",
483 | "metadata": {},
484 | "source": [
485 | "Note that the output of this vectorized operation is identical to that of the loop above, which can be verified below:"
486 | ]
487 | },
488 | {
489 | "cell_type": "code",
490 | "execution_count": 21,
491 | "metadata": {
492 | "collapsed": false
493 | },
494 | "outputs": [
495 | {
496 | "name": "stdout",
497 | "output_type": "stream",
498 | "text": [
499 | "[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n",
500 | "[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n",
501 | "[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n"
502 | ]
503 | }
504 | ],
505 | "source": [
506 | "# verify that vectorization works\n",
507 | "results = features_train[0:3] - features_test[0]\n",
508 | "print results[0] - (features_train[0]-features_test[0])\n",
509 | "# should print all 0's if results[0] == (features_train[0]-features_test[0])\n",
510 | "print results[1] - (features_train[1]-features_test[0])\n",
511 | "# should print all 0's if results[1] == (features_train[1]-features_test[0])\n",
512 | "print results[2] - (features_train[2]-features_test[0])\n",
513 | "# should print all 0's if results[2] == (features_train[2]-features_test[0])"
514 | ]
515 | },
516 | {
517 | "cell_type": "markdown",
518 | "metadata": {},
519 | "source": [
520 | "# Perform 1-nearest neighbor regression"
521 | ]
522 | },
523 | {
524 | "cell_type": "markdown",
525 | "metadata": {},
526 | "source": [
527 | "Now that we have the element-wise differences, it is not too hard to compute the Euclidean distances between our query house and all of the training houses. First, write a single-line expression to define a variable ‘diff’ such that ‘diff[i]’ gives the element-wise difference between the features of the query house and the i-th training house.\n",
528 | "\n",
529 | "To test your code, print diff[-1].sum(), which should be -0.0934339605842."
530 | ]
531 | },
532 | {
533 | "cell_type": "code",
534 | "execution_count": 22,
535 | "metadata": {
536 | "collapsed": false
537 | },
538 | "outputs": [
539 | {
540 | "data": {
541 | "text/plain": [
542 | "array([[ 0.00000000e+00, -1.20498190e-02, -5.14364795e-03, ...,\n",
543 | " -1.70254220e-05, 0.00000000e+00, -5.10236549e-02],\n",
544 | " [ 0.00000000e+00, -4.51868214e-03, -2.89330197e-03, ...,\n",
545 | " 4.23090220e-05, 0.00000000e+00, -5.10236549e-02],\n",
546 | " [ 0.00000000e+00, -1.20498190e-02, 3.72914476e-03, ...,\n",
547 | " 4.70885840e-05, 0.00000000e+00, -5.10236549e-02],\n",
548 | " ..., \n",
549 | " [ 0.00000000e+00, -3.01245476e-03, 8.35842791e-04, ...,\n",
550 | " -9.19146535e-06, 0.00000000e+00, -5.10236549e-02],\n",
551 | " [ 0.00000000e+00, -3.01245476e-03, 2.44323277e-03, ...,\n",
552 | " -1.63183862e-05, 0.00000000e+00, -5.10236549e-02],\n",
553 | " [ 0.00000000e+00, -3.01245476e-03, -3.92203156e-03, ...,\n",
554 | " 3.61719513e-05, 0.00000000e+00, -5.10236549e-02]])"
555 | ]
556 | },
557 | "execution_count": 22,
558 | "metadata": {},
559 | "output_type": "execute_result"
560 | }
561 | ],
562 | "source": [
563 | "diff = features_train[:] - features_test[0]\n",
564 | "diff"
565 | ]
566 | },
567 | {
568 | "cell_type": "code",
569 | "execution_count": 23,
570 | "metadata": {
571 | "collapsed": false
572 | },
573 | "outputs": [
574 | {
575 | "data": {
576 | "text/plain": [
577 | "array([[ 0.00000000e+00, -1.20498190e-02, -5.14364795e-03, ...,\n",
578 | " -1.70254220e-05, 0.00000000e+00, -5.10236549e-02],\n",
579 | " [ 0.00000000e+00, -4.51868214e-03, -2.89330197e-03, ...,\n",
580 | " 4.23090220e-05, 0.00000000e+00, -5.10236549e-02],\n",
581 | " [ 0.00000000e+00, -1.20498190e-02, 3.72914476e-03, ...,\n",
582 | " 4.70885840e-05, 0.00000000e+00, -5.10236549e-02],\n",
583 | " ..., \n",
584 | " [ 0.00000000e+00, -3.01245476e-03, 8.35842791e-04, ...,\n",
585 | " -9.19146535e-06, 0.00000000e+00, -5.10236549e-02],\n",
586 | " [ 0.00000000e+00, -3.01245476e-03, 2.44323277e-03, ...,\n",
587 | " -1.63183862e-05, 0.00000000e+00, -5.10236549e-02],\n",
588 | " [ 0.00000000e+00, -3.01245476e-03, -3.92203156e-03, ...,\n",
589 | " 3.61719513e-05, 0.00000000e+00, -5.10236549e-02]])"
590 | ]
591 | },
592 | "execution_count": 23,
593 | "metadata": {},
594 | "output_type": "execute_result"
595 | }
596 | ],
597 | "source": [
598 | "features_train - features_test[0]"
599 | ]
600 | },
601 | {
602 | "cell_type": "code",
603 | "execution_count": 24,
604 | "metadata": {
605 | "collapsed": false
606 | },
607 | "outputs": [
608 | {
609 | "data": {
610 | "text/plain": [
611 | "-0.09343399874654644"
612 | ]
613 | },
614 | "execution_count": 24,
615 | "metadata": {},
616 | "output_type": "execute_result"
617 | }
618 | ],
619 | "source": [
620 | "diff[-1].sum()"
621 | ]
622 | },
623 | {
624 | "cell_type": "markdown",
625 | "metadata": {},
626 | "source": [
627 | "The next step in computing the Euclidean distances is to take these feature-by-feature differences in ‘diff’, square each, and take the sum over feature indices. That is, compute the sum of squared feature differences for each training house (row in ‘diff’).\n",
628 | "\n",
629 | "By default, ‘np.sum’ sums up everything in the matrix and returns a single number. To instead sum only over a row or column, we need to specifiy the ‘axis’ parameter described in the np.sum documentation. In particular, ‘axis=1’ computes the sum across each row.\n",
630 | "\n"
631 | ]
632 | },
633 | {
634 | "cell_type": "code",
635 | "execution_count": 25,
636 | "metadata": {
637 | "collapsed": false
638 | },
639 | "outputs": [],
640 | "source": [
641 | "total_row = np.sum(diff**2, axis=1)"
642 | ]
643 | },
644 | {
645 | "cell_type": "code",
646 | "execution_count": 26,
647 | "metadata": {
648 | "collapsed": false
649 | },
650 | "outputs": [
651 | {
652 | "data": {
653 | "text/plain": [
654 | "(5527,)"
655 | ]
656 | },
657 | "execution_count": 26,
658 | "metadata": {},
659 | "output_type": "execute_result"
660 | }
661 | ],
662 | "source": [
663 | "total_row.shape"
664 | ]
665 | },
666 | {
667 | "cell_type": "code",
668 | "execution_count": 27,
669 | "metadata": {
670 | "collapsed": false
671 | },
672 | "outputs": [
673 | {
674 | "data": {
675 | "text/plain": [
676 | "(5527, 18)"
677 | ]
678 | },
679 | "execution_count": 27,
680 | "metadata": {},
681 | "output_type": "execute_result"
682 | }
683 | ],
684 | "source": [
685 | "diff.shape"
686 | ]
687 | },
688 | {
689 | "cell_type": "markdown",
690 | "metadata": {},
691 | "source": [
692 | "computes this sum of squared feature differences for all training houses. Verify that the two expressions\n"
693 | ]
694 | },
695 | {
696 | "cell_type": "code",
697 | "execution_count": 28,
698 | "metadata": {
699 | "collapsed": false
700 | },
701 | "outputs": [
702 | {
703 | "data": {
704 | "text/plain": [
705 | "0.0033070590284564457"
706 | ]
707 | },
708 | "execution_count": 28,
709 | "metadata": {},
710 | "output_type": "execute_result"
711 | }
712 | ],
713 | "source": [
714 | "np.sum(diff**2, axis=1)[15]"
715 | ]
716 | },
717 | {
718 | "cell_type": "code",
719 | "execution_count": 29,
720 | "metadata": {
721 | "collapsed": false
722 | },
723 | "outputs": [
724 | {
725 | "data": {
726 | "text/plain": [
727 | "0.0033070590284564453"
728 | ]
729 | },
730 | "execution_count": 29,
731 | "metadata": {},
732 | "output_type": "execute_result"
733 | }
734 | ],
735 | "source": [
736 | "np.sum(diff[15]**2)"
737 | ]
738 | },
739 | {
740 | "cell_type": "markdown",
741 | "metadata": {},
742 | "source": [
743 | "With this result in mind, write a single-line expression to compute the Euclidean distances from the query to all the instances. Assign the result to variable distances.\n",
744 | "\n",
745 | "Hint: don't forget to take the square root of the sum of squares.\n",
746 | "\n",
747 | "Hint: distances[100] should contain 0.0237082324496."
748 | ]
749 | },
750 | {
751 | "cell_type": "code",
752 | "execution_count": 30,
753 | "metadata": {
754 | "collapsed": false
755 | },
756 | "outputs": [
757 | {
758 | "data": {
759 | "text/plain": [
760 | "0.023708232416678195"
761 | ]
762 | },
763 | "execution_count": 30,
764 | "metadata": {},
765 | "output_type": "execute_result"
766 | }
767 | ],
768 | "source": [
769 | "np.sqrt(sum(diff[100]**2))"
770 | ]
771 | },
772 | {
773 | "cell_type": "markdown",
774 | "metadata": {},
775 | "source": [
776 | "Now you are ready to write a function that computes the distances from a query house to all training houses. The function should take two parameters: (i) the matrix of training features and (ii) the single feature vector associated with the query."
777 | ]
778 | },
779 | {
780 | "cell_type": "code",
781 | "execution_count": 31,
782 | "metadata": {
783 | "collapsed": true
784 | },
785 | "outputs": [],
786 | "source": [
787 | "def compute_distances(features_query):\n",
788 | " diff = features_train - features_test[features_query]\n",
789 | " distances = np.sqrt(np.sum(diff**2, axis=1))\n",
790 | " return distances"
791 | ]
792 | },
793 | {
794 | "cell_type": "code",
795 | "execution_count": 32,
796 | "metadata": {
797 | "collapsed": false
798 | },
799 | "outputs": [],
800 | "source": [
801 | "dist = compute_distances(2)"
802 | ]
803 | },
804 | {
805 | "cell_type": "code",
806 | "execution_count": 33,
807 | "metadata": {
808 | "collapsed": false
809 | },
810 | "outputs": [
811 | {
812 | "data": {
813 | "text/plain": [
814 | "array([ 0.01954476, 0.06861035, 0.02165079, ..., 0.02433478,\n",
815 | " 0.02622734, 0.02637942])"
816 | ]
817 | },
818 | "execution_count": 33,
819 | "metadata": {},
820 | "output_type": "execute_result"
821 | }
822 | ],
823 | "source": [
824 | "dist"
825 | ]
826 | },
827 | {
828 | "cell_type": "code",
829 | "execution_count": 34,
830 | "metadata": {
831 | "collapsed": false
832 | },
833 | "outputs": [
834 | {
835 | "data": {
836 | "text/plain": [
837 | "0.0028604955575117085"
838 | ]
839 | },
840 | "execution_count": 34,
841 | "metadata": {},
842 | "output_type": "execute_result"
843 | }
844 | ],
845 | "source": [
846 | "min(dist)"
847 | ]
848 | },
849 | {
850 | "cell_type": "code",
851 | "execution_count": 35,
852 | "metadata": {
853 | "collapsed": false
854 | },
855 | "outputs": [
856 | {
857 | "data": {
858 | "text/plain": [
859 | "382"
860 | ]
861 | },
862 | "execution_count": 35,
863 | "metadata": {},
864 | "output_type": "execute_result"
865 | }
866 | ],
867 | "source": [
868 | "np.argmin(dist)"
869 | ]
870 | },
871 | {
872 | "cell_type": "markdown",
873 | "metadata": {},
874 | "source": [
875 | "Quiz Question: What is the predicted value of the query house based on 1-nearest neighbor regression?"
876 | ]
877 | },
878 | {
879 | "cell_type": "code",
880 | "execution_count": 36,
881 | "metadata": {
882 | "collapsed": false
883 | },
884 | "outputs": [
885 | {
886 | "data": {
887 | "text/plain": [
888 | "249000.0"
889 | ]
890 | },
891 | "execution_count": 36,
892 | "metadata": {},
893 | "output_type": "execute_result"
894 | }
895 | ],
896 | "source": [
897 | "training_output[382]"
898 | ]
899 | },
900 | {
901 | "cell_type": "markdown",
902 | "metadata": {},
903 | "source": [
904 | "#Perform k-nearest neighbor regression"
905 | ]
906 | },
907 | {
908 | "cell_type": "markdown",
909 | "metadata": {},
910 | "source": [
911 | "Using the functions above, implement a function that takes in\n",
912 | "\n",
913 | "the value of k;\n",
914 | "the feature matrix for the instances; and\n",
915 | "the feature of the query\n",
916 | "and returns the indices of the k closest training houses. For instance, with 2-nearest neighbor, a return value of [5, 10] would indicate that the 6th and 11th training houses are closest to the query house."
917 | ]
918 | },
919 | {
920 | "cell_type": "code",
921 | "execution_count": 37,
922 | "metadata": {
923 | "collapsed": true
924 | },
925 | "outputs": [],
926 | "source": [
927 | "def k_nearest_neighbors(k, feat_query):\n",
928 | " distance = compute_distances(feat_query)\n",
929 | "# print np.sort(distance)[:k]\n",
930 | " return np.argsort(distance)[0:k]"
931 | ]
932 | },
933 | {
934 | "cell_type": "markdown",
935 | "metadata": {},
936 | "source": [
937 | "Quiz Question: Take the query house to be third house of the test set (features_test[2]). What are the indices of the 4 training houses closest to the query house?"
938 | ]
939 | },
940 | {
941 | "cell_type": "code",
942 | "execution_count": 38,
943 | "metadata": {
944 | "collapsed": false
945 | },
946 | "outputs": [
947 | {
948 | "data": {
949 | "text/plain": [
950 | "array([ 382, 1149, 4087, 3142])"
951 | ]
952 | },
953 | "execution_count": 38,
954 | "metadata": {},
955 | "output_type": "execute_result"
956 | }
957 | ],
958 | "source": [
959 | "k_nearest_neighbors(4,2)"
960 | ]
961 | },
962 | {
963 | "cell_type": "markdown",
964 | "metadata": {},
965 | "source": [
966 | "Now that we know how to find the k-nearest neighbors, write a function that predicts the value of a given query house. For simplicity, take the average of the prices of the k nearest neighbors in the training set. The function should have the following parameters:\n",
967 | "\n",
968 | "the value of k;\n",
969 | "the feature matrix for the instances;\n",
970 | "the output values (prices) of the instances; and\n",
971 | "the feature of the query, whose price we’re predicting.\n",
972 | "The function should return a predicted value of the query house."
973 | ]
974 | },
975 | {
976 | "cell_type": "code",
977 | "execution_count": 42,
978 | "metadata": {
979 | "collapsed": true
980 | },
981 | "outputs": [],
982 | "source": [
983 | "def predict_output_of_query(k, features_train, output_train, features_query):\n",
984 | " prediction = np.sum(output_train[k_nearest_neighbors(k,features_query)])/k\n",
985 | " return prediction"
986 | ]
987 | },
988 | {
989 | "cell_type": "markdown",
990 | "metadata": {},
991 | "source": [
992 | "Quiz Question: Make predictions for the first 10 houses in the test set, using k=10. What is the index of the house in this query set that has the lowest predicted value? What is the predicted value of this house?"
993 | ]
994 | },
995 | {
996 | "cell_type": "code",
997 | "execution_count": 45,
998 | "metadata": {
999 | "collapsed": false
1000 | },
1001 | "outputs": [
1002 | {
1003 | "name": "stdout",
1004 | "output_type": "stream",
1005 | "text": [
1006 | "0 881300.0\n",
1007 | "1 431860.0\n",
1008 | "2 460595.0\n",
1009 | "3 430200.0\n",
1010 | "4 766750.0\n",
1011 | "5 667420.0\n",
1012 | "6 350032.0\n",
1013 | "7 512800.7\n",
1014 | "8 484000.0\n",
1015 | "9 457235.0\n"
1016 | ]
1017 | }
1018 | ],
1019 | "source": [
1020 | "for m in range(10):\n",
1021 | " print m, predict_output_of_query(10, features_train, training_output, m)"
1022 | ]
1023 | },
1024 | {
1025 | "cell_type": "markdown",
1026 | "metadata": {},
1027 | "source": [
1028 | "# Choosing the best value of k using a validation set"
1029 | ]
1030 | },
1031 | {
1032 | "cell_type": "markdown",
1033 | "metadata": {},
1034 | "source": [
1035 | "There remains a question of choosing the value of k to use in making predictions. Here, we use a validation set to choose this value. Write a loop that does the following:\n",
1036 | "\n",
1037 | "For k in [1, 2, … 15]:\n",
1038 | "\n",
1039 | "Make predictions for the VALIDATION data using the k-nearest neighbors from the TRAINING data.\n",
1040 | "Compute the RSS on VALIDATION data\n",
1041 | "Report which k produced the lowest RSS on validation data."
1042 | ]
1043 | },
1044 | {
1045 | "cell_type": "code",
1046 | "execution_count": 50,
1047 | "metadata": {
1048 | "collapsed": false
1049 | },
1050 | "outputs": [
1051 | {
1052 | "ename": "IndexError",
1053 | "evalue": "arrays used as indices must be of integer (or boolean) type",
1054 | "output_type": "error",
1055 | "traceback": [
1056 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
1057 | "\u001b[0;31mIndexError\u001b[0m Traceback (most recent call last)",
1058 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0mrss_all\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mzeros\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m15\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mk\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mrange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m16\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mpredictions_k\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpredict_output_of_query\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfeatures_train\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtraining_output\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfeatures_valid\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 4\u001b[0m \u001b[0mrss_all\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpredictions_k\u001b[0m\u001b[0;34m-\u001b[0m\u001b[0mvalidating_output\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m**\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
1059 | "\u001b[0;32m\u001b[0m in \u001b[0;36mpredict_output_of_query\u001b[0;34m(k, features_train, output_train, features_query)\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mpredict_output_of_query\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfeatures_train\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0moutput_train\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfeatures_query\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mprediction\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0moutput_train\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mk_nearest_neighbors\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mfeatures_query\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m/\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mprediction\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
1060 | "\u001b[0;32m\u001b[0m in \u001b[0;36mk_nearest_neighbors\u001b[0;34m(k, feat_query)\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mk_nearest_neighbors\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfeat_query\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mdistance\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcompute_distances\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfeat_query\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3\u001b[0m \u001b[0;31m# print np.sort(distance)[:k]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0margsort\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdistance\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
1061 | "\u001b[0;32m\u001b[0m in \u001b[0;36mcompute_distances\u001b[0;34m(features_query)\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mcompute_distances\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfeatures_query\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mdiff\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfeatures_train\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mfeatures_test\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mfeatures_query\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3\u001b[0m \u001b[0mdistances\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msqrt\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdiff\u001b[0m\u001b[0;34m**\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mdistances\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
1062 | "\u001b[0;31mIndexError\u001b[0m: arrays used as indices must be of integer (or boolean) type"
1063 | ]
1064 | }
1065 | ],
1066 | "source": [
1067 | "rss_all = np.zeros(15)\n",
1068 | "for k in range(1,16):\n",
1069 | " predictions_k = predict_output_of_query(k, features_train, training_output, features_valid)\n",
1070 | " rss_all[k-1] = np.sum((predictions_k-validating_output)**2)"
1071 | ]
1072 | },
1073 | {
1074 | "cell_type": "code",
1075 | "execution_count": null,
1076 | "metadata": {
1077 | "collapsed": true
1078 | },
1079 | "outputs": [],
1080 | "source": []
1081 | }
1082 | ],
1083 | "metadata": {
1084 | "kernelspec": {
1085 | "display_name": "Python 2",
1086 | "language": "python",
1087 | "name": "python2"
1088 | },
1089 | "language_info": {
1090 | "codemirror_mode": {
1091 | "name": "ipython",
1092 | "version": 2
1093 | },
1094 | "file_extension": ".py",
1095 | "mimetype": "text/x-python",
1096 | "name": "python",
1097 | "nbconvert_exporter": "python",
1098 | "pygments_lexer": "ipython2",
1099 | "version": "2.7.11"
1100 | }
1101 | },
1102 | "nbformat": 4,
1103 | "nbformat_minor": 0
1104 | }
1105 |
--------------------------------------------------------------------------------
/Machine_Learning_Regression/ML_UW_Regression.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AprilXiaoyanLiu/Machine-Learning-University-of-Washington/2edb9caced2cb4fda576121699ac228d206c6507/Machine_Learning_Regression/ML_UW_Regression.pdf
--------------------------------------------------------------------------------
/Machine_Learning_Regression/Week 5 - Lasso Regression Assignment 1.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "In this assignment, you will use LASSO to select features, building on a pre-implemented solver for LASSO (using GraphLab Create, though you can use other solvers). You will:\n",
8 | "\n",
9 | "Run LASSO with different L1 penalties.\n",
10 | "Choose best L1 penalty using a validation set.\n",
11 | "Choose best L1 penalty using a validation set, with additional constraint on the size of subset.\n"
12 | ]
13 | },
14 | {
15 | "cell_type": "code",
16 | "execution_count": 1,
17 | "metadata": {
18 | "collapsed": false
19 | },
20 | "outputs": [],
21 | "source": [
22 | "import pandas as pd\n",
23 | "import numpy as np\n",
24 | "dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':float, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}\n",
25 | "sales = pd.read_csv('kc_house_data.csv', dtype=dtype_dict)"
26 | ]
27 | },
28 | {
29 | "cell_type": "markdown",
30 | "metadata": {},
31 | "source": [
32 | "Create new features by performing following transformation on inputs:"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": 2,
38 | "metadata": {
39 | "collapsed": true
40 | },
41 | "outputs": [],
42 | "source": [
43 | "from math import log, sqrt\n",
44 | "sales['sqft_living_sqrt'] = sales['sqft_living'].apply(sqrt)\n",
45 | "sales['sqft_lot_sqrt'] = sales['sqft_lot'].apply(sqrt)\n",
46 | "sales['bedrooms_square'] = sales['bedrooms']*sales['bedrooms']\n",
47 | "sales['floors_square'] = sales['floors']*sales['floors']"
48 | ]
49 | },
50 | {
51 | "cell_type": "markdown",
52 | "metadata": {},
53 | "source": [
54 | "Squaring bedrooms will increase the separation between not many bedrooms (e.g. 1) and lots of bedrooms (e.g. 4) since 1^2 = 1 but 4^2 = 16. Consequently this variable will mostly affect houses with many bedrooms.\n",
55 | "On the other hand, taking square root of sqft_living will decrease the separation between big house and small house. The owner may not be exactly twice as happy for getting a house that is twice as big.\n"
56 | ]
57 | },
58 | {
59 | "cell_type": "markdown",
60 | "metadata": {},
61 | "source": [
62 | "Using the entire house dataset, learn regression weights using an L1 penalty of 5e2. Make sure to add \"normalize=True\" when creating the Lasso object. Refer to the following code snippet for the list of features."
63 | ]
64 | },
65 | {
66 | "cell_type": "code",
67 | "execution_count": 3,
68 | "metadata": {
69 | "collapsed": false
70 | },
71 | "outputs": [
72 | {
73 | "data": {
74 | "text/plain": [
75 | "Lasso(alpha=500.0, copy_X=True, fit_intercept=True, max_iter=1000,\n",
76 | " normalize=True, positive=False, precompute=False, random_state=None,\n",
77 | " selection='cyclic', tol=0.0001, warm_start=False)"
78 | ]
79 | },
80 | "execution_count": 3,
81 | "metadata": {},
82 | "output_type": "execute_result"
83 | }
84 | ],
85 | "source": [
86 | "from sklearn import linear_model # using scikit-learn\n",
87 | "\n",
88 | "all_features = ['bedrooms', 'bedrooms_square',\n",
89 | " 'bathrooms',\n",
90 | " 'sqft_living', 'sqft_living_sqrt',\n",
91 | " 'sqft_lot', 'sqft_lot_sqrt',\n",
92 | " 'floors', 'floors_square',\n",
93 | " 'waterfront', 'view', 'condition', 'grade',\n",
94 | " 'sqft_above',\n",
95 | " 'sqft_basement',\n",
96 | " 'yr_built', 'yr_renovated']\n",
97 | "\n",
98 | "model_all = linear_model.Lasso(alpha=5e2, normalize=True) # set parameters\n",
99 | "model_all.fit(sales[all_features], sales['price']) # learn weights"
100 | ]
101 | },
102 | {
103 | "cell_type": "markdown",
104 | "metadata": {},
105 | "source": [
106 | "Quiz Question: Which features have been chosen by LASSO, i.e. which features were assigned nonzero weights?"
107 | ]
108 | },
109 | {
110 | "cell_type": "code",
111 | "execution_count": 4,
112 | "metadata": {
113 | "collapsed": false
114 | },
115 | "outputs": [
116 | {
117 | "data": {
118 | "text/plain": [
119 | "bedrooms 0.000000\n",
120 | "bedrooms_square 0.000000\n",
121 | "bathrooms 0.000000\n",
122 | "sqft_living 134.439314\n",
123 | "sqft_living_sqrt 0.000000\n",
124 | "sqft_lot 0.000000\n",
125 | "sqft_lot_sqrt 0.000000\n",
126 | "floors 0.000000\n",
127 | "floors_square 0.000000\n",
128 | "waterfront 0.000000\n",
129 | "view 24750.004586\n",
130 | "condition 0.000000\n",
131 | "grade 61749.103091\n",
132 | "sqft_above 0.000000\n",
133 | "sqft_basement 0.000000\n",
134 | "yr_built -0.000000\n",
135 | "yr_renovated 0.000000\n",
136 | "dtype: float64"
137 | ]
138 | },
139 | "execution_count": 4,
140 | "metadata": {},
141 | "output_type": "execute_result"
142 | }
143 | ],
144 | "source": [
145 | "pd.Series(model_all.coef_,index=all_features)"
146 | ]
147 | },
148 | {
149 | "cell_type": "markdown",
150 | "metadata": {},
151 | "source": [
152 | "To find a good L1 penalty, we will explore multiple values using a validation set. Let us do three way split into train, validation, and test sets. Download the provided csv files containing training, validation and test sets."
153 | ]
154 | },
155 | {
156 | "cell_type": "code",
157 | "execution_count": 5,
158 | "metadata": {
159 | "collapsed": true
160 | },
161 | "outputs": [],
162 | "source": [
163 | "testing = pd.read_csv('wk3_kc_house_test_data.csv', dtype=dtype_dict)\n",
164 | "training = pd.read_csv('wk3_kc_house_train_data.csv', dtype=dtype_dict)\n",
165 | "validation = pd.read_csv('wk3_kc_house_valid_data.csv', dtype=dtype_dict)"
166 | ]
167 | },
168 | {
169 | "cell_type": "code",
170 | "execution_count": 6,
171 | "metadata": {
172 | "collapsed": true
173 | },
174 | "outputs": [],
175 | "source": [
176 | "testing['sqft_living_sqrt'] = testing['sqft_living'].apply(sqrt)\n",
177 | "testing['sqft_lot_sqrt'] = testing['sqft_lot'].apply(sqrt)\n",
178 | "testing['bedrooms_square'] = testing['bedrooms']*testing['bedrooms']\n",
179 | "testing['floors_square'] = testing['floors']*testing['floors']\n",
180 | "\n",
181 | "training['sqft_living_sqrt'] = training['sqft_living'].apply(sqrt)\n",
182 | "training['sqft_lot_sqrt'] = training['sqft_lot'].apply(sqrt)\n",
183 | "training['bedrooms_square'] = training['bedrooms']*training['bedrooms']\n",
184 | "training['floors_square'] = training['floors']*training['floors']\n",
185 | "\n",
186 | "validation['sqft_living_sqrt'] = validation['sqft_living'].apply(sqrt)\n",
187 | "validation['sqft_lot_sqrt'] = validation['sqft_lot'].apply(sqrt)\n",
188 | "validation['bedrooms_square'] = validation['bedrooms']*validation['bedrooms']\n",
189 | "validation['floors_square'] = validation['floors']*validation['floors']"
190 | ]
191 | },
192 | {
193 | "cell_type": "code",
194 | "execution_count": 7,
195 | "metadata": {
196 | "collapsed": false
197 | },
198 | "outputs": [
199 | {
200 | "data": {
201 | "text/plain": [
202 | "array([ 1.00000000e+01, 3.16227766e+01, 1.00000000e+02,\n",
203 | " 3.16227766e+02, 1.00000000e+03, 3.16227766e+03,\n",
204 | " 1.00000000e+04, 3.16227766e+04, 1.00000000e+05,\n",
205 | " 3.16227766e+05, 1.00000000e+06, 3.16227766e+06,\n",
206 | " 1.00000000e+07])"
207 | ]
208 | },
209 | "execution_count": 7,
210 | "metadata": {},
211 | "output_type": "execute_result"
212 | }
213 | ],
214 | "source": [
215 | "l1_penalty = np.logspace(1, 7, num=13)\n",
216 | "l1_penalty"
217 | ]
218 | },
219 | {
220 | "cell_type": "markdown",
221 | "metadata": {},
222 | "source": [
223 | "Learn a model on TRAINING data using the specified l1_penalty. Make sure to specify normalize=True in the constructor:"
224 | ]
225 | },
226 | {
227 | "cell_type": "markdown",
228 | "metadata": {},
229 | "source": [
230 | "Compute the RSS on VALIDATION for the current model (print or save the RSS)"
231 | ]
232 | },
233 | {
234 | "cell_type": "markdown",
235 | "metadata": {},
236 | "source": [
237 | "Quiz Question: Which was the best value for the l1_penalty, i.e. which value of l1_penalty produced the lowest RSS on VALIDATION data?\n",
238 | "\n"
239 | ]
240 | },
241 | {
242 | "cell_type": "code",
243 | "execution_count": 8,
244 | "metadata": {
245 | "collapsed": false
246 | },
247 | "outputs": [
248 | {
249 | "name": "stdout",
250 | "output_type": "stream",
251 | "text": [
252 | "10.0 3.982133273e+14\n",
253 | "31.6227766017 3.99041900253e+14\n",
254 | "100.0 4.29791604073e+14\n",
255 | "316.227766017 4.63739831045e+14\n",
256 | "1000.0 6.45898733634e+14\n",
257 | "3162.27766017 1.22250685943e+15\n",
258 | "10000.0 1.22250685943e+15\n",
259 | "31622.7766017 1.22250685943e+15\n",
260 | "100000.0 1.22250685943e+15\n",
261 | "316227.766017 1.22250685943e+15\n",
262 | "1000000.0 1.22250685943e+15\n",
263 | "3162277.66017 1.22250685943e+15\n",
264 | "10000000.0 1.22250685943e+15\n"
265 | ]
266 | }
267 | ],
268 | "source": [
269 | "for i in l1_penalty:\n",
270 | " train_model_all = linear_model.Lasso(i, normalize=True) # set parameters\n",
271 | " train_model_all.fit(training[all_features], training['price']) # learn weights\n",
272 | " y = pd.Series(train_model_all.coef_, index=all_features)\n",
273 | " RSS = np.sum((train_model_all.predict(validation[all_features])-validation['price'])**2)\n",
274 | " print i,RSS"
275 | ]
276 | },
277 | {
278 | "cell_type": "markdown",
279 | "metadata": {},
280 | "source": [
281 | "Now that you have selected an L1 penalty, compute the RSS on TEST data for the model with the best L1 penalty."
282 | ]
283 | },
284 | {
285 | "cell_type": "code",
286 | "execution_count": 9,
287 | "metadata": {
288 | "collapsed": false
289 | },
290 | "outputs": [],
291 | "source": [
292 | "train2_model_all = linear_model.Lasso(alpha=10.0, normalize=True)"
293 | ]
294 | },
295 | {
296 | "cell_type": "code",
297 | "execution_count": 10,
298 | "metadata": {
299 | "collapsed": false
300 | },
301 | "outputs": [
302 | {
303 | "data": {
304 | "text/plain": [
305 | "Lasso(alpha=10.0, copy_X=True, fit_intercept=True, max_iter=1000,\n",
306 | " normalize=True, positive=False, precompute=False, random_state=None,\n",
307 | " selection='cyclic', tol=0.0001, warm_start=False)"
308 | ]
309 | },
310 | "execution_count": 10,
311 | "metadata": {},
312 | "output_type": "execute_result"
313 | }
314 | ],
315 | "source": [
316 | "train2_model_all.fit(training[all_features], training['price'])"
317 | ]
318 | },
319 | {
320 | "cell_type": "code",
321 | "execution_count": 11,
322 | "metadata": {
323 | "collapsed": false
324 | },
325 | "outputs": [
326 | {
327 | "data": {
328 | "text/plain": [
329 | "15"
330 | ]
331 | },
332 | "execution_count": 11,
333 | "metadata": {},
334 | "output_type": "execute_result"
335 | }
336 | ],
337 | "source": [
338 | "np.count_nonzero(train2_model_all.coef_) + np.count_nonzero(train2_model_all.intercept_)"
339 | ]
340 | },
341 | {
342 | "cell_type": "markdown",
343 | "metadata": {},
344 | "source": [
345 | "What if we absolutely wanted to limit ourselves to, say, 7 features? This may be important if we want to derive \"a rule of thumb\" --- an interpretable model that has only a few features in them.\n",
346 | "\n",
347 | "You are going to implement a simple, two phase procedure to achieve this goal:\n",
348 | "\n",
349 | "Explore a large range of ‘l1_penalty’ values to find a narrow region of ‘l1_penalty’ values where models are likely to have the desired number of non-zero weights.\n",
350 | "Further explore the narrow region you found to find a good value for ‘l1_penalty’ that achieves the desired sparsity. Here, we will again use a validation set to choose the best value for ‘l1_penalty’."
351 | ]
352 | },
353 | {
354 | "cell_type": "markdown",
355 | "metadata": {},
356 | "source": [
357 | "Assign 7 to the variable ‘max_nonzeros’."
358 | ]
359 | },
360 | {
361 | "cell_type": "markdown",
362 | "metadata": {},
363 | "source": [
364 | "Exploring large range of l1_penalty"
365 | ]
366 | },
367 | {
368 | "cell_type": "markdown",
369 | "metadata": {},
370 | "source": [
371 | "For l1_penalty in np.logspace(1, 4, num=20):"
372 | ]
373 | },
374 | {
375 | "cell_type": "markdown",
376 | "metadata": {},
377 | "source": [
378 | "Quiz Question: What values did you find for l1_penalty_min and l1_penalty_max?"
379 | ]
380 | },
381 | {
382 | "cell_type": "code",
383 | "execution_count": 12,
384 | "metadata": {
385 | "collapsed": true
386 | },
387 | "outputs": [],
388 | "source": [
389 | "max_nonzeros = 7"
390 | ]
391 | },
392 | {
393 | "cell_type": "code",
394 | "execution_count": 13,
395 | "metadata": {
396 | "collapsed": false
397 | },
398 | "outputs": [],
399 | "source": [
400 | "alpha = np.logspace(1, 4, num=20)"
401 | ]
402 | },
403 | {
404 | "cell_type": "code",
405 | "execution_count": 14,
406 | "metadata": {
407 | "collapsed": false
408 | },
409 | "outputs": [],
410 | "source": [
411 | "counts_of_nonzeros = []\n",
412 | "for l1_penalty in alpha:\n",
413 | " train_model_2 = linear_model.Lasso(l1_penalty, normalize=True) # set parameters\n",
414 | " train_model_2.fit(training[all_features], training['price']) # learn weights\n",
415 | " if train_model_2.intercept_ <> 0:\n",
416 | " counts = np.count_nonzero(train_model_2.coef_) + np.count_nonzero(train_model_2.intercept_)\n",
417 | " counts_of_nonzeros.append((l1_penalty,counts))\n",
418 | " elif train_model_2.intercept_==0:\n",
419 | " counts = np.count_nonzero(train_model_2.coef_) + 1\n",
420 | " counts_of_nonzeros.append((l1_penalty,counts))\n",
421 | " \n",
422 | " \n",
423 | " "
424 | ]
425 | },
426 | {
427 | "cell_type": "code",
428 | "execution_count": 15,
429 | "metadata": {
430 | "collapsed": false
431 | },
432 | "outputs": [
433 | {
434 | "name": "stdout",
435 | "output_type": "stream",
436 | "text": [
437 | "[(10.0, 15), (14.384498882876629, 15), (20.691380811147901, 15), (29.763514416313178, 15), (42.813323987193932, 13), (61.584821106602639, 12), (88.586679041008225, 11), (127.42749857031335, 10), (183.29807108324357, 7), (263.66508987303581, 6), (379.26901907322497, 6), (545.55947811685144, 6), (784.75997035146065, 5), (1128.8378916846884, 3), (1623.776739188721, 3), (2335.7214690901214, 2), (3359.8182862837812, 1), (4832.9302385717519, 1), (6951.9279617756056, 1), (10000.0, 1)]\n"
438 | ]
439 | }
440 | ],
441 | "source": [
442 | "print counts_of_nonzeros"
443 | ]
444 | },
445 | {
446 | "cell_type": "code",
447 | "execution_count": 16,
448 | "metadata": {
449 | "collapsed": false
450 | },
451 | "outputs": [],
452 | "source": [
453 | "list_greater = [k for k,v in counts_of_nonzeros if v > 7]"
454 | ]
455 | },
456 | {
457 | "cell_type": "code",
458 | "execution_count": 17,
459 | "metadata": {
460 | "collapsed": false
461 | },
462 | "outputs": [],
463 | "source": [
464 | "l1_penalty_min = max(list_greater)"
465 | ]
466 | },
467 | {
468 | "cell_type": "code",
469 | "execution_count": 18,
470 | "metadata": {
471 | "collapsed": false
472 | },
473 | "outputs": [
474 | {
475 | "data": {
476 | "text/plain": [
477 | "127.42749857031335"
478 | ]
479 | },
480 | "execution_count": 18,
481 | "metadata": {},
482 | "output_type": "execute_result"
483 | }
484 | ],
485 | "source": [
486 | "l1_penalty_min"
487 | ]
488 | },
489 | {
490 | "cell_type": "code",
491 | "execution_count": 19,
492 | "metadata": {
493 | "collapsed": false
494 | },
495 | "outputs": [
496 | {
497 | "data": {
498 | "text/plain": [
499 | "263.66508987303581"
500 | ]
501 | },
502 | "execution_count": 19,
503 | "metadata": {},
504 | "output_type": "execute_result"
505 | }
506 | ],
507 | "source": [
508 | "list_less = [m for m,n in counts_of_nonzeros if n < 7]\n",
509 | "l1_penalty_max = min(list_less)\n",
510 | "l1_penalty_max"
511 | ]
512 | },
513 | {
514 | "cell_type": "markdown",
515 | "metadata": {},
516 | "source": [
517 | "Exploring narrower range of l1_penalty\n",
518 | "\n",
519 | "We now explore the region of l1_penalty we found: between ‘l1_penalty_min’ and ‘l1_penalty_max’. We look for the L1 penalty in this range that produces exactly the right number of nonzeros and also minimizes RSS on the VALIDATION set.\n",
520 | "\n",
521 | "For l1_penalty in np.linspace(l1_penalty_min,l1_penalty_max,20):\n",
522 | "\n",
523 | "Fit a regression model with a given l1_penalty on TRAIN data. As before, use \"alpha=l1_penalty\" and \"normalize=True\".\n",
524 | "Measure the RSS of the learned model on the VALIDATION set\n",
525 | "Find the model that the lowest RSS on the VALIDATION set and has sparsity equal to ‘max_nonzeros’. (Again, take account of the intercept when counting the number of nonzeros.)"
526 | ]
527 | },
528 | {
529 | "cell_type": "code",
530 | "execution_count": 20,
531 | "metadata": {
532 | "collapsed": false
533 | },
534 | "outputs": [],
535 | "source": [
536 | "counts_of_nonzeros_2 = []\n",
537 | "RSS_list = []\n",
538 | "for l1_penalty in np.linspace(l1_penalty_min,l1_penalty_max,20):\n",
539 | " train_model_3 = linear_model.Lasso(l1_penalty, normalize=True) # set parameters\n",
540 | " train_model_3.fit(training[all_features], training['price']) # learn weights\n",
541 | " RSS = np.sum((train_model_3.predict(validation[all_features])-validation['price'])**2)\n",
542 | " RSS_list.append((l1_penalty,RSS))\n",
543 | " if train_model_3.intercept_ <> 0:\n",
544 | " counts = np.count_nonzero(train_model_3.coef_) + np.count_nonzero(train_model_3.intercept_)\n",
545 | " counts_of_nonzeros_2.append((l1_penalty,counts))\n",
546 | " elif train_model_3.intercept_==0:\n",
547 | " counts = np.count_nonzero(train_model_3.coef_) + 1\n",
548 | " counts_of_nonzeros_2.append((l1_penalty,counts))\n",
549 | " "
550 | ]
551 | },
552 | {
553 | "cell_type": "code",
554 | "execution_count": 21,
555 | "metadata": {
556 | "collapsed": false
557 | },
558 | "outputs": [
559 | {
560 | "data": {
561 | "text/plain": [
562 | "[156.10909673930755,\n",
563 | " 163.27949628155611,\n",
564 | " 170.44989582380464,\n",
565 | " 177.6202953660532,\n",
566 | " 184.79069490830176,\n",
567 | " 191.96109445055032,\n",
568 | " 199.13149399279888]"
569 | ]
570 | },
571 | "execution_count": 21,
572 | "metadata": {},
573 | "output_type": "execute_result"
574 | }
575 | ],
576 | "source": [
577 | "max_7_list = [j for j,p in counts_of_nonzeros_2 if p == 7] \n",
578 | "max_7_list"
579 | ]
580 | },
581 | {
582 | "cell_type": "code",
583 | "execution_count": 22,
584 | "metadata": {
585 | "collapsed": false
586 | },
587 | "outputs": [],
588 | "source": [
589 | "RSS_list_2 = [(y,x) for x,y in RSS_list if x in max_7_list]"
590 | ]
591 | },
592 | {
593 | "cell_type": "code",
594 | "execution_count": 23,
595 | "metadata": {
596 | "collapsed": false
597 | },
598 | "outputs": [],
599 | "source": [
600 | "RSS_list_2.sort()"
601 | ]
602 | },
603 | {
604 | "cell_type": "code",
605 | "execution_count": 24,
606 | "metadata": {
607 | "collapsed": false
608 | },
609 | "outputs": [
610 | {
611 | "data": {
612 | "text/plain": [
613 | "[(440037365263316.94, 156.10909673930755),\n",
614 | " (440777489641605.56, 163.27949628155611),\n",
615 | " (441566698090138.94, 170.44989582380464),\n",
616 | " (442406413188665.06, 177.6202953660532),\n",
617 | " (443296716874313.06, 184.79069490830176),\n",
618 | " (444239780526141.2, 191.96109445055032),\n",
619 | " (445230739842613.8, 199.13149399279888)]"
620 | ]
621 | },
622 | "execution_count": 24,
623 | "metadata": {},
624 | "output_type": "execute_result"
625 | }
626 | ],
627 | "source": [
628 | "RSS_list_2"
629 | ]
630 | },
631 | {
632 | "cell_type": "code",
633 | "execution_count": 25,
634 | "metadata": {
635 | "collapsed": false
636 | },
637 | "outputs": [
638 | {
639 | "data": {
640 | "text/plain": [
641 | "Lasso(alpha=156.10909673930755, copy_X=True, fit_intercept=True,\n",
642 | " max_iter=1000, normalize=True, positive=False, precompute=False,\n",
643 | " random_state=None, selection='cyclic', tol=0.0001, warm_start=False)"
644 | ]
645 | },
646 | "execution_count": 25,
647 | "metadata": {},
648 | "output_type": "execute_result"
649 | }
650 | ],
651 | "source": [
652 | "train_model_4 = linear_model.Lasso(max_7_list[0], normalize=True) # set parameters\n",
653 | "train_model_4.fit(training[all_features], training['price']) # learn weights"
654 | ]
655 | },
656 | {
657 | "cell_type": "code",
658 | "execution_count": 26,
659 | "metadata": {
660 | "collapsed": false
661 | },
662 | "outputs": [
663 | {
664 | "data": {
665 | "text/plain": [
666 | "bedrooms -0.000000\n",
667 | "bedrooms_square -0.000000\n",
668 | "bathrooms 10610.890284\n",
669 | "sqft_living 163.380252\n",
670 | "sqft_living_sqrt 0.000000\n",
671 | "sqft_lot -0.000000\n",
672 | "sqft_lot_sqrt -0.000000\n",
673 | "floors 0.000000\n",
674 | "floors_square 0.000000\n",
675 | "waterfront 506451.687115\n",
676 | "view 41960.043555\n",
677 | "condition 0.000000\n",
678 | "grade 116253.553700\n",
679 | "sqft_above 0.000000\n",
680 | "sqft_basement 0.000000\n",
681 | "yr_built -2612.234880\n",
682 | "yr_renovated 0.000000\n",
683 | "dtype: float64"
684 | ]
685 | },
686 | "execution_count": 26,
687 | "metadata": {},
688 | "output_type": "execute_result"
689 | }
690 | ],
691 | "source": [
692 | "pd.Series(train_model_4.coef_, index = all_features)"
693 | ]
694 | },
695 | {
696 | "cell_type": "code",
697 | "execution_count": null,
698 | "metadata": {
699 | "collapsed": true
700 | },
701 | "outputs": [],
702 | "source": []
703 | }
704 | ],
705 | "metadata": {
706 | "kernelspec": {
707 | "display_name": "Python 2",
708 | "language": "python",
709 | "name": "python2"
710 | },
711 | "language_info": {
712 | "codemirror_mode": {
713 | "name": "ipython",
714 | "version": 2
715 | },
716 | "file_extension": ".py",
717 | "mimetype": "text/x-python",
718 | "name": "python",
719 | "nbconvert_exporter": "python",
720 | "pygments_lexer": "ipython2",
721 | "version": "2.7.11"
722 | }
723 | },
724 | "nbformat": 4,
725 | "nbformat_minor": 0
726 | }
727 |
--------------------------------------------------------------------------------
/Machine_Learning_Regression/week 2 - multiple regression - Assignment 2.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 22,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "import pandas as pd"
12 | ]
13 | },
14 | {
15 | "cell_type": "code",
16 | "execution_count": 23,
17 | "metadata": {
18 | "collapsed": true
19 | },
20 | "outputs": [],
21 | "source": [
22 | "import numpy as np"
23 | ]
24 | },
25 | {
26 | "cell_type": "code",
27 | "execution_count": 24,
28 | "metadata": {
29 | "collapsed": true
30 | },
31 | "outputs": [],
32 | "source": [
33 | "from math import sqrt"
34 | ]
35 | },
36 | {
37 | "cell_type": "code",
38 | "execution_count": 25,
39 | "metadata": {
40 | "collapsed": false
41 | },
42 | "outputs": [
43 | {
44 | "data": {
45 | "text/html": [
46 | "\n",
47 | "
\n",
48 | " \n",
49 | " \n",
50 | " | \n",
51 | " id | \n",
52 | " date | \n",
53 | " price | \n",
54 | " bedrooms | \n",
55 | " bathrooms | \n",
56 | " sqft_living | \n",
57 | " sqft_lot | \n",
58 | " floors | \n",
59 | " waterfront | \n",
60 | " view | \n",
61 | " ... | \n",
62 | " grade | \n",
63 | " sqft_above | \n",
64 | " sqft_basement | \n",
65 | " yr_built | \n",
66 | " yr_renovated | \n",
67 | " zipcode | \n",
68 | " lat | \n",
69 | " long | \n",
70 | " sqft_living15 | \n",
71 | " sqft_lot15 | \n",
72 | "
\n",
73 | " \n",
74 | " \n",
75 | " \n",
76 | " 0 | \n",
77 | " 7129300520 | \n",
78 | " 20141013T000000 | \n",
79 | " 221900 | \n",
80 | " 3 | \n",
81 | " 1.00 | \n",
82 | " 1180 | \n",
83 | " 5650 | \n",
84 | " 1 | \n",
85 | " 0 | \n",
86 | " 0 | \n",
87 | " ... | \n",
88 | " 7 | \n",
89 | " 1180 | \n",
90 | " 0 | \n",
91 | " 1955 | \n",
92 | " 0 | \n",
93 | " 98178 | \n",
94 | " 47.5112 | \n",
95 | " -122.257 | \n",
96 | " 1340 | \n",
97 | " 5650 | \n",
98 | "
\n",
99 | " \n",
100 | " 1 | \n",
101 | " 6414100192 | \n",
102 | " 20141209T000000 | \n",
103 | " 538000 | \n",
104 | " 3 | \n",
105 | " 2.25 | \n",
106 | " 2570 | \n",
107 | " 7242 | \n",
108 | " 2 | \n",
109 | " 0 | \n",
110 | " 0 | \n",
111 | " ... | \n",
112 | " 7 | \n",
113 | " 2170 | \n",
114 | " 400 | \n",
115 | " 1951 | \n",
116 | " 1991 | \n",
117 | " 98125 | \n",
118 | " 47.7210 | \n",
119 | " -122.319 | \n",
120 | " 1690 | \n",
121 | " 7639 | \n",
122 | "
\n",
123 | " \n",
124 | " 2 | \n",
125 | " 5631500400 | \n",
126 | " 20150225T000000 | \n",
127 | " 180000 | \n",
128 | " 2 | \n",
129 | " 1.00 | \n",
130 | " 770 | \n",
131 | " 10000 | \n",
132 | " 1 | \n",
133 | " 0 | \n",
134 | " 0 | \n",
135 | " ... | \n",
136 | " 6 | \n",
137 | " 770 | \n",
138 | " 0 | \n",
139 | " 1933 | \n",
140 | " 0 | \n",
141 | " 98028 | \n",
142 | " 47.7379 | \n",
143 | " -122.233 | \n",
144 | " 2720 | \n",
145 | " 8062 | \n",
146 | "
\n",
147 | " \n",
148 | " 3 | \n",
149 | " 2487200875 | \n",
150 | " 20141209T000000 | \n",
151 | " 604000 | \n",
152 | " 4 | \n",
153 | " 3.00 | \n",
154 | " 1960 | \n",
155 | " 5000 | \n",
156 | " 1 | \n",
157 | " 0 | \n",
158 | " 0 | \n",
159 | " ... | \n",
160 | " 7 | \n",
161 | " 1050 | \n",
162 | " 910 | \n",
163 | " 1965 | \n",
164 | " 0 | \n",
165 | " 98136 | \n",
166 | " 47.5208 | \n",
167 | " -122.393 | \n",
168 | " 1360 | \n",
169 | " 5000 | \n",
170 | "
\n",
171 | " \n",
172 | " 4 | \n",
173 | " 1954400510 | \n",
174 | " 20150218T000000 | \n",
175 | " 510000 | \n",
176 | " 3 | \n",
177 | " 2.00 | \n",
178 | " 1680 | \n",
179 | " 8080 | \n",
180 | " 1 | \n",
181 | " 0 | \n",
182 | " 0 | \n",
183 | " ... | \n",
184 | " 8 | \n",
185 | " 1680 | \n",
186 | " 0 | \n",
187 | " 1987 | \n",
188 | " 0 | \n",
189 | " 98074 | \n",
190 | " 47.6168 | \n",
191 | " -122.045 | \n",
192 | " 1800 | \n",
193 | " 7503 | \n",
194 | "
\n",
195 | " \n",
196 | "
\n",
197 | "
5 rows × 21 columns
\n",
198 | "
"
199 | ],
200 | "text/plain": [
201 | " id date price bedrooms bathrooms sqft_living \\\n",
202 | "0 7129300520 20141013T000000 221900 3 1.00 1180 \n",
203 | "1 6414100192 20141209T000000 538000 3 2.25 2570 \n",
204 | "2 5631500400 20150225T000000 180000 2 1.00 770 \n",
205 | "3 2487200875 20141209T000000 604000 4 3.00 1960 \n",
206 | "4 1954400510 20150218T000000 510000 3 2.00 1680 \n",
207 | "\n",
208 | " sqft_lot floors waterfront view ... grade sqft_above \\\n",
209 | "0 5650 1 0 0 ... 7 1180 \n",
210 | "1 7242 2 0 0 ... 7 2170 \n",
211 | "2 10000 1 0 0 ... 6 770 \n",
212 | "3 5000 1 0 0 ... 7 1050 \n",
213 | "4 8080 1 0 0 ... 8 1680 \n",
214 | "\n",
215 | " sqft_basement yr_built yr_renovated zipcode lat long \\\n",
216 | "0 0 1955 0 98178 47.5112 -122.257 \n",
217 | "1 400 1951 1991 98125 47.7210 -122.319 \n",
218 | "2 0 1933 0 98028 47.7379 -122.233 \n",
219 | "3 910 1965 0 98136 47.5208 -122.393 \n",
220 | "4 0 1987 0 98074 47.6168 -122.045 \n",
221 | "\n",
222 | " sqft_living15 sqft_lot15 \n",
223 | "0 1340 5650 \n",
224 | "1 1690 7639 \n",
225 | "2 2720 8062 \n",
226 | "3 1360 5000 \n",
227 | "4 1800 7503 \n",
228 | "\n",
229 | "[5 rows x 21 columns]"
230 | ]
231 | },
232 | "execution_count": 25,
233 | "metadata": {},
234 | "output_type": "execute_result"
235 | }
236 | ],
237 | "source": [
238 | "house_data = pd.read_csv('kc_house_data.csv')\n",
239 | "house_data.head()"
240 | ]
241 | },
242 | {
243 | "cell_type": "code",
244 | "execution_count": 26,
245 | "metadata": {
246 | "collapsed": true
247 | },
248 | "outputs": [],
249 | "source": [
250 | "train_data = pd.read_csv('kc_house_train_data.csv')"
251 | ]
252 | },
253 | {
254 | "cell_type": "code",
255 | "execution_count": 27,
256 | "metadata": {
257 | "collapsed": true
258 | },
259 | "outputs": [],
260 | "source": [
261 | "test_data = pd.read_csv('kc_house_test_data.csv')"
262 | ]
263 | },
264 | {
265 | "cell_type": "markdown",
266 | "metadata": {
267 | "collapsed": true
268 | },
269 | "source": [
270 | "Next write a function that takes a data set, a list of features (e.g. [‘sqft_living’, ‘bedrooms’]), to be used as inputs, and a name of the output (e.g. ‘price’). This function should return a features_matrix (2D array) consisting of first a column of ones followed by columns containing the values of the input features in the data set in the same order as the input list. It should also return an output_array which is an array of the values of the output in the data set (e.g. ‘price’)."
271 | ]
272 | },
273 | {
274 | "cell_type": "code",
275 | "execution_count": 28,
276 | "metadata": {
277 | "collapsed": true
278 | },
279 | "outputs": [],
280 | "source": [
281 | "def get_numpy_data(data, features, output):\n",
282 | " data['constant'] = 1 # add a constant column to a dataframe\n",
283 | " # prepend variable 'constant' to the features list\n",
284 | " features = ['constant'] + features\n",
285 | " # select the columns of dataframe given by the ‘features’ list into the SFrame ‘features_sframe’\n",
286 | "\n",
287 | " # this will convert the features_sframe into a numpy matrix with GraphLab Create >= 1.7!!\n",
288 | " features_matrix = data[features].as_matrix(columns=None)\n",
289 | " # assign the column of data_sframe associated with the target to the variable ‘output_sarray’\n",
290 | "\n",
291 | " # this will convert the SArray into a numpy array:\n",
292 | " output_array = data[output].as_matrix(columns=None) # GraphLab Create>= 1.7!!\n",
293 | " return(features_matrix, output_array)"
294 | ]
295 | },
296 | {
297 | "cell_type": "markdown",
298 | "metadata": {},
299 | "source": [
300 | "If the features matrix (including a column of 1s for the constant) is stored as a 2D array (or matrix) and the regression weights are stored as a 1D array then the predicted output is just the dot product between the features matrix and the weights (with the weights on the right). Write a function ‘predict_output’ which accepts a 2D array ‘feature_matrix’ and a 1D array ‘weights’ and returns a 1D array ‘predictions’. e.g. in python:"
301 | ]
302 | },
303 | {
304 | "cell_type": "code",
305 | "execution_count": 29,
306 | "metadata": {
307 | "collapsed": true
308 | },
309 | "outputs": [],
310 | "source": [
311 | "def predict_outcome(feature_matrix, weights):\n",
312 | " predictions = np.dot(feature_matrix, weights)\n",
313 | " return(predictions)"
314 | ]
315 | },
316 | {
317 | "cell_type": "markdown",
318 | "metadata": {},
319 | "source": [
320 | "If we have a the values of a single input feature in an array ‘feature’ and the prediction ‘errors’ (predictions - output) then the derivative of the regression cost function with respect to the weight of ‘feature’ is just twice the dot product between ‘feature’ and ‘errors’. Write a function that accepts a ‘feature’ array and ‘error’ array and returns the ‘derivative’ (a single number). e.g. in python:"
321 | ]
322 | },
323 | {
324 | "cell_type": "code",
325 | "execution_count": 30,
326 | "metadata": {
327 | "collapsed": true
328 | },
329 | "outputs": [],
330 | "source": [
331 | "def feature_derivative(errors, feature):\n",
332 | " errors = predictions - output\n",
333 | " derivative = 2* np.dot(errors,feature)\n",
334 | " return(derivative)"
335 | ]
336 | },
337 | {
338 | "cell_type": "markdown",
339 | "metadata": {},
340 | "source": [
341 | "Now we will use our predict_output and feature_derivative to write a gradient descent function. Although we can compute the derivative for all the features simultaneously (the gradient) we will explicitly loop over the features individually for simplicity. Write a gradient descent function that does the following:\n",
342 | "\n",
343 | "Accepts a numpy feature_matrix 2D array, a 1D output array, an array of initial weights, a step size and a convergence tolerance.\n",
344 | "While not converged updates each feature weight by subtracting the step size times the derivative for that feature given the current weights\n",
345 | "At each step computes the magnitude/length of the gradient (square root of the sum of squared components)\n",
346 | "When the magnitude of the gradient is smaller than the input tolerance returns the final weight vector."
347 | ]
348 | },
349 | {
350 | "cell_type": "code",
351 | "execution_count": 31,
352 | "metadata": {
353 | "collapsed": true
354 | },
355 | "outputs": [],
356 | "source": [
357 | "def regression_gradient_descent(feature_matrix, output, initial_weights, step_size, tolerance):\n",
358 | " converged = False\n",
359 | " weights = np.array(initial_weights)\n",
360 | " while not converged:\n",
361 | " # compute the predictions based on feature_matrix and weights:\n",
362 | " predictions = np.dot(feature_matrix, weights)\n",
363 | " # compute the errors as predictions - output:\n",
364 | " errors = predictions - output\n",
365 | " gradient_sum_squares = 0 # initialize the gradient\n",
366 | " # while not converged, update each weight individually:\n",
367 | " for i in range(len(weights)):\n",
368 | " # Recall that feature_matrix[:, i] is the feature column associated with weights[i]\n",
369 | " # compute the derivative for weight[i]:\n",
370 | " derivative = 2* np.dot(errors, feature_matrix[:,i])\n",
371 | " # add the squared derivative to the gradient magnitude\n",
372 | " gradient_sum_squares += derivative**2\n",
373 | " # update the weight based on step size and derivative:\n",
374 | " weights[i] = weights[i] - step_size * derivative\n",
375 | " gradient_magnitude = sqrt(gradient_sum_squares)\n",
376 | " if gradient_magnitude < tolerance:\n",
377 | " converged = True\n",
378 | " return(weights)"
379 | ]
380 | },
381 | {
382 | "cell_type": "markdown",
383 | "metadata": {},
384 | "source": [
385 | "Now we will run the regression_gradient_descent function on some actual data. In particular we will use the gradient descent to estimate the model from Week 1 using just an intercept and slope. Use the following parameters:\n",
386 | "\n",
387 | "features: ‘sqft_living’\n",
388 | "output: ‘price’\n",
389 | "initial weights: -47000, 1 (intercept, sqft_living respectively)\n",
390 | "step_size = 7e-12\n",
391 | "tolerance = 2.5e7"
392 | ]
393 | },
394 | {
395 | "cell_type": "code",
396 | "execution_count": 32,
397 | "metadata": {
398 | "collapsed": false
399 | },
400 | "outputs": [],
401 | "source": [
402 | "simple_features = ['sqft_living']\n",
403 | "my_output= 'price'\n",
404 | "(simple_feature_matrix, output) = get_numpy_data(train_data, simple_features, my_output)\n",
405 | "initial_weights = np.array([-47000., 1.])\n",
406 | "step_size = 7e-12\n",
407 | "tolerance = 2.5e7"
408 | ]
409 | },
410 | {
411 | "cell_type": "markdown",
412 | "metadata": {},
413 | "source": [
414 | "Use these parameters to estimate the slope and intercept for predicting prices based only on ‘sqft_living’."
415 | ]
416 | },
417 | {
418 | "cell_type": "code",
419 | "execution_count": 33,
420 | "metadata": {
421 | "collapsed": false
422 | },
423 | "outputs": [],
424 | "source": [
425 | "simple_weights = regression_gradient_descent(simple_feature_matrix, output,initial_weights, step_size, tolerance)"
426 | ]
427 | },
428 | {
429 | "cell_type": "markdown",
430 | "metadata": {},
431 | "source": [
432 | "Quiz Question: What is the value of the weight for sqft_living -- the second element of ‘simple_weights’ (rounded to 1 decimal place)?"
433 | ]
434 | },
435 | {
436 | "cell_type": "code",
437 | "execution_count": 34,
438 | "metadata": {
439 | "collapsed": false
440 | },
441 | "outputs": [
442 | {
443 | "data": {
444 | "text/plain": [
445 | "array([-46999.88716555, 281.91211918])"
446 | ]
447 | },
448 | "execution_count": 34,
449 | "metadata": {},
450 | "output_type": "execute_result"
451 | }
452 | ],
453 | "source": [
454 | "simple_weights"
455 | ]
456 | },
457 | {
458 | "cell_type": "markdown",
459 | "metadata": {},
460 | "source": [
461 | "Now build a corresponding ‘test_simple_feature_matrix’ and ‘test_output’ using test_data. Using ‘test_simple_feature_matrix’ and ‘simple_weights’ compute the predicted house prices on all the test data."
462 | ]
463 | },
464 | {
465 | "cell_type": "code",
466 | "execution_count": 35,
467 | "metadata": {
468 | "collapsed": true
469 | },
470 | "outputs": [],
471 | "source": [
472 | "(test_simple_feature_matrix, test_output) = get_numpy_data(test_data,simple_features,my_output)"
473 | ]
474 | },
475 | {
476 | "cell_type": "code",
477 | "execution_count": 36,
478 | "metadata": {
479 | "collapsed": true
480 | },
481 | "outputs": [],
482 | "source": [
483 | "predicted_house_prices = predict_outcome(test_simple_feature_matrix, simple_weights)"
484 | ]
485 | },
486 | {
487 | "cell_type": "markdown",
488 | "metadata": {},
489 | "source": [
490 | "Quiz Question: What is the predicted price for the 1st house in the Test data set for model 1 (round to nearest dollar)?"
491 | ]
492 | },
493 | {
494 | "cell_type": "code",
495 | "execution_count": 40,
496 | "metadata": {
497 | "collapsed": false
498 | },
499 | "outputs": [
500 | {
501 | "data": {
502 | "text/plain": [
503 | "356134.44325500238"
504 | ]
505 | },
506 | "execution_count": 40,
507 | "metadata": {},
508 | "output_type": "execute_result"
509 | }
510 | ],
511 | "source": [
512 | "predicted_house_prices[0]"
513 | ]
514 | },
515 | {
516 | "cell_type": "markdown",
517 | "metadata": {},
518 | "source": [
519 | "Now compute RSS on all test data for this model. Record the value and store it for later"
520 | ]
521 | },
522 | {
523 | "cell_type": "code",
524 | "execution_count": 45,
525 | "metadata": {
526 | "collapsed": false
527 | },
528 | "outputs": [
529 | {
530 | "data": {
531 | "text/plain": [
532 | "275400044902128.31"
533 | ]
534 | },
535 | "execution_count": 45,
536 | "metadata": {},
537 | "output_type": "execute_result"
538 | }
539 | ],
540 | "source": [
541 | "RSS_model1 = np.sum((predicted_house_prices - test_output)**2)\n",
542 | "RSS_model1"
543 | ]
544 | },
545 | {
546 | "cell_type": "markdown",
547 | "metadata": {},
548 | "source": [
549 | "Now we will use the gradient descent to fit a model with more than 1 predictor variable (and an intercept). Use the following parameters:"
550 | ]
551 | },
552 | {
553 | "cell_type": "code",
554 | "execution_count": 48,
555 | "metadata": {
556 | "collapsed": false
557 | },
558 | "outputs": [],
559 | "source": [
560 | "model_features = ['sqft_living', 'sqft_living15']\n",
561 | "my_output = 'price'\n",
562 | "(feature_matrix, output) = get_numpy_data(train_data, model_features,my_output)\n",
563 | "initial_weights = np.array([-100000., 1., 1.])\n",
564 | "step_size = 4e-12\n",
565 | "tolerance = 1e9"
566 | ]
567 | },
568 | {
569 | "cell_type": "markdown",
570 | "metadata": {},
571 | "source": [
572 | "Note that sqft_living_15 is the average square feet of the nearest 15 neighbouring houses.\n",
573 | "\n",
574 | "Run gradient descent on a model with ‘sqft_living’ and ‘sqft_living_15’ as well as an intercept with the above parameters. Save the resulting regression weights.\n"
575 | ]
576 | },
577 | {
578 | "cell_type": "code",
579 | "execution_count": 49,
580 | "metadata": {
581 | "collapsed": true
582 | },
583 | "outputs": [],
584 | "source": [
585 | "regression_weights = regression_gradient_descent(feature_matrix, output,initial_weights, step_size, tolerance)"
586 | ]
587 | },
588 | {
589 | "cell_type": "code",
590 | "execution_count": 50,
591 | "metadata": {
592 | "collapsed": false
593 | },
594 | "outputs": [
595 | {
596 | "data": {
597 | "text/plain": [
598 | "array([ -9.99999688e+04, 2.45072603e+02, 6.52795267e+01])"
599 | ]
600 | },
601 | "execution_count": 50,
602 | "metadata": {},
603 | "output_type": "execute_result"
604 | }
605 | ],
606 | "source": [
607 | "regression_weights"
608 | ]
609 | },
610 | {
611 | "cell_type": "markdown",
612 | "metadata": {},
613 | "source": [
614 | "Quiz Question: What is the predicted price for the 1st house in the TEST data set for model 2 (round to nearest dollar)"
615 | ]
616 | },
617 | {
618 | "cell_type": "code",
619 | "execution_count": 53,
620 | "metadata": {
621 | "collapsed": true
622 | },
623 | "outputs": [],
624 | "source": [
625 | "(test_feature_matrix, test_output) = get_numpy_data(test_data,model_features,my_output)"
626 | ]
627 | },
628 | {
629 | "cell_type": "code",
630 | "execution_count": 54,
631 | "metadata": {
632 | "collapsed": true
633 | },
634 | "outputs": [],
635 | "source": [
636 | "predicted_house_prices_model2 = predict_outcome(test_feature_matrix, regression_weights)"
637 | ]
638 | },
639 | {
640 | "cell_type": "code",
641 | "execution_count": 55,
642 | "metadata": {
643 | "collapsed": false
644 | },
645 | "outputs": [
646 | {
647 | "data": {
648 | "text/plain": [
649 | "366651.41162949387"
650 | ]
651 | },
652 | "execution_count": 55,
653 | "metadata": {},
654 | "output_type": "execute_result"
655 | }
656 | ],
657 | "source": [
658 | "predicted_house_prices_model2[0]"
659 | ]
660 | },
661 | {
662 | "cell_type": "markdown",
663 | "metadata": {},
664 | "source": [
665 | "What is the actual price for the 1st house in the Test data set"
666 | ]
667 | },
668 | {
669 | "cell_type": "code",
670 | "execution_count": 57,
671 | "metadata": {
672 | "collapsed": false
673 | },
674 | "outputs": [
675 | {
676 | "data": {
677 | "text/plain": [
678 | "310000.0"
679 | ]
680 | },
681 | "execution_count": 57,
682 | "metadata": {},
683 | "output_type": "execute_result"
684 | }
685 | ],
686 | "source": [
687 | "test_data['price'][0]"
688 | ]
689 | },
690 | {
691 | "cell_type": "markdown",
692 | "metadata": {},
693 | "source": [
694 | "Quiz Question: Which estimate was closer to the true price for the 1st house on the TEST data set, model 1 or model 2?"
695 | ]
696 | },
697 | {
698 | "cell_type": "markdown",
699 | "metadata": {},
700 | "source": [
701 | "Now compute RSS on all test data for the second model. Record the value and store it for later."
702 | ]
703 | },
704 | {
705 | "cell_type": "code",
706 | "execution_count": 58,
707 | "metadata": {
708 | "collapsed": false
709 | },
710 | "outputs": [
711 | {
712 | "data": {
713 | "text/plain": [
714 | "270263443629803.56"
715 | ]
716 | },
717 | "execution_count": 58,
718 | "metadata": {},
719 | "output_type": "execute_result"
720 | }
721 | ],
722 | "source": [
723 | "RSS_model2 = np.sum((predicted_house_prices_model2 - test_output)**2)\n",
724 | "RSS_model2"
725 | ]
726 | },
727 | {
728 | "cell_type": "markdown",
729 | "metadata": {},
730 | "source": [
731 | "Quiz Question: Which model (1 or 2) has lowest RSS on all of the TEST data?"
732 | ]
733 | },
734 | {
735 | "cell_type": "code",
736 | "execution_count": 59,
737 | "metadata": {
738 | "collapsed": false
739 | },
740 | "outputs": [
741 | {
742 | "data": {
743 | "text/plain": [
744 | "True"
745 | ]
746 | },
747 | "execution_count": 59,
748 | "metadata": {},
749 | "output_type": "execute_result"
750 | }
751 | ],
752 | "source": [
753 | "RSS_model1 > RSS_model2"
754 | ]
755 | },
756 | {
757 | "cell_type": "code",
758 | "execution_count": null,
759 | "metadata": {
760 | "collapsed": true
761 | },
762 | "outputs": [],
763 | "source": []
764 | }
765 | ],
766 | "metadata": {
767 | "kernelspec": {
768 | "display_name": "Python 2",
769 | "language": "python",
770 | "name": "python2"
771 | },
772 | "language_info": {
773 | "codemirror_mode": {
774 | "name": "ipython",
775 | "version": 2
776 | },
777 | "file_extension": ".py",
778 | "mimetype": "text/x-python",
779 | "name": "python",
780 | "nbconvert_exporter": "python",
781 | "pygments_lexer": "ipython2",
782 | "version": "2.7.11"
783 | }
784 | },
785 | "nbformat": 4,
786 | "nbformat_minor": 0
787 | }
788 |
--------------------------------------------------------------------------------
/Machine_Learning_Regression/week1_kc_house_linear regression_assignment1.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Sun May 22 14:49:24 2016
4 |
5 | @author: April
6 | """
7 |
8 | import pandas as pd
9 | import numpy as np
10 |
11 | house_data = pd.read_csv("{filepath}/kc_house_data.csv",dtype = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':str, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int})
12 | train_data = pd.read_csv("{filepath}kc_house_train_data.csv",dtype = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':str, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int})
13 | test_data = pd.read_csv("{filepath}kc_house_test_data.csv",dtype = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':str, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int})
14 |
15 |
16 | # use the closed form solution from lecture to calculate the slope and intercept
17 | def simple_linear_regression(input_feature,output):
18 | numerator = (input_feature * output).mean(axis=0) - (output.mean(axis=0))*(input_feature.mean(axis=0))
19 | denominator = (input_feature**2).mean(axis=0) - input_feature.mean(axis=0) * input_feature.mean(axis=0)
20 | slope = numerator/denominator
21 | intercept = output.mean(axis=0) - slope * (input_feature.mean(axis=0))
22 | return (intercept, slope)
23 |
24 |
25 | sqft_living = train_data['sqft_living']
26 | sqft_living_list = [i for i in train_data['sqft_living']]
27 | sqft_living_array = np.array(sqft_living_list)
28 |
29 |
30 | price_list = [m for m in train_data['price']]
31 | price_list_array = np.array(price_list)
32 |
33 |
34 | intercept_train,slope_train = simple_linear_regression(sqft_living_array, price_list_array)
35 |
36 |
37 | def get_regression_predictions(input_feature, intercept, slope):
38 | predicted_output = intercept + input_feature * slope
39 | return(predicted_output)
40 |
41 | # use function to calcuate the estimated slope and intercept on the training data to predict 'price'given 'sqft_living'
42 | input_feature = 2650
43 | print get_regression_predictions(2650, intercept_train, slope_train)
44 |
45 | # What is the RSS for the simple linear regression using squarefeet to predict prices on TRAINING data
46 | def get_residual_sum_of_squares(input_feature, output, intercept,slope):
47 | RSS = (((intercept + input_feature*slope) - output)**2).sum(axis=0)
48 | return(RSS)
49 |
50 | print get_residual_sum_of_squares(sqft_living_array,price_list_array,intercept_train,slope_train)
51 |
52 |
53 | # what is the estimated square-feet for a house costing $800,000?
54 | def inverse_regression_predictions(output, intercept, slope):
55 | estimated_input = (output - intercept)/slope
56 | return(estimated_input)
57 |
58 | output = 800000
59 | print inverse_regression_predictions(output,intercept_train,slope_train)
60 |
61 | # Which model (square feet or bedrooms) has lowest RSS on TEST data?
62 | sqft_living_array_test = np.array([a for a in test_data['sqft_living']])
63 | bedrooms_array_test = np.array([b for b in test_data['bedrooms']])
64 | price_array_test = np.array([c for c in test_data['price']])
65 | intercept_sqf,slope_sqf = simple_linear_regression(sqft_living_array_test,price_array_test)
66 | intercept_sqf
67 |
68 | intercept_br, slope_br = simple_linear_regression(bedrooms_array_test,price_array_test)
69 | RSS_sqf = get_residual_sum_of_squares(sqft_living_array_test,price_array_test,intercept_sqf,slope_sqf)
70 | RSS_br = get_residual_sum_of_squares(bedrooms_array_test,price_array_test,intercept_br,slope_br)
71 | print RSS_sqf - RSS_br
72 |
73 |
74 |
--------------------------------------------------------------------------------
/Machine_Learning_Regression/week5-lasso regression assignment 2.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "import numpy as np"
12 | ]
13 | },
14 | {
15 | "cell_type": "code",
16 | "execution_count": 2,
17 | "metadata": {
18 | "collapsed": true
19 | },
20 | "outputs": [],
21 | "source": [
22 | "import pandas as pd"
23 | ]
24 | },
25 | {
26 | "cell_type": "code",
27 | "execution_count": 3,
28 | "metadata": {
29 | "collapsed": true
30 | },
31 | "outputs": [],
32 | "source": [
33 | "def get_numpy_data(data, features, output):\n",
34 | " data['constant'] = 1 # add a constant column to a dataframe\n",
35 | " # prepend variable 'constant' to the features list\n",
36 | " features = ['constant'] + features\n",
37 | " # select the columns of dataframe given by the ‘features’ list into the SFrame ‘features_sframe’\n",
38 | "\n",
39 | " # this will convert the features_sframe into a numpy matrix with GraphLab Create >= 1.7!!\n",
40 | " features_matrix = data[features].as_matrix(columns=None)\n",
41 | " # assign the column of data_sframe associated with the target to the variable ‘output_sarray’\n",
42 | "\n",
43 | " # this will convert the SArray into a numpy array:\n",
44 | " output_array = data[output].as_matrix(columns=None) \n",
45 | " return(features_matrix, output_array)"
46 | ]
47 | },
48 | {
49 | "cell_type": "markdown",
50 | "metadata": {},
51 | "source": [
52 | "\n",
53 | "If the features matrix (including a column of 1s for the constant) is stored as a 2D array (or matrix) and the regression weights are stored as a 1D array then the predicted output is just the dot product between the features matrix and the weights (with the weights on the right). Write a function ‘predict_output’ which accepts a 2D array ‘feature_matrix’ and a 1D array ‘weights’ and returns a 1D array ‘predictions’. e.g. in python:"
54 | ]
55 | },
56 | {
57 | "cell_type": "code",
58 | "execution_count": 4,
59 | "metadata": {
60 | "collapsed": true
61 | },
62 | "outputs": [],
63 | "source": [
64 | "def predict_outcome(feature_matrix, weights):\n",
65 | " predictions = np.dot(feature_matrix, weights)\n",
66 | " return(predictions)"
67 | ]
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "In the house dataset, features vary wildly in their relative magnitude: ‘sqft_living’ is very large overall compared to ‘bedrooms’, for instance. As a result, weight for ‘sqft_living’ would be much smaller than weight for ‘bedrooms’. This is problematic because “small” weights are dropped first as l1_penalty goes up.\n",
74 | "\n"
75 | ]
76 | },
77 | {
78 | "cell_type": "markdown",
79 | "metadata": {},
80 | "source": [
81 | "To give equal considerations for all features, we need to normalize features as discussed in the lectures: we divide each feature by its 2-norm so that the transformed feature has norm 1."
82 | ]
83 | },
84 | {
85 | "cell_type": "markdown",
86 | "metadata": {},
87 | "source": [
88 | "Write a short function called ‘normalize_features(feature_matrix)’, which normalizes columns of a given feature matrix. The function should return a pair ‘(normalized_features, norms)’, where the second item contains the norms of original features. As discussed in the lectures, we will use these norms to normalize the test data in the same way as we normalized the training data."
89 | ]
90 | },
91 | {
92 | "cell_type": "markdown",
93 | "metadata": {},
94 | "source": [
95 | "from math import sqrt\n",
96 | "def normalize_features(features):\n",
97 | " norms = np.sqrt(np.sum(features**2,axis=0)) \n",
98 | " normalized_features = features/norms \n",
99 | " \n",
100 | " return (normalized_features, norms)"
101 | ]
102 | },
103 | {
104 | "cell_type": "markdown",
105 | "metadata": {},
106 | "source": [
107 | "Normalized Features Test"
108 | ]
109 | },
110 | {
111 | "cell_type": "code",
112 | "execution_count": 5,
113 | "metadata": {
114 | "collapsed": true
115 | },
116 | "outputs": [],
117 | "source": [
118 | "def normalize_features(features):\n",
119 | " norms = np.sqrt(np.sum(features**2,axis=0))\n",
120 | " normlized_features = features/norms\n",
121 | " return (normlized_features, norms)"
122 | ]
123 | },
124 | {
125 | "cell_type": "code",
126 | "execution_count": 6,
127 | "metadata": {
128 | "collapsed": false
129 | },
130 | "outputs": [
131 | {
132 | "data": {
133 | "text/plain": [
134 | "(array([[ 0.6, 0.6, 0.6],\n",
135 | " [ 0.8, 0.8, 0.8]]), array([ 5., 10., 15.]))"
136 | ]
137 | },
138 | "execution_count": 6,
139 | "metadata": {},
140 | "output_type": "execute_result"
141 | }
142 | ],
143 | "source": [
144 | "normalize_features(np.array([[3,6,9],[4,8,12]]))"
145 | ]
146 | },
147 | {
148 | "cell_type": "code",
149 | "execution_count": 7,
150 | "metadata": {
151 | "collapsed": false
152 | },
153 | "outputs": [
154 | {
155 | "data": {
156 | "text/plain": [
157 | "array([ 5., 10., 15.])"
158 | ]
159 | },
160 | "execution_count": 7,
161 | "metadata": {},
162 | "output_type": "execute_result"
163 | }
164 | ],
165 | "source": [
166 | "vec = np.sqrt(np.sum((np.array([[3,6,9],[4,8,12]]))**2,axis=0))\n",
167 | "vec"
168 | ]
169 | },
170 | {
171 | "cell_type": "code",
172 | "execution_count": 8,
173 | "metadata": {
174 | "collapsed": false
175 | },
176 | "outputs": [
177 | {
178 | "data": {
179 | "text/plain": [
180 | "array([[ 3, 6, 9],\n",
181 | " [ 4, 8, 12]])"
182 | ]
183 | },
184 | "execution_count": 8,
185 | "metadata": {},
186 | "output_type": "execute_result"
187 | }
188 | ],
189 | "source": [
190 | "data = np.array([[3,6,9],[4,8,12]])\n",
191 | "data"
192 | ]
193 | },
194 | {
195 | "cell_type": "markdown",
196 | "metadata": {},
197 | "source": [
198 | "Review of Coordinate Descent\n",
199 | "We seek to obtain a sparse set of weights by minimizing the LASSO cost function\n",
200 | "SUM[ (prediction - output)^2 ] + lambda*( |w[1]| + ... + |w[k]|).\n",
201 | "The absolute value sign makes the cost function non-differentiable, so simple gradient descent is not viable (you would need to implement a method called subgradient descent). Instead, we will use coordinate descent: at each iteration, we will fix all weights but weight i and find the value of weight i that minimizes the objective. That is, we look for\n",
202 | "argmin_{w[i]} [ SUM[ (prediction - output)^2 ] + lambda*( |w[1]| + ... + |w[k]|) ]\n",
203 | "where all weights other than w[i] are held to be constant. We will optimize one w[i] at a time, circling through the weights multiple times.\n",
204 | "Pick a coordinate i Compute w[i] that minimizes the LASSO cost function Repeat the two steps for all coordinates, multiple times"
205 | ]
206 | },
207 | {
208 | "cell_type": "markdown",
209 | "metadata": {},
210 | "source": [
211 | " For this assignment, we use cyclical coordinate descent with normalized features, where we cycle through coordinates 0 to (d-1) in order, and assume the features were normalized as discussed above. The formula for optimizing each coordinate is as follows:\n",
212 | "\n",
213 | "\n",
214 | "\u0001\u0001\n",
215 | "1\n",
216 | "2\n",
217 | "3\n",
218 | " ┌ (ro[i] + lambda/2) if ro[i] < -lambda/2\n",
219 | "w[i] = ├ 0 if -lambda/2 <= ro[i] <= lambda/2\n",
220 | " └ (ro[i] - lambda/2) if ro[i] > lambda/2\n"
221 | ]
222 | },
223 | {
224 | "cell_type": "markdown",
225 | "metadata": {},
226 | "source": [
227 | "where\n",
228 | "ro[i] = SUM[ [feature_i]*(output - prediction + w[i]*[feature_i]) ]."
229 | ]
230 | },
231 | {
232 | "cell_type": "markdown",
233 | "metadata": {},
234 | "source": [
235 | "w[0] = ro[i]"
236 | ]
237 | },
238 | {
239 | "cell_type": "markdown",
240 | "metadata": {},
241 | "source": [
242 | "9. Consider a simple model with 2 features: ‘sqft_living’ and ‘bedrooms’. The output is ‘price’.\n",
243 | "\n",
244 | "First, run get_numpy_data() (or equivalent) to obtain a feature matrix with 3 columns (constant column added). Use the entire ‘sales’ dataset for now.\n",
245 | "Normalize columns of the feature matrix. Save the norms of original features as ‘norms’.\n",
246 | "Set initial weights to [1,4,1].\n",
247 | "Make predictions with feature matrix and initial weights.\n",
248 | "Compute values of ro[i], where"
249 | ]
250 | },
251 | {
252 | "cell_type": "code",
253 | "execution_count": 9,
254 | "metadata": {
255 | "collapsed": false
256 | },
257 | "outputs": [],
258 | "source": [
259 | "dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':float, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}\n",
260 | "sales = pd.read_csv('kc_house_data.csv', dtype=dtype_dict)\n",
261 | "train = pd.read_csv('wk3_kc_house_train_data.csv', dtype=dtype_dict)\n",
262 | "test = pd.read_csv('wk3_kc_house_test_data.csv', dtype=dtype_dict)"
263 | ]
264 | },
265 | {
266 | "cell_type": "code",
267 | "execution_count": 10,
268 | "metadata": {
269 | "collapsed": false
270 | },
271 | "outputs": [],
272 | "source": [
273 | "features = ['sqft_living','bedrooms']\n",
274 | "output = 'price'"
275 | ]
276 | },
277 | {
278 | "cell_type": "code",
279 | "execution_count": 11,
280 | "metadata": {
281 | "collapsed": false
282 | },
283 | "outputs": [],
284 | "source": [
285 | "features_matrix, output_array = get_numpy_data(sales, features, output)"
286 | ]
287 | },
288 | {
289 | "cell_type": "code",
290 | "execution_count": 12,
291 | "metadata": {
292 | "collapsed": false
293 | },
294 | "outputs": [
295 | {
296 | "data": {
297 | "text/plain": [
298 | "(array([[ 0.00680209, 0.00353021, 0.00583571],\n",
299 | " [ 0.00680209, 0.00768869, 0.00583571],\n",
300 | " [ 0.00680209, 0.00230361, 0.00389048],\n",
301 | " ..., \n",
302 | " [ 0.00680209, 0.00305154, 0.00389048],\n",
303 | " [ 0.00680209, 0.00478673, 0.00583571],\n",
304 | " [ 0.00680209, 0.00305154, 0.00389048]]),\n",
305 | " array([ 1.47013605e+02, 3.34257264e+05, 5.14075870e+02]))"
306 | ]
307 | },
308 | "execution_count": 12,
309 | "metadata": {},
310 | "output_type": "execute_result"
311 | }
312 | ],
313 | "source": [
314 | "simple_features_matrix,norms = normalize_features(features_matrix)\n",
315 | "simple_features_matrix, norms"
316 | ]
317 | },
318 | {
319 | "cell_type": "code",
320 | "execution_count": 13,
321 | "metadata": {
322 | "collapsed": true
323 | },
324 | "outputs": [],
325 | "source": [
326 | "weights = [1,4,1]"
327 | ]
328 | },
329 | {
330 | "cell_type": "code",
331 | "execution_count": 14,
332 | "metadata": {
333 | "collapsed": false
334 | },
335 | "outputs": [],
336 | "source": [
337 | "prediction = predict_outcome(simple_features_matrix, weights)"
338 | ]
339 | },
340 | {
341 | "cell_type": "markdown",
342 | "metadata": {},
343 | "source": [
344 | "Compute the values of ro[i] for each feature in this simple model, using the formula given above, using the formula:\n",
345 | "ro[i] = SUM[ [feature_i]*(output - prediction + w[i]*[feature_i]) ]\n",
346 | "Hint: You can get a Numpy vector for feature_i using:\n",
347 | "simple_feature_matrix[:,i]"
348 | ]
349 | },
350 | {
351 | "cell_type": "code",
352 | "execution_count": 15,
353 | "metadata": {
354 | "collapsed": false
355 | },
356 | "outputs": [],
357 | "source": [
358 | "ro_1 = np.dot(simple_features_matrix[:,1],(output_array - prediction + weights[1]*simple_features_matrix[:,1]))"
359 | ]
360 | },
361 | {
362 | "cell_type": "code",
363 | "execution_count": 16,
364 | "metadata": {
365 | "collapsed": true
366 | },
367 | "outputs": [],
368 | "source": [
369 | "ro_2 = np.dot(simple_features_matrix[:,2],(output_array - prediction + weights[2]*simple_features_matrix[:,2]))"
370 | ]
371 | },
372 | {
373 | "cell_type": "markdown",
374 | "metadata": {},
375 | "source": [
376 | "10. Quiz Question: Recall that, whenever ro[i] falls between -l1_penalty/2 and l1_penalty/2, the corresponding weight w[i] is sent to zero. Now suppose we were to take one step of coordinate descent on either feature 1 or feature 2. What range of values of l1_penalty would not set w[1] zero, but would set w[2] to zero, if we were to take a step in that coordinate?"
377 | ]
378 | },
379 | {
380 | "cell_type": "code",
381 | "execution_count": 17,
382 | "metadata": {
383 | "collapsed": false
384 | },
385 | "outputs": [
386 | {
387 | "name": "stdout",
388 | "output_type": "stream",
389 | "text": [
390 | "87939470.8233 80966698.6662\n"
391 | ]
392 | }
393 | ],
394 | "source": [
395 | "print ro_1, ro_2"
396 | ]
397 | },
398 | {
399 | "cell_type": "code",
400 | "execution_count": 18,
401 | "metadata": {
402 | "collapsed": true
403 | },
404 | "outputs": [],
405 | "source": [
406 | "range_l1penalty = [2*ro_2, 2*ro_1]"
407 | ]
408 | },
409 | {
410 | "cell_type": "code",
411 | "execution_count": 19,
412 | "metadata": {
413 | "collapsed": false
414 | },
415 | "outputs": [
416 | {
417 | "data": {
418 | "text/plain": [
419 | "[161933397.33247885, 175878941.64650357]"
420 | ]
421 | },
422 | "execution_count": 19,
423 | "metadata": {},
424 | "output_type": "execute_result"
425 | }
426 | ],
427 | "source": [
428 | "range_l1penalty"
429 | ]
430 | },
431 | {
432 | "cell_type": "markdown",
433 | "metadata": {},
434 | "source": [
435 | "Quiz Question: What range of values of l1_penalty would set both w[1] and w[2] to zero, if we were to take a step in that coordinate?"
436 | ]
437 | },
438 | {
439 | "cell_type": "markdown",
440 | "metadata": {},
441 | "source": [
442 | "Single Coordinate Descent Step"
443 | ]
444 | },
445 | {
446 | "cell_type": "markdown",
447 | "metadata": {},
448 | "source": [
449 | "12. Using the formula above, implement coordinate descent that minimizes the cost function over a single feature i. Note that the intercept (weight 0) is not regularized. The function should accept feature matrix, output, current weights, l1 penalty, and index of feature to optimize over. The function should return new weight for feature i.\n",
450 | "\n",
451 | "e.g. in Python:"
452 | ]
453 | },
454 | {
455 | "cell_type": "code",
456 | "execution_count": 20,
457 | "metadata": {
458 | "collapsed": true
459 | },
460 | "outputs": [],
461 | "source": [
462 | "def lasso_coordinate_descent_step(i, feature_matrix, output, weights, l1_penalty):\n",
463 | " # compute prediction\n",
464 | " prediction = predict_outcome(feature_matrix, weights)\n",
465 | " # compute ro[i] = SUM[ [feature_i]*(output - prediction + weight[i]*[feature_i]) ]\n",
466 | " ro_i = np.dot(feature_matrix[:,i],(output - prediction + weights[i]*feature_matrix[:,i]))\n",
467 | " \n",
468 | " if i == 0: # intercept -- do not regularize\n",
469 | " new_weight_i = ro_i\n",
470 | " elif ro_i < -l1_penalty/2.:\n",
471 | " new_weight_i = ro_i + l1_penalty/2\n",
472 | " elif ro_i > l1_penalty/2.:\n",
473 | " new_weight_i = ro_i - l1_penalty/2\n",
474 | " else:\n",
475 | " new_weight_i = 0.\n",
476 | " \n",
477 | " return new_weight_i"
478 | ]
479 | },
480 | {
481 | "cell_type": "code",
482 | "execution_count": 21,
483 | "metadata": {
484 | "collapsed": false
485 | },
486 | "outputs": [
487 | {
488 | "name": "stdout",
489 | "output_type": "stream",
490 | "text": [
491 | "0.425558846691\n"
492 | ]
493 | }
494 | ],
495 | "source": [
496 | "import math\n",
497 | "print lasso_coordinate_descent_step(1, np.array([[3./math.sqrt(13),1./math.sqrt(10)],\n",
498 | " [2./math.sqrt(13),3./math.sqrt(10)]]), np.array([1., 1.]), np.array([1., 4.]), 0.1)"
499 | ]
500 | },
501 | {
502 | "cell_type": "markdown",
503 | "metadata": {},
504 | "source": [
505 | "#Cyclical coordinate descent"
506 | ]
507 | },
508 | {
509 | "cell_type": "markdown",
510 | "metadata": {},
511 | "source": [
512 | "Now that we have a function that optimizes the cost function over a single coordinate, let us implement cyclical coordinate descent where we optimize coordinates 0, 1, ..., (d-1) in order and repeat.\n",
513 | "\n",
514 | "When do we know to stop? Each time we scan all the coordinates (features) once, we measure the change in weight for each coordinate. If no coordinate changes by more than a specified threshold, we stop.\n",
515 | "\n",
516 | "For each iteration:\n",
517 | "\n",
518 | "As you loop over features in order and perform coordinate descent, measure how much each coordinate changes.\n",
519 | "After the loop, if the maximum change across all coordinates is falls below the tolerance, stop. Otherwise, go back to the previous step.\n"
520 | ]
521 | },
522 | {
523 | "cell_type": "markdown",
524 | "metadata": {},
525 | "source": [
526 | "Return weights\n",
527 | "\n",
528 | "The function should accept the following parameters:\n",
529 | "\n",
530 | "Feature matrix\n",
531 | "Output array\n",
532 | "Initial weights\n",
533 | "L1 penalty\n",
534 | "Tolerance\n",
535 | "e.g. in Python:\n",
536 | "\n"
537 | ]
538 | },
539 | {
540 | "cell_type": "code",
541 | "execution_count": 22,
542 | "metadata": {
543 | "collapsed": true
544 | },
545 | "outputs": [],
546 | "source": [
547 | "def lasso_cyclical_coordinate_descent(feature_matrix, output, initial_weights, l1_penalty, tolerance):\n",
548 | " max_change = tolerance*2\n",
549 | " weights = initial_weights\n",
550 | " while max_change > tolerance:\n",
551 | " max_change = 0\n",
552 | " for i in range(len(weights)):\n",
553 | " old_weight_i = weights[i]\n",
554 | " weights[i] = lasso_coordinate_descent_step(i, feature_matrix, output, weights, l1_penalty)\n",
555 | " change = np.abs(weights[i] - old_weight_i)\n",
556 | " if change>max_change:\n",
557 | " max_change = change\n",
558 | "# print max_change\n",
559 | " return weights"
560 | ]
561 | },
562 | {
563 | "cell_type": "markdown",
564 | "metadata": {},
565 | "source": [
566 | "Using the following parameters, learn the weights on the sales dataset."
567 | ]
568 | },
569 | {
570 | "cell_type": "code",
571 | "execution_count": 23,
572 | "metadata": {
573 | "collapsed": true
574 | },
575 | "outputs": [],
576 | "source": [
577 | "simple_features = ['sqft_living', 'bedrooms']\n",
578 | "my_output = 'price'\n",
579 | "initial_weights = np.zeros(3)\n",
580 | "l1_penalty = 1e7\n",
581 | "tolerance = 1.0"
582 | ]
583 | },
584 | {
585 | "cell_type": "markdown",
586 | "metadata": {},
587 | "source": [
588 | "Quiz Question: What is the RSS of the learned model on the normalized dataset?\n",
589 | "\n"
590 | ]
591 | },
592 | {
593 | "cell_type": "code",
594 | "execution_count": 24,
595 | "metadata": {
596 | "collapsed": false
597 | },
598 | "outputs": [],
599 | "source": [
600 | "(simple_feature_matrix, output) = get_numpy_data(sales, simple_features, my_output)\n",
601 | "(normalized_simple_feature_matrix, simple_norms) = normalize_features(simple_feature_matrix)"
602 | ]
603 | },
604 | {
605 | "cell_type": "code",
606 | "execution_count": 25,
607 | "metadata": {
608 | "collapsed": false
609 | },
610 | "outputs": [],
611 | "source": [
612 | "new_weights = lasso_cyclical_coordinate_descent(normalized_simple_feature_matrix, output,\n",
613 | " initial_weights, l1_penalty, tolerance)"
614 | ]
615 | },
616 | {
617 | "cell_type": "code",
618 | "execution_count": 26,
619 | "metadata": {
620 | "collapsed": false
621 | },
622 | "outputs": [
623 | {
624 | "name": "stdout",
625 | "output_type": "stream",
626 | "text": [
627 | "[ 21624997.95951911 63157247.20788951 0. ]\n"
628 | ]
629 | }
630 | ],
631 | "source": [
632 | "print new_weights"
633 | ]
634 | },
635 | {
636 | "cell_type": "code",
637 | "execution_count": 27,
638 | "metadata": {
639 | "collapsed": false
640 | },
641 | "outputs": [
642 | {
643 | "name": "stdout",
644 | "output_type": "stream",
645 | "text": [
646 | "1.63049247672e+15\n"
647 | ]
648 | }
649 | ],
650 | "source": [
651 | "print np.sum((output-predict_outcome(normalized_simple_feature_matrix,new_weights))**2)"
652 | ]
653 | },
654 | {
655 | "cell_type": "markdown",
656 | "metadata": {},
657 | "source": [
658 | "#Evaluating LASSO fit with more features"
659 | ]
660 | },
661 | {
662 | "cell_type": "markdown",
663 | "metadata": {},
664 | "source": [
665 | "Let us split the sales dataset into training and test sets. If you are using GraphLab Create, call ‘random_split’ with .8 ratio and seed=0. Otherwise, please down the corresponding csv files from the downloads section.\n",
666 | "\n"
667 | ]
668 | },
669 | {
670 | "cell_type": "markdown",
671 | "metadata": {},
672 | "source": [
673 | "Create a normalized feature matrix from the TRAINING data with the following set of features.\n",
674 | "\n",
675 | "bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterfront, view, condition, grade, sqft_above, sqft_basement, yr_built, yr_renovated\n",
676 | "Make sure you store the norms for the normalization, since we’ll use them later.\n",
677 | "\n"
678 | ]
679 | },
680 | {
681 | "cell_type": "code",
682 | "execution_count": 28,
683 | "metadata": {
684 | "collapsed": false
685 | },
686 | "outputs": [],
687 | "source": [
688 | "train_features = ['bedrooms','bathrooms','sqft_living','sqft_lot','floors','waterfront','view','condition','grade','sqft_above','sqft_basement','yr_built','yr_renovated']"
689 | ]
690 | },
691 | {
692 | "cell_type": "code",
693 | "execution_count": 29,
694 | "metadata": {
695 | "collapsed": false
696 | },
697 | "outputs": [],
698 | "source": [
699 | "output = 'price'"
700 | ]
701 | },
702 | {
703 | "cell_type": "code",
704 | "execution_count": 30,
705 | "metadata": {
706 | "collapsed": true
707 | },
708 | "outputs": [],
709 | "source": [
710 | "train_feature_matrix, output_array = get_numpy_data(train, train_features, output)"
711 | ]
712 | },
713 | {
714 | "cell_type": "code",
715 | "execution_count": 31,
716 | "metadata": {
717 | "collapsed": false
718 | },
719 | "outputs": [],
720 | "source": [
721 | "train_normalized_features_matrix, normalization = normalize_features(train_feature_matrix)"
722 | ]
723 | },
724 | {
725 | "cell_type": "markdown",
726 | "metadata": {},
727 | "source": [
728 | "First, learn the weights with l1_penalty=1e7, on the training data. Initialize weights to all zeros, and set the tolerance=1. Call resulting weights’ weights1e7’, you will need them later."
729 | ]
730 | },
731 | {
732 | "cell_type": "code",
733 | "execution_count": 32,
734 | "metadata": {
735 | "collapsed": true
736 | },
737 | "outputs": [],
738 | "source": [
739 | "l1_penalty = 1e7\n",
740 | "initialize_weights = np.zeros(14)\n",
741 | "tolerance = 1"
742 | ]
743 | },
744 | {
745 | "cell_type": "code",
746 | "execution_count": 33,
747 | "metadata": {
748 | "collapsed": false
749 | },
750 | "outputs": [],
751 | "source": [
752 | "weights1e7 = lasso_cyclical_coordinate_descent(train_normalized_features_matrix, output_array,initialize_weights, l1_penalty, tolerance)"
753 | ]
754 | },
755 | {
756 | "cell_type": "code",
757 | "execution_count": 34,
758 | "metadata": {
759 | "collapsed": false
760 | },
761 | "outputs": [
762 | {
763 | "data": {
764 | "text/plain": [
765 | "intercept 23864692.509538\n",
766 | "bedrooms 0.000000\n",
767 | "bathrooms 0.000000\n",
768 | "sqft_living 30495548.132547\n",
769 | "sqft_lot 0.000000\n",
770 | "floors 0.000000\n",
771 | "waterfront 1901633.614756\n",
772 | "view 5705765.016733\n",
773 | "condition 0.000000\n",
774 | "grade 0.000000\n",
775 | "sqft_above 0.000000\n",
776 | "sqft_basement 0.000000\n",
777 | "yr_built 0.000000\n",
778 | "yr_renovated 0.000000\n",
779 | "dtype: float64"
780 | ]
781 | },
782 | "execution_count": 34,
783 | "metadata": {},
784 | "output_type": "execute_result"
785 | }
786 | ],
787 | "source": [
788 | "pd.Series(weights1e7,index=['intercept']+ train_features)"
789 | ]
790 | },
791 | {
792 | "cell_type": "code",
793 | "execution_count": 35,
794 | "metadata": {
795 | "collapsed": false
796 | },
797 | "outputs": [],
798 | "source": [
799 | "l1_penalty = 1e8\n",
800 | "initialize_weights = np.zeros(14)\n",
801 | "tolerance = 1.0"
802 | ]
803 | },
804 | {
805 | "cell_type": "code",
806 | "execution_count": 36,
807 | "metadata": {
808 | "collapsed": true
809 | },
810 | "outputs": [],
811 | "source": [
812 | "weights1e8 = lasso_cyclical_coordinate_descent(train_normalized_features_matrix, output_array,initialize_weights, l1_penalty, tolerance)"
813 | ]
814 | },
815 | {
816 | "cell_type": "code",
817 | "execution_count": 37,
818 | "metadata": {
819 | "collapsed": false
820 | },
821 | "outputs": [
822 | {
823 | "data": {
824 | "text/plain": [
825 | "intercept 53621004.689715\n",
826 | "bedrooms 0.000000\n",
827 | "bathrooms 0.000000\n",
828 | "sqft_living 0.000000\n",
829 | "sqft_lot 0.000000\n",
830 | "floors 0.000000\n",
831 | "waterfront 0.000000\n",
832 | "view 0.000000\n",
833 | "condition 0.000000\n",
834 | "grade 0.000000\n",
835 | "sqft_above 0.000000\n",
836 | "sqft_basement 0.000000\n",
837 | "yr_built 0.000000\n",
838 | "yr_renovated 0.000000\n",
839 | "dtype: float64"
840 | ]
841 | },
842 | "execution_count": 37,
843 | "metadata": {},
844 | "output_type": "execute_result"
845 | }
846 | ],
847 | "source": [
848 | "pd.Series(weights1e8, index=['intercept']+train_features)"
849 | ]
850 | },
851 | {
852 | "cell_type": "markdown",
853 | "metadata": {},
854 | "source": [
855 | "Finally, learn the weights with l1_penalty=1e4, on the training data. Initialize weights to all zeros, and set the tolerance=5e5. Call resulting weights ‘weights1e4’, you will need them later. (This case will take quite a bit longer to converge than the others above.)"
856 | ]
857 | },
858 | {
859 | "cell_type": "code",
860 | "execution_count": 38,
861 | "metadata": {
862 | "collapsed": true
863 | },
864 | "outputs": [],
865 | "source": [
866 | "l1_penalty = 1e4\n",
867 | "tolerance = 5e5\n",
868 | "weights1e4 = lasso_cyclical_coordinate_descent(train_normalized_features_matrix, output_array,initialize_weights, l1_penalty, tolerance)"
869 | ]
870 | },
871 | {
872 | "cell_type": "code",
873 | "execution_count": 39,
874 | "metadata": {
875 | "collapsed": false
876 | },
877 | "outputs": [
878 | {
879 | "data": {
880 | "text/plain": [
881 | "intercept 57481091.133021\n",
882 | "bedrooms -13652628.540224\n",
883 | "bathrooms 12462713.071262\n",
884 | "sqft_living 57942788.373312\n",
885 | "sqft_lot -1475769.694276\n",
886 | "floors -4904547.755466\n",
887 | "waterfront 5349050.186362\n",
888 | "view 5845253.562136\n",
889 | "condition -416038.969813\n",
890 | "grade 2682274.594885\n",
891 | "sqft_above 242649.685551\n",
892 | "sqft_basement -1285549.667681\n",
893 | "yr_built -54779474.227684\n",
894 | "yr_renovated 2167703.066102\n",
895 | "dtype: float64"
896 | ]
897 | },
898 | "execution_count": 39,
899 | "metadata": {},
900 | "output_type": "execute_result"
901 | }
902 | ],
903 | "source": [
904 | "pd.Series(weights1e4, index=['intercept']+train_features)"
905 | ]
906 | },
907 | {
908 | "cell_type": "markdown",
909 | "metadata": {},
910 | "source": [
911 | "#Rescaling learned weights"
912 | ]
913 | },
914 | {
915 | "cell_type": "markdown",
916 | "metadata": {},
917 | "source": [
918 | "Recall that we normalized our feature matrix, before learning the weights. To use these weights on a test set, we must normalize the test data in the same way. Alternatively, we can rescale the learned weights to include the normalization, so we never have to worry about normalizing the test data:\n",
919 | "\n",
920 | "In this case, we must scale the resulting weights so that we can make predictions with original features:\n",
921 | "\n",
922 | "Store the norms of the original features to a vector called ‘norms’:\n"
923 | ]
924 | },
925 | {
926 | "cell_type": "markdown",
927 | "metadata": {},
928 | "source": [
929 | "test_features_matrix, norms = normalize_features(train_feature_matrix)"
930 | ]
931 | },
932 | {
933 | "cell_type": "markdown",
934 | "metadata": {},
935 | "source": [
936 | "Run Lasso on the normalized features and obtain a ‘weights’ vector\n",
937 | "Compute the weights for the original features by performing element-wise division, i.e.\n"
938 | ]
939 | },
940 | {
941 | "cell_type": "markdown",
942 | "metadata": {},
943 | "source": [
944 | "weights_normalized = weights / norms"
945 | ]
946 | },
947 | {
948 | "cell_type": "markdown",
949 | "metadata": {},
950 | "source": [
951 | "Now, we can apply weights_normalized to the test data, without normalizing it!"
952 | ]
953 | },
954 | {
955 | "cell_type": "code",
956 | "execution_count": 41,
957 | "metadata": {
958 | "collapsed": false
959 | },
960 | "outputs": [
961 | {
962 | "data": {
963 | "text/plain": [
964 | "array([ 57481091.13302054, -13652628.5402242 , 12462713.07126241,\n",
965 | " 57942788.37331215, -1475769.69427564, -4904547.75546551,\n",
966 | " 5349050.18636169, 5845253.56213634, -416038.96981256,\n",
967 | " 2682274.59488508, 242649.68555077, -1285549.66768121,\n",
968 | " -54779474.22768354, 2167703.06610234])"
969 | ]
970 | },
971 | "execution_count": 41,
972 | "metadata": {},
973 | "output_type": "execute_result"
974 | }
975 | ],
976 | "source": [
977 | "weights1e7_normalized = weights1e7 / normalization\n",
978 | "weights1e8_normalized = weights1e8 / normalization\n",
979 | "weights1e4_normalized = weights1e4 / normalization\n",
980 | "weights1e8"
981 | ]
982 | },
983 | {
984 | "cell_type": "code",
985 | "execution_count": 41,
986 | "metadata": {
987 | "collapsed": true
988 | },
989 | "outputs": [],
990 | "source": [
991 | "(test_feature_matrix, test_output) = get_numpy_data(test, train_features, 'price')"
992 | ]
993 | },
994 | {
995 | "cell_type": "code",
996 | "execution_count": 42,
997 | "metadata": {
998 | "collapsed": false
999 | },
1000 | "outputs": [
1001 | {
1002 | "name": "stdout",
1003 | "output_type": "stream",
1004 | "text": [
1005 | "1.29085259902e+14\n"
1006 | ]
1007 | }
1008 | ],
1009 | "source": [
1010 | "print sum((test_output - predict_outcome(test_feature_matrix,weights1e4_normalized) )**2)"
1011 | ]
1012 | },
1013 | {
1014 | "cell_type": "code",
1015 | "execution_count": 43,
1016 | "metadata": {
1017 | "collapsed": false
1018 | },
1019 | "outputs": [
1020 | {
1021 | "name": "stdout",
1022 | "output_type": "stream",
1023 | "text": [
1024 | "1.63103564165e+14\n"
1025 | ]
1026 | }
1027 | ],
1028 | "source": [
1029 | "print sum((test_output - predict_outcome(test_feature_matrix,weights1e7_normalized) )**2)"
1030 | ]
1031 | },
1032 | {
1033 | "cell_type": "code",
1034 | "execution_count": 44,
1035 | "metadata": {
1036 | "collapsed": false
1037 | },
1038 | "outputs": [
1039 | {
1040 | "name": "stdout",
1041 | "output_type": "stream",
1042 | "text": [
1043 | "1.29085259902e+14\n"
1044 | ]
1045 | }
1046 | ],
1047 | "source": [
1048 | "print sum((test_output - predict_outcome(test_feature_matrix,weights1e8_normalized) )**2)"
1049 | ]
1050 | },
1051 | {
1052 | "cell_type": "code",
1053 | "execution_count": null,
1054 | "metadata": {
1055 | "collapsed": true
1056 | },
1057 | "outputs": [],
1058 | "source": []
1059 | }
1060 | ],
1061 | "metadata": {
1062 | "kernelspec": {
1063 | "display_name": "Python 2",
1064 | "language": "python",
1065 | "name": "python2"
1066 | },
1067 | "language_info": {
1068 | "codemirror_mode": {
1069 | "name": "ipython",
1070 | "version": 2
1071 | },
1072 | "file_extension": ".py",
1073 | "mimetype": "text/x-python",
1074 | "name": "python",
1075 | "nbconvert_exporter": "python",
1076 | "pygments_lexer": "ipython2",
1077 | "version": "2.7.11"
1078 | }
1079 | },
1080 | "nbformat": 4,
1081 | "nbformat_minor": 0
1082 | }
1083 |
--------------------------------------------------------------------------------
/Maching_Learning_Clustering-Retrieval/Quiz-Answers/screencapture-coursera-org-learn-ml-clustering-and-retrieval-exam-0acPr-kd-trees-1475969153420.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AprilXiaoyanLiu/Machine-Learning-University-of-Washington/2edb9caced2cb4fda576121699ac228d206c6507/Maching_Learning_Clustering-Retrieval/Quiz-Answers/screencapture-coursera-org-learn-ml-clustering-and-retrieval-exam-0acPr-kd-trees-1475969153420.png
--------------------------------------------------------------------------------
/Maching_Learning_Clustering-Retrieval/Quiz-Answers/screencapture-coursera-org-learn-ml-clustering-and-retrieval-exam-7cVMj-locality-sensitive-hashing-1475969687954.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AprilXiaoyanLiu/Machine-Learning-University-of-Washington/2edb9caced2cb4fda576121699ac228d206c6507/Maching_Learning_Clustering-Retrieval/Quiz-Answers/screencapture-coursera-org-learn-ml-clustering-and-retrieval-exam-7cVMj-locality-sensitive-hashing-1475969687954.png
--------------------------------------------------------------------------------
/Maching_Learning_Clustering-Retrieval/Quiz-Answers/screencapture-coursera-org-learn-ml-clustering-and-retrieval-exam-itzL5-implementing-locality-sensitive-hashing-from-scratch-1475994803333.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AprilXiaoyanLiu/Machine-Learning-University-of-Washington/2edb9caced2cb4fda576121699ac228d206c6507/Maching_Learning_Clustering-Retrieval/Quiz-Answers/screencapture-coursera-org-learn-ml-clustering-and-retrieval-exam-itzL5-implementing-locality-sensitive-hashing-from-scratch-1475994803333.png
--------------------------------------------------------------------------------
/Maching_Learning_Clustering-Retrieval/Quiz-Answers/screencapture-coursera-org-learn-ml-clustering-and-retrieval-exam-tHtXY-k-means-1476159235037.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AprilXiaoyanLiu/Machine-Learning-University-of-Washington/2edb9caced2cb4fda576121699ac228d206c6507/Maching_Learning_Clustering-Retrieval/Quiz-Answers/screencapture-coursera-org-learn-ml-clustering-and-retrieval-exam-tHtXY-k-means-1476159235037.png
--------------------------------------------------------------------------------
/Maching_Learning_Clustering-Retrieval/Quiz-Answers/test:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/Maching_Learning_Clustering-Retrieval/test:
--------------------------------------------------------------------------------
1 | test
2 |
--------------------------------------------------------------------------------
/Maching_Learning_Clustering-Retrieval/week4_text_em_clustering_programming-assignment-2.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Clustering text data with Gaussian mixtures"
8 | ]
9 | },
10 | {
11 | "cell_type": "raw",
12 | "metadata": {},
13 | "source": [
14 | "In a previous assignment, we explored K-means clustering for a high-dimensional Wikipedia dataset. We can also model this data with a mixture of Gaussians, though with increasing dimension we run into several important problems associated with using a full covariance matrix for each component.\n",
15 | "\n",
16 | "In this section, we will use an EM implementation to fit a Gaussian mixture model with diagonal covariances to a subset of the Wikipedia dataset."
17 | ]
18 | },
19 | {
20 | "cell_type": "markdown",
21 | "metadata": {},
22 | "source": [
23 | "## Overview"
24 | ]
25 | },
26 | {
27 | "cell_type": "raw",
28 | "metadata": {},
29 | "source": [
30 | "In a previous assignment, we explored k-means clustering for a high-dimensional Wikipedia dataset. We can also model this data with a mixture of Gaussians, though with increasing dimension we run into two important issues associated with using a full covariance matrix for each component.\n",
31 | "\n",
32 | "Computational cost becomes prohibitive in high dimensions: score calculations have complexity cubic in the number of dimensions M if the Gaussian has a full covariance matrix.\n",
33 | "A model with many parameters require more data: bserve that a full covariance matrix for an M-dimensional Gaussian will have M(M+1)/2 parameters to fit. With the number of parameters growing roughly as the square of the dimension, it may quickly become impossible to find a sufficient amount of data to make good inferences.\n",
34 | "Both of these issues are avoided if we require the covariance matrix of each component to be diagonal, as then it has only M parameters to fit and the score computation decomposes into M univariate score calculations. Recall from the lecture that the M-step for the full covariance is:\n",
35 | "\n",
36 | "Σ^k=1Nsoftk∑Ni=1rik(xi−μ^k)(xi−μ^k)T\n",
37 | "Note that this is a square matrix with M rows and M columns, and the above equation implies that the (v, w) element is computed by\n",
38 | "\n",
39 | "Σ^k,v,w=1Nsoftk∑Ni=1rik(xiv−μ^kv)(xiw−μ^kw)\n",
40 | "When we assume that this is a diagonal matrix, then non-diagonal elements are assumed to be zero and we only need to compute each of the M elements along the diagonal independently using the following equation.\n",
41 | "\n",
42 | "σ^2k,v=Σ^k,v,v=1Nsoftk∑Ni=1rik(xiv−μ^kv)2\n",
43 | "In this section, we will use an EM implementation to fit a Gaussian mixture model with diagonal covariances to a subset of the Wikipedia dataset. The implementation uses the above equation to compute each variance term.\n",
44 | "\n",
45 | "We'll begin by importing the dataset and coming up with a useful representation for each article. After running our algorithm on the data, we will explore the output to see whether we can give a meaningful interpretation to the fitted parameters in our model."
46 | ]
47 | },
48 | {
49 | "cell_type": "code",
50 | "execution_count": 1,
51 | "metadata": {
52 | "collapsed": true
53 | },
54 | "outputs": [],
55 | "source": [
56 | "import pandas as pd \n",
57 | "import numpy as np\n",
58 | "from scipy.sparse import csr_matrix\n",
59 | "from scipy.sparse import spdiags\n",
60 | "from scipy.stats import multivariate_normal\n",
61 | "from copy import deepcopy\n",
62 | "from sklearn.metrics import pairwise_distances\n",
63 | "from sklearn.preprocessing import normalize"
64 | ]
65 | },
66 | {
67 | "cell_type": "markdown",
68 | "metadata": {},
69 | "source": [
70 | "### Load Wikipedia data and extract TF-IDF features"
71 | ]
72 | },
73 | {
74 | "cell_type": "code",
75 | "execution_count": 2,
76 | "metadata": {
77 | "collapsed": true
78 | },
79 | "outputs": [],
80 | "source": [
81 | "wiki = pd.read_csv('/Users/April/Downloads/people_wiki.csv')"
82 | ]
83 | },
84 | {
85 | "cell_type": "raw",
86 | "metadata": {},
87 | "source": [
88 | "As in the previous assignment, we extract the TF-IDF vector of each document.\n",
89 | "\n",
90 | "For your convenience, we extracted the TF-IDF vectors from the dataset. The vectors are packaged in a sparse matrix, where the i-th row gives the TF-IDF vectors for the i-th document. Each column corresponds to a unique word appearing in the dataset.\n",
91 | "\n"
92 | ]
93 | },
94 | {
95 | "cell_type": "code",
96 | "execution_count": 3,
97 | "metadata": {
98 | "collapsed": true
99 | },
100 | "outputs": [],
101 | "source": [
102 | "def load_sparse_csr(filename):\n",
103 | " loader = np.load(filename)\n",
104 | " data = loader['data']\n",
105 | " indices = loader['indices']\n",
106 | " indptr = loader['indptr']\n",
107 | " shape = loader['shape']\n",
108 | " \n",
109 | " return csr_matrix( (data, indices, indptr), shape)\n",
110 | "\n",
111 | "\n",
112 | "tf_idf = load_sparse_csr('/Users/April/Downloads/people_wiki_tf_idf.npz')\n",
113 | "import json\n",
114 | "with open('/Users/April/Downloads/people_wiki_map_index_to_word.json', 'r') as f: # Reads the list of most frequent words\n",
115 | " map_index_to_word = json.load(f)"
116 | ]
117 | },
118 | {
119 | "cell_type": "raw",
120 | "metadata": {},
121 | "source": [
122 | "As in the previous assignment, we will normalize each document's TF-IDF vector to be a unit vector."
123 | ]
124 | },
125 | {
126 | "cell_type": "code",
127 | "execution_count": 4,
128 | "metadata": {
129 | "collapsed": true
130 | },
131 | "outputs": [],
132 | "source": [
133 | "tf_idf = normalize(tf_idf)"
134 | ]
135 | },
136 | {
137 | "cell_type": "raw",
138 | "metadata": {},
139 | "source": [
140 | "(Optional) Extracting TF-IDF vectors yourself. We provide the pre-computed TF-IDF vectors to minimize potential compatibility issues. You are free to experiment with other tools to compute the TF-IDF vectors yourself. A good place to start is sklearn.TfidfVectorizer. Note. Due to variations in tokenization and other factors, your TF-IDF vectors may differ from the ones we provide. For the purpose the assessment, we ask you to use the vectors from 4_tf_idf.npz. (See my "
141 | ]
142 | },
143 | {
144 | "cell_type": "markdown",
145 | "metadata": {},
146 | "source": [
147 | "### EM in high dimensions"
148 | ]
149 | },
150 | {
151 | "cell_type": "raw",
152 | "metadata": {},
153 | "source": [
154 | "EM for high-dimensional data requires some special treatment:\n",
155 | "\n",
156 | "E step and M step must be vectorized as much as possible, as explicit loops are dreadfully slow in Python.\n",
157 | "All operations must be cast in terms of sparse matrix operations, to take advantage of computational savings enabled by sparsity of data.\n",
158 | "Initially, some words may be entirely absent from a cluster, causing the M step to produce zero mean and variance for those words. This means any data point with one of those words will have 0 probability of being assigned to that cluster since the cluster allows for no variability (0 variance) around that count being 0 (0 mean). Since there is a small chance for those words to later appear in the cluster, we instead assign a small positive variance (~1e-10). Doing so also prevents numerical overflow."
159 | ]
160 | },
161 | {
162 | "cell_type": "markdown",
163 | "metadata": {},
164 | "source": [
165 | "### Log probability function for diagonal covariance Gaussian."
166 | ]
167 | },
168 | {
169 | "cell_type": "code",
170 | "execution_count": 5,
171 | "metadata": {
172 | "collapsed": true
173 | },
174 | "outputs": [],
175 | "source": [
176 | "def diag(array):\n",
177 | " n = len(array)\n",
178 | " return spdiags(array, 0, n, n)\n",
179 | "\n",
180 | "def logpdf_diagonal_gaussian(x, mean, cov):\n",
181 | " '''\n",
182 | " Compute logpdf of a multivariate Gaussian distribution with diagonal covariance at a given point x.\n",
183 | " A multivariate Gaussian distribution with a diagonal covariance is equivalent\n",
184 | " to a collection of independent Gaussian random variables.\n",
185 | "\n",
186 | " x should be a sparse matrix. The logpdf will be computed for each row of x.\n",
187 | " mean and cov should be given as 1D numpy arrays\n",
188 | " mean[i] : mean of i-th variable\n",
189 | " cov[i] : variance of i-th variable'''\n",
190 | "\n",
191 | " n = x.shape[0]\n",
192 | " dim = x.shape[1]\n",
193 | " assert(dim == len(mean) and dim == len(cov))\n",
194 | "\n",
195 | " # multiply each i-th column of x by (1/(2*sigma_i)), where sigma_i is sqrt of variance of i-th variable.\n",
196 | " scaled_x = x.dot( diag(1./(2*np.sqrt(cov))) )\n",
197 | " # multiply each i-th entry of mean by (1/(2*sigma_i))\n",
198 | " scaled_mean = mean/(2*np.sqrt(cov))\n",
199 | "\n",
200 | " # sum of pairwise squared Eulidean distances gives SUM[(x_i - mean_i)^2/(2*sigma_i^2)]\n",
201 | " return -np.sum(np.log(np.sqrt(2*np.pi*cov))) - pairwise_distances(scaled_x, [scaled_mean], 'euclidean').flatten()**2"
202 | ]
203 | },
204 | {
205 | "cell_type": "markdown",
206 | "metadata": {},
207 | "source": [
208 | "### EM algorithm for sparse data."
209 | ]
210 | },
211 | {
212 | "cell_type": "code",
213 | "execution_count": 6,
214 | "metadata": {
215 | "collapsed": true
216 | },
217 | "outputs": [],
218 | "source": [
219 | "def log_sum_exp(x, axis):\n",
220 | " '''Compute the log of a sum of exponentials'''\n",
221 | " x_max = np.max(x, axis=axis)\n",
222 | " if axis == 1:\n",
223 | " return x_max + np.log( np.sum(np.exp(x-x_max[:,np.newaxis]), axis=1) )\n",
224 | " else:\n",
225 | " return x_max + np.log( np.sum(np.exp(x-x_max), axis=0) )\n",
226 | "\n",
227 | "def EM_for_high_dimension(data, means, covs, weights, cov_smoothing=1e-5, maxiter=int(1e3), thresh=1e-4, verbose=False):\n",
228 | " # cov_smoothing: specifies the default variance assigned to absent features in a cluster.\n",
229 | " # If we were to assign zero variances to absent features, we would be overconfient,\n",
230 | " # as we hastily conclude that those featurese would NEVER appear in the cluster.\n",
231 | " # We'd like to leave a little bit of possibility for absent features to show up later.\n",
232 | " n = data.shape[0]\n",
233 | " dim = data.shape[1]\n",
234 | " mu = deepcopy(means)\n",
235 | " Sigma = deepcopy(covs)\n",
236 | " K = len(mu)\n",
237 | " weights = np.array(weights)\n",
238 | "\n",
239 | " ll = None\n",
240 | " ll_trace = []\n",
241 | "\n",
242 | " for i in range(maxiter):\n",
243 | " # E-step: compute responsibilities\n",
244 | " logresp = np.zeros((n,K))\n",
245 | " for k in xrange(K):\n",
246 | " logresp[:,k] = np.log(weights[k]) + logpdf_diagonal_gaussian(data, mu[k], Sigma[k])\n",
247 | " ll_new = np.sum(log_sum_exp(logresp, axis=1))\n",
248 | " if verbose:\n",
249 | " print(ll_new)\n",
250 | " logresp -= np.vstack(log_sum_exp(logresp, axis=1))\n",
251 | " resp = np.exp(logresp)\n",
252 | " counts = np.sum(resp, axis=0)\n",
253 | "\n",
254 | " # M-step: update weights, means, covariances\n",
255 | " weights = counts / np.sum(counts)\n",
256 | " for k in range(K):\n",
257 | " mu[k] = (diag(resp[:,k]).dot(data)).sum(axis=0)/counts[k]\n",
258 | " mu[k] = mu[k].A1\n",
259 | "\n",
260 | " Sigma[k] = diag(resp[:,k]).dot( data.multiply(data)-2*data.dot(diag(mu[k])) ).sum(axis=0) \\\n",
261 | " + (mu[k]**2)*counts[k]\n",
262 | " Sigma[k] = Sigma[k].A1 / counts[k] + cov_smoothing*np.ones(dim)\n",
263 | "\n",
264 | " # check for convergence in log-likelihood\n",
265 | " ll_trace.append(ll_new)\n",
266 | " if ll is not None and (ll_new-ll) < thresh and ll_new > -np.inf:\n",
267 | " ll = ll_new\n",
268 | " break\n",
269 | " else:\n",
270 | " ll = ll_new\n",
271 | "\n",
272 | " out = {'weights':weights,'means':mu,'covs':Sigma,'loglik':ll_trace,'resp':resp}\n",
273 | "\n",
274 | " return out"
275 | ]
276 | },
277 | {
278 | "cell_type": "markdown",
279 | "metadata": {},
280 | "source": [
281 | "### Initializing mean parameters using k-means."
282 | ]
283 | },
284 | {
285 | "cell_type": "raw",
286 | "metadata": {},
287 | "source": [
288 | "Recall from the lectures that EM for Gaussian mixtures is very sensitive to the choice of initial means. With a bad initial set of means, EM may produce clusters that span a large area and are mostly overlapping. To eliminate such bad outcomes, we first produce a suitable set of initial means by using the cluster centers from running k-means. That is, we first run k-means and then take the final set of means from the converged solution as the initial means in our EM algorithm."
289 | ]
290 | },
291 | {
292 | "cell_type": "code",
293 | "execution_count": 8,
294 | "metadata": {
295 | "collapsed": false
296 | },
297 | "outputs": [],
298 | "source": [
299 | "from sklearn.cluster import KMeans\n",
300 | "\n",
301 | "np.random.seed(5)\n",
302 | "num_clusters = 25\n",
303 | "\n",
304 | "# Use scikit-learn's k-means to simplify workflow\n",
305 | "kmeans_model = KMeans(n_clusters=num_clusters, n_init=5, max_iter=400, random_state=1, n_jobs=-1)\n",
306 | "kmeans_model.fit(tf_idf)\n",
307 | "centroids, cluster_assignment = kmeans_model.cluster_centers_, kmeans_model.labels_\n",
308 | "\n",
309 | "means = [centroid for centroid in centroids]"
310 | ]
311 | },
312 | {
313 | "cell_type": "markdown",
314 | "metadata": {},
315 | "source": [
316 | "### Initializing weights"
317 | ]
318 | },
319 | {
320 | "cell_type": "raw",
321 | "metadata": {},
322 | "source": [
323 | "We will initialize each cluster weight to be the proportion of documents assigned to that cluster by k-means above."
324 | ]
325 | },
326 | {
327 | "cell_type": "code",
328 | "execution_count": 9,
329 | "metadata": {
330 | "collapsed": true
331 | },
332 | "outputs": [],
333 | "source": [
334 | "num_docs = tf_idf.shape[0]\n",
335 | "weights = []\n",
336 | "for i in xrange(num_clusters):\n",
337 | " # Compute the number of data points assigned to cluster i:\n",
338 | " num_assigned = sum(cluster_assignment == i) # YOUR CODE HERE\n",
339 | " w = float(num_assigned) / num_docs\n",
340 | " weights.append(w)"
341 | ]
342 | },
343 | {
344 | "cell_type": "markdown",
345 | "metadata": {},
346 | "source": [
347 | "### Initializing covariances."
348 | ]
349 | },
350 | {
351 | "cell_type": "markdown",
352 | "metadata": {},
353 | "source": [
354 | "To initialize our covariance parameters, we compute σ^2k,j=∑Ni=1(xi,j−μ^k,j)2 for each feature j. For features with really tiny variances, we assign 1e-8 instead to prevent numerical instability. "
355 | ]
356 | },
357 | {
358 | "cell_type": "code",
359 | "execution_count": 10,
360 | "metadata": {
361 | "collapsed": true
362 | },
363 | "outputs": [],
364 | "source": [
365 | "covs = []\n",
366 | "for i in xrange(num_clusters):\n",
367 | " member_rows = tf_idf[cluster_assignment==i]\n",
368 | " cov = (member_rows.multiply(member_rows) - 2*member_rows.dot(diag(means[i]))).sum(axis=0).A1 / member_rows.shape[0] \\\n",
369 | " + means[i]**2\n",
370 | " cov[cov < 1e-8] = 1e-8\n",
371 | " covs.append(cov)"
372 | ]
373 | },
374 | {
375 | "cell_type": "markdown",
376 | "metadata": {},
377 | "source": [
378 | "## Running EM"
379 | ]
380 | },
381 | {
382 | "cell_type": "code",
383 | "execution_count": 11,
384 | "metadata": {
385 | "collapsed": false
386 | },
387 | "outputs": [
388 | {
389 | "name": "stdout",
390 | "output_type": "stream",
391 | "text": [
392 | "[252072234152.18665, 314189800465.00781, 314189826362.03009, 314189826362.03009]\n"
393 | ]
394 | }
395 | ],
396 | "source": [
397 | "out = EM_for_high_dimension(tf_idf, means, covs, weights, cov_smoothing=1e-10)\n",
398 | "print out['loglik'] # print history of log-likelihood over time"
399 | ]
400 | },
401 | {
402 | "cell_type": "markdown",
403 | "metadata": {},
404 | "source": [
405 | "## Interpret clusters"
406 | ]
407 | },
408 | {
409 | "cell_type": "markdown",
410 | "metadata": {},
411 | "source": [
412 | "In contrast to k-means, EM is able to explicitly model clusters of varying sizes and proportions. The relative magnitude of variances in the word dimensions tell us much about the nature of the clusters.\n",
413 | "\n",
414 | "Write yourself a cluster visualizer as follows. Examining each cluster's mean vector, list the 5 words with the largest mean values (5 most common words in the cluster). For each word, also include the associated variance parameter (diagonal element of the covariance matrix).\n",
415 | "\n"
416 | ]
417 | },
418 | {
419 | "cell_type": "code",
420 | "execution_count": 52,
421 | "metadata": {
422 | "collapsed": false
423 | },
424 | "outputs": [],
425 | "source": [
426 | "def visualize_EM_clusters(tf_idf, means, covs, map_index_to_word):\n",
427 | " print('')\n",
428 | " print('==========================================================')\n",
429 | "\n",
430 | " num_clusters = len(means)\n",
431 | " for c in xrange(num_clusters):\n",
432 | " print('Cluster {0:d}: Largest mean parameters in cluster '.format(c))\n",
433 | " print('\\n{0: <12}{1: <12}{2: <12}'.format('Word', 'Mean', 'Variance'))\n",
434 | " \n",
435 | " # The k'th element of sorted_word_ids should be the index of the word \n",
436 | " # that has the k'th-largest value in the cluster mean. Hint: Use np.argsort().\n",
437 | " sorted_word_ids = np.argsort(means[c])[::-1] # YOUR CODE HERE\n",
438 | "\n",
439 | " for i in sorted_word_ids[:5]:\n",
440 | " print '{0: <12}{1:<10.2e}{2:10.2e}'.format({v:k for k, v in map_index_to_word.items()}[i], means[c][i], covs[c][i])\n",
441 | " \n",
442 | " print '\\n=====================================================Quiz Question. Select all the topics that have a cluster in the model created above. [multiple choice]===='\n"
443 | ]
444 | },
445 | {
446 | "cell_type": "code",
447 | "execution_count": 53,
448 | "metadata": {
449 | "collapsed": false
450 | },
451 | "outputs": [
452 | {
453 | "name": "stdout",
454 | "output_type": "stream",
455 | "text": [
456 | "\n",
457 | "==========================================================\n",
458 | "Cluster 0: Largest mean parameters in cluster \n",
459 | "\n",
460 | "Word Mean Variance \n",
461 | "band 7.43e-02 3.92e-03\n",
462 | "jazz 4.55e-02 7.81e-03\n",
463 | "music 3.58e-02 1.52e-03\n",
464 | "album 3.34e-02 1.59e-03\n",
465 | "guitar 2.95e-02 2.89e-03\n",
466 | "\n",
467 | "=====================================================Quiz Question. Select all the topics that have a cluster in the model created above. [multiple choice]====\n",
468 | "Cluster 1: Largest mean parameters in cluster \n",
469 | "\n",
470 | "Word Mean Variance \n",
471 | "championships8.21e-02 5.92e-03\n",
472 | "olympics 5.54e-02 2.84e-03\n",
473 | "marathon 5.45e-02 2.30e-02\n",
474 | "metres 5.44e-02 1.07e-02\n",
475 | "she 5.19e-02 5.79e-03\n",
476 | "\n",
477 | "=====================================================Quiz Question. Select all the topics that have a cluster in the model created above. [multiple choice]====\n",
478 | "Cluster 2: Largest mean parameters in cluster \n",
479 | "\n",
480 | "Word Mean Variance \n",
481 | "racing 1.00e-01 7.90e-03\n",
482 | "chess 8.14e-02 3.18e-02\n",
483 | "formula 6.14e-02 1.35e-02\n",
484 | "championship5.58e-02 3.40e-03\n",
485 | "race 5.36e-02 3.83e-03\n",
486 | "\n",
487 | "=====================================================Quiz Question. Select all the topics that have a cluster in the model created above. [multiple choice]====\n",
488 | "Cluster 3: Largest mean parameters in cluster \n",
489 | "\n",
490 | "Word Mean Variance \n",
491 | "rugby 1.96e-01 1.21e-02\n",
492 | "cup 4.87e-02 2.52e-03\n",
493 | "against 4.59e-02 2.39e-03\n",
494 | "played 4.42e-02 1.24e-03\n",
495 | "wales 3.86e-02 5.20e-03\n",
496 | "\n",
497 | "=====================================================Quiz Question. Select all the topics that have a cluster in the model created above. [multiple choice]====\n",
498 | "Cluster 4: Largest mean parameters in cluster \n",
499 | "\n",
500 | "Word Mean Variance \n",
501 | "law 1.34e-01 1.11e-02\n",
502 | "court 8.13e-02 5.99e-03\n",
503 | "judge 6.09e-02 6.25e-03\n",
504 | "district 4.34e-02 4.82e-03\n",
505 | "justice 3.96e-02 3.95e-03\n",
506 | "\n",
507 | "=====================================================Quiz Question. Select all the topics that have a cluster in the model created above. [multiple choice]====\n",
508 | "Cluster 5: Largest mean parameters in cluster \n",
509 | "\n",
510 | "Word Mean Variance \n",
511 | "he 1.15e-02 6.59e-05\n",
512 | "that 9.21e-03 1.30e-04\n",
513 | "his 8.57e-03 5.00e-05\n",
514 | "president 7.03e-03 3.02e-04\n",
515 | "world 6.68e-03 2.23e-04\n",
516 | "\n",
517 | "=====================================================Quiz Question. Select all the topics that have a cluster in the model created above. [multiple choice]====\n",
518 | "Cluster 6: Largest mean parameters in cluster \n",
519 | "\n",
520 | "Word Mean Variance \n",
521 | "football 1.20e-01 4.43e-03\n",
522 | "afl 1.12e-01 1.22e-02\n",
523 | "australian 8.28e-02 1.87e-03\n",
524 | "season 5.76e-02 1.71e-03\n",
525 | "club 5.73e-02 2.04e-03\n",
526 | "\n",
527 | "=====================================================Quiz Question. Select all the topics that have a cluster in the model created above. [multiple choice]====\n",
528 | "Cluster 7: Largest mean parameters in cluster \n",
529 | "\n",
530 | "Word Mean Variance \n",
531 | "she 1.44e-01 3.34e-03\n",
532 | "her 8.60e-02 3.04e-03\n",
533 | "miss 1.76e-02 5.88e-03\n",
534 | "women 1.37e-02 1.25e-03\n",
535 | "womens 1.07e-02 1.06e-03\n",
536 | "\n",
537 | "=====================================================Quiz Question. Select all the topics that have a cluster in the model created above. [multiple choice]====\n",
538 | "Cluster 8: Largest mean parameters in cluster \n",
539 | "\n",
540 | "Word Mean Variance \n",
541 | "music 9.83e-02 4.65e-03\n",
542 | "orchestra 9.06e-02 9.78e-03\n",
543 | "symphony 6.01e-02 7.16e-03\n",
544 | "conductor 4.33e-02 6.49e-03\n",
545 | "opera 4.30e-02 9.59e-03\n",
546 | "\n",
547 | "=====================================================Quiz Question. Select all the topics that have a cluster in the model created above. [multiple choice]====\n",
548 | "Cluster 9: Largest mean parameters in cluster \n",
549 | "\n",
550 | "Word Mean Variance \n",
551 | "theatre 5.41e-02 7.46e-03\n",
552 | "actor 3.33e-02 1.58e-03\n",
553 | "television 3.11e-02 1.25e-03\n",
554 | "series 3.11e-02 1.53e-03\n",
555 | "film 2.88e-02 1.05e-03\n",
556 | "\n",
557 | "=====================================================Quiz Question. Select all the topics that have a cluster in the model created above. [multiple choice]====\n",
558 | "Cluster 10: Largest mean parameters in cluster \n",
559 | "\n",
560 | "Word Mean Variance \n",
561 | "league 5.85e-02 3.09e-03\n",
562 | "football 4.77e-02 2.76e-03\n",
563 | "club 4.71e-02 2.25e-03\n",
564 | "season 4.51e-02 2.03e-03\n",
565 | "cup 4.02e-02 2.80e-03\n",
566 | "\n",
567 | "=====================================================Quiz Question. Select all the topics that have a cluster in the model created above. [multiple choice]====\n",
568 | "Cluster 11: Largest mean parameters in cluster \n",
569 | "\n",
570 | "Word Mean Variance \n",
571 | "poetry 4.67e-02 8.26e-03\n",
572 | "book 4.25e-02 1.96e-03\n",
573 | "novel 3.92e-02 3.65e-03\n",
574 | "published 3.76e-02 1.26e-03\n",
575 | "books 3.11e-02 1.45e-03\n",
576 | "\n",
577 | "=====================================================Quiz Question. Select all the topics that have a cluster in the model created above. [multiple choice]====\n",
578 | "Cluster 12: Largest mean parameters in cluster \n",
579 | "\n",
580 | "Word Mean Variance \n",
581 | "research 4.16e-02 1.91e-03\n",
582 | "university 3.89e-02 8.29e-04\n",
583 | "professor 3.50e-02 1.21e-03\n",
584 | "science 2.78e-02 2.02e-03\n",
585 | "institute 2.08e-02 8.29e-04\n",
586 | "\n",
587 | "=====================================================Quiz Question. Select all the topics that have a cluster in the model created above. [multiple choice]====\n",
588 | "Cluster 13: Largest mean parameters in cluster \n",
589 | "\n",
590 | "Word Mean Variance \n",
591 | "hockey 2.17e-01 1.20e-02\n",
592 | "nhl 1.35e-01 1.24e-02\n",
593 | "ice 6.61e-02 3.09e-03\n",
594 | "season 5.27e-02 2.23e-03\n",
595 | "league 4.76e-02 1.58e-03\n",
596 | "\n",
597 | "=====================================================Quiz Question. Select all the topics that have a cluster in the model created above. [multiple choice]====\n",
598 | "Cluster 14: Largest mean parameters in cluster \n",
599 | "\n",
600 | "Word Mean Variance \n",
601 | "film 1.70e-01 6.29e-03\n",
602 | "films 5.37e-02 2.79e-03\n",
603 | "festival 4.58e-02 3.95e-03\n",
604 | "directed 3.32e-02 1.88e-03\n",
605 | "feature 3.24e-02 1.85e-03\n",
606 | "\n",
607 | "=====================================================Quiz Question. Select all the topics that have a cluster in the model created above. [multiple choice]====\n",
608 | "Cluster 15: Largest mean parameters in cluster \n",
609 | "\n",
610 | "Word Mean Variance \n",
611 | "party 4.89e-02 2.79e-03\n",
612 | "election 4.54e-02 2.38e-03\n",
613 | "minister 4.20e-02 4.64e-03\n",
614 | "elected 2.97e-02 8.46e-04\n",
615 | "member 2.11e-02 4.67e-04\n",
616 | "\n",
617 | "=====================================================Quiz Question. Select all the topics that have a cluster in the model created above. [multiple choice]====\n",
618 | "Cluster 16: Largest mean parameters in cluster \n",
619 | "\n",
620 | "Word Mean Variance \n",
621 | "tour 2.54e-01 1.36e-02\n",
622 | "pga 2.11e-01 2.13e-02\n",
623 | "golf 1.43e-01 1.75e-02\n",
624 | "open 7.25e-02 3.86e-03\n",
625 | "golfer 6.20e-02 2.30e-03\n",
626 | "\n",
627 | "=====================================================Quiz Question. Select all the topics that have a cluster in the model created above. [multiple choice]====\n",
628 | "Cluster 17: Largest mean parameters in cluster \n",
629 | "\n",
630 | "Word Mean Variance \n",
631 | "she 1.39e-01 2.90e-03\n",
632 | "her 9.75e-02 3.09e-03\n",
633 | "actress 6.77e-02 3.14e-03\n",
634 | "film 4.40e-02 2.38e-03\n",
635 | "role 4.09e-02 1.78e-03\n",
636 | "\n",
637 | "=====================================================Quiz Question. Select all the topics that have a cluster in the model created above. [multiple choice]====\n",
638 | "Cluster 18: Largest mean parameters in cluster \n",
639 | "\n",
640 | "Word Mean Variance \n",
641 | "church 1.22e-01 9.38e-03\n",
642 | "bishop 9.24e-02 1.27e-02\n",
643 | "lds 4.43e-02 9.05e-03\n",
644 | "diocese 4.42e-02 5.69e-03\n",
645 | "archbishop 4.36e-02 6.35e-03\n",
646 | "\n",
647 | "=====================================================Quiz Question. Select all the topics that have a cluster in the model created above. [multiple choice]====\n",
648 | "Cluster 19: Largest mean parameters in cluster \n",
649 | "\n",
650 | "Word Mean Variance \n",
651 | "baseball 1.10e-01 5.30e-03\n",
652 | "league 1.04e-01 2.99e-03\n",
653 | "major 5.21e-02 1.25e-03\n",
654 | "games 4.71e-02 2.10e-03\n",
655 | "season 4.53e-02 1.47e-03\n",
656 | "\n",
657 | "=====================================================Quiz Question. Select all the topics that have a cluster in the model created above. [multiple choice]====\n",
658 | "Cluster 20: Largest mean parameters in cluster \n",
659 | "\n",
660 | "Word Mean Variance \n",
661 | "radio 7.83e-02 6.17e-03\n",
662 | "news 5.96e-02 6.28e-03\n",
663 | "show 4.57e-02 2.75e-03\n",
664 | "bbc 3.37e-02 7.10e-03\n",
665 | "television 2.83e-02 1.17e-03\n",
666 | "\n",
667 | "=====================================================Quiz Question. Select all the topics that have a cluster in the model created above. [multiple choice]====\n",
668 | "Cluster 21: Largest mean parameters in cluster \n",
669 | "\n",
670 | "Word Mean Variance \n",
671 | "health 8.21e-02 1.34e-02\n",
672 | "medical 7.79e-02 6.17e-03\n",
673 | "medicine 7.07e-02 6.44e-03\n",
674 | "research 4.60e-02 2.47e-03\n",
675 | "clinical 3.20e-02 2.74e-03\n",
676 | "\n",
677 | "=====================================================Quiz Question. Select all the topics that have a cluster in the model created above. [multiple choice]====\n",
678 | "Cluster 22: Largest mean parameters in cluster \n",
679 | "\n",
680 | "Word Mean Variance \n",
681 | "art 1.44e-01 7.05e-03\n",
682 | "museum 7.66e-02 9.13e-03\n",
683 | "gallery 5.63e-02 6.12e-03\n",
684 | "artist 3.29e-02 9.84e-04\n",
685 | "arts 3.13e-02 1.48e-03\n",
686 | "\n",
687 | "=====================================================Quiz Question. Select all the topics that have a cluster in the model created above. [multiple choice]====\n",
688 | "Cluster 23: Largest mean parameters in cluster \n",
689 | "\n",
690 | "Word Mean Variance \n",
691 | "basketball 7.78e-02 9.63e-03\n",
692 | "coach 7.50e-02 8.86e-03\n",
693 | "football 5.59e-02 5.31e-03\n",
694 | "nba 4.10e-02 6.99e-03\n",
695 | "nfl 4.06e-02 6.27e-03\n",
696 | "\n",
697 | "=====================================================Quiz Question. Select all the topics that have a cluster in the model created above. [multiple choice]====\n",
698 | "Cluster 24: Largest mean parameters in cluster \n",
699 | "\n",
700 | "Word Mean Variance \n",
701 | "album 8.03e-02 4.25e-03\n",
702 | "music 4.76e-02 1.97e-03\n",
703 | "released 4.27e-02 1.56e-03\n",
704 | "song 3.73e-02 2.72e-03\n",
705 | "her 3.02e-02 2.91e-03\n",
706 | "\n",
707 | "=====================================================Quiz Question. Select all the topics that have a cluster in the model created above. [multiple choice]====\n"
708 | ]
709 | }
710 | ],
711 | "source": [
712 | "visualize_EM_clusters(tf_idf, out['means'], out['covs'], map_index_to_word)"
713 | ]
714 | },
715 | {
716 | "cell_type": "markdown",
717 | "metadata": {},
718 | "source": [
719 | "## Comparing to random initialization"
720 | ]
721 | },
722 | {
723 | "cell_type": "markdown",
724 | "metadata": {},
725 | "source": [
726 | "Create variables for randomly initializing the EM algorithm."
727 | ]
728 | },
729 | {
730 | "cell_type": "code",
731 | "execution_count": 63,
732 | "metadata": {
733 | "collapsed": true
734 | },
735 | "outputs": [],
736 | "source": [
737 | "np.random.seed(5)\n",
738 | "num_clusters = len(means)\n",
739 | "num_docs, num_words = tf_idf.shape\n",
740 | "\n",
741 | "random_means = []\n",
742 | "random_covs = []\n",
743 | "random_weights = []\n",
744 | "\n",
745 | "for k in range(num_clusters):\n",
746 | " \n",
747 | " # Create a numpy array of length num_words with random normally distributed values.\n",
748 | " # Use the standard univariate normal distribution (mean 0, variance 1).\n",
749 | " # YOUR CODE HERE\n",
750 | " mean = np.random.normal(0, 1, num_words)\n",
751 | " \n",
752 | " # Create a numpy array of length num_words with random values uniformly distributed between 1 and 5.\n",
753 | " # YOUR CODE HERE\n",
754 | " cov = np.random.uniform(1, 5, num_words)\n",
755 | "\n",
756 | " # Initially give each cluster equal weight.\n",
757 | " # YOUR CODE HERE\n",
758 | " weight = 1. / num_clusters\n",
759 | " \n",
760 | " random_means.append(mean)\n",
761 | " random_covs.append(cov)\n",
762 | " random_weights.append(weight)"
763 | ]
764 | },
765 | {
766 | "cell_type": "code",
767 | "execution_count": 67,
768 | "metadata": {
769 | "collapsed": false
770 | },
771 | "outputs": [
772 | {
773 | "data": {
774 | "text/plain": [
775 | "[0.04,\n",
776 | " 0.04,\n",
777 | " 0.04,\n",
778 | " 0.04,\n",
779 | " 0.04,\n",
780 | " 0.04,\n",
781 | " 0.04,\n",
782 | " 0.04,\n",
783 | " 0.04,\n",
784 | " 0.04,\n",
785 | " 0.04,\n",
786 | " 0.04,\n",
787 | " 0.04,\n",
788 | " 0.04,\n",
789 | " 0.04,\n",
790 | " 0.04,\n",
791 | " 0.04,\n",
792 | " 0.04,\n",
793 | " 0.04,\n",
794 | " 0.04,\n",
795 | " 0.04,\n",
796 | " 0.04,\n",
797 | " 0.04,\n",
798 | " 0.04,\n",
799 | " 0.04]"
800 | ]
801 | },
802 | "execution_count": 67,
803 | "metadata": {},
804 | "output_type": "execute_result"
805 | }
806 | ],
807 | "source": [
808 | "random_weights"
809 | ]
810 | },
811 | {
812 | "cell_type": "markdown",
813 | "metadata": {},
814 | "source": [
815 | "Quiz Question: Try fitting EM with the random initial parameters you created above. (Use cov_smoothing=1e-5.) Store the result to out_random_init. What is the final loglikelihood that the algorithm converges to?"
816 | ]
817 | },
818 | {
819 | "cell_type": "raw",
820 | "metadata": {},
821 | "source": [
822 | "out_random_init = EM_for_high_dimension(tf_idf, random_means, random_covs, random_weights, cov_smoothing=1e-5)"
823 | ]
824 | },
825 | {
826 | "cell_type": "markdown",
827 | "metadata": {},
828 | "source": [
829 | "Quiz Question: Is the final loglikelihood larger or smaller than the final loglikelihood we obtained above when initializing EM with the results from running k-means?"
830 | ]
831 | }
832 | ],
833 | "metadata": {
834 | "kernelspec": {
835 | "display_name": "Python 2",
836 | "language": "python",
837 | "name": "python2"
838 | },
839 | "language_info": {
840 | "codemirror_mode": {
841 | "name": "ipython",
842 | "version": 2
843 | },
844 | "file_extension": ".py",
845 | "mimetype": "text/x-python",
846 | "name": "python",
847 | "nbconvert_exporter": "python",
848 | "pygments_lexer": "ipython2",
849 | "version": "2.7.11"
850 | }
851 | },
852 | "nbformat": 4,
853 | "nbformat_minor": 0
854 | }
855 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Machine-Learning-Specialization-University of Washington
2 | Programming Assignments for machine learning specialization courses from University of Washington through Coursera.
3 |
4 | Techniques used: Python, pandas, numpy,scikit-learn, graphlab
5 |
6 | In terms of the library and packages, I only used graphlab and SFrame for Machine Learning Foundations. For all the other courses (Regression, Classification and Clustering) I have used pandas for feature enginering and scikit-learn to build out modeling.
7 |
8 | ## Specialization Courses:
9 | - Machine Learning Foundations: A Case Study Approach
10 |
11 | Regression: Predicting House Prices (Leverage Zillow data to build linear regression model to predict house prices)
12 |
13 | Classification: Analyzing Sentiment (Build logistic classification model to analyze product sentiment)
14 |
15 | Clustering and Similarity: Retrieving Documents (conduct cluster analysis for document retreival, tf-dif)
16 |
17 | Recommending Products: Build Matrix Factorization Model and leverage Jaccard Similarity to Recommend Songs
18 |
19 | - Machine Learning: Regression
20 |
21 | Project Overview: How to predict a house's price? How to evaluate model? How to prevent model from overfitting?
22 |
23 | Simple Linear Regression: Implementing closed-form solution for simple linear regression
24 |
25 | Multiple Linear Regression: Exploring multiple regression models for house prediction; Implementing gradient descent for multiple regression
26 |
27 | Assessing Performance
28 |
29 | Ridge regression
30 |
31 | Lasso regression
32 |
33 | Kernal regression
34 |
35 | - Machine Learning: Classification
36 |
37 | Project 1 Overview: Build classification modeling to predict if an Amazon review is positive.
38 |
39 | Project 2 Overview: Is this loan safe or risky?
40 |
41 | In these assignments, I have built logistic regression modeling and decision tree modeling to predict if a loan is risky or safe and test classification errors for different models by both using scikit-learn and implementing the (greedy ascent, greedy descrsion tree and etc.) algorithm from sracth.
42 |
43 |
44 | Linear Classifiers & Logistic Regression
45 |
46 | Learning Classifiers; Overfitting & Regularization in Logistic Regression
47 |
48 | Decision Trees
49 |
50 | Precision-Recall
51 |
52 | Stochastic Gradient Ascent
53 |
54 | SVM http://www.svm-tutorial.com/2014/11/svm-understanding-math-part-2/
55 |
56 | - Machine Learning: Clustering & Retrievel
57 |
58 | Nearest Neighbor Search
59 |
60 | Clustering with K-Means
61 |
62 | Mixture Models (Implementing Expectation Maximization Algorithm for Gaussian mixtures; Clustering text data with Gaussian mixtures)
63 |
64 | Mixed Membership Modeling via Latent Dirichlet Allocation
65 |
66 | Others:
67 |
68 | computational cost (comlexity)
69 | http://stackoverflow.com/questions/2307283/what-does-olog-n-mean-exactly
70 |
71 | bitwiseoperators ( 0 1 )
72 | https://wiki.python.org/moin/BitwiseOperators
73 |
74 | additional blog that helps understand LDA
75 | http://confusedlanguagetech.blogspot.com/
76 |
--------------------------------------------------------------------------------