├── .gitattributes
├── .gitignore
├── README.md
├── course-1
├── analyzing-product-sentiment.ipynb
├── deep-features-for-image-classification.ipynb
├── deep-features-for-image-retrieval.ipynb
├── document-retrieval.ipynb
├── getting-started-with-ipython-notebook.ipynb
├── getting-started-with-sframes.ipynb
├── predicting-house-prices.ipynb
└── song-recommender.ipynb
├── course-2
├── Overfitting_Demo_Ridge_Lasso.ipynb
├── Philadelphia_Crime_Rate_noNA.csv
├── PhillyCrime.ipynb
├── kc_house_data.gl.zip
├── kc_house_data_small.gl.zip
├── numpy-tutorial.ipynb
├── week-1-simple-regression-assignment-blank.ipynb
├── week-2-multiple-regression-assignment-1-blank.ipynb
├── week-2-multiple-regression-assignment-2-blank.ipynb
├── week-3-polynomial-regression-assignment-blank.ipynb
├── week-4-ridge-regression-assignment-1-blank.ipynb
├── week-4-ridge-regression-assignment-2-blank.ipynb
├── week-5-lasso-assignment-1-blank.ipynb
├── week-5-lasso-assignment-2-blank.ipynb
└── week-6-local-regression-assignment-blank.ipynb
├── course-3
├── amazon_baby.gl.zip
├── amazon_baby_subset.gl.zip
├── important_words.json
├── indices-json
│ ├── module-10-assignment-train-idx.json
│ ├── module-10-assignment-validation-idx.json
│ ├── module-2-assignment-test-idx.json
│ ├── module-2-assignment-train-idx.json
│ ├── module-4-assignment-train-idx.json
│ ├── module-4-assignment-validation-idx.json
│ ├── module-5-assignment-1-train-idx.json
│ ├── module-5-assignment-1-validation-idx.json
│ ├── module-5-assignment-2-test-idx.json
│ ├── module-5-assignment-2-train-idx.json
│ ├── module-6-assignment-train-idx.json
│ ├── module-6-assignment-validation-idx.json
│ ├── module-8-assignment-1-train-idx.json
│ ├── module-8-assignment-1-validation-idx.json
│ ├── module-8-assignment-2-test-idx.json
│ ├── module-8-assignment-2-train-idx.json
│ ├── module-9-assignment-test-idx.json
│ └── module-9-assignment-train-idx.json
├── lending-club-data.gl.zip
├── module-10-online-learning-assignment-blank.ipynb
├── module-2-linear-classifier-assignment-blank.ipynb
├── module-3-linear-classifier-learning-assignment-blank.ipynb
├── module-4-linear-classifier-regularization-assignment-blank.ipynb
├── module-5-decision-tree-assignment-1-blank.ipynb
├── module-5-decision-tree-assignment-2-blank.ipynb
├── module-6-decision-tree-practical-assignment-blank.ipynb
├── module-8-boosting-assignment-1-blank.ipynb
├── module-8-boosting-assignment-2-blank.ipynb
├── module-9-precision-recall-assignment-blank.ipynb
└── numpy-arrays
│ ├── module-10-assignment-numpy-arrays.npz
│ ├── module-3-assignment-numpy-arrays.npz
│ └── module-4-assignment-numpy-arrays.npz
└── course-4
├── 0_nearest-neighbors-features-and-metrics_blank.ipynb
├── 1_nearest-neighbors-lsh-implementation_blank.ipynb
├── 2_kmeans-with-text-data_blank.ipynb
├── 3_em-for-gmm_blank.ipynb
├── 4_em-with-text-data_blank.ipynb
├── 5_lda_blank.ipynb
├── 6_hierarchical_clustering_blank.ipynb
├── chosen_images.png
├── em_utilities.py
├── images.sf.zip
├── kmeans-arrays.npz
├── people_wiki.gl.zip
└── topic_models.zip
/.gitattributes:
--------------------------------------------------------------------------------
1 | *.zip filter=lfs diff=lfs merge=lfs -text
2 | course-4/kmeans-arrays.npz filter=lfs diff=lfs merge=lfs -text
3 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | __MACOSX
2 | *.pyc
3 | .ipynb_checkpoints
4 | .DS_Store
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Machine Learning Specialization
2 |
3 | ## Datasets
4 | ### Course 1
5 | **amazon_baby**
6 | * https://s3.amazonaws.com/static.dato.com/files/coursera/course-1/amazon_baby.gl.zip
7 |
8 | **home_data**
9 | * https://s3.amazonaws.com/static.dato.com/files/coursera/course-1/home_data.gl.zip
10 |
11 | **image_test_data**
12 | * https://s3.amazonaws.com/static.dato.com/files/coursera/course-1/image_test_data.gl.zip
13 |
14 | **image_train_data**
15 | * https://s3.amazonaws.com/static.dato.com/files/coursera/course-1/image_train_data.gl.zip
16 |
17 | **people_wiki.gl**
18 | * https://s3.amazonaws.com/static.dato.com/files/coursera/course-1/people_wiki.gl.zip
19 |
20 | **song_data**
21 | * https://s3.amazonaws.com/static.dato.com/files/coursera/course-1/song_data.gl.zip
22 |
23 |
24 | #### Course 2
25 | * https://s3.amazonaws.com/static.dato.com/files/coursera/course-2/kc_house_data.gl.zip
26 |
27 | ### References
28 |
29 | More information on the Amazon data set may be found [here](http://jmcauley.ucsd.edu/data/amazon/) as well as in the following paper.
30 |
31 | ```
32 | Inferring networks of substitutable and complementary products
33 | J. McAuley, R. Pandey, J. Leskovec
34 | Knowledge Discovery and Data Mining, 2015
35 | ```
36 |
--------------------------------------------------------------------------------
/course-1/getting-started-with-ipython-notebook.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Installing Python and GraphLab Create"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "Please follow the installation instructions here before getting started:\n",
15 | "\n",
16 | "\n",
17 | "## We have done\n",
18 | "* Installed Python\n",
19 | "* Started Ipython Notebook"
20 | ]
21 | },
22 | {
23 | "cell_type": "markdown",
24 | "metadata": {},
25 | "source": [
26 | "# Getting started with Python"
27 | ]
28 | },
29 | {
30 | "cell_type": "code",
31 | "execution_count": 1,
32 | "metadata": {
33 | "collapsed": false
34 | },
35 | "outputs": [
36 | {
37 | "name": "stdout",
38 | "output_type": "stream",
39 | "text": [
40 | "Hello World!\n"
41 | ]
42 | }
43 | ],
44 | "source": [
45 | "print 'Hello World!'"
46 | ]
47 | },
48 | {
49 | "cell_type": "markdown",
50 | "metadata": {},
51 | "source": [
52 | "## Create some variables in Python"
53 | ]
54 | },
55 | {
56 | "cell_type": "code",
57 | "execution_count": 2,
58 | "metadata": {
59 | "collapsed": true
60 | },
61 | "outputs": [],
62 | "source": [
63 | "i = 4 #int"
64 | ]
65 | },
66 | {
67 | "cell_type": "code",
68 | "execution_count": 3,
69 | "metadata": {
70 | "collapsed": false
71 | },
72 | "outputs": [
73 | {
74 | "data": {
75 | "text/plain": [
76 | "int"
77 | ]
78 | },
79 | "execution_count": 3,
80 | "metadata": {},
81 | "output_type": "execute_result"
82 | }
83 | ],
84 | "source": [
85 | "type(i)"
86 | ]
87 | },
88 | {
89 | "cell_type": "code",
90 | "execution_count": 4,
91 | "metadata": {
92 | "collapsed": true
93 | },
94 | "outputs": [],
95 | "source": [
96 | "f = 4.1 #float"
97 | ]
98 | },
99 | {
100 | "cell_type": "code",
101 | "execution_count": 5,
102 | "metadata": {
103 | "collapsed": false
104 | },
105 | "outputs": [
106 | {
107 | "data": {
108 | "text/plain": [
109 | "float"
110 | ]
111 | },
112 | "execution_count": 5,
113 | "metadata": {},
114 | "output_type": "execute_result"
115 | }
116 | ],
117 | "source": [
118 | "type(f)"
119 | ]
120 | },
121 | {
122 | "cell_type": "code",
123 | "execution_count": 6,
124 | "metadata": {
125 | "collapsed": true
126 | },
127 | "outputs": [],
128 | "source": [
129 | "b = True #boolean variable"
130 | ]
131 | },
132 | {
133 | "cell_type": "code",
134 | "execution_count": 7,
135 | "metadata": {
136 | "collapsed": true
137 | },
138 | "outputs": [],
139 | "source": [
140 | "s = \"This is a string!\""
141 | ]
142 | },
143 | {
144 | "cell_type": "code",
145 | "execution_count": 8,
146 | "metadata": {
147 | "collapsed": false
148 | },
149 | "outputs": [
150 | {
151 | "name": "stdout",
152 | "output_type": "stream",
153 | "text": [
154 | "This is a string!\n"
155 | ]
156 | }
157 | ],
158 | "source": [
159 | "print s"
160 | ]
161 | },
162 | {
163 | "cell_type": "markdown",
164 | "metadata": {},
165 | "source": [
166 | "## Advanced python types"
167 | ]
168 | },
169 | {
170 | "cell_type": "code",
171 | "execution_count": 9,
172 | "metadata": {
173 | "collapsed": true
174 | },
175 | "outputs": [],
176 | "source": [
177 | "l = [3,1,2] #list"
178 | ]
179 | },
180 | {
181 | "cell_type": "code",
182 | "execution_count": 10,
183 | "metadata": {
184 | "collapsed": false
185 | },
186 | "outputs": [
187 | {
188 | "name": "stdout",
189 | "output_type": "stream",
190 | "text": [
191 | "[3, 1, 2]\n"
192 | ]
193 | }
194 | ],
195 | "source": [
196 | "print l"
197 | ]
198 | },
199 | {
200 | "cell_type": "code",
201 | "execution_count": 11,
202 | "metadata": {
203 | "collapsed": true
204 | },
205 | "outputs": [],
206 | "source": [
207 | "d = {'foo':1, 'bar':2.3, 's':'my first dictionary'} #dictionary"
208 | ]
209 | },
210 | {
211 | "cell_type": "code",
212 | "execution_count": 12,
213 | "metadata": {
214 | "collapsed": false
215 | },
216 | "outputs": [
217 | {
218 | "name": "stdout",
219 | "output_type": "stream",
220 | "text": [
221 | "{'s': 'my first dictionary', 'foo': 1, 'bar': 2.3}\n"
222 | ]
223 | }
224 | ],
225 | "source": [
226 | "print d"
227 | ]
228 | },
229 | {
230 | "cell_type": "code",
231 | "execution_count": 13,
232 | "metadata": {
233 | "collapsed": false
234 | },
235 | "outputs": [
236 | {
237 | "name": "stdout",
238 | "output_type": "stream",
239 | "text": [
240 | "1\n"
241 | ]
242 | }
243 | ],
244 | "source": [
245 | "print d['foo'] #element of a dictionary"
246 | ]
247 | },
248 | {
249 | "cell_type": "code",
250 | "execution_count": 14,
251 | "metadata": {
252 | "collapsed": false
253 | },
254 | "outputs": [],
255 | "source": [
256 | "n = None #Python's null type"
257 | ]
258 | },
259 | {
260 | "cell_type": "code",
261 | "execution_count": 15,
262 | "metadata": {
263 | "collapsed": false
264 | },
265 | "outputs": [
266 | {
267 | "data": {
268 | "text/plain": [
269 | "NoneType"
270 | ]
271 | },
272 | "execution_count": 15,
273 | "metadata": {},
274 | "output_type": "execute_result"
275 | }
276 | ],
277 | "source": [
278 | "type(n)"
279 | ]
280 | },
281 | {
282 | "cell_type": "markdown",
283 | "metadata": {},
284 | "source": [
285 | "## Advanced printing"
286 | ]
287 | },
288 | {
289 | "cell_type": "code",
290 | "execution_count": 16,
291 | "metadata": {
292 | "collapsed": false
293 | },
294 | "outputs": [
295 | {
296 | "name": "stdout",
297 | "output_type": "stream",
298 | "text": [
299 | "Our float value is 4.1. Our int value is 4.\n"
300 | ]
301 | }
302 | ],
303 | "source": [
304 | "print \"Our float value is %s. Our int value is %s.\" % (f,i) #Python is pretty good with strings"
305 | ]
306 | },
307 | {
308 | "cell_type": "markdown",
309 | "metadata": {},
310 | "source": [
311 | "## Conditional statements in python"
312 | ]
313 | },
314 | {
315 | "cell_type": "code",
316 | "execution_count": 17,
317 | "metadata": {
318 | "collapsed": false
319 | },
320 | "outputs": [
321 | {
322 | "name": "stdout",
323 | "output_type": "stream",
324 | "text": [
325 | "i or f are both greater than 4.\n"
326 | ]
327 | }
328 | ],
329 | "source": [
330 | "if i == 1 and f > 4:\n",
331 | " print \"The value of i is 1 and f is greater than 4.\"\n",
332 | "elif i > 4 or f > 4:\n",
333 | " print \"i or f are both greater than 4.\"\n",
334 | "else:\n",
335 | " print \"both i and f are less than or equal to 4\"\n"
336 | ]
337 | },
338 | {
339 | "cell_type": "markdown",
340 | "metadata": {},
341 | "source": [
342 | "## Conditional loops"
343 | ]
344 | },
345 | {
346 | "cell_type": "code",
347 | "execution_count": 18,
348 | "metadata": {
349 | "collapsed": false
350 | },
351 | "outputs": [
352 | {
353 | "name": "stdout",
354 | "output_type": "stream",
355 | "text": [
356 | "[3, 1, 2]\n"
357 | ]
358 | }
359 | ],
360 | "source": [
361 | "print l"
362 | ]
363 | },
364 | {
365 | "cell_type": "code",
366 | "execution_count": 19,
367 | "metadata": {
368 | "collapsed": false
369 | },
370 | "outputs": [
371 | {
372 | "name": "stdout",
373 | "output_type": "stream",
374 | "text": [
375 | "3\n",
376 | "1\n",
377 | "2\n"
378 | ]
379 | }
380 | ],
381 | "source": [
382 | "for e in l:\n",
383 | " print e"
384 | ]
385 | },
386 | {
387 | "cell_type": "markdown",
388 | "metadata": {},
389 | "source": [
390 | "Note that in Python, we don't use {} or other markers to indicate the part of the loop that gets iterated. Instead, we just indent and align each of the iterated statements with spaces or tabs. (You can use as many as you want, as long as the lines are aligned.)"
391 | ]
392 | },
393 | {
394 | "cell_type": "code",
395 | "execution_count": 20,
396 | "metadata": {
397 | "collapsed": false
398 | },
399 | "outputs": [
400 | {
401 | "name": "stdout",
402 | "output_type": "stream",
403 | "text": [
404 | "6\n",
405 | "7\n",
406 | "8\n",
407 | "9\n"
408 | ]
409 | }
410 | ],
411 | "source": [
412 | "counter = 6\n",
413 | "while counter < 10:\n",
414 | " print counter\n",
415 | " counter += 1"
416 | ]
417 | },
418 | {
419 | "cell_type": "markdown",
420 | "metadata": {
421 | "collapsed": true
422 | },
423 | "source": [
424 | "# Creating functions in Python\n",
425 | "\n",
426 | "Again, we don't use {}, but just indent the lines that are part of the function."
427 | ]
428 | },
429 | {
430 | "cell_type": "code",
431 | "execution_count": 21,
432 | "metadata": {
433 | "collapsed": true
434 | },
435 | "outputs": [],
436 | "source": [
437 | "def add2(x):\n",
438 | " y = x + 2\n",
439 | " return y"
440 | ]
441 | },
442 | {
443 | "cell_type": "code",
444 | "execution_count": 22,
445 | "metadata": {
446 | "collapsed": true
447 | },
448 | "outputs": [],
449 | "source": [
450 | "i = 5"
451 | ]
452 | },
453 | {
454 | "cell_type": "code",
455 | "execution_count": 23,
456 | "metadata": {
457 | "collapsed": false
458 | },
459 | "outputs": [
460 | {
461 | "data": {
462 | "text/plain": [
463 | "7"
464 | ]
465 | },
466 | "execution_count": 23,
467 | "metadata": {},
468 | "output_type": "execute_result"
469 | }
470 | ],
471 | "source": [
472 | "add2(i)"
473 | ]
474 | },
475 | {
476 | "cell_type": "markdown",
477 | "metadata": {},
478 | "source": [
479 | "We can also define simple functions with lambdas:"
480 | ]
481 | },
482 | {
483 | "cell_type": "code",
484 | "execution_count": 24,
485 | "metadata": {
486 | "collapsed": true
487 | },
488 | "outputs": [],
489 | "source": [
490 | "square = lambda x: x*x"
491 | ]
492 | },
493 | {
494 | "cell_type": "code",
495 | "execution_count": null,
496 | "metadata": {
497 | "collapsed": true
498 | },
499 | "outputs": [],
500 | "source": []
501 | }
502 | ],
503 | "metadata": {
504 | "kernelspec": {
505 | "display_name": "Python 2",
506 | "language": "python",
507 | "name": "python2"
508 | },
509 | "language_info": {
510 | "codemirror_mode": {
511 | "name": "ipython",
512 | "version": 2
513 | },
514 | "file_extension": ".py",
515 | "mimetype": "text/x-python",
516 | "name": "python",
517 | "nbconvert_exporter": "python",
518 | "pygments_lexer": "ipython2",
519 | "version": "2.7.11"
520 | }
521 | },
522 | "nbformat": 4,
523 | "nbformat_minor": 0
524 | }
525 |
--------------------------------------------------------------------------------
/course-2/Philadelphia_Crime_Rate_noNA.csv:
--------------------------------------------------------------------------------
1 | HousePrice,"HsPrc ($10,000)",CrimeRate,MilesPhila,PopChg,Name,County
140463,14.0463,29.7,10,-1,Abington,Montgome
113033,11.3033,24.1,18,4,Ambler,Montgome
124186,12.4186,19.5,25,8,Aston,Delaware
110490,11.049,49.4,25,2.7,Bensalem,Bucks
79124,7.9124,54.1,19,3.9,Bristol B.,Bucks
92634,9.2634,48.6,20,0.6,Bristol T.,Bucks
89246,8.9246,30.8,15,-2.6,Brookhaven,Delaware
195145,19.5145,10.8,20,-3.5,Bryn Athyn,Montgome
297342,29.7342,20.2,14,0.6,Bryn Mawr,Montgome
264298,26.4298,20.4,26,6,Buckingham,Bucks
134342,13.4342,17.3,31,4.2,Chalfont,Bucks
147600,14.76,50.3,9,-1,Cheltenham,Montgome
77370,7.737,34.2,10,-1.2,Clifton,Delaware
170822,17.0822,33.7,32,2.4,Collegeville,Montgome
40642,4.0642,45.7,15,0,Darby Bor.,Delaware
71359,7.1359,22.3,8,1.6,Darby Town,Delaware
104923,10.4923,48.1,21,6.9,Downingtown,Chester
190317,19.0317,19.4,26,1.9,Doylestown,Bucks
215512,21.5512,71.9,26,5.8,E. Bradford,Chester
178105,17.8105,45.1,25,2.3,E. Goshen,Chester
131025,13.1025,31.3,19,-1.8,E. Norriton,Montgome
149844,14.9844,24.9,22,6.4,E. Pikeland,Chester
170556,17.0556,27.2,30,4.6,E. Whiteland,Chester
280969,28.0969,17.7,14,2.9,Easttown,Chester
114233,11.4233,29,30,1.3,Falls Town,Bucks
74502,7.4502,21.4,15,-3.2,Follcroft,Delaware
475112,47.5112,28.6,12,,Gladwyne,Montgome
97167,9.7167,29.3,10,0.2,Glenolden,Delaware
114572,11.4572,17.5,20,5.2,Hatboro,Montgome
436348,43.6348,16.5,10,-0.7,Haverford,Delaware
389302,38.9302,17.8,20,1.5,Horsham,Montgome
122392,12.2392,17.3,10,1.9,Jenkintown,Montgome
130436,13.0436,31.2,17,-0.4,L Southampton,Delaware
272790,27.279,14.5,20,-5.1,L. Gwynedd,Montgome
194435,19.4435,15.7,32,15,L. Makefield,Bucks
299621,29.9621,28.6,10,1.4,L. Merion,Montgome
210884,21.0884,20.8,20,0.1,L. Moreland,Montgome
112471,11.2471,29.3,35,3.4,Lansdale,Montgome
93738,9.3738,19.3,7,-0.4,Lansdown,Delaware
121024,12.1024,39.5,35,26.9,Limerick,Montgome
156035,15.6035,13,23,6.3,Malvern,Chester
185404,18.5404,24.1,10,0.9,Marple,Delaware
126160,12.616,38,20,-2.4,Media,Delaware
143072,14.3072,40.1,23,1.6,Middletown,Bucks
96769,9.6769,36.1,15,5.1,Morrisville,Bucks
94014,9.4014,26.6,14,0.5,Morton,Delaware
118214,11.8214,25.1,25,5.7,N. Wales,Montgome
157446,15.7446,14.6,15,3.1,Narberth,Montgome
150283,15.0283,18.2,15,0.9,Nether,Delaware
153842,15.3842,15.3,23,8.5,Newtown,Bucks
197214,19.7214,15.2,25,2.1,Newtown B.,Bucks
206127,20.6127,17.4,15,2.7,Newtown T.,Delaware
71981,7.1981,73.3,19,4.9,Norristown,Montgome
169401,16.9401,7.1,22,1.5,Northampton,Bucks
99843,9.9843,12.5,12,-3.7,Norwood,Delaware
60000,6,45.8,18,-1.4,"Phila, Far NE",Phila
28000,2.8,44.9,5.5,-8.4,"Phila, N",Phila
60000,6,65,9,-4.9,"Phila, NE",Phila
61800,6.18,49.9,9,-6.4,"Phila, NW",Phila
38000,3.8,54.8,4.5,-5.1,"Phila, SW",Phila
38000,3.8,53.5,2,-9.2,"Phila, South",Phila
42000,4.2,69.9,4,-5.7,"Phila, West",Phila
96200,9.62,366.1,0,4.8,"Phila,CC",Phila
103087,10.3087,24.6,24,3.9,Phoenixville,Chester
147720,14.772,58.6,25,1.5,Plymouth,Montgome
78175,7.8175,53.2,41,2.2,Pottstown,Montgome
92215,9.2215,17.4,14,7.8,Prospect Park,Delaware
271804,27.1804,15.5,17,1.2,Radnor,Delaware
119566,11.9566,14.5,12,-2.9,Ridley Park,Delaware
100231,10.0231,24.1,15,1.9,Ridley Town,Delaware
95831,9.5831,21.2,32,3.2,Royersford,Montgome
229711,22.9711,9.8,22,5.3,Schuylkill,Chester
74308,7.4308,29.9,7,1.8,Sharon Hill,Delaware
259506,25.9506,7.2,40,17.4,Solebury,Bucks
159573,15.9573,19.4,15,-2.1,Springfield,Montgome
147176,14.7176,41.1,12,-1.7,Springfield,Delaware
205732,20.5732,11.2,12,-0.2,Swarthmore,Delaware
215783,21.5783,21.2,20,1.1,Tredyffin,Chester
116710,11.671,42.8,20,12.9,U. Chichester,Delaware
359112,35.9112,9.4,36,4,U. Makefield,Bucks
189959,18.9959,61.7,22,-2.1,U. Merion,Montgome
133198,13.3198,19.4,22,-2,U. Moreland,Montgome
242821,24.2821,6.6,21,1.6,U. Providence,Delaware
142811,14.2811,15.9,20,-1.6,U. Southampton,Bucks
200498,20.0498,18.8,36,11,U. Uwchlan,Chester
199065,19.9065,13.2,20,7.8,Upper Darby,Montgome
93648,9.3648,34.5,8,-0.7,Upper Darby,Delaware
163001,16.3001,22.1,50,8,Uwchlan T.,Chester
436348,43.6348,22.1,15,1.3,Villanova,Montgome
124478,12.4478,71.9,22,4.6,W. Chester,Chester
168276,16.8276,31.9,26,5.9,W. Goshen,Chester
114157,11.4157,44.6,38,14.6,W. Whiteland,Chester
130088,13.0088,28.6,19,-0.2,Warminster,Bucks
152624,15.2624,24,19,23.1,Warrington,Bucks
174232,17.4232,13.8,25,4.7,Westtown,Chester
196515,19.6515,29.9,16,1.8,Whitemarsh,Montgome
232714,23.2714,9.9,21,0.2,Willistown,Chester
245920,24.592,22.6,10,0.3,Wynnewood,Montgome
130953,13.0953,13,24,5.2,Yardley,Bucks
--------------------------------------------------------------------------------
/course-2/kc_house_data.gl.zip:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:8cc2700874aa4aa3db885807ef76b9b9b1622323fe1a509272bad5e4288ade83
3 | size 930178
4 |
--------------------------------------------------------------------------------
/course-2/kc_house_data_small.gl.zip:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:f483f25d3832ada232530454d4ef9dc051c49209aee7a05466f5dff171ea4cf4
3 | size 381539
4 |
--------------------------------------------------------------------------------
/course-2/numpy-tutorial.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "#Numpy Tutorial"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "Numpy is a computational library for Python that is optimized for operations on multi-dimensional arrays. In this notebook we will use numpy to work with 1-d arrays (often called vectors) and 2-d arrays (often called matrices).\n",
15 | "\n",
16 | "For a the full user guide and reference for numpy see: http://docs.scipy.org/doc/numpy/"
17 | ]
18 | },
19 | {
20 | "cell_type": "code",
21 | "execution_count": null,
22 | "metadata": {
23 | "collapsed": true
24 | },
25 | "outputs": [],
26 | "source": [
27 | "import numpy as np # importing this way allows us to refer to numpy as np"
28 | ]
29 | },
30 | {
31 | "cell_type": "markdown",
32 | "metadata": {},
33 | "source": [
34 | "# Creating Numpy Arrays"
35 | ]
36 | },
37 | {
38 | "cell_type": "markdown",
39 | "metadata": {},
40 | "source": [
41 | "New arrays can be made in several ways. We can take an existing list and convert it to a numpy array:"
42 | ]
43 | },
44 | {
45 | "cell_type": "code",
46 | "execution_count": null,
47 | "metadata": {
48 | "collapsed": false
49 | },
50 | "outputs": [],
51 | "source": [
52 | "mylist = [1., 2., 3., 4.]\n",
53 | "mynparray = np.array(mylist)\n",
54 | "mynparray"
55 | ]
56 | },
57 | {
58 | "cell_type": "markdown",
59 | "metadata": {},
60 | "source": [
61 | "You can initialize an array (of any dimension) of all ones or all zeroes with the ones() and zeros() functions:"
62 | ]
63 | },
64 | {
65 | "cell_type": "code",
66 | "execution_count": null,
67 | "metadata": {
68 | "collapsed": false
69 | },
70 | "outputs": [],
71 | "source": [
72 | "one_vector = np.ones(4)\n",
73 | "print one_vector # using print removes the array() portion"
74 | ]
75 | },
76 | {
77 | "cell_type": "code",
78 | "execution_count": null,
79 | "metadata": {
80 | "collapsed": false
81 | },
82 | "outputs": [],
83 | "source": [
84 | "one2Darray = np.ones((2, 4)) # an 2D array with 2 \"rows\" and 4 \"columns\"\n",
85 | "print one2Darray"
86 | ]
87 | },
88 | {
89 | "cell_type": "code",
90 | "execution_count": null,
91 | "metadata": {
92 | "collapsed": false
93 | },
94 | "outputs": [],
95 | "source": [
96 | "zero_vector = np.zeros(4)\n",
97 | "print zero_vector"
98 | ]
99 | },
100 | {
101 | "cell_type": "markdown",
102 | "metadata": {},
103 | "source": [
104 | "You can also initialize an empty array which will be filled with values. This is the fastest way to initialize a fixed-size numpy array however you must ensure that you replace all of the values."
105 | ]
106 | },
107 | {
108 | "cell_type": "code",
109 | "execution_count": null,
110 | "metadata": {
111 | "collapsed": false
112 | },
113 | "outputs": [],
114 | "source": [
115 | "empty_vector = np.empty(5)\n",
116 | "print empty_vector"
117 | ]
118 | },
119 | {
120 | "cell_type": "markdown",
121 | "metadata": {},
122 | "source": [
123 | "#Accessing array elements"
124 | ]
125 | },
126 | {
127 | "cell_type": "markdown",
128 | "metadata": {},
129 | "source": [
130 | "Accessing an array is straight forward. For vectors you access the index by referring to it inside square brackets. Recall that indices in Python start with 0."
131 | ]
132 | },
133 | {
134 | "cell_type": "code",
135 | "execution_count": null,
136 | "metadata": {
137 | "collapsed": false
138 | },
139 | "outputs": [],
140 | "source": [
141 | "mynparray[2]"
142 | ]
143 | },
144 | {
145 | "cell_type": "markdown",
146 | "metadata": {},
147 | "source": [
148 | "2D arrays are accessed similarly by referring to the row and column index separated by a comma:"
149 | ]
150 | },
151 | {
152 | "cell_type": "code",
153 | "execution_count": null,
154 | "metadata": {
155 | "collapsed": false
156 | },
157 | "outputs": [],
158 | "source": [
159 | "my_matrix = np.array([[1, 2, 3], [4, 5, 6]])\n",
160 | "print my_matrix"
161 | ]
162 | },
163 | {
164 | "cell_type": "code",
165 | "execution_count": null,
166 | "metadata": {
167 | "collapsed": false
168 | },
169 | "outputs": [],
170 | "source": [
171 | "print my_matrix[1, 2]"
172 | ]
173 | },
174 | {
175 | "cell_type": "markdown",
176 | "metadata": {},
177 | "source": [
178 | "Sequences of indices can be accessed using ':' for example"
179 | ]
180 | },
181 | {
182 | "cell_type": "code",
183 | "execution_count": null,
184 | "metadata": {
185 | "collapsed": false
186 | },
187 | "outputs": [],
188 | "source": [
189 | "print my_matrix[0:2, 2] # recall 0:2 = [0, 1]"
190 | ]
191 | },
192 | {
193 | "cell_type": "code",
194 | "execution_count": null,
195 | "metadata": {
196 | "collapsed": false
197 | },
198 | "outputs": [],
199 | "source": [
200 | "print my_matrix[0, 0:3]"
201 | ]
202 | },
203 | {
204 | "cell_type": "markdown",
205 | "metadata": {},
206 | "source": [
207 | "You can also pass a list of indices. "
208 | ]
209 | },
210 | {
211 | "cell_type": "code",
212 | "execution_count": null,
213 | "metadata": {
214 | "collapsed": false
215 | },
216 | "outputs": [],
217 | "source": [
218 | "fib_indices = np.array([1, 1, 2, 3])\n",
219 | "random_vector = np.random.random(10) # 10 random numbers between 0 and 1\n",
220 | "print random_vector"
221 | ]
222 | },
223 | {
224 | "cell_type": "code",
225 | "execution_count": null,
226 | "metadata": {
227 | "collapsed": false
228 | },
229 | "outputs": [],
230 | "source": [
231 | "print random_vector[fib_indices]"
232 | ]
233 | },
234 | {
235 | "cell_type": "markdown",
236 | "metadata": {},
237 | "source": [
238 | "You can also use true/false values to select values"
239 | ]
240 | },
241 | {
242 | "cell_type": "code",
243 | "execution_count": null,
244 | "metadata": {
245 | "collapsed": false
246 | },
247 | "outputs": [],
248 | "source": [
249 | "my_vector = np.array([1, 2, 3, 4])\n",
250 | "select_index = np.array([True, False, True, False])\n",
251 | "print my_vector[select_index]"
252 | ]
253 | },
254 | {
255 | "cell_type": "markdown",
256 | "metadata": {},
257 | "source": [
258 | "For 2D arrays you can select specific columns and specific rows. Passing ':' selects all rows/columns"
259 | ]
260 | },
261 | {
262 | "cell_type": "code",
263 | "execution_count": null,
264 | "metadata": {
265 | "collapsed": false
266 | },
267 | "outputs": [],
268 | "source": [
269 | "select_cols = np.array([True, False, True]) # 1st and 3rd column\n",
270 | "select_rows = np.array([False, True]) # 2nd row"
271 | ]
272 | },
273 | {
274 | "cell_type": "code",
275 | "execution_count": null,
276 | "metadata": {
277 | "collapsed": false
278 | },
279 | "outputs": [],
280 | "source": [
281 | "print my_matrix[select_rows, :] # just 2nd row but all columns"
282 | ]
283 | },
284 | {
285 | "cell_type": "code",
286 | "execution_count": null,
287 | "metadata": {
288 | "collapsed": false
289 | },
290 | "outputs": [],
291 | "source": [
292 | "print my_matrix[:, select_cols] # all rows and just the 1st and 3rd column"
293 | ]
294 | },
295 | {
296 | "cell_type": "markdown",
297 | "metadata": {},
298 | "source": [
299 | "#Operations on Arrays"
300 | ]
301 | },
302 | {
303 | "cell_type": "markdown",
304 | "metadata": {},
305 | "source": [
306 | "You can use the operations '\\*', '\\*\\*', '\\\\', '+' and '-' on numpy arrays and they operate elementwise."
307 | ]
308 | },
309 | {
310 | "cell_type": "code",
311 | "execution_count": null,
312 | "metadata": {
313 | "collapsed": false
314 | },
315 | "outputs": [],
316 | "source": [
317 | "my_array = np.array([1., 2., 3., 4.])\n",
318 | "print my_array*my_array"
319 | ]
320 | },
321 | {
322 | "cell_type": "code",
323 | "execution_count": null,
324 | "metadata": {
325 | "collapsed": false
326 | },
327 | "outputs": [],
328 | "source": [
329 | "print my_array**2"
330 | ]
331 | },
332 | {
333 | "cell_type": "code",
334 | "execution_count": null,
335 | "metadata": {
336 | "collapsed": false
337 | },
338 | "outputs": [],
339 | "source": [
340 | "print my_array - np.ones(4)"
341 | ]
342 | },
343 | {
344 | "cell_type": "code",
345 | "execution_count": null,
346 | "metadata": {
347 | "collapsed": false
348 | },
349 | "outputs": [],
350 | "source": [
351 | "print my_array + np.ones(4)"
352 | ]
353 | },
354 | {
355 | "cell_type": "code",
356 | "execution_count": null,
357 | "metadata": {
358 | "collapsed": false
359 | },
360 | "outputs": [],
361 | "source": [
362 | "print my_array / 3"
363 | ]
364 | },
365 | {
366 | "cell_type": "code",
367 | "execution_count": null,
368 | "metadata": {
369 | "collapsed": false
370 | },
371 | "outputs": [],
372 | "source": [
373 | "print my_array / np.array([2., 3., 4., 5.]) # = [1.0/2.0, 2.0/3.0, 3.0/4.0, 4.0/5.0]"
374 | ]
375 | },
376 | {
377 | "cell_type": "markdown",
378 | "metadata": {},
379 | "source": [
380 | "You can compute the sum with np.sum() and the average with np.average()"
381 | ]
382 | },
383 | {
384 | "cell_type": "code",
385 | "execution_count": null,
386 | "metadata": {
387 | "collapsed": false
388 | },
389 | "outputs": [],
390 | "source": [
391 | "print np.sum(my_array)"
392 | ]
393 | },
394 | {
395 | "cell_type": "code",
396 | "execution_count": null,
397 | "metadata": {
398 | "collapsed": false
399 | },
400 | "outputs": [],
401 | "source": [
402 | "print np.average(my_array)"
403 | ]
404 | },
405 | {
406 | "cell_type": "code",
407 | "execution_count": null,
408 | "metadata": {
409 | "collapsed": false
410 | },
411 | "outputs": [],
412 | "source": [
413 | "print np.sum(my_array)/len(my_array)"
414 | ]
415 | },
416 | {
417 | "cell_type": "markdown",
418 | "metadata": {},
419 | "source": [
420 | "#The dot product"
421 | ]
422 | },
423 | {
424 | "cell_type": "markdown",
425 | "metadata": {},
426 | "source": [
427 | "An important mathematical operation in linear algebra is the dot product. \n",
428 | "\n",
429 | "When we compute the dot product between two vectors we are simply multiplying them elementwise and adding them up. In numpy you can do this with np.dot()"
430 | ]
431 | },
432 | {
433 | "cell_type": "code",
434 | "execution_count": null,
435 | "metadata": {
436 | "collapsed": false
437 | },
438 | "outputs": [],
439 | "source": [
440 | "array1 = np.array([1., 2., 3., 4.])\n",
441 | "array2 = np.array([2., 3., 4., 5.])\n",
442 | "print np.dot(array1, array2)"
443 | ]
444 | },
445 | {
446 | "cell_type": "code",
447 | "execution_count": null,
448 | "metadata": {
449 | "collapsed": false
450 | },
451 | "outputs": [],
452 | "source": [
453 | "print np.sum(array1*array2)"
454 | ]
455 | },
456 | {
457 | "cell_type": "markdown",
458 | "metadata": {},
459 | "source": [
460 | "Recall that the Euclidean length (or magnitude) of a vector is the squareroot of the sum of the squares of the components. This is just the squareroot of the dot product of the vector with itself:"
461 | ]
462 | },
463 | {
464 | "cell_type": "code",
465 | "execution_count": null,
466 | "metadata": {
467 | "collapsed": false
468 | },
469 | "outputs": [],
470 | "source": [
471 | "array1_mag = np.sqrt(np.dot(array1, array1))\n",
472 | "print array1_mag"
473 | ]
474 | },
475 | {
476 | "cell_type": "code",
477 | "execution_count": null,
478 | "metadata": {
479 | "collapsed": false
480 | },
481 | "outputs": [],
482 | "source": [
483 | "print np.sqrt(np.sum(array1*array1))"
484 | ]
485 | },
486 | {
487 | "cell_type": "markdown",
488 | "metadata": {},
489 | "source": [
490 | "We can also use the dot product when we have a 2D array (or matrix). When you have an vector with the same number of elements as the matrix (2D array) has columns you can right-multiply the matrix by the vector to get another vector with the same number of elements as the matrix has rows. For example this is how you compute the predicted values given a matrix of features and an array of weights."
491 | ]
492 | },
493 | {
494 | "cell_type": "code",
495 | "execution_count": null,
496 | "metadata": {
497 | "collapsed": false
498 | },
499 | "outputs": [],
500 | "source": [
501 | "my_features = np.array([[1., 2.], [3., 4.], [5., 6.], [7., 8.]])\n",
502 | "print my_features"
503 | ]
504 | },
505 | {
506 | "cell_type": "code",
507 | "execution_count": null,
508 | "metadata": {
509 | "collapsed": false
510 | },
511 | "outputs": [],
512 | "source": [
513 | "my_weights = np.array([0.4, 0.5])\n",
514 | "print my_weights"
515 | ]
516 | },
517 | {
518 | "cell_type": "code",
519 | "execution_count": null,
520 | "metadata": {
521 | "collapsed": false
522 | },
523 | "outputs": [],
524 | "source": [
525 | "my_predictions = np.dot(my_features, my_weights) # note that the weights are on the right\n",
526 | "print my_predictions # which has 4 elements since my_features has 4 rows"
527 | ]
528 | },
529 | {
530 | "cell_type": "markdown",
531 | "metadata": {},
532 | "source": [
533 | "Similarly if you have a vector with the same number of elements as the matrix has *rows* you can left multiply them."
534 | ]
535 | },
536 | {
537 | "cell_type": "code",
538 | "execution_count": null,
539 | "metadata": {
540 | "collapsed": true
541 | },
542 | "outputs": [],
543 | "source": [
544 | "my_matrix = my_features\n",
545 | "my_array = np.array([0.3, 0.4, 0.5, 0.6])"
546 | ]
547 | },
548 | {
549 | "cell_type": "code",
550 | "execution_count": null,
551 | "metadata": {
552 | "collapsed": false
553 | },
554 | "outputs": [],
555 | "source": [
556 | "print np.dot(my_array, my_matrix) # which has 2 elements because my_matrix has 2 columns"
557 | ]
558 | },
559 | {
560 | "cell_type": "markdown",
561 | "metadata": {},
562 | "source": [
563 | "#Multiplying Matrices"
564 | ]
565 | },
566 | {
567 | "cell_type": "markdown",
568 | "metadata": {},
569 | "source": [
570 | "If we have two 2D arrays (matrices) matrix_1 and matrix_2 where the number of columns of matrix_1 is the same as the number of rows of matrix_2 then we can use np.dot() to perform matrix multiplication."
571 | ]
572 | },
573 | {
574 | "cell_type": "code",
575 | "execution_count": null,
576 | "metadata": {
577 | "collapsed": false
578 | },
579 | "outputs": [],
580 | "source": [
581 | "matrix_1 = np.array([[1., 2., 3.],[4., 5., 6.]])\n",
582 | "print matrix_1"
583 | ]
584 | },
585 | {
586 | "cell_type": "code",
587 | "execution_count": null,
588 | "metadata": {
589 | "collapsed": false
590 | },
591 | "outputs": [],
592 | "source": [
593 | "matrix_2 = np.array([[1., 2.], [3., 4.], [5., 6.]])\n",
594 | "print matrix_2"
595 | ]
596 | },
597 | {
598 | "cell_type": "code",
599 | "execution_count": null,
600 | "metadata": {
601 | "collapsed": false
602 | },
603 | "outputs": [],
604 | "source": [
605 | "print np.dot(matrix_1, matrix_2)"
606 | ]
607 | }
608 | ],
609 | "metadata": {
610 | "kernelspec": {
611 | "display_name": "Python 2",
612 | "language": "python",
613 | "name": "python2"
614 | },
615 | "language_info": {
616 | "codemirror_mode": {
617 | "name": "ipython",
618 | "version": 2
619 | },
620 | "file_extension": ".py",
621 | "mimetype": "text/x-python",
622 | "name": "python",
623 | "nbconvert_exporter": "python",
624 | "pygments_lexer": "ipython2",
625 | "version": "2.7.10"
626 | }
627 | },
628 | "nbformat": 4,
629 | "nbformat_minor": 0
630 | }
631 |
--------------------------------------------------------------------------------
/course-2/week-1-simple-regression-assignment-blank.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Regression Week 1: Simple Linear Regression"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "In this notebook we will use data on house sales in King County to predict house prices using simple (one input) linear regression. You will:\n",
15 | "* Use graphlab SArray and SFrame functions to compute important summary statistics\n",
16 | "* Write a function to compute the Simple Linear Regression weights using the closed form solution\n",
17 | "* Write a function to make predictions of the output given the input feature\n",
18 | "* Turn the regression around to predict the input given the output\n",
19 | "* Compare two different models for predicting house prices\n",
20 | "\n",
21 | "In this notebook you will be provided with some already complete code as well as some code that you should complete yourself in order to answer quiz questions. The code we provide to complte is optional and is there to assist you with solving the problems but feel free to ignore the helper code and write your own."
22 | ]
23 | },
24 | {
25 | "cell_type": "markdown",
26 | "metadata": {},
27 | "source": [
28 | "# Fire up graphlab create"
29 | ]
30 | },
31 | {
32 | "cell_type": "code",
33 | "execution_count": null,
34 | "metadata": {
35 | "collapsed": false
36 | },
37 | "outputs": [],
38 | "source": [
39 | "import graphlab"
40 | ]
41 | },
42 | {
43 | "cell_type": "markdown",
44 | "metadata": {},
45 | "source": [
46 | "# Load house sales data\n",
47 | "\n",
48 | "Dataset is from house sales in King County, the region where the city of Seattle, WA is located."
49 | ]
50 | },
51 | {
52 | "cell_type": "code",
53 | "execution_count": null,
54 | "metadata": {
55 | "collapsed": false
56 | },
57 | "outputs": [],
58 | "source": [
59 | "sales = graphlab.SFrame('kc_house_data.gl/')"
60 | ]
61 | },
62 | {
63 | "cell_type": "markdown",
64 | "metadata": {},
65 | "source": [
66 | "# Split data into training and testing"
67 | ]
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "We use seed=0 so that everyone running this notebook gets the same results. In practice, you may set a random seed (or let GraphLab Create pick a random seed for you). "
74 | ]
75 | },
76 | {
77 | "cell_type": "code",
78 | "execution_count": null,
79 | "metadata": {
80 | "collapsed": false
81 | },
82 | "outputs": [],
83 | "source": [
84 | "train_data,test_data = sales.random_split(.8,seed=0)"
85 | ]
86 | },
87 | {
88 | "cell_type": "markdown",
89 | "metadata": {},
90 | "source": [
91 | "# Useful SFrame summary functions"
92 | ]
93 | },
94 | {
95 | "cell_type": "markdown",
96 | "metadata": {},
97 | "source": [
98 | "In order to make use of the closed form solution as well as take advantage of graphlab's built in functions we will review some important ones. In particular:\n",
99 | "* Computing the sum of an SArray\n",
100 | "* Computing the arithmetic average (mean) of an SArray\n",
101 | "* multiplying SArrays by constants\n",
102 | "* multiplying SArrays by other SArrays"
103 | ]
104 | },
105 | {
106 | "cell_type": "code",
107 | "execution_count": null,
108 | "metadata": {
109 | "collapsed": false
110 | },
111 | "outputs": [],
112 | "source": [
113 | "# Let's compute the mean of the House Prices in King County in 2 different ways.\n",
114 | "prices = sales['price'] # extract the price column of the sales SFrame -- this is now an SArray\n",
115 | "\n",
116 | "# recall that the arithmetic average (the mean) is the sum of the prices divided by the total number of houses:\n",
117 | "sum_prices = prices.sum()\n",
118 | "num_houses = prices.size() # when prices is an SArray .size() returns its length\n",
119 | "avg_price_1 = sum_prices/num_houses\n",
120 | "avg_price_2 = prices.mean() # if you just want the average, the .mean() function\n",
121 | "print \"average price via method 1: \" + str(avg_price_1)\n",
122 | "print \"average price via method 2: \" + str(avg_price_2)"
123 | ]
124 | },
125 | {
126 | "cell_type": "markdown",
127 | "metadata": {},
128 | "source": [
129 | "As we see we get the same answer both ways"
130 | ]
131 | },
132 | {
133 | "cell_type": "code",
134 | "execution_count": null,
135 | "metadata": {
136 | "collapsed": false
137 | },
138 | "outputs": [],
139 | "source": [
140 | "# if we want to multiply every price by 0.5 it's a simple as:\n",
141 | "half_prices = 0.5*prices\n",
142 | "# Let's compute the sum of squares of price. We can multiply two SArrays of the same length elementwise also with *\n",
143 | "prices_squared = prices*prices\n",
144 | "sum_prices_squared = prices_squared.sum() # price_squared is an SArray of the squares and we want to add them up.\n",
145 | "print \"the sum of price squared is: \" + str(sum_prices_squared)"
146 | ]
147 | },
148 | {
149 | "cell_type": "markdown",
150 | "metadata": {},
151 | "source": [
152 | "Aside: The python notation x.xxe+yy means x.xx \\* 10^(yy). e.g 100 = 10^2 = 1*10^2 = 1e2 "
153 | ]
154 | },
155 | {
156 | "cell_type": "markdown",
157 | "metadata": {},
158 | "source": [
159 | "# Build a generic simple linear regression function "
160 | ]
161 | },
162 | {
163 | "cell_type": "markdown",
164 | "metadata": {},
165 | "source": [
166 | "Armed with these SArray functions we can use the closed form solution found from lecture to compute the slope and intercept for a simple linear regression on observations stored as SArrays: input_feature, output.\n",
167 | "\n",
168 | "Complete the following function (or write your own) to compute the simple linear regression slope and intercept:"
169 | ]
170 | },
171 | {
172 | "cell_type": "code",
173 | "execution_count": null,
174 | "metadata": {
175 | "collapsed": false
176 | },
177 | "outputs": [],
178 | "source": [
179 | "def simple_linear_regression(input_feature, output):\n",
180 | " # compute the sum of input_feature and output\n",
181 | " \n",
182 | " # compute the product of the output and the input_feature and its sum\n",
183 | " \n",
184 | " # compute the squared value of the input_feature and its sum\n",
185 | " \n",
186 | " # use the formula for the slope\n",
187 | " \n",
188 | " # use the formula for the intercept\n",
189 | " \n",
190 | " return (intercept, slope)"
191 | ]
192 | },
193 | {
194 | "cell_type": "markdown",
195 | "metadata": {},
196 | "source": [
197 | "We can test that our function works by passing it something where we know the answer. In particular we can generate a feature and then put the output exactly on a line: output = 1 + 1\\*input_feature then we know both our slope and intercept should be 1"
198 | ]
199 | },
200 | {
201 | "cell_type": "code",
202 | "execution_count": null,
203 | "metadata": {
204 | "collapsed": false,
205 | "scrolled": true
206 | },
207 | "outputs": [],
208 | "source": [
209 | "test_feature = graphlab.SArray(range(5))\n",
210 | "test_output = graphlab.SArray(1 + 1*test_feature)\n",
211 | "(test_intercept, test_slope) = simple_linear_regression(test_feature, test_output)\n",
212 | "print \"Intercept: \" + str(test_intercept)\n",
213 | "print \"Slope: \" + str(test_slope)"
214 | ]
215 | },
216 | {
217 | "cell_type": "markdown",
218 | "metadata": {},
219 | "source": [
220 | "Now that we know it works let's build a regression model for predicting price based on sqft_living. Rembember that we train on train_data!"
221 | ]
222 | },
223 | {
224 | "cell_type": "code",
225 | "execution_count": null,
226 | "metadata": {
227 | "collapsed": false
228 | },
229 | "outputs": [],
230 | "source": [
231 | "sqft_intercept, sqft_slope = simple_linear_regression(train_data['sqft_living'], train_data['price'])\n",
232 | "\n",
233 | "print \"Intercept: \" + str(sqft_intercept)\n",
234 | "print \"Slope: \" + str(sqft_slope)"
235 | ]
236 | },
237 | {
238 | "cell_type": "markdown",
239 | "metadata": {},
240 | "source": [
241 | "# Predicting Values"
242 | ]
243 | },
244 | {
245 | "cell_type": "markdown",
246 | "metadata": {},
247 | "source": [
248 | "Now that we have the model parameters: intercept & slope we can make predictions. Using SArrays it's easy to multiply an SArray by a constant and add a constant value. Complete the following function to return the predicted output given the input_feature, slope and intercept:"
249 | ]
250 | },
251 | {
252 | "cell_type": "code",
253 | "execution_count": null,
254 | "metadata": {
255 | "collapsed": false
256 | },
257 | "outputs": [],
258 | "source": [
259 | "def get_regression_predictions(input_feature, intercept, slope):\n",
260 | " # calculate the predicted values:\n",
261 | " \n",
262 | " return predicted_values"
263 | ]
264 | },
265 | {
266 | "cell_type": "markdown",
267 | "metadata": {},
268 | "source": [
269 | "Now that we can calculate a prediction given the slope and intercept let's make a prediction. Use (or alter) the following to find out the estimated price for a house with 2650 squarefeet according to the squarefeet model we estiamted above.\n",
270 | "\n",
271 | "**Quiz Question: Using your Slope and Intercept from (4), What is the predicted price for a house with 2650 sqft?**"
272 | ]
273 | },
274 | {
275 | "cell_type": "code",
276 | "execution_count": null,
277 | "metadata": {
278 | "collapsed": false
279 | },
280 | "outputs": [],
281 | "source": [
282 | "my_house_sqft = 2650\n",
283 | "estimated_price = get_regression_predictions(my_house_sqft, sqft_intercept, sqft_slope)\n",
284 | "print \"The estimated price for a house with %d squarefeet is $%.2f\" % (my_house_sqft, estimated_price)"
285 | ]
286 | },
287 | {
288 | "cell_type": "markdown",
289 | "metadata": {},
290 | "source": [
291 | "# Residual Sum of Squares"
292 | ]
293 | },
294 | {
295 | "cell_type": "markdown",
296 | "metadata": {},
297 | "source": [
298 | "Now that we have a model and can make predictions let's evaluate our model using Residual Sum of Squares (RSS). Recall that RSS is the sum of the squares of the residuals and the residuals is just a fancy word for the difference between the predicted output and the true output. \n",
299 | "\n",
300 | "Complete the following (or write your own) function to compute the RSS of a simple linear regression model given the input_feature, output, intercept and slope:"
301 | ]
302 | },
303 | {
304 | "cell_type": "code",
305 | "execution_count": null,
306 | "metadata": {
307 | "collapsed": true
308 | },
309 | "outputs": [],
310 | "source": [
311 | "def get_residual_sum_of_squares(input_feature, output, intercept, slope):\n",
312 | " # First get the predictions\n",
313 | "\n",
314 | " # then compute the residuals (since we are squaring it doesn't matter which order you subtract)\n",
315 | "\n",
316 | " # square the residuals and add them up\n",
317 | "\n",
318 | " return(RSS)"
319 | ]
320 | },
321 | {
322 | "cell_type": "markdown",
323 | "metadata": {},
324 | "source": [
325 | "Let's test our get_residual_sum_of_squares function by applying it to the test model where the data lie exactly on a line. Since they lie exactly on a line the residual sum of squares should be zero!"
326 | ]
327 | },
328 | {
329 | "cell_type": "code",
330 | "execution_count": null,
331 | "metadata": {
332 | "collapsed": false
333 | },
334 | "outputs": [],
335 | "source": [
336 | "print get_residual_sum_of_squares(test_feature, test_output, test_intercept, test_slope) # should be 0.0"
337 | ]
338 | },
339 | {
340 | "cell_type": "markdown",
341 | "metadata": {},
342 | "source": [
343 | "Now use your function to calculate the RSS on training data from the squarefeet model calculated above.\n",
344 | "\n",
345 | "**Quiz Question: According to this function and the slope and intercept from the squarefeet model What is the RSS for the simple linear regression using squarefeet to predict prices on TRAINING data?**"
346 | ]
347 | },
348 | {
349 | "cell_type": "code",
350 | "execution_count": null,
351 | "metadata": {
352 | "collapsed": false
353 | },
354 | "outputs": [],
355 | "source": [
356 | "rss_prices_on_sqft = get_residual_sum_of_squares(train_data['sqft_living'], train_data['price'], sqft_intercept, sqft_slope)\n",
357 | "print 'The RSS of predicting Prices based on Square Feet is : ' + str(rss_prices_on_sqft)"
358 | ]
359 | },
360 | {
361 | "cell_type": "markdown",
362 | "metadata": {},
363 | "source": [
364 | "# Predict the squarefeet given price"
365 | ]
366 | },
367 | {
368 | "cell_type": "markdown",
369 | "metadata": {},
370 | "source": [
371 | "What if we want to predict the squarefoot given the price? Since we have an equation y = a + b\\*x we can solve the function for x. So that if we have the intercept (a) and the slope (b) and the price (y) we can solve for the estimated squarefeet (x).\n",
372 | "\n",
373 | "Complete the following function to compute the inverse regression estimate, i.e. predict the input_feature given the output."
374 | ]
375 | },
376 | {
377 | "cell_type": "code",
378 | "execution_count": null,
379 | "metadata": {
380 | "collapsed": true
381 | },
382 | "outputs": [],
383 | "source": [
384 | "def inverse_regression_predictions(output, intercept, slope):\n",
385 | " # solve output = intercept + slope*input_feature for input_feature. Use this equation to compute the inverse predictions:\n",
386 | "\n",
387 | " return estimated_feature"
388 | ]
389 | },
390 | {
391 | "cell_type": "markdown",
392 | "metadata": {},
393 | "source": [
394 | "Now that we have a function to compute the squarefeet given the price from our simple regression model let's see how big we might expect a house that costs $800,000 to be.\n",
395 | "\n",
396 | "**Quiz Question: According to this function and the regression slope and intercept from (3) what is the estimated square-feet for a house costing $800,000?**"
397 | ]
398 | },
399 | {
400 | "cell_type": "code",
401 | "execution_count": null,
402 | "metadata": {
403 | "collapsed": false
404 | },
405 | "outputs": [],
406 | "source": [
407 | "my_house_price = 800000\n",
408 | "estimated_squarefeet = inverse_regression_predictions(my_house_price, sqft_intercept, sqft_slope)\n",
409 | "print \"The estimated squarefeet for a house worth $%.2f is %d\" % (my_house_price, estimated_squarefeet)"
410 | ]
411 | },
412 | {
413 | "cell_type": "markdown",
414 | "metadata": {},
415 | "source": [
416 | "# New Model: estimate prices from bedrooms"
417 | ]
418 | },
419 | {
420 | "cell_type": "markdown",
421 | "metadata": {},
422 | "source": [
423 | "We have made one model for predicting house prices using squarefeet, but there are many other features in the sales SFrame. \n",
424 | "Use your simple linear regression function to estimate the regression parameters from predicting Prices based on number of bedrooms. Use the training data!"
425 | ]
426 | },
427 | {
428 | "cell_type": "code",
429 | "execution_count": null,
430 | "metadata": {
431 | "collapsed": false
432 | },
433 | "outputs": [],
434 | "source": [
435 | "# Estimate the slope and intercept for predicting 'price' based on 'bedrooms'\n",
436 | "\n"
437 | ]
438 | },
439 | {
440 | "cell_type": "markdown",
441 | "metadata": {},
442 | "source": [
443 | "# Test your Linear Regression Algorithm"
444 | ]
445 | },
446 | {
447 | "cell_type": "markdown",
448 | "metadata": {},
449 | "source": [
450 | "Now we have two models for predicting the price of a house. How do we know which one is better? Calculate the RSS on the TEST data (remember this data wasn't involved in learning the model). Compute the RSS from predicting prices using bedrooms and from predicting prices using squarefeet.\n",
451 | "\n",
452 | "**Quiz Question: Which model (square feet or bedrooms) has lowest RSS on TEST data? Think about why this might be the case.**"
453 | ]
454 | },
455 | {
456 | "cell_type": "code",
457 | "execution_count": null,
458 | "metadata": {
459 | "collapsed": false
460 | },
461 | "outputs": [],
462 | "source": [
463 | "# Compute RSS when using bedrooms on TEST data:\n"
464 | ]
465 | },
466 | {
467 | "cell_type": "code",
468 | "execution_count": null,
469 | "metadata": {
470 | "collapsed": false
471 | },
472 | "outputs": [],
473 | "source": [
474 | "# Compute RSS when using squarefeet on TEST data:\n"
475 | ]
476 | }
477 | ],
478 | "metadata": {
479 | "kernelspec": {
480 | "display_name": "Python 2",
481 | "language": "python",
482 | "name": "python2"
483 | },
484 | "language_info": {
485 | "codemirror_mode": {
486 | "name": "ipython",
487 | "version": 2
488 | },
489 | "file_extension": ".py",
490 | "mimetype": "text/x-python",
491 | "name": "python",
492 | "nbconvert_exporter": "python",
493 | "pygments_lexer": "ipython2",
494 | "version": "2.7.11"
495 | }
496 | },
497 | "nbformat": 4,
498 | "nbformat_minor": 0
499 | }
500 |
--------------------------------------------------------------------------------
/course-2/week-2-multiple-regression-assignment-1-blank.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Regression Week 2: Multiple Regression (Interpretation)"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "The goal of this first notebook is to explore multiple regression and feature engineering with existing graphlab functions.\n",
15 | "\n",
16 | "In this notebook you will use data on house sales in King County to predict prices using multiple regression. You will:\n",
17 | "* Use SFrames to do some feature engineering\n",
18 | "* Use built-in graphlab functions to compute the regression weights (coefficients/parameters)\n",
19 | "* Given the regression weights, predictors and outcome write a function to compute the Residual Sum of Squares\n",
20 | "* Look at coefficients and interpret their meanings\n",
21 | "* Evaluate multiple models via RSS"
22 | ]
23 | },
24 | {
25 | "cell_type": "markdown",
26 | "metadata": {},
27 | "source": [
28 | "# Fire up graphlab create"
29 | ]
30 | },
31 | {
32 | "cell_type": "code",
33 | "execution_count": null,
34 | "metadata": {
35 | "collapsed": true
36 | },
37 | "outputs": [],
38 | "source": [
39 | "import graphlab"
40 | ]
41 | },
42 | {
43 | "cell_type": "markdown",
44 | "metadata": {},
45 | "source": [
46 | "# Load in house sales data\n",
47 | "\n",
48 | "Dataset is from house sales in King County, the region where the city of Seattle, WA is located."
49 | ]
50 | },
51 | {
52 | "cell_type": "code",
53 | "execution_count": null,
54 | "metadata": {
55 | "collapsed": false,
56 | "scrolled": true
57 | },
58 | "outputs": [],
59 | "source": [
60 | "sales = graphlab.SFrame('kc_house_data.gl/')"
61 | ]
62 | },
63 | {
64 | "cell_type": "markdown",
65 | "metadata": {},
66 | "source": [
67 | "# Split data into training and testing.\n",
68 | "We use seed=0 so that everyone running this notebook gets the same results. In practice, you may set a random seed (or let GraphLab Create pick a random seed for you). "
69 | ]
70 | },
71 | {
72 | "cell_type": "code",
73 | "execution_count": null,
74 | "metadata": {
75 | "collapsed": true
76 | },
77 | "outputs": [],
78 | "source": [
79 | "train_data,test_data = sales.random_split(.8,seed=0)"
80 | ]
81 | },
82 | {
83 | "cell_type": "markdown",
84 | "metadata": {},
85 | "source": [
86 | "# Learning a multiple regression model"
87 | ]
88 | },
89 | {
90 | "cell_type": "markdown",
91 | "metadata": {},
92 | "source": [
93 | "Recall we can use the following code to learn a multiple regression model predicting 'price' based on the following features:\n",
94 | "example_features = ['sqft_living', 'bedrooms', 'bathrooms'] on training data with the following code:\n",
95 | "\n",
96 | "(Aside: We set validation_set = None to ensure that the results are always the same)"
97 | ]
98 | },
99 | {
100 | "cell_type": "code",
101 | "execution_count": null,
102 | "metadata": {
103 | "collapsed": false
104 | },
105 | "outputs": [],
106 | "source": [
107 | "example_features = ['sqft_living', 'bedrooms', 'bathrooms']\n",
108 | "example_model = graphlab.linear_regression.create(train_data, target = 'price', features = example_features, \n",
109 | " validation_set = None)"
110 | ]
111 | },
112 | {
113 | "cell_type": "markdown",
114 | "metadata": {},
115 | "source": [
116 | "Now that we have fitted the model we can extract the regression weights (coefficients) as an SFrame as follows:"
117 | ]
118 | },
119 | {
120 | "cell_type": "code",
121 | "execution_count": null,
122 | "metadata": {
123 | "collapsed": false
124 | },
125 | "outputs": [],
126 | "source": [
127 | "example_weight_summary = example_model.get(\"coefficients\")\n",
128 | "print example_weight_summary"
129 | ]
130 | },
131 | {
132 | "cell_type": "markdown",
133 | "metadata": {},
134 | "source": [
135 | "# Making Predictions\n",
136 | "\n",
137 | "In the gradient descent notebook we use numpy to do our regression. In this book we will use existing graphlab create functions to analyze multiple regressions. \n",
138 | "\n",
139 | "Recall that once a model is built we can use the .predict() function to find the predicted values for data we pass. For example using the example model above:"
140 | ]
141 | },
142 | {
143 | "cell_type": "code",
144 | "execution_count": null,
145 | "metadata": {
146 | "collapsed": false
147 | },
148 | "outputs": [],
149 | "source": [
150 | "example_predictions = example_model.predict(train_data)\n",
151 | "print example_predictions[0] # should be 271789.505878"
152 | ]
153 | },
154 | {
155 | "cell_type": "markdown",
156 | "metadata": {},
157 | "source": [
158 | "# Compute RSS"
159 | ]
160 | },
161 | {
162 | "cell_type": "markdown",
163 | "metadata": {},
164 | "source": [
165 | "Now that we can make predictions given the model, let's write a function to compute the RSS of the model. Complete the function below to calculate RSS given the model, data, and the outcome."
166 | ]
167 | },
168 | {
169 | "cell_type": "code",
170 | "execution_count": null,
171 | "metadata": {
172 | "collapsed": true
173 | },
174 | "outputs": [],
175 | "source": [
176 | "def get_residual_sum_of_squares(model, data, outcome):\n",
177 | " # First get the predictions\n",
178 | "\n",
179 | " # Then compute the residuals/errors\n",
180 | "\n",
181 | " # Then square and add them up\n",
182 | "\n",
183 | " return(RSS) "
184 | ]
185 | },
186 | {
187 | "cell_type": "markdown",
188 | "metadata": {},
189 | "source": [
190 | "Test your function by computing the RSS on TEST data for the example model:"
191 | ]
192 | },
193 | {
194 | "cell_type": "code",
195 | "execution_count": null,
196 | "metadata": {
197 | "collapsed": false
198 | },
199 | "outputs": [],
200 | "source": [
201 | "rss_example_train = get_residual_sum_of_squares(example_model, test_data, test_data['price'])\n",
202 | "print rss_example_train # should be 2.7376153833e+14"
203 | ]
204 | },
205 | {
206 | "cell_type": "markdown",
207 | "metadata": {},
208 | "source": [
209 | "# Create some new features"
210 | ]
211 | },
212 | {
213 | "cell_type": "markdown",
214 | "metadata": {},
215 | "source": [
216 | "Although we often think of multiple regression as including multiple different features (e.g. # of bedrooms, squarefeet, and # of bathrooms) but we can also consider transformations of existing features e.g. the log of the squarefeet or even \"interaction\" features such as the product of bedrooms and bathrooms."
217 | ]
218 | },
219 | {
220 | "cell_type": "markdown",
221 | "metadata": {},
222 | "source": [
223 | "You will use the logarithm function to create a new feature. so first you should import it from the math library."
224 | ]
225 | },
226 | {
227 | "cell_type": "code",
228 | "execution_count": null,
229 | "metadata": {
230 | "collapsed": true
231 | },
232 | "outputs": [],
233 | "source": [
234 | "from math import log"
235 | ]
236 | },
237 | {
238 | "cell_type": "markdown",
239 | "metadata": {},
240 | "source": [
241 | "Next create the following 4 new features as column in both TEST and TRAIN data:\n",
242 | "* bedrooms_squared = bedrooms\\*bedrooms\n",
243 | "* bed_bath_rooms = bedrooms\\*bathrooms\n",
244 | "* log_sqft_living = log(sqft_living)\n",
245 | "* lat_plus_long = lat + long \n",
246 | "As an example here's the first one:"
247 | ]
248 | },
249 | {
250 | "cell_type": "code",
251 | "execution_count": null,
252 | "metadata": {
253 | "collapsed": true
254 | },
255 | "outputs": [],
256 | "source": [
257 | "train_data['bedrooms_squared'] = train_data['bedrooms'].apply(lambda x: x**2)\n",
258 | "test_data['bedrooms_squared'] = test_data['bedrooms'].apply(lambda x: x**2)"
259 | ]
260 | },
261 | {
262 | "cell_type": "code",
263 | "execution_count": null,
264 | "metadata": {
265 | "collapsed": true
266 | },
267 | "outputs": [],
268 | "source": [
269 | "# create the remaining 3 features in both TEST and TRAIN data\n",
270 | "\n"
271 | ]
272 | },
273 | {
274 | "cell_type": "markdown",
275 | "metadata": {},
276 | "source": [
277 | "* Squaring bedrooms will increase the separation between not many bedrooms (e.g. 1) and lots of bedrooms (e.g. 4) since 1^2 = 1 but 4^2 = 16. Consequently this feature will mostly affect houses with many bedrooms.\n",
278 | "* bedrooms times bathrooms gives what's called an \"interaction\" feature. It is large when *both* of them are large.\n",
279 | "* Taking the log of squarefeet has the effect of bringing large values closer together and spreading out small values.\n",
280 | "* Adding latitude to longitude is totally non-sensical but we will do it anyway (you'll see why)"
281 | ]
282 | },
283 | {
284 | "cell_type": "markdown",
285 | "metadata": {},
286 | "source": [
287 | "**Quiz Question: What is the mean (arithmetic average) value of your 4 new features on TEST data? (round to 2 digits)**"
288 | ]
289 | },
290 | {
291 | "cell_type": "code",
292 | "execution_count": null,
293 | "metadata": {
294 | "collapsed": true
295 | },
296 | "outputs": [],
297 | "source": []
298 | },
299 | {
300 | "cell_type": "markdown",
301 | "metadata": {},
302 | "source": [
303 | "# Learning Multiple Models"
304 | ]
305 | },
306 | {
307 | "cell_type": "markdown",
308 | "metadata": {},
309 | "source": [
310 | "Now we will learn the weights for three (nested) models for predicting house prices. The first model will have the fewest features the second model will add one more feature and the third will add a few more:\n",
311 | "* Model 1: squarefeet, # bedrooms, # bathrooms, latitude & longitude\n",
312 | "* Model 2: add bedrooms\\*bathrooms\n",
313 | "* Model 3: Add log squarefeet, bedrooms squared, and the (nonsensical) latitude + longitude"
314 | ]
315 | },
316 | {
317 | "cell_type": "code",
318 | "execution_count": null,
319 | "metadata": {
320 | "collapsed": true
321 | },
322 | "outputs": [],
323 | "source": [
324 | "model_1_features = ['sqft_living', 'bedrooms', 'bathrooms', 'lat', 'long']\n",
325 | "model_2_features = model_1_features + ['bed_bath_rooms']\n",
326 | "model_3_features = model_2_features + ['bedrooms_squared', 'log_sqft_living', 'lat_plus_long']"
327 | ]
328 | },
329 | {
330 | "cell_type": "markdown",
331 | "metadata": {},
332 | "source": [
333 | "Now that you have the features, learn the weights for the three different models for predicting target = 'price' using graphlab.linear_regression.create() and look at the value of the weights/coefficients:"
334 | ]
335 | },
336 | {
337 | "cell_type": "code",
338 | "execution_count": null,
339 | "metadata": {
340 | "collapsed": true
341 | },
342 | "outputs": [],
343 | "source": [
344 | "# Learn the three models: (don't forget to set validation_set = None)\n"
345 | ]
346 | },
347 | {
348 | "cell_type": "code",
349 | "execution_count": null,
350 | "metadata": {
351 | "collapsed": true
352 | },
353 | "outputs": [],
354 | "source": [
355 | "# Examine/extract each model's coefficients:\n"
356 | ]
357 | },
358 | {
359 | "cell_type": "markdown",
360 | "metadata": {},
361 | "source": [
362 | "**Quiz Question: What is the sign (positive or negative) for the coefficient/weight for 'bathrooms' in model 1?**\n",
363 | "\n",
364 | "**Quiz Question: What is the sign (positive or negative) for the coefficient/weight for 'bathrooms' in model 2?**\n",
365 | "\n",
366 | "Think about what this means."
367 | ]
368 | },
369 | {
370 | "cell_type": "markdown",
371 | "metadata": {},
372 | "source": [
373 | "# Comparing multiple models\n",
374 | "\n",
375 | "Now that you've learned three models and extracted the model weights we want to evaluate which model is best."
376 | ]
377 | },
378 | {
379 | "cell_type": "markdown",
380 | "metadata": {},
381 | "source": [
382 | "First use your functions from earlier to compute the RSS on TRAINING Data for each of the three models."
383 | ]
384 | },
385 | {
386 | "cell_type": "code",
387 | "execution_count": null,
388 | "metadata": {
389 | "collapsed": true
390 | },
391 | "outputs": [],
392 | "source": [
393 | "# Compute the RSS on TRAINING data for each of the three models and record the values:\n"
394 | ]
395 | },
396 | {
397 | "cell_type": "markdown",
398 | "metadata": {},
399 | "source": [
400 | "**Quiz Question: Which model (1, 2 or 3) has lowest RSS on TRAINING Data?** Is this what you expected?"
401 | ]
402 | },
403 | {
404 | "cell_type": "markdown",
405 | "metadata": {},
406 | "source": [
407 | "Now compute the RSS on on TEST data for each of the three models."
408 | ]
409 | },
410 | {
411 | "cell_type": "code",
412 | "execution_count": null,
413 | "metadata": {
414 | "collapsed": true
415 | },
416 | "outputs": [],
417 | "source": [
418 | "# Compute the RSS on TESTING data for each of the three models and record the values:\n"
419 | ]
420 | },
421 | {
422 | "cell_type": "markdown",
423 | "metadata": {},
424 | "source": [
425 | "**Quiz Question: Which model (1, 2 or 3) has lowest RSS on TESTING Data?** Is this what you expected? Think about the features that were added to each model from the previous."
426 | ]
427 | },
428 | {
429 | "cell_type": "code",
430 | "execution_count": null,
431 | "metadata": {
432 | "collapsed": true
433 | },
434 | "outputs": [],
435 | "source": []
436 | }
437 | ],
438 | "metadata": {
439 | "kernelspec": {
440 | "display_name": "Python 2",
441 | "language": "python",
442 | "name": "python2"
443 | },
444 | "language_info": {
445 | "codemirror_mode": {
446 | "name": "ipython",
447 | "version": 2
448 | },
449 | "file_extension": ".py",
450 | "mimetype": "text/x-python",
451 | "name": "python",
452 | "nbconvert_exporter": "python",
453 | "pygments_lexer": "ipython2",
454 | "version": "2.7.11"
455 | }
456 | },
457 | "nbformat": 4,
458 | "nbformat_minor": 0
459 | }
460 |
--------------------------------------------------------------------------------
/course-2/week-2-multiple-regression-assignment-2-blank.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Regression Week 2: Multiple Regression (gradient descent)"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "In the first notebook we explored multiple regression using graphlab create. Now we will use graphlab along with numpy to solve for the regression weights with gradient descent.\n",
15 | "\n",
16 | "In this notebook we will cover estimating multiple regression weights via gradient descent. You will:\n",
17 | "* Add a constant column of 1's to a graphlab SFrame to account for the intercept\n",
18 | "* Convert an SFrame into a Numpy array\n",
19 | "* Write a predict_output() function using Numpy\n",
20 | "* Write a numpy function to compute the derivative of the regression weights with respect to a single feature\n",
21 | "* Write gradient descent function to compute the regression weights given an initial weight vector, step size and tolerance.\n",
22 | "* Use the gradient descent function to estimate regression weights for multiple features"
23 | ]
24 | },
25 | {
26 | "cell_type": "markdown",
27 | "metadata": {},
28 | "source": [
29 | "# Fire up graphlab create"
30 | ]
31 | },
32 | {
33 | "cell_type": "markdown",
34 | "metadata": {},
35 | "source": [
36 | "Make sure you have the latest version of graphlab (>= 1.7)"
37 | ]
38 | },
39 | {
40 | "cell_type": "code",
41 | "execution_count": null,
42 | "metadata": {
43 | "collapsed": false
44 | },
45 | "outputs": [],
46 | "source": [
47 | "import graphlab"
48 | ]
49 | },
50 | {
51 | "cell_type": "markdown",
52 | "metadata": {},
53 | "source": [
54 | "# Load in house sales data\n",
55 | "\n",
56 | "Dataset is from house sales in King County, the region where the city of Seattle, WA is located."
57 | ]
58 | },
59 | {
60 | "cell_type": "code",
61 | "execution_count": null,
62 | "metadata": {
63 | "collapsed": false
64 | },
65 | "outputs": [],
66 | "source": [
67 | "sales = graphlab.SFrame('kc_house_data.gl/')"
68 | ]
69 | },
70 | {
71 | "cell_type": "markdown",
72 | "metadata": {},
73 | "source": [
74 | "If we want to do any \"feature engineering\" like creating new features or adjusting existing ones we should do this directly using the SFrames as seen in the other Week 2 notebook. For this notebook, however, we will work with the existing features."
75 | ]
76 | },
77 | {
78 | "cell_type": "markdown",
79 | "metadata": {},
80 | "source": [
81 | "# Convert to Numpy Array"
82 | ]
83 | },
84 | {
85 | "cell_type": "markdown",
86 | "metadata": {},
87 | "source": [
88 | "Although SFrames offer a number of benefits to users (especially when using Big Data and built-in graphlab functions) in order to understand the details of the implementation of algorithms it's important to work with a library that allows for direct (and optimized) matrix operations. Numpy is a Python solution to work with matrices (or any multi-dimensional \"array\").\n",
89 | "\n",
90 | "Recall that the predicted value given the weights and the features is just the dot product between the feature and weight vector. Similarly, if we put all of the features row-by-row in a matrix then the predicted value for *all* the observations can be computed by right multiplying the \"feature matrix\" by the \"weight vector\". \n",
91 | "\n",
92 | "First we need to take the SFrame of our data and convert it into a 2D numpy array (also called a matrix). To do this we use graphlab's built in .to_dataframe() which converts the SFrame into a Pandas (another python library) dataframe. We can then use Panda's .as_matrix() to convert the dataframe into a numpy matrix."
93 | ]
94 | },
95 | {
96 | "cell_type": "code",
97 | "execution_count": null,
98 | "metadata": {
99 | "collapsed": true
100 | },
101 | "outputs": [],
102 | "source": [
103 | "import numpy as np # note this allows us to refer to numpy as np instead "
104 | ]
105 | },
106 | {
107 | "cell_type": "markdown",
108 | "metadata": {},
109 | "source": [
110 | "Now we will write a function that will accept an SFrame, a list of feature names (e.g. ['sqft_living', 'bedrooms']) and an target feature e.g. ('price') and will return two things:\n",
111 | "* A numpy matrix whose columns are the desired features plus a constant column (this is how we create an 'intercept')\n",
112 | "* A numpy array containing the values of the output\n",
113 | "\n",
114 | "With this in mind, complete the following function (where there's an empty line you should write a line of code that does what the comment above indicates)\n",
115 | "\n",
116 | "**Please note you will need GraphLab Create version at least 1.7.1 in order for .to_numpy() to work!**"
117 | ]
118 | },
119 | {
120 | "cell_type": "code",
121 | "execution_count": null,
122 | "metadata": {
123 | "collapsed": false
124 | },
125 | "outputs": [],
126 | "source": [
127 | "def get_numpy_data(data_sframe, features, output):\n",
128 | " data_sframe['constant'] = 1 # this is how you add a constant column to an SFrame\n",
129 | " # add the column 'constant' to the front of the features list so that we can extract it along with the others:\n",
130 | " features = ['constant'] + features # this is how you combine two lists\n",
131 | " # select the columns of data_SFrame given by the features list into the SFrame features_sframe (now including constant):\n",
132 | "\n",
133 | " # the following line will convert the features_SFrame into a numpy matrix:\n",
134 | " feature_matrix = features_sframe.to_numpy()\n",
135 | " # assign the column of data_sframe associated with the output to the SArray output_sarray\n",
136 | "\n",
137 | " # the following will convert the SArray into a numpy array by first converting it to a list\n",
138 | " output_array = output_sarray.to_numpy()\n",
139 | " return(feature_matrix, output_array)"
140 | ]
141 | },
142 | {
143 | "cell_type": "markdown",
144 | "metadata": {},
145 | "source": [
146 | "For testing let's use the 'sqft_living' feature and a constant as our features and price as our output:"
147 | ]
148 | },
149 | {
150 | "cell_type": "code",
151 | "execution_count": null,
152 | "metadata": {
153 | "collapsed": false
154 | },
155 | "outputs": [],
156 | "source": [
157 | "(example_features, example_output) = get_numpy_data(sales, ['sqft_living'], 'price') # the [] around 'sqft_living' makes it a list\n",
158 | "print example_features[0,:] # this accesses the first row of the data the ':' indicates 'all columns'\n",
159 | "print example_output[0] # and the corresponding output"
160 | ]
161 | },
162 | {
163 | "cell_type": "markdown",
164 | "metadata": {},
165 | "source": [
166 | "# Predicting output given regression weights"
167 | ]
168 | },
169 | {
170 | "cell_type": "markdown",
171 | "metadata": {},
172 | "source": [
173 | "Suppose we had the weights [1.0, 1.0] and the features [1.0, 1180.0] and we wanted to compute the predicted output 1.0\\*1.0 + 1.0\\*1180.0 = 1181.0 this is the dot product between these two arrays. If they're numpy arrayws we can use np.dot() to compute this:"
174 | ]
175 | },
176 | {
177 | "cell_type": "code",
178 | "execution_count": null,
179 | "metadata": {
180 | "collapsed": false
181 | },
182 | "outputs": [],
183 | "source": [
184 | "my_weights = np.array([1., 1.]) # the example weights\n",
185 | "my_features = example_features[0,] # we'll use the first data point\n",
186 | "predicted_value = np.dot(my_features, my_weights)\n",
187 | "print predicted_value"
188 | ]
189 | },
190 | {
191 | "cell_type": "markdown",
192 | "metadata": {},
193 | "source": [
194 | "np.dot() also works when dealing with a matrix and a vector. Recall that the predictions from all the observations is just the RIGHT (as in weights on the right) dot product between the features *matrix* and the weights *vector*. With this in mind finish the following predict_output function to compute the predictions for an entire matrix of features given the matrix and the weights:"
195 | ]
196 | },
197 | {
198 | "cell_type": "code",
199 | "execution_count": null,
200 | "metadata": {
201 | "collapsed": true
202 | },
203 | "outputs": [],
204 | "source": [
205 | "def predict_output(feature_matrix, weights):\n",
206 | " # assume feature_matrix is a numpy matrix containing the features as columns and weights is a corresponding numpy array\n",
207 | " # create the predictions vector by using np.dot()\n",
208 | "\n",
209 | " return(predictions)"
210 | ]
211 | },
212 | {
213 | "cell_type": "markdown",
214 | "metadata": {},
215 | "source": [
216 | "If you want to test your code run the following cell:"
217 | ]
218 | },
219 | {
220 | "cell_type": "code",
221 | "execution_count": null,
222 | "metadata": {
223 | "collapsed": false
224 | },
225 | "outputs": [],
226 | "source": [
227 | "test_predictions = predict_output(example_features, my_weights)\n",
228 | "print test_predictions[0] # should be 1181.0\n",
229 | "print test_predictions[1] # should be 2571.0"
230 | ]
231 | },
232 | {
233 | "cell_type": "markdown",
234 | "metadata": {},
235 | "source": [
236 | "# Computing the Derivative"
237 | ]
238 | },
239 | {
240 | "cell_type": "markdown",
241 | "metadata": {},
242 | "source": [
243 | "We are now going to move to computing the derivative of the regression cost function. Recall that the cost function is the sum over the data points of the squared difference between an observed output and a predicted output.\n",
244 | "\n",
245 | "Since the derivative of a sum is the sum of the derivatives we can compute the derivative for a single data point and then sum over data points. We can write the squared difference between the observed output and predicted output for a single point as follows:\n",
246 | "\n",
247 | "(w[0]\\*[CONSTANT] + w[1]\\*[feature_1] + ... + w[i] \\*[feature_i] + ... + w[k]\\*[feature_k] - output)^2\n",
248 | "\n",
249 | "Where we have k features and a constant. So the derivative with respect to weight w[i] by the chain rule is:\n",
250 | "\n",
251 | "2\\*(w[0]\\*[CONSTANT] + w[1]\\*[feature_1] + ... + w[i] \\*[feature_i] + ... + w[k]\\*[feature_k] - output)\\* [feature_i]\n",
252 | "\n",
253 | "The term inside the paranethesis is just the error (difference between prediction and output). So we can re-write this as:\n",
254 | "\n",
255 | "2\\*error\\*[feature_i]\n",
256 | "\n",
257 | "That is, the derivative for the weight for feature i is the sum (over data points) of 2 times the product of the error and the feature itself. In the case of the constant then this is just twice the sum of the errors!\n",
258 | "\n",
259 | "Recall that twice the sum of the product of two vectors is just twice the dot product of the two vectors. Therefore the derivative for the weight for feature_i is just two times the dot product between the values of feature_i and the current errors. \n",
260 | "\n",
261 | "With this in mind complete the following derivative function which computes the derivative of the weight given the value of the feature (over all data points) and the errors (over all data points)."
262 | ]
263 | },
264 | {
265 | "cell_type": "code",
266 | "execution_count": null,
267 | "metadata": {
268 | "collapsed": true
269 | },
270 | "outputs": [],
271 | "source": [
272 | "def feature_derivative(errors, feature):\n",
273 | " # Assume that errors and feature are both numpy arrays of the same length (number of data points)\n",
274 | " # compute twice the dot product of these vectors as 'derivative' and return the value\n",
275 | "\n",
276 | " return(derivative)"
277 | ]
278 | },
279 | {
280 | "cell_type": "markdown",
281 | "metadata": {},
282 | "source": [
283 | "To test your feature derivartive run the following:"
284 | ]
285 | },
286 | {
287 | "cell_type": "code",
288 | "execution_count": null,
289 | "metadata": {
290 | "collapsed": false
291 | },
292 | "outputs": [],
293 | "source": [
294 | "(example_features, example_output) = get_numpy_data(sales, ['sqft_living'], 'price') \n",
295 | "my_weights = np.array([0., 0.]) # this makes all the predictions 0\n",
296 | "test_predictions = predict_output(example_features, my_weights) \n",
297 | "# just like SFrames 2 numpy arrays can be elementwise subtracted with '-': \n",
298 | "errors = test_predictions - example_output # prediction errors in this case is just the -example_output\n",
299 | "feature = example_features[:,0] # let's compute the derivative with respect to 'constant', the \":\" indicates \"all rows\"\n",
300 | "derivative = feature_derivative(errors, feature)\n",
301 | "print derivative\n",
302 | "print -np.sum(example_output)*2 # should be the same as derivative"
303 | ]
304 | },
305 | {
306 | "cell_type": "markdown",
307 | "metadata": {},
308 | "source": [
309 | "# Gradient Descent"
310 | ]
311 | },
312 | {
313 | "cell_type": "markdown",
314 | "metadata": {},
315 | "source": [
316 | "Now we will write a function that performs a gradient descent. The basic premise is simple. Given a starting point we update the current weights by moving in the negative gradient direction. Recall that the gradient is the direction of *increase* and therefore the negative gradient is the direction of *decrease* and we're trying to *minimize* a cost function. \n",
317 | "\n",
318 | "The amount by which we move in the negative gradient *direction* is called the 'step size'. We stop when we are 'sufficiently close' to the optimum. We define this by requiring that the magnitude (length) of the gradient vector to be smaller than a fixed 'tolerance'.\n",
319 | "\n",
320 | "With this in mind, complete the following gradient descent function below using your derivative function above. For each step in the gradient descent we update the weight for each feature befofe computing our stopping criteria"
321 | ]
322 | },
323 | {
324 | "cell_type": "code",
325 | "execution_count": null,
326 | "metadata": {
327 | "collapsed": true
328 | },
329 | "outputs": [],
330 | "source": [
331 | "from math import sqrt # recall that the magnitude/length of a vector [g[0], g[1], g[2]] is sqrt(g[0]^2 + g[1]^2 + g[2]^2)"
332 | ]
333 | },
334 | {
335 | "cell_type": "code",
336 | "execution_count": null,
337 | "metadata": {
338 | "collapsed": false
339 | },
340 | "outputs": [],
341 | "source": [
342 | "def regression_gradient_descent(feature_matrix, output, initial_weights, step_size, tolerance):\n",
343 | " converged = False \n",
344 | " weights = np.array(initial_weights) # make sure it's a numpy array\n",
345 | " while not converged:\n",
346 | " # compute the predictions based on feature_matrix and weights using your predict_output() function\n",
347 | "\n",
348 | " # compute the errors as predictions - output\n",
349 | "\n",
350 | " gradient_sum_squares = 0 # initialize the gradient sum of squares\n",
351 | " # while we haven't reached the tolerance yet, update each feature's weight\n",
352 | " for i in range(len(weights)): # loop over each weight\n",
353 | " # Recall that feature_matrix[:, i] is the feature column associated with weights[i]\n",
354 | " # compute the derivative for weight[i]:\n",
355 | "\n",
356 | " # add the squared value of the derivative to the gradient sum of squares (for assessing convergence)\n",
357 | "\n",
358 | " # subtract the step size times the derivative from the current weight\n",
359 | " \n",
360 | " # compute the square-root of the gradient sum of squares to get the gradient magnitude:\n",
361 | " gradient_magnitude = sqrt(gradient_sum_squares)\n",
362 | " if gradient_magnitude < tolerance:\n",
363 | " converged = True\n",
364 | " return(weights)"
365 | ]
366 | },
367 | {
368 | "cell_type": "markdown",
369 | "metadata": {},
370 | "source": [
371 | "A few things to note before we run the gradient descent. Since the gradient is a sum over all the data points and involves a product of an error and a feature the gradient itself will be very large since the features are large (squarefeet) and the output is large (prices). So while you might expect \"tolerance\" to be small, small is only relative to the size of the features. \n",
372 | "\n",
373 | "For similar reasons the step size will be much smaller than you might expect but this is because the gradient has such large values."
374 | ]
375 | },
376 | {
377 | "cell_type": "markdown",
378 | "metadata": {},
379 | "source": [
380 | "# Running the Gradient Descent as Simple Regression"
381 | ]
382 | },
383 | {
384 | "cell_type": "markdown",
385 | "metadata": {},
386 | "source": [
387 | "First let's split the data into training and test data."
388 | ]
389 | },
390 | {
391 | "cell_type": "code",
392 | "execution_count": null,
393 | "metadata": {
394 | "collapsed": true
395 | },
396 | "outputs": [],
397 | "source": [
398 | "train_data,test_data = sales.random_split(.8,seed=0)"
399 | ]
400 | },
401 | {
402 | "cell_type": "markdown",
403 | "metadata": {},
404 | "source": [
405 | "Although the gradient descent is designed for multiple regression since the constant is now a feature we can use the gradient descent function to estimat the parameters in the simple regression on squarefeet. The folowing cell sets up the feature_matrix, output, initial weights and step size for the first model:"
406 | ]
407 | },
408 | {
409 | "cell_type": "code",
410 | "execution_count": null,
411 | "metadata": {
412 | "collapsed": false
413 | },
414 | "outputs": [],
415 | "source": [
416 | "# let's test out the gradient descent\n",
417 | "simple_features = ['sqft_living']\n",
418 | "my_output = 'price'\n",
419 | "(simple_feature_matrix, output) = get_numpy_data(train_data, simple_features, my_output)\n",
420 | "initial_weights = np.array([-47000., 1.])\n",
421 | "step_size = 7e-12\n",
422 | "tolerance = 2.5e7"
423 | ]
424 | },
425 | {
426 | "cell_type": "markdown",
427 | "metadata": {},
428 | "source": [
429 | "Next run your gradient descent with the above parameters."
430 | ]
431 | },
432 | {
433 | "cell_type": "code",
434 | "execution_count": null,
435 | "metadata": {
436 | "collapsed": false
437 | },
438 | "outputs": [],
439 | "source": []
440 | },
441 | {
442 | "cell_type": "markdown",
443 | "metadata": {},
444 | "source": [
445 | "How do your weights compare to those achieved in week 1 (don't expect them to be exactly the same)? \n",
446 | "\n",
447 | "**Quiz Question: What is the value of the weight for sqft_living -- the second element of ‘simple_weights’ (rounded to 1 decimal place)?**"
448 | ]
449 | },
450 | {
451 | "cell_type": "markdown",
452 | "metadata": {},
453 | "source": [
454 | "Use your newly estimated weights and your predict_output() function to compute the predictions on all the TEST data (you will need to create a numpy array of the test feature_matrix and test output first:"
455 | ]
456 | },
457 | {
458 | "cell_type": "code",
459 | "execution_count": null,
460 | "metadata": {
461 | "collapsed": false
462 | },
463 | "outputs": [],
464 | "source": [
465 | "(test_simple_feature_matrix, test_output) = get_numpy_data(test_data, simple_features, my_output)"
466 | ]
467 | },
468 | {
469 | "cell_type": "markdown",
470 | "metadata": {},
471 | "source": [
472 | "Now compute your predictions using test_simple_feature_matrix and your weights from above."
473 | ]
474 | },
475 | {
476 | "cell_type": "code",
477 | "execution_count": null,
478 | "metadata": {
479 | "collapsed": true
480 | },
481 | "outputs": [],
482 | "source": []
483 | },
484 | {
485 | "cell_type": "markdown",
486 | "metadata": {},
487 | "source": [
488 | "**Quiz Question: What is the predicted price for the 1st house in the TEST data set for model 1 (round to nearest dollar)?**"
489 | ]
490 | },
491 | {
492 | "cell_type": "code",
493 | "execution_count": null,
494 | "metadata": {
495 | "collapsed": false
496 | },
497 | "outputs": [],
498 | "source": []
499 | },
500 | {
501 | "cell_type": "markdown",
502 | "metadata": {},
503 | "source": [
504 | "Now that you have the predictions on test data, compute the RSS on the test data set. Save this value for comparison later. Recall that RSS is the sum of the squared errors (difference between prediction and output)."
505 | ]
506 | },
507 | {
508 | "cell_type": "code",
509 | "execution_count": null,
510 | "metadata": {
511 | "collapsed": false
512 | },
513 | "outputs": [],
514 | "source": []
515 | },
516 | {
517 | "cell_type": "markdown",
518 | "metadata": {},
519 | "source": [
520 | "# Running a multiple regression"
521 | ]
522 | },
523 | {
524 | "cell_type": "markdown",
525 | "metadata": {},
526 | "source": [
527 | "Now we will use more than one actual feature. Use the following code to produce the weights for a second model with the following parameters:"
528 | ]
529 | },
530 | {
531 | "cell_type": "code",
532 | "execution_count": null,
533 | "metadata": {
534 | "collapsed": false
535 | },
536 | "outputs": [],
537 | "source": [
538 | "model_features = ['sqft_living', 'sqft_living15'] # sqft_living15 is the average squarefeet for the nearest 15 neighbors. \n",
539 | "my_output = 'price'\n",
540 | "(feature_matrix, output) = get_numpy_data(train_data, model_features, my_output)\n",
541 | "initial_weights = np.array([-100000., 1., 1.])\n",
542 | "step_size = 4e-12\n",
543 | "tolerance = 1e9"
544 | ]
545 | },
546 | {
547 | "cell_type": "markdown",
548 | "metadata": {},
549 | "source": [
550 | "Use the above parameters to estimate the model weights. Record these values for your quiz."
551 | ]
552 | },
553 | {
554 | "cell_type": "code",
555 | "execution_count": null,
556 | "metadata": {
557 | "collapsed": false
558 | },
559 | "outputs": [],
560 | "source": []
561 | },
562 | {
563 | "cell_type": "markdown",
564 | "metadata": {},
565 | "source": [
566 | "Use your newly estimated weights and the predict_output function to compute the predictions on the TEST data. Don't forget to create a numpy array for these features from the test set first!"
567 | ]
568 | },
569 | {
570 | "cell_type": "code",
571 | "execution_count": null,
572 | "metadata": {
573 | "collapsed": false
574 | },
575 | "outputs": [],
576 | "source": []
577 | },
578 | {
579 | "cell_type": "markdown",
580 | "metadata": {},
581 | "source": [
582 | "**Quiz Question: What is the predicted price for the 1st house in the TEST data set for model 2 (round to nearest dollar)?**"
583 | ]
584 | },
585 | {
586 | "cell_type": "code",
587 | "execution_count": null,
588 | "metadata": {
589 | "collapsed": false
590 | },
591 | "outputs": [],
592 | "source": []
593 | },
594 | {
595 | "cell_type": "markdown",
596 | "metadata": {},
597 | "source": [
598 | "What is the actual price for the 1st house in the test data set?"
599 | ]
600 | },
601 | {
602 | "cell_type": "code",
603 | "execution_count": null,
604 | "metadata": {
605 | "collapsed": false
606 | },
607 | "outputs": [],
608 | "source": []
609 | },
610 | {
611 | "cell_type": "markdown",
612 | "metadata": {},
613 | "source": [
614 | "**Quiz Question: Which estimate was closer to the true price for the 1st house on the TEST data set, model 1 or model 2?**"
615 | ]
616 | },
617 | {
618 | "cell_type": "markdown",
619 | "metadata": {},
620 | "source": [
621 | "Now use your predictions and the output to compute the RSS for model 2 on TEST data."
622 | ]
623 | },
624 | {
625 | "cell_type": "code",
626 | "execution_count": null,
627 | "metadata": {
628 | "collapsed": false
629 | },
630 | "outputs": [],
631 | "source": []
632 | },
633 | {
634 | "cell_type": "markdown",
635 | "metadata": {},
636 | "source": [
637 | "**Quiz Question: Which model (1 or 2) has lowest RSS on all of the TEST data? **"
638 | ]
639 | },
640 | {
641 | "cell_type": "code",
642 | "execution_count": null,
643 | "metadata": {
644 | "collapsed": true
645 | },
646 | "outputs": [],
647 | "source": []
648 | }
649 | ],
650 | "metadata": {
651 | "kernelspec": {
652 | "display_name": "Python 2",
653 | "language": "python",
654 | "name": "python2"
655 | },
656 | "language_info": {
657 | "codemirror_mode": {
658 | "name": "ipython",
659 | "version": 2
660 | },
661 | "file_extension": ".py",
662 | "mimetype": "text/x-python",
663 | "name": "python",
664 | "nbconvert_exporter": "python",
665 | "pygments_lexer": "ipython2",
666 | "version": "2.7.11"
667 | }
668 | },
669 | "nbformat": 4,
670 | "nbformat_minor": 0
671 | }
672 |
--------------------------------------------------------------------------------
/course-2/week-3-polynomial-regression-assignment-blank.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Regression Week 3: Assessing Fit (polynomial regression)"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "In this notebook you will compare different regression models in order to assess which model fits best. We will be using polynomial regression as a means to examine this topic. In particular you will:\n",
15 | "* Write a function to take an SArray and a degree and return an SFrame where each column is the SArray to a polynomial value up to the total degree e.g. degree = 3 then column 1 is the SArray column 2 is the SArray squared and column 3 is the SArray cubed\n",
16 | "* Use matplotlib to visualize polynomial regressions\n",
17 | "* Use matplotlib to visualize the same polynomial degree on different subsets of the data\n",
18 | "* Use a validation set to select a polynomial degree\n",
19 | "* Assess the final fit using test data\n",
20 | "\n",
21 | "We will continue to use the House data from previous notebooks."
22 | ]
23 | },
24 | {
25 | "cell_type": "markdown",
26 | "metadata": {},
27 | "source": [
28 | "# Fire up graphlab create"
29 | ]
30 | },
31 | {
32 | "cell_type": "code",
33 | "execution_count": null,
34 | "metadata": {
35 | "collapsed": true
36 | },
37 | "outputs": [],
38 | "source": [
39 | "import graphlab"
40 | ]
41 | },
42 | {
43 | "cell_type": "markdown",
44 | "metadata": {},
45 | "source": [
46 | "Next we're going to write a polynomial function that takes an SArray and a maximal degree and returns an SFrame with columns containing the SArray to all the powers up to the maximal degree.\n",
47 | "\n",
48 | "The easiest way to apply a power to an SArray is to use the .apply() and lambda x: functions. \n",
49 | "For example to take the example array and compute the third power we can do as follows: (note running this cell the first time may take longer than expected since it loads graphlab)"
50 | ]
51 | },
52 | {
53 | "cell_type": "code",
54 | "execution_count": null,
55 | "metadata": {
56 | "collapsed": false
57 | },
58 | "outputs": [],
59 | "source": [
60 | "tmp = graphlab.SArray([1., 2., 3.])\n",
61 | "tmp_cubed = tmp.apply(lambda x: x**3)\n",
62 | "print tmp\n",
63 | "print tmp_cubed"
64 | ]
65 | },
66 | {
67 | "cell_type": "markdown",
68 | "metadata": {},
69 | "source": [
70 | "We can create an empty SFrame using graphlab.SFrame() and then add any columns to it with ex_sframe['column_name'] = value. For example we create an empty SFrame and make the column 'power_1' to be the first power of tmp (i.e. tmp itself)."
71 | ]
72 | },
73 | {
74 | "cell_type": "code",
75 | "execution_count": null,
76 | "metadata": {
77 | "collapsed": false
78 | },
79 | "outputs": [],
80 | "source": [
81 | "ex_sframe = graphlab.SFrame()\n",
82 | "ex_sframe['power_1'] = tmp\n",
83 | "print ex_sframe"
84 | ]
85 | },
86 | {
87 | "cell_type": "markdown",
88 | "metadata": {},
89 | "source": [
90 | "# Polynomial_sframe function"
91 | ]
92 | },
93 | {
94 | "cell_type": "markdown",
95 | "metadata": {},
96 | "source": [
97 | "Using the hints above complete the following function to create an SFrame consisting of the powers of an SArray up to a specific degree:"
98 | ]
99 | },
100 | {
101 | "cell_type": "code",
102 | "execution_count": null,
103 | "metadata": {
104 | "collapsed": true
105 | },
106 | "outputs": [],
107 | "source": [
108 | "def polynomial_sframe(feature, degree):\n",
109 | " # assume that degree >= 1\n",
110 | " # initialize the SFrame:\n",
111 | " poly_sframe = graphlab.SFrame()\n",
112 | " # and set poly_sframe['power_1'] equal to the passed feature\n",
113 | "\n",
114 | " # first check if degree > 1\n",
115 | " if degree > 1:\n",
116 | " # then loop over the remaining degrees:\n",
117 | " # range usually starts at 0 and stops at the endpoint-1. We want it to start at 2 and stop at degree\n",
118 | " for power in range(2, degree+1): \n",
119 | " # first we'll give the column a name:\n",
120 | " name = 'power_' + str(power)\n",
121 | " # then assign poly_sframe[name] to the appropriate power of feature\n",
122 | "\n",
123 | " return poly_sframe"
124 | ]
125 | },
126 | {
127 | "cell_type": "markdown",
128 | "metadata": {},
129 | "source": [
130 | "To test your function consider the smaller tmp variable and what you would expect the outcome of the following call:"
131 | ]
132 | },
133 | {
134 | "cell_type": "code",
135 | "execution_count": null,
136 | "metadata": {
137 | "collapsed": false
138 | },
139 | "outputs": [],
140 | "source": [
141 | "print polynomial_sframe(tmp, 3)"
142 | ]
143 | },
144 | {
145 | "cell_type": "markdown",
146 | "metadata": {},
147 | "source": [
148 | "# Visualizing polynomial regression"
149 | ]
150 | },
151 | {
152 | "cell_type": "markdown",
153 | "metadata": {},
154 | "source": [
155 | "Let's use matplotlib to visualize what a polynomial regression looks like on some real data."
156 | ]
157 | },
158 | {
159 | "cell_type": "code",
160 | "execution_count": null,
161 | "metadata": {
162 | "collapsed": true
163 | },
164 | "outputs": [],
165 | "source": [
166 | "sales = graphlab.SFrame('kc_house_data.gl/')"
167 | ]
168 | },
169 | {
170 | "cell_type": "markdown",
171 | "metadata": {},
172 | "source": [
173 | "As in Week 3, we will use the sqft_living variable. For plotting purposes (connecting the dots), you'll need to sort by the values of sqft_living. For houses with identical square footage, we break the tie by their prices."
174 | ]
175 | },
176 | {
177 | "cell_type": "code",
178 | "execution_count": null,
179 | "metadata": {
180 | "collapsed": false
181 | },
182 | "outputs": [],
183 | "source": [
184 | "sales = sales.sort(['sqft_living', 'price'])"
185 | ]
186 | },
187 | {
188 | "cell_type": "markdown",
189 | "metadata": {},
190 | "source": [
191 | "Let's start with a degree 1 polynomial using 'sqft_living' (i.e. a line) to predict 'price' and plot what it looks like."
192 | ]
193 | },
194 | {
195 | "cell_type": "code",
196 | "execution_count": null,
197 | "metadata": {
198 | "collapsed": false
199 | },
200 | "outputs": [],
201 | "source": [
202 | "poly1_data = polynomial_sframe(sales['sqft_living'], 1)\n",
203 | "poly1_data['price'] = sales['price'] # add price to the data since it's the target"
204 | ]
205 | },
206 | {
207 | "cell_type": "markdown",
208 | "metadata": {},
209 | "source": [
210 | "NOTE: for all the models in this notebook use validation_set = None to ensure that all results are consistent across users."
211 | ]
212 | },
213 | {
214 | "cell_type": "code",
215 | "execution_count": null,
216 | "metadata": {
217 | "collapsed": false
218 | },
219 | "outputs": [],
220 | "source": [
221 | "model1 = graphlab.linear_regression.create(poly1_data, target = 'price', features = ['power_1'], validation_set = None)"
222 | ]
223 | },
224 | {
225 | "cell_type": "code",
226 | "execution_count": null,
227 | "metadata": {
228 | "collapsed": false
229 | },
230 | "outputs": [],
231 | "source": [
232 | "#let's take a look at the weights before we plot\n",
233 | "model1.get(\"coefficients\")"
234 | ]
235 | },
236 | {
237 | "cell_type": "code",
238 | "execution_count": null,
239 | "metadata": {
240 | "collapsed": true
241 | },
242 | "outputs": [],
243 | "source": [
244 | "import matplotlib.pyplot as plt\n",
245 | "%matplotlib inline"
246 | ]
247 | },
248 | {
249 | "cell_type": "code",
250 | "execution_count": null,
251 | "metadata": {
252 | "collapsed": false
253 | },
254 | "outputs": [],
255 | "source": [
256 | "plt.plot(poly1_data['power_1'],poly1_data['price'],'.',\n",
257 | " poly1_data['power_1'], model1.predict(poly1_data),'-')"
258 | ]
259 | },
260 | {
261 | "cell_type": "markdown",
262 | "metadata": {},
263 | "source": [
264 | "Let's unpack that plt.plot() command. The first pair of SArrays we passed are the 1st power of sqft and the actual price we then ask it to print these as dots '.'. The next pair we pass is the 1st power of sqft and the predicted values from the linear model. We ask these to be plotted as a line '-'. \n",
265 | "\n",
266 | "We can see, not surprisingly, that the predicted values all fall on a line, specifically the one with slope 280 and intercept -43579. What if we wanted to plot a second degree polynomial?"
267 | ]
268 | },
269 | {
270 | "cell_type": "code",
271 | "execution_count": null,
272 | "metadata": {
273 | "collapsed": false
274 | },
275 | "outputs": [],
276 | "source": [
277 | "poly2_data = polynomial_sframe(sales['sqft_living'], 2)\n",
278 | "my_features = poly2_data.column_names() # get the name of the features\n",
279 | "poly2_data['price'] = sales['price'] # add price to the data since it's the target\n",
280 | "model2 = graphlab.linear_regression.create(poly2_data, target = 'price', features = my_features, validation_set = None)"
281 | ]
282 | },
283 | {
284 | "cell_type": "code",
285 | "execution_count": null,
286 | "metadata": {
287 | "collapsed": false
288 | },
289 | "outputs": [],
290 | "source": [
291 | "model2.get(\"coefficients\")"
292 | ]
293 | },
294 | {
295 | "cell_type": "code",
296 | "execution_count": null,
297 | "metadata": {
298 | "collapsed": false
299 | },
300 | "outputs": [],
301 | "source": [
302 | "plt.plot(poly2_data['power_1'],poly2_data['price'],'.',\n",
303 | " poly2_data['power_1'], model2.predict(poly2_data),'-')"
304 | ]
305 | },
306 | {
307 | "cell_type": "markdown",
308 | "metadata": {},
309 | "source": [
310 | "The resulting model looks like half a parabola. Try on your own to see what the cubic looks like:"
311 | ]
312 | },
313 | {
314 | "cell_type": "code",
315 | "execution_count": null,
316 | "metadata": {
317 | "collapsed": false
318 | },
319 | "outputs": [],
320 | "source": []
321 | },
322 | {
323 | "cell_type": "code",
324 | "execution_count": null,
325 | "metadata": {
326 | "collapsed": false
327 | },
328 | "outputs": [],
329 | "source": []
330 | },
331 | {
332 | "cell_type": "markdown",
333 | "metadata": {},
334 | "source": [
335 | "Now try a 15th degree polynomial:"
336 | ]
337 | },
338 | {
339 | "cell_type": "code",
340 | "execution_count": null,
341 | "metadata": {
342 | "collapsed": false
343 | },
344 | "outputs": [],
345 | "source": []
346 | },
347 | {
348 | "cell_type": "code",
349 | "execution_count": null,
350 | "metadata": {
351 | "collapsed": false
352 | },
353 | "outputs": [],
354 | "source": []
355 | },
356 | {
357 | "cell_type": "markdown",
358 | "metadata": {},
359 | "source": [
360 | "What do you think of the 15th degree polynomial? Do you think this is appropriate? If we were to change the data do you think you'd get pretty much the same curve? Let's take a look."
361 | ]
362 | },
363 | {
364 | "cell_type": "markdown",
365 | "metadata": {},
366 | "source": [
367 | "# Changing the data and re-learning"
368 | ]
369 | },
370 | {
371 | "cell_type": "markdown",
372 | "metadata": {},
373 | "source": [
374 | "We're going to split the sales data into four subsets of roughly equal size. Then you will estimate a 15th degree polynomial model on all four subsets of the data. Print the coefficients (you should use .print_rows(num_rows = 16) to view all of them) and plot the resulting fit (as we did above). The quiz will ask you some questions about these results.\n",
375 | "\n",
376 | "To split the sales data into four subsets, we perform the following steps:\n",
377 | "* First split sales into 2 subsets with `.random_split(0.5, seed=0)`. \n",
378 | "* Next split the resulting subsets into 2 more subsets each. Use `.random_split(0.5, seed=0)`.\n",
379 | "\n",
380 | "We set `seed=0` in these steps so that different users get consistent results.\n",
381 | "You should end up with 4 subsets (`set_1`, `set_2`, `set_3`, `set_4`) of approximately equal size. "
382 | ]
383 | },
384 | {
385 | "cell_type": "code",
386 | "execution_count": null,
387 | "metadata": {
388 | "collapsed": true
389 | },
390 | "outputs": [],
391 | "source": []
392 | },
393 | {
394 | "cell_type": "markdown",
395 | "metadata": {},
396 | "source": [
397 | "Fit a 15th degree polynomial on set_1, set_2, set_3, and set_4 using sqft_living to predict prices. Print the coefficients and make a plot of the resulting model."
398 | ]
399 | },
400 | {
401 | "cell_type": "code",
402 | "execution_count": null,
403 | "metadata": {
404 | "collapsed": false
405 | },
406 | "outputs": [],
407 | "source": []
408 | },
409 | {
410 | "cell_type": "code",
411 | "execution_count": null,
412 | "metadata": {
413 | "collapsed": false
414 | },
415 | "outputs": [],
416 | "source": []
417 | },
418 | {
419 | "cell_type": "code",
420 | "execution_count": null,
421 | "metadata": {
422 | "collapsed": false
423 | },
424 | "outputs": [],
425 | "source": []
426 | },
427 | {
428 | "cell_type": "code",
429 | "execution_count": null,
430 | "metadata": {
431 | "collapsed": false
432 | },
433 | "outputs": [],
434 | "source": []
435 | },
436 | {
437 | "cell_type": "markdown",
438 | "metadata": {},
439 | "source": [
440 | "Some questions you will be asked on your quiz:\n",
441 | "\n",
442 | "**Quiz Question: Is the sign (positive or negative) for power_15 the same in all four models?**\n",
443 | "\n",
444 | "**Quiz Question: (True/False) the plotted fitted lines look the same in all four plots**"
445 | ]
446 | },
447 | {
448 | "cell_type": "markdown",
449 | "metadata": {},
450 | "source": [
451 | "# Selecting a Polynomial Degree"
452 | ]
453 | },
454 | {
455 | "cell_type": "markdown",
456 | "metadata": {},
457 | "source": [
458 | "Whenever we have a \"magic\" parameter like the degree of the polynomial there is one well-known way to select these parameters: validation set. (We will explore another approach in week 4).\n",
459 | "\n",
460 | "We split the sales dataset 3-way into training set, test set, and validation set as follows:\n",
461 | "\n",
462 | "* Split our sales data into 2 sets: `training_and_validation` and `testing`. Use `random_split(0.9, seed=1)`.\n",
463 | "* Further split our training data into two sets: `training` and `validation`. Use `random_split(0.5, seed=1)`.\n",
464 | "\n",
465 | "Again, we set `seed=1` to obtain consistent results for different users."
466 | ]
467 | },
468 | {
469 | "cell_type": "code",
470 | "execution_count": null,
471 | "metadata": {
472 | "collapsed": true
473 | },
474 | "outputs": [],
475 | "source": []
476 | },
477 | {
478 | "cell_type": "markdown",
479 | "metadata": {},
480 | "source": [
481 | "Next you should write a loop that does the following:\n",
482 | "* For degree in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] (to get this in python type range(1, 15+1))\n",
483 | " * Build an SFrame of polynomial data of train_data['sqft_living'] at the current degree\n",
484 | " * hint: my_features = poly_data.column_names() gives you a list e.g. ['power_1', 'power_2', 'power_3'] which you might find useful for graphlab.linear_regression.create( features = my_features)\n",
485 | " * Add train_data['price'] to the polynomial SFrame\n",
486 | " * Learn a polynomial regression model to sqft vs price with that degree on TRAIN data\n",
487 | " * Compute the RSS on VALIDATION data (here you will want to use .predict()) for that degree and you will need to make a polynmial SFrame using validation data.\n",
488 | "* Report which degree had the lowest RSS on validation data (remember python indexes from 0)\n",
489 | "\n",
490 | "(Note you can turn off the print out of linear_regression.create() with verbose = False)"
491 | ]
492 | },
493 | {
494 | "cell_type": "code",
495 | "execution_count": null,
496 | "metadata": {
497 | "collapsed": false
498 | },
499 | "outputs": [],
500 | "source": []
501 | },
502 | {
503 | "cell_type": "markdown",
504 | "metadata": {},
505 | "source": [
506 | "**Quiz Question: Which degree (1, 2, …, 15) had the lowest RSS on Validation data?**"
507 | ]
508 | },
509 | {
510 | "cell_type": "markdown",
511 | "metadata": {},
512 | "source": [
513 | "Now that you have chosen the degree of your polynomial using validation data, compute the RSS of this model on TEST data. Report the RSS on your quiz."
514 | ]
515 | },
516 | {
517 | "cell_type": "code",
518 | "execution_count": null,
519 | "metadata": {
520 | "collapsed": false
521 | },
522 | "outputs": [],
523 | "source": []
524 | },
525 | {
526 | "cell_type": "markdown",
527 | "metadata": {},
528 | "source": [
529 | "**Quiz Question: what is the RSS on TEST data for the model with the degree selected from Validation data?**"
530 | ]
531 | },
532 | {
533 | "cell_type": "code",
534 | "execution_count": null,
535 | "metadata": {
536 | "collapsed": true
537 | },
538 | "outputs": [],
539 | "source": []
540 | }
541 | ],
542 | "metadata": {
543 | "kernelspec": {
544 | "display_name": "Python 2",
545 | "language": "python",
546 | "name": "python2"
547 | },
548 | "language_info": {
549 | "codemirror_mode": {
550 | "name": "ipython",
551 | "version": 2
552 | },
553 | "file_extension": ".py",
554 | "mimetype": "text/x-python",
555 | "name": "python",
556 | "nbconvert_exporter": "python",
557 | "pygments_lexer": "ipython2",
558 | "version": "2.7.11"
559 | }
560 | },
561 | "nbformat": 4,
562 | "nbformat_minor": 0
563 | }
564 |
--------------------------------------------------------------------------------
/course-2/week-4-ridge-regression-assignment-1-blank.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Regression Week 4: Ridge Regression (interpretation)"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "In this notebook, we will run ridge regression multiple times with different L2 penalties to see which one produces the best fit. We will revisit the example of polynomial regression as a means to see the effect of L2 regularization. In particular, we will:\n",
15 | "* Use a pre-built implementation of regression (GraphLab Create) to run polynomial regression\n",
16 | "* Use matplotlib to visualize polynomial regressions\n",
17 | "* Use a pre-built implementation of regression (GraphLab Create) to run polynomial regression, this time with L2 penalty\n",
18 | "* Use matplotlib to visualize polynomial regressions under L2 regularization\n",
19 | "* Choose best L2 penalty using cross-validation.\n",
20 | "* Assess the final fit using test data.\n",
21 | "\n",
22 | "We will continue to use the House data from previous notebooks. (In the next programming assignment for this module, you will implement your own ridge regression learning algorithm using gradient descent.)"
23 | ]
24 | },
25 | {
26 | "cell_type": "markdown",
27 | "metadata": {},
28 | "source": [
29 | "# Fire up graphlab create"
30 | ]
31 | },
32 | {
33 | "cell_type": "code",
34 | "execution_count": null,
35 | "metadata": {
36 | "collapsed": false
37 | },
38 | "outputs": [],
39 | "source": [
40 | "import graphlab"
41 | ]
42 | },
43 | {
44 | "cell_type": "markdown",
45 | "metadata": {},
46 | "source": [
47 | "# Polynomial regression, revisited"
48 | ]
49 | },
50 | {
51 | "cell_type": "markdown",
52 | "metadata": {},
53 | "source": [
54 | "We build on the material from Week 3, where we wrote the function to produce an SFrame with columns containing the powers of a given input. Copy and paste the function `polynomial_sframe` from Week 3:"
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": null,
60 | "metadata": {
61 | "collapsed": true
62 | },
63 | "outputs": [],
64 | "source": [
65 | "def polynomial_sframe(feature, degree):\n",
66 | " "
67 | ]
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "Let's use matplotlib to visualize what a polynomial regression looks like on the house data."
74 | ]
75 | },
76 | {
77 | "cell_type": "code",
78 | "execution_count": null,
79 | "metadata": {
80 | "collapsed": false
81 | },
82 | "outputs": [],
83 | "source": [
84 | "import matplotlib.pyplot as plt\n",
85 | "%matplotlib inline"
86 | ]
87 | },
88 | {
89 | "cell_type": "code",
90 | "execution_count": null,
91 | "metadata": {
92 | "collapsed": false
93 | },
94 | "outputs": [],
95 | "source": [
96 | "sales = graphlab.SFrame('kc_house_data.gl/')"
97 | ]
98 | },
99 | {
100 | "cell_type": "markdown",
101 | "metadata": {},
102 | "source": [
103 | "As in Week 3, we will use the sqft_living variable. For plotting purposes (connecting the dots), you'll need to sort by the values of sqft_living. For houses with identical square footage, we break the tie by their prices."
104 | ]
105 | },
106 | {
107 | "cell_type": "code",
108 | "execution_count": null,
109 | "metadata": {
110 | "collapsed": false
111 | },
112 | "outputs": [],
113 | "source": [
114 | "sales = sales.sort(['sqft_living','price'])"
115 | ]
116 | },
117 | {
118 | "cell_type": "markdown",
119 | "metadata": {},
120 | "source": [
121 | "Let us revisit the 15th-order polynomial model using the 'sqft_living' input. Generate polynomial features up to degree 15 using `polynomial_sframe()` and fit a model with these features. When fitting the model, use an L2 penalty of `1e-5`:"
122 | ]
123 | },
124 | {
125 | "cell_type": "code",
126 | "execution_count": null,
127 | "metadata": {
128 | "collapsed": true
129 | },
130 | "outputs": [],
131 | "source": [
132 | "l2_small_penalty = 1e-5"
133 | ]
134 | },
135 | {
136 | "cell_type": "markdown",
137 | "metadata": {},
138 | "source": [
139 | "Note: When we have so many features and so few data points, the solution can become highly numerically unstable, which can sometimes lead to strange unpredictable results. Thus, rather than using no regularization, we will introduce a tiny amount of regularization (`l2_penalty=1e-5`) to make the solution numerically stable. (In lecture, we discussed the fact that regularization can also help with numerical stability, and here we are seeing a practical example.)\n",
140 | "\n",
141 | "With the L2 penalty specified above, fit the model and print out the learned weights.\n",
142 | "\n",
143 | "Hint: make sure to add 'price' column to the new SFrame before calling `graphlab.linear_regression.create()`. Also, make sure GraphLab Create doesn't create its own validation set by using the option `validation_set=None` in this call."
144 | ]
145 | },
146 | {
147 | "cell_type": "code",
148 | "execution_count": null,
149 | "metadata": {
150 | "collapsed": false
151 | },
152 | "outputs": [],
153 | "source": []
154 | },
155 | {
156 | "cell_type": "markdown",
157 | "metadata": {},
158 | "source": [
159 | "***QUIZ QUESTION: What's the learned value for the coefficient of feature `power_1`?***"
160 | ]
161 | },
162 | {
163 | "cell_type": "markdown",
164 | "metadata": {},
165 | "source": [
166 | "# Observe overfitting"
167 | ]
168 | },
169 | {
170 | "cell_type": "markdown",
171 | "metadata": {},
172 | "source": [
173 | "Recall from Week 3 that the polynomial fit of degree 15 changed wildly whenever the data changed. In particular, when we split the sales data into four subsets and fit the model of degree 15, the result came out to be very different for each subset. The model had a *high variance*. We will see in a moment that ridge regression reduces such variance. But first, we must reproduce the experiment we did in Week 3."
174 | ]
175 | },
176 | {
177 | "cell_type": "markdown",
178 | "metadata": {},
179 | "source": [
180 | "First, split the data into split the sales data into four subsets of roughly equal size and call them `set_1`, `set_2`, `set_3`, and `set_4`. Use `.random_split` function and make sure you set `seed=0`. "
181 | ]
182 | },
183 | {
184 | "cell_type": "code",
185 | "execution_count": null,
186 | "metadata": {
187 | "collapsed": true
188 | },
189 | "outputs": [],
190 | "source": [
191 | "(semi_split1, semi_split2) = sales.random_split(.5,seed=0)\n",
192 | "(set_1, set_2) = semi_split1.random_split(0.5, seed=0)\n",
193 | "(set_3, set_4) = semi_split2.random_split(0.5, seed=0)"
194 | ]
195 | },
196 | {
197 | "cell_type": "markdown",
198 | "metadata": {},
199 | "source": [
200 | "Next, fit a 15th degree polynomial on `set_1`, `set_2`, `set_3`, and `set_4`, using 'sqft_living' to predict prices. Print the weights and make a plot of the resulting model.\n",
201 | "\n",
202 | "Hint: When calling `graphlab.linear_regression.create()`, use the same L2 penalty as before (i.e. `l2_small_penalty`). Also, make sure GraphLab Create doesn't create its own validation set by using the option `validation_set = None` in this call."
203 | ]
204 | },
205 | {
206 | "cell_type": "code",
207 | "execution_count": null,
208 | "metadata": {
209 | "collapsed": false
210 | },
211 | "outputs": [],
212 | "source": []
213 | },
214 | {
215 | "cell_type": "code",
216 | "execution_count": null,
217 | "metadata": {
218 | "collapsed": false,
219 | "scrolled": false
220 | },
221 | "outputs": [],
222 | "source": []
223 | },
224 | {
225 | "cell_type": "code",
226 | "execution_count": null,
227 | "metadata": {
228 | "collapsed": false
229 | },
230 | "outputs": [],
231 | "source": []
232 | },
233 | {
234 | "cell_type": "code",
235 | "execution_count": null,
236 | "metadata": {
237 | "collapsed": false
238 | },
239 | "outputs": [],
240 | "source": []
241 | },
242 | {
243 | "cell_type": "markdown",
244 | "metadata": {},
245 | "source": [
246 | "The four curves should differ from one another a lot, as should the coefficients you learned.\n",
247 | "\n",
248 | "***QUIZ QUESTION: For the models learned in each of these training sets, what are the smallest and largest values you learned for the coefficient of feature `power_1`?*** (For the purpose of answering this question, negative numbers are considered \"smaller\" than positive numbers. So -5 is smaller than -3, and -3 is smaller than 5 and so forth.)"
249 | ]
250 | },
251 | {
252 | "cell_type": "markdown",
253 | "metadata": {},
254 | "source": [
255 | "# Ridge regression comes to rescue"
256 | ]
257 | },
258 | {
259 | "cell_type": "markdown",
260 | "metadata": {},
261 | "source": [
262 | "Generally, whenever we see weights change so much in response to change in data, we believe the variance of our estimate to be large. Ridge regression aims to address this issue by penalizing \"large\" weights. (Weights of `model15` looked quite small, but they are not that small because 'sqft_living' input is in the order of thousands.)\n",
263 | "\n",
264 | "With the argument `l2_penalty=1e5`, fit a 15th-order polynomial model on `set_1`, `set_2`, `set_3`, and `set_4`. Other than the change in the `l2_penalty` parameter, the code should be the same as the experiment above. Also, make sure GraphLab Create doesn't create its own validation set by using the option `validation_set = None` in this call."
265 | ]
266 | },
267 | {
268 | "cell_type": "code",
269 | "execution_count": null,
270 | "metadata": {
271 | "collapsed": false,
272 | "scrolled": false
273 | },
274 | "outputs": [],
275 | "source": []
276 | },
277 | {
278 | "cell_type": "code",
279 | "execution_count": null,
280 | "metadata": {
281 | "collapsed": false,
282 | "scrolled": false
283 | },
284 | "outputs": [],
285 | "source": []
286 | },
287 | {
288 | "cell_type": "code",
289 | "execution_count": null,
290 | "metadata": {
291 | "collapsed": false
292 | },
293 | "outputs": [],
294 | "source": []
295 | },
296 | {
297 | "cell_type": "code",
298 | "execution_count": null,
299 | "metadata": {
300 | "collapsed": false
301 | },
302 | "outputs": [],
303 | "source": []
304 | },
305 | {
306 | "cell_type": "markdown",
307 | "metadata": {},
308 | "source": [
309 | "These curves should vary a lot less, now that you applied a high degree of regularization.\n",
310 | "\n",
311 | "***QUIZ QUESTION: For the models learned with the high level of regularization in each of these training sets, what are the smallest and largest values you learned for the coefficient of feature `power_1`?*** (For the purpose of answering this question, negative numbers are considered \"smaller\" than positive numbers. So -5 is smaller than -3, and -3 is smaller than 5 and so forth.)"
312 | ]
313 | },
314 | {
315 | "cell_type": "markdown",
316 | "metadata": {},
317 | "source": [
318 | "# Selecting an L2 penalty via cross-validation"
319 | ]
320 | },
321 | {
322 | "cell_type": "markdown",
323 | "metadata": {},
324 | "source": [
325 | "Just like the polynomial degree, the L2 penalty is a \"magic\" parameter we need to select. We could use the validation set approach as we did in the last module, but that approach has a major disadvantage: it leaves fewer observations available for training. **Cross-validation** seeks to overcome this issue by using all of the training set in a smart way.\n",
326 | "\n",
327 | "We will implement a kind of cross-validation called **k-fold cross-validation**. The method gets its name because it involves dividing the training set into k segments of roughtly equal size. Similar to the validation set method, we measure the validation error with one of the segments designated as the validation set. The major difference is that we repeat the process k times as follows:\n",
328 | "\n",
329 | "Set aside segment 0 as the validation set, and fit a model on rest of data, and evalutate it on this validation set
\n",
330 | "Set aside segment 1 as the validation set, and fit a model on rest of data, and evalutate it on this validation set
\n",
331 | "...
\n",
332 | "Set aside segment k-1 as the validation set, and fit a model on rest of data, and evalutate it on this validation set\n",
333 | "\n",
334 | "After this process, we compute the average of the k validation errors, and use it as an estimate of the generalization error. Notice that all observations are used for both training and validation, as we iterate over segments of data. \n",
335 | "\n",
336 | "To estimate the generalization error well, it is crucial to shuffle the training data before dividing them into segments. GraphLab Create has a utility function for shuffling a given SFrame. We reserve 10% of the data as the test set and shuffle the remainder. (Make sure to use `seed=1` to get consistent answer.)"
337 | ]
338 | },
339 | {
340 | "cell_type": "code",
341 | "execution_count": null,
342 | "metadata": {
343 | "collapsed": true
344 | },
345 | "outputs": [],
346 | "source": [
347 | "(train_valid, test) = sales.random_split(.9, seed=1)\n",
348 | "train_valid_shuffled = graphlab.toolkits.cross_validation.shuffle(train_valid, random_seed=1)"
349 | ]
350 | },
351 | {
352 | "cell_type": "markdown",
353 | "metadata": {},
354 | "source": [
355 | "Once the data is shuffled, we divide it into equal segments. Each segment should receive `n/k` elements, where `n` is the number of observations in the training set and `k` is the number of segments. Since the segment 0 starts at index 0 and contains `n/k` elements, it ends at index `(n/k)-1`. The segment 1 starts where the segment 0 left off, at index `(n/k)`. With `n/k` elements, the segment 1 ends at index `(n*2/k)-1`. Continuing in this fashion, we deduce that the segment `i` starts at index `(n*i/k)` and ends at `(n*(i+1)/k)-1`."
356 | ]
357 | },
358 | {
359 | "cell_type": "markdown",
360 | "metadata": {},
361 | "source": [
362 | "With this pattern in mind, we write a short loop that prints the starting and ending indices of each segment, just to make sure you are getting the splits right."
363 | ]
364 | },
365 | {
366 | "cell_type": "code",
367 | "execution_count": null,
368 | "metadata": {
369 | "collapsed": true
370 | },
371 | "outputs": [],
372 | "source": [
373 | "n = len(train_valid_shuffled)\n",
374 | "k = 10 # 10-fold cross-validation\n",
375 | "\n",
376 | "for i in xrange(k):\n",
377 | " start = (n*i)/k\n",
378 | " end = (n*(i+1))/k-1\n",
379 | " print i, (start, end)"
380 | ]
381 | },
382 | {
383 | "cell_type": "markdown",
384 | "metadata": {
385 | "collapsed": false
386 | },
387 | "source": [
388 | "Let us familiarize ourselves with array slicing with SFrame. To extract a continuous slice from an SFrame, use colon in square brackets. For instance, the following cell extracts rows 0 to 9 of `train_valid_shuffled`. Notice that the first index (0) is included in the slice but the last index (10) is omitted."
389 | ]
390 | },
391 | {
392 | "cell_type": "code",
393 | "execution_count": null,
394 | "metadata": {
395 | "collapsed": true
396 | },
397 | "outputs": [],
398 | "source": [
399 | "train_valid_shuffled[0:10] # rows 0 to 9"
400 | ]
401 | },
402 | {
403 | "cell_type": "markdown",
404 | "metadata": {},
405 | "source": [
406 | "Now let us extract individual segments with array slicing. Consider the scenario where we group the houses in the `train_valid_shuffled` dataframe into k=10 segments of roughly equal size, with starting and ending indices computed as above.\n",
407 | "Extract the fourth segment (segment 3) and assign it to a variable called `validation4`."
408 | ]
409 | },
410 | {
411 | "cell_type": "code",
412 | "execution_count": null,
413 | "metadata": {
414 | "collapsed": true
415 | },
416 | "outputs": [],
417 | "source": []
418 | },
419 | {
420 | "cell_type": "markdown",
421 | "metadata": {},
422 | "source": [
423 | "To verify that we have the right elements extracted, run the following cell, which computes the average price of the fourth segment. When rounded to nearest whole number, the average should be $536,234."
424 | ]
425 | },
426 | {
427 | "cell_type": "code",
428 | "execution_count": null,
429 | "metadata": {
430 | "collapsed": true
431 | },
432 | "outputs": [],
433 | "source": [
434 | "print int(round(validation4['price'].mean(), 0))"
435 | ]
436 | },
437 | {
438 | "cell_type": "markdown",
439 | "metadata": {},
440 | "source": [
441 | "After designating one of the k segments as the validation set, we train a model using the rest of the data. To choose the remainder, we slice (0:start) and (end+1:n) of the data and paste them together. SFrame has `append()` method that pastes together two disjoint sets of rows originating from a common dataset. For instance, the following cell pastes together the first and last two rows of the `train_valid_shuffled` dataframe."
442 | ]
443 | },
444 | {
445 | "cell_type": "code",
446 | "execution_count": null,
447 | "metadata": {
448 | "collapsed": true
449 | },
450 | "outputs": [],
451 | "source": [
452 | "n = len(train_valid_shuffled)\n",
453 | "first_two = train_valid_shuffled[0:2]\n",
454 | "last_two = train_valid_shuffled[n-2:n]\n",
455 | "print first_two.append(last_two)"
456 | ]
457 | },
458 | {
459 | "cell_type": "markdown",
460 | "metadata": {},
461 | "source": [
462 | "Extract the remainder of the data after *excluding* fourth segment (segment 3) and assign the subset to `train4`."
463 | ]
464 | },
465 | {
466 | "cell_type": "code",
467 | "execution_count": null,
468 | "metadata": {
469 | "collapsed": true
470 | },
471 | "outputs": [],
472 | "source": []
473 | },
474 | {
475 | "cell_type": "markdown",
476 | "metadata": {},
477 | "source": [
478 | "To verify that we have the right elements extracted, run the following cell, which computes the average price of the data with fourth segment excluded. When rounded to nearest whole number, the average should be $539,450."
479 | ]
480 | },
481 | {
482 | "cell_type": "code",
483 | "execution_count": null,
484 | "metadata": {
485 | "collapsed": true
486 | },
487 | "outputs": [],
488 | "source": [
489 | "print int(round(train4['price'].mean(), 0))"
490 | ]
491 | },
492 | {
493 | "cell_type": "markdown",
494 | "metadata": {},
495 | "source": [
496 | "Now we are ready to implement k-fold cross-validation. Write a function that computes k validation errors by designating each of the k segments as the validation set. It accepts as parameters (i) `k`, (ii) `l2_penalty`, (iii) dataframe, (iv) name of output column (e.g. `price`) and (v) list of feature names. The function returns the average validation error using k segments as validation sets.\n",
497 | "\n",
498 | "* For each i in [0, 1, ..., k-1]:\n",
499 | " * Compute starting and ending indices of segment i and call 'start' and 'end'\n",
500 | " * Form validation set by taking a slice (start:end+1) from the data.\n",
501 | " * Form training set by appending slice (end+1:n) to the end of slice (0:start).\n",
502 | " * Train a linear model using training set just formed, with a given l2_penalty\n",
503 | " * Compute validation error using validation set just formed"
504 | ]
505 | },
506 | {
507 | "cell_type": "code",
508 | "execution_count": null,
509 | "metadata": {
510 | "collapsed": false
511 | },
512 | "outputs": [],
513 | "source": [
514 | "def k_fold_cross_validation(k, l2_penalty, data, output_name, features_list):\n",
515 | " "
516 | ]
517 | },
518 | {
519 | "cell_type": "markdown",
520 | "metadata": {},
521 | "source": [
522 | "Once we have a function to compute the average validation error for a model, we can write a loop to find the model that minimizes the average validation error. Write a loop that does the following:\n",
523 | "* We will again be aiming to fit a 15th-order polynomial model using the `sqft_living` input\n",
524 | "* For `l2_penalty` in [10^1, 10^1.5, 10^2, 10^2.5, ..., 10^7] (to get this in Python, you can use this Numpy function: `np.logspace(1, 7, num=13)`.)\n",
525 | " * Run 10-fold cross-validation with `l2_penalty`\n",
526 | "* Report which L2 penalty produced the lowest average validation error.\n",
527 | "\n",
528 | "Note: since the degree of the polynomial is now fixed to 15, to make things faster, you should generate polynomial features in advance and re-use them throughout the loop. Make sure to use `train_valid_shuffled` when generating polynomial features!"
529 | ]
530 | },
531 | {
532 | "cell_type": "code",
533 | "execution_count": null,
534 | "metadata": {
535 | "collapsed": true
536 | },
537 | "outputs": [],
538 | "source": []
539 | },
540 | {
541 | "cell_type": "markdown",
542 | "metadata": {},
543 | "source": [
544 | "***QUIZ QUESTIONS: What is the best value for the L2 penalty according to 10-fold validation?***"
545 | ]
546 | },
547 | {
548 | "cell_type": "markdown",
549 | "metadata": {},
550 | "source": [
551 | "You may find it useful to plot the k-fold cross-validation errors you have obtained to better understand the behavior of the method. "
552 | ]
553 | },
554 | {
555 | "cell_type": "code",
556 | "execution_count": null,
557 | "metadata": {
558 | "collapsed": true
559 | },
560 | "outputs": [],
561 | "source": [
562 | "# Plot the l2_penalty values in the x axis and the cross-validation error in the y axis.\n",
563 | "# Using plt.xscale('log') will make your plot more intuitive.\n",
564 | "\n"
565 | ]
566 | },
567 | {
568 | "cell_type": "markdown",
569 | "metadata": {},
570 | "source": [
571 | "Once you found the best value for the L2 penalty using cross-validation, it is important to retrain a final model on all of the training data using this value of `l2_penalty`. This way, your final model will be trained on the entire dataset."
572 | ]
573 | },
574 | {
575 | "cell_type": "code",
576 | "execution_count": null,
577 | "metadata": {
578 | "collapsed": true
579 | },
580 | "outputs": [],
581 | "source": []
582 | },
583 | {
584 | "cell_type": "markdown",
585 | "metadata": {},
586 | "source": [
587 | "***QUIZ QUESTION: Using the best L2 penalty found above, train a model using all training data. What is the RSS on the TEST data of the model you learn with this L2 penalty? ***"
588 | ]
589 | },
590 | {
591 | "cell_type": "code",
592 | "execution_count": null,
593 | "metadata": {
594 | "collapsed": true
595 | },
596 | "outputs": [],
597 | "source": []
598 | }
599 | ],
600 | "metadata": {
601 | "kernelspec": {
602 | "display_name": "Python 2",
603 | "language": "python",
604 | "name": "python2"
605 | },
606 | "language_info": {
607 | "codemirror_mode": {
608 | "name": "ipython",
609 | "version": 2
610 | },
611 | "file_extension": ".py",
612 | "mimetype": "text/x-python",
613 | "name": "python",
614 | "nbconvert_exporter": "python",
615 | "pygments_lexer": "ipython2",
616 | "version": "2.7.11"
617 | }
618 | },
619 | "nbformat": 4,
620 | "nbformat_minor": 0
621 | }
622 |
--------------------------------------------------------------------------------
/course-2/week-4-ridge-regression-assignment-2-blank.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Regression Week 4: Ridge Regression (gradient descent)"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "In this notebook, you will implement ridge regression via gradient descent. You will:\n",
15 | "* Convert an SFrame into a Numpy array\n",
16 | "* Write a Numpy function to compute the derivative of the regression weights with respect to a single feature\n",
17 | "* Write gradient descent function to compute the regression weights given an initial weight vector, step size, tolerance, and L2 penalty"
18 | ]
19 | },
20 | {
21 | "cell_type": "markdown",
22 | "metadata": {},
23 | "source": [
24 | "# Fire up graphlab create"
25 | ]
26 | },
27 | {
28 | "cell_type": "markdown",
29 | "metadata": {},
30 | "source": [
31 | "Make sure you have the latest version of GraphLab Create (>= 1.7)"
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": null,
37 | "metadata": {
38 | "collapsed": false
39 | },
40 | "outputs": [],
41 | "source": [
42 | "import graphlab"
43 | ]
44 | },
45 | {
46 | "cell_type": "markdown",
47 | "metadata": {},
48 | "source": [
49 | "# Load in house sales data\n",
50 | "\n",
51 | "Dataset is from house sales in King County, the region where the city of Seattle, WA is located."
52 | ]
53 | },
54 | {
55 | "cell_type": "code",
56 | "execution_count": null,
57 | "metadata": {
58 | "collapsed": false
59 | },
60 | "outputs": [],
61 | "source": [
62 | "sales = graphlab.SFrame('kc_house_data.gl/')"
63 | ]
64 | },
65 | {
66 | "cell_type": "markdown",
67 | "metadata": {},
68 | "source": [
69 | "If we want to do any \"feature engineering\" like creating new features or adjusting existing ones we should do this directly using the SFrames as seen in the first notebook of Week 2. For this notebook, however, we will work with the existing features."
70 | ]
71 | },
72 | {
73 | "cell_type": "markdown",
74 | "metadata": {},
75 | "source": [
76 | "# Import useful functions from previous notebook"
77 | ]
78 | },
79 | {
80 | "cell_type": "markdown",
81 | "metadata": {},
82 | "source": [
83 | "As in Week 2, we convert the SFrame into a 2D Numpy array. Copy and paste `get_numpy_data()` from the second notebook of Week 2."
84 | ]
85 | },
86 | {
87 | "cell_type": "code",
88 | "execution_count": null,
89 | "metadata": {
90 | "collapsed": true
91 | },
92 | "outputs": [],
93 | "source": [
94 | "import numpy as np # note this allows us to refer to numpy as np instead "
95 | ]
96 | },
97 | {
98 | "cell_type": "code",
99 | "execution_count": null,
100 | "metadata": {
101 | "collapsed": true
102 | },
103 | "outputs": [],
104 | "source": []
105 | },
106 | {
107 | "cell_type": "markdown",
108 | "metadata": {},
109 | "source": [
110 | "Also, copy and paste the `predict_output()` function to compute the predictions for an entire matrix of features given the matrix and the weights:"
111 | ]
112 | },
113 | {
114 | "cell_type": "code",
115 | "execution_count": null,
116 | "metadata": {
117 | "collapsed": true
118 | },
119 | "outputs": [],
120 | "source": []
121 | },
122 | {
123 | "cell_type": "markdown",
124 | "metadata": {},
125 | "source": [
126 | "# Computing the Derivative"
127 | ]
128 | },
129 | {
130 | "cell_type": "markdown",
131 | "metadata": {},
132 | "source": [
133 | "We are now going to move to computing the derivative of the regression cost function. Recall that the cost function is the sum over the data points of the squared difference between an observed output and a predicted output, plus the L2 penalty term.\n",
134 | "```\n",
135 | "Cost(w)\n",
136 | "= SUM[ (prediction - output)^2 ]\n",
137 | "+ l2_penalty*(w[0]^2 + w[1]^2 + ... + w[k]^2).\n",
138 | "```\n",
139 | "\n",
140 | "Since the derivative of a sum is the sum of the derivatives, we can take the derivative of the first part (the RSS) as we did in the notebook for the unregularized case in Week 2 and add the derivative of the regularization part. As we saw, the derivative of the RSS with respect to `w[i]` can be written as: \n",
141 | "```\n",
142 | "2*SUM[ error*[feature_i] ].\n",
143 | "```\n",
144 | "The derivative of the regularization term with respect to `w[i]` is:\n",
145 | "```\n",
146 | "2*l2_penalty*w[i].\n",
147 | "```\n",
148 | "Summing both, we get\n",
149 | "```\n",
150 | "2*SUM[ error*[feature_i] ] + 2*l2_penalty*w[i].\n",
151 | "```\n",
152 | "That is, the derivative for the weight for feature i is the sum (over data points) of 2 times the product of the error and the feature itself, plus `2*l2_penalty*w[i]`. \n",
153 | "\n",
154 | "**We will not regularize the constant.** Thus, in the case of the constant, the derivative is just twice the sum of the errors (without the `2*l2_penalty*w[0]` term).\n",
155 | "\n",
156 | "Recall that twice the sum of the product of two vectors is just twice the dot product of the two vectors. Therefore the derivative for the weight for feature_i is just two times the dot product between the values of feature_i and the current errors, plus `2*l2_penalty*w[i]`.\n",
157 | "\n",
158 | "With this in mind complete the following derivative function which computes the derivative of the weight given the value of the feature (over all data points) and the errors (over all data points). To decide when to we are dealing with the constant (so we don't regularize it) we added the extra parameter to the call `feature_is_constant` which you should set to `True` when computing the derivative of the constant and `False` otherwise."
159 | ]
160 | },
161 | {
162 | "cell_type": "code",
163 | "execution_count": null,
164 | "metadata": {
165 | "collapsed": true
166 | },
167 | "outputs": [],
168 | "source": [
169 | "def feature_derivative_ridge(errors, feature, weight, l2_penalty, feature_is_constant):\n",
170 | " # If feature_is_constant is True, derivative is twice the dot product of errors and feature\n",
171 | " \n",
172 | " # Otherwise, derivative is twice the dot product plus 2*l2_penalty*weight\n",
173 | " \n",
174 | " return derivative"
175 | ]
176 | },
177 | {
178 | "cell_type": "markdown",
179 | "metadata": {},
180 | "source": [
181 | "To test your feature derivartive run the following:"
182 | ]
183 | },
184 | {
185 | "cell_type": "code",
186 | "execution_count": null,
187 | "metadata": {
188 | "collapsed": false
189 | },
190 | "outputs": [],
191 | "source": [
192 | "(example_features, example_output) = get_numpy_data(sales, ['sqft_living'], 'price') \n",
193 | "my_weights = np.array([1., 10.])\n",
194 | "test_predictions = predict_output(example_features, my_weights) \n",
195 | "errors = test_predictions - example_output # prediction errors\n",
196 | "\n",
197 | "# next two lines should print the same values\n",
198 | "print feature_derivative_ridge(errors, example_features[:,1], my_weights[1], 1, False)\n",
199 | "print np.sum(errors*example_features[:,1])*2+20.\n",
200 | "print ''\n",
201 | "\n",
202 | "# next two lines should print the same values\n",
203 | "print feature_derivative_ridge(errors, example_features[:,0], my_weights[0], 1, True)\n",
204 | "print np.sum(errors)*2."
205 | ]
206 | },
207 | {
208 | "cell_type": "markdown",
209 | "metadata": {},
210 | "source": [
211 | "# Gradient Descent"
212 | ]
213 | },
214 | {
215 | "cell_type": "markdown",
216 | "metadata": {},
217 | "source": [
218 | "Now we will write a function that performs a gradient descent. The basic premise is simple. Given a starting point we update the current weights by moving in the negative gradient direction. Recall that the gradient is the direction of *increase* and therefore the negative gradient is the direction of *decrease* and we're trying to *minimize* a cost function. \n",
219 | "\n",
220 | "The amount by which we move in the negative gradient *direction* is called the 'step size'. We stop when we are 'sufficiently close' to the optimum. Unlike in Week 2, this time we will set a **maximum number of iterations** and take gradient steps until we reach this maximum number. If no maximum number is supplied, the maximum should be set 100 by default. (Use default parameter values in Python.)\n",
221 | "\n",
222 | "With this in mind, complete the following gradient descent function below using your derivative function above. For each step in the gradient descent, we update the weight for each feature before computing our stopping criteria."
223 | ]
224 | },
225 | {
226 | "cell_type": "code",
227 | "execution_count": null,
228 | "metadata": {
229 | "collapsed": false
230 | },
231 | "outputs": [],
232 | "source": [
233 | "def ridge_regression_gradient_descent(feature_matrix, output, initial_weights, step_size, l2_penalty, max_iterations=100):\n",
234 | " print 'Starting gradient descent with l2_penalty = ' + str(l2_penalty)\n",
235 | " \n",
236 | " weights = np.array(initial_weights) # make sure it's a numpy array\n",
237 | " iteration = 0 # iteration counter\n",
238 | " print_frequency = 1 # for adjusting frequency of debugging output\n",
239 | " \n",
240 | " #while not reached maximum number of iterations:\n",
241 | " iteration += 1 # increment iteration counter\n",
242 | " ### === code section for adjusting frequency of debugging output. ===\n",
243 | " if iteration == 10:\n",
244 | " print_frequency = 10\n",
245 | " if iteration == 100:\n",
246 | " print_frequency = 100\n",
247 | " if iteration%print_frequency==0:\n",
248 | " print('Iteration = ' + str(iteration))\n",
249 | " ### === end code section ===\n",
250 | " \n",
251 | " # compute the predictions based on feature_matrix and weights using your predict_output() function\n",
252 | "\n",
253 | " # compute the errors as predictions - output\n",
254 | "\n",
255 | " # from time to time, print the value of the cost function\n",
256 | " if iteration%print_frequency==0:\n",
257 | " print 'Cost function = ', str(np.dot(errors,errors) + l2_penalty*(np.dot(weights,weights) - weights[0]**2))\n",
258 | " \n",
259 | " for i in xrange(len(weights)): # loop over each weight\n",
260 | " # Recall that feature_matrix[:,i] is the feature column associated with weights[i]\n",
261 | " # compute the derivative for weight[i].\n",
262 | " #(Remember: when i=0, you are computing the derivative of the constant!)\n",
263 | "\n",
264 | " # subtract the step size times the derivative from the current weight\n",
265 | " \n",
266 | " print 'Done with gradient descent at iteration ', iteration\n",
267 | " print 'Learned weights = ', str(weights)\n",
268 | " return weights"
269 | ]
270 | },
271 | {
272 | "cell_type": "markdown",
273 | "metadata": {},
274 | "source": [
275 | "# Visualizing effect of L2 penalty"
276 | ]
277 | },
278 | {
279 | "cell_type": "markdown",
280 | "metadata": {},
281 | "source": [
282 | "The L2 penalty gets its name because it causes weights to have small L2 norms than otherwise. Let's see how large weights get penalized. Let us consider a simple model with 1 feature:"
283 | ]
284 | },
285 | {
286 | "cell_type": "code",
287 | "execution_count": null,
288 | "metadata": {
289 | "collapsed": true
290 | },
291 | "outputs": [],
292 | "source": [
293 | "simple_features = ['sqft_living']\n",
294 | "my_output = 'price'"
295 | ]
296 | },
297 | {
298 | "cell_type": "markdown",
299 | "metadata": {},
300 | "source": [
301 | "Let us split the dataset into training set and test set. Make sure to use `seed=0`:"
302 | ]
303 | },
304 | {
305 | "cell_type": "code",
306 | "execution_count": null,
307 | "metadata": {
308 | "collapsed": true
309 | },
310 | "outputs": [],
311 | "source": [
312 | "train_data,test_data = sales.random_split(.8,seed=0)"
313 | ]
314 | },
315 | {
316 | "cell_type": "markdown",
317 | "metadata": {},
318 | "source": [
319 | "In this part, we will only use `'sqft_living'` to predict `'price'`. Use the `get_numpy_data` function to get a Numpy versions of your data with only this feature, for both the `train_data` and the `test_data`. "
320 | ]
321 | },
322 | {
323 | "cell_type": "code",
324 | "execution_count": null,
325 | "metadata": {
326 | "collapsed": true
327 | },
328 | "outputs": [],
329 | "source": [
330 | "(simple_feature_matrix, output) = get_numpy_data(train_data, simple_features, my_output)\n",
331 | "(simple_test_feature_matrix, test_output) = get_numpy_data(test_data, simple_features, my_output)"
332 | ]
333 | },
334 | {
335 | "cell_type": "markdown",
336 | "metadata": {},
337 | "source": [
338 | "Let's set the parameters for our optimization:"
339 | ]
340 | },
341 | {
342 | "cell_type": "code",
343 | "execution_count": null,
344 | "metadata": {
345 | "collapsed": true
346 | },
347 | "outputs": [],
348 | "source": [
349 | "initial_weights = np.array([0., 0.])\n",
350 | "step_size = 1e-12\n",
351 | "max_iterations=1000"
352 | ]
353 | },
354 | {
355 | "cell_type": "markdown",
356 | "metadata": {},
357 | "source": [
358 | "First, let's consider no regularization. Set the `l2_penalty` to `0.0` and run your ridge regression algorithm to learn the weights of your model. Call your weights:\n",
359 | "\n",
360 | "`simple_weights_0_penalty`\n",
361 | "\n",
362 | "we'll use them later."
363 | ]
364 | },
365 | {
366 | "cell_type": "code",
367 | "execution_count": null,
368 | "metadata": {
369 | "collapsed": true
370 | },
371 | "outputs": [],
372 | "source": []
373 | },
374 | {
375 | "cell_type": "markdown",
376 | "metadata": {},
377 | "source": [
378 | "Next, let's consider high regularization. Set the `l2_penalty` to `1e11` and run your ridge regression algorithm to learn the weights of your model. Call your weights:\n",
379 | "\n",
380 | "`simple_weights_high_penalty`\n",
381 | "\n",
382 | "we'll use them later."
383 | ]
384 | },
385 | {
386 | "cell_type": "code",
387 | "execution_count": null,
388 | "metadata": {
389 | "collapsed": true
390 | },
391 | "outputs": [],
392 | "source": []
393 | },
394 | {
395 | "cell_type": "markdown",
396 | "metadata": {},
397 | "source": [
398 | "This code will plot the two learned models. (The blue line is for the model with no regularization and the red line is for the one with high regularization.)"
399 | ]
400 | },
401 | {
402 | "cell_type": "code",
403 | "execution_count": null,
404 | "metadata": {
405 | "collapsed": true
406 | },
407 | "outputs": [],
408 | "source": [
409 | "import matplotlib.pyplot as plt\n",
410 | "%matplotlib inline\n",
411 | "plt.plot(simple_feature_matrix,output,'k.',\n",
412 | " simple_feature_matrix,predict_output(simple_feature_matrix, simple_weights_0_penalty),'b-',\n",
413 | " simple_feature_matrix,predict_output(simple_feature_matrix, simple_weights_high_penalty),'r-')"
414 | ]
415 | },
416 | {
417 | "cell_type": "markdown",
418 | "metadata": {},
419 | "source": [
420 | "Compute the RSS on the TEST data for the following three sets of weights:\n",
421 | "1. The initial weights (all zeros)\n",
422 | "2. The weights learned with no regularization\n",
423 | "3. The weights learned with high regularization\n",
424 | "\n",
425 | "Which weights perform best?"
426 | ]
427 | },
428 | {
429 | "cell_type": "code",
430 | "execution_count": null,
431 | "metadata": {
432 | "collapsed": true
433 | },
434 | "outputs": [],
435 | "source": []
436 | },
437 | {
438 | "cell_type": "code",
439 | "execution_count": null,
440 | "metadata": {
441 | "collapsed": true
442 | },
443 | "outputs": [],
444 | "source": []
445 | },
446 | {
447 | "cell_type": "code",
448 | "execution_count": null,
449 | "metadata": {
450 | "collapsed": true
451 | },
452 | "outputs": [],
453 | "source": []
454 | },
455 | {
456 | "cell_type": "markdown",
457 | "metadata": {
458 | "collapsed": false
459 | },
460 | "source": [
461 | "***QUIZ QUESTIONS***\n",
462 | "1. What is the value of the coefficient for `sqft_living` that you learned with no regularization, rounded to 1 decimal place? What about the one with high regularization?\n",
463 | "2. Comparing the lines you fit with the with no regularization versus high regularization, which one is steeper?\n",
464 | "3. What are the RSS on the test data for each of the set of weights above (initial, no regularization, high regularization)? \n"
465 | ]
466 | },
467 | {
468 | "cell_type": "markdown",
469 | "metadata": {},
470 | "source": [
471 | "# Running a multiple regression with L2 penalty"
472 | ]
473 | },
474 | {
475 | "cell_type": "markdown",
476 | "metadata": {},
477 | "source": [
478 | "Let us now consider a model with 2 features: `['sqft_living', 'sqft_living15']`."
479 | ]
480 | },
481 | {
482 | "cell_type": "markdown",
483 | "metadata": {},
484 | "source": [
485 | "First, create Numpy versions of your training and test data with these two features. "
486 | ]
487 | },
488 | {
489 | "cell_type": "code",
490 | "execution_count": null,
491 | "metadata": {
492 | "collapsed": true
493 | },
494 | "outputs": [],
495 | "source": [
496 | "model_features = ['sqft_living', 'sqft_living15'] # sqft_living15 is the average squarefeet for the nearest 15 neighbors. \n",
497 | "my_output = 'price'\n",
498 | "(feature_matrix, output) = get_numpy_data(train_data, model_features, my_output)\n",
499 | "(test_feature_matrix, test_output) = get_numpy_data(test_data, model_features, my_output)"
500 | ]
501 | },
502 | {
503 | "cell_type": "markdown",
504 | "metadata": {},
505 | "source": [
506 | "We need to re-inialize the weights, since we have one extra parameter. Let us also set the step size and maximum number of iterations."
507 | ]
508 | },
509 | {
510 | "cell_type": "code",
511 | "execution_count": null,
512 | "metadata": {
513 | "collapsed": true
514 | },
515 | "outputs": [],
516 | "source": [
517 | "initial_weights = np.array([0.0,0.0,0.0])\n",
518 | "step_size = 1e-12\n",
519 | "max_iterations = 1000"
520 | ]
521 | },
522 | {
523 | "cell_type": "markdown",
524 | "metadata": {},
525 | "source": [
526 | "First, let's consider no regularization. Set the `l2_penalty` to `0.0` and run your ridge regression algorithm to learn the weights of your model. Call your weights:\n",
527 | "\n",
528 | "`multiple_weights_0_penalty`"
529 | ]
530 | },
531 | {
532 | "cell_type": "code",
533 | "execution_count": null,
534 | "metadata": {
535 | "collapsed": true
536 | },
537 | "outputs": [],
538 | "source": []
539 | },
540 | {
541 | "cell_type": "markdown",
542 | "metadata": {},
543 | "source": [
544 | "Next, let's consider high regularization. Set the `l2_penalty` to `1e11` and run your ridge regression algorithm to learn the weights of your model. Call your weights:\n",
545 | "\n",
546 | "`multiple_weights_high_penalty`"
547 | ]
548 | },
549 | {
550 | "cell_type": "code",
551 | "execution_count": null,
552 | "metadata": {
553 | "collapsed": true
554 | },
555 | "outputs": [],
556 | "source": []
557 | },
558 | {
559 | "cell_type": "markdown",
560 | "metadata": {},
561 | "source": [
562 | "Compute the RSS on the TEST data for the following three sets of weights:\n",
563 | "1. The initial weights (all zeros)\n",
564 | "2. The weights learned with no regularization\n",
565 | "3. The weights learned with high regularization\n",
566 | "\n",
567 | "Which weights perform best?"
568 | ]
569 | },
570 | {
571 | "cell_type": "code",
572 | "execution_count": null,
573 | "metadata": {
574 | "collapsed": true
575 | },
576 | "outputs": [],
577 | "source": []
578 | },
579 | {
580 | "cell_type": "code",
581 | "execution_count": null,
582 | "metadata": {
583 | "collapsed": true
584 | },
585 | "outputs": [],
586 | "source": []
587 | },
588 | {
589 | "cell_type": "code",
590 | "execution_count": null,
591 | "metadata": {
592 | "collapsed": true
593 | },
594 | "outputs": [],
595 | "source": []
596 | },
597 | {
598 | "cell_type": "markdown",
599 | "metadata": {},
600 | "source": [
601 | "Predict the house price for the 1st house in the test set using the no regularization and high regularization models. (Remember that python starts indexing from 0.) How far is the prediction from the actual price? Which weights perform best for the 1st house?"
602 | ]
603 | },
604 | {
605 | "cell_type": "code",
606 | "execution_count": null,
607 | "metadata": {
608 | "collapsed": true
609 | },
610 | "outputs": [],
611 | "source": []
612 | },
613 | {
614 | "cell_type": "code",
615 | "execution_count": null,
616 | "metadata": {
617 | "collapsed": false
618 | },
619 | "outputs": [],
620 | "source": []
621 | },
622 | {
623 | "cell_type": "markdown",
624 | "metadata": {
625 | "collapsed": true
626 | },
627 | "source": [
628 | "***QUIZ QUESTIONS***\n",
629 | "1. What is the value of the coefficient for `sqft_living` that you learned with no regularization, rounded to 1 decimal place? What about the one with high regularization?\n",
630 | "2. What are the RSS on the test data for each of the set of weights above (initial, no regularization, high regularization)? \n",
631 | "3. We make prediction for the first house in the test set using two sets of weights (no regularization vs high regularization). Which weights make better prediction for that particular house?"
632 | ]
633 | },
634 | {
635 | "cell_type": "code",
636 | "execution_count": null,
637 | "metadata": {
638 | "collapsed": true
639 | },
640 | "outputs": [],
641 | "source": []
642 | }
643 | ],
644 | "metadata": {
645 | "kernelspec": {
646 | "display_name": "Python 2",
647 | "language": "python",
648 | "name": "python2"
649 | },
650 | "language_info": {
651 | "codemirror_mode": {
652 | "name": "ipython",
653 | "version": 2
654 | },
655 | "file_extension": ".py",
656 | "mimetype": "text/x-python",
657 | "name": "python",
658 | "nbconvert_exporter": "python",
659 | "pygments_lexer": "ipython2",
660 | "version": "2.7.11"
661 | }
662 | },
663 | "nbformat": 4,
664 | "nbformat_minor": 0
665 | }
666 |
--------------------------------------------------------------------------------
/course-2/week-5-lasso-assignment-1-blank.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Regression Week 5: Feature Selection and LASSO (Interpretation)"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "In this notebook, you will use LASSO to select features, building on a pre-implemented solver for LASSO (using GraphLab Create, though you can use other solvers). You will:\n",
15 | "* Run LASSO with different L1 penalties.\n",
16 | "* Choose best L1 penalty using a validation set.\n",
17 | "* Choose best L1 penalty using a validation set, with additional constraint on the size of subset.\n",
18 | "\n",
19 | "In the second notebook, you will implement your own LASSO solver, using coordinate descent. "
20 | ]
21 | },
22 | {
23 | "cell_type": "markdown",
24 | "metadata": {},
25 | "source": [
26 | "# Fire up Graphlab Create"
27 | ]
28 | },
29 | {
30 | "cell_type": "code",
31 | "execution_count": null,
32 | "metadata": {
33 | "collapsed": true
34 | },
35 | "outputs": [],
36 | "source": [
37 | "import graphlab"
38 | ]
39 | },
40 | {
41 | "cell_type": "markdown",
42 | "metadata": {},
43 | "source": [
44 | "# Load in house sales data\n",
45 | "\n",
46 | "Dataset is from house sales in King County, the region where the city of Seattle, WA is located."
47 | ]
48 | },
49 | {
50 | "cell_type": "code",
51 | "execution_count": null,
52 | "metadata": {
53 | "collapsed": false,
54 | "scrolled": true
55 | },
56 | "outputs": [],
57 | "source": [
58 | "sales = graphlab.SFrame('kc_house_data.gl/')"
59 | ]
60 | },
61 | {
62 | "cell_type": "markdown",
63 | "metadata": {},
64 | "source": [
65 | "# Create new features"
66 | ]
67 | },
68 | {
69 | "cell_type": "markdown",
70 | "metadata": {},
71 | "source": [
72 | "As in Week 2, we consider features that are some transformations of inputs."
73 | ]
74 | },
75 | {
76 | "cell_type": "code",
77 | "execution_count": null,
78 | "metadata": {
79 | "collapsed": true
80 | },
81 | "outputs": [],
82 | "source": [
83 | "from math import log, sqrt\n",
84 | "sales['sqft_living_sqrt'] = sales['sqft_living'].apply(sqrt)\n",
85 | "sales['sqft_lot_sqrt'] = sales['sqft_lot'].apply(sqrt)\n",
86 | "sales['bedrooms_square'] = sales['bedrooms']*sales['bedrooms']\n",
87 | "\n",
88 | "# In the dataset, 'floors' was defined with type string, \n",
89 | "# so we'll convert them to float, before creating a new feature.\n",
90 | "sales['floors'] = sales['floors'].astype(float) \n",
91 | "sales['floors_square'] = sales['floors']*sales['floors']"
92 | ]
93 | },
94 | {
95 | "cell_type": "markdown",
96 | "metadata": {},
97 | "source": [
98 | "* Squaring bedrooms will increase the separation between not many bedrooms (e.g. 1) and lots of bedrooms (e.g. 4) since 1^2 = 1 but 4^2 = 16. Consequently this variable will mostly affect houses with many bedrooms.\n",
99 | "* On the other hand, taking square root of sqft_living will decrease the separation between big house and small house. The owner may not be exactly twice as happy for getting a house that is twice as big."
100 | ]
101 | },
102 | {
103 | "cell_type": "markdown",
104 | "metadata": {},
105 | "source": [
106 | "# Learn regression weights with L1 penalty"
107 | ]
108 | },
109 | {
110 | "cell_type": "markdown",
111 | "metadata": {},
112 | "source": [
113 | "Let us fit a model with all the features available, plus the features we just created above."
114 | ]
115 | },
116 | {
117 | "cell_type": "code",
118 | "execution_count": null,
119 | "metadata": {
120 | "collapsed": false
121 | },
122 | "outputs": [],
123 | "source": [
124 | "all_features = ['bedrooms', 'bedrooms_square',\n",
125 | " 'bathrooms',\n",
126 | " 'sqft_living', 'sqft_living_sqrt',\n",
127 | " 'sqft_lot', 'sqft_lot_sqrt',\n",
128 | " 'floors', 'floors_square',\n",
129 | " 'waterfront', 'view', 'condition', 'grade',\n",
130 | " 'sqft_above',\n",
131 | " 'sqft_basement',\n",
132 | " 'yr_built', 'yr_renovated']"
133 | ]
134 | },
135 | {
136 | "cell_type": "markdown",
137 | "metadata": {},
138 | "source": [
139 | "Applying L1 penalty requires adding an extra parameter (`l1_penalty`) to the linear regression call in GraphLab Create. (Other tools may have separate implementations of LASSO.) Note that it's important to set `l2_penalty=0` to ensure we don't introduce an additional L2 penalty."
140 | ]
141 | },
142 | {
143 | "cell_type": "code",
144 | "execution_count": null,
145 | "metadata": {
146 | "collapsed": false
147 | },
148 | "outputs": [],
149 | "source": [
150 | "model_all = graphlab.linear_regression.create(sales, target='price', features=all_features,\n",
151 | " validation_set=None, \n",
152 | " l2_penalty=0., l1_penalty=1e10)"
153 | ]
154 | },
155 | {
156 | "cell_type": "markdown",
157 | "metadata": {},
158 | "source": [
159 | "Find what features had non-zero weight."
160 | ]
161 | },
162 | {
163 | "cell_type": "code",
164 | "execution_count": null,
165 | "metadata": {
166 | "collapsed": true
167 | },
168 | "outputs": [],
169 | "source": []
170 | },
171 | {
172 | "cell_type": "markdown",
173 | "metadata": {},
174 | "source": [
175 | "Note that a majority of the weights have been set to zero. So by setting an L1 penalty that's large enough, we are performing a subset selection. \n",
176 | "\n",
177 | "***QUIZ QUESTION***:\n",
178 | "According to this list of weights, which of the features have been chosen? "
179 | ]
180 | },
181 | {
182 | "cell_type": "markdown",
183 | "metadata": {},
184 | "source": [
185 | "# Selecting an L1 penalty"
186 | ]
187 | },
188 | {
189 | "cell_type": "markdown",
190 | "metadata": {},
191 | "source": [
192 | "To find a good L1 penalty, we will explore multiple values using a validation set. Let us do three way split into train, validation, and test sets:\n",
193 | "* Split our sales data into 2 sets: training and test\n",
194 | "* Further split our training data into two sets: train, validation\n",
195 | "\n",
196 | "Be *very* careful that you use seed = 1 to ensure you get the same answer!"
197 | ]
198 | },
199 | {
200 | "cell_type": "code",
201 | "execution_count": null,
202 | "metadata": {
203 | "collapsed": true
204 | },
205 | "outputs": [],
206 | "source": [
207 | "(training_and_validation, testing) = sales.random_split(.9,seed=1) # initial train/test split\n",
208 | "(training, validation) = training_and_validation.random_split(0.5, seed=1) # split training into train and validate"
209 | ]
210 | },
211 | {
212 | "cell_type": "markdown",
213 | "metadata": {},
214 | "source": [
215 | "Next, we write a loop that does the following:\n",
216 | "* For `l1_penalty` in [10^1, 10^1.5, 10^2, 10^2.5, ..., 10^7] (to get this in Python, type `np.logspace(1, 7, num=13)`.)\n",
217 | " * Fit a regression model with a given `l1_penalty` on TRAIN data. Specify `l1_penalty=l1_penalty` and `l2_penalty=0.` in the parameter list.\n",
218 | " * Compute the RSS on VALIDATION data (here you will want to use `.predict()`) for that `l1_penalty`\n",
219 | "* Report which `l1_penalty` produced the lowest RSS on validation data.\n",
220 | "\n",
221 | "When you call `linear_regression.create()` make sure you set `validation_set = None`.\n",
222 | "\n",
223 | "Note: you can turn off the print out of `linear_regression.create()` with `verbose = False`"
224 | ]
225 | },
226 | {
227 | "cell_type": "code",
228 | "execution_count": null,
229 | "metadata": {
230 | "collapsed": true
231 | },
232 | "outputs": [],
233 | "source": []
234 | },
235 | {
236 | "cell_type": "code",
237 | "execution_count": null,
238 | "metadata": {
239 | "collapsed": false
240 | },
241 | "outputs": [],
242 | "source": []
243 | },
244 | {
245 | "cell_type": "markdown",
246 | "metadata": {},
247 | "source": [
248 | "*** QUIZ QUESTION. *** What was the best value for the `l1_penalty`?"
249 | ]
250 | },
251 | {
252 | "cell_type": "code",
253 | "execution_count": null,
254 | "metadata": {
255 | "collapsed": false
256 | },
257 | "outputs": [],
258 | "source": []
259 | },
260 | {
261 | "cell_type": "markdown",
262 | "metadata": {},
263 | "source": [
264 | "***QUIZ QUESTION***\n",
265 | "Also, using this value of L1 penalty, how many nonzero weights do you have?"
266 | ]
267 | },
268 | {
269 | "cell_type": "code",
270 | "execution_count": null,
271 | "metadata": {
272 | "collapsed": false
273 | },
274 | "outputs": [],
275 | "source": []
276 | },
277 | {
278 | "cell_type": "markdown",
279 | "metadata": {},
280 | "source": [
281 | "# Limit the number of nonzero weights\n",
282 | "\n",
283 | "What if we absolutely wanted to limit ourselves to, say, 7 features? This may be important if we want to derive \"a rule of thumb\" --- an interpretable model that has only a few features in them."
284 | ]
285 | },
286 | {
287 | "cell_type": "markdown",
288 | "metadata": {},
289 | "source": [
290 | "In this section, you are going to implement a simple, two phase procedure to achive this goal:\n",
291 | "1. Explore a large range of `l1_penalty` values to find a narrow region of `l1_penalty` values where models are likely to have the desired number of non-zero weights.\n",
292 | "2. Further explore the narrow region you found to find a good value for `l1_penalty` that achieves the desired sparsity. Here, we will again use a validation set to choose the best value for `l1_penalty`."
293 | ]
294 | },
295 | {
296 | "cell_type": "code",
297 | "execution_count": null,
298 | "metadata": {
299 | "collapsed": true
300 | },
301 | "outputs": [],
302 | "source": [
303 | "max_nonzeros = 7"
304 | ]
305 | },
306 | {
307 | "cell_type": "markdown",
308 | "metadata": {},
309 | "source": [
310 | "## Exploring the larger range of values to find a narrow range with the desired sparsity\n",
311 | "\n",
312 | "Let's define a wide range of possible `l1_penalty_values`:"
313 | ]
314 | },
315 | {
316 | "cell_type": "code",
317 | "execution_count": null,
318 | "metadata": {
319 | "collapsed": false
320 | },
321 | "outputs": [],
322 | "source": [
323 | "l1_penalty_values = np.logspace(8, 10, num=20)"
324 | ]
325 | },
326 | {
327 | "cell_type": "markdown",
328 | "metadata": {},
329 | "source": [
330 | "Now, implement a loop that search through this space of possible `l1_penalty` values:\n",
331 | "\n",
332 | "* For `l1_penalty` in `np.logspace(8, 10, num=20)`:\n",
333 | " * Fit a regression model with a given `l1_penalty` on TRAIN data. Specify `l1_penalty=l1_penalty` and `l2_penalty=0.` in the parameter list. When you call `linear_regression.create()` make sure you set `validation_set = None`\n",
334 | " * Extract the weights of the model and count the number of nonzeros. Save the number of nonzeros to a list.\n",
335 | " * *Hint: `model['coefficients']['value']` gives you an SArray with the parameters you learned. If you call the method `.nnz()` on it, you will find the number of non-zero parameters!* "
336 | ]
337 | },
338 | {
339 | "cell_type": "code",
340 | "execution_count": null,
341 | "metadata": {
342 | "collapsed": true
343 | },
344 | "outputs": [],
345 | "source": []
346 | },
347 | {
348 | "cell_type": "markdown",
349 | "metadata": {},
350 | "source": [
351 | "Out of this large range, we want to find the two ends of our desired narrow range of `l1_penalty`. At one end, we will have `l1_penalty` values that have too few non-zeros, and at the other end, we will have an `l1_penalty` that has too many non-zeros. \n",
352 | "\n",
353 | "More formally, find:\n",
354 | "* The largest `l1_penalty` that has more non-zeros than `max_nonzeros` (if we pick a penalty smaller than this value, we will definitely have too many non-zero weights)\n",
355 | " * Store this value in the variable `l1_penalty_min` (we will use it later)\n",
356 | "* The smallest `l1_penalty` that has fewer non-zeros than `max_nonzeros` (if we pick a penalty larger than this value, we will definitely have too few non-zero weights)\n",
357 | " * Store this value in the variable `l1_penalty_max` (we will use it later)\n",
358 | "\n",
359 | "\n",
360 | "*Hint: there are many ways to do this, e.g.:*\n",
361 | "* Programmatically within the loop above\n",
362 | "* Creating a list with the number of non-zeros for each value of `l1_penalty` and inspecting it to find the appropriate boundaries."
363 | ]
364 | },
365 | {
366 | "cell_type": "code",
367 | "execution_count": null,
368 | "metadata": {
369 | "collapsed": true
370 | },
371 | "outputs": [],
372 | "source": [
373 | "l1_penalty_min = \n",
374 | "l1_penalty_max = "
375 | ]
376 | },
377 | {
378 | "cell_type": "markdown",
379 | "metadata": {},
380 | "source": [
381 | "***QUIZ QUESTION.*** What values did you find for `l1_penalty_min` and `l1_penalty_max`, respectively? "
382 | ]
383 | },
384 | {
385 | "cell_type": "markdown",
386 | "metadata": {},
387 | "source": [
388 | "## Exploring the narrow range of values to find the solution with the right number of non-zeros that has lowest RSS on the validation set \n",
389 | "\n",
390 | "We will now explore the narrow region of `l1_penalty` values we found:"
391 | ]
392 | },
393 | {
394 | "cell_type": "code",
395 | "execution_count": null,
396 | "metadata": {
397 | "collapsed": true
398 | },
399 | "outputs": [],
400 | "source": [
401 | "l1_penalty_values = np.linspace(l1_penalty_min,l1_penalty_max,20)"
402 | ]
403 | },
404 | {
405 | "cell_type": "markdown",
406 | "metadata": {},
407 | "source": [
408 | "* For `l1_penalty` in `np.linspace(l1_penalty_min,l1_penalty_max,20)`:\n",
409 | " * Fit a regression model with a given `l1_penalty` on TRAIN data. Specify `l1_penalty=l1_penalty` and `l2_penalty=0.` in the parameter list. When you call `linear_regression.create()` make sure you set `validation_set = None`\n",
410 | " * Measure the RSS of the learned model on the VALIDATION set\n",
411 | "\n",
412 | "Find the model that the lowest RSS on the VALIDATION set and has sparsity *equal* to `max_nonzeros`."
413 | ]
414 | },
415 | {
416 | "cell_type": "code",
417 | "execution_count": null,
418 | "metadata": {
419 | "collapsed": true
420 | },
421 | "outputs": [],
422 | "source": []
423 | },
424 | {
425 | "cell_type": "markdown",
426 | "metadata": {},
427 | "source": [
428 | "***QUIZ QUESTIONS***\n",
429 | "1. What value of `l1_penalty` in our narrow range has the lowest RSS on the VALIDATION set and has sparsity *equal* to `max_nonzeros`?\n",
430 | "2. What features in this model have non-zero coefficients?"
431 | ]
432 | },
433 | {
434 | "cell_type": "code",
435 | "execution_count": null,
436 | "metadata": {
437 | "collapsed": false
438 | },
439 | "outputs": [],
440 | "source": []
441 | }
442 | ],
443 | "metadata": {
444 | "kernelspec": {
445 | "display_name": "Python 2",
446 | "language": "python",
447 | "name": "python2"
448 | },
449 | "language_info": {
450 | "codemirror_mode": {
451 | "name": "ipython",
452 | "version": 2
453 | },
454 | "file_extension": ".py",
455 | "mimetype": "text/x-python",
456 | "name": "python",
457 | "nbconvert_exporter": "python",
458 | "pygments_lexer": "ipython2",
459 | "version": "2.7.11"
460 | }
461 | },
462 | "nbformat": 4,
463 | "nbformat_minor": 0
464 | }
465 |
--------------------------------------------------------------------------------
/course-3/amazon_baby.gl.zip:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:25b07a853547c3af6408cc0c18435e7c36a352163144a4ffc525b019387b1b7a
3 | size 42295364
4 |
--------------------------------------------------------------------------------
/course-3/amazon_baby_subset.gl.zip:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:b8b313518cc52abddcf626097ec44bd5f4e94b7db093b332f9f6fe148c22dbcc
3 | size 13343615
4 |
--------------------------------------------------------------------------------
/course-3/important_words.json:
--------------------------------------------------------------------------------
1 | ["baby", "one", "great", "love", "use", "would", "like", "easy", "little", "seat", "old", "well", "get", "also", "really", "son", "time", "bought", "product", "good", "daughter", "much", "loves", "stroller", "put", "months", "car", "still", "back", "used", "recommend", "first", "even", "perfect", "nice", "bag", "two", "using", "got", "fit", "around", "diaper", "enough", "month", "price", "go", "could", "soft", "since", "buy", "room", "works", "made", "child", "keep", "size", "small", "need", "year", "big", "make", "take", "easily", "think", "crib", "clean", "way", "quality", "thing", "better", "without", "set", "new", "every", "cute", "best", "bottles", "work", "purchased", "right", "lot", "side", "happy", "comfortable", "toy", "able", "kids", "bit", "night", "long", "fits", "see", "us", "another", "play", "day", "money", "monitor", "tried", "thought", "never", "item", "hard", "plastic", "however", "disappointed", "reviews", "something", "going", "pump", "bottle", "cup", "waste", "return", "amazon", "different", "top", "want", "problem", "know", "water", "try", "received", "sure", "times", "chair", "find", "hold", "gate", "open", "bottom", "away", "actually", "cheap", "worked", "getting", "ordered", "came", "milk", "bad", "part", "worth", "found", "cover", "many", "design", "looking", "weeks", "say", "wanted", "look", "place", "purchase", "looks", "second", "piece", "box", "pretty", "trying", "difficult", "together", "though", "give", "started", "anything", "last", "company", "come", "returned", "maybe", "took", "broke", "makes", "stay", "instead", "idea", "head", "said", "less", "went", "working", "high", "unit", "seems", "picture", "completely", "wish", "buying", "babies", "won", "tub", "almost", "either"]
--------------------------------------------------------------------------------
/course-3/lending-club-data.gl.zip:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:b1f35902354f19610bc8065b51fffe9a6b33d9ca4dde06aa3314e382e5b707a0
3 | size 20322341
4 |
--------------------------------------------------------------------------------
/course-3/numpy-arrays/module-10-assignment-numpy-arrays.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/learnml/machine-learning-specialization/d8159cb9266e3c337b897de27dbd600d440cb870/course-3/numpy-arrays/module-10-assignment-numpy-arrays.npz
--------------------------------------------------------------------------------
/course-3/numpy-arrays/module-3-assignment-numpy-arrays.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/learnml/machine-learning-specialization/d8159cb9266e3c337b897de27dbd600d440cb870/course-3/numpy-arrays/module-3-assignment-numpy-arrays.npz
--------------------------------------------------------------------------------
/course-3/numpy-arrays/module-4-assignment-numpy-arrays.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/learnml/machine-learning-specialization/d8159cb9266e3c337b897de27dbd600d440cb870/course-3/numpy-arrays/module-4-assignment-numpy-arrays.npz
--------------------------------------------------------------------------------
/course-4/4_em-with-text-data_blank.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## Fitting a diagonal covariance Gaussian mixture model to text data\n",
8 | "\n",
9 | "In a previous assignment, we explored k-means clustering for a high-dimensional Wikipedia dataset. We can also model this data with a mixture of Gaussians, though with increasing dimension we run into two important issues associated with using a full covariance matrix for each component.\n",
10 | " * Computational cost becomes prohibitive in high dimensions: score calculations have complexity cubic in the number of dimensions M if the Gaussian has a full covariance matrix.\n",
11 | " * A model with many parameters require more data: observe that a full covariance matrix for an M-dimensional Gaussian will have M(M+1)/2 parameters to fit. With the number of parameters growing roughly as the square of the dimension, it may quickly become impossible to find a sufficient amount of data to make good inferences.\n",
12 | "\n",
13 | "Both of these issues are avoided if we require the covariance matrix of each component to be diagonal, as then it has only M parameters to fit and the score computation decomposes into M univariate score calculations. Recall from the lecture that the M-step for the full covariance is:\n",
14 | "\n",
15 | "\\begin{align*}\n",
16 | "\\hat{\\Sigma}_k &= \\frac{1}{N_k^{soft}} \\sum_{i=1}^N r_{ik} (x_i-\\hat{\\mu}_k)(x_i - \\hat{\\mu}_k)^T\n",
17 | "\\end{align*}\n",
18 | "\n",
19 | "Note that this is a square matrix with M rows and M columns, and the above equation implies that the (v, w) element is computed by\n",
20 | "\n",
21 | "\\begin{align*}\n",
22 | "\\hat{\\Sigma}_{k, v, w} &= \\frac{1}{N_k^{soft}} \\sum_{i=1}^N r_{ik} (x_{iv}-\\hat{\\mu}_{kv})(x_{iw} - \\hat{\\mu}_{kw})\n",
23 | "\\end{align*}\n",
24 | "\n",
25 | "When we assume that this is a diagonal matrix, then non-diagonal elements are assumed to be zero and we only need to compute each of the M elements along the diagonal independently using the following equation. \n",
26 | "\n",
27 | "\\begin{align*}\n",
28 | "\\hat{\\sigma}^2_{k, v} &= \\hat{\\Sigma}_{k, v, v} \\\\\n",
29 | "&= \\frac{1}{N_k^{soft}} \\sum_{i=1}^N r_{ik} (x_{iv}-\\hat{\\mu}_{kv})^2\n",
30 | "\\end{align*}\n",
31 | "\n",
32 | "In this section, we will use an EM implementation to fit a Gaussian mixture model with **diagonal** covariances to a subset of the Wikipedia dataset. The implementation uses the above equation to compute each variance term. \n",
33 | "\n",
34 | "We'll begin by importing the dataset and coming up with a useful representation for each article. After running our algorithm on the data, we will explore the output to see whether we can give a meaningful interpretation to the fitted parameters in our model."
35 | ]
36 | },
37 | {
38 | "cell_type": "markdown",
39 | "metadata": {},
40 | "source": [
41 | "**Note to Amazon EC2 users**: To conserve memory, make sure to stop all the other notebooks before running this notebook."
42 | ]
43 | },
44 | {
45 | "cell_type": "markdown",
46 | "metadata": {},
47 | "source": [
48 | "## Import necessary packages"
49 | ]
50 | },
51 | {
52 | "cell_type": "markdown",
53 | "metadata": {},
54 | "source": [
55 | "The following code block will check if you have the correct version of GraphLab Create. Any version later than 1.8.5 will do. To upgrade, read [this page](https://turi.com/download/upgrade-graphlab-create.html)."
56 | ]
57 | },
58 | {
59 | "cell_type": "code",
60 | "execution_count": null,
61 | "metadata": {
62 | "collapsed": false
63 | },
64 | "outputs": [],
65 | "source": [
66 | "import graphlab\n",
67 | "\n",
68 | "'''Check GraphLab Create version'''\n",
69 | "from distutils.version import StrictVersion\n",
70 | "assert (StrictVersion(graphlab.version) >= StrictVersion('1.8.5')), 'GraphLab Create must be version 1.8.5 or later.'"
71 | ]
72 | },
73 | {
74 | "cell_type": "markdown",
75 | "metadata": {},
76 | "source": [
77 | "We also have a Python file containing implementations for several functions that will be used during the course of this assignment."
78 | ]
79 | },
80 | {
81 | "cell_type": "code",
82 | "execution_count": null,
83 | "metadata": {
84 | "collapsed": false
85 | },
86 | "outputs": [],
87 | "source": [
88 | "from em_utilities import *"
89 | ]
90 | },
91 | {
92 | "cell_type": "markdown",
93 | "metadata": {},
94 | "source": [
95 | "## Load Wikipedia data and extract TF-IDF features"
96 | ]
97 | },
98 | {
99 | "cell_type": "markdown",
100 | "metadata": {},
101 | "source": [
102 | "Load Wikipedia data and transform each of the first 5000 document into a TF-IDF representation."
103 | ]
104 | },
105 | {
106 | "cell_type": "code",
107 | "execution_count": null,
108 | "metadata": {
109 | "collapsed": false
110 | },
111 | "outputs": [],
112 | "source": [
113 | "wiki = graphlab.SFrame('people_wiki.gl/').head(5000)\n",
114 | "wiki['tf_idf'] = graphlab.text_analytics.tf_idf(wiki['text'])"
115 | ]
116 | },
117 | {
118 | "cell_type": "markdown",
119 | "metadata": {},
120 | "source": [
121 | "Using a utility we provide, we will create a sparse matrix representation of the documents. This is the same utility function you used during the previous assignment on k-means with text data."
122 | ]
123 | },
124 | {
125 | "cell_type": "code",
126 | "execution_count": null,
127 | "metadata": {
128 | "collapsed": false
129 | },
130 | "outputs": [],
131 | "source": [
132 | "tf_idf, map_index_to_word = sframe_to_scipy(wiki, 'tf_idf')"
133 | ]
134 | },
135 | {
136 | "cell_type": "markdown",
137 | "metadata": {},
138 | "source": [
139 | "As in the previous assignment, we will normalize each document's TF-IDF vector to be a unit vector. "
140 | ]
141 | },
142 | {
143 | "cell_type": "code",
144 | "execution_count": null,
145 | "metadata": {
146 | "collapsed": false
147 | },
148 | "outputs": [],
149 | "source": [
150 | "tf_idf = normalize(tf_idf)"
151 | ]
152 | },
153 | {
154 | "cell_type": "markdown",
155 | "metadata": {},
156 | "source": [
157 | "We can check that the length (Euclidean norm) of each row is now 1.0, as expected."
158 | ]
159 | },
160 | {
161 | "cell_type": "code",
162 | "execution_count": null,
163 | "metadata": {
164 | "collapsed": false,
165 | "scrolled": true
166 | },
167 | "outputs": [],
168 | "source": [
169 | "for i in range(5):\n",
170 | " doc = tf_idf[i]\n",
171 | " print(np.linalg.norm(doc.todense()))"
172 | ]
173 | },
174 | {
175 | "cell_type": "markdown",
176 | "metadata": {},
177 | "source": [
178 | "## EM in high dimensions\n",
179 | "\n",
180 | "EM for high-dimensional data requires some special treatment:\n",
181 | " * E step and M step must be vectorized as much as possible, as explicit loops are dreadfully slow in Python.\n",
182 | " * All operations must be cast in terms of sparse matrix operations, to take advantage of computational savings enabled by sparsity of data.\n",
183 | " * Initially, some words may be entirely absent from a cluster, causing the M step to produce zero mean and variance for those words. This means any data point with one of those words will have 0 probability of being assigned to that cluster since the cluster allows for no variability (0 variance) around that count being 0 (0 mean). Since there is a small chance for those words to later appear in the cluster, we instead assign a small positive variance (~1e-10). Doing so also prevents numerical overflow.\n",
184 | " \n",
185 | "We provide the complete implementation for you in the file `em_utilities.py`. For those who are interested, you can read through the code to see how the sparse matrix implementation differs from the previous assignment. \n",
186 | "\n",
187 | "You are expected to answer some quiz questions using the results of clustering."
188 | ]
189 | },
190 | {
191 | "cell_type": "markdown",
192 | "metadata": {},
193 | "source": [
194 | "**Initializing mean parameters using k-means**\n",
195 | "\n",
196 | "Recall from the lectures that EM for Gaussian mixtures is very sensitive to the choice of initial means. With a bad initial set of means, EM may produce clusters that span a large area and are mostly overlapping. To eliminate such bad outcomes, we first produce a suitable set of initial means by using the cluster centers from running k-means. That is, we first run k-means and then take the final set of means from the converged solution as the initial means in our EM algorithm."
197 | ]
198 | },
199 | {
200 | "cell_type": "code",
201 | "execution_count": null,
202 | "metadata": {
203 | "collapsed": false
204 | },
205 | "outputs": [],
206 | "source": [
207 | "from sklearn.cluster import KMeans\n",
208 | "\n",
209 | "np.random.seed(5)\n",
210 | "num_clusters = 25\n",
211 | "\n",
212 | "# Use scikit-learn's k-means to simplify workflow\n",
213 | "#kmeans_model = KMeans(n_clusters=num_clusters, n_init=5, max_iter=400, random_state=1, n_jobs=-1) # uncomment to use parallelism -- may break on your installation\n",
214 | "kmeans_model = KMeans(n_clusters=num_clusters, n_init=5, max_iter=400, random_state=1, n_jobs=1)\n",
215 | "kmeans_model.fit(tf_idf)\n",
216 | "centroids, cluster_assignment = kmeans_model.cluster_centers_, kmeans_model.labels_\n",
217 | "\n",
218 | "means = [centroid for centroid in centroids]"
219 | ]
220 | },
221 | {
222 | "cell_type": "markdown",
223 | "metadata": {},
224 | "source": [
225 | "**Initializing cluster weights**\n",
226 | "\n",
227 | "We will initialize each cluster weight to be the proportion of documents assigned to that cluster by k-means above."
228 | ]
229 | },
230 | {
231 | "cell_type": "code",
232 | "execution_count": null,
233 | "metadata": {
234 | "collapsed": false
235 | },
236 | "outputs": [],
237 | "source": [
238 | "num_docs = tf_idf.shape[0]\n",
239 | "weights = []\n",
240 | "for i in xrange(num_clusters):\n",
241 | " # Compute the number of data points assigned to cluster i:\n",
242 | " num_assigned = ... # YOUR CODE HERE\n",
243 | " w = float(num_assigned) / num_docs\n",
244 | " weights.append(w)"
245 | ]
246 | },
247 | {
248 | "cell_type": "markdown",
249 | "metadata": {},
250 | "source": [
251 | "**Initializing covariances**\n",
252 | "\n",
253 | "To initialize our covariance parameters, we compute $\\hat{\\sigma}_{k, j}^2 = \\sum_{i=1}^{N}(x_{i,j} - \\hat{\\mu}_{k, j})^2$ for each feature $j$. For features with really tiny variances, we assign 1e-8 instead to prevent numerical instability. We do this computation in a vectorized fashion in the following code block."
254 | ]
255 | },
256 | {
257 | "cell_type": "code",
258 | "execution_count": null,
259 | "metadata": {
260 | "collapsed": false
261 | },
262 | "outputs": [],
263 | "source": [
264 | "covs = []\n",
265 | "for i in xrange(num_clusters):\n",
266 | " member_rows = tf_idf[cluster_assignment==i]\n",
267 | " cov = (member_rows.multiply(member_rows) - 2*member_rows.dot(diag(means[i]))).sum(axis=0).A1 / member_rows.shape[0] \\\n",
268 | " + means[i]**2\n",
269 | " cov[cov < 1e-8] = 1e-8\n",
270 | " covs.append(cov)"
271 | ]
272 | },
273 | {
274 | "cell_type": "markdown",
275 | "metadata": {},
276 | "source": [
277 | "**Running EM**\n",
278 | "\n",
279 | "Now that we have initialized all of our parameters, run EM."
280 | ]
281 | },
282 | {
283 | "cell_type": "code",
284 | "execution_count": null,
285 | "metadata": {
286 | "collapsed": false,
287 | "scrolled": true
288 | },
289 | "outputs": [],
290 | "source": [
291 | "out = EM_for_high_dimension(tf_idf, means, covs, weights, cov_smoothing=1e-10)"
292 | ]
293 | },
294 | {
295 | "cell_type": "code",
296 | "execution_count": null,
297 | "metadata": {
298 | "collapsed": false
299 | },
300 | "outputs": [],
301 | "source": [
302 | "out['loglik']"
303 | ]
304 | },
305 | {
306 | "cell_type": "markdown",
307 | "metadata": {},
308 | "source": [
309 | "## Interpret clustering results"
310 | ]
311 | },
312 | {
313 | "cell_type": "markdown",
314 | "metadata": {},
315 | "source": [
316 | "In contrast to k-means, EM is able to explicitly model clusters of varying sizes and proportions. The relative magnitude of variances in the word dimensions tell us much about the nature of the clusters.\n",
317 | "\n",
318 | "Write yourself a cluster visualizer as follows. Examining each cluster's mean vector, list the 5 words with the largest mean values (5 most common words in the cluster). For each word, also include the associated variance parameter (diagonal element of the covariance matrix). \n",
319 | "\n",
320 | "A sample output may be:\n",
321 | "```\n",
322 | "==========================================================\n",
323 | "Cluster 0: Largest mean parameters in cluster \n",
324 | "\n",
325 | "Word Mean Variance \n",
326 | "football 1.08e-01 8.64e-03\n",
327 | "season 5.80e-02 2.93e-03\n",
328 | "club 4.48e-02 1.99e-03\n",
329 | "league 3.94e-02 1.08e-03\n",
330 | "played 3.83e-02 8.45e-04\n",
331 | "...\n",
332 | "```"
333 | ]
334 | },
335 | {
336 | "cell_type": "code",
337 | "execution_count": null,
338 | "metadata": {
339 | "collapsed": true
340 | },
341 | "outputs": [],
342 | "source": [
343 | "# Fill in the blanks\n",
344 | "def visualize_EM_clusters(tf_idf, means, covs, map_index_to_word):\n",
345 | " print('')\n",
346 | " print('==========================================================')\n",
347 | "\n",
348 | " num_clusters = len(means)\n",
349 | " for c in xrange(num_clusters):\n",
350 | " print('Cluster {0:d}: Largest mean parameters in cluster '.format(c))\n",
351 | " print('\\n{0: <12}{1: <12}{2: <12}'.format('Word', 'Mean', 'Variance'))\n",
352 | " \n",
353 | " # The k'th element of sorted_word_ids should be the index of the word \n",
354 | " # that has the k'th-largest value in the cluster mean. Hint: Use np.argsort().\n",
355 | " sorted_word_ids = ... # YOUR CODE HERE\n",
356 | "\n",
357 | " for i in sorted_word_ids[:5]:\n",
358 | " print '{0: <12}{1:<10.2e}{2:10.2e}'.format(map_index_to_word['category'][i], \n",
359 | " means[c][i],\n",
360 | " covs[c][i])\n",
361 | " print '\\n=========================================================='"
362 | ]
363 | },
364 | {
365 | "cell_type": "code",
366 | "execution_count": null,
367 | "metadata": {
368 | "collapsed": false
369 | },
370 | "outputs": [],
371 | "source": [
372 | "'''By EM'''\n",
373 | "visualize_EM_clusters(tf_idf, out['means'], out['covs'], map_index_to_word)"
374 | ]
375 | },
376 | {
377 | "cell_type": "markdown",
378 | "metadata": {},
379 | "source": [
380 | "**Quiz Question**. Select all the topics that have a cluster in the model created above. [multiple choice]"
381 | ]
382 | },
383 | {
384 | "cell_type": "markdown",
385 | "metadata": {},
386 | "source": [
387 | "## Comparing to random initialization"
388 | ]
389 | },
390 | {
391 | "cell_type": "markdown",
392 | "metadata": {
393 | "collapsed": false
394 | },
395 | "source": [
396 | "Create variables for randomly initializing the EM algorithm. Complete the following code block."
397 | ]
398 | },
399 | {
400 | "cell_type": "code",
401 | "execution_count": null,
402 | "metadata": {
403 | "collapsed": true
404 | },
405 | "outputs": [],
406 | "source": [
407 | "np.random.seed(5) # See the note below to see why we set seed=5.\n",
408 | "num_clusters = len(means)\n",
409 | "num_docs, num_words = tf_idf.shape\n",
410 | "\n",
411 | "random_means = []\n",
412 | "random_covs = []\n",
413 | "random_weights = []\n",
414 | "\n",
415 | "for k in range(num_clusters):\n",
416 | " \n",
417 | " # Create a numpy array of length num_words with random normally distributed values.\n",
418 | " # Use the standard univariate normal distribution (mean 0, variance 1).\n",
419 | " # YOUR CODE HERE\n",
420 | " mean = ...\n",
421 | " \n",
422 | " # Create a numpy array of length num_words with random values uniformly distributed between 1 and 5.\n",
423 | " # YOUR CODE HERE\n",
424 | " cov = ...\n",
425 | "\n",
426 | " # Initially give each cluster equal weight.\n",
427 | " # YOUR CODE HERE\n",
428 | " weight = ...\n",
429 | " \n",
430 | " random_means.append(mean)\n",
431 | " random_covs.append(cov)\n",
432 | " random_weights.append(weight)"
433 | ]
434 | },
435 | {
436 | "cell_type": "markdown",
437 | "metadata": {},
438 | "source": [
439 | "**Quiz Question**: Try fitting EM with the random initial parameters you created above. (Use `cov_smoothing=1e-5`.) Store the result to `out_random_init`. What is the final loglikelihood that the algorithm converges to? "
440 | ]
441 | },
442 | {
443 | "cell_type": "code",
444 | "execution_count": null,
445 | "metadata": {
446 | "collapsed": true
447 | },
448 | "outputs": [],
449 | "source": []
450 | },
451 | {
452 | "cell_type": "markdown",
453 | "metadata": {},
454 | "source": [
455 | "**Quiz Question:** Is the final loglikelihood larger or smaller than the final loglikelihood we obtained above when initializing EM with the results from running k-means?"
456 | ]
457 | },
458 | {
459 | "cell_type": "code",
460 | "execution_count": null,
461 | "metadata": {
462 | "collapsed": true
463 | },
464 | "outputs": [],
465 | "source": []
466 | },
467 | {
468 | "cell_type": "markdown",
469 | "metadata": {},
470 | "source": [
471 | "**Quiz Question**: For the above model, `out_random_init`, use the `visualize_EM_clusters` method you created above. Are the clusters more or less interpretable than the ones found after initializing using k-means?"
472 | ]
473 | },
474 | {
475 | "cell_type": "code",
476 | "execution_count": null,
477 | "metadata": {
478 | "collapsed": true
479 | },
480 | "outputs": [],
481 | "source": [
482 | "# YOUR CODE HERE. Use visualize_EM_clusters, which will require you to pass in tf_idf and map_index_to_word.\n",
483 | "..."
484 | ]
485 | },
486 | {
487 | "cell_type": "markdown",
488 | "metadata": {
489 | "collapsed": true
490 | },
491 | "source": [
492 | "**Note**: Random initialization may sometimes produce a superior fit than k-means initialization. We do not claim that random initialization is always worse. However, this section does illustrate that random initialization often produces much worse clustering than k-means counterpart. This is the reason why we provide the particular random seed (`np.random.seed(5)`)."
493 | ]
494 | },
495 | {
496 | "cell_type": "markdown",
497 | "metadata": {
498 | "collapsed": true
499 | },
500 | "source": [
501 | "## Takeaway\n",
502 | "\n",
503 | "In this assignment we were able to apply the EM algorithm to a mixture of Gaussians model of text data. This was made possible by modifying the model to assume a diagonal covariance for each cluster, and by modifying the implementation to use a sparse matrix representation. In the second part you explored the role of k-means initialization on the convergence of the model as well as the interpretability of the clusters."
504 | ]
505 | },
506 | {
507 | "cell_type": "code",
508 | "execution_count": null,
509 | "metadata": {
510 | "collapsed": true
511 | },
512 | "outputs": [],
513 | "source": []
514 | }
515 | ],
516 | "metadata": {
517 | "anaconda-cloud": {},
518 | "kernelspec": {
519 | "display_name": "Python 2",
520 | "language": "python",
521 | "name": "python2"
522 | },
523 | "language_info": {
524 | "codemirror_mode": {
525 | "name": "ipython",
526 | "version": 2
527 | },
528 | "file_extension": ".py",
529 | "mimetype": "text/x-python",
530 | "name": "python",
531 | "nbconvert_exporter": "python",
532 | "pygments_lexer": "ipython2",
533 | "version": "2.7.11"
534 | }
535 | },
536 | "nbformat": 4,
537 | "nbformat_minor": 0
538 | }
539 |
--------------------------------------------------------------------------------
/course-4/6_hierarchical_clustering_blank.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Hierarchical Clustering"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "**Hierarchical clustering** refers to a class of clustering methods that seek to build a **hierarchy** of clusters, in which some clusters contain others. In this assignment, we will explore a top-down approach, recursively bipartitioning the data using k-means."
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "**Note to Amazon EC2 users**: To conserve memory, make sure to stop all the other notebooks before running this notebook."
22 | ]
23 | },
24 | {
25 | "cell_type": "markdown",
26 | "metadata": {},
27 | "source": [
28 | "## Import packages"
29 | ]
30 | },
31 | {
32 | "cell_type": "markdown",
33 | "metadata": {},
34 | "source": [
35 | "The following code block will check if you have the correct version of GraphLab Create. Any version later than 1.8.5 will do. To upgrade, read [this page](https://turi.com/download/upgrade-graphlab-create.html)."
36 | ]
37 | },
38 | {
39 | "cell_type": "code",
40 | "execution_count": null,
41 | "metadata": {
42 | "collapsed": false
43 | },
44 | "outputs": [],
45 | "source": [
46 | "import graphlab\n",
47 | "import matplotlib.pyplot as plt\n",
48 | "import numpy as np\n",
49 | "import sys\n",
50 | "import os\n",
51 | "import time\n",
52 | "from scipy.sparse import csr_matrix\n",
53 | "from sklearn.cluster import KMeans\n",
54 | "from sklearn.metrics import pairwise_distances\n",
55 | "%matplotlib inline\n",
56 | "\n",
57 | "'''Check GraphLab Create version'''\n",
58 | "from distutils.version import StrictVersion\n",
59 | "assert (StrictVersion(graphlab.version) >= StrictVersion('1.8.5')), 'GraphLab Create must be version 1.8.5 or later.'"
60 | ]
61 | },
62 | {
63 | "cell_type": "markdown",
64 | "metadata": {},
65 | "source": [
66 | "## Load the Wikipedia dataset"
67 | ]
68 | },
69 | {
70 | "cell_type": "code",
71 | "execution_count": null,
72 | "metadata": {
73 | "collapsed": true
74 | },
75 | "outputs": [],
76 | "source": [
77 | "wiki = graphlab.SFrame('people_wiki.gl/')"
78 | ]
79 | },
80 | {
81 | "cell_type": "markdown",
82 | "metadata": {},
83 | "source": [
84 | "As we did in previous assignments, let's extract the TF-IDF features:"
85 | ]
86 | },
87 | {
88 | "cell_type": "code",
89 | "execution_count": null,
90 | "metadata": {
91 | "collapsed": true
92 | },
93 | "outputs": [],
94 | "source": [
95 | "wiki['tf_idf'] = graphlab.text_analytics.tf_idf(wiki['text'])"
96 | ]
97 | },
98 | {
99 | "cell_type": "markdown",
100 | "metadata": {},
101 | "source": [
102 | "To run k-means on this dataset, we should convert the data matrix into a sparse matrix."
103 | ]
104 | },
105 | {
106 | "cell_type": "code",
107 | "execution_count": null,
108 | "metadata": {
109 | "collapsed": true
110 | },
111 | "outputs": [],
112 | "source": [
113 | "from em_utilities import sframe_to_scipy # converter\n",
114 | "\n",
115 | "# This will take about a minute or two.\n",
116 | "tf_idf, map_index_to_word = sframe_to_scipy(wiki, 'tf_idf')"
117 | ]
118 | },
119 | {
120 | "cell_type": "markdown",
121 | "metadata": {},
122 | "source": [
123 | "To be consistent with the k-means assignment, let's normalize all vectors to have unit norm."
124 | ]
125 | },
126 | {
127 | "cell_type": "code",
128 | "execution_count": null,
129 | "metadata": {
130 | "collapsed": true
131 | },
132 | "outputs": [],
133 | "source": [
134 | "from sklearn.preprocessing import normalize\n",
135 | "tf_idf = normalize(tf_idf)"
136 | ]
137 | },
138 | {
139 | "cell_type": "markdown",
140 | "metadata": {},
141 | "source": [
142 | "## Bipartition the Wikipedia dataset using k-means"
143 | ]
144 | },
145 | {
146 | "cell_type": "markdown",
147 | "metadata": {},
148 | "source": [
149 | "Recall our workflow for clustering text data with k-means:\n",
150 | "\n",
151 | "1. Load the dataframe containing a dataset, such as the Wikipedia text dataset.\n",
152 | "2. Extract the data matrix from the dataframe.\n",
153 | "3. Run k-means on the data matrix with some value of k.\n",
154 | "4. Visualize the clustering results using the centroids, cluster assignments, and the original dataframe. We keep the original dataframe around because the data matrix does not keep auxiliary information (in the case of the text dataset, the title of each article).\n",
155 | "\n",
156 | "Let us modify the workflow to perform bipartitioning:\n",
157 | "\n",
158 | "1. Load the dataframe containing a dataset, such as the Wikipedia text dataset.\n",
159 | "2. Extract the data matrix from the dataframe.\n",
160 | "3. Run k-means on the data matrix with k=2.\n",
161 | "4. Divide the data matrix into two parts using the cluster assignments.\n",
162 | "5. Divide the dataframe into two parts, again using the cluster assignments. This step is necessary to allow for visualization.\n",
163 | "6. Visualize the bipartition of data.\n",
164 | "\n",
165 | "We'd like to be able to repeat Steps 3-6 multiple times to produce a **hierarchy** of clusters such as the following:\n",
166 | "```\n",
167 | " (root)\n",
168 | " |\n",
169 | " +------------+-------------+\n",
170 | " | |\n",
171 | " Cluster Cluster\n",
172 | " +------+-----+ +------+-----+\n",
173 | " | | | |\n",
174 | " Cluster Cluster Cluster Cluster\n",
175 | "```\n",
176 | "Each **parent cluster** is bipartitioned to produce two **child clusters**. At the very top is the **root cluster**, which consists of the entire dataset.\n",
177 | "\n",
178 | "Now we write a wrapper function to bipartition a given cluster using k-means. There are three variables that together comprise the cluster:\n",
179 | "\n",
180 | "* `dataframe`: a subset of the original dataframe that correspond to member rows of the cluster\n",
181 | "* `matrix`: same set of rows, stored in sparse matrix format\n",
182 | "* `centroid`: the centroid of the cluster (not applicable for the root cluster)\n",
183 | "\n",
184 | "Rather than passing around the three variables separately, we package them into a Python dictionary. The wrapper function takes a single dictionary (representing a parent cluster) and returns two dictionaries (representing the child clusters)."
185 | ]
186 | },
187 | {
188 | "cell_type": "code",
189 | "execution_count": null,
190 | "metadata": {
191 | "collapsed": false
192 | },
193 | "outputs": [],
194 | "source": [
195 | "def bipartition(cluster, maxiter=400, num_runs=4, seed=None):\n",
196 | " '''cluster: should be a dictionary containing the following keys\n",
197 | " * dataframe: original dataframe\n",
198 | " * matrix: same data, in matrix format\n",
199 | " * centroid: centroid for this particular cluster'''\n",
200 | " \n",
201 | " data_matrix = cluster['matrix']\n",
202 | " dataframe = cluster['dataframe']\n",
203 | " \n",
204 | " # Run k-means on the data matrix with k=2. We use scikit-learn here to simplify workflow.\n",
205 | " kmeans_model = KMeans(n_clusters=2, max_iter=maxiter, n_init=num_runs, random_state=seed, n_jobs=1)\n",
206 | " kmeans_model.fit(data_matrix)\n",
207 | " centroids, cluster_assignment = kmeans_model.cluster_centers_, kmeans_model.labels_\n",
208 | " \n",
209 | " # Divide the data matrix into two parts using the cluster assignments.\n",
210 | " data_matrix_left_child, data_matrix_right_child = data_matrix[cluster_assignment==0], \\\n",
211 | " data_matrix[cluster_assignment==1]\n",
212 | " \n",
213 | " # Divide the dataframe into two parts, again using the cluster assignments.\n",
214 | " cluster_assignment_sa = graphlab.SArray(cluster_assignment) # minor format conversion\n",
215 | " dataframe_left_child, dataframe_right_child = dataframe[cluster_assignment_sa==0], \\\n",
216 | " dataframe[cluster_assignment_sa==1]\n",
217 | " \n",
218 | " \n",
219 | " # Package relevant variables for the child clusters\n",
220 | " cluster_left_child = {'matrix': data_matrix_left_child,\n",
221 | " 'dataframe': dataframe_left_child,\n",
222 | " 'centroid': centroids[0]}\n",
223 | " cluster_right_child = {'matrix': data_matrix_right_child,\n",
224 | " 'dataframe': dataframe_right_child,\n",
225 | " 'centroid': centroids[1]}\n",
226 | " \n",
227 | " return (cluster_left_child, cluster_right_child)"
228 | ]
229 | },
230 | {
231 | "cell_type": "markdown",
232 | "metadata": {},
233 | "source": [
234 | "The following cell performs bipartitioning of the Wikipedia dataset. Allow 20-60 seconds to finish.\n",
235 | "\n",
236 | "Note. For the purpose of the assignment, we set an explicit seed (`seed=1`) to produce identical outputs for every run. In pratical applications, you might want to use different random seeds for all runs."
237 | ]
238 | },
239 | {
240 | "cell_type": "code",
241 | "execution_count": null,
242 | "metadata": {
243 | "collapsed": false
244 | },
245 | "outputs": [],
246 | "source": [
247 | "wiki_data = {'matrix': tf_idf, 'dataframe': wiki} # no 'centroid' for the root cluster\n",
248 | "left_child, right_child = bipartition(wiki_data, maxiter=100, num_runs=6, seed=1)"
249 | ]
250 | },
251 | {
252 | "cell_type": "markdown",
253 | "metadata": {},
254 | "source": [
255 | "Let's examine the contents of one of the two clusters, which we call the `left_child`, referring to the tree visualization above."
256 | ]
257 | },
258 | {
259 | "cell_type": "code",
260 | "execution_count": null,
261 | "metadata": {
262 | "collapsed": false
263 | },
264 | "outputs": [],
265 | "source": [
266 | "left_child"
267 | ]
268 | },
269 | {
270 | "cell_type": "markdown",
271 | "metadata": {},
272 | "source": [
273 | "And here is the content of the other cluster we named `right_child`."
274 | ]
275 | },
276 | {
277 | "cell_type": "code",
278 | "execution_count": null,
279 | "metadata": {
280 | "collapsed": false
281 | },
282 | "outputs": [],
283 | "source": [
284 | "right_child"
285 | ]
286 | },
287 | {
288 | "cell_type": "markdown",
289 | "metadata": {},
290 | "source": [
291 | "## Visualize the bipartition"
292 | ]
293 | },
294 | {
295 | "cell_type": "markdown",
296 | "metadata": {},
297 | "source": [
298 | "We provide you with a modified version of the visualization function from the k-means assignment. For each cluster, we print the top 5 words with highest TF-IDF weights in the centroid and display excerpts for the 8 nearest neighbors of the centroid."
299 | ]
300 | },
301 | {
302 | "cell_type": "code",
303 | "execution_count": null,
304 | "metadata": {
305 | "collapsed": false,
306 | "scrolled": true
307 | },
308 | "outputs": [],
309 | "source": [
310 | "def display_single_tf_idf_cluster(cluster, map_index_to_word):\n",
311 | " '''map_index_to_word: SFrame specifying the mapping betweeen words and column indices'''\n",
312 | " \n",
313 | " wiki_subset = cluster['dataframe']\n",
314 | " tf_idf_subset = cluster['matrix']\n",
315 | " centroid = cluster['centroid']\n",
316 | " \n",
317 | " # Print top 5 words with largest TF-IDF weights in the cluster\n",
318 | " idx = centroid.argsort()[::-1]\n",
319 | " for i in xrange(5):\n",
320 | " print('{0:s}:{1:.3f}'.format(map_index_to_word['category'][idx[i]], centroid[idx[i]])),\n",
321 | " print('')\n",
322 | " \n",
323 | " # Compute distances from the centroid to all data points in the cluster.\n",
324 | " distances = pairwise_distances(tf_idf_subset, [centroid], metric='euclidean').flatten()\n",
325 | " # compute nearest neighbors of the centroid within the cluster.\n",
326 | " nearest_neighbors = distances.argsort()\n",
327 | " # For 8 nearest neighbors, print the title as well as first 180 characters of text.\n",
328 | " # Wrap the text at 80-character mark.\n",
329 | " for i in xrange(8):\n",
330 | " text = ' '.join(wiki_subset[nearest_neighbors[i]]['text'].split(None, 25)[0:25])\n",
331 | " print('* {0:50s} {1:.5f}\\n {2:s}\\n {3:s}'.format(wiki_subset[nearest_neighbors[i]]['name'],\n",
332 | " distances[nearest_neighbors[i]], text[:90], text[90:180] if len(text) > 90 else ''))\n",
333 | " print('')"
334 | ]
335 | },
336 | {
337 | "cell_type": "markdown",
338 | "metadata": {},
339 | "source": [
340 | "Let's visualize the two child clusters:"
341 | ]
342 | },
343 | {
344 | "cell_type": "code",
345 | "execution_count": null,
346 | "metadata": {
347 | "collapsed": false
348 | },
349 | "outputs": [],
350 | "source": [
351 | "display_single_tf_idf_cluster(left_child, map_index_to_word)"
352 | ]
353 | },
354 | {
355 | "cell_type": "code",
356 | "execution_count": null,
357 | "metadata": {
358 | "collapsed": false
359 | },
360 | "outputs": [],
361 | "source": [
362 | "display_single_tf_idf_cluster(right_child, map_index_to_word)"
363 | ]
364 | },
365 | {
366 | "cell_type": "markdown",
367 | "metadata": {},
368 | "source": [
369 | "The left cluster consists of athletes, whereas the right cluster consists of non-athletes. So far, we have a single-level hierarchy consisting of two clusters, as follows:"
370 | ]
371 | },
372 | {
373 | "cell_type": "markdown",
374 | "metadata": {},
375 | "source": [
376 | "```\n",
377 | " Wikipedia\n",
378 | " +\n",
379 | " |\n",
380 | " +--------------------------+--------------------+\n",
381 | " | |\n",
382 | " + +\n",
383 | " Athletes Non-athletes\n",
384 | "```"
385 | ]
386 | },
387 | {
388 | "cell_type": "markdown",
389 | "metadata": {},
390 | "source": [
391 | "Is this hierarchy good enough? **When building a hierarchy of clusters, we must keep our particular application in mind.** For instance, we might want to build a **directory** for Wikipedia articles. A good directory would let you quickly narrow down your search to a small set of related articles. The categories of athletes and non-athletes are too general to facilitate efficient search. For this reason, we decide to build another level into our hierarchy of clusters with the goal of getting more specific cluster structure at the lower level. To that end, we subdivide both the `athletes` and `non-athletes` clusters."
392 | ]
393 | },
394 | {
395 | "cell_type": "markdown",
396 | "metadata": {},
397 | "source": [
398 | "## Perform recursive bipartitioning"
399 | ]
400 | },
401 | {
402 | "cell_type": "markdown",
403 | "metadata": {},
404 | "source": [
405 | "### Cluster of athletes"
406 | ]
407 | },
408 | {
409 | "cell_type": "markdown",
410 | "metadata": {},
411 | "source": [
412 | "To help identify the clusters we've built so far, let's give them easy-to-read aliases:"
413 | ]
414 | },
415 | {
416 | "cell_type": "code",
417 | "execution_count": null,
418 | "metadata": {
419 | "collapsed": true
420 | },
421 | "outputs": [],
422 | "source": [
423 | "athletes = left_child\n",
424 | "non_athletes = right_child"
425 | ]
426 | },
427 | {
428 | "cell_type": "markdown",
429 | "metadata": {},
430 | "source": [
431 | "Using the bipartition function, we produce two child clusters of the athlete cluster:"
432 | ]
433 | },
434 | {
435 | "cell_type": "code",
436 | "execution_count": null,
437 | "metadata": {
438 | "collapsed": false
439 | },
440 | "outputs": [],
441 | "source": [
442 | "# Bipartition the cluster of athletes\n",
443 | "left_child_athletes, right_child_athletes = bipartition(athletes, maxiter=100, num_runs=6, seed=1)"
444 | ]
445 | },
446 | {
447 | "cell_type": "markdown",
448 | "metadata": {},
449 | "source": [
450 | "The left child cluster mainly consists of baseball players:"
451 | ]
452 | },
453 | {
454 | "cell_type": "code",
455 | "execution_count": null,
456 | "metadata": {
457 | "collapsed": false
458 | },
459 | "outputs": [],
460 | "source": [
461 | "display_single_tf_idf_cluster(left_child_athletes, map_index_to_word)"
462 | ]
463 | },
464 | {
465 | "cell_type": "markdown",
466 | "metadata": {},
467 | "source": [
468 | "On the other hand, the right child cluster is a mix of players in association football, Austrailian rules football and ice hockey:"
469 | ]
470 | },
471 | {
472 | "cell_type": "code",
473 | "execution_count": null,
474 | "metadata": {
475 | "collapsed": false
476 | },
477 | "outputs": [],
478 | "source": [
479 | "display_single_tf_idf_cluster(right_child_athletes, map_index_to_word)"
480 | ]
481 | },
482 | {
483 | "cell_type": "markdown",
484 | "metadata": {},
485 | "source": [
486 | "Our hierarchy of clusters now looks like this:\n",
487 | "```\n",
488 | " Wikipedia\n",
489 | " +\n",
490 | " |\n",
491 | " +--------------------------+--------------------+\n",
492 | " | |\n",
493 | " + +\n",
494 | " Athletes Non-athletes\n",
495 | " +\n",
496 | " |\n",
497 | " +-----------+--------+\n",
498 | " | |\n",
499 | " | association football/\n",
500 | " + Austrailian rules football/\n",
501 | " baseball ice hockey\n",
502 | "```"
503 | ]
504 | },
505 | {
506 | "cell_type": "markdown",
507 | "metadata": {},
508 | "source": [
509 | "Should we keep subdividing the clusters? If so, which cluster should we subdivide? To answer this question, we again think about our application. Since we organize our directory by topics, it would be nice to have topics that are about as coarse as each other. For instance, if one cluster is about baseball, we expect some other clusters about football, basketball, volleyball, and so forth. That is, **we would like to achieve similar level of granularity for all clusters.**\n",
510 | "\n",
511 | "Notice that the right child cluster is more coarse than the left child cluster. The right cluster posseses a greater variety of topics than the left (ice hockey/association football/Austrialian football vs. baseball). So the right child cluster should be subdivided further to produce finer child clusters."
512 | ]
513 | },
514 | {
515 | "cell_type": "markdown",
516 | "metadata": {},
517 | "source": [
518 | "Let's give the clusters aliases as well:"
519 | ]
520 | },
521 | {
522 | "cell_type": "code",
523 | "execution_count": null,
524 | "metadata": {
525 | "collapsed": true
526 | },
527 | "outputs": [],
528 | "source": [
529 | "baseball = left_child_athletes\n",
530 | "ice_hockey_football = right_child_athletes"
531 | ]
532 | },
533 | {
534 | "cell_type": "markdown",
535 | "metadata": {},
536 | "source": [
537 | "### Cluster of ice hockey players and football players"
538 | ]
539 | },
540 | {
541 | "cell_type": "markdown",
542 | "metadata": {},
543 | "source": [
544 | "In answering the following quiz question, take a look at the topics represented in the top documents (those closest to the centroid), as well as the list of words with highest TF-IDF weights.\n",
545 | "\n",
546 | "Let us bipartition the cluster of ice hockey and football players."
547 | ]
548 | },
549 | {
550 | "cell_type": "code",
551 | "execution_count": null,
552 | "metadata": {
553 | "collapsed": false
554 | },
555 | "outputs": [],
556 | "source": [
557 | "left_child_ihs, right_child_ihs = bipartition(ice_hockey_football, maxiter=100, num_runs=6, seed=1)\n",
558 | "display_single_tf_idf_cluster(left_child_ihs, map_index_to_word)\n",
559 | "display_single_tf_idf_cluster(right_child_ihs, map_index_to_word)"
560 | ]
561 | },
562 | {
563 | "cell_type": "markdown",
564 | "metadata": {},
565 | "source": [
566 | "**Quiz Question**. Which diagram best describes the hierarchy right after splitting the `ice_hockey_football` cluster? Refer to the quiz form for the diagrams."
567 | ]
568 | },
569 | {
570 | "cell_type": "markdown",
571 | "metadata": {},
572 | "source": [
573 | "**Caution**. The granularity criteria is an imperfect heuristic and must be taken with a grain of salt. It takes a lot of manual intervention to obtain a good hierarchy of clusters.\n",
574 | "\n",
575 | "* **If a cluster is highly mixed, the top articles and words may not convey the full picture of the cluster.** Thus, we may be misled if we judge the purity of clusters solely by their top documents and words. \n",
576 | "* **Many interesting topics are hidden somewhere inside the clusters but do not appear in the visualization.** We may need to subdivide further to discover new topics. For instance, subdividing the `ice_hockey_football` cluster led to the appearance of runners and golfers."
577 | ]
578 | },
579 | {
580 | "cell_type": "markdown",
581 | "metadata": {},
582 | "source": [
583 | "### Cluster of non-athletes"
584 | ]
585 | },
586 | {
587 | "cell_type": "markdown",
588 | "metadata": {},
589 | "source": [
590 | "Now let us subdivide the cluster of non-athletes."
591 | ]
592 | },
593 | {
594 | "cell_type": "code",
595 | "execution_count": null,
596 | "metadata": {
597 | "collapsed": false
598 | },
599 | "outputs": [],
600 | "source": [
601 | "# Bipartition the cluster of non-athletes\n",
602 | "left_child_non_athletes, right_child_non_athletes = bipartition(non_athletes, maxiter=100, num_runs=6, seed=1)"
603 | ]
604 | },
605 | {
606 | "cell_type": "code",
607 | "execution_count": null,
608 | "metadata": {
609 | "collapsed": false
610 | },
611 | "outputs": [],
612 | "source": [
613 | "display_single_tf_idf_cluster(left_child_non_athletes, map_index_to_word)"
614 | ]
615 | },
616 | {
617 | "cell_type": "code",
618 | "execution_count": null,
619 | "metadata": {
620 | "collapsed": false
621 | },
622 | "outputs": [],
623 | "source": [
624 | "display_single_tf_idf_cluster(right_child_non_athletes, map_index_to_word)"
625 | ]
626 | },
627 | {
628 | "cell_type": "markdown",
629 | "metadata": {},
630 | "source": [
631 | "Neither of the clusters show clear topics, apart from the genders. Let us divide them further."
632 | ]
633 | },
634 | {
635 | "cell_type": "code",
636 | "execution_count": null,
637 | "metadata": {
638 | "collapsed": true
639 | },
640 | "outputs": [],
641 | "source": [
642 | "male_non_athletes = left_child_non_athletes\n",
643 | "female_non_athletes = right_child_non_athletes"
644 | ]
645 | },
646 | {
647 | "cell_type": "markdown",
648 | "metadata": {},
649 | "source": [
650 | "**Quiz Question**. Let us bipartition the clusters `male_non_athletes` and `female_non_athletes`. Which diagram best describes the resulting hierarchy of clusters for the non-athletes? Refer to the quiz for the diagrams.\n",
651 | "\n",
652 | "**Note**. Use `maxiter=100, num_runs=6, seed=1` for consistency of output."
653 | ]
654 | },
655 | {
656 | "cell_type": "code",
657 | "execution_count": null,
658 | "metadata": {
659 | "collapsed": false
660 | },
661 | "outputs": [],
662 | "source": []
663 | },
664 | {
665 | "cell_type": "code",
666 | "execution_count": null,
667 | "metadata": {
668 | "collapsed": false
669 | },
670 | "outputs": [],
671 | "source": []
672 | },
673 | {
674 | "cell_type": "code",
675 | "execution_count": null,
676 | "metadata": {
677 | "collapsed": true
678 | },
679 | "outputs": [],
680 | "source": []
681 | }
682 | ],
683 | "metadata": {
684 | "kernelspec": {
685 | "display_name": "Python 2",
686 | "language": "python",
687 | "name": "python2"
688 | },
689 | "language_info": {
690 | "codemirror_mode": {
691 | "name": "ipython",
692 | "version": 2
693 | },
694 | "file_extension": ".py",
695 | "mimetype": "text/x-python",
696 | "name": "python",
697 | "nbconvert_exporter": "python",
698 | "pygments_lexer": "ipython2",
699 | "version": "2.7.11"
700 | }
701 | },
702 | "nbformat": 4,
703 | "nbformat_minor": 0
704 | }
705 |
--------------------------------------------------------------------------------
/course-4/chosen_images.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/learnml/machine-learning-specialization/d8159cb9266e3c337b897de27dbd600d440cb870/course-4/chosen_images.png
--------------------------------------------------------------------------------
/course-4/em_utilities.py:
--------------------------------------------------------------------------------
1 | from scipy.sparse import csr_matrix
2 | from scipy.sparse import spdiags
3 | from scipy.stats import multivariate_normal
4 | import graphlab
5 | import numpy as np
6 | import sys
7 | import time
8 | from copy import deepcopy
9 | from sklearn.metrics import pairwise_distances
10 | from sklearn.preprocessing import normalize
11 |
12 | def sframe_to_scipy(x, column_name):
13 | '''
14 | Convert a dictionary column of an SFrame into a sparse matrix format where
15 | each (row_id, column_id, value) triple corresponds to the value of
16 | x[row_id][column_id], where column_id is a key in the dictionary.
17 |
18 | Example
19 | >>> sparse_matrix, map_key_to_index = sframe_to_scipy(sframe, column_name)
20 | '''
21 | assert x[column_name].dtype() == dict, \
22 | 'The chosen column must be dict type, representing sparse data.'
23 |
24 | # Create triples of (row_id, feature_id, count).
25 | # 1. Add a row number.
26 | x = x.add_row_number()
27 | # 2. Stack will transform x to have a row for each unique (row, key) pair.
28 | x = x.stack(column_name, ['feature', 'value'])
29 |
30 | # Map words into integers using a OneHotEncoder feature transformation.
31 | f = graphlab.feature_engineering.OneHotEncoder(features=['feature'])
32 | # 1. Fit the transformer using the above data.
33 | f.fit(x)
34 | # 2. The transform takes 'feature' column and adds a new column 'feature_encoding'.
35 | x = f.transform(x)
36 | # 3. Get the feature mapping.
37 | mapping = f['feature_encoding']
38 | # 4. Get the feature id to use for each key.
39 | x['feature_id'] = x['encoded_features'].dict_keys().apply(lambda x: x[0])
40 |
41 | # Create numpy arrays that contain the data for the sparse matrix.
42 | i = np.array(x['id'])
43 | j = np.array(x['feature_id'])
44 | v = np.array(x['value'])
45 | width = x['id'].max() + 1
46 | height = x['feature_id'].max() + 1
47 |
48 | # Create a sparse matrix.
49 | mat = csr_matrix((v, (i, j)), shape=(width, height))
50 |
51 | return mat, mapping
52 |
53 | def diag(array):
54 | n = len(array)
55 | return spdiags(array, 0, n, n)
56 |
57 | def logpdf_diagonal_gaussian(x, mean, cov):
58 | '''
59 | Compute logpdf of a multivariate Gaussian distribution with diagonal covariance at a given point x.
60 | A multivariate Gaussian distribution with a diagonal covariance is equivalent
61 | to a collection of independent Gaussian random variables.
62 |
63 | x should be a sparse matrix. The logpdf will be computed for each row of x.
64 | mean and cov should be given as 1D numpy arrays
65 | mean[i] : mean of i-th variable
66 | cov[i] : variance of i-th variable'''
67 |
68 | n = x.shape[0]
69 | dim = x.shape[1]
70 | assert(dim == len(mean) and dim == len(cov))
71 |
72 | # multiply each i-th column of x by (1/(2*sigma_i)), where sigma_i is sqrt of variance of i-th variable.
73 | scaled_x = x.dot( diag(1./(2*np.sqrt(cov))) )
74 | # multiply each i-th entry of mean by (1/(2*sigma_i))
75 | scaled_mean = mean/(2*np.sqrt(cov))
76 |
77 | # sum of pairwise squared Eulidean distances gives SUM[(x_i - mean_i)^2/(2*sigma_i^2)]
78 | return -np.sum(np.log(np.sqrt(2*np.pi*cov))) - pairwise_distances(scaled_x, [scaled_mean], 'euclidean').flatten()**2
79 |
80 | def log_sum_exp(x, axis):
81 | '''Compute the log of a sum of exponentials'''
82 | x_max = np.max(x, axis=axis)
83 | if axis == 1:
84 | return x_max + np.log( np.sum(np.exp(x-x_max[:,np.newaxis]), axis=1) )
85 | else:
86 | return x_max + np.log( np.sum(np.exp(x-x_max), axis=0) )
87 |
88 | def EM_for_high_dimension(data, means, covs, weights, cov_smoothing=1e-5, maxiter=int(1e3), thresh=1e-4, verbose=False):
89 | # cov_smoothing: specifies the default variance assigned to absent features in a cluster.
90 | # If we were to assign zero variances to absent features, we would be overconfient,
91 | # as we hastily conclude that those featurese would NEVER appear in the cluster.
92 | # We'd like to leave a little bit of possibility for absent features to show up later.
93 | n = data.shape[0]
94 | dim = data.shape[1]
95 | mu = deepcopy(means)
96 | Sigma = deepcopy(covs)
97 | K = len(mu)
98 | weights = np.array(weights)
99 |
100 | ll = None
101 | ll_trace = []
102 |
103 | for i in range(maxiter):
104 | # E-step: compute responsibilities
105 | logresp = np.zeros((n,K))
106 | for k in xrange(K):
107 | logresp[:,k] = np.log(weights[k]) + logpdf_diagonal_gaussian(data, mu[k], Sigma[k])
108 | ll_new = np.sum(log_sum_exp(logresp, axis=1))
109 | if verbose:
110 | print(ll_new)
111 | sys.stdout.flush()
112 | logresp -= np.vstack(log_sum_exp(logresp, axis=1))
113 | resp = np.exp(logresp)
114 | counts = np.sum(resp, axis=0)
115 |
116 | # M-step: update weights, means, covariances
117 | weights = counts / np.sum(counts)
118 | for k in range(K):
119 | mu[k] = (diag(resp[:,k]).dot(data)).sum(axis=0)/counts[k]
120 | mu[k] = mu[k].A1
121 |
122 | Sigma[k] = diag(resp[:,k]).dot( data.multiply(data)-2*data.dot(diag(mu[k])) ).sum(axis=0) \
123 | + (mu[k]**2)*counts[k]
124 | Sigma[k] = Sigma[k].A1 / counts[k] + cov_smoothing*np.ones(dim)
125 |
126 | # check for convergence in log-likelihood
127 | ll_trace.append(ll_new)
128 | if ll is not None and (ll_new-ll) < thresh and ll_new > -np.inf:
129 | ll = ll_new
130 | break
131 | else:
132 | ll = ll_new
133 |
134 | out = {'weights':weights,'means':mu,'covs':Sigma,'loglik':ll_trace,'resp':resp}
135 |
136 | return out
137 |
--------------------------------------------------------------------------------
/course-4/images.sf.zip:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:34bba8f7341dcd0adc4506151a72fb0ccc26f0967c605d1eef49324730d1d0e5
3 | size 11687484
4 |
--------------------------------------------------------------------------------
/course-4/kmeans-arrays.npz:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:525a3fe1b6b27bf8034aeaa2e35c20c697fd123a544c7dfafffe922a2341924a
3 | size 50329954
4 |
--------------------------------------------------------------------------------
/course-4/people_wiki.gl.zip:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:757e1546d5a778939ac8a84f7eb7cb6ba501544acf2759f7783046ce474f28bc
3 | size 58269690
4 |
--------------------------------------------------------------------------------
/course-4/topic_models.zip:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:8a2aff7b44795e80717b7a68e3c531dced7e81318662bc193d5a0c4870997614
3 | size 23249973
4 |
--------------------------------------------------------------------------------