├── .DS_Store
├── NYC-311-2M.db
├── Notebook0-basics.ipynb
├── Notebook10_NumpyMatrix_part2.ipynb
├── Notebook10_NumpyMatrix_part3.ipynb
├── Notebook10_NumpyScipy_part1.ipynb
├── Notebook11_MarkovChainAnalysis.ipynb
├── Notebook11_MarkovChainAnalysis_Ranking.ipynb
├── Notebook12_Algoriths_LinearLeastSq_part3.ipynb
├── Notebook12_Cost_part3.ipynb
├── Notebook12_Gradients_part2.ipynb
├── Notebook12_LinReg_ModelFitting_part1.ipynb
├── Notebook12_LinerRegNotes.ipynb
├── Notebook12_PerturbationTheory.ipynb
├── Notebook12_ThetaLMS.ipynb
├── Notebook13_LogisticRegression.ipynb
├── Notebook13_NumpyMatrixManipulation.ipynb
├── Notebook14_KMeans_Clustering.ipynb
├── Notebook15_Compression_via_PCA&SVD.ipynb
├── Notebook15_SVDimage_compression.ipynb
├── Notebook16_EiganFaces.ipynb
├── Notebook1_part1-collections.ipynb
├── Notebook1_part2-more_exercises.ipynb
├── Notebook4_part1_floating_points.ipynb
├── Notebook4_part2_floatingPoints.ipynb
├── Notebook5_part1_RegEx.ipynb
├── Notebook5_part2_RegExYelp.ipynb
├── Notebook5_part3_RegEx_hard.ipynb
├── Notebook6_part1_web_mining.ipynb
├── Notebook6_part2_BeautifulSoup.ipynb
├── Notebook6_part3_webMiningAPIs.ipynb
├── Notebook7_Pandas_TidyData.ipynb
├── Notebook8_bokeh_seaborn.ipynb
├── Notebook9_SQL_Part2.ipynb
├── Notebook9_SQL_relational_DBs.ipynb
├── README.md
├── Supplemental_notebook.ipynb
└── part1.ipynb


/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/maicat11/Computing-for-Data-Analysis/401be39dc4058c1e3204fd01b49dbede5e7853f1/.DS_Store


--------------------------------------------------------------------------------
/NYC-311-2M.db:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/maicat11/Computing-for-Data-Analysis/401be39dc4058c1e3204fd01b49dbede5e7853f1/NYC-311-2M.db


--------------------------------------------------------------------------------
/Notebook0-basics.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {
  6 |     "nbgrader": {
  7 |      "grade": false,
  8 |      "grade_id": "cell-166d0b0bd7d2633f",
  9 |      "locked": true,
 10 |      "schema_version": 1,
 11 |      "solution": false
 12 |     }
 13 |    },
 14 |    "source": [
 15 |     "# Python review: Values, variables, types, lists, and strings\n",
 16 |     "\n",
 17 |     "These first few notebooks are a set of exercises with two goals:\n",
 18 |     "\n",
 19 |     "1. Review the basics of Python\n",
 20 |     "2. Familiarize you with Jupyter\n",
 21 |     "\n",
 22 |     "Regarding the first goal, these initial notebooks cover material we think you should already know from [Chris Simpkins's](https://www.cc.gatech.edu/~simpkins/) [Python Bootcamp](https://www.cc.gatech.edu/~simpkins/teaching/python-bootcamp/syllabus.html). It is based specifically on his offering to incoming students of the Georgia Tech MS Analytics in [Fall 2016](https://www.cc.gatech.edu/~simpkins/teaching/python-bootcamp/august2016.html).\n",
 23 |     "\n",
 24 |     "Regarding the second goal, you'll observe that the bootcamp has each student install and work directly with the Python interpreter, which runs locally on his or her machine (e.g., see [Slide 5 of Chris's intro](https://www.cc.gatech.edu/~simpkins/teaching/python-bootcamp/slides/intro-python.html)). But in this course, we are using Jupyter Notebooks as the development environment. You can think of a Jupyter notebook as a web-based \"skin\" for running a Python interpreter---possibly hosted on a remote server, which is the case in this course. Here is a good tutorial on [Jupyter](https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook)."
 25 |    ]
 26 |   },
 27 |   {
 28 |    "cell_type": "markdown",
 29 |    "metadata": {
 30 |     "nbgrader": {
 31 |      "grade": false,
 32 |      "grade_id": "cell-9c012c4d7a197d7d",
 33 |      "locked": true,
 34 |      "schema_version": 1,
 35 |      "solution": false
 36 |     }
 37 |    },
 38 |    "source": [
 39 |     "> **Note for [OMSA](https://pe.gatech.edu/master-science-degrees/online-master-science-analytics) students.** In this course we assume you are using [Vocareum's deployment](https://www.vocareum.com/) of Jupyter. You also have an option to use other Jupyter environments, including installing and running Jupyter on your own system. We can't provide technical support to you if you choose to go those routes, but if you'd like to do that anyway, we recommend [Microsoft Azure Notebooks](https://notebooks.azure.com/) as a web-hosted option, which we use in the on-campus class, or the Continuum Analytics [Anaconda distribution](https://www.continuum.io/downloads) as a locally installed option."
 40 |    ]
 41 |   },
 42 |   {
 43 |    "cell_type": "markdown",
 44 |    "metadata": {
 45 |     "nbgrader": {
 46 |      "grade": false,
 47 |      "grade_id": "cell-92648f77b2c73f26",
 48 |      "locked": true,
 49 |      "schema_version": 1,
 50 |      "solution": false
 51 |     }
 52 |    },
 53 |    "source": [
 54 |     "**Study hint: Read the test code!** You'll notice that most of the exercises below have a place for you to code up your answer followed by a \"test cell.\" That's a code cell that checks the output of your code to see whether it appears to produce correct results. You can often learn a lot by reading the test code. In fact, sometimes it gives you a hint about how to approach the problem. As such, we encourage you to try to read the test cells even if they seem cryptic, which is deliberate!"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "markdown",
 59 |    "metadata": {
 60 |     "nbgrader": {
 61 |      "grade": false,
 62 |      "grade_id": "cell-9a91d97e7aa9c67f",
 63 |      "locked": true,
 64 |      "schema_version": 1,
 65 |      "solution": false
 66 |     }
 67 |    },
 68 |    "source": [
 69 |     "**Exercise 0** (1 point). Run the code cell below. It should display the output string, `Hello, world!`."
 70 |    ]
 71 |   },
 72 |   {
 73 |    "cell_type": "code",
 74 |    "execution_count": 1,
 75 |    "metadata": {
 76 |     "nbgrader": {
 77 |      "grade": true,
 78 |      "grade_id": "hello_world_test",
 79 |      "locked": true,
 80 |      "points": 1,
 81 |      "schema_version": 1,
 82 |      "solution": false
 83 |     }
 84 |    },
 85 |    "outputs": [
 86 |     {
 87 |      "name": "stdout",
 88 |      "output_type": "stream",
 89 |      "text": [
 90 |       "Hello, world!\n"
 91 |      ]
 92 |     }
 93 |    ],
 94 |    "source": [
 95 |     "print(\"Hello, world!\")"
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "markdown",
100 |    "metadata": {
101 |     "nbgrader": {
102 |      "grade": false,
103 |      "grade_id": "cell-2de1352e57946ac5",
104 |      "locked": true,
105 |      "schema_version": 1,
106 |      "solution": false
107 |     }
108 |    },
109 |    "source": [
110 |     "**Exercise 1** (`x_float_test`: 1 point). Create a variable named `x_float` whose numerical value is one (1) and whose type is *floating-point*."
111 |    ]
112 |   },
113 |   {
114 |    "cell_type": "code",
115 |    "execution_count": 2,
116 |    "metadata": {
117 |     "collapsed": true,
118 |     "nbgrader": {
119 |      "grade": false,
120 |      "grade_id": "x_float",
121 |      "locked": false,
122 |      "schema_version": 1,
123 |      "solution": true
124 |     }
125 |    },
126 |    "outputs": [],
127 |    "source": [
128 |     "#\n",
129 |     "# YOUR CODE HERE\n",
130 |     "#\n",
131 |     "x_float = float(1)"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "code",
136 |    "execution_count": 3,
137 |    "metadata": {
138 |     "nbgrader": {
139 |      "grade": true,
140 |      "grade_id": "x_float_test",
141 |      "locked": true,
142 |      "points": 1,
143 |      "schema_version": 1,
144 |      "solution": false
145 |     }
146 |    },
147 |    "outputs": [
148 |     {
149 |      "name": "stdout",
150 |      "output_type": "stream",
151 |      "text": [
152 |       "\n",
153 |       "(Passed!)\n"
154 |      ]
155 |     }
156 |    ],
157 |    "source": [
158 |     "# `x_float_test`: Test cell\n",
159 |     "assert x_float == 1\n",
160 |     "assert type(x_float) is float\n",
161 |     "print(\"\\n(Passed!)\")"
162 |    ]
163 |   },
164 |   {
165 |    "cell_type": "markdown",
166 |    "metadata": {
167 |     "nbgrader": {
168 |      "grade": false,
169 |      "grade_id": "cell-2b53cd92e2ac58a1",
170 |      "locked": true,
171 |      "schema_version": 1,
172 |      "solution": false
173 |     }
174 |    },
175 |    "source": [
176 |     "**Exercise 2** (`strcat_ba_test`: 1 point). Complete the following function, `strcat_ba(a, b)`, so that given two strings, `a` and `b`, it returns the concatenation of `b` followed by `a` (pay attention to the order in these instructions!)."
177 |    ]
178 |   },
179 |   {
180 |    "cell_type": "code",
181 |    "execution_count": 4,
182 |    "metadata": {
183 |     "collapsed": true,
184 |     "nbgrader": {
185 |      "grade": false,
186 |      "grade_id": "strcat_ba",
187 |      "locked": false,
188 |      "schema_version": 1,
189 |      "solution": true
190 |     }
191 |    },
192 |    "outputs": [],
193 |    "source": [
194 |     "def strcat_ba(a, b):\n",
195 |     "    assert type(a) is str\n",
196 |     "    assert type(b) is str\n",
197 |     "#\n",
198 |     "# YOUR CODE HERE\n",
199 |     "#\n",
200 |     "    return b + a"
201 |    ]
202 |   },
203 |   {
204 |    "cell_type": "code",
205 |    "execution_count": 5,
206 |    "metadata": {
207 |     "nbgrader": {
208 |      "grade": true,
209 |      "grade_id": "strcat_ba_test",
210 |      "locked": true,
211 |      "points": 1,
212 |      "schema_version": 1,
213 |      "solution": false
214 |     }
215 |    },
216 |    "outputs": [
217 |     {
218 |      "name": "stdout",
219 |      "output_type": "stream",
220 |      "text": [
221 |       "strcat_ba(\"ofrpy\", \"kek\") == \"kekofrpy\"\n",
222 |       "\n",
223 |       "(Passed!)\n"
224 |      ]
225 |     }
226 |    ],
227 |    "source": [
228 |     "# `strcat_ba_test`: Test cell\n",
229 |     "\n",
230 |     "# Workaround:  # Python 3.5.2 does not have `random.choices()` (available in 3.6+)\n",
231 |     "def random_letter():\n",
232 |     "    from random import choice\n",
233 |     "    return choice('abcdefghijklmnopqrstuvwxyz')\n",
234 |     "\n",
235 |     "def random_string(n, fun=random_letter):\n",
236 |     "    return ''.join([str(fun()) for _ in range(n)])\n",
237 |     "\n",
238 |     "a = random_string(5)\n",
239 |     "b = random_string(3)\n",
240 |     "c = strcat_ba(a, b)\n",
241 |     "print('strcat_ba(\"{}\", \"{}\") == \"{}\"'.format(a, b, c))\n",
242 |     "assert len(c) == len(a) + len(b)\n",
243 |     "assert c[:len(b)] == b\n",
244 |     "assert c[-len(a):] == a\n",
245 |     "print(\"\\n(Passed!)\")"
246 |    ]
247 |   },
248 |   {
249 |    "cell_type": "markdown",
250 |    "metadata": {
251 |     "nbgrader": {
252 |      "grade": false,
253 |      "grade_id": "cell-75fe9a45cd6b3f9a",
254 |      "locked": true,
255 |      "schema_version": 1,
256 |      "solution": false
257 |     }
258 |    },
259 |    "source": [
260 |     "**Exercise 3** (`strcat_list_test`: 2 points). Complete the following function, `strcat_list(L)`, which generalizes the previous function: given a *list* of strings, `L[:]`, returns the concatenation of the strings in reverse order. For example:\n",
261 |     "\n",
262 |     "```python\n",
263 |     "    strcat_list(['abc', 'def', 'ghi']) == 'ghidefabc'\n",
264 |     "```"
265 |    ]
266 |   },
267 |   {
268 |    "cell_type": "code",
269 |    "execution_count": 8,
270 |    "metadata": {
271 |     "collapsed": true,
272 |     "nbgrader": {
273 |      "grade": false,
274 |      "grade_id": "strcat_list",
275 |      "locked": false,
276 |      "schema_version": 1,
277 |      "solution": true
278 |     }
279 |    },
280 |    "outputs": [],
281 |    "source": [
282 |     "def strcat_list(L):\n",
283 |     "    assert type(L) is list\n",
284 |     "#\n",
285 |     "# YOUR CODE HERE\n",
286 |     "#\n",
287 |     "    rev_cat = ''\n",
288 |     "    for item in reversed(L):\n",
289 |     "        rev_cat += item\n",
290 |     "    return rev_cat"
291 |    ]
292 |   },
293 |   {
294 |    "cell_type": "code",
295 |    "execution_count": 9,
296 |    "metadata": {
297 |     "nbgrader": {
298 |      "grade": true,
299 |      "grade_id": "strcat_list_test",
300 |      "locked": true,
301 |      "points": 2,
302 |      "schema_version": 1,
303 |      "solution": false
304 |     }
305 |    },
306 |    "outputs": [
307 |     {
308 |      "name": "stdout",
309 |      "output_type": "stream",
310 |      "text": [
311 |       "L == ['qsg', 'jul', 'mnp', 'jro', 'nws', 'pzq']\n",
312 |       "strcat_list(L) == 'pzqnwsjromnpjulqsg'\n",
313 |       "\n",
314 |       "(Passed!)\n"
315 |      ]
316 |     }
317 |    ],
318 |    "source": [
319 |     "# `strcat_list_test`: Test cell\n",
320 |     "n = 3\n",
321 |     "nL = 6\n",
322 |     "L = [random_string(n) for _ in range(nL)]\n",
323 |     "Lc = strcat_list(L)\n",
324 |     "\n",
325 |     "print('L == {}'.format(L))\n",
326 |     "print('strcat_list(L) == \\'{}\\''.format(Lc))\n",
327 |     "assert all([Lc[i*n:(i+1)*n] == L[nL-i-1] for i, x in zip(range(nL), L)])\n",
328 |     "print(\"\\n(Passed!)\")"
329 |    ]
330 |   },
331 |   {
332 |    "cell_type": "markdown",
333 |    "metadata": {
334 |     "nbgrader": {
335 |      "grade": false,
336 |      "grade_id": "cell-06e37dd37379b6b8",
337 |      "locked": true,
338 |      "schema_version": 1,
339 |      "solution": false
340 |     }
341 |    },
342 |    "source": [
343 |     "**Exercise 4** (`floor_fraction_test`: 1 point). Suppose you are given two variables, `a` and `b`, whose values are the real numbers, $a \\geq 0$ (non-negative) and $b > 0$ (positive). Complete the function, `floor_fraction(a, b)` so that it returns $\\left\\lfloor\\frac{a}{b}\\right\\rfloor$, that is, the *floor* of $\\frac{a}{b}$. The *type* of the returned value must be `int` (an integer)."
344 |    ]
345 |   },
346 |   {
347 |    "cell_type": "code",
348 |    "execution_count": 12,
349 |    "metadata": {
350 |     "collapsed": true,
351 |     "nbgrader": {
352 |      "grade": false,
353 |      "grade_id": "floor_fraction",
354 |      "locked": false,
355 |      "schema_version": 1,
356 |      "solution": true
357 |     }
358 |    },
359 |    "outputs": [],
360 |    "source": [
361 |     "def is_number(x):\n",
362 |     "    \"\"\"Returns `True` if `x` is a number-like type, e.g., `int`, `float`, `Decimal()`, ...\"\"\"\n",
363 |     "    from numbers import Number\n",
364 |     "    return isinstance(x, Number)\n",
365 |     "    \n",
366 |     "def floor_fraction(a, b):\n",
367 |     "    assert is_number(a) and a >= 0\n",
368 |     "    assert is_number(b) and b > 0\n",
369 |     "#\n",
370 |     "# YOUR CODE HERE\n",
371 |     "#\n",
372 |     "    return int(a/b)\n",
373 |     "    "
374 |    ]
375 |   },
376 |   {
377 |    "cell_type": "code",
378 |    "execution_count": 13,
379 |    "metadata": {
380 |     "nbgrader": {
381 |      "grade": true,
382 |      "grade_id": "floor_fraction_test",
383 |      "locked": true,
384 |      "points": 1,
385 |      "schema_version": 1,
386 |      "solution": false
387 |     }
388 |    },
389 |    "outputs": [
390 |     {
391 |      "name": "stdout",
392 |      "output_type": "stream",
393 |      "text": [
394 |       "floor_fraction(0.9805045805759878, 0.40871852802407027) == floor(2.398972675195786) == 2\n",
395 |       "\n",
396 |       "(Passed!)\n"
397 |      ]
398 |     }
399 |    ],
400 |    "source": [
401 |     "# `floor_fraction_test`: Test cell\n",
402 |     "from random import random\n",
403 |     "a = random()\n",
404 |     "b = random()\n",
405 |     "c = floor_fraction(a, b)\n",
406 |     "\n",
407 |     "print('floor_fraction({}, {}) == floor({}) == {}'.format(a, b, a/b, c))\n",
408 |     "assert b*c <= a <= b*(c+1)\n",
409 |     "assert type(c) is int\n",
410 |     "print('\\n(Passed!)')"
411 |    ]
412 |   },
413 |   {
414 |    "cell_type": "markdown",
415 |    "metadata": {
416 |     "nbgrader": {
417 |      "grade": false,
418 |      "grade_id": "cell-e98590d39e95bc25",
419 |      "locked": true,
420 |      "schema_version": 1,
421 |      "solution": false
422 |     }
423 |    },
424 |    "source": [
425 |     "**Exercise 5** (`ceiling_fraction_test`: 1 point). Complete the function, `ceiling_fraction(a, b)`, which for any numeric inputs, `a` and `b`, corresponding to real numbers, $a \\geq 0$ and $b > 0$, returns $\\left\\lceil\\frac{a}{b}\\right\\rceil$, that is, the *ceiling* of $\\frac{a}{b}$. The type of the returned value must be `int`."
426 |    ]
427 |   },
428 |   {
429 |    "cell_type": "code",
430 |    "execution_count": 14,
431 |    "metadata": {
432 |     "collapsed": true,
433 |     "nbgrader": {
434 |      "grade": false,
435 |      "grade_id": "ceiling_fraction",
436 |      "locked": false,
437 |      "schema_version": 1,
438 |      "solution": true
439 |     }
440 |    },
441 |    "outputs": [],
442 |    "source": [
443 |     "def ceiling_fraction(a, b):\n",
444 |     "    assert is_number(a) and a >= 0\n",
445 |     "    assert is_number(b) and b > 0\n",
446 |     "#\n",
447 |     "# YOUR CODE HERE\n",
448 |     "#\n",
449 |     "    return int(a/b) + 1"
450 |    ]
451 |   },
452 |   {
453 |    "cell_type": "code",
454 |    "execution_count": 15,
455 |    "metadata": {
456 |     "nbgrader": {
457 |      "grade": true,
458 |      "grade_id": "ceiling_fraction_test",
459 |      "locked": true,
460 |      "points": 1,
461 |      "schema_version": 1,
462 |      "solution": false
463 |     }
464 |    },
465 |    "outputs": [
466 |     {
467 |      "name": "stdout",
468 |      "output_type": "stream",
469 |      "text": [
470 |       "ceiling_fraction(0.8979882736065053, 0.8324033074774889) == ceiling(1.0787898913181457) == 2\n",
471 |       "\n",
472 |       "(Passed!)\n"
473 |      ]
474 |     }
475 |    ],
476 |    "source": [
477 |     "# `ceiling_fraction_test`: Test cell\n",
478 |     "from random import random\n",
479 |     "a = random()\n",
480 |     "b = random()\n",
481 |     "c = ceiling_fraction(a, b)\n",
482 |     "print('ceiling_fraction({}, {}) == ceiling({}) == {}'.format(a, b, a/b, c))\n",
483 |     "assert b*(c-1) <= a <= b*c\n",
484 |     "assert type(c) is int\n",
485 |     "print(\"\\n(Passed!)\")"
486 |    ]
487 |   },
488 |   {
489 |    "cell_type": "markdown",
490 |    "metadata": {},
491 |    "source": [
492 |     "**Exercise 6** (`report_exam_avg_test`: 1 point). Let `a`, `b`, and `c` represent three exam scores as numerical values. Complete the function, `report_exam_avg(a, b, c)` so that it computes the average score (equally weighted) and returns the string, `'Your average score is: XX'`, where `XX` is the average rounded to one decimal place. For example:\n",
493 |     "\n",
494 |     "```python\n",
495 |     "    report_exam_avg(100, 95, 80) == 'Your average score: 91.7'\n",
496 |     "```"
497 |    ]
498 |   },
499 |   {
500 |    "cell_type": "code",
501 |    "execution_count": 27,
502 |    "metadata": {
503 |     "collapsed": true,
504 |     "nbgrader": {
505 |      "grade": false,
506 |      "grade_id": "cell-a117d6495f42b850",
507 |      "locked": false,
508 |      "schema_version": 1,
509 |      "solution": true
510 |     }
511 |    },
512 |    "outputs": [],
513 |    "source": [
514 |     "def report_exam_avg(a, b, c):\n",
515 |     "    assert is_number(a) and is_number(b) and is_number(c)\n",
516 |     "#\n",
517 |     "# YOUR CODE HERE\n",
518 |     "#\n",
519 |     "    avg_score = round(((a + b + c) / 3), 1)\n",
520 |     "    return \"Your average score: {}\".format(avg_score)"
521 |    ]
522 |   },
523 |   {
524 |    "cell_type": "code",
525 |    "execution_count": 29,
526 |    "metadata": {
527 |     "nbgrader": {
528 |      "grade": true,
529 |      "grade_id": "report_exam_avg_test",
530 |      "locked": true,
531 |      "points": 1,
532 |      "schema_version": 1,
533 |      "solution": false
534 |     }
535 |    },
536 |    "outputs": [
537 |     {
538 |      "name": "stdout",
539 |      "output_type": "stream",
540 |      "text": [
541 |       "Your average score: 91.7\n",
542 |       "Checking some additional randomly generated cases:\n",
543 |       "48.33258245396376, 50.442426025062744, 91.75699870990516 -> 'Your average score: 63.5' [0.010669062977219331]\n",
544 |       "35.4641280297894, 63.36452071896817, 19.903296253292147 -> 'Your average score: 39.6' [0.022684999316766152]\n",
545 |       "12.712685343435181, 87.91783899424385, 20.58290036382341 -> 'Your average score: 40.4' [0.004474900500819483]\n",
546 |       "95.8381834385127, 63.40664348059859, 45.20335845795377 -> 'Your average score: 68.1' [0.04939512568835388]\n",
547 |       "6.814313089670787, 84.59381135235066, 90.7578727720895 -> 'Your average score: 60.7' [0.021999071370307394]\n",
548 |       "93.50308463838849, 69.21269551706179, 13.635338619391979 -> 'Your average score: 58.8' [0.016293741719247617]\n",
549 |       "87.65740049353373, 99.51004413046226, 5.788171656622643 -> 'Your average score: 64.3' [0.01853876020622162]\n",
550 |       "7.674497245314571, 58.43286389008937, 71.55717716660797 -> 'Your average score: 45.9' [0.011820565996022955]\n",
551 |       "36.842555880064786, 67.78010593581831, 49.56061896877766 -> 'Your average score: 51.4' [0.005573071779735983]\n",
552 |       "98.67836876468702, 84.33581586521609, 75.29298941927082 -> 'Your average score: 86.1' [0.002391349724670514]\n",
553 |       "\n",
554 |       "(Passed!)\n"
555 |      ]
556 |     }
557 |    ],
558 |    "source": [
559 |     "# `report_exam_avg_test`: Test cell\n",
560 |     "msg = report_exam_avg(100, 95, 80)\n",
561 |     "print(msg)\n",
562 |     "assert msg == 'Your average score: 91.7'\n",
563 |     "\n",
564 |     "print(\"Checking some additional randomly generated cases:\")\n",
565 |     "for _ in range(10):\n",
566 |     "    ex1 = random() * 100\n",
567 |     "    ex2 = random() * 100\n",
568 |     "    ex3 = random() * 100\n",
569 |     "    msg = report_exam_avg(ex1, ex2, ex3)\n",
570 |     "    ex_rounded_avg = float(msg.split()[-1])\n",
571 |     "    abs_err = abs(ex_rounded_avg*3 - (ex1 + ex2 + ex3)) / 3\n",
572 |     "    print(\"{}, {}, {} -> '{}' [{}]\".format(ex1, ex2, ex3, msg, abs_err))\n",
573 |     "    assert abs_err <= 0.05\n",
574 |     "\n",
575 |     "print(\"\\n(Passed!)\")"
576 |    ]
577 |   },
578 |   {
579 |    "cell_type": "markdown",
580 |    "metadata": {
581 |     "nbgrader": {
582 |      "grade": false,
583 |      "grade_id": "cell-24a78862d8e3bba0",
584 |      "locked": true,
585 |      "schema_version": 1,
586 |      "solution": false
587 |     }
588 |    },
589 |    "source": [
590 |     "**Exercise 7** (`count_word_lengths_test`: 2 points). Write a function `count_word_lengths(s)` that, given a string consisting of words separated by spaces, returns a list containing the length of each word. Words will consist of lowercase alphabetic characters, and they may be separated by multiple consecutive spaces. If a string is empty or has no spaces, the function should return an empty list.\n",
591 |     "\n",
592 |     "For instance, in this code sample,\n",
593 |     "\n",
594 |     "```python\n",
595 |     "   count_word_lengths('the quick  brown   fox jumped over     the lazy  dog') == [3, 5, 5, 3, 6, 4, 3, 4, 3]`\n",
596 |     "```\n",
597 |     "\n",
598 |     "the input string consists of nine (9) words whose respective lengths are shown in the list."
599 |    ]
600 |   },
601 |   {
602 |    "cell_type": "code",
603 |    "execution_count": 35,
604 |    "metadata": {
605 |     "nbgrader": {
606 |      "grade": false,
607 |      "grade_id": "count_word_lengths",
608 |      "locked": false,
609 |      "schema_version": 1,
610 |      "solution": true
611 |     }
612 |    },
613 |    "outputs": [],
614 |    "source": [
615 |     "def count_word_lengths(s):\n",
616 |     "    assert all([x.isalpha() or x == ' ' for x in s])\n",
617 |     "    assert type(s) is str\n",
618 |     "#\n",
619 |     "# YOUR CODE HERE\n",
620 |     "#\n",
621 |     "    split_string = s.split(' ')\n",
622 |     "    list_lengths = [len(x) for x in split_string if x != '']\n",
623 |     "    if len(list_lengths) == 1:\n",
624 |     "        return []\n",
625 |     "    else:\n",
626 |     "        return list_lengths\n"
627 |    ]
628 |   },
629 |   {
630 |    "cell_type": "code",
631 |    "execution_count": 36,
632 |    "metadata": {
633 |     "nbgrader": {
634 |      "grade": true,
635 |      "grade_id": "count_word_lengths_test",
636 |      "locked": true,
637 |      "points": 2,
638 |      "schema_version": 1,
639 |      "solution": false
640 |     }
641 |    },
642 |    "outputs": [
643 |     {
644 |      "name": "stdout",
645 |      "output_type": "stream",
646 |      "text": [
647 |       "Test 1: count_word_lengths('the quick brown fox jumped over the lazy dog') == [3, 5, 5, 3, 6, 4, 3, 4, 3]\n",
648 |       "Test 2: count_word_lengths('lfslpf  ib  dxjejrqc hlhu lxretibwcksunx') == '[6, 2, 8, 4, 14]'\n",
649 |       "  => 'lfslpf'\n",
650 |       "  => 'ib'\n",
651 |       "  => 'dxjejrqc'\n",
652 |       "  => 'hlhu'\n",
653 |       "  => 'lxretibwcksunx'\n",
654 |       "Test 3: Empty strings...\n",
655 |       "\n",
656 |       "(Passed!)\n"
657 |      ]
658 |     }
659 |    ],
660 |    "source": [
661 |     "# `count_word_lengths_test`: Test cell\n",
662 |     "\n",
663 |     "# Test 1: Example\n",
664 |     "qbf_str = 'the quick brown fox jumped over the lazy dog'\n",
665 |     "qbf_lens = count_word_lengths(qbf_str)\n",
666 |     "print(\"Test 1: count_word_lengths('{}') == {}\".format(qbf_str, qbf_lens))\n",
667 |     "assert qbf_lens == [3, 5, 5, 3, 6, 4, 3, 4, 3]\n",
668 |     "\n",
669 |     "# Test 2: Random strings\n",
670 |     "from random import choice # 3.5.2 does not have `choices()` (available in 3.6+)\n",
671 |     "#return ''.join([choice('abcdefghijklmnopqrstuvwxyz') for _ in range(n)])\n",
672 |     "\n",
673 |     "def random_letter_or_space(pr_space=0.15):\n",
674 |     "    from random import choice, random\n",
675 |     "    is_space = (random() <= pr_space)\n",
676 |     "    if is_space:\n",
677 |     "        return ' '\n",
678 |     "    return random_letter()\n",
679 |     "\n",
680 |     "S_LEN = 40\n",
681 |     "W_SPACE = 1 / 6\n",
682 |     "rand_str = random_string(S_LEN, fun=random_letter_or_space)\n",
683 |     "rand_lens = count_word_lengths(rand_str)\n",
684 |     "print(\"Test 2: count_word_lengths('{}') == '{}'\".format(rand_str, rand_lens))\n",
685 |     "c = 0\n",
686 |     "while c < len(rand_str) and rand_str[c] == ' ':\n",
687 |     "    c += 1\n",
688 |     "for k in rand_lens:\n",
689 |     "    print(\"  => '{}'\".format (rand_str[c:c+k]))\n",
690 |     "    assert (c+k) == len(rand_str) or rand_str[c+k] == ' '\n",
691 |     "    c += k\n",
692 |     "    while c < len(rand_str) and rand_str[c] == ' ':\n",
693 |     "        c += 1\n",
694 |     "    \n",
695 |     "# Test 3: Empty string\n",
696 |     "print(\"Test 3: Empty strings...\")\n",
697 |     "assert count_word_lengths('') == []\n",
698 |     "assert count_word_lengths('   ') == []\n",
699 |     "\n",
700 |     "print(\"\\n(Passed!)\")"
701 |    ]
702 |   },
703 |   {
704 |    "cell_type": "code",
705 |    "execution_count": null,
706 |    "metadata": {},
707 |    "outputs": [],
708 |    "source": []
709 |   }
710 |  ],
711 |  "metadata": {
712 |   "celltoolbar": "Create Assignment",
713 |   "kernelspec": {
714 |    "display_name": "Python 3",
715 |    "language": "python",
716 |    "name": "python3"
717 |   },
718 |   "language_info": {
719 |    "codemirror_mode": {
720 |     "name": "ipython",
721 |     "version": 3
722 |    },
723 |    "file_extension": ".py",
724 |    "mimetype": "text/x-python",
725 |    "name": "python",
726 |    "nbconvert_exporter": "python",
727 |    "pygments_lexer": "ipython3",
728 |    "version": "3.5.2"
729 |   }
730 |  },
731 |  "nbformat": 4,
732 |  "nbformat_minor": 2
733 | }
734 | 


--------------------------------------------------------------------------------
/Notebook10_NumpyMatrix_part2.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {
  6 |     "deletable": false,
  7 |     "nbgrader": {
  8 |      "grade": false,
  9 |      "grade_id": "cell-467487ba59ea183e",
 10 |      "locked": true,
 11 |      "schema_version": 1,
 12 |      "solution": false
 13 |     }
 14 |    },
 15 |    "source": [
 16 |     "# Part 2: Dense matrix storage \n",
 17 |     "This part of the lab is a brief introduction to efficient storage of matrices."
 18 |    ]
 19 |   },
 20 |   {
 21 |    "cell_type": "markdown",
 22 |    "metadata": {},
 23 |    "source": [
 24 |     "**Exercise 0** (ungraded). Import Numpy!"
 25 |    ]
 26 |   },
 27 |   {
 28 |    "cell_type": "code",
 29 |    "execution_count": 3,
 30 |    "metadata": {
 31 |     "collapsed": true,
 32 |     "nbgrader": {
 33 |      "grade": false,
 34 |      "grade_id": "cell-4263c0d16078cf0a",
 35 |      "locked": true,
 36 |      "schema_version": 1,
 37 |      "solution": false
 38 |     }
 39 |    },
 40 |    "outputs": [],
 41 |    "source": [
 42 |     "import numpy as np"
 43 |    ]
 44 |   },
 45 |   {
 46 |    "cell_type": "markdown",
 47 |    "metadata": {},
 48 |    "source": [
 49 |     "## Dense matrix storage: Column-major versus row-major layouts\n",
 50 |     "\n",
 51 |     "For linear algebra, we will be especially interested in 2-D arrays, which we will use to store matrices. For this common case, there is a subtle performance issue related to how matrices are stored in memory.\n",
 52 |     "\n",
 53 |     "By way of background, physical storage---whether it be memory or disk---is basically one big array. And because of how physical storage is implemented, it turns out that it is much faster to access consecutive elements in memory than, say, to jump around randomly."
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "markdown",
 58 |    "metadata": {},
 59 |    "source": [
 60 |     "A matrix is a two-dimensional object. Thus, when it is stored in memory, it must be mapped in some way to the one-dimensional physical array. There are many possible mappings, but the two most common conventions are known as the _column-major_ and _row-major_ layouts:\n",
 61 |     "\n",
 62 |     "<img src=\"matrix-layout.png\" alt=\"Exercise: Extract these slices\" width=\"640\">"
 63 |    ]
 64 |   },
 65 |   {
 66 |    "cell_type": "markdown",
 67 |    "metadata": {},
 68 |    "source": [
 69 |     "**Exercise 1** (2 points). Let $A$ be an $m \\times n$ matrix stored in column-major format. Let $B$ be an $m \\times n$ matrix stored in row-major format.\n",
 70 |     "\n",
 71 |     "Based on the preceding discussion, recall that these objects will be mapped to 1-D arrays of length $mn$, behind the scenes. Let's call the 1-D array representations $\\hat{A}$ and $\\hat{B}$. Thus, the $(i, j)$ element of $a$, $a_{ij}$, will map to some element $\\hat{a}_u$ of $\\hat{A}$; similarly, $b_{ij}$ will map to some element $\\hat{b}_v$ of $\\hat{B}$.\n",
 72 |     "\n",
 73 |     "Determine formulae to compute the 1-D index values, $u$ and $v$, in terms of $\\{i, j, m, n\\}$. Assume that all indices are 0-based, i.e., $0 \\leq i \\leq m-1$, $0 \\leq j \\leq n-1$, and $0 \\leq u, v \\leq mn-1$."
 74 |    ]
 75 |   },
 76 |   {
 77 |    "cell_type": "code",
 78 |    "execution_count": 2,
 79 |    "metadata": {
 80 |     "collapsed": true,
 81 |     "deletable": false,
 82 |     "nbgrader": {
 83 |      "checksum": "e628bb9dc9f0e8a68ad52ba1d43caca4",
 84 |      "grade": false,
 85 |      "grade_id": "calc_u",
 86 |      "locked": false,
 87 |      "schema_version": 1,
 88 |      "solution": true
 89 |     }
 90 |    },
 91 |    "outputs": [],
 92 |    "source": [
 93 |     "def linearize_colmajor(i, j, m, n): # calculate `u`\n",
 94 |     "    \"\"\"\n",
 95 |     "    Returns the linear index for the `(i, j)` entry of\n",
 96 |     "    an `m`-by-`n` matrix stored in column-major order.\n",
 97 |     "    \"\"\"\n",
 98 |     "    # YOUR CODE HERE\n",
 99 |     "    # colmajor_linear_index \n",
100 |     "    u = i + m * j\n",
101 |     "    return u\n"
102 |    ]
103 |   },
104 |   {
105 |    "cell_type": "code",
106 |    "execution_count": 3,
107 |    "metadata": {
108 |     "collapsed": true,
109 |     "deletable": false,
110 |     "nbgrader": {
111 |      "checksum": "a7ca53a5a658c1c36cbdf1a3ef1a03d9",
112 |      "grade": false,
113 |      "grade_id": "calc_v",
114 |      "locked": false,
115 |      "schema_version": 1,
116 |      "solution": true
117 |     }
118 |    },
119 |    "outputs": [],
120 |    "source": [
121 |     "def linearize_rowmajor(i, j, m, n): # calculate `v`\n",
122 |     "    \"\"\"\n",
123 |     "    Returns the linear index for the `(i, j)` entry of\n",
124 |     "    an `m`-by-`n` matrix stored in row-major order.\n",
125 |     "    \"\"\"\n",
126 |     "    # YOUR CODE HERE\n",
127 |     "    # rowmajor_linear_index\n",
128 |     "    v = i * n + j\n",
129 |     "    return v"
130 |    ]
131 |   },
132 |   {
133 |    "cell_type": "code",
134 |    "execution_count": 4,
135 |    "metadata": {
136 |     "deletable": false,
137 |     "nbgrader": {
138 |      "checksum": "917a5dd8ab9f91bc5fbae92435acd42d",
139 |      "grade": true,
140 |      "grade_id": "calc_uv_test",
141 |      "locked": true,
142 |      "points": 2,
143 |      "schema_version": 1,
144 |      "solution": false
145 |     }
146 |    },
147 |    "outputs": [
148 |     {
149 |      "name": "stdout",
150 |      "output_type": "stream",
151 |      "text": [
152 |       "(Passed.)\n"
153 |      ]
154 |     }
155 |    ],
156 |    "source": [
157 |     "# Test cell: `calc_uv_test`\n",
158 |     "\n",
159 |     "# Quick check (not exhaustive):\n",
160 |     "assert linearize_colmajor(7, 4, 10, 20) == 47\n",
161 |     "assert linearize_rowmajor(7, 4, 10, 20) == 144\n",
162 |     "\n",
163 |     "assert linearize_colmajor(10, 8, 86, 26) == 698\n",
164 |     "assert linearize_rowmajor(10, 8, 86, 26) == 268\n",
165 |     "\n",
166 |     "assert linearize_colmajor(8, 34, 17, 40) == 586\n",
167 |     "assert linearize_rowmajor(8, 34, 17, 40) == 354\n",
168 |     "\n",
169 |     "assert linearize_colmajor(32, 48, 37, 55) == 1808\n",
170 |     "assert linearize_rowmajor(32, 48, 37, 55) == 1808\n",
171 |     "\n",
172 |     "assert linearize_colmajor(24, 33, 57, 87) == 1905\n",
173 |     "assert linearize_rowmajor(24, 33, 57, 87) == 2121\n",
174 |     "\n",
175 |     "assert linearize_colmajor(10, 3, 19, 74) == 67\n",
176 |     "assert linearize_rowmajor(10, 3, 19, 74) == 743\n",
177 |     "\n",
178 |     "print (\"(Passed.)\")"
179 |    ]
180 |   },
181 |   {
182 |    "cell_type": "markdown",
183 |    "metadata": {},
184 |    "source": [
185 |     "## Requesting a layout in Numpy\n",
186 |     "\n",
187 |     "In Numpy, you can ask for either layout. The default in Numpy is row-major.\n",
188 |     "\n",
189 |     "Historically numerical linear algebra libraries were developed assuming column-major layout. This layout happens to be the default when you declare a 2-D array in the Fortran programming language. By contrast, in the C and C++ programming languages, the default convention for a 2-D array is row-major layout. So the Numpy default is the C/C++ convention.\n",
190 |     "\n",
191 |     "In your programs, you can request either order of Numpy using the `order` parameter. For linear algebra operations (common), we recommend using the column-major convention.\n",
192 |     "\n",
193 |     "In either case, here is how you would create column- and row-major matrices."
194 |    ]
195 |   },
196 |   {
197 |    "cell_type": "code",
198 |    "execution_count": 4,
199 |    "metadata": {},
200 |    "outputs": [],
201 |    "source": [
202 |     "n = 5000\n",
203 |     "A_colmaj = np.ones((n, n), order='F') # column-major (Fortran convention)\n",
204 |     "A_rowmaj = np.ones((n, n), order='C') # row-major (C/C++ convention)"
205 |    ]
206 |   },
207 |   {
208 |    "cell_type": "markdown",
209 |    "metadata": {},
210 |    "source": [
211 |     "**Exercise 2** (1 point). Given a matrix $A$, write a function that scales each column, $A(:, j)$ by $j$. Then compare the speed of applying that function to matrices in row and column major order."
212 |    ]
213 |   },
214 |   {
215 |    "cell_type": "code",
216 |    "execution_count": 1,
217 |    "metadata": {
218 |     "collapsed": true,
219 |     "deletable": false,
220 |     "nbgrader": {
221 |      "checksum": "8abc750df11036d09bd1e787452d2682",
222 |      "grade": false,
223 |      "grade_id": "scale_colwise",
224 |      "locked": false,
225 |      "schema_version": 1,
226 |      "solution": true
227 |     }
228 |    },
229 |    "outputs": [],
230 |    "source": [
231 |     "def scale_colwise(A):\n",
232 |     "    \"\"\"Given a Numpy matrix `A`, visits each column `A[:, j]`\n",
233 |     "    and scales it by `j`.\"\"\"\n",
234 |     "    assert type(A) is np.ndarray\n",
235 |     "    \n",
236 |     "    n_cols = A.shape[1] # number of columns\n",
237 |     "    # YOUR CODE HERE\n",
238 |     "    # A = n_cols * A   # my code...not matrix notation\n",
239 |     "    \n",
240 |     "    # their answer\n",
241 |     "    for j in range(n_cols):\n",
242 |     "        A[:,j] *= j\n",
243 |     "    \n",
244 |     "    return A"
245 |    ]
246 |   },
247 |   {
248 |    "cell_type": "code",
249 |    "execution_count": 5,
250 |    "metadata": {
251 |     "nbgrader": {
252 |      "grade": true,
253 |      "grade_id": "scale_colwise_test",
254 |      "locked": true,
255 |      "points": 1,
256 |      "schema_version": 1,
257 |      "solution": false
258 |     }
259 |    },
260 |    "outputs": [
261 |     {
262 |      "name": "stdout",
263 |      "output_type": "stream",
264 |      "text": [
265 |       "120 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n",
266 |       "31.9 ms ± 25.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
267 |      ]
268 |     }
269 |    ],
270 |    "source": [
271 |     "# Test (timing) cell: `scale_colwise_test`\n",
272 |     "\n",
273 |     "# Measure time to scale a row-major input column-wise\n",
274 |     "%timeit scale_colwise(A_rowmaj)\n",
275 |     "\n",
276 |     "# Measure time to scale a column-major input column-wise\n",
277 |     "%timeit scale_colwise(A_colmaj)"
278 |    ]
279 |   },
280 |   {
281 |    "cell_type": "markdown",
282 |    "metadata": {
283 |     "collapsed": true
284 |    },
285 |    "source": [
286 |     "## Python vs. Numpy example: Matrix-vector multiply\n",
287 |     "\n",
288 |     "Look at the definition of matrix-vector multiplication from [Da Kuang's linear algebra notes](https://www.dropbox.com/s/f410k9fgd7iesdv/kuang-linalg-notes.pdf?dl=0). Let's benchmark a matrix-vector multiply in native Python, and compare that to doing the same operation in Numpy.\n",
289 |     "\n",
290 |     "First, some setup. (What does this code do?)"
291 |    ]
292 |   },
293 |   {
294 |    "cell_type": "code",
295 |    "execution_count": 8,
296 |    "metadata": {
297 |     "collapsed": true
298 |    },
299 |    "outputs": [],
300 |    "source": [
301 |     "# Dimensions; you might shrink this value for debugging\n",
302 |     "n = 2500"
303 |    ]
304 |   },
305 |   {
306 |    "cell_type": "code",
307 |    "execution_count": 9,
308 |    "metadata": {
309 |     "collapsed": true
310 |    },
311 |    "outputs": [],
312 |    "source": [
313 |     "# Generate random values, for use in populating the matrix and vector\n",
314 |     "from random import gauss\n",
315 |     "\n",
316 |     "# Native Python, using lists\n",
317 |     "A_py = [gauss(0, 1) for i in range(n*n)] # Assume: Column-major\n",
318 |     "x_py = [gauss(0, 1) for i in range(n)]\n"
319 |    ]
320 |   },
321 |   {
322 |    "cell_type": "code",
323 |    "execution_count": 10,
324 |    "metadata": {
325 |     "collapsed": true
326 |    },
327 |    "outputs": [],
328 |    "source": [
329 |     "# Convert values into Numpy arrays in column-major order\n",
330 |     "A_np = np.reshape(A_py, (n, n), order='F')\n",
331 |     "x_np = np.reshape(x_py, (n, 1), order='F')\n"
332 |    ]
333 |   },
334 |   {
335 |    "cell_type": "code",
336 |    "execution_count": 11,
337 |    "metadata": {},
338 |    "outputs": [
339 |     {
340 |      "data": {
341 |       "text/plain": [
342 |        "array([[  13.89207985],\n",
343 |        "       [ 104.68760455],\n",
344 |        "       [  76.12002576],\n",
345 |        "       ..., \n",
346 |        "       [   0.41580127],\n",
347 |        "       [ -19.92075803],\n",
348 |        "       [  41.50614829]])"
349 |       ]
350 |      },
351 |      "execution_count": 11,
352 |      "metadata": {},
353 |      "output_type": "execute_result"
354 |     }
355 |    ],
356 |    "source": [
357 |     "A_np.dot(x_np)"
358 |    ]
359 |   },
360 |   {
361 |    "cell_type": "code",
362 |    "execution_count": 12,
363 |    "metadata": {},
364 |    "outputs": [
365 |     {
366 |      "name": "stdout",
367 |      "output_type": "stream",
368 |      "text": [
369 |       "1.49 ms ± 182 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
370 |      ]
371 |     }
372 |    ],
373 |    "source": [
374 |     "# Here is how you do a \"matvec\" in Numpy:\n",
375 |     "%timeit A_np.dot(x_np)"
376 |    ]
377 |   },
378 |   {
379 |    "cell_type": "markdown",
380 |    "metadata": {},
381 |    "source": [
382 |     "**Exercise 3** (3 points). Implement a matrix-vector product that operates on native Python lists. Assume the 1-D **column-major** storage of the matrix."
383 |    ]
384 |   },
385 |   {
386 |    "cell_type": "code",
387 |    "execution_count": 35,
388 |    "metadata": {
389 |     "collapsed": true,
390 |     "deletable": false,
391 |     "nbgrader": {
392 |      "checksum": "407dbfe44ba36ac12065b96d9142e3a4",
393 |      "grade": false,
394 |      "grade_id": "matvec_py",
395 |      "locked": false,
396 |      "schema_version": 1,
397 |      "solution": true
398 |     }
399 |    },
400 |    "outputs": [],
401 |    "source": [
402 |     "def matvec_py(m, n, A, x):\n",
403 |     "    \"\"\"\n",
404 |     "    Native Python-based matrix-vector multiply, using lists.\n",
405 |     "    The dimensions of the matrix A are m-by-n, and x is a\n",
406 |     "    vector of length n.\n",
407 |     "    \"\"\"\n",
408 |     "    assert type(A) is list and all([type(aij) is float for aij in A])\n",
409 |     "    assert type(x) is list\n",
410 |     "    assert len(x) >= n\n",
411 |     "    assert len(A) >= (m*n)\n",
412 |     "\n",
413 |     "    y = []\n",
414 |     "    \n",
415 |     "    # YOUR CODE HERE\n",
416 |     "    A_np = np.reshape(A, (m, n), order='F')\n",
417 |     "    \n",
418 |     "    for row in A_np:\n",
419 |     "        row_sum = []\n",
420 |     "        for i in range(m):\n",
421 |     "            row_sum.append(row[i]*x[i])\n",
422 |     "        y.append(sum(row_sum))\n",
423 |     "#         print(row_sum)\n",
424 |     "#     print(y)\n",
425 |     "    return y\n"
426 |    ]
427 |   },
428 |   {
429 |    "cell_type": "code",
430 |    "execution_count": 36,
431 |    "metadata": {
432 |     "deletable": false,
433 |     "nbgrader": {
434 |      "checksum": "caf39b3909e91acb9536e726b45bc4cf",
435 |      "grade": true,
436 |      "grade_id": "matvec_py_test",
437 |      "locked": true,
438 |      "points": 3,
439 |      "schema_version": 1,
440 |      "solution": false
441 |     }
442 |    },
443 |    "outputs": [
444 |     {
445 |      "name": "stdout",
446 |      "output_type": "stream",
447 |      "text": [
448 |       "==> Error bound estimate:\n",
449 |       "         C*n*eps\n",
450 |       "         == 10*2500*2.22045e-16\n",
451 |       "         == 5.55112e-12\n",
452 |       "\n"
453 |      ]
454 |     },
455 |     {
456 |      "data": {
457 |       "text/latex": [
458 |        "$$||y_{\\textrm{np}} - y_{\\textrm{py}}||_{\\infty} = \\textrm{5.11591e-13} \\leq \\textrm{5.55112e-12}\\ (\\textrm{estimated bound})$$"
459 |       ],
460 |       "text/plain": [
461 |        "<IPython.core.display.Math object>"
462 |       ]
463 |      },
464 |      "metadata": {},
465 |      "output_type": "display_data"
466 |     },
467 |     {
468 |      "name": "stdout",
469 |      "output_type": "stream",
470 |      "text": [
471 |       "\n",
472 |       "(Passed!)\n"
473 |      ]
474 |     }
475 |    ],
476 |    "source": [
477 |     "# Test cell: `matvec_py_test`\n",
478 |     "\n",
479 |     "# Estimate a bound on the difference between these two\n",
480 |     "EPS = np.finfo (float).eps # \"machine epsilon\"\n",
481 |     "CONST = 10.0 # Some constant for the error bound\n",
482 |     "dy_max = CONST * n * EPS\n",
483 |     "\n",
484 |     "print (\"\"\"==> Error bound estimate:\n",
485 |     "         C*n*eps\n",
486 |     "         == %g*%g*%g\n",
487 |     "         == %g\n",
488 |     "\"\"\" % (CONST, n, EPS, dy_max))\n",
489 |     "\n",
490 |     "# Run the Numpy version and your code\n",
491 |     "y_np = A_np.dot (x_np)\n",
492 |     "y_py = matvec_py (n, n, A_py, x_py)\n",
493 |     "\n",
494 |     "# Compute the difference between these\n",
495 |     "dy = y_np - np.reshape (y_py, (n, 1), order='F')\n",
496 |     "dy_norm = np.linalg.norm (dy, ord=np.inf)\n",
497 |     "\n",
498 |     "# Summarize the results\n",
499 |     "from IPython.display import display, Math\n",
500 |     "\n",
501 |     "comparison = \"\\leq\" if dy_norm <= dy_max else \"\\gt\"\n",
502 |     "display (Math (\n",
503 |     "        r'||y_{\\textrm{np}} - y_{\\textrm{py}}||_{\\infty}'\n",
504 |     "        r' = \\textrm{%g} %s \\textrm{%g}\\ (\\textrm{estimated bound})'\n",
505 |     "        % (dy_norm, comparison, dy_max)\n",
506 |     "    ))\n",
507 |     "\n",
508 |     "if n <= 4: # Debug: Print all data for small inputs\n",
509 |     "    print (\"@A_np:\\n\", A_np)\n",
510 |     "    print (\"@x_np:\\n\", x_np)\n",
511 |     "    print (\"@y_np:\\n\", y_np)\n",
512 |     "    print (\"@A_py:\\n\", A_py)\n",
513 |     "    print (\"@x_py:\\n\", x_np)\n",
514 |     "    print (\"@y_py:\\n\", y_py)\n",
515 |     "    print (\"@dy:\\n\", dy)\n",
516 |     "\n",
517 |     "# Trigger an error on likely failure\n",
518 |     "assert dy_norm <= dy_max\n",
519 |     "print(\"\\n(Passed!)\")"
520 |    ]
521 |   },
522 |   {
523 |    "cell_type": "code",
524 |    "execution_count": 34,
525 |    "metadata": {
526 |     "nbgrader": {
527 |      "grade": false,
528 |      "grade_id": "cell-f0155950b35ebcf2",
529 |      "locked": true,
530 |      "schema_version": 1,
531 |      "solution": false
532 |     }
533 |    },
534 |    "outputs": [
535 |     {
536 |      "name": "stdout",
537 |      "output_type": "stream",
538 |      "text": [
539 |       "2.98 s ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
540 |      ]
541 |     }
542 |    ],
543 |    "source": [
544 |     "%timeit matvec_py (n, n, A_py, x_py)"
545 |    ]
546 |   },
547 |   {
548 |    "cell_type": "markdown",
549 |    "metadata": {
550 |     "nbgrader": {
551 |      "grade": false,
552 |      "grade_id": "cell-3c70eea14727218a",
553 |      "locked": true,
554 |      "schema_version": 1,
555 |      "solution": false
556 |     }
557 |    },
558 |    "source": [
559 |     "**Fin!** If you've reached this point and everything executed without error, you can submit this part and move on to the next one."
560 |    ]
561 |   }
562 |  ],
563 |  "metadata": {
564 |   "celltoolbar": "Create Assignment",
565 |   "kernelspec": {
566 |    "display_name": "Python 3",
567 |    "language": "python",
568 |    "name": "python3"
569 |   },
570 |   "language_info": {
571 |    "codemirror_mode": {
572 |     "name": "ipython",
573 |     "version": 3
574 |    },
575 |    "file_extension": ".py",
576 |    "mimetype": "text/x-python",
577 |    "name": "python",
578 |    "nbconvert_exporter": "python",
579 |    "pygments_lexer": "ipython3",
580 |    "version": "3.5.2"
581 |   }
582 |  },
583 |  "nbformat": 4,
584 |  "nbformat_minor": 1
585 | }
586 | 


--------------------------------------------------------------------------------
/Notebook12_LinerRegNotes.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Notes: Solving the linear regression problem\n",
  8 |     "\n",
  9 |     "In the linear regression problem, you have a data matrix, $X$, and a response $y$, you want to find model parameters $\\theta$ that make $y \\approx X \\theta$. These notes sketch one method for solving this problem."
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "markdown",
 14 |    "metadata": {},
 15 |    "source": [
 16 |     "## Notation\n",
 17 |     "\n",
 18 |     "Assume your data consists of $m$ observations and $n+1$ variables. One of these variables is the _response_ variable, $y$, which you want to predict from the other $n$ variables, $\\{x_0, \\ldots, x_{n-1}\\}$. You wish to fit a _linear model_ of the following form to these data,\n",
 19 |     "\n",
 20 |     "$$y_i \\approx x_{i,0} \\theta_0 + x_{i,1} \\theta_1 + \\cdots + x_{i,n-1} \\theta_{n-1} + \\theta_n,$$\n",
 21 |     "\n",
 22 |     "where $\\{\\theta_j | 0 \\leq j \\leq n\\}$ is the set of unknown coefficients. Your modeling task is to choose values for these coefficients that \"best fit\" the data."
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "markdown",
 27 |    "metadata": {},
 28 |    "source": [
 29 |     "You can arrange the observations into a tibble like this one:\n",
 30 |     "\n",
 31 |     "|     y      | x<sub>0</sub> | x<sub>1</sub> | $\\cdots$ | x<sub>n-1</sub> | x<sub>n</sub> |\n",
 32 |     "|:----------:|:-------------:|:-------------:|:--------:|:---------------:|:-------------:|\n",
 33 |     "|   $y_0$    |   $x_{0,1}$   |   $x_{0,2}$   | $\\cdots$ |   $x_{0,n-1}$   |      1.0      |\n",
 34 |     "|   $y_1$    |   $x_{1,1}$   |   $x_{1,2}$   | $\\cdots$ |   $x_{1,n-1}$   |      1.0      |\n",
 35 |     "|   $y_2$    |   $x_{2,1}$   |   $x_{2,2}$   | $\\cdots$ |   $x_{2,n-1}$   |      1.0      |\n",
 36 |     "|  $\\vdots$  |   $\\vdots$    |   $\\vdots$    | $\\vdots$ |    $\\vdots$     |      1.0      |\n",
 37 |     "|  $y_{m-1}$ |  $x_{m-1,1}$  |  $x_{m-1,2}$  | $\\cdots$ |  $x_{m-1,n-1}$  |      1.0      |\n",
 38 |     "\n",
 39 |     "This tibble includes an extra dummy variable, $x_n$, whose entries are all equal to 1.0. Treating each variable as a column vector, the modeling tasks is to find the vector $\\theta^T \\equiv (\\theta_0, \\theta_1, \\ldots, \\theta_{n})$ such that\n",
 40 |     "\n",
 41 |     "$$y \\approx X \\theta,$$\n",
 42 |     "\n",
 43 |     "where $y$ is the vector of responses and $X$ is the $m \\times (n+1)$ matrix whose columns are the corresponding vectors, $x_0$, $x_1$, $\\ldots$, $x_n$. The matrix $X$ composed this way from the predictors is sometimes referred to as the _(input) data matrix_."
 44 |    ]
 45 |   },
 46 |   {
 47 |    "cell_type": "markdown",
 48 |    "metadata": {},
 49 |    "source": [
 50 |     "So how should you choose $\\theta$? Suppose you are given $\\theta$. One way to measure its quality is to look at the difference between $y$ and the _(model) prediction_, $X \\theta$. A natural way to measure that difference is to use some vector norm, like the 2-norm (here, squared):\n",
 51 |     "\n",
 52 |     "$$ \\|X \\theta - y\\|_2^2 \\equiv \\|r\\|_2^2,$$\n",
 53 |     "\n",
 54 |     "where $r \\equiv X \\theta - y$ is the _residual error vector_ or just _residual_ for this model. Each element of $r$ is the residual for a given observation; thus, using the two-norm means each difference is squared, thereby \"penalizing\" larger differences more than smaller ones.\n",
 55 |     "\n",
 56 |     "> The additional squaring of $\\|r\\|_2$ could be interpreted similarly, though in reality it is chosen to simplify the math. In particular, recall (or convince yourself) that $\\|r\\|_2^2 = r^T r$."
 57 |    ]
 58 |   },
 59 |   {
 60 |    "cell_type": "markdown",
 61 |    "metadata": {},
 62 |    "source": [
 63 |     "Given this error measure, we can now formalize our mathematical goal as an optimization problem: compute the $\\theta$ that _minimizes_ this error:\n",
 64 |     "\n",
 65 |     "$$ \\theta_* = {\\arg\\min_\\theta} \\|X \\theta - y\\|_2^2. $$"
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "markdown",
 70 |    "metadata": {},
 71 |    "source": [
 72 |     "## Solving the optimization problem\n",
 73 |     "\n",
 74 |     "Recall from calculus that you can minimize (or maximize) a continuous function $f(x)$ in a single variable $x$ by computing its derivative $\\left.\\frac{df}{dx}\\right|_{x=x_*}$, setting it to zero, and then solving for $x_*$.\n",
 75 |     "\n",
 76 |     "> **Example.** Let $f(x) \\equiv a x^2 + b x + c$. Then its maximum or minimum occurs at\n",
 77 |     ">\n",
 78 |     "> $$\n",
 79 |     "    \\left. \\frac{df}{dx} \\right|_{x=x_*} = 2 a x_* + b = 0,\n",
 80 |     "  $$\n",
 81 |     ">\n",
 82 |     "> or when\n",
 83 |     "> \n",
 84 |     "> $$\n",
 85 |     "    x_* = -\\frac{b}{2 a}.\n",
 86 |     "  $$\n",
 87 |     ">\n",
 88 |     "> To show whether this value is a maximum, a minimum, or a saddle-point, you would look at the second derivative. But let's skip that detail for now."
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "markdown",
 93 |    "metadata": {},
 94 |    "source": [
 95 |     "In the setting of multivariable calculus, the procedure is the same. Let $g(\\theta)$ be the (scalar) function to minimize or maximize, where $\\theta$ is a vector. For vectors, the analogue of the first-derivative is the _gradient_. We define the gradient of a scalar function $g$ with respect to the vector variable $\\theta$ to be the vector\n",
 96 |     "\n",
 97 |     "$$\n",
 98 |     "\\nabla_\\theta g(\\theta) \\equiv\n",
 99 |     "  \\left(\\begin{array}{c}\n",
100 |     "    \\frac{\\partial g}{\\partial \\theta_0} \\\\\n",
101 |     "    \\frac{\\partial g}{\\partial \\theta_1} \\\\\n",
102 |     "    \\vdots \\\\\n",
103 |     "    \\frac{\\partial g}{\\partial \\theta_{n-1}}\n",
104 |     "  \\end{array}\\right),\n",
105 |     "$$\n",
106 |     "\n",
107 |     "where $\\frac{\\partial g}{\\partial \\theta_i}$ is the partial derivative of $g$ with respect to $\\theta_i$. (To compute a partial derivative with respect to $\\theta_i$, take the ordinary derivative with respect to $\\theta_i$ while treating all other $\\theta_{j \\neq i}$ as constants.) The gradient produces a _vector_ of these partial derivatives.\n",
108 |     "\n",
109 |     "> **Example.** Let $\\theta \\equiv \\left(\\begin{array}{c} \\theta_0 \\\\ \\theta_1 \\end{array}\\right)$ and $g(\\theta) \\equiv \\|\\theta\\|_2^2$. That is,\n",
110 |     ">\n",
111 |     "> $$ g(\\theta) = \\|\\theta\\|_2^2 \\Longrightarrow g(\\theta_0, \\theta_1) = \\theta_0^2 + \\theta_1^2. $$\n",
112 |     ">\n",
113 |     "> Then,\n",
114 |     ">\n",
115 |     "> $$\n",
116 |     "    \\nabla_\\theta\\, g(\\theta)\n",
117 |     "      = \\left(\\begin{array}{c}\n",
118 |     "          \\frac{\\partial g}{\\partial \\theta_0} \\\\\n",
119 |     "          \\frac{\\partial g}{\\partial \\theta_1}\n",
120 |     "        \\end{array}\\right)\n",
121 |     "      = \\left(\\begin{array}{c}\n",
122 |     "          \\frac{\\partial}{\\partial \\theta_0} (\\theta_0^2 + \\theta_1^2) \\\\\n",
123 |     "          \\frac{\\partial}{\\partial \\theta_1} (\\theta_0^2 + \\theta_1^2)\n",
124 |     "        \\end{array}\\right)\n",
125 |     "      = \\left(\\begin{array}{c}\n",
126 |     "          2 \\theta_0 \\\\\n",
127 |     "          2 \\theta_1\n",
128 |     "        \\end{array}\\right)\n",
129 |     "      = 2 \\theta.\n",
130 |     "  $$"
131 |    ]
132 |   },
133 |   {
134 |    "cell_type": "markdown",
135 |    "metadata": {},
136 |    "source": [
137 |     "From the definition of the gradient, you should be able to verify the following identities. Below, take $v$ and $w$ to be vectors of length $n$ and $M$ to be an $n \\times n$ matrix.\n",
138 |     "\n",
139 |     "1. $\\nabla_v (v^T w) = w$.\n",
140 |     "2. $\\nabla_v (v^T v) = 2v$. (That is, generalize the example above to an $n$-vector.)\n",
141 |     "3. $\\nabla_v (v^T M v) = (M + M^T)v$."
142 |    ]
143 |   },
144 |   {
145 |    "cell_type": "markdown",
146 |    "metadata": {},
147 |    "source": [
148 |     "**Computing the optimal parameters, $\\theta^*$.** Armed with the gradient, you are now ready to minimize $g(\\theta) \\equiv \\|X \\theta - y\\|_2^2$.\n",
149 |     "\n",
150 |     "In the same way that the derivative is zero at the minimum of a scalar function, the gradient will be zero at the minimum of $g(\\theta)$. So let's compute the gradient and set it to zero."
151 |    ]
152 |   },
153 |   {
154 |    "cell_type": "markdown",
155 |    "metadata": {},
156 |    "source": [
157 |     "When\n",
158 |     "\n",
159 |     "$$\n",
160 |     "  \\begin{eqnarray}\n",
161 |     "    \\left. \\nabla_\\theta\\, g(\\theta) \\right|_{\\theta^*} = 0,\n",
162 |     "  \\end{eqnarray}\n",
163 |     "$$\n",
164 |     "\n",
165 |     "then\n",
166 |     "\n",
167 |     "$$\n",
168 |     "\\begin{eqnarray}\n",
169 |     "  \\nabla_{\\theta^*} \\|X\\theta^* - y\\|_2^2\n",
170 |     "    & = & \\nabla_{\\theta^*} \\left( \\theta^{*T} X^T X \\theta^* - 2 \\theta^{*T} X^T y + y^T y \\right) \\\\\n",
171 |     "    & = & 2 (X^T X \\theta^* - X^T y) \\\\\n",
172 |     "    & = & 0.\n",
173 |     "\\end{eqnarray}\n",
174 |     "$$\n",
175 |     "\n",
176 |     "In other words, the $\\theta^*$ at the minimum is the solution of $X^T X \\theta^* = X^T y$. This system is known as the _normal equations_. If the data matrix $X$ has full rank, then this equation will have a solution.\n",
177 |     "\n",
178 |     "> Again, like the 1-D case, we've glossed over the fact that you need one more step to show that $\\theta^*$ minimizes the above equation."
179 |    ]
180 |   }
181 |  ],
182 |  "metadata": {
183 |   "anaconda-cloud": {},
184 |   "celltoolbar": "Create Assignment",
185 |   "kernelspec": {
186 |    "display_name": "Python 3",
187 |    "language": "python",
188 |    "name": "python3"
189 |   },
190 |   "language_info": {
191 |    "codemirror_mode": {
192 |     "name": "ipython",
193 |     "version": 3
194 |    },
195 |    "file_extension": ".py",
196 |    "mimetype": "text/x-python",
197 |    "name": "python",
198 |    "nbconvert_exporter": "python",
199 |    "pygments_lexer": "ipython3",
200 |    "version": "3.5.2"
201 |   }
202 |  },
203 |  "nbformat": 4,
204 |  "nbformat_minor": 1
205 | }
206 | 


--------------------------------------------------------------------------------
/Notebook12_PerturbationTheory.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Notes: Perturbation theory and condition numbers\n",
  8 |     "\n",
  9 |     "Let's start by asking how \"hard\" it is to solve a given linear system, $Ax=b$. You will apply perturbation theory to answer this question.\n",
 10 |     "\n",
 11 |     "This notebook is only for your edification. You do not need to submit it, but you are responsible for understanding the concept of a _condition number_ and how to estimate it for a matrix using Numpy. (The code below shows you how!)"
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "markdown",
 16 |    "metadata": {},
 17 |    "source": [
 18 |     "**Intuition: Continuous functions of a single variable.** To build your intuition, consider the simple case of a scalar function in a single continuous variable, $y = f(x)$. Suppose the input is perturbed by some amount, $\\Delta x$. The output will also change by some amount, $\\Delta y$. How large is $\\Delta y$ relative to $\\Delta x$?\n",
 19 |     "\n",
 20 |     "Supposing $\\Delta x$ is sufficiently small, you can approximate the change in the output by a Taylor series expansion of $f(x + \\Delta x)$:\n",
 21 |     "\n",
 22 |     "$$\n",
 23 |     "  y + \\Delta y = f(x + \\Delta x) = f(x) + \\Delta x \\frac{df}{dx} + O(\\Delta x^2).\n",
 24 |     "$$\n",
 25 |     "\n",
 26 |     "Since $\\Delta x$ is assumed to be \"small,\" we can approximate this relation by\n",
 27 |     "\n",
 28 |     "$$\n",
 29 |     "\\begin{eqnarray}\n",
 30 |     "    y + \\Delta y & \\approx & f(x) + \\Delta x \\frac{df}{dx} \\\\\n",
 31 |     "        \\Delta y & \\approx & \\Delta x \\frac{df}{dx}.\n",
 32 |     "\\end{eqnarray}\n",
 33 |     "$$\n",
 34 |     "\n",
 35 |     "This result should not be surprising: the first derivative measures the sensitivity of changes in the output to changes in the input. We will give the derivative a special name: it is the _(absolute) condition number_. If it is very large in the vicinity of $x$, then even small changes to the input will result in large changes in the output. Put differently, a large condition number indicates that the problem is intrinsically sensitive, so we should expect it may be difficult to construct an accurate algorithm.\n",
 36 |     "\n",
 37 |     "In addition to the absolute condition number, we can define a _relative_ condition number for the problem of evaluating $f(x)$.\n",
 38 |     "\n",
 39 |     "$$\n",
 40 |     "\\begin{eqnarray}\n",
 41 |     "                \\Delta y &  \\approx   & \\Delta x \\frac{df}{dx} \\\\\n",
 42 |     "                         & \\Downarrow & \\\\\n",
 43 |     "  \\frac{|\\Delta y|}{|y|} &  \\approx   & \\frac{|\\Delta x|}{|x|} \\cdot \\underbrace{\\frac{|df/dx| \\cdot |x|}{|f(x)|}}_{\\kappa_f(x)}.\n",
 44 |     "\\end{eqnarray}\n",
 45 |     "$$\n",
 46 |     "\n",
 47 |     "Here, the underscored factor, defined to be $\\kappa_f(x)$, is the relative analogue of the absolute condition number. Again, its magnitude tells us whether the output is sensitive to the input."
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "markdown",
 52 |    "metadata": {},
 53 |    "source": [
 54 |     "**Perturbation theory for linear systems.** What if we perturb a linear system? How can we measure its sensitivity or \"intrinsic difficulty\" to solve?\n",
 55 |     "\n",
 56 |     "First, recall the following identities linear algebraic identities:\n",
 57 |     "\n",
 58 |     "* _Triangle inequality_: $\\|x + y\\|_2 \\leq \\|x\\|_2 + \\|y\\|_2$\n",
 59 |     "* _Norm of a matrix-vector product_: $\\|Ax\\|_2 \\leq \\|A\\|_F\\cdot\\|x\\|_2$\n",
 60 |     "* _Norm of matrix-matrix product_: $\\|AB\\|_F \\leq \\|A\\|_F\\cdot\\|B\\|_F$\n",
 61 |     "\n",
 62 |     "To simplify the notation a little, we will drop the \"$2$\" and \"$F$\" subscripts."
 63 |    ]
 64 |   },
 65 |   {
 66 |    "cell_type": "markdown",
 67 |    "metadata": {},
 68 |    "source": [
 69 |     "Suppose all of $A$, $b$, and the eventual solution $x$ undergo additive perturbations, denoted by $A + \\Delta A$, $b + \\Delta b$, and $x + \\Delta x$, respectively. Then, subtracting the original system from the perturbed system, you would obtain the following.\n",
 70 |     "\n",
 71 |     "$$\n",
 72 |     "\\begin{array}{rrcll}\n",
 73 |     "   &         (A + \\Delta A)(x + \\Delta x) & = & b + \\Delta b & \\\\\n",
 74 |     "- [&                                   Ax & = & b & ] \\\\\n",
 75 |     "\\hline\n",
 76 |     "   & \\Delta A x + (A + \\Delta A) \\Delta x & = & \\Delta b & \\\\\n",
 77 |     "\\end{array}\n",
 78 |     "$$"
 79 |    ]
 80 |   },
 81 |   {
 82 |    "cell_type": "markdown",
 83 |    "metadata": {},
 84 |    "source": [
 85 |     "Now look more closely at the perturbation, $\\Delta x$, of the solution. Let $\\hat{x} \\equiv x + \\Delta x$ be the perturbed solution. Then the above can be rewritten as,\n",
 86 |     "\n",
 87 |     "$$\\Delta x = A^{-1} \\left(\\Delta b - \\Delta A \\hat{x}\\right),$$\n",
 88 |     "\n",
 89 |     "where we have assumed that $A$ is invertible. (That won't be true for our overdetermined system, but let's not worry about that for the moment.)"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "markdown",
 94 |    "metadata": {},
 95 |    "source": [
 96 |     "How large is $\\Delta x$? Let's use a norm to measure it and bound it using \n",
 97 |     "\n",
 98 |     "$$\n",
 99 |     "\\begin{array}{rcl}\n",
100 |     "  \\|\\Delta x\\| &   =   & \\|A^{-1} \\left(\\Delta b - \\Delta A \\hat{x}\\right)\\| \\\\\n",
101 |     "               &  \\leq & \\|A^{-1}\\|\\cdot\\left(\\|\\Delta b\\| + \\|\\Delta A\\|\\cdot\\|\\hat{x}\\|\\right).\n",
102 |     "\\end{array}\n",
103 |     "$$"
104 |    ]
105 |   },
106 |   {
107 |    "cell_type": "markdown",
108 |    "metadata": {},
109 |    "source": [
110 |     "You can rewrite this as follows:\n",
111 |     "\n",
112 |     "$$\n",
113 |     "\\begin{array}{rcl}\n",
114 |     "  \\frac{\\|\\Delta x\\|}\n",
115 |     "       {\\|\\hat{x}\\|}\n",
116 |     "    & \\leq &\n",
117 |     "    \\|A^{-1}\\| \\cdot \\|A\\| \\cdot \\left(\n",
118 |     "                                   \\frac{\\|\\Delta A\\|}\n",
119 |     "                                        {\\|A\\|}\n",
120 |     "                                   +\n",
121 |     "                                   \\frac{\\Delta b}\n",
122 |     "                                        {\\|A\\| \\cdot \\|\\hat{x}\\|}\n",
123 |     "                                 \\right).\n",
124 |     "\\end{array}\n",
125 |     "$$"
126 |    ]
127 |   },
128 |   {
129 |    "cell_type": "markdown",
130 |    "metadata": {},
131 |    "source": [
132 |     "This bound says that the relative error of the perturbed solution, compared to relative perturbations in $A$ and $b$, scales with the product, $\\|A^{-1}\\| \\cdot \\|A\\|$. This factor is the linear systems analogue of the condition number for evaluating the function $f(x)$! As such, we define\n",
133 |     "\n",
134 |     "$$\\kappa(A) \\equiv \\|A^{-1}\\| \\cdot \\|A\\|$$\n",
135 |     "\n",
136 |     "as the _condition number of $A$_ for solving linear systems."
137 |    ]
138 |   },
139 |   {
140 |    "cell_type": "markdown",
141 |    "metadata": {},
142 |    "source": [
143 |     "**What values of $\\kappa(A)$ are \"large?\"** Generally, you want to compare $\\kappa(A)$ to $1/\\epsilon$, where $\\epsilon$ is _machine precision_, which is the [maximum relative error under rounding](https://sites.ualberta.ca/~kbeach/phys420_580_2010/docs/ACM-Goldberg.pdf). We may look more closely at floating-point representations later on, but for now, a good notional value for $\\epsilon$ is about $10^{-7}$ in single-precision and $10^{-15}$ in double-precision. (In Python, the default format for floating-point values is double-precision.)\n",
144 |     "\n",
145 |     "This analysis explains why solving the normal equations directly could lead to computational problems. In particular, one can show that $\\kappa(X^T X) \\approx \\kappa(X)^2$, which means forming $X^T X$ explicitly may make the problem harder to solve by a large amount!\n",
146 |     "\n",
147 |     "Another scenario in which $X$ will have a large condition number is when it has nearly collinear predictors. See the examples below."
148 |    ]
149 |   },
150 |   {
151 |    "cell_type": "markdown",
152 |    "metadata": {
153 |     "collapsed": true,
154 |     "nbgrader": {
155 |      "grade": false,
156 |      "grade_id": "cell-921c6afe80942a78",
157 |      "locked": true,
158 |      "schema_version": 1,
159 |      "solution": false
160 |     }
161 |    },
162 |    "source": [
163 |     "**Fin!** That's the end of these notes."
164 |    ]
165 |   }
166 |  ],
167 |  "metadata": {
168 |   "anaconda-cloud": {},
169 |   "celltoolbar": "Create Assignment",
170 |   "kernelspec": {
171 |    "display_name": "Python 3",
172 |    "language": "python",
173 |    "name": "python3"
174 |   },
175 |   "language_info": {
176 |    "codemirror_mode": {
177 |     "name": "ipython",
178 |     "version": 3
179 |    },
180 |    "file_extension": ".py",
181 |    "mimetype": "text/x-python",
182 |    "name": "python",
183 |    "nbconvert_exporter": "python",
184 |    "pygments_lexer": "ipython3",
185 |    "version": "3.5.2"
186 |   }
187 |  },
188 |  "nbformat": 4,
189 |  "nbformat_minor": 1
190 | }
191 | 


--------------------------------------------------------------------------------
/Notebook13_NumpyMatrixManipulation.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Mo' Numpy, Mo' Problems\n",
  8 |     "\n",
  9 |     "This notebook is a quick overview of additional functionality in Numpy not explicitly covered in some of the other notebooks in this course."
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "code",
 14 |    "execution_count": 1,
 15 |    "metadata": {
 16 |     "collapsed": true
 17 |    },
 18 |    "outputs": [],
 19 |    "source": [
 20 |     "import numpy as np"
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "markdown",
 25 |    "metadata": {},
 26 |    "source": [
 27 |     "# Random numbers"
 28 |    ]
 29 |   },
 30 |   {
 31 |    "cell_type": "markdown",
 32 |    "metadata": {},
 33 |    "source": [
 34 |     "Numpy has a rich collection of (pseudo)random number generators. Here is an example; see the documentation for [numpy.random()](https://docs.scipy.org/doc/numpy/reference/routines.random.html) for more details."
 35 |    ]
 36 |   },
 37 |   {
 38 |    "cell_type": "code",
 39 |    "execution_count": 2,
 40 |    "metadata": {},
 41 |    "outputs": [
 42 |     {
 43 |      "data": {
 44 |       "text/plain": [
 45 |        "array([[  2, -10,   0],\n",
 46 |        "       [ -9,   4,  -9],\n",
 47 |        "       [ -1,   8,  -7],\n",
 48 |        "       [ -5,  -3,   4]])"
 49 |       ]
 50 |      },
 51 |      "execution_count": 2,
 52 |      "metadata": {},
 53 |      "output_type": "execute_result"
 54 |     }
 55 |    ],
 56 |    "source": [
 57 |     "A = np.random.randint(-10, 10, size=(4, 3))\n",
 58 |     "A"
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "markdown",
 63 |    "metadata": {},
 64 |    "source": [
 65 |     "# Aggregations or reductions"
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "markdown",
 70 |    "metadata": {},
 71 |    "source": [
 72 |     "Suppose you want to reduce the values of a Numpy array to a smaller number of values. Numpy provides a number of such functions that _aggregate_ values. Examples of aggregations include sums, min/max calculations, and averaging, among others."
 73 |    ]
 74 |   },
 75 |   {
 76 |    "cell_type": "code",
 77 |    "execution_count": 3,
 78 |    "metadata": {},
 79 |    "outputs": [
 80 |     {
 81 |      "name": "stdout",
 82 |      "output_type": "stream",
 83 |      "text": [
 84 |       "8 8\n",
 85 |       "-10 -10\n",
 86 |       "-26\n",
 87 |       "-2.16666666667\n",
 88 |       "5.69844033243\n"
 89 |      ]
 90 |     }
 91 |    ],
 92 |    "source": [
 93 |     "print(np.max(A), np.amax(A)) # np.max() and np.amax() are synonyms\n",
 94 |     "print(np.min(A), np.amin(A)) # same\n",
 95 |     "print(np.sum(A))\n",
 96 |     "print(np.mean(A))\n",
 97 |     "print(np.std(A))"
 98 |    ]
 99 |   },
100 |   {
101 |    "cell_type": "markdown",
102 |    "metadata": {},
103 |    "source": [
104 |     "The above examples aggregate over all values. But you can also aggregate along a dimension using the optional `axis` parameter."
105 |    ]
106 |   },
107 |   {
108 |    "cell_type": "code",
109 |    "execution_count": 4,
110 |    "metadata": {},
111 |    "outputs": [
112 |     {
113 |      "name": "stdout",
114 |      "output_type": "stream",
115 |      "text": [
116 |       "Max in each column: [2 8 4]\n",
117 |       "Max in each row: [2 4 8 4]\n"
118 |      ]
119 |     }
120 |    ],
121 |    "source": [
122 |     "print(\"Max in each column:\", np.amax(A, axis=0)) # i.e., aggregate along axis 0, the rows, producing column maximums\n",
123 |     "print(\"Max in each row:\", np.amax(A, axis=1)) # i.e., aggregate along axis 1, the columns, producing row maximums"
124 |    ]
125 |   },
126 |   {
127 |    "cell_type": "markdown",
128 |    "metadata": {},
129 |    "source": [
130 |     "# Universal functions"
131 |    ]
132 |   },
133 |   {
134 |    "cell_type": "markdown",
135 |    "metadata": {},
136 |    "source": [
137 |     "Universal functions apply a given function _elementwise_ to one or more Numpy objects.\n",
138 |     "\n",
139 |     "For instance, `np.abs(A)` takes the absolute value of each element."
140 |    ]
141 |   },
142 |   {
143 |    "cell_type": "code",
144 |    "execution_count": 5,
145 |    "metadata": {},
146 |    "outputs": [
147 |     {
148 |      "name": "stdout",
149 |      "output_type": "stream",
150 |      "text": [
151 |       "[[  2 -10   0]\n",
152 |       " [ -9   4  -9]\n",
153 |       " [ -1   8  -7]\n",
154 |       " [ -5  -3   4]] \n",
155 |       "==>\n",
156 |       " [[ 2 10  0]\n",
157 |       " [ 9  4  9]\n",
158 |       " [ 1  8  7]\n",
159 |       " [ 5  3  4]]\n"
160 |      ]
161 |     }
162 |    ],
163 |    "source": [
164 |     "print(A, \"\\n==>\\n\", np.abs(A))"
165 |    ]
166 |   },
167 |   {
168 |    "cell_type": "markdown",
169 |    "metadata": {},
170 |    "source": [
171 |     "Some universal functions accept multiple, compatible arguments. Given two matrices, $A \\equiv (a_{ij})$ and $B \\equiv (b_{ij})$, the following example, computes a matrix $C$ such that $c_{ij} = \\max(a_{ij}, b_{ij})$."
172 |    ]
173 |   },
174 |   {
175 |    "cell_type": "code",
176 |    "execution_count": 6,
177 |    "metadata": {},
178 |    "outputs": [
179 |     {
180 |      "data": {
181 |       "text/plain": [
182 |        "array([[ 0, -8,  9],\n",
183 |        "       [-9,  9,  4],\n",
184 |        "       [ 9, -5,  0],\n",
185 |        "       [-6, -2, -3]])"
186 |       ]
187 |      },
188 |      "execution_count": 6,
189 |      "metadata": {},
190 |      "output_type": "execute_result"
191 |     }
192 |    ],
193 |    "source": [
194 |     "B = np.random.randint(-10, 10, size=A.shape)\n",
195 |     "B"
196 |    ]
197 |   },
198 |   {
199 |    "cell_type": "code",
200 |    "execution_count": 7,
201 |    "metadata": {},
202 |    "outputs": [
203 |     {
204 |      "data": {
205 |       "text/plain": [
206 |        "array([[ 2, -8,  9],\n",
207 |        "       [-9,  9,  4],\n",
208 |        "       [ 9,  8,  0],\n",
209 |        "       [-5, -2,  4]])"
210 |       ]
211 |      },
212 |      "execution_count": 7,
213 |      "metadata": {},
214 |      "output_type": "execute_result"
215 |     }
216 |    ],
217 |    "source": [
218 |     "C = np.maximum(A, B)\n",
219 |     "C"
220 |    ]
221 |   },
222 |   {
223 |    "cell_type": "markdown",
224 |    "metadata": {},
225 |    "source": [
226 |     "You can also build your own universal functions! For instance, suppose you want to compute, elementwise, $f(x) = e^{-x^2}$ and you have a scalar function that implements $f(x)$:"
227 |    ]
228 |   },
229 |   {
230 |    "cell_type": "code",
231 |    "execution_count": 8,
232 |    "metadata": {},
233 |    "outputs": [
234 |     {
235 |      "data": {
236 |       "text/plain": [
237 |        "0.01831563888873418"
238 |       ]
239 |      },
240 |      "execution_count": 8,
241 |      "metadata": {},
242 |      "output_type": "execute_result"
243 |     }
244 |    ],
245 |    "source": [
246 |     "def f(x):\n",
247 |     "    from math import exp\n",
248 |     "    return exp(-(x**2))\n",
249 |     "\n",
250 |     "f(-2) # i.e., exp(-4) ~= 0.01831563888873418"
251 |    ]
252 |   },
253 |   {
254 |    "cell_type": "markdown",
255 |    "metadata": {},
256 |    "source": [
257 |     "This function accepts 1 input (`x`) and returns a single output. The following will create a new Numpy universal function."
258 |    ]
259 |   },
260 |   {
261 |    "cell_type": "code",
262 |    "execution_count": 9,
263 |    "metadata": {},
264 |    "outputs": [
265 |     {
266 |      "name": "stdout",
267 |      "output_type": "stream",
268 |      "text": [
269 |       "[[  2 -10   0]\n",
270 |       " [ -9   4  -9]\n",
271 |       " [ -1   8  -7]\n",
272 |       " [ -5  -3   4]] \n",
273 |       "=>\n",
274 |       " [[0.01831563888873418 3.720075976020836e-44 1.0]\n",
275 |       " [6.639677199580735e-36 1.1253517471925912e-07 6.639677199580735e-36]\n",
276 |       " [0.36787944117144233 1.603810890548638e-28 5.242885663363464e-22]\n",
277 |       " [1.3887943864964021e-11 0.00012340980408667956 1.1253517471925912e-07]]\n"
278 |      ]
279 |     }
280 |    ],
281 |    "source": [
282 |     "f_np = np.frompyfunc(f, 1, 1) # Creates a universal function from `f()`\n",
283 |     "\n",
284 |     "print(A, \"\\n=>\\n\", f_np(A))"
285 |    ]
286 |   },
287 |   {
288 |    "cell_type": "markdown",
289 |    "metadata": {},
290 |    "source": [
291 |     "# Broadcasting"
292 |    ]
293 |   },
294 |   {
295 |    "cell_type": "markdown",
296 |    "metadata": {},
297 |    "source": [
298 |     "Sometimes we want to combine operations on Numpy arrays that have different shapes but are _compatible_.\n",
299 |     "\n",
300 |     "In the following example, we want to add 3 elementwise to every value in `A`."
301 |    ]
302 |   },
303 |   {
304 |    "cell_type": "code",
305 |    "execution_count": 10,
306 |    "metadata": {},
307 |    "outputs": [
308 |     {
309 |      "name": "stdout",
310 |      "output_type": "stream",
311 |      "text": [
312 |       "[[  2 -10   0]\n",
313 |       " [ -9   4  -9]\n",
314 |       " [ -1   8  -7]\n",
315 |       " [ -5  -3   4]]\n",
316 |       "\n",
317 |       "[[ 5 -7  3]\n",
318 |       " [-6  7 -6]\n",
319 |       " [ 2 11 -4]\n",
320 |       " [-2  0  7]]\n"
321 |      ]
322 |     }
323 |    ],
324 |    "source": [
325 |     "print(A)\n",
326 |     "print()\n",
327 |     "print(A + 3)"
328 |    ]
329 |   },
330 |   {
331 |    "cell_type": "markdown",
332 |    "metadata": {},
333 |    "source": [
334 |     "Technically, `A` and `3` have different shapes: the former is a $4 \\times 3$ matrix, while the latter is a scalar ($1 \\times 1$). However, they are compatible because Numpy has a scheme to _extend_---or **broadcast**---the value 3 into an equivalent matrix object of the same shape, before combining them elementwise."
335 |    ]
336 |   },
337 |   {
338 |    "cell_type": "markdown",
339 |    "metadata": {},
340 |    "source": [
341 |     "To see a more sophisticated example, suppose each row `A[i, :]` are the coordinates of a data point, and we want to compute the centroid (or \"center-of-mass,\" if we imagine each point is a unit mass). That's the same as computing the mean coordinate for each column:"
342 |    ]
343 |   },
344 |   {
345 |    "cell_type": "code",
346 |    "execution_count": 11,
347 |    "metadata": {},
348 |    "outputs": [
349 |     {
350 |      "name": "stdout",
351 |      "output_type": "stream",
352 |      "text": [
353 |       "[[  2 -10   0]\n",
354 |       " [ -9   4  -9]\n",
355 |       " [ -1   8  -7]\n",
356 |       " [ -5  -3   4]] => [-3.25 -0.25 -3.  ]\n"
357 |      ]
358 |     }
359 |    ],
360 |    "source": [
361 |     "A_row_means = np.mean(A, axis=0)\n",
362 |     "print(A, \"=>\", A_row_means)"
363 |    ]
364 |   },
365 |   {
366 |    "cell_type": "markdown",
367 |    "metadata": {},
368 |    "source": [
369 |     "Now, suppose you want to shift the points so that their mean is zero. Even though they don't have the same shape, Numpy will interpret `A - A_rowmeans` in a way that effectively carries out this operation. That is, it will extend or \"replicate\" `A_rowmeans` into rows of a matrix of the same shape as `A` and then perform elementwise subtraction."
370 |    ]
371 |   },
372 |   {
373 |    "cell_type": "code",
374 |    "execution_count": 12,
375 |    "metadata": {},
376 |    "outputs": [
377 |     {
378 |      "data": {
379 |       "text/plain": [
380 |        "array([[ 5.25, -9.75,  3.  ],\n",
381 |        "       [-5.75,  4.25, -6.  ],\n",
382 |        "       [ 2.25,  8.25, -4.  ],\n",
383 |        "       [-1.75, -2.75,  7.  ]])"
384 |       ]
385 |      },
386 |      "execution_count": 12,
387 |      "metadata": {},
388 |      "output_type": "execute_result"
389 |     }
390 |    ],
391 |    "source": [
392 |     "A_row_centered = A - A_row_means\n",
393 |     "A_row_centered"
394 |    ]
395 |   },
396 |   {
397 |    "cell_type": "markdown",
398 |    "metadata": {},
399 |    "source": [
400 |     "Suppose you instead want to mean-center the _columns_ instead of the rows. You could start by computing column means:"
401 |    ]
402 |   },
403 |   {
404 |    "cell_type": "code",
405 |    "execution_count": 13,
406 |    "metadata": {},
407 |    "outputs": [
408 |     {
409 |      "name": "stdout",
410 |      "output_type": "stream",
411 |      "text": [
412 |       "[[  2 -10   0]\n",
413 |       " [ -9   4  -9]\n",
414 |       " [ -1   8  -7]\n",
415 |       " [ -5  -3   4]] => [-2.66666667 -4.66666667  0.         -1.33333333]\n"
416 |      ]
417 |     }
418 |    ],
419 |    "source": [
420 |     "A_col_means = np.mean(A, axis=1)\n",
421 |     "print(A, \"=>\", A_col_means)"
422 |    ]
423 |   },
424 |   {
425 |    "cell_type": "markdown",
426 |    "metadata": {},
427 |    "source": [
428 |     "But the same operation will fail!"
429 |    ]
430 |   },
431 |   {
432 |    "cell_type": "code",
433 |    "execution_count": 14,
434 |    "metadata": {},
435 |    "outputs": [
436 |     {
437 |      "ename": "ValueError",
438 |      "evalue": "operands could not be broadcast together with shapes (4,3) (4,) ",
439 |      "output_type": "error",
440 |      "traceback": [
441 |       "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
442 |       "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
443 |       "\u001b[0;32m<ipython-input-14-318dabe58a82>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mA\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mA_col_means\u001b[0m \u001b[0;31m# Fails, throwing a `ValueError`\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
444 |       "\u001b[0;31mValueError\u001b[0m: operands could not be broadcast together with shapes (4,3) (4,) "
445 |      ]
446 |     }
447 |    ],
448 |    "source": [
449 |     "A - A_col_means # Fails, throwing a `ValueError`"
450 |    ]
451 |   },
452 |   {
453 |    "cell_type": "markdown",
454 |    "metadata": {},
455 |    "source": [
456 |     "The error reports that these shapes are not compatible. So how can you fix it?\n",
457 |     "\n",
458 |     "**The broadcasting rule.** One way is to learn Numpy's convention for **[broadcasting](https://docs.scipy.org/doc/numpy/reference/ufuncs.html#broadcasting)**. Numpy starts by looking at the shapes of the objects:"
459 |    ]
460 |   },
461 |   {
462 |    "cell_type": "code",
463 |    "execution_count": 15,
464 |    "metadata": {},
465 |    "outputs": [
466 |     {
467 |      "name": "stdout",
468 |      "output_type": "stream",
469 |      "text": [
470 |       "(4, 3) (3,)\n"
471 |      ]
472 |     }
473 |    ],
474 |    "source": [
475 |     "print(A.shape, A_row_means.shape)"
476 |    ]
477 |   },
478 |   {
479 |    "cell_type": "markdown",
480 |    "metadata": {},
481 |    "source": [
482 |     "These are compatible if, starting from _right_ to _left_, the dimensions match **or** one of the dimensions is 1. This convention of moving from right to left is referred to as matching the _trailing dimensions_. In this example, the rightmost dimensions of each object are both 3, so they match. Since `A_row_means` has no more dimensions, it can be replicated to match the remaining dimensions of `A`."
483 |    ]
484 |   },
485 |   {
486 |    "cell_type": "markdown",
487 |    "metadata": {},
488 |    "source": [
489 |     "By contrast, consider the shapes of `A` and `A_col_means`:"
490 |    ]
491 |   },
492 |   {
493 |    "cell_type": "code",
494 |    "execution_count": 16,
495 |    "metadata": {},
496 |    "outputs": [
497 |     {
498 |      "name": "stdout",
499 |      "output_type": "stream",
500 |      "text": [
501 |       "(4, 3) (4,)\n"
502 |      ]
503 |     }
504 |    ],
505 |    "source": [
506 |     "print(A.shape, A_col_means.shape)"
507 |    ]
508 |   },
509 |   {
510 |    "cell_type": "markdown",
511 |    "metadata": {},
512 |    "source": [
513 |     "In this case, per the broadcasting rule, the trailing dimensions of 3 and 4 do not match. Therefore, the broadcast rule fails. One way to get the desired behavior is to modify `A_col_means` to have a unit trailing dimension. In this case, you can use Numpy's [`reshape()`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html) to convert `A_col_means` into a shape that has an explicit trailing dimension of size 1."
514 |    ]
515 |   },
516 |   {
517 |    "cell_type": "code",
518 |    "execution_count": 17,
519 |    "metadata": {},
520 |    "outputs": [
521 |     {
522 |      "name": "stdout",
523 |      "output_type": "stream",
524 |      "text": [
525 |       "[[-2.66666667]\n",
526 |       " [-4.66666667]\n",
527 |       " [ 0.        ]\n",
528 |       " [-1.33333333]] => (4, 1)\n"
529 |      ]
530 |     }
531 |    ],
532 |    "source": [
533 |     "A_col_means2 = np.reshape(A_col_means, (len(A_col_means), 1))\n",
534 |     "print(A_col_means2, \"=>\", A_col_means2.shape)"
535 |    ]
536 |   },
537 |   {
538 |    "cell_type": "markdown",
539 |    "metadata": {},
540 |    "source": [
541 |     "Now the trailing dimension equals 1, so it can be matched against the trailing dimension of `A`. The next dimension is the same between the two objects, so Numpy knows it can replicate accordingly."
542 |    ]
543 |   },
544 |   {
545 |    "cell_type": "code",
546 |    "execution_count": 18,
547 |    "metadata": {},
548 |    "outputs": [
549 |     {
550 |      "name": "stdout",
551 |      "output_type": "stream",
552 |      "text": [
553 |       "[[  2 -10   0]\n",
554 |       " [ -9   4  -9]\n",
555 |       " [ -1   8  -7]\n",
556 |       " [ -5  -3   4]] \n",
557 |       "- [[-2.66666667]\n",
558 |       " [-4.66666667]\n",
559 |       " [ 0.        ]\n",
560 |       " [-1.33333333]]\n",
561 |       "=>\n",
562 |       " [[ 4.66666667 -7.33333333  2.66666667]\n",
563 |       " [-4.33333333  8.66666667 -4.33333333]\n",
564 |       " [-1.          8.         -7.        ]\n",
565 |       " [-3.66666667 -1.66666667  5.33333333]]\n"
566 |      ]
567 |     }
568 |    ],
569 |    "source": [
570 |     "print(A, \"\\n-\", A_col_means2)\n",
571 |     "print(\"=>\\n\", A - A_col_means2)"
572 |    ]
573 |   },
574 |   {
575 |    "cell_type": "markdown",
576 |    "metadata": {
577 |     "collapsed": true
578 |    },
579 |    "source": [
580 |     "**Fin!** That marks the end of this notebook. If you want to learn more, check out the second edition of [Python for Data Analysis](http://shop.oreilly.com/product/0636920050896.do) (released in October 2017)."
581 |    ]
582 |   }
583 |  ],
584 |  "metadata": {
585 |   "kernelspec": {
586 |    "display_name": "Python 3",
587 |    "language": "python",
588 |    "name": "python3"
589 |   },
590 |   "language_info": {
591 |    "codemirror_mode": {
592 |     "name": "ipython",
593 |     "version": 3
594 |    },
595 |    "file_extension": ".py",
596 |    "mimetype": "text/x-python",
597 |    "name": "python",
598 |    "nbconvert_exporter": "python",
599 |    "pygments_lexer": "ipython3",
600 |    "version": "3.5.2"
601 |   }
602 |  },
603 |  "nbformat": 4,
604 |  "nbformat_minor": 2
605 | }
606 | 


--------------------------------------------------------------------------------
/Notebook1_part2-more_exercises.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {
  6 |     "nbgrader": {
  7 |      "grade": false,
  8 |      "grade_id": "cell-3b25f2b6cfc80b65",
  9 |      "locked": true,
 10 |      "schema_version": 1,
 11 |      "solution": false
 12 |     }
 13 |    },
 14 |    "source": [
 15 |     "# Python review: More exercises\n",
 16 |     "\n",
 17 |     "This notebook continues the review of Python basics based on [Chris Simpkins's](https://www.cc.gatech.edu/~simpkins/) [Python Bootcamp](https://www.cc.gatech.edu/~simpkins/teaching/python-bootcamp/syllabus.html).\n",
 18 |     "\n",
 19 |     "This particular notebook adapts the exercises that appeared with the [\"Functional Programming\" slides](https://www.cc.gatech.edu/~simpkins/teaching/python-bootcamp/slides/functional-programming.html) of the Fall 2016 offering."
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "markdown",
 24 |    "metadata": {
 25 |     "nbgrader": {
 26 |      "grade": false,
 27 |      "grade_id": "cell-f3331b5182117a1f",
 28 |      "locked": true,
 29 |      "schema_version": 1,
 30 |      "solution": false
 31 |     }
 32 |    },
 33 |    "source": [
 34 |     "Consider the following dataset of exam grades, organized as a 2-D table and stored in Python as a \"list of lists\" under the variable name, `grades`."
 35 |    ]
 36 |   },
 37 |   {
 38 |    "cell_type": "code",
 39 |    "execution_count": 3,
 40 |    "metadata": {
 41 |     "collapsed": true,
 42 |     "nbgrader": {
 43 |      "grade": false,
 44 |      "grade_id": "cell-9dc72b683a8858c7",
 45 |      "locked": true,
 46 |      "schema_version": 1,
 47 |      "solution": false
 48 |     }
 49 |    },
 50 |    "outputs": [],
 51 |    "source": [
 52 |     "grades = [\n",
 53 |     "    # First line is descriptive header. Subsequent lines hold data\n",
 54 |     "    ['Student', 'Exam 1', 'Exam 2', 'Exam 3'],\n",
 55 |     "    ['Thorny', '100', '90', '80'],\n",
 56 |     "    ['Mac', '88', '99', '111'],\n",
 57 |     "    ['Farva', '45', '56', '67'],\n",
 58 |     "    ['Rabbit', '59', '61', '67'],\n",
 59 |     "    ['Ursula', '73', '79', '83'],\n",
 60 |     "    ['Foster', '89', '97', '101']\n",
 61 |     "]"
 62 |    ]
 63 |   },
 64 |   {
 65 |    "cell_type": "markdown",
 66 |    "metadata": {
 67 |     "nbgrader": {
 68 |      "grade": false,
 69 |      "grade_id": "cell-04082681e80572d5",
 70 |      "locked": true,
 71 |      "schema_version": 1,
 72 |      "solution": false
 73 |     }
 74 |    },
 75 |    "source": [
 76 |     "**Exercise 0** (`students_test`: 1 point). Write some code that computes a new list named `students[:]`, which holds the names of the students as they from \"top to bottom\" in the table."
 77 |    ]
 78 |   },
 79 |   {
 80 |    "cell_type": "code",
 81 |    "execution_count": 4,
 82 |    "metadata": {
 83 |     "nbgrader": {
 84 |      "grade": false,
 85 |      "grade_id": "students",
 86 |      "locked": false,
 87 |      "schema_version": 1,
 88 |      "solution": true
 89 |     }
 90 |    },
 91 |    "outputs": [
 92 |     {
 93 |      "data": {
 94 |       "text/plain": [
 95 |        "['Thorny', 'Mac', 'Farva', 'Rabbit', 'Ursula', 'Foster']"
 96 |       ]
 97 |      },
 98 |      "execution_count": 4,
 99 |      "metadata": {},
100 |      "output_type": "execute_result"
101 |     }
102 |    ],
103 |    "source": [
104 |     "#\n",
105 |     "# YOUR CODE HERE\n",
106 |     "#\n",
107 |     "students = [x[0] for x in grades if x[0] != 'Student']"
108 |    ]
109 |   },
110 |   {
111 |    "cell_type": "code",
112 |    "execution_count": 5,
113 |    "metadata": {
114 |     "nbgrader": {
115 |      "grade": true,
116 |      "grade_id": "students_test",
117 |      "locked": true,
118 |      "points": 1,
119 |      "schema_version": 1,
120 |      "solution": false
121 |     }
122 |    },
123 |    "outputs": [
124 |     {
125 |      "name": "stdout",
126 |      "output_type": "stream",
127 |      "text": [
128 |       "['Thorny', 'Mac', 'Farva', 'Rabbit', 'Ursula', 'Foster']\n",
129 |       "\n",
130 |       "(Passed!)\n"
131 |      ]
132 |     }
133 |    ],
134 |    "source": [
135 |     "# `students_test`: Test cell\n",
136 |     "print(students)\n",
137 |     "assert type(students) is list\n",
138 |     "assert students == ['Thorny', 'Mac', 'Farva', 'Rabbit', 'Ursula', 'Foster']\n",
139 |     "print(\"\\n(Passed!)\")"
140 |    ]
141 |   },
142 |   {
143 |    "cell_type": "markdown",
144 |    "metadata": {
145 |     "nbgrader": {
146 |      "grade": false,
147 |      "grade_id": "cell-e5e0181d53efed56",
148 |      "locked": true,
149 |      "schema_version": 1,
150 |      "solution": false
151 |     }
152 |    },
153 |    "source": [
154 |     "**Exercise 1** (`assignments_test`: 1 point). Write some code to compute a new list named `assignments[:]`, to hold the names of the class assignments. (These appear in the descriptive header element of `grades`.)"
155 |    ]
156 |   },
157 |   {
158 |    "cell_type": "code",
159 |    "execution_count": 6,
160 |    "metadata": {
161 |     "nbgrader": {
162 |      "grade": false,
163 |      "grade_id": "assignments",
164 |      "locked": false,
165 |      "schema_version": 1,
166 |      "solution": true
167 |     }
168 |    },
169 |    "outputs": [
170 |     {
171 |      "data": {
172 |       "text/plain": [
173 |        "['Exam 1', 'Exam 2', 'Exam 3']"
174 |       ]
175 |      },
176 |      "execution_count": 6,
177 |      "metadata": {},
178 |      "output_type": "execute_result"
179 |     }
180 |    ],
181 |    "source": [
182 |     "#\n",
183 |     "# YOUR CODE HERE\n",
184 |     "#\n",
185 |     "assignments = [x for x in grades[0] if x != 'Student']"
186 |    ]
187 |   },
188 |   {
189 |    "cell_type": "code",
190 |    "execution_count": 7,
191 |    "metadata": {
192 |     "nbgrader": {
193 |      "grade": true,
194 |      "grade_id": "assignments_test",
195 |      "locked": true,
196 |      "points": 1,
197 |      "schema_version": 1,
198 |      "solution": false
199 |     }
200 |    },
201 |    "outputs": [
202 |     {
203 |      "name": "stdout",
204 |      "output_type": "stream",
205 |      "text": [
206 |       "['Exam 1', 'Exam 2', 'Exam 3']\n",
207 |       "\n",
208 |       "(Passed!)\n"
209 |      ]
210 |     }
211 |    ],
212 |    "source": [
213 |     "# `assignments_test`: Test cell\n",
214 |     "print(assignments)\n",
215 |     "assert type(assignments) is list\n",
216 |     "assert assignments == ['Exam 1', 'Exam 2', 'Exam 3']\n",
217 |     "print(\"\\n(Passed!)\")"
218 |    ]
219 |   },
220 |   {
221 |    "cell_type": "markdown",
222 |    "metadata": {
223 |     "nbgrader": {
224 |      "grade": false,
225 |      "grade_id": "cell-1bd41417aad245fa",
226 |      "locked": true,
227 |      "schema_version": 1,
228 |      "solution": false
229 |     }
230 |    },
231 |    "source": [
232 |     "**Exercise 2** (`grade_lists_test`: 1 point). Write some code to compute a new _dictionary_, named `grade_lists`, that maps names of students to _lists_ of their exam grades. The grades should be converted from strings to integers. For instance, `grade_lists['Thorny'] == [100, 90, 80]`."
233 |    ]
234 |   },
235 |   {
236 |    "cell_type": "code",
237 |    "execution_count": 13,
238 |    "metadata": {
239 |     "nbgrader": {
240 |      "grade": false,
241 |      "grade_id": "grade_lists",
242 |      "locked": false,
243 |      "schema_version": 1,
244 |      "solution": true
245 |     }
246 |    },
247 |    "outputs": [],
248 |    "source": [
249 |     "# Create a dict mapping names to lists of grades.\n",
250 |     "#\n",
251 |     "# YOUR CODE HERE\n",
252 |     "#\n",
253 |     "grade_lists = {x[0]: [int(x[1]), int(x[2]), int(x[3])] for x in grades if x[0] != 'Student'}\n"
254 |    ]
255 |   },
256 |   {
257 |    "cell_type": "code",
258 |    "execution_count": 14,
259 |    "metadata": {
260 |     "nbgrader": {
261 |      "grade": true,
262 |      "grade_id": "grade_lists_test",
263 |      "locked": true,
264 |      "points": 1,
265 |      "schema_version": 1,
266 |      "solution": false
267 |     }
268 |    },
269 |    "outputs": [
270 |     {
271 |      "name": "stdout",
272 |      "output_type": "stream",
273 |      "text": [
274 |       "{'Foster': [89, 97, 101], 'Ursula': [73, 79, 83], 'Mac': [88, 99, 111], 'Farva': [45, 56, 67], 'Rabbit': [59, 61, 67], 'Thorny': [100, 90, 80]}\n",
275 |       "\n",
276 |       "(Passed!)\n"
277 |      ]
278 |     }
279 |    ],
280 |    "source": [
281 |     "# `grade_lists_test`: Test cell\n",
282 |     "print(grade_lists)\n",
283 |     "assert type(grade_lists) is dict, \"Did not create a dictionary.\"\n",
284 |     "assert len(grade_lists) == len(grades)-1, \"Dictionary has the wrong number of entries.\"\n",
285 |     "assert {'Thorny', 'Mac', 'Farva', 'Rabbit', 'Ursula', 'Foster'} == set(grade_lists.keys()), \"Dictionary has the wrong keys.\"\n",
286 |     "assert grade_lists['Thorny'] == [100, 90, 80], 'Wrong grades for: Thorny'\n",
287 |     "assert grade_lists['Mac'] == [88, 99, 111], 'Wrong grades for: Mac'\n",
288 |     "assert grade_lists['Farva'] == [45, 56, 67], 'Wrong grades for: Farva'\n",
289 |     "assert grade_lists['Rabbit'] == [59, 61, 67], 'Wrong grades for: Rabbit'\n",
290 |     "assert grade_lists['Ursula'] == [73, 79, 83], 'Wrong grades for: Ursula'\n",
291 |     "assert grade_lists['Foster'] == [89, 97, 101], 'Wrong grades for: Foster'\n",
292 |     "print(\"\\n(Passed!)\")"
293 |    ]
294 |   },
295 |   {
296 |    "cell_type": "markdown",
297 |    "metadata": {
298 |     "nbgrader": {
299 |      "grade": false,
300 |      "grade_id": "cell-a628c6c0f63e7e7c",
301 |      "locked": true,
302 |      "schema_version": 1,
303 |      "solution": false
304 |     }
305 |    },
306 |    "source": [
307 |     "**Exercise 3** (`grade_dicts_test`: 2 points). Write some code to compute a new dictionary, `grade_dicts`, that maps names of students to _dictionaries_ containing their scores. Each entry of this scores dictionary should be keyed on assignment name and hold the corresponding grade as an integer. For instance, `grade_dicts['Thorny']['Exam 1'] == 100`."
308 |    ]
309 |   },
310 |   {
311 |    "cell_type": "code",
312 |    "execution_count": 25,
313 |    "metadata": {
314 |     "nbgrader": {
315 |      "grade": false,
316 |      "grade_id": "grade_dicts",
317 |      "locked": false,
318 |      "schema_version": 1,
319 |      "solution": true
320 |     }
321 |    },
322 |    "outputs": [],
323 |    "source": [
324 |     "# Create a dict mapping names to dictionaries of grades.\n",
325 |     "#\n",
326 |     "# YOUR CODE HERE\n",
327 |     "#\n",
328 |     "grade_dicts = {}\n",
329 |     "for key, value in grade_lists.items():\n",
330 |     "    grade_dicts[key] = {x:y for x,y in zip(assignments, value)}\n"
331 |    ]
332 |   },
333 |   {
334 |    "cell_type": "code",
335 |    "execution_count": 26,
336 |    "metadata": {
337 |     "nbgrader": {
338 |      "grade": true,
339 |      "grade_id": "grade_dicts_test",
340 |      "locked": true,
341 |      "points": 2,
342 |      "schema_version": 1,
343 |      "solution": false
344 |     }
345 |    },
346 |    "outputs": [
347 |     {
348 |      "name": "stdout",
349 |      "output_type": "stream",
350 |      "text": [
351 |       "{'Ursula': {'Exam 2': 79, 'Exam 3': 83, 'Exam 1': 73}, 'Thorny': {'Exam 2': 90, 'Exam 3': 80, 'Exam 1': 100}, 'Mac': {'Exam 2': 99, 'Exam 3': 111, 'Exam 1': 88}, 'Farva': {'Exam 2': 56, 'Exam 3': 67, 'Exam 1': 45}, 'Rabbit': {'Exam 2': 61, 'Exam 3': 67, 'Exam 1': 59}, 'Foster': {'Exam 2': 97, 'Exam 3': 101, 'Exam 1': 89}}\n",
352 |       "\n",
353 |       "(Passed!)\n"
354 |      ]
355 |     }
356 |    ],
357 |    "source": [
358 |     "# `grade_dicts_test`: Test cell\n",
359 |     "print(grade_dicts)\n",
360 |     "assert type(grade_dicts) is dict, \"Did not create a dictionary.\"\n",
361 |     "assert len(grade_dicts) == len(grades)-1, \"Dictionary has the wrong number of entries.\"\n",
362 |     "assert {'Thorny', 'Mac', 'Farva', 'Rabbit', 'Ursula', 'Foster'} == set(grade_dicts.keys()), \"Dictionary has the wrong keys.\"\n",
363 |     "assert grade_dicts['Foster']['Exam 1'] == 89, 'Wrong score'\n",
364 |     "assert grade_dicts['Foster']['Exam 3'] == 101, 'Wrong score'\n",
365 |     "assert grade_dicts['Foster']['Exam 2'] == 97, 'Wrong score'\n",
366 |     "assert grade_dicts['Ursula']['Exam 1'] == 73, 'Wrong score'\n",
367 |     "assert grade_dicts['Ursula']['Exam 3'] == 83, 'Wrong score'\n",
368 |     "assert grade_dicts['Ursula']['Exam 2'] == 79, 'Wrong score'\n",
369 |     "assert grade_dicts['Rabbit']['Exam 1'] == 59, 'Wrong score'\n",
370 |     "assert grade_dicts['Rabbit']['Exam 3'] == 67, 'Wrong score'\n",
371 |     "assert grade_dicts['Rabbit']['Exam 2'] == 61, 'Wrong score'\n",
372 |     "assert grade_dicts['Mac']['Exam 1'] == 88, 'Wrong score'\n",
373 |     "assert grade_dicts['Mac']['Exam 3'] == 111, 'Wrong score'\n",
374 |     "assert grade_dicts['Mac']['Exam 2'] == 99, 'Wrong score'\n",
375 |     "assert grade_dicts['Farva']['Exam 1'] == 45, 'Wrong score'\n",
376 |     "assert grade_dicts['Farva']['Exam 3'] == 67, 'Wrong score'\n",
377 |     "assert grade_dicts['Farva']['Exam 2'] == 56, 'Wrong score'\n",
378 |     "assert grade_dicts['Thorny']['Exam 1'] == 100, 'Wrong score'\n",
379 |     "assert grade_dicts['Thorny']['Exam 3'] == 80, 'Wrong score'\n",
380 |     "assert grade_dicts['Thorny']['Exam 2'] == 90, 'Wrong score'\n",
381 |     "print(\"\\n(Passed!)\")"
382 |    ]
383 |   },
384 |   {
385 |    "cell_type": "markdown",
386 |    "metadata": {
387 |     "nbgrader": {
388 |      "grade": false,
389 |      "grade_id": "cell-840a57a4b61944e5",
390 |      "locked": true,
391 |      "schema_version": 1,
392 |      "solution": false
393 |     }
394 |    },
395 |    "source": [
396 |     "**Exercise 4** (`avg_grades_by_student_test`: 1 point). Write some code to compute a dictionary named `avg_grades_by_student` that maps each student to his or her average exam score. For instance, `avg_grades_by_student['Thorny'] == 90`.\n",
397 |     "\n",
398 |     "> **Hint.** The [`statistics`](https://docs.python.org/3.5/library/statistics.html) module of Python has at least one helpful function."
399 |    ]
400 |   },
401 |   {
402 |    "cell_type": "code",
403 |    "execution_count": 31,
404 |    "metadata": {
405 |     "nbgrader": {
406 |      "grade": false,
407 |      "grade_id": "avg_grades_by_student",
408 |      "locked": false,
409 |      "schema_version": 1,
410 |      "solution": true
411 |     }
412 |    },
413 |    "outputs": [],
414 |    "source": [
415 |     "# Create a dict mapping names to grade averages.\n",
416 |     "#\n",
417 |     "# YOUR CODE HERE\n",
418 |     "#\n",
419 |     "avg_grades_by_student = {name: sum(score)/3 for name, score in grade_lists.items()}\n"
420 |    ]
421 |   },
422 |   {
423 |    "cell_type": "code",
424 |    "execution_count": 30,
425 |    "metadata": {
426 |     "nbgrader": {
427 |      "grade": true,
428 |      "grade_id": "avg_grades_by_student_test",
429 |      "locked": true,
430 |      "points": 1,
431 |      "schema_version": 1,
432 |      "solution": false
433 |     }
434 |    },
435 |    "outputs": [
436 |     {
437 |      "name": "stdout",
438 |      "output_type": "stream",
439 |      "text": [
440 |       "{'Ursula': 78.33333333333333, 'Thorny': 90.0, 'Mac': 99.33333333333333, 'Farva': 56.0, 'Rabbit': 62.333333333333336, 'Foster': 95.66666666666667}\n",
441 |       "\n",
442 |       "(Passed!)\n"
443 |      ]
444 |     }
445 |    ],
446 |    "source": [
447 |     "# `avg_grades_by_student_test`: Test cell\n",
448 |     "print(avg_grades_by_student)\n",
449 |     "assert type(avg_grades_by_student) is dict, \"Did not create a dictionary.\"\n",
450 |     "assert len(avg_grades_by_student) == len(students), \"Output has the wrong number of students.\"\n",
451 |     "assert abs(avg_grades_by_student['Mac'] - 99.33333333333333) <= 4e-15, 'Mean is incorrect'\n",
452 |     "assert abs(avg_grades_by_student['Foster'] - 95.66666666666667) <= 4e-15, 'Mean is incorrect'\n",
453 |     "assert abs(avg_grades_by_student['Farva'] - 56) <= 4e-15, 'Mean is incorrect'\n",
454 |     "assert abs(avg_grades_by_student['Rabbit'] - 62.333333333333336) <= 4e-15, 'Mean is incorrect'\n",
455 |     "assert abs(avg_grades_by_student['Thorny'] - 90) <= 4e-15, 'Mean is incorrect'\n",
456 |     "assert abs(avg_grades_by_student['Ursula'] - 78.33333333333333) <= 4e-15, 'Mean is incorrect'\n",
457 |     "print(\"\\n(Passed!)\")"
458 |    ]
459 |   },
460 |   {
461 |    "cell_type": "markdown",
462 |    "metadata": {
463 |     "nbgrader": {
464 |      "grade": false,
465 |      "grade_id": "cell-3f31ab810dcb86d1",
466 |      "locked": true,
467 |      "schema_version": 1,
468 |      "solution": false
469 |     }
470 |    },
471 |    "source": [
472 |     "**Exercise 5** (`grades_by_assignment_test`: 2 points). Write some code to compute a dictionary named `grades_by_assignment`, whose keys are assignment (exam) names and whose values are lists of scores over all students on that assignment. For instance, `grades_by_assignment['Exam 1'] == [100, 88, 45, 59, 73, 89]`."
473 |    ]
474 |   },
475 |   {
476 |    "cell_type": "code",
477 |    "execution_count": 38,
478 |    "metadata": {
479 |     "nbgrader": {
480 |      "grade": false,
481 |      "grade_id": "grades_by_assignment",
482 |      "locked": false,
483 |      "schema_version": 1,
484 |      "solution": true
485 |     }
486 |    },
487 |    "outputs": [],
488 |    "source": [
489 |     "#\n",
490 |     "# YOUR CODE HERE\n",
491 |     "#\n",
492 |     "grades_by_assignment = {}\n",
493 |     "\n",
494 |     "for i in range(1, len(assignments) + 1):\n",
495 |     "    grades_by_assignment['Exam {}'.format(i)] = [int(x[i]) for x in grades if x[i] != 'Exam {}'.format(i)]"
496 |    ]
497 |   },
498 |   {
499 |    "cell_type": "code",
500 |    "execution_count": 39,
501 |    "metadata": {
502 |     "nbgrader": {
503 |      "grade": true,
504 |      "grade_id": "grades_by_assignment_test",
505 |      "locked": true,
506 |      "points": 2,
507 |      "schema_version": 1,
508 |      "solution": false
509 |     }
510 |    },
511 |    "outputs": [
512 |     {
513 |      "name": "stdout",
514 |      "output_type": "stream",
515 |      "text": [
516 |       "{'Exam 2': [90, 99, 56, 61, 79, 97], 'Exam 3': [80, 111, 67, 67, 83, 101], 'Exam 1': [100, 88, 45, 59, 73, 89]}\n",
517 |       "\n",
518 |       "(Passed!)\n"
519 |      ]
520 |     }
521 |    ],
522 |    "source": [
523 |     "# `grades_by_assignment_test`: Test cell\n",
524 |     "print(grades_by_assignment)\n",
525 |     "assert type(grades_by_assignment) is dict, \"Output is not a dictionary.\"\n",
526 |     "assert len(grades_by_assignment) == 3, \"Wrong number of assignments.\"\n",
527 |     "assert grades_by_assignment['Exam 1'] == [100, 88, 45, 59, 73, 89], 'Wrong grades list'\n",
528 |     "assert grades_by_assignment['Exam 3'] == [80, 111, 67, 67, 83, 101], 'Wrong grades list'\n",
529 |     "assert grades_by_assignment['Exam 2'] == [90, 99, 56, 61, 79, 97], 'Wrong grades list'\n",
530 |     "print(\"\\n(Passed!)\")"
531 |    ]
532 |   },
533 |   {
534 |    "cell_type": "markdown",
535 |    "metadata": {
536 |     "nbgrader": {
537 |      "grade": false,
538 |      "grade_id": "cell-d763d8a25d8cac78",
539 |      "locked": true,
540 |      "schema_version": 1,
541 |      "solution": false
542 |     }
543 |    },
544 |    "source": [
545 |     "**Exercise 6** (`avg_grades_by_assignment_test`: 1 point). Write some code to compute a dictionary, `avg_grades_by_assignment`, which maps each exam to its average score."
546 |    ]
547 |   },
548 |   {
549 |    "cell_type": "code",
550 |    "execution_count": 41,
551 |    "metadata": {
552 |     "nbgrader": {
553 |      "grade": false,
554 |      "grade_id": "avg_grades_by_assignment",
555 |      "locked": false,
556 |      "schema_version": 1,
557 |      "solution": true
558 |     }
559 |    },
560 |    "outputs": [],
561 |    "source": [
562 |     "# Create a dict mapping items to average for that item across all students.\n",
563 |     "#\n",
564 |     "# YOUR CODE HERE\n",
565 |     "#\n",
566 |     "avg_grades_by_assignment = {exam: sum(scores)/len(scores) for exam, scores in grades_by_assignment.items()}"
567 |    ]
568 |   },
569 |   {
570 |    "cell_type": "code",
571 |    "execution_count": 42,
572 |    "metadata": {
573 |     "nbgrader": {
574 |      "grade": true,
575 |      "grade_id": "avg_grades_by_assignment_test",
576 |      "locked": true,
577 |      "points": 1,
578 |      "schema_version": 1,
579 |      "solution": false
580 |     }
581 |    },
582 |    "outputs": [
583 |     {
584 |      "name": "stdout",
585 |      "output_type": "stream",
586 |      "text": [
587 |       "{'Exam 2': 80.33333333333333, 'Exam 3': 84.83333333333333, 'Exam 1': 75.66666666666667}\n",
588 |       "\n",
589 |       "(Passed!)\n"
590 |      ]
591 |     }
592 |    ],
593 |    "source": [
594 |     "# `avg_grades_by_assignment_test`: Test cell\n",
595 |     "print(avg_grades_by_assignment)\n",
596 |     "assert type(avg_grades_by_assignment) is dict\n",
597 |     "assert len(avg_grades_by_assignment) == 3\n",
598 |     "assert abs((100+88+45+59+73+89)/6 - avg_grades_by_assignment['Exam 1']) <= 7e-15\n",
599 |     "assert abs((80+111+67+67+83+101)/6 - avg_grades_by_assignment['Exam 3']) <= 7e-15\n",
600 |     "assert abs((90+99+56+61+79+97)/6 - avg_grades_by_assignment['Exam 2']) <= 7e-15\n",
601 |     "print(\"\\n(Passed!)\")"
602 |    ]
603 |   },
604 |   {
605 |    "cell_type": "markdown",
606 |    "metadata": {
607 |     "nbgrader": {
608 |      "grade": false,
609 |      "grade_id": "cell-7d85977d9fab2482",
610 |      "locked": true,
611 |      "schema_version": 1,
612 |      "solution": false
613 |     }
614 |    },
615 |    "source": [
616 |     "**Exercise 7** (`rank_test`: 2 points). Write some code to create a new list, `rank`, which contains the names of students in order by _decreasing_ score. That is, `rank[0]` should contain the name of the top student (highest average exam score), and `rank[-1]` should have the name of the bottom student (lowest average exam score)."
617 |    ]
618 |   },
619 |   {
620 |    "cell_type": "code",
621 |    "execution_count": 48,
622 |    "metadata": {
623 |     "nbgrader": {
624 |      "grade": false,
625 |      "grade_id": "rank",
626 |      "locked": false,
627 |      "schema_version": 1,
628 |      "solution": true
629 |     }
630 |    },
631 |    "outputs": [],
632 |    "source": [
633 |     "#\n",
634 |     "# YOUR CODE HERE\n",
635 |     "#\n",
636 |     "rank_scores = sorted([(y,x) for x,y in avg_grades_by_student.items()], reverse=True)\n",
637 |     "rank = [x[1] for x in rank_scores]"
638 |    ]
639 |   },
640 |   {
641 |    "cell_type": "code",
642 |    "execution_count": 49,
643 |    "metadata": {
644 |     "nbgrader": {
645 |      "grade": true,
646 |      "grade_id": "rank_test",
647 |      "locked": true,
648 |      "points": 2,
649 |      "schema_version": 1,
650 |      "solution": false
651 |     }
652 |    },
653 |    "outputs": [
654 |     {
655 |      "name": "stdout",
656 |      "output_type": "stream",
657 |      "text": [
658 |       "['Mac', 'Foster', 'Thorny', 'Ursula', 'Rabbit', 'Farva']\n",
659 |       "\n",
660 |       "=== Ranking ===\n",
661 |       "1. Mac: 99.33333333333333\n",
662 |       "2. Foster: 95.66666666666667\n",
663 |       "3. Thorny: 90.0\n",
664 |       "4. Ursula: 78.33333333333333\n",
665 |       "5. Rabbit: 62.333333333333336\n",
666 |       "6. Farva: 56.0\n",
667 |       "\n",
668 |       "(Passed!)\n"
669 |      ]
670 |     }
671 |    ],
672 |    "source": [
673 |     "# `rank_test`: Test cell\n",
674 |     "print(rank)\n",
675 |     "print(\"\\n=== Ranking ===\")\n",
676 |     "for i, s in enumerate(rank):\n",
677 |     "    print(\"{}. {}: {}\".format(i+1, s, avg_grades_by_student[s]))\n",
678 |     "    \n",
679 |     "assert rank == ['Mac', 'Foster', 'Thorny', 'Ursula', 'Rabbit', 'Farva']\n",
680 |     "for i in range(len(rank)-1):\n",
681 |     "    assert avg_grades_by_student[rank[i]] >= avg_grades_by_student[rank[i+1]]\n",
682 |     "print(\"\\n(Passed!)\")"
683 |    ]
684 |   },
685 |   {
686 |    "cell_type": "code",
687 |    "execution_count": null,
688 |    "metadata": {
689 |     "collapsed": true
690 |    },
691 |    "outputs": [],
692 |    "source": []
693 |   }
694 |  ],
695 |  "metadata": {
696 |   "celltoolbar": "Create Assignment",
697 |   "kernelspec": {
698 |    "display_name": "Python 3",
699 |    "language": "python",
700 |    "name": "python3"
701 |   },
702 |   "language_info": {
703 |    "codemirror_mode": {
704 |     "name": "ipython",
705 |     "version": 3
706 |    },
707 |    "file_extension": ".py",
708 |    "mimetype": "text/x-python",
709 |    "name": "python",
710 |    "nbconvert_exporter": "python",
711 |    "pygments_lexer": "ipython3",
712 |    "version": "3.5.2"
713 |   }
714 |  },
715 |  "nbformat": 4,
716 |  "nbformat_minor": 1
717 | }
718 | 


--------------------------------------------------------------------------------
/Notebook5_part2_RegExYelp.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {
  6 |     "nbgrader": {
  7 |      "grade": false,
  8 |      "grade_id": "cell-81740ad10bcffdd8",
  9 |      "locked": true,
 10 |      "schema_version": 1,
 11 |      "solution": false
 12 |     }
 13 |    },
 14 |    "source": [
 15 |     "# Part 1 of 2: Processing an HTML file\n",
 16 |     "\n",
 17 |     "One of the richest sources of information is [the Web](http://www.computerhistory.org/revolution/networking/19/314)! In this notebook, we ask you to use string processing and regular expressions to mine a web page, which is stored in HTML format."
 18 |    ]
 19 |   },
 20 |   {
 21 |    "cell_type": "markdown",
 22 |    "metadata": {
 23 |     "nbgrader": {
 24 |      "grade": false,
 25 |      "grade_id": "cell-e1821fbeefa0e2c2",
 26 |      "locked": true,
 27 |      "schema_version": 1,
 28 |      "solution": false
 29 |     }
 30 |    },
 31 |    "source": [
 32 |     "**The data: Yelp! reviews.** The data you will work with is a snapshot of a recent search on the [Yelp! site](https://yelp.com) for the best fried chicken restaurants in Atlanta. That snapshot is hosted here: https://cse6040.gatech.edu/datasets/yelp-example\n",
 33 |     "\n",
 34 |     "If you go ahead and open that site, you'll see that it contains a ranked list of places:\n",
 35 |     "\n",
 36 |     "![Top 10 Fried Chicken Spots in ATL as of September 12, 2017](https://cse6040.gatech.edu/datasets/yelp-example/ranked-list-snapshot.png)"
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "markdown",
 41 |    "metadata": {
 42 |     "nbgrader": {
 43 |      "grade": false,
 44 |      "grade_id": "cell-fe765896f1d25066",
 45 |      "locked": true,
 46 |      "schema_version": 1,
 47 |      "solution": false
 48 |     }
 49 |    },
 50 |    "source": [
 51 |     "**Your task.** In this part of this assignment, we'd like you to write some code to extract this list."
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "markdown",
 56 |    "metadata": {
 57 |     "nbgrader": {
 58 |      "grade": false,
 59 |      "grade_id": "cell-95c9a0ef4d1838e1",
 60 |      "locked": true,
 61 |      "schema_version": 1,
 62 |      "solution": false
 63 |     }
 64 |    },
 65 |    "source": [
 66 |     "## Getting the data\n",
 67 |     "\n",
 68 |     "First things first: you need an HTML file. The following Python code will download a particular web page that we've prepared for this exercise and store it locally in a file.\n",
 69 |     "\n",
 70 |     "> If the file exists, this command will not overwrite it. By not doing so, we can reduce accesses to the server that hosts the file. Also, if an error occurs during the download, this cell may report that the downloaded file is corrupt; in that case, you should try re-running the cell."
 71 |    ]
 72 |   },
 73 |   {
 74 |    "cell_type": "code",
 75 |    "execution_count": 3,
 76 |    "metadata": {
 77 |     "nbgrader": {
 78 |      "grade": false,
 79 |      "grade_id": "cell-af1ae6df64a1fd40",
 80 |      "locked": true,
 81 |      "schema_version": 1,
 82 |      "solution": false
 83 |     }
 84 |    },
 85 |    "outputs": [
 86 |     {
 87 |      "ename": "UnicodeDecodeError",
 88 |      "evalue": "'charmap' codec can't decode byte 0x81 in position 711138: character maps to <undefined>",
 89 |      "output_type": "error",
 90 |      "traceback": [
 91 |       "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
 92 |       "\u001b[1;31mUnicodeDecodeError\u001b[0m                        Traceback (most recent call last)",
 93 |       "\u001b[1;32m<ipython-input-3-868d042d28fa>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m     15\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     16\u001b[0m \u001b[1;32mwith\u001b[0m \u001b[0mopen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'yelp.htm'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'r'\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;32mas\u001b[0m \u001b[0mf\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 17\u001b[1;33m     \u001b[0myelp_html\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mf\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mread\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mencode\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mencoding\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'utf-8'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m     18\u001b[0m     \u001b[0mchecksum\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mhashlib\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mmd5\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0myelp_html\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mhexdigest\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     19\u001b[0m     \u001b[1;32massert\u001b[0m \u001b[0mchecksum\u001b[0m \u001b[1;33m==\u001b[0m \u001b[1;34m\"4a74a0ee9cefee773e76a22a52d45a8e\"\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m\"Downloaded file has incorrect checksum!\"\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
 94 |       "\u001b[1;32mC:\\Anaconda3\\lib\\encodings\\cp1252.py\u001b[0m in \u001b[0;36mdecode\u001b[1;34m(self, input, final)\u001b[0m\n\u001b[0;32m     21\u001b[0m \u001b[1;32mclass\u001b[0m \u001b[0mIncrementalDecoder\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mcodecs\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mIncrementalDecoder\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     22\u001b[0m     \u001b[1;32mdef\u001b[0m \u001b[0mdecode\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0minput\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mfinal\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mFalse\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 23\u001b[1;33m         \u001b[1;32mreturn\u001b[0m \u001b[0mcodecs\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mcharmap_decode\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0minput\u001b[0m\u001b[1;33m,\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0merrors\u001b[0m\u001b[1;33m,\u001b[0m\u001b[0mdecoding_table\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m     24\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     25\u001b[0m \u001b[1;32mclass\u001b[0m \u001b[0mStreamWriter\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mCodec\u001b[0m\u001b[1;33m,\u001b[0m\u001b[0mcodecs\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mStreamWriter\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
 95 |       "\u001b[1;31mUnicodeDecodeError\u001b[0m: 'charmap' codec can't decode byte 0x81 in position 711138: character maps to <undefined>"
 96 |      ]
 97 |     }
 98 |    ],
 99 |    "source": [
100 |     "import requests\n",
101 |     "import os\n",
102 |     "import hashlib\n",
103 |     "\n",
104 |     "if os.path.exists('.voc'):\n",
105 |     "    data_url = 'https://cse6040.gatech.edu/datasets/yelp-example/yelp.htm'\n",
106 |     "else:\n",
107 |     "    data_url = 'https://github.com/cse6040/labs-fa17/raw/master/datasets/yelp.htm'\n",
108 |     "\n",
109 |     "if not os.path.exists('yelp.htm'):\n",
110 |     "    print(\"Downloading: {} ...\".format(data_url))\n",
111 |     "    r = requests.get(data_url)\n",
112 |     "    with open('yelp.htm', 'w', encoding=r.encoding) as f:\n",
113 |     "        f.write(r.text)\n",
114 |     "\n",
115 |     "with open('yelp.htm', 'r') as f:\n",
116 |     "    yelp_html = f.read().encode(encoding='utf-8')\n",
117 |     "    checksum = hashlib.md5(yelp_html).hexdigest()\n",
118 |     "    assert checksum == \"4a74a0ee9cefee773e76a22a52d45a8e\", \"Downloaded file has incorrect checksum!\"\n",
119 |     "    \n",
120 |     "print(\"'yelp.htm' is ready!\")"
121 |    ]
122 |   },
123 |   {
124 |    "cell_type": "markdown",
125 |    "metadata": {
126 |     "nbgrader": {
127 |      "grade": false,
128 |      "grade_id": "cell-afee39f0b7aee426",
129 |      "locked": true,
130 |      "schema_version": 1,
131 |      "solution": false
132 |     }
133 |    },
134 |    "source": [
135 |     "**Viewing the raw HTML in your web browser.** The file you just downloaded is the raw HTML version of the data described previously. Before moving on, you should go back to that site and use your web browser to view the HTML source for the web page. Do that now to get an idea of what is in that file.\n",
136 |     "\n",
137 |     "> If you don't know how to view the page source in your browser, try the instructions on [this site](http://www.wikihow.com/View-Source-Code)."
138 |    ]
139 |   },
140 |   {
141 |    "cell_type": "markdown",
142 |    "metadata": {
143 |     "nbgrader": {
144 |      "grade": false,
145 |      "grade_id": "cell-993d633285178cf8",
146 |      "locked": true,
147 |      "schema_version": 1,
148 |      "solution": false
149 |     }
150 |    },
151 |    "source": [
152 |     "**Reading the HTML file into a Python string.** Let's also open the file in Python and read its contents into a string named, `yelp_html`."
153 |    ]
154 |   },
155 |   {
156 |    "cell_type": "code",
157 |    "execution_count": 3,
158 |    "metadata": {},
159 |    "outputs": [
160 |     {
161 |      "name": "stdout",
162 |      "output_type": "stream",
163 |      "text": [
164 |       "*** type(yelp_html) == <class 'str'> ***\n",
165 |       "*** Contents (first 1000 characters) ***\n",
166 |       "<!DOCTYPE html>\n",
167 |       "<!-- saved from url=(0079)https://www.yelp.com/search?find_desc=fried+chicken&find_loc=Atlanta%2C+GA&ns=1 -->\n",
168 |       "<html xmlns:fb=\"http://www.facebook.com/2008/fbml\" class=\"js gr__yelp_com\" lang=\"en\"><!--<![endif]--><head data-component-bound=\"true\"><meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\"><link type=\"text/css\" rel=\"stylesheet\" href=\"./Best Fried chicken in Atlanta, GA - Yelp_files/css\"><style type=\"text/css\">.gm-style .gm-style-cc span,.gm-style .gm-style-cc a,.gm-style .gm-style-mtc div{font-size:10px}\n",
169 |       "</style><style type=\"text/css\">@media print {  .gm-style .gmnoprint, .gmnoprint {    display:none  }}@media screen {  .gm-style .gmnoscreen, .gmnoscreen {    display:none  }}</style><style type=\"text/css\">.gm-style-pbc{transition:opacity ease-in-out;background-color:rgba(0,0,0,0.45);text-align:center}.gm-style-pbt{font-size:22px;color:white;font-family:Roboto,Arial,sans-serif;position:relative;margin:0;top:50%;-webkit-transform:translateY(-50%);-ms- ...\n"
170 |      ]
171 |     }
172 |    ],
173 |    "source": [
174 |     "with open('yelp.htm') as yelp_file:\n",
175 |     "    yelp_html = yelp_file.read()\n",
176 |     "    \n",
177 |     "# Print first few hundred characters of this string:\n",
178 |     "print(\"*** type(yelp_html) == {} ***\".format(type(yelp_html)))\n",
179 |     "n = 1000\n",
180 |     "print(\"*** Contents (first {} characters) ***\\n{} ...\".format(n, yelp_html[:n]))"
181 |    ]
182 |   },
183 |   {
184 |    "cell_type": "markdown",
185 |    "metadata": {
186 |     "nbgrader": {
187 |      "grade": false,
188 |      "grade_id": "cell-02895e5c5a7d18be",
189 |      "locked": true,
190 |      "schema_version": 1,
191 |      "solution": false
192 |     }
193 |    },
194 |    "source": [
195 |     "Oy, what a mess! It will be great to have some code read and process the information contained within this file."
196 |    ]
197 |   },
198 |   {
199 |    "cell_type": "markdown",
200 |    "metadata": {
201 |     "nbgrader": {
202 |      "grade": false,
203 |      "grade_id": "cell-6481539b4054dbde",
204 |      "locked": true,
205 |      "schema_version": 1,
206 |      "solution": false
207 |     }
208 |    },
209 |    "source": [
210 |     "## Exercise (5 points): Extracting the ranking\n",
211 |     "\n",
212 |     "Write some Python code to create a variable named `rankings`, which is a list of dictionaries set up as follows:\n",
213 |     "\n",
214 |     "* `rankings[i]` is a dictionary corresponding to the restaurant whose rank is `i+1`. For example, from the screenshot above, `rankings[0]` should be a dictionary with information about Gus's World Famous Fried Chicken.\n",
215 |     "* Each dictionary, `rankings[i]`, should have these keys:\n",
216 |     "    * `rankings[i]['name']`: The name of the restaurant, a string.\n",
217 |     "    * `rankings[i]['stars']`: The star rating, as a string, e.g., `'4.5'`, `'4.0'`\n",
218 |     "    * `rankings[i]['numrevs']`: The number of reviews, as an **integer.**\n",
219 |     "    * `rankings[i]['price']`: The price range, as dollar signs, e.g., `'$'`, `'$$'`, `'$$$'`, or `'$$$$'`.\n",
220 |     "    \n",
221 |     "Of course, since the current topic is regular expressions, you might try to apply them (possibly combined with other string manipulation methods) find the particular patterns that yield the desired information."
222 |    ]
223 |   },
224 |   {
225 |    "cell_type": "code",
226 |    "execution_count": 69,
227 |    "metadata": {
228 |     "nbgrader": {
229 |      "grade": false,
230 |      "grade_id": "rankings",
231 |      "locked": false,
232 |      "schema_version": 1,
233 |      "solution": true
234 |     },
235 |     "scrolled": true
236 |    },
237 |    "outputs": [
238 |     {
239 |      "data": {
240 |       "text/plain": [
241 |        "['Gus’s World Famous Fried Chicken',\n",
242 |        " 'South City Kitchen - Midtown\\n\\n\\n\\n',\n",
243 |        " 'Mary Mac’s Tea Room\\n\\n\\n\\n\\n\\n\\n\\n     ',\n",
244 |        " 'Busy Bee Cafe\\n\\n\\n\\n\\n\\n\\n\\n           ',\n",
245 |        " 'Richards’ Southern Fried\\n\\n\\n\\n\\n\\n\\n\\n',\n",
246 |        " 'Greens & Gravy\\n\\n\\n\\n\\n\\n\\n\\n          ',\n",
247 |        " 'Colonnade Restaurant\\n\\n\\n\\n\\n\\n\\n\\n    ',\n",
248 |        " 'South City Kitchen Buckhead\\n\\n\\n\\n\\n',\n",
249 |        " 'Poor Calvin’s\\n\\n\\n\\n\\n\\n\\n\\n           ',\n",
250 |        " ' Rock’s Chicken & Fries\\n\\n\\n\\n\\n\\n\\n\\n ']"
251 |       ]
252 |      },
253 |      "execution_count": 69,
254 |      "metadata": {},
255 |      "output_type": "execute_result"
256 |     }
257 |    ],
258 |    "source": [
259 |     "from bs4 import BeautifulSoup\n",
260 |     "import re\n",
261 |     "\n",
262 |     "yelp_soup = BeautifulSoup(yelp_html, 'html.parser')\n",
263 |     "soup_li = yelp_soup.findAll('li', attrs={'class':\"regular-search-result\"})\n",
264 |     "\n",
265 |     "names_raw = []\n",
266 |     "reviews_raw = []\n",
267 |     "\n",
268 |     "for result in soup_li:\n",
269 |     "    result.span\n",
270 |     "    names_raw.append(result.get_text()[25:57])\n",
271 |     "    reviews_raw.append(result.get_text()[57:100])\n",
272 |     "\n",
273 |     "names_raw\n",
274 |     "#reviews_raw"
275 |    ]
276 |   },
277 |   {
278 |    "cell_type": "code",
279 |    "execution_count": 70,
280 |    "metadata": {},
281 |    "outputs": [
282 |     {
283 |      "data": {
284 |       "text/plain": [
285 |        "['Gus’s World Famous Fried Chicken',\n",
286 |        " 'South City Kitchen - Midtown',\n",
287 |        " 'Mary Mac’s Tea Room',\n",
288 |        " 'Busy Bee Cafe',\n",
289 |        " 'Richards’ Southern Fried',\n",
290 |        " 'Greens &amp; Gravy',\n",
291 |        " 'Colonnade Restaurant',\n",
292 |        " 'South City Kitchen Buckhead',\n",
293 |        " 'Poor Calvin’s',\n",
294 |        " 'Rock’s Chicken &amp; Fries']"
295 |       ]
296 |      },
297 |      "execution_count": 70,
298 |      "metadata": {},
299 |      "output_type": "execute_result"
300 |     }
301 |    ],
302 |    "source": [
303 |     "names = []\n",
304 |     "name_matcher = re.compile('.*')\n",
305 |     "for name in names_raw:\n",
306 |     "    name_match = name_matcher.search(name)\n",
307 |     "    names.append(name_match.group().strip())\n",
308 |     "\n",
309 |     "names = [line.replace('&','&amp;') for line in names]\n",
310 |     "names"
311 |    ]
312 |   },
313 |   {
314 |    "cell_type": "code",
315 |    "execution_count": 71,
316 |    "metadata": {},
317 |    "outputs": [
318 |     {
319 |      "data": {
320 |       "text/plain": [
321 |        "[549, 1777, 2241, 481, 108, 93, 350, 248, 1558, 67]"
322 |       ]
323 |      },
324 |      "execution_count": 71,
325 |      "metadata": {},
326 |      "output_type": "execute_result"
327 |     }
328 |    ],
329 |    "source": [
330 |     "reviews = []\n",
331 |     "cost = []\n",
332 |     "review_matcher = re.compile(r'[\\d]+')\n",
333 |     "cost_matcher = re.compile(r'[\\$]+')\n",
334 |     "\n",
335 |     "for review in reviews_raw:\n",
336 |     "    review_match = review_matcher.search(review)\n",
337 |     "    reviews.append(int(review_match.group()))\n",
338 |     "    cost_match = cost_matcher.search(review)\n",
339 |     "    cost.append(cost_match.group())\n",
340 |     "reviews\n",
341 |     "#cost"
342 |    ]
343 |   },
344 |   {
345 |    "cell_type": "code",
346 |    "execution_count": 72,
347 |    "metadata": {},
348 |    "outputs": [
349 |     {
350 |      "data": {
351 |       "text/plain": [
352 |        "[\"Gus's World Famous Fried Chicken\",\n",
353 |        " '4.0 star rating',\n",
354 |        " 'V D.',\n",
355 |        " 'South City Kitchen - Midtown',\n",
356 |        " '4.5 star rating',\n",
357 |        " 'Tori P.',\n",
358 |        " \"Mary Mac's Tea Room\",\n",
359 |        " '4.0 star rating',\n",
360 |        " 'Monique V.',\n",
361 |        " 'Busy Bee Cafe',\n",
362 |        " '4.0 star rating',\n",
363 |        " 'Joe G.',\n",
364 |        " \"Richards' Southern Fried\",\n",
365 |        " '4.0 star rating',\n",
366 |        " 'Kurtis K.',\n",
367 |        " 'Greens & Gravy',\n",
368 |        " '3.5 star rating',\n",
369 |        " 'Tammy J.',\n",
370 |        " 'Colonnade Restaurant',\n",
371 |        " '4.0 star rating',\n",
372 |        " 'Peter S.',\n",
373 |        " 'South City Kitchen Buckhead',\n",
374 |        " '4.5 star rating',\n",
375 |        " 'T. M.',\n",
376 |        " \"Poor Calvin's\",\n",
377 |        " '4.5 star rating',\n",
378 |        " 'Monique V.',\n",
379 |        " \"Rock's Chicken & Fries\",\n",
380 |        " '4.0 star rating',\n",
381 |        " 'Sabri3l A.']"
382 |       ]
383 |      },
384 |      "execution_count": 72,
385 |      "metadata": {},
386 |      "output_type": "execute_result"
387 |     }
388 |    ],
389 |    "source": [
390 |     "rates_raw = []\n",
391 |     "for result in soup_li:\n",
392 |     "    soup_img = result.findAll('img')\n",
393 |     "    for image in soup_img:\n",
394 |     "        rates_raw.append(image.get('alt', ''))\n",
395 |     "\n",
396 |     "rates_raw"
397 |    ]
398 |   },
399 |   {
400 |    "cell_type": "code",
401 |    "execution_count": 73,
402 |    "metadata": {},
403 |    "outputs": [
404 |     {
405 |      "data": {
406 |       "text/plain": [
407 |        "['4.0', '4.5', '4.0', '4.0', '4.0', '3.5', '4.0', '4.5', '4.5', '4.0']"
408 |       ]
409 |      },
410 |      "execution_count": 73,
411 |      "metadata": {},
412 |      "output_type": "execute_result"
413 |     }
414 |    ],
415 |    "source": [
416 |     "rates = []\n",
417 |     "rate_matcher = re.compile(r'[\\d\\.]+')\n",
418 |     "for item in rates_raw:\n",
419 |     "    if item[0].isnumeric():\n",
420 |     "        rate_match = rate_matcher.search(item)\n",
421 |     "        rates.append(rate_match.group())\n",
422 |     "rates"
423 |    ]
424 |   },
425 |   {
426 |    "cell_type": "code",
427 |    "execution_count": 74,
428 |    "metadata": {},
429 |    "outputs": [
430 |     {
431 |      "data": {
432 |       "text/plain": [
433 |        "[{'name': 'Gus’s World Famous Fried Chicken',\n",
434 |        "  'numrevs': 549,\n",
435 |        "  'price': '$$',\n",
436 |        "  'stars': '4.0'},\n",
437 |        " {'name': 'South City Kitchen - Midtown',\n",
438 |        "  'numrevs': 1777,\n",
439 |        "  'price': '$$',\n",
440 |        "  'stars': '4.5'},\n",
441 |        " {'name': 'Mary Mac’s Tea Room',\n",
442 |        "  'numrevs': 2241,\n",
443 |        "  'price': '$$',\n",
444 |        "  'stars': '4.0'},\n",
445 |        " {'name': 'Busy Bee Cafe', 'numrevs': 481, 'price': '$$', 'stars': '4.0'},\n",
446 |        " {'name': 'Richards’ Southern Fried',\n",
447 |        "  'numrevs': 108,\n",
448 |        "  'price': '$$',\n",
449 |        "  'stars': '4.0'},\n",
450 |        " {'name': 'Greens &amp; Gravy', 'numrevs': 93, 'price': '$$', 'stars': '3.5'},\n",
451 |        " {'name': 'Colonnade Restaurant',\n",
452 |        "  'numrevs': 350,\n",
453 |        "  'price': '$$',\n",
454 |        "  'stars': '4.0'},\n",
455 |        " {'name': 'South City Kitchen Buckhead',\n",
456 |        "  'numrevs': 248,\n",
457 |        "  'price': '$$',\n",
458 |        "  'stars': '4.5'},\n",
459 |        " {'name': 'Poor Calvin’s', 'numrevs': 1558, 'price': '$$', 'stars': '4.5'},\n",
460 |        " {'name': 'Rock’s Chicken &amp; Fries',\n",
461 |        "  'numrevs': 67,\n",
462 |        "  'price': '$',\n",
463 |        "  'stars': '4.0'}]"
464 |       ]
465 |      },
466 |      "execution_count": 74,
467 |      "metadata": {},
468 |      "output_type": "execute_result"
469 |     }
470 |    ],
471 |    "source": [
472 |     "rankings = [{'name': a, 'stars': b, 'numrevs': c, 'price': d} for a, b, c, d in zip(names, rates, reviews, cost)]\n",
473 |     "rankings"
474 |    ]
475 |   },
476 |   {
477 |    "cell_type": "code",
478 |    "execution_count": 75,
479 |    "metadata": {
480 |     "nbgrader": {
481 |      "grade": true,
482 |      "grade_id": "rankings_test",
483 |      "locked": true,
484 |      "points": 5,
485 |      "schema_version": 1,
486 |      "solution": false
487 |     }
488 |    },
489 |    "outputs": [
490 |     {
491 |      "name": "stdout",
492 |      "output_type": "stream",
493 |      "text": [
494 |       "=== Rankings ===\n",
495 |       "1. Gus’s World Famous Fried Chicken ($$): 4.0 stars based on 549 reviews\n",
496 |       "2. South City Kitchen - Midtown ($$): 4.5 stars based on 1777 reviews\n",
497 |       "3. Mary Mac’s Tea Room ($$): 4.0 stars based on 2241 reviews\n",
498 |       "4. Busy Bee Cafe ($$): 4.0 stars based on 481 reviews\n",
499 |       "5. Richards’ Southern Fried ($$): 4.0 stars based on 108 reviews\n",
500 |       "6. Greens &amp; Gravy ($$): 3.5 stars based on 93 reviews\n",
501 |       "7. Colonnade Restaurant ($$): 4.0 stars based on 350 reviews\n",
502 |       "8. South City Kitchen Buckhead ($$): 4.5 stars based on 248 reviews\n",
503 |       "9. Poor Calvin’s ($$): 4.5 stars based on 1558 reviews\n",
504 |       "10. Rock’s Chicken &amp; Fries ($): 4.0 stars based on 67 reviews\n",
505 |       "\n",
506 |       "(Passed!)\n"
507 |      ]
508 |     }
509 |    ],
510 |    "source": [
511 |     "# Test cell: `rankings_test`\n",
512 |     "\n",
513 |     "assert type(rankings) is list, \"`rankings` must be a list\"\n",
514 |     "assert all([type(r) is dict for r in rankings]), \"All `rankings[i]` must be dictionaries\"\n",
515 |     "\n",
516 |     "print(\"=== Rankings ===\")\n",
517 |     "for i, r in enumerate(rankings):\n",
518 |     "    print(\"{}. {} ({}): {} stars based on {} reviews\".format(i+1,\n",
519 |     "                                                             r['name'],\n",
520 |     "                                                             r['price'],\n",
521 |     "                                                             r['stars'],\n",
522 |     "                                                             r['numrevs']))\n",
523 |     "\n",
524 |     "assert rankings[0] == {'numrevs': 549, 'name': 'Gus’s World Famous Fried Chicken', 'stars': '4.0', 'price': '$$'}\n",
525 |     "assert rankings[1] == {'numrevs': 1777, 'name': 'South City Kitchen - Midtown', 'stars': '4.5', 'price': '$$'}\n",
526 |     "assert rankings[2] == {'numrevs': 2241, 'name': 'Mary Mac’s Tea Room', 'stars': '4.0', 'price': '$$'}\n",
527 |     "assert rankings[3] == {'numrevs': 481, 'name': 'Busy Bee Cafe', 'stars': '4.0', 'price': '$$'}\n",
528 |     "assert rankings[4] == {'numrevs': 108, 'name': 'Richards’ Southern Fried', 'stars': '4.0', 'price': '$$'}\n",
529 |     "assert rankings[5] == {'numrevs': 93, 'name': 'Greens &amp; Gravy', 'stars': '3.5', 'price': '$$'}\n",
530 |     "assert rankings[6] == {'numrevs': 350, 'name': 'Colonnade Restaurant', 'stars': '4.0', 'price': '$$'}\n",
531 |     "assert rankings[7] == {'numrevs': 248, 'name': 'South City Kitchen Buckhead', 'stars': '4.5', 'price': '$$'}\n",
532 |     "assert rankings[8] == {'numrevs': 1558, 'name': 'Poor Calvin’s', 'stars': '4.5', 'price': '$$'}\n",
533 |     "assert rankings[9] == {'numrevs': 67, 'name': 'Rock’s Chicken &amp; Fries', 'stars': '4.0', 'price': '$'}\n",
534 |     "\n",
535 |     "print(\"\\n(Passed!)\")"
536 |    ]
537 |   },
538 |   {
539 |    "cell_type": "markdown",
540 |    "metadata": {
541 |     "collapsed": true,
542 |     "nbgrader": {
543 |      "grade": false,
544 |      "grade_id": "cell-b3bde66e454dc063",
545 |      "locked": true,
546 |      "schema_version": 1,
547 |      "solution": false
548 |     }
549 |    },
550 |    "source": [
551 |     "**Fin!** This cell marks the end of Part 1. Don't forget to save, restart and rerun all cells, and submit it. When you are done, proceed to Part 2."
552 |    ]
553 |   }
554 |  ],
555 |  "metadata": {
556 |   "celltoolbar": "Create Assignment",
557 |   "kernelspec": {
558 |    "display_name": "Python 3",
559 |    "language": "python",
560 |    "name": "python3"
561 |   },
562 |   "language_info": {
563 |    "codemirror_mode": {
564 |     "name": "ipython",
565 |     "version": 3
566 |    },
567 |    "file_extension": ".py",
568 |    "mimetype": "text/x-python",
569 |    "name": "python",
570 |    "nbconvert_exporter": "python",
571 |    "pygments_lexer": "ipython3",
572 |    "version": "3.6.2"
573 |   }
574 |  },
575 |  "nbformat": 4,
576 |  "nbformat_minor": 1
577 | }
578 | 


--------------------------------------------------------------------------------
/Notebook5_part3_RegEx_hard.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Part 2 of 2 (OPTIONAL): An extreme case of regular expression processing\n",
  8 |     "\n",
  9 |     "> This part is **OPTIONAL**. That is, while there are exercises, they are worth 0 points each. Rather, this notebook is designed for those of you who may have a deeper interest in computational aspects of the course material and would like to explore that.\n",
 10 |     "\n",
 11 |     "There is a beautiful theory underlying regular expressions, and efficient regular expression processing is regarded as one of the classic problems of computer science. In the last part of this lab, you will explore a bit of that theory, albeit by experiment.\n",
 12 |     "\n",
 13 |     "In particular, the code cells below will walk you through a simple example of the potentially **hidden cost** of regular expression parsing. And if you really want to geek out, look at the article on which this example is taken: https://swtch.com/~rsc/regexp/regexp1.html"
 14 |    ]
 15 |   },
 16 |   {
 17 |    "cell_type": "markdown",
 18 |    "metadata": {},
 19 |    "source": [
 20 |     "## Quick review\n",
 21 |     "\n",
 22 |     "**Exercise 0** (ungraded) Let $a^n$ be a shorthand notation for a string in which $a$ is repeated $n$ times. For example, $a^3$ is the same as $aaa$ and $a^6$ is the same as $aaaaaa$. Write a function to generate the string for $a^n$, given a string $a$ and an integer $n \\geq 1$."
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "code",
 27 |    "execution_count": null,
 28 |    "metadata": {
 29 |     "collapsed": true,
 30 |     "nbgrader": {
 31 |      "grade": false,
 32 |      "grade_id": "rep_str",
 33 |      "locked": false,
 34 |      "schema_version": 1,
 35 |      "solution": true
 36 |     }
 37 |    },
 38 |    "outputs": [],
 39 |    "source": [
 40 |     "def rep_str (a, n):\n",
 41 |     "    \"\"\"Returns a string consisting of an input string repeated a given number of times.\"\"\"\n",
 42 |     "    assert type(a) is str and n >= 1\n",
 43 |     "#\n",
 44 |     "# YOUR CODE HERE\n",
 45 |     "#\n"
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "code",
 50 |    "execution_count": null,
 51 |    "metadata": {
 52 |     "nbgrader": {
 53 |      "grade": true,
 54 |      "grade_id": "rep_str_test",
 55 |      "locked": true,
 56 |      "points": 0,
 57 |      "schema_version": 1,
 58 |      "solution": false
 59 |     }
 60 |    },
 61 |    "outputs": [],
 62 |    "source": [
 63 |     "# Test cell: `rep_str_test`\n",
 64 |     "\n",
 65 |     "def check_fixed(a, n, ans):\n",
 66 |     "    msg = \"Testing: '{}'^{} -> '{}'\".format(a, n, ans)\n",
 67 |     "    print(msg)\n",
 68 |     "    assert rep_str(a, n) == ans, \"Case failed!\"\n",
 69 |     "    \n",
 70 |     "check_fixed('a', 3, 'aaa')\n",
 71 |     "check_fixed('cat', 4, 'catcatcatcat')\n",
 72 |     "check_fixed('', 100, '')\n",
 73 |     "\n",
 74 |     "def check_rand():\n",
 75 |     "    from random import choice, randint\n",
 76 |     "    a = ''.join([choice([chr(k) for k in range(ord('a'), ord('z')+1)]) for _ in range(randint(1, 5))])\n",
 77 |     "    n = randint(1, 10)\n",
 78 |     "    msg = \"Testing: '{}'^{}\".format(a, n)\n",
 79 |     "    print(msg)\n",
 80 |     "    s_you = rep_str(a, n)\n",
 81 |     "    for k in range(0, n*len(a), len(a)):\n",
 82 |     "        assert s_you[k:(k+len(a))] == a, \"Your result, '{}', is not correct at position {} [{}].\".format(s_you, k)\n",
 83 |     "    \n",
 84 |     "for _ in range(10):\n",
 85 |     "    check_rand()\n",
 86 |     "\n",
 87 |     "print(\"\\n(Passed!)\")"
 88 |    ]
 89 |   },
 90 |   {
 91 |    "cell_type": "markdown",
 92 |    "metadata": {},
 93 |    "source": [
 94 |     "## An initial experiment\n",
 95 |     "\n",
 96 |     "Intuitively, you should expect (or hope) that the time to determine whether a string of length $n$ matches a given pattern will be proportional to $n$. Let's see if this holds when matching simple input strings of repeated letters against a pattern designed to match such strings."
 97 |    ]
 98 |   },
 99 |   {
100 |    "cell_type": "code",
101 |    "execution_count": null,
102 |    "metadata": {
103 |     "collapsed": true
104 |    },
105 |    "outputs": [],
106 |    "source": [
107 |     "import re"
108 |    ]
109 |   },
110 |   {
111 |    "cell_type": "code",
112 |    "execution_count": null,
113 |    "metadata": {},
114 |    "outputs": [],
115 |    "source": [
116 |     "# Set up an input problem\n",
117 |     "n = 3\n",
118 |     "s_n = rep_str ('a', n) # Input string\n",
119 |     "pattern = '^a{%d}$' % n # Pattern to match it exactly\n",
120 |     "\n",
121 |     "# Test it\n",
122 |     "print (\"Matching input '{}' against pattern '{}'...\".format (s_n, pattern))\n",
123 |     "assert re.match (pattern, s_n) is not None\n",
124 |     "\n",
125 |     "# Benchmark it & report time, normalized to 'n'\n",
126 |     "timing = %timeit -q -o re.match (pattern, s_n)\n",
127 |     "t_avg = sum (timing.all_runs) / len (timing.all_runs) / timing.loops / n * 1e9\n",
128 |     "print (\"Average time per match per `n`: {:.1f} ns\".format (t_avg))"
129 |    ]
130 |   },
131 |   {
132 |    "cell_type": "markdown",
133 |    "metadata": {},
134 |    "source": [
135 |     "Before moving on, be sure you understand what the above benchmark is doing. For more on the Jupyter \"magic\" command, `%timeit`, see: http://ipython.readthedocs.io/en/stable/interactive/magics.html?highlight=magic#magic-magic"
136 |    ]
137 |   },
138 |   {
139 |    "cell_type": "markdown",
140 |    "metadata": {},
141 |    "source": [
142 |     "**Exercise 1** (ungraded) Repeat the above experiment for various values of `n`. To help keep track of the results, feel free to create new code cells that repeat the benchmark for different values of `n`. Explain what you observe. Can you conclude that matching simple regular expression patterns of the form `^a{n}$` against input strings of the form $a^n$ does, indeed, scale linearly?"
143 |    ]
144 |   },
145 |   {
146 |    "cell_type": "code",
147 |    "execution_count": null,
148 |    "metadata": {
149 |     "nbgrader": {
150 |      "grade": false,
151 |      "grade_id": "experiment1",
152 |      "locked": false,
153 |      "schema_version": 1,
154 |      "solution": true
155 |     }
156 |    },
157 |    "outputs": [],
158 |    "source": [
159 |     "# Use this code cell (and others, if you wish) to set up an experiment\n",
160 |     "# to test whether matching simple patterns behaves at worst linearly\n",
161 |     "# in the length of the input.\n",
162 |     "\n",
163 |     "#\n",
164 |     "# YOUR CODE HERE\n",
165 |     "#\n"
166 |    ]
167 |   },
168 |   {
169 |    "cell_type": "markdown",
170 |    "metadata": {
171 |     "nbgrader": {
172 |      "grade": true,
173 |      "grade_id": "results1",
174 |      "locked": false,
175 |      "points": 0,
176 |      "schema_version": 1,
177 |      "solution": true
178 |     }
179 |    },
180 |    "source": [
181 |     "**Answer.** To see asymptotically linear behavior, you'll need to try some fairly large values of $n$, e.g., a thousand, ten thousand, a hundred thousand, and a million."
182 |    ]
183 |   },
184 |   {
185 |    "cell_type": "markdown",
186 |    "metadata": {},
187 |    "source": [
188 |     "## A more complex pattern\n",
189 |     "\n",
190 |     "Consider a regular expression of the form:\n",
191 |     "\n",
192 |     "$$(a?)^n(a^n) \\quad$$\n",
193 |     "\n",
194 |     "For instance, $n=3$, the regular expression pattern is `(a?){3}a{3} == a?a?a?aaa`. Start by convincing yourself that an input string of the form,\n",
195 |     "\n",
196 |     "$$a^n = \\underbrace{aa\\cdots a}_{n \\mbox{ occurrences}}$$\n",
197 |     "\n",
198 |     "should match this pattern. Here is some code to set up an experiment to benchmark this case."
199 |    ]
200 |   },
201 |   {
202 |    "cell_type": "code",
203 |    "execution_count": null,
204 |    "metadata": {},
205 |    "outputs": [],
206 |    "source": [
207 |     "def setup_inputs(n):\n",
208 |     "    \"\"\"Sets up the 'complex pattern example' above.\"\"\"\n",
209 |     "    s_n = rep_str('a', n)\n",
210 |     "    p_n = \"^(a?){%d}(a{%d})$\" % (n, n)\n",
211 |     "    print (\"[n={}] Matching pattern '{}' against input '{}'...\".format(n, p_n, s_n))\n",
212 |     "    assert re.match(p_n, s_n) is not None\n",
213 |     "    return (p_n, s_n)\n",
214 |     "\n",
215 |     "n = 3\n",
216 |     "p_n, s_n = setup_inputs(n)\n",
217 |     "timing = %timeit -q -o re.match(p_n, s_n)\n",
218 |     "t_n = sum(timing.all_runs) / len(timing.all_runs) / timing.loops / n * 1e9\n",
219 |     "print (\"==> Time per run per `n`: {} ns\".format(t_n))"
220 |    ]
221 |   },
222 |   {
223 |    "cell_type": "markdown",
224 |    "metadata": {},
225 |    "source": [
226 |     "**Exercise 3** (ungraded) Repeat the above experiment but for different values of $n$, such as $n \\in \\{3, 6, 9, 12, 15, 18\\}$. As before, feel free to use the code cell below or make new code cells to contain the code for your experiments. Summarize what you observe. How does the execution time vary with $n$? Can you explain this behavior?"
227 |    ]
228 |   },
229 |   {
230 |    "cell_type": "code",
231 |    "execution_count": null,
232 |    "metadata": {
233 |     "nbgrader": {
234 |      "grade": false,
235 |      "grade_id": "experiment2",
236 |      "locked": false,
237 |      "schema_version": 1,
238 |      "solution": true
239 |     }
240 |    },
241 |    "outputs": [],
242 |    "source": [
243 |     "# Use this code cell (and others, if you wish) to set up an experiment\n",
244 |     "# to test whether matching simple patterns behaves at worst linearly\n",
245 |     "# in the length of the input.\n",
246 |     "\n",
247 |     "#\n",
248 |     "# YOUR CODE HERE\n",
249 |     "#\n"
250 |    ]
251 |   },
252 |   {
253 |    "cell_type": "markdown",
254 |    "metadata": {
255 |     "nbgrader": {
256 |      "grade": true,
257 |      "grade_id": "results2",
258 |      "locked": false,
259 |      "points": 0,
260 |      "schema_version": 1,
261 |      "solution": true
262 |     }
263 |    },
264 |    "source": [
265 |     "**Answer.** Here, you should observe something more like polynomial growth. Here are some results we collected, for instance.\n",
266 |     "\n",
267 |     "|    n    |  t (ns)    |\n",
268 |     "|---------|------------|\n",
269 |     "|       3 | 945.8      |\n",
270 |     "|       6 | 1611.7     |\n",
271 |     "|       9 | 7040.1     |\n",
272 |     "|      12 | 41166.1    |\n",
273 |     "|      15 | 254927.4   |\n",
274 |     "|      18 | 1724843.9  |"
275 |    ]
276 |   },
277 |   {
278 |    "cell_type": "markdown",
279 |    "metadata": {
280 |     "collapsed": true
281 |    },
282 |    "source": [
283 |     "**Fin!** This cell marks the end of Part 2, which is the final part of this assignment. Don't forget to save, restart and rerun all cells, and submit it."
284 |    ]
285 |   }
286 |  ],
287 |  "metadata": {
288 |   "celltoolbar": "Create Assignment",
289 |   "kernelspec": {
290 |    "display_name": "Python 3",
291 |    "language": "python",
292 |    "name": "python3"
293 |   },
294 |   "language_info": {
295 |    "codemirror_mode": {
296 |     "name": "ipython",
297 |     "version": 3
298 |    },
299 |    "file_extension": ".py",
300 |    "mimetype": "text/x-python",
301 |    "name": "python",
302 |    "nbconvert_exporter": "python",
303 |    "pygments_lexer": "ipython3",
304 |    "version": "3.5.2"
305 |   }
306 |  },
307 |  "nbformat": 4,
308 |  "nbformat_minor": 1
309 | }
310 | 


--------------------------------------------------------------------------------
/Notebook6_part2_BeautifulSoup.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {
  6 |     "nbgrader": {
  7 |      "grade": false,
  8 |      "grade_id": "cell-c075516d369362c8",
  9 |      "locked": true,
 10 |      "schema_version": 1,
 11 |      "solution": false
 12 |     }
 13 |    },
 14 |    "source": [
 15 |     "## Part 1: Tools to process HTML\n",
 16 |     "\n",
 17 |     "In Part 0, you downloaded real web pages and manipulated them using \"conventional\" string processing tools, like [`str`]() functions or [regular expressions]().\n",
 18 |     "\n",
 19 |     "However, web pages are stored in HTML ([hypertext markup language]()), which is a highly structured format. As such, it makes sense to use specialized tools to understand and process its structure. That's the subject of this notebook."
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "markdown",
 24 |    "metadata": {
 25 |     "nbgrader": {
 26 |      "grade": false,
 27 |      "grade_id": "cell-43e7e5ddf6ea837d",
 28 |      "locked": true,
 29 |      "schema_version": 1,
 30 |      "solution": false
 31 |     }
 32 |    },
 33 |    "source": [
 34 |     "## Parsing HTML: The Beautiful Soup module\n",
 35 |     "\n",
 36 |     "One such package to help process HTML is [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/). The following is a quick tutorial on how to use it.\n",
 37 |     "\n",
 38 |     "Any HTML document may be modeled as an object in computer science known as a [tree](https://en.wikipedia.org/wiki/Tree_(data_structure)):\n",
 39 |     "\n",
 40 |     "![HTML as a tree](./html-slide.png)\n",
 41 |     "\n",
 42 |     "There are different ways to define trees, but for our purposes, the following will be sufficient."
 43 |    ]
 44 |   },
 45 |   {
 46 |    "cell_type": "markdown",
 47 |    "metadata": {
 48 |     "nbgrader": {
 49 |      "grade": false,
 50 |      "grade_id": "cell-c52d068f57ed8895",
 51 |      "locked": true,
 52 |      "schema_version": 1,
 53 |      "solution": false
 54 |     }
 55 |    },
 56 |    "source": [
 57 |     "Consider a tree is a collection of _nodes_, which are the labeled boxes in the figure, and _edges_, which are the line segments connecting nodes, with the following special structure.\n",
 58 |     "\n",
 59 |     "* The node at the top is called the _root_. Here, the root is labeled `html` and abstractly represents the entire HTML document.\n",
 60 |     "* Regard each edge as always \"pointing\" from the node at its top end to the node at its bottom end. For any edge, the node at its top end is the _parent_ and the node at the bottom end is a _child_. Like real families, a parent can be a child. For example, the node labeled `head` is the child of `html` and the parent of `meta`, `title`, and `style`.\n",
 61 |     "* The _descendant_ of a node $x$ is any node $y$ for which there is a path from $x$ going down to $y$. For example, the node labeled `6x span` is a descendant of the node `body`. All nodes are descendants of the root.\n",
 62 |     "* Any node with _no_ descendants is a _leaf_.\n",
 63 |     "* Any node that is neither a root nor a leaf is an _internal node_.\n",
 64 |     "* There are no _cycles_. A cycle would be a loop. For instance, if you were to add an edge between the two lower rightmost nodes labeled, `strong` and `strong`, that would create a loop and the object would no longer be a tree.\n",
 65 |     "\n",
 66 |     "> For whatever reason, [computer scientists usually view trees upside down](https://www.quora.com/Why-are-trees-in-computer-science-generally-drawn-upside-down-from-how-trees-are-in-real-life), with the \"root\" at the top and the \"leaves\" at the bottom."
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "markdown",
 71 |    "metadata": {
 72 |     "nbgrader": {
 73 |      "grade": false,
 74 |      "grade_id": "cell-f50c89e450930a0f",
 75 |      "locked": true,
 76 |      "schema_version": 1,
 77 |      "solution": false
 78 |     }
 79 |    },
 80 |    "source": [
 81 |     "The Beautiful Soup package gives you a data structure for traversing this tree. For instance, consider an HTML file with the contents below, shown both as code and pictorially."
 82 |    ]
 83 |   },
 84 |   {
 85 |    "cell_type": "code",
 86 |    "execution_count": 1,
 87 |    "metadata": {
 88 |     "nbgrader": {
 89 |      "grade": false,
 90 |      "grade_id": "cell-3e6d9fce3f852df9",
 91 |      "locked": true,
 92 |      "schema_version": 1,
 93 |      "solution": false
 94 |     }
 95 |    },
 96 |    "outputs": [
 97 |     {
 98 |      "name": "stdout",
 99 |      "output_type": "stream",
100 |      "text": [
101 |       "\n",
102 |       "<html>\n",
103 |       "  <body>\n",
104 |       "    <p>First paragraph.</p>\n",
105 |       "    <p>Second paragraph, which links to the <a href=\"http://www.gatech.edu\">Georgia Tech website</a>.</p>\n",
106 |       "    <p>Third paragraph.</p>\n",
107 |       "  </body>\n",
108 |       "</html>\n",
109 |       "\n"
110 |      ]
111 |     }
112 |    ],
113 |    "source": [
114 |     "some_page = \"\"\"\n",
115 |     "<html>\n",
116 |     "  <body>\n",
117 |     "    <p>First paragraph.</p>\n",
118 |     "    <p>Second paragraph, which links to the <a href=\"http://www.gatech.edu\">Georgia Tech website</a>.</p>\n",
119 |     "    <p>Third paragraph.</p>\n",
120 |     "  </body>\n",
121 |     "</html>\n",
122 |     "\"\"\"\n",
123 |     "print(some_page)"
124 |    ]
125 |   },
126 |   {
127 |    "cell_type": "markdown",
128 |    "metadata": {
129 |     "nbgrader": {
130 |      "grade": false,
131 |      "grade_id": "cell-b10719fa97510cdf",
132 |      "locked": true,
133 |      "schema_version": 1,
134 |      "solution": false
135 |     }
136 |    },
137 |    "source": [
138 |     "![Two visual representations of `some_page`](./html-viz.png)"
139 |    ]
140 |   },
141 |   {
142 |    "cell_type": "markdown",
143 |    "metadata": {
144 |     "nbgrader": {
145 |      "grade": false,
146 |      "grade_id": "cell-2ffdaae04974caa4",
147 |      "locked": true,
148 |      "schema_version": 1,
149 |      "solution": false
150 |     }
151 |    },
152 |    "source": [
153 |     "**Exercise 0.** Besides HTML files, what else have we seen in this class that could be represented by a tree? Briefly and roughly explain what and how."
154 |    ]
155 |   },
156 |   {
157 |    "cell_type": "markdown",
158 |    "metadata": {
159 |     "nbgrader": {
160 |      "grade": true,
161 |      "grade_id": "ex3",
162 |      "locked": false,
163 |      "points": 0,
164 |      "schema_version": 1,
165 |      "solution": true
166 |     }
167 |    },
168 |    "source": [
169 |     "**Answer.** One thing that has a natural tree representation is a Python program! For example, can you draw the following program as a tree?\n",
170 |     "\n",
171 |     "```python\n",
172 |     "import re\n",
173 |     "\n",
174 |     "def scan_lines(text, pattern):\n",
175 |     "    matches = []\n",
176 |     "    for line in text.split('\\n'):\n",
177 |     "        if re.search(pattern, text) is not None:\n",
178 |     "            matches.append(True)\n",
179 |     "        else:\n",
180 |     "            matches.append(False)\n",
181 |     "    return matches\n",
182 |     "```"
183 |    ]
184 |   },
185 |   {
186 |    "cell_type": "markdown",
187 |    "metadata": {
188 |     "nbgrader": {
189 |      "grade": false,
190 |      "grade_id": "cell-f7d60ea487aa01e5",
191 |      "locked": true,
192 |      "schema_version": 1,
193 |      "solution": false
194 |     }
195 |    },
196 |    "source": [
197 |     "## Using Beautiful Soup\n",
198 |     "\n",
199 |     "Here is how you might use Beautiful Soup to inspect the structure of `some_page`.\n",
200 |     "\n",
201 |     "Let's start by taking the contents of the page above (`some_page`) and asking Beautiful Soup to process it. Let's store the result in object named `soup`, and then explore its contents:"
202 |    ]
203 |   },
204 |   {
205 |    "cell_type": "code",
206 |    "execution_count": 2,
207 |    "metadata": {
208 |     "nbgrader": {
209 |      "grade": false,
210 |      "grade_id": "cell-3e24c3ba0a76eef7",
211 |      "locked": true,
212 |      "schema_version": 1,
213 |      "solution": false
214 |     }
215 |    },
216 |    "outputs": [
217 |     {
218 |      "name": "stdout",
219 |      "output_type": "stream",
220 |      "text": [
221 |       "1. soup == <html>\n",
222 |       "<body>\n",
223 |       "<p>First paragraph.</p>\n",
224 |       "<p>Second paragraph, which links to the <a href=\"http://www.gatech.edu\">Georgia Tech website</a>.</p>\n",
225 |       "<p>Third paragraph.</p>\n",
226 |       "</body>\n",
227 |       "</html>\n",
228 |       "\n",
229 |       "\n",
230 |       "2. soup.html == <html>\n",
231 |       "<body>\n",
232 |       "<p>First paragraph.</p>\n",
233 |       "<p>Second paragraph, which links to the <a href=\"http://www.gatech.edu\">Georgia Tech website</a>.</p>\n",
234 |       "<p>Third paragraph.</p>\n",
235 |       "</body>\n",
236 |       "</html>\n",
237 |       "\n",
238 |       "3. soup.html.body == <body>\n",
239 |       "<p>First paragraph.</p>\n",
240 |       "<p>Second paragraph, which links to the <a href=\"http://www.gatech.edu\">Georgia Tech website</a>.</p>\n",
241 |       "<p>Third paragraph.</p>\n",
242 |       "</body>\n",
243 |       "\n",
244 |       "4. soup.html.body.p == <p>First paragraph.</p>\n",
245 |       "\n",
246 |       "5. soup.html.body.contents == <class 'list'> :: ['\\n', <p>First paragraph.</p>, '\\n', <p>Second paragraph, which links to the <a href=\"http://www.gatech.edu\">Georgia Tech website</a>.</p>, '\\n', <p>Third paragraph.</p>, '\\n']\n"
247 |      ]
248 |     }
249 |    ],
250 |    "source": [
251 |     "from bs4 import BeautifulSoup\n",
252 |     "\n",
253 |     "soup = BeautifulSoup(some_page, \"lxml\")\n",
254 |     "\n",
255 |     "print('1. soup ==', soup) # Print the HTML contents\n",
256 |     "print('\\n2. soup.html ==', soup.html) # Root of the tree\n",
257 |     "print('\\n3. soup.html.body ==', soup.html.body) # A child tag\n",
258 |     "print('\\n4. soup.html.body.p ==', soup.html.body.p) # Another child tag\n",
259 |     "print('\\n5. soup.html.body.contents ==', type(soup.html.body.contents), '::', soup.html.body.contents)"
260 |    ]
261 |   },
262 |   {
263 |    "cell_type": "markdown",
264 |    "metadata": {
265 |     "nbgrader": {
266 |      "grade": false,
267 |      "grade_id": "cell-b878deab62a98518",
268 |      "locked": true,
269 |      "schema_version": 1,
270 |      "solution": false
271 |     }
272 |    },
273 |    "source": [
274 |     "Observe that the `.` notation allows us to reference HTML tags---that is, the stuff enclosed in angle brackets in the original HTML, e.g., `<html> ... </html>`, `<body> ... </body>`---as they are nested. But in the case of the `<body> ... </body>` tag, there are multiple subtags. Evidently, `soup.html.body.contents` contains these, as a list, which we know how to manipulate."
275 |    ]
276 |   },
277 |   {
278 |    "cell_type": "code",
279 |    "execution_count": 4,
280 |    "metadata": {},
281 |    "outputs": [
282 |     {
283 |      "name": "stdout",
284 |      "output_type": "stream",
285 |      "text": [
286 |       "[   0] <class 'bs4.element.NavigableString'> \n",
287 |       "\t==> '\n",
288 |       "'\n",
289 |       "[   1] <class 'bs4.element.Tag'> \n",
290 |       "\t==> '<p>First paragraph.</p>'\n",
291 |       "[   2] <class 'bs4.element.NavigableString'> \n",
292 |       "\t==> '\n",
293 |       "'\n",
294 |       "[   3] <class 'bs4.element.Tag'> \n",
295 |       "\t==> '<p>Second paragraph, which links to the <a href=\"http://www.gatech.edu\">Georgia Tech website</a>.</p>'\n",
296 |       "[   4] <class 'bs4.element.NavigableString'> \n",
297 |       "\t==> '\n",
298 |       "'\n",
299 |       "[   5] <class 'bs4.element.Tag'> \n",
300 |       "\t==> '<p>Third paragraph.</p>'\n",
301 |       "[   6] <class 'bs4.element.NavigableString'> \n",
302 |       "\t==> '\n",
303 |       "'\n",
304 |       "['Second paragraph, which links to the ', <a href=\"http://www.gatech.edu\">Georgia Tech website</a>, '.']\n"
305 |      ]
306 |     }
307 |    ],
308 |    "source": [
309 |     "# Enumerate all tags within the <body> ... </body> tag:\n",
310 |     "for i, elem in enumerate (soup.html.body.contents):\n",
311 |     "    print (\"[{:4d}]\".format (i), type (elem), '\\n\\t==>', \"'{}'\".format (elem))\n",
312 |     "\n",
313 |     "# Reference one of these, element 3:\n",
314 |     "elem3 = soup.html.body.contents[3]\n",
315 |     "print(elem3.contents)"
316 |    ]
317 |   },
318 |   {
319 |    "cell_type": "markdown",
320 |    "metadata": {
321 |     "nbgrader": {
322 |      "grade": false,
323 |      "grade_id": "cell-334edd5f6b7fac7b",
324 |      "locked": true,
325 |      "schema_version": 1,
326 |      "solution": false
327 |     }
328 |    },
329 |    "source": [
330 |     "**Exercise 1.** Write a statement that navigates to the tag representing the GT website link. Store this resulting tag object in a variable called `link`."
331 |    ]
332 |   },
333 |   {
334 |    "cell_type": "code",
335 |    "execution_count": 41,
336 |    "metadata": {
337 |     "nbgrader": {
338 |      "grade": false,
339 |      "grade_id": "ex5",
340 |      "locked": false,
341 |      "schema_version": 1,
342 |      "solution": true
343 |     }
344 |    },
345 |    "outputs": [
346 |     {
347 |      "name": "stdout",
348 |      "output_type": "stream",
349 |      "text": [
350 |       "<a href=\"http://www.gatech.edu\">Georgia Tech website</a>\n"
351 |      ]
352 |     }
353 |    ],
354 |    "source": [
355 |     "# YOUR CODE HERE\n",
356 |     "soup = BeautifulSoup(some_page, 'lxml')\n",
357 |     "link = soup.html.body.findAll('p')[1].find('a')\n",
358 |     "print(link)"
359 |    ]
360 |   },
361 |   {
362 |    "cell_type": "code",
363 |    "execution_count": 42,
364 |    "metadata": {
365 |     "collapsed": true,
366 |     "nbgrader": {
367 |      "grade": true,
368 |      "grade_id": "ex5_test",
369 |      "locked": true,
370 |      "points": 0,
371 |      "schema_version": 1,
372 |      "solution": false
373 |     }
374 |    },
375 |    "outputs": [],
376 |    "source": [
377 |     "# Checks your link. Can you understand what it is doing?\n",
378 |     "import bs4\n",
379 |     "assert type(link) is bs4.element.Tag\n",
380 |     "assert link.name == 'a'\n",
381 |     "assert link['href'] == 'http://www.gatech.edu'\n",
382 |     "assert link.contents == ['Georgia Tech website']"
383 |    ]
384 |   },
385 |   {
386 |    "cell_type": "markdown",
387 |    "metadata": {
388 |     "nbgrader": {
389 |      "grade": false,
390 |      "grade_id": "cell-4b2368e9947f3054",
391 |      "locked": true,
392 |      "schema_version": 1,
393 |      "solution": false
394 |     }
395 |    },
396 |    "source": [
397 |     "### Other navigation tools\n",
398 |     "\n",
399 |     "This lab includes a static copy of the Yelp! results for a search of \"universities\" in ATL. Let's start by downloading this file."
400 |    ]
401 |   },
402 |   {
403 |    "cell_type": "code",
404 |    "execution_count": 64,
405 |    "metadata": {
406 |     "nbgrader": {
407 |      "grade": false,
408 |      "grade_id": "cell-6c5218127e70c014",
409 |      "locked": true,
410 |      "schema_version": 1,
411 |      "solution": false
412 |     }
413 |    },
414 |    "outputs": [
415 |     {
416 |      "ename": "UnicodeDecodeError",
417 |      "evalue": "'charmap' codec can't decode byte 0x81 in position 257112: character maps to <undefined>",
418 |      "output_type": "error",
419 |      "traceback": [
420 |       "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
421 |       "\u001b[1;31mUnicodeDecodeError\u001b[0m                        Traceback (most recent call last)",
422 |       "\u001b[1;32m<ipython-input-64-debd4db37969>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m     20\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     21\u001b[0m \u001b[1;32mwith\u001b[0m \u001b[0mopen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0myelp_htm\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'r'\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;32mas\u001b[0m \u001b[0mf\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 22\u001b[1;33m     \u001b[0myelp_html\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mf\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mread\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mencode\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mencoding\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'utf-8'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m     23\u001b[0m     \u001b[0mchecksum\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mhashlib\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mmd5\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0myelp_html\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mhexdigest\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     24\u001b[0m     \u001b[1;32massert\u001b[0m \u001b[0mchecksum\u001b[0m \u001b[1;33m==\u001b[0m \u001b[0myelp_htm_checksum\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m\"Downloaded file has incorrect checksum!\"\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
423 |       "\u001b[1;32mC:\\Anaconda3\\lib\\encodings\\cp1252.py\u001b[0m in \u001b[0;36mdecode\u001b[1;34m(self, input, final)\u001b[0m\n\u001b[0;32m     21\u001b[0m \u001b[1;32mclass\u001b[0m \u001b[0mIncrementalDecoder\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mcodecs\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mIncrementalDecoder\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     22\u001b[0m     \u001b[1;32mdef\u001b[0m \u001b[0mdecode\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0minput\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mfinal\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mFalse\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 23\u001b[1;33m         \u001b[1;32mreturn\u001b[0m \u001b[0mcodecs\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mcharmap_decode\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0minput\u001b[0m\u001b[1;33m,\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0merrors\u001b[0m\u001b[1;33m,\u001b[0m\u001b[0mdecoding_table\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m     24\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     25\u001b[0m \u001b[1;32mclass\u001b[0m \u001b[0mStreamWriter\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mCodec\u001b[0m\u001b[1;33m,\u001b[0m\u001b[0mcodecs\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mStreamWriter\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
424 |       "\u001b[1;31mUnicodeDecodeError\u001b[0m: 'charmap' codec can't decode byte 0x81 in position 257112: character maps to <undefined>"
425 |      ]
426 |     }
427 |    ],
428 |    "source": [
429 |     "# Run me: Code to download sample HTML file\n",
430 |     "\n",
431 |     "import requests\n",
432 |     "import os\n",
433 |     "import hashlib\n",
434 |     "\n",
435 |     "yelp_htm = 'yelp_atl_unies.html'\n",
436 |     "yelp_htm_checksum = 'a940e7cd0c8c408a5dd2098a87303afe'\n",
437 |     "\n",
438 |     "if os.path.exists('.voc'):\n",
439 |     "    data_url = 'https://cse6040.gatech.edu/datasets/yelp-example-uni/{}'.format(yelp_htm)\n",
440 |     "else:\n",
441 |     "    data_url = 'https://github.com/cse6040/labs-fa17/raw/master/datasets/{}'.format(yelp_htm)\n",
442 |     "\n",
443 |     "if not os.path.exists(yelp_htm):\n",
444 |     "    print(\"Downloading: {} ...\".format(data_url))\n",
445 |     "    r = requests.get(data_url)\n",
446 |     "    with open(yelp_htm, 'w', encoding=r.encoding) as f:\n",
447 |     "        f.write(r.text)\n",
448 |     "\n",
449 |     "with open(yelp_htm, 'r') as f:\n",
450 |     "    yelp_html = f.read().encode(encoding='utf-8')\n",
451 |     "    checksum = hashlib.md5(yelp_html).hexdigest()\n",
452 |     "    assert checksum == yelp_htm_checksum, \"Downloaded file has incorrect checksum!\"\n",
453 |     "    \n",
454 |     "print(\"'{}' is ready!\".format(yelp_htm))"
455 |    ]
456 |   },
457 |   {
458 |    "cell_type": "markdown",
459 |    "metadata": {
460 |     "nbgrader": {
461 |      "grade": false,
462 |      "grade_id": "cell-68bbb05561a090b8",
463 |      "locked": true,
464 |      "schema_version": 1,
465 |      "solution": false
466 |     }
467 |    },
468 |    "source": [
469 |     "Next, inspect and run this code, which prints the top (number one) result."
470 |    ]
471 |   },
472 |   {
473 |    "cell_type": "code",
474 |    "execution_count": 82,
475 |    "metadata": {},
476 |    "outputs": [
477 |     {
478 |      "name": "stdout",
479 |      "output_type": "stream",
480 |      "text": [
481 |       "The number 1 ATL university according to Yelp!:\n",
482 |       "Georgia Institute of Technology\n"
483 |      ]
484 |     }
485 |    ],
486 |    "source": [
487 |     "# uni_html_text = open (yelp_html, 'r').read()\n",
488 |     "uni_html_text = open(r\"C:\\Users\\maica\\Desktop\\DS_Courses\\6_ComputingForDA\\yelp_atl_unies.txt\", 'r', encoding='utf-8').read()\n",
489 |     "\n",
490 |     "uni_soup = BeautifulSoup(uni_html_text, \"lxml\")\n",
491 |     "\n",
492 |     "print(\"The number 1 ATL university according to Yelp!:\")\n",
493 |     "\n",
494 |     "uni_1 = uni_soup.html.body \\\n",
495 |     "    .contents[7] \\\n",
496 |     "    .contents[9] \\\n",
497 |     "    .contents[3] \\\n",
498 |     "    .contents[1] \\\n",
499 |     "    .contents[3] \\\n",
500 |     "    .contents[1] \\\n",
501 |     "    .contents[1] \\\n",
502 |     "    .contents[7] \\\n",
503 |     "    .contents[3] \\\n",
504 |     "    .contents[5] \\\n",
505 |     "    .contents[1] \\\n",
506 |     "    .contents[1] \\\n",
507 |     "    .contents[1] \\\n",
508 |     "    .contents[1] \\\n",
509 |     "    .contents[3] \\\n",
510 |     "    .contents[1] \\\n",
511 |     "    .contents[1] \\\n",
512 |     "    .contents[1] \\\n",
513 |     "    .contents[0] \\\n",
514 |     "    .contents[0]\n",
515 |     "    \n",
516 |     "print(uni_1)"
517 |    ]
518 |   },
519 |   {
520 |    "cell_type": "markdown",
521 |    "metadata": {
522 |     "nbgrader": {
523 |      "grade": false,
524 |      "grade_id": "cell-cfa7415b772a15ad",
525 |      "locked": true,
526 |      "schema_version": 1,
527 |      "solution": false
528 |     }
529 |    },
530 |    "source": [
531 |     "We hope it is self-evident that the above method to navigate to a particular tag or element is not terribly productive or robust, particularly if there are small modifications to the HTML.\n",
532 |     "\n",
533 |     "Here is an alternative. Inspect the raw HTML and observe that every non-ad search result appears in a tag of the form,\n",
534 |     "\n",
535 |     "```html\n",
536 |     "<span class=\"indexed-biz-name\">1.         <a class=\"biz-name js-analytics-click\" data-analytics-label=\"biz-name\" href=\"/biz/georgia-institute-of-technology-atlanta-2\" data-hovercard-id=\"gBX8UvhOwtdD5tGJeU-hxg\" ><span >Georgia Institute of Technology</span></a>\n",
537 |     "</span>\n",
538 |     "```\n",
539 |     "\n",
540 |     "Beautiful Soup gives us a way to search for specific tags."
541 |    ]
542 |   },
543 |   {
544 |    "cell_type": "code",
545 |    "execution_count": 72,
546 |    "metadata": {
547 |     "nbgrader": {
548 |      "grade": false,
549 |      "grade_id": "cell-d121f56efa636843",
550 |      "locked": true,
551 |      "schema_version": 1,
552 |      "solution": false
553 |     }
554 |    },
555 |    "outputs": [
556 |     {
557 |      "name": "stdout",
558 |      "output_type": "stream",
559 |      "text": [
560 |       "*** First 5 of 30 results ***\n",
561 |       "\n",
562 |       "[<span class=\"indexed-biz-name\">1.         <a class=\"biz-name js-analytics-click\" data-analytics-label=\"biz-name\" data-hovercard-id=\"gBX8UvhOwtdD5tGJeU-hxg\" href=\"/biz/georgia-institute-of-technology-atlanta-2\"><span>Georgia Institute of Technology</span></a>\n",
563 |       "</span>, <span class=\"indexed-biz-name\">2.         <a class=\"biz-name js-analytics-click\" data-analytics-label=\"biz-name\" data-hovercard-id=\"13oCD5wffSr2ypav9MpCsQ\" href=\"/biz/emory-university-atlanta-2\"><span>Emory <span class=\"highlighted\">University</span></span></a>\n",
564 |       "</span>, <span class=\"indexed-biz-name\">3.         <a class=\"biz-name js-analytics-click\" data-analytics-label=\"biz-name\" data-hovercard-id=\"jAebE83Ox0lPCNsJoQII4A\" href=\"/biz/spelman-college-atlanta\"><span>Spelman College</span></a>\n",
565 |       "</span>, <span class=\"indexed-biz-name\">4.         <a class=\"biz-name js-analytics-click\" data-analytics-label=\"biz-name\" data-hovercard-id=\"8YFAsc5dK1K6FYbn9sVQ7g\" href=\"/biz/oglethorpe-university-atlanta\"><span>Oglethorpe <span class=\"highlighted\">University</span></span></a>\n",
566 |       "</span>, <span class=\"indexed-biz-name\">5.         <a class=\"biz-name js-analytics-click\" data-analytics-label=\"biz-name\" data-hovercard-id=\"o5tSSq2nJA-vseLTEDW9UA\" href=\"/biz/georgia-state-university-atlanta-3\"><span>Georgia State <span class=\"highlighted\">University</span></span></a>\n",
567 |       "</span>]\n"
568 |      ]
569 |     }
570 |    ],
571 |    "source": [
572 |     "indexed_unies = uni_soup.find_all(attrs={'class': 'indexed-biz-name'})\n",
573 |     "print(\"*** First 5 of {} results ***\\n\\n{}\".format(len(indexed_unies), indexed_unies[:5]))"
574 |    ]
575 |   },
576 |   {
577 |    "cell_type": "markdown",
578 |    "metadata": {
579 |     "nbgrader": {
580 |      "grade": false,
581 |      "grade_id": "cell-d26c59c6b89e51b8",
582 |      "locked": true,
583 |      "schema_version": 1,
584 |      "solution": false
585 |     }
586 |    },
587 |    "source": [
588 |     "**Exercise 2.** Based on the above, write a function that, given a Yelp! search results page such as `uni_soup` above, returns the name of the number 1 indexed search result."
589 |    ]
590 |   },
591 |   {
592 |    "cell_type": "code",
593 |    "execution_count": 83,
594 |    "metadata": {
595 |     "collapsed": true,
596 |     "nbgrader": {
597 |      "grade": false,
598 |      "grade_id": "ex6",
599 |      "locked": false,
600 |      "schema_version": 1,
601 |      "solution": true
602 |     }
603 |    },
604 |    "outputs": [],
605 |    "source": [
606 |     "def get_top_yelp_result(soup):\n",
607 |     "    \"\"\"Given a Yelp! search result as a Beautiful Soup page,\n",
608 |     "    returns the name of the number 1 indexed result.\n",
609 |     "    \"\"\"\n",
610 |     "# YOUR CODE HERE\n",
611 |     "    indexed_search = soup.find_all(attrs={'class': 'indexed-biz-name'})\n",
612 |     "    top_result = indexed_search[0].find('a', attrs={'class': \"biz-name js-analytics-click\"}).span.get_text()\n",
613 |     "    return top_result"
614 |    ]
615 |   },
616 |   {
617 |    "cell_type": "code",
618 |    "execution_count": 84,
619 |    "metadata": {
620 |     "nbgrader": {
621 |      "grade": true,
622 |      "grade_id": "ex6_test",
623 |      "locked": true,
624 |      "points": 0,
625 |      "schema_version": 1,
626 |      "solution": false
627 |     }
628 |    },
629 |    "outputs": [
630 |     {
631 |      "name": "stdout",
632 |      "output_type": "stream",
633 |      "text": [
634 |       "Georgia Institute of Technology\n"
635 |      ]
636 |     }
637 |    ],
638 |    "source": [
639 |     "print(get_top_yelp_result(uni_soup))\n",
640 |     "assert get_top_yelp_result(uni_soup) == 'Georgia Institute of Technology'"
641 |    ]
642 |   },
643 |   {
644 |    "cell_type": "markdown",
645 |    "metadata": {
646 |     "nbgrader": {
647 |      "grade": false,
648 |      "grade_id": "cell-c21b203ae30914dc",
649 |      "locked": true,
650 |      "schema_version": 1,
651 |      "solution": false
652 |     }
653 |    },
654 |    "source": [
655 |     "This mini-tutorial only scratches the surface of what is possible with Beautiful Soup. As always, refer to the [package's documentation](https://www.crummy.com/software/BeautifulSoup/) for all the awesome deets!"
656 |    ]
657 |   }
658 |  ],
659 |  "metadata": {
660 |   "celltoolbar": "Create Assignment",
661 |   "kernelspec": {
662 |    "display_name": "Python 3",
663 |    "language": "python",
664 |    "name": "python3"
665 |   },
666 |   "language_info": {
667 |    "codemirror_mode": {
668 |     "name": "ipython",
669 |     "version": 3
670 |    },
671 |    "file_extension": ".py",
672 |    "mimetype": "text/x-python",
673 |    "name": "python",
674 |    "nbconvert_exporter": "python",
675 |    "pygments_lexer": "ipython3",
676 |    "version": "3.6.2"
677 |   }
678 |  },
679 |  "nbformat": 4,
680 |  "nbformat_minor": 2
681 | }
682 | 


--------------------------------------------------------------------------------
/Notebook9_SQL_relational_DBs.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {
  6 |     "deletable": false,
  7 |     "nbgrader": {
  8 |      "grade": false,
  9 |      "grade_id": "cell-2d7bc35c8a49ea43",
 10 |      "locked": true,
 11 |      "schema_version": 1,
 12 |      "solution": false
 13 |     }
 14 |    },
 15 |    "source": [
 16 |     "# Lesson 0: SQLite\n",
 17 |     "\n",
 18 |     "The de facto language for managing relational databases is the Structured Query Language, or SQL (\"sequel\").\n",
 19 |     "\n",
 20 |     "Many commerical and open-source relational data management systems (RDBMS) support SQL. The one we will consider in this class is the simplest, called [sqlite3](https://www.sqlite.org/). It stores the database in a simple file and can be run in a \"standalone\" mode from the command-line. However, we will, naturally, [invoke it from Python](https://docs.python.org/3/library/sqlite3.html). But all of the basic techniques apply to any commercial SQL backend.\n",
 21 |     "\n",
 22 |     "With a little luck, you _might_ by the end of this class understand this [xkcd comic on SQL injection attacks](http://xkcd.com/327)."
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "markdown",
 27 |    "metadata": {
 28 |     "deletable": false,
 29 |     "nbgrader": {
 30 |      "checksum": "631d516583b07e1831b9aa0d773bf48e",
 31 |      "grade": false,
 32 |      "grade_id": "name_gpa",
 33 |      "locked": true,
 34 |      "schema_version": 1,
 35 |      "solution": false
 36 |     }
 37 |    },
 38 |    "source": [
 39 |     "## Getting started\n",
 40 |     "\n",
 41 |     "In Python, you _connect_ to an `sqlite3` database by creating a _connection object_."
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "markdown",
 46 |    "metadata": {},
 47 |    "source": [
 48 |     "**Exercise 0** (ungraded). Run this code cell to get started."
 49 |    ]
 50 |   },
 51 |   {
 52 |    "cell_type": "code",
 53 |    "execution_count": 1,
 54 |    "metadata": {
 55 |     "collapsed": true,
 56 |     "deletable": false,
 57 |     "nbgrader": {
 58 |      "checksum": "b11295002cc2b9549d6a2b01721b6701",
 59 |      "grade": true,
 60 |      "grade_id": "who__test",
 61 |      "locked": true,
 62 |      "points": 1,
 63 |      "schema_version": 1,
 64 |      "solution": false
 65 |     },
 66 |     "nbpresent": {
 67 |      "id": "60ff4b48-4f6f-4052-bbb9-8f7d7459f5f9"
 68 |     }
 69 |    },
 70 |    "outputs": [],
 71 |    "source": [
 72 |     "import sqlite3 as db\n",
 73 |     "\n",
 74 |     "# Connect to a database (or create one if it doesn't exist)\n",
 75 |     "conn = db.connect('example.db')"
 76 |    ]
 77 |   },
 78 |   {
 79 |    "cell_type": "markdown",
 80 |    "metadata": {
 81 |     "nbpresent": {
 82 |      "id": "bca69dbc-e662-4485-ace9-5bc32f914fac"
 83 |     }
 84 |    },
 85 |    "source": [
 86 |     "The `sqlite` engine maintains a database as a file; in this example, the name of that file is `example.db`.\n",
 87 |     "\n",
 88 |     "> **Important usage note!** If the named file does **not** yet exist, this code creates it. However, if the database has been created before, this same code will open it. This fact can be important when you are debugging. For example, if your code depends on the database not existing initially, then you may need to remove the file first."
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "markdown",
 93 |    "metadata": {
 94 |     "nbpresent": {
 95 |      "id": "7706d36d-b4af-4f8e-b419-20cfce1b7ba3"
 96 |     }
 97 |    },
 98 |    "source": [
 99 |     "You issue commands to the database through an object called a _cursor_."
100 |    ]
101 |   },
102 |   {
103 |    "cell_type": "code",
104 |    "execution_count": 2,
105 |    "metadata": {
106 |     "collapsed": true,
107 |     "nbpresent": {
108 |      "id": "f798db7f-53dd-43ca-b03e-e12be2ad7231"
109 |     }
110 |    },
111 |    "outputs": [],
112 |    "source": [
113 |     "# Create a 'cursor' for executing commands\n",
114 |     "c = conn.cursor()"
115 |    ]
116 |   },
117 |   {
118 |    "cell_type": "markdown",
119 |    "metadata": {
120 |     "nbpresent": {
121 |      "id": "f7caca1e-bb84-434d-a422-840ac1c6b7b7"
122 |     }
123 |    },
124 |    "source": [
125 |     "A cursor tracks the current state of the database, and you will mostly be using the cursor to issue commands that modify or query the database."
126 |    ]
127 |   },
128 |   {
129 |    "cell_type": "markdown",
130 |    "metadata": {
131 |     "nbpresent": {
132 |      "id": "ad7d1ae0-681e-4a83-b3ef-6b416cf48f31"
133 |     }
134 |    },
135 |    "source": [
136 |     "## Tables and Basic Queries\n",
137 |     "\n",
138 |     "The central object of a relational database is a _table_. It's identical to what you called a \"tibble\" in the tidy data lab: observations as rows, variables as columns. In the relational database world, we sometimes refer to as _items_ or _records_ and columns as _attributes_. We'll use all of these terms interchangeably in this course."
139 |    ]
140 |   },
141 |   {
142 |    "cell_type": "markdown",
143 |    "metadata": {},
144 |    "source": [
145 |     "Let's look at a concrete example. Suppose we wish to maintain a database of Georgia Tech students, whose attributes are their names and Georgia Tech-issued ID numbers. You might start by creating a table named `Students` to hold this data. You can create the table using the command, [`create table`](https://www.sqlite.org/lang_createtable.html).\n",
146 |     "\n",
147 |     "> Note: If you try to create a table that already exists, it will **fail**. If you are trying to carry out these exercises from scratch, you may need to remove any existing `example.db` file or destroy any existing table; you can do the latter with the SQL command, `drop table if exists Students`."
148 |    ]
149 |   },
150 |   {
151 |    "cell_type": "code",
152 |    "execution_count": 3,
153 |    "metadata": {
154 |     "nbpresent": {
155 |      "id": "ad171c63-8b27-4921-9d6c-ebb09921fca4"
156 |     }
157 |    },
158 |    "outputs": [
159 |     {
160 |      "data": {
161 |       "text/plain": [
162 |        "<sqlite3.Cursor at 0x1fec0f45b90>"
163 |       ]
164 |      },
165 |      "execution_count": 3,
166 |      "metadata": {},
167 |      "output_type": "execute_result"
168 |     }
169 |    ],
170 |    "source": [
171 |     "c.execute(\"create table Students (gtid integer, name text)\")"
172 |    ]
173 |   },
174 |   {
175 |    "cell_type": "markdown",
176 |    "metadata": {
177 |     "nbpresent": {
178 |      "id": "bd814451-e891-48e7-8bd2-c90105cae261"
179 |     }
180 |    },
181 |    "source": [
182 |     "To populate the table with items, you can use the command, [`insert into`](https://www.sqlite.org/lang_insert.html)."
183 |    ]
184 |   },
185 |   {
186 |    "cell_type": "code",
187 |    "execution_count": 4,
188 |    "metadata": {
189 |     "nbpresent": {
190 |      "id": "2cc9a1f8-f514-4dd8-9d43-0663b4a4c70d"
191 |     }
192 |    },
193 |    "outputs": [
194 |     {
195 |      "data": {
196 |       "text/plain": [
197 |        "<sqlite3.Cursor at 0x1fec0f45b90>"
198 |       ]
199 |      },
200 |      "execution_count": 4,
201 |      "metadata": {},
202 |      "output_type": "execute_result"
203 |     }
204 |    ],
205 |    "source": [
206 |     "c.execute(\"insert into Students values (123, 'Vuduc')\")\n",
207 |     "c.execute(\"insert into Students values (456, 'Chau')\")\n",
208 |     "c.execute(\"insert into Students values (381, 'Bader')\")\n",
209 |     "c.execute(\"insert into Students values (991, 'Sokol')\")"
210 |    ]
211 |   },
212 |   {
213 |    "cell_type": "markdown",
214 |    "metadata": {},
215 |    "source": [
216 |     "**Commitment issues.** The commands above modify the database. However, these are temporary modifications and aren't actually saved to the databases until you say so. (_Aside:_ Why would you want such behavior?) The way to do that is to issue a _commit_ operation from the _connection_ object.\n",
217 |     "\n",
218 |     "> There are some subtleties related to when you actually need to commit, since the SQLite database engine does commit at certain points as discussed [here](https://stackoverflow.com/questions/13642956/commit-behavior-and-atomicity-in-python-sqlite3-module). However, it's probably simpler if you remember to encode commits when you intend for them to take effect."
219 |    ]
220 |   },
221 |   {
222 |    "cell_type": "code",
223 |    "execution_count": 5,
224 |    "metadata": {
225 |     "collapsed": true,
226 |     "nbgrader": {
227 |      "grade": false,
228 |      "grade_id": "cell-de29bd964e0fed8a",
229 |      "locked": true,
230 |      "schema_version": 1,
231 |      "solution": false
232 |     }
233 |    },
234 |    "outputs": [],
235 |    "source": [
236 |     "conn.commit()"
237 |    ]
238 |   },
239 |   {
240 |    "cell_type": "markdown",
241 |    "metadata": {
242 |     "nbpresent": {
243 |      "id": "0580d369-3b2a-4011-8f93-53681beda02a"
244 |     }
245 |    },
246 |    "source": [
247 |     "Another common operation is to perform a bunch of insertions into a table from a list of tuples. In this case, you can use `executemany()`."
248 |    ]
249 |   },
250 |   {
251 |    "cell_type": "code",
252 |    "execution_count": 6,
253 |    "metadata": {
254 |     "collapsed": true,
255 |     "nbpresent": {
256 |      "id": "425a8dc2-8e05-4370-9244-eb9e134bc16e"
257 |     }
258 |    },
259 |    "outputs": [],
260 |    "source": [
261 |     "# An important (and secure!) idiom\n",
262 |     "more_students = [(723, 'Rozga'),\n",
263 |     "                 (882, 'Zha'),\n",
264 |     "                 (401, 'Park'),\n",
265 |     "                 (377, 'Vetter'),\n",
266 |     "                 (904, 'Brown')]\n",
267 |     "\n",
268 |     "c.executemany('insert into Students values (?, ?)', more_students)\n",
269 |     "conn.commit()"
270 |    ]
271 |   },
272 |   {
273 |    "cell_type": "markdown",
274 |    "metadata": {
275 |     "nbpresent": {
276 |      "id": "13374b53-847e-4db5-ab8d-0deca7250bbd"
277 |     }
278 |    },
279 |    "source": [
280 |     "Given a table, the most common operation is a _query_, which asks for some subset or transformation of the data. The simplest kind of query is called a [`select`](https://www.sqlite.org/lang_select.html).\n",
281 |     "\n",
282 |     "The following example selects all rows (items) from the `Students` table."
283 |    ]
284 |   },
285 |   {
286 |    "cell_type": "code",
287 |    "execution_count": 7,
288 |    "metadata": {
289 |     "nbpresent": {
290 |      "id": "e0b70aa2-b610-44a0-bea3-f72ebcffd5bd"
291 |     }
292 |    },
293 |    "outputs": [
294 |     {
295 |      "name": "stdout",
296 |      "output_type": "stream",
297 |      "text": [
298 |       "Your results: 9 \n",
299 |       "The entries of Students:\n",
300 |       " [(123, 'Vuduc'), (456, 'Chau'), (381, 'Bader'), (991, 'Sokol'), (723, 'Rozga'), (882, 'Zha'), (401, 'Park'), (377, 'Vetter'), (904, 'Brown')]\n"
301 |      ]
302 |     }
303 |    ],
304 |    "source": [
305 |     "c.execute(\"select * from Students\")\n",
306 |     "results = c.fetchall()\n",
307 |     "print(\"Your results:\", len(results), \"\\nThe entries of Students:\\n\", results)"
308 |    ]
309 |   },
310 |   {
311 |    "cell_type": "markdown",
312 |    "metadata": {
313 |     "nbgrader": {
314 |      "grade": false,
315 |      "grade_id": "cell-457e323adba0c030",
316 |      "locked": true,
317 |      "schema_version": 1,
318 |      "solution": false
319 |     },
320 |     "nbpresent": {
321 |      "id": "5be67adf-58b9-48b8-a447-6d435a60b32a"
322 |     }
323 |    },
324 |    "source": [
325 |     "**Exercise 1** (2 points). Suppose we wish to maintain a second table, called `Takes`, which records classes that students have taken and the grades they earn.\n",
326 |     "\n",
327 |     "In particular, each row of `Takes` stores a student by his/her GT ID, the course he/she took, and the grade he/she earned. More formally, suppose this table is defined as follows:"
328 |    ]
329 |   },
330 |   {
331 |    "cell_type": "code",
332 |    "execution_count": 10,
333 |    "metadata": {
334 |     "nbgrader": {
335 |      "grade": false,
336 |      "grade_id": "cell-0f6a7ffdcf640bb5",
337 |      "locked": true,
338 |      "schema_version": 1,
339 |      "solution": false
340 |     },
341 |     "nbpresent": {
342 |      "id": "ead189ed-6c04-4e3a-a285-903d08eb20b4"
343 |     }
344 |    },
345 |    "outputs": [
346 |     {
347 |      "data": {
348 |       "text/plain": [
349 |        "<sqlite3.Cursor at 0x1fec0f45b90>"
350 |       ]
351 |      },
352 |      "execution_count": 10,
353 |      "metadata": {},
354 |      "output_type": "execute_result"
355 |     }
356 |    ],
357 |    "source": [
358 |     "# Run this cell\n",
359 |     "c.execute('drop table if exists Takes')\n",
360 |     "c.execute('create table Takes (gtid integer, course text, grade real)')"
361 |    ]
362 |   },
363 |   {
364 |    "cell_type": "markdown",
365 |    "metadata": {
366 |     "nbgrader": {
367 |      "grade": false,
368 |      "grade_id": "cell-a238171e2e6ed8f3",
369 |      "locked": true,
370 |      "schema_version": 1,
371 |      "solution": false
372 |     },
373 |     "nbpresent": {
374 |      "id": "5bf1708e-da01-422e-a002-114ba03032be"
375 |     }
376 |    },
377 |    "source": [
378 |     "Write a command to insert the following records into the `Takes` table.\n",
379 |     "\n",
380 |     "* Vuduc: CSE 6040 - A (4.0), ISYE 6644 - B (3.0), MGMT 8803 - D (1.0)\n",
381 |     "* Sokol: CSE 6040 - A (4.0), ISYE 6740 - A (4.0)\n",
382 |     "* Chau: CSE 6040 - A (4.0), CSE 6740 - C (2.0), MGMT 8803 - B (3.0)"
383 |    ]
384 |   },
385 |   {
386 |    "cell_type": "code",
387 |    "execution_count": 14,
388 |    "metadata": {
389 |     "deletable": false,
390 |     "nbgrader": {
391 |      "checksum": "ef802a157691401d9221693b8cf3bf3c",
392 |      "grade": false,
393 |      "grade_id": "insert_many",
394 |      "locked": false,
395 |      "schema_version": 1,
396 |      "solution": true
397 |     },
398 |     "nbpresent": {
399 |      "id": "fbe2ca1e-a5dd-4633-bd6a-3f8ea5d446f3"
400 |     }
401 |    },
402 |    "outputs": [
403 |     {
404 |      "name": "stdout",
405 |      "output_type": "stream",
406 |      "text": [
407 |       "Your results: 8 \n",
408 |       "The entries of Takes: [(123, 'CSE 6040', 4.0), (123, 'ISYE 6644', 3.0), (123, 'MGMT 8803', 1.0), (991, 'CSE 6040', 4.0), (991, 'ISYE 6740', 4.0), (456, 'CSE 6040', 4.0), (456, 'CSE 6740', 2.0), (456, 'MGMT 8803', 3.0)]\n"
409 |      ]
410 |     }
411 |    ],
412 |    "source": [
413 |     "# YOUR CODE HERE\n",
414 |     "take_entries = [(123, 'CSE 6040', 4.0), (123, 'ISYE 6644', 3.0), (123, 'MGMT 8803', 1.0),\n",
415 |     "               (991, 'CSE 6040', 4.0), (991, 'ISYE 6740', 4.0),\n",
416 |     "               (456, 'CSE 6040', 4.0), (456, 'CSE 6740', 2.0), (456, 'MGMT 8803', 3.0)]\n",
417 |     "\n",
418 |     "\n",
419 |     "c.executemany('insert into Takes values (?, ?, ?)', take_entries)\n",
420 |     "conn.commit()\n",
421 |     "\n",
422 |     "# Displays the results of your code\n",
423 |     "c.execute('select * from Takes')\n",
424 |     "results = c.fetchall()\n",
425 |     "print(\"Your results:\", len(results), \"\\nThe entries of Takes:\", results)"
426 |    ]
427 |   },
428 |   {
429 |    "cell_type": "code",
430 |    "execution_count": 15,
431 |    "metadata": {
432 |     "deletable": false,
433 |     "nbgrader": {
434 |      "checksum": "3474ee2d282adaeeb16a4399f7e0a191",
435 |      "grade": true,
436 |      "grade_id": "insert_many__test",
437 |      "locked": true,
438 |      "points": 2,
439 |      "schema_version": 1,
440 |      "solution": false
441 |     },
442 |     "nbpresent": {
443 |      "id": "a6361dd2-a7f3-4bb1-b4a1-ce9d9ed34c7f"
444 |     }
445 |    },
446 |    "outputs": [
447 |     {
448 |      "name": "stdout",
449 |      "output_type": "stream",
450 |      "text": [
451 |       "\n",
452 |       "(Passed.)\n"
453 |      ]
454 |     }
455 |    ],
456 |    "source": [
457 |     "# Test cell: `insert_many__test`\n",
458 |     "\n",
459 |     "# Close the database and reopen it\n",
460 |     "conn.close()\n",
461 |     "conn = db.connect('example.db')\n",
462 |     "c = conn.cursor()\n",
463 |     "c.execute('select * from Takes')\n",
464 |     "results = c.fetchall()\n",
465 |     "\n",
466 |     "if len(results) == 0:\n",
467 |     "    print(\"*** No matching records. Did you remember to commit the results? ***\")\n",
468 |     "assert len(results) == 8, \"The `Takes` table has {} when it should have {}.\".format(len(results), 8)\n",
469 |     "\n",
470 |     "assert (123, 'CSE 6040', 4.0) in results\n",
471 |     "assert (123, 'ISYE 6644', 3.0) in results\n",
472 |     "assert (123, 'MGMT 8803', 1.0) in results\n",
473 |     "assert (991, 'CSE 6040', 4.0) in results\n",
474 |     "assert (991, 'ISYE 6740', 4.0) in results\n",
475 |     "assert (456, 'CSE 6040', 4.0) in results\n",
476 |     "assert (456, \"CSE 6740\", 2.0) in results\n",
477 |     "assert (456, \"MGMT 8803\", 3.0) in results\n",
478 |     "\n",
479 |     "print(\"\\n(Passed.)\")"
480 |    ]
481 |   },
482 |   {
483 |    "cell_type": "markdown",
484 |    "metadata": {
485 |     "nbpresent": {
486 |      "id": "ffeff9c0-6b08-4771-b885-b57cb3ff5ca5"
487 |     }
488 |    },
489 |    "source": [
490 |     "# Lesson 1: Join queries\n",
491 |     "\n",
492 |     "The main type of query that combines information from multiple tables is the _join query_. Recall from our discussion of tibbles these four types:\n",
493 |     "\n",
494 |     "- `inner-join(A, B)`: Keep rows of `A` and `B` only where `A` and `B` match\n",
495 |     "- `outer-join(A, B)`: Keep all rows of `A` and `B`, but merge matching rows and fill in missing values with some default (`NaN` in Pandas, `NULL` in SQL)\n",
496 |     "- `left-join(A, B)`: Keep all rows of `A` but only merge matches from `B`.\n",
497 |     "- `right-join(A, B)`: Keep all rows of `B` but only merge matches from `A`.\n",
498 |     "\n",
499 |     "If you are a visual person, see [this page](https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins) for illustrations of the different join types.\n",
500 |     "\n",
501 |     "In SQL, you can use the `where` clause of a `select` statement to specify how to match rows from the tables being joined. For example, recall that the `Takes` table stores classes taken by each student. However, these classes are recorded by a student's GT ID. Suppose we want a report where we want each student's name rather than his/her ID. We can get the matching name from the `Students` table. Here is a query to accomplish this matching:"
502 |    ]
503 |   },
504 |   {
505 |    "cell_type": "code",
506 |    "execution_count": 16,
507 |    "metadata": {
508 |     "nbpresent": {
509 |      "id": "0061492b-418b-42f2-a4eb-f6a1ad798350"
510 |     }
511 |    },
512 |    "outputs": [
513 |     {
514 |      "name": "stdout",
515 |      "output_type": "stream",
516 |      "text": [
517 |       "('Vuduc', 'CSE 6040', 4.0)\n",
518 |       "('Vuduc', 'ISYE 6644', 3.0)\n",
519 |       "('Vuduc', 'MGMT 8803', 1.0)\n",
520 |       "('Chau', 'CSE 6040', 4.0)\n",
521 |       "('Chau', 'CSE 6740', 2.0)\n",
522 |       "('Chau', 'MGMT 8803', 3.0)\n",
523 |       "('Sokol', 'CSE 6040', 4.0)\n",
524 |       "('Sokol', 'ISYE 6740', 4.0)\n"
525 |      ]
526 |     }
527 |    ],
528 |    "source": [
529 |     "# See all (name, course, grade) tuples\n",
530 |     "query = '''\n",
531 |     "    select Students.name, Takes.course, Takes.grade\n",
532 |     "        from Students, Takes\n",
533 |     "        where Students.gtid=Takes.gtid\n",
534 |     "'''\n",
535 |     "\n",
536 |     "for match in c.execute(query): # Note this alternative idiom for iterating over query results\n",
537 |     "    print(match)"
538 |    ]
539 |   },
540 |   {
541 |    "cell_type": "markdown",
542 |    "metadata": {
543 |     "nbpresent": {
544 |      "id": "c229a881-8cf8-44ee-ba29-63386a9ad337"
545 |     }
546 |    },
547 |    "source": [
548 |     "**Exercise 2** (2 points). Define a query to select only the names and grades of students _who took CSE 6040_. The code below will execute your query and store the results in a list `results1` of tuples, where each tuple is a `(name, grade)` pair; thus, you should structure your query to match this format."
549 |    ]
550 |   },
551 |   {
552 |    "cell_type": "code",
553 |    "execution_count": 20,
554 |    "metadata": {
555 |     "deletable": false,
556 |     "nbgrader": {
557 |      "checksum": "0dfcf9af496ec5c807a97a27069c240a",
558 |      "grade": false,
559 |      "grade_id": "join1",
560 |      "locked": false,
561 |      "schema_version": 1,
562 |      "solution": true
563 |     },
564 |     "nbpresent": {
565 |      "id": "0a053476-f540-480a-b3cc-28d23e2d6511"
566 |     }
567 |    },
568 |    "outputs": [
569 |     {
570 |      "data": {
571 |       "text/plain": [
572 |        "[('Vuduc', 4.0), ('Chau', 4.0), ('Sokol', 4.0)]"
573 |       ]
574 |      },
575 |      "execution_count": 20,
576 |      "metadata": {},
577 |      "output_type": "execute_result"
578 |     }
579 |    ],
580 |    "source": [
581 |     "# Define `query` with your query:\n",
582 |     "# YOUR CODE HERE\n",
583 |     "query = \"SELECT Students.name, Takes.grade FROM Students LEFT JOIN Takes ON Students.gtid = Takes.gtid WHERE Takes.course == 'CSE 6040'\"\n",
584 |     "\n",
585 |     "c.execute(query)\n",
586 |     "results1 = c.fetchall()\n",
587 |     "results1"
588 |    ]
589 |   },
590 |   {
591 |    "cell_type": "code",
592 |    "execution_count": 21,
593 |    "metadata": {
594 |     "deletable": false,
595 |     "nbgrader": {
596 |      "checksum": "6adb27164d87eda12aec79e0b8d8ecec",
597 |      "grade": true,
598 |      "grade_id": "join1__test",
599 |      "locked": true,
600 |      "points": 2,
601 |      "schema_version": 1,
602 |      "solution": false
603 |     },
604 |     "nbpresent": {
605 |      "id": "e1fe9c03-2112-4b49-8112-e12bb7bb10df"
606 |     }
607 |    },
608 |    "outputs": [
609 |     {
610 |      "name": "stdout",
611 |      "output_type": "stream",
612 |      "text": [
613 |       "Your results: [('Vuduc', 4.0), ('Chau', 4.0), ('Sokol', 4.0)]\n",
614 |       "\n",
615 |       "(Passed.)\n"
616 |      ]
617 |     }
618 |    ],
619 |    "source": [
620 |     "# Test cell: `join1__test`\n",
621 |     "\n",
622 |     "print (\"Your results:\", results1)\n",
623 |     "\n",
624 |     "assert type(results1) is list\n",
625 |     "assert len(results1) == 3, \"Your query produced {} results instead of {}.\".format(len(results1), 3)\n",
626 |     "\n",
627 |     "assert set(results1) == {('Vuduc', 4.0), ('Sokol', 4.0), ('Chau', 4.0)}\n",
628 |     "\n",
629 |     "print(\"\\n(Passed.)\")"
630 |    ]
631 |   },
632 |   {
633 |    "cell_type": "markdown",
634 |    "metadata": {
635 |     "nbpresent": {
636 |      "id": "0449ec70-9765-4cd3-bdeb-e79c1652f9fd"
637 |     }
638 |    },
639 |    "source": [
640 |     "For contrast, let's do a quick exercise that executes a [left join](http://www.sqlitetutorial.net/sqlite-left-join/).\n",
641 |     "\n",
642 |     "**Exercise 3** (2 points). Execute a left join that uses `Students` as the left table, `Takes` as the right table, and selects a student's name and course grade. Write your query as a string variable named `query`, which the subsequent code will execute."
643 |    ]
644 |   },
645 |   {
646 |    "cell_type": "code",
647 |    "execution_count": 23,
648 |    "metadata": {
649 |     "nbgrader": {
650 |      "grade": false,
651 |      "grade_id": "cell-6eaa56837de291f4",
652 |      "locked": false,
653 |      "schema_version": 1,
654 |      "solution": true
655 |     },
656 |     "nbpresent": {
657 |      "id": "55b9eb3b-2bf7-47c6-9599-fd751c6f2266"
658 |     }
659 |    },
660 |    "outputs": [
661 |     {
662 |      "name": "stdout",
663 |      "output_type": "stream",
664 |      "text": [
665 |       "0 -> ('Vuduc', 1.0)\n",
666 |       "1 -> ('Vuduc', 3.0)\n",
667 |       "2 -> ('Vuduc', 4.0)\n",
668 |       "3 -> ('Chau', 2.0)\n",
669 |       "4 -> ('Chau', 3.0)\n",
670 |       "5 -> ('Chau', 4.0)\n",
671 |       "6 -> ('Bader', None)\n",
672 |       "7 -> ('Sokol', 4.0)\n",
673 |       "8 -> ('Sokol', 4.0)\n",
674 |       "9 -> ('Rozga', None)\n",
675 |       "10 -> ('Zha', None)\n",
676 |       "11 -> ('Park', None)\n",
677 |       "12 -> ('Vetter', None)\n",
678 |       "13 -> ('Brown', None)\n"
679 |      ]
680 |     }
681 |    ],
682 |    "source": [
683 |     "# Define `query` string here:\n",
684 |     "# YOUR CODE HERE\n",
685 |     "query = \"SELECT Students.name, Takes.grade FROM Students LEFT JOIN Takes ON Students.gtid=Takes.gtid\"\n",
686 |     "\n",
687 |     "# Executes your `query` string:\n",
688 |     "c.execute(query)\n",
689 |     "matches = c.fetchall()\n",
690 |     "for i, match in enumerate(matches):\n",
691 |     "    print(i, \"->\", match)"
692 |    ]
693 |   },
694 |   {
695 |    "cell_type": "code",
696 |    "execution_count": 24,
697 |    "metadata": {
698 |     "nbgrader": {
699 |      "grade": true,
700 |      "grade_id": "left_join_test",
701 |      "locked": true,
702 |      "points": 2,
703 |      "schema_version": 1,
704 |      "solution": false
705 |     }
706 |    },
707 |    "outputs": [
708 |     {
709 |      "name": "stdout",
710 |      "output_type": "stream",
711 |      "text": [
712 |       "\n",
713 |       "(Passed!)\n"
714 |      ]
715 |     }
716 |    ],
717 |    "source": [
718 |     "# Test cell: `left_join_test`\n",
719 |     "\n",
720 |     "assert set(matches) == {('Vuduc', 4.0), ('Chau', 2.0), ('Park', None), ('Vuduc', 1.0), ('Chau', 3.0), ('Zha', None), ('Brown', None), ('Vetter', None), ('Vuduc', 3.0), ('Bader', None), ('Rozga', None), ('Chau', 4.0), ('Sokol', 4.0)}\n",
721 |     "print(\"\\n(Passed!)\")"
722 |    ]
723 |   },
724 |   {
725 |    "cell_type": "markdown",
726 |    "metadata": {
727 |     "nbpresent": {
728 |      "id": "42844d75-fe64-44e9-81fd-971d31426c8c"
729 |     }
730 |    },
731 |    "source": [
732 |     "## Aggregations\n",
733 |     "\n",
734 |     "Another common style of query is an _aggregation_, which is a summary of information across multiple records, rather than the raw records themselves.\n",
735 |     "\n",
736 |     "For instance, suppose we want to compute the GPA for each unique GT ID from the `Takes` table. Here is a query that does it:"
737 |    ]
738 |   },
739 |   {
740 |    "cell_type": "code",
741 |    "execution_count": 25,
742 |    "metadata": {
743 |     "nbpresent": {
744 |      "id": "2d78b9ac-8a7d-4a89-99ae-86038d3184f5"
745 |     }
746 |    },
747 |    "outputs": [
748 |     {
749 |      "name": "stdout",
750 |      "output_type": "stream",
751 |      "text": [
752 |       "(123, 2.6666666666666665)\n",
753 |       "(456, 3.0)\n",
754 |       "(991, 4.0)\n"
755 |      ]
756 |     }
757 |    ],
758 |    "source": [
759 |     "query = '''\n",
760 |     "    select gtid, avg(grade)\n",
761 |     "        from Takes \n",
762 |     "        group by gtid\n",
763 |     "'''\n",
764 |     "\n",
765 |     "for match in c.execute(query):\n",
766 |     "    print(match)"
767 |    ]
768 |   },
769 |   {
770 |    "cell_type": "markdown",
771 |    "metadata": {
772 |     "nbpresent": {
773 |      "id": "173eeb1b-177c-4121-aaf2-5349eb2ee43a"
774 |     }
775 |    },
776 |    "source": [
777 |     "Some other useful SQL aggregators include `min`, `max`, `sum`, and `count`."
778 |    ]
779 |   },
780 |   {
781 |    "cell_type": "markdown",
782 |    "metadata": {
783 |     "nbpresent": {
784 |      "id": "4e2c8ddf-fca1-4b7f-ab11-ce70adcd4ca6"
785 |     }
786 |    },
787 |    "source": [
788 |     "## Cleanup\n",
789 |     "\n",
790 |     "As one final bit of information, it's good practice to shutdown the cursor and connection, the same way you close files."
791 |    ]
792 |   },
793 |   {
794 |    "cell_type": "code",
795 |    "execution_count": 26,
796 |    "metadata": {
797 |     "collapsed": true,
798 |     "nbgrader": {
799 |      "grade": false,
800 |      "grade_id": "cell-61f9b7c91d99417d",
801 |      "locked": true,
802 |      "schema_version": 1,
803 |      "solution": false
804 |     },
805 |     "nbpresent": {
806 |      "id": "a36899b0-ee8d-46a7-9aba-4a978c6ddd85"
807 |     }
808 |    },
809 |    "outputs": [],
810 |    "source": [
811 |     "c.close()\n",
812 |     "conn.close()"
813 |    ]
814 |   },
815 |   {
816 |    "cell_type": "markdown",
817 |    "metadata": {
818 |     "nbgrader": {
819 |      "grade": false,
820 |      "grade_id": "cell-f72ae23344171e43",
821 |      "locked": true,
822 |      "schema_version": 1,
823 |      "solution": false
824 |     }
825 |    },
826 |    "source": [
827 |     "**What next?** It's now a good time to look at a different tutorial which reviews this material and introduces some additional topics: [A thorough guide to SQLite database operations in Python](http://sebastianraschka.com/Articles/2014_sqlite_in_python_tutorial.html)."
828 |    ]
829 |   },
830 |   {
831 |    "cell_type": "code",
832 |    "execution_count": null,
833 |    "metadata": {
834 |     "collapsed": true
835 |    },
836 |    "outputs": [],
837 |    "source": []
838 |   }
839 |  ],
840 |  "metadata": {
841 |   "anaconda-cloud": [],
842 |   "celltoolbar": "Create Assignment",
843 |   "kernelspec": {
844 |    "display_name": "Python 3",
845 |    "language": "python",
846 |    "name": "python3"
847 |   },
848 |   "language_info": {
849 |    "codemirror_mode": {
850 |     "name": "ipython",
851 |     "version": 3
852 |    },
853 |    "file_extension": ".py",
854 |    "mimetype": "text/x-python",
855 |    "name": "python",
856 |    "nbconvert_exporter": "python",
857 |    "pygments_lexer": "ipython3",
858 |    "version": "3.6.2"
859 |   },
860 |   "nbpresent": {
861 |    "slides": [],
862 |    "themes": []
863 |   }
864 |  },
865 |  "nbformat": 4,
866 |  "nbformat_minor": 1
867 | }
868 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ## CSE 6040x: Intro to Computing for Data Analysis
 2 | Instructor: Professor Richard (Rich) Vuduc
 3 | Co-creators: Vaishnavi Eleti and Rachel Wiseley
 4 | 
 5 | ## Course description.
 6 | This course is your hands-on introduction to programming
 7 | techniques relevant to data analysis and machine learning. Most of the programming
 8 | exercises will be based on Python and SQL.
 9 | 
10 | ## What will you learn?
11 | You will build, "from scratch," the basic components of a data
12 | analysis pipeline: collection, preprocessing, storage, analysis, and visualization. You will
13 | see many examples of high-level data analysis questions, concepts and techniques for
14 | formalizing those questions into mathematical or computational tasks, and methods for
15 | translating those tasks into code. Beyond programming languages and best practices,
16 | you’ll learn elementary data processing algorithms, notions of program correctness and
17 | efficiency, and numerical methods for linear algebra and mathematical optimization.
18 | 
19 | ## Philosophy and approach.
20 | The basic philosophy of this course is that you'll learn the
21 | material best by a combination of reading, thinking, and most importantly, actively
22 | doing. Therefore, you should make an effort to complete all assignments, including any
23 | "optional" parts.
24 | 


--------------------------------------------------------------------------------
/Supplemental_notebook.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## 1. What You See is What You Get"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "A lot of confusion in the first labs was caused by the differences between how humans understand information, how computers (or rather Python) stores it and, more importantly how Python prints it out. Let's look at a simple number."
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "code",
 19 |    "execution_count": null,
 20 |    "metadata": {
 21 |     "collapsed": true
 22 |    },
 23 |    "outputs": [],
 24 |    "source": [
 25 |     "print (42)"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "markdown",
 30 |    "metadata": {},
 31 |    "source": [
 32 |     "We can store it in a variable and print it out. It doesn't sound confusing at all!"
 33 |    ]
 34 |   },
 35 |   {
 36 |    "cell_type": "code",
 37 |    "execution_count": null,
 38 |    "metadata": {
 39 |     "collapsed": true
 40 |    },
 41 |    "outputs": [],
 42 |    "source": [
 43 |     "x = 42\n",
 44 |     "print (x)"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "markdown",
 49 |    "metadata": {},
 50 |    "source": [
 51 |     "Here, a variable `x` stores number **42** as an integer. However, we can store the same number as a different type or within another data structure - as float, string, part of a list or a tuple. Depending on the type of variable, Python will print it **slightly** differently."
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "code",
 56 |    "execution_count": null,
 57 |    "metadata": {
 58 |     "collapsed": true,
 59 |     "scrolled": true
 60 |    },
 61 |    "outputs": [],
 62 |    "source": [
 63 |     "x_float = float(42)\n",
 64 |     "x_scientific = 42e0\n",
 65 |     "x_str = '42'\n",
 66 |     "\n",
 67 |     "print ('42 as a float', x_float)\n",
 68 |     "print ('42 as a float in scientific notation', x_scientific)\n",
 69 |     "print ('42 as a string', x_str)"
 70 |    ]
 71 |   },
 72 |   {
 73 |    "cell_type": "markdown",
 74 |    "metadata": {},
 75 |    "source": [
 76 |     "So far it looks pretty normal. `float` adds floating point to the integer. Scientific notation is typically a `float` (*serious scientists don't work with integers!*). `42` as string looks exactly like we expected and similar to the integer, but their behaviors are different. The difference in behavior isn't obvious until we make it a part of a collection, for example a list."
 77 |    ]
 78 |   },
 79 |   {
 80 |    "cell_type": "code",
 81 |    "execution_count": null,
 82 |    "metadata": {
 83 |     "collapsed": true
 84 |    },
 85 |    "outputs": [],
 86 |    "source": [
 87 |     "x_list = [x, x_str, x_float, x_scientific]\n",
 88 |     "print (\"All 42 in a list:\", x_list)"
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "markdown",
 93 |    "metadata": {},
 94 |    "source": [
 95 |     "Here is the thing - Python won't show you quotes when you print a string, but if you print a string in another object, it encloses the string in single quotes. So each time you see this single quotes, you should understand that it's a string and not a number (at least for Python)!\n",
 96 |     "\n",
 97 |     "Let's look how: "
 98 |    ]
 99 |   },
100 |   {
101 |    "cell_type": "code",
102 |    "execution_count": null,
103 |    "metadata": {
104 |     "collapsed": true
105 |    },
106 |    "outputs": [],
107 |    "source": [
108 |     "x_tuple = tuple(x_list)\n",
109 |     "x_set = set(x_list)\n",
110 |     "x_dict = {x_str : x, x_tuple : x_list}\n",
111 |     "\n",
112 |     "print (\"All 42 in a list:\", x_list)\n",
113 |     "print (\"All 42 in a tuple:\", x_tuple)\n",
114 |     "print (\"All 42 in a set:\", x_set)\n",
115 |     "print (\"A dict of 42 in different flavors\", x_dict)"
116 |    ]
117 |   },
118 |   {
119 |    "cell_type": "markdown",
120 |    "metadata": {},
121 |    "source": [
122 |     "Wow, now you should be **extreeeeeemely** watchful! \n",
123 |     "\n",
124 |     "* Look how the shape of brackets differs between a tuple and a list; lists use **[brackets]**, whereas tuples use **(parentheses)**\n",
125 |     "\n",
126 |     "* Look how both sets and dicts use **{braces}**. That might create some confusion, but each element of set is just an object, whereas in a dictionary you have **key : value** pair separated by **: (colon) **\n",
127 |     "\n",
128 |     "* Although 42 as integer and 42.0 as floating point are objects of a different kind, for sets they are equal since they have the same value on comparison, exact 42. '42' as a string on the other hand is an object of a different nature, it's not a number. That's why set has only two objects: 42 as integer, and '42' as a string"
129 |    ]
130 |   },
131 |   {
132 |    "cell_type": "markdown",
133 |    "metadata": {},
134 |    "source": [
135 |     "If you are confused, you can always use **type()** function to figure out the type of the object you have:"
136 |    ]
137 |   },
138 |   {
139 |    "cell_type": "code",
140 |    "execution_count": null,
141 |    "metadata": {
142 |     "collapsed": true
143 |    },
144 |    "outputs": [],
145 |    "source": [
146 |     "print ('42 as integer', x, \"variable type is\", type(x))\n",
147 |     "print ('42 as a float', x_float, \"variable type is\", type(x_float))\n",
148 |     "print ('42 as a float in scientific notation', x_scientific, \"variable type is\", type(x_scientific))\n",
149 |     "print ('42 as a string', x_str, \"variable type is\", type(x_str))"
150 |    ]
151 |   },
152 |   {
153 |    "cell_type": "markdown",
154 |    "metadata": {},
155 |    "source": [
156 |     "Using **type()** function might be extremely useful during debugging stage. However, quite often a simple print and a little bit of attention to what's printed is enough to figure out what's going on."
157 |    ]
158 |   },
159 |   {
160 |    "cell_type": "markdown",
161 |    "metadata": {},
162 |    "source": [
163 |     "Some objects have different results on calling the `print` function. For example, let's consider a **frozenset**, a built-in immutable implementation of python set."
164 |    ]
165 |   },
166 |   {
167 |    "cell_type": "code",
168 |    "execution_count": null,
169 |    "metadata": {
170 |     "collapsed": true
171 |    },
172 |    "outputs": [],
173 |    "source": [
174 |     "x_frozenset = frozenset(x_list)\n",
175 |     "print (\"Here's a set of 42:\\n\", x_set)\n",
176 |     "print (\"Here's a frozenset of 42:\\n\", x_frozenset)"
177 |    ]
178 |   },
179 |   {
180 |    "cell_type": "markdown",
181 |    "metadata": {},
182 |    "source": [
183 |     "As you see, when we print set and frozenset, they look very different. Frozenset, as lots of other objects in python, adds its object name when you print it. That makes really hard to confuse set and frozenset!"
184 |    ]
185 |   },
186 |   {
187 |    "cell_type": "markdown",
188 |    "metadata": {},
189 |    "source": [
190 |     "If you want to do something similar for your custom class, you can do it rather easily in Python. You just need to add a special __str__ method to your custome class which defines the string representation for your object."
191 |    ]
192 |   },
193 |   {
194 |    "cell_type": "code",
195 |    "execution_count": null,
196 |    "metadata": {
197 |     "collapsed": true
198 |    },
199 |    "outputs": [],
200 |    "source": [
201 |     "class my_42:\n",
202 |     "    def __init__(self):\n",
203 |     "        self.n = 42\n",
204 |     "    def __str__(self):\n",
205 |     "        return 'Member of class my_42(' + str(self.n) + ')'\n",
206 |     "print ('Just 42:',42)\n",
207 |     "print ('New class:', my_42())"
208 |    ]
209 |   },
210 |   {
211 |    "cell_type": "markdown",
212 |    "metadata": {},
213 |    "source": [
214 |     "**Exercise.** Now let's use our knowledge to practise and play with a function that takes a list and returns exactly the same list if every element of the list is a string. Otherwise, it returns a new list with all non-string elements converted to string. To avoid confusion, the function also returns a flag variable showing whether the list has been modified. Try to add some print statements to investigate the types of elements in the list, how the elements are printed out, and how the whole array looks like before and after type conversion."
215 |    ]
216 |   },
217 |   {
218 |    "cell_type": "code",
219 |    "execution_count": null,
220 |    "metadata": {
221 |     "collapsed": true
222 |    },
223 |    "outputs": [],
224 |    "source": [
225 |     "def list_converter(l):\n",
226 |     "    \"\"\"\n",
227 |     "    l - input list\n",
228 |     "    \n",
229 |     "    Returns a list where all elements have been stringified\n",
230 |     "    as well a flag to indicate if the list has been modified\n",
231 |     "    \"\"\"\n",
232 |     "    assert (type(l) == list)\n",
233 |     "    flag = False\n",
234 |     "    for el in l:\n",
235 |     "        # print the type of el\n",
236 |     "        if type(el) != str:\n",
237 |     "            flag = True\n",
238 |     "    if flag:\n",
239 |     "        new_list = []\n",
240 |     "        for el in l:\n",
241 |     "            # how would be each element printed out? what's the element type?\n",
242 |     "            new_list.append(str(el))\n",
243 |     "            # print how the new list looks like\n",
244 |     "        return new_list, flag\n",
245 |     "    else:\n",
246 |     "        return l, flag"
247 |    ]
248 |   },
249 |   {
250 |    "cell_type": "code",
251 |    "execution_count": null,
252 |    "metadata": {
253 |     "collapsed": true
254 |    },
255 |    "outputs": [],
256 |    "source": [
257 |     "# `list_converter_test`: Test cell\n",
258 |     "l = ['4', 8, 15, '16', 23, 42]\n",
259 |     "l_true = ['4', '8', '15', '16', '23', '42']\n",
260 |     "new_l, flag = list_converter(l)\n",
261 |     "\n",
262 |     "print (\"list_converter({}) -> {} [True: {}], new list flag is {}\".format(l, new_l, l_true, flag))\n",
263 |     "assert new_l == l_true\n",
264 |     "assert flag\n",
265 |     "\n",
266 |     "new_l, flag = list_converter(l_true)\n",
267 |     "print (\"list_converter({}) -> {} [True: {}], new list flag is {}\".format(l, new_l, l_true, flag))\n",
268 |     "assert new_l == l_true\n",
269 |     "assert not flag"
270 |    ]
271 |   }
272 |  ],
273 |  "metadata": {
274 |   "kernelspec": {
275 |    "display_name": "Python 3",
276 |    "language": "python",
277 |    "name": "python3"
278 |   },
279 |   "language_info": {
280 |    "codemirror_mode": {
281 |     "name": "ipython",
282 |     "version": 3
283 |    },
284 |    "file_extension": ".py",
285 |    "mimetype": "text/x-python",
286 |    "name": "python",
287 |    "nbconvert_exporter": "python",
288 |    "pygments_lexer": "ipython3",
289 |    "version": "3.5.2"
290 |   }
291 |  },
292 |  "nbformat": 4,
293 |  "nbformat_minor": 2
294 | }
295 | 


--------------------------------------------------------------------------------
/part1.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {
  6 |     "nbgrader": {
  7 |      "grade": false,
  8 |      "grade_id": "cell-d9da995b9796eee6",
  9 |      "locked": true,
 10 |      "schema_version": 1,
 11 |      "solution": false
 12 |     }
 13 |    },
 14 |    "source": [
 15 |     "# Sample notebook: Part 1\n",
 16 |     "\n",
 17 |     "This notebook is Part 1 of two parts (Parts 0 and 1): in the computer science tradition, we will try to number beginning at 0. Together, the two parts comprise an ungraded *lab notebook assignment* (or just *lab* or *assignment*). Although it's ungraded, use it as practice for completing and submitting an assignment."
 18 |    ]
 19 |   },
 20 |   {
 21 |    "cell_type": "markdown",
 22 |    "metadata": {},
 23 |    "source": [
 24 |     "## Getting input data\n",
 25 |     "\n",
 26 |     "Throughout the course, we'll use a variety of methods to get data for use in the notebook environment.\n",
 27 |     "\n",
 28 |     "One technique is to use [magic commands or shell commands](https://ipython.readthedocs.io/en/stable/interactive/tutorial.html#magics-explained). These are code-like constructs that are specific to Jupyter but outside the base language (e.g., Python). They typically appear on lines of code prefixed by `!` or `%`.\n",
 29 |     "\n",
 30 |     "Here is an example that downloads a file containing a secret message.\n",
 31 |     "\n",
 32 |     "> This example is a *shell command*. It invokes a command-line utility called `curl` to do the download, which you can read more about [here](https://curl.haxx.se/docs/manpage.html)."
 33 |    ]
 34 |   },
 35 |   {
 36 |    "cell_type": "code",
 37 |    "execution_count": 1,
 38 |    "metadata": {
 39 |     "nbgrader": {
 40 |      "grade": false,
 41 |      "grade_id": "cell-4f9c6b528b06bfcc",
 42 |      "locked": true,
 43 |      "schema_version": 1,
 44 |      "solution": false
 45 |     }
 46 |    },
 47 |    "outputs": [
 48 |     {
 49 |      "name": "stdout",
 50 |      "output_type": "stream",
 51 |      "text": [
 52 |       "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
 53 |       "                                 Dload  Upload   Total   Spent    Left  Speed\n",
 54 |       "100   215  100   215    0     0    356      0 --:--:-- --:--:-- --:--:--   357\n",
 55 |       "\n",
 56 |       "=== Files in the current directory (from a shell command) ===\n",
 57 |       "\n",
 58 |       "total 40\n",
 59 |       "drwxrwx--- 6 ccc_v1_w_862db_41314                   48 4096 Aug 23 07:07 .\n",
 60 |       "drwxrwx--- 5 ccc_v1_w_862db_41314                   48 4096 Aug 23 06:34 ..\n",
 61 |       "-rw-rw-r-- 1 ccc_v1_w_862db_41314                   48   52 Aug 23 06:34 .gitconfig\n",
 62 |       "drwxr-xr-x 2 ccc_v1_w_862db_41314 ccc_v1_s79867__49148 4096 Aug 23 06:55 .ipynb_checkpoints\n",
 63 |       "drwxr-xr-x 5 ccc_v1_w_862db_41314 ccc_v1_s79867__49148 4096 Aug 23 06:54 .ipython\n",
 64 |       "drwx------ 3 ccc_v1_w_862db_41314 ccc_v1_s79867__49148 4096 Aug 23 06:54 .local\n",
 65 |       "-rw-r--r-- 1 ccc_v1_w_862db_41314 ccc_v1_s79867__49148  215 Aug 23 07:07 message_in_a_bottle.txt.zip\n",
 66 |       "-rwxrwx--- 1 ccc_v1_w_862db_41314                   48 4838 Aug 23 07:04 part1.ipynb\n",
 67 |       "dr-x------ 2 root                 root                 4096 Aug 23 06:34 .voc\n",
 68 |       "\n",
 69 |       "=== Files in the current directory (from Python) ===\n",
 70 |       "['part1.ipynb', '.gitconfig', '.ipython', '.local', '.voc', '.ipynb_checkpoints', 'message_in_a_bottle.txt.zip']\n"
 71 |      ]
 72 |     }
 73 |    ],
 74 |    "source": [
 75 |     "# Download:\n",
 76 |     "!curl -O https://cse6040.gatech.edu/datasets/message_in_a_bottle.txt.zip\n",
 77 |     "\n",
 78 |     "# Confirm (from shell):\n",
 79 |     "!echo && echo \"=== Files in the current directory (from a shell command) ===\" && echo && ls -al\n",
 80 |     "\n",
 81 |     "# Confirm (from Python):\n",
 82 |     "import os\n",
 83 |     "print(\"\\n=== Files in the current directory (from Python) ===\\n{}\".format(os.listdir('.')))"
 84 |    ]
 85 |   },
 86 |   {
 87 |    "cell_type": "markdown",
 88 |    "metadata": {
 89 |     "nbgrader": {
 90 |      "grade": false,
 91 |      "grade_id": "cell-1fc294f3ae6ef303",
 92 |      "locked": true,
 93 |      "schema_version": 1,
 94 |      "solution": false
 95 |     }
 96 |    },
 97 |    "source": [
 98 |     "**Exercise 0** (1 point). In the code cell below, create a variable named `filename` and initialize it to a string containing the name `message_in_a_bottle.txt.zip`. The test cell that follows it will unpack this file, assuming it is available in the current working directory, unpack it, and then print its contents."
 99 |    ]
100 |   },
101 |   {
102 |    "cell_type": "code",
103 |    "execution_count": 2,
104 |    "metadata": {
105 |     "collapsed": true,
106 |     "nbgrader": {
107 |      "grade": false,
108 |      "grade_id": "filename",
109 |      "locked": false,
110 |      "schema_version": 1,
111 |      "solution": true
112 |     }
113 |    },
114 |    "outputs": [],
115 |    "source": [
116 |     "uncompressed_name = 'message_in_a_bottle.txt'\n",
117 |     "compressed_extension = '.zip'\n",
118 |     "\n",
119 |     "#\n",
120 |     "# YOUR CODE HERE\n",
121 |     "#\n",
122 |     "filename = \"message_in_a_bottle.txt.zip\""
123 |    ]
124 |   },
125 |   {
126 |    "cell_type": "code",
127 |    "execution_count": 3,
128 |    "metadata": {
129 |     "nbgrader": {
130 |      "grade": true,
131 |      "grade_id": "filename_test",
132 |      "locked": true,
133 |      "points": 1,
134 |      "schema_version": 1,
135 |      "solution": false
136 |     }
137 |    },
138 |    "outputs": [
139 |     {
140 |      "name": "stdout",
141 |      "output_type": "stream",
142 |      "text": [
143 |       "`filename`: 'message_in_a_bottle.txt.zip'\n",
144 |       "\n",
145 |       "=== BEGIN MESSAGE ===\n",
146 |       "Good luck, kiddos!\n",
147 |       "=== END MESSAGE ===\n"
148 |      ]
149 |     }
150 |    ],
151 |    "source": [
152 |     "# Test cell: `filename_test`\n",
153 |     "\n",
154 |     "print(\"`filename`: '{}'\".format(filename))\n",
155 |     "from zipfile import ZipFile\n",
156 |     "with ZipFile(filename, 'r') as input_zip:\n",
157 |     "    with input_zip.open(filename[:-4], 'r') as input_file:\n",
158 |     "        message = input_file.readline().decode('utf-8')\n",
159 |     "print(\"\\n=== BEGIN MESSAGE ===\\n{}=== END MESSAGE ===\".format(message))"
160 |    ]
161 |   },
162 |   {
163 |    "cell_type": "markdown",
164 |    "metadata": {
165 |     "nbgrader": {
166 |      "grade": false,
167 |      "grade_id": "cell-7b5dfffc0f015b87",
168 |      "locked": true,
169 |      "schema_version": 1,
170 |      "solution": false
171 |     }
172 |    },
173 |    "source": [
174 |     "This is the end of Part 1. If everything seems to have worked, try submitting it!"
175 |    ]
176 |   },
177 |   {
178 |    "cell_type": "code",
179 |    "execution_count": null,
180 |    "metadata": {
181 |     "collapsed": true
182 |    },
183 |    "outputs": [],
184 |    "source": []
185 |   }
186 |  ],
187 |  "metadata": {
188 |   "celltoolbar": "Create Assignment",
189 |   "kernelspec": {
190 |    "display_name": "Python 3",
191 |    "language": "python",
192 |    "name": "python3"
193 |   },
194 |   "language_info": {
195 |    "codemirror_mode": {
196 |     "name": "ipython",
197 |     "version": 3
198 |    },
199 |    "file_extension": ".py",
200 |    "mimetype": "text/x-python",
201 |    "name": "python",
202 |    "nbconvert_exporter": "python",
203 |    "pygments_lexer": "ipython3",
204 |    "version": "3.5.2"
205 |   }
206 |  },
207 |  "nbformat": 4,
208 |  "nbformat_minor": 2
209 | }
210 | 


--------------------------------------------------------------------------------