├── .DS_Store ├── NYC-311-2M.db ├── Notebook0-basics.ipynb ├── Notebook10_NumpyMatrix_part2.ipynb ├── Notebook10_NumpyMatrix_part3.ipynb ├── Notebook10_NumpyScipy_part1.ipynb ├── Notebook11_MarkovChainAnalysis.ipynb ├── Notebook11_MarkovChainAnalysis_Ranking.ipynb ├── Notebook12_Algoriths_LinearLeastSq_part3.ipynb ├── Notebook12_Cost_part3.ipynb ├── Notebook12_Gradients_part2.ipynb ├── Notebook12_LinReg_ModelFitting_part1.ipynb ├── Notebook12_LinerRegNotes.ipynb ├── Notebook12_PerturbationTheory.ipynb ├── Notebook12_ThetaLMS.ipynb ├── Notebook13_LogisticRegression.ipynb ├── Notebook13_NumpyMatrixManipulation.ipynb ├── Notebook14_KMeans_Clustering.ipynb ├── Notebook15_Compression_via_PCA&SVD.ipynb ├── Notebook15_SVDimage_compression.ipynb ├── Notebook16_EiganFaces.ipynb ├── Notebook1_part1-collections.ipynb ├── Notebook1_part2-more_exercises.ipynb ├── Notebook4_part1_floating_points.ipynb ├── Notebook4_part2_floatingPoints.ipynb ├── Notebook5_part1_RegEx.ipynb ├── Notebook5_part2_RegExYelp.ipynb ├── Notebook5_part3_RegEx_hard.ipynb ├── Notebook6_part1_web_mining.ipynb ├── Notebook6_part2_BeautifulSoup.ipynb ├── Notebook6_part3_webMiningAPIs.ipynb ├── Notebook7_Pandas_TidyData.ipynb ├── Notebook8_bokeh_seaborn.ipynb ├── Notebook9_SQL_Part2.ipynb ├── Notebook9_SQL_relational_DBs.ipynb ├── README.md ├── Supplemental_notebook.ipynb └── part1.ipynb /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/maicat11/Computing-for-Data-Analysis/401be39dc4058c1e3204fd01b49dbede5e7853f1/.DS_Store -------------------------------------------------------------------------------- /NYC-311-2M.db: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/maicat11/Computing-for-Data-Analysis/401be39dc4058c1e3204fd01b49dbede5e7853f1/NYC-311-2M.db -------------------------------------------------------------------------------- /Notebook0-basics.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "nbgrader": { 7 | "grade": false, 8 | "grade_id": "cell-166d0b0bd7d2633f", 9 | "locked": true, 10 | "schema_version": 1, 11 | "solution": false 12 | } 13 | }, 14 | "source": [ 15 | "# Python review: Values, variables, types, lists, and strings\n", 16 | "\n", 17 | "These first few notebooks are a set of exercises with two goals:\n", 18 | "\n", 19 | "1. Review the basics of Python\n", 20 | "2. Familiarize you with Jupyter\n", 21 | "\n", 22 | "Regarding the first goal, these initial notebooks cover material we think you should already know from [Chris Simpkins's](https://www.cc.gatech.edu/~simpkins/) [Python Bootcamp](https://www.cc.gatech.edu/~simpkins/teaching/python-bootcamp/syllabus.html). It is based specifically on his offering to incoming students of the Georgia Tech MS Analytics in [Fall 2016](https://www.cc.gatech.edu/~simpkins/teaching/python-bootcamp/august2016.html).\n", 23 | "\n", 24 | "Regarding the second goal, you'll observe that the bootcamp has each student install and work directly with the Python interpreter, which runs locally on his or her machine (e.g., see [Slide 5 of Chris's intro](https://www.cc.gatech.edu/~simpkins/teaching/python-bootcamp/slides/intro-python.html)). But in this course, we are using Jupyter Notebooks as the development environment. You can think of a Jupyter notebook as a web-based \"skin\" for running a Python interpreter---possibly hosted on a remote server, which is the case in this course. Here is a good tutorial on [Jupyter](https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook)." 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": { 30 | "nbgrader": { 31 | "grade": false, 32 | "grade_id": "cell-9c012c4d7a197d7d", 33 | "locked": true, 34 | "schema_version": 1, 35 | "solution": false 36 | } 37 | }, 38 | "source": [ 39 | "> **Note for [OMSA](https://pe.gatech.edu/master-science-degrees/online-master-science-analytics) students.** In this course we assume you are using [Vocareum's deployment](https://www.vocareum.com/) of Jupyter. You also have an option to use other Jupyter environments, including installing and running Jupyter on your own system. We can't provide technical support to you if you choose to go those routes, but if you'd like to do that anyway, we recommend [Microsoft Azure Notebooks](https://notebooks.azure.com/) as a web-hosted option, which we use in the on-campus class, or the Continuum Analytics [Anaconda distribution](https://www.continuum.io/downloads) as a locally installed option." 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": { 45 | "nbgrader": { 46 | "grade": false, 47 | "grade_id": "cell-92648f77b2c73f26", 48 | "locked": true, 49 | "schema_version": 1, 50 | "solution": false 51 | } 52 | }, 53 | "source": [ 54 | "**Study hint: Read the test code!** You'll notice that most of the exercises below have a place for you to code up your answer followed by a \"test cell.\" That's a code cell that checks the output of your code to see whether it appears to produce correct results. You can often learn a lot by reading the test code. In fact, sometimes it gives you a hint about how to approach the problem. As such, we encourage you to try to read the test cells even if they seem cryptic, which is deliberate!" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": { 60 | "nbgrader": { 61 | "grade": false, 62 | "grade_id": "cell-9a91d97e7aa9c67f", 63 | "locked": true, 64 | "schema_version": 1, 65 | "solution": false 66 | } 67 | }, 68 | "source": [ 69 | "**Exercise 0** (1 point). Run the code cell below. It should display the output string, `Hello, world!`." 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": 1, 75 | "metadata": { 76 | "nbgrader": { 77 | "grade": true, 78 | "grade_id": "hello_world_test", 79 | "locked": true, 80 | "points": 1, 81 | "schema_version": 1, 82 | "solution": false 83 | } 84 | }, 85 | "outputs": [ 86 | { 87 | "name": "stdout", 88 | "output_type": "stream", 89 | "text": [ 90 | "Hello, world!\n" 91 | ] 92 | } 93 | ], 94 | "source": [ 95 | "print(\"Hello, world!\")" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": { 101 | "nbgrader": { 102 | "grade": false, 103 | "grade_id": "cell-2de1352e57946ac5", 104 | "locked": true, 105 | "schema_version": 1, 106 | "solution": false 107 | } 108 | }, 109 | "source": [ 110 | "**Exercise 1** (`x_float_test`: 1 point). Create a variable named `x_float` whose numerical value is one (1) and whose type is *floating-point*." 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": 2, 116 | "metadata": { 117 | "collapsed": true, 118 | "nbgrader": { 119 | "grade": false, 120 | "grade_id": "x_float", 121 | "locked": false, 122 | "schema_version": 1, 123 | "solution": true 124 | } 125 | }, 126 | "outputs": [], 127 | "source": [ 128 | "#\n", 129 | "# YOUR CODE HERE\n", 130 | "#\n", 131 | "x_float = float(1)" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": 3, 137 | "metadata": { 138 | "nbgrader": { 139 | "grade": true, 140 | "grade_id": "x_float_test", 141 | "locked": true, 142 | "points": 1, 143 | "schema_version": 1, 144 | "solution": false 145 | } 146 | }, 147 | "outputs": [ 148 | { 149 | "name": "stdout", 150 | "output_type": "stream", 151 | "text": [ 152 | "\n", 153 | "(Passed!)\n" 154 | ] 155 | } 156 | ], 157 | "source": [ 158 | "# `x_float_test`: Test cell\n", 159 | "assert x_float == 1\n", 160 | "assert type(x_float) is float\n", 161 | "print(\"\\n(Passed!)\")" 162 | ] 163 | }, 164 | { 165 | "cell_type": "markdown", 166 | "metadata": { 167 | "nbgrader": { 168 | "grade": false, 169 | "grade_id": "cell-2b53cd92e2ac58a1", 170 | "locked": true, 171 | "schema_version": 1, 172 | "solution": false 173 | } 174 | }, 175 | "source": [ 176 | "**Exercise 2** (`strcat_ba_test`: 1 point). Complete the following function, `strcat_ba(a, b)`, so that given two strings, `a` and `b`, it returns the concatenation of `b` followed by `a` (pay attention to the order in these instructions!)." 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": 4, 182 | "metadata": { 183 | "collapsed": true, 184 | "nbgrader": { 185 | "grade": false, 186 | "grade_id": "strcat_ba", 187 | "locked": false, 188 | "schema_version": 1, 189 | "solution": true 190 | } 191 | }, 192 | "outputs": [], 193 | "source": [ 194 | "def strcat_ba(a, b):\n", 195 | " assert type(a) is str\n", 196 | " assert type(b) is str\n", 197 | "#\n", 198 | "# YOUR CODE HERE\n", 199 | "#\n", 200 | " return b + a" 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": 5, 206 | "metadata": { 207 | "nbgrader": { 208 | "grade": true, 209 | "grade_id": "strcat_ba_test", 210 | "locked": true, 211 | "points": 1, 212 | "schema_version": 1, 213 | "solution": false 214 | } 215 | }, 216 | "outputs": [ 217 | { 218 | "name": "stdout", 219 | "output_type": "stream", 220 | "text": [ 221 | "strcat_ba(\"ofrpy\", \"kek\") == \"kekofrpy\"\n", 222 | "\n", 223 | "(Passed!)\n" 224 | ] 225 | } 226 | ], 227 | "source": [ 228 | "# `strcat_ba_test`: Test cell\n", 229 | "\n", 230 | "# Workaround: # Python 3.5.2 does not have `random.choices()` (available in 3.6+)\n", 231 | "def random_letter():\n", 232 | " from random import choice\n", 233 | " return choice('abcdefghijklmnopqrstuvwxyz')\n", 234 | "\n", 235 | "def random_string(n, fun=random_letter):\n", 236 | " return ''.join([str(fun()) for _ in range(n)])\n", 237 | "\n", 238 | "a = random_string(5)\n", 239 | "b = random_string(3)\n", 240 | "c = strcat_ba(a, b)\n", 241 | "print('strcat_ba(\"{}\", \"{}\") == \"{}\"'.format(a, b, c))\n", 242 | "assert len(c) == len(a) + len(b)\n", 243 | "assert c[:len(b)] == b\n", 244 | "assert c[-len(a):] == a\n", 245 | "print(\"\\n(Passed!)\")" 246 | ] 247 | }, 248 | { 249 | "cell_type": "markdown", 250 | "metadata": { 251 | "nbgrader": { 252 | "grade": false, 253 | "grade_id": "cell-75fe9a45cd6b3f9a", 254 | "locked": true, 255 | "schema_version": 1, 256 | "solution": false 257 | } 258 | }, 259 | "source": [ 260 | "**Exercise 3** (`strcat_list_test`: 2 points). Complete the following function, `strcat_list(L)`, which generalizes the previous function: given a *list* of strings, `L[:]`, returns the concatenation of the strings in reverse order. For example:\n", 261 | "\n", 262 | "```python\n", 263 | " strcat_list(['abc', 'def', 'ghi']) == 'ghidefabc'\n", 264 | "```" 265 | ] 266 | }, 267 | { 268 | "cell_type": "code", 269 | "execution_count": 8, 270 | "metadata": { 271 | "collapsed": true, 272 | "nbgrader": { 273 | "grade": false, 274 | "grade_id": "strcat_list", 275 | "locked": false, 276 | "schema_version": 1, 277 | "solution": true 278 | } 279 | }, 280 | "outputs": [], 281 | "source": [ 282 | "def strcat_list(L):\n", 283 | " assert type(L) is list\n", 284 | "#\n", 285 | "# YOUR CODE HERE\n", 286 | "#\n", 287 | " rev_cat = ''\n", 288 | " for item in reversed(L):\n", 289 | " rev_cat += item\n", 290 | " return rev_cat" 291 | ] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "execution_count": 9, 296 | "metadata": { 297 | "nbgrader": { 298 | "grade": true, 299 | "grade_id": "strcat_list_test", 300 | "locked": true, 301 | "points": 2, 302 | "schema_version": 1, 303 | "solution": false 304 | } 305 | }, 306 | "outputs": [ 307 | { 308 | "name": "stdout", 309 | "output_type": "stream", 310 | "text": [ 311 | "L == ['qsg', 'jul', 'mnp', 'jro', 'nws', 'pzq']\n", 312 | "strcat_list(L) == 'pzqnwsjromnpjulqsg'\n", 313 | "\n", 314 | "(Passed!)\n" 315 | ] 316 | } 317 | ], 318 | "source": [ 319 | "# `strcat_list_test`: Test cell\n", 320 | "n = 3\n", 321 | "nL = 6\n", 322 | "L = [random_string(n) for _ in range(nL)]\n", 323 | "Lc = strcat_list(L)\n", 324 | "\n", 325 | "print('L == {}'.format(L))\n", 326 | "print('strcat_list(L) == \\'{}\\''.format(Lc))\n", 327 | "assert all([Lc[i*n:(i+1)*n] == L[nL-i-1] for i, x in zip(range(nL), L)])\n", 328 | "print(\"\\n(Passed!)\")" 329 | ] 330 | }, 331 | { 332 | "cell_type": "markdown", 333 | "metadata": { 334 | "nbgrader": { 335 | "grade": false, 336 | "grade_id": "cell-06e37dd37379b6b8", 337 | "locked": true, 338 | "schema_version": 1, 339 | "solution": false 340 | } 341 | }, 342 | "source": [ 343 | "**Exercise 4** (`floor_fraction_test`: 1 point). Suppose you are given two variables, `a` and `b`, whose values are the real numbers, $a \\geq 0$ (non-negative) and $b > 0$ (positive). Complete the function, `floor_fraction(a, b)` so that it returns $\\left\\lfloor\\frac{a}{b}\\right\\rfloor$, that is, the *floor* of $\\frac{a}{b}$. The *type* of the returned value must be `int` (an integer)." 344 | ] 345 | }, 346 | { 347 | "cell_type": "code", 348 | "execution_count": 12, 349 | "metadata": { 350 | "collapsed": true, 351 | "nbgrader": { 352 | "grade": false, 353 | "grade_id": "floor_fraction", 354 | "locked": false, 355 | "schema_version": 1, 356 | "solution": true 357 | } 358 | }, 359 | "outputs": [], 360 | "source": [ 361 | "def is_number(x):\n", 362 | " \"\"\"Returns `True` if `x` is a number-like type, e.g., `int`, `float`, `Decimal()`, ...\"\"\"\n", 363 | " from numbers import Number\n", 364 | " return isinstance(x, Number)\n", 365 | " \n", 366 | "def floor_fraction(a, b):\n", 367 | " assert is_number(a) and a >= 0\n", 368 | " assert is_number(b) and b > 0\n", 369 | "#\n", 370 | "# YOUR CODE HERE\n", 371 | "#\n", 372 | " return int(a/b)\n", 373 | " " 374 | ] 375 | }, 376 | { 377 | "cell_type": "code", 378 | "execution_count": 13, 379 | "metadata": { 380 | "nbgrader": { 381 | "grade": true, 382 | "grade_id": "floor_fraction_test", 383 | "locked": true, 384 | "points": 1, 385 | "schema_version": 1, 386 | "solution": false 387 | } 388 | }, 389 | "outputs": [ 390 | { 391 | "name": "stdout", 392 | "output_type": "stream", 393 | "text": [ 394 | "floor_fraction(0.9805045805759878, 0.40871852802407027) == floor(2.398972675195786) == 2\n", 395 | "\n", 396 | "(Passed!)\n" 397 | ] 398 | } 399 | ], 400 | "source": [ 401 | "# `floor_fraction_test`: Test cell\n", 402 | "from random import random\n", 403 | "a = random()\n", 404 | "b = random()\n", 405 | "c = floor_fraction(a, b)\n", 406 | "\n", 407 | "print('floor_fraction({}, {}) == floor({}) == {}'.format(a, b, a/b, c))\n", 408 | "assert b*c <= a <= b*(c+1)\n", 409 | "assert type(c) is int\n", 410 | "print('\\n(Passed!)')" 411 | ] 412 | }, 413 | { 414 | "cell_type": "markdown", 415 | "metadata": { 416 | "nbgrader": { 417 | "grade": false, 418 | "grade_id": "cell-e98590d39e95bc25", 419 | "locked": true, 420 | "schema_version": 1, 421 | "solution": false 422 | } 423 | }, 424 | "source": [ 425 | "**Exercise 5** (`ceiling_fraction_test`: 1 point). Complete the function, `ceiling_fraction(a, b)`, which for any numeric inputs, `a` and `b`, corresponding to real numbers, $a \\geq 0$ and $b > 0$, returns $\\left\\lceil\\frac{a}{b}\\right\\rceil$, that is, the *ceiling* of $\\frac{a}{b}$. The type of the returned value must be `int`." 426 | ] 427 | }, 428 | { 429 | "cell_type": "code", 430 | "execution_count": 14, 431 | "metadata": { 432 | "collapsed": true, 433 | "nbgrader": { 434 | "grade": false, 435 | "grade_id": "ceiling_fraction", 436 | "locked": false, 437 | "schema_version": 1, 438 | "solution": true 439 | } 440 | }, 441 | "outputs": [], 442 | "source": [ 443 | "def ceiling_fraction(a, b):\n", 444 | " assert is_number(a) and a >= 0\n", 445 | " assert is_number(b) and b > 0\n", 446 | "#\n", 447 | "# YOUR CODE HERE\n", 448 | "#\n", 449 | " return int(a/b) + 1" 450 | ] 451 | }, 452 | { 453 | "cell_type": "code", 454 | "execution_count": 15, 455 | "metadata": { 456 | "nbgrader": { 457 | "grade": true, 458 | "grade_id": "ceiling_fraction_test", 459 | "locked": true, 460 | "points": 1, 461 | "schema_version": 1, 462 | "solution": false 463 | } 464 | }, 465 | "outputs": [ 466 | { 467 | "name": "stdout", 468 | "output_type": "stream", 469 | "text": [ 470 | "ceiling_fraction(0.8979882736065053, 0.8324033074774889) == ceiling(1.0787898913181457) == 2\n", 471 | "\n", 472 | "(Passed!)\n" 473 | ] 474 | } 475 | ], 476 | "source": [ 477 | "# `ceiling_fraction_test`: Test cell\n", 478 | "from random import random\n", 479 | "a = random()\n", 480 | "b = random()\n", 481 | "c = ceiling_fraction(a, b)\n", 482 | "print('ceiling_fraction({}, {}) == ceiling({}) == {}'.format(a, b, a/b, c))\n", 483 | "assert b*(c-1) <= a <= b*c\n", 484 | "assert type(c) is int\n", 485 | "print(\"\\n(Passed!)\")" 486 | ] 487 | }, 488 | { 489 | "cell_type": "markdown", 490 | "metadata": {}, 491 | "source": [ 492 | "**Exercise 6** (`report_exam_avg_test`: 1 point). Let `a`, `b`, and `c` represent three exam scores as numerical values. Complete the function, `report_exam_avg(a, b, c)` so that it computes the average score (equally weighted) and returns the string, `'Your average score is: XX'`, where `XX` is the average rounded to one decimal place. For example:\n", 493 | "\n", 494 | "```python\n", 495 | " report_exam_avg(100, 95, 80) == 'Your average score: 91.7'\n", 496 | "```" 497 | ] 498 | }, 499 | { 500 | "cell_type": "code", 501 | "execution_count": 27, 502 | "metadata": { 503 | "collapsed": true, 504 | "nbgrader": { 505 | "grade": false, 506 | "grade_id": "cell-a117d6495f42b850", 507 | "locked": false, 508 | "schema_version": 1, 509 | "solution": true 510 | } 511 | }, 512 | "outputs": [], 513 | "source": [ 514 | "def report_exam_avg(a, b, c):\n", 515 | " assert is_number(a) and is_number(b) and is_number(c)\n", 516 | "#\n", 517 | "# YOUR CODE HERE\n", 518 | "#\n", 519 | " avg_score = round(((a + b + c) / 3), 1)\n", 520 | " return \"Your average score: {}\".format(avg_score)" 521 | ] 522 | }, 523 | { 524 | "cell_type": "code", 525 | "execution_count": 29, 526 | "metadata": { 527 | "nbgrader": { 528 | "grade": true, 529 | "grade_id": "report_exam_avg_test", 530 | "locked": true, 531 | "points": 1, 532 | "schema_version": 1, 533 | "solution": false 534 | } 535 | }, 536 | "outputs": [ 537 | { 538 | "name": "stdout", 539 | "output_type": "stream", 540 | "text": [ 541 | "Your average score: 91.7\n", 542 | "Checking some additional randomly generated cases:\n", 543 | "48.33258245396376, 50.442426025062744, 91.75699870990516 -> 'Your average score: 63.5' [0.010669062977219331]\n", 544 | "35.4641280297894, 63.36452071896817, 19.903296253292147 -> 'Your average score: 39.6' [0.022684999316766152]\n", 545 | "12.712685343435181, 87.91783899424385, 20.58290036382341 -> 'Your average score: 40.4' [0.004474900500819483]\n", 546 | "95.8381834385127, 63.40664348059859, 45.20335845795377 -> 'Your average score: 68.1' [0.04939512568835388]\n", 547 | "6.814313089670787, 84.59381135235066, 90.7578727720895 -> 'Your average score: 60.7' [0.021999071370307394]\n", 548 | "93.50308463838849, 69.21269551706179, 13.635338619391979 -> 'Your average score: 58.8' [0.016293741719247617]\n", 549 | "87.65740049353373, 99.51004413046226, 5.788171656622643 -> 'Your average score: 64.3' [0.01853876020622162]\n", 550 | "7.674497245314571, 58.43286389008937, 71.55717716660797 -> 'Your average score: 45.9' [0.011820565996022955]\n", 551 | "36.842555880064786, 67.78010593581831, 49.56061896877766 -> 'Your average score: 51.4' [0.005573071779735983]\n", 552 | "98.67836876468702, 84.33581586521609, 75.29298941927082 -> 'Your average score: 86.1' [0.002391349724670514]\n", 553 | "\n", 554 | "(Passed!)\n" 555 | ] 556 | } 557 | ], 558 | "source": [ 559 | "# `report_exam_avg_test`: Test cell\n", 560 | "msg = report_exam_avg(100, 95, 80)\n", 561 | "print(msg)\n", 562 | "assert msg == 'Your average score: 91.7'\n", 563 | "\n", 564 | "print(\"Checking some additional randomly generated cases:\")\n", 565 | "for _ in range(10):\n", 566 | " ex1 = random() * 100\n", 567 | " ex2 = random() * 100\n", 568 | " ex3 = random() * 100\n", 569 | " msg = report_exam_avg(ex1, ex2, ex3)\n", 570 | " ex_rounded_avg = float(msg.split()[-1])\n", 571 | " abs_err = abs(ex_rounded_avg*3 - (ex1 + ex2 + ex3)) / 3\n", 572 | " print(\"{}, {}, {} -> '{}' [{}]\".format(ex1, ex2, ex3, msg, abs_err))\n", 573 | " assert abs_err <= 0.05\n", 574 | "\n", 575 | "print(\"\\n(Passed!)\")" 576 | ] 577 | }, 578 | { 579 | "cell_type": "markdown", 580 | "metadata": { 581 | "nbgrader": { 582 | "grade": false, 583 | "grade_id": "cell-24a78862d8e3bba0", 584 | "locked": true, 585 | "schema_version": 1, 586 | "solution": false 587 | } 588 | }, 589 | "source": [ 590 | "**Exercise 7** (`count_word_lengths_test`: 2 points). Write a function `count_word_lengths(s)` that, given a string consisting of words separated by spaces, returns a list containing the length of each word. Words will consist of lowercase alphabetic characters, and they may be separated by multiple consecutive spaces. If a string is empty or has no spaces, the function should return an empty list.\n", 591 | "\n", 592 | "For instance, in this code sample,\n", 593 | "\n", 594 | "```python\n", 595 | " count_word_lengths('the quick brown fox jumped over the lazy dog') == [3, 5, 5, 3, 6, 4, 3, 4, 3]`\n", 596 | "```\n", 597 | "\n", 598 | "the input string consists of nine (9) words whose respective lengths are shown in the list." 599 | ] 600 | }, 601 | { 602 | "cell_type": "code", 603 | "execution_count": 35, 604 | "metadata": { 605 | "nbgrader": { 606 | "grade": false, 607 | "grade_id": "count_word_lengths", 608 | "locked": false, 609 | "schema_version": 1, 610 | "solution": true 611 | } 612 | }, 613 | "outputs": [], 614 | "source": [ 615 | "def count_word_lengths(s):\n", 616 | " assert all([x.isalpha() or x == ' ' for x in s])\n", 617 | " assert type(s) is str\n", 618 | "#\n", 619 | "# YOUR CODE HERE\n", 620 | "#\n", 621 | " split_string = s.split(' ')\n", 622 | " list_lengths = [len(x) for x in split_string if x != '']\n", 623 | " if len(list_lengths) == 1:\n", 624 | " return []\n", 625 | " else:\n", 626 | " return list_lengths\n" 627 | ] 628 | }, 629 | { 630 | "cell_type": "code", 631 | "execution_count": 36, 632 | "metadata": { 633 | "nbgrader": { 634 | "grade": true, 635 | "grade_id": "count_word_lengths_test", 636 | "locked": true, 637 | "points": 2, 638 | "schema_version": 1, 639 | "solution": false 640 | } 641 | }, 642 | "outputs": [ 643 | { 644 | "name": "stdout", 645 | "output_type": "stream", 646 | "text": [ 647 | "Test 1: count_word_lengths('the quick brown fox jumped over the lazy dog') == [3, 5, 5, 3, 6, 4, 3, 4, 3]\n", 648 | "Test 2: count_word_lengths('lfslpf ib dxjejrqc hlhu lxretibwcksunx') == '[6, 2, 8, 4, 14]'\n", 649 | " => 'lfslpf'\n", 650 | " => 'ib'\n", 651 | " => 'dxjejrqc'\n", 652 | " => 'hlhu'\n", 653 | " => 'lxretibwcksunx'\n", 654 | "Test 3: Empty strings...\n", 655 | "\n", 656 | "(Passed!)\n" 657 | ] 658 | } 659 | ], 660 | "source": [ 661 | "# `count_word_lengths_test`: Test cell\n", 662 | "\n", 663 | "# Test 1: Example\n", 664 | "qbf_str = 'the quick brown fox jumped over the lazy dog'\n", 665 | "qbf_lens = count_word_lengths(qbf_str)\n", 666 | "print(\"Test 1: count_word_lengths('{}') == {}\".format(qbf_str, qbf_lens))\n", 667 | "assert qbf_lens == [3, 5, 5, 3, 6, 4, 3, 4, 3]\n", 668 | "\n", 669 | "# Test 2: Random strings\n", 670 | "from random import choice # 3.5.2 does not have `choices()` (available in 3.6+)\n", 671 | "#return ''.join([choice('abcdefghijklmnopqrstuvwxyz') for _ in range(n)])\n", 672 | "\n", 673 | "def random_letter_or_space(pr_space=0.15):\n", 674 | " from random import choice, random\n", 675 | " is_space = (random() <= pr_space)\n", 676 | " if is_space:\n", 677 | " return ' '\n", 678 | " return random_letter()\n", 679 | "\n", 680 | "S_LEN = 40\n", 681 | "W_SPACE = 1 / 6\n", 682 | "rand_str = random_string(S_LEN, fun=random_letter_or_space)\n", 683 | "rand_lens = count_word_lengths(rand_str)\n", 684 | "print(\"Test 2: count_word_lengths('{}') == '{}'\".format(rand_str, rand_lens))\n", 685 | "c = 0\n", 686 | "while c < len(rand_str) and rand_str[c] == ' ':\n", 687 | " c += 1\n", 688 | "for k in rand_lens:\n", 689 | " print(\" => '{}'\".format (rand_str[c:c+k]))\n", 690 | " assert (c+k) == len(rand_str) or rand_str[c+k] == ' '\n", 691 | " c += k\n", 692 | " while c < len(rand_str) and rand_str[c] == ' ':\n", 693 | " c += 1\n", 694 | " \n", 695 | "# Test 3: Empty string\n", 696 | "print(\"Test 3: Empty strings...\")\n", 697 | "assert count_word_lengths('') == []\n", 698 | "assert count_word_lengths(' ') == []\n", 699 | "\n", 700 | "print(\"\\n(Passed!)\")" 701 | ] 702 | }, 703 | { 704 | "cell_type": "code", 705 | "execution_count": null, 706 | "metadata": {}, 707 | "outputs": [], 708 | "source": [] 709 | } 710 | ], 711 | "metadata": { 712 | "celltoolbar": "Create Assignment", 713 | "kernelspec": { 714 | "display_name": "Python 3", 715 | "language": "python", 716 | "name": "python3" 717 | }, 718 | "language_info": { 719 | "codemirror_mode": { 720 | "name": "ipython", 721 | "version": 3 722 | }, 723 | "file_extension": ".py", 724 | "mimetype": "text/x-python", 725 | "name": "python", 726 | "nbconvert_exporter": "python", 727 | "pygments_lexer": "ipython3", 728 | "version": "3.5.2" 729 | } 730 | }, 731 | "nbformat": 4, 732 | "nbformat_minor": 2 733 | } 734 | -------------------------------------------------------------------------------- /Notebook10_NumpyMatrix_part2.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "deletable": false, 7 | "nbgrader": { 8 | "grade": false, 9 | "grade_id": "cell-467487ba59ea183e", 10 | "locked": true, 11 | "schema_version": 1, 12 | "solution": false 13 | } 14 | }, 15 | "source": [ 16 | "# Part 2: Dense matrix storage \n", 17 | "This part of the lab is a brief introduction to efficient storage of matrices." 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "**Exercise 0** (ungraded). Import Numpy!" 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": 3, 30 | "metadata": { 31 | "collapsed": true, 32 | "nbgrader": { 33 | "grade": false, 34 | "grade_id": "cell-4263c0d16078cf0a", 35 | "locked": true, 36 | "schema_version": 1, 37 | "solution": false 38 | } 39 | }, 40 | "outputs": [], 41 | "source": [ 42 | "import numpy as np" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "## Dense matrix storage: Column-major versus row-major layouts\n", 50 | "\n", 51 | "For linear algebra, we will be especially interested in 2-D arrays, which we will use to store matrices. For this common case, there is a subtle performance issue related to how matrices are stored in memory.\n", 52 | "\n", 53 | "By way of background, physical storage---whether it be memory or disk---is basically one big array. And because of how physical storage is implemented, it turns out that it is much faster to access consecutive elements in memory than, say, to jump around randomly." 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "metadata": {}, 59 | "source": [ 60 | "A matrix is a two-dimensional object. Thus, when it is stored in memory, it must be mapped in some way to the one-dimensional physical array. There are many possible mappings, but the two most common conventions are known as the _column-major_ and _row-major_ layouts:\n", 61 | "\n", 62 | "\"Exercise:" 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": {}, 68 | "source": [ 69 | "**Exercise 1** (2 points). Let $A$ be an $m \\times n$ matrix stored in column-major format. Let $B$ be an $m \\times n$ matrix stored in row-major format.\n", 70 | "\n", 71 | "Based on the preceding discussion, recall that these objects will be mapped to 1-D arrays of length $mn$, behind the scenes. Let's call the 1-D array representations $\\hat{A}$ and $\\hat{B}$. Thus, the $(i, j)$ element of $a$, $a_{ij}$, will map to some element $\\hat{a}_u$ of $\\hat{A}$; similarly, $b_{ij}$ will map to some element $\\hat{b}_v$ of $\\hat{B}$.\n", 72 | "\n", 73 | "Determine formulae to compute the 1-D index values, $u$ and $v$, in terms of $\\{i, j, m, n\\}$. Assume that all indices are 0-based, i.e., $0 \\leq i \\leq m-1$, $0 \\leq j \\leq n-1$, and $0 \\leq u, v \\leq mn-1$." 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 2, 79 | "metadata": { 80 | "collapsed": true, 81 | "deletable": false, 82 | "nbgrader": { 83 | "checksum": "e628bb9dc9f0e8a68ad52ba1d43caca4", 84 | "grade": false, 85 | "grade_id": "calc_u", 86 | "locked": false, 87 | "schema_version": 1, 88 | "solution": true 89 | } 90 | }, 91 | "outputs": [], 92 | "source": [ 93 | "def linearize_colmajor(i, j, m, n): # calculate `u`\n", 94 | " \"\"\"\n", 95 | " Returns the linear index for the `(i, j)` entry of\n", 96 | " an `m`-by-`n` matrix stored in column-major order.\n", 97 | " \"\"\"\n", 98 | " # YOUR CODE HERE\n", 99 | " # colmajor_linear_index \n", 100 | " u = i + m * j\n", 101 | " return u\n" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": 3, 107 | "metadata": { 108 | "collapsed": true, 109 | "deletable": false, 110 | "nbgrader": { 111 | "checksum": "a7ca53a5a658c1c36cbdf1a3ef1a03d9", 112 | "grade": false, 113 | "grade_id": "calc_v", 114 | "locked": false, 115 | "schema_version": 1, 116 | "solution": true 117 | } 118 | }, 119 | "outputs": [], 120 | "source": [ 121 | "def linearize_rowmajor(i, j, m, n): # calculate `v`\n", 122 | " \"\"\"\n", 123 | " Returns the linear index for the `(i, j)` entry of\n", 124 | " an `m`-by-`n` matrix stored in row-major order.\n", 125 | " \"\"\"\n", 126 | " # YOUR CODE HERE\n", 127 | " # rowmajor_linear_index\n", 128 | " v = i * n + j\n", 129 | " return v" 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": 4, 135 | "metadata": { 136 | "deletable": false, 137 | "nbgrader": { 138 | "checksum": "917a5dd8ab9f91bc5fbae92435acd42d", 139 | "grade": true, 140 | "grade_id": "calc_uv_test", 141 | "locked": true, 142 | "points": 2, 143 | "schema_version": 1, 144 | "solution": false 145 | } 146 | }, 147 | "outputs": [ 148 | { 149 | "name": "stdout", 150 | "output_type": "stream", 151 | "text": [ 152 | "(Passed.)\n" 153 | ] 154 | } 155 | ], 156 | "source": [ 157 | "# Test cell: `calc_uv_test`\n", 158 | "\n", 159 | "# Quick check (not exhaustive):\n", 160 | "assert linearize_colmajor(7, 4, 10, 20) == 47\n", 161 | "assert linearize_rowmajor(7, 4, 10, 20) == 144\n", 162 | "\n", 163 | "assert linearize_colmajor(10, 8, 86, 26) == 698\n", 164 | "assert linearize_rowmajor(10, 8, 86, 26) == 268\n", 165 | "\n", 166 | "assert linearize_colmajor(8, 34, 17, 40) == 586\n", 167 | "assert linearize_rowmajor(8, 34, 17, 40) == 354\n", 168 | "\n", 169 | "assert linearize_colmajor(32, 48, 37, 55) == 1808\n", 170 | "assert linearize_rowmajor(32, 48, 37, 55) == 1808\n", 171 | "\n", 172 | "assert linearize_colmajor(24, 33, 57, 87) == 1905\n", 173 | "assert linearize_rowmajor(24, 33, 57, 87) == 2121\n", 174 | "\n", 175 | "assert linearize_colmajor(10, 3, 19, 74) == 67\n", 176 | "assert linearize_rowmajor(10, 3, 19, 74) == 743\n", 177 | "\n", 178 | "print (\"(Passed.)\")" 179 | ] 180 | }, 181 | { 182 | "cell_type": "markdown", 183 | "metadata": {}, 184 | "source": [ 185 | "## Requesting a layout in Numpy\n", 186 | "\n", 187 | "In Numpy, you can ask for either layout. The default in Numpy is row-major.\n", 188 | "\n", 189 | "Historically numerical linear algebra libraries were developed assuming column-major layout. This layout happens to be the default when you declare a 2-D array in the Fortran programming language. By contrast, in the C and C++ programming languages, the default convention for a 2-D array is row-major layout. So the Numpy default is the C/C++ convention.\n", 190 | "\n", 191 | "In your programs, you can request either order of Numpy using the `order` parameter. For linear algebra operations (common), we recommend using the column-major convention.\n", 192 | "\n", 193 | "In either case, here is how you would create column- and row-major matrices." 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": 4, 199 | "metadata": {}, 200 | "outputs": [], 201 | "source": [ 202 | "n = 5000\n", 203 | "A_colmaj = np.ones((n, n), order='F') # column-major (Fortran convention)\n", 204 | "A_rowmaj = np.ones((n, n), order='C') # row-major (C/C++ convention)" 205 | ] 206 | }, 207 | { 208 | "cell_type": "markdown", 209 | "metadata": {}, 210 | "source": [ 211 | "**Exercise 2** (1 point). Given a matrix $A$, write a function that scales each column, $A(:, j)$ by $j$. Then compare the speed of applying that function to matrices in row and column major order." 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": 1, 217 | "metadata": { 218 | "collapsed": true, 219 | "deletable": false, 220 | "nbgrader": { 221 | "checksum": "8abc750df11036d09bd1e787452d2682", 222 | "grade": false, 223 | "grade_id": "scale_colwise", 224 | "locked": false, 225 | "schema_version": 1, 226 | "solution": true 227 | } 228 | }, 229 | "outputs": [], 230 | "source": [ 231 | "def scale_colwise(A):\n", 232 | " \"\"\"Given a Numpy matrix `A`, visits each column `A[:, j]`\n", 233 | " and scales it by `j`.\"\"\"\n", 234 | " assert type(A) is np.ndarray\n", 235 | " \n", 236 | " n_cols = A.shape[1] # number of columns\n", 237 | " # YOUR CODE HERE\n", 238 | " # A = n_cols * A # my code...not matrix notation\n", 239 | " \n", 240 | " # their answer\n", 241 | " for j in range(n_cols):\n", 242 | " A[:,j] *= j\n", 243 | " \n", 244 | " return A" 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": 5, 250 | "metadata": { 251 | "nbgrader": { 252 | "grade": true, 253 | "grade_id": "scale_colwise_test", 254 | "locked": true, 255 | "points": 1, 256 | "schema_version": 1, 257 | "solution": false 258 | } 259 | }, 260 | "outputs": [ 261 | { 262 | "name": "stdout", 263 | "output_type": "stream", 264 | "text": [ 265 | "120 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n", 266 | "31.9 ms ± 25.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" 267 | ] 268 | } 269 | ], 270 | "source": [ 271 | "# Test (timing) cell: `scale_colwise_test`\n", 272 | "\n", 273 | "# Measure time to scale a row-major input column-wise\n", 274 | "%timeit scale_colwise(A_rowmaj)\n", 275 | "\n", 276 | "# Measure time to scale a column-major input column-wise\n", 277 | "%timeit scale_colwise(A_colmaj)" 278 | ] 279 | }, 280 | { 281 | "cell_type": "markdown", 282 | "metadata": { 283 | "collapsed": true 284 | }, 285 | "source": [ 286 | "## Python vs. Numpy example: Matrix-vector multiply\n", 287 | "\n", 288 | "Look at the definition of matrix-vector multiplication from [Da Kuang's linear algebra notes](https://www.dropbox.com/s/f410k9fgd7iesdv/kuang-linalg-notes.pdf?dl=0). Let's benchmark a matrix-vector multiply in native Python, and compare that to doing the same operation in Numpy.\n", 289 | "\n", 290 | "First, some setup. (What does this code do?)" 291 | ] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "execution_count": 8, 296 | "metadata": { 297 | "collapsed": true 298 | }, 299 | "outputs": [], 300 | "source": [ 301 | "# Dimensions; you might shrink this value for debugging\n", 302 | "n = 2500" 303 | ] 304 | }, 305 | { 306 | "cell_type": "code", 307 | "execution_count": 9, 308 | "metadata": { 309 | "collapsed": true 310 | }, 311 | "outputs": [], 312 | "source": [ 313 | "# Generate random values, for use in populating the matrix and vector\n", 314 | "from random import gauss\n", 315 | "\n", 316 | "# Native Python, using lists\n", 317 | "A_py = [gauss(0, 1) for i in range(n*n)] # Assume: Column-major\n", 318 | "x_py = [gauss(0, 1) for i in range(n)]\n" 319 | ] 320 | }, 321 | { 322 | "cell_type": "code", 323 | "execution_count": 10, 324 | "metadata": { 325 | "collapsed": true 326 | }, 327 | "outputs": [], 328 | "source": [ 329 | "# Convert values into Numpy arrays in column-major order\n", 330 | "A_np = np.reshape(A_py, (n, n), order='F')\n", 331 | "x_np = np.reshape(x_py, (n, 1), order='F')\n" 332 | ] 333 | }, 334 | { 335 | "cell_type": "code", 336 | "execution_count": 11, 337 | "metadata": {}, 338 | "outputs": [ 339 | { 340 | "data": { 341 | "text/plain": [ 342 | "array([[ 13.89207985],\n", 343 | " [ 104.68760455],\n", 344 | " [ 76.12002576],\n", 345 | " ..., \n", 346 | " [ 0.41580127],\n", 347 | " [ -19.92075803],\n", 348 | " [ 41.50614829]])" 349 | ] 350 | }, 351 | "execution_count": 11, 352 | "metadata": {}, 353 | "output_type": "execute_result" 354 | } 355 | ], 356 | "source": [ 357 | "A_np.dot(x_np)" 358 | ] 359 | }, 360 | { 361 | "cell_type": "code", 362 | "execution_count": 12, 363 | "metadata": {}, 364 | "outputs": [ 365 | { 366 | "name": "stdout", 367 | "output_type": "stream", 368 | "text": [ 369 | "1.49 ms ± 182 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n" 370 | ] 371 | } 372 | ], 373 | "source": [ 374 | "# Here is how you do a \"matvec\" in Numpy:\n", 375 | "%timeit A_np.dot(x_np)" 376 | ] 377 | }, 378 | { 379 | "cell_type": "markdown", 380 | "metadata": {}, 381 | "source": [ 382 | "**Exercise 3** (3 points). Implement a matrix-vector product that operates on native Python lists. Assume the 1-D **column-major** storage of the matrix." 383 | ] 384 | }, 385 | { 386 | "cell_type": "code", 387 | "execution_count": 35, 388 | "metadata": { 389 | "collapsed": true, 390 | "deletable": false, 391 | "nbgrader": { 392 | "checksum": "407dbfe44ba36ac12065b96d9142e3a4", 393 | "grade": false, 394 | "grade_id": "matvec_py", 395 | "locked": false, 396 | "schema_version": 1, 397 | "solution": true 398 | } 399 | }, 400 | "outputs": [], 401 | "source": [ 402 | "def matvec_py(m, n, A, x):\n", 403 | " \"\"\"\n", 404 | " Native Python-based matrix-vector multiply, using lists.\n", 405 | " The dimensions of the matrix A are m-by-n, and x is a\n", 406 | " vector of length n.\n", 407 | " \"\"\"\n", 408 | " assert type(A) is list and all([type(aij) is float for aij in A])\n", 409 | " assert type(x) is list\n", 410 | " assert len(x) >= n\n", 411 | " assert len(A) >= (m*n)\n", 412 | "\n", 413 | " y = []\n", 414 | " \n", 415 | " # YOUR CODE HERE\n", 416 | " A_np = np.reshape(A, (m, n), order='F')\n", 417 | " \n", 418 | " for row in A_np:\n", 419 | " row_sum = []\n", 420 | " for i in range(m):\n", 421 | " row_sum.append(row[i]*x[i])\n", 422 | " y.append(sum(row_sum))\n", 423 | "# print(row_sum)\n", 424 | "# print(y)\n", 425 | " return y\n" 426 | ] 427 | }, 428 | { 429 | "cell_type": "code", 430 | "execution_count": 36, 431 | "metadata": { 432 | "deletable": false, 433 | "nbgrader": { 434 | "checksum": "caf39b3909e91acb9536e726b45bc4cf", 435 | "grade": true, 436 | "grade_id": "matvec_py_test", 437 | "locked": true, 438 | "points": 3, 439 | "schema_version": 1, 440 | "solution": false 441 | } 442 | }, 443 | "outputs": [ 444 | { 445 | "name": "stdout", 446 | "output_type": "stream", 447 | "text": [ 448 | "==> Error bound estimate:\n", 449 | " C*n*eps\n", 450 | " == 10*2500*2.22045e-16\n", 451 | " == 5.55112e-12\n", 452 | "\n" 453 | ] 454 | }, 455 | { 456 | "data": { 457 | "text/latex": [ 458 | "$$||y_{\\textrm{np}} - y_{\\textrm{py}}||_{\\infty} = \\textrm{5.11591e-13} \\leq \\textrm{5.55112e-12}\\ (\\textrm{estimated bound})$$" 459 | ], 460 | "text/plain": [ 461 | "" 462 | ] 463 | }, 464 | "metadata": {}, 465 | "output_type": "display_data" 466 | }, 467 | { 468 | "name": "stdout", 469 | "output_type": "stream", 470 | "text": [ 471 | "\n", 472 | "(Passed!)\n" 473 | ] 474 | } 475 | ], 476 | "source": [ 477 | "# Test cell: `matvec_py_test`\n", 478 | "\n", 479 | "# Estimate a bound on the difference between these two\n", 480 | "EPS = np.finfo (float).eps # \"machine epsilon\"\n", 481 | "CONST = 10.0 # Some constant for the error bound\n", 482 | "dy_max = CONST * n * EPS\n", 483 | "\n", 484 | "print (\"\"\"==> Error bound estimate:\n", 485 | " C*n*eps\n", 486 | " == %g*%g*%g\n", 487 | " == %g\n", 488 | "\"\"\" % (CONST, n, EPS, dy_max))\n", 489 | "\n", 490 | "# Run the Numpy version and your code\n", 491 | "y_np = A_np.dot (x_np)\n", 492 | "y_py = matvec_py (n, n, A_py, x_py)\n", 493 | "\n", 494 | "# Compute the difference between these\n", 495 | "dy = y_np - np.reshape (y_py, (n, 1), order='F')\n", 496 | "dy_norm = np.linalg.norm (dy, ord=np.inf)\n", 497 | "\n", 498 | "# Summarize the results\n", 499 | "from IPython.display import display, Math\n", 500 | "\n", 501 | "comparison = \"\\leq\" if dy_norm <= dy_max else \"\\gt\"\n", 502 | "display (Math (\n", 503 | " r'||y_{\\textrm{np}} - y_{\\textrm{py}}||_{\\infty}'\n", 504 | " r' = \\textrm{%g} %s \\textrm{%g}\\ (\\textrm{estimated bound})'\n", 505 | " % (dy_norm, comparison, dy_max)\n", 506 | " ))\n", 507 | "\n", 508 | "if n <= 4: # Debug: Print all data for small inputs\n", 509 | " print (\"@A_np:\\n\", A_np)\n", 510 | " print (\"@x_np:\\n\", x_np)\n", 511 | " print (\"@y_np:\\n\", y_np)\n", 512 | " print (\"@A_py:\\n\", A_py)\n", 513 | " print (\"@x_py:\\n\", x_np)\n", 514 | " print (\"@y_py:\\n\", y_py)\n", 515 | " print (\"@dy:\\n\", dy)\n", 516 | "\n", 517 | "# Trigger an error on likely failure\n", 518 | "assert dy_norm <= dy_max\n", 519 | "print(\"\\n(Passed!)\")" 520 | ] 521 | }, 522 | { 523 | "cell_type": "code", 524 | "execution_count": 34, 525 | "metadata": { 526 | "nbgrader": { 527 | "grade": false, 528 | "grade_id": "cell-f0155950b35ebcf2", 529 | "locked": true, 530 | "schema_version": 1, 531 | "solution": false 532 | } 533 | }, 534 | "outputs": [ 535 | { 536 | "name": "stdout", 537 | "output_type": "stream", 538 | "text": [ 539 | "2.98 s ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" 540 | ] 541 | } 542 | ], 543 | "source": [ 544 | "%timeit matvec_py (n, n, A_py, x_py)" 545 | ] 546 | }, 547 | { 548 | "cell_type": "markdown", 549 | "metadata": { 550 | "nbgrader": { 551 | "grade": false, 552 | "grade_id": "cell-3c70eea14727218a", 553 | "locked": true, 554 | "schema_version": 1, 555 | "solution": false 556 | } 557 | }, 558 | "source": [ 559 | "**Fin!** If you've reached this point and everything executed without error, you can submit this part and move on to the next one." 560 | ] 561 | } 562 | ], 563 | "metadata": { 564 | "celltoolbar": "Create Assignment", 565 | "kernelspec": { 566 | "display_name": "Python 3", 567 | "language": "python", 568 | "name": "python3" 569 | }, 570 | "language_info": { 571 | "codemirror_mode": { 572 | "name": "ipython", 573 | "version": 3 574 | }, 575 | "file_extension": ".py", 576 | "mimetype": "text/x-python", 577 | "name": "python", 578 | "nbconvert_exporter": "python", 579 | "pygments_lexer": "ipython3", 580 | "version": "3.5.2" 581 | } 582 | }, 583 | "nbformat": 4, 584 | "nbformat_minor": 1 585 | } 586 | -------------------------------------------------------------------------------- /Notebook12_LinerRegNotes.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Notes: Solving the linear regression problem\n", 8 | "\n", 9 | "In the linear regression problem, you have a data matrix, $X$, and a response $y$, you want to find model parameters $\\theta$ that make $y \\approx X \\theta$. These notes sketch one method for solving this problem." 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "## Notation\n", 17 | "\n", 18 | "Assume your data consists of $m$ observations and $n+1$ variables. One of these variables is the _response_ variable, $y$, which you want to predict from the other $n$ variables, $\\{x_0, \\ldots, x_{n-1}\\}$. You wish to fit a _linear model_ of the following form to these data,\n", 19 | "\n", 20 | "$$y_i \\approx x_{i,0} \\theta_0 + x_{i,1} \\theta_1 + \\cdots + x_{i,n-1} \\theta_{n-1} + \\theta_n,$$\n", 21 | "\n", 22 | "where $\\{\\theta_j | 0 \\leq j \\leq n\\}$ is the set of unknown coefficients. Your modeling task is to choose values for these coefficients that \"best fit\" the data." 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "You can arrange the observations into a tibble like this one:\n", 30 | "\n", 31 | "| y | x0 | x1 | $\\cdots$ | xn-1 | xn |\n", 32 | "|:----------:|:-------------:|:-------------:|:--------:|:---------------:|:-------------:|\n", 33 | "| $y_0$ | $x_{0,1}$ | $x_{0,2}$ | $\\cdots$ | $x_{0,n-1}$ | 1.0 |\n", 34 | "| $y_1$ | $x_{1,1}$ | $x_{1,2}$ | $\\cdots$ | $x_{1,n-1}$ | 1.0 |\n", 35 | "| $y_2$ | $x_{2,1}$ | $x_{2,2}$ | $\\cdots$ | $x_{2,n-1}$ | 1.0 |\n", 36 | "| $\\vdots$ | $\\vdots$ | $\\vdots$ | $\\vdots$ | $\\vdots$ | 1.0 |\n", 37 | "| $y_{m-1}$ | $x_{m-1,1}$ | $x_{m-1,2}$ | $\\cdots$ | $x_{m-1,n-1}$ | 1.0 |\n", 38 | "\n", 39 | "This tibble includes an extra dummy variable, $x_n$, whose entries are all equal to 1.0. Treating each variable as a column vector, the modeling tasks is to find the vector $\\theta^T \\equiv (\\theta_0, \\theta_1, \\ldots, \\theta_{n})$ such that\n", 40 | "\n", 41 | "$$y \\approx X \\theta,$$\n", 42 | "\n", 43 | "where $y$ is the vector of responses and $X$ is the $m \\times (n+1)$ matrix whose columns are the corresponding vectors, $x_0$, $x_1$, $\\ldots$, $x_n$. The matrix $X$ composed this way from the predictors is sometimes referred to as the _(input) data matrix_." 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "So how should you choose $\\theta$? Suppose you are given $\\theta$. One way to measure its quality is to look at the difference between $y$ and the _(model) prediction_, $X \\theta$. A natural way to measure that difference is to use some vector norm, like the 2-norm (here, squared):\n", 51 | "\n", 52 | "$$ \\|X \\theta - y\\|_2^2 \\equiv \\|r\\|_2^2,$$\n", 53 | "\n", 54 | "where $r \\equiv X \\theta - y$ is the _residual error vector_ or just _residual_ for this model. Each element of $r$ is the residual for a given observation; thus, using the two-norm means each difference is squared, thereby \"penalizing\" larger differences more than smaller ones.\n", 55 | "\n", 56 | "> The additional squaring of $\\|r\\|_2$ could be interpreted similarly, though in reality it is chosen to simplify the math. In particular, recall (or convince yourself) that $\\|r\\|_2^2 = r^T r$." 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "Given this error measure, we can now formalize our mathematical goal as an optimization problem: compute the $\\theta$ that _minimizes_ this error:\n", 64 | "\n", 65 | "$$ \\theta_* = {\\arg\\min_\\theta} \\|X \\theta - y\\|_2^2. $$" 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "metadata": {}, 71 | "source": [ 72 | "## Solving the optimization problem\n", 73 | "\n", 74 | "Recall from calculus that you can minimize (or maximize) a continuous function $f(x)$ in a single variable $x$ by computing its derivative $\\left.\\frac{df}{dx}\\right|_{x=x_*}$, setting it to zero, and then solving for $x_*$.\n", 75 | "\n", 76 | "> **Example.** Let $f(x) \\equiv a x^2 + b x + c$. Then its maximum or minimum occurs at\n", 77 | ">\n", 78 | "> $$\n", 79 | " \\left. \\frac{df}{dx} \\right|_{x=x_*} = 2 a x_* + b = 0,\n", 80 | " $$\n", 81 | ">\n", 82 | "> or when\n", 83 | "> \n", 84 | "> $$\n", 85 | " x_* = -\\frac{b}{2 a}.\n", 86 | " $$\n", 87 | ">\n", 88 | "> To show whether this value is a maximum, a minimum, or a saddle-point, you would look at the second derivative. But let's skip that detail for now." 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": {}, 94 | "source": [ 95 | "In the setting of multivariable calculus, the procedure is the same. Let $g(\\theta)$ be the (scalar) function to minimize or maximize, where $\\theta$ is a vector. For vectors, the analogue of the first-derivative is the _gradient_. We define the gradient of a scalar function $g$ with respect to the vector variable $\\theta$ to be the vector\n", 96 | "\n", 97 | "$$\n", 98 | "\\nabla_\\theta g(\\theta) \\equiv\n", 99 | " \\left(\\begin{array}{c}\n", 100 | " \\frac{\\partial g}{\\partial \\theta_0} \\\\\n", 101 | " \\frac{\\partial g}{\\partial \\theta_1} \\\\\n", 102 | " \\vdots \\\\\n", 103 | " \\frac{\\partial g}{\\partial \\theta_{n-1}}\n", 104 | " \\end{array}\\right),\n", 105 | "$$\n", 106 | "\n", 107 | "where $\\frac{\\partial g}{\\partial \\theta_i}$ is the partial derivative of $g$ with respect to $\\theta_i$. (To compute a partial derivative with respect to $\\theta_i$, take the ordinary derivative with respect to $\\theta_i$ while treating all other $\\theta_{j \\neq i}$ as constants.) The gradient produces a _vector_ of these partial derivatives.\n", 108 | "\n", 109 | "> **Example.** Let $\\theta \\equiv \\left(\\begin{array}{c} \\theta_0 \\\\ \\theta_1 \\end{array}\\right)$ and $g(\\theta) \\equiv \\|\\theta\\|_2^2$. That is,\n", 110 | ">\n", 111 | "> $$ g(\\theta) = \\|\\theta\\|_2^2 \\Longrightarrow g(\\theta_0, \\theta_1) = \\theta_0^2 + \\theta_1^2. $$\n", 112 | ">\n", 113 | "> Then,\n", 114 | ">\n", 115 | "> $$\n", 116 | " \\nabla_\\theta\\, g(\\theta)\n", 117 | " = \\left(\\begin{array}{c}\n", 118 | " \\frac{\\partial g}{\\partial \\theta_0} \\\\\n", 119 | " \\frac{\\partial g}{\\partial \\theta_1}\n", 120 | " \\end{array}\\right)\n", 121 | " = \\left(\\begin{array}{c}\n", 122 | " \\frac{\\partial}{\\partial \\theta_0} (\\theta_0^2 + \\theta_1^2) \\\\\n", 123 | " \\frac{\\partial}{\\partial \\theta_1} (\\theta_0^2 + \\theta_1^2)\n", 124 | " \\end{array}\\right)\n", 125 | " = \\left(\\begin{array}{c}\n", 126 | " 2 \\theta_0 \\\\\n", 127 | " 2 \\theta_1\n", 128 | " \\end{array}\\right)\n", 129 | " = 2 \\theta.\n", 130 | " $$" 131 | ] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "metadata": {}, 136 | "source": [ 137 | "From the definition of the gradient, you should be able to verify the following identities. Below, take $v$ and $w$ to be vectors of length $n$ and $M$ to be an $n \\times n$ matrix.\n", 138 | "\n", 139 | "1. $\\nabla_v (v^T w) = w$.\n", 140 | "2. $\\nabla_v (v^T v) = 2v$. (That is, generalize the example above to an $n$-vector.)\n", 141 | "3. $\\nabla_v (v^T M v) = (M + M^T)v$." 142 | ] 143 | }, 144 | { 145 | "cell_type": "markdown", 146 | "metadata": {}, 147 | "source": [ 148 | "**Computing the optimal parameters, $\\theta^*$.** Armed with the gradient, you are now ready to minimize $g(\\theta) \\equiv \\|X \\theta - y\\|_2^2$.\n", 149 | "\n", 150 | "In the same way that the derivative is zero at the minimum of a scalar function, the gradient will be zero at the minimum of $g(\\theta)$. So let's compute the gradient and set it to zero." 151 | ] 152 | }, 153 | { 154 | "cell_type": "markdown", 155 | "metadata": {}, 156 | "source": [ 157 | "When\n", 158 | "\n", 159 | "$$\n", 160 | " \\begin{eqnarray}\n", 161 | " \\left. \\nabla_\\theta\\, g(\\theta) \\right|_{\\theta^*} = 0,\n", 162 | " \\end{eqnarray}\n", 163 | "$$\n", 164 | "\n", 165 | "then\n", 166 | "\n", 167 | "$$\n", 168 | "\\begin{eqnarray}\n", 169 | " \\nabla_{\\theta^*} \\|X\\theta^* - y\\|_2^2\n", 170 | " & = & \\nabla_{\\theta^*} \\left( \\theta^{*T} X^T X \\theta^* - 2 \\theta^{*T} X^T y + y^T y \\right) \\\\\n", 171 | " & = & 2 (X^T X \\theta^* - X^T y) \\\\\n", 172 | " & = & 0.\n", 173 | "\\end{eqnarray}\n", 174 | "$$\n", 175 | "\n", 176 | "In other words, the $\\theta^*$ at the minimum is the solution of $X^T X \\theta^* = X^T y$. This system is known as the _normal equations_. If the data matrix $X$ has full rank, then this equation will have a solution.\n", 177 | "\n", 178 | "> Again, like the 1-D case, we've glossed over the fact that you need one more step to show that $\\theta^*$ minimizes the above equation." 179 | ] 180 | } 181 | ], 182 | "metadata": { 183 | "anaconda-cloud": {}, 184 | "celltoolbar": "Create Assignment", 185 | "kernelspec": { 186 | "display_name": "Python 3", 187 | "language": "python", 188 | "name": "python3" 189 | }, 190 | "language_info": { 191 | "codemirror_mode": { 192 | "name": "ipython", 193 | "version": 3 194 | }, 195 | "file_extension": ".py", 196 | "mimetype": "text/x-python", 197 | "name": "python", 198 | "nbconvert_exporter": "python", 199 | "pygments_lexer": "ipython3", 200 | "version": "3.5.2" 201 | } 202 | }, 203 | "nbformat": 4, 204 | "nbformat_minor": 1 205 | } 206 | -------------------------------------------------------------------------------- /Notebook12_PerturbationTheory.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Notes: Perturbation theory and condition numbers\n", 8 | "\n", 9 | "Let's start by asking how \"hard\" it is to solve a given linear system, $Ax=b$. You will apply perturbation theory to answer this question.\n", 10 | "\n", 11 | "This notebook is only for your edification. You do not need to submit it, but you are responsible for understanding the concept of a _condition number_ and how to estimate it for a matrix using Numpy. (The code below shows you how!)" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "**Intuition: Continuous functions of a single variable.** To build your intuition, consider the simple case of a scalar function in a single continuous variable, $y = f(x)$. Suppose the input is perturbed by some amount, $\\Delta x$. The output will also change by some amount, $\\Delta y$. How large is $\\Delta y$ relative to $\\Delta x$?\n", 19 | "\n", 20 | "Supposing $\\Delta x$ is sufficiently small, you can approximate the change in the output by a Taylor series expansion of $f(x + \\Delta x)$:\n", 21 | "\n", 22 | "$$\n", 23 | " y + \\Delta y = f(x + \\Delta x) = f(x) + \\Delta x \\frac{df}{dx} + O(\\Delta x^2).\n", 24 | "$$\n", 25 | "\n", 26 | "Since $\\Delta x$ is assumed to be \"small,\" we can approximate this relation by\n", 27 | "\n", 28 | "$$\n", 29 | "\\begin{eqnarray}\n", 30 | " y + \\Delta y & \\approx & f(x) + \\Delta x \\frac{df}{dx} \\\\\n", 31 | " \\Delta y & \\approx & \\Delta x \\frac{df}{dx}.\n", 32 | "\\end{eqnarray}\n", 33 | "$$\n", 34 | "\n", 35 | "This result should not be surprising: the first derivative measures the sensitivity of changes in the output to changes in the input. We will give the derivative a special name: it is the _(absolute) condition number_. If it is very large in the vicinity of $x$, then even small changes to the input will result in large changes in the output. Put differently, a large condition number indicates that the problem is intrinsically sensitive, so we should expect it may be difficult to construct an accurate algorithm.\n", 36 | "\n", 37 | "In addition to the absolute condition number, we can define a _relative_ condition number for the problem of evaluating $f(x)$.\n", 38 | "\n", 39 | "$$\n", 40 | "\\begin{eqnarray}\n", 41 | " \\Delta y & \\approx & \\Delta x \\frac{df}{dx} \\\\\n", 42 | " & \\Downarrow & \\\\\n", 43 | " \\frac{|\\Delta y|}{|y|} & \\approx & \\frac{|\\Delta x|}{|x|} \\cdot \\underbrace{\\frac{|df/dx| \\cdot |x|}{|f(x)|}}_{\\kappa_f(x)}.\n", 44 | "\\end{eqnarray}\n", 45 | "$$\n", 46 | "\n", 47 | "Here, the underscored factor, defined to be $\\kappa_f(x)$, is the relative analogue of the absolute condition number. Again, its magnitude tells us whether the output is sensitive to the input." 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "**Perturbation theory for linear systems.** What if we perturb a linear system? How can we measure its sensitivity or \"intrinsic difficulty\" to solve?\n", 55 | "\n", 56 | "First, recall the following identities linear algebraic identities:\n", 57 | "\n", 58 | "* _Triangle inequality_: $\\|x + y\\|_2 \\leq \\|x\\|_2 + \\|y\\|_2$\n", 59 | "* _Norm of a matrix-vector product_: $\\|Ax\\|_2 \\leq \\|A\\|_F\\cdot\\|x\\|_2$\n", 60 | "* _Norm of matrix-matrix product_: $\\|AB\\|_F \\leq \\|A\\|_F\\cdot\\|B\\|_F$\n", 61 | "\n", 62 | "To simplify the notation a little, we will drop the \"$2$\" and \"$F$\" subscripts." 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": {}, 68 | "source": [ 69 | "Suppose all of $A$, $b$, and the eventual solution $x$ undergo additive perturbations, denoted by $A + \\Delta A$, $b + \\Delta b$, and $x + \\Delta x$, respectively. Then, subtracting the original system from the perturbed system, you would obtain the following.\n", 70 | "\n", 71 | "$$\n", 72 | "\\begin{array}{rrcll}\n", 73 | " & (A + \\Delta A)(x + \\Delta x) & = & b + \\Delta b & \\\\\n", 74 | "- [& Ax & = & b & ] \\\\\n", 75 | "\\hline\n", 76 | " & \\Delta A x + (A + \\Delta A) \\Delta x & = & \\Delta b & \\\\\n", 77 | "\\end{array}\n", 78 | "$$" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "Now look more closely at the perturbation, $\\Delta x$, of the solution. Let $\\hat{x} \\equiv x + \\Delta x$ be the perturbed solution. Then the above can be rewritten as,\n", 86 | "\n", 87 | "$$\\Delta x = A^{-1} \\left(\\Delta b - \\Delta A \\hat{x}\\right),$$\n", 88 | "\n", 89 | "where we have assumed that $A$ is invertible. (That won't be true for our overdetermined system, but let's not worry about that for the moment.)" 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": {}, 95 | "source": [ 96 | "How large is $\\Delta x$? Let's use a norm to measure it and bound it using \n", 97 | "\n", 98 | "$$\n", 99 | "\\begin{array}{rcl}\n", 100 | " \\|\\Delta x\\| & = & \\|A^{-1} \\left(\\Delta b - \\Delta A \\hat{x}\\right)\\| \\\\\n", 101 | " & \\leq & \\|A^{-1}\\|\\cdot\\left(\\|\\Delta b\\| + \\|\\Delta A\\|\\cdot\\|\\hat{x}\\|\\right).\n", 102 | "\\end{array}\n", 103 | "$$" 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": {}, 109 | "source": [ 110 | "You can rewrite this as follows:\n", 111 | "\n", 112 | "$$\n", 113 | "\\begin{array}{rcl}\n", 114 | " \\frac{\\|\\Delta x\\|}\n", 115 | " {\\|\\hat{x}\\|}\n", 116 | " & \\leq &\n", 117 | " \\|A^{-1}\\| \\cdot \\|A\\| \\cdot \\left(\n", 118 | " \\frac{\\|\\Delta A\\|}\n", 119 | " {\\|A\\|}\n", 120 | " +\n", 121 | " \\frac{\\Delta b}\n", 122 | " {\\|A\\| \\cdot \\|\\hat{x}\\|}\n", 123 | " \\right).\n", 124 | "\\end{array}\n", 125 | "$$" 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "metadata": {}, 131 | "source": [ 132 | "This bound says that the relative error of the perturbed solution, compared to relative perturbations in $A$ and $b$, scales with the product, $\\|A^{-1}\\| \\cdot \\|A\\|$. This factor is the linear systems analogue of the condition number for evaluating the function $f(x)$! As such, we define\n", 133 | "\n", 134 | "$$\\kappa(A) \\equiv \\|A^{-1}\\| \\cdot \\|A\\|$$\n", 135 | "\n", 136 | "as the _condition number of $A$_ for solving linear systems." 137 | ] 138 | }, 139 | { 140 | "cell_type": "markdown", 141 | "metadata": {}, 142 | "source": [ 143 | "**What values of $\\kappa(A)$ are \"large?\"** Generally, you want to compare $\\kappa(A)$ to $1/\\epsilon$, where $\\epsilon$ is _machine precision_, which is the [maximum relative error under rounding](https://sites.ualberta.ca/~kbeach/phys420_580_2010/docs/ACM-Goldberg.pdf). We may look more closely at floating-point representations later on, but for now, a good notional value for $\\epsilon$ is about $10^{-7}$ in single-precision and $10^{-15}$ in double-precision. (In Python, the default format for floating-point values is double-precision.)\n", 144 | "\n", 145 | "This analysis explains why solving the normal equations directly could lead to computational problems. In particular, one can show that $\\kappa(X^T X) \\approx \\kappa(X)^2$, which means forming $X^T X$ explicitly may make the problem harder to solve by a large amount!\n", 146 | "\n", 147 | "Another scenario in which $X$ will have a large condition number is when it has nearly collinear predictors. See the examples below." 148 | ] 149 | }, 150 | { 151 | "cell_type": "markdown", 152 | "metadata": { 153 | "collapsed": true, 154 | "nbgrader": { 155 | "grade": false, 156 | "grade_id": "cell-921c6afe80942a78", 157 | "locked": true, 158 | "schema_version": 1, 159 | "solution": false 160 | } 161 | }, 162 | "source": [ 163 | "**Fin!** That's the end of these notes." 164 | ] 165 | } 166 | ], 167 | "metadata": { 168 | "anaconda-cloud": {}, 169 | "celltoolbar": "Create Assignment", 170 | "kernelspec": { 171 | "display_name": "Python 3", 172 | "language": "python", 173 | "name": "python3" 174 | }, 175 | "language_info": { 176 | "codemirror_mode": { 177 | "name": "ipython", 178 | "version": 3 179 | }, 180 | "file_extension": ".py", 181 | "mimetype": "text/x-python", 182 | "name": "python", 183 | "nbconvert_exporter": "python", 184 | "pygments_lexer": "ipython3", 185 | "version": "3.5.2" 186 | } 187 | }, 188 | "nbformat": 4, 189 | "nbformat_minor": 1 190 | } 191 | -------------------------------------------------------------------------------- /Notebook13_NumpyMatrixManipulation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Mo' Numpy, Mo' Problems\n", 8 | "\n", 9 | "This notebook is a quick overview of additional functionality in Numpy not explicitly covered in some of the other notebooks in this course." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": { 16 | "collapsed": true 17 | }, 18 | "outputs": [], 19 | "source": [ 20 | "import numpy as np" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "# Random numbers" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | "Numpy has a rich collection of (pseudo)random number generators. Here is an example; see the documentation for [numpy.random()](https://docs.scipy.org/doc/numpy/reference/routines.random.html) for more details." 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 2, 40 | "metadata": {}, 41 | "outputs": [ 42 | { 43 | "data": { 44 | "text/plain": [ 45 | "array([[ 2, -10, 0],\n", 46 | " [ -9, 4, -9],\n", 47 | " [ -1, 8, -7],\n", 48 | " [ -5, -3, 4]])" 49 | ] 50 | }, 51 | "execution_count": 2, 52 | "metadata": {}, 53 | "output_type": "execute_result" 54 | } 55 | ], 56 | "source": [ 57 | "A = np.random.randint(-10, 10, size=(4, 3))\n", 58 | "A" 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [ 65 | "# Aggregations or reductions" 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "metadata": {}, 71 | "source": [ 72 | "Suppose you want to reduce the values of a Numpy array to a smaller number of values. Numpy provides a number of such functions that _aggregate_ values. Examples of aggregations include sums, min/max calculations, and averaging, among others." 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": 3, 78 | "metadata": {}, 79 | "outputs": [ 80 | { 81 | "name": "stdout", 82 | "output_type": "stream", 83 | "text": [ 84 | "8 8\n", 85 | "-10 -10\n", 86 | "-26\n", 87 | "-2.16666666667\n", 88 | "5.69844033243\n" 89 | ] 90 | } 91 | ], 92 | "source": [ 93 | "print(np.max(A), np.amax(A)) # np.max() and np.amax() are synonyms\n", 94 | "print(np.min(A), np.amin(A)) # same\n", 95 | "print(np.sum(A))\n", 96 | "print(np.mean(A))\n", 97 | "print(np.std(A))" 98 | ] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "metadata": {}, 103 | "source": [ 104 | "The above examples aggregate over all values. But you can also aggregate along a dimension using the optional `axis` parameter." 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": 4, 110 | "metadata": {}, 111 | "outputs": [ 112 | { 113 | "name": "stdout", 114 | "output_type": "stream", 115 | "text": [ 116 | "Max in each column: [2 8 4]\n", 117 | "Max in each row: [2 4 8 4]\n" 118 | ] 119 | } 120 | ], 121 | "source": [ 122 | "print(\"Max in each column:\", np.amax(A, axis=0)) # i.e., aggregate along axis 0, the rows, producing column maximums\n", 123 | "print(\"Max in each row:\", np.amax(A, axis=1)) # i.e., aggregate along axis 1, the columns, producing row maximums" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "# Universal functions" 131 | ] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "metadata": {}, 136 | "source": [ 137 | "Universal functions apply a given function _elementwise_ to one or more Numpy objects.\n", 138 | "\n", 139 | "For instance, `np.abs(A)` takes the absolute value of each element." 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": 5, 145 | "metadata": {}, 146 | "outputs": [ 147 | { 148 | "name": "stdout", 149 | "output_type": "stream", 150 | "text": [ 151 | "[[ 2 -10 0]\n", 152 | " [ -9 4 -9]\n", 153 | " [ -1 8 -7]\n", 154 | " [ -5 -3 4]] \n", 155 | "==>\n", 156 | " [[ 2 10 0]\n", 157 | " [ 9 4 9]\n", 158 | " [ 1 8 7]\n", 159 | " [ 5 3 4]]\n" 160 | ] 161 | } 162 | ], 163 | "source": [ 164 | "print(A, \"\\n==>\\n\", np.abs(A))" 165 | ] 166 | }, 167 | { 168 | "cell_type": "markdown", 169 | "metadata": {}, 170 | "source": [ 171 | "Some universal functions accept multiple, compatible arguments. Given two matrices, $A \\equiv (a_{ij})$ and $B \\equiv (b_{ij})$, the following example, computes a matrix $C$ such that $c_{ij} = \\max(a_{ij}, b_{ij})$." 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": 6, 177 | "metadata": {}, 178 | "outputs": [ 179 | { 180 | "data": { 181 | "text/plain": [ 182 | "array([[ 0, -8, 9],\n", 183 | " [-9, 9, 4],\n", 184 | " [ 9, -5, 0],\n", 185 | " [-6, -2, -3]])" 186 | ] 187 | }, 188 | "execution_count": 6, 189 | "metadata": {}, 190 | "output_type": "execute_result" 191 | } 192 | ], 193 | "source": [ 194 | "B = np.random.randint(-10, 10, size=A.shape)\n", 195 | "B" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": 7, 201 | "metadata": {}, 202 | "outputs": [ 203 | { 204 | "data": { 205 | "text/plain": [ 206 | "array([[ 2, -8, 9],\n", 207 | " [-9, 9, 4],\n", 208 | " [ 9, 8, 0],\n", 209 | " [-5, -2, 4]])" 210 | ] 211 | }, 212 | "execution_count": 7, 213 | "metadata": {}, 214 | "output_type": "execute_result" 215 | } 216 | ], 217 | "source": [ 218 | "C = np.maximum(A, B)\n", 219 | "C" 220 | ] 221 | }, 222 | { 223 | "cell_type": "markdown", 224 | "metadata": {}, 225 | "source": [ 226 | "You can also build your own universal functions! For instance, suppose you want to compute, elementwise, $f(x) = e^{-x^2}$ and you have a scalar function that implements $f(x)$:" 227 | ] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "execution_count": 8, 232 | "metadata": {}, 233 | "outputs": [ 234 | { 235 | "data": { 236 | "text/plain": [ 237 | "0.01831563888873418" 238 | ] 239 | }, 240 | "execution_count": 8, 241 | "metadata": {}, 242 | "output_type": "execute_result" 243 | } 244 | ], 245 | "source": [ 246 | "def f(x):\n", 247 | " from math import exp\n", 248 | " return exp(-(x**2))\n", 249 | "\n", 250 | "f(-2) # i.e., exp(-4) ~= 0.01831563888873418" 251 | ] 252 | }, 253 | { 254 | "cell_type": "markdown", 255 | "metadata": {}, 256 | "source": [ 257 | "This function accepts 1 input (`x`) and returns a single output. The following will create a new Numpy universal function." 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": 9, 263 | "metadata": {}, 264 | "outputs": [ 265 | { 266 | "name": "stdout", 267 | "output_type": "stream", 268 | "text": [ 269 | "[[ 2 -10 0]\n", 270 | " [ -9 4 -9]\n", 271 | " [ -1 8 -7]\n", 272 | " [ -5 -3 4]] \n", 273 | "=>\n", 274 | " [[0.01831563888873418 3.720075976020836e-44 1.0]\n", 275 | " [6.639677199580735e-36 1.1253517471925912e-07 6.639677199580735e-36]\n", 276 | " [0.36787944117144233 1.603810890548638e-28 5.242885663363464e-22]\n", 277 | " [1.3887943864964021e-11 0.00012340980408667956 1.1253517471925912e-07]]\n" 278 | ] 279 | } 280 | ], 281 | "source": [ 282 | "f_np = np.frompyfunc(f, 1, 1) # Creates a universal function from `f()`\n", 283 | "\n", 284 | "print(A, \"\\n=>\\n\", f_np(A))" 285 | ] 286 | }, 287 | { 288 | "cell_type": "markdown", 289 | "metadata": {}, 290 | "source": [ 291 | "# Broadcasting" 292 | ] 293 | }, 294 | { 295 | "cell_type": "markdown", 296 | "metadata": {}, 297 | "source": [ 298 | "Sometimes we want to combine operations on Numpy arrays that have different shapes but are _compatible_.\n", 299 | "\n", 300 | "In the following example, we want to add 3 elementwise to every value in `A`." 301 | ] 302 | }, 303 | { 304 | "cell_type": "code", 305 | "execution_count": 10, 306 | "metadata": {}, 307 | "outputs": [ 308 | { 309 | "name": "stdout", 310 | "output_type": "stream", 311 | "text": [ 312 | "[[ 2 -10 0]\n", 313 | " [ -9 4 -9]\n", 314 | " [ -1 8 -7]\n", 315 | " [ -5 -3 4]]\n", 316 | "\n", 317 | "[[ 5 -7 3]\n", 318 | " [-6 7 -6]\n", 319 | " [ 2 11 -4]\n", 320 | " [-2 0 7]]\n" 321 | ] 322 | } 323 | ], 324 | "source": [ 325 | "print(A)\n", 326 | "print()\n", 327 | "print(A + 3)" 328 | ] 329 | }, 330 | { 331 | "cell_type": "markdown", 332 | "metadata": {}, 333 | "source": [ 334 | "Technically, `A` and `3` have different shapes: the former is a $4 \\times 3$ matrix, while the latter is a scalar ($1 \\times 1$). However, they are compatible because Numpy has a scheme to _extend_---or **broadcast**---the value 3 into an equivalent matrix object of the same shape, before combining them elementwise." 335 | ] 336 | }, 337 | { 338 | "cell_type": "markdown", 339 | "metadata": {}, 340 | "source": [ 341 | "To see a more sophisticated example, suppose each row `A[i, :]` are the coordinates of a data point, and we want to compute the centroid (or \"center-of-mass,\" if we imagine each point is a unit mass). That's the same as computing the mean coordinate for each column:" 342 | ] 343 | }, 344 | { 345 | "cell_type": "code", 346 | "execution_count": 11, 347 | "metadata": {}, 348 | "outputs": [ 349 | { 350 | "name": "stdout", 351 | "output_type": "stream", 352 | "text": [ 353 | "[[ 2 -10 0]\n", 354 | " [ -9 4 -9]\n", 355 | " [ -1 8 -7]\n", 356 | " [ -5 -3 4]] => [-3.25 -0.25 -3. ]\n" 357 | ] 358 | } 359 | ], 360 | "source": [ 361 | "A_row_means = np.mean(A, axis=0)\n", 362 | "print(A, \"=>\", A_row_means)" 363 | ] 364 | }, 365 | { 366 | "cell_type": "markdown", 367 | "metadata": {}, 368 | "source": [ 369 | "Now, suppose you want to shift the points so that their mean is zero. Even though they don't have the same shape, Numpy will interpret `A - A_rowmeans` in a way that effectively carries out this operation. That is, it will extend or \"replicate\" `A_rowmeans` into rows of a matrix of the same shape as `A` and then perform elementwise subtraction." 370 | ] 371 | }, 372 | { 373 | "cell_type": "code", 374 | "execution_count": 12, 375 | "metadata": {}, 376 | "outputs": [ 377 | { 378 | "data": { 379 | "text/plain": [ 380 | "array([[ 5.25, -9.75, 3. ],\n", 381 | " [-5.75, 4.25, -6. ],\n", 382 | " [ 2.25, 8.25, -4. ],\n", 383 | " [-1.75, -2.75, 7. ]])" 384 | ] 385 | }, 386 | "execution_count": 12, 387 | "metadata": {}, 388 | "output_type": "execute_result" 389 | } 390 | ], 391 | "source": [ 392 | "A_row_centered = A - A_row_means\n", 393 | "A_row_centered" 394 | ] 395 | }, 396 | { 397 | "cell_type": "markdown", 398 | "metadata": {}, 399 | "source": [ 400 | "Suppose you instead want to mean-center the _columns_ instead of the rows. You could start by computing column means:" 401 | ] 402 | }, 403 | { 404 | "cell_type": "code", 405 | "execution_count": 13, 406 | "metadata": {}, 407 | "outputs": [ 408 | { 409 | "name": "stdout", 410 | "output_type": "stream", 411 | "text": [ 412 | "[[ 2 -10 0]\n", 413 | " [ -9 4 -9]\n", 414 | " [ -1 8 -7]\n", 415 | " [ -5 -3 4]] => [-2.66666667 -4.66666667 0. -1.33333333]\n" 416 | ] 417 | } 418 | ], 419 | "source": [ 420 | "A_col_means = np.mean(A, axis=1)\n", 421 | "print(A, \"=>\", A_col_means)" 422 | ] 423 | }, 424 | { 425 | "cell_type": "markdown", 426 | "metadata": {}, 427 | "source": [ 428 | "But the same operation will fail!" 429 | ] 430 | }, 431 | { 432 | "cell_type": "code", 433 | "execution_count": 14, 434 | "metadata": {}, 435 | "outputs": [ 436 | { 437 | "ename": "ValueError", 438 | "evalue": "operands could not be broadcast together with shapes (4,3) (4,) ", 439 | "output_type": "error", 440 | "traceback": [ 441 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 442 | "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", 443 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mA\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mA_col_means\u001b[0m \u001b[0;31m# Fails, throwing a `ValueError`\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", 444 | "\u001b[0;31mValueError\u001b[0m: operands could not be broadcast together with shapes (4,3) (4,) " 445 | ] 446 | } 447 | ], 448 | "source": [ 449 | "A - A_col_means # Fails, throwing a `ValueError`" 450 | ] 451 | }, 452 | { 453 | "cell_type": "markdown", 454 | "metadata": {}, 455 | "source": [ 456 | "The error reports that these shapes are not compatible. So how can you fix it?\n", 457 | "\n", 458 | "**The broadcasting rule.** One way is to learn Numpy's convention for **[broadcasting](https://docs.scipy.org/doc/numpy/reference/ufuncs.html#broadcasting)**. Numpy starts by looking at the shapes of the objects:" 459 | ] 460 | }, 461 | { 462 | "cell_type": "code", 463 | "execution_count": 15, 464 | "metadata": {}, 465 | "outputs": [ 466 | { 467 | "name": "stdout", 468 | "output_type": "stream", 469 | "text": [ 470 | "(4, 3) (3,)\n" 471 | ] 472 | } 473 | ], 474 | "source": [ 475 | "print(A.shape, A_row_means.shape)" 476 | ] 477 | }, 478 | { 479 | "cell_type": "markdown", 480 | "metadata": {}, 481 | "source": [ 482 | "These are compatible if, starting from _right_ to _left_, the dimensions match **or** one of the dimensions is 1. This convention of moving from right to left is referred to as matching the _trailing dimensions_. In this example, the rightmost dimensions of each object are both 3, so they match. Since `A_row_means` has no more dimensions, it can be replicated to match the remaining dimensions of `A`." 483 | ] 484 | }, 485 | { 486 | "cell_type": "markdown", 487 | "metadata": {}, 488 | "source": [ 489 | "By contrast, consider the shapes of `A` and `A_col_means`:" 490 | ] 491 | }, 492 | { 493 | "cell_type": "code", 494 | "execution_count": 16, 495 | "metadata": {}, 496 | "outputs": [ 497 | { 498 | "name": "stdout", 499 | "output_type": "stream", 500 | "text": [ 501 | "(4, 3) (4,)\n" 502 | ] 503 | } 504 | ], 505 | "source": [ 506 | "print(A.shape, A_col_means.shape)" 507 | ] 508 | }, 509 | { 510 | "cell_type": "markdown", 511 | "metadata": {}, 512 | "source": [ 513 | "In this case, per the broadcasting rule, the trailing dimensions of 3 and 4 do not match. Therefore, the broadcast rule fails. One way to get the desired behavior is to modify `A_col_means` to have a unit trailing dimension. In this case, you can use Numpy's [`reshape()`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html) to convert `A_col_means` into a shape that has an explicit trailing dimension of size 1." 514 | ] 515 | }, 516 | { 517 | "cell_type": "code", 518 | "execution_count": 17, 519 | "metadata": {}, 520 | "outputs": [ 521 | { 522 | "name": "stdout", 523 | "output_type": "stream", 524 | "text": [ 525 | "[[-2.66666667]\n", 526 | " [-4.66666667]\n", 527 | " [ 0. ]\n", 528 | " [-1.33333333]] => (4, 1)\n" 529 | ] 530 | } 531 | ], 532 | "source": [ 533 | "A_col_means2 = np.reshape(A_col_means, (len(A_col_means), 1))\n", 534 | "print(A_col_means2, \"=>\", A_col_means2.shape)" 535 | ] 536 | }, 537 | { 538 | "cell_type": "markdown", 539 | "metadata": {}, 540 | "source": [ 541 | "Now the trailing dimension equals 1, so it can be matched against the trailing dimension of `A`. The next dimension is the same between the two objects, so Numpy knows it can replicate accordingly." 542 | ] 543 | }, 544 | { 545 | "cell_type": "code", 546 | "execution_count": 18, 547 | "metadata": {}, 548 | "outputs": [ 549 | { 550 | "name": "stdout", 551 | "output_type": "stream", 552 | "text": [ 553 | "[[ 2 -10 0]\n", 554 | " [ -9 4 -9]\n", 555 | " [ -1 8 -7]\n", 556 | " [ -5 -3 4]] \n", 557 | "- [[-2.66666667]\n", 558 | " [-4.66666667]\n", 559 | " [ 0. ]\n", 560 | " [-1.33333333]]\n", 561 | "=>\n", 562 | " [[ 4.66666667 -7.33333333 2.66666667]\n", 563 | " [-4.33333333 8.66666667 -4.33333333]\n", 564 | " [-1. 8. -7. ]\n", 565 | " [-3.66666667 -1.66666667 5.33333333]]\n" 566 | ] 567 | } 568 | ], 569 | "source": [ 570 | "print(A, \"\\n-\", A_col_means2)\n", 571 | "print(\"=>\\n\", A - A_col_means2)" 572 | ] 573 | }, 574 | { 575 | "cell_type": "markdown", 576 | "metadata": { 577 | "collapsed": true 578 | }, 579 | "source": [ 580 | "**Fin!** That marks the end of this notebook. If you want to learn more, check out the second edition of [Python for Data Analysis](http://shop.oreilly.com/product/0636920050896.do) (released in October 2017)." 581 | ] 582 | } 583 | ], 584 | "metadata": { 585 | "kernelspec": { 586 | "display_name": "Python 3", 587 | "language": "python", 588 | "name": "python3" 589 | }, 590 | "language_info": { 591 | "codemirror_mode": { 592 | "name": "ipython", 593 | "version": 3 594 | }, 595 | "file_extension": ".py", 596 | "mimetype": "text/x-python", 597 | "name": "python", 598 | "nbconvert_exporter": "python", 599 | "pygments_lexer": "ipython3", 600 | "version": "3.5.2" 601 | } 602 | }, 603 | "nbformat": 4, 604 | "nbformat_minor": 2 605 | } 606 | -------------------------------------------------------------------------------- /Notebook1_part2-more_exercises.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "nbgrader": { 7 | "grade": false, 8 | "grade_id": "cell-3b25f2b6cfc80b65", 9 | "locked": true, 10 | "schema_version": 1, 11 | "solution": false 12 | } 13 | }, 14 | "source": [ 15 | "# Python review: More exercises\n", 16 | "\n", 17 | "This notebook continues the review of Python basics based on [Chris Simpkins's](https://www.cc.gatech.edu/~simpkins/) [Python Bootcamp](https://www.cc.gatech.edu/~simpkins/teaching/python-bootcamp/syllabus.html).\n", 18 | "\n", 19 | "This particular notebook adapts the exercises that appeared with the [\"Functional Programming\" slides](https://www.cc.gatech.edu/~simpkins/teaching/python-bootcamp/slides/functional-programming.html) of the Fall 2016 offering." 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": { 25 | "nbgrader": { 26 | "grade": false, 27 | "grade_id": "cell-f3331b5182117a1f", 28 | "locked": true, 29 | "schema_version": 1, 30 | "solution": false 31 | } 32 | }, 33 | "source": [ 34 | "Consider the following dataset of exam grades, organized as a 2-D table and stored in Python as a \"list of lists\" under the variable name, `grades`." 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 3, 40 | "metadata": { 41 | "collapsed": true, 42 | "nbgrader": { 43 | "grade": false, 44 | "grade_id": "cell-9dc72b683a8858c7", 45 | "locked": true, 46 | "schema_version": 1, 47 | "solution": false 48 | } 49 | }, 50 | "outputs": [], 51 | "source": [ 52 | "grades = [\n", 53 | " # First line is descriptive header. Subsequent lines hold data\n", 54 | " ['Student', 'Exam 1', 'Exam 2', 'Exam 3'],\n", 55 | " ['Thorny', '100', '90', '80'],\n", 56 | " ['Mac', '88', '99', '111'],\n", 57 | " ['Farva', '45', '56', '67'],\n", 58 | " ['Rabbit', '59', '61', '67'],\n", 59 | " ['Ursula', '73', '79', '83'],\n", 60 | " ['Foster', '89', '97', '101']\n", 61 | "]" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": { 67 | "nbgrader": { 68 | "grade": false, 69 | "grade_id": "cell-04082681e80572d5", 70 | "locked": true, 71 | "schema_version": 1, 72 | "solution": false 73 | } 74 | }, 75 | "source": [ 76 | "**Exercise 0** (`students_test`: 1 point). Write some code that computes a new list named `students[:]`, which holds the names of the students as they from \"top to bottom\" in the table." 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": 4, 82 | "metadata": { 83 | "nbgrader": { 84 | "grade": false, 85 | "grade_id": "students", 86 | "locked": false, 87 | "schema_version": 1, 88 | "solution": true 89 | } 90 | }, 91 | "outputs": [ 92 | { 93 | "data": { 94 | "text/plain": [ 95 | "['Thorny', 'Mac', 'Farva', 'Rabbit', 'Ursula', 'Foster']" 96 | ] 97 | }, 98 | "execution_count": 4, 99 | "metadata": {}, 100 | "output_type": "execute_result" 101 | } 102 | ], 103 | "source": [ 104 | "#\n", 105 | "# YOUR CODE HERE\n", 106 | "#\n", 107 | "students = [x[0] for x in grades if x[0] != 'Student']" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": 5, 113 | "metadata": { 114 | "nbgrader": { 115 | "grade": true, 116 | "grade_id": "students_test", 117 | "locked": true, 118 | "points": 1, 119 | "schema_version": 1, 120 | "solution": false 121 | } 122 | }, 123 | "outputs": [ 124 | { 125 | "name": "stdout", 126 | "output_type": "stream", 127 | "text": [ 128 | "['Thorny', 'Mac', 'Farva', 'Rabbit', 'Ursula', 'Foster']\n", 129 | "\n", 130 | "(Passed!)\n" 131 | ] 132 | } 133 | ], 134 | "source": [ 135 | "# `students_test`: Test cell\n", 136 | "print(students)\n", 137 | "assert type(students) is list\n", 138 | "assert students == ['Thorny', 'Mac', 'Farva', 'Rabbit', 'Ursula', 'Foster']\n", 139 | "print(\"\\n(Passed!)\")" 140 | ] 141 | }, 142 | { 143 | "cell_type": "markdown", 144 | "metadata": { 145 | "nbgrader": { 146 | "grade": false, 147 | "grade_id": "cell-e5e0181d53efed56", 148 | "locked": true, 149 | "schema_version": 1, 150 | "solution": false 151 | } 152 | }, 153 | "source": [ 154 | "**Exercise 1** (`assignments_test`: 1 point). Write some code to compute a new list named `assignments[:]`, to hold the names of the class assignments. (These appear in the descriptive header element of `grades`.)" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": 6, 160 | "metadata": { 161 | "nbgrader": { 162 | "grade": false, 163 | "grade_id": "assignments", 164 | "locked": false, 165 | "schema_version": 1, 166 | "solution": true 167 | } 168 | }, 169 | "outputs": [ 170 | { 171 | "data": { 172 | "text/plain": [ 173 | "['Exam 1', 'Exam 2', 'Exam 3']" 174 | ] 175 | }, 176 | "execution_count": 6, 177 | "metadata": {}, 178 | "output_type": "execute_result" 179 | } 180 | ], 181 | "source": [ 182 | "#\n", 183 | "# YOUR CODE HERE\n", 184 | "#\n", 185 | "assignments = [x for x in grades[0] if x != 'Student']" 186 | ] 187 | }, 188 | { 189 | "cell_type": "code", 190 | "execution_count": 7, 191 | "metadata": { 192 | "nbgrader": { 193 | "grade": true, 194 | "grade_id": "assignments_test", 195 | "locked": true, 196 | "points": 1, 197 | "schema_version": 1, 198 | "solution": false 199 | } 200 | }, 201 | "outputs": [ 202 | { 203 | "name": "stdout", 204 | "output_type": "stream", 205 | "text": [ 206 | "['Exam 1', 'Exam 2', 'Exam 3']\n", 207 | "\n", 208 | "(Passed!)\n" 209 | ] 210 | } 211 | ], 212 | "source": [ 213 | "# `assignments_test`: Test cell\n", 214 | "print(assignments)\n", 215 | "assert type(assignments) is list\n", 216 | "assert assignments == ['Exam 1', 'Exam 2', 'Exam 3']\n", 217 | "print(\"\\n(Passed!)\")" 218 | ] 219 | }, 220 | { 221 | "cell_type": "markdown", 222 | "metadata": { 223 | "nbgrader": { 224 | "grade": false, 225 | "grade_id": "cell-1bd41417aad245fa", 226 | "locked": true, 227 | "schema_version": 1, 228 | "solution": false 229 | } 230 | }, 231 | "source": [ 232 | "**Exercise 2** (`grade_lists_test`: 1 point). Write some code to compute a new _dictionary_, named `grade_lists`, that maps names of students to _lists_ of their exam grades. The grades should be converted from strings to integers. For instance, `grade_lists['Thorny'] == [100, 90, 80]`." 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": 13, 238 | "metadata": { 239 | "nbgrader": { 240 | "grade": false, 241 | "grade_id": "grade_lists", 242 | "locked": false, 243 | "schema_version": 1, 244 | "solution": true 245 | } 246 | }, 247 | "outputs": [], 248 | "source": [ 249 | "# Create a dict mapping names to lists of grades.\n", 250 | "#\n", 251 | "# YOUR CODE HERE\n", 252 | "#\n", 253 | "grade_lists = {x[0]: [int(x[1]), int(x[2]), int(x[3])] for x in grades if x[0] != 'Student'}\n" 254 | ] 255 | }, 256 | { 257 | "cell_type": "code", 258 | "execution_count": 14, 259 | "metadata": { 260 | "nbgrader": { 261 | "grade": true, 262 | "grade_id": "grade_lists_test", 263 | "locked": true, 264 | "points": 1, 265 | "schema_version": 1, 266 | "solution": false 267 | } 268 | }, 269 | "outputs": [ 270 | { 271 | "name": "stdout", 272 | "output_type": "stream", 273 | "text": [ 274 | "{'Foster': [89, 97, 101], 'Ursula': [73, 79, 83], 'Mac': [88, 99, 111], 'Farva': [45, 56, 67], 'Rabbit': [59, 61, 67], 'Thorny': [100, 90, 80]}\n", 275 | "\n", 276 | "(Passed!)\n" 277 | ] 278 | } 279 | ], 280 | "source": [ 281 | "# `grade_lists_test`: Test cell\n", 282 | "print(grade_lists)\n", 283 | "assert type(grade_lists) is dict, \"Did not create a dictionary.\"\n", 284 | "assert len(grade_lists) == len(grades)-1, \"Dictionary has the wrong number of entries.\"\n", 285 | "assert {'Thorny', 'Mac', 'Farva', 'Rabbit', 'Ursula', 'Foster'} == set(grade_lists.keys()), \"Dictionary has the wrong keys.\"\n", 286 | "assert grade_lists['Thorny'] == [100, 90, 80], 'Wrong grades for: Thorny'\n", 287 | "assert grade_lists['Mac'] == [88, 99, 111], 'Wrong grades for: Mac'\n", 288 | "assert grade_lists['Farva'] == [45, 56, 67], 'Wrong grades for: Farva'\n", 289 | "assert grade_lists['Rabbit'] == [59, 61, 67], 'Wrong grades for: Rabbit'\n", 290 | "assert grade_lists['Ursula'] == [73, 79, 83], 'Wrong grades for: Ursula'\n", 291 | "assert grade_lists['Foster'] == [89, 97, 101], 'Wrong grades for: Foster'\n", 292 | "print(\"\\n(Passed!)\")" 293 | ] 294 | }, 295 | { 296 | "cell_type": "markdown", 297 | "metadata": { 298 | "nbgrader": { 299 | "grade": false, 300 | "grade_id": "cell-a628c6c0f63e7e7c", 301 | "locked": true, 302 | "schema_version": 1, 303 | "solution": false 304 | } 305 | }, 306 | "source": [ 307 | "**Exercise 3** (`grade_dicts_test`: 2 points). Write some code to compute a new dictionary, `grade_dicts`, that maps names of students to _dictionaries_ containing their scores. Each entry of this scores dictionary should be keyed on assignment name and hold the corresponding grade as an integer. For instance, `grade_dicts['Thorny']['Exam 1'] == 100`." 308 | ] 309 | }, 310 | { 311 | "cell_type": "code", 312 | "execution_count": 25, 313 | "metadata": { 314 | "nbgrader": { 315 | "grade": false, 316 | "grade_id": "grade_dicts", 317 | "locked": false, 318 | "schema_version": 1, 319 | "solution": true 320 | } 321 | }, 322 | "outputs": [], 323 | "source": [ 324 | "# Create a dict mapping names to dictionaries of grades.\n", 325 | "#\n", 326 | "# YOUR CODE HERE\n", 327 | "#\n", 328 | "grade_dicts = {}\n", 329 | "for key, value in grade_lists.items():\n", 330 | " grade_dicts[key] = {x:y for x,y in zip(assignments, value)}\n" 331 | ] 332 | }, 333 | { 334 | "cell_type": "code", 335 | "execution_count": 26, 336 | "metadata": { 337 | "nbgrader": { 338 | "grade": true, 339 | "grade_id": "grade_dicts_test", 340 | "locked": true, 341 | "points": 2, 342 | "schema_version": 1, 343 | "solution": false 344 | } 345 | }, 346 | "outputs": [ 347 | { 348 | "name": "stdout", 349 | "output_type": "stream", 350 | "text": [ 351 | "{'Ursula': {'Exam 2': 79, 'Exam 3': 83, 'Exam 1': 73}, 'Thorny': {'Exam 2': 90, 'Exam 3': 80, 'Exam 1': 100}, 'Mac': {'Exam 2': 99, 'Exam 3': 111, 'Exam 1': 88}, 'Farva': {'Exam 2': 56, 'Exam 3': 67, 'Exam 1': 45}, 'Rabbit': {'Exam 2': 61, 'Exam 3': 67, 'Exam 1': 59}, 'Foster': {'Exam 2': 97, 'Exam 3': 101, 'Exam 1': 89}}\n", 352 | "\n", 353 | "(Passed!)\n" 354 | ] 355 | } 356 | ], 357 | "source": [ 358 | "# `grade_dicts_test`: Test cell\n", 359 | "print(grade_dicts)\n", 360 | "assert type(grade_dicts) is dict, \"Did not create a dictionary.\"\n", 361 | "assert len(grade_dicts) == len(grades)-1, \"Dictionary has the wrong number of entries.\"\n", 362 | "assert {'Thorny', 'Mac', 'Farva', 'Rabbit', 'Ursula', 'Foster'} == set(grade_dicts.keys()), \"Dictionary has the wrong keys.\"\n", 363 | "assert grade_dicts['Foster']['Exam 1'] == 89, 'Wrong score'\n", 364 | "assert grade_dicts['Foster']['Exam 3'] == 101, 'Wrong score'\n", 365 | "assert grade_dicts['Foster']['Exam 2'] == 97, 'Wrong score'\n", 366 | "assert grade_dicts['Ursula']['Exam 1'] == 73, 'Wrong score'\n", 367 | "assert grade_dicts['Ursula']['Exam 3'] == 83, 'Wrong score'\n", 368 | "assert grade_dicts['Ursula']['Exam 2'] == 79, 'Wrong score'\n", 369 | "assert grade_dicts['Rabbit']['Exam 1'] == 59, 'Wrong score'\n", 370 | "assert grade_dicts['Rabbit']['Exam 3'] == 67, 'Wrong score'\n", 371 | "assert grade_dicts['Rabbit']['Exam 2'] == 61, 'Wrong score'\n", 372 | "assert grade_dicts['Mac']['Exam 1'] == 88, 'Wrong score'\n", 373 | "assert grade_dicts['Mac']['Exam 3'] == 111, 'Wrong score'\n", 374 | "assert grade_dicts['Mac']['Exam 2'] == 99, 'Wrong score'\n", 375 | "assert grade_dicts['Farva']['Exam 1'] == 45, 'Wrong score'\n", 376 | "assert grade_dicts['Farva']['Exam 3'] == 67, 'Wrong score'\n", 377 | "assert grade_dicts['Farva']['Exam 2'] == 56, 'Wrong score'\n", 378 | "assert grade_dicts['Thorny']['Exam 1'] == 100, 'Wrong score'\n", 379 | "assert grade_dicts['Thorny']['Exam 3'] == 80, 'Wrong score'\n", 380 | "assert grade_dicts['Thorny']['Exam 2'] == 90, 'Wrong score'\n", 381 | "print(\"\\n(Passed!)\")" 382 | ] 383 | }, 384 | { 385 | "cell_type": "markdown", 386 | "metadata": { 387 | "nbgrader": { 388 | "grade": false, 389 | "grade_id": "cell-840a57a4b61944e5", 390 | "locked": true, 391 | "schema_version": 1, 392 | "solution": false 393 | } 394 | }, 395 | "source": [ 396 | "**Exercise 4** (`avg_grades_by_student_test`: 1 point). Write some code to compute a dictionary named `avg_grades_by_student` that maps each student to his or her average exam score. For instance, `avg_grades_by_student['Thorny'] == 90`.\n", 397 | "\n", 398 | "> **Hint.** The [`statistics`](https://docs.python.org/3.5/library/statistics.html) module of Python has at least one helpful function." 399 | ] 400 | }, 401 | { 402 | "cell_type": "code", 403 | "execution_count": 31, 404 | "metadata": { 405 | "nbgrader": { 406 | "grade": false, 407 | "grade_id": "avg_grades_by_student", 408 | "locked": false, 409 | "schema_version": 1, 410 | "solution": true 411 | } 412 | }, 413 | "outputs": [], 414 | "source": [ 415 | "# Create a dict mapping names to grade averages.\n", 416 | "#\n", 417 | "# YOUR CODE HERE\n", 418 | "#\n", 419 | "avg_grades_by_student = {name: sum(score)/3 for name, score in grade_lists.items()}\n" 420 | ] 421 | }, 422 | { 423 | "cell_type": "code", 424 | "execution_count": 30, 425 | "metadata": { 426 | "nbgrader": { 427 | "grade": true, 428 | "grade_id": "avg_grades_by_student_test", 429 | "locked": true, 430 | "points": 1, 431 | "schema_version": 1, 432 | "solution": false 433 | } 434 | }, 435 | "outputs": [ 436 | { 437 | "name": "stdout", 438 | "output_type": "stream", 439 | "text": [ 440 | "{'Ursula': 78.33333333333333, 'Thorny': 90.0, 'Mac': 99.33333333333333, 'Farva': 56.0, 'Rabbit': 62.333333333333336, 'Foster': 95.66666666666667}\n", 441 | "\n", 442 | "(Passed!)\n" 443 | ] 444 | } 445 | ], 446 | "source": [ 447 | "# `avg_grades_by_student_test`: Test cell\n", 448 | "print(avg_grades_by_student)\n", 449 | "assert type(avg_grades_by_student) is dict, \"Did not create a dictionary.\"\n", 450 | "assert len(avg_grades_by_student) == len(students), \"Output has the wrong number of students.\"\n", 451 | "assert abs(avg_grades_by_student['Mac'] - 99.33333333333333) <= 4e-15, 'Mean is incorrect'\n", 452 | "assert abs(avg_grades_by_student['Foster'] - 95.66666666666667) <= 4e-15, 'Mean is incorrect'\n", 453 | "assert abs(avg_grades_by_student['Farva'] - 56) <= 4e-15, 'Mean is incorrect'\n", 454 | "assert abs(avg_grades_by_student['Rabbit'] - 62.333333333333336) <= 4e-15, 'Mean is incorrect'\n", 455 | "assert abs(avg_grades_by_student['Thorny'] - 90) <= 4e-15, 'Mean is incorrect'\n", 456 | "assert abs(avg_grades_by_student['Ursula'] - 78.33333333333333) <= 4e-15, 'Mean is incorrect'\n", 457 | "print(\"\\n(Passed!)\")" 458 | ] 459 | }, 460 | { 461 | "cell_type": "markdown", 462 | "metadata": { 463 | "nbgrader": { 464 | "grade": false, 465 | "grade_id": "cell-3f31ab810dcb86d1", 466 | "locked": true, 467 | "schema_version": 1, 468 | "solution": false 469 | } 470 | }, 471 | "source": [ 472 | "**Exercise 5** (`grades_by_assignment_test`: 2 points). Write some code to compute a dictionary named `grades_by_assignment`, whose keys are assignment (exam) names and whose values are lists of scores over all students on that assignment. For instance, `grades_by_assignment['Exam 1'] == [100, 88, 45, 59, 73, 89]`." 473 | ] 474 | }, 475 | { 476 | "cell_type": "code", 477 | "execution_count": 38, 478 | "metadata": { 479 | "nbgrader": { 480 | "grade": false, 481 | "grade_id": "grades_by_assignment", 482 | "locked": false, 483 | "schema_version": 1, 484 | "solution": true 485 | } 486 | }, 487 | "outputs": [], 488 | "source": [ 489 | "#\n", 490 | "# YOUR CODE HERE\n", 491 | "#\n", 492 | "grades_by_assignment = {}\n", 493 | "\n", 494 | "for i in range(1, len(assignments) + 1):\n", 495 | " grades_by_assignment['Exam {}'.format(i)] = [int(x[i]) for x in grades if x[i] != 'Exam {}'.format(i)]" 496 | ] 497 | }, 498 | { 499 | "cell_type": "code", 500 | "execution_count": 39, 501 | "metadata": { 502 | "nbgrader": { 503 | "grade": true, 504 | "grade_id": "grades_by_assignment_test", 505 | "locked": true, 506 | "points": 2, 507 | "schema_version": 1, 508 | "solution": false 509 | } 510 | }, 511 | "outputs": [ 512 | { 513 | "name": "stdout", 514 | "output_type": "stream", 515 | "text": [ 516 | "{'Exam 2': [90, 99, 56, 61, 79, 97], 'Exam 3': [80, 111, 67, 67, 83, 101], 'Exam 1': [100, 88, 45, 59, 73, 89]}\n", 517 | "\n", 518 | "(Passed!)\n" 519 | ] 520 | } 521 | ], 522 | "source": [ 523 | "# `grades_by_assignment_test`: Test cell\n", 524 | "print(grades_by_assignment)\n", 525 | "assert type(grades_by_assignment) is dict, \"Output is not a dictionary.\"\n", 526 | "assert len(grades_by_assignment) == 3, \"Wrong number of assignments.\"\n", 527 | "assert grades_by_assignment['Exam 1'] == [100, 88, 45, 59, 73, 89], 'Wrong grades list'\n", 528 | "assert grades_by_assignment['Exam 3'] == [80, 111, 67, 67, 83, 101], 'Wrong grades list'\n", 529 | "assert grades_by_assignment['Exam 2'] == [90, 99, 56, 61, 79, 97], 'Wrong grades list'\n", 530 | "print(\"\\n(Passed!)\")" 531 | ] 532 | }, 533 | { 534 | "cell_type": "markdown", 535 | "metadata": { 536 | "nbgrader": { 537 | "grade": false, 538 | "grade_id": "cell-d763d8a25d8cac78", 539 | "locked": true, 540 | "schema_version": 1, 541 | "solution": false 542 | } 543 | }, 544 | "source": [ 545 | "**Exercise 6** (`avg_grades_by_assignment_test`: 1 point). Write some code to compute a dictionary, `avg_grades_by_assignment`, which maps each exam to its average score." 546 | ] 547 | }, 548 | { 549 | "cell_type": "code", 550 | "execution_count": 41, 551 | "metadata": { 552 | "nbgrader": { 553 | "grade": false, 554 | "grade_id": "avg_grades_by_assignment", 555 | "locked": false, 556 | "schema_version": 1, 557 | "solution": true 558 | } 559 | }, 560 | "outputs": [], 561 | "source": [ 562 | "# Create a dict mapping items to average for that item across all students.\n", 563 | "#\n", 564 | "# YOUR CODE HERE\n", 565 | "#\n", 566 | "avg_grades_by_assignment = {exam: sum(scores)/len(scores) for exam, scores in grades_by_assignment.items()}" 567 | ] 568 | }, 569 | { 570 | "cell_type": "code", 571 | "execution_count": 42, 572 | "metadata": { 573 | "nbgrader": { 574 | "grade": true, 575 | "grade_id": "avg_grades_by_assignment_test", 576 | "locked": true, 577 | "points": 1, 578 | "schema_version": 1, 579 | "solution": false 580 | } 581 | }, 582 | "outputs": [ 583 | { 584 | "name": "stdout", 585 | "output_type": "stream", 586 | "text": [ 587 | "{'Exam 2': 80.33333333333333, 'Exam 3': 84.83333333333333, 'Exam 1': 75.66666666666667}\n", 588 | "\n", 589 | "(Passed!)\n" 590 | ] 591 | } 592 | ], 593 | "source": [ 594 | "# `avg_grades_by_assignment_test`: Test cell\n", 595 | "print(avg_grades_by_assignment)\n", 596 | "assert type(avg_grades_by_assignment) is dict\n", 597 | "assert len(avg_grades_by_assignment) == 3\n", 598 | "assert abs((100+88+45+59+73+89)/6 - avg_grades_by_assignment['Exam 1']) <= 7e-15\n", 599 | "assert abs((80+111+67+67+83+101)/6 - avg_grades_by_assignment['Exam 3']) <= 7e-15\n", 600 | "assert abs((90+99+56+61+79+97)/6 - avg_grades_by_assignment['Exam 2']) <= 7e-15\n", 601 | "print(\"\\n(Passed!)\")" 602 | ] 603 | }, 604 | { 605 | "cell_type": "markdown", 606 | "metadata": { 607 | "nbgrader": { 608 | "grade": false, 609 | "grade_id": "cell-7d85977d9fab2482", 610 | "locked": true, 611 | "schema_version": 1, 612 | "solution": false 613 | } 614 | }, 615 | "source": [ 616 | "**Exercise 7** (`rank_test`: 2 points). Write some code to create a new list, `rank`, which contains the names of students in order by _decreasing_ score. That is, `rank[0]` should contain the name of the top student (highest average exam score), and `rank[-1]` should have the name of the bottom student (lowest average exam score)." 617 | ] 618 | }, 619 | { 620 | "cell_type": "code", 621 | "execution_count": 48, 622 | "metadata": { 623 | "nbgrader": { 624 | "grade": false, 625 | "grade_id": "rank", 626 | "locked": false, 627 | "schema_version": 1, 628 | "solution": true 629 | } 630 | }, 631 | "outputs": [], 632 | "source": [ 633 | "#\n", 634 | "# YOUR CODE HERE\n", 635 | "#\n", 636 | "rank_scores = sorted([(y,x) for x,y in avg_grades_by_student.items()], reverse=True)\n", 637 | "rank = [x[1] for x in rank_scores]" 638 | ] 639 | }, 640 | { 641 | "cell_type": "code", 642 | "execution_count": 49, 643 | "metadata": { 644 | "nbgrader": { 645 | "grade": true, 646 | "grade_id": "rank_test", 647 | "locked": true, 648 | "points": 2, 649 | "schema_version": 1, 650 | "solution": false 651 | } 652 | }, 653 | "outputs": [ 654 | { 655 | "name": "stdout", 656 | "output_type": "stream", 657 | "text": [ 658 | "['Mac', 'Foster', 'Thorny', 'Ursula', 'Rabbit', 'Farva']\n", 659 | "\n", 660 | "=== Ranking ===\n", 661 | "1. Mac: 99.33333333333333\n", 662 | "2. Foster: 95.66666666666667\n", 663 | "3. Thorny: 90.0\n", 664 | "4. Ursula: 78.33333333333333\n", 665 | "5. Rabbit: 62.333333333333336\n", 666 | "6. Farva: 56.0\n", 667 | "\n", 668 | "(Passed!)\n" 669 | ] 670 | } 671 | ], 672 | "source": [ 673 | "# `rank_test`: Test cell\n", 674 | "print(rank)\n", 675 | "print(\"\\n=== Ranking ===\")\n", 676 | "for i, s in enumerate(rank):\n", 677 | " print(\"{}. {}: {}\".format(i+1, s, avg_grades_by_student[s]))\n", 678 | " \n", 679 | "assert rank == ['Mac', 'Foster', 'Thorny', 'Ursula', 'Rabbit', 'Farva']\n", 680 | "for i in range(len(rank)-1):\n", 681 | " assert avg_grades_by_student[rank[i]] >= avg_grades_by_student[rank[i+1]]\n", 682 | "print(\"\\n(Passed!)\")" 683 | ] 684 | }, 685 | { 686 | "cell_type": "code", 687 | "execution_count": null, 688 | "metadata": { 689 | "collapsed": true 690 | }, 691 | "outputs": [], 692 | "source": [] 693 | } 694 | ], 695 | "metadata": { 696 | "celltoolbar": "Create Assignment", 697 | "kernelspec": { 698 | "display_name": "Python 3", 699 | "language": "python", 700 | "name": "python3" 701 | }, 702 | "language_info": { 703 | "codemirror_mode": { 704 | "name": "ipython", 705 | "version": 3 706 | }, 707 | "file_extension": ".py", 708 | "mimetype": "text/x-python", 709 | "name": "python", 710 | "nbconvert_exporter": "python", 711 | "pygments_lexer": "ipython3", 712 | "version": "3.5.2" 713 | } 714 | }, 715 | "nbformat": 4, 716 | "nbformat_minor": 1 717 | } 718 | -------------------------------------------------------------------------------- /Notebook5_part2_RegExYelp.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "nbgrader": { 7 | "grade": false, 8 | "grade_id": "cell-81740ad10bcffdd8", 9 | "locked": true, 10 | "schema_version": 1, 11 | "solution": false 12 | } 13 | }, 14 | "source": [ 15 | "# Part 1 of 2: Processing an HTML file\n", 16 | "\n", 17 | "One of the richest sources of information is [the Web](http://www.computerhistory.org/revolution/networking/19/314)! In this notebook, we ask you to use string processing and regular expressions to mine a web page, which is stored in HTML format." 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": { 23 | "nbgrader": { 24 | "grade": false, 25 | "grade_id": "cell-e1821fbeefa0e2c2", 26 | "locked": true, 27 | "schema_version": 1, 28 | "solution": false 29 | } 30 | }, 31 | "source": [ 32 | "**The data: Yelp! reviews.** The data you will work with is a snapshot of a recent search on the [Yelp! site](https://yelp.com) for the best fried chicken restaurants in Atlanta. That snapshot is hosted here: https://cse6040.gatech.edu/datasets/yelp-example\n", 33 | "\n", 34 | "If you go ahead and open that site, you'll see that it contains a ranked list of places:\n", 35 | "\n", 36 | "![Top 10 Fried Chicken Spots in ATL as of September 12, 2017](https://cse6040.gatech.edu/datasets/yelp-example/ranked-list-snapshot.png)" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": { 42 | "nbgrader": { 43 | "grade": false, 44 | "grade_id": "cell-fe765896f1d25066", 45 | "locked": true, 46 | "schema_version": 1, 47 | "solution": false 48 | } 49 | }, 50 | "source": [ 51 | "**Your task.** In this part of this assignment, we'd like you to write some code to extract this list." 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": { 57 | "nbgrader": { 58 | "grade": false, 59 | "grade_id": "cell-95c9a0ef4d1838e1", 60 | "locked": true, 61 | "schema_version": 1, 62 | "solution": false 63 | } 64 | }, 65 | "source": [ 66 | "## Getting the data\n", 67 | "\n", 68 | "First things first: you need an HTML file. The following Python code will download a particular web page that we've prepared for this exercise and store it locally in a file.\n", 69 | "\n", 70 | "> If the file exists, this command will not overwrite it. By not doing so, we can reduce accesses to the server that hosts the file. Also, if an error occurs during the download, this cell may report that the downloaded file is corrupt; in that case, you should try re-running the cell." 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 3, 76 | "metadata": { 77 | "nbgrader": { 78 | "grade": false, 79 | "grade_id": "cell-af1ae6df64a1fd40", 80 | "locked": true, 81 | "schema_version": 1, 82 | "solution": false 83 | } 84 | }, 85 | "outputs": [ 86 | { 87 | "ename": "UnicodeDecodeError", 88 | "evalue": "'charmap' codec can't decode byte 0x81 in position 711138: character maps to ", 89 | "output_type": "error", 90 | "traceback": [ 91 | "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", 92 | "\u001b[1;31mUnicodeDecodeError\u001b[0m Traceback (most recent call last)", 93 | "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[0;32m 15\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 16\u001b[0m \u001b[1;32mwith\u001b[0m \u001b[0mopen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'yelp.htm'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'r'\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;32mas\u001b[0m \u001b[0mf\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 17\u001b[1;33m \u001b[0myelp_html\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mf\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mread\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mencode\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mencoding\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'utf-8'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 18\u001b[0m \u001b[0mchecksum\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mhashlib\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mmd5\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0myelp_html\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mhexdigest\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 19\u001b[0m \u001b[1;32massert\u001b[0m \u001b[0mchecksum\u001b[0m \u001b[1;33m==\u001b[0m \u001b[1;34m\"4a74a0ee9cefee773e76a22a52d45a8e\"\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m\"Downloaded file has incorrect checksum!\"\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", 94 | "\u001b[1;32mC:\\Anaconda3\\lib\\encodings\\cp1252.py\u001b[0m in \u001b[0;36mdecode\u001b[1;34m(self, input, final)\u001b[0m\n\u001b[0;32m 21\u001b[0m \u001b[1;32mclass\u001b[0m \u001b[0mIncrementalDecoder\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mcodecs\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mIncrementalDecoder\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 22\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0mdecode\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0minput\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mfinal\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mFalse\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 23\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mcodecs\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mcharmap_decode\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0minput\u001b[0m\u001b[1;33m,\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0merrors\u001b[0m\u001b[1;33m,\u001b[0m\u001b[0mdecoding_table\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 24\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 25\u001b[0m \u001b[1;32mclass\u001b[0m \u001b[0mStreamWriter\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mCodec\u001b[0m\u001b[1;33m,\u001b[0m\u001b[0mcodecs\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mStreamWriter\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", 95 | "\u001b[1;31mUnicodeDecodeError\u001b[0m: 'charmap' codec can't decode byte 0x81 in position 711138: character maps to " 96 | ] 97 | } 98 | ], 99 | "source": [ 100 | "import requests\n", 101 | "import os\n", 102 | "import hashlib\n", 103 | "\n", 104 | "if os.path.exists('.voc'):\n", 105 | " data_url = 'https://cse6040.gatech.edu/datasets/yelp-example/yelp.htm'\n", 106 | "else:\n", 107 | " data_url = 'https://github.com/cse6040/labs-fa17/raw/master/datasets/yelp.htm'\n", 108 | "\n", 109 | "if not os.path.exists('yelp.htm'):\n", 110 | " print(\"Downloading: {} ...\".format(data_url))\n", 111 | " r = requests.get(data_url)\n", 112 | " with open('yelp.htm', 'w', encoding=r.encoding) as f:\n", 113 | " f.write(r.text)\n", 114 | "\n", 115 | "with open('yelp.htm', 'r') as f:\n", 116 | " yelp_html = f.read().encode(encoding='utf-8')\n", 117 | " checksum = hashlib.md5(yelp_html).hexdigest()\n", 118 | " assert checksum == \"4a74a0ee9cefee773e76a22a52d45a8e\", \"Downloaded file has incorrect checksum!\"\n", 119 | " \n", 120 | "print(\"'yelp.htm' is ready!\")" 121 | ] 122 | }, 123 | { 124 | "cell_type": "markdown", 125 | "metadata": { 126 | "nbgrader": { 127 | "grade": false, 128 | "grade_id": "cell-afee39f0b7aee426", 129 | "locked": true, 130 | "schema_version": 1, 131 | "solution": false 132 | } 133 | }, 134 | "source": [ 135 | "**Viewing the raw HTML in your web browser.** The file you just downloaded is the raw HTML version of the data described previously. Before moving on, you should go back to that site and use your web browser to view the HTML source for the web page. Do that now to get an idea of what is in that file.\n", 136 | "\n", 137 | "> If you don't know how to view the page source in your browser, try the instructions on [this site](http://www.wikihow.com/View-Source-Code)." 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": { 143 | "nbgrader": { 144 | "grade": false, 145 | "grade_id": "cell-993d633285178cf8", 146 | "locked": true, 147 | "schema_version": 1, 148 | "solution": false 149 | } 150 | }, 151 | "source": [ 152 | "**Reading the HTML file into a Python string.** Let's also open the file in Python and read its contents into a string named, `yelp_html`." 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": 3, 158 | "metadata": {}, 159 | "outputs": [ 160 | { 161 | "name": "stdout", 162 | "output_type": "stream", 163 | "text": [ 164 | "*** type(yelp_html) == ***\n", 165 | "*** Contents (first 1000 characters) ***\n", 166 | "\n", 167 | "\n", 168 | "