├── .gitignore ├── .ipynb_checkpoints ├── Intro_ML_sklearn-checkpoint.ipynb ├── Notebook_anatomy-checkpoint.ipynb └── OCR_Example-checkpoint.ipynb ├── 00.Setup and Primers.ipynb ├── 01.ML 101 and Intro to scikit-learn.ipynb ├── 02.Our Dataset.ipynb ├── 03.Feature Engineering.ipynb ├── 04.Supervised Learning.ipynb ├── 05.Unsupervised Learning.ipynb ├── 06.Model Evaluation.ipynb ├── 07.Pipelines and Parameter Tuning.ipynb ├── Intro_ML_sklearn.ipynb ├── LICENSE ├── LICENSE_Jake_Vanderplas ├── Notebook_anatomy.ipynb ├── OCR_Example.ipynb ├── README.md ├── Resources.ipynb ├── Untitled.ipynb ├── archive └── Intro_ML_sklearn.ipynb ├── fig_code ├── ML_flow_chart.py ├── __init__.py ├── data.py ├── figures.py ├── helpers.py ├── linear_regression.py ├── sgd_separator.py └── svm_gui.py ├── imgs ├── iris-setosa.jpg ├── iris_petal_sepal.png ├── iris_with_labels.jpg ├── irises.png ├── ml_map.png ├── ml_process.png ├── ml_process_by_micheleenharris.png ├── ml_process_mharris2.png ├── pca1.png └── sgd_boundary_scatter.png ├── misc └── Wine Quality Python3.ipynb └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | .ipynb_checkpoints 2 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/Notebook_anatomy-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Basic Anatomy of a Notebook and General Guide\n", 8 | "* Note this a is Python 3-flavored Jupyter notebook" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "### My Disclaimers:\n", 16 | "1. Notebooks are no substitute for an IDE for developing apps.\n", 17 | "* Notebooks are not suitable for debugging code (yet).\n", 18 | "* They are no substitute for publication quality publishing, however they are very useful for interactive blogging\n", 19 | "* My main use of notebooks are for interactive teaching mostly and as a playground for some code that I might like to share at some point (I can add useful and pretty markup text, pics, videos, etc).\n", 20 | "* I'm a fan also because github render's ipynb files nicely (even better than r-markdown for some reason)." 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "### In this figure are a few labels of notebook parts I will refer to\n", 28 | "![Parts](https://www.packtpub.com/sites/default/files/Article-Images/B01727_03.png)" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "\n", 36 | "#### OK, change this cell to markdown to see some examples (you'll recognize this if you speak markdown)\n", 37 | "# This will be Heading1\n", 38 | "1. first thing\n", 39 | "* second thing\n", 40 | "* third thing\n", 41 | "\n", 42 | "A horizontal rule:\n", 43 | "\n", 44 | "---\n", 45 | "> Indented text\n", 46 | "\n", 47 | "Code snippet:\n", 48 | "\n", 49 | "```python\n", 50 | "import numpy as np\n", 51 | "a2d = np.random.randn(100).reshape(10, 10)\n", 52 | "```\n", 53 | "\n", 54 | "LaTeX inline equation:\n", 55 | "\n", 56 | "$\\Delta =\\sum_{i=1}^N w_i (x_i - \\bar{x})^2$\n", 57 | "\n", 58 | "LaTeX table:\n", 59 | "\n", 60 | "First Header | Second Header\n", 61 | "------------- | -------------\n", 62 | "Content Cell | Content Cell\n", 63 | "Content Cell | Content Cell\n", 64 | "\n", 65 | "HTML:\n", 66 | "\n", 67 | "\"You" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "## Shortcuts!!!\n", 75 | "* A complete list is [here](https://sowingseasons.com/blog/reference/2016/01/jupyter-keyboard-shortcuts/23298516), but these are my favorites:\n", 76 | "\n", 77 | "Mode | What | Shortcut\n", 78 | "------------- | ------------- | -------------\n", 79 | "Command (Press `Esc` to enter) | Run cell | Shift-Enter\n", 80 | "Command | Add cell below | B\n", 81 | "Command | Add cell above | A\n", 82 | "Command | Delete a cell | d-d\n", 83 | "Command | Go into edit mode | Enter\n", 84 | "Edit (Press `Enter` to enable) | Indent | Clrl-]\n", 85 | "Edit | Unindent | Ctrl-[\n", 86 | "Edit | Comment section | Ctrl-/\n", 87 | "\n", 88 | "Try some below" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": null, 94 | "metadata": { 95 | "collapsed": true 96 | }, 97 | "outputs": [], 98 | "source": [] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "metadata": {}, 103 | "source": [ 104 | "### As you can see on your jupyter homepage, you can open up any notebook\n", 105 | "NB: You can return to the homepage by clicking the Jupyter icon in the very upper left corner at any time\n", 106 | "### You can also Upload a notebook (button on upper right)\n", 107 | "![Upload button](http://www.ciser.cornell.edu/data_wrangling/python_intro/images/JupyterUpload.gif)\n", 108 | "### As well as start a new notebook with a specific kernel (button to the right of Upload)\n", 109 | "![New menu](https://www.ibm.com/developerworks/community/blogs/jfp/resource/BLOGS_UPLOADED_IMAGES/irkernel48.png)\n", 110 | "\n", 111 | "> So, what's that number after `In` or `Out`? That's the order of running this cell relative to other cells (useful for keeping track of what order cells have been run). When you save this notebook that number along with any output shown will also be saved. To reset a notebook go to Cell -> All Output -> Clear and then Save it.\n", 112 | "\n", 113 | "You can do something like this to render a publicly available notebook on github statically (this I do as a backup for presentations and course stuff):\n", 114 | "\n", 115 | "```\n", 116 | "http://nbviewer.jupyter.org/github///blob/master/.ipynb\n", 117 | "```\n", 118 | "like:
\n", 119 | "http://nbviewer.jupyter.org/github/michhar/rpy2_sample_notebooks/blob/master/TestingRpy2.ipynb\n", 120 | "\n", 121 | "
\n", 122 | "Also, you can upload or start a new interactive, free notebook by going here:
\n", 123 | "https://tmpnb.org\n", 124 | "
\n", 125 | "\n", 126 | "

\n", 127 | "Also, if you are a MSFTE ask me about some other options not in GA yet (sorry public...).\n", 128 | "\n", 129 | "> The nifty thing about Jupyter notebooks (and the .ipynb files which you can download and upload) is that you can share these. They are just written in JSON language. I put them up in places like GitHub and point people in that direction. \n", 130 | "\n", 131 | "> Some people (like [this guy](http://www.r-bloggers.com/why-i-dont-like-jupyter-fka-ipython-notebook/) who misses the point I think) really dislike notebooks, but they are really good for what they are good at - sharing code ideas plus neat notes and stuff in dev, teaching interactively, even chaining languages together in a polyglot style. And doing all of this on github works really well (as long as you remember to always clear your output before checking in - version control can get a bit crazy otherwise).\n", 132 | "\n", 133 | "### Some additional features\n", 134 | "* tab completion\n", 135 | "* function introspection\n", 136 | "* help" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": null, 142 | "metadata": { 143 | "collapsed": true 144 | }, 145 | "outputs": [], 146 | "source": [ 147 | "?sum()" 148 | ] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": null, 153 | "metadata": { 154 | "collapsed": false 155 | }, 156 | "outputs": [], 157 | "source": [ 158 | "import json\n", 159 | "?json" 160 | ] 161 | }, 162 | { 163 | "cell_type": "markdown", 164 | "metadata": { 165 | "collapsed": true 166 | }, 167 | "source": [ 168 | "The MIT License (MIT)
\n", 169 | "Copyright (c) 2016 Micheleen Harris" 170 | ] 171 | }, 172 | { 173 | "cell_type": "code", 174 | "execution_count": null, 175 | "metadata": { 176 | "collapsed": true 177 | }, 178 | "outputs": [], 179 | "source": [] 180 | } 181 | ], 182 | "metadata": { 183 | "kernelspec": { 184 | "display_name": "Python 3", 185 | "language": "python", 186 | "name": "python3" 187 | }, 188 | "language_info": { 189 | "codemirror_mode": { 190 | "name": "ipython", 191 | "version": 3 192 | }, 193 | "file_extension": ".py", 194 | "mimetype": "text/x-python", 195 | "name": "python", 196 | "nbconvert_exporter": "python", 197 | "pygments_lexer": "ipython3", 198 | "version": "3.4.3" 199 | } 200 | }, 201 | "nbformat": 4, 202 | "nbformat_minor": 0 203 | } 204 | -------------------------------------------------------------------------------- /00.Setup and Primers.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Tutorial Setup" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### Check your install" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": null, 20 | "metadata": { 21 | "collapsed": false 22 | }, 23 | "outputs": [], 24 | "source": [ 25 | "import numpy" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": null, 31 | "metadata": { 32 | "collapsed": false 33 | }, 34 | "outputs": [], 35 | "source": [ 36 | "import matplotlib" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": null, 42 | "metadata": { 43 | "collapsed": false 44 | }, 45 | "outputs": [], 46 | "source": [ 47 | "import sklearn" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": null, 53 | "metadata": { 54 | "collapsed": false 55 | }, 56 | "outputs": [], 57 | "source": [ 58 | "import pandas" 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [ 65 | "Finding the location of an installed package and its version:" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": null, 71 | "metadata": { 72 | "collapsed": false 73 | }, 74 | "outputs": [], 75 | "source": [ 76 | "numpy.__path__" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": null, 82 | "metadata": { 83 | "collapsed": false 84 | }, 85 | "outputs": [], 86 | "source": [ 87 | "numpy.__version__" 88 | ] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "metadata": {}, 93 | "source": [ 94 | "Or check it all at once: pip install version_information and check versions with a magic command." 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": null, 100 | "metadata": { 101 | "collapsed": false 102 | }, 103 | "outputs": [], 104 | "source": [ 105 | "!pip install version_information" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": null, 111 | "metadata": { 112 | "collapsed": false 113 | }, 114 | "outputs": [], 115 | "source": [ 116 | "%load_ext version_information\n", 117 | "%version_information numpy, scipy, matplotlib, pandas, tensorflow, sklearn, skflow" 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": {}, 123 | "source": [ 124 | "## A NumPy primer" 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "metadata": {}, 130 | "source": [ 131 | "### NumPy array dtypes and shapes" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": null, 137 | "metadata": { 138 | "collapsed": false 139 | }, 140 | "outputs": [], 141 | "source": [ 142 | "import numpy as np" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": null, 148 | "metadata": { 149 | "collapsed": false 150 | }, 151 | "outputs": [], 152 | "source": [ 153 | "a = np.array([1, 2, 3])" 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": null, 159 | "metadata": { 160 | "collapsed": false 161 | }, 162 | "outputs": [], 163 | "source": [ 164 | "a" 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": null, 170 | "metadata": { 171 | "collapsed": false 172 | }, 173 | "outputs": [], 174 | "source": [ 175 | "b = np.array([[0, 2, 4], [1, 3, 5]])" 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": null, 181 | "metadata": { 182 | "collapsed": false 183 | }, 184 | "outputs": [], 185 | "source": [ 186 | "b" 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": null, 192 | "metadata": { 193 | "collapsed": false 194 | }, 195 | "outputs": [], 196 | "source": [ 197 | "b.shape" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": null, 203 | "metadata": { 204 | "collapsed": false 205 | }, 206 | "outputs": [], 207 | "source": [ 208 | "b.dtype" 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": null, 214 | "metadata": { 215 | "collapsed": false 216 | }, 217 | "outputs": [], 218 | "source": [ 219 | "a.shape" 220 | ] 221 | }, 222 | { 223 | "cell_type": "code", 224 | "execution_count": null, 225 | "metadata": { 226 | "collapsed": false 227 | }, 228 | "outputs": [], 229 | "source": [ 230 | "a.dtype" 231 | ] 232 | }, 233 | { 234 | "cell_type": "code", 235 | "execution_count": null, 236 | "metadata": { 237 | "collapsed": false 238 | }, 239 | "outputs": [], 240 | "source": [ 241 | "np.zeros(5)" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": null, 247 | "metadata": { 248 | "collapsed": false 249 | }, 250 | "outputs": [], 251 | "source": [ 252 | "np.ones(shape=(3, 4), dtype=np.int32)" 253 | ] 254 | }, 255 | { 256 | "cell_type": "markdown", 257 | "metadata": {}, 258 | "source": [ 259 | "### Common array operations" 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "execution_count": null, 265 | "metadata": { 266 | "collapsed": false 267 | }, 268 | "outputs": [], 269 | "source": [ 270 | "c = b * 0.5" 271 | ] 272 | }, 273 | { 274 | "cell_type": "code", 275 | "execution_count": null, 276 | "metadata": { 277 | "collapsed": false 278 | }, 279 | "outputs": [], 280 | "source": [ 281 | "c" 282 | ] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "execution_count": null, 287 | "metadata": { 288 | "collapsed": false 289 | }, 290 | "outputs": [], 291 | "source": [ 292 | "c.shape" 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "execution_count": null, 298 | "metadata": { 299 | "collapsed": false 300 | }, 301 | "outputs": [], 302 | "source": [ 303 | "c.dtype" 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": null, 309 | "metadata": { 310 | "collapsed": false 311 | }, 312 | "outputs": [], 313 | "source": [ 314 | "a" 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": null, 320 | "metadata": { 321 | "collapsed": false 322 | }, 323 | "outputs": [], 324 | "source": [ 325 | "d = a + c" 326 | ] 327 | }, 328 | { 329 | "cell_type": "code", 330 | "execution_count": null, 331 | "metadata": { 332 | "collapsed": false 333 | }, 334 | "outputs": [], 335 | "source": [ 336 | "d" 337 | ] 338 | }, 339 | { 340 | "cell_type": "code", 341 | "execution_count": null, 342 | "metadata": { 343 | "collapsed": false 344 | }, 345 | "outputs": [], 346 | "source": [ 347 | "d[0]" 348 | ] 349 | }, 350 | { 351 | "cell_type": "code", 352 | "execution_count": null, 353 | "metadata": { 354 | "collapsed": false 355 | }, 356 | "outputs": [], 357 | "source": [ 358 | "d[0, 0]" 359 | ] 360 | }, 361 | { 362 | "cell_type": "code", 363 | "execution_count": null, 364 | "metadata": { 365 | "collapsed": false 366 | }, 367 | "outputs": [], 368 | "source": [ 369 | "d[:, 0]" 370 | ] 371 | }, 372 | { 373 | "cell_type": "code", 374 | "execution_count": null, 375 | "metadata": { 376 | "collapsed": false 377 | }, 378 | "outputs": [], 379 | "source": [ 380 | "d.sum()" 381 | ] 382 | }, 383 | { 384 | "cell_type": "code", 385 | "execution_count": null, 386 | "metadata": { 387 | "collapsed": false 388 | }, 389 | "outputs": [], 390 | "source": [ 391 | "d.mean()" 392 | ] 393 | }, 394 | { 395 | "cell_type": "code", 396 | "execution_count": null, 397 | "metadata": { 398 | "collapsed": false 399 | }, 400 | "outputs": [], 401 | "source": [ 402 | "d.sum(axis=0)" 403 | ] 404 | }, 405 | { 406 | "cell_type": "code", 407 | "execution_count": null, 408 | "metadata": { 409 | "collapsed": false 410 | }, 411 | "outputs": [], 412 | "source": [ 413 | "d.mean(axis=1)" 414 | ] 415 | }, 416 | { 417 | "cell_type": "markdown", 418 | "metadata": {}, 419 | "source": [ 420 | "### Reshaping and inplace update" 421 | ] 422 | }, 423 | { 424 | "cell_type": "code", 425 | "execution_count": null, 426 | "metadata": { 427 | "collapsed": false 428 | }, 429 | "outputs": [], 430 | "source": [ 431 | "e = np.arange(12)" 432 | ] 433 | }, 434 | { 435 | "cell_type": "code", 436 | "execution_count": null, 437 | "metadata": { 438 | "collapsed": false 439 | }, 440 | "outputs": [], 441 | "source": [ 442 | "e" 443 | ] 444 | }, 445 | { 446 | "cell_type": "code", 447 | "execution_count": null, 448 | "metadata": { 449 | "collapsed": false 450 | }, 451 | "outputs": [], 452 | "source": [ 453 | "f = e.reshape(3, 4)" 454 | ] 455 | }, 456 | { 457 | "cell_type": "code", 458 | "execution_count": null, 459 | "metadata": { 460 | "collapsed": false 461 | }, 462 | "outputs": [], 463 | "source": [ 464 | "f" 465 | ] 466 | }, 467 | { 468 | "cell_type": "code", 469 | "execution_count": null, 470 | "metadata": { 471 | "collapsed": false 472 | }, 473 | "outputs": [], 474 | "source": [ 475 | "e" 476 | ] 477 | }, 478 | { 479 | "cell_type": "code", 480 | "execution_count": null, 481 | "metadata": { 482 | "collapsed": false 483 | }, 484 | "outputs": [], 485 | "source": [ 486 | "e[5:] = 0" 487 | ] 488 | }, 489 | { 490 | "cell_type": "code", 491 | "execution_count": null, 492 | "metadata": { 493 | "collapsed": false 494 | }, 495 | "outputs": [], 496 | "source": [ 497 | "e" 498 | ] 499 | }, 500 | { 501 | "cell_type": "code", 502 | "execution_count": null, 503 | "metadata": { 504 | "collapsed": false 505 | }, 506 | "outputs": [], 507 | "source": [ 508 | "f" 509 | ] 510 | }, 511 | { 512 | "cell_type": "markdown", 513 | "metadata": {}, 514 | "source": [ 515 | "### Combining arrays" 516 | ] 517 | }, 518 | { 519 | "cell_type": "code", 520 | "execution_count": null, 521 | "metadata": { 522 | "collapsed": false 523 | }, 524 | "outputs": [], 525 | "source": [ 526 | "a" 527 | ] 528 | }, 529 | { 530 | "cell_type": "code", 531 | "execution_count": null, 532 | "metadata": { 533 | "collapsed": false 534 | }, 535 | "outputs": [], 536 | "source": [ 537 | "b" 538 | ] 539 | }, 540 | { 541 | "cell_type": "code", 542 | "execution_count": null, 543 | "metadata": { 544 | "collapsed": false 545 | }, 546 | "outputs": [], 547 | "source": [ 548 | "d" 549 | ] 550 | }, 551 | { 552 | "cell_type": "code", 553 | "execution_count": null, 554 | "metadata": { 555 | "collapsed": false 556 | }, 557 | "outputs": [], 558 | "source": [ 559 | "np.concatenate([a, a, a])" 560 | ] 561 | }, 562 | { 563 | "cell_type": "code", 564 | "execution_count": null, 565 | "metadata": { 566 | "collapsed": false 567 | }, 568 | "outputs": [], 569 | "source": [ 570 | "np.vstack([a, b, d])" 571 | ] 572 | }, 573 | { 574 | "cell_type": "code", 575 | "execution_count": null, 576 | "metadata": { 577 | "collapsed": false 578 | }, 579 | "outputs": [], 580 | "source": [ 581 | "np.hstack([b, d])" 582 | ] 583 | }, 584 | { 585 | "cell_type": "markdown", 586 | "metadata": {}, 587 | "source": [ 588 | "Also see this fun \"100 numpy exercises\" [site](https://github.com/rougier/numpy-100)" 589 | ] 590 | }, 591 | { 592 | "cell_type": "markdown", 593 | "metadata": {}, 594 | "source": [ 595 | "## A Matplotlib primer" 596 | ] 597 | }, 598 | { 599 | "cell_type": "code", 600 | "execution_count": null, 601 | "metadata": { 602 | "collapsed": false 603 | }, 604 | "outputs": [], 605 | "source": [ 606 | "%matplotlib inline" 607 | ] 608 | }, 609 | { 610 | "cell_type": "code", 611 | "execution_count": null, 612 | "metadata": { 613 | "collapsed": false 614 | }, 615 | "outputs": [], 616 | "source": [ 617 | "import matplotlib.pyplot as plt" 618 | ] 619 | }, 620 | { 621 | "cell_type": "code", 622 | "execution_count": null, 623 | "metadata": { 624 | "collapsed": false 625 | }, 626 | "outputs": [], 627 | "source": [ 628 | "x = np.linspace(0, 2, 10)" 629 | ] 630 | }, 631 | { 632 | "cell_type": "code", 633 | "execution_count": null, 634 | "metadata": { 635 | "collapsed": false 636 | }, 637 | "outputs": [], 638 | "source": [ 639 | "x" 640 | ] 641 | }, 642 | { 643 | "cell_type": "code", 644 | "execution_count": null, 645 | "metadata": { 646 | "collapsed": false 647 | }, 648 | "outputs": [], 649 | "source": [ 650 | "plt.plot(x, 'o-');" 651 | ] 652 | }, 653 | { 654 | "cell_type": "code", 655 | "execution_count": null, 656 | "metadata": { 657 | "collapsed": false 658 | }, 659 | "outputs": [], 660 | "source": [ 661 | "plt.plot(x, x, 'o-', label='linear')\n", 662 | "plt.plot(x, x ** 2, 'x-', label='quadratic')\n", 663 | "\n", 664 | "plt.legend(loc='best')\n", 665 | "plt.title('Linear vs Quadratic progression')\n", 666 | "plt.xlabel('Input')\n", 667 | "plt.ylabel('Output');" 668 | ] 669 | }, 670 | { 671 | "cell_type": "code", 672 | "execution_count": null, 673 | "metadata": { 674 | "collapsed": false 675 | }, 676 | "outputs": [], 677 | "source": [ 678 | "samples = np.random.normal(loc=1.0, scale=0.5, size=1000)" 679 | ] 680 | }, 681 | { 682 | "cell_type": "code", 683 | "execution_count": null, 684 | "metadata": { 685 | "collapsed": false 686 | }, 687 | "outputs": [], 688 | "source": [ 689 | "samples.shape" 690 | ] 691 | }, 692 | { 693 | "cell_type": "code", 694 | "execution_count": null, 695 | "metadata": { 696 | "collapsed": false 697 | }, 698 | "outputs": [], 699 | "source": [ 700 | "samples.dtype" 701 | ] 702 | }, 703 | { 704 | "cell_type": "code", 705 | "execution_count": null, 706 | "metadata": { 707 | "collapsed": false 708 | }, 709 | "outputs": [], 710 | "source": [ 711 | "samples[:30]" 712 | ] 713 | }, 714 | { 715 | "cell_type": "code", 716 | "execution_count": null, 717 | "metadata": { 718 | "collapsed": false 719 | }, 720 | "outputs": [], 721 | "source": [ 722 | "plt.hist(samples, bins=50);" 723 | ] 724 | }, 725 | { 726 | "cell_type": "code", 727 | "execution_count": null, 728 | "metadata": { 729 | "collapsed": false 730 | }, 731 | "outputs": [], 732 | "source": [ 733 | "samples_1 = np.random.normal(loc=1, scale=.5, size=10000)\n", 734 | "samples_2 = np.random.standard_t(df=10, size=10000)" 735 | ] 736 | }, 737 | { 738 | "cell_type": "code", 739 | "execution_count": null, 740 | "metadata": { 741 | "collapsed": false 742 | }, 743 | "outputs": [], 744 | "source": [ 745 | "bins = np.linspace(-3, 3, 50)\n", 746 | "_ = plt.hist(samples_1, bins=bins, alpha=0.5, label='samples 1')\n", 747 | "_ = plt.hist(samples_2, bins=bins, alpha=0.5, label='samples 2')\n", 748 | "plt.legend(loc='upper left');" 749 | ] 750 | }, 751 | { 752 | "cell_type": "code", 753 | "execution_count": null, 754 | "metadata": { 755 | "collapsed": false 756 | }, 757 | "outputs": [], 758 | "source": [ 759 | "plt.scatter(samples_1, samples_2, alpha=0.1)" 760 | ] 761 | }, 762 | { 763 | "cell_type": "code", 764 | "execution_count": null, 765 | "metadata": { 766 | "collapsed": true 767 | }, 768 | "outputs": [], 769 | "source": [ 770 | "samples_3 = np.random.normal(loc=2, scale=.5, size=10000)" 771 | ] 772 | }, 773 | { 774 | "cell_type": "code", 775 | "execution_count": null, 776 | "metadata": { 777 | "collapsed": false 778 | }, 779 | "outputs": [], 780 | "source": [ 781 | "fig = plt.figure()\n", 782 | "ax1 = fig.add_subplot(111)\n", 783 | "ax1.scatter(samples_1, samples_2, alpha=0.1, c='b', marker=\"s\", label='first')\n", 784 | "ax1.scatter(samples_3, samples_2, alpha=0.1, c='r', marker=\"o\", label='second')\n", 785 | "plt.show()" 786 | ] 787 | }, 788 | { 789 | "cell_type": "markdown", 790 | "metadata": {}, 791 | "source": [ 792 | "Credits\n", 793 | "=======\n", 794 | "\n", 795 | "Most of this material is adapted from the Olivier Grisel's 2015 tutorial:\n", 796 | "\n", 797 | "[https://github.com/ogrisel/parallel_ml_tutorial](https://github.com/ogrisel/parallel_ml_tutorial)\n", 798 | "\n", 799 | "Original author:\n", 800 | "\n", 801 | "- Olivier Grisel [@ogrisel](https://twitter.com/ogrisel) | http://ogrisel.com\n" 802 | ] 803 | }, 804 | { 805 | "cell_type": "code", 806 | "execution_count": null, 807 | "metadata": { 808 | "collapsed": true 809 | }, 810 | "outputs": [], 811 | "source": [] 812 | } 813 | ], 814 | "metadata": { 815 | "kernelspec": { 816 | "display_name": "Python 3", 817 | "language": "python", 818 | "name": "python3" 819 | }, 820 | "language_info": { 821 | "codemirror_mode": { 822 | "name": "ipython", 823 | "version": 3 824 | }, 825 | "file_extension": ".py", 826 | "mimetype": "text/x-python", 827 | "name": "python", 828 | "nbconvert_exporter": "python", 829 | "pygments_lexer": "ipython3", 830 | "version": "3.5.1" 831 | } 832 | }, 833 | "nbformat": 4, 834 | "nbformat_minor": 0 835 | } 836 | -------------------------------------------------------------------------------- /01.ML 101 and Intro to scikit-learn.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Machine Learning with `scikit-learn`\n", 8 | "* With Jupyter\n", 9 | "* This notebook should be compatible with python 2 and 3" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "## Machine Learning 101\n", 17 | "\n", 18 | "It's said in different ways, but I like the way Jake VanderPlas defines ML:\n", 19 | "\n", 20 | "> Machine Learning can be considered a subfield of Artificial Intelligence since those algorithms can be seen as building blocks to make computers learn to behave more intelligently by somehow generalizing rather that just storing and retrieving data items like a database system would do.\n", 21 | "\n", 22 | "He goes on to say:\n", 23 | "\n", 24 | "\"Machine Learning is about building programs with tunable parameters (typically an array of floating point values) that are adjusted automatically so as to improve their behavior by adapting to previously seen data.\"

\n", 25 | "\n", 26 | "\n", 27 | "(more [here](http://www.astroml.org/sklearn_tutorial/general_concepts.html))

\n", 28 | "\n", 29 | "ML is much more than writing a program. ML experts write clever and robust algorithms which can generalize to answer different, but specific questions. There are still types of questions that a certain algorithm can not or should not be used to answer. I say answer instead of solve, because even with an answer one should evaluate whether it is a good answer or bad answer. Also, just an in statistics, one needs to be careful about assumptions and limitations of an algorithm and the subsequent model that is built from it.
\n", 30 | "\n", 31 | "

Here's my hand-drawn diagram of the machine learning process.
\n", 32 | "\n", 33 | "\"Smiley\n", 34 | "

\n" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "### First example\n", 42 | "Below, we are going to show a simple case of classification in a picture.

\n", 43 | "\n", 44 | "\n", 45 | "\n", 46 | "In the figure we show a collection of 2D data, colored by their class labels (imagine one class is labeled \"red\" and the other \"blue\").\n", 47 | "The `fig_code` module is credited to Jake VanderPlas and was cloned from his github repo [here](https://github.com/jakevdp/sklearn_pycon2015) - also on our repo is his license file since he asked us to include that if we use his source code. :)" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 5, 53 | "metadata": { 54 | "collapsed": false 55 | }, 56 | "outputs": [], 57 | "source": [ 58 | "# Imports for python 2/3 compatibility\n", 59 | "\n", 60 | "from __future__ import absolute_import, division, print_function, unicode_literals\n", 61 | "\n", 62 | "# For python 2, comment these out:\n", 63 | "# from builtins import range" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": null, 69 | "metadata": { 70 | "collapsed": true 71 | }, 72 | "outputs": [], 73 | "source": [ 74 | "# Plot settings for notebook\n", 75 | "\n", 76 | "# so that plots show up in notebook\n", 77 | "%matplotlib inline\n", 78 | "import matplotlib.pyplot as plt\n", 79 | "\n", 80 | "# suppress future warning (some code is borrowed/adapted)\n", 81 | "import warnings\n", 82 | "warnings.filterwarnings('ignore')" 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": 6, 88 | "metadata": { 89 | "collapsed": false 90 | }, 91 | "outputs": [ 92 | { 93 | "data": { 94 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAW8AAAD7CAYAAAClvBX1AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3XlcVdX6x/HPYpQZFQUVzXkA5wEVLTG1TMtMS1NLM200\ntTKtW+aQ1vU2D1reyga1zFuWppmziOaQ4TzhAKiIICKCiDKcs35/YPzUjSlwYHPgeb9evATOw97P\nAfmyztp7r6201gghhLAvDmY3IIQQouAkvIUQwg5JeAshhB2S8BZCCDsk4S2EEHZIwlsIIeyQU0nt\nSCkl5yQKIUQhaK3V9Z8r0ZG31rpQb5MnTy7015a2N3kupe+trDwPeS6l960oz+VGZNpECCHskIS3\nEELYIbsI77CwMLNbsBl5LqVPWXkeIM+ltCqO56L+aU7FpjtSSpfUvoQQoqxQSqHNPmAphBDCNiS8\nhRDCDkl4F7OjR4+yYsUKoqOjzW5FCFGGSHgXo1mzPiWkQ0denfZv2rQL4YsvvzS7JSFEGSEHLIvJ\nqVOnCG7ajDfm/0aV6jU5fSKGKcPu4+jhw1StWtXs9oQQdkIOWJawEydOUK1WbapUrwlAtVp1qBJQ\ng7i4OJM7E0KUBRLexaRBgwYkxB3nyJ4dABzasY1zSQnUrVvX5M6EEGVBiS1MVd74+fkx95tvGDps\nGK7uHmRdyuCH77/H19fX7NaEEGWAzHkXs0uXLhEfH0+NGjWoUKGC2e0IIezMjea8JbyFEKIUkwOW\nQghRhsictyiy7OxsnJycUMowOCg1srKy2LhxI9nZ2XTq1AkvLy+zWxKiSGTkLQotJSWFu3reg7uH\nB55e3nzyySdmt5SvCxcu0LFTZ559fhwTJr1Bs+YtOHnypNltCVEkMvIWhTbyyafQnpX5atMhkhPi\neevZwTRu3JgePXqY3do1ZvznP3gF1GLc1A9QSvHz5x/w4rjx/Pi/H8xuTYhCk5G3KLSNERHcP2IM\nTs4u+NesTWiv/kRERJjdlkF0dAyN24bmTesEtetETGyMyV0JUTQS3qLQ/AMCiD64B8i9P+mJqH1U\nq1bN5K6M2oe0Y/PyRVy+lIElJ4cNv3xPu3btzG5LiCKxyamCSqlYIBWwAtla65B8auRUwTImIiKC\nvg/0o0Wnrpw9HYeHswPh69aW+PnsycnJbNq0CXd3d8LCwnB2dr7mcYvFwvARI/l50SIcHR1p264d\ni39eJActhV0o1vO8lVLRQButdco/1Eh4l0HR0dGEh4fj4+PDfffdh4uLS4nu/+DBg3S9sxs1GzQm\nLeUclX28WLt6FW5ubobac+fOkZ2dTdWqVUv1mTFCXK24wzsGaKu1Tv6HGglvYXNdu3WnTkhXegwY\nhtVq5ZMJT9G/552MHz/e7NaEsInivkhHA6uVUtuVUk/YaJtC3NTJkydp0qYDAA4ODjRoGULs8RMm\ndyVE8bPVqYKdtNanlVJVyA3xg1rrTdcXTZkyJe/9sLCwMnV3aGGOkJB2rF74DcNenk5GehpbV/zC\n5H9NMLstIQotPDyc8PDwm9bZfG0TpdRk4ILW+v3rPi/TJsLmUlJSuP+BfuzYEYklO4dnnn2W9959\nR+a0RZlRbHPeSil3wEFrna6U8gBWAVO11quuq5PwFsVCa01KSgqurq54eHiY3Y4QNnWj8LbFtIk/\n8ItSSl/Z3nfXB7cQxUkpRaVKlcxuQ4gSJUvClkIWi4WlS5dy5swZOnfuTFBQkNktCSFMIut52wmL\nxULv+/oQGxdPjXqN2LlxLV9+/l/69etndmtCCBMU57SJXdJak5SUhI+PD66urma3k2fx4sUcj09g\n4pxfcHRy4ujenTz97BMS3kKIa5TLtU1iYmJoEtyUBg0bUbFSZWbOmmV2S3kSExOp2aAJjk65f1dr\nNw4mJTkZq9VqcmdCiNKkXIb3QwMfpu3dDzB7/V7+vXAVb0x/k23btpndFgChoaHs2LCa2Kj9WHJy\n+Pm/79MhtBMODuXyRyWEuIFylwhWq5VdOyK5e9AIAKrWqEWr27uzfft2kzvL1bJlSz756EPefnYw\nw0MbcObwHv73wwKz2xJClDLlLrwdHByoVr0GB3dsBSAr8zLH9u2kVq1aJnf2/wYNGsS55LNkZGTw\nx8aIUrnMqhDCXOXybJO1a9fy0ICBNGzRhvjYY3Tq2IHv58+z26vyzp8/T0ZGBgEBATafXklNTWXW\nrFmcOZNEjx7d6d27t023L4T4Z3L3+Kt069aN3bt28vLop/l+7jd2G9xaa14c9xI1AmvStHkLQjp0\n5MyZMzbbfnp6Oh1CO7FycyRntDtPPvscH3388TU1n372Gf4B1fD28WX4iJFkZmbm22d8fDxJSUk2\n602I8q5cjrzLih9++IHXpk7nlc8W4OHty4KP3sTxQhKLf15kk+1/++23zPx6Pi9+8DUA8bHHmD7i\nAVLOnQPgt99+Y+TTz/L8e1/iU7kKX00bT4cWQXz80Ud520hLS6NP3wfYtWsXOTnZ9L2/L998/RVO\nTuX2LFUhCkRG3mXQ9r/+IqTHfXj6VEQpxZ39hhAZGWmz7V+8eBHfylXzPvb1q8qlS5f4+4/w7ytX\n0u2hodRq0ASfSn70f2YCv6+4dmWECS+/gvKqzKxVO5i5IpK9h6P5+LrRuxCi4CS87VjdOnWI2rEV\nS04OAPv+3ETt2rVttv277rqLyPCVbFn5K3HRh5kzbTz39+2bN8XkV6kSCbHH8urjY49SufK1a4xs\n/+svuvQdhIOjI65ubnTs1Y/tf9nuD4wQ5ZVMm9ixrKwset/Xh6Mxx/H1q0JC7DHWrF5FcHDwLW/D\nYrEwbdp0fl68GE9PT96YMpnu3bvnPb5582aef3EcZ8+epXu3bnz4wfu4u7sDufeODGnfgYC6jfGq\nVJk/Vy/j1yWL6dy5c97XDxg4CKtvAP2fHofVauXzyS9we6sgpk+bZuglPT2d2NhYqlevLgtNCXGF\nrG1SRlksFrZs2cKFCxfo0KEDFStWLNDXv/raRBb/vopBz79OcmI8c/8zkTWrVtK6detb+vrz58+z\ncOFCMjIy6NWrF40aNbrm8bi4OO7oEoa7b2UyL2VQ0duTdWtW4+npeU3d+vXreXDAALx8KnIu6Qwf\nfvgBjw8fXqDnIkRZJOEt8lW7bj1Gvf05gfVyQ/enz96lkZ87b731ls32kZ6ezpYtW3BxcSE0NNRw\nd/fMzEyqBwby9LRPaNq+c+6B0ZH9idz+J3Xr1rVZH0LYI1mYSuTL1bUC6ann8z5OTz1PhUDbTll4\nenrSo0ePGz5++vRpnJxdaNo+d7qleu161GvSjEOHDkl4C3EDEt7l3Ouv/YtxE8Zw95AnOZcYz+6N\nq/jq/ZJdKsDf35/MS5c4um8n9Zu2IjnxNDFR+6lXr16J9iGEPZHwLuceeeQRqlSpwi+Ll+Af4MNn\n27ZRo0aNEu3Bzc2Nb7/5mseGP0aNOvWIiznGpNcn5s2fZ2dnk5aWRqVKlezyYiohioPMeYsCO3r0\nKOvWrcPb25u+fftSoUIFm2w3ISGBqKgoatWqRZ06dQCYO28ezz47CuWg8PcP4LelvxoOit6KtLQ0\nnJ2dcXNzs0mvQpQUuUjHZKtWraJ+w0b4VqxEn74PcO7KVYr2ZsOGDYS078DC5WuZ8dEsOt1+BxkZ\nGTbZdkBAAF26dMkL7n379vHCi+OY/M1iPg/fzx39h9K3X/8CbfPChQvc1fMe/AOqUbFSJZ5/4UVk\nECHKAgnvEhAVFcXDgwbz0POT+M+i9WS6ePPw4CFmtwXkjnbXrFlDVFTULdU/N/Z5HnttBiMnv8eE\nWd/j5FWJOXPmFEtvkZGRNGt/O4F1GwLQ/aGhRB87WqA/Fi+Me4lsF0++2LCfT1ZsZ/madXz11VfF\n0q8QJUnCuwSEh4fTuksPmnfsgnfFygwZN4n1a9dgsVhM7ev3338nKLgp416bTOjtd/D6pMk3/Zqk\nxERqN2oK5L6cq9kgiISEhGLpr2bNmsQc2svlS7lhHX1gN+7uHgWa+ti8eQvdBw7H0ckJT29fOt/7\nEJu3lo4bbwhRFBLeJaBixYoknozNe7meGHccD08vU++OY7FYGPLII4x59wtemb2Qt35YzRdz5tx0\nbZSwrmEs/uJDsjIvc/p4NH/89hNdu3Ytlh67du1Kt7AuvD64J59MeJL3xz7G11/NKdBBy1o1a3J4\nV+7ZM1prju6J5LaagcXSrxAlyWYHLJVSDsBfQJzWuk8+j5fbA5ZZWVmE3dmNy9qRwAZN2LpyCW9N\nn8bIESNKvJf58+fz7bz5ODg4sHFjBHM2Hsp7bObLTzF25DAGDBhww69PTU1lyKNDWbnid9zc3Pn3\nv99i1LPPFlu/Wms2b95MfHw8bdq0KfB531FRUYR1vZOaDZqQnnaeCo6KiPD1his8hSitiv0KS6XU\nC0AbwFvC2ygzM5P58+eTmJjIHXfccc36HyVlzldfMWnqNAaMfpWM9DTmvjOZviNG02f4KBJPxjJt\nZH82hq8nKCjoptuyWCw4ODjYxal7ycnJRERE4OrqSrdu3XB1dTW7JSFuWbGGt1IqEPgaeBN4UcK7\ndArp0JHuw0bTrMMdACz9djbLvpmFl48vaSnneOedt3n6qadM7lIIcbXivjz+A2A84GOj7YlioJTC\netVBUqslh4cHDuSlcS9StWpVfH19TexOCFEQRQ5vpVRvIFFrvUspFQbc8HX0lClT8t4PCwsjLCys\nqLsXBfD8mNE8P248/Z55iYwLaaz6/kvWrV1Dw4YNzW5NCHFFeHg44eHhN60r8rSJUuot4BEgB3AD\nvICftdZDr6uTaZNSYMmSJXw7bz6uri6Me+EF2rZta3ZLN5SWloa7u7vcMk2UayWyJKxSqgswTua8\nRVGcPHmSPn0f4NDBgzgoxXvvvydz8aLckiVhhd14ePAQ6od0YfwXv5B4MpZJTw+kZYsWdOjQwezW\nhCg1bHqViNZ6Q36jblF+5OTkMOq50Xh6eeNbsRJTpk4t0FoiWmv+3LaVe4c+jVKKgFp1aNPlbrZt\nk6sihbiaXGEpbOrNN99i4587ePeXDbwxfznzFvyPOQVYS0QpRbVq1Tm8O/dKz5zsbGIO7i7xZWqF\nKO1kSVhhUx07dabb0NEEh3QCYOOyRSTt28qPCxfc8jZWr17NwIcHEdwulPjYozRt0piff/oRR0fH\n4mo7z/Hjx5n+5lsknT3L3Xf14OmnnrKLC5FE2SVz3qJE+Pn5EXcsKi+8T0VHUcOvcoG20aNHDyL/\n2s7WrVvx8/OjW7duJbIOzJkzZ+jQMZSOvR+kZsidvP/Jp8SfimfatDeKfd9CFJSMvIVN7d+/ny5d\nu9I8tCvZmZeJ2beTbVu32MW0x2effcYPv63h6WkfAZAUH8ekR3pxPsU+114XZYOMvEWJCA4OZmdk\nJEuXLsXJyYl+8+bg5+dndlu3xGq14njVOeVOzs6mL9srxI3IyLscycnJ4V+vvsbC//0PNzd3Jr/+\nGoMHDza7rVLj1KlTtGzdmruHPElg3YYs/Xomd3XpzAfvv2d2a6IcK5GLdG7SgIS3yV59bSJLV69j\n2MtvkpaSzOzXx7Bg/jy6detmdmulxqFDh5g4aTJnrxywnDB+fIkcKBXiRiS8BY2DmjJ04jvUadIM\ngGVzZ1PRksbMjz82uTMhxI3IDYgFXl5eJCfG532ckngaby8vEzsSQhSWjLztzLp16zhw4ABNmjQp\n8HTHqlWreHjwEMIeGMyF8+fYv3k9f23/k+rVqxdTt0KIopKRdxnwyr9eZejjI1ka8SfDRjzBhJdf\nKdDX33XXXaxa8TtB/l6EtWzMjsi/JLiL6OzZswx9bDht2oXw6LDHSEpKMrslUU7IyNtOnDhxguYt\nW/H2onC8fCuSnprChP5d2Rn5F7Vr1za7vXIpOzubdu07UL1xS0J63Mv2Nb9xcn8kkdv/xNnZ2ez2\nRBkhI287l5SURJWAanj5VgTA06ciVQKqy0jPRAcOHODc+VQeeWkKjVq2Y8i4yaReSGf//v1mtybK\nAQlvO9GoUSPSz6fwx++/kJOdzeaVS0g7d5bGjRub3Vq55ezsTFZmJpacHAAsOTlkXr6Mi4uLyZ2J\n8kCmTezIrl27GDhoMEcPR1GvQUN++P47WrdubXZb5ZbVaqXXvfdx7lI2rcN6sjN8Jb4VHPn9t2Ul\nshaLKB/kPO8yxGq1SjiUEpmZmbzz7rvs33+A4OAgxr/0Eq6urma3JcoQCW8hRJmntUZrne/gJikp\nibi4ODIzM8nKysr7t3bt2gQHBxvqt2zZwurVqw31PXr0oH///ob6uXPn8vHHH19Tm5WVxTPPPMPE\niRML/ZxkYSohhE1YLJZrAsrV1RUfHx9DXWxsLAcOHMgLsb/rg4KC6Nixo6F+zZo1LFq0yBCWffr0\nYcSIEYb6zz//nDfeeMNQ/9JLL/H2228b6pcuXcrHH3+Mq6srLi4uef8OGDAg3/DOzs4mOzsbNzc3\nfHx88upr1qyZ7/elW7duNGnSJK/u7314e3vfyre1wGTkLUQplZmZSVpa2jXBlJmZSaVKlfINkAMH\nDrB161ZDmLVt25aePXsa6pctW8acOXMM2x84cCAvvviiof6jjz7ixRdfRGt9TQA+++yzTJkyxVD/\n008/MWfOHEOY9ezZk4ceeshQv2PHDrZu3WoI14YNGxIUFGSoT01NJS0tzVBf1taikWkTIW7i3Llz\nxMXFGUaKgYGB+Y7MIiMjWbFihaH+9ttvZ+DAgYb6n376iXfffdcQlkOHDmX69OmG+tmzZzNx4kRD\nOD366KOMHz/eUL9y5Up++OEHXF1drwnMzp07c8899xjqjxw5wr59+wzhWr169Xz/OORcOavGyUle\nsJckCe8y6o8//uCxx0cQd/IErVq34fv580y/aEdrTXZ2dl44OTs75/vSMS4uLu9l9dVh1qBBAzp1\n6mSo37RpEwsXLsyr+/tr7rrrLp566ilD/bx585g4caIhXJ944glmzZplqP/uu++YMWPGNUHp4uLC\nAw88wDPPPGOo37x5M8uWLbsm+FxdXWnZsiWdO3c21MfHx3P8+HFDGPv6+uY77SAESHiXSSdOnKBF\nq1Y8PHYidYNbELHkfxzYvJZ1a9dw2223GeqPHDnCli1brjmYkpmZSfPmzenVq5ehfs2aNcyePdsQ\nrn369OGVV4yX5n/55ZeMGjWKrKwsnJyc8sJp5MiR+c5BLlmyhJkzZxrCrHv37jz66KOG+t27dxMR\nEWEI1wYNGtCiRQtDfWpqKikpKYZ6FxcXuS+lsBvFFt5KKVcgAnC58rZEa/1qPnV2H96pqamcOnXK\nEGYBAQE0bdrUUL93716WL19uqA8JCWHIkCGG+mXLljFjxgxDuD744IP5ht/YsWOZ9emneHj74uzi\ngpOTM0mn43jh+ed59913DfXr1q3jm2++MYRlhw4d6Nu3r6H+6NGj7Ny501BfvXp16tSpY6jPzs7G\nYrHg4uIipzIKYSPFOvJWSrlrrTOUUo7AH8A4rfUf19UUOrwjIiJISUnJC7PatWvn+7J0+/btfPfd\nd4awvOOOO3juuecM9YsWLWLChAmG+iFDhvDll18a6n/88UcmTZp0TZC5urrSu3fvfA/wbNu2jUWL\nFhnqmzVrlu+KgKdPn+bYsWOGkaKvry+VKxtv4rt582YefmQo0xesxNnFleTE00zoF0ZS0hk8PDxu\n9dsrhCjFSmTaRCnlDoQDj2mtD1z3WKHD++mnn+b06dN5YRYWFsYTTzxhqNu3bx+rV682hGW9evVo\n27atoT41NZWkpCRDvYuLi10clNFaM+DhQew9GEW9Zm3YGbGaF8aOZvxLL5ndmhDCRop75O0ARAL1\ngNla6wn51Nj9tElpZLVa+fnnnzl+/Dht2rQhLCzM7JaEEDZUrBfpaK2tQCullDewSinVRWu94fq6\nq88FDQsLk6CxAQcHBx588EGz2xBC2Eh4eDjh4eE3rbP52SZKqdeBDK31e9d9XkbeQghRQMW2nrdS\nyk8p5XPlfTegB7CrqNsVQghxY7aYNqkGfKtyT5x1AOZprdfaYLtCCCFuQC7SEUKIUkxugyaEEGWI\nhLcQQtih0n8lihBCmEBrbVi7/Eb//tNjLVu2zHehtaKS8BZCmOZG4VjQgCxo/a3WKKUMV19f/+8/\nPebq6nrDmzcUlRywFKIMs1qtxRpuRanJysrCarXeUhgWNDyvX6a3MKFbWm7sILdBE6IYaK0LFI62\nHD3eymM5OTmFDq4bPebu7n5LQXsrNY6OjrI8byHJyFuUalprcnJybBZqtq7Jzs7G2dm5QIFlq5Fl\nfvXOzs7X9OHs7CzhaOfkZgwiX38flLF1uNmqPisrC0dHxxsG162MIG0x2vynl+YSjqI4SXibqCAH\nZYr7AEx+NQ4ODjYJwuIYWTo7O5eKeUchzFKmw/ufDsqU9AGY/Oqvv9u2LV8m22JEKeEoROll1wcs\n+/fvT1RU1A2D9EYHZYoSblcflClqgMpBGSGErdnFyPvAgQNYrdYbhqYclBFClFVletpECCHKKlmY\nSgghyhAJbyGEsEMS3kIIYYckvIUQwg5JeAshhB2S8BZCCDsk4S2EEHbILq6wFEKIm/l7DSGlFBUq\nVDA8furUKaKjow1LWTRo0IBWrVoZ6jds2MBvv/1mWAKjV69eDBo0yFA/Z84c3n33XUP9mDFjeOON\nN2z+fCW8hRC35EZrCHl5eVGlShVD/ZEjR9i1a5ehvmXLlnTp0sVQv2LFCr7//nvD8hf9+/fnmWee\nMdTPmjWLiRMn5tX9fWOHcePGMX36dEP9+vXrmT17tmFJi759++Yb3i4uLlSuXNlQ36RJk3y/P/fd\ndx8dO3Y0XAXu7u5+K9/eAivyFZZKqUBgLuAPWIEvtNYf51MnV1gKUQAXLlzg7NmzhsD09/enXr16\nAISHh3P06FGaNWtGhQoViIiIMIRfx44d6dOnj2H7v/zyC59++qlh+0OGDOFf//qXof6DDz7glVde\nMYTTk08+mW/9kiVLmDdvnqG+e/fu+fazd+9eduzYYaivU6cO9evXN9RnZGSQmZlZ5tcQKrbL45VS\nAUCA1nqXUsoTiATu11ofuq5OwluYSmtNdnY2kDuqul5iYmLey+qrA7BOnTq0bt3aUL9161Z+/fVX\nQ1jeeeedDB061FC/YMEC3nrrLcP2R4wYwdtvv22onzNnDtOmTTOs5TNo0CDGjBnDi2PH8OP3cwny\nc2N3wkX6PjQQJxdXw00h2rdvT7du3Qzbj4mJ4ciRI4awrFq1KlWrVi3Mt1gUgxJb20QptRj4RGu9\n9rrPS3iXcX/f9eb6kZy7uzt+fn6G+piYGHbu3GmoDwoKomvXrob69evXM3fuXEN9r169GDNmjKH+\n66+/Zvz48dfUOjk58dxzz/HBBx8Y6mfMmMGcOXPw8fGhYsWKeWHWu3dvRowYYaj/888/Wb16tSH8\ngoODCQkJMdQnJSWRkJBgCGMPDw/c3Nxu9dsM5C7W1iW0PR91r4aniyPJGdmMWXWK2JNxVK5cuUDb\nEqVbiSwJq5SqDbQEttlyuyJ/GRkZJCcnG0ZylStXzvdl5sGDB1m/fr2hvk2bNjzwwAOG+t9//52P\nP/7YUN+vXz8mT55sqJ89ezajR482hNOwYcPynYM8ePAg8+bNM9TnN38K4O/vz+23326or127dr71\nAwYM4N57772m1sEh/xOs3pw2lY/ff4/GVTw4cDSBoVOnMWbs8/nW/i0kJCTfkL6RKlWq3PC5FVRi\nYiI1fN3xdMldi72yuzO+HhVISkqS8C4nbDbyvjJlEg5M01ovyedxffUvfFhYGGFhYTbZty1ZrVas\nVitOTsa/a2fPniUmJsZw84XAwEDatGljqN+xYweLFy821IeGhvL4448b6hcvXsyUKVMMYTlo0CA+\n+ugjQ/38+fN55ZVXDEvl9uvXj1deecVQHxERwYIFCwwjxbZt23LPPfcY6o8fP87+/fvzfVldvXp1\nQ73W2i7nHGNiYmjToikfdKtORTcnki5m88KaUxyLPWGzsLW1s2fP0rh+PZ5r6U2rah5sOH6BhdHZ\nRB8/iaurq9ntiSIIDw8nPDw87+OpU6cW37SJUsoJWAb8rrU2pgxFmzZZuXIlycnJZGZmkpmZSf36\n9enevbuhbsuWLXz11VeG8LvzzjsZN26coX7hwoWMHTvWcGOHkSNH8sUXXxjqFy9ezJtvvmkY+d19\n992MGjXKUL9jxw6WLVtmqG/SpAmdO3c21CclJREXF2cIYw8PDzw9PQv1vRM3t2nTJp4e3I+3Ov//\niPX5dYksWb2BZs2amdjZP4uIiGDQgAdJTEqmTq1Aflr8Ky1atDC7LWFjxTrnrZSaC5zVWr/4DzWF\nDu8XXniBxMTEvDDr1KlTvgeEDh06REREhCH8brvttnx/CS9evMiFCxeuOQ3IycnJLkePovCSk5Np\nWLcOL7T1pbm/B3+dSmf23nRiTsbh4eFhdns3lZWVle8BWFE2FOfZJp2ACGAvoK+8vaq1XnFdnRyw\nFKXW2rVrGfhgf7IyM/Hw8GDRkl8JDQ01uy0h5E46QtyMxWIhJSWFSpUq3fDAphAlTcJbCCHskNwG\nTQghyhAJbyGEsEMS3kIIYYckvEWZY7VaeXP6NJo2qk/bFk1ZssRwzZgQdk/CW5Q5M/79JvM//YDH\naufQq2IqI4c9woYNG8xuSwibkrNNRJnTrFEDht6WTSO/3MWefjmYjEeHfsz8bLbJnQlRcHK2iSi1\nLl++zGv/epme3cIYPeoZUlJSirQ9N3d30jIteR+nZWnc7OBKSSEKQkbeokRlZmZy5swZ/P39cXFx\nQWvNvT3vIvXoLu6o4cqupCziHf3YFrmz0AssLV26lOGPDObeOm5cyNZsTMhh2187qFOnTqG2d/Lk\nSY4dO0b9+vUJDAws1DaEKCwZeQvTrVixgmpVq9C6WRDV/auwbt06Tp06xZYtm3mxXWU6BHrxVMtK\nZJxL5M8//yz0fu677z5+WbYcr9D+1Ov1WJGCe86XX9I8qAljHxtIs6DGfPPN14XuSwhbkpG3KBHJ\nyck0qFu0hwJFAAARwElEQVSbCSGVCKrizp7Ei8zYnEhVv8okJZ1h8h2BNPRzQ2vNhA1JfLlwMbff\nfrupPSckJNC4fj3+HRZADW8X4tIyeXVDIkeiY0vtUrGi7CmRmzEIcSOHDx8mwNuNoCq5N2Nt7u+B\nl5PmwVpwuXoVJq0/ycjWVYm5YMXVt0qBbnJQXI4fP041X3dqeOeu2Bfo7UpVbzdOnCi963yL8kOm\nTUSJqFmzJvHnL5J0MfcekonpWVzItNDU34Pu9XxpHejDmvOeBHbuw/qNf5SKGwrUq1ePhLRLHD13\nGYAjyZc4m55J3bp1Te5MCBl5ixISGBjIlKnTeHnqZOpX8WDP8SQeDK6EbwUntNZcxolJU6fx8MMP\nm91qHj8/P776dh7Dhz6Kr7sLqZey+Hb+93h7e7Nnzx4AgoODcXR0NLlTUR7JnLcoUYcOHeLw4cOs\nW7uGJQvm0r2mKzFpVhIcK7ItcmepvPlBWloaJ0+epGbNmiiluLtbV07GHAWgZt0GrFyzDi8vL5O7\nFGWVLAkrSp0ff/yRdWtWUb1GTcaMHYuPj4/ZLd3UC2NGs3fFQp5rUwmAmZHnaH7PQN7/6BOTOxNl\nlYS3EDZwV9c7aGeJpn1g7kh7W9wFtjvWZdX6CJM7E2WVnOcthA00bd6SbQlZWLXGYtVsS8iiWYtW\nZrclyiEZeYsiyczMZMa/32J35HYaBTXltdcn2eWd7rXWZGdn3/RGvhcuXOCeHt2IORoFQJ36jVix\nZp1dPmdhH2TaRNjc35e2pxzZRWiACzuSssmsdBsb/tiCk5P9nMj0/Xff8ewzT3Ex4xKtWzbn5yXL\nuHTpEvPnz0NrzeDBQ2jUqFFevcVi4eDBgwA0adJEzjYRxUrCW9hcdHQ07Vu34L89a+DkoLBqzfNr\nE/lp+Wratm1rdnu3ZPfu3XS7ozOvh/pRy8eV/x1M4aDVj7iTcdxewwUFhMdlsjZ8A61ayfSIKHky\n5y1szmKx4OigcLjy30oBTo4O5OTkmNpXQWzevJmQGh7UqVgBRwfFQ00qcujgQe6t48rwFn481sKP\nhxq4M33KJLNbFeIaNglvpdQcpVSiUmqPLbYn7EO9evWo37Axn+44x+6Ei3y1JwX3ilVo3bq12a3d\nsoCAAGJSs7FYc18VHjt3GWdHRRV357waP3cnLqSlmtWiEPmy1cj7a+BuG21L2AkHBweWr1pD/bC+\nrEivgl+7u1kXsemmB/1Kkz59+lCvWRv+tTGJ97edYWr4Sbrc5sMP+85y9NxlolMus/DwRfoNHGR2\nq0Jcw2Zz3kqp24ClWuvmN3hc5rxFqWSxWFi+fDmff/5fLh38g9Ht/Fl2OIWlUSmk52gmTZ3G+Akv\no5Rh2vGWtr1o0SLi4+Pp0KEDHTp0KIZnIMqyYj9gKeEtioPFYiny2Rxaa5KTk/Hx8cHZ2fmGdYcO\nHSK0fTsGNvKkipsTC6LSGT56HK++NrFQ+7VYLPS9txfReyOp5+PE1lMXmfafd3jqqacL+1REOVQq\nwnvy5Ml5H4eFhREWFmaTfYuyZ+/evQzodz+Ho2OpVb0aC35cVKhR6+HDh7m3510kJCZi1TDr088Y\n9thjN6yPjIzkjUmvkZaaSv8Bgxg1enShRtyQe/OJ0cMH858uVXFyUMRfyGLcmlOkpV+U0wvFDYWH\nhxMeHp738dSpU80Pbxl5i1tx+fJl6tWuxUN1nOhymzfb49P5Yl86UUejqVSpUoG2FdyoAZ290+nd\nwJe4tEwmbUxi3cY/aN483/+mNjV//nzmTB/PC619gdxXAAMXHSM55XypXIBLlE4lcaqguvImRJFE\nR0fjpHO4s44Pjg6KDoFeBHi5sG/fvgJtJyMjg8PHYuhVP3fBq0BvV1pW82THjh3F0bZBaGgou0+n\nszvhIpdzrCw4kELToCYS3MImbHWq4PfAZqChUuqEUmq4LbYryic/Pz9S0i9x/lLu+eIXsywknM+g\natWqBdqOm5sb3p4eHE7OvZlCZo6VYymXqVmzps17zk/dunX54cdFfHEoi0d+ieaUe22W/PZ7iexb\nlH1yhaUolaa/MZXPPnqfFv5u7E+6zAMPP8KHn8ws8HaWLl3KsCGDCQrw5HjKJXr0uo8538wt9Dx2\nSdm0aRMzP3wfiyWHJ58dTY8ePcxuSZhELo8XdmfTpk3s2bOHhg0b0r1790JvJyYmhsjISKpVq0Zo\naKhdBPf9vXsyoKEnTg6KH6LSmbvgf9xzzz1mtyZMIOEthJ0Y9FB/fGP/4J4GFQGIiE1jn1tDVqxd\nb3Jnwgxy93hRpsTGxhIREcH58+dp37497dq1w8GhbCzVY7XmrhnzN0eH3M8JcTUJb2F3li9fzpCB\nD1HLQ3HqQhYWrejQqTO/Ll/xjxfhFJTWmvnz57Nr5w4aNmrMiBEjSmSp2yeeeY6H+6/B1VHh5KD4\ndn8as74cW+z7FfZFpk2EXdFaE+BXmRdaexFUxZ3MHCvjVsXi6uLK85P/zahRo2y2r2eefIL1yxYR\nUtWJfSlWAoPbsnjZ8hIZ4a9YsYIP3/0PVouFp54bS//+/Yt9n6J0kmkTUSZkZ2eTfD6Vxn65pw26\nOjnQoJIbl3IsHIk6ZLP9JCQk8N138/m8V03cnR2536J5ft1Wdu7cSZs2bWy2nxvp2bMnPXv2LPb9\nCPtVNiYJRZmVnJzMxo0bOXbsGAAuLi4ENWrAssMpAJxKy2LH6XROX1K0btvOZvtNT0/Hw9UFN6fc\nXxFnR4Wvmwvp6ek224cQRSHTJqLUWr9+PQ890JdqPq6cSrnI6DHPM3X6mxw5coR7enTj1KlT5Fg1\nzk6ODB06jM8+/8JmpwFaLBZaNg0i2DmVrrd5EHk6g9/jNQeijuDl5WWTfQhxK+RUQWFXtNb4+1Vi\ndAsvWgR4kHo5hwnhiSxZsYaQkBC01iQmJpKamkrFihULfPXlrYiPj+eJ4cPYs2c39erV4/OvvqVh\nw4Z5jx8+fJgTJ04QFBRE9erVbb5/IUDmvIWdSUtLI/1iBi0CAgDwqeBE4yoeHD58mJCQEJRSBAQE\nEHDl8eJQvXp1flu5Ot/H3pw2lffffYdalTyITb7IN/O+4/777y+2XoS4noS3KJW8vb2p6OvD1rgL\ndAj0IuliNvvPpNO0aVOzW2Pv3r18+N67fHBnNXzdnDiS7MawR4dw5uw5u7qLkLBvEt6iVFJK8cuv\ny7iv1z0siMogOf0SU9+YTsuWLc1ujejoaOr7eeLrlvvr06CyG04KkpKSqFGjhsndifJCwluUWiEh\nIcScOEl0dDT+/v5UqVLF7JYACAoK4nDSBeLS3Aj0duWvU+k4Obvi7+9vdmuiHJEDlkIUwtdff8XY\n0c/h7eZCltWBxUuXERoaanZbogySs02EsLHU1FQSExOpWbMmbm5uZrcjyigJbyGEsEMlcRs0IYQQ\nJUTCWwgh7JCEtxClkMViQaYZxT+R8BaiFLlw4QJ9et+DWwVXvD3d+fD998xuSZRScsBSiFLk0cEP\nc/qvtTzdqhLnLuXwxuazfD53Ab169TK7NWESOWAphB3YsH49DzXyxsXRgQBPF7rWcGH9unU22XZU\nVBSd2rejauWKdL29E7GxsTbZrjCHTcJbKdVTKXVIKXVYKfWyLbYpRHlUtWpVolMuA7krK8amQzUb\nrFiYnp5Ot7A7aGqNY8btlamVfoweXbuQlZVV5G0LcxQ5vJVSDsBM4G4gGBiklGpc1O0KUR59/Nl/\n+WJvGjN3pvDGlmTS3avy5JNPFnm7e/bswdtJ07uBL37uzvRvUpGsi2kcPXrUBl0LM9hibZMQ4IjW\n+jiAUuoH4H7AdvekEqKUslqtrFixgoSEBNq3b09wcHCRthcaGspfO3ezdu1aPD096du3r02u3vT2\n9ubcxctk5lhxdXIgI9tCakYm3t7eRd62MIctwrsGcPKqj+PIDXQhyjSr1crA/g+wa+tGavu6Mj4+\nnU8//5KBAwcWabt169albt26NuoyV3BwMHf2uJupf6yjWUUHIs9aGDR4CIGBgTbdjyg5sqqgEIW0\ncuVKdm3dyIw7quLsqIiuXYEnR45gwIABNrsdm60opZi/YCHz588nKuoQ9zdrXuQ/MsJctgjvU0Ct\nqz4OvPI5gylTpuS9HxYWRlhYmA12L4Q5EhISqO3rirNjblDX8XUl49JlMjMzqVChgsndGTk4ODB0\n6FCz2xA3ER4eTnh4+E3rinyet1LKEYgCugGngT+BQVrrg9fVyXneolCioqIY9dRIYmNiaRsSwqzZ\nn1O5cmWz2+LAgQPc3jGEiR2rUKeiKz8fOs/enMrs3Lvf7NZEGVKsqwoqpXoCH5F79socrfWMfGok\nvEWBpaSkENy4Ib1rOtG8qhurYtNJ9qzFH9u2l4qpiR9//JEnRjzOxYxLNAtqzC9Lf+O2224zuy1R\nhsiSsMIuLV++nNdHDWdyx0oAWLVm+LITHDwSXaw3Hy4IrXWpnSoR9k+usBR2yd3dnbTL2VisuX/4\nM7KtZGZbSlVQKqVKVT+ifJCzTUSp1rlzZ6rXbcQ7fx6lia8DfyRk8/jjw/H19TW7NSFMJdMmotS7\nfPkyM2fOJPbYUdp16MjQoUNLxXy3ECVB5ryFEMIOyZy3EEKUIRLeQghhhyS8hRDCDkl4CyGEHZLw\nFkIIOyThLYQQdkjCWwgh7JCEtxBC2CEJbyGEsEMS3kIIYYckvIUQwg5JeAshhB2S8BZCCDsk4S2E\nEHZIwlsIIeyQhLcQQtghCW8hhLBDEt5CCGGHihTeSqkHlVL7lFIWpVRrWzUlhBDinxV15L0XeADY\nYINehBBC3CKnonyx1joKQMmtvIUQokTJnLcQQtihm468lVKrAf+rPwVo4DWt9dLiakwIIcSN3TS8\ntdY9bLWzKVOm5L0fFhZGWFiYrTYthBBlQnh4OOHh4TetU1rrIu9MKbUeeElrHfkPNdoW+xJCiPJE\nKYXW2nBcsainCvZVSp0EOgDLlFK/F2V7Qgghbo1NRt63tCMZeQshRIEVy8hbCCGEOSS8hRDCDtlF\neN/KkVd7Ic+l9CkrzwPkuZRWxfFcJLxLmDyX0qesPA+Q51JaldvwFkIIcS0JbyGEsEMleqpgiexI\nCCHKmPxOFSyx8BZCCGE7Mm0ihBB2SMJbCCHskN2Et73fck0p1VMpdUgpdVgp9bLZ/RSFUmqOUipR\nKbXH7F6KQikVqJRap5Tar5Taq5QaY3ZPhaWUclVKbVNK7bzyfN4yu6eiUEo5KKV2KKV+NbuXolBK\nxSqldl/5ufxpy23bTXhjx7dcU0o5ADOBu4FgYJBSqrG5XRXJ1+Q+F3uXA7yotQ4GOgKj7PXnorXO\nBLpqrVsBzYE7lVKdTG6rKMYCB8xuwgasQJjWupXWOsSWG7ab8NZaR2mtj5B7Mwh7EwIc0Vof11pn\nAz8A95vcU6FprTcBKWb3UVRa6wSt9a4r76cDB4Ea5nZVeFrrjCvvupL7u22XPyOlVCDQC/jS7F5s\nQFFMOWs34W3nagAnr/o4DjsOibJIKVUbaAlsM7eTwrsy1bATSADCtdb2OnL9ABhP7h277J0GViul\ntiulnrDlhot0A2Jbk1uuCTMopTyBn4CxV0bgdklrbQVaKaW8gVVKqS5aa7uaZlRK9QYStda7lFJh\n2Ocr7at10lqfVkpVITfED1555VpkpSq8bXnLtVLmFFDrqo8Dr3xOmEwp5URucM/TWi8xux9b0Fqn\nKaV+A9pif8eIOgF9lFK9ADfASyk1V2s91OS+CkVrffrKv0lKqV/InUK1SXjb67SJvf013g7UV0rd\nppRyAR4G7PooOrk/A3v7OeTnK+CA1vojsxspCqWUn1LK58r7bkAPYJe5XRWc1vpVrXUtrXVdcn9P\n1tlrcCul3K+8qkMp5QHcBeyz1fbtJrzt+ZZrWmsL8BywCtgP/KC1PmhuV4WnlPoe2Aw0VEqdUEoN\nN7unwrhyNsYQcs/M2Hnl1LSeZvdVSNWA9VfmvLcCv2qt15rcU3nnD2y66meyVGu9ylYbl8vjhRDC\nDtnNyFsIIcT/k/AWQgg7JOEthBB2SMJbCCHskIS3EELYIQlvIYSwQxLeQghhhyS8hRDCDv0f9u2d\nn3VIdQUAAAAASUVORK5CYII=\n", 95 | "text/plain": [ 96 | "" 97 | ] 98 | }, 99 | "metadata": {}, 100 | "output_type": "display_data" 101 | } 102 | ], 103 | "source": [ 104 | "# Import an example plot from the figures directory\n", 105 | "from fig_code import plot_sgd_separator\n", 106 | "plot_sgd_separator()" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": {}, 112 | "source": [ 113 | "Above is the vector which best separates the two classes, \"red\" and \"blue\" using a classification algorithm called Stochastic Gradient Decent (don't worry about the detail yet). The confidence intervals are shown as dashed lines.

\n", 114 | "\n", 115 | "\n", 116 | "This demonstrates a very important aspect of ML and that is the algorithm is generalizable, i.e., if we add some new data, a new point, the algorithm can predict whether is should be in the \"red\" or \"blue\" category.\n", 117 | "

\n", 118 | "Here are some details of the code used above:" 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": null, 124 | "metadata": { 125 | "collapsed": false 126 | }, 127 | "outputs": [], 128 | "source": [ 129 | "from sklearn.linear_model import SGDClassifier\n", 130 | "from sklearn.datasets.samples_generator import make_blobs\n", 131 | "import numpy as np\n", 132 | "\n", 133 | "# we create 50 separable points\n", 134 | "X, y = make_blobs(n_samples=50, centers=2,\n", 135 | " random_state=0, cluster_std=0.60)\n", 136 | "\n", 137 | "# what's in X and what's in y??\n", 138 | "print(X[0:10,:])\n", 139 | "print(y[0:10])\n", 140 | "\n", 141 | "target_names = np.array(['blue', 'red']) # <-- what am I naming here?" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": null, 147 | "metadata": { 148 | "collapsed": false 149 | }, 150 | "outputs": [], 151 | "source": [ 152 | "clf = SGDClassifier(loss=\"hinge\", alpha=0.01,\n", 153 | " max_iter=200, fit_intercept=True)\n", 154 | "\n", 155 | "# fit the model -> more details later\n", 156 | "clf.fit(X, y)" 157 | ] 158 | }, 159 | { 160 | "cell_type": "markdown", 161 | "metadata": {}, 162 | "source": [ 163 | "Add some of your own data and make a prediction in the cell below.\n", 164 | "\n", 165 | "Data could be a single x, y point or array of x, y points. e.g. `[[0, 5]]`" 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": null, 171 | "metadata": { 172 | "collapsed": false 173 | }, 174 | "outputs": [], 175 | "source": [ 176 | "X_test = [] # <-- your data here (as 2D array)\n", 177 | "y_pred = clf.predict(___) # <-- what goes here?\n", 178 | "\n", 179 | "# predictions (decode w/ target names list)\n", 180 | "target_names[y_pred]" 181 | ] 182 | }, 183 | { 184 | "cell_type": "markdown", 185 | "metadata": {}, 186 | "source": [ 187 | "\n", 188 | "> ML TIP: ML can only answer 5 questions:\n", 189 | "* How much/how many?\n", 190 | "* Which category?\n", 191 | "* Which group?\n", 192 | "* Is it weird?\n", 193 | "* Which action?\n", 194 | "\n", 195 | "

explained well by Brandon Rohrer [here]((https://channel9.msdn.com/blogs/Cloud-and-Enterprise-Premium/Data-Science-for-Rest-of-Us)

\n", 196 | "\n", 197 | "As far as algorithms for learning a model (i.e. running some training data through an algorithm), it's nice to think of them in two different ways (with the help of the [machine learning wikipedia article](https://en.wikipedia.org/wiki/Machine_learning)). \n", 198 | "\n", 199 | "The first way of thinking about ML, is by the type of information or **input** given to a system. So, given that criteria there are three classical categories:\n", 200 | "1. Supervised learning - we get the data and the labels\n", 201 | "2. Unsupervised learning - only get the data (no labels)\n", 202 | "3. Reinforcement learning - reward/penalty based information (feedback)\n", 203 | "\n", 204 | "Another way of categorizing ML approaches, is to think of the desired **output**:\n", 205 | "1. Classification\n", 206 | "2. Regression\n", 207 | "3. Clustering\n", 208 | "4. Density estimation\n", 209 | "5. Dimensionality reduction\n", 210 | "\n", 211 | "--> This second approach (by desired output) is how `sklearn` categorizes it's ML algorithms.

\n", 212 | "\n", 213 | "\n", 214 | "\n", 215 | "The problem solved in supervised learning (e.g. classification, regression)\n", 216 | "\n", 217 | "Supervised learning consists of learning the link between two datasets: the observed data X and an external variable y that we are trying to predict, usually called “target” or “labels”. Most often, y is a 1D array of length n_samples.\n", 218 | "

\n", 219 | "\n", 220 | "All supervised estimators in `sklearn` implement a `fit(X, y)` method to fit the model and a `predict(X)` method that, given unlabeled observations X, returns the predicted labels y.

\n", 221 | "\n", 222 | "\n", 223 | "Common algorithms you will use to train a model and then use trying to predict the labels of unknown observations are: classification and regression. There are many types of classification and regression (for examples check out the `sklearn` algorithm cheatsheet below).\n", 224 | "\n", 225 | "The problem solved in unsupervised learning (e.g. dimensionality reduction, clustering)\n", 226 | "In machine learning, the problem of unsupervised learning is that of trying to find hidden structure in unlabeled data.\n", 227 | "

\n", 228 | "Unsupervised models have a `fit()`, `transform()` and/or `fit_transform()` in `sklearn`.\n", 229 | "\n", 230 | "#### There are some instances where ML is just not needed or appropriate for solving a problem.\n", 231 | "Some examples are pattern matching (e.g. regex), group-by and data mining in general (discovery vs. prediction)." 232 | ] 233 | }, 234 | { 235 | "cell_type": "markdown", 236 | "metadata": {}, 237 | "source": [ 238 | "#### EXERCISE: Should I use ML or can I get away with something else?\n", 239 | "\n", 240 | "* Looking back at previous years, by what percent did housing prices increase over each decade?
\n", 241 | "* Looking back at previous years, and given the relationship between housing prices and mean income in my area, given my income how much will a house be in two years in my area?
\n", 242 | "* A vacuum like roomba has to make a decision to vacuum the living room again or return to its base.
\n", 243 | "* Is this image a cat or dog?
\n", 244 | "* Are orange tabby cats more common than other breeds in Austin, Texas?
\n", 245 | "* Using my database on housing prices, group my housing prices by whether or not the house is under 10 miles from a school.
\n", 246 | "* What is the weather going to be like tomorrow?
\n", 247 | "* What is the purpose of life?" 248 | ] 249 | }, 250 | { 251 | "cell_type": "markdown", 252 | "metadata": {}, 253 | "source": [ 254 | "## A very brief introduction to scikit-learn (aka `sklearn`)\n", 255 | "\n", 256 | "As a gentle intro, it is helpful to think of the `sklearn` approach having layers of abstraction. This famous quote certainly applies:\n", 257 | "\n", 258 | "> Easy reading is damn hard writing, and vice versa.
\n", 259 | "--Nathaniel Hawthorne\n", 260 | "\n", 261 | "In `sklearn`, you'll find you have a common programming choice: to do things very explicitly, e.g. pre-process data one step at a time, perhaps do a transformation like PCA, split data into traning and test sets, define a classifier or learner with desired parameters, train the classifier, use the classifier to predict on a test set and then analyze how good it did. \n", 262 | "\n", 263 | "A different approach and something `sklearn` offers is to combine some or all of the steps above into a pipeline so to speak. For instance, one could define a pipeline which does all of these steps at one time and perhaps even pits mutlple learners against one another or does some parameter tuning with a grid search (examples will be shown towards the end). This is what is meant here by layers of abstraction.

\n", 264 | "\n", 265 | "So, in this particular module, for the most part, we will try to be explicit regarding our process and give some useful tips on options for a more automated or pipelined approach. Just note, once you've mastered the explicit approaches you might want to explore `sklearn`'s `GridSearchCV` and `Pipeline` classes.\n", 266 | "

\n", 267 | "Here is `sklearn`'s algorithm diagram - (note, this is not an exhaustive list of model options offered in `sklearn`, but serves as a good algorithm guide). The interactive version is [here](http://scikit-learn.org/stable/tutorial/machine_learning_map/).\n", 268 | "![](imgs/ml_map.png)" 269 | ] 270 | }, 271 | { 272 | "cell_type": "markdown", 273 | "metadata": {}, 274 | "source": [ 275 | "### Your first model - a quick multiclass logistic regression\n", 276 | "* `sklearn` comes with many datasets ready-to-go for `sklearn`'s algorithms like the `iris` data set\n", 277 | "* In the next module we'll explore the `iris` dataset in detail\n", 278 | "---\n", 279 | "Below, notice some methods like *`fit`, `predict` and `predict_proba`*. Many of the classifiers you'll see will share method names like these. (Note this is a supervised learning classifier)" 280 | ] 281 | }, 282 | { 283 | "cell_type": "code", 284 | "execution_count": null, 285 | "metadata": { 286 | "collapsed": true 287 | }, 288 | "outputs": [], 289 | "source": [ 290 | "from sklearn.datasets import load_iris\n", 291 | "iris = load_iris()\n", 292 | "\n", 293 | "# Leave one value out from training set - that will be test later on\n", 294 | "X_train, y_train = iris.data[:-1,:], iris.target[:-1]" 295 | ] 296 | }, 297 | { 298 | "cell_type": "code", 299 | "execution_count": null, 300 | "metadata": { 301 | "collapsed": false 302 | }, 303 | "outputs": [], 304 | "source": [ 305 | "from sklearn.linear_model import LogisticRegression\n", 306 | "\n", 307 | "# our model - a multiclass regression\n", 308 | "logistic = LogisticRegression()\n", 309 | "\n", 310 | "# train on iris training set\n", 311 | "logistic.fit(X_train, y_train)\n", 312 | "\n", 313 | "# place data in array of arrays (1D -> 2D)\n", 314 | "X_test = iris.data[-1,:].reshape(1, -1)\n", 315 | "\n", 316 | "y_predict = logistic.predict(X_test)\n", 317 | "\n", 318 | "print('Predicted class %s, real class %s' % (\n", 319 | " y_predict, iris.target[-1]))\n", 320 | "\n", 321 | "print('Probabilities of membership in each class: %s' % \n", 322 | " logistic.predict_proba(X_test))" 323 | ] 324 | }, 325 | { 326 | "cell_type": "markdown", 327 | "metadata": {}, 328 | "source": [ 329 | "QUESTION:\n", 330 | "* What would have been good to do before plunging right in to a logistic regression model?" 331 | ] 332 | }, 333 | { 334 | "cell_type": "markdown", 335 | "metadata": {}, 336 | "source": [ 337 | "#### Some terms you will encounter as a Machine Learnest\n", 338 | "\n", 339 | "Term | Definition\n", 340 | "------------- | -------------\n", 341 | "Training set | set of data used to learn a model\n", 342 | "Test set | set of data used to test a model\n", 343 | "Feature | a variable (continuous, discrete, categorical, etc.) aka column\n", 344 | "Target | Label (associated with dependent variable, what we predict)\n", 345 | "Learner | Model or algorithm\n", 346 | "Fit, Train | learn a model with an ML algorithm using a training set\n", 347 | "Predict | w/ supervised learning, give a label to an unknown datum(data), w/ unsupervised decide if new data is weird, in which group, or what to do next with the new data\n", 348 | "Accuracy | percentage of correct predictions ((TP + TN) / total)\n", 349 | "Precision | percentage of correct positive predictions (TP / (FP + TP))\n", 350 | "Recall | percentage of positive cases caught (TP / (FN + TP))" 351 | ] 352 | }, 353 | { 354 | "cell_type": "markdown", 355 | "metadata": {}, 356 | "source": [ 357 | "> PRO TIP: Are you a statistician? Want to talk like a machine learning expert? Here you go (from the friendly people at SAS ([here](http://www.sas.com/it_it/insights/analytics/machine-learning.html))): \n", 358 | "\n", 359 | "A Statistician Would Say | A Machine Learnest Would Say\n", 360 | "------------- | -------------\n", 361 | "dependent variable | target\n", 362 | "variable | feature\n", 363 | "transformation | feature creation\n" 364 | ] 365 | }, 366 | { 367 | "cell_type": "markdown", 368 | "metadata": {}, 369 | "source": [ 370 | "Created by a Microsoft Employee.\n", 371 | "\t\n", 372 | "The MIT License (MIT)
\n", 373 | "Copyright (c) 2016 Micheleen Harris" 374 | ] 375 | } 376 | ], 377 | "metadata": { 378 | "kernelspec": { 379 | "display_name": "Python 3", 380 | "language": "python", 381 | "name": "python3" 382 | }, 383 | "language_info": { 384 | "codemirror_mode": { 385 | "name": "ipython", 386 | "version": 3 387 | }, 388 | "file_extension": ".py", 389 | "mimetype": "text/x-python", 390 | "name": "python", 391 | "nbconvert_exporter": "python", 392 | "pygments_lexer": "ipython3", 393 | "version": "3.4.3" 394 | } 395 | }, 396 | "nbformat": 4, 397 | "nbformat_minor": 0 398 | } 399 | -------------------------------------------------------------------------------- /02.Our Dataset.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# (Collect Data), Visualize and Explore\n", 8 | "* Well, the collection has already been done for us and this dataset is included with `sklearn`\n", 9 | "* In reality, many datasets will need to go through a preprocessing and exploratory data analysis step. `sklearn` has many tools for this.\n", 10 | "\n", 11 | "\"Smiley
\n", 12 | "\n", 13 | "## The Dataset - Fisher's Irises" 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "Most machine learning algorithms implemented in scikit-learn expect data to be stored in a\n", 21 | "**two-dimensional array or matrix**. The arrays can be\n", 22 | "either ``numpy`` arrays, or in some cases ``scipy.sparse`` matrices.\n", 23 | "The size of the array is expected to be `n_samples x n_features`.\n", 24 | "\n", 25 | "- **n_samples:** The number of samples: each sample is an item to process (e.g. classify).\n", 26 | " A sample can be a document, a picture, a sound, a video, an astronomical object,\n", 27 | " a row in database or CSV file,\n", 28 | " or whatever you can describe with a fixed set of quantitative traits.\n", 29 | "- **n_features:** The number of features or distinct traits that can be used to describe each\n", 30 | " item in a quantitative manner. Features are generally real-valued, but may be boolean or\n", 31 | " discrete-valued in some cases.

\n", 32 | "\n", 33 | "The number of features must be fixed in advance. However it can be very high dimensional\n", 34 | "(e.g. millions of features) with most of them being zeros for a given sample. This is a case\n", 35 | "where `scipy.sparse` matrices can be useful, in that they are\n", 36 | "much more memory-efficient than numpy arrays.\n", 37 | "

\n", 38 | "If there are labels or targets, they need to be stored in **one-dimensional arrays or lists**." 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": {}, 44 | "source": [ 45 | "Today we are going to use the `iris` dataset which comes with `sklearn`. It's fairly small as we'll see shortly." 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "> Remember our ML TIP: Ask sharp questions.

e.g. What type of flower is this (pictured below) closest to of the three given classes?\n", 53 | "\n", 54 | "\"iris\n", 55 | "

from http://www.madlantern.com/photography/wild-iris

\n", 56 | "\n", 57 | "### Labels (species names/classes):\n", 58 | "\"iris" 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [ 65 | " TIP: Commonly, machine learning algorithms will require your data to be standardized, normalized or even reguarlized and preprocessed. In `sklearn` the data must also take on a certain structure.\n" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": null, 71 | "metadata": { 72 | "collapsed": true 73 | }, 74 | "outputs": [], 75 | "source": [ 76 | "# Imports for python 2/3 compatibility\n", 77 | "\n", 78 | "from __future__ import absolute_import, division, print_function, unicode_literals\n", 79 | "\n", 80 | "# For python 2, comment these out:\n", 81 | "# from builtins import range" 82 | ] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "metadata": {}, 87 | "source": [ 88 | "QUICK QUESTION:\n", 89 | "1. What do you expect this data set to be if you are trying to recognize an iris species?\n", 90 | "* For our `[n_samples x n_features]` data array, what do you think\n", 91 | " * the samples are?\n", 92 | " * the features are?" 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": null, 98 | "metadata": { 99 | "collapsed": false 100 | }, 101 | "outputs": [], 102 | "source": [ 103 | "from sklearn import datasets\n", 104 | "\n", 105 | "iris = datasets.load_iris()\n", 106 | "\n", 107 | "print(type(iris.data))\n", 108 | "print(type(iris.target))" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "## Let's Dive In!" 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": null, 121 | "metadata": { 122 | "collapsed": true 123 | }, 124 | "outputs": [], 125 | "source": [ 126 | "%matplotlib inline\n", 127 | "import pandas as pd\n", 128 | "import numpy as np\n", 129 | "import matplotlib.pyplot as plt" 130 | ] 131 | }, 132 | { 133 | "cell_type": "markdown", 134 | "metadata": {}, 135 | "source": [ 136 | "#### Features (aka columns in data)" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": null, 142 | "metadata": { 143 | "collapsed": false 144 | }, 145 | "outputs": [], 146 | "source": [ 147 | "import pandas as pd\n", 148 | "from sklearn import datasets\n", 149 | "iris = datasets.load_iris()\n", 150 | "\n", 151 | "# converting to dataframe for clearer printing\n", 152 | "pd.DataFrame({'feature name': iris.feature_names})" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "#### Targets (aka labels)" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": null, 165 | "metadata": { 166 | "collapsed": false 167 | }, 168 | "outputs": [], 169 | "source": [ 170 | "import pandas as pd\n", 171 | "from sklearn import datasets\n", 172 | "iris = datasets.load_iris()\n", 173 | "\n", 174 | "# converting to dataframe for clearer printing\n", 175 | "pd.DataFrame({'target name': iris.target_names})" 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "metadata": {}, 181 | "source": [ 182 | "> `sklearn` TIP: all included datasets for have at least `feature_names` and sometimes `target_names`" 183 | ] 184 | }, 185 | { 186 | "cell_type": "markdown", 187 | "metadata": {}, 188 | "source": [ 189 | "### Get to know the data - explore\n", 190 | "* Features (columns/measurements) come from this diagram\n", 191 | "\"iris\n", 192 | "* Shape\n", 193 | "* Peek at data\n", 194 | "* Summaries" 195 | ] 196 | }, 197 | { 198 | "cell_type": "markdown", 199 | "metadata": {}, 200 | "source": [ 201 | "Shape and representation" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": null, 207 | "metadata": { 208 | "collapsed": false 209 | }, 210 | "outputs": [], 211 | "source": [ 212 | "import pandas as pd\n", 213 | "from sklearn import datasets\n", 214 | "\n", 215 | "iris = datasets.load_iris()\n", 216 | "\n", 217 | "# How many data points (rows) x how many features (columns)\n", 218 | "print(iris.data.shape)\n", 219 | "print(iris.target.shape)" 220 | ] 221 | }, 222 | { 223 | "cell_type": "markdown", 224 | "metadata": {}, 225 | "source": [ 226 | "Sneak a peek at data (and a reminder of your `pandas` dataframe methods)" 227 | ] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "execution_count": null, 232 | "metadata": { 233 | "collapsed": false 234 | }, 235 | "outputs": [], 236 | "source": [ 237 | "# convert to pandas df (adding real column names) to use some pandas functions (head, describe...)\n", 238 | "iris.df = pd.DataFrame(iris.data, \n", 239 | " columns = iris.feature_names)\n", 240 | "\n", 241 | "\n", 242 | "# first few rows\n", 243 | "iris.df.head()" 244 | ] 245 | }, 246 | { 247 | "cell_type": "markdown", 248 | "metadata": {}, 249 | "source": [ 250 | "Describe the dataset with some summary statitsics" 251 | ] 252 | }, 253 | { 254 | "cell_type": "code", 255 | "execution_count": null, 256 | "metadata": { 257 | "collapsed": false 258 | }, 259 | "outputs": [], 260 | "source": [ 261 | "# summary stats\n", 262 | "iris.df.describe()" 263 | ] 264 | }, 265 | { 266 | "cell_type": "markdown", 267 | "metadata": {}, 268 | "source": [ 269 | "* We don't have to do much with the `iris` dataset. It has no missing values. It's already in numpy arrays and has the correct shape for `sklearn`. However we could try standardization and/or normalization. (later, in the transforms section, we will show one hot encoding, a preprocessing step)" 270 | ] 271 | }, 272 | { 273 | "cell_type": "markdown", 274 | "metadata": {}, 275 | "source": [ 276 | "## Visualize\n", 277 | "* There are many ways to visualize and here's a boxplot:" 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": 11, 283 | "metadata": { 284 | "collapsed": false 285 | }, 286 | "outputs": [ 287 | { 288 | "data": { 289 | "text/plain": [ 290 | "" 291 | ] 292 | }, 293 | "execution_count": 11, 294 | "metadata": {}, 295 | "output_type": "execute_result" 296 | }, 297 | { 298 | "data": { 299 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAeEAAAFeCAYAAACy1qeuAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3X90VOWdx/HPkAkZyAxLAjN0j1LAWMoP0WpigcPhx4Ld\n8iu7ipk2AQKU6NYIdhVpaWg1kJ5SiBXdEmHJqVuKilTs6LHIEesiorSBwGERUiKIEH+tJEggmSQa\nhtz9g+2UlB8ZMjM8k8n79Rdz780z32Qe5jPPnec+12ZZliUAAHDNdTFdAAAAnRUhDACAIYQwAACG\nEMIAABhCCAMAYAghDACAIYQwEKN2796tzMzMkLdHwoEDB1RYWNiu53n88ce1c+fOsGtobGzUvffe\nq+bm5rDbAmIdIQwg6MiRIzpx4sRV/9z+/ft19OhRjRo1KuwaunfvrqlTp+qJJ54Iuy0g1tlNFwB0\nNI2NjSooKNCHH34om82mm266SUVFRZKkN998U2vWrFEgEJDD4dCiRYt0yy23qKSkRIcPH9apU6dU\nU1OjwYMH6+c//7mSk5P15ptvau3atQoEAjp16pT+9V//Vf/+7/8eUi1nz57VL3/5S5WXl6ulpUWD\nBw/WT3/6UyUnJ2v8+PGaNm2a/vznP+t///d/NWnSJP3whz+UJJWWlur3v/+9kpOTlZGRoTfeeEPP\nP/+8Vq1aJb/fr8WLF+vOO+9UQ0ODFixYoA8++EDNzc362c9+pvT09IvqWLVqlXJzc4OPX3zxRa1b\nt04JCQlKSUnR8uXL9eGHH2rlypXyeDw6cuSIunXrpgceeEDPPPOMjh8/rm9961sqKCiQJE2aNEmP\nPfaY7r33XqWmpob7kgGxywJwVV5++WXrnnvusSzLss6dO2c98sgj1ocffmgdP37cmjp1qnX69GnL\nsizryJEj1qhRo6ympiZr1apV1ujRo63PP//csizLWrBggbVixQrLsixr1qxZVlVVlWVZlnXixAlr\nyJAhVm1trbVr1y5r6tSpFz3/hdtLSkqs4uLi4L6VK1daS5cutSzLsv7pn/4p+ByfffaZdfPNN1sf\nf/yxtWPHDmvSpElWfX29ZVmWtXjxYmv8+PGWZVmWz+ezvv/97wefZ+jQoda7775rWZZl/eY3v7Hm\nzJlzUT11dXXWN77xDevs2bOWZVnWoUOHrBEjRlifffaZZVmW9dvf/tYqLCwMtnfo0CHLsizrnnvu\nsbKzs61AIGCdOnXKGjp0qFVdXR1s9wc/+IHl8/lCfl2AjoiRMHCV0tPT9eSTTyo3N1ejRo3S7Nmz\n1bdvX23YsEEnT57UnDlzZP3/arB2u11VVVWSpG9/+9vBUV1WVpZ+8Ytf6Ec/+pHWrFmj7du365VX\nXtEHH3wgSWpqagqplu3bt6u+vj74XWwgEFCvXr2C+ydMmCBJ6tOnj3r16qUzZ85ox44dmjhxopxO\npyRpxowZKisru2T7ffv21bBhwyRJgwcPls/nu+iYqqoqeTwe2e3n307Kyso0evRo9enTR5I0a9Ys\nSee/Y77uuus0aNAgSdJXv/pVuVyu4GjZ6XTqzJkzcrvdwf3Hjh0L6e8AdFSEMHCVrr/+er3++uva\nvXu3ysrKNHv2bD3yyCNqaWnRyJEjtXLlyuCxn376qb7yla/oj3/8YzCkJMmyLCUkJKipqUl33nmn\n/vmf/1kZGRnKysrSG2+8EQzxtpw7d04/+clPNHr0aEnnT5V/+eWXwf0Oh6PV8ZZlyW63t2q/S5fL\nTw25sGabzXbJurp06aJz584FHyckJMhmswUfNzc369NPP5Ukde3a9bLtX+p3+/vjgXjDxCzgKj3/\n/PP68Y9/rFGjRunhhx/W6NGjdeTIEY0YMUI7d+4Mjmbffvtt3XXXXcFZvtu2bZPf71dLS4teeOEF\njR8/XlVVVWpsbNSDDz6ocePGadeuXTp79myrULuS0aNH67nnnlNzc7NaWlr06KOPtjmhaezYsXr9\n9dfl9/slnf/+9q+hmZCQoEAgcFV/j759++rzzz8P/p7Dhw/Xn/70J508eVKStGHDBhUXF19Vm5L0\n8ccf64YbbrjqnwM6EkbCwFW68847VV5ersmTJ6tbt2667rrrNHv2bLlcLhUVFWnBggWSzgfamjVr\ngqPR3r1769/+7d906tQpZWRk6Pvf/74SExM1btw4TZo0SR6PR7fddpuGDh2qDz/8UImJiW3Wcv/9\n96u4uFh33XWXLMvS4MGDtWjRIklqNRq98PGIESPk9XqVnZ0th8Ohr33ta+rWrZsk6dZbb9WTTz6p\nBx54oNVEqytxuVzKyMjQrl27NHr0aA0cOFA/+tGPlJeXJ5vNJrfbrWXLlrV5avnvR8//8z//o2XL\nloVUA9BR2axQz3sBaLeSkhKdPHlSS5YsMV2KDh48qH379gVDdt26dXr33XdbnUa/Wvv27dN//ud/\nau3atRGp8aWXXtL7778fnM0NxKs2T0dblqXFixcrJydHM2fOvOjT7LZt25SVlaXs7Gxt2rQpaoUC\niIz+/ftrz549yszMVGZmpsrKyvTjH/84rDZvvfVW3XDDDXrnnXfCrq+hoUGbN2/WAw88EHZbQKxr\ncyT89ttvy+fz6YknntCf/vQnbdy4Ub/61a8knZ+JOXnyZPl8PiUlJSknJ0elpaVc1wcAQAjaHAkn\nJSWpvr5elmWpvr6+1fdUR48eVb9+/eR0OpWYmKj09HSVl5dHtWAAAOJFmxOz0tPT9eWXX2rixIk6\nffp0q+98/H6/XC5X8HFycrLq6+ujUykAAHGmzZHwr3/9a912223aunWrXnnlFS1atCh4KYLT6Qxe\n5iCd/y6nR48eV2wvEAjt0gsAAOJdmyPhxsbG4Mo6LpdLgUBALS0tkqS0tDRVVVWprq5ODodD5eXl\nysvLu2J7tbWNESi7c3C7Xaqp4cwCIoP+hEijT4XO7XZdcnubE7Pq6upUUFCg2tpanTt3TrNmzZJl\nWWpqapLX69X27dtVUlIiy7KUlZWlnJycKxbCCxY6Ojgiif6ESKNPha7dIRxpvGCho4MjkuhPiDT6\nVOguF8IsWwkAgCGEMAAAhhDCAAAYQggDAGAId1ECAASNGTNclZWHIt7uoEGDtWPHroi329ERwgCA\noKsJSo/HpepqZkeHg9PRAAAYQggDAGAIIQwAaJfCQtMVdHyEMACgXZYsMV1Bx0cIAwBgCCEMAIAh\nhDAAAIYQwgAAGEIIAwDahYlZ4SOEAQDtsnSp6Qo6PkIYAABDCGEAAAwhhAEAMIQQBgDAEEIYANAu\nrB0dPkIYANAuXKIUPkIYAABDCGEAAAwhhAEAMIQQBgDAEEIYANAuTMwKHyEMAGgX1o4OHyEMAIAh\nhDAAAIbY2zrgpZdeks/nk81m05dffqnKykrt3LlTTqdTkrRu3Tq9+OKLSk1NlSQVFRWpf//+US0a\nAIB40GYI33XXXbrrrrsknQ/YrKysYABLUkVFhYqLizVkyJDoVQkAQBwK+XT0gQMH9P7778vr9bba\nXlFRobVr12r69OkqLS2NeIEAgNjE2tHhCzmES0tLNX/+/Iu2T5kyRUuXLtX69eu1d+9evfXWWxEt\nEAAQm7hEKXw2y7Kstg6qr6/X9OnT9Yc//OGifX6/P3h6esOGDTpz5ozy8/Mv21YgcE52e0IYJQMA\nEB/a/E5YksrLyzVixIiLtvv9fmVmZmrLli1yOBwqKytTVlbWFduqrW1sX6WdkNvtUk1NvekyECfo\nT4g0+lTo3G7XJbeHFMLHjh1T3759g483b96spqYmeb1eLVy4ULm5uUpKStLIkSM1ZsyYyFQMAECc\nC+l0dCTxqSl0fMpEJNGfEGn0qdBdbiTMYh0AgHZhYlb4CGEAQLuwdnT4CGEAAAwhhAEAMIQQBgDA\nEEIYAABDCGEAQLuwdnT4CGEAQLtwiVL4CGEAAAwhhAEAMIQQBgDAEEIYAABDCGEAQLswMSt8hDAA\noF1YOzp8hDAAAIYQwgAAGEIIAwBgCCEMAIAhhDAAoF1YOzp8hDAAoF24RCl8hDAAAIYQwgAAGEII\nAwBgCCEMAIAhhDAAoF2YmBU+QhgA0C6sHR0+QhgAAEMIYQAADCGEAQAwhBAGAMAQe1sHvPTSS/L5\nfLLZbPryyy9VWVmpnTt3yul0SpK2bdum1atXy2636+6775bX64160QAA81g7Onw2y7KsUA8uKirS\n4MGDg0EbCAQ0efJk+Xw+JSUlKScnR6WlpUpNTb1sGzU19eFX3Um43S7+XogY+hMijT4VOrfbdcnt\nIZ+OPnDggN5///1WI92jR4+qX79+cjqdSkxMVHp6usrLy8OvFgCATiDkEC4tLdX8+fNbbfP7/XK5\n/pbuycnJqq/nUxEAAKEIKYTr6+t1/PhxffOb32y13el0yu/3Bx83NDSoR48eka0QAIA41ebELEkq\nLy/XiBEjLtqelpamqqoq1dXVyeFwqLy8XHl5eVdsKyWlu+z2hPZV2wld7nsEQJJuuukmVVRURKXt\noUOH6uDBg1FpG/GD96jwhBTCx44dU9++fYOPN2/erKamJnm9XhUUFGju3LmyLEter1cej+eKbdXW\nNoZXcSfCpAe05c03/xzysR6PS9XVV9ef6H+4kqeecmnePPpIKC73YeWqZkdHAv+pQ0cII5LaE8LA\nldCnQhf27GgAABBZhDAAAIYQwkAnwepGQOwhhIFOghuwA7EnpNnRAICOa+BAp06ftkWlbY8nOpco\n9exp6fBhf9sHdnCEMADEudOnbVGZxRzNKziiFe6xhtPRAAAYQggDAGAIIQx0EkzMAmIPIQx0EkuX\nmq4AwN8jhAEAMIQQBgDAEEIYAABDCGEAAAwhhIFOgrWjgdhDCAOdBJcoAbGHEAYAwBBCGAAAQwhh\nAAAMIYQBADCEEAY6CSZmAbGHEAY6CdaOBmIPIQwAgCGEMAAAhhDCAAAYQggDAGAIIQx0EqwdDcQe\nQhjoJLhECYg9hDAAAIYQwgAAGEIIAwBgiD2Ug0pLS7Vt2zYFAgHNnDlTd955Z3DfunXr9OKLLyo1\nNVWSVFRUpP79+0elWAAA4kmbIbx7927t27dPGzduVGNjo55++ulW+ysqKlRcXKwhQ4ZErUgA4Vuy\nRJo3z3QVAC7U5unod955RwMHDtT999+v/Px8jR8/vtX+iooKrV27VtOnT1dpaWnUCgUQHtaOBmJP\nmyPh2tpaffrpp1q7dq0++ugj5efn67XXXgvunzJlimbMmCGn06l58+bprbfe0tixY6NaNAAA8aDN\nEO7Zs6fS0tJkt9s1YMAAJSUl6dSpU8HvgGfPni2n0ylJGjt2rP7yl79cMYRTUrrLbk+IUPnxz+12\nmS4BcYT+1Dkd0E1yeyqi0rY7Kq1KBzRUbvfBKLUeO9oM4fT0dD3zzDOaM2eOTpw4oS+++EIpKSmS\nJL/fr8zMTG3ZskUOh0NlZWXKysq6Ynu1tY2RqbwTcLtdqqmpN10G4gb9qbMapoOqro78ax/N96hh\nHpeq46i/Xu4DcJshPG7cOO3Zs0dZWVmyLEuPPvqoXn31VTU1Ncnr9WrhwoXKzc1VUlKSRo4cqTFj\nxkS8eAAA4pHNsizrWj4hn8RDx0gYkfTUUy7Nm0d/6ow8HleHGwlHq2ZTLjcSZrEOoJNg7Wgg9hDC\nAAAYQggDAGAIIQwAgCGEMAAAhhDCQCfBxCwg9hDCQCfB2tFA7CGEAQAwhBAGAMAQQhgAAEMIYQAA\nDGnzBg4Arq2BA506fdoWlbY9nujcyrBnT0uHD/uj0jYQzwhhIMacPm3rkIvtA7h6nI4GAMAQQhgA\nAEMIYQAADCGEAQAwhIlZANAJRG/yXPRm3HcGhDAAxLlozLaXzgd7tNruLDgdDQCAIYQwAACGEMIA\nABhCCAMAYAghDABol8JC0xV0fIQwAKBdliwxXUHHRwgDAGAIIQwAgCGEMAAAhhDCAAAYQggDANqF\niVnhCymES0tLlZ2draysLL388sut9m3btk1ZWVnKzs7Wpk2bolIkACD2LF1quoKOr80bOOzevVv7\n9u3Txo0b1djYqKeffjq4LxAIaPny5fL5fEpKSlJOTo4mTJig1NTUqBYNAEA8aHMk/M4772jgwIG6\n//77lZ+fr/Hjxwf3HT16VP369ZPT6VRiYqLS09NVXl4e1YIBAIgXbY6Ea2tr9emnn2rt2rX66KOP\nlJ+fr9dee02S5Pf75XL97V6SycnJqq/ntlYAAISizRDu2bOn0tLSZLfbNWDAACUlJenUqVNKTU2V\n0+mU3+8PHtvQ0KAePXpcsb2UlO6y2xPCr7yTcLujdSNuxLJove7R7E/01c6J1z08bYZwenq6nnnm\nGc2ZM0cnTpzQF198oZSUFElSWlqaqqqqVFdXJ4fDofLycuXl5V2xvdraxshU3gm43S7V1HBmofOJ\nzuse3f5EX+2MCgt53UN1uQ8rbYbwuHHjtGfPHmVlZcmyLD366KN69dVX1dTUJK/Xq4KCAs2dO1eW\nZcnr9crj8US8eABA7FmyRKqpMV1Fx2azLMu6lk/Ip6bQMRLunDwel6qrO9ZIOFo1I7bxHhW6y42E\nWawDAABDCGEAAAwhhAEAMIQQBgC0C2tHh4+JWTGMSQ+d0wnPSN2kCtNlXJWDGqo+1X82XQauMSbk\nha7dlygBuLaG6WCHmx09zONStXgzBq4Wp6MBADCEEAYAwBBCGAAAQwhhAEC7FBaarqDjI4QBAO3C\nJUrhI4QBADCES5SuoTFjhquy8lBU2h40aLB27NgVlbYBANFBCF9DVxuSLNYBAPGN09EAABhCCAMA\n2oWJWeEjhGMYHRxALFu61HQFHR8hHMPo4AAQ3whhAAAMIYQBADCEEAYAwBBCGADQLqwdHT5COIbR\nwQHEMq7gCB8hHMPo4AAQ3whhAAAMIYQBADCEEAYAwBDuogTEII/HFaWWo9Nuz55WVNrFtRetW65y\nu9VLs1mWdU3/93BrvtA99ZRL8+bx90JkeDwuVVfTnxA53G41dG73pT8Aczo6hrF2NADEN0IYAABD\nQvpOeNq0aXI6nZKk66+/XsuWLQvuW7dunV588UWlpqZKkoqKitS/f//IVwoAQJxpM4Sbm5slSevX\nr7/k/oqKChUXF2vIkCGRrQwAgDjX5unoyspKNTY2Ki8vT3PmzNH+/ftb7a+oqNDatWs1ffp0lZaW\nRq1QAOFhGVQg9rQ5EnY4HMrLy5PX69Xx48d17733auvWrerS5Xx+T5kyRTNmzJDT6dS8efP01ltv\naezYsVEvvDPgTRORtGSJVFNjugoAF2rzEqXm5mZZlqWkpCRJktfrVUlJifr06SNJ8vv9we+LN2zY\noDNnzig/P/+y7QUC52S3J0SqfgAAOqw2R8I+n0/vvfeeCgsLdeLECTU0NMjtdks6H8CZmZnasmWL\nHA6HysrKlJWVdcX2amsbI1N5J8A1eIgk+hMijT4VustdJ9zmSDgQCGjx4sX65JNPZLPZtHDhQn38\n8cdqamqS1+vVq6++qt/85jdKSkrSyJEjNX/+/CsWwgsWOjo4Ion+hEijT4Wu3SEcabxgoaODI5Lo\nT4g0+lToWDEL6OS4PzUQewjhGMabJiKJZVCB2EMIxzDeNAEgvhHCAAAYQggDAGAIIQwAgCGEMNBJ\nsAwqEHsI4RjGmyYiidn2QOwhhGMYb5oAEN8IYQAADCGEAQAwhBAGAMAQQhjoJJhjAMQeQjiG8aaJ\nSGIZVCD2EMIxjDdNAIhvhDAAAIYQwgAAGEIIAwBgCCEMdBIsgwrEHptlWda1fMKamvpr+XRRN3Cg\nU6dP20yXcVV69rR0+LDfdBm4xtxuV9z9/4NZ9KnQud2uS263X+M64s7p0zZVV0enE0arg3s8l+4M\nAIBri9PRAAAYQggDAGAIIQwAgCGEMNBJsAwqEHsIYaCTYBlUIPYQwgAAGEIIAwBgCCEMAIAhhDAA\nAIaEtGLWtGnT5HQ6JUnXX3+9li1bFty3bds2rV69Wna7XXfffbe8Xm90KgUQFtaOBmJPmyHc3Nws\nSVq/fv1F+wKBgJYvXy6fz6ekpCTl5ORowoQJSk1NjXylAMKyZIlUU2O6CgAXavN0dGVlpRobG5WX\nl6c5c+Zo//79wX1Hjx5Vv3795HQ6lZiYqPT0dJWXl0e1YAAA4kWbI2GHw6G8vDx5vV4dP35c9957\nr7Zu3aouXbrI7/fL5frbzQCSk5NVX88dNQAACEWbIdy/f3/169cv+O+ePXuqpqZGffr0kdPplN//\nt1viNTQ0qEePHldsLyWlu+z2hDDLjh0HdJPcnoqote+OQpsHNFRu98EotIxYd7nbqQHtRZ8KT5sh\n7PP59N5776mwsFAnTpxQQ0OD3O7z0ZCWlqaqqirV1dXJ4XCovLxceXl5V2yvtrYxMpXHiGE62OFu\nZTjM41I19wDtdLj3KyKNPhW6dt9POCsrS4sXL9aMGTNks9m0bNkybdmyRU1NTfJ6vSooKNDcuXNl\nWZa8Xq88Hk/EiwcQviVLpHnzTFcB4EI2y7Ksa/mE8fapyeNxdbiRcDRrRuzidUekMRIO3eVGwizW\nAQCAIYQwAACGEMIAABhCCAMAYEhIa0cDiE1jxgxXZeWhkI+/mosXBg0arB07drWjKgChIoSBDuxq\nQpKZrEDs4XQ0AACGEMIAABhCCAMAYAghDACAIYQwAACGMDs6AjyeaN7KK/Jt9+x5TZcLBwBcBiEc\npmguiM+C+wAQ3zgdDQCAIYQwAACGEMIAABhCCAMAYAghHMMKC01XAACIJkI4hi1ZYroCAEA0EcIA\nABhCCAMAYAghDACAIYQwAACGEMIxjIlZABDfCOEYtnSp6QoAANFECAMAYAghDACAIYQwAACGEMIA\nABhCCMcw1o4GgPgWUgh//vnnGjdunI4dO9Zq+7p16zR16lTNmjVLs2bN0vHjx6NRY6fFJUoAEN/s\nbR0QCARUWFgoh8Nx0b6KigoVFxdryJAhUSkOAIB41uZIeMWKFcrJyZHH47loX0VFhdauXavp06er\ntLQ0KgUCABCvrhjCPp9PvXr10qhRo2RZ1kX7p0yZoqVLl2r9+vXau3ev3nrrragVCgBAvLFZl0rX\n/zdz5kzZbDZJUmVlpQYMGKA1a9aoV69ekiS/3y+n0ylJ2rBhg86cOaP8/PwrPmEgcE52e0Kk6gcA\noMO64nfCzz77bPDfubm5KioqahXAmZmZ2rJlixwOh8rKypSVldXmE9bWNoZZcufx1FMuzZtXb7oM\nxAm326WaGvoTIoc+FTq323XJ7SFfovTXEfHmzZu1adMmOZ1OLVy4ULm5uZo5c6YGDhyoMWPGRKZa\nSGLtaACId1c8HR0NfGoKncfjUnU1fy9EBqMWRBp9KnRhj4QBAEBkEcIAABhCCAMAYAghHMNYOxoA\n4hshHMNYOxoA4hshDACAIYQwAACGEMIAABhCCAMAYAghHMOYmAUA8Y0QjmGsHQ0A8Y0QBgDAkCve\nyhCRNWbMcFVWHrqqn/F4Qjtu0KDB2rFjVzuqAgCYQghfQ1cbktyhBADiG6ejAQAwhBAGAMAQQhgA\nAEMIYQAADCGEAQAwhBAGAMAQQhgAAEMIYQAADCGEAQAwhBAGAMAQQhgAAEMIYQAADCGEAQAwhBAG\nAMAQQhgAAEMIYQAADCGEAQAwJKQQ/vzzzzVu3DgdO3as1fZt27YpKytL2dnZ2rRpU1QKBAAgXtnb\nOiAQCKiwsFAOh+Oi7cuXL5fP51NSUpJycnI0YcIEpaamRq1YAADiSZsj4RUrVignJ0cej6fV9qNH\nj6pfv35yOp1KTExUenq6ysvLo1YoAADx5ooh7PP51KtXL40aNUqWZbXa5/f75XK5go+Tk5NVX18f\nnSoBAIhDVzwd7fP5ZLPZtHPnTlVWVmrRokVas2aNevXqJafTKb/fHzy2oaFBPXr0aPMJ3W5Xm8fg\nb/h7IZLoT4g0+lR4rhjCzz77bPDfubm5KioqUq9evSRJaWlpqqqqUl1dnRwOh8rLy5WXlxfdagEA\niCNtTsz6K5vNJknavHmzmpqa5PV6VVBQoLlz58qyLHm93ou+NwYAAJdns/7+y14AAHBNsFgHAACG\nEMIAABhCCAMAYAghHCMOHz6sPXv2mC4Dcebtt9++6iVlS0pK9Lvf/S5KFSEWXU0/OXnypIqKii67\nv7KyUqtXr45UaXGPiVkxoqSkRL1791Z2drbpUtDJlZSUyO1267vf/a7pUoC4F/IlSmif48ePq6Cg\nQHa7XZZl6Ze//KU2bNigvXv36ty5c/re976nb3zjG/L5fOratauGDh2quro6/cd//IeSkpKUkpKi\nZcuWqbm5WQ899JAsy1Jzc7OWLFmiQYMGaeXKlaqoqFBtba0GDRqkZcuWmf6VEQEPPPCAZs+erYyM\nDB08eFCrVq1S7969VVVVJcuy9OCDD+r2229XZmam+vfvr65du2rGjBlasWKFEhMT5XA49Ktf/Upb\nt27VBx98oIcfflirV6/Wf//3f6ulpUU5OTn6zne+o//6r//Sli1bZLfbdfvtt+vhhx9uVceKFSu0\nd+9e2Ww2TZ06Vbm5uSooKFBtba3OnDmj0tLSVivnoWO4sH8dOHBA3/ve9zR9+nR997vf1X333aeU\nlBSNHTtWt99+u4qKiuR0OpWamqqkpCTNnz9fCxYs0O9+9zv9y7/8i775zW/qvffek81m0+rVq/WX\nv/xFGzdu1MqVK7Vp0yZt3LhRlmVp/Pjxmj9/vp577jm9/vrr+uKLL5SSkqKSkhLZ7Z03ijrvb36N\n7Ny5U7fccot++MMfqry8XG+88YY++eQTPffcc2pubtZ3vvMdPfvss5o2bZrcbreGDRumCRMmaOPG\njXK73XrmmWf01FNPacSIEUpJSVFxcbGOHDmipqYm+f1+/cM//IOefvppWZalKVOmqLq6muu144DX\n65XP51NGRoZ8Pp/GjBmjzz77TD//+c91+vRpzZw5U5s3b1ZDQ4PmzZunQYMGqbi4WJMmTdLs2bO1\nbds21dXVSTp/jf+hQ4f0zjvv6Pe//70CgYAef/xxHT58WFu3btULL7ygLl266Ac/+IG2b98erGH7\n9u365JNgac+5AAAEiklEQVRP9MILLygQCGjGjBkaPny4JGnkyJGaPXu2iT8NIuDC/vXSSy/poYce\n0okTJySdv2veyy+/rISEBE2bNk2PPfaY0tLS9MQTT6i6ulrS39aN8Pv9yszM1E9/+lMtXLhQO3bs\nUO/evWWz2XTq1Cn9+te/1h/+8Ad17dpVK1euVENDg06fPq3f/va3kqS8vDwdOHBAt956q5k/RAwg\nhKPM6/WqtLRUeXl56tGjh77+9a/r4MGDmjVrlizL0rlz5/Txxx8Hjz916pRcLpfcbrckKSMjQ088\n8YQWLVqk48ePKz8/X4mJicrPz5fD4dDJkyf18MMPq3v37mpqalIgEDD1qyKCRo8erccee0xnzpzR\nnj171NLSor1792r//v3BflNbWytJGjBggCTpvvvu05o1azR79mx95Stf0c033xxs79ixY8HHdrtd\nixYt0muvvaZbbrlFXbqcnxpy22236ciRI8GfOXr0qNLT04M/c/PNN+v9999v9ZzomP6+fw0dOjS4\n7/rrr1dCQoIkqbq6WmlpaZLOvxdt2bLlorYGDx4sSfrHf/xHNTc3B7d/9NFHGjhwoLp27SpJWrBg\ngSQpMTFRCxYsULdu3VRdXd3p37OYmBVlb7zxhjIyMrRu3Tp9+9vfls/n0/Dhw7V+/XqtW7dOEydO\n1Fe/+lXZbDa1tLQoNTVVfr9fJ0+elCTt3r1b/fv3165du+R2u/X000/rvvvu08qVK7Vjxw599tln\nevzxx/XQQw+pqanpohttoGOy2WyaOHGilixZom9961u68cYblZmZqfXr12vNmjWaPHmyevbsGTxW\nkl555RXdfffdWr9+vW688Ua98MILwfZuuOEGVVRUSJLOnj2rvLw8DRgwQO+++65aWlpkWZb27NnT\nKlxvvPFG7d27N/gz+/btC+7/a3CjY/r7/nXh6/nX/iSdD9ajR49Kkvbv339Vz9G3b1998MEHOnv2\nrCTpwQcfDJ4NXLlypR555BGdO3eu079nMRKOsmHDhgVvfNHS0qJVq1bplVde0YwZM9TU1KQ77rhD\n3bt310033RQ87fOzn/1M8+fPV5cuXdSjRw8tX75c0vlPks8//7xaWlo0f/58fe1rXwuOfHr37q1b\nbrlF1dXVuu666wz/1oiEu+++W3fccYf++Mc/qlevXnrkkUeUm5urhoYG5eTkyGaztXrDvPnmm/WT\nn/xE3bp1U0JCgoqKirR7925J0qBBgzR69GhlZ2fLsizl5OTo61//uiZOnBjclpGRoTvuuEOVlZWS\npLFjx6qsrEzZ2dk6e/asJk+eHBz1oOP7a/96/fXXtWvXruD2C/vUo48+qsWLFys5OVmJiYnq06dP\nqzYuPPbCf0tSamqq7rnnHs2cOVM2m03jx4/XsGHD1L17d82cOVMpKSkaMmRI8BR3Z8XsaADAJT33\n3HOaPHmyUlJS9OSTT6pr1666//77TZcVVxgJAwAuqXfv3po7d666d+8ul8ulFStWmC4p7jASBgDA\nEGZXAABgCCEMAIAhhDAAAIYQwgAAGEIIAwBgCCEMAIAh/weWZ/IADA/0rAAAAABJRU5ErkJggg==\n", 300 | "text/plain": [ 301 | "" 302 | ] 303 | }, 304 | "metadata": {}, 305 | "output_type": "display_data" 306 | } 307 | ], 308 | "source": [ 309 | "import matplotlib.pyplot as plt\n", 310 | "\n", 311 | "import pandas as pd\n", 312 | "from sklearn import datasets\n", 313 | "\n", 314 | "iris = datasets.load_iris()\n", 315 | "\n", 316 | "iris.df = pd.DataFrame(iris.data, \n", 317 | " columns = iris.feature_names)\n", 318 | "\n", 319 | "iris.df['target'] = iris.target\n", 320 | "\n", 321 | "df = iris.df.loc[:, ['sepal length (cm)', 'target']]\n", 322 | "\n", 323 | "df['idx'] = list(range(0, 50))*3\n", 324 | "df = df.pivot(index = 'idx', columns = 'target')\n", 325 | "df = np.array(df)\n", 326 | "plt.boxplot(df, labels = iris.target_names)\n", 327 | "plt.title('sepal length (cm)')" 328 | ] 329 | }, 330 | { 331 | "cell_type": "markdown", 332 | "metadata": {}, 333 | "source": [ 334 | "**Using `pairplot` from `seaborn` is a quick way to see which features separate out our data ()**" 335 | ] 336 | }, 337 | { 338 | "cell_type": "code", 339 | "execution_count": null, 340 | "metadata": { 341 | "collapsed": false 342 | }, 343 | "outputs": [], 344 | "source": [ 345 | "import seaborn as sb\n", 346 | "sb.pairplot(pd.DataFrame(iris.data, columns = iris.feature_names))" 347 | ] 348 | }, 349 | { 350 | "cell_type": "markdown", 351 | "metadata": {}, 352 | "source": [ 353 | "### Preprocessing (Bonus Material)" 354 | ] 355 | }, 356 | { 357 | "cell_type": "markdown", 358 | "metadata": { 359 | "collapsed": true 360 | }, 361 | "source": [ 362 | "

What you might have to do before using a learner in `sklearn`:

\n", 363 | "1. Non-numerics transformed to numeric (tip: use applymap() method from `pandas`)\n", 364 | "* Fill in missing values\n", 365 | "* Standardization\n", 366 | "* Normalization\n", 367 | "* Encoding categorical features (e.g. one-hot encoding or dummy variables)\n", 368 | "\n", 369 | "Features should end up in a numpy.ndarray (hence numeric) and labels in a list.\n", 370 | "\n", 371 | "Data options:\n", 372 | "* Use pre-processed [datasets](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets) from scikit-learn\n", 373 | "* [Create your own](http://scikit-learn.org/stable/datasets/index.html#sample-generators)\n", 374 | "* Read from a file\n", 375 | "\n", 376 | "If you use your own data or \"real-world\" data you will likely have to do some data wrangling and need to leverage `pandas` for some data manipulation." 377 | ] 378 | }, 379 | { 380 | "cell_type": "markdown", 381 | "metadata": {}, 382 | "source": [ 383 | "#### Standardization - make our data look like a standard Gaussian distribution (commonly needed for `sklearn` learners)" 384 | ] 385 | }, 386 | { 387 | "cell_type": "markdown", 388 | "metadata": {}, 389 | "source": [ 390 | "> FYI: you'll commonly see the data or feature set (ML word for data without it's labels) represented as a capital X and the targets or labels (if we have them) represented as a lowercase y. This is because the data is a 2D array or list of lists and the targets are a 1D array or simple list." 391 | ] 392 | }, 393 | { 394 | "cell_type": "code", 395 | "execution_count": null, 396 | "metadata": { 397 | "collapsed": false 398 | }, 399 | "outputs": [], 400 | "source": [ 401 | "# Standardization aka scaling\n", 402 | "from sklearn import preprocessing, datasets\n", 403 | "\n", 404 | "# make sure we have iris loaded\n", 405 | "iris = datasets.load_iris()\n", 406 | "\n", 407 | "X, y = iris.data, iris.target\n", 408 | "\n", 409 | "# scale it to a gaussian distribution\n", 410 | "X_scaled = preprocessing.scale(X)\n", 411 | "\n", 412 | "# how does it look now\n", 413 | "pd.DataFrame(X_scaled).head()" 414 | ] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "execution_count": null, 419 | "metadata": { 420 | "collapsed": false 421 | }, 422 | "outputs": [], 423 | "source": [ 424 | "# let's just confirm our standardization worked (mean is 0 w/ unit variance)\n", 425 | "pd.DataFrame(X_scaled).describe()\n", 426 | "\n", 427 | "# also could:\n", 428 | "#print(X_scaled.mean(axis = 0))\n", 429 | "#print(X_scaled.std(axis = 0))" 430 | ] 431 | }, 432 | { 433 | "cell_type": "markdown", 434 | "metadata": {}, 435 | "source": [ 436 | "> PRO TIP: To save our standardization and reapply later (say to the test set or some new data), create a transformer object like so:\n", 437 | "```python\n", 438 | "scaler = preprocessing.StandardScaler().fit(X_train)\n", 439 | "# apply to a new dataset (e.g. test set):\n", 440 | "scaler.transform(X_test)\n", 441 | "```" 442 | ] 443 | }, 444 | { 445 | "cell_type": "markdown", 446 | "metadata": {}, 447 | "source": [ 448 | "#### Normalization - scaling samples individually to have unit norm\n", 449 | "* This type of scaling is really important if doing some downstream transformations and learning (see sklearn docs [here](http://scikit-learn.org/stable/modules/preprocessing.html#normalization) for more) where similarity of pairs of samples is examined\n", 450 | "* A basic intro to normalization and the unit vector can be found [here](http://freetext.org/Introduction_to_Linear_Algebra/Basic_Vector_Operations/Normalization/)" 451 | ] 452 | }, 453 | { 454 | "cell_type": "code", 455 | "execution_count": null, 456 | "metadata": { 457 | "collapsed": false 458 | }, 459 | "outputs": [], 460 | "source": [ 461 | "# Standardization aka scaling\n", 462 | "from sklearn import preprocessing, datasets\n", 463 | "\n", 464 | "# make sure we have iris loaded\n", 465 | "iris = datasets.load_iris()\n", 466 | "\n", 467 | "X, y = iris.data, iris.target\n", 468 | "\n", 469 | "# scale it to a gaussian distribution\n", 470 | "X_norm = preprocessing.normalize(X, norm='l1')\n", 471 | "\n", 472 | "# how does it look now\n", 473 | "pd.DataFrame(X_norm).tail()" 474 | ] 475 | }, 476 | { 477 | "cell_type": "code", 478 | "execution_count": null, 479 | "metadata": { 480 | "collapsed": false 481 | }, 482 | "outputs": [], 483 | "source": [ 484 | "# let's just confirm our standardization worked (mean is 0 w/ unit variance)\n", 485 | "pd.DataFrame(X_norm).describe()\n", 486 | "\n", 487 | "# cumulative sum of normalized and original data:\n", 488 | "#print(pd.DataFrame(X_norm.cumsum().reshape(X.shape)).tail())\n", 489 | "#print(pd.DataFrame(X).cumsum().tail())\n", 490 | "\n", 491 | "# unit norm (convert to unit vectors) - all row sums should be 1 now\n", 492 | "X_norm.sum(axis = 1)" 493 | ] 494 | }, 495 | { 496 | "cell_type": "markdown", 497 | "metadata": {}, 498 | "source": [ 499 | "> PRO TIP: To save our normalization (like standardization above) and reapply later (say to the test set or some new data), create a transformer object like so:\n", 500 | "```python\n", 501 | "normalizer = preprocessing.Normalizer().fit(X_train)\n", 502 | "# apply to a new dataset (e.g. test set):\n", 503 | "normalizer.transform(X_test) \n", 504 | "```" 505 | ] 506 | }, 507 | { 508 | "cell_type": "markdown", 509 | "metadata": {}, 510 | "source": [ 511 | "Created by a Microsoft Employee.\n", 512 | "\t\n", 513 | "The MIT License (MIT)
\n", 514 | "Copyright (c) 2016 Micheleen Harris" 515 | ] 516 | } 517 | ], 518 | "metadata": { 519 | "kernelspec": { 520 | "display_name": "Python 3", 521 | "language": "python", 522 | "name": "python3" 523 | }, 524 | "language_info": { 525 | "codemirror_mode": { 526 | "name": "ipython", 527 | "version": 3 528 | }, 529 | "file_extension": ".py", 530 | "mimetype": "text/x-python", 531 | "name": "python", 532 | "nbconvert_exporter": "python", 533 | "pygments_lexer": "ipython3", 534 | "version": "3.4.3" 535 | } 536 | }, 537 | "nbformat": 4, 538 | "nbformat_minor": 0 539 | } 540 | -------------------------------------------------------------------------------- /03.Feature Engineering.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# (Clean Data) and Transform Data\n", 8 | "\"Smiley
" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "### Make the learning easier or better beforehand - feature reduction/selection/creation\n", 16 | "* SelectKBest\n", 17 | "* PCA\n", 18 | "* One-Hot Encoder\n", 19 | "\n", 20 | "Just to remind you, the features of the irises we are dealing with on the flower are:\n", 21 | "![Iris with labels](imgs/iris_with_labels.jpg)" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": null, 27 | "metadata": { 28 | "collapsed": true 29 | }, 30 | "outputs": [], 31 | "source": [ 32 | "# Imports for python 2/3 compatibility\n", 33 | "\n", 34 | "from __future__ import absolute_import, division, print_function, unicode_literals\n", 35 | "\n", 36 | "# For python 2, comment these out:\n", 37 | "# from builtins import range" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "### Selecting k top scoring features (also dimensionality reduction)\n", 45 | "* Considered unsupervised learning" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": null, 51 | "metadata": { 52 | "collapsed": true 53 | }, 54 | "outputs": [], 55 | "source": [ 56 | "%matplotlib inline\n", 57 | "import numpy as np\n", 58 | "import pandas as pd\n", 59 | "import matplotlib.pyplot as plt" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": null, 65 | "metadata": { 66 | "collapsed": false 67 | }, 68 | "outputs": [], 69 | "source": [ 70 | "# SelectKBest for selecting top-scoring features\n", 71 | "\n", 72 | "from sklearn import datasets\n", 73 | "from sklearn.feature_selection import SelectKBest, chi2\n", 74 | "\n", 75 | "# Our nice, clean data (it's not always going to be this easy)\n", 76 | "iris = datasets.load_iris()\n", 77 | "X, y = iris.data, iris.target\n", 78 | "\n", 79 | "print('Original shape:', X.shape)" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": null, 85 | "metadata": { 86 | "collapsed": false 87 | }, 88 | "outputs": [], 89 | "source": [ 90 | "# Let's add a NEW feature - a ratio of two of the iris measurements\n", 91 | "df = pd.DataFrame(X, columns = iris.feature_names)\n", 92 | "df['petal width / sepal width'] = df['petal width (cm)'] / df['sepal width (cm)']\n", 93 | "new_feature_names = df.columns\n", 94 | "print('New feature names:', list(new_feature_names))\n", 95 | "\n", 96 | "# We've now added a new column to our data\n", 97 | "X = np.array(df)" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "metadata": { 104 | "collapsed": true 105 | }, 106 | "outputs": [], 107 | "source": [ 108 | "# Perform feature selection\n", 109 | "# input is scoring function (here chi2) to get univariate p-values\n", 110 | "# and number of top-scoring features (k) - here we get the top 3\n", 111 | "dim_red = SelectKBest(chi2, k = 3)\n", 112 | "dim_red.fit(X, y)\n", 113 | "X_t = dim_red.transform(X)" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": null, 119 | "metadata": { 120 | "collapsed": false 121 | }, 122 | "outputs": [], 123 | "source": [ 124 | "# Show scores, features selected and new shape\n", 125 | "print('Scores:', dim_red.scores_)\n", 126 | "print('New shape:', X_t.shape)" 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": null, 132 | "metadata": { 133 | "collapsed": false 134 | }, 135 | "outputs": [], 136 | "source": [ 137 | "# Get back the selected columns\n", 138 | "selected = dim_red.get_support() # boolean values\n", 139 | "selected_names = new_feature_names[selected]\n", 140 | "\n", 141 | "print('Top k features: ', list(selected_names))" 142 | ] 143 | }, 144 | { 145 | "cell_type": "markdown", 146 | "metadata": {}, 147 | "source": [ 148 | "**Note on scoring function selection in `SelectKBest` tranformations:**\n", 149 | "* For regression - [f_regression](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html#sklearn.feature_selection.f_regression)\n", 150 | "* For classification - [chi2](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html#sklearn.feature_selection.chi2), [f_classif](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html#sklearn.feature_selection.f_classif)\n" 151 | ] 152 | }, 153 | { 154 | "cell_type": "markdown", 155 | "metadata": {}, 156 | "source": [ 157 | "### Principal component analysis (aka PCA)\n", 158 | "* Reduces dimensions (number of features), based on what information explains the most variance (or signal)\n", 159 | "* Considered unsupervised learning\n", 160 | "* Useful for very large feature space (e.g. say the botanist in charge of the iris dataset measured 100 more parts of the flower and thus there were 104 columns instead of 4)\n", 161 | "* More about PCA on wikipedia [here](https://en.wikipedia.org/wiki/Principal_component_analysis)" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": null, 167 | "metadata": { 168 | "collapsed": false 169 | }, 170 | "outputs": [], 171 | "source": [ 172 | "# PCA for dimensionality reduction\n", 173 | "\n", 174 | "from sklearn import decomposition\n", 175 | "from sklearn import datasets\n", 176 | "\n", 177 | "iris = datasets.load_iris()\n", 178 | "\n", 179 | "X, y = iris.data, iris.target\n", 180 | "\n", 181 | "# perform principal component analysis\n", 182 | "pca = decomposition.PCA(.95)\n", 183 | "pca.fit(X)\n", 184 | "X_t = pca.transform(X)\n", 185 | "(X_t[:, 0])\n", 186 | "\n", 187 | "# import numpy and matplotlib for plotting (and set some stuff)\n", 188 | "import numpy as np\n", 189 | "np.set_printoptions(suppress=True)\n", 190 | "import matplotlib.pyplot as plt\n", 191 | "%matplotlib inline\n", 192 | "\n", 193 | "# let's separate out data based on first two principle components\n", 194 | "x1, x2 = X_t[:, 0], X_t[:, 1]\n", 195 | "\n", 196 | "\n", 197 | "# please don't worry about details of the plotting below \n", 198 | "# (note: you can get the iris names below from iris.target_names, also in docs)\n", 199 | "c1 = np.array(list('rbg')) # colors\n", 200 | "colors = c1[y] # y coded by color\n", 201 | "classes = iris.target_names[y] # y coded by iris name\n", 202 | "for (i, cla) in enumerate(set(classes)):\n", 203 | " xc = [p for (j, p) in enumerate(x1) if classes[j] == cla]\n", 204 | " yc = [p for (j, p) in enumerate(x2) if classes[j] == cla]\n", 205 | " cols = [c for (j, c) in enumerate(colors) if classes[j] == cla]\n", 206 | " plt.scatter(xc, yc, c = cols, label = cla)\n", 207 | " plt.ylabel('Principal Component 2')\n", 208 | " plt.xlabel('Principal Component 1')\n", 209 | "plt.legend(loc = 4)" 210 | ] 211 | }, 212 | { 213 | "cell_type": "markdown", 214 | "metadata": {}, 215 | "source": [ 216 | "### More feature selection methods [here](http://scikit-learn.org/stable/modules/feature_selection.html)" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": {}, 222 | "source": [ 223 | "### One Hot Encoding\n", 224 | "* It's an operation on feature labels - a method of dummying variable\n", 225 | "* Expands the feature space by nature of transform - later this can be processed further with a dimensionality reduction (the dummied variables are now their own features)\n", 226 | "* FYI: One hot encoding variables is needed for python ML module `tenorflow`\n", 227 | "* Can do this with `pandas` method or a `sklearn` one-hot-encoder system" 228 | ] 229 | }, 230 | { 231 | "cell_type": "markdown", 232 | "metadata": {}, 233 | "source": [ 234 | "#### `pandas` method" 235 | ] 236 | }, 237 | { 238 | "cell_type": "code", 239 | "execution_count": null, 240 | "metadata": { 241 | "collapsed": false 242 | }, 243 | "outputs": [], 244 | "source": [ 245 | "# Dummy variables with pandas built-in function\n", 246 | "\n", 247 | "import pandas as pd\n", 248 | "from sklearn import datasets\n", 249 | "\n", 250 | "iris = datasets.load_iris()\n", 251 | "X, y = iris.data, iris.target\n", 252 | "\n", 253 | "# Convert to dataframe and add a column with actual iris species name\n", 254 | "data = pd.DataFrame(X, columns = iris.feature_names)\n", 255 | "data['target_name'] = iris.target_names[y]\n", 256 | "\n", 257 | "df = pd.get_dummies(data, prefix = ['target_name'])\n", 258 | "df.head()" 259 | ] 260 | }, 261 | { 262 | "cell_type": "markdown", 263 | "metadata": {}, 264 | "source": [ 265 | "#### `sklearn` method" 266 | ] 267 | }, 268 | { 269 | "cell_type": "code", 270 | "execution_count": null, 271 | "metadata": { 272 | "collapsed": false 273 | }, 274 | "outputs": [], 275 | "source": [ 276 | "# OneHotEncoder for dummying variables\n", 277 | "\n", 278 | "from sklearn.preprocessing import OneHotEncoder, LabelEncoder\n", 279 | "from sklearn import datasets\n", 280 | "\n", 281 | "iris = datasets.load_iris()\n", 282 | "X, y = iris.data, iris.target\n", 283 | "\n", 284 | "# We encode both our categorical variable and it's labels\n", 285 | "enc = OneHotEncoder()\n", 286 | "label_enc = LabelEncoder() # remember the labels here\n", 287 | "\n", 288 | "# Encode labels (can use for discrete numerical values as well)\n", 289 | "data_label_encoded = label_enc.fit_transform(y)\n", 290 | "\n", 291 | "# Encode and \"dummy\" variables\n", 292 | "data_feature_one_hot_encoded = enc.fit_transform(y.reshape(-1, 1))\n", 293 | "print(data_feature_one_hot_encoded.shape)\n", 294 | "\n", 295 | "num_dummies = data_feature_one_hot_encoded.shape[1]\n", 296 | "df = pd.DataFrame(data_feature_one_hot_encoded.toarray(), columns = label_enc.inverse_transform(range(num_dummies)))\n", 297 | "\n", 298 | "df.head()" 299 | ] 300 | }, 301 | { 302 | "cell_type": "markdown", 303 | "metadata": {}, 304 | "source": [ 305 | "Created by a Microsoft Employee.\n", 306 | "\t\n", 307 | "The MIT License (MIT)
\n", 308 | "Copyright (c) 2016 Micheleen Harris" 309 | ] 310 | } 311 | ], 312 | "metadata": { 313 | "kernelspec": { 314 | "display_name": "Python 3", 315 | "language": "python", 316 | "name": "python3" 317 | }, 318 | "language_info": { 319 | "codemirror_mode": { 320 | "name": "ipython", 321 | "version": 3 322 | }, 323 | "file_extension": ".py", 324 | "mimetype": "text/x-python", 325 | "name": "python", 326 | "nbconvert_exporter": "python", 327 | "pygments_lexer": "ipython3", 328 | "version": "3.4.3" 329 | } 330 | }, 331 | "nbformat": 4, 332 | "nbformat_minor": 0 333 | } 334 | -------------------------------------------------------------------------------- /04.Supervised Learning.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Learning Algorithms - Supervised Learning\n", 8 | "\n", 9 | "> Reminder: All supervised estimators in scikit-learn implement a `fit(X, y)` method to fit the model and a `predict(X)` method that, given unlabeled observations X, returns the predicted labels y. (direct quote from `sklearn` docs)\n", 10 | "\n", 11 | "* Given that Iris is a fairly small, labeled dataset with relatively few features...what algorithm would you start with and why?" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "> \"Often the hardest part of solving a machine learning problem can be finding the right estimator for the job.\"\n", 19 | "\n", 20 | "> \"Different estimators are better suited for different types of data and different problems.\"\n", 21 | "\n", 22 | "-Choosing the Right Estimator from sklearn docs\n" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": null, 28 | "metadata": { 29 | "collapsed": true 30 | }, 31 | "outputs": [], 32 | "source": [ 33 | "# Imports for python 2/3 compatibility\n", 34 | "\n", 35 | "from __future__ import absolute_import, division, print_function, unicode_literals\n", 36 | "\n", 37 | "# For python 2, comment these out:\n", 38 | "# from builtins import range" 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": {}, 44 | "source": [ 45 | "An estimator for recognizing a new iris from its measurements\n", 46 | "\n", 47 | "> Or, in machine learning parlance, we fit an estimator on known samples of the iris measurements to predict the class to which an unseen iris belongs.\n", 48 | "\n", 49 | "Let's give it a try! (We are actually going to hold out a small percentage of the `iris` dataset and check our predictions against the labels)" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 1, 55 | "metadata": { 56 | "collapsed": false 57 | }, 58 | "outputs": [], 59 | "source": [ 60 | "from sklearn.datasets import load_iris\n", 61 | "from sklearn.model_selection import train_test_split\n", 62 | "\n", 63 | "# Let's load the iris dataset\n", 64 | "iris = load_iris()\n", 65 | "X, y = iris.data, iris.target\n", 66 | "\n", 67 | "# split data into training and test sets using the handy train_test_split func\n", 68 | "# in this split, we are \"holding out\" only one value and label (placed into X_test and y_test)\n", 69 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": null, 75 | "metadata": { 76 | "collapsed": false 77 | }, 78 | "outputs": [], 79 | "source": [ 80 | "# Let's try a decision tree classification method\n", 81 | "from sklearn import tree\n", 82 | "\n", 83 | "t = tree.DecisionTreeClassifier(max_depth = 4,\n", 84 | " criterion = 'entropy', \n", 85 | " class_weight = 'balanced',\n", 86 | " random_state = 2)\n", 87 | "t.fit(X_train, y_train)\n", 88 | "\n", 89 | "t.score(X_test, y_test) # what performance metric is this?" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": null, 95 | "metadata": { 96 | "collapsed": false 97 | }, 98 | "outputs": [], 99 | "source": [ 100 | "# What was the label associated with this test sample? (\"held out\" sample's original label)\n", 101 | "# Let's predict on our \"held out\" sample\n", 102 | "y_pred = t.predict(X_test)\n", 103 | "print(y_pred)\n", 104 | "\n", 105 | "# fill in the blank below\n", 106 | "\n", 107 | "# how did our prediction do for first sample in test dataset?\n", 108 | "print(\"Prediction: %d, Original label: %d\" % (y_pred[0], ___)) # <-- fill in blank" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": null, 114 | "metadata": { 115 | "collapsed": false 116 | }, 117 | "outputs": [], 118 | "source": [ 119 | "# Here's a nifty way to cross-validate (useful for quick model evaluation!)\n", 120 | "from sklearn import model_selection\n", 121 | "\n", 122 | "t = tree.DecisionTreeClassifier(max_depth = 4,\n", 123 | " criterion = 'entropy', \n", 124 | " class_weight = 'balanced',\n", 125 | " random_state = 2)\n", 126 | "\n", 127 | "# splits, fits and predicts all in one with a score (does this multiple times)\n", 128 | "score = model_selection.cross_val_score(t, X, y)\n", 129 | "score" 130 | ] 131 | }, 132 | { 133 | "cell_type": "markdown", 134 | "metadata": { 135 | "collapsed": true 136 | }, 137 | "source": [ 138 | "QUESTIONS: What do these scores tell you? Are they too high or too low you think? If it's 1.0, what does that mean?" 139 | ] 140 | }, 141 | { 142 | "cell_type": "markdown", 143 | "metadata": {}, 144 | "source": [ 145 | "### What does the graph look like for this decision tree? i.e. what are the \"questions\" and \"decisions\" for this tree...\n", 146 | "* Note: You need both Graphviz app and the python package `graphviz` (It's worth it for this cool decision tree graph, I promise!)\n", 147 | "* To install both on OS X:\n", 148 | "```\n", 149 | "sudo port install graphviz\n", 150 | "sudo pip install graphviz\n", 151 | "```\n", 152 | "* For general Installation see [this guide](http://graphviz.readthedocs.org/en/latest/manual.html)" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": 2, 158 | "metadata": { 159 | "collapsed": false 160 | }, 161 | "outputs": [ 162 | { 163 | "data": { 164 | "image/svg+xml": [ 165 | "\n", 166 | "\n", 168 | "\n", 170 | "\n", 171 | "\n", 173 | "\n", 174 | "Tree\n", 175 | "\n", 176 | "\n", 177 | "0\n", 178 | "\n", 179 | "petal width (cm) ≤ 0.8\n", 180 | "entropy = 1.585\n", 181 | "samples = 105\n", 182 | "value = [35.0, 35.0, 35.0]\n", 183 | "class = setosa\n", 184 | "\n", 185 | "\n", 186 | "1\n", 187 | "\n", 188 | "entropy = 0.0\n", 189 | "samples = 38\n", 190 | "value = [35.0, 0.0, 0.0]\n", 191 | "class = setosa\n", 192 | "\n", 193 | "\n", 194 | "0->1\n", 195 | "\n", 196 | "\n", 197 | "True\n", 198 | "\n", 199 | "\n", 200 | "2\n", 201 | "\n", 202 | "petal width (cm) ≤ 1.65\n", 203 | "entropy = 1.0\n", 204 | "samples = 67\n", 205 | "value = [0.0, 35.0, 35.0]\n", 206 | "class = virginica\n", 207 | "\n", 208 | "\n", 209 | "0->2\n", 210 | "\n", 211 | "\n", 212 | "False\n", 213 | "\n", 214 | "\n", 215 | "3\n", 216 | "\n", 217 | "petal length (cm) ≤ 4.95\n", 218 | "entropy = 0.4382\n", 219 | "samples = 38\n", 220 | "value = [0.0, 34.0278, 3.3871]\n", 221 | "class = versicolor\n", 222 | "\n", 223 | "\n", 224 | "2->3\n", 225 | "\n", 226 | "\n", 227 | "\n", 228 | "\n", 229 | "8\n", 230 | "\n", 231 | "petal length (cm) ≤ 4.85\n", 232 | "entropy = 0.1936\n", 233 | "samples = 29\n", 234 | "value = [0.0, 0.9722, 31.6129]\n", 235 | "class = virginica\n", 236 | "\n", 237 | "\n", 238 | "2->8\n", 239 | "\n", 240 | "\n", 241 | "\n", 242 | "\n", 243 | "4\n", 244 | "\n", 245 | "entropy = 0.0\n", 246 | "samples = 34\n", 247 | "value = [0.0, 33.0556, 0.0]\n", 248 | "class = versicolor\n", 249 | "\n", 250 | "\n", 251 | "3->4\n", 252 | "\n", 253 | "\n", 254 | "\n", 255 | "\n", 256 | "5\n", 257 | "\n", 258 | "petal width (cm) ≤ 1.55\n", 259 | "entropy = 0.7656\n", 260 | "samples = 4\n", 261 | "value = [0.0, 0.9722, 3.3871]\n", 262 | "class = virginica\n", 263 | "\n", 264 | "\n", 265 | "3->5\n", 266 | "\n", 267 | "\n", 268 | "\n", 269 | "\n", 270 | "6\n", 271 | "\n", 272 | "entropy = -0.0\n", 273 | "samples = 3\n", 274 | "value = [0.0, 0.0, 3.3871]\n", 275 | "class = virginica\n", 276 | "\n", 277 | "\n", 278 | "5->6\n", 279 | "\n", 280 | "\n", 281 | "\n", 282 | "\n", 283 | "7\n", 284 | "\n", 285 | "entropy = 0.0\n", 286 | "samples = 1\n", 287 | "value = [0.0, 0.9722, 0.0]\n", 288 | "class = versicolor\n", 289 | "\n", 290 | "\n", 291 | "5->7\n", 292 | "\n", 293 | "\n", 294 | "\n", 295 | "\n", 296 | "9\n", 297 | "\n", 298 | "sepal width (cm) ≤ 3.1\n", 299 | "entropy = 0.7656\n", 300 | "samples = 4\n", 301 | "value = [0.0, 0.9722, 3.3871]\n", 302 | "class = virginica\n", 303 | "\n", 304 | "\n", 305 | "8->9\n", 306 | "\n", 307 | "\n", 308 | "\n", 309 | "\n", 310 | "12\n", 311 | "\n", 312 | "entropy = 0.0\n", 313 | "samples = 25\n", 314 | "value = [0.0, 0.0, 28.2258]\n", 315 | "class = virginica\n", 316 | "\n", 317 | "\n", 318 | "8->12\n", 319 | "\n", 320 | "\n", 321 | "\n", 322 | "\n", 323 | "10\n", 324 | "\n", 325 | "entropy = 0.0\n", 326 | "samples = 3\n", 327 | "value = [0.0, 0.0, 3.3871]\n", 328 | "class = virginica\n", 329 | "\n", 330 | "\n", 331 | "9->10\n", 332 | "\n", 333 | "\n", 334 | "\n", 335 | "\n", 336 | "11\n", 337 | "\n", 338 | "entropy = 0.0\n", 339 | "samples = 1\n", 340 | "value = [0.0, 0.9722, 0.0]\n", 341 | "class = versicolor\n", 342 | "\n", 343 | "\n", 344 | "9->11\n", 345 | "\n", 346 | "\n", 347 | "\n", 348 | "\n", 349 | "\n" 350 | ], 351 | "text/plain": [ 352 | "" 353 | ] 354 | }, 355 | "execution_count": 2, 356 | "metadata": {}, 357 | "output_type": "execute_result" 358 | } 359 | ], 360 | "source": [ 361 | "from sklearn.tree import export_graphviz\n", 362 | "import graphviz\n", 363 | "\n", 364 | "# Let's rerun the decision tree classifier\n", 365 | "from sklearn import tree\n", 366 | "\n", 367 | "t = tree.DecisionTreeClassifier(max_depth = 4,\n", 368 | " criterion = 'entropy', \n", 369 | " class_weight = 'balanced',\n", 370 | " random_state = 2)\n", 371 | "t.fit(X_train, y_train)\n", 372 | "\n", 373 | "t.score(X_test, y_test) # what performance metric is this?\n", 374 | "\n", 375 | "export_graphviz(t, out_file=\"mytree.dot\", \n", 376 | " feature_names=iris.feature_names, \n", 377 | " class_names=iris.target_names, \n", 378 | " filled=True, rounded=True, \n", 379 | " special_characters=True)\n", 380 | "\n", 381 | "with open(\"mytree.dot\") as f:\n", 382 | " dot_graph = f.read()\n", 383 | "\n", 384 | "graphviz.Source(dot_graph, format = 'png')" 385 | ] 386 | }, 387 | { 388 | "cell_type": "markdown", 389 | "metadata": {}, 390 | "source": [ 391 | "### From Decision Tree to Random Forest" 392 | ] 393 | }, 394 | { 395 | "cell_type": "code", 396 | "execution_count": null, 397 | "metadata": { 398 | "collapsed": false 399 | }, 400 | "outputs": [], 401 | "source": [ 402 | "from sklearn.datasets import load_iris\n", 403 | "import pandas as pd\n", 404 | "import numpy as np\n", 405 | "\n", 406 | "iris = load_iris()\n", 407 | "X, y = iris.data, iris.target\n", 408 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)" 409 | ] 410 | }, 411 | { 412 | "cell_type": "code", 413 | "execution_count": null, 414 | "metadata": { 415 | "collapsed": false 416 | }, 417 | "outputs": [], 418 | "source": [ 419 | "from sklearn.ensemble import RandomForestClassifier\n", 420 | "\n", 421 | "forest = RandomForestClassifier(max_depth=4,\n", 422 | " criterion = 'entropy', \n", 423 | " n_estimators = 100, \n", 424 | " class_weight = 'balanced',\n", 425 | " n_jobs = -1,\n", 426 | " random_state = 2)\n", 427 | "\n", 428 | "#forest = RandomForestClassifier()\n", 429 | "forest.fit(X_train, y_train)\n", 430 | "\n", 431 | "y_preds = iris.target_names[forest.predict(X_test)]\n", 432 | "\n", 433 | "forest.score(X_test, y_test)" 434 | ] 435 | }, 436 | { 437 | "cell_type": "code", 438 | "execution_count": null, 439 | "metadata": { 440 | "collapsed": false 441 | }, 442 | "outputs": [], 443 | "source": [ 444 | "# Here's a nifty way to cross-validate (useful for model evaluation!)\n", 445 | "from sklearn import model_selection\n", 446 | "\n", 447 | "# reinitialize classifier\n", 448 | "forest = RandomForestClassifier(max_depth=4,\n", 449 | " criterion = 'entropy', \n", 450 | " n_estimators = 100, \n", 451 | " class_weight = 'balanced',\n", 452 | " n_jobs = -1,\n", 453 | " random_state = 2)\n", 454 | "\n", 455 | "score = model_selection.cross_val_score(forest, X, y)\n", 456 | "score" 457 | ] 458 | }, 459 | { 460 | "cell_type": "markdown", 461 | "metadata": {}, 462 | "source": [ 463 | "QUESTION: Comparing to the decision tree method, what do these accuracy scores tell you? Do they seem more reasonable?" 464 | ] 465 | }, 466 | { 467 | "cell_type": "markdown", 468 | "metadata": {}, 469 | "source": [ 470 | "### Splitting into train and test set vs. cross-validation" 471 | ] 472 | }, 473 | { 474 | "cell_type": "markdown", 475 | "metadata": {}, 476 | "source": [ 477 | "

We can be explicit and use the `train_test_split` method in scikit-learn ( [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html) ) as in (and as shown above for `iris` data):

\n", 478 | "\n", 479 | "```python\n", 480 | "# Create some data by hand and place 70% into a training set and the rest into a test set\n", 481 | "# Here we are using labeled features (X - feature data, y - labels) in our made-up data\n", 482 | "import numpy as np\n", 483 | "from sklearn import linear_model\n", 484 | "from sklearn.cross_validation import train_test_split\n", 485 | "X, y = np.arange(10).reshape((5, 2)), range(5)\n", 486 | "X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.70)\n", 487 | "clf = linear_model.LinearRegression()\n", 488 | "clf.fit(X_train, y_train)\n", 489 | "```\n", 490 | "\n", 491 | "OR\n", 492 | "\n", 493 | "Be more concise and\n", 494 | "\n", 495 | "```python\n", 496 | "import numpy as np\n", 497 | "from sklearn import cross_validation, linear_model\n", 498 | "X, y = np.arange(10).reshape((5, 2)), range(5)\n", 499 | "clf = linear_model.LinearRegression()\n", 500 | "score = cross_validation.cross_val_score(clf, X, y)\n", 501 | "```\n", 502 | "\n", 503 | "

There is also a `cross_val_predict` method to create estimates rather than scores and is very useful for cross-validation to evaluate models ( [cross_val_predict](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.cross_val_predict.html) )" 504 | ] 505 | }, 506 | { 507 | "cell_type": "markdown", 508 | "metadata": {}, 509 | "source": [ 510 | "Created by a Microsoft Employee.\n", 511 | "\t\n", 512 | "The MIT License (MIT)
\n", 513 | "Copyright (c) 2016 Micheleen Harris" 514 | ] 515 | } 516 | ], 517 | "metadata": { 518 | "kernelspec": { 519 | "display_name": "Python 3", 520 | "language": "python", 521 | "name": "python3" 522 | }, 523 | "language_info": { 524 | "codemirror_mode": { 525 | "name": "ipython", 526 | "version": 3 527 | }, 528 | "file_extension": ".py", 529 | "mimetype": "text/x-python", 530 | "name": "python", 531 | "nbconvert_exporter": "python", 532 | "pygments_lexer": "ipython3", 533 | "version": "3.4.3" 534 | } 535 | }, 536 | "nbformat": 4, 537 | "nbformat_minor": 0 538 | } 539 | -------------------------------------------------------------------------------- /06.Model Evaluation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Evaluating Models" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "# Imports for python 2/3 compatibility\n", 19 | "\n", 20 | "from __future__ import absolute_import, division, print_function, unicode_literals\n", 21 | "\n", 22 | "# For python 2, comment these out:\n", 23 | "# from builtins import range" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "### Evaluating using metrics\n", 31 | "* Confusion matrix - visually inspect quality of a classifier's predictions (more [here](http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html)) - very useful to see if a particular class is problematic\n", 32 | "\n", 33 | "Here, we will process some data, classify it with SVM (see [here](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) for more info), and view the quality of the classification with a confusion matrix." 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": null, 39 | "metadata": { 40 | "collapsed": false 41 | }, 42 | "outputs": [], 43 | "source": [ 44 | "import pandas as pd\n", 45 | "\n", 46 | "# import model algorithm and data\n", 47 | "from sklearn import svm, datasets\n", 48 | "\n", 49 | "# import splitter\n", 50 | "from sklearn.model_selection import train_test_split\n", 51 | "\n", 52 | "# import metrics\n", 53 | "from sklearn.metrics import confusion_matrix\n", 54 | "\n", 55 | "# feature data (X) and labels (y)\n", 56 | "iris = datasets.load_iris()\n", 57 | "X, y = iris.data, iris.target\n", 58 | "\n", 59 | "# split data into training and test sets\n", 60 | "X_train, X_test, y_train, y_test = \\\n", 61 | " train_test_split(X, y, test_size = 0.30, random_state = 42)" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": null, 67 | "metadata": { 68 | "collapsed": false 69 | }, 70 | "outputs": [], 71 | "source": [ 72 | "# perform the classification step and run a prediction on test set from above\n", 73 | "clf = svm.SVC(kernel = 'linear', C = 0.01)\n", 74 | "y_pred = clf.fit(X_train, y_train).predict(X_test)\n", 75 | "\n", 76 | "pd.DataFrame({'Prediction': iris.target_names[y_pred],\n", 77 | " 'Actual': iris.target_names[y_test]})" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": null, 83 | "metadata": { 84 | "collapsed": false 85 | }, 86 | "outputs": [], 87 | "source": [ 88 | "# accuracy score\n", 89 | "clf.score(X_test, y_test)" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": null, 95 | "metadata": { 96 | "collapsed": true 97 | }, 98 | "outputs": [], 99 | "source": [ 100 | "# Define a plotting function confusion matrices \n", 101 | "# (from http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html)\n", 102 | "\n", 103 | "import matplotlib.pyplot as plt\n", 104 | "import numpy as np\n", 105 | "\n", 106 | "def plot_confusion_matrix(cm, target_names, title = 'The Confusion Matrix', cmap = plt.cm.YlOrRd):\n", 107 | " plt.imshow(cm, interpolation = 'nearest', cmap = cmap)\n", 108 | " plt.tight_layout()\n", 109 | " \n", 110 | " # Add feature labels to x and y axes\n", 111 | " tick_marks = np.arange(len(target_names))\n", 112 | " plt.xticks(tick_marks, target_names, rotation=45)\n", 113 | " plt.yticks(tick_marks, target_names)\n", 114 | " \n", 115 | " plt.ylabel('True Label')\n", 116 | " plt.xlabel('Predicted Label')\n", 117 | " \n", 118 | " plt.colorbar()" 119 | ] 120 | }, 121 | { 122 | "cell_type": "markdown", 123 | "metadata": {}, 124 | "source": [ 125 | "Numbers in confusion matrix:\n", 126 | "* on-diagonal - counts of points for which the predicted label is equal to the true label\n", 127 | "* off-diagonal - counts of mislabeled points" 128 | ] 129 | }, 130 | { 131 | "cell_type": "code", 132 | "execution_count": null, 133 | "metadata": { 134 | "collapsed": false 135 | }, 136 | "outputs": [], 137 | "source": [ 138 | "%matplotlib inline\n", 139 | "\n", 140 | "cm = confusion_matrix(y_test, y_pred)\n", 141 | "\n", 142 | "# see the actual counts\n", 143 | "print(cm)\n", 144 | "\n", 145 | "# visually inpsect how the classifier did of matching predictions to true labels\n", 146 | "plot_confusion_matrix(cm, iris.target_names)" 147 | ] 148 | }, 149 | { 150 | "cell_type": "markdown", 151 | "metadata": {}, 152 | "source": [ 153 | "* Classification reports - a text report with important classification metrics (e.g. precision, recall)" 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": null, 159 | "metadata": { 160 | "collapsed": false 161 | }, 162 | "outputs": [], 163 | "source": [ 164 | "from sklearn.metrics import classification_report\n", 165 | "\n", 166 | "# Using the test and prediction sets from above\n", 167 | "print(classification_report(y_test, y_pred, target_names = iris.target_names))" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": null, 173 | "metadata": { 174 | "collapsed": false 175 | }, 176 | "outputs": [], 177 | "source": [ 178 | "# Another example with some toy data\n", 179 | "\n", 180 | "y_test = ['cat', 'dog', 'mouse', 'mouse', 'cat', 'cat']\n", 181 | "y_pred = ['mouse', 'dog', 'cat', 'mouse', 'cat', 'mouse']\n", 182 | "\n", 183 | "# How did our predictor do?\n", 184 | "print(classification_report(y_test, ___, target_names = ___)) # <-- fill in the blanks" 185 | ] 186 | }, 187 | { 188 | "cell_type": "markdown", 189 | "metadata": {}, 190 | "source": [ 191 | "QUICK QUESTION: Is it better to have too many false positives or too many false negatives?" 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": {}, 197 | "source": [ 198 | "### Evaluating Models and Under/Over-Fitting\n", 199 | "* Over-fitting or under-fitting can be visualized as below and tuned as we will see later with `GridSearchCV` paramter tuning\n", 200 | "* A validation curve gives one an idea of the relationship of model complexity to model performance.\n", 201 | "* For this examination it would help to understand the idea of the [bias-variance tradeoff](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff).\n", 202 | "* A learning curve helps answer the question of if there is an added benefit to adding more training data to a model. It is also a tool for investigating whether an estimator is more affected by variance error or bias error." 203 | ] 204 | }, 205 | { 206 | "cell_type": "markdown", 207 | "metadata": { 208 | "collapsed": true 209 | }, 210 | "source": [ 211 | "PARTING THOUGHT: Does a parameter when increased/decreased cause overfitting or underfitting? What are the implications of those cases?" 212 | ] 213 | }, 214 | { 215 | "cell_type": "markdown", 216 | "metadata": { 217 | "collapsed": true 218 | }, 219 | "source": [ 220 | "Created by a Microsoft Employee.\n", 221 | "\t\n", 222 | "The MIT License (MIT)
\n", 223 | "Copyright (c) 2016 Micheleen Harris" 224 | ] 225 | } 226 | ], 227 | "metadata": { 228 | "kernelspec": { 229 | "display_name": "Python 3", 230 | "language": "python", 231 | "name": "python3" 232 | }, 233 | "language_info": { 234 | "codemirror_mode": { 235 | "name": "ipython", 236 | "version": 3 237 | }, 238 | "file_extension": ".py", 239 | "mimetype": "text/x-python", 240 | "name": "python", 241 | "nbconvert_exporter": "python", 242 | "pygments_lexer": "ipython3", 243 | "version": "3.4.3" 244 | } 245 | }, 246 | "nbformat": 4, 247 | "nbformat_minor": 0 248 | } 249 | -------------------------------------------------------------------------------- /07.Pipelines and Parameter Tuning.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Search for best parameters and create a pipeline" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### Easy reading...create and use a pipeline" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "> Pipelining (as an aside to this section)\n", 22 | "* `Pipeline(steps=[...])` - where steps can be a list of processes through which to put data or a dictionary which includes the parameters for each step as values\n", 23 | "* For example, here we do a transformation (SelectKBest) and a classification (SVC) all at once in a pipeline we set up.\n", 24 | "\n", 25 | "See a full example [here](http://scikit-learn.org/stable/auto_examples/feature_stacker.html)\n", 26 | "\n", 27 | "Note: If you wish to perform multiple transformations in your pipeline try [FeatureUnion](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html#sklearn.pipeline.FeatureUnion)" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": null, 33 | "metadata": {}, 34 | "outputs": [], 35 | "source": [ 36 | "# Imports for python 2/3 compatibility\n", 37 | "\n", 38 | "from __future__ import absolute_import, division, print_function, unicode_literals\n", 39 | "\n", 40 | "# For python 2, comment these out:\n", 41 | "# from builtins import range" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": null, 47 | "metadata": {}, 48 | "outputs": [], 49 | "source": [ 50 | "from sklearn.model_selection import train_test_split\n", 51 | "from sklearn.svm import SVC\n", 52 | "from sklearn.pipeline import Pipeline\n", 53 | "from sklearn.feature_selection import SelectKBest, chi2\n", 54 | "from sklearn.datasets import load_iris\n", 55 | "\n", 56 | "iris = load_iris()\n", 57 | "X, y = iris.data, iris.target\n", 58 | "\n", 59 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)\n", 60 | "\n", 61 | "# a feature selection instance\n", 62 | "selection = SelectKBest(chi2, k = 2)\n", 63 | "\n", 64 | "# classification instance\n", 65 | "clf = SVC(kernel = 'linear')\n", 66 | "\n", 67 | "# make a pipeline\n", 68 | "pipeline = Pipeline([(\"feature selection\", selection), (\"classification\", clf)])\n", 69 | "\n", 70 | "# train the model\n", 71 | "pipeline.fit(X, y)" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": null, 77 | "metadata": {}, 78 | "outputs": [], 79 | "source": [ 80 | "# Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks.\n", 81 | "# Homepage: http://rasbt.github.io/mlxtend/\n", 82 | "\n", 83 | "!pip install msgpack mlxtend" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": null, 89 | "metadata": {}, 90 | "outputs": [], 91 | "source": [ 92 | "import numpy as np\n", 93 | "from mlxtend.plotting import plot_decision_regions\n", 94 | "\n", 95 | "# Obtain estimated test set labels using the pipeline we created\n", 96 | "y_pred = pipeline.predict(X_test)\n", 97 | "\n", 98 | "# We use mlxtend to show the decision regions of the final SVC\n", 99 | "fig, axarr = plt.subplots(1, 2, figsize=(12,5), sharex=True, sharey=True)\n", 100 | "\n", 101 | "# Plot the decision region for the X_train and y_train. Note that the pipeline didn't transform X using\n", 102 | "# the SelectKBest component, so we transform it here:\n", 103 | "X_train_transformed = selection.transform(X_train)\n", 104 | "X_test_transformed = selection.transform(X_test)\n", 105 | "\n", 106 | "plot_decision_regions(X_train_transformed, y_train, clf=clf, legend=2, ax= axarr[0])\n", 107 | "axarr[0].set_title(\"Decision Region (Trained)\")\n", 108 | "\n", 109 | "plot_decision_regions(X_test_transformed, y_pred, clf=clf, legend=2, ax= axarr[1])\n", 110 | "axarr[1].set_title(\"Decision Region (Predicted)\")\n" 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": {}, 116 | "source": [ 117 | "### Last, but not least, Searching Parameter Space with `GridSearchCV`" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": null, 123 | "metadata": {}, 124 | "outputs": [], 125 | "source": [ 126 | "from sklearn.model_selection import GridSearchCV\n", 127 | "\n", 128 | "from sklearn.preprocessing import PolynomialFeatures\n", 129 | "from sklearn.linear_model import LinearRegression\n", 130 | "\n", 131 | "poly = PolynomialFeatures(include_bias = False)\n", 132 | "lm = LinearRegression()\n", 133 | "\n", 134 | "pipeline = Pipeline([(\"polynomial_features\", poly),\n", 135 | " (\"linear_regression\", lm)])\n", 136 | "\n", 137 | "param_grid = dict(polynomial_features__degree = list(range(1, 30, 2)),\n", 138 | " linear_regression__normalize = [False, True])\n", 139 | "\n", 140 | "grid_search = GridSearchCV(pipeline, param_grid=param_grid)\n", 141 | "grid_search.fit(X, y)\n", 142 | "print(grid_search.best_params_)" 143 | ] 144 | }, 145 | { 146 | "cell_type": "markdown", 147 | "metadata": {}, 148 | "source": [ 149 | "Created by a Microsoft Employee.\n", 150 | "\t\n", 151 | "The MIT License (MIT)
\n", 152 | "Copyright (c) 2016 Micheleen Harris" 153 | ] 154 | } 155 | ], 156 | "metadata": { 157 | "kernelspec": { 158 | "display_name": "Python 3", 159 | "language": "python", 160 | "name": "python3" 161 | }, 162 | "language_info": { 163 | "codemirror_mode": { 164 | "name": "ipython", 165 | "version": 3 166 | }, 167 | "file_extension": ".py", 168 | "mimetype": "text/x-python", 169 | "name": "python", 170 | "nbconvert_exporter": "python", 171 | "pygments_lexer": "ipython3", 172 | "version": "3.6.5" 173 | } 174 | }, 175 | "nbformat": 4, 176 | "nbformat_minor": 1 177 | } 178 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2016 Micheleen Harris 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in 13 | all copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 21 | THE SOFTWARE. 22 | 23 | -------------------------------------------------------------------------------- /LICENSE_Jake_Vanderplas: -------------------------------------------------------------------------------- 1 | Copyright (c) 2015, Jake Vanderplas 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without modification, 5 | are permitted provided that the following conditions are met: 6 | 7 | Redistributions of source code must retain the above copyright notice, this 8 | list of conditions and the following disclaimer. 9 | 10 | Redistributions in binary form must reproduce the above copyright notice, this 11 | list of conditions and the following disclaimer in the documentation and/or 12 | other materials provided with the distribution. 13 | 14 | Neither the name of the {organization} nor the names of its 15 | contributors may be used to endorse or promote products derived from 16 | this software without specific prior written permission. 17 | 18 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 19 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 20 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 21 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR 22 | ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES 23 | (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 24 | LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON 25 | ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 26 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 27 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 28 | -------------------------------------------------------------------------------- /Notebook_anatomy.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Basic Anatomy of a Notebook and General Guide\n", 8 | "* Note this a is Python 3-flavored Jupyter notebook" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "### My Disclaimers:\n", 16 | "1. Notebooks are no substitute for an IDE for developing apps.\n", 17 | "* Notebooks are not suitable for debugging code (yet).\n", 18 | "* They are no substitute for publication quality publishing, however they are very useful for interactive blogging\n", 19 | "* My main use of notebooks are for interactive teaching mostly and as a playground for some code that I might like to share at some point (I can add useful and pretty markup text, pics, videos, etc).\n", 20 | "* I'm a fan also because github render's ipynb files nicely (even better than r-markdown for some reason)." 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "## Shortcuts!!!\n", 28 | "* A complete list is [here](https://sowingseasons.com/blog/reference/2016/01/jupyter-keyboard-shortcuts/23298516), but these are my favorites:\n", 29 | "\n", 30 | "Mode | What | Shortcut\n", 31 | "------------- | ------------- | -------------\n", 32 | "Either (Press `Esc` to enter) | Run cell | Shift-Enter\n", 33 | "Command | Add cell below | B\n", 34 | "Command | Add cell above | A\n", 35 | "Command | Delete a cell | d-d\n", 36 | "Command | Go into edit mode | Enter\n", 37 | "Edit (Press `Enter` to enable) | Indent | Clrl-]\n", 38 | "Edit | Unindent | Ctrl-[\n", 39 | "Edit | Comment section | Ctrl-/\n", 40 | "Edit | Function introspection | Shift-Tab\n", 41 | "\n", 42 | "Try some below" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": 1, 48 | "metadata": { 49 | "collapsed": false 50 | }, 51 | "outputs": [ 52 | { 53 | "name": "stdout", 54 | "output_type": "stream", 55 | "text": [ 56 | "hello world!\n" 57 | ] 58 | } 59 | ], 60 | "source": [ 61 | "print('hello world!')" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "### In this figure are a few labels of notebook parts I will refer to\n", 69 | "![Parts](https://www.packtpub.com/sites/default/files/Article-Images/B01727_03.png)" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "\n", 77 | "#### OK, change this cell to markdown to see some examples (you'll recognize this if you speak markdown)\n", 78 | "# This will be Heading1\n", 79 | "1. first thing\n", 80 | "* second thing\n", 81 | "* third thing\n", 82 | "\n", 83 | "A horizontal rule:\n", 84 | "\n", 85 | "---\n", 86 | "> Indented text\n", 87 | "\n", 88 | "Code snippet:\n", 89 | "\n", 90 | "```python\n", 91 | "import numpy as np\n", 92 | "a2d = np.random.randn(100).reshape(10, 10)\n", 93 | "```\n", 94 | "\n", 95 | "LaTeX inline equation:\n", 96 | "\n", 97 | "$\\Delta =\\sum_{i=1}^N w_i (x_i - \\bar{x})^2$\n", 98 | "\n", 99 | "LaTeX table:\n", 100 | "\n", 101 | "First Header | Second Header\n", 102 | "------------- | -------------\n", 103 | "Content Cell | Content Cell\n", 104 | "Content Cell | Content Cell\n", 105 | "\n", 106 | "HTML:\n", 107 | "\n", 108 | "\"You" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "### As you can see on your jupyter homepage, you can open up any notebook\n", 116 | "NB: You can return to the homepage by clicking the Jupyter icon in the very upper left corner at any time\n", 117 | "### You can also Upload a notebook (button on upper right)\n", 118 | "![Upload button](http://www.ciser.cornell.edu/data_wrangling/python_intro/images/JupyterUpload.gif)\n", 119 | "### As well as start a new notebook with a specific kernel (button to the right of Upload)\n", 120 | "![New menu](https://www.ibm.com/developerworks/community/blogs/jfp/resource/BLOGS_UPLOADED_IMAGES/irkernel48.png)\n", 121 | "\n", 122 | "> So, what's that number after `In` or `Out`? That's the order of running this cell relative to other cells (useful for keeping track of what order cells have been run). When you save this notebook that number along with any output shown will also be saved. To reset a notebook go to Cell -> All Output -> Clear and then Save it.\n", 123 | "\n", 124 | "You can do something like this to render a publicly available notebook on github statically (this I do as a backup for presentations and course stuff):\n", 125 | "\n", 126 | "```\n", 127 | "http://nbviewer.jupyter.org/github///blob/master/.ipynb\n", 128 | "```\n", 129 | "like:
\n", 130 | "http://nbviewer.jupyter.org/github/michhar/rpy2_sample_notebooks/blob/master/TestingRpy2.ipynb\n", 131 | "\n", 132 | "
\n", 133 | "Also, you can upload or start a new interactive, free notebook by going here:
\n", 134 | "https://tmpnb.org\n", 135 | "
\n", 136 | "\n", 137 | "> The nifty thing about Jupyter notebooks (and the .ipynb files which you can download and upload) is that you can share these. They are just written in JSON language. I put them up in places like GitHub and point people in that direction. \n", 138 | "\n", 139 | "> Some people (like [this guy](http://www.r-bloggers.com/why-i-dont-like-jupyter-fka-ipython-notebook/) who misses the point I think) really dislike notebooks, but they are really good for what they are good at - sharing code ideas plus neat notes and stuff in dev, teaching interactively, even chaining languages together in a polyglot style. And doing all of this on github works really well (as long as you remember to always clear your output before checking in - version control can get a bit crazy otherwise).\n", 140 | "\n", 141 | "### Some additional features\n", 142 | "* tab completion\n", 143 | "* function introspection\n", 144 | "* help" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": 2, 150 | "metadata": { 151 | "collapsed": true 152 | }, 153 | "outputs": [], 154 | "source": [ 155 | "import json" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": null, 161 | "metadata": { 162 | "collapsed": true 163 | }, 164 | "outputs": [], 165 | "source": [ 166 | "# hit Tab at end of this to see all methods\n", 167 | "json.\n", 168 | "\n", 169 | "# hit Shift-Tab within parenthesis of method to see full docstring\n", 170 | "json.loads()" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": null, 176 | "metadata": { 177 | "collapsed": true 178 | }, 179 | "outputs": [], 180 | "source": [ 181 | "?sum()" 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": null, 187 | "metadata": { 188 | "collapsed": false 189 | }, 190 | "outputs": [], 191 | "source": [ 192 | "import json\n", 193 | "?json" 194 | ] 195 | }, 196 | { 197 | "cell_type": "markdown", 198 | "metadata": { 199 | "collapsed": true 200 | }, 201 | "source": [ 202 | "The MIT License (MIT)
\n", 203 | "Copyright (c) 2016 Micheleen Harris" 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": null, 209 | "metadata": { 210 | "collapsed": true 211 | }, 212 | "outputs": [], 213 | "source": [] 214 | } 215 | ], 216 | "metadata": { 217 | "kernelspec": { 218 | "display_name": "Python 3", 219 | "language": "python", 220 | "name": "python3" 221 | }, 222 | "language_info": { 223 | "codemirror_mode": { 224 | "name": "ipython", 225 | "version": 3 226 | }, 227 | "file_extension": ".py", 228 | "mimetype": "text/x-python", 229 | "name": "python", 230 | "nbconvert_exporter": "python", 231 | "pygments_lexer": "ipython3", 232 | "version": "3.5.1" 233 | } 234 | }, 235 | "nbformat": 4, 236 | "nbformat_minor": 0 237 | } 238 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## What's in this tutorial 2 | 3 | The notebooks are a modular introduction to machine learning in python using `scikit-learn` with examples and tips. 4 | 5 | The material is in jupyter notebook format and was designed to be compatible with Python >= 2.6 or >= 3.3. To use these notebooks interatively (intended use), you will need a jupyter/ipython notebook install (see below). 6 | 7 | Also, included is a brief introductory guide to jupyter notebooks in Notebook_anatomy notebook. If you are unfamiliar with jupyter/ipython notebooks, please take some time to look at this file. 8 | 9 | ## Installation Notes 10 | 11 | > For a quick deployment, simply click the `launch binder` link at the bottom of this page. However, we recommend a local install for more customizable setups, flexibility and possiblities. 12 | 13 | ### Setting up a development environment 14 | 15 | > Note: the requirements.txt file above is a snapshot of the latest `pip` installed packages from a successful ML ecosystem. `conda` should install the best dependencies for the `scikit-learn` used and may have different versions. 16 | 17 | It is generally best practice to have a distinct development environment for various Python projects. There are multiple options available to do this such as virtualenv and Conda. For this project, we will be using the [Conda](https://www.continuum.io/why-anaconda) environment. 18 | 19 | To get started, you can install [miniconda3](http://conda.pydata.org/docs/install/quick.html) to get python3 as well as python2. 20 | 21 | If you already have Python installed, you can install Conda via `pip`: 22 | 23 | ``` 24 | pip install auxlib conda 25 | ``` 26 | 27 | ### Initializing a Conda environment 28 | 29 | * To setup a python 2.7 development environment in addition to your python 3 conda install for this project (done after installing [miniconda3](http://conda.pydata.org/docs/install/quick.html)), you can run: 30 | * `conda create --name sklearn python=2` 31 | * This installs into `C:\Miniconda3\envs\python2\` so I added this to system path (on Windows) 32 | * On Linux and OS/X, this depends on where the Python Framework is installed. On OS/X using Homebrew, this installs into `/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/envs/python2/bin` 33 | * See [here](http://conda.pydata.org/docs/py2or3.html) for more detailed instructions 34 | 35 | * To activate the development environment, from the `bin` folder of your conda environment, run 36 | * Windows: `activate sklearn` 37 | * Linux/OSX: `source activate sklearn` 38 | 39 | * Ensure ipython/ipython2 is installed in the Python environment 40 | * Windows: `c:\Miniconda3\envs\python2\Scripts\ipython2.exe kernel install --name python2 --display-name "Python 2"` 41 | * Linux/OSX: `ipython2 kernel install --name python2 --display-name "Python 2"` (may need `sudo`) 42 | 43 | * If, at any point, you desire to exit the development environment, simply type the following: 44 | * Windows: `deactivate` 45 | * Linux/OSX: `source deactivate` 46 | 47 | 48 | ### Installing jupyter notebook locally 49 | 50 | The easiest way to install [jupyter notebook](http://jupyter.org/) is via `conda install` 51 | * Run `conda install jupyter` from your terminal. Linux/OSX may require `sudo` permissions. 52 | * Navigate to the directory containing this repository, and execute `jupyter notebook`. This will start a notebook service locally for accessing notebooks in your browser. Drill down on the home page to your notebook of interest. 53 | 54 | For a notebook primer go to `Notebook_anatomy.ipynb` on this repo. The very short story is: to execute a cell just hit Shift-Enter. There are many more shortcuts in primer. 55 | 56 | ## Installing python packages 57 | 58 | This tutorial requires the following packages: 59 | 60 | * numpy version 1.5 or later: http://www.numpy.org/ 61 | * scipy version 0.10 or later: http://www.scipy.org/ 62 | * pandas http://pandas.pydata.org/ 63 | * matplotlib version 1.3 or later: http://matplotlib.org/ 64 | * scikit-learn version 0.14 or later: http://scikit-learn.org 65 | * jupyter http://jupyter.readthedocs.org/en/latest/install.html 66 | 67 | You can use your development environment of choice, but if you used `conda` as described above, simply run: 68 | ``` 69 | $ conda install numpy scipy matplotlib scikit-learn jupyter 70 | ``` 71 | 72 | We have also provided a requirements.txt file above for use with pip. 73 | 74 | ## Other install options 75 | 76 | There are many different ways to install python and the package ecosystem for machine learning. They are not all going to be covered here, but essentially you have the following choices: 77 | 78 | 1. anaconda/miniconda aka conda (shown above) 79 | 2. download python and pip install packages 80 | 3. use a docker image ([this](https://hub.docker.com/r/wi3o/skflow-jupyternb/) is one for jupyter+sklearn+skflow+tensorflow) 81 | 4. [Google cloud platform](https://cloud.google.com/) has a jupyter notebook service called Datalab (quickstart [here](https://cloud.google.com/datalab/docs/quickstart)). It has tensorflow pre-installed (needed for next tutorial). 82 | 5. Click the Binder link at the bottom of this page to deploy a notebook setup. 83 | 84 | Or a combination of the above. 85 | 86 | A quick tip if you are installing in a non-conda way with `pip` and are on Windows, many of the data analysis packages are tricky (compiled dependencies) to install. A nice "unofficial" repository for binaries of packages like `numpy` and a myriad of others was created and maintained by Christoph Gohlke. This site is [here](http://www.lfd.uci.edu/~gohlke/pythonlibs/). 87 | 88 | ## What's next 89 | 90 | The next tutorial in this workshop is on `tensorflow` and the installation instructions are in this [README](https://github.com/PythonWorkshop/intro-to-tensorflow/blob/master/README.md) 91 | 92 | [![Binder](http://mybinder.org/badge.svg)](http://mybinder.org/repo/PythonWorkshop/intro-to-sklearn) 93 | -------------------------------------------------------------------------------- /Resources.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### Some References\n", 8 | "* [The iris dataset and an intro to sklearn explained on the Kaggle blog](http://blog.kaggle.com/2015/04/22/scikit-learn-video-3-machine-learning-first-steps-with-the-iris-dataset/)\n", 9 | "* [sklearn: Conference Notebooks and Presentation from Open Data Science Conf 2015](https://github.com/amueller/odscon-sf-2015) by Andreas Mueller\n", 10 | "* [real-world example set of notebooks for learning ML from Open Data Science Conf 2015](https://github.com/cmmalone/malone_OpenDataSciCon) by Katie Malone\n", 11 | "* [PyCon 2015 Workshop, Scikit-learn tutorial](https://www.youtube.com/watch?v=L7R4HUQ-eQ0) by Jake VanDerplas (Univ of Washington, eScience Dept)\n", 12 | "* [Data Science for the Rest of Us](https://channel9.msdn.com/blogs/Cloud-and-Enterprise-Premium/Data-Science-for-Rest-of-Us) great introductory webinar (no math) by Brandon Rohrer (Microsoft)\n", 13 | "* [A Few Useful Things to Know about Machine Learning](http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf) with useful ML \"folk wisdom\" by Pedro Domingos (Univ of Washington, CS Dept)\n", 14 | "* [Machine Learning 101](http://www.astroml.org/sklearn_tutorial/general_concepts.html) associated with `sklearn` docs\n", 15 | "\n", 16 | "### Some Datasets\n", 17 | "* [Machine learning datasets](http://mldata.org/)\n", 18 | "* [Make your own with sklearn](http://scikit-learn.org/stable/datasets/index.html#sample-generators)\n", 19 | "* [Kaggle datasets](https://www.kaggle.com/datasets)\n", 20 | "\n", 21 | "### Contact Info\n", 22 | "\n", 23 | "Micheleen Harris
\n", 24 | "email: michhar@microsoft.com" 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": null, 30 | "metadata": { 31 | "collapsed": true 32 | }, 33 | "outputs": [], 34 | "source": [] 35 | } 36 | ], 37 | "metadata": { 38 | "kernelspec": { 39 | "display_name": "Python 3", 40 | "language": "python", 41 | "name": "python3" 42 | }, 43 | "language_info": { 44 | "codemirror_mode": { 45 | "name": "ipython", 46 | "version": 3 47 | }, 48 | "file_extension": ".py", 49 | "mimetype": "text/x-python", 50 | "name": "python", 51 | "nbconvert_exporter": "python", 52 | "pygments_lexer": "ipython3", 53 | "version": "3.5.1" 54 | } 55 | }, 56 | "nbformat": 4, 57 | "nbformat_minor": 0 58 | } 59 | -------------------------------------------------------------------------------- /Untitled.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 6, 6 | "metadata": { 7 | "collapsed": false 8 | }, 9 | "outputs": [ 10 | { 11 | "data": { 12 | "text/plain": [ 13 | "str" 14 | ] 15 | }, 16 | "execution_count": 6, 17 | "metadata": {}, 18 | "output_type": "execute_result" 19 | } 20 | ], 21 | "source": [ 22 | "s1 = u'my string'\n", 23 | "type(s1)" 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": 3, 29 | "metadata": { 30 | "collapsed": true 31 | }, 32 | "outputs": [], 33 | "source": [ 34 | "from __future__ import unicode_literals" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 7, 40 | "metadata": { 41 | "collapsed": true 42 | }, 43 | "outputs": [], 44 | "source": [] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 8, 49 | "metadata": { 50 | "collapsed": false 51 | }, 52 | "outputs": [ 53 | { 54 | "data": { 55 | "text/plain": [ 56 | "dict_values([175, 166, 192])" 57 | ] 58 | }, 59 | "execution_count": 8, 60 | "metadata": {}, 61 | "output_type": "execute_result" 62 | } 63 | ], 64 | "source": [ 65 | "heights = {'Fred': 175, 'Anne': 166, 'Joe': 192}\n", 66 | "heights.values()" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "metadata": { 73 | "collapsed": true 74 | }, 75 | "outputs": [], 76 | "source": [] 77 | } 78 | ], 79 | "metadata": { 80 | "kernelspec": { 81 | "display_name": "Python 3", 82 | "language": "python", 83 | "name": "python3" 84 | }, 85 | "language_info": { 86 | "codemirror_mode": { 87 | "name": "ipython", 88 | "version": 3 89 | }, 90 | "file_extension": ".py", 91 | "mimetype": "text/x-python", 92 | "name": "python", 93 | "nbconvert_exporter": "python", 94 | "pygments_lexer": "ipython3", 95 | "version": "3.4.3" 96 | } 97 | }, 98 | "nbformat": 4, 99 | "nbformat_minor": 0 100 | } 101 | -------------------------------------------------------------------------------- /fig_code/ML_flow_chart.py: -------------------------------------------------------------------------------- 1 | """ 2 | Tutorial Diagrams 3 | ----------------- 4 | 5 | This script plots the flow-charts used in the scikit-learn tutorials. 6 | """ 7 | 8 | import numpy as np 9 | import pylab as pl 10 | from matplotlib.patches import Circle, Rectangle, Polygon, Arrow, FancyArrow 11 | 12 | def create_base(box_bg = '#CCCCCC', 13 | arrow1 = '#88CCFF', 14 | arrow2 = '#88FF88', 15 | supervised=True): 16 | fig = pl.figure(figsize=(9, 6), facecolor='w') 17 | ax = pl.axes((0, 0, 1, 1), 18 | xticks=[], yticks=[], frameon=False) 19 | ax.set_xlim(0, 9) 20 | ax.set_ylim(0, 6) 21 | 22 | patches = [Rectangle((0.3, 3.6), 1.5, 1.8, zorder=1, fc=box_bg), 23 | Rectangle((0.5, 3.8), 1.5, 1.8, zorder=2, fc=box_bg), 24 | Rectangle((0.7, 4.0), 1.5, 1.8, zorder=3, fc=box_bg), 25 | 26 | Rectangle((2.9, 3.6), 0.2, 1.8, fc=box_bg), 27 | Rectangle((3.1, 3.8), 0.2, 1.8, fc=box_bg), 28 | Rectangle((3.3, 4.0), 0.2, 1.8, fc=box_bg), 29 | 30 | Rectangle((0.3, 0.2), 1.5, 1.8, fc=box_bg), 31 | 32 | Rectangle((2.9, 0.2), 0.2, 1.8, fc=box_bg), 33 | 34 | Circle((5.5, 3.5), 1.0, fc=box_bg), 35 | 36 | Polygon([[5.5, 1.7], 37 | [6.1, 1.1], 38 | [5.5, 0.5], 39 | [4.9, 1.1]], fc=box_bg), 40 | 41 | FancyArrow(2.3, 4.6, 0.35, 0, fc=arrow1, 42 | width=0.25, head_width=0.5, head_length=0.2), 43 | 44 | FancyArrow(3.75, 4.2, 0.5, -0.2, fc=arrow1, 45 | width=0.25, head_width=0.5, head_length=0.2), 46 | 47 | FancyArrow(5.5, 2.4, 0, -0.4, fc=arrow1, 48 | width=0.25, head_width=0.5, head_length=0.2), 49 | 50 | FancyArrow(2.0, 1.1, 0.5, 0, fc=arrow2, 51 | width=0.25, head_width=0.5, head_length=0.2), 52 | 53 | FancyArrow(3.3, 1.1, 1.3, 0, fc=arrow2, 54 | width=0.25, head_width=0.5, head_length=0.2), 55 | 56 | FancyArrow(6.2, 1.1, 0.8, 0, fc=arrow2, 57 | width=0.25, head_width=0.5, head_length=0.2)] 58 | 59 | if supervised: 60 | patches += [Rectangle((0.3, 2.4), 1.5, 0.5, zorder=1, fc=box_bg), 61 | Rectangle((0.5, 2.6), 1.5, 0.5, zorder=2, fc=box_bg), 62 | Rectangle((0.7, 2.8), 1.5, 0.5, zorder=3, fc=box_bg), 63 | FancyArrow(2.3, 2.9, 2.0, 0, fc=arrow1, 64 | width=0.25, head_width=0.5, head_length=0.2), 65 | Rectangle((7.3, 0.85), 1.5, 0.5, fc=box_bg)] 66 | else: 67 | patches += [Rectangle((7.3, 0.2), 1.5, 1.8, fc=box_bg)] 68 | 69 | for p in patches: 70 | ax.add_patch(p) 71 | 72 | pl.text(1.45, 4.9, "Training\nText,\nDocuments,\nImages,\netc.", 73 | ha='center', va='center', fontsize=14) 74 | 75 | pl.text(3.6, 4.9, "Feature\nVectors", 76 | ha='left', va='center', fontsize=14) 77 | 78 | pl.text(5.5, 3.5, "Machine\nLearning\nAlgorithm", 79 | ha='center', va='center', fontsize=14) 80 | 81 | pl.text(1.05, 1.1, "New Text,\nDocument,\nImage,\netc.", 82 | ha='center', va='center', fontsize=14) 83 | 84 | pl.text(3.3, 1.7, "Feature\nVector", 85 | ha='left', va='center', fontsize=14) 86 | 87 | pl.text(5.5, 1.1, "Predictive\nModel", 88 | ha='center', va='center', fontsize=12) 89 | 90 | if supervised: 91 | pl.text(1.45, 3.05, "Labels", 92 | ha='center', va='center', fontsize=14) 93 | 94 | pl.text(8.05, 1.1, "Expected\nLabel", 95 | ha='center', va='center', fontsize=14) 96 | pl.text(8.8, 5.8, "Supervised Learning Model", 97 | ha='right', va='top', fontsize=18) 98 | 99 | else: 100 | pl.text(8.05, 1.1, 101 | "Likelihood\nor Cluster ID\nor Better\nRepresentation", 102 | ha='center', va='center', fontsize=12) 103 | pl.text(8.8, 5.8, "Unsupervised Learning Model", 104 | ha='right', va='top', fontsize=18) 105 | 106 | 107 | 108 | def plot_supervised_chart(annotate=False): 109 | create_base(supervised=True) 110 | if annotate: 111 | fontdict = dict(color='r', weight='bold', size=14) 112 | pl.text(1.9, 4.55, 'X = vec.fit_transform(input)', 113 | fontdict=fontdict, 114 | rotation=20, ha='left', va='bottom') 115 | pl.text(3.7, 3.2, 'clf.fit(X, y)', 116 | fontdict=fontdict, 117 | rotation=20, ha='left', va='bottom') 118 | pl.text(1.7, 1.5, 'X_new = vec.transform(input)', 119 | fontdict=fontdict, 120 | rotation=20, ha='left', va='bottom') 121 | pl.text(6.1, 1.5, 'y_new = clf.predict(X_new)', 122 | fontdict=fontdict, 123 | rotation=20, ha='left', va='bottom') 124 | 125 | def plot_unsupervised_chart(): 126 | create_base(supervised=False) 127 | 128 | 129 | if __name__ == '__main__': 130 | plot_supervised_chart(False) 131 | plot_supervised_chart(True) 132 | plot_unsupervised_chart() 133 | pl.show() 134 | 135 | 136 | -------------------------------------------------------------------------------- /fig_code/__init__.py: -------------------------------------------------------------------------------- 1 | from .data import * 2 | from .figures import * 3 | 4 | from .sgd_separator import plot_sgd_separator 5 | from .linear_regression import plot_linear_regression 6 | from .helpers import plot_iris_knn 7 | -------------------------------------------------------------------------------- /fig_code/data.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | 4 | def linear_data_sample(N=40, rseed=0, m=3, b=-2): 5 | rng = np.random.RandomState(rseed) 6 | 7 | x = 10 * rng.rand(N) 8 | dy = m / 2 * (1 + rng.rand(N)) 9 | y = m * x + b + dy * rng.randn(N) 10 | 11 | return (x, y, dy) 12 | 13 | 14 | def linear_data_sample_big_errs(N=40, rseed=0, m=3, b=-2): 15 | rng = np.random.RandomState(rseed) 16 | 17 | x = 10 * rng.rand(N) 18 | dy = m / 2 * (1 + rng.rand(N)) 19 | dy[20:25] *= 10 20 | y = m * x + b + dy * rng.randn(N) 21 | 22 | return (x, y, dy) 23 | 24 | 25 | def sample_light_curve(phased=True): 26 | from astroML.datasets import fetch_LINEAR_sample 27 | data = fetch_LINEAR_sample() 28 | t, y, dy = data[18525697].T 29 | 30 | if phased: 31 | P_best = 0.580313015651 32 | t /= P_best 33 | 34 | return (t, y, dy) 35 | 36 | 37 | def sample_light_curve_2(phased=True): 38 | from astroML.datasets import fetch_LINEAR_sample 39 | data = fetch_LINEAR_sample() 40 | t, y, dy = data[10022663].T 41 | 42 | if phased: 43 | P_best = 0.61596079804 44 | t /= P_best 45 | 46 | return (t, y, dy) 47 | 48 | -------------------------------------------------------------------------------- /fig_code/figures.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | import warnings 4 | 5 | 6 | def plot_venn_diagram(): 7 | fig, ax = plt.subplots(subplot_kw=dict(frameon=False, xticks=[], yticks=[])) 8 | ax.add_patch(plt.Circle((0.3, 0.3), 0.3, fc='red', alpha=0.5)) 9 | ax.add_patch(plt.Circle((0.6, 0.3), 0.3, fc='blue', alpha=0.5)) 10 | ax.add_patch(plt.Rectangle((-0.1, -0.1), 1.1, 0.8, fc='none', ec='black')) 11 | ax.text(0.2, 0.3, '$x$', size=30, ha='center', va='center') 12 | ax.text(0.7, 0.3, '$y$', size=30, ha='center', va='center') 13 | ax.text(0.0, 0.6, '$I$', size=30) 14 | ax.axis('equal') 15 | 16 | 17 | def plot_example_decision_tree(): 18 | fig = plt.figure(figsize=(10, 4)) 19 | ax = fig.add_axes([0, 0, 0.8, 1], frameon=False, xticks=[], yticks=[]) 20 | ax.set_title('Example Decision Tree: Animal Classification', size=24) 21 | 22 | def text(ax, x, y, t, size=20, **kwargs): 23 | ax.text(x, y, t, 24 | ha='center', va='center', size=size, 25 | bbox=dict(boxstyle='round', ec='k', fc='w'), **kwargs) 26 | 27 | text(ax, 0.5, 0.9, "How big is\nthe animal?", 20) 28 | text(ax, 0.3, 0.6, "Does the animal\nhave horns?", 18) 29 | text(ax, 0.7, 0.6, "Does the animal\nhave two legs?", 18) 30 | text(ax, 0.12, 0.3, "Are the horns\nlonger than 10cm?", 14) 31 | text(ax, 0.38, 0.3, "Is the animal\nwearing a collar?", 14) 32 | text(ax, 0.62, 0.3, "Does the animal\nhave wings?", 14) 33 | text(ax, 0.88, 0.3, "Does the animal\nhave a tail?", 14) 34 | 35 | text(ax, 0.4, 0.75, "> 1m", 12, alpha=0.4) 36 | text(ax, 0.6, 0.75, "< 1m", 12, alpha=0.4) 37 | 38 | text(ax, 0.21, 0.45, "yes", 12, alpha=0.4) 39 | text(ax, 0.34, 0.45, "no", 12, alpha=0.4) 40 | 41 | text(ax, 0.66, 0.45, "yes", 12, alpha=0.4) 42 | text(ax, 0.79, 0.45, "no", 12, alpha=0.4) 43 | 44 | ax.plot([0.3, 0.5, 0.7], [0.6, 0.9, 0.6], '-k') 45 | ax.plot([0.12, 0.3, 0.38], [0.3, 0.6, 0.3], '-k') 46 | ax.plot([0.62, 0.7, 0.88], [0.3, 0.6, 0.3], '-k') 47 | ax.plot([0.0, 0.12, 0.20], [0.0, 0.3, 0.0], '--k') 48 | ax.plot([0.28, 0.38, 0.48], [0.0, 0.3, 0.0], '--k') 49 | ax.plot([0.52, 0.62, 0.72], [0.0, 0.3, 0.0], '--k') 50 | ax.plot([0.8, 0.88, 1.0], [0.0, 0.3, 0.0], '--k') 51 | ax.axis([0, 1, 0, 1]) 52 | 53 | 54 | def visualize_tree(estimator, X, y, boundaries=True, 55 | xlim=None, ylim=None): 56 | estimator.fit(X, y) 57 | 58 | if xlim is None: 59 | xlim = (X[:, 0].min() - 0.1, X[:, 0].max() + 0.1) 60 | if ylim is None: 61 | ylim = (X[:, 1].min() - 0.1, X[:, 1].max() + 0.1) 62 | 63 | x_min, x_max = xlim 64 | y_min, y_max = ylim 65 | xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), 66 | np.linspace(y_min, y_max, 100)) 67 | Z = estimator.predict(np.c_[xx.ravel(), yy.ravel()]) 68 | 69 | # Put the result into a color plot 70 | Z = Z.reshape(xx.shape) 71 | plt.figure() 72 | plt.pcolormesh(xx, yy, Z, alpha=0.2, cmap='rainbow') 73 | plt.clim(y.min(), y.max()) 74 | 75 | # Plot also the training points 76 | plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='rainbow') 77 | plt.axis('off') 78 | 79 | plt.xlim(x_min, x_max) 80 | plt.ylim(y_min, y_max) 81 | plt.clim(y.min(), y.max()) 82 | 83 | # Plot the decision boundaries 84 | def plot_boundaries(i, xlim, ylim): 85 | if i < 0: 86 | return 87 | 88 | tree = estimator.tree_ 89 | 90 | if tree.feature[i] == 0: 91 | plt.plot([tree.threshold[i], tree.threshold[i]], ylim, '-k') 92 | plot_boundaries(tree.children_left[i], 93 | [xlim[0], tree.threshold[i]], ylim) 94 | plot_boundaries(tree.children_right[i], 95 | [tree.threshold[i], xlim[1]], ylim) 96 | 97 | elif tree.feature[i] == 1: 98 | plt.plot(xlim, [tree.threshold[i], tree.threshold[i]], '-k') 99 | plot_boundaries(tree.children_left[i], xlim, 100 | [ylim[0], tree.threshold[i]]) 101 | plot_boundaries(tree.children_right[i], xlim, 102 | [tree.threshold[i], ylim[1]]) 103 | 104 | if boundaries: 105 | plot_boundaries(0, plt.xlim(), plt.ylim()) 106 | 107 | 108 | def plot_tree_interactive(X, y): 109 | from sklearn.tree import DecisionTreeClassifier 110 | 111 | def interactive_tree(depth=1): 112 | clf = DecisionTreeClassifier(max_depth=depth, random_state=0) 113 | visualize_tree(clf, X, y) 114 | 115 | from IPython.html.widgets import interact 116 | return interact(interactive_tree, depth=[1, 5]) 117 | 118 | 119 | def plot_kmeans_interactive(min_clusters=1, max_clusters=6): 120 | from IPython.html.widgets import interact 121 | from sklearn.metrics.pairwise import euclidean_distances 122 | from sklearn.datasets.samples_generator import make_blobs 123 | 124 | with warnings.catch_warnings(): 125 | #warnings.filterwarnings('ignore') 126 | 127 | from sklearn.datasets import load_iris 128 | from sklearn.decomposition import PCA 129 | 130 | iris = load_iris() 131 | X, y = iris.data, iris.target 132 | pca = PCA(n_components = 0.95) # keep 95% of variance 133 | X = pca.fit_ttransform(X) 134 | #X = X[:, 1:3] 135 | 136 | 137 | def _kmeans_step(frame=0, n_clusters=3): 138 | rng = np.random.RandomState(2) 139 | labels = np.zeros(X.shape[0]) 140 | centers = rng.randn(n_clusters, 2) 141 | 142 | nsteps = frame // 3 143 | 144 | for i in range(nsteps + 1): 145 | old_centers = centers 146 | if i < nsteps or frame % 3 > 0: 147 | dist = euclidean_distances(X, centers) 148 | labels = dist.argmin(1) 149 | 150 | if i < nsteps or frame % 3 > 1: 151 | centers = np.array([X[labels == j].mean(0) 152 | for j in range(n_clusters)]) 153 | nans = np.isnan(centers) 154 | centers[nans] = old_centers[nans] 155 | 156 | 157 | # plot the data and cluster centers 158 | plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='rainbow', 159 | vmin=0, vmax=n_clusters - 1); 160 | plt.scatter(old_centers[:, 0], old_centers[:, 1], marker='o', 161 | c=np.arange(n_clusters), 162 | s=200, cmap='rainbow') 163 | plt.scatter(old_centers[:, 0], old_centers[:, 1], marker='o', 164 | c='black', s=50) 165 | 166 | # plot new centers if third frame 167 | if frame % 3 == 2: 168 | for i in range(n_clusters): 169 | plt.annotate('', centers[i], old_centers[i], 170 | arrowprops=dict(arrowstyle='->', linewidth=1)) 171 | plt.scatter(centers[:, 0], centers[:, 1], marker='o', 172 | c=np.arange(n_clusters), 173 | s=200, cmap='rainbow') 174 | plt.scatter(centers[:, 0], centers[:, 1], marker='o', 175 | c='black', s=50) 176 | 177 | plt.xlim(-4, 4) 178 | plt.ylim(-2, 10) 179 | 180 | if frame % 3 == 1: 181 | plt.text(3.8, 9.5, "1. Reassign points to nearest centroid", 182 | ha='right', va='top', size=14) 183 | elif frame % 3 == 2: 184 | plt.text(3.8, 9.5, "2. Update centroids to cluster means", 185 | ha='right', va='top', size=14) 186 | 187 | 188 | return interact(_kmeans_step, frame=[0, 50], 189 | n_clusters=[min_clusters, max_clusters]) 190 | 191 | 192 | def plot_image_components(x, coefficients=None, mean=0, components=None, 193 | imshape=(8, 8), n_components=6, fontsize=12): 194 | if coefficients is None: 195 | coefficients = x 196 | 197 | if components is None: 198 | components = np.eye(len(coefficients), len(x)) 199 | 200 | mean = np.zeros_like(x) + mean 201 | 202 | 203 | fig = plt.figure(figsize=(1.2 * (5 + n_components), 1.2 * 2)) 204 | g = plt.GridSpec(2, 5 + n_components, hspace=0.3) 205 | 206 | def show(i, j, x, title=None): 207 | ax = fig.add_subplot(g[i, j], xticks=[], yticks=[]) 208 | ax.imshow(x.reshape(imshape), interpolation='nearest') 209 | if title: 210 | ax.set_title(title, fontsize=fontsize) 211 | 212 | show(slice(2), slice(2), x, "True") 213 | 214 | approx = mean.copy() 215 | show(0, 2, np.zeros_like(x) + mean, r'$\mu$') 216 | show(1, 2, approx, r'$1 \cdot \mu$') 217 | 218 | for i in range(0, n_components): 219 | approx = approx + coefficients[i] * components[i] 220 | show(0, i + 3, components[i], r'$c_{0}$'.format(i + 1)) 221 | show(1, i + 3, approx, 222 | r"${0:.2f} \cdot c_{1}$".format(coefficients[i], i + 1)) 223 | plt.gca().text(0, 1.05, '$+$', ha='right', va='bottom', 224 | transform=plt.gca().transAxes, fontsize=fontsize) 225 | 226 | show(slice(2), slice(-2, None), approx, "Approx") 227 | 228 | 229 | def plot_pca_interactive(data, n_components=6): 230 | from sklearn.decomposition import PCA 231 | from IPython.html.widgets import interact 232 | 233 | pca = PCA(n_components=n_components) 234 | Xproj = pca.fit_transform(data) 235 | 236 | def show_decomp(i=0): 237 | plot_image_components(data[i], Xproj[i], 238 | pca.mean_, pca.components_) 239 | 240 | interact(show_decomp, i=(0, data.shape[0] - 1)); 241 | -------------------------------------------------------------------------------- /fig_code/helpers.py: -------------------------------------------------------------------------------- 1 | """ 2 | Small helpers for code that is not shown in the notebooks 3 | """ 4 | 5 | from sklearn import neighbors, datasets, linear_model 6 | import pylab as pl 7 | import numpy as np 8 | from matplotlib.colors import ListedColormap 9 | 10 | # Create color maps for 3-class classification problem, as with iris 11 | cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF']) 12 | cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF']) 13 | 14 | def plot_iris_knn(): 15 | iris = datasets.load_iris() 16 | X = iris.data[:, :2] # we only take the first two features. We could 17 | # avoid this ugly slicing by using a two-dim dataset 18 | y = iris.target 19 | 20 | knn = neighbors.KNeighborsClassifier(n_neighbors=3) 21 | knn.fit(X, y) 22 | 23 | x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1 24 | y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1 25 | xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), 26 | np.linspace(y_min, y_max, 100)) 27 | Z = knn.predict(np.c_[xx.ravel(), yy.ravel()]) 28 | 29 | # Put the result into a color plot 30 | Z = Z.reshape(xx.shape) 31 | pl.figure() 32 | pl.pcolormesh(xx, yy, Z, cmap=cmap_light) 33 | 34 | # Plot also the training points 35 | pl.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold) 36 | pl.xlabel('sepal length (cm)') 37 | pl.ylabel('sepal width (cm)') 38 | pl.axis('tight') 39 | 40 | 41 | def plot_polynomial_regression(): 42 | rng = np.random.RandomState(0) 43 | x = 2*rng.rand(100) - 1 44 | 45 | f = lambda t: 1.2 * t**2 + .1 * t**3 - .4 * t **5 - .5 * t ** 9 46 | y = f(x) + .4 * rng.normal(size=100) 47 | 48 | x_test = np.linspace(-1, 1, 100) 49 | 50 | pl.figure() 51 | pl.scatter(x, y, s=4) 52 | 53 | X = np.array([x**i for i in range(5)]).T 54 | X_test = np.array([x_test**i for i in range(5)]).T 55 | regr = linear_model.LinearRegression() 56 | regr.fit(X, y) 57 | pl.plot(x_test, regr.predict(X_test), label='4th order') 58 | 59 | X = np.array([x**i for i in range(10)]).T 60 | X_test = np.array([x_test**i for i in range(10)]).T 61 | regr = linear_model.LinearRegression() 62 | regr.fit(X, y) 63 | pl.plot(x_test, regr.predict(X_test), label='9th order') 64 | 65 | pl.legend(loc='best') 66 | pl.axis('tight') 67 | pl.title('Fitting a 4th and a 9th order polynomial') 68 | 69 | pl.figure() 70 | pl.scatter(x, y, s=4) 71 | pl.plot(x_test, f(x_test), label="truth") 72 | pl.axis('tight') 73 | pl.title('Ground truth (9th order polynomial)') 74 | 75 | 76 | -------------------------------------------------------------------------------- /fig_code/linear_regression.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | from sklearn.linear_model import LinearRegression 4 | 5 | 6 | def plot_linear_regression(): 7 | a = 0.5 8 | b = 1.0 9 | 10 | # x from 0 to 10 11 | x = 30 * np.random.random(20) 12 | 13 | # y = a*x + b with noise 14 | y = a * x + b + np.random.normal(size=x.shape) 15 | 16 | # create a linear regression classifier 17 | clf = LinearRegression() 18 | clf.fit(x[:, None], y) 19 | 20 | # predict y from the data 21 | x_new = np.linspace(0, 30, 100) 22 | y_new = clf.predict(x_new[:, None]) 23 | 24 | # plot the results 25 | ax = plt.axes() 26 | ax.scatter(x, y) 27 | ax.plot(x_new, y_new) 28 | 29 | ax.set_xlabel('x') 30 | ax.set_ylabel('y') 31 | 32 | ax.axis('tight') 33 | 34 | 35 | if __name__ == '__main__': 36 | plot_linear_regression() 37 | plt.show() 38 | -------------------------------------------------------------------------------- /fig_code/sgd_separator.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | from sklearn.linear_model import SGDClassifier 4 | from sklearn.datasets.samples_generator import make_blobs 5 | 6 | def plot_sgd_separator(): 7 | # we create 50 separable points 8 | X, Y = make_blobs(n_samples=50, centers=2, 9 | random_state=0, cluster_std=0.60) 10 | 11 | # fit the model 12 | clf = SGDClassifier(loss="hinge", alpha=0.01, 13 | n_iter=200, fit_intercept=True) 14 | clf.fit(X, Y) 15 | 16 | # plot the line, the points, and the nearest vectors to the plane 17 | xx = np.linspace(-1, 5, 10) 18 | yy = np.linspace(-1, 5, 10) 19 | 20 | X1, X2 = np.meshgrid(xx, yy) 21 | Z = np.empty(X1.shape) 22 | for (i, j), val in np.ndenumerate(X1): 23 | x1 = val 24 | x2 = X2[i, j] 25 | p = clf.decision_function(np.array([[x1, x2]])) 26 | Z[i, j] = p[0] 27 | levels = [-1.0, 0.0, 1.0] 28 | linestyles = ['dashed', 'solid', 'dashed'] 29 | colors = 'k' 30 | 31 | ax = plt.axes() 32 | ax.contour(X1, X2, Z, levels, colors=colors, linestyles=linestyles) 33 | ax.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Paired) 34 | 35 | ax.axis('tight') 36 | 37 | 38 | if __name__ == '__main__': 39 | plot_sgd_separator() 40 | plt.show() 41 | -------------------------------------------------------------------------------- /fig_code/svm_gui.py: -------------------------------------------------------------------------------- 1 | """ 2 | ========== 3 | Libsvm GUI 4 | ========== 5 | 6 | A simple graphical frontend for Libsvm mainly intended for didactic 7 | purposes. You can create data points by point and click and visualize 8 | the decision region induced by different kernels and parameter settings. 9 | 10 | To create positive examples click the left mouse button; to create 11 | negative examples click the right button. 12 | 13 | If all examples are from the same class, it uses a one-class SVM. 14 | 15 | """ 16 | from __future__ import division, print_function 17 | 18 | print(__doc__) 19 | 20 | # Author: Peter Prettenhoer 21 | # 22 | # License: BSD 3 clause 23 | 24 | import matplotlib 25 | matplotlib.use('TkAgg') 26 | 27 | from matplotlib.backends.backend_tkagg import FigureCanvasTkAgg 28 | from matplotlib.backends.backend_tkagg import NavigationToolbar2TkAgg 29 | from matplotlib.figure import Figure 30 | from matplotlib.contour import ContourSet 31 | 32 | import Tkinter as Tk 33 | import sys 34 | import numpy as np 35 | 36 | from sklearn import svm 37 | from sklearn.datasets import dump_svmlight_file 38 | from sklearn.externals.six.moves import xrange 39 | 40 | y_min, y_max = -50, 50 41 | x_min, x_max = -50, 50 42 | 43 | 44 | class Model(object): 45 | """The Model which hold the data. It implements the 46 | observable in the observer pattern and notifies the 47 | registered observers on change event. 48 | """ 49 | 50 | def __init__(self): 51 | self.observers = [] 52 | self.surface = None 53 | self.data = [] 54 | self.cls = None 55 | self.surface_type = 0 56 | 57 | def changed(self, event): 58 | """Notify the observers. """ 59 | for observer in self.observers: 60 | observer.update(event, self) 61 | 62 | def add_observer(self, observer): 63 | """Register an observer. """ 64 | self.observers.append(observer) 65 | 66 | def set_surface(self, surface): 67 | self.surface = surface 68 | 69 | def dump_svmlight_file(self, file): 70 | data = np.array(self.data) 71 | X = data[:, 0:2] 72 | y = data[:, 2] 73 | dump_svmlight_file(X, y, file) 74 | 75 | 76 | class Controller(object): 77 | def __init__(self, model): 78 | self.model = model 79 | self.kernel = Tk.IntVar() 80 | self.surface_type = Tk.IntVar() 81 | # Whether or not a model has been fitted 82 | self.fitted = False 83 | 84 | def fit(self): 85 | print("fit the model") 86 | train = np.array(self.model.data) 87 | X = train[:, 0:2] 88 | y = train[:, 2] 89 | 90 | C = float(self.complexity.get()) 91 | gamma = float(self.gamma.get()) 92 | coef0 = float(self.coef0.get()) 93 | degree = int(self.degree.get()) 94 | kernel_map = {0: "linear", 1: "rbf", 2: "poly"} 95 | if len(np.unique(y)) == 1: 96 | clf = svm.OneClassSVM(kernel=kernel_map[self.kernel.get()], 97 | gamma=gamma, coef0=coef0, degree=degree) 98 | clf.fit(X) 99 | else: 100 | clf = svm.SVC(kernel=kernel_map[self.kernel.get()], C=C, 101 | gamma=gamma, coef0=coef0, degree=degree) 102 | clf.fit(X, y) 103 | if hasattr(clf, 'score'): 104 | print("Accuracy:", clf.score(X, y) * 100) 105 | X1, X2, Z = self.decision_surface(clf) 106 | self.model.clf = clf 107 | self.model.set_surface((X1, X2, Z)) 108 | self.model.surface_type = self.surface_type.get() 109 | self.fitted = True 110 | self.model.changed("surface") 111 | 112 | def decision_surface(self, cls): 113 | delta = 1 114 | x = np.arange(x_min, x_max + delta, delta) 115 | y = np.arange(y_min, y_max + delta, delta) 116 | X1, X2 = np.meshgrid(x, y) 117 | Z = cls.decision_function(np.c_[X1.ravel(), X2.ravel()]) 118 | Z = Z.reshape(X1.shape) 119 | return X1, X2, Z 120 | 121 | def clear_data(self): 122 | self.model.data = [] 123 | self.fitted = False 124 | self.model.changed("clear") 125 | 126 | def add_example(self, x, y, label): 127 | self.model.data.append((x, y, label)) 128 | self.model.changed("example_added") 129 | 130 | # update decision surface if already fitted. 131 | self.refit() 132 | 133 | def refit(self): 134 | """Refit the model if already fitted. """ 135 | if self.fitted: 136 | self.fit() 137 | 138 | 139 | class View(object): 140 | """Test docstring. """ 141 | def __init__(self, root, controller): 142 | f = Figure() 143 | ax = f.add_subplot(111) 144 | ax.set_xticks([]) 145 | ax.set_yticks([]) 146 | ax.set_xlim((x_min, x_max)) 147 | ax.set_ylim((y_min, y_max)) 148 | canvas = FigureCanvasTkAgg(f, master=root) 149 | canvas.show() 150 | canvas.get_tk_widget().pack(side=Tk.TOP, fill=Tk.BOTH, expand=1) 151 | canvas._tkcanvas.pack(side=Tk.TOP, fill=Tk.BOTH, expand=1) 152 | canvas.mpl_connect('key_press_event', self.onkeypress) 153 | canvas.mpl_connect('key_release_event', self.onkeyrelease) 154 | canvas.mpl_connect('button_press_event', self.onclick) 155 | toolbar = NavigationToolbar2TkAgg(canvas, root) 156 | toolbar.update() 157 | self.shift_down = False 158 | self.controllbar = ControllBar(root, controller) 159 | self.f = f 160 | self.ax = ax 161 | self.canvas = canvas 162 | self.controller = controller 163 | self.contours = [] 164 | self.c_labels = None 165 | self.plot_kernels() 166 | 167 | def plot_kernels(self): 168 | self.ax.text(-50, -60, "Linear: $u^T v$") 169 | self.ax.text(-20, -60, "RBF: $\exp (-\gamma \| u-v \|^2)$") 170 | self.ax.text(10, -60, "Poly: $(\gamma \, u^T v + r)^d$") 171 | 172 | def onkeypress(self, event): 173 | if event.key == "shift": 174 | self.shift_down = True 175 | 176 | def onkeyrelease(self, event): 177 | if event.key == "shift": 178 | self.shift_down = False 179 | 180 | def onclick(self, event): 181 | if event.xdata and event.ydata: 182 | if self.shift_down or event.button == 3: 183 | self.controller.add_example(event.xdata, event.ydata, -1) 184 | elif event.button == 1: 185 | self.controller.add_example(event.xdata, event.ydata, 1) 186 | 187 | def update_example(self, model, idx): 188 | x, y, l = model.data[idx] 189 | if l == 1: 190 | color = 'w' 191 | elif l == -1: 192 | color = 'k' 193 | self.ax.plot([x], [y], "%so" % color, scalex=0.0, scaley=0.0) 194 | 195 | def update(self, event, model): 196 | if event == "examples_loaded": 197 | for i in xrange(len(model.data)): 198 | self.update_example(model, i) 199 | 200 | if event == "example_added": 201 | self.update_example(model, -1) 202 | 203 | if event == "clear": 204 | self.ax.clear() 205 | self.ax.set_xticks([]) 206 | self.ax.set_yticks([]) 207 | self.contours = [] 208 | self.c_labels = None 209 | self.plot_kernels() 210 | 211 | if event == "surface": 212 | self.remove_surface() 213 | self.plot_support_vectors(model.clf.support_vectors_) 214 | self.plot_decision_surface(model.surface, model.surface_type) 215 | 216 | self.canvas.draw() 217 | 218 | def remove_surface(self): 219 | """Remove old decision surface.""" 220 | if len(self.contours) > 0: 221 | for contour in self.contours: 222 | if isinstance(contour, ContourSet): 223 | for lineset in contour.collections: 224 | lineset.remove() 225 | else: 226 | contour.remove() 227 | self.contours = [] 228 | 229 | def plot_support_vectors(self, support_vectors): 230 | """Plot the support vectors by placing circles over the 231 | corresponding data points and adds the circle collection 232 | to the contours list.""" 233 | cs = self.ax.scatter(support_vectors[:, 0], support_vectors[:, 1], 234 | s=80, edgecolors="k", facecolors="none") 235 | self.contours.append(cs) 236 | 237 | def plot_decision_surface(self, surface, type): 238 | X1, X2, Z = surface 239 | if type == 0: 240 | levels = [-1.0, 0.0, 1.0] 241 | linestyles = ['dashed', 'solid', 'dashed'] 242 | colors = 'k' 243 | self.contours.append(self.ax.contour(X1, X2, Z, levels, 244 | colors=colors, 245 | linestyles=linestyles)) 246 | elif type == 1: 247 | self.contours.append(self.ax.contourf(X1, X2, Z, 10, 248 | cmap=matplotlib.cm.bone, 249 | origin='lower', alpha=0.85)) 250 | self.contours.append(self.ax.contour(X1, X2, Z, [0.0], colors='k', 251 | linestyles=['solid'])) 252 | else: 253 | raise ValueError("surface type unknown") 254 | 255 | 256 | class ControllBar(object): 257 | def __init__(self, root, controller): 258 | fm = Tk.Frame(root) 259 | kernel_group = Tk.Frame(fm) 260 | Tk.Radiobutton(kernel_group, text="Linear", variable=controller.kernel, 261 | value=0, command=controller.refit).pack(anchor=Tk.W) 262 | Tk.Radiobutton(kernel_group, text="RBF", variable=controller.kernel, 263 | value=1, command=controller.refit).pack(anchor=Tk.W) 264 | Tk.Radiobutton(kernel_group, text="Poly", variable=controller.kernel, 265 | value=2, command=controller.refit).pack(anchor=Tk.W) 266 | kernel_group.pack(side=Tk.LEFT) 267 | 268 | valbox = Tk.Frame(fm) 269 | controller.complexity = Tk.StringVar() 270 | controller.complexity.set("1.0") 271 | c = Tk.Frame(valbox) 272 | Tk.Label(c, text="C:", anchor="e", width=7).pack(side=Tk.LEFT) 273 | Tk.Entry(c, width=6, textvariable=controller.complexity).pack( 274 | side=Tk.LEFT) 275 | c.pack() 276 | 277 | controller.gamma = Tk.StringVar() 278 | controller.gamma.set("0.01") 279 | g = Tk.Frame(valbox) 280 | Tk.Label(g, text="gamma:", anchor="e", width=7).pack(side=Tk.LEFT) 281 | Tk.Entry(g, width=6, textvariable=controller.gamma).pack(side=Tk.LEFT) 282 | g.pack() 283 | 284 | controller.degree = Tk.StringVar() 285 | controller.degree.set("3") 286 | d = Tk.Frame(valbox) 287 | Tk.Label(d, text="degree:", anchor="e", width=7).pack(side=Tk.LEFT) 288 | Tk.Entry(d, width=6, textvariable=controller.degree).pack(side=Tk.LEFT) 289 | d.pack() 290 | 291 | controller.coef0 = Tk.StringVar() 292 | controller.coef0.set("0") 293 | r = Tk.Frame(valbox) 294 | Tk.Label(r, text="coef0:", anchor="e", width=7).pack(side=Tk.LEFT) 295 | Tk.Entry(r, width=6, textvariable=controller.coef0).pack(side=Tk.LEFT) 296 | r.pack() 297 | valbox.pack(side=Tk.LEFT) 298 | 299 | cmap_group = Tk.Frame(fm) 300 | Tk.Radiobutton(cmap_group, text="Hyperplanes", 301 | variable=controller.surface_type, value=0, 302 | command=controller.refit).pack(anchor=Tk.W) 303 | Tk.Radiobutton(cmap_group, text="Surface", 304 | variable=controller.surface_type, value=1, 305 | command=controller.refit).pack(anchor=Tk.W) 306 | 307 | cmap_group.pack(side=Tk.LEFT) 308 | 309 | train_button = Tk.Button(fm, text='Fit', width=5, 310 | command=controller.fit) 311 | train_button.pack() 312 | fm.pack(side=Tk.LEFT) 313 | Tk.Button(fm, text='Clear', width=5, 314 | command=controller.clear_data).pack(side=Tk.LEFT) 315 | 316 | 317 | def get_parser(): 318 | from optparse import OptionParser 319 | op = OptionParser() 320 | op.add_option("--output", 321 | action="store", type="str", dest="output", 322 | help="Path where to dump data.") 323 | return op 324 | 325 | 326 | def main(argv): 327 | op = get_parser() 328 | opts, args = op.parse_args(argv[1:]) 329 | root = Tk.Tk() 330 | model = Model() 331 | controller = Controller(model) 332 | root.wm_title("Scikit-learn Libsvm GUI") 333 | view = View(root, controller) 334 | model.add_observer(view) 335 | Tk.mainloop() 336 | 337 | if opts.output: 338 | model.dump_svmlight_file(opts.output) 339 | 340 | if __name__ == "__main__": 341 | main(sys.argv) 342 | -------------------------------------------------------------------------------- /imgs/iris-setosa.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PythonWorkshop/intro-to-sklearn/1c176aa0ddc5d4de8a58ec21887016eea770fb52/imgs/iris-setosa.jpg -------------------------------------------------------------------------------- /imgs/iris_petal_sepal.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PythonWorkshop/intro-to-sklearn/1c176aa0ddc5d4de8a58ec21887016eea770fb52/imgs/iris_petal_sepal.png -------------------------------------------------------------------------------- /imgs/iris_with_labels.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PythonWorkshop/intro-to-sklearn/1c176aa0ddc5d4de8a58ec21887016eea770fb52/imgs/iris_with_labels.jpg -------------------------------------------------------------------------------- /imgs/irises.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PythonWorkshop/intro-to-sklearn/1c176aa0ddc5d4de8a58ec21887016eea770fb52/imgs/irises.png -------------------------------------------------------------------------------- /imgs/ml_map.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PythonWorkshop/intro-to-sklearn/1c176aa0ddc5d4de8a58ec21887016eea770fb52/imgs/ml_map.png -------------------------------------------------------------------------------- /imgs/ml_process.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PythonWorkshop/intro-to-sklearn/1c176aa0ddc5d4de8a58ec21887016eea770fb52/imgs/ml_process.png -------------------------------------------------------------------------------- /imgs/ml_process_by_micheleenharris.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PythonWorkshop/intro-to-sklearn/1c176aa0ddc5d4de8a58ec21887016eea770fb52/imgs/ml_process_by_micheleenharris.png -------------------------------------------------------------------------------- /imgs/ml_process_mharris2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PythonWorkshop/intro-to-sklearn/1c176aa0ddc5d4de8a58ec21887016eea770fb52/imgs/ml_process_mharris2.png -------------------------------------------------------------------------------- /imgs/pca1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PythonWorkshop/intro-to-sklearn/1c176aa0ddc5d4de8a58ec21887016eea770fb52/imgs/pca1.png -------------------------------------------------------------------------------- /imgs/sgd_boundary_scatter.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PythonWorkshop/intro-to-sklearn/1c176aa0ddc5d4de8a58ec21887016eea770fb52/imgs/sgd_boundary_scatter.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy==1.11.0 2 | pandas==0.18.0 3 | scipy==0.17.0 4 | matplotlib==1.5.1 5 | scikit-learn==0.17.1 6 | seaborn==0.7.0 7 | jupyter==1.0.0 8 | ipython==4.1.2 9 | skflow==0.1.0 10 | ipykernel==4.3.1 11 | --------------------------------------------------------------------------------