├── .gitignore
├── Intro to Scikit-Learn.ipynb
├── LICENSE
├── README.md
└── facebook_map.png


/.gitignore:
--------------------------------------------------------------------------------
 1 | *.py[cod]
 2 | 
 3 | # C extensions
 4 | *.so
 5 | 
 6 | # Packages
 7 | *.egg
 8 | *.egg-info
 9 | dist
10 | build
11 | eggs
12 | parts
13 | bin
14 | var
15 | sdist
16 | develop-eggs
17 | .installed.cfg
18 | lib
19 | lib64
20 | __pycache__
21 | 
22 | # Installer logs
23 | pip-log.txt
24 | 
25 | # Unit test / coverage reports
26 | .coverage
27 | .tox
28 | nosetests.xml
29 | 
30 | # Translations
31 | *.mo
32 | 
33 | # Mr Developer
34 | .mr.developer.cfg
35 | .project
36 | .pydevproject
37 | 
38 | *.html
39 | 


--------------------------------------------------------------------------------
/Intro to Scikit-Learn.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "metadata": {
   3 |   "celltoolbar": "Slideshow",
   4 |   "name": ""
   5 |  },
   6 |  "nbformat": 3,
   7 |  "nbformat_minor": 0,
   8 |  "worksheets": [
   9 |   {
  10 |    "cells": [
  11 |     {
  12 |      "cell_type": "markdown",
  13 |      "metadata": {
  14 |       "slideshow": {
  15 |        "slide_type": "slide"
  16 |       }
  17 |      },
  18 |      "source": [
  19 |       "# Introduction to Scikit-Learn \n",
  20 |       "\n",
  21 |       "View this IPython Notebook: \n",
  22 |       "\n",
  23 |       "    j.mp/sklearn\n",
  24 |       "    \n",
  25 |       "Everything is in a Github repo: \n",
  26 |       "    \n",
  27 |       "    github.com/tdhopper/\n",
  28 |       "    \n",
  29 |       "View slides with:\n",
  30 |       "\n",
  31 |       "    ipython nbconvert Intro\\ to\\ Scikit-Learn.ipynb --to slides --post serve"
  32 |      ]
  33 |     },
  34 |     {
  35 |      "cell_type": "markdown",
  36 |      "metadata": {
  37 |       "slideshow": {
  38 |        "slide_type": "slide"
  39 |       }
  40 |      },
  41 |      "source": [
  42 |       "<center>\n",
  43 |       "# Introduction to Scikit-Learn <br \\><br \\>\n",
  44 |       "\n",
  45 |       "__Research Triangle Analysts (1/16/13)__\n",
  46 |       "<img src = \"http://stiglerdiet.com/theme/images/logo.png\" /><br \\><br \\>\n",
  47 |       "\n",
  48 |       "Software Engineer at [parse.ly](http://www.parse.ly) <br \\>\n",
  49 |       "@tdhopper <br \\>\n",
  50 |       "tdhopper@gmail.com <br \\>\n",
  51 |       "</center>"
  52 |      ]
  53 |     },
  54 |     {
  55 |      "cell_type": "heading",
  56 |      "level": 1,
  57 |      "metadata": {
  58 |       "slideshow": {
  59 |        "slide_type": "slide"
  60 |       }
  61 |      },
  62 |      "source": [
  63 |       "What is Scikit-Learn?"
  64 |      ]
  65 |     },
  66 |     {
  67 |      "cell_type": "markdown",
  68 |      "metadata": {},
  69 |      "source": [
  70 |       "\"Machine Learning in Python\""
  71 |      ]
  72 |     },
  73 |     {
  74 |      "cell_type": "markdown",
  75 |      "metadata": {
  76 |       "slideshow": {
  77 |        "slide_type": "fragment"
  78 |       }
  79 |      },
  80 |      "source": [
  81 |       "* Classification \n",
  82 |       "* Regression\n",
  83 |       "* Clustering\n",
  84 |       "* Dimensionality Reduction\n",
  85 |       "* Model Selection\n",
  86 |       "* Preprocessing\n",
  87 |       "\n",
  88 |       "See more: [http://scikit-learn.org/stable/user_guide.html](http://scikit-learn.org/stable/user_guide.html)"
  89 |      ]
  90 |     },
  91 |     {
  92 |      "cell_type": "heading",
  93 |      "level": 1,
  94 |      "metadata": {
  95 |       "slideshow": {
  96 |        "slide_type": "slide"
  97 |       }
  98 |      },
  99 |      "source": [
 100 |       "Why scikit-learn?"
 101 |      ]
 102 |     },
 103 |     {
 104 |      "cell_type": "markdown",
 105 |      "metadata": {},
 106 |      "source": [
 107 |       "Six reasons why [Ben Lorica (@bigdata)](http://strata.oreilly.com/2013/12/six-reasons-why-i-recommend-scikit-learn.html) recommends scikit-learn\n",
 108 |       "\n",
 109 |       "One: __Commitment to documentation and usability__\n",
 110 |       "\n",
 111 |       "> One of the reasons I started using scikit-learn was because of its nice documentation (which I hold up as an example for other communities and projects to emulate). \n"
 112 |      ]
 113 |     },
 114 |     {
 115 |      "cell_type": "markdown",
 116 |      "metadata": {
 117 |       "slideshow": {
 118 |        "slide_type": "subslide"
 119 |       }
 120 |      },
 121 |      "source": [
 122 |       "Six reasons why [Ben Lorica (@bigdata)](http://strata.oreilly.com/2013/12/six-reasons-why-i-recommend-scikit-learn.html) recommends scikit-learn\n",
 123 |       "\n",
 124 |       "Two: __Models are chosen and implemented by a dedicated team of experts__\n",
 125 |       "\n",
 126 |       "> Scikit-learn\u2019s stable of contributors includes experts in machine-learning and software development."
 127 |      ]
 128 |     },
 129 |     {
 130 |      "cell_type": "markdown",
 131 |      "metadata": {
 132 |       "slideshow": {
 133 |        "slide_type": "subslide"
 134 |       }
 135 |      },
 136 |      "source": [
 137 |       "Six reasons why [Ben Lorica (@bigdata)](http://strata.oreilly.com/2013/12/six-reasons-why-i-recommend-scikit-learn.html) recommends scikit-learn\n",
 138 |       "\n",
 139 |       "Three: __Covers most machine-learning tasks__\n",
 140 |       "\n",
 141 |       "> Scan the list of things available in scikit-learn and you quickly realize that it includes tools for many of the standard machine-learning tasks (such as clustering, classification, regression, etc.).\n"
 142 |      ]
 143 |     },
 144 |     {
 145 |      "cell_type": "markdown",
 146 |      "metadata": {
 147 |       "slideshow": {
 148 |        "slide_type": "subslide"
 149 |       }
 150 |      },
 151 |      "source": [
 152 |       "Six reasons why [Ben Lorica (@bigdata)](http://strata.oreilly.com/2013/12/six-reasons-why-i-recommend-scikit-learn.html) recommends scikit-learn\n",
 153 |       "\n",
 154 |       "Four: __Python and Pydata__\n",
 155 |       "\n",
 156 |       "> An impressive set of Python data tools (pydata) have emerged over the last few years.\n",
 157 |       "\n"
 158 |      ]
 159 |     },
 160 |     {
 161 |      "cell_type": "markdown",
 162 |      "metadata": {
 163 |       "slideshow": {
 164 |        "slide_type": "subslide"
 165 |       }
 166 |      },
 167 |      "source": [
 168 |       "Six reasons why [Ben Lorica (@bigdata)](http://strata.oreilly.com/2013/12/six-reasons-why-i-recommend-scikit-learn.html) recommends scikit-learn\n",
 169 |       "\n",
 170 |       "Five: __Focus__\n",
 171 |       "\n",
 172 |       "> Scikit-learn is a machine-learning library. Its goal is to provide a set of common algorithms to Python users through a consistent interface.\n",
 173 |       "\n"
 174 |      ]
 175 |     },
 176 |     {
 177 |      "cell_type": "markdown",
 178 |      "metadata": {
 179 |       "slideshow": {
 180 |        "slide_type": "subslide"
 181 |       }
 182 |      },
 183 |      "source": [
 184 |       "Six reasons why [Ben Lorica (@bigdata)](http://strata.oreilly.com/2013/12/six-reasons-why-i-recommend-scikit-learn.html) recommends scikit-learn\n",
 185 |       "\n",
 186 |       "Six: __scikit-learn scales to most data problems__\n",
 187 |       "\n",
 188 |       "> Many problems can be tackled using a single (big memory) server, and well-designed software that runs on a single machine can blow away distributed systems.\n"
 189 |      ]
 190 |     },
 191 |     {
 192 |      "cell_type": "heading",
 193 |      "level": 1,
 194 |      "metadata": {
 195 |       "slideshow": {
 196 |        "slide_type": "slide"
 197 |       }
 198 |      },
 199 |      "source": [
 200 |       "This talk is _not_...\n",
 201 |       "\n"
 202 |      ]
 203 |     },
 204 |     {
 205 |      "cell_type": "markdown",
 206 |      "metadata": {},
 207 |      "source": [
 208 |       "...an introduction to Python"
 209 |      ]
 210 |     },
 211 |     {
 212 |      "cell_type": "markdown",
 213 |      "metadata": {
 214 |       "slideshow": {
 215 |        "slide_type": "fragment"
 216 |       }
 217 |      },
 218 |      "source": [
 219 |       "...an introduction to machine learning"
 220 |      ]
 221 |     },
 222 |     {
 223 |      "cell_type": "heading",
 224 |      "level": 1,
 225 |      "metadata": {
 226 |       "slideshow": {
 227 |        "slide_type": "slide"
 228 |       }
 229 |      },
 230 |      "source": [
 231 |       "Example\n"
 232 |      ]
 233 |     },
 234 |     {
 235 |      "cell_type": "code",
 236 |      "collapsed": false,
 237 |      "input": [
 238 |       "from sklearn import datasets\n",
 239 |       "from numpy import logical_or\n",
 240 |       "from sklearn.lda import LDA\n",
 241 |       "from sklearn.metrics import confusion_matrix"
 242 |      ],
 243 |      "language": "python",
 244 |      "metadata": {
 245 |       "slideshow": {
 246 |        "slide_type": "skip"
 247 |       }
 248 |      },
 249 |      "outputs": [],
 250 |      "prompt_number": 4
 251 |     },
 252 |     {
 253 |      "cell_type": "code",
 254 |      "collapsed": false,
 255 |      "input": [
 256 |       "iris = datasets.load_iris()\n",
 257 |       "subset = logical_or(iris.target == 0, iris.target == 1)\n",
 258 |       "\n",
 259 |       "X = iris.data[subset]\n",
 260 |       "y = iris.target[subset]"
 261 |      ],
 262 |      "language": "python",
 263 |      "metadata": {
 264 |       "slideshow": {
 265 |        "slide_type": "-"
 266 |       }
 267 |      },
 268 |      "outputs": [],
 269 |      "prompt_number": 5
 270 |     },
 271 |     {
 272 |      "cell_type": "code",
 273 |      "collapsed": false,
 274 |      "input": [
 275 |       "print X[0:5,:]"
 276 |      ],
 277 |      "language": "python",
 278 |      "metadata": {},
 279 |      "outputs": [
 280 |       {
 281 |        "output_type": "stream",
 282 |        "stream": "stdout",
 283 |        "text": [
 284 |         "[[ 5.1  3.5  1.4  0.2]\n",
 285 |         " [ 4.9  3.   1.4  0.2]\n",
 286 |         " [ 4.7  3.2  1.3  0.2]\n",
 287 |         " [ 4.6  3.1  1.5  0.2]\n",
 288 |         " [ 5.   3.6  1.4  0.2]]\n"
 289 |        ]
 290 |       }
 291 |      ],
 292 |      "prompt_number": 6
 293 |     },
 294 |     {
 295 |      "cell_type": "code",
 296 |      "collapsed": false,
 297 |      "input": [
 298 |       "print y[0:5]"
 299 |      ],
 300 |      "language": "python",
 301 |      "metadata": {},
 302 |      "outputs": [
 303 |       {
 304 |        "output_type": "stream",
 305 |        "stream": "stdout",
 306 |        "text": [
 307 |         "[0 0 0 0 0]\n"
 308 |        ]
 309 |       }
 310 |      ],
 311 |      "prompt_number": 7
 312 |     },
 313 |     {
 314 |      "cell_type": "code",
 315 |      "collapsed": false,
 316 |      "input": [
 317 |       "# Linear Discriminant Analysis\n",
 318 |       "lda = LDA(2)\n",
 319 |       "lda.fit(X, y)\n",
 320 |       "\n",
 321 |       "confusion_matrix(y, lda.predict(X))"
 322 |      ],
 323 |      "language": "python",
 324 |      "metadata": {
 325 |       "slideshow": {
 326 |        "slide_type": "subslide"
 327 |       }
 328 |      },
 329 |      "outputs": [
 330 |       {
 331 |        "metadata": {},
 332 |        "output_type": "pyout",
 333 |        "prompt_number": 8,
 334 |        "text": [
 335 |         "array([[50,  0],\n",
 336 |         "       [ 0, 50]])"
 337 |        ]
 338 |       }
 339 |      ],
 340 |      "prompt_number": 8
 341 |     },
 342 |     {
 343 |      "cell_type": "heading",
 344 |      "level": 1,
 345 |      "metadata": {
 346 |       "slideshow": {
 347 |        "slide_type": "slide"
 348 |       }
 349 |      },
 350 |      "source": [
 351 |       "The Scikit-learn API"
 352 |      ]
 353 |     },
 354 |     {
 355 |      "cell_type": "markdown",
 356 |      "metadata": {},
 357 |      "source": [
 358 |       "The main \"interfaces\" in scikit-learn are (one class can implement multiple interfaces): \n",
 359 |       "\n",
 360 |       "__Estimator__: \n",
 361 |       "\n",
 362 |       "    estimator = obj.fit(data, targets) "
 363 |      ]
 364 |     },
 365 |     {
 366 |      "cell_type": "markdown",
 367 |      "metadata": {},
 368 |      "source": [
 369 |       "__Predictor__: \n",
 370 |       "\n",
 371 |       "    prediction = obj.predict(data) "
 372 |      ]
 373 |     },
 374 |     {
 375 |      "cell_type": "markdown",
 376 |      "metadata": {},
 377 |      "source": [
 378 |       "__Transformer__:\n",
 379 |       "\n",
 380 |       "    new_data = obj.transform(data) \n",
 381 |       "    "
 382 |      ]
 383 |     },
 384 |     {
 385 |      "cell_type": "markdown",
 386 |      "metadata": {},
 387 |      "source": [
 388 |       "__Model__:\n",
 389 |       "\n",
 390 |       "    score = obj.score(data)"
 391 |      ]
 392 |     },
 393 |     {
 394 |      "cell_type": "heading",
 395 |      "level": 2,
 396 |      "metadata": {
 397 |       "slideshow": {
 398 |        "slide_type": "slide"
 399 |       }
 400 |      },
 401 |      "source": [
 402 |       "Scikit-learn API: the Estimator"
 403 |      ]
 404 |     },
 405 |     {
 406 |      "cell_type": "markdown",
 407 |      "metadata": {},
 408 |      "source": [
 409 |       "All estimators implement the __fit__ method:\n",
 410 |       "\n",
 411 |       "    estimator.fit(X, y)"
 412 |      ]
 413 |     },
 414 |     {
 415 |      "cell_type": "markdown",
 416 |      "metadata": {
 417 |       "slideshow": {
 418 |        "slide_type": "-"
 419 |       }
 420 |      },
 421 |      "source": [
 422 |       "    \n",
 423 |       "> A estimator is an object that __fits a model__ based on some training data and is __capable of inferring__ some properties on new data."
 424 |      ]
 425 |     },
 426 |     {
 427 |      "cell_type": "code",
 428 |      "collapsed": false,
 429 |      "input": [
 430 |       "from sklearn.linear_model import LogisticRegression"
 431 |      ],
 432 |      "language": "python",
 433 |      "metadata": {
 434 |       "slideshow": {
 435 |        "slide_type": "skip"
 436 |       }
 437 |      },
 438 |      "outputs": [],
 439 |      "prompt_number": 9
 440 |     },
 441 |     {
 442 |      "cell_type": "code",
 443 |      "collapsed": false,
 444 |      "input": [
 445 |       "# Create Model\n",
 446 |       "model = LogisticRegression()\n",
 447 |       "# Fit Model\n",
 448 |       "model.fit(X, y)"
 449 |      ],
 450 |      "language": "python",
 451 |      "metadata": {
 452 |       "slideshow": {
 453 |        "slide_type": "fragment"
 454 |       }
 455 |      },
 456 |      "outputs": [
 457 |       {
 458 |        "metadata": {},
 459 |        "output_type": "pyout",
 460 |        "prompt_number": 10,
 461 |        "text": [
 462 |         "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
 463 |         "          intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)"
 464 |        ]
 465 |       }
 466 |      ],
 467 |      "prompt_number": 10
 468 |     },
 469 |     {
 470 |      "cell_type": "heading",
 471 |      "level": 2,
 472 |      "metadata": {
 473 |       "slideshow": {
 474 |        "slide_type": "slide"
 475 |       }
 476 |      },
 477 |      "source": [
 478 |       "(Almost) everything is an estimator"
 479 |      ]
 480 |     },
 481 |     {
 482 |      "cell_type": "heading",
 483 |      "level": 3,
 484 |      "metadata": {},
 485 |      "source": [
 486 |       "Unsupervised Learning"
 487 |      ]
 488 |     },
 489 |     {
 490 |      "cell_type": "code",
 491 |      "collapsed": false,
 492 |      "input": [
 493 |       "from sklearn.cluster import KMeans"
 494 |      ],
 495 |      "language": "python",
 496 |      "metadata": {
 497 |       "slideshow": {
 498 |        "slide_type": "skip"
 499 |       }
 500 |      },
 501 |      "outputs": [],
 502 |      "prompt_number": 11
 503 |     },
 504 |     {
 505 |      "cell_type": "code",
 506 |      "collapsed": false,
 507 |      "input": [
 508 |       "# Create Model\n",
 509 |       "kmeans = KMeans(n_clusters = 2)\n",
 510 |       "# Fit Model\n",
 511 |       "kmeans.fit(X)"
 512 |      ],
 513 |      "language": "python",
 514 |      "metadata": {},
 515 |      "outputs": [
 516 |       {
 517 |        "metadata": {},
 518 |        "output_type": "pyout",
 519 |        "prompt_number": 12,
 520 |        "text": [
 521 |         "KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=2, n_init=10,\n",
 522 |         "    n_jobs=1, precompute_distances=True, random_state=None, tol=0.0001,\n",
 523 |         "    verbose=0)"
 524 |        ]
 525 |       }
 526 |      ],
 527 |      "prompt_number": 12
 528 |     },
 529 |     {
 530 |      "cell_type": "heading",
 531 |      "level": 3,
 532 |      "metadata": {
 533 |       "slideshow": {
 534 |        "slide_type": "slide"
 535 |       }
 536 |      },
 537 |      "source": [
 538 |       "Dimensionality Reduction"
 539 |      ]
 540 |     },
 541 |     {
 542 |      "cell_type": "code",
 543 |      "collapsed": false,
 544 |      "input": [
 545 |       "from sklearn.decomposition import PCA"
 546 |      ],
 547 |      "language": "python",
 548 |      "metadata": {
 549 |       "slideshow": {
 550 |        "slide_type": "skip"
 551 |       }
 552 |      },
 553 |      "outputs": [],
 554 |      "prompt_number": 13
 555 |     },
 556 |     {
 557 |      "cell_type": "code",
 558 |      "collapsed": false,
 559 |      "input": [
 560 |       "# Create Model \n",
 561 |       "pca = PCA(n_components=2)\n",
 562 |       "# Fit Model\n",
 563 |       "pca.fit(X)"
 564 |      ],
 565 |      "language": "python",
 566 |      "metadata": {},
 567 |      "outputs": [
 568 |       {
 569 |        "metadata": {},
 570 |        "output_type": "pyout",
 571 |        "prompt_number": 14,
 572 |        "text": [
 573 |         "PCA(copy=True, n_components=2, whiten=False)"
 574 |        ]
 575 |       }
 576 |      ],
 577 |      "prompt_number": 14
 578 |     },
 579 |     {
 580 |      "cell_type": "markdown",
 581 |      "metadata": {
 582 |       "slideshow": {
 583 |        "slide_type": "fragment"
 584 |       }
 585 |      },
 586 |      "source": [
 587 |       "The __fit__ method takes a $y$ parameter even if it isn't needed (though $y$ is ignored). This is important later."
 588 |      ]
 589 |     },
 590 |     {
 591 |      "cell_type": "code",
 592 |      "collapsed": false,
 593 |      "input": [
 594 |       "from sklearn.decomposition import PCA"
 595 |      ],
 596 |      "language": "python",
 597 |      "metadata": {
 598 |       "slideshow": {
 599 |        "slide_type": "skip"
 600 |       }
 601 |      },
 602 |      "outputs": [],
 603 |      "prompt_number": 15
 604 |     },
 605 |     {
 606 |      "cell_type": "code",
 607 |      "collapsed": false,
 608 |      "input": [
 609 |       "pca = PCA(n_components=2)\n",
 610 |       "pca.fit(X, y)"
 611 |      ],
 612 |      "language": "python",
 613 |      "metadata": {
 614 |       "slideshow": {
 615 |        "slide_type": "fragment"
 616 |       }
 617 |      },
 618 |      "outputs": [
 619 |       {
 620 |        "metadata": {},
 621 |        "output_type": "pyout",
 622 |        "prompt_number": 16,
 623 |        "text": [
 624 |         "PCA(copy=True, n_components=2, whiten=False)"
 625 |        ]
 626 |       }
 627 |      ],
 628 |      "prompt_number": 16
 629 |     },
 630 |     {
 631 |      "cell_type": "heading",
 632 |      "level": 3,
 633 |      "metadata": {
 634 |       "slideshow": {
 635 |        "slide_type": "slide"
 636 |       }
 637 |      },
 638 |      "source": [
 639 |       "Feature Selection"
 640 |      ]
 641 |     },
 642 |     {
 643 |      "cell_type": "code",
 644 |      "collapsed": false,
 645 |      "input": [
 646 |       "from sklearn.feature_selection import SelectKBest\n",
 647 |       "from sklearn.metrics import matthews_corrcoef"
 648 |      ],
 649 |      "language": "python",
 650 |      "metadata": {
 651 |       "slideshow": {
 652 |        "slide_type": "skip"
 653 |       }
 654 |      },
 655 |      "outputs": [],
 656 |      "prompt_number": 17
 657 |     },
 658 |     {
 659 |      "cell_type": "code",
 660 |      "collapsed": false,
 661 |      "input": [
 662 |       "# Create Model\n",
 663 |       "kbest = SelectKBest(k = 3)\n",
 664 |       "# Fit Model\n",
 665 |       "kbest.fit(X, y)"
 666 |      ],
 667 |      "language": "python",
 668 |      "metadata": {},
 669 |      "outputs": [
 670 |       {
 671 |        "metadata": {},
 672 |        "output_type": "pyout",
 673 |        "prompt_number": 18,
 674 |        "text": [
 675 |         "SelectKBest(k=1, score_func=<function f_classif at 0x1139f3398>)"
 676 |        ]
 677 |       }
 678 |      ],
 679 |      "prompt_number": 18
 680 |     },
 681 |     {
 682 |      "cell_type": "heading",
 683 |      "level": 2,
 684 |      "metadata": {
 685 |       "slideshow": {
 686 |        "slide_type": "slide"
 687 |       }
 688 |      },
 689 |      "source": [
 690 |       "(Almost) everything is an estimator!"
 691 |      ]
 692 |     },
 693 |     {
 694 |      "cell_type": "code",
 695 |      "collapsed": false,
 696 |      "input": [
 697 |       "model = LogisticRegression()\n",
 698 |       "model.fit(X, y)\n",
 699 |       "\n",
 700 |       "kbest = SelectKBest(k = 1)\n",
 701 |       "kbest.fit(X, y)\n",
 702 |       "\n",
 703 |       "kmeans = KMeans(n_clusters = 2)\n",
 704 |       "kmeans.fit(X, y)\n",
 705 |       "\n",
 706 |       "pca = PCA(n_components=2)\n",
 707 |       "pca.fit(X, y)"
 708 |      ],
 709 |      "language": "python",
 710 |      "metadata": {},
 711 |      "outputs": [
 712 |       {
 713 |        "metadata": {},
 714 |        "output_type": "pyout",
 715 |        "prompt_number": 83,
 716 |        "text": [
 717 |         "PCA(copy=True, n_components=2, whiten=False)"
 718 |        ]
 719 |       }
 720 |      ],
 721 |      "prompt_number": 83
 722 |     },
 723 |     {
 724 |      "cell_type": "markdown",
 725 |      "metadata": {
 726 |       "slideshow": {
 727 |        "slide_type": "slide"
 728 |       }
 729 |      },
 730 |      "source": [
 731 |       "__What can we do with an estimator?__ \n",
 732 |       "\n",
 733 |       "Inference!"
 734 |      ]
 735 |     },
 736 |     {
 737 |      "cell_type": "code",
 738 |      "collapsed": false,
 739 |      "input": [
 740 |       "model = LogisticRegression()\n",
 741 |       "model.fit(X, y)\n",
 742 |       "print model.coef_"
 743 |      ],
 744 |      "language": "python",
 745 |      "metadata": {
 746 |       "slideshow": {
 747 |        "slide_type": "-"
 748 |       }
 749 |      },
 750 |      "outputs": [
 751 |       {
 752 |        "output_type": "stream",
 753 |        "stream": "stdout",
 754 |        "text": [
 755 |         "[[-0.40731745 -1.46092371  2.24004724  1.00841492]]\n"
 756 |        ]
 757 |       }
 758 |      ],
 759 |      "prompt_number": 19
 760 |     },
 761 |     {
 762 |      "cell_type": "code",
 763 |      "collapsed": false,
 764 |      "input": [
 765 |       "kmeans = KMeans(n_clusters = 2)\n",
 766 |       "kmeans.fit(X)\n",
 767 |       "print kmeans.cluster_centers_"
 768 |      ],
 769 |      "language": "python",
 770 |      "metadata": {
 771 |       "slideshow": {
 772 |        "slide_type": "fragment"
 773 |       }
 774 |      },
 775 |      "outputs": [
 776 |       {
 777 |        "output_type": "stream",
 778 |        "stream": "stdout",
 779 |        "text": [
 780 |         "[[ 5.936  2.77   4.26   1.326]\n",
 781 |         " [ 5.006  3.418  1.464  0.244]]\n"
 782 |        ]
 783 |       }
 784 |      ],
 785 |      "prompt_number": 20
 786 |     },
 787 |     {
 788 |      "cell_type": "code",
 789 |      "collapsed": false,
 790 |      "input": [
 791 |       "pca = PCA(n_components=2)\n",
 792 |       "pca.fit(X, y)\n",
 793 |       "print pca.explained_variance_"
 794 |      ],
 795 |      "language": "python",
 796 |      "metadata": {
 797 |       "slideshow": {
 798 |        "slide_type": "fragment"
 799 |       }
 800 |      },
 801 |      "outputs": [
 802 |       {
 803 |        "output_type": "stream",
 804 |        "stream": "stdout",
 805 |        "text": [
 806 |         "[ 2.73946394  0.22599044]\n"
 807 |        ]
 808 |       }
 809 |      ],
 810 |      "prompt_number": 21
 811 |     },
 812 |     {
 813 |      "cell_type": "code",
 814 |      "collapsed": false,
 815 |      "input": [
 816 |       "kbest = SelectKBest(k = 1)\n",
 817 |       "kbest.fit(X, y)\n",
 818 |       "print kbest.get_support()"
 819 |      ],
 820 |      "language": "python",
 821 |      "metadata": {
 822 |       "slideshow": {
 823 |        "slide_type": "fragment"
 824 |       }
 825 |      },
 826 |      "outputs": [
 827 |       {
 828 |        "output_type": "stream",
 829 |        "stream": "stdout",
 830 |        "text": [
 831 |         "[False False  True False]\n"
 832 |        ]
 833 |       }
 834 |      ],
 835 |      "prompt_number": 22
 836 |     },
 837 |     {
 838 |      "cell_type": "markdown",
 839 |      "metadata": {
 840 |       "slideshow": {
 841 |        "slide_type": "slide"
 842 |       }
 843 |      },
 844 |      "source": [
 845 |       "__Is that it?__"
 846 |      ]
 847 |     },
 848 |     {
 849 |      "cell_type": "heading",
 850 |      "level": 2,
 851 |      "metadata": {
 852 |       "slideshow": {
 853 |        "slide_type": "slide"
 854 |       }
 855 |      },
 856 |      "source": [
 857 |       "Scikit-learn API: the Predictor"
 858 |      ]
 859 |     },
 860 |     {
 861 |      "cell_type": "code",
 862 |      "collapsed": false,
 863 |      "input": [
 864 |       "model = LogisticRegression()\n",
 865 |       "model.fit(X, y)\n",
 866 |       "\n",
 867 |       "X_test = [[ 5.006,  3.418,  1.464,  0.244], [ 5.936,  2.77 ,  4.26 ,  1.326]]\n",
 868 |       "\n",
 869 |       "model.predict(X_test)"
 870 |      ],
 871 |      "language": "python",
 872 |      "metadata": {},
 873 |      "outputs": [
 874 |       {
 875 |        "metadata": {},
 876 |        "output_type": "pyout",
 877 |        "prompt_number": 23,
 878 |        "text": [
 879 |         "array([0, 1])"
 880 |        ]
 881 |       }
 882 |      ],
 883 |      "prompt_number": 23
 884 |     },
 885 |     {
 886 |      "cell_type": "code",
 887 |      "collapsed": false,
 888 |      "input": [
 889 |       "print model.predict_proba(X_test)"
 890 |      ],
 891 |      "language": "python",
 892 |      "metadata": {
 893 |       "slideshow": {
 894 |        "slide_type": "fragment"
 895 |       }
 896 |      },
 897 |      "outputs": [
 898 |       {
 899 |        "output_type": "stream",
 900 |        "stream": "stdout",
 901 |        "text": [
 902 |         "[[ 0.97741151  0.02258849]\n",
 903 |         " [ 0.01544837  0.98455163]]\n"
 904 |        ]
 905 |       }
 906 |      ],
 907 |      "prompt_number": 24
 908 |     },
 909 |     {
 910 |      "cell_type": "heading",
 911 |      "level": 2,
 912 |      "metadata": {
 913 |       "slideshow": {
 914 |        "slide_type": "slide"
 915 |       }
 916 |      },
 917 |      "source": [
 918 |       "Scikit-learn API: the Transformer"
 919 |      ]
 920 |     },
 921 |     {
 922 |      "cell_type": "code",
 923 |      "collapsed": false,
 924 |      "input": [
 925 |       "pca = PCA(n_components=2)\n",
 926 |       "pca.fit(X)\n",
 927 |       "\n",
 928 |       "print pca.transform(X)[0:5,:]"
 929 |      ],
 930 |      "language": "python",
 931 |      "metadata": {},
 932 |      "outputs": [
 933 |       {
 934 |        "output_type": "stream",
 935 |        "stream": "stdout",
 936 |        "text": [
 937 |         "[[-1.65441341 -0.20660719]\n",
 938 |         " [-1.63509488  0.2988347 ]\n",
 939 |         " [-1.82037547  0.27141696]\n",
 940 |         " [-1.66207305  0.43021683]\n",
 941 |         " [-1.70358916 -0.21574051]]\n"
 942 |        ]
 943 |       }
 944 |      ],
 945 |      "prompt_number": 25
 946 |     },
 947 |     {
 948 |      "cell_type": "markdown",
 949 |      "metadata": {
 950 |       "slideshow": {
 951 |        "slide_type": "fragment"
 952 |       }
 953 |      },
 954 |      "source": [
 955 |       "__fit_transform__ is also available (and is sometimes faster)."
 956 |      ]
 957 |     },
 958 |     {
 959 |      "cell_type": "code",
 960 |      "collapsed": false,
 961 |      "input": [
 962 |       "pca = PCA(n_components=2)\n",
 963 |       "print pca.fit_transform(X)[0:5,:]"
 964 |      ],
 965 |      "language": "python",
 966 |      "metadata": {
 967 |       "slideshow": {
 968 |        "slide_type": "fragment"
 969 |       }
 970 |      },
 971 |      "outputs": [
 972 |       {
 973 |        "output_type": "stream",
 974 |        "stream": "stdout",
 975 |        "text": [
 976 |         "[[-1.65441341 -0.20660719]\n",
 977 |         " [-1.63509488  0.2988347 ]\n",
 978 |         " [-1.82037547  0.27141696]\n",
 979 |         " [-1.66207305  0.43021683]\n",
 980 |         " [-1.70358916 -0.21574051]]\n"
 981 |        ]
 982 |       }
 983 |      ],
 984 |      "prompt_number": 54
 985 |     },
 986 |     {
 987 |      "cell_type": "code",
 988 |      "collapsed": false,
 989 |      "input": [
 990 |       "kbest = SelectKBest(k = 1)\n",
 991 |       "kbest.fit(X, y)\n",
 992 |       "\n",
 993 |       "print kbest.transform(X)[0:5,:]"
 994 |      ],
 995 |      "language": "python",
 996 |      "metadata": {
 997 |       "slideshow": {
 998 |        "slide_type": "slide"
 999 |       }
1000 |      },
1001 |      "outputs": [
1002 |       {
1003 |        "output_type": "stream",
1004 |        "stream": "stdout",
1005 |        "text": [
1006 |         "[[ 1.4]\n",
1007 |         " [ 1.4]\n",
1008 |         " [ 1.3]\n",
1009 |         " [ 1.5]\n",
1010 |         " [ 1.4]]\n"
1011 |        ]
1012 |       }
1013 |      ],
1014 |      "prompt_number": 26
1015 |     },
1016 |     {
1017 |      "cell_type": "heading",
1018 |      "level": 2,
1019 |      "metadata": {
1020 |       "slideshow": {
1021 |        "slide_type": "slide"
1022 |       }
1023 |      },
1024 |      "source": [
1025 |       "Scikit-learn API: the Model"
1026 |      ]
1027 |     },
1028 |     {
1029 |      "cell_type": "code",
1030 |      "collapsed": false,
1031 |      "input": [
1032 |       "from sklearn.cross_validation import KFold\n",
1033 |       "from numpy import arange\n",
1034 |       "from random import shuffle\n",
1035 |       "from sklearn.dummy import DummyClassifier"
1036 |      ],
1037 |      "language": "python",
1038 |      "metadata": {
1039 |       "slideshow": {
1040 |        "slide_type": "skip"
1041 |       }
1042 |      },
1043 |      "outputs": [],
1044 |      "prompt_number": 27
1045 |     },
1046 |     {
1047 |      "cell_type": "code",
1048 |      "collapsed": false,
1049 |      "input": [
1050 |       "model = DummyClassifier()\n",
1051 |       "model.fit(X, y)\n",
1052 |       "\n",
1053 |       "model.score(X, y)"
1054 |      ],
1055 |      "language": "python",
1056 |      "metadata": {},
1057 |      "outputs": [
1058 |       {
1059 |        "metadata": {},
1060 |        "output_type": "pyout",
1061 |        "prompt_number": 86,
1062 |        "text": [
1063 |         "0.48999999999999999"
1064 |        ]
1065 |       }
1066 |      ],
1067 |      "prompt_number": 86
1068 |     },
1069 |     {
1070 |      "cell_type": "heading",
1071 |      "level": 1,
1072 |      "metadata": {
1073 |       "slideshow": {
1074 |        "slide_type": "slide"
1075 |       }
1076 |      },
1077 |      "source": [
1078 |       "Building Pipelines"
1079 |      ]
1080 |     },
1081 |     {
1082 |      "cell_type": "code",
1083 |      "collapsed": false,
1084 |      "input": [
1085 |       "from sklearn.pipeline import Pipeline"
1086 |      ],
1087 |      "language": "python",
1088 |      "metadata": {
1089 |       "slideshow": {
1090 |        "slide_type": "skip"
1091 |       }
1092 |      },
1093 |      "outputs": [],
1094 |      "prompt_number": 87
1095 |     },
1096 |     {
1097 |      "cell_type": "code",
1098 |      "collapsed": false,
1099 |      "input": [
1100 |       "pipe = Pipeline([\n",
1101 |       "          (\"select\", SelectKBest(k = 3)),\n",
1102 |       "          (\"pca\", PCA(n_components = 1)),\n",
1103 |       "          (\"classify\", LogisticRegression())\n",
1104 |       "          ])\n",
1105 |       "\n",
1106 |       "pipe.fit(X, y)\n",
1107 |       "\n",
1108 |       "pipe.predict(X)"
1109 |      ],
1110 |      "language": "python",
1111 |      "metadata": {},
1112 |      "outputs": [
1113 |       {
1114 |        "metadata": {},
1115 |        "output_type": "pyout",
1116 |        "prompt_number": 55,
1117 |        "text": [
1118 |         "array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
1119 |         "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
1120 |         "       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
1121 |         "       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
1122 |         "       1, 1, 1, 1, 1, 1, 1, 1])"
1123 |        ]
1124 |       }
1125 |      ],
1126 |      "prompt_number": 55
1127 |     },
1128 |     {
1129 |      "cell_type": "markdown",
1130 |      "metadata": {},
1131 |      "source": [
1132 |       "Intermediate steps of the pipeline must be __Estimators__ and __Transformers__.\n",
1133 |       "\n",
1134 |       "The final estimator needs only to be an __Estimator__."
1135 |      ]
1136 |     },
1137 |     {
1138 |      "cell_type": "heading",
1139 |      "level": 2,
1140 |      "metadata": {
1141 |       "slideshow": {
1142 |        "slide_type": "slide"
1143 |       }
1144 |      },
1145 |      "source": [
1146 |       "Text Pipeline"
1147 |      ]
1148 |     },
1149 |     {
1150 |      "cell_type": "code",
1151 |      "collapsed": false,
1152 |      "input": [
1153 |       "from sklearn.datasets import fetch_20newsgroups\n",
1154 |       "from sklearn.feature_extraction.text import CountVectorizer\n",
1155 |       "from sklearn.feature_extraction.text import TfidfTransformer\n",
1156 |       "from sklearn.linear_model import SGDClassifier\n"
1157 |      ],
1158 |      "language": "python",
1159 |      "metadata": {
1160 |       "slideshow": {
1161 |        "slide_type": "skip"
1162 |       }
1163 |      },
1164 |      "outputs": [],
1165 |      "prompt_number": 78
1166 |     },
1167 |     {
1168 |      "cell_type": "code",
1169 |      "collapsed": false,
1170 |      "input": [
1171 |       "news = fetch_20newsgroups()\n",
1172 |       "data = news.data\n",
1173 |       "category = news.target"
1174 |      ],
1175 |      "language": "python",
1176 |      "metadata": {},
1177 |      "outputs": [],
1178 |      "prompt_number": 71
1179 |     },
1180 |     {
1181 |      "cell_type": "code",
1182 |      "collapsed": false,
1183 |      "input": [
1184 |       "len(data)"
1185 |      ],
1186 |      "language": "python",
1187 |      "metadata": {},
1188 |      "outputs": [
1189 |       {
1190 |        "metadata": {},
1191 |        "output_type": "pyout",
1192 |        "prompt_number": 72,
1193 |        "text": [
1194 |         "11314"
1195 |        ]
1196 |       }
1197 |      ],
1198 |      "prompt_number": 72
1199 |     },
1200 |     {
1201 |      "cell_type": "code",
1202 |      "collapsed": false,
1203 |      "input": [
1204 |       "print \"  \".join(news.target_names)"
1205 |      ],
1206 |      "language": "python",
1207 |      "metadata": {
1208 |       "slideshow": {
1209 |        "slide_type": "slide"
1210 |       }
1211 |      },
1212 |      "outputs": [
1213 |       {
1214 |        "output_type": "stream",
1215 |        "stream": "stdout",
1216 |        "text": [
1217 |         "alt.atheism  comp.graphics  comp.os.ms-windows.misc  comp.sys.ibm.pc.hardware  comp.sys.mac.hardware  comp.windows.x  misc.forsale  rec.autos  rec.motorcycles  rec.sport.baseball  rec.sport.hockey  sci.crypt  sci.electronics  sci.med  sci.space  soc.religion.christian  talk.politics.guns  talk.politics.mideast  talk.politics.misc  talk.religion.misc\n"
1218 |        ]
1219 |       }
1220 |      ],
1221 |      "prompt_number": 92
1222 |     },
1223 |     {
1224 |      "cell_type": "code",
1225 |      "collapsed": false,
1226 |      "input": [
1227 |       "print data[8]"
1228 |      ],
1229 |      "language": "python",
1230 |      "metadata": {},
1231 |      "outputs": [
1232 |       {
1233 |        "output_type": "stream",
1234 |        "stream": "stdout",
1235 |        "text": [
1236 |         "From: holmes7000@iscsvax.uni.edu\n",
1237 |         "Subject: WIn 3.0 ICON HELP PLEASE!\n",
1238 |         "Organization: University of Northern Iowa\n",
1239 |         "Lines: 10\n",
1240 |         "\n",
1241 |         "I have win 3.0 and downloaded several icons and BMP's but I can't figure out\n",
1242 |         "how to change the \"wallpaper\" or use the icons.  Any help would be appreciated.\n",
1243 |         "\n",
1244 |         "\n",
1245 |         "Thanx,\n",
1246 |         "\n",
1247 |         "-Brando\n",
1248 |         "\n",
1249 |         "PS Please E-mail me\n",
1250 |         "\n",
1251 |         "\n"
1252 |        ]
1253 |       }
1254 |      ],
1255 |      "prompt_number": 99
1256 |     },
1257 |     {
1258 |      "cell_type": "code",
1259 |      "collapsed": false,
1260 |      "input": [
1261 |       "pipe = Pipeline([\n",
1262 |       "    ('vect', CountVectorizer(max_features = 100)),\n",
1263 |       "    ('tfidf', TfidfTransformer()),\n",
1264 |       "    ('clf', SGDClassifier()),\n",
1265 |       "])\n",
1266 |       "\n",
1267 |       "pipe.fit(data, category)"
1268 |      ],
1269 |      "language": "python",
1270 |      "metadata": {
1271 |       "slideshow": {
1272 |        "slide_type": "slide"
1273 |       }
1274 |      },
1275 |      "outputs": [
1276 |       {
1277 |        "metadata": {},
1278 |        "output_type": "pyout",
1279 |        "prompt_number": 100,
1280 |        "text": [
1281 |         "Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, charset=None,\n",
1282 |         "        charset_error=None, decode_error=u'strict',\n",
1283 |         "        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',\n",
1284 |         "        lowercase=True, max_df=1.0, max_features=100, min_df=1,\n",
1285 |         "        ngram_range=(1, 1), prepr..., penalty='l2', power_t=0.5,\n",
1286 |         "       random_state=None, shuffle=False, verbose=0, warm_start=False))])"
1287 |        ]
1288 |       }
1289 |      ],
1290 |      "prompt_number": 100
1291 |     },
1292 |     {
1293 |      "cell_type": "heading",
1294 |      "level": 2,
1295 |      "metadata": {
1296 |       "slideshow": {
1297 |        "slide_type": "slide"
1298 |       }
1299 |      },
1300 |      "source": [
1301 |       "Pandas Pipelines!"
1302 |      ]
1303 |     },
1304 |     {
1305 |      "cell_type": "code",
1306 |      "collapsed": false,
1307 |      "input": [
1308 |       "import pandas as pd\n",
1309 |       "import numpy as np\n",
1310 |       "import sklearn.preprocessing, sklearn.decomposition, sklearn.linear_model, sklearn.pipeline, sklearn.metrics\n",
1311 |       "from sklearn_pandas import DataFrameMapper, cross_val_score"
1312 |      ],
1313 |      "language": "python",
1314 |      "metadata": {
1315 |       "slideshow": {
1316 |        "slide_type": "skip"
1317 |       }
1318 |      },
1319 |      "outputs": [],
1320 |      "prompt_number": 107
1321 |     },
1322 |     {
1323 |      "cell_type": "code",
1324 |      "collapsed": false,
1325 |      "input": [
1326 |       "data = pd.DataFrame({\n",
1327 |       "    'pet':      ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish'],\n",
1328 |       "    'children': [4., 6, 3, 3, 2, 3, 5, 4],\n",
1329 |       "    'salary':   [90, 24, 44, 27, 32, 59, 36, 27]\n",
1330 |       "})"
1331 |      ],
1332 |      "language": "python",
1333 |      "metadata": {},
1334 |      "outputs": [],
1335 |      "prompt_number": 117
1336 |     },
1337 |     {
1338 |      "cell_type": "code",
1339 |      "collapsed": false,
1340 |      "input": [
1341 |       "mapper = DataFrameMapper([\n",
1342 |       "     ('pet', sklearn.preprocessing.LabelBinarizer()),\n",
1343 |       "     ('children', sklearn.preprocessing.StandardScaler()),\n",
1344 |       "     ('salary', None)\n",
1345 |       "])"
1346 |      ],
1347 |      "language": "python",
1348 |      "metadata": {
1349 |       "slideshow": {
1350 |        "slide_type": "fragment"
1351 |       }
1352 |      },
1353 |      "outputs": [],
1354 |      "prompt_number": 111
1355 |     },
1356 |     {
1357 |      "cell_type": "code",
1358 |      "collapsed": false,
1359 |      "input": [
1360 |       "mapper.fit_transform(data)"
1361 |      ],
1362 |      "language": "python",
1363 |      "metadata": {
1364 |       "slideshow": {
1365 |        "slide_type": "fragment"
1366 |       }
1367 |      },
1368 |      "outputs": [
1369 |       {
1370 |        "metadata": {},
1371 |        "output_type": "pyout",
1372 |        "prompt_number": 113,
1373 |        "text": [
1374 |         "array([[  1.        ,   0.        ,   0.        ,   0.20851441,  90.        ],\n",
1375 |         "       [  0.        ,   1.        ,   0.        ,   1.87662973,  24.        ],\n",
1376 |         "       [  0.        ,   1.        ,   0.        ,  -0.62554324,  44.        ],\n",
1377 |         "       [  0.        ,   0.        ,   1.        ,  -0.62554324,  27.        ],\n",
1378 |         "       [  1.        ,   0.        ,   0.        ,  -1.4596009 ,  32.        ],\n",
1379 |         "       [  0.        ,   1.        ,   0.        ,  -0.62554324,  59.        ],\n",
1380 |         "       [  1.        ,   0.        ,   0.        ,   1.04257207,  36.        ],\n",
1381 |         "       [  0.        ,   0.        ,   1.        ,   0.20851441,  27.        ]])"
1382 |        ]
1383 |       }
1384 |      ],
1385 |      "prompt_number": 113
1386 |     },
1387 |     {
1388 |      "cell_type": "code",
1389 |      "collapsed": false,
1390 |      "input": [
1391 |       "mapper = DataFrameMapper([\n",
1392 |       "     ('pet', sklearn.preprocessing.LabelBinarizer()),\n",
1393 |       "     ('children', sklearn.preprocessing.StandardScaler()),\n",
1394 |       "     ('salary', None)\n",
1395 |       "])\n",
1396 |       "\n",
1397 |       "pipe = Pipeline([\n",
1398 |       "    (\"mapper\", mapper),\n",
1399 |       "    (\"pca\", PCA(n_components=2))\n",
1400 |       "])\n",
1401 |       "pipe.fit_transform(data) # 'data' is a data frame, not a numpy array!"
1402 |      ],
1403 |      "language": "python",
1404 |      "metadata": {
1405 |       "slideshow": {
1406 |        "slide_type": "slide"
1407 |       }
1408 |      },
1409 |      "outputs": [
1410 |       {
1411 |        "metadata": {},
1412 |        "output_type": "pyout",
1413 |        "prompt_number": 157,
1414 |        "text": [
1415 |         "array([[ -4.76269151e+01,   4.25991055e-01],\n",
1416 |         "       [  1.83856756e+01,   1.86178138e+00],\n",
1417 |         "       [ -1.62747544e+00,  -5.06199939e-01],\n",
1418 |         "       [  1.53796381e+01,  -8.10331853e-01],\n",
1419 |         "       [  1.03575109e+01,  -1.52528125e+00],\n",
1420 |         "       [ -1.66260441e+01,  -4.27845667e-01],\n",
1421 |         "       [  6.37295205e+00,   9.68066902e-01],\n",
1422 |         "       [  1.53846579e+01,   1.38193738e-02]])"
1423 |        ]
1424 |       }
1425 |      ],
1426 |      "prompt_number": 157
1427 |     },
1428 |     {
1429 |      "cell_type": "markdown",
1430 |      "metadata": {
1431 |       "slideshow": {
1432 |        "slide_type": "slide"
1433 |       }
1434 |      },
1435 |      "source": [
1436 |       "Pandas pipelines require [sklearn-pandas](https://github.com/paulgb/sklearn-pandas) module by [@paulgb](http://www.twitter.com/paulgb)."
1437 |      ]
1438 |     },
1439 |     {
1440 |      "cell_type": "markdown",
1441 |      "metadata": {
1442 |       "slideshow": {
1443 |        "slide_type": "fragment"
1444 |       }
1445 |      },
1446 |      "source": [
1447 |       "Also by Paul:\n",
1448 |       "    \n",
1449 |       "[![](facebook_map.png)](https://www.facebook.com/note.php?note_id=469716398919)"
1450 |      ]
1451 |     },
1452 |     {
1453 |      "cell_type": "heading",
1454 |      "level": 1,
1455 |      "metadata": {
1456 |       "slideshow": {
1457 |        "slide_type": "slide"
1458 |       }
1459 |      },
1460 |      "source": [
1461 |       "Model Evaluation and Selection"
1462 |      ]
1463 |     },
1464 |     {
1465 |      "cell_type": "code",
1466 |      "collapsed": false,
1467 |      "input": [
1468 |       "from sklearn.grid_search import GridSearchCV, RandomizedSearchCV\n",
1469 |       "from sklearn import datasets\n",
1470 |       "from sklearn.ensemble import RandomForestClassifier"
1471 |      ],
1472 |      "language": "python",
1473 |      "metadata": {
1474 |       "slideshow": {
1475 |        "slide_type": "skip"
1476 |       }
1477 |      },
1478 |      "outputs": [],
1479 |      "prompt_number": 212
1480 |     },
1481 |     {
1482 |      "cell_type": "code",
1483 |      "collapsed": false,
1484 |      "input": [
1485 |       "# Create sample dataset\n",
1486 |       "X, y = datasets.make_classification(n_samples = 1000, n_features = 40, n_informative = 6, n_classes = 2)"
1487 |      ],
1488 |      "language": "python",
1489 |      "metadata": {},
1490 |      "outputs": [],
1491 |      "prompt_number": 137
1492 |     },
1493 |     {
1494 |      "cell_type": "code",
1495 |      "collapsed": false,
1496 |      "input": [
1497 |       "# Pipeline for Feature Selection to Random Forest\n",
1498 |       "pipe = Pipeline([\n",
1499 |       "  (\"select\", SelectKBest()),\n",
1500 |       "  (\"classify\", RandomForestClassifier())\n",
1501 |       "])"
1502 |      ],
1503 |      "language": "python",
1504 |      "metadata": {
1505 |       "slideshow": {
1506 |        "slide_type": "fragment"
1507 |       }
1508 |      },
1509 |      "outputs": [],
1510 |      "prompt_number": 162
1511 |     },
1512 |     {
1513 |      "cell_type": "code",
1514 |      "collapsed": false,
1515 |      "input": [
1516 |       "# Define parameter grid\n",
1517 |       "param_grid = {\n",
1518 |       "  \"select__k\" : [1, 6, 20, 40],\n",
1519 |       "  \"classify__n_estimators\" : [1, 10, 100],\n",
1520 |       "  \n",
1521 |       "}\n",
1522 |       "gs = GridSearchCV(pipe, param_grid)"
1523 |      ],
1524 |      "language": "python",
1525 |      "metadata": {
1526 |       "slideshow": {
1527 |        "slide_type": "fragment"
1528 |       }
1529 |      },
1530 |      "outputs": [],
1531 |      "prompt_number": 175
1532 |     },
1533 |     {
1534 |      "cell_type": "code",
1535 |      "collapsed": false,
1536 |      "input": [
1537 |       "# Search over grid\n",
1538 |       "gs.fit(X, y)\n",
1539 |       "\n",
1540 |       "gs.best_params_"
1541 |      ],
1542 |      "language": "python",
1543 |      "metadata": {
1544 |       "slideshow": {
1545 |        "slide_type": "fragment"
1546 |       }
1547 |      },
1548 |      "outputs": [
1549 |       {
1550 |        "metadata": {},
1551 |        "output_type": "pyout",
1552 |        "prompt_number": 183,
1553 |        "text": [
1554 |         "{'classify__n_estimators': 10, 'select__k': 6}"
1555 |        ]
1556 |       }
1557 |      ],
1558 |      "prompt_number": 183
1559 |     },
1560 |     {
1561 |      "cell_type": "code",
1562 |      "collapsed": false,
1563 |      "input": [
1564 |       "print gs.best_estimator_.predict(X.mean(axis = 0))"
1565 |      ],
1566 |      "language": "python",
1567 |      "metadata": {
1568 |       "slideshow": {
1569 |        "slide_type": "fragment"
1570 |       }
1571 |      },
1572 |      "outputs": [
1573 |       {
1574 |        "output_type": "stream",
1575 |        "stream": "stdout",
1576 |        "text": [
1577 |         "[1]\n"
1578 |        ]
1579 |       }
1580 |      ],
1581 |      "prompt_number": 192
1582 |     },
1583 |     {
1584 |      "cell_type": "heading",
1585 |      "level": 2,
1586 |      "metadata": {
1587 |       "slideshow": {
1588 |        "slide_type": "slide"
1589 |       }
1590 |      },
1591 |      "source": [
1592 |       "Curse of Dimensionality"
1593 |      ]
1594 |     },
1595 |     {
1596 |      "cell_type": "markdown",
1597 |      "metadata": {},
1598 |      "source": [
1599 |       "Search space grows exponentially with number of parameters."
1600 |      ]
1601 |     },
1602 |     {
1603 |      "cell_type": "code",
1604 |      "collapsed": false,
1605 |      "input": [
1606 |       "gs.grid_scores_"
1607 |      ],
1608 |      "language": "python",
1609 |      "metadata": {},
1610 |      "outputs": [
1611 |       {
1612 |        "metadata": {},
1613 |        "output_type": "pyout",
1614 |        "prompt_number": 185,
1615 |        "text": [
1616 |         "[mean: 0.72600, std: 0.02773, params: {'classify__n_estimators': 1, 'select__k': 1},\n",
1617 |         " mean: 0.78200, std: 0.00631, params: {'classify__n_estimators': 1, 'select__k': 6},\n",
1618 |         " mean: 0.74400, std: 0.02580, params: {'classify__n_estimators': 1, 'select__k': 20},\n",
1619 |         " mean: 0.70600, std: 0.05772, params: {'classify__n_estimators': 1, 'select__k': 40},\n",
1620 |         " mean: 0.73800, std: 0.02372, params: {'classify__n_estimators': 10, 'select__k': 1},\n",
1621 |         " mean: 0.90000, std: 0.01539, params: {'classify__n_estimators': 10, 'select__k': 6},\n",
1622 |         " mean: 0.86400, std: 0.01047, params: {'classify__n_estimators': 10, 'select__k': 20},\n",
1623 |         " mean: 0.81200, std: 0.02247, params: {'classify__n_estimators': 10, 'select__k': 40},\n",
1624 |         " mean: 0.73600, std: 0.02229, params: {'classify__n_estimators': 100, 'select__k': 1},\n",
1625 |         " mean: 0.89200, std: 0.01520, params: {'classify__n_estimators': 100, 'select__k': 6},\n",
1626 |         " mean: 0.89000, std: 0.01769, params: {'classify__n_estimators': 100, 'select__k': 20},\n",
1627 |         " mean: 0.87000, std: 0.02366, params: {'classify__n_estimators': 100, 'select__k': 40}]"
1628 |        ]
1629 |       }
1630 |      ],
1631 |      "prompt_number": 185
1632 |     },
1633 |     {
1634 |      "cell_type": "heading",
1635 |      "level": 2,
1636 |      "metadata": {
1637 |       "slideshow": {
1638 |        "slide_type": "slide"
1639 |       }
1640 |      },
1641 |      "source": [
1642 |       "Curse of Dimensionality: Parallelization "
1643 |      ]
1644 |     },
1645 |     {
1646 |      "cell_type": "markdown",
1647 |      "metadata": {},
1648 |      "source": [
1649 |       "__GridSearch__ on 1 core:"
1650 |      ]
1651 |     },
1652 |     {
1653 |      "cell_type": "code",
1654 |      "collapsed": false,
1655 |      "input": [
1656 |       "param_grid = {\n",
1657 |       "  \"select__k\" : [1, 5, 10, 15, 20, 25, 30, 35, 40],\n",
1658 |       "  \"classify__n_estimators\" : [1, 5, 10, 25, 50, 75, 100],\n",
1659 |       "  \n",
1660 |       "}\n",
1661 |       "gs = GridSearchCV(pipe, param_grid, n_jobs = 1)\n",
1662 |       "%timeit gs.fit(X, y)\n",
1663 |       "print"
1664 |      ],
1665 |      "language": "python",
1666 |      "metadata": {},
1667 |      "outputs": [
1668 |       {
1669 |        "output_type": "stream",
1670 |        "stream": "stdout",
1671 |        "text": [
1672 |         "1 loops, best of 3: 6.31 s per loop\n",
1673 |         "\n"
1674 |        ]
1675 |       }
1676 |      ],
1677 |      "prompt_number": 207
1678 |     },
1679 |     {
1680 |      "cell_type": "markdown",
1681 |      "metadata": {},
1682 |      "source": [
1683 |       "__GridSearch__ on 7 cores:"
1684 |      ]
1685 |     },
1686 |     {
1687 |      "cell_type": "code",
1688 |      "collapsed": false,
1689 |      "input": [
1690 |       "gs = GridSearchCV(pipe, param_grid, n_jobs = 7)\n",
1691 |       "%timeit gs.fit(X, y)\n",
1692 |       "print"
1693 |      ],
1694 |      "language": "python",
1695 |      "metadata": {},
1696 |      "outputs": [
1697 |       {
1698 |        "output_type": "stream",
1699 |        "stream": "stdout",
1700 |        "text": [
1701 |         "1 loops, best of 3: 1.81 s per loop\n",
1702 |         "\n"
1703 |        ]
1704 |       }
1705 |      ],
1706 |      "prompt_number": 208
1707 |     },
1708 |     {
1709 |      "cell_type": "heading",
1710 |      "level": 2,
1711 |      "metadata": {
1712 |       "slideshow": {
1713 |        "slide_type": "slide"
1714 |       }
1715 |      },
1716 |      "source": [
1717 |       "Curse of Dimensionality: Randomization"
1718 |      ]
1719 |     },
1720 |     {
1721 |      "cell_type": "markdown",
1722 |      "metadata": {},
1723 |      "source": [
1724 |       "__GridSearchCV__ might be very slow:"
1725 |      ]
1726 |     },
1727 |     {
1728 |      "cell_type": "code",
1729 |      "collapsed": false,
1730 |      "input": [
1731 |       "param_grid = {\n",
1732 |       "  \"select__k\" : range(1, 40),\n",
1733 |       "  \"classify__n_estimators\" : range(1, 100), \n",
1734 |       "}"
1735 |      ],
1736 |      "language": "python",
1737 |      "metadata": {},
1738 |      "outputs": [],
1739 |      "prompt_number": 220
1740 |     },
1741 |     {
1742 |      "cell_type": "code",
1743 |      "collapsed": false,
1744 |      "input": [
1745 |       "gs = GridSearchCV(pipe, param_grid, n_jobs = 7)\n",
1746 |       "gs.fit(X, y)\n",
1747 |       "print \"Best CV score\", gs.best_score_\n",
1748 |       "print gs.best_params_"
1749 |      ],
1750 |      "language": "python",
1751 |      "metadata": {},
1752 |      "outputs": [
1753 |       {
1754 |        "output_type": "stream",
1755 |        "stream": "stdout",
1756 |        "text": [
1757 |         "0.924\n",
1758 |         "{'classify__n_estimators': 59, 'select__k': 9}\n"
1759 |        ]
1760 |       }
1761 |      ],
1762 |      "prompt_number": 221
1763 |     },
1764 |     {
1765 |      "cell_type": "markdown",
1766 |      "metadata": {},
1767 |      "source": [
1768 |       "We can instead randomly sample from the parameter space with __RandomizedSearchCV__:"
1769 |      ]
1770 |     },
1771 |     {
1772 |      "cell_type": "code",
1773 |      "collapsed": false,
1774 |      "input": [
1775 |       "gs = RandomizedSearchCV(pipe, param_grid, n_jobs = 7, n_iter = 10)\n",
1776 |       "gs.fit(X, y)\n",
1777 |       "print \"Best CV score\", gs.best_score_\n",
1778 |       "print gs.best_params_"
1779 |      ],
1780 |      "language": "python",
1781 |      "metadata": {},
1782 |      "outputs": [
1783 |       {
1784 |        "output_type": "stream",
1785 |        "stream": "stdout",
1786 |        "text": [
1787 |         "0.894\n",
1788 |         "{'classify__n_estimators': 58, 'select__k': 7}\n"
1789 |        ]
1790 |       }
1791 |      ],
1792 |      "prompt_number": 229
1793 |     },
1794 |     {
1795 |      "cell_type": "heading",
1796 |      "level": 1,
1797 |      "metadata": {
1798 |       "slideshow": {
1799 |        "slide_type": "slide"
1800 |       }
1801 |      },
1802 |      "source": [
1803 |       "Conclusions"
1804 |      ]
1805 |     },
1806 |     {
1807 |      "cell_type": "markdown",
1808 |      "metadata": {
1809 |       "slideshow": {
1810 |        "slide_type": "-"
1811 |       }
1812 |      },
1813 |      "source": [
1814 |       "* Scikit-learn has an elegant API and is built in a beautiful language."
1815 |      ]
1816 |     },
1817 |     {
1818 |      "cell_type": "markdown",
1819 |      "metadata": {
1820 |       "slideshow": {
1821 |        "slide_type": "fragment"
1822 |       }
1823 |      },
1824 |      "source": [
1825 |       "* Pipelines allow complex chains of operations to be easily computed.\n",
1826 |       "    * This helps ensure correct cross validation (see _Elements of Statistical Learning_ 7.10.2). "
1827 |      ]
1828 |     },
1829 |     {
1830 |      "cell_type": "markdown",
1831 |      "metadata": {
1832 |       "slideshow": {
1833 |        "slide_type": "fragment"
1834 |       }
1835 |      },
1836 |      "source": [
1837 |       "* Pipelines combined with grid search permit easy model selection."
1838 |      ]
1839 |     }
1840 |    ],
1841 |    "metadata": {}
1842 |   }
1843 |  ]
1844 | }
1845 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | 
 3 | Copyright (c) 2014 Timothy Hopper
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of
 6 | this software and associated documentation files (the "Software"), to deal in
 7 | the Software without restriction, including without limitation the rights to
 8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 9 | the Software, and to permit persons to whom the Software is furnished to do so,
10 | subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
17 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
18 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
19 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
20 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
21 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Intro to Scikit-Learn
 2 | 
 3 | * Research Triangle Analysts
 4 | * January 2014
 5 | * Presented by Tim Hopper
 6 | 
 7 | __Abstract:__ Scikit-learn is an actively developed Python package providing an implementation of many machine learning algorithms (e.g. SVM, kNN, linear models, HMM, k-Means, spectral clustering). However, the benefits of Scikit-learn goes well beyond carefully implemented learning algorithms. Being built in Python, it allows easy integration with countless other Python modules for tasks such as plotting, data munging, and application development. Its consistent API across algorithms allows for rapid experimentation with multiple learning methods. Also, Scikit-learn is well documented and provides lots of examples.
 8 | 
 9 | Instead of discussing particular machine learning algorithms provided by the package, I will focus on Scikit-learn and Python as a toolkit for solving data problems from start to finish. I will emphasize the Pipeline tool which allows the user to chain together all the steps of a machine learning pipeline including preprocessing, dimensionality reduction, feature selection, and model fitting.
10 | 
11 | ------
12 | 
13 | A (poor quality) video of this talk is [here](https://www.youtube.com/watch?v=2kx19t8bNMU).
14 | 
15 | ------
16 | 
17 | The slides for this presentation are generated from _Intro to Scikit-Learn.ipynb_.
18 | 
19 | To view the slides in a browser run the following command:
20 |   
21 | ```
22 | ipython nbconvert Intro\ to\ Scikit-Learn.ipynb --to slides --post serve
23 | ```
24 | 


--------------------------------------------------------------------------------
/facebook_map.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tdhopper/intro-to-scikit-learn/1545a1b8579c3e4725ce5ac7076a0e3c9544c231/facebook_map.png


--------------------------------------------------------------------------------