├── .gitignore ├── Intro to Scikit-Learn.ipynb ├── LICENSE ├── README.md └── facebook_map.png /.gitignore: -------------------------------------------------------------------------------- 1 | *.py[cod] 2 | 3 | # C extensions 4 | *.so 5 | 6 | # Packages 7 | *.egg 8 | *.egg-info 9 | dist 10 | build 11 | eggs 12 | parts 13 | bin 14 | var 15 | sdist 16 | develop-eggs 17 | .installed.cfg 18 | lib 19 | lib64 20 | __pycache__ 21 | 22 | # Installer logs 23 | pip-log.txt 24 | 25 | # Unit test / coverage reports 26 | .coverage 27 | .tox 28 | nosetests.xml 29 | 30 | # Translations 31 | *.mo 32 | 33 | # Mr Developer 34 | .mr.developer.cfg 35 | .project 36 | .pydevproject 37 | 38 | *.html 39 | -------------------------------------------------------------------------------- /Intro to Scikit-Learn.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "celltoolbar": "Slideshow", 4 | "name": "" 5 | }, 6 | "nbformat": 3, 7 | "nbformat_minor": 0, 8 | "worksheets": [ 9 | { 10 | "cells": [ 11 | { 12 | "cell_type": "markdown", 13 | "metadata": { 14 | "slideshow": { 15 | "slide_type": "slide" 16 | } 17 | }, 18 | "source": [ 19 | "# Introduction to Scikit-Learn \n", 20 | "\n", 21 | "View this IPython Notebook: \n", 22 | "\n", 23 | " j.mp/sklearn\n", 24 | " \n", 25 | "Everything is in a Github repo: \n", 26 | " \n", 27 | " github.com/tdhopper/\n", 28 | " \n", 29 | "View slides with:\n", 30 | "\n", 31 | " ipython nbconvert Intro\\ to\\ Scikit-Learn.ipynb --to slides --post serve" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": { 37 | "slideshow": { 38 | "slide_type": "slide" 39 | } 40 | }, 41 | "source": [ 42 | "
\n", 43 | "# Introduction to Scikit-Learn

\n", 44 | "\n", 45 | "__Research Triangle Analysts (1/16/13)__\n", 46 | "

\n", 47 | "\n", 48 | "Software Engineer at [parse.ly](http://www.parse.ly)
\n", 49 | "@tdhopper
\n", 50 | "tdhopper@gmail.com
\n", 51 | "
" 52 | ] 53 | }, 54 | { 55 | "cell_type": "heading", 56 | "level": 1, 57 | "metadata": { 58 | "slideshow": { 59 | "slide_type": "slide" 60 | } 61 | }, 62 | "source": [ 63 | "What is Scikit-Learn?" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": {}, 69 | "source": [ 70 | "\"Machine Learning in Python\"" 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "metadata": { 76 | "slideshow": { 77 | "slide_type": "fragment" 78 | } 79 | }, 80 | "source": [ 81 | "* Classification \n", 82 | "* Regression\n", 83 | "* Clustering\n", 84 | "* Dimensionality Reduction\n", 85 | "* Model Selection\n", 86 | "* Preprocessing\n", 87 | "\n", 88 | "See more: [http://scikit-learn.org/stable/user_guide.html](http://scikit-learn.org/stable/user_guide.html)" 89 | ] 90 | }, 91 | { 92 | "cell_type": "heading", 93 | "level": 1, 94 | "metadata": { 95 | "slideshow": { 96 | "slide_type": "slide" 97 | } 98 | }, 99 | "source": [ 100 | "Why scikit-learn?" 101 | ] 102 | }, 103 | { 104 | "cell_type": "markdown", 105 | "metadata": {}, 106 | "source": [ 107 | "Six reasons why [Ben Lorica (@bigdata)](http://strata.oreilly.com/2013/12/six-reasons-why-i-recommend-scikit-learn.html) recommends scikit-learn\n", 108 | "\n", 109 | "One: __Commitment to documentation and usability__\n", 110 | "\n", 111 | "> One of the reasons I started using scikit-learn was because of its nice documentation (which I hold up as an example for other communities and projects to emulate). \n" 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "metadata": { 117 | "slideshow": { 118 | "slide_type": "subslide" 119 | } 120 | }, 121 | "source": [ 122 | "Six reasons why [Ben Lorica (@bigdata)](http://strata.oreilly.com/2013/12/six-reasons-why-i-recommend-scikit-learn.html) recommends scikit-learn\n", 123 | "\n", 124 | "Two: __Models are chosen and implemented by a dedicated team of experts__\n", 125 | "\n", 126 | "> Scikit-learn\u2019s stable of contributors includes experts in machine-learning and software development." 127 | ] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "metadata": { 132 | "slideshow": { 133 | "slide_type": "subslide" 134 | } 135 | }, 136 | "source": [ 137 | "Six reasons why [Ben Lorica (@bigdata)](http://strata.oreilly.com/2013/12/six-reasons-why-i-recommend-scikit-learn.html) recommends scikit-learn\n", 138 | "\n", 139 | "Three: __Covers most machine-learning tasks__\n", 140 | "\n", 141 | "> Scan the list of things available in scikit-learn and you quickly realize that it includes tools for many of the standard machine-learning tasks (such as clustering, classification, regression, etc.).\n" 142 | ] 143 | }, 144 | { 145 | "cell_type": "markdown", 146 | "metadata": { 147 | "slideshow": { 148 | "slide_type": "subslide" 149 | } 150 | }, 151 | "source": [ 152 | "Six reasons why [Ben Lorica (@bigdata)](http://strata.oreilly.com/2013/12/six-reasons-why-i-recommend-scikit-learn.html) recommends scikit-learn\n", 153 | "\n", 154 | "Four: __Python and Pydata__\n", 155 | "\n", 156 | "> An impressive set of Python data tools (pydata) have emerged over the last few years.\n", 157 | "\n" 158 | ] 159 | }, 160 | { 161 | "cell_type": "markdown", 162 | "metadata": { 163 | "slideshow": { 164 | "slide_type": "subslide" 165 | } 166 | }, 167 | "source": [ 168 | "Six reasons why [Ben Lorica (@bigdata)](http://strata.oreilly.com/2013/12/six-reasons-why-i-recommend-scikit-learn.html) recommends scikit-learn\n", 169 | "\n", 170 | "Five: __Focus__\n", 171 | "\n", 172 | "> Scikit-learn is a machine-learning library. Its goal is to provide a set of common algorithms to Python users through a consistent interface.\n", 173 | "\n" 174 | ] 175 | }, 176 | { 177 | "cell_type": "markdown", 178 | "metadata": { 179 | "slideshow": { 180 | "slide_type": "subslide" 181 | } 182 | }, 183 | "source": [ 184 | "Six reasons why [Ben Lorica (@bigdata)](http://strata.oreilly.com/2013/12/six-reasons-why-i-recommend-scikit-learn.html) recommends scikit-learn\n", 185 | "\n", 186 | "Six: __scikit-learn scales to most data problems__\n", 187 | "\n", 188 | "> Many problems can be tackled using a single (big memory) server, and well-designed software that runs on a single machine can blow away distributed systems.\n" 189 | ] 190 | }, 191 | { 192 | "cell_type": "heading", 193 | "level": 1, 194 | "metadata": { 195 | "slideshow": { 196 | "slide_type": "slide" 197 | } 198 | }, 199 | "source": [ 200 | "This talk is _not_...\n", 201 | "\n" 202 | ] 203 | }, 204 | { 205 | "cell_type": "markdown", 206 | "metadata": {}, 207 | "source": [ 208 | "...an introduction to Python" 209 | ] 210 | }, 211 | { 212 | "cell_type": "markdown", 213 | "metadata": { 214 | "slideshow": { 215 | "slide_type": "fragment" 216 | } 217 | }, 218 | "source": [ 219 | "...an introduction to machine learning" 220 | ] 221 | }, 222 | { 223 | "cell_type": "heading", 224 | "level": 1, 225 | "metadata": { 226 | "slideshow": { 227 | "slide_type": "slide" 228 | } 229 | }, 230 | "source": [ 231 | "Example\n" 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "collapsed": false, 237 | "input": [ 238 | "from sklearn import datasets\n", 239 | "from numpy import logical_or\n", 240 | "from sklearn.lda import LDA\n", 241 | "from sklearn.metrics import confusion_matrix" 242 | ], 243 | "language": "python", 244 | "metadata": { 245 | "slideshow": { 246 | "slide_type": "skip" 247 | } 248 | }, 249 | "outputs": [], 250 | "prompt_number": 4 251 | }, 252 | { 253 | "cell_type": "code", 254 | "collapsed": false, 255 | "input": [ 256 | "iris = datasets.load_iris()\n", 257 | "subset = logical_or(iris.target == 0, iris.target == 1)\n", 258 | "\n", 259 | "X = iris.data[subset]\n", 260 | "y = iris.target[subset]" 261 | ], 262 | "language": "python", 263 | "metadata": { 264 | "slideshow": { 265 | "slide_type": "-" 266 | } 267 | }, 268 | "outputs": [], 269 | "prompt_number": 5 270 | }, 271 | { 272 | "cell_type": "code", 273 | "collapsed": false, 274 | "input": [ 275 | "print X[0:5,:]" 276 | ], 277 | "language": "python", 278 | "metadata": {}, 279 | "outputs": [ 280 | { 281 | "output_type": "stream", 282 | "stream": "stdout", 283 | "text": [ 284 | "[[ 5.1 3.5 1.4 0.2]\n", 285 | " [ 4.9 3. 1.4 0.2]\n", 286 | " [ 4.7 3.2 1.3 0.2]\n", 287 | " [ 4.6 3.1 1.5 0.2]\n", 288 | " [ 5. 3.6 1.4 0.2]]\n" 289 | ] 290 | } 291 | ], 292 | "prompt_number": 6 293 | }, 294 | { 295 | "cell_type": "code", 296 | "collapsed": false, 297 | "input": [ 298 | "print y[0:5]" 299 | ], 300 | "language": "python", 301 | "metadata": {}, 302 | "outputs": [ 303 | { 304 | "output_type": "stream", 305 | "stream": "stdout", 306 | "text": [ 307 | "[0 0 0 0 0]\n" 308 | ] 309 | } 310 | ], 311 | "prompt_number": 7 312 | }, 313 | { 314 | "cell_type": "code", 315 | "collapsed": false, 316 | "input": [ 317 | "# Linear Discriminant Analysis\n", 318 | "lda = LDA(2)\n", 319 | "lda.fit(X, y)\n", 320 | "\n", 321 | "confusion_matrix(y, lda.predict(X))" 322 | ], 323 | "language": "python", 324 | "metadata": { 325 | "slideshow": { 326 | "slide_type": "subslide" 327 | } 328 | }, 329 | "outputs": [ 330 | { 331 | "metadata": {}, 332 | "output_type": "pyout", 333 | "prompt_number": 8, 334 | "text": [ 335 | "array([[50, 0],\n", 336 | " [ 0, 50]])" 337 | ] 338 | } 339 | ], 340 | "prompt_number": 8 341 | }, 342 | { 343 | "cell_type": "heading", 344 | "level": 1, 345 | "metadata": { 346 | "slideshow": { 347 | "slide_type": "slide" 348 | } 349 | }, 350 | "source": [ 351 | "The Scikit-learn API" 352 | ] 353 | }, 354 | { 355 | "cell_type": "markdown", 356 | "metadata": {}, 357 | "source": [ 358 | "The main \"interfaces\" in scikit-learn are (one class can implement multiple interfaces): \n", 359 | "\n", 360 | "__Estimator__: \n", 361 | "\n", 362 | " estimator = obj.fit(data, targets) " 363 | ] 364 | }, 365 | { 366 | "cell_type": "markdown", 367 | "metadata": {}, 368 | "source": [ 369 | "__Predictor__: \n", 370 | "\n", 371 | " prediction = obj.predict(data) " 372 | ] 373 | }, 374 | { 375 | "cell_type": "markdown", 376 | "metadata": {}, 377 | "source": [ 378 | "__Transformer__:\n", 379 | "\n", 380 | " new_data = obj.transform(data) \n", 381 | " " 382 | ] 383 | }, 384 | { 385 | "cell_type": "markdown", 386 | "metadata": {}, 387 | "source": [ 388 | "__Model__:\n", 389 | "\n", 390 | " score = obj.score(data)" 391 | ] 392 | }, 393 | { 394 | "cell_type": "heading", 395 | "level": 2, 396 | "metadata": { 397 | "slideshow": { 398 | "slide_type": "slide" 399 | } 400 | }, 401 | "source": [ 402 | "Scikit-learn API: the Estimator" 403 | ] 404 | }, 405 | { 406 | "cell_type": "markdown", 407 | "metadata": {}, 408 | "source": [ 409 | "All estimators implement the __fit__ method:\n", 410 | "\n", 411 | " estimator.fit(X, y)" 412 | ] 413 | }, 414 | { 415 | "cell_type": "markdown", 416 | "metadata": { 417 | "slideshow": { 418 | "slide_type": "-" 419 | } 420 | }, 421 | "source": [ 422 | " \n", 423 | "> A estimator is an object that __fits a model__ based on some training data and is __capable of inferring__ some properties on new data." 424 | ] 425 | }, 426 | { 427 | "cell_type": "code", 428 | "collapsed": false, 429 | "input": [ 430 | "from sklearn.linear_model import LogisticRegression" 431 | ], 432 | "language": "python", 433 | "metadata": { 434 | "slideshow": { 435 | "slide_type": "skip" 436 | } 437 | }, 438 | "outputs": [], 439 | "prompt_number": 9 440 | }, 441 | { 442 | "cell_type": "code", 443 | "collapsed": false, 444 | "input": [ 445 | "# Create Model\n", 446 | "model = LogisticRegression()\n", 447 | "# Fit Model\n", 448 | "model.fit(X, y)" 449 | ], 450 | "language": "python", 451 | "metadata": { 452 | "slideshow": { 453 | "slide_type": "fragment" 454 | } 455 | }, 456 | "outputs": [ 457 | { 458 | "metadata": {}, 459 | "output_type": "pyout", 460 | "prompt_number": 10, 461 | "text": [ 462 | "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", 463 | " intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)" 464 | ] 465 | } 466 | ], 467 | "prompt_number": 10 468 | }, 469 | { 470 | "cell_type": "heading", 471 | "level": 2, 472 | "metadata": { 473 | "slideshow": { 474 | "slide_type": "slide" 475 | } 476 | }, 477 | "source": [ 478 | "(Almost) everything is an estimator" 479 | ] 480 | }, 481 | { 482 | "cell_type": "heading", 483 | "level": 3, 484 | "metadata": {}, 485 | "source": [ 486 | "Unsupervised Learning" 487 | ] 488 | }, 489 | { 490 | "cell_type": "code", 491 | "collapsed": false, 492 | "input": [ 493 | "from sklearn.cluster import KMeans" 494 | ], 495 | "language": "python", 496 | "metadata": { 497 | "slideshow": { 498 | "slide_type": "skip" 499 | } 500 | }, 501 | "outputs": [], 502 | "prompt_number": 11 503 | }, 504 | { 505 | "cell_type": "code", 506 | "collapsed": false, 507 | "input": [ 508 | "# Create Model\n", 509 | "kmeans = KMeans(n_clusters = 2)\n", 510 | "# Fit Model\n", 511 | "kmeans.fit(X)" 512 | ], 513 | "language": "python", 514 | "metadata": {}, 515 | "outputs": [ 516 | { 517 | "metadata": {}, 518 | "output_type": "pyout", 519 | "prompt_number": 12, 520 | "text": [ 521 | "KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=2, n_init=10,\n", 522 | " n_jobs=1, precompute_distances=True, random_state=None, tol=0.0001,\n", 523 | " verbose=0)" 524 | ] 525 | } 526 | ], 527 | "prompt_number": 12 528 | }, 529 | { 530 | "cell_type": "heading", 531 | "level": 3, 532 | "metadata": { 533 | "slideshow": { 534 | "slide_type": "slide" 535 | } 536 | }, 537 | "source": [ 538 | "Dimensionality Reduction" 539 | ] 540 | }, 541 | { 542 | "cell_type": "code", 543 | "collapsed": false, 544 | "input": [ 545 | "from sklearn.decomposition import PCA" 546 | ], 547 | "language": "python", 548 | "metadata": { 549 | "slideshow": { 550 | "slide_type": "skip" 551 | } 552 | }, 553 | "outputs": [], 554 | "prompt_number": 13 555 | }, 556 | { 557 | "cell_type": "code", 558 | "collapsed": false, 559 | "input": [ 560 | "# Create Model \n", 561 | "pca = PCA(n_components=2)\n", 562 | "# Fit Model\n", 563 | "pca.fit(X)" 564 | ], 565 | "language": "python", 566 | "metadata": {}, 567 | "outputs": [ 568 | { 569 | "metadata": {}, 570 | "output_type": "pyout", 571 | "prompt_number": 14, 572 | "text": [ 573 | "PCA(copy=True, n_components=2, whiten=False)" 574 | ] 575 | } 576 | ], 577 | "prompt_number": 14 578 | }, 579 | { 580 | "cell_type": "markdown", 581 | "metadata": { 582 | "slideshow": { 583 | "slide_type": "fragment" 584 | } 585 | }, 586 | "source": [ 587 | "The __fit__ method takes a $y$ parameter even if it isn't needed (though $y$ is ignored). This is important later." 588 | ] 589 | }, 590 | { 591 | "cell_type": "code", 592 | "collapsed": false, 593 | "input": [ 594 | "from sklearn.decomposition import PCA" 595 | ], 596 | "language": "python", 597 | "metadata": { 598 | "slideshow": { 599 | "slide_type": "skip" 600 | } 601 | }, 602 | "outputs": [], 603 | "prompt_number": 15 604 | }, 605 | { 606 | "cell_type": "code", 607 | "collapsed": false, 608 | "input": [ 609 | "pca = PCA(n_components=2)\n", 610 | "pca.fit(X, y)" 611 | ], 612 | "language": "python", 613 | "metadata": { 614 | "slideshow": { 615 | "slide_type": "fragment" 616 | } 617 | }, 618 | "outputs": [ 619 | { 620 | "metadata": {}, 621 | "output_type": "pyout", 622 | "prompt_number": 16, 623 | "text": [ 624 | "PCA(copy=True, n_components=2, whiten=False)" 625 | ] 626 | } 627 | ], 628 | "prompt_number": 16 629 | }, 630 | { 631 | "cell_type": "heading", 632 | "level": 3, 633 | "metadata": { 634 | "slideshow": { 635 | "slide_type": "slide" 636 | } 637 | }, 638 | "source": [ 639 | "Feature Selection" 640 | ] 641 | }, 642 | { 643 | "cell_type": "code", 644 | "collapsed": false, 645 | "input": [ 646 | "from sklearn.feature_selection import SelectKBest\n", 647 | "from sklearn.metrics import matthews_corrcoef" 648 | ], 649 | "language": "python", 650 | "metadata": { 651 | "slideshow": { 652 | "slide_type": "skip" 653 | } 654 | }, 655 | "outputs": [], 656 | "prompt_number": 17 657 | }, 658 | { 659 | "cell_type": "code", 660 | "collapsed": false, 661 | "input": [ 662 | "# Create Model\n", 663 | "kbest = SelectKBest(k = 3)\n", 664 | "# Fit Model\n", 665 | "kbest.fit(X, y)" 666 | ], 667 | "language": "python", 668 | "metadata": {}, 669 | "outputs": [ 670 | { 671 | "metadata": {}, 672 | "output_type": "pyout", 673 | "prompt_number": 18, 674 | "text": [ 675 | "SelectKBest(k=1, score_func=)" 676 | ] 677 | } 678 | ], 679 | "prompt_number": 18 680 | }, 681 | { 682 | "cell_type": "heading", 683 | "level": 2, 684 | "metadata": { 685 | "slideshow": { 686 | "slide_type": "slide" 687 | } 688 | }, 689 | "source": [ 690 | "(Almost) everything is an estimator!" 691 | ] 692 | }, 693 | { 694 | "cell_type": "code", 695 | "collapsed": false, 696 | "input": [ 697 | "model = LogisticRegression()\n", 698 | "model.fit(X, y)\n", 699 | "\n", 700 | "kbest = SelectKBest(k = 1)\n", 701 | "kbest.fit(X, y)\n", 702 | "\n", 703 | "kmeans = KMeans(n_clusters = 2)\n", 704 | "kmeans.fit(X, y)\n", 705 | "\n", 706 | "pca = PCA(n_components=2)\n", 707 | "pca.fit(X, y)" 708 | ], 709 | "language": "python", 710 | "metadata": {}, 711 | "outputs": [ 712 | { 713 | "metadata": {}, 714 | "output_type": "pyout", 715 | "prompt_number": 83, 716 | "text": [ 717 | "PCA(copy=True, n_components=2, whiten=False)" 718 | ] 719 | } 720 | ], 721 | "prompt_number": 83 722 | }, 723 | { 724 | "cell_type": "markdown", 725 | "metadata": { 726 | "slideshow": { 727 | "slide_type": "slide" 728 | } 729 | }, 730 | "source": [ 731 | "__What can we do with an estimator?__ \n", 732 | "\n", 733 | "Inference!" 734 | ] 735 | }, 736 | { 737 | "cell_type": "code", 738 | "collapsed": false, 739 | "input": [ 740 | "model = LogisticRegression()\n", 741 | "model.fit(X, y)\n", 742 | "print model.coef_" 743 | ], 744 | "language": "python", 745 | "metadata": { 746 | "slideshow": { 747 | "slide_type": "-" 748 | } 749 | }, 750 | "outputs": [ 751 | { 752 | "output_type": "stream", 753 | "stream": "stdout", 754 | "text": [ 755 | "[[-0.40731745 -1.46092371 2.24004724 1.00841492]]\n" 756 | ] 757 | } 758 | ], 759 | "prompt_number": 19 760 | }, 761 | { 762 | "cell_type": "code", 763 | "collapsed": false, 764 | "input": [ 765 | "kmeans = KMeans(n_clusters = 2)\n", 766 | "kmeans.fit(X)\n", 767 | "print kmeans.cluster_centers_" 768 | ], 769 | "language": "python", 770 | "metadata": { 771 | "slideshow": { 772 | "slide_type": "fragment" 773 | } 774 | }, 775 | "outputs": [ 776 | { 777 | "output_type": "stream", 778 | "stream": "stdout", 779 | "text": [ 780 | "[[ 5.936 2.77 4.26 1.326]\n", 781 | " [ 5.006 3.418 1.464 0.244]]\n" 782 | ] 783 | } 784 | ], 785 | "prompt_number": 20 786 | }, 787 | { 788 | "cell_type": "code", 789 | "collapsed": false, 790 | "input": [ 791 | "pca = PCA(n_components=2)\n", 792 | "pca.fit(X, y)\n", 793 | "print pca.explained_variance_" 794 | ], 795 | "language": "python", 796 | "metadata": { 797 | "slideshow": { 798 | "slide_type": "fragment" 799 | } 800 | }, 801 | "outputs": [ 802 | { 803 | "output_type": "stream", 804 | "stream": "stdout", 805 | "text": [ 806 | "[ 2.73946394 0.22599044]\n" 807 | ] 808 | } 809 | ], 810 | "prompt_number": 21 811 | }, 812 | { 813 | "cell_type": "code", 814 | "collapsed": false, 815 | "input": [ 816 | "kbest = SelectKBest(k = 1)\n", 817 | "kbest.fit(X, y)\n", 818 | "print kbest.get_support()" 819 | ], 820 | "language": "python", 821 | "metadata": { 822 | "slideshow": { 823 | "slide_type": "fragment" 824 | } 825 | }, 826 | "outputs": [ 827 | { 828 | "output_type": "stream", 829 | "stream": "stdout", 830 | "text": [ 831 | "[False False True False]\n" 832 | ] 833 | } 834 | ], 835 | "prompt_number": 22 836 | }, 837 | { 838 | "cell_type": "markdown", 839 | "metadata": { 840 | "slideshow": { 841 | "slide_type": "slide" 842 | } 843 | }, 844 | "source": [ 845 | "__Is that it?__" 846 | ] 847 | }, 848 | { 849 | "cell_type": "heading", 850 | "level": 2, 851 | "metadata": { 852 | "slideshow": { 853 | "slide_type": "slide" 854 | } 855 | }, 856 | "source": [ 857 | "Scikit-learn API: the Predictor" 858 | ] 859 | }, 860 | { 861 | "cell_type": "code", 862 | "collapsed": false, 863 | "input": [ 864 | "model = LogisticRegression()\n", 865 | "model.fit(X, y)\n", 866 | "\n", 867 | "X_test = [[ 5.006, 3.418, 1.464, 0.244], [ 5.936, 2.77 , 4.26 , 1.326]]\n", 868 | "\n", 869 | "model.predict(X_test)" 870 | ], 871 | "language": "python", 872 | "metadata": {}, 873 | "outputs": [ 874 | { 875 | "metadata": {}, 876 | "output_type": "pyout", 877 | "prompt_number": 23, 878 | "text": [ 879 | "array([0, 1])" 880 | ] 881 | } 882 | ], 883 | "prompt_number": 23 884 | }, 885 | { 886 | "cell_type": "code", 887 | "collapsed": false, 888 | "input": [ 889 | "print model.predict_proba(X_test)" 890 | ], 891 | "language": "python", 892 | "metadata": { 893 | "slideshow": { 894 | "slide_type": "fragment" 895 | } 896 | }, 897 | "outputs": [ 898 | { 899 | "output_type": "stream", 900 | "stream": "stdout", 901 | "text": [ 902 | "[[ 0.97741151 0.02258849]\n", 903 | " [ 0.01544837 0.98455163]]\n" 904 | ] 905 | } 906 | ], 907 | "prompt_number": 24 908 | }, 909 | { 910 | "cell_type": "heading", 911 | "level": 2, 912 | "metadata": { 913 | "slideshow": { 914 | "slide_type": "slide" 915 | } 916 | }, 917 | "source": [ 918 | "Scikit-learn API: the Transformer" 919 | ] 920 | }, 921 | { 922 | "cell_type": "code", 923 | "collapsed": false, 924 | "input": [ 925 | "pca = PCA(n_components=2)\n", 926 | "pca.fit(X)\n", 927 | "\n", 928 | "print pca.transform(X)[0:5,:]" 929 | ], 930 | "language": "python", 931 | "metadata": {}, 932 | "outputs": [ 933 | { 934 | "output_type": "stream", 935 | "stream": "stdout", 936 | "text": [ 937 | "[[-1.65441341 -0.20660719]\n", 938 | " [-1.63509488 0.2988347 ]\n", 939 | " [-1.82037547 0.27141696]\n", 940 | " [-1.66207305 0.43021683]\n", 941 | " [-1.70358916 -0.21574051]]\n" 942 | ] 943 | } 944 | ], 945 | "prompt_number": 25 946 | }, 947 | { 948 | "cell_type": "markdown", 949 | "metadata": { 950 | "slideshow": { 951 | "slide_type": "fragment" 952 | } 953 | }, 954 | "source": [ 955 | "__fit_transform__ is also available (and is sometimes faster)." 956 | ] 957 | }, 958 | { 959 | "cell_type": "code", 960 | "collapsed": false, 961 | "input": [ 962 | "pca = PCA(n_components=2)\n", 963 | "print pca.fit_transform(X)[0:5,:]" 964 | ], 965 | "language": "python", 966 | "metadata": { 967 | "slideshow": { 968 | "slide_type": "fragment" 969 | } 970 | }, 971 | "outputs": [ 972 | { 973 | "output_type": "stream", 974 | "stream": "stdout", 975 | "text": [ 976 | "[[-1.65441341 -0.20660719]\n", 977 | " [-1.63509488 0.2988347 ]\n", 978 | " [-1.82037547 0.27141696]\n", 979 | " [-1.66207305 0.43021683]\n", 980 | " [-1.70358916 -0.21574051]]\n" 981 | ] 982 | } 983 | ], 984 | "prompt_number": 54 985 | }, 986 | { 987 | "cell_type": "code", 988 | "collapsed": false, 989 | "input": [ 990 | "kbest = SelectKBest(k = 1)\n", 991 | "kbest.fit(X, y)\n", 992 | "\n", 993 | "print kbest.transform(X)[0:5,:]" 994 | ], 995 | "language": "python", 996 | "metadata": { 997 | "slideshow": { 998 | "slide_type": "slide" 999 | } 1000 | }, 1001 | "outputs": [ 1002 | { 1003 | "output_type": "stream", 1004 | "stream": "stdout", 1005 | "text": [ 1006 | "[[ 1.4]\n", 1007 | " [ 1.4]\n", 1008 | " [ 1.3]\n", 1009 | " [ 1.5]\n", 1010 | " [ 1.4]]\n" 1011 | ] 1012 | } 1013 | ], 1014 | "prompt_number": 26 1015 | }, 1016 | { 1017 | "cell_type": "heading", 1018 | "level": 2, 1019 | "metadata": { 1020 | "slideshow": { 1021 | "slide_type": "slide" 1022 | } 1023 | }, 1024 | "source": [ 1025 | "Scikit-learn API: the Model" 1026 | ] 1027 | }, 1028 | { 1029 | "cell_type": "code", 1030 | "collapsed": false, 1031 | "input": [ 1032 | "from sklearn.cross_validation import KFold\n", 1033 | "from numpy import arange\n", 1034 | "from random import shuffle\n", 1035 | "from sklearn.dummy import DummyClassifier" 1036 | ], 1037 | "language": "python", 1038 | "metadata": { 1039 | "slideshow": { 1040 | "slide_type": "skip" 1041 | } 1042 | }, 1043 | "outputs": [], 1044 | "prompt_number": 27 1045 | }, 1046 | { 1047 | "cell_type": "code", 1048 | "collapsed": false, 1049 | "input": [ 1050 | "model = DummyClassifier()\n", 1051 | "model.fit(X, y)\n", 1052 | "\n", 1053 | "model.score(X, y)" 1054 | ], 1055 | "language": "python", 1056 | "metadata": {}, 1057 | "outputs": [ 1058 | { 1059 | "metadata": {}, 1060 | "output_type": "pyout", 1061 | "prompt_number": 86, 1062 | "text": [ 1063 | "0.48999999999999999" 1064 | ] 1065 | } 1066 | ], 1067 | "prompt_number": 86 1068 | }, 1069 | { 1070 | "cell_type": "heading", 1071 | "level": 1, 1072 | "metadata": { 1073 | "slideshow": { 1074 | "slide_type": "slide" 1075 | } 1076 | }, 1077 | "source": [ 1078 | "Building Pipelines" 1079 | ] 1080 | }, 1081 | { 1082 | "cell_type": "code", 1083 | "collapsed": false, 1084 | "input": [ 1085 | "from sklearn.pipeline import Pipeline" 1086 | ], 1087 | "language": "python", 1088 | "metadata": { 1089 | "slideshow": { 1090 | "slide_type": "skip" 1091 | } 1092 | }, 1093 | "outputs": [], 1094 | "prompt_number": 87 1095 | }, 1096 | { 1097 | "cell_type": "code", 1098 | "collapsed": false, 1099 | "input": [ 1100 | "pipe = Pipeline([\n", 1101 | " (\"select\", SelectKBest(k = 3)),\n", 1102 | " (\"pca\", PCA(n_components = 1)),\n", 1103 | " (\"classify\", LogisticRegression())\n", 1104 | " ])\n", 1105 | "\n", 1106 | "pipe.fit(X, y)\n", 1107 | "\n", 1108 | "pipe.predict(X)" 1109 | ], 1110 | "language": "python", 1111 | "metadata": {}, 1112 | "outputs": [ 1113 | { 1114 | "metadata": {}, 1115 | "output_type": "pyout", 1116 | "prompt_number": 55, 1117 | "text": [ 1118 | "array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 1119 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 1120 | " 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n", 1121 | " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n", 1122 | " 1, 1, 1, 1, 1, 1, 1, 1])" 1123 | ] 1124 | } 1125 | ], 1126 | "prompt_number": 55 1127 | }, 1128 | { 1129 | "cell_type": "markdown", 1130 | "metadata": {}, 1131 | "source": [ 1132 | "Intermediate steps of the pipeline must be __Estimators__ and __Transformers__.\n", 1133 | "\n", 1134 | "The final estimator needs only to be an __Estimator__." 1135 | ] 1136 | }, 1137 | { 1138 | "cell_type": "heading", 1139 | "level": 2, 1140 | "metadata": { 1141 | "slideshow": { 1142 | "slide_type": "slide" 1143 | } 1144 | }, 1145 | "source": [ 1146 | "Text Pipeline" 1147 | ] 1148 | }, 1149 | { 1150 | "cell_type": "code", 1151 | "collapsed": false, 1152 | "input": [ 1153 | "from sklearn.datasets import fetch_20newsgroups\n", 1154 | "from sklearn.feature_extraction.text import CountVectorizer\n", 1155 | "from sklearn.feature_extraction.text import TfidfTransformer\n", 1156 | "from sklearn.linear_model import SGDClassifier\n" 1157 | ], 1158 | "language": "python", 1159 | "metadata": { 1160 | "slideshow": { 1161 | "slide_type": "skip" 1162 | } 1163 | }, 1164 | "outputs": [], 1165 | "prompt_number": 78 1166 | }, 1167 | { 1168 | "cell_type": "code", 1169 | "collapsed": false, 1170 | "input": [ 1171 | "news = fetch_20newsgroups()\n", 1172 | "data = news.data\n", 1173 | "category = news.target" 1174 | ], 1175 | "language": "python", 1176 | "metadata": {}, 1177 | "outputs": [], 1178 | "prompt_number": 71 1179 | }, 1180 | { 1181 | "cell_type": "code", 1182 | "collapsed": false, 1183 | "input": [ 1184 | "len(data)" 1185 | ], 1186 | "language": "python", 1187 | "metadata": {}, 1188 | "outputs": [ 1189 | { 1190 | "metadata": {}, 1191 | "output_type": "pyout", 1192 | "prompt_number": 72, 1193 | "text": [ 1194 | "11314" 1195 | ] 1196 | } 1197 | ], 1198 | "prompt_number": 72 1199 | }, 1200 | { 1201 | "cell_type": "code", 1202 | "collapsed": false, 1203 | "input": [ 1204 | "print \" \".join(news.target_names)" 1205 | ], 1206 | "language": "python", 1207 | "metadata": { 1208 | "slideshow": { 1209 | "slide_type": "slide" 1210 | } 1211 | }, 1212 | "outputs": [ 1213 | { 1214 | "output_type": "stream", 1215 | "stream": "stdout", 1216 | "text": [ 1217 | "alt.atheism comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x misc.forsale rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey sci.crypt sci.electronics sci.med sci.space soc.religion.christian talk.politics.guns talk.politics.mideast talk.politics.misc talk.religion.misc\n" 1218 | ] 1219 | } 1220 | ], 1221 | "prompt_number": 92 1222 | }, 1223 | { 1224 | "cell_type": "code", 1225 | "collapsed": false, 1226 | "input": [ 1227 | "print data[8]" 1228 | ], 1229 | "language": "python", 1230 | "metadata": {}, 1231 | "outputs": [ 1232 | { 1233 | "output_type": "stream", 1234 | "stream": "stdout", 1235 | "text": [ 1236 | "From: holmes7000@iscsvax.uni.edu\n", 1237 | "Subject: WIn 3.0 ICON HELP PLEASE!\n", 1238 | "Organization: University of Northern Iowa\n", 1239 | "Lines: 10\n", 1240 | "\n", 1241 | "I have win 3.0 and downloaded several icons and BMP's but I can't figure out\n", 1242 | "how to change the \"wallpaper\" or use the icons. Any help would be appreciated.\n", 1243 | "\n", 1244 | "\n", 1245 | "Thanx,\n", 1246 | "\n", 1247 | "-Brando\n", 1248 | "\n", 1249 | "PS Please E-mail me\n", 1250 | "\n", 1251 | "\n" 1252 | ] 1253 | } 1254 | ], 1255 | "prompt_number": 99 1256 | }, 1257 | { 1258 | "cell_type": "code", 1259 | "collapsed": false, 1260 | "input": [ 1261 | "pipe = Pipeline([\n", 1262 | " ('vect', CountVectorizer(max_features = 100)),\n", 1263 | " ('tfidf', TfidfTransformer()),\n", 1264 | " ('clf', SGDClassifier()),\n", 1265 | "])\n", 1266 | "\n", 1267 | "pipe.fit(data, category)" 1268 | ], 1269 | "language": "python", 1270 | "metadata": { 1271 | "slideshow": { 1272 | "slide_type": "slide" 1273 | } 1274 | }, 1275 | "outputs": [ 1276 | { 1277 | "metadata": {}, 1278 | "output_type": "pyout", 1279 | "prompt_number": 100, 1280 | "text": [ 1281 | "Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, charset=None,\n", 1282 | " charset_error=None, decode_error=u'strict',\n", 1283 | " dtype=, encoding=u'utf-8', input=u'content',\n", 1284 | " lowercase=True, max_df=1.0, max_features=100, min_df=1,\n", 1285 | " ngram_range=(1, 1), prepr..., penalty='l2', power_t=0.5,\n", 1286 | " random_state=None, shuffle=False, verbose=0, warm_start=False))])" 1287 | ] 1288 | } 1289 | ], 1290 | "prompt_number": 100 1291 | }, 1292 | { 1293 | "cell_type": "heading", 1294 | "level": 2, 1295 | "metadata": { 1296 | "slideshow": { 1297 | "slide_type": "slide" 1298 | } 1299 | }, 1300 | "source": [ 1301 | "Pandas Pipelines!" 1302 | ] 1303 | }, 1304 | { 1305 | "cell_type": "code", 1306 | "collapsed": false, 1307 | "input": [ 1308 | "import pandas as pd\n", 1309 | "import numpy as np\n", 1310 | "import sklearn.preprocessing, sklearn.decomposition, sklearn.linear_model, sklearn.pipeline, sklearn.metrics\n", 1311 | "from sklearn_pandas import DataFrameMapper, cross_val_score" 1312 | ], 1313 | "language": "python", 1314 | "metadata": { 1315 | "slideshow": { 1316 | "slide_type": "skip" 1317 | } 1318 | }, 1319 | "outputs": [], 1320 | "prompt_number": 107 1321 | }, 1322 | { 1323 | "cell_type": "code", 1324 | "collapsed": false, 1325 | "input": [ 1326 | "data = pd.DataFrame({\n", 1327 | " 'pet': ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish'],\n", 1328 | " 'children': [4., 6, 3, 3, 2, 3, 5, 4],\n", 1329 | " 'salary': [90, 24, 44, 27, 32, 59, 36, 27]\n", 1330 | "})" 1331 | ], 1332 | "language": "python", 1333 | "metadata": {}, 1334 | "outputs": [], 1335 | "prompt_number": 117 1336 | }, 1337 | { 1338 | "cell_type": "code", 1339 | "collapsed": false, 1340 | "input": [ 1341 | "mapper = DataFrameMapper([\n", 1342 | " ('pet', sklearn.preprocessing.LabelBinarizer()),\n", 1343 | " ('children', sklearn.preprocessing.StandardScaler()),\n", 1344 | " ('salary', None)\n", 1345 | "])" 1346 | ], 1347 | "language": "python", 1348 | "metadata": { 1349 | "slideshow": { 1350 | "slide_type": "fragment" 1351 | } 1352 | }, 1353 | "outputs": [], 1354 | "prompt_number": 111 1355 | }, 1356 | { 1357 | "cell_type": "code", 1358 | "collapsed": false, 1359 | "input": [ 1360 | "mapper.fit_transform(data)" 1361 | ], 1362 | "language": "python", 1363 | "metadata": { 1364 | "slideshow": { 1365 | "slide_type": "fragment" 1366 | } 1367 | }, 1368 | "outputs": [ 1369 | { 1370 | "metadata": {}, 1371 | "output_type": "pyout", 1372 | "prompt_number": 113, 1373 | "text": [ 1374 | "array([[ 1. , 0. , 0. , 0.20851441, 90. ],\n", 1375 | " [ 0. , 1. , 0. , 1.87662973, 24. ],\n", 1376 | " [ 0. , 1. , 0. , -0.62554324, 44. ],\n", 1377 | " [ 0. , 0. , 1. , -0.62554324, 27. ],\n", 1378 | " [ 1. , 0. , 0. , -1.4596009 , 32. ],\n", 1379 | " [ 0. , 1. , 0. , -0.62554324, 59. ],\n", 1380 | " [ 1. , 0. , 0. , 1.04257207, 36. ],\n", 1381 | " [ 0. , 0. , 1. , 0.20851441, 27. ]])" 1382 | ] 1383 | } 1384 | ], 1385 | "prompt_number": 113 1386 | }, 1387 | { 1388 | "cell_type": "code", 1389 | "collapsed": false, 1390 | "input": [ 1391 | "mapper = DataFrameMapper([\n", 1392 | " ('pet', sklearn.preprocessing.LabelBinarizer()),\n", 1393 | " ('children', sklearn.preprocessing.StandardScaler()),\n", 1394 | " ('salary', None)\n", 1395 | "])\n", 1396 | "\n", 1397 | "pipe = Pipeline([\n", 1398 | " (\"mapper\", mapper),\n", 1399 | " (\"pca\", PCA(n_components=2))\n", 1400 | "])\n", 1401 | "pipe.fit_transform(data) # 'data' is a data frame, not a numpy array!" 1402 | ], 1403 | "language": "python", 1404 | "metadata": { 1405 | "slideshow": { 1406 | "slide_type": "slide" 1407 | } 1408 | }, 1409 | "outputs": [ 1410 | { 1411 | "metadata": {}, 1412 | "output_type": "pyout", 1413 | "prompt_number": 157, 1414 | "text": [ 1415 | "array([[ -4.76269151e+01, 4.25991055e-01],\n", 1416 | " [ 1.83856756e+01, 1.86178138e+00],\n", 1417 | " [ -1.62747544e+00, -5.06199939e-01],\n", 1418 | " [ 1.53796381e+01, -8.10331853e-01],\n", 1419 | " [ 1.03575109e+01, -1.52528125e+00],\n", 1420 | " [ -1.66260441e+01, -4.27845667e-01],\n", 1421 | " [ 6.37295205e+00, 9.68066902e-01],\n", 1422 | " [ 1.53846579e+01, 1.38193738e-02]])" 1423 | ] 1424 | } 1425 | ], 1426 | "prompt_number": 157 1427 | }, 1428 | { 1429 | "cell_type": "markdown", 1430 | "metadata": { 1431 | "slideshow": { 1432 | "slide_type": "slide" 1433 | } 1434 | }, 1435 | "source": [ 1436 | "Pandas pipelines require [sklearn-pandas](https://github.com/paulgb/sklearn-pandas) module by [@paulgb](http://www.twitter.com/paulgb)." 1437 | ] 1438 | }, 1439 | { 1440 | "cell_type": "markdown", 1441 | "metadata": { 1442 | "slideshow": { 1443 | "slide_type": "fragment" 1444 | } 1445 | }, 1446 | "source": [ 1447 | "Also by Paul:\n", 1448 | " \n", 1449 | "[![](facebook_map.png)](https://www.facebook.com/note.php?note_id=469716398919)" 1450 | ] 1451 | }, 1452 | { 1453 | "cell_type": "heading", 1454 | "level": 1, 1455 | "metadata": { 1456 | "slideshow": { 1457 | "slide_type": "slide" 1458 | } 1459 | }, 1460 | "source": [ 1461 | "Model Evaluation and Selection" 1462 | ] 1463 | }, 1464 | { 1465 | "cell_type": "code", 1466 | "collapsed": false, 1467 | "input": [ 1468 | "from sklearn.grid_search import GridSearchCV, RandomizedSearchCV\n", 1469 | "from sklearn import datasets\n", 1470 | "from sklearn.ensemble import RandomForestClassifier" 1471 | ], 1472 | "language": "python", 1473 | "metadata": { 1474 | "slideshow": { 1475 | "slide_type": "skip" 1476 | } 1477 | }, 1478 | "outputs": [], 1479 | "prompt_number": 212 1480 | }, 1481 | { 1482 | "cell_type": "code", 1483 | "collapsed": false, 1484 | "input": [ 1485 | "# Create sample dataset\n", 1486 | "X, y = datasets.make_classification(n_samples = 1000, n_features = 40, n_informative = 6, n_classes = 2)" 1487 | ], 1488 | "language": "python", 1489 | "metadata": {}, 1490 | "outputs": [], 1491 | "prompt_number": 137 1492 | }, 1493 | { 1494 | "cell_type": "code", 1495 | "collapsed": false, 1496 | "input": [ 1497 | "# Pipeline for Feature Selection to Random Forest\n", 1498 | "pipe = Pipeline([\n", 1499 | " (\"select\", SelectKBest()),\n", 1500 | " (\"classify\", RandomForestClassifier())\n", 1501 | "])" 1502 | ], 1503 | "language": "python", 1504 | "metadata": { 1505 | "slideshow": { 1506 | "slide_type": "fragment" 1507 | } 1508 | }, 1509 | "outputs": [], 1510 | "prompt_number": 162 1511 | }, 1512 | { 1513 | "cell_type": "code", 1514 | "collapsed": false, 1515 | "input": [ 1516 | "# Define parameter grid\n", 1517 | "param_grid = {\n", 1518 | " \"select__k\" : [1, 6, 20, 40],\n", 1519 | " \"classify__n_estimators\" : [1, 10, 100],\n", 1520 | " \n", 1521 | "}\n", 1522 | "gs = GridSearchCV(pipe, param_grid)" 1523 | ], 1524 | "language": "python", 1525 | "metadata": { 1526 | "slideshow": { 1527 | "slide_type": "fragment" 1528 | } 1529 | }, 1530 | "outputs": [], 1531 | "prompt_number": 175 1532 | }, 1533 | { 1534 | "cell_type": "code", 1535 | "collapsed": false, 1536 | "input": [ 1537 | "# Search over grid\n", 1538 | "gs.fit(X, y)\n", 1539 | "\n", 1540 | "gs.best_params_" 1541 | ], 1542 | "language": "python", 1543 | "metadata": { 1544 | "slideshow": { 1545 | "slide_type": "fragment" 1546 | } 1547 | }, 1548 | "outputs": [ 1549 | { 1550 | "metadata": {}, 1551 | "output_type": "pyout", 1552 | "prompt_number": 183, 1553 | "text": [ 1554 | "{'classify__n_estimators': 10, 'select__k': 6}" 1555 | ] 1556 | } 1557 | ], 1558 | "prompt_number": 183 1559 | }, 1560 | { 1561 | "cell_type": "code", 1562 | "collapsed": false, 1563 | "input": [ 1564 | "print gs.best_estimator_.predict(X.mean(axis = 0))" 1565 | ], 1566 | "language": "python", 1567 | "metadata": { 1568 | "slideshow": { 1569 | "slide_type": "fragment" 1570 | } 1571 | }, 1572 | "outputs": [ 1573 | { 1574 | "output_type": "stream", 1575 | "stream": "stdout", 1576 | "text": [ 1577 | "[1]\n" 1578 | ] 1579 | } 1580 | ], 1581 | "prompt_number": 192 1582 | }, 1583 | { 1584 | "cell_type": "heading", 1585 | "level": 2, 1586 | "metadata": { 1587 | "slideshow": { 1588 | "slide_type": "slide" 1589 | } 1590 | }, 1591 | "source": [ 1592 | "Curse of Dimensionality" 1593 | ] 1594 | }, 1595 | { 1596 | "cell_type": "markdown", 1597 | "metadata": {}, 1598 | "source": [ 1599 | "Search space grows exponentially with number of parameters." 1600 | ] 1601 | }, 1602 | { 1603 | "cell_type": "code", 1604 | "collapsed": false, 1605 | "input": [ 1606 | "gs.grid_scores_" 1607 | ], 1608 | "language": "python", 1609 | "metadata": {}, 1610 | "outputs": [ 1611 | { 1612 | "metadata": {}, 1613 | "output_type": "pyout", 1614 | "prompt_number": 185, 1615 | "text": [ 1616 | "[mean: 0.72600, std: 0.02773, params: {'classify__n_estimators': 1, 'select__k': 1},\n", 1617 | " mean: 0.78200, std: 0.00631, params: {'classify__n_estimators': 1, 'select__k': 6},\n", 1618 | " mean: 0.74400, std: 0.02580, params: {'classify__n_estimators': 1, 'select__k': 20},\n", 1619 | " mean: 0.70600, std: 0.05772, params: {'classify__n_estimators': 1, 'select__k': 40},\n", 1620 | " mean: 0.73800, std: 0.02372, params: {'classify__n_estimators': 10, 'select__k': 1},\n", 1621 | " mean: 0.90000, std: 0.01539, params: {'classify__n_estimators': 10, 'select__k': 6},\n", 1622 | " mean: 0.86400, std: 0.01047, params: {'classify__n_estimators': 10, 'select__k': 20},\n", 1623 | " mean: 0.81200, std: 0.02247, params: {'classify__n_estimators': 10, 'select__k': 40},\n", 1624 | " mean: 0.73600, std: 0.02229, params: {'classify__n_estimators': 100, 'select__k': 1},\n", 1625 | " mean: 0.89200, std: 0.01520, params: {'classify__n_estimators': 100, 'select__k': 6},\n", 1626 | " mean: 0.89000, std: 0.01769, params: {'classify__n_estimators': 100, 'select__k': 20},\n", 1627 | " mean: 0.87000, std: 0.02366, params: {'classify__n_estimators': 100, 'select__k': 40}]" 1628 | ] 1629 | } 1630 | ], 1631 | "prompt_number": 185 1632 | }, 1633 | { 1634 | "cell_type": "heading", 1635 | "level": 2, 1636 | "metadata": { 1637 | "slideshow": { 1638 | "slide_type": "slide" 1639 | } 1640 | }, 1641 | "source": [ 1642 | "Curse of Dimensionality: Parallelization " 1643 | ] 1644 | }, 1645 | { 1646 | "cell_type": "markdown", 1647 | "metadata": {}, 1648 | "source": [ 1649 | "__GridSearch__ on 1 core:" 1650 | ] 1651 | }, 1652 | { 1653 | "cell_type": "code", 1654 | "collapsed": false, 1655 | "input": [ 1656 | "param_grid = {\n", 1657 | " \"select__k\" : [1, 5, 10, 15, 20, 25, 30, 35, 40],\n", 1658 | " \"classify__n_estimators\" : [1, 5, 10, 25, 50, 75, 100],\n", 1659 | " \n", 1660 | "}\n", 1661 | "gs = GridSearchCV(pipe, param_grid, n_jobs = 1)\n", 1662 | "%timeit gs.fit(X, y)\n", 1663 | "print" 1664 | ], 1665 | "language": "python", 1666 | "metadata": {}, 1667 | "outputs": [ 1668 | { 1669 | "output_type": "stream", 1670 | "stream": "stdout", 1671 | "text": [ 1672 | "1 loops, best of 3: 6.31 s per loop\n", 1673 | "\n" 1674 | ] 1675 | } 1676 | ], 1677 | "prompt_number": 207 1678 | }, 1679 | { 1680 | "cell_type": "markdown", 1681 | "metadata": {}, 1682 | "source": [ 1683 | "__GridSearch__ on 7 cores:" 1684 | ] 1685 | }, 1686 | { 1687 | "cell_type": "code", 1688 | "collapsed": false, 1689 | "input": [ 1690 | "gs = GridSearchCV(pipe, param_grid, n_jobs = 7)\n", 1691 | "%timeit gs.fit(X, y)\n", 1692 | "print" 1693 | ], 1694 | "language": "python", 1695 | "metadata": {}, 1696 | "outputs": [ 1697 | { 1698 | "output_type": "stream", 1699 | "stream": "stdout", 1700 | "text": [ 1701 | "1 loops, best of 3: 1.81 s per loop\n", 1702 | "\n" 1703 | ] 1704 | } 1705 | ], 1706 | "prompt_number": 208 1707 | }, 1708 | { 1709 | "cell_type": "heading", 1710 | "level": 2, 1711 | "metadata": { 1712 | "slideshow": { 1713 | "slide_type": "slide" 1714 | } 1715 | }, 1716 | "source": [ 1717 | "Curse of Dimensionality: Randomization" 1718 | ] 1719 | }, 1720 | { 1721 | "cell_type": "markdown", 1722 | "metadata": {}, 1723 | "source": [ 1724 | "__GridSearchCV__ might be very slow:" 1725 | ] 1726 | }, 1727 | { 1728 | "cell_type": "code", 1729 | "collapsed": false, 1730 | "input": [ 1731 | "param_grid = {\n", 1732 | " \"select__k\" : range(1, 40),\n", 1733 | " \"classify__n_estimators\" : range(1, 100), \n", 1734 | "}" 1735 | ], 1736 | "language": "python", 1737 | "metadata": {}, 1738 | "outputs": [], 1739 | "prompt_number": 220 1740 | }, 1741 | { 1742 | "cell_type": "code", 1743 | "collapsed": false, 1744 | "input": [ 1745 | "gs = GridSearchCV(pipe, param_grid, n_jobs = 7)\n", 1746 | "gs.fit(X, y)\n", 1747 | "print \"Best CV score\", gs.best_score_\n", 1748 | "print gs.best_params_" 1749 | ], 1750 | "language": "python", 1751 | "metadata": {}, 1752 | "outputs": [ 1753 | { 1754 | "output_type": "stream", 1755 | "stream": "stdout", 1756 | "text": [ 1757 | "0.924\n", 1758 | "{'classify__n_estimators': 59, 'select__k': 9}\n" 1759 | ] 1760 | } 1761 | ], 1762 | "prompt_number": 221 1763 | }, 1764 | { 1765 | "cell_type": "markdown", 1766 | "metadata": {}, 1767 | "source": [ 1768 | "We can instead randomly sample from the parameter space with __RandomizedSearchCV__:" 1769 | ] 1770 | }, 1771 | { 1772 | "cell_type": "code", 1773 | "collapsed": false, 1774 | "input": [ 1775 | "gs = RandomizedSearchCV(pipe, param_grid, n_jobs = 7, n_iter = 10)\n", 1776 | "gs.fit(X, y)\n", 1777 | "print \"Best CV score\", gs.best_score_\n", 1778 | "print gs.best_params_" 1779 | ], 1780 | "language": "python", 1781 | "metadata": {}, 1782 | "outputs": [ 1783 | { 1784 | "output_type": "stream", 1785 | "stream": "stdout", 1786 | "text": [ 1787 | "0.894\n", 1788 | "{'classify__n_estimators': 58, 'select__k': 7}\n" 1789 | ] 1790 | } 1791 | ], 1792 | "prompt_number": 229 1793 | }, 1794 | { 1795 | "cell_type": "heading", 1796 | "level": 1, 1797 | "metadata": { 1798 | "slideshow": { 1799 | "slide_type": "slide" 1800 | } 1801 | }, 1802 | "source": [ 1803 | "Conclusions" 1804 | ] 1805 | }, 1806 | { 1807 | "cell_type": "markdown", 1808 | "metadata": { 1809 | "slideshow": { 1810 | "slide_type": "-" 1811 | } 1812 | }, 1813 | "source": [ 1814 | "* Scikit-learn has an elegant API and is built in a beautiful language." 1815 | ] 1816 | }, 1817 | { 1818 | "cell_type": "markdown", 1819 | "metadata": { 1820 | "slideshow": { 1821 | "slide_type": "fragment" 1822 | } 1823 | }, 1824 | "source": [ 1825 | "* Pipelines allow complex chains of operations to be easily computed.\n", 1826 | " * This helps ensure correct cross validation (see _Elements of Statistical Learning_ 7.10.2). " 1827 | ] 1828 | }, 1829 | { 1830 | "cell_type": "markdown", 1831 | "metadata": { 1832 | "slideshow": { 1833 | "slide_type": "fragment" 1834 | } 1835 | }, 1836 | "source": [ 1837 | "* Pipelines combined with grid search permit easy model selection." 1838 | ] 1839 | } 1840 | ], 1841 | "metadata": {} 1842 | } 1843 | ] 1844 | } 1845 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2014 Timothy Hopper 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of 6 | this software and associated documentation files (the "Software"), to deal in 7 | the Software without restriction, including without limitation the rights to 8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 9 | the Software, and to permit persons to whom the Software is furnished to do so, 10 | subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 17 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 18 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 19 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 20 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 21 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Intro to Scikit-Learn 2 | 3 | * Research Triangle Analysts 4 | * January 2014 5 | * Presented by Tim Hopper 6 | 7 | __Abstract:__ Scikit-learn is an actively developed Python package providing an implementation of many machine learning algorithms (e.g. SVM, kNN, linear models, HMM, k-Means, spectral clustering). However, the benefits of Scikit-learn goes well beyond carefully implemented learning algorithms. Being built in Python, it allows easy integration with countless other Python modules for tasks such as plotting, data munging, and application development. Its consistent API across algorithms allows for rapid experimentation with multiple learning methods. Also, Scikit-learn is well documented and provides lots of examples. 8 | 9 | Instead of discussing particular machine learning algorithms provided by the package, I will focus on Scikit-learn and Python as a toolkit for solving data problems from start to finish. I will emphasize the Pipeline tool which allows the user to chain together all the steps of a machine learning pipeline including preprocessing, dimensionality reduction, feature selection, and model fitting. 10 | 11 | ------ 12 | 13 | A (poor quality) video of this talk is [here](https://www.youtube.com/watch?v=2kx19t8bNMU). 14 | 15 | ------ 16 | 17 | The slides for this presentation are generated from _Intro to Scikit-Learn.ipynb_. 18 | 19 | To view the slides in a browser run the following command: 20 | 21 | ``` 22 | ipython nbconvert Intro\ to\ Scikit-Learn.ipynb --to slides --post serve 23 | ``` 24 | -------------------------------------------------------------------------------- /facebook_map.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tdhopper/intro-to-scikit-learn/1545a1b8579c3e4725ce5ac7076a0e3c9544c231/facebook_map.png --------------------------------------------------------------------------------