├── .gitignore ├── LICENSE ├── README.md ├── code ├── check_environment.ipynb ├── dataset_brain.txt ├── dataset_iris.txt └── tutorial.ipynb ├── images ├── checkenv-example.png ├── github-download.png └── logo.png └── slides └── PyDataChi2016_sr.pdf /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | 27 | # PyInstaller 28 | # Usually these files are written by a python script from a template 29 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 30 | *.manifest 31 | *.spec 32 | 33 | # Installer logs 34 | pip-log.txt 35 | pip-delete-this-directory.txt 36 | 37 | # Unit test / coverage reports 38 | htmlcov/ 39 | .tox/ 40 | .coverage 41 | .coverage.* 42 | .cache 43 | nosetests.xml 44 | coverage.xml 45 | *,cover 46 | .hypothesis/ 47 | 48 | # Translations 49 | *.mo 50 | *.pot 51 | 52 | # Django stuff: 53 | *.log 54 | local_settings.py 55 | 56 | # Flask stuff: 57 | instance/ 58 | .webassets-cache 59 | 60 | # Scrapy stuff: 61 | .scrapy 62 | 63 | # Sphinx documentation 64 | docs/_build/ 65 | 66 | # PyBuilder 67 | target/ 68 | 69 | # IPython Notebook 70 | .ipynb_checkpoints 71 | 72 | # pyenv 73 | .python-version 74 | 75 | # celery beat schedule file 76 | celerybeat-schedule 77 | 78 | # dotenv 79 | .env 80 | 81 | # virtualenv 82 | venv/ 83 | ENV/ 84 | 85 | # Spyder project settings 86 | .spyderproject 87 | 88 | # Rope project settings 89 | .ropeproject 90 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2016 Sebastian Raschka 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # pydata-chicago2016-ml-tutorial 2 | 3 | ### Learning scikit-learn -- An Introduction to Machine Learning in Python @ PyData Chicago 2016 4 | 5 | *This tutorial provides you with an introduction to machine learning in Python using the popular scikit-learn library.* 6 | 7 | ![](images/logo.png) 8 | 9 | This tutorial will teach you the basics of scikit-learn. I will give you a brief overview of the basic concepts of classification and regression analysis, how to build powerful predictive models from labeled data. Although it's not a requirement for attending this tutorial, I highly recommend you to check out the accompanying GitHub repository at https://github.com/rasbt/pydata-chicago2016-ml-tutorial 1-2 days before the tutorial. During the session, we will not only talk about scikit-learn, but we will also go over some live code examples to get the knack of scikit-learn's API. 10 | 11 | If you have any questions about the tutorial, please don't hesitate to contact me. You can either open an "issue" on GitHub or reach me via email at mail_at_sebastianraschka.com. I am looking forward to meeting you soon! 12 | 13 | --- 14 | 15 | - View the presentation slides [here](slides/PyDataChi2016_sr.pdf) / on [SpeakerDeck](https://speakerdeck.com/rasbt/learning-scikit-learn-an-introduction-to-machine-learning-in-python-at-pydata-chicago-2016) 16 | - View the code notebook [here](code/tutorial.ipynb) / on [nbviewer](http://nbviewer.jupyter.org/github/rasbt/pydata-chicago2016-ml-tutorial/blob/master/code/tutorial.ipynb) 17 | - [Video recording](https://www.youtube.com/watch?v=9fOWryQq9J8) of the talk on YouTube 18 | 19 | --- 20 | 21 | # Schedule 22 | 23 | This repository will contain the teaching material and other info for the *Learning scikit-learn* tutorial at the [PyData Chicago 2016 Conference](http://pydata.org/chicago2016/) held Aug 26-28. 24 | 25 | - When? **Fri Aug. 26, 2016 at 9:00 - 10:30 am** 26 | - Where? **Room 1** 27 | 28 | (I recommend watching the PyData [schedule](http://pydata.org/chicago2016/schedule/) for updates). 29 | 30 | # Obtaining the Tutorial Material 31 | 32 | If you already have a GitHub account, the probably most convenient way to obtain the tutorial material is to clone this GitHub repository via `git clone https://github.com/rasbt/pydata-chicago2016-ml-tutorials` and fetch updates via `pull origin master` 33 | 34 | If you don’t have an GitHub account, you can download the repository as a .zip file by heading over to the GitHub repository (https://github.com/rasbt/pydata-chicago2016-ml-tutorial) in your browser and click the green “Download” button in the upper right. 35 | 36 | ![](images/github-download.png) 37 | 38 | 39 | # Installation Notes and Requirements 40 | 41 | Please note that installing the following libraries and running code alongside is **not a hard requirement for attending the tutorial session**, you will be able to follow along just fine (and probably be less distracted :)). Now, the tutorial code should be compatible to both Python 2.7 and Python 3.x. but will require recent installations of 42 | 43 | - [NumPy](http://www.numpy.org) 44 | - [SciPy](http://www.scipy.org) 45 | - [matplotlib](http://matplotlib.org) 46 | - [pandas](http://pandas.pydata.org) 47 | - [scikit-learn](http://scikit-learn.org/stable/) 48 | - [IPython](http://ipython.readthedocs.org/en/stable/) 49 | - [Jupyter Notebook](http://jupyter.org) 50 | - [watermark](https://pypi.python.org/pypi/watermark) 51 | - [mlxtend](http://rasbt.github.io/mlxtend/) 52 | 53 | 54 | Please make sure that you have these libraries installed in your current Python environment prior to attending the tutorial if you want to execute the code examples that are executed during the talk. Please also note that executing these examples during/after the talk is merely a suggestion, not a requirement. **I highly recommend you to open the code/check_environment.ipynb](code/check_environment.ipynb) notebook as a Jupyter notebook**, for instance by the notebook via 55 | 56 | ```bash 57 | jupyter notebook /pydata-chicago2016-ml-tutorial/code/check_environment.ipynb 58 | ``` 59 | and executing the code cells: 60 | 61 | ![](images/checkenv-example.png) 62 | -------------------------------------------------------------------------------- /code/check_environment.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Jupyter Notebook for Checking Python Package Requirements" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": false 15 | }, 16 | "outputs": [ 17 | { 18 | "name": "stdout", 19 | "output_type": "stream", 20 | "text": [ 21 | "[OK] numpy 1.11.1\n", 22 | "[OK] scipy 0.17.1\n", 23 | "[OK] matplotlib 1.5.1\n", 24 | "[OK] sklearn 0.17.1\n", 25 | "[OK] pandas 0.18.1\n", 26 | "[OK] mlxtend 0.4.2\n" 27 | ] 28 | } 29 | ], 30 | "source": [ 31 | "def get_packages(pkgs):\n", 32 | " versions = []\n", 33 | " for p in packages:\n", 34 | " try:\n", 35 | " imported = __import__(p)\n", 36 | " try:\n", 37 | " versions.append(imported.__version__)\n", 38 | " except AttributeError:\n", 39 | " try:\n", 40 | " versions.append(imported.version)\n", 41 | " except AttributeError:\n", 42 | " try:\n", 43 | " versions.append(imported.version_info)\n", 44 | " except AttributeError:\n", 45 | " versions.append('0.0')\n", 46 | " except ImportError:\n", 47 | " print('[FAIL]: %s is not installed' % p)\n", 48 | " return versions\n", 49 | " \n", 50 | "packages = ['numpy', 'scipy', 'matplotlib', 'sklearn', 'pandas', 'mlxtend']\n", 51 | "suggested_v = ['1.10', '0.17', '1.5.1', '0.17.1', '0.17.1', '0.4.2']\n", 52 | "versions = get_packages(packages)\n", 53 | "\n", 54 | "for p, v, s in zip(packages, versions, suggested_v):\n", 55 | " if v < s:\n", 56 | " print('[FAIL] %s %s, please upgrade to >= %s' % (p, v, s))\n", 57 | " else:\n", 58 | " print('[OK] %s %s' % (p, v))" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": 2, 64 | "metadata": { 65 | "collapsed": false 66 | }, 67 | "outputs": [ 68 | { 69 | "name": "stdout", 70 | "output_type": "stream", 71 | "text": [ 72 | "2016-08-24 \n", 73 | "\n", 74 | "numpy 1.11.1\n", 75 | "scipy 0.17.1\n", 76 | "matplotlib 1.5.1\n", 77 | "sklearn 0.17.1\n", 78 | "pandas 0.18.1\n", 79 | "mlxtend 0.4.2\n" 80 | ] 81 | } 82 | ], 83 | "source": [ 84 | "%load_ext watermark\n", 85 | "%watermark -d -p numpy,scipy,matplotlib,sklearn,pandas,mlxtend" 86 | ] 87 | } 88 | ], 89 | "metadata": { 90 | "anaconda-cloud": {}, 91 | "kernelspec": { 92 | "display_name": "Python [Root]", 93 | "language": "python", 94 | "name": "Python [Root]" 95 | }, 96 | "language_info": { 97 | "codemirror_mode": { 98 | "name": "ipython", 99 | "version": 3 100 | }, 101 | "file_extension": ".py", 102 | "mimetype": "text/x-python", 103 | "name": "python", 104 | "nbconvert_exporter": "python", 105 | "pygments_lexer": "ipython3", 106 | "version": "3.5.2" 107 | } 108 | }, 109 | "nbformat": 4, 110 | "nbformat_minor": 0 111 | } 112 | -------------------------------------------------------------------------------- /code/dataset_brain.txt: -------------------------------------------------------------------------------- 1 | # Source: R.J. Gladstone (1905). "A Study of the Relations of the Brain to 2 | # to the Size of the Head", Biometrika, Vol. 4, pp105-123 3 | # 4 | # Download link: http://www.stat.ufl.edu/~winner/data/brainhead.txt 5 | # 6 | # Description: Brain weight (grams) and head size (cubic cm) for 237 7 | # adults classified by gender and age group. 8 | # 9 | # Variables/Columns 10 | # Gender 8 /* 1=Male, 2=Female */ 11 | # Age Range 16 /* 1=20-46, 2=46+ */ 12 | # Head size (cm^3) 21-24 13 | # Brain weight (grams) 29-32 14 | # 15 | gender age-group head-size brain-weight 16 | 1 1 4512 1530 17 | 1 1 3738 1297 18 | 1 1 4261 1335 19 | 1 1 3777 1282 20 | 1 1 4177 1590 21 | 1 1 3585 1300 22 | 1 1 3785 1400 23 | 1 1 3559 1255 24 | 1 1 3613 1355 25 | 1 1 3982 1375 26 | 1 1 3443 1340 27 | 1 1 3993 1380 28 | 1 1 3640 1355 29 | 1 1 4208 1522 30 | 1 1 3832 1208 31 | 1 1 3876 1405 32 | 1 1 3497 1358 33 | 1 1 3466 1292 34 | 1 1 3095 1340 35 | 1 1 4424 1400 36 | 1 1 3878 1357 37 | 1 1 4046 1287 38 | 1 1 3804 1275 39 | 1 1 3710 1270 40 | 1 1 4747 1635 41 | 1 1 4423 1505 42 | 1 1 4036 1490 43 | 1 1 4022 1485 44 | 1 1 3454 1310 45 | 1 1 4175 1420 46 | 1 1 3787 1318 47 | 1 1 3796 1432 48 | 1 1 4103 1364 49 | 1 1 4161 1405 50 | 1 1 4158 1432 51 | 1 1 3814 1207 52 | 1 1 3527 1375 53 | 1 1 3748 1350 54 | 1 1 3334 1236 55 | 1 1 3492 1250 56 | 1 1 3962 1350 57 | 1 1 3505 1320 58 | 1 1 4315 1525 59 | 1 1 3804 1570 60 | 1 1 3863 1340 61 | 1 1 4034 1422 62 | 1 1 4308 1506 63 | 1 1 3165 1215 64 | 1 1 3641 1311 65 | 1 1 3644 1300 66 | 1 1 3891 1224 67 | 1 1 3793 1350 68 | 1 1 4270 1335 69 | 1 1 4063 1390 70 | 1 1 4012 1400 71 | 1 1 3458 1225 72 | 1 1 3890 1310 73 | 1 2 4166 1560 74 | 1 2 3935 1330 75 | 1 2 3669 1222 76 | 1 2 3866 1415 77 | 1 2 3393 1175 78 | 1 2 4442 1330 79 | 1 2 4253 1485 80 | 1 2 3727 1470 81 | 1 2 3329 1135 82 | 1 2 3415 1310 83 | 1 2 3372 1154 84 | 1 2 4430 1510 85 | 1 2 4381 1415 86 | 1 2 4008 1468 87 | 1 2 3858 1390 88 | 1 2 4121 1380 89 | 1 2 4057 1432 90 | 1 2 3824 1240 91 | 1 2 3394 1195 92 | 1 2 3558 1225 93 | 1 2 3362 1188 94 | 1 2 3930 1252 95 | 1 2 3835 1315 96 | 1 2 3830 1245 97 | 1 2 3856 1430 98 | 1 2 3249 1279 99 | 1 2 3577 1245 100 | 1 2 3933 1309 101 | 1 2 3850 1412 102 | 1 2 3309 1120 103 | 1 2 3406 1220 104 | 1 2 3506 1280 105 | 1 2 3907 1440 106 | 1 2 4160 1370 107 | 1 2 3318 1192 108 | 1 2 3662 1230 109 | 1 2 3899 1346 110 | 1 2 3700 1290 111 | 1 2 3779 1165 112 | 1 2 3473 1240 113 | 1 2 3490 1132 114 | 1 2 3654 1242 115 | 1 2 3478 1270 116 | 1 2 3495 1218 117 | 1 2 3834 1430 118 | 1 2 3876 1588 119 | 1 2 3661 1320 120 | 1 2 3618 1290 121 | 1 2 3648 1260 122 | 1 2 4032 1425 123 | 1 2 3399 1226 124 | 1 2 3916 1360 125 | 1 2 4430 1620 126 | 1 2 3695 1310 127 | 1 2 3524 1250 128 | 1 2 3571 1295 129 | 1 2 3594 1290 130 | 1 2 3383 1290 131 | 1 2 3499 1275 132 | 1 2 3589 1250 133 | 1 2 3900 1270 134 | 1 2 4114 1362 135 | 1 2 3937 1300 136 | 1 2 3399 1173 137 | 1 2 4200 1256 138 | 1 2 4488 1440 139 | 1 2 3614 1180 140 | 1 2 4051 1306 141 | 1 2 3782 1350 142 | 1 2 3391 1125 143 | 1 2 3124 1165 144 | 1 2 4053 1312 145 | 1 2 3582 1300 146 | 1 2 3666 1270 147 | 1 2 3532 1335 148 | 1 2 4046 1450 149 | 1 2 3667 1310 150 | 2 1 2857 1027 151 | 2 1 3436 1235 152 | 2 1 3791 1260 153 | 2 1 3302 1165 154 | 2 1 3104 1080 155 | 2 1 3171 1127 156 | 2 1 3572 1270 157 | 2 1 3530 1252 158 | 2 1 3175 1200 159 | 2 1 3438 1290 160 | 2 1 3903 1334 161 | 2 1 3899 1380 162 | 2 1 3401 1140 163 | 2 1 3267 1243 164 | 2 1 3451 1340 165 | 2 1 3090 1168 166 | 2 1 3413 1322 167 | 2 1 3323 1249 168 | 2 1 3680 1321 169 | 2 1 3439 1192 170 | 2 1 3853 1373 171 | 2 1 3156 1170 172 | 2 1 3279 1265 173 | 2 1 3707 1235 174 | 2 1 4006 1302 175 | 2 1 3269 1241 176 | 2 1 3071 1078 177 | 2 1 3779 1520 178 | 2 1 3548 1460 179 | 2 1 3292 1075 180 | 2 1 3497 1280 181 | 2 1 3082 1180 182 | 2 1 3248 1250 183 | 2 1 3358 1190 184 | 2 1 3803 1374 185 | 2 1 3566 1306 186 | 2 1 3145 1202 187 | 2 1 3503 1240 188 | 2 1 3571 1316 189 | 2 1 3724 1280 190 | 2 1 3615 1350 191 | 2 1 3203 1180 192 | 2 1 3609 1210 193 | 2 1 3561 1127 194 | 2 1 3979 1324 195 | 2 1 3533 1210 196 | 2 1 3689 1290 197 | 2 1 3158 1100 198 | 2 1 4005 1280 199 | 2 1 3181 1175 200 | 2 1 3479 1160 201 | 2 1 3642 1205 202 | 2 1 3632 1163 203 | 2 2 3069 1022 204 | 2 2 3394 1243 205 | 2 2 3703 1350 206 | 2 2 3165 1237 207 | 2 2 3354 1204 208 | 2 2 3000 1090 209 | 2 2 3687 1355 210 | 2 2 3556 1250 211 | 2 2 2773 1076 212 | 2 2 3058 1120 213 | 2 2 3344 1220 214 | 2 2 3493 1240 215 | 2 2 3297 1220 216 | 2 2 3360 1095 217 | 2 2 3228 1235 218 | 2 2 3277 1105 219 | 2 2 3851 1405 220 | 2 2 3067 1150 221 | 2 2 3692 1305 222 | 2 2 3402 1220 223 | 2 2 3995 1296 224 | 2 2 3318 1175 225 | 2 2 2720 955 226 | 2 2 2937 1070 227 | 2 2 3580 1320 228 | 2 2 2939 1060 229 | 2 2 2989 1130 230 | 2 2 3586 1250 231 | 2 2 3156 1225 232 | 2 2 3246 1180 233 | 2 2 3170 1178 234 | 2 2 3268 1142 235 | 2 2 3389 1130 236 | 2 2 3381 1185 237 | 2 2 2864 1012 238 | 2 2 3740 1280 239 | 2 2 3479 1103 240 | 2 2 3647 1408 241 | 2 2 3716 1300 242 | 2 2 3284 1246 243 | 2 2 4204 1380 244 | 2 2 3735 1350 245 | 2 2 3218 1060 246 | 2 2 3685 1350 247 | 2 2 3704 1220 248 | 2 2 3214 1110 249 | 2 2 3394 1215 250 | 2 2 3233 1104 251 | 2 2 3352 1170 252 | 2 2 3391 1120 -------------------------------------------------------------------------------- /code/dataset_iris.txt: -------------------------------------------------------------------------------- 1 | # Download source: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data 2 | # 3 | # 1. Title: Iris Plants Database 4 | # Updated Sept 21 by C.Blake - Added discrepency information 5 | # 6 | # 2. Sources: 7 | # (a) Creator: R.A. Fisher 8 | # (b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov) 9 | # (c) Date: July, 1988 10 | # 11 | # 3. Past Usage: 12 | # - Publications: too many to mention!!! Here are a few. 13 | # 1. Fisher,R.A. "The use of multiple measurements in taxonomic problems" 14 | # Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions 15 | # to Mathematical Statistics" (John Wiley, NY, 1950). 16 | # 2. Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis. 17 | # (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218. 18 | # 3. Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System 19 | # Structure and Classification Rule for Recognition in Partially Exposed 20 | # Environments". IEEE Transactions on Pattern Analysis and Machine 21 | # Intelligence, Vol. PAMI-2, No. 1, 67-71. 22 | # -- Results: 23 | # -- very low misclassification rates (0% for the setosa class) 24 | # 4. Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE 25 | # Transactions on Information Theory, May 1972, 431-433. 26 | # -- Results: 27 | # -- very low misclassification rates again 28 | # 5. See also: 1988 MLC Proceedings, 54-64. Cheeseman et al's AUTOCLASS II 29 | # conceptual clustering system finds 3 classes in the data. 30 | # 31 | # 4. Relevant Information: 32 | # --- This is perhaps the best known database to be found in the pattern 33 | # recognition literature. Fisher's paper is a classic in the field 34 | # and is referenced frequently to this day. (See Duda & Hart, for 35 | # example.) The data set contains 3 classes of 50 instances each, 36 | # where each class refers to a type of iris plant. One class is 37 | # linearly separable from the other 2; the latter are NOT linearly 38 | # separable from each other. 39 | # --- Predicted attribute: class of iris plant. 40 | # --- This is an exceedingly simple domain. 41 | # --- This data differs from the data presented in Fishers article 42 | # (identified by Steve Chadwick, spchadwick@espeedaz.net ) 43 | # The 35th sample should be: 4.9,3.1,1.5,0.2,"Iris-setosa" 44 | # where the error is in the fourth feature. 45 | # The 38th sample: 4.9,3.6,1.4,0.1,"Iris-setosa" 46 | # where the errors are in the second and third features. 47 | # 48 | # 5. Number of Instances: 150 (50 in each of three classes) 49 | # 50 | # 6. Number of Attributes: 4 numeric, predictive attributes and the class 51 | # 52 | # 7. Attribute Information: 53 | # 1. sepal length in cm 54 | # 2. sepal width in cm 55 | # 3. petal length in cm 56 | # 4. petal width in cm 57 | # 5. class: 58 | # -- Iris Setosa 59 | # -- Iris Versicolour 60 | # -- Iris Virginica 61 | # 62 | # 8. Missing Attribute Values: None 63 | # 64 | # Summary Statistics: 65 | # Min Max Mean SD Class Correlation 66 | # sepal length: 4.3 7.9 5.84 0.83 0.7826 67 | # sepal width: 2.0 4.4 3.05 0.43 -0.4194 68 | # petal length: 1.0 6.9 3.76 1.76 0.9490 (high!) 69 | # petal width: 0.1 2.5 1.20 0.76 0.9565 (high!) 70 | # 71 | # 9. Class Distribution: 33.3% for each of 3 classes. 72 | # 73 | sepal_length,sepal_width,petal_length,petal_width,class 74 | 5.1,3.5,1.4,0.2,Iris-setosa 75 | 4.9,3.0,1.4,0.2,Iris-setosa 76 | 4.7,3.2,1.3,0.2,Iris-setosa 77 | 4.6,3.1,1.5,0.2,Iris-setosa 78 | 5.0,3.6,1.4,0.2,Iris-setosa 79 | 5.4,3.9,1.7,0.4,Iris-setosa 80 | 4.6,3.4,1.4,0.3,Iris-setosa 81 | 5.0,3.4,1.5,0.2,Iris-setosa 82 | 4.4,2.9,1.4,0.2,Iris-setosa 83 | 4.9,3.1,1.5,0.1,Iris-setosa 84 | 5.4,3.7,1.5,0.2,Iris-setosa 85 | 4.8,3.4,1.6,0.2,Iris-setosa 86 | 4.8,3.0,1.4,0.1,Iris-setosa 87 | 4.3,3.0,1.1,0.1,Iris-setosa 88 | 5.8,4.0,1.2,0.2,Iris-setosa 89 | 5.7,4.4,1.5,0.4,Iris-setosa 90 | 5.4,3.9,1.3,0.4,Iris-setosa 91 | 5.1,3.5,1.4,0.3,Iris-setosa 92 | 5.7,3.8,1.7,0.3,Iris-setosa 93 | 5.1,3.8,1.5,0.3,Iris-setosa 94 | 5.4,3.4,1.7,0.2,Iris-setosa 95 | 5.1,3.7,1.5,0.4,Iris-setosa 96 | 4.6,3.6,1.0,0.2,Iris-setosa 97 | 5.1,3.3,1.7,0.5,Iris-setosa 98 | 4.8,3.4,1.9,0.2,Iris-setosa 99 | 5.0,3.0,1.6,0.2,Iris-setosa 100 | 5.0,3.4,1.6,0.4,Iris-setosa 101 | 5.2,3.5,1.5,0.2,Iris-setosa 102 | 5.2,3.4,1.4,0.2,Iris-setosa 103 | 4.7,3.2,1.6,0.2,Iris-setosa 104 | 4.8,3.1,1.6,0.2,Iris-setosa 105 | 5.4,3.4,1.5,0.4,Iris-setosa 106 | 5.2,4.1,1.5,0.1,Iris-setosa 107 | 5.5,4.2,1.4,0.2,Iris-setosa 108 | 4.9,3.1,1.5,0.1,Iris-setosa 109 | 5.0,3.2,1.2,0.2,Iris-setosa 110 | 5.5,3.5,1.3,0.2,Iris-setosa 111 | 4.9,3.1,1.5,0.1,Iris-setosa 112 | 4.4,3.0,1.3,0.2,Iris-setosa 113 | 5.1,3.4,1.5,0.2,Iris-setosa 114 | 5.0,3.5,1.3,0.3,Iris-setosa 115 | 4.5,2.3,1.3,0.3,Iris-setosa 116 | 4.4,3.2,1.3,0.2,Iris-setosa 117 | 5.0,3.5,1.6,0.6,Iris-setosa 118 | 5.1,3.8,1.9,0.4,Iris-setosa 119 | 4.8,3.0,1.4,0.3,Iris-setosa 120 | 5.1,3.8,1.6,0.2,Iris-setosa 121 | 4.6,3.2,1.4,0.2,Iris-setosa 122 | 5.3,3.7,1.5,0.2,Iris-setosa 123 | 5.0,3.3,1.4,0.2,Iris-setosa 124 | 7.0,3.2,4.7,1.4,Iris-versicolor 125 | 6.4,3.2,4.5,1.5,Iris-versicolor 126 | 6.9,3.1,4.9,1.5,Iris-versicolor 127 | 5.5,2.3,4.0,1.3,Iris-versicolor 128 | 6.5,2.8,4.6,1.5,Iris-versicolor 129 | 5.7,2.8,4.5,1.3,Iris-versicolor 130 | 6.3,3.3,4.7,1.6,Iris-versicolor 131 | 4.9,2.4,3.3,1.0,Iris-versicolor 132 | 6.6,2.9,4.6,1.3,Iris-versicolor 133 | 5.2,2.7,3.9,1.4,Iris-versicolor 134 | 5.0,2.0,3.5,1.0,Iris-versicolor 135 | 5.9,3.0,4.2,1.5,Iris-versicolor 136 | 6.0,2.2,4.0,1.0,Iris-versicolor 137 | 6.1,2.9,4.7,1.4,Iris-versicolor 138 | 5.6,2.9,3.6,1.3,Iris-versicolor 139 | 6.7,3.1,4.4,1.4,Iris-versicolor 140 | 5.6,3.0,4.5,1.5,Iris-versicolor 141 | 5.8,2.7,4.1,1.0,Iris-versicolor 142 | 6.2,2.2,4.5,1.5,Iris-versicolor 143 | 5.6,2.5,3.9,1.1,Iris-versicolor 144 | 5.9,3.2,4.8,1.8,Iris-versicolor 145 | 6.1,2.8,4.0,1.3,Iris-versicolor 146 | 6.3,2.5,4.9,1.5,Iris-versicolor 147 | 6.1,2.8,4.7,1.2,Iris-versicolor 148 | 6.4,2.9,4.3,1.3,Iris-versicolor 149 | 6.6,3.0,4.4,1.4,Iris-versicolor 150 | 6.8,2.8,4.8,1.4,Iris-versicolor 151 | 6.7,3.0,5.0,1.7,Iris-versicolor 152 | 6.0,2.9,4.5,1.5,Iris-versicolor 153 | 5.7,2.6,3.5,1.0,Iris-versicolor 154 | 5.5,2.4,3.8,1.1,Iris-versicolor 155 | 5.5,2.4,3.7,1.0,Iris-versicolor 156 | 5.8,2.7,3.9,1.2,Iris-versicolor 157 | 6.0,2.7,5.1,1.6,Iris-versicolor 158 | 5.4,3.0,4.5,1.5,Iris-versicolor 159 | 6.0,3.4,4.5,1.6,Iris-versicolor 160 | 6.7,3.1,4.7,1.5,Iris-versicolor 161 | 6.3,2.3,4.4,1.3,Iris-versicolor 162 | 5.6,3.0,4.1,1.3,Iris-versicolor 163 | 5.5,2.5,4.0,1.3,Iris-versicolor 164 | 5.5,2.6,4.4,1.2,Iris-versicolor 165 | 6.1,3.0,4.6,1.4,Iris-versicolor 166 | 5.8,2.6,4.0,1.2,Iris-versicolor 167 | 5.0,2.3,3.3,1.0,Iris-versicolor 168 | 5.6,2.7,4.2,1.3,Iris-versicolor 169 | 5.7,3.0,4.2,1.2,Iris-versicolor 170 | 5.7,2.9,4.2,1.3,Iris-versicolor 171 | 6.2,2.9,4.3,1.3,Iris-versicolor 172 | 5.1,2.5,3.0,1.1,Iris-versicolor 173 | 5.7,2.8,4.1,1.3,Iris-versicolor 174 | 6.3,3.3,6.0,2.5,Iris-virginica 175 | 5.8,2.7,5.1,1.9,Iris-virginica 176 | 7.1,3.0,5.9,2.1,Iris-virginica 177 | 6.3,2.9,5.6,1.8,Iris-virginica 178 | 6.5,3.0,5.8,2.2,Iris-virginica 179 | 7.6,3.0,6.6,2.1,Iris-virginica 180 | 4.9,2.5,4.5,1.7,Iris-virginica 181 | 7.3,2.9,6.3,1.8,Iris-virginica 182 | 6.7,2.5,5.8,1.8,Iris-virginica 183 | 7.2,3.6,6.1,2.5,Iris-virginica 184 | 6.5,3.2,5.1,2.0,Iris-virginica 185 | 6.4,2.7,5.3,1.9,Iris-virginica 186 | 6.8,3.0,5.5,2.1,Iris-virginica 187 | 5.7,2.5,5.0,2.0,Iris-virginica 188 | 5.8,2.8,5.1,2.4,Iris-virginica 189 | 6.4,3.2,5.3,2.3,Iris-virginica 190 | 6.5,3.0,5.5,1.8,Iris-virginica 191 | 7.7,3.8,6.7,2.2,Iris-virginica 192 | 7.7,2.6,6.9,2.3,Iris-virginica 193 | 6.0,2.2,5.0,1.5,Iris-virginica 194 | 6.9,3.2,5.7,2.3,Iris-virginica 195 | 5.6,2.8,4.9,2.0,Iris-virginica 196 | 7.7,2.8,6.7,2.0,Iris-virginica 197 | 6.3,2.7,4.9,1.8,Iris-virginica 198 | 6.7,3.3,5.7,2.1,Iris-virginica 199 | 7.2,3.2,6.0,1.8,Iris-virginica 200 | 6.2,2.8,4.8,1.8,Iris-virginica 201 | 6.1,3.0,4.9,1.8,Iris-virginica 202 | 6.4,2.8,5.6,2.1,Iris-virginica 203 | 7.2,3.0,5.8,1.6,Iris-virginica 204 | 7.4,2.8,6.1,1.9,Iris-virginica 205 | 7.9,3.8,6.4,2.0,Iris-virginica 206 | 6.4,2.8,5.6,2.2,Iris-virginica 207 | 6.3,2.8,5.1,1.5,Iris-virginica 208 | 6.1,2.6,5.6,1.4,Iris-virginica 209 | 7.7,3.0,6.1,2.3,Iris-virginica 210 | 6.3,3.4,5.6,2.4,Iris-virginica 211 | 6.4,3.1,5.5,1.8,Iris-virginica 212 | 6.0,3.0,4.8,1.8,Iris-virginica 213 | 6.9,3.1,5.4,2.1,Iris-virginica 214 | 6.7,3.1,5.6,2.4,Iris-virginica 215 | 6.9,3.1,5.1,2.3,Iris-virginica 216 | 5.8,2.7,5.1,1.9,Iris-virginica 217 | 6.8,3.2,5.9,2.3,Iris-virginica 218 | 6.7,3.3,5.7,2.5,Iris-virginica 219 | 6.7,3.0,5.2,2.3,Iris-virginica 220 | 6.3,2.5,5.0,1.9,Iris-virginica 221 | 6.5,3.0,5.2,2.0,Iris-virginica 222 | 6.2,3.4,5.4,2.3,Iris-virginica 223 | 5.9,3.0,5.1,1.8,Iris-virginica 224 | 225 | -------------------------------------------------------------------------------- /code/tutorial.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Learning scikit-learn " 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## An Introduction to Machine Learning in Python" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "### at PyData Chicago 2016" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": null, 27 | "metadata": { 28 | "collapsed": false 29 | }, 30 | "outputs": [], 31 | "source": [ 32 | "%load_ext watermark\n", 33 | "%watermark -a \"Sebastian Raschka\" -u -d -p numpy,scipy,matplotlib,sklearn,pandas,mlxtend" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "# Table of Contents\n", 41 | "\n", 42 | "* [1 Introduction to Machine Learning](#1-Introduction-to-Machine-Learning)\n", 43 | "* [2 Linear Regression](#2-Linear-Regression)\n", 44 | " * [Loading the dataset](#Loading-the-dataset)\n", 45 | " * [Preparing the dataset](#Preparing-the-dataset)\n", 46 | " * [Fitting the model](#Fitting-the-model)\n", 47 | " * [Evaluating the model](#Evaluating-the-model)\n", 48 | "* [3 Introduction to Classification](#3-Introduction-to-Classification)\n", 49 | " * [The Iris dataset](#The-Iris-dataset)\n", 50 | " * [Class label encoding](#Class-label-encoding)\n", 51 | " * [Scikit-learn's in-build datasets](#Scikit-learn's-in-build-datasets)\n", 52 | " * [Test/train splits](#Test/train-splits)\n", 53 | " * [Logistic Regression](#Logistic-Regression)\n", 54 | " * [K-Nearest Neighbors](#K-Nearest-Neighbors)\n", 55 | " * [3 - Exercises](#3---Exercises)\n", 56 | "* [4 - Feature Preprocessing & scikit-learn Pipelines](#4---Feature-Preprocessing-&-scikit-learn-Pipelines)\n", 57 | " * [Categorical features: nominal vs ordinal](#Categorical-features:-nominal-vs-ordinal)\n", 58 | " * [Normalization](#Normalization)\n", 59 | " * [Pipelines](#Pipelines)\n", 60 | " * [4 - Exercises](#4---Exercises)\n", 61 | "* [5 - Dimensionality Reduction: Feature Selection & Extraction](#5---Dimensionality-Reduction:-Feature-Selection-&-Extraction)\n", 62 | " * [Recursive Feature Elimination](#Recursive-Feature-Elimination)\n", 63 | " * [Sequential Feature Selection](#Sequential-Feature-Selection)\n", 64 | " * [Principal Component Analysis](#Principal-Component-Analysis)\n", 65 | "* [6 - Model Evaluation & Hyperparameter Tuning](#6---Model-Evaluation-&-Hyperparameter-Tuning)\n", 66 | " * [Wine Dataset](#Wine-Dataset)\n", 67 | " * [Stratified K-Fold](#Stratified-K-Fold)\n", 68 | " * [Grid Search](#Grid-Search)" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": null, 74 | "metadata": { 75 | "collapsed": true 76 | }, 77 | "outputs": [], 78 | "source": [ 79 | "%matplotlib inline\n", 80 | "import matplotlib.pyplot as plt\n", 81 | "import numpy as np\n", 82 | "import pandas as pd" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [ 89 | "
" 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": {}, 95 | "source": [ 96 | "
" 97 | ] 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "metadata": {}, 102 | "source": [ 103 | "# 1 Introduction to Machine Learning" 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": {}, 109 | "source": [ 110 | "
" 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": {}, 116 | "source": [ 117 | "# 2 Linear Regression" 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": {}, 123 | "source": [ 124 | "### Loading the dataset" 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "metadata": {}, 130 | "source": [ 131 | "Source: R.J. Gladstone (1905). \"A Study of the Relations of the Brain to \n", 132 | "to the Size of the Head\", Biometrika, Vol. 4, pp105-123\n", 133 | "\n", 134 | "\n", 135 | "Description: Brain weight (grams) and head size (cubic cm) for 237\n", 136 | "adults classified by gender and age group.\n", 137 | "\n", 138 | "\n", 139 | "Variables/Columns\n", 140 | "- Gender (1=Male, 2=Female)\n", 141 | "- Age Range (1=20-46, 2=46+)\n", 142 | "- Head size (cm^3)\n", 143 | "- Brain weight (grams)\n" 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": null, 149 | "metadata": { 150 | "collapsed": false 151 | }, 152 | "outputs": [], 153 | "source": [ 154 | "df = pd.read_csv('dataset_brain.txt', \n", 155 | " encoding='utf-8', \n", 156 | " comment='#',\n", 157 | " sep='\\s+')\n", 158 | "df.tail()" 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": null, 164 | "metadata": { 165 | "collapsed": false 166 | }, 167 | "outputs": [], 168 | "source": [ 169 | "plt.scatter(df['head-size'], df['brain-weight'])\n", 170 | "plt.xlabel('Head size (cm^3)')\n", 171 | "plt.ylabel('Brain weight (grams)');" 172 | ] 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "metadata": {}, 177 | "source": [ 178 | "### Preparing the dataset" 179 | ] 180 | }, 181 | { 182 | "cell_type": "code", 183 | "execution_count": null, 184 | "metadata": { 185 | "collapsed": false 186 | }, 187 | "outputs": [], 188 | "source": [ 189 | "y = df['brain-weight'].values\n", 190 | "y.shape" 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": null, 196 | "metadata": { 197 | "collapsed": false 198 | }, 199 | "outputs": [], 200 | "source": [ 201 | "X = df['head-size'].values\n", 202 | "X = X[:, np.newaxis]\n", 203 | "X.shape" 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": null, 209 | "metadata": { 210 | "collapsed": false 211 | }, 212 | "outputs": [], 213 | "source": [ 214 | "from sklearn.cross_validation import train_test_split\n", 215 | "\n", 216 | "X_train, X_test, y_train, y_test = train_test_split(\n", 217 | " X, y, test_size=0.3, random_state=123)" 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "execution_count": null, 223 | "metadata": { 224 | "collapsed": false 225 | }, 226 | "outputs": [], 227 | "source": [ 228 | "plt.scatter(X_train, y_train, c='blue', marker='o')\n", 229 | "plt.scatter(X_test, y_test, c='red', marker='s')\n", 230 | "plt.xlabel('Head size (cm^3)')\n", 231 | "plt.ylabel('Brain weight (grams)');" 232 | ] 233 | }, 234 | { 235 | "cell_type": "markdown", 236 | "metadata": {}, 237 | "source": [ 238 | "### Fitting the model" 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": null, 244 | "metadata": { 245 | "collapsed": false 246 | }, 247 | "outputs": [], 248 | "source": [ 249 | "from sklearn.linear_model import LinearRegression\n", 250 | "\n", 251 | "lr = LinearRegression()\n", 252 | "lr.fit(X_train, y_train)\n", 253 | "y_pred = lr.predict(X_test)" 254 | ] 255 | }, 256 | { 257 | "cell_type": "markdown", 258 | "metadata": {}, 259 | "source": [ 260 | "### Evaluating the model" 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "execution_count": null, 266 | "metadata": { 267 | "collapsed": false 268 | }, 269 | "outputs": [], 270 | "source": [ 271 | "sum_of_squares = ((y_test - y_pred) ** 2).sum()\n", 272 | "res_sum_of_squares = ((y_test - y_test.mean()) ** 2).sum()\n", 273 | "r2_score = 1 - (sum_of_squares / res_sum_of_squares)\n", 274 | "print('R2 score: %.3f' % r2_score)" 275 | ] 276 | }, 277 | { 278 | "cell_type": "code", 279 | "execution_count": null, 280 | "metadata": { 281 | "collapsed": false 282 | }, 283 | "outputs": [], 284 | "source": [ 285 | "print('R2 score: %.3f' % lr.score(X_test, y_test))" 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": null, 291 | "metadata": { 292 | "collapsed": false 293 | }, 294 | "outputs": [], 295 | "source": [ 296 | "lr.coef_" 297 | ] 298 | }, 299 | { 300 | "cell_type": "code", 301 | "execution_count": null, 302 | "metadata": { 303 | "collapsed": false 304 | }, 305 | "outputs": [], 306 | "source": [ 307 | "lr.intercept_" 308 | ] 309 | }, 310 | { 311 | "cell_type": "code", 312 | "execution_count": null, 313 | "metadata": { 314 | "collapsed": false 315 | }, 316 | "outputs": [], 317 | "source": [ 318 | "min_pred = X_train.min() * lr.coef_ + lr.intercept_\n", 319 | "max_pred = X_train.max() * lr.coef_ + lr.intercept_\n", 320 | "\n", 321 | "plt.scatter(X_train, y_train, c='blue', marker='o')\n", 322 | "plt.plot([X_train.min(), X_train.max()],\n", 323 | " [min_pred, max_pred],\n", 324 | " color='red',\n", 325 | " linewidth=4)\n", 326 | "plt.xlabel('Head size (cm^3)')\n", 327 | "plt.ylabel('Brain weight (grams)');" 328 | ] 329 | }, 330 | { 331 | "cell_type": "markdown", 332 | "metadata": {}, 333 | "source": [ 334 | "
" 335 | ] 336 | }, 337 | { 338 | "cell_type": "markdown", 339 | "metadata": {}, 340 | "source": [ 341 | "# 3 Introduction to Classification" 342 | ] 343 | }, 344 | { 345 | "cell_type": "markdown", 346 | "metadata": {}, 347 | "source": [ 348 | "### The Iris dataset" 349 | ] 350 | }, 351 | { 352 | "cell_type": "code", 353 | "execution_count": null, 354 | "metadata": { 355 | "collapsed": false 356 | }, 357 | "outputs": [], 358 | "source": [ 359 | "df = pd.read_csv('dataset_iris.txt', \n", 360 | " encoding='utf-8', \n", 361 | " comment='#',\n", 362 | " sep=',')\n", 363 | "df.tail()" 364 | ] 365 | }, 366 | { 367 | "cell_type": "code", 368 | "execution_count": null, 369 | "metadata": { 370 | "collapsed": false 371 | }, 372 | "outputs": [], 373 | "source": [ 374 | "X = df.iloc[:, :4].values \n", 375 | "y = df['class'].values\n", 376 | "np.unique(y)" 377 | ] 378 | }, 379 | { 380 | "cell_type": "markdown", 381 | "metadata": {}, 382 | "source": [ 383 | "### Class label encoding" 384 | ] 385 | }, 386 | { 387 | "cell_type": "code", 388 | "execution_count": null, 389 | "metadata": { 390 | "collapsed": false 391 | }, 392 | "outputs": [], 393 | "source": [ 394 | "from sklearn.preprocessing import LabelEncoder\n", 395 | "\n", 396 | "l_encoder = LabelEncoder()\n", 397 | "l_encoder.fit(y)\n", 398 | "l_encoder.classes_" 399 | ] 400 | }, 401 | { 402 | "cell_type": "code", 403 | "execution_count": null, 404 | "metadata": { 405 | "collapsed": false 406 | }, 407 | "outputs": [], 408 | "source": [ 409 | "y_enc = l_encoder.transform(y)\n", 410 | "np.unique(y_enc)" 411 | ] 412 | }, 413 | { 414 | "cell_type": "code", 415 | "execution_count": null, 416 | "metadata": { 417 | "collapsed": false 418 | }, 419 | "outputs": [], 420 | "source": [ 421 | "np.unique(l_encoder.inverse_transform(y_enc))" 422 | ] 423 | }, 424 | { 425 | "cell_type": "markdown", 426 | "metadata": {}, 427 | "source": [ 428 | "### Scikit-learn's in-build datasets" 429 | ] 430 | }, 431 | { 432 | "cell_type": "code", 433 | "execution_count": null, 434 | "metadata": { 435 | "collapsed": false 436 | }, 437 | "outputs": [], 438 | "source": [ 439 | "from sklearn.datasets import load_iris\n", 440 | "\n", 441 | "iris = load_iris()\n", 442 | "print(iris['DESCR'])" 443 | ] 444 | }, 445 | { 446 | "cell_type": "markdown", 447 | "metadata": {}, 448 | "source": [ 449 | "### Test/train splits" 450 | ] 451 | }, 452 | { 453 | "cell_type": "code", 454 | "execution_count": null, 455 | "metadata": { 456 | "collapsed": false 457 | }, 458 | "outputs": [], 459 | "source": [ 460 | "X, y = iris.data[:, :2], iris.target\n", 461 | "# ! We only use 2 features for visual purposes\n", 462 | "\n", 463 | "print('Class labels:', np.unique(y))\n", 464 | "print('Class proportions:', np.bincount(y))" 465 | ] 466 | }, 467 | { 468 | "cell_type": "code", 469 | "execution_count": null, 470 | "metadata": { 471 | "collapsed": false 472 | }, 473 | "outputs": [], 474 | "source": [ 475 | "from sklearn.cross_validation import train_test_split\n", 476 | "\n", 477 | "X_train, X_test, y_train, y_test = train_test_split(\n", 478 | " X, y, test_size=0.3, random_state=123)\n", 479 | "\n", 480 | "print('Class labels:', np.unique(y_train))\n", 481 | "print('Class proportions:', np.bincount(y_train))" 482 | ] 483 | }, 484 | { 485 | "cell_type": "code", 486 | "execution_count": null, 487 | "metadata": { 488 | "collapsed": false 489 | }, 490 | "outputs": [], 491 | "source": [ 492 | "from sklearn.cross_validation import train_test_split\n", 493 | "\n", 494 | "X_train, X_test, y_train, y_test = train_test_split(\n", 495 | " X, y, test_size=0.3, random_state=123,\n", 496 | " stratify=y)\n", 497 | "\n", 498 | "print('Class labels:', np.unique(y_train))\n", 499 | "print('Class proportions:', np.bincount(y_train))" 500 | ] 501 | }, 502 | { 503 | "cell_type": "markdown", 504 | "metadata": {}, 505 | "source": [ 506 | "### Logistic Regression" 507 | ] 508 | }, 509 | { 510 | "cell_type": "code", 511 | "execution_count": null, 512 | "metadata": { 513 | "collapsed": false 514 | }, 515 | "outputs": [], 516 | "source": [ 517 | "from sklearn.linear_model import LogisticRegression\n", 518 | "\n", 519 | "lr = LogisticRegression(solver='newton-cg', \n", 520 | " multi_class='multinomial', \n", 521 | " random_state=1)\n", 522 | "\n", 523 | "lr.fit(X_train, y_train)\n", 524 | "print('Test accuracy %.2f' % lr.score(X_test, y_test))" 525 | ] 526 | }, 527 | { 528 | "cell_type": "code", 529 | "execution_count": null, 530 | "metadata": { 531 | "collapsed": false 532 | }, 533 | "outputs": [], 534 | "source": [ 535 | "from mlxtend.evaluate import plot_decision_regions\n", 536 | "\n", 537 | "plot_decision_regions\n", 538 | "\n", 539 | "plot_decision_regions(X=X, y=y, clf=lr, X_highlight=X_test)\n", 540 | "plt.xlabel('sepal length [cm]')\n", 541 | "plt.xlabel('sepal width [cm]');" 542 | ] 543 | }, 544 | { 545 | "cell_type": "markdown", 546 | "metadata": {}, 547 | "source": [ 548 | "### K-Nearest Neighbors" 549 | ] 550 | }, 551 | { 552 | "cell_type": "code", 553 | "execution_count": null, 554 | "metadata": { 555 | "collapsed": false 556 | }, 557 | "outputs": [], 558 | "source": [ 559 | "from sklearn.neighbors import KNeighborsClassifier\n", 560 | "\n", 561 | "kn = KNeighborsClassifier(n_neighbors=4)\n", 562 | "\n", 563 | "kn.fit(X_train, y_train)\n", 564 | "print('Test accuracy %.2f' % kn.score(X_test, y_test))" 565 | ] 566 | }, 567 | { 568 | "cell_type": "code", 569 | "execution_count": null, 570 | "metadata": { 571 | "collapsed": false 572 | }, 573 | "outputs": [], 574 | "source": [ 575 | "plot_decision_regions(X=X, y=y, clf=kn, X_highlight=X_test)\n", 576 | "plt.xlabel('sepal length [cm]')\n", 577 | "plt.xlabel('sepal width [cm]');" 578 | ] 579 | }, 580 | { 581 | "cell_type": "markdown", 582 | "metadata": {}, 583 | "source": [ 584 | "### 3 - Exercises" 585 | ] 586 | }, 587 | { 588 | "cell_type": "markdown", 589 | "metadata": {}, 590 | "source": [ 591 | "- Which of the two models above would you prefer if you had to choose? Why?\n", 592 | "- What would be possible ways to resolve ties in KNN when `n_neighbors` is an even number?\n", 593 | "- Can you find the right spot in the scikit-learn documentation to read about how scikit-learn handles this?\n", 594 | "- Train & evaluate the Logistic Regression and KNN algorithms on the 4-dimensional iris datasets. \n", 595 | " - What performance do you observe? \n", 596 | " - Why is it different vs. using only 2 dimensions? \n", 597 | " - Would adding more dimensions help?" 598 | ] 599 | }, 600 | { 601 | "cell_type": "markdown", 602 | "metadata": {}, 603 | "source": [ 604 | "
" 605 | ] 606 | }, 607 | { 608 | "cell_type": "markdown", 609 | "metadata": {}, 610 | "source": [ 611 | "# 4 - Feature Preprocessing & scikit-learn Pipelines" 612 | ] 613 | }, 614 | { 615 | "cell_type": "markdown", 616 | "metadata": {}, 617 | "source": [ 618 | "### Categorical features: nominal vs ordinal" 619 | ] 620 | }, 621 | { 622 | "cell_type": "code", 623 | "execution_count": null, 624 | "metadata": { 625 | "collapsed": false 626 | }, 627 | "outputs": [], 628 | "source": [ 629 | "import pandas as pd\n", 630 | "\n", 631 | "df = pd.DataFrame([\n", 632 | " ['green', 'M', 10.0], \n", 633 | " ['red', 'L', 13.5], \n", 634 | " ['blue', 'XL', 15.3]])\n", 635 | "\n", 636 | "df.columns = ['color', 'size', 'prize']\n", 637 | "df" 638 | ] 639 | }, 640 | { 641 | "cell_type": "code", 642 | "execution_count": null, 643 | "metadata": { 644 | "collapsed": false 645 | }, 646 | "outputs": [], 647 | "source": [ 648 | "from sklearn.feature_extraction import DictVectorizer\n", 649 | "\n", 650 | "dvec = DictVectorizer(sparse=False)\n", 651 | "\n", 652 | "X = dvec.fit_transform(df.transpose().to_dict().values())\n", 653 | "X" 654 | ] 655 | }, 656 | { 657 | "cell_type": "code", 658 | "execution_count": null, 659 | "metadata": { 660 | "collapsed": false 661 | }, 662 | "outputs": [], 663 | "source": [ 664 | "size_mapping = {\n", 665 | " 'XL': 3,\n", 666 | " 'L': 2,\n", 667 | " 'M': 1}\n", 668 | "\n", 669 | "df['size'] = df['size'].map(size_mapping)\n", 670 | "df" 671 | ] 672 | }, 673 | { 674 | "cell_type": "code", 675 | "execution_count": null, 676 | "metadata": { 677 | "collapsed": false 678 | }, 679 | "outputs": [], 680 | "source": [ 681 | "X = dvec.fit_transform(df.transpose().to_dict().values())\n", 682 | "X" 683 | ] 684 | }, 685 | { 686 | "cell_type": "markdown", 687 | "metadata": {}, 688 | "source": [ 689 | "### Normalization" 690 | ] 691 | }, 692 | { 693 | "cell_type": "code", 694 | "execution_count": null, 695 | "metadata": { 696 | "collapsed": false 697 | }, 698 | "outputs": [], 699 | "source": [ 700 | "df = pd.DataFrame([1., 2., 3., 4., 5., 6.], columns=['feature'])\n", 701 | "df" 702 | ] 703 | }, 704 | { 705 | "cell_type": "code", 706 | "execution_count": null, 707 | "metadata": { 708 | "collapsed": false 709 | }, 710 | "outputs": [], 711 | "source": [ 712 | "from sklearn.preprocessing import MinMaxScaler\n", 713 | "from sklearn.preprocessing import StandardScaler\n", 714 | "\n", 715 | "mmxsc = MinMaxScaler()\n", 716 | "stdsc = StandardScaler()\n", 717 | "\n", 718 | "X = df['feature'].values[:, np.newaxis]\n", 719 | "\n", 720 | "df['minmax'] = mmxsc.fit_transform(X)\n", 721 | "df['z-score'] = stdsc.fit_transform(X)\n", 722 | "\n", 723 | "df" 724 | ] 725 | }, 726 | { 727 | "cell_type": "markdown", 728 | "metadata": {}, 729 | "source": [ 730 | "### Pipelines" 731 | ] 732 | }, 733 | { 734 | "cell_type": "code", 735 | "execution_count": null, 736 | "metadata": { 737 | "collapsed": false 738 | }, 739 | "outputs": [], 740 | "source": [ 741 | "from sklearn.pipeline import make_pipeline\n", 742 | "from sklearn.cross_validation import train_test_split\n", 743 | "from sklearn.datasets import load_iris\n", 744 | "\n", 745 | "iris = load_iris()\n", 746 | "X, y = iris.data, iris.target\n", 747 | "\n", 748 | "X_train, X_test, y_train, y_test = train_test_split(\n", 749 | " X, y, test_size=0.3, random_state=123,\n", 750 | " stratify=y)\n", 751 | "\n", 752 | "lr = LogisticRegression(solver='newton-cg', \n", 753 | " multi_class='multinomial', \n", 754 | " random_state=1)\n", 755 | "\n", 756 | "lr_pipe = make_pipeline(StandardScaler(), lr)\n", 757 | "\n", 758 | "lr_pipe.fit(X_train, y_train)\n", 759 | "lr_pipe.score(X_test, y_test)" 760 | ] 761 | }, 762 | { 763 | "cell_type": "code", 764 | "execution_count": null, 765 | "metadata": { 766 | "collapsed": false 767 | }, 768 | "outputs": [], 769 | "source": [ 770 | "lr_pipe.named_steps" 771 | ] 772 | }, 773 | { 774 | "cell_type": "code", 775 | "execution_count": null, 776 | "metadata": { 777 | "collapsed": false 778 | }, 779 | "outputs": [], 780 | "source": [ 781 | "lr_pipe.named_steps['standardscaler'].transform(X[:5])" 782 | ] 783 | }, 784 | { 785 | "cell_type": "markdown", 786 | "metadata": {}, 787 | "source": [ 788 | "### 4 - Exercises" 789 | ] 790 | }, 791 | { 792 | "cell_type": "markdown", 793 | "metadata": {}, 794 | "source": [ 795 | "- Why is it important that we scale test and training sets separately?\n", 796 | "- Fit a KNN classifier to the standardized Iris dataset. Do you notice difference in the predictive performance of the model compared to the non-standardized one? Why or why not?" 797 | ] 798 | }, 799 | { 800 | "cell_type": "markdown", 801 | "metadata": {}, 802 | "source": [ 803 | "
" 804 | ] 805 | }, 806 | { 807 | "cell_type": "markdown", 808 | "metadata": {}, 809 | "source": [ 810 | "# 5 - Dimensionality Reduction: Feature Selection & Extraction" 811 | ] 812 | }, 813 | { 814 | "cell_type": "code", 815 | "execution_count": null, 816 | "metadata": { 817 | "collapsed": true 818 | }, 819 | "outputs": [], 820 | "source": [ 821 | "from sklearn.cross_validation import train_test_split\n", 822 | "from sklearn.datasets import load_iris\n", 823 | "\n", 824 | "iris = load_iris()\n", 825 | "X, y = iris.data, iris.target\n", 826 | "\n", 827 | "X_train, X_test, y_train, y_test = train_test_split(\n", 828 | " X, y, test_size=0.3, random_state=123, stratify=y)" 829 | ] 830 | }, 831 | { 832 | "cell_type": "markdown", 833 | "metadata": {}, 834 | "source": [ 835 | "### Recursive Feature Elimination" 836 | ] 837 | }, 838 | { 839 | "cell_type": "code", 840 | "execution_count": null, 841 | "metadata": { 842 | "collapsed": false 843 | }, 844 | "outputs": [], 845 | "source": [ 846 | "from sklearn.linear_model import LogisticRegression\n", 847 | "from sklearn.feature_selection import RFECV\n", 848 | "\n", 849 | "lr = LogisticRegression()\n", 850 | "rfe = RFECV(lr, step=1, cv=5, scoring='accuracy')\n", 851 | "\n", 852 | "rfe.fit(X_train, y_train)\n", 853 | "print('Number of features:', rfe.n_features_)\n", 854 | "print('Feature ranking', rfe.ranking_)" 855 | ] 856 | }, 857 | { 858 | "cell_type": "markdown", 859 | "metadata": {}, 860 | "source": [ 861 | "### Sequential Feature Selection" 862 | ] 863 | }, 864 | { 865 | "cell_type": "code", 866 | "execution_count": null, 867 | "metadata": { 868 | "collapsed": false 869 | }, 870 | "outputs": [], 871 | "source": [ 872 | "from mlxtend.feature_selection import SequentialFeatureSelector as SFS\n", 873 | "from mlxtend.feature_selection import plot_sequential_feature_selection as plot_sfs\n", 874 | "\n", 875 | "\n", 876 | "sfs = SFS(lr, k_features=4, forward=True, floating=False, cv=5)\n", 877 | "\n", 878 | "sfs.fit(X_train, y_train)\n", 879 | "sfs = SFS(lr, \n", 880 | " k_features=4, \n", 881 | " forward=True, \n", 882 | " floating=False, \n", 883 | " scoring='accuracy',\n", 884 | " cv=2)\n", 885 | "\n", 886 | "sfs = sfs.fit(X, y)\n", 887 | "fig1 = plot_sfs(sfs.get_metric_dict())\n", 888 | "\n", 889 | "plt.ylim([0.8, 1])\n", 890 | "plt.title('Sequential Forward Selection (w. StdDev)')\n", 891 | "plt.grid()" 892 | ] 893 | }, 894 | { 895 | "cell_type": "code", 896 | "execution_count": null, 897 | "metadata": { 898 | "collapsed": false 899 | }, 900 | "outputs": [], 901 | "source": [ 902 | "sfs.subsets_" 903 | ] 904 | }, 905 | { 906 | "cell_type": "markdown", 907 | "metadata": {}, 908 | "source": [ 909 | "### Principal Component Analysis" 910 | ] 911 | }, 912 | { 913 | "cell_type": "code", 914 | "execution_count": null, 915 | "metadata": { 916 | "collapsed": false 917 | }, 918 | "outputs": [], 919 | "source": [ 920 | "from sklearn.decomposition import PCA\n", 921 | "from sklearn.preprocessing import StandardScaler\n", 922 | "\n", 923 | "sc = StandardScaler()\n", 924 | "pca = PCA(n_components=4)\n", 925 | "\n", 926 | "pca.fit_transform(X_train, y_train)\n", 927 | "\n", 928 | "var_exp = pca.explained_variance_ratio_\n", 929 | "cum_var_exp = np.cumsum(var_exp)\n", 930 | "\n", 931 | "idx = [i for i in range(len(var_exp))]\n", 932 | "labels = [str(i + 1) for i in idx]\n", 933 | "with plt.style.context('seaborn-whitegrid'):\n", 934 | " plt.bar(range(4), var_exp, alpha=0.5, align='center',\n", 935 | " label='individual explained variance')\n", 936 | " plt.step(range(4), cum_var_exp, where='mid',\n", 937 | " label='cumulative explained variance')\n", 938 | " plt.ylabel('Explained variance ratio')\n", 939 | " plt.xlabel('Principal components')\n", 940 | " plt.xticks(idx, labels)\n", 941 | " plt.legend(loc='center right')\n", 942 | " plt.tight_layout()\n", 943 | " plt.show()" 944 | ] 945 | }, 946 | { 947 | "cell_type": "code", 948 | "execution_count": null, 949 | "metadata": { 950 | "collapsed": false 951 | }, 952 | "outputs": [], 953 | "source": [ 954 | "X_train_pca = pca.transform(X_train)\n", 955 | "\n", 956 | "for lab, col, mar in zip((0, 1, 2),\n", 957 | " ('blue', 'red', 'green'),\n", 958 | " ('o', 's', '^')):\n", 959 | " plt.scatter(X_train_pca[y_train == lab, 0],\n", 960 | " X_train_pca[y_train == lab, 1],\n", 961 | " label=lab,\n", 962 | " marker=mar,\n", 963 | " c=col)\n", 964 | "plt.xlabel('Principal Component 1')\n", 965 | "plt.ylabel('Principal Component 2')\n", 966 | "plt.legend(loc='lower right')\n", 967 | "plt.tight_layout()" 968 | ] 969 | }, 970 | { 971 | "cell_type": "markdown", 972 | "metadata": {}, 973 | "source": [ 974 | "
" 975 | ] 976 | }, 977 | { 978 | "cell_type": "markdown", 979 | "metadata": {}, 980 | "source": [ 981 | "# 6 - Model Evaluation & Hyperparameter Tuning" 982 | ] 983 | }, 984 | { 985 | "cell_type": "markdown", 986 | "metadata": {}, 987 | "source": [ 988 | "### Wine Dataset" 989 | ] 990 | }, 991 | { 992 | "cell_type": "code", 993 | "execution_count": null, 994 | "metadata": { 995 | "collapsed": true 996 | }, 997 | "outputs": [], 998 | "source": [ 999 | "from mlxtend.data import wine_data\n", 1000 | "\n", 1001 | "X, y = wine_data()" 1002 | ] 1003 | }, 1004 | { 1005 | "cell_type": "markdown", 1006 | "metadata": {}, 1007 | "source": [ 1008 | "Wine dataset.\n", 1009 | "\n", 1010 | "Source : https://archive.ics.uci.edu/ml/datasets/Wine\n", 1011 | "\n", 1012 | "Number of samples : 178\n", 1013 | "\n", 1014 | "Class labels : {0, 1, 2}, distribution: [59, 71, 48]\n", 1015 | "\n", 1016 | "Dataset Attributes:\n", 1017 | "\n", 1018 | "1. Alcohol\n", 1019 | "2. Malic acid\n", 1020 | "3. Ash\n", 1021 | "4. Alcalinity of ash\n", 1022 | "5. Magnesium\n", 1023 | "6. Total phenols\n", 1024 | "7. Flavanoids\n", 1025 | "8. Nonflavanoid phenols\n", 1026 | "9. Proanthocyanins\n", 1027 | "10. Color intensity\n", 1028 | "11. Hue\n", 1029 | "12. OD280/OD315 of diluted wines\n", 1030 | "13. Proline\n" 1031 | ] 1032 | }, 1033 | { 1034 | "cell_type": "markdown", 1035 | "metadata": {}, 1036 | "source": [ 1037 | "### Stratified K-Fold" 1038 | ] 1039 | }, 1040 | { 1041 | "cell_type": "code", 1042 | "execution_count": null, 1043 | "metadata": { 1044 | "collapsed": false 1045 | }, 1046 | "outputs": [], 1047 | "source": [ 1048 | "from sklearn.preprocessing import StandardScaler\n", 1049 | "from sklearn.decomposition import PCA\n", 1050 | "from sklearn.pipeline import make_pipeline\n", 1051 | "from sklearn.cross_validation import StratifiedKFold\n", 1052 | "from sklearn.neighbors import KNeighborsClassifier as KNN\n", 1053 | "\n", 1054 | "X_train, X_test, y_train, y_test = train_test_split(\n", 1055 | " X, y, test_size=0.3, random_state=123, stratify=y)\n", 1056 | "\n", 1057 | "pipe_kn = make_pipeline(StandardScaler(), \n", 1058 | " PCA(n_components=1),\n", 1059 | " KNN(n_neighbors=3))\n", 1060 | "\n", 1061 | "kfold = StratifiedKFold(y=y_train, \n", 1062 | " n_folds=10,\n", 1063 | " random_state=1)\n", 1064 | "\n", 1065 | "scores = []\n", 1066 | "for k, (train, test) in enumerate(kfold):\n", 1067 | " pipe_kn.fit(X_train[train], y_train[train])\n", 1068 | " score = pipe_kn.score(X_train[test], y_train[test])\n", 1069 | " scores.append(score)\n", 1070 | " print('Fold: %s, Class dist.: %s, Acc: %.3f' % (k+1,\n", 1071 | " np.bincount(y_train[train]), score))\n", 1072 | " \n", 1073 | "print('\\nCV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))" 1074 | ] 1075 | }, 1076 | { 1077 | "cell_type": "code", 1078 | "execution_count": null, 1079 | "metadata": { 1080 | "collapsed": false 1081 | }, 1082 | "outputs": [], 1083 | "source": [ 1084 | "from sklearn.cross_validation import cross_val_score\n", 1085 | "\n", 1086 | "scores = cross_val_score(estimator=pipe_kn,\n", 1087 | " X=X_train,\n", 1088 | " y=y_train,\n", 1089 | " cv=10,\n", 1090 | " n_jobs=2)\n", 1091 | "\n", 1092 | "print('CV accuracy scores: %s' % scores)\n", 1093 | "print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))" 1094 | ] 1095 | }, 1096 | { 1097 | "cell_type": "markdown", 1098 | "metadata": {}, 1099 | "source": [ 1100 | "### Grid Search" 1101 | ] 1102 | }, 1103 | { 1104 | "cell_type": "code", 1105 | "execution_count": null, 1106 | "metadata": { 1107 | "collapsed": false 1108 | }, 1109 | "outputs": [], 1110 | "source": [ 1111 | "pipe_kn.named_steps" 1112 | ] 1113 | }, 1114 | { 1115 | "cell_type": "code", 1116 | "execution_count": null, 1117 | "metadata": { 1118 | "collapsed": false 1119 | }, 1120 | "outputs": [], 1121 | "source": [ 1122 | "from sklearn.grid_search import GridSearchCV\n", 1123 | "\n", 1124 | "\n", 1125 | "param_grid = {'pca__n_components': [1, 2, 3, 4, 5, 6, None],\n", 1126 | " 'kneighborsclassifier__n_neighbors': [1, 3, 5, 7, 9, 11]}\n", 1127 | "\n", 1128 | "gs = GridSearchCV(estimator=pipe_kn, \n", 1129 | " param_grid=param_grid, \n", 1130 | " scoring='accuracy', \n", 1131 | " cv=10,\n", 1132 | " n_jobs=2,\n", 1133 | " refit=True)\n", 1134 | "gs = gs.fit(X_train, y_train)\n", 1135 | "print(gs.best_score_)\n", 1136 | "print(gs.best_params_)" 1137 | ] 1138 | }, 1139 | { 1140 | "cell_type": "code", 1141 | "execution_count": null, 1142 | "metadata": { 1143 | "collapsed": false 1144 | }, 1145 | "outputs": [], 1146 | "source": [ 1147 | "gs.score(X_test, y_test)" 1148 | ] 1149 | } 1150 | ], 1151 | "metadata": { 1152 | "anaconda-cloud": {}, 1153 | "kernelspec": { 1154 | "display_name": "Python 3", 1155 | "language": "python", 1156 | "name": "python3" 1157 | }, 1158 | "language_info": { 1159 | "codemirror_mode": { 1160 | "name": "ipython", 1161 | "version": 3 1162 | }, 1163 | "file_extension": ".py", 1164 | "mimetype": "text/x-python", 1165 | "name": "python", 1166 | "nbconvert_exporter": "python", 1167 | "pygments_lexer": "ipython3", 1168 | "version": "3.5.2" 1169 | } 1170 | }, 1171 | "nbformat": 4, 1172 | "nbformat_minor": 0 1173 | } 1174 | -------------------------------------------------------------------------------- /images/checkenv-example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rasbt/pydata-chicago2016-ml-tutorial/115c706033c82daf563b7a440218b933fcadb228/images/checkenv-example.png -------------------------------------------------------------------------------- /images/github-download.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rasbt/pydata-chicago2016-ml-tutorial/115c706033c82daf563b7a440218b933fcadb228/images/github-download.png -------------------------------------------------------------------------------- /images/logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rasbt/pydata-chicago2016-ml-tutorial/115c706033c82daf563b7a440218b933fcadb228/images/logo.png -------------------------------------------------------------------------------- /slides/PyDataChi2016_sr.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rasbt/pydata-chicago2016-ml-tutorial/115c706033c82daf563b7a440218b933fcadb228/slides/PyDataChi2016_sr.pdf --------------------------------------------------------------------------------