├── .gitignore
├── LICENSE
├── README.md
├── code
    ├── check_environment.ipynb
    ├── dataset_brain.txt
    ├── dataset_iris.txt
    └── tutorial.ipynb
├── images
    ├── checkenv-example.png
    ├── github-download.png
    └── logo.png
└── slides
    └── PyDataChi2016_sr.pdf


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | __pycache__/
 3 | *.py[cod]
 4 | *$py.class
 5 | 
 6 | # C extensions
 7 | *.so
 8 | 
 9 | # Distribution / packaging
10 | .Python
11 | env/
12 | build/
13 | develop-eggs/
14 | dist/
15 | downloads/
16 | eggs/
17 | .eggs/
18 | lib/
19 | lib64/
20 | parts/
21 | sdist/
22 | var/
23 | *.egg-info/
24 | .installed.cfg
25 | *.egg
26 | 
27 | # PyInstaller
28 | #  Usually these files are written by a python script from a template
29 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
30 | *.manifest
31 | *.spec
32 | 
33 | # Installer logs
34 | pip-log.txt
35 | pip-delete-this-directory.txt
36 | 
37 | # Unit test / coverage reports
38 | htmlcov/
39 | .tox/
40 | .coverage
41 | .coverage.*
42 | .cache
43 | nosetests.xml
44 | coverage.xml
45 | *,cover
46 | .hypothesis/
47 | 
48 | # Translations
49 | *.mo
50 | *.pot
51 | 
52 | # Django stuff:
53 | *.log
54 | local_settings.py
55 | 
56 | # Flask stuff:
57 | instance/
58 | .webassets-cache
59 | 
60 | # Scrapy stuff:
61 | .scrapy
62 | 
63 | # Sphinx documentation
64 | docs/_build/
65 | 
66 | # PyBuilder
67 | target/
68 | 
69 | # IPython Notebook
70 | .ipynb_checkpoints
71 | 
72 | # pyenv
73 | .python-version
74 | 
75 | # celery beat schedule file
76 | celerybeat-schedule
77 | 
78 | # dotenv
79 | .env
80 | 
81 | # virtualenv
82 | venv/
83 | ENV/
84 | 
85 | # Spyder project settings
86 | .spyderproject
87 | 
88 | # Rope project settings
89 | .ropeproject
90 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2016 Sebastian Raschka
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # pydata-chicago2016-ml-tutorial
 2 | 
 3 | ### Learning scikit-learn -- An Introduction to Machine Learning in Python @ PyData Chicago 2016
 4 | 
 5 | *This tutorial provides you with an introduction to machine learning in Python using the popular scikit-learn library.*
 6 | 
 7 | ![](images/logo.png)
 8 | 
 9 | This tutorial will teach you the basics of scikit-learn. I will give you a brief overview of the basic concepts of classification and regression analysis, how to build powerful predictive models from labeled data. Although it's not a requirement for attending this tutorial, I highly recommend you to check out the accompanying GitHub repository at https://github.com/rasbt/pydata-chicago2016-ml-tutorial 1-2 days before the tutorial. During the session, we will not only talk about scikit-learn, but we will also go over some live code examples to get the knack of scikit-learn's API.   
10 | 
11 | If you have any questions about the tutorial, please don't hesitate to contact me. You can either open an "issue" on GitHub or reach me via email at mail_at_sebastianraschka.com. I am looking forward to meeting you soon!
12 | 
13 | ---
14 | 
15 | - View the presentation slides [here](slides/PyDataChi2016_sr.pdf) / on [SpeakerDeck](https://speakerdeck.com/rasbt/learning-scikit-learn-an-introduction-to-machine-learning-in-python-at-pydata-chicago-2016)
16 | - View the code notebook [here](code/tutorial.ipynb) / on [nbviewer](http://nbviewer.jupyter.org/github/rasbt/pydata-chicago2016-ml-tutorial/blob/master/code/tutorial.ipynb)
17 | - [Video recording](https://www.youtube.com/watch?v=9fOWryQq9J8) of the talk on YouTube
18 | 
19 | ---
20 | 
21 | # Schedule
22 | 
23 | This repository will contain the teaching material and other info for the *Learning scikit-learn* tutorial at the [PyData Chicago 2016 Conference](http://pydata.org/chicago2016/) held Aug 26-28.
24 | 
25 | - When? **Fri Aug. 26, 2016 at 9:00 - 10:30 am**
26 | - Where? **Room 1**
27 | 
28 | (I recommend watching the PyData [schedule](http://pydata.org/chicago2016/schedule/) for updates).
29 | 
30 | # Obtaining the Tutorial Material
31 | 
32 | If you already have a GitHub account, the probably most convenient way to obtain the tutorial material is to clone this GitHub repository via `git clone https://github.com/rasbt/pydata-chicago2016-ml-tutorials` and fetch updates via `pull origin master`
33 | 
34 | If you don’t have an GitHub account, you can download the repository as a .zip file by heading over to the GitHub repository (https://github.com/rasbt/pydata-chicago2016-ml-tutorial) in your browser and click the green “Download” button in the upper right.
35 | 
36 | ![](images/github-download.png)
37 | 
38 | 
39 | # Installation Notes and Requirements
40 | 
41 | Please note that installing the following libraries and running code alongside is **not a hard requirement for attending the tutorial session**, you will be able to follow along just fine (and probably be less distracted :)). Now, the tutorial code should be compatible to both Python 2.7 and Python 3.x. but will require recent installations of
42 | 
43 | - [NumPy](http://www.numpy.org)
44 | - [SciPy](http://www.scipy.org)
45 | - [matplotlib](http://matplotlib.org)
46 | - [pandas](http://pandas.pydata.org)
47 | - [scikit-learn](http://scikit-learn.org/stable/)
48 | - [IPython](http://ipython.readthedocs.org/en/stable/)
49 | - [Jupyter Notebook](http://jupyter.org)
50 | - [watermark](https://pypi.python.org/pypi/watermark)
51 | - [mlxtend](http://rasbt.github.io/mlxtend/)
52 | 
53 | 
54 | Please make sure that you have these libraries installed in your current Python environment prior to attending the tutorial if you want to execute the code examples that are executed during the talk. Please also note that executing these examples during/after the talk is merely a suggestion, not a requirement. **I highly recommend you to open the code/check_environment.ipynb](code/check_environment.ipynb) notebook as a Jupyter notebook**, for instance by the notebook via
55 | 
56 | ```bash
57 | jupyter notebook <path-to>/pydata-chicago2016-ml-tutorial/code/check_environment.ipynb
58 | ```
59 | and executing the code cells:
60 | 
61 | ![](images/checkenv-example.png)
62 | 


--------------------------------------------------------------------------------
/code/check_environment.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Jupyter Notebook for Checking Python Package Requirements"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 1,
 13 |    "metadata": {
 14 |     "collapsed": false
 15 |    },
 16 |    "outputs": [
 17 |     {
 18 |      "name": "stdout",
 19 |      "output_type": "stream",
 20 |      "text": [
 21 |       "[OK] numpy 1.11.1\n",
 22 |       "[OK] scipy 0.17.1\n",
 23 |       "[OK] matplotlib 1.5.1\n",
 24 |       "[OK] sklearn 0.17.1\n",
 25 |       "[OK] pandas 0.18.1\n",
 26 |       "[OK] mlxtend 0.4.2\n"
 27 |      ]
 28 |     }
 29 |    ],
 30 |    "source": [
 31 |     "def get_packages(pkgs):\n",
 32 |     "    versions = []\n",
 33 |     "    for p in packages:\n",
 34 |     "        try:\n",
 35 |     "            imported = __import__(p)\n",
 36 |     "            try:\n",
 37 |     "                versions.append(imported.__version__)\n",
 38 |     "            except AttributeError:\n",
 39 |     "                try:\n",
 40 |     "                    versions.append(imported.version)\n",
 41 |     "                except AttributeError:\n",
 42 |     "                    try:\n",
 43 |     "                        versions.append(imported.version_info)\n",
 44 |     "                    except AttributeError:\n",
 45 |     "                        versions.append('0.0')\n",
 46 |     "        except ImportError:\n",
 47 |     "            print('[FAIL]: %s is not installed' % p)\n",
 48 |     "    return versions\n",
 49 |     "                    \n",
 50 |     "packages = ['numpy', 'scipy', 'matplotlib', 'sklearn', 'pandas', 'mlxtend']\n",
 51 |     "suggested_v = ['1.10', '0.17', '1.5.1', '0.17.1', '0.17.1', '0.4.2']\n",
 52 |     "versions = get_packages(packages)\n",
 53 |     "\n",
 54 |     "for p, v, s in zip(packages, versions, suggested_v):\n",
 55 |     "    if v < s:\n",
 56 |     "        print('[FAIL] %s %s, please upgrade to >= %s' % (p, v, s))\n",
 57 |     "    else:\n",
 58 |     "        print('[OK] %s %s' % (p, v))"
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "code",
 63 |    "execution_count": 2,
 64 |    "metadata": {
 65 |     "collapsed": false
 66 |    },
 67 |    "outputs": [
 68 |     {
 69 |      "name": "stdout",
 70 |      "output_type": "stream",
 71 |      "text": [
 72 |       "2016-08-24 \n",
 73 |       "\n",
 74 |       "numpy 1.11.1\n",
 75 |       "scipy 0.17.1\n",
 76 |       "matplotlib 1.5.1\n",
 77 |       "sklearn 0.17.1\n",
 78 |       "pandas 0.18.1\n",
 79 |       "mlxtend 0.4.2\n"
 80 |      ]
 81 |     }
 82 |    ],
 83 |    "source": [
 84 |     "%load_ext watermark\n",
 85 |     "%watermark -d -p numpy,scipy,matplotlib,sklearn,pandas,mlxtend"
 86 |    ]
 87 |   }
 88 |  ],
 89 |  "metadata": {
 90 |   "anaconda-cloud": {},
 91 |   "kernelspec": {
 92 |    "display_name": "Python [Root]",
 93 |    "language": "python",
 94 |    "name": "Python [Root]"
 95 |   },
 96 |   "language_info": {
 97 |    "codemirror_mode": {
 98 |     "name": "ipython",
 99 |     "version": 3
100 |    },
101 |    "file_extension": ".py",
102 |    "mimetype": "text/x-python",
103 |    "name": "python",
104 |    "nbconvert_exporter": "python",
105 |    "pygments_lexer": "ipython3",
106 |    "version": "3.5.2"
107 |   }
108 |  },
109 |  "nbformat": 4,
110 |  "nbformat_minor": 0
111 | }
112 | 


--------------------------------------------------------------------------------
/code/dataset_brain.txt:
--------------------------------------------------------------------------------
  1 | # Source: R.J. Gladstone (1905). "A Study of the Relations of the Brain to 
  2 | # to the Size of the Head", Biometrika, Vol. 4, pp105-123
  3 | #
  4 | # Download link: http://www.stat.ufl.edu/~winner/data/brainhead.txt
  5 | #
  6 | # Description: Brain weight (grams) and head size (cubic cm) for 237
  7 | # adults classified by gender and age group.
  8 | #
  9 | # Variables/Columns
 10 | # Gender   8   /* 1=Male, 2=Female  */
 11 | # Age Range  16   /* 1=20-46, 2=46+  */
 12 | # Head size (cm^3)  21-24
 13 | # Brain weight (grams)  29-32
 14 | #
 15 | gender age-group head-size brain-weight
 16 |        1       1    4512    1530
 17 |        1       1    3738    1297
 18 |        1       1    4261    1335
 19 |        1       1    3777    1282
 20 |        1       1    4177    1590
 21 |        1       1    3585    1300
 22 |        1       1    3785    1400
 23 |        1       1    3559    1255
 24 |        1       1    3613    1355
 25 |        1       1    3982    1375
 26 |        1       1    3443    1340
 27 |        1       1    3993    1380
 28 |        1       1    3640    1355
 29 |        1       1    4208    1522
 30 |        1       1    3832    1208
 31 |        1       1    3876    1405
 32 |        1       1    3497    1358
 33 |        1       1    3466    1292
 34 |        1       1    3095    1340
 35 |        1       1    4424    1400
 36 |        1       1    3878    1357
 37 |        1       1    4046    1287
 38 |        1       1    3804    1275
 39 |        1       1    3710    1270
 40 |        1       1    4747    1635
 41 |        1       1    4423    1505
 42 |        1       1    4036    1490
 43 |        1       1    4022    1485
 44 |        1       1    3454    1310
 45 |        1       1    4175    1420
 46 |        1       1    3787    1318
 47 |        1       1    3796    1432
 48 |        1       1    4103    1364
 49 |        1       1    4161    1405
 50 |        1       1    4158    1432
 51 |        1       1    3814    1207
 52 |        1       1    3527    1375
 53 |        1       1    3748    1350
 54 |        1       1    3334    1236
 55 |        1       1    3492    1250
 56 |        1       1    3962    1350
 57 |        1       1    3505    1320
 58 |        1       1    4315    1525
 59 |        1       1    3804    1570
 60 |        1       1    3863    1340
 61 |        1       1    4034    1422
 62 |        1       1    4308    1506
 63 |        1       1    3165    1215
 64 |        1       1    3641    1311
 65 |        1       1    3644    1300
 66 |        1       1    3891    1224
 67 |        1       1    3793    1350
 68 |        1       1    4270    1335
 69 |        1       1    4063    1390
 70 |        1       1    4012    1400
 71 |        1       1    3458    1225
 72 |        1       1    3890    1310
 73 |        1       2    4166    1560
 74 |        1       2    3935    1330
 75 |        1       2    3669    1222
 76 |        1       2    3866    1415
 77 |        1       2    3393    1175
 78 |        1       2    4442    1330
 79 |        1       2    4253    1485
 80 |        1       2    3727    1470
 81 |        1       2    3329    1135
 82 |        1       2    3415    1310
 83 |        1       2    3372    1154
 84 |        1       2    4430    1510
 85 |        1       2    4381    1415
 86 |        1       2    4008    1468
 87 |        1       2    3858    1390
 88 |        1       2    4121    1380
 89 |        1       2    4057    1432
 90 |        1       2    3824    1240
 91 |        1       2    3394    1195
 92 |        1       2    3558    1225
 93 |        1       2    3362    1188
 94 |        1       2    3930    1252
 95 |        1       2    3835    1315
 96 |        1       2    3830    1245
 97 |        1       2    3856    1430
 98 |        1       2    3249    1279
 99 |        1       2    3577    1245
100 |        1       2    3933    1309
101 |        1       2    3850    1412
102 |        1       2    3309    1120
103 |        1       2    3406    1220
104 |        1       2    3506    1280
105 |        1       2    3907    1440
106 |        1       2    4160    1370
107 |        1       2    3318    1192
108 |        1       2    3662    1230
109 |        1       2    3899    1346
110 |        1       2    3700    1290
111 |        1       2    3779    1165
112 |        1       2    3473    1240
113 |        1       2    3490    1132
114 |        1       2    3654    1242
115 |        1       2    3478    1270
116 |        1       2    3495    1218
117 |        1       2    3834    1430
118 |        1       2    3876    1588
119 |        1       2    3661    1320
120 |        1       2    3618    1290
121 |        1       2    3648    1260
122 |        1       2    4032    1425
123 |        1       2    3399    1226
124 |        1       2    3916    1360
125 |        1       2    4430    1620
126 |        1       2    3695    1310
127 |        1       2    3524    1250
128 |        1       2    3571    1295
129 |        1       2    3594    1290
130 |        1       2    3383    1290
131 |        1       2    3499    1275
132 |        1       2    3589    1250
133 |        1       2    3900    1270
134 |        1       2    4114    1362
135 |        1       2    3937    1300
136 |        1       2    3399    1173
137 |        1       2    4200    1256
138 |        1       2    4488    1440
139 |        1       2    3614    1180
140 |        1       2    4051    1306
141 |        1       2    3782    1350
142 |        1       2    3391    1125
143 |        1       2    3124    1165
144 |        1       2    4053    1312
145 |        1       2    3582    1300
146 |        1       2    3666    1270
147 |        1       2    3532    1335
148 |        1       2    4046    1450
149 |        1       2    3667    1310
150 |        2       1    2857    1027
151 |        2       1    3436    1235
152 |        2       1    3791    1260
153 |        2       1    3302    1165
154 |        2       1    3104    1080
155 |        2       1    3171    1127
156 |        2       1    3572    1270
157 |        2       1    3530    1252
158 |        2       1    3175    1200
159 |        2       1    3438    1290
160 |        2       1    3903    1334
161 |        2       1    3899    1380
162 |        2       1    3401    1140
163 |        2       1    3267    1243
164 |        2       1    3451    1340
165 |        2       1    3090    1168
166 |        2       1    3413    1322
167 |        2       1    3323    1249
168 |        2       1    3680    1321
169 |        2       1    3439    1192
170 |        2       1    3853    1373
171 |        2       1    3156    1170
172 |        2       1    3279    1265
173 |        2       1    3707    1235
174 |        2       1    4006    1302
175 |        2       1    3269    1241
176 |        2       1    3071    1078
177 |        2       1    3779    1520
178 |        2       1    3548    1460
179 |        2       1    3292    1075
180 |        2       1    3497    1280
181 |        2       1    3082    1180
182 |        2       1    3248    1250
183 |        2       1    3358    1190
184 |        2       1    3803    1374
185 |        2       1    3566    1306
186 |        2       1    3145    1202
187 |        2       1    3503    1240
188 |        2       1    3571    1316
189 |        2       1    3724    1280
190 |        2       1    3615    1350
191 |        2       1    3203    1180
192 |        2       1    3609    1210
193 |        2       1    3561    1127
194 |        2       1    3979    1324
195 |        2       1    3533    1210
196 |        2       1    3689    1290
197 |        2       1    3158    1100
198 |        2       1    4005    1280
199 |        2       1    3181    1175
200 |        2       1    3479    1160
201 |        2       1    3642    1205
202 |        2       1    3632    1163
203 |        2       2    3069    1022
204 |        2       2    3394    1243
205 |        2       2    3703    1350
206 |        2       2    3165    1237
207 |        2       2    3354    1204
208 |        2       2    3000    1090
209 |        2       2    3687    1355
210 |        2       2    3556    1250
211 |        2       2    2773    1076
212 |        2       2    3058    1120
213 |        2       2    3344    1220
214 |        2       2    3493    1240
215 |        2       2    3297    1220
216 |        2       2    3360    1095
217 |        2       2    3228    1235
218 |        2       2    3277    1105
219 |        2       2    3851    1405
220 |        2       2    3067    1150
221 |        2       2    3692    1305
222 |        2       2    3402    1220
223 |        2       2    3995    1296
224 |        2       2    3318    1175
225 |        2       2    2720     955
226 |        2       2    2937    1070
227 |        2       2    3580    1320
228 |        2       2    2939    1060
229 |        2       2    2989    1130
230 |        2       2    3586    1250
231 |        2       2    3156    1225
232 |        2       2    3246    1180
233 |        2       2    3170    1178
234 |        2       2    3268    1142
235 |        2       2    3389    1130
236 |        2       2    3381    1185
237 |        2       2    2864    1012
238 |        2       2    3740    1280
239 |        2       2    3479    1103
240 |        2       2    3647    1408
241 |        2       2    3716    1300
242 |        2       2    3284    1246
243 |        2       2    4204    1380
244 |        2       2    3735    1350
245 |        2       2    3218    1060
246 |        2       2    3685    1350
247 |        2       2    3704    1220
248 |        2       2    3214    1110
249 |        2       2    3394    1215
250 |        2       2    3233    1104
251 |        2       2    3352    1170
252 |        2       2    3391    1120


--------------------------------------------------------------------------------
/code/dataset_iris.txt:
--------------------------------------------------------------------------------
  1 | # Download source: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
  2 | #
  3 | # 1. Title: Iris Plants Database
  4 | #	Updated Sept 21 by C.Blake - Added discrepency information
  5 | #
  6 | # 2. Sources:
  7 | #     (a) Creator: R.A. Fisher
  8 | #     (b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
  9 | #     (c) Date: July, 1988
 10 | #
 11 | # 3. Past Usage:
 12 | #   - Publications: too many to mention!!!  Here are a few.
 13 | #   1. Fisher,R.A. "The use of multiple measurements in taxonomic problems"
 14 | #      Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions
 15 | #      to Mathematical Statistics" (John Wiley, NY, 1950).
 16 | #   2. Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
 17 | #      (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
 18 | #   3. Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
 19 | #      Structure and Classification Rule for Recognition in Partially Exposed
 20 | #      Environments".  IEEE Transactions on Pattern Analysis and Machine
 21 | #      Intelligence, Vol. PAMI-2, No. 1, 67-71.
 22 | #      -- Results:
 23 | #         -- very low misclassification rates (0% for the setosa class)
 24 | #   4. Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE 
 25 | #      Transactions on Information Theory, May 1972, 431-433.
 26 | #      -- Results:
 27 | #         -- very low misclassification rates again
 28 | #   5. See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al's AUTOCLASS II
 29 | #      conceptual clustering system finds 3 classes in the data.
 30 | #
 31 | # 4. Relevant Information:
 32 | #   --- This is perhaps the best known database to be found in the pattern
 33 | #       recognition literature.  Fisher's paper is a classic in the field
 34 | #       and is referenced frequently to this day.  (See Duda & Hart, for
 35 | #       example.)  The data set contains 3 classes of 50 instances each,
 36 | #       where each class refers to a type of iris plant.  One class is
 37 | #       linearly separable from the other 2; the latter are NOT linearly
 38 | #       separable from each other.
 39 | #   --- Predicted attribute: class of iris plant.
 40 | #   --- This is an exceedingly simple domain.
 41 | #   --- This data differs from the data presented in Fishers article
 42 | #	(identified by Steve Chadwick,  spchadwick@espeedaz.net )
 43 | #	The 35th sample should be: 4.9,3.1,1.5,0.2,"Iris-setosa"
 44 | #	where the error is in the fourth feature.
 45 | #	The 38th sample: 4.9,3.6,1.4,0.1,"Iris-setosa"
 46 | #	where the errors are in the second and third features.  
 47 | #
 48 | # 5. Number of Instances: 150 (50 in each of three classes)
 49 | #
 50 | # 6. Number of Attributes: 4 numeric, predictive attributes and the class
 51 | #
 52 | # 7. Attribute Information:
 53 | #   1. sepal length in cm
 54 | #   2. sepal width in cm
 55 | #   3. petal length in cm
 56 | #   4. petal width in cm
 57 | #   5. class: 
 58 | #      -- Iris Setosa
 59 | #      -- Iris Versicolour
 60 | #      -- Iris Virginica
 61 | #
 62 | # 8. Missing Attribute Values: None
 63 | #
 64 | # Summary Statistics:
 65 | #	         Min  Max   Mean    SD   Class Correlation
 66 | #   sepal length: 4.3  7.9   5.84  0.83    0.7826   
 67 | #    sepal width: 2.0  4.4   3.05  0.43   -0.4194
 68 | #   petal length: 1.0  6.9   3.76  1.76    0.9490  (high!)
 69 | #    petal width: 0.1  2.5   1.20  0.76    0.9565  (high!)
 70 | #
 71 | # 9. Class Distribution: 33.3% for each of 3 classes.
 72 | #
 73 | sepal_length,sepal_width,petal_length,petal_width,class
 74 | 5.1,3.5,1.4,0.2,Iris-setosa
 75 | 4.9,3.0,1.4,0.2,Iris-setosa
 76 | 4.7,3.2,1.3,0.2,Iris-setosa
 77 | 4.6,3.1,1.5,0.2,Iris-setosa
 78 | 5.0,3.6,1.4,0.2,Iris-setosa
 79 | 5.4,3.9,1.7,0.4,Iris-setosa
 80 | 4.6,3.4,1.4,0.3,Iris-setosa
 81 | 5.0,3.4,1.5,0.2,Iris-setosa
 82 | 4.4,2.9,1.4,0.2,Iris-setosa
 83 | 4.9,3.1,1.5,0.1,Iris-setosa
 84 | 5.4,3.7,1.5,0.2,Iris-setosa
 85 | 4.8,3.4,1.6,0.2,Iris-setosa
 86 | 4.8,3.0,1.4,0.1,Iris-setosa
 87 | 4.3,3.0,1.1,0.1,Iris-setosa
 88 | 5.8,4.0,1.2,0.2,Iris-setosa
 89 | 5.7,4.4,1.5,0.4,Iris-setosa
 90 | 5.4,3.9,1.3,0.4,Iris-setosa
 91 | 5.1,3.5,1.4,0.3,Iris-setosa
 92 | 5.7,3.8,1.7,0.3,Iris-setosa
 93 | 5.1,3.8,1.5,0.3,Iris-setosa
 94 | 5.4,3.4,1.7,0.2,Iris-setosa
 95 | 5.1,3.7,1.5,0.4,Iris-setosa
 96 | 4.6,3.6,1.0,0.2,Iris-setosa
 97 | 5.1,3.3,1.7,0.5,Iris-setosa
 98 | 4.8,3.4,1.9,0.2,Iris-setosa
 99 | 5.0,3.0,1.6,0.2,Iris-setosa
100 | 5.0,3.4,1.6,0.4,Iris-setosa
101 | 5.2,3.5,1.5,0.2,Iris-setosa
102 | 5.2,3.4,1.4,0.2,Iris-setosa
103 | 4.7,3.2,1.6,0.2,Iris-setosa
104 | 4.8,3.1,1.6,0.2,Iris-setosa
105 | 5.4,3.4,1.5,0.4,Iris-setosa
106 | 5.2,4.1,1.5,0.1,Iris-setosa
107 | 5.5,4.2,1.4,0.2,Iris-setosa
108 | 4.9,3.1,1.5,0.1,Iris-setosa
109 | 5.0,3.2,1.2,0.2,Iris-setosa
110 | 5.5,3.5,1.3,0.2,Iris-setosa
111 | 4.9,3.1,1.5,0.1,Iris-setosa
112 | 4.4,3.0,1.3,0.2,Iris-setosa
113 | 5.1,3.4,1.5,0.2,Iris-setosa
114 | 5.0,3.5,1.3,0.3,Iris-setosa
115 | 4.5,2.3,1.3,0.3,Iris-setosa
116 | 4.4,3.2,1.3,0.2,Iris-setosa
117 | 5.0,3.5,1.6,0.6,Iris-setosa
118 | 5.1,3.8,1.9,0.4,Iris-setosa
119 | 4.8,3.0,1.4,0.3,Iris-setosa
120 | 5.1,3.8,1.6,0.2,Iris-setosa
121 | 4.6,3.2,1.4,0.2,Iris-setosa
122 | 5.3,3.7,1.5,0.2,Iris-setosa
123 | 5.0,3.3,1.4,0.2,Iris-setosa
124 | 7.0,3.2,4.7,1.4,Iris-versicolor
125 | 6.4,3.2,4.5,1.5,Iris-versicolor
126 | 6.9,3.1,4.9,1.5,Iris-versicolor
127 | 5.5,2.3,4.0,1.3,Iris-versicolor
128 | 6.5,2.8,4.6,1.5,Iris-versicolor
129 | 5.7,2.8,4.5,1.3,Iris-versicolor
130 | 6.3,3.3,4.7,1.6,Iris-versicolor
131 | 4.9,2.4,3.3,1.0,Iris-versicolor
132 | 6.6,2.9,4.6,1.3,Iris-versicolor
133 | 5.2,2.7,3.9,1.4,Iris-versicolor
134 | 5.0,2.0,3.5,1.0,Iris-versicolor
135 | 5.9,3.0,4.2,1.5,Iris-versicolor
136 | 6.0,2.2,4.0,1.0,Iris-versicolor
137 | 6.1,2.9,4.7,1.4,Iris-versicolor
138 | 5.6,2.9,3.6,1.3,Iris-versicolor
139 | 6.7,3.1,4.4,1.4,Iris-versicolor
140 | 5.6,3.0,4.5,1.5,Iris-versicolor
141 | 5.8,2.7,4.1,1.0,Iris-versicolor
142 | 6.2,2.2,4.5,1.5,Iris-versicolor
143 | 5.6,2.5,3.9,1.1,Iris-versicolor
144 | 5.9,3.2,4.8,1.8,Iris-versicolor
145 | 6.1,2.8,4.0,1.3,Iris-versicolor
146 | 6.3,2.5,4.9,1.5,Iris-versicolor
147 | 6.1,2.8,4.7,1.2,Iris-versicolor
148 | 6.4,2.9,4.3,1.3,Iris-versicolor
149 | 6.6,3.0,4.4,1.4,Iris-versicolor
150 | 6.8,2.8,4.8,1.4,Iris-versicolor
151 | 6.7,3.0,5.0,1.7,Iris-versicolor
152 | 6.0,2.9,4.5,1.5,Iris-versicolor
153 | 5.7,2.6,3.5,1.0,Iris-versicolor
154 | 5.5,2.4,3.8,1.1,Iris-versicolor
155 | 5.5,2.4,3.7,1.0,Iris-versicolor
156 | 5.8,2.7,3.9,1.2,Iris-versicolor
157 | 6.0,2.7,5.1,1.6,Iris-versicolor
158 | 5.4,3.0,4.5,1.5,Iris-versicolor
159 | 6.0,3.4,4.5,1.6,Iris-versicolor
160 | 6.7,3.1,4.7,1.5,Iris-versicolor
161 | 6.3,2.3,4.4,1.3,Iris-versicolor
162 | 5.6,3.0,4.1,1.3,Iris-versicolor
163 | 5.5,2.5,4.0,1.3,Iris-versicolor
164 | 5.5,2.6,4.4,1.2,Iris-versicolor
165 | 6.1,3.0,4.6,1.4,Iris-versicolor
166 | 5.8,2.6,4.0,1.2,Iris-versicolor
167 | 5.0,2.3,3.3,1.0,Iris-versicolor
168 | 5.6,2.7,4.2,1.3,Iris-versicolor
169 | 5.7,3.0,4.2,1.2,Iris-versicolor
170 | 5.7,2.9,4.2,1.3,Iris-versicolor
171 | 6.2,2.9,4.3,1.3,Iris-versicolor
172 | 5.1,2.5,3.0,1.1,Iris-versicolor
173 | 5.7,2.8,4.1,1.3,Iris-versicolor
174 | 6.3,3.3,6.0,2.5,Iris-virginica
175 | 5.8,2.7,5.1,1.9,Iris-virginica
176 | 7.1,3.0,5.9,2.1,Iris-virginica
177 | 6.3,2.9,5.6,1.8,Iris-virginica
178 | 6.5,3.0,5.8,2.2,Iris-virginica
179 | 7.6,3.0,6.6,2.1,Iris-virginica
180 | 4.9,2.5,4.5,1.7,Iris-virginica
181 | 7.3,2.9,6.3,1.8,Iris-virginica
182 | 6.7,2.5,5.8,1.8,Iris-virginica
183 | 7.2,3.6,6.1,2.5,Iris-virginica
184 | 6.5,3.2,5.1,2.0,Iris-virginica
185 | 6.4,2.7,5.3,1.9,Iris-virginica
186 | 6.8,3.0,5.5,2.1,Iris-virginica
187 | 5.7,2.5,5.0,2.0,Iris-virginica
188 | 5.8,2.8,5.1,2.4,Iris-virginica
189 | 6.4,3.2,5.3,2.3,Iris-virginica
190 | 6.5,3.0,5.5,1.8,Iris-virginica
191 | 7.7,3.8,6.7,2.2,Iris-virginica
192 | 7.7,2.6,6.9,2.3,Iris-virginica
193 | 6.0,2.2,5.0,1.5,Iris-virginica
194 | 6.9,3.2,5.7,2.3,Iris-virginica
195 | 5.6,2.8,4.9,2.0,Iris-virginica
196 | 7.7,2.8,6.7,2.0,Iris-virginica
197 | 6.3,2.7,4.9,1.8,Iris-virginica
198 | 6.7,3.3,5.7,2.1,Iris-virginica
199 | 7.2,3.2,6.0,1.8,Iris-virginica
200 | 6.2,2.8,4.8,1.8,Iris-virginica
201 | 6.1,3.0,4.9,1.8,Iris-virginica
202 | 6.4,2.8,5.6,2.1,Iris-virginica
203 | 7.2,3.0,5.8,1.6,Iris-virginica
204 | 7.4,2.8,6.1,1.9,Iris-virginica
205 | 7.9,3.8,6.4,2.0,Iris-virginica
206 | 6.4,2.8,5.6,2.2,Iris-virginica
207 | 6.3,2.8,5.1,1.5,Iris-virginica
208 | 6.1,2.6,5.6,1.4,Iris-virginica
209 | 7.7,3.0,6.1,2.3,Iris-virginica
210 | 6.3,3.4,5.6,2.4,Iris-virginica
211 | 6.4,3.1,5.5,1.8,Iris-virginica
212 | 6.0,3.0,4.8,1.8,Iris-virginica
213 | 6.9,3.1,5.4,2.1,Iris-virginica
214 | 6.7,3.1,5.6,2.4,Iris-virginica
215 | 6.9,3.1,5.1,2.3,Iris-virginica
216 | 5.8,2.7,5.1,1.9,Iris-virginica
217 | 6.8,3.2,5.9,2.3,Iris-virginica
218 | 6.7,3.3,5.7,2.5,Iris-virginica
219 | 6.7,3.0,5.2,2.3,Iris-virginica
220 | 6.3,2.5,5.0,1.9,Iris-virginica
221 | 6.5,3.0,5.2,2.0,Iris-virginica
222 | 6.2,3.4,5.4,2.3,Iris-virginica
223 | 5.9,3.0,5.1,1.8,Iris-virginica
224 | 
225 | 


--------------------------------------------------------------------------------
/code/tutorial.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "# Learning scikit-learn "
   8 |    ]
   9 |   },
  10 |   {
  11 |    "cell_type": "markdown",
  12 |    "metadata": {},
  13 |    "source": [
  14 |     "## An Introduction to Machine Learning in Python"
  15 |    ]
  16 |   },
  17 |   {
  18 |    "cell_type": "markdown",
  19 |    "metadata": {},
  20 |    "source": [
  21 |     "### at PyData Chicago 2016"
  22 |    ]
  23 |   },
  24 |   {
  25 |    "cell_type": "code",
  26 |    "execution_count": null,
  27 |    "metadata": {
  28 |     "collapsed": false
  29 |    },
  30 |    "outputs": [],
  31 |    "source": [
  32 |     "%load_ext watermark\n",
  33 |     "%watermark -a \"Sebastian Raschka\" -u -d -p numpy,scipy,matplotlib,sklearn,pandas,mlxtend"
  34 |    ]
  35 |   },
  36 |   {
  37 |    "cell_type": "markdown",
  38 |    "metadata": {},
  39 |    "source": [
  40 |     "# Table of Contents\n",
  41 |     "\n",
  42 |     "* [1 Introduction to Machine Learning](#1-Introduction-to-Machine-Learning)\n",
  43 |     "* [2 Linear Regression](#2-Linear-Regression)\n",
  44 |     "    * [Loading the dataset](#Loading-the-dataset)\n",
  45 |     "    * [Preparing the dataset](#Preparing-the-dataset)\n",
  46 |     "    * [Fitting the model](#Fitting-the-model)\n",
  47 |     "    * [Evaluating the model](#Evaluating-the-model)\n",
  48 |     "* [3 Introduction to Classification](#3-Introduction-to-Classification)\n",
  49 |     "    * [The Iris dataset](#The-Iris-dataset)\n",
  50 |     "    * [Class label encoding](#Class-label-encoding)\n",
  51 |     "    * [Scikit-learn's in-build datasets](#Scikit-learn's-in-build-datasets)\n",
  52 |     "    * [Test/train splits](#Test/train-splits)\n",
  53 |     "    * [Logistic Regression](#Logistic-Regression)\n",
  54 |     "    * [K-Nearest Neighbors](#K-Nearest-Neighbors)\n",
  55 |     "    * [3 - Exercises](#3---Exercises)\n",
  56 |     "* [4 - Feature Preprocessing & scikit-learn Pipelines](#4---Feature-Preprocessing-&-scikit-learn-Pipelines)\n",
  57 |     "    * [Categorical features: nominal vs ordinal](#Categorical-features:-nominal-vs-ordinal)\n",
  58 |     "    * [Normalization](#Normalization)\n",
  59 |     "    * [Pipelines](#Pipelines)\n",
  60 |     "    * [4 - Exercises](#4---Exercises)\n",
  61 |     "* [5 - Dimensionality Reduction: Feature Selection & Extraction](#5---Dimensionality-Reduction:-Feature-Selection-&-Extraction)\n",
  62 |     "    * [Recursive Feature Elimination](#Recursive-Feature-Elimination)\n",
  63 |     "    * [Sequential Feature Selection](#Sequential-Feature-Selection)\n",
  64 |     "    * [Principal Component Analysis](#Principal-Component-Analysis)\n",
  65 |     "* [6 - Model Evaluation & Hyperparameter Tuning](#6---Model-Evaluation-&-Hyperparameter-Tuning)\n",
  66 |     "    * [Wine Dataset](#Wine-Dataset)\n",
  67 |     "    * [Stratified K-Fold](#Stratified-K-Fold)\n",
  68 |     "    * [Grid Search](#Grid-Search)"
  69 |    ]
  70 |   },
  71 |   {
  72 |    "cell_type": "code",
  73 |    "execution_count": null,
  74 |    "metadata": {
  75 |     "collapsed": true
  76 |    },
  77 |    "outputs": [],
  78 |    "source": [
  79 |     "%matplotlib inline\n",
  80 |     "import matplotlib.pyplot as plt\n",
  81 |     "import numpy as np\n",
  82 |     "import pandas as pd"
  83 |    ]
  84 |   },
  85 |   {
  86 |    "cell_type": "markdown",
  87 |    "metadata": {},
  88 |    "source": [
  89 |     "<div style='height:100px;'></div>"
  90 |    ]
  91 |   },
  92 |   {
  93 |    "cell_type": "markdown",
  94 |    "metadata": {},
  95 |    "source": [
  96 |     "<div style='height:100px;'></div>"
  97 |    ]
  98 |   },
  99 |   {
 100 |    "cell_type": "markdown",
 101 |    "metadata": {},
 102 |    "source": [
 103 |     "# 1 Introduction to Machine Learning"
 104 |    ]
 105 |   },
 106 |   {
 107 |    "cell_type": "markdown",
 108 |    "metadata": {},
 109 |    "source": [
 110 |     "<div style='height:100px;'></div>"
 111 |    ]
 112 |   },
 113 |   {
 114 |    "cell_type": "markdown",
 115 |    "metadata": {},
 116 |    "source": [
 117 |     "# 2 Linear Regression"
 118 |    ]
 119 |   },
 120 |   {
 121 |    "cell_type": "markdown",
 122 |    "metadata": {},
 123 |    "source": [
 124 |     "### Loading the dataset"
 125 |    ]
 126 |   },
 127 |   {
 128 |    "cell_type": "markdown",
 129 |    "metadata": {},
 130 |    "source": [
 131 |     "Source: R.J. Gladstone (1905). \"A Study of the Relations of the Brain to \n",
 132 |     "to the Size of the Head\", Biometrika, Vol. 4, pp105-123\n",
 133 |     "\n",
 134 |     "\n",
 135 |     "Description: Brain weight (grams) and head size (cubic cm) for 237\n",
 136 |     "adults classified by gender and age group.\n",
 137 |     "\n",
 138 |     "\n",
 139 |     "Variables/Columns\n",
 140 |     "- Gender (1=Male, 2=Female)\n",
 141 |     "- Age Range (1=20-46, 2=46+)\n",
 142 |     "- Head size (cm^3)\n",
 143 |     "- Brain weight (grams)\n"
 144 |    ]
 145 |   },
 146 |   {
 147 |    "cell_type": "code",
 148 |    "execution_count": null,
 149 |    "metadata": {
 150 |     "collapsed": false
 151 |    },
 152 |    "outputs": [],
 153 |    "source": [
 154 |     "df = pd.read_csv('dataset_brain.txt', \n",
 155 |     "                 encoding='utf-8', \n",
 156 |     "                 comment='#',\n",
 157 |     "                 sep='\\s+')\n",
 158 |     "df.tail()"
 159 |    ]
 160 |   },
 161 |   {
 162 |    "cell_type": "code",
 163 |    "execution_count": null,
 164 |    "metadata": {
 165 |     "collapsed": false
 166 |    },
 167 |    "outputs": [],
 168 |    "source": [
 169 |     "plt.scatter(df['head-size'], df['brain-weight'])\n",
 170 |     "plt.xlabel('Head size (cm^3)')\n",
 171 |     "plt.ylabel('Brain weight (grams)');"
 172 |    ]
 173 |   },
 174 |   {
 175 |    "cell_type": "markdown",
 176 |    "metadata": {},
 177 |    "source": [
 178 |     "### Preparing the dataset"
 179 |    ]
 180 |   },
 181 |   {
 182 |    "cell_type": "code",
 183 |    "execution_count": null,
 184 |    "metadata": {
 185 |     "collapsed": false
 186 |    },
 187 |    "outputs": [],
 188 |    "source": [
 189 |     "y = df['brain-weight'].values\n",
 190 |     "y.shape"
 191 |    ]
 192 |   },
 193 |   {
 194 |    "cell_type": "code",
 195 |    "execution_count": null,
 196 |    "metadata": {
 197 |     "collapsed": false
 198 |    },
 199 |    "outputs": [],
 200 |    "source": [
 201 |     "X = df['head-size'].values\n",
 202 |     "X = X[:, np.newaxis]\n",
 203 |     "X.shape"
 204 |    ]
 205 |   },
 206 |   {
 207 |    "cell_type": "code",
 208 |    "execution_count": null,
 209 |    "metadata": {
 210 |     "collapsed": false
 211 |    },
 212 |    "outputs": [],
 213 |    "source": [
 214 |     "from sklearn.cross_validation import train_test_split\n",
 215 |     "\n",
 216 |     "X_train, X_test, y_train, y_test = train_test_split(\n",
 217 |     "        X, y, test_size=0.3, random_state=123)"
 218 |    ]
 219 |   },
 220 |   {
 221 |    "cell_type": "code",
 222 |    "execution_count": null,
 223 |    "metadata": {
 224 |     "collapsed": false
 225 |    },
 226 |    "outputs": [],
 227 |    "source": [
 228 |     "plt.scatter(X_train, y_train, c='blue', marker='o')\n",
 229 |     "plt.scatter(X_test, y_test, c='red', marker='s')\n",
 230 |     "plt.xlabel('Head size (cm^3)')\n",
 231 |     "plt.ylabel('Brain weight (grams)');"
 232 |    ]
 233 |   },
 234 |   {
 235 |    "cell_type": "markdown",
 236 |    "metadata": {},
 237 |    "source": [
 238 |     "### Fitting the model"
 239 |    ]
 240 |   },
 241 |   {
 242 |    "cell_type": "code",
 243 |    "execution_count": null,
 244 |    "metadata": {
 245 |     "collapsed": false
 246 |    },
 247 |    "outputs": [],
 248 |    "source": [
 249 |     "from sklearn.linear_model import LinearRegression\n",
 250 |     "\n",
 251 |     "lr = LinearRegression()\n",
 252 |     "lr.fit(X_train, y_train)\n",
 253 |     "y_pred = lr.predict(X_test)"
 254 |    ]
 255 |   },
 256 |   {
 257 |    "cell_type": "markdown",
 258 |    "metadata": {},
 259 |    "source": [
 260 |     "### Evaluating the model"
 261 |    ]
 262 |   },
 263 |   {
 264 |    "cell_type": "code",
 265 |    "execution_count": null,
 266 |    "metadata": {
 267 |     "collapsed": false
 268 |    },
 269 |    "outputs": [],
 270 |    "source": [
 271 |     "sum_of_squares = ((y_test - y_pred) ** 2).sum()\n",
 272 |     "res_sum_of_squares = ((y_test - y_test.mean()) ** 2).sum()\n",
 273 |     "r2_score = 1 - (sum_of_squares / res_sum_of_squares)\n",
 274 |     "print('R2 score: %.3f' % r2_score)"
 275 |    ]
 276 |   },
 277 |   {
 278 |    "cell_type": "code",
 279 |    "execution_count": null,
 280 |    "metadata": {
 281 |     "collapsed": false
 282 |    },
 283 |    "outputs": [],
 284 |    "source": [
 285 |     "print('R2 score: %.3f' % lr.score(X_test, y_test))"
 286 |    ]
 287 |   },
 288 |   {
 289 |    "cell_type": "code",
 290 |    "execution_count": null,
 291 |    "metadata": {
 292 |     "collapsed": false
 293 |    },
 294 |    "outputs": [],
 295 |    "source": [
 296 |     "lr.coef_"
 297 |    ]
 298 |   },
 299 |   {
 300 |    "cell_type": "code",
 301 |    "execution_count": null,
 302 |    "metadata": {
 303 |     "collapsed": false
 304 |    },
 305 |    "outputs": [],
 306 |    "source": [
 307 |     "lr.intercept_"
 308 |    ]
 309 |   },
 310 |   {
 311 |    "cell_type": "code",
 312 |    "execution_count": null,
 313 |    "metadata": {
 314 |     "collapsed": false
 315 |    },
 316 |    "outputs": [],
 317 |    "source": [
 318 |     "min_pred = X_train.min() * lr.coef_ + lr.intercept_\n",
 319 |     "max_pred = X_train.max() * lr.coef_ + lr.intercept_\n",
 320 |     "\n",
 321 |     "plt.scatter(X_train, y_train, c='blue', marker='o')\n",
 322 |     "plt.plot([X_train.min(), X_train.max()],\n",
 323 |     "         [min_pred, max_pred],\n",
 324 |     "         color='red',\n",
 325 |     "         linewidth=4)\n",
 326 |     "plt.xlabel('Head size (cm^3)')\n",
 327 |     "plt.ylabel('Brain weight (grams)');"
 328 |    ]
 329 |   },
 330 |   {
 331 |    "cell_type": "markdown",
 332 |    "metadata": {},
 333 |    "source": [
 334 |     "<div style='height:100px;'></div>"
 335 |    ]
 336 |   },
 337 |   {
 338 |    "cell_type": "markdown",
 339 |    "metadata": {},
 340 |    "source": [
 341 |     "# 3 Introduction to Classification"
 342 |    ]
 343 |   },
 344 |   {
 345 |    "cell_type": "markdown",
 346 |    "metadata": {},
 347 |    "source": [
 348 |     "### The Iris dataset"
 349 |    ]
 350 |   },
 351 |   {
 352 |    "cell_type": "code",
 353 |    "execution_count": null,
 354 |    "metadata": {
 355 |     "collapsed": false
 356 |    },
 357 |    "outputs": [],
 358 |    "source": [
 359 |     "df = pd.read_csv('dataset_iris.txt', \n",
 360 |     "                 encoding='utf-8', \n",
 361 |     "                 comment='#',\n",
 362 |     "                 sep=',')\n",
 363 |     "df.tail()"
 364 |    ]
 365 |   },
 366 |   {
 367 |    "cell_type": "code",
 368 |    "execution_count": null,
 369 |    "metadata": {
 370 |     "collapsed": false
 371 |    },
 372 |    "outputs": [],
 373 |    "source": [
 374 |     "X = df.iloc[:, :4].values \n",
 375 |     "y = df['class'].values\n",
 376 |     "np.unique(y)"
 377 |    ]
 378 |   },
 379 |   {
 380 |    "cell_type": "markdown",
 381 |    "metadata": {},
 382 |    "source": [
 383 |     "### Class label encoding"
 384 |    ]
 385 |   },
 386 |   {
 387 |    "cell_type": "code",
 388 |    "execution_count": null,
 389 |    "metadata": {
 390 |     "collapsed": false
 391 |    },
 392 |    "outputs": [],
 393 |    "source": [
 394 |     "from sklearn.preprocessing import LabelEncoder\n",
 395 |     "\n",
 396 |     "l_encoder = LabelEncoder()\n",
 397 |     "l_encoder.fit(y)\n",
 398 |     "l_encoder.classes_"
 399 |    ]
 400 |   },
 401 |   {
 402 |    "cell_type": "code",
 403 |    "execution_count": null,
 404 |    "metadata": {
 405 |     "collapsed": false
 406 |    },
 407 |    "outputs": [],
 408 |    "source": [
 409 |     "y_enc = l_encoder.transform(y)\n",
 410 |     "np.unique(y_enc)"
 411 |    ]
 412 |   },
 413 |   {
 414 |    "cell_type": "code",
 415 |    "execution_count": null,
 416 |    "metadata": {
 417 |     "collapsed": false
 418 |    },
 419 |    "outputs": [],
 420 |    "source": [
 421 |     "np.unique(l_encoder.inverse_transform(y_enc))"
 422 |    ]
 423 |   },
 424 |   {
 425 |    "cell_type": "markdown",
 426 |    "metadata": {},
 427 |    "source": [
 428 |     "### Scikit-learn's in-build datasets"
 429 |    ]
 430 |   },
 431 |   {
 432 |    "cell_type": "code",
 433 |    "execution_count": null,
 434 |    "metadata": {
 435 |     "collapsed": false
 436 |    },
 437 |    "outputs": [],
 438 |    "source": [
 439 |     "from sklearn.datasets import load_iris\n",
 440 |     "\n",
 441 |     "iris = load_iris()\n",
 442 |     "print(iris['DESCR'])"
 443 |    ]
 444 |   },
 445 |   {
 446 |    "cell_type": "markdown",
 447 |    "metadata": {},
 448 |    "source": [
 449 |     "### Test/train splits"
 450 |    ]
 451 |   },
 452 |   {
 453 |    "cell_type": "code",
 454 |    "execution_count": null,
 455 |    "metadata": {
 456 |     "collapsed": false
 457 |    },
 458 |    "outputs": [],
 459 |    "source": [
 460 |     "X, y = iris.data[:, :2], iris.target\n",
 461 |     "# ! We only use 2 features for visual purposes\n",
 462 |     "\n",
 463 |     "print('Class labels:', np.unique(y))\n",
 464 |     "print('Class proportions:', np.bincount(y))"
 465 |    ]
 466 |   },
 467 |   {
 468 |    "cell_type": "code",
 469 |    "execution_count": null,
 470 |    "metadata": {
 471 |     "collapsed": false
 472 |    },
 473 |    "outputs": [],
 474 |    "source": [
 475 |     "from sklearn.cross_validation import train_test_split\n",
 476 |     "\n",
 477 |     "X_train, X_test, y_train, y_test = train_test_split(\n",
 478 |     "        X, y, test_size=0.3, random_state=123)\n",
 479 |     "\n",
 480 |     "print('Class labels:', np.unique(y_train))\n",
 481 |     "print('Class proportions:', np.bincount(y_train))"
 482 |    ]
 483 |   },
 484 |   {
 485 |    "cell_type": "code",
 486 |    "execution_count": null,
 487 |    "metadata": {
 488 |     "collapsed": false
 489 |    },
 490 |    "outputs": [],
 491 |    "source": [
 492 |     "from sklearn.cross_validation import train_test_split\n",
 493 |     "\n",
 494 |     "X_train, X_test, y_train, y_test = train_test_split(\n",
 495 |     "        X, y, test_size=0.3, random_state=123,\n",
 496 |     "        stratify=y)\n",
 497 |     "\n",
 498 |     "print('Class labels:', np.unique(y_train))\n",
 499 |     "print('Class proportions:', np.bincount(y_train))"
 500 |    ]
 501 |   },
 502 |   {
 503 |    "cell_type": "markdown",
 504 |    "metadata": {},
 505 |    "source": [
 506 |     "### Logistic Regression"
 507 |    ]
 508 |   },
 509 |   {
 510 |    "cell_type": "code",
 511 |    "execution_count": null,
 512 |    "metadata": {
 513 |     "collapsed": false
 514 |    },
 515 |    "outputs": [],
 516 |    "source": [
 517 |     "from sklearn.linear_model import LogisticRegression\n",
 518 |     "\n",
 519 |     "lr = LogisticRegression(solver='newton-cg', \n",
 520 |     "                        multi_class='multinomial', \n",
 521 |     "                        random_state=1)\n",
 522 |     "\n",
 523 |     "lr.fit(X_train, y_train)\n",
 524 |     "print('Test accuracy %.2f' % lr.score(X_test, y_test))"
 525 |    ]
 526 |   },
 527 |   {
 528 |    "cell_type": "code",
 529 |    "execution_count": null,
 530 |    "metadata": {
 531 |     "collapsed": false
 532 |    },
 533 |    "outputs": [],
 534 |    "source": [
 535 |     "from mlxtend.evaluate import plot_decision_regions\n",
 536 |     "\n",
 537 |     "plot_decision_regions\n",
 538 |     "\n",
 539 |     "plot_decision_regions(X=X, y=y, clf=lr, X_highlight=X_test)\n",
 540 |     "plt.xlabel('sepal length [cm]')\n",
 541 |     "plt.xlabel('sepal width [cm]');"
 542 |    ]
 543 |   },
 544 |   {
 545 |    "cell_type": "markdown",
 546 |    "metadata": {},
 547 |    "source": [
 548 |     "### K-Nearest Neighbors"
 549 |    ]
 550 |   },
 551 |   {
 552 |    "cell_type": "code",
 553 |    "execution_count": null,
 554 |    "metadata": {
 555 |     "collapsed": false
 556 |    },
 557 |    "outputs": [],
 558 |    "source": [
 559 |     "from sklearn.neighbors import KNeighborsClassifier\n",
 560 |     "\n",
 561 |     "kn = KNeighborsClassifier(n_neighbors=4)\n",
 562 |     "\n",
 563 |     "kn.fit(X_train, y_train)\n",
 564 |     "print('Test accuracy %.2f' % kn.score(X_test, y_test))"
 565 |    ]
 566 |   },
 567 |   {
 568 |    "cell_type": "code",
 569 |    "execution_count": null,
 570 |    "metadata": {
 571 |     "collapsed": false
 572 |    },
 573 |    "outputs": [],
 574 |    "source": [
 575 |     "plot_decision_regions(X=X, y=y, clf=kn, X_highlight=X_test)\n",
 576 |     "plt.xlabel('sepal length [cm]')\n",
 577 |     "plt.xlabel('sepal width [cm]');"
 578 |    ]
 579 |   },
 580 |   {
 581 |    "cell_type": "markdown",
 582 |    "metadata": {},
 583 |    "source": [
 584 |     "### 3 - Exercises"
 585 |    ]
 586 |   },
 587 |   {
 588 |    "cell_type": "markdown",
 589 |    "metadata": {},
 590 |    "source": [
 591 |     "- Which of the two models above would you prefer if you had to choose? Why?\n",
 592 |     "- What would be possible ways to resolve ties in KNN when `n_neighbors` is an even number?\n",
 593 |     "- Can you find the right spot in the scikit-learn documentation to read about how scikit-learn handles this?\n",
 594 |     "- Train & evaluate the Logistic Regression and KNN algorithms on the 4-dimensional iris datasets. \n",
 595 |     "  - What performance do you observe? \n",
 596 |     "  - Why is it different vs. using only 2 dimensions? \n",
 597 |     "  - Would adding more dimensions help?"
 598 |    ]
 599 |   },
 600 |   {
 601 |    "cell_type": "markdown",
 602 |    "metadata": {},
 603 |    "source": [
 604 |     "<div style='height:100px;'></div>"
 605 |    ]
 606 |   },
 607 |   {
 608 |    "cell_type": "markdown",
 609 |    "metadata": {},
 610 |    "source": [
 611 |     "# 4 - Feature Preprocessing & scikit-learn Pipelines"
 612 |    ]
 613 |   },
 614 |   {
 615 |    "cell_type": "markdown",
 616 |    "metadata": {},
 617 |    "source": [
 618 |     "### Categorical features: nominal vs ordinal"
 619 |    ]
 620 |   },
 621 |   {
 622 |    "cell_type": "code",
 623 |    "execution_count": null,
 624 |    "metadata": {
 625 |     "collapsed": false
 626 |    },
 627 |    "outputs": [],
 628 |    "source": [
 629 |     "import pandas as pd\n",
 630 |     "\n",
 631 |     "df = pd.DataFrame([\n",
 632 |     "            ['green', 'M', 10.0], \n",
 633 |     "            ['red', 'L', 13.5], \n",
 634 |     "            ['blue', 'XL', 15.3]])\n",
 635 |     "\n",
 636 |     "df.columns = ['color', 'size', 'prize']\n",
 637 |     "df"
 638 |    ]
 639 |   },
 640 |   {
 641 |    "cell_type": "code",
 642 |    "execution_count": null,
 643 |    "metadata": {
 644 |     "collapsed": false
 645 |    },
 646 |    "outputs": [],
 647 |    "source": [
 648 |     "from sklearn.feature_extraction import DictVectorizer\n",
 649 |     "\n",
 650 |     "dvec = DictVectorizer(sparse=False)\n",
 651 |     "\n",
 652 |     "X = dvec.fit_transform(df.transpose().to_dict().values())\n",
 653 |     "X"
 654 |    ]
 655 |   },
 656 |   {
 657 |    "cell_type": "code",
 658 |    "execution_count": null,
 659 |    "metadata": {
 660 |     "collapsed": false
 661 |    },
 662 |    "outputs": [],
 663 |    "source": [
 664 |     "size_mapping = {\n",
 665 |     "           'XL': 3,\n",
 666 |     "           'L': 2,\n",
 667 |     "           'M': 1}\n",
 668 |     "\n",
 669 |     "df['size'] = df['size'].map(size_mapping)\n",
 670 |     "df"
 671 |    ]
 672 |   },
 673 |   {
 674 |    "cell_type": "code",
 675 |    "execution_count": null,
 676 |    "metadata": {
 677 |     "collapsed": false
 678 |    },
 679 |    "outputs": [],
 680 |    "source": [
 681 |     "X = dvec.fit_transform(df.transpose().to_dict().values())\n",
 682 |     "X"
 683 |    ]
 684 |   },
 685 |   {
 686 |    "cell_type": "markdown",
 687 |    "metadata": {},
 688 |    "source": [
 689 |     "### Normalization"
 690 |    ]
 691 |   },
 692 |   {
 693 |    "cell_type": "code",
 694 |    "execution_count": null,
 695 |    "metadata": {
 696 |     "collapsed": false
 697 |    },
 698 |    "outputs": [],
 699 |    "source": [
 700 |     "df = pd.DataFrame([1., 2., 3., 4., 5., 6.], columns=['feature'])\n",
 701 |     "df"
 702 |    ]
 703 |   },
 704 |   {
 705 |    "cell_type": "code",
 706 |    "execution_count": null,
 707 |    "metadata": {
 708 |     "collapsed": false
 709 |    },
 710 |    "outputs": [],
 711 |    "source": [
 712 |     "from sklearn.preprocessing import MinMaxScaler\n",
 713 |     "from sklearn.preprocessing import StandardScaler\n",
 714 |     "\n",
 715 |     "mmxsc = MinMaxScaler()\n",
 716 |     "stdsc = StandardScaler()\n",
 717 |     "\n",
 718 |     "X = df['feature'].values[:, np.newaxis]\n",
 719 |     "\n",
 720 |     "df['minmax'] = mmxsc.fit_transform(X)\n",
 721 |     "df['z-score'] = stdsc.fit_transform(X)\n",
 722 |     "\n",
 723 |     "df"
 724 |    ]
 725 |   },
 726 |   {
 727 |    "cell_type": "markdown",
 728 |    "metadata": {},
 729 |    "source": [
 730 |     "### Pipelines"
 731 |    ]
 732 |   },
 733 |   {
 734 |    "cell_type": "code",
 735 |    "execution_count": null,
 736 |    "metadata": {
 737 |     "collapsed": false
 738 |    },
 739 |    "outputs": [],
 740 |    "source": [
 741 |     "from sklearn.pipeline import make_pipeline\n",
 742 |     "from sklearn.cross_validation import train_test_split\n",
 743 |     "from sklearn.datasets import load_iris\n",
 744 |     "\n",
 745 |     "iris = load_iris()\n",
 746 |     "X, y = iris.data, iris.target\n",
 747 |     "\n",
 748 |     "X_train, X_test, y_train, y_test = train_test_split(\n",
 749 |     "        X, y, test_size=0.3, random_state=123,\n",
 750 |     "        stratify=y)\n",
 751 |     "\n",
 752 |     "lr = LogisticRegression(solver='newton-cg', \n",
 753 |     "                        multi_class='multinomial', \n",
 754 |     "                        random_state=1)\n",
 755 |     "\n",
 756 |     "lr_pipe = make_pipeline(StandardScaler(), lr)\n",
 757 |     "\n",
 758 |     "lr_pipe.fit(X_train, y_train)\n",
 759 |     "lr_pipe.score(X_test, y_test)"
 760 |    ]
 761 |   },
 762 |   {
 763 |    "cell_type": "code",
 764 |    "execution_count": null,
 765 |    "metadata": {
 766 |     "collapsed": false
 767 |    },
 768 |    "outputs": [],
 769 |    "source": [
 770 |     "lr_pipe.named_steps"
 771 |    ]
 772 |   },
 773 |   {
 774 |    "cell_type": "code",
 775 |    "execution_count": null,
 776 |    "metadata": {
 777 |     "collapsed": false
 778 |    },
 779 |    "outputs": [],
 780 |    "source": [
 781 |     "lr_pipe.named_steps['standardscaler'].transform(X[:5])"
 782 |    ]
 783 |   },
 784 |   {
 785 |    "cell_type": "markdown",
 786 |    "metadata": {},
 787 |    "source": [
 788 |     "### 4 - Exercises"
 789 |    ]
 790 |   },
 791 |   {
 792 |    "cell_type": "markdown",
 793 |    "metadata": {},
 794 |    "source": [
 795 |     "- Why is it important that we scale test and training sets separately?\n",
 796 |     "- Fit a KNN classifier to the standardized Iris dataset. Do you notice difference in the predictive performance of the model compared to the non-standardized one? Why or why not?"
 797 |    ]
 798 |   },
 799 |   {
 800 |    "cell_type": "markdown",
 801 |    "metadata": {},
 802 |    "source": [
 803 |     "<div style='height:100px;'></div>"
 804 |    ]
 805 |   },
 806 |   {
 807 |    "cell_type": "markdown",
 808 |    "metadata": {},
 809 |    "source": [
 810 |     "# 5 - Dimensionality Reduction: Feature Selection & Extraction"
 811 |    ]
 812 |   },
 813 |   {
 814 |    "cell_type": "code",
 815 |    "execution_count": null,
 816 |    "metadata": {
 817 |     "collapsed": true
 818 |    },
 819 |    "outputs": [],
 820 |    "source": [
 821 |     "from sklearn.cross_validation import train_test_split\n",
 822 |     "from sklearn.datasets import load_iris\n",
 823 |     "\n",
 824 |     "iris = load_iris()\n",
 825 |     "X, y = iris.data, iris.target\n",
 826 |     "\n",
 827 |     "X_train, X_test, y_train, y_test = train_test_split(\n",
 828 |     "        X, y, test_size=0.3, random_state=123, stratify=y)"
 829 |    ]
 830 |   },
 831 |   {
 832 |    "cell_type": "markdown",
 833 |    "metadata": {},
 834 |    "source": [
 835 |     "### Recursive Feature Elimination"
 836 |    ]
 837 |   },
 838 |   {
 839 |    "cell_type": "code",
 840 |    "execution_count": null,
 841 |    "metadata": {
 842 |     "collapsed": false
 843 |    },
 844 |    "outputs": [],
 845 |    "source": [
 846 |     "from sklearn.linear_model import LogisticRegression\n",
 847 |     "from sklearn.feature_selection import RFECV\n",
 848 |     "\n",
 849 |     "lr = LogisticRegression()\n",
 850 |     "rfe = RFECV(lr, step=1, cv=5, scoring='accuracy')\n",
 851 |     "\n",
 852 |     "rfe.fit(X_train, y_train)\n",
 853 |     "print('Number of features:', rfe.n_features_)\n",
 854 |     "print('Feature ranking', rfe.ranking_)"
 855 |    ]
 856 |   },
 857 |   {
 858 |    "cell_type": "markdown",
 859 |    "metadata": {},
 860 |    "source": [
 861 |     "### Sequential Feature Selection"
 862 |    ]
 863 |   },
 864 |   {
 865 |    "cell_type": "code",
 866 |    "execution_count": null,
 867 |    "metadata": {
 868 |     "collapsed": false
 869 |    },
 870 |    "outputs": [],
 871 |    "source": [
 872 |     "from mlxtend.feature_selection import SequentialFeatureSelector as SFS\n",
 873 |     "from mlxtend.feature_selection import plot_sequential_feature_selection as plot_sfs\n",
 874 |     "\n",
 875 |     "\n",
 876 |     "sfs = SFS(lr, k_features=4, forward=True, floating=False, cv=5)\n",
 877 |     "\n",
 878 |     "sfs.fit(X_train, y_train)\n",
 879 |     "sfs = SFS(lr, \n",
 880 |     "          k_features=4, \n",
 881 |     "          forward=True, \n",
 882 |     "          floating=False, \n",
 883 |     "          scoring='accuracy',\n",
 884 |     "          cv=2)\n",
 885 |     "\n",
 886 |     "sfs = sfs.fit(X, y)\n",
 887 |     "fig1 = plot_sfs(sfs.get_metric_dict())\n",
 888 |     "\n",
 889 |     "plt.ylim([0.8, 1])\n",
 890 |     "plt.title('Sequential Forward Selection (w. StdDev)')\n",
 891 |     "plt.grid()"
 892 |    ]
 893 |   },
 894 |   {
 895 |    "cell_type": "code",
 896 |    "execution_count": null,
 897 |    "metadata": {
 898 |     "collapsed": false
 899 |    },
 900 |    "outputs": [],
 901 |    "source": [
 902 |     "sfs.subsets_"
 903 |    ]
 904 |   },
 905 |   {
 906 |    "cell_type": "markdown",
 907 |    "metadata": {},
 908 |    "source": [
 909 |     "### Principal Component Analysis"
 910 |    ]
 911 |   },
 912 |   {
 913 |    "cell_type": "code",
 914 |    "execution_count": null,
 915 |    "metadata": {
 916 |     "collapsed": false
 917 |    },
 918 |    "outputs": [],
 919 |    "source": [
 920 |     "from sklearn.decomposition import PCA\n",
 921 |     "from sklearn.preprocessing import StandardScaler\n",
 922 |     "\n",
 923 |     "sc = StandardScaler()\n",
 924 |     "pca = PCA(n_components=4)\n",
 925 |     "\n",
 926 |     "pca.fit_transform(X_train, y_train)\n",
 927 |     "\n",
 928 |     "var_exp = pca.explained_variance_ratio_\n",
 929 |     "cum_var_exp = np.cumsum(var_exp)\n",
 930 |     "\n",
 931 |     "idx = [i for i in range(len(var_exp))]\n",
 932 |     "labels = [str(i + 1) for i in idx]\n",
 933 |     "with plt.style.context('seaborn-whitegrid'):\n",
 934 |     "    plt.bar(range(4), var_exp, alpha=0.5, align='center',\n",
 935 |     "            label='individual explained variance')\n",
 936 |     "    plt.step(range(4), cum_var_exp, where='mid',\n",
 937 |     "             label='cumulative explained variance')\n",
 938 |     "    plt.ylabel('Explained variance ratio')\n",
 939 |     "    plt.xlabel('Principal components')\n",
 940 |     "    plt.xticks(idx, labels)\n",
 941 |     "    plt.legend(loc='center right')\n",
 942 |     "    plt.tight_layout()\n",
 943 |     "    plt.show()"
 944 |    ]
 945 |   },
 946 |   {
 947 |    "cell_type": "code",
 948 |    "execution_count": null,
 949 |    "metadata": {
 950 |     "collapsed": false
 951 |    },
 952 |    "outputs": [],
 953 |    "source": [
 954 |     "X_train_pca = pca.transform(X_train)\n",
 955 |     "\n",
 956 |     "for lab, col, mar in zip((0, 1, 2),\n",
 957 |     "                         ('blue', 'red', 'green'),\n",
 958 |     "                         ('o', 's', '^')):\n",
 959 |     "    plt.scatter(X_train_pca[y_train == lab, 0],\n",
 960 |     "                X_train_pca[y_train == lab, 1],\n",
 961 |     "                label=lab,\n",
 962 |     "                marker=mar,\n",
 963 |     "                c=col)\n",
 964 |     "plt.xlabel('Principal Component 1')\n",
 965 |     "plt.ylabel('Principal Component 2')\n",
 966 |     "plt.legend(loc='lower right')\n",
 967 |     "plt.tight_layout()"
 968 |    ]
 969 |   },
 970 |   {
 971 |    "cell_type": "markdown",
 972 |    "metadata": {},
 973 |    "source": [
 974 |     "<div style='height:100px;'></div>"
 975 |    ]
 976 |   },
 977 |   {
 978 |    "cell_type": "markdown",
 979 |    "metadata": {},
 980 |    "source": [
 981 |     "# 6 - Model Evaluation & Hyperparameter Tuning"
 982 |    ]
 983 |   },
 984 |   {
 985 |    "cell_type": "markdown",
 986 |    "metadata": {},
 987 |    "source": [
 988 |     "### Wine Dataset"
 989 |    ]
 990 |   },
 991 |   {
 992 |    "cell_type": "code",
 993 |    "execution_count": null,
 994 |    "metadata": {
 995 |     "collapsed": true
 996 |    },
 997 |    "outputs": [],
 998 |    "source": [
 999 |     "from mlxtend.data import wine_data\n",
1000 |     "\n",
1001 |     "X, y = wine_data()"
1002 |    ]
1003 |   },
1004 |   {
1005 |    "cell_type": "markdown",
1006 |    "metadata": {},
1007 |    "source": [
1008 |     "Wine dataset.\n",
1009 |     "\n",
1010 |     "Source : https://archive.ics.uci.edu/ml/datasets/Wine\n",
1011 |     "\n",
1012 |     "Number of samples : 178\n",
1013 |     "\n",
1014 |     "Class labels : {0, 1, 2}, distribution: [59, 71, 48]\n",
1015 |     "\n",
1016 |     "Dataset Attributes:\n",
1017 |     "\n",
1018 |     "1. Alcohol\n",
1019 |     "2. Malic acid\n",
1020 |     "3. Ash\n",
1021 |     "4. Alcalinity of ash\n",
1022 |     "5. Magnesium\n",
1023 |     "6. Total phenols\n",
1024 |     "7. Flavanoids\n",
1025 |     "8. Nonflavanoid phenols\n",
1026 |     "9. Proanthocyanins\n",
1027 |     "10. Color intensity\n",
1028 |     "11. Hue\n",
1029 |     "12. OD280/OD315 of diluted wines\n",
1030 |     "13. Proline\n"
1031 |    ]
1032 |   },
1033 |   {
1034 |    "cell_type": "markdown",
1035 |    "metadata": {},
1036 |    "source": [
1037 |     "### Stratified K-Fold"
1038 |    ]
1039 |   },
1040 |   {
1041 |    "cell_type": "code",
1042 |    "execution_count": null,
1043 |    "metadata": {
1044 |     "collapsed": false
1045 |    },
1046 |    "outputs": [],
1047 |    "source": [
1048 |     "from sklearn.preprocessing import StandardScaler\n",
1049 |     "from sklearn.decomposition import PCA\n",
1050 |     "from sklearn.pipeline import make_pipeline\n",
1051 |     "from sklearn.cross_validation import StratifiedKFold\n",
1052 |     "from sklearn.neighbors import KNeighborsClassifier as KNN\n",
1053 |     "\n",
1054 |     "X_train, X_test, y_train, y_test = train_test_split(\n",
1055 |     "        X, y, test_size=0.3, random_state=123, stratify=y)\n",
1056 |     "\n",
1057 |     "pipe_kn = make_pipeline(StandardScaler(), \n",
1058 |     "                        PCA(n_components=1),\n",
1059 |     "                        KNN(n_neighbors=3))\n",
1060 |     "\n",
1061 |     "kfold = StratifiedKFold(y=y_train, \n",
1062 |     "                        n_folds=10,\n",
1063 |     "                        random_state=1)\n",
1064 |     "\n",
1065 |     "scores = []\n",
1066 |     "for k, (train, test) in enumerate(kfold):\n",
1067 |     "    pipe_kn.fit(X_train[train], y_train[train])\n",
1068 |     "    score = pipe_kn.score(X_train[test], y_train[test])\n",
1069 |     "    scores.append(score)\n",
1070 |     "    print('Fold: %s, Class dist.: %s, Acc: %.3f' % (k+1,\n",
1071 |     "          np.bincount(y_train[train]), score))\n",
1072 |     "    \n",
1073 |     "print('\\nCV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))"
1074 |    ]
1075 |   },
1076 |   {
1077 |    "cell_type": "code",
1078 |    "execution_count": null,
1079 |    "metadata": {
1080 |     "collapsed": false
1081 |    },
1082 |    "outputs": [],
1083 |    "source": [
1084 |     "from sklearn.cross_validation import cross_val_score\n",
1085 |     "\n",
1086 |     "scores = cross_val_score(estimator=pipe_kn,\n",
1087 |     "                         X=X_train,\n",
1088 |     "                         y=y_train,\n",
1089 |     "                         cv=10,\n",
1090 |     "                         n_jobs=2)\n",
1091 |     "\n",
1092 |     "print('CV accuracy scores: %s' % scores)\n",
1093 |     "print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))"
1094 |    ]
1095 |   },
1096 |   {
1097 |    "cell_type": "markdown",
1098 |    "metadata": {},
1099 |    "source": [
1100 |     "### Grid Search"
1101 |    ]
1102 |   },
1103 |   {
1104 |    "cell_type": "code",
1105 |    "execution_count": null,
1106 |    "metadata": {
1107 |     "collapsed": false
1108 |    },
1109 |    "outputs": [],
1110 |    "source": [
1111 |     "pipe_kn.named_steps"
1112 |    ]
1113 |   },
1114 |   {
1115 |    "cell_type": "code",
1116 |    "execution_count": null,
1117 |    "metadata": {
1118 |     "collapsed": false
1119 |    },
1120 |    "outputs": [],
1121 |    "source": [
1122 |     "from sklearn.grid_search import GridSearchCV\n",
1123 |     "\n",
1124 |     "\n",
1125 |     "param_grid = {'pca__n_components': [1, 2, 3, 4, 5, 6, None],\n",
1126 |     "              'kneighborsclassifier__n_neighbors': [1, 3, 5, 7, 9, 11]}\n",
1127 |     "\n",
1128 |     "gs = GridSearchCV(estimator=pipe_kn, \n",
1129 |     "                  param_grid=param_grid, \n",
1130 |     "                  scoring='accuracy', \n",
1131 |     "                  cv=10,\n",
1132 |     "                  n_jobs=2,\n",
1133 |     "                  refit=True)\n",
1134 |     "gs = gs.fit(X_train, y_train)\n",
1135 |     "print(gs.best_score_)\n",
1136 |     "print(gs.best_params_)"
1137 |    ]
1138 |   },
1139 |   {
1140 |    "cell_type": "code",
1141 |    "execution_count": null,
1142 |    "metadata": {
1143 |     "collapsed": false
1144 |    },
1145 |    "outputs": [],
1146 |    "source": [
1147 |     "gs.score(X_test, y_test)"
1148 |    ]
1149 |   }
1150 |  ],
1151 |  "metadata": {
1152 |   "anaconda-cloud": {},
1153 |   "kernelspec": {
1154 |    "display_name": "Python 3",
1155 |    "language": "python",
1156 |    "name": "python3"
1157 |   },
1158 |   "language_info": {
1159 |    "codemirror_mode": {
1160 |     "name": "ipython",
1161 |     "version": 3
1162 |    },
1163 |    "file_extension": ".py",
1164 |    "mimetype": "text/x-python",
1165 |    "name": "python",
1166 |    "nbconvert_exporter": "python",
1167 |    "pygments_lexer": "ipython3",
1168 |    "version": "3.5.2"
1169 |   }
1170 |  },
1171 |  "nbformat": 4,
1172 |  "nbformat_minor": 0
1173 | }
1174 | 


--------------------------------------------------------------------------------
/images/checkenv-example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rasbt/pydata-chicago2016-ml-tutorial/115c706033c82daf563b7a440218b933fcadb228/images/checkenv-example.png


--------------------------------------------------------------------------------
/images/github-download.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rasbt/pydata-chicago2016-ml-tutorial/115c706033c82daf563b7a440218b933fcadb228/images/github-download.png


--------------------------------------------------------------------------------
/images/logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rasbt/pydata-chicago2016-ml-tutorial/115c706033c82daf563b7a440218b933fcadb228/images/logo.png


--------------------------------------------------------------------------------
/slides/PyDataChi2016_sr.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rasbt/pydata-chicago2016-ml-tutorial/115c706033c82daf563b7a440218b933fcadb228/slides/PyDataChi2016_sr.pdf


--------------------------------------------------------------------------------