├── .gitignore
├── Breakout-Classification.ipynb
├── Breakout-Regression.ipynb
├── Datasets.ipynb
├── Index.ipynb
├── Intro.ipynb
├── LICENSE
├── README.md
├── fig_code
├── __init__.py
├── data.py
├── figures.py
├── helpers.py
├── kmeans.py
├── linear_regression.py
└── sgd_separator.py
├── indepth_GMM.ipynb
├── indepth_KMeans.ipynb
├── indepth_PCA.ipynb
├── indepth_RandomForests.ipynb
└── indepth_SVM.ipynb
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 |
5 | # C extensions
6 | *.so
7 |
8 | # Distribution / packaging
9 | .Python
10 | env/
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | *.egg-info/
23 | .installed.cfg
24 | *.egg
25 |
26 | # PyInstaller
27 | # Usually these files are written by a python script from a template
28 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
29 | *.manifest
30 | *.spec
31 |
32 | # Installer logs
33 | pip-log.txt
34 | pip-delete-this-directory.txt
35 |
36 | # Unit test / coverage reports
37 | htmlcov/
38 | .tox/
39 | .coverage
40 | .coverage.*
41 | .cache
42 | nosetests.xml
43 | coverage.xml
44 | *,cover
45 |
46 | # Translations
47 | *.mo
48 | *.pot
49 |
50 | # Django stuff:
51 | *.log
52 |
53 | # Sphinx documentation
54 | docs/_build/
55 |
56 | # PyBuilder
57 | target/
58 |
59 | # IPython
60 | .ipynb_checkpoints
61 |
62 | # Emacs
63 | *~
--------------------------------------------------------------------------------
/Breakout-Classification.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": "",
4 | "signature": "sha256:024628f96f6dce5ffc22f9a799cac725908533b1cce78b87a9ac904e65e4bca5"
5 | },
6 | "nbformat": 3,
7 | "nbformat_minor": 0,
8 | "worksheets": [
9 | {
10 | "cells": [
11 | {
12 | "cell_type": "markdown",
13 | "metadata": {},
14 | "source": [
15 | "# Exploring Supervised Machine Learning: Classification"
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {},
21 | "source": [
22 | "This notebook explores a classification task on astronomical data with Scikit-Learn.\n",
23 | "Much of the practice of machine learning relies on searching through documentation to learn (or remind yourself) of how things are done. Recall that in IPython, you can see the documentation of any function using the ``?`` character. For example:"
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "collapsed": false,
29 | "input": [
30 | "import numpy as np\n",
31 | "np.linspace?"
32 | ],
33 | "language": "python",
34 | "metadata": {},
35 | "outputs": []
36 | },
37 | {
38 | "cell_type": "markdown",
39 | "metadata": {},
40 | "source": [
41 | "In addition, Google is your friend. Do you want to know how to import a Gaussian mixture model in scikit-learn? Search for \"scikit-learn Gaussian Mixture\" and go from there."
42 | ]
43 | },
44 | {
45 | "cell_type": "markdown",
46 | "metadata": {},
47 | "source": [
48 | "In this notebook, we'll look at a *stellar photometry* dataset for classification, where the labels distinguish between variable and non-variable stars (based on multiple observations for the training set).\n",
49 | "\n",
50 | "(see also the [Regression breakout](Breakout-Regression.ipynb))"
51 | ]
52 | },
53 | {
54 | "cell_type": "markdown",
55 | "metadata": {},
56 | "source": [
57 | "We'll start with the normal notebook imports & preliminaries:"
58 | ]
59 | },
60 | {
61 | "cell_type": "code",
62 | "collapsed": false,
63 | "input": [
64 | "%matplotlib inline\n",
65 | "import matplotlib as mpl\n",
66 | "import matplotlib.pyplot as plt\n",
67 | "import numpy as np\n",
68 | "\n",
69 | "# set matplotlib figure style\n",
70 | "mpl.style.use('ggplot') # only works in matplotlib 1.4+\n",
71 | "mpl.rc('figure', figsize=(8, 6))"
72 | ],
73 | "language": "python",
74 | "metadata": {},
75 | "outputs": []
76 | },
77 | {
78 | "cell_type": "markdown",
79 | "metadata": {},
80 | "source": [
81 | "## The Data: Stellar Photometry\n",
82 | "\n",
83 | "The stellar photometry data is a combination of photometry for RR Lyrae and main sequence stars from SDSS Stripe 82, with RR Lyrae flagged based on their temporal variation."
84 | ]
85 | },
86 | {
87 | "cell_type": "code",
88 | "collapsed": false,
89 | "input": [
90 | "from astroML.datasets import fetch_rrlyrae_combined\n",
91 | "from sklearn.cross_validation import train_test_split\n",
92 | "\n",
93 | "X, y = fetch_rrlyrae_combined()"
94 | ],
95 | "language": "python",
96 | "metadata": {},
97 | "outputs": []
98 | },
99 | {
100 | "cell_type": "markdown",
101 | "metadata": {},
102 | "source": [
103 | "The data consist of 93141 objects, each with four features representing [u-g, g-r, r-i, i-z] colors"
104 | ]
105 | },
106 | {
107 | "cell_type": "code",
108 | "collapsed": false,
109 | "input": [
110 | "print(X.shape)\n",
111 | "print(X[:10])"
112 | ],
113 | "language": "python",
114 | "metadata": {},
115 | "outputs": []
116 | },
117 | {
118 | "cell_type": "markdown",
119 | "metadata": {},
120 | "source": [
121 | "There are two labels, 0 = non-variable, 1 = variable star."
122 | ]
123 | },
124 | {
125 | "cell_type": "code",
126 | "collapsed": false,
127 | "input": [
128 | "print(y.shape)\n",
129 | "print(np.unique(y))"
130 | ],
131 | "language": "python",
132 | "metadata": {},
133 | "outputs": []
134 | },
135 | {
136 | "cell_type": "markdown",
137 | "metadata": {},
138 | "source": [
139 | "The data have a large imbalance: that is, there are many more background stars than there are RR Lyrae stars.\n",
140 | "This is important to keep in mind when performing a machine learning task on them:"
141 | ]
142 | },
143 | {
144 | "cell_type": "code",
145 | "collapsed": false,
146 | "input": [
147 | "np.sum(y == 0), np.sum(y == 1)"
148 | ],
149 | "language": "python",
150 | "metadata": {},
151 | "outputs": []
152 | },
153 | {
154 | "cell_type": "markdown",
155 | "metadata": {},
156 | "source": [
157 | "As before, we can plot the data to better understand what we're looking at:"
158 | ]
159 | },
160 | {
161 | "cell_type": "code",
162 | "collapsed": false,
163 | "input": [
164 | "plt.scatter(X[:, 0], X[:, 1], c=y,\n",
165 | " linewidth=0, alpha=0.3)\n",
166 | "plt.xlabel('u-g')\n",
167 | "plt.ylabel('g-r')\n",
168 | "plt.axis('tight');"
169 | ],
170 | "language": "python",
171 | "metadata": {},
172 | "outputs": []
173 | },
174 | {
175 | "cell_type": "markdown",
176 | "metadata": {},
177 | "source": [
178 | "## Classification: Labeling Variable Sources\n",
179 | "\n",
180 | "Here we'll explore a classification task using the variable sources shown above."
181 | ]
182 | },
183 | {
184 | "cell_type": "markdown",
185 | "metadata": {},
186 | "source": [
187 | "### 1. Explore the Data\n",
188 | "\n",
189 | "It's good to start by visually exploring the data you're working with.\n",
190 | "Plot various combinations of the inputs: if you were doing a classic manual separation of the data based on this, what would you do? Treating the variable stars as your \"signal\" and the rest as your \"background\", roughly what would you expect for the completeness and contamination of such a method?"
191 | ]
192 | },
193 | {
194 | "cell_type": "code",
195 | "collapsed": false,
196 | "input": [],
197 | "language": "python",
198 | "metadata": {},
199 | "outputs": []
200 | },
201 | {
202 | "cell_type": "code",
203 | "collapsed": false,
204 | "input": [],
205 | "language": "python",
206 | "metadata": {},
207 | "outputs": []
208 | },
209 | {
210 | "cell_type": "code",
211 | "collapsed": false,
212 | "input": [],
213 | "language": "python",
214 | "metadata": {},
215 | "outputs": []
216 | },
217 | {
218 | "cell_type": "markdown",
219 | "metadata": {},
220 | "source": [
221 | "### 2. Applying a Fast & Simple Model\n",
222 | "\n",
223 | "Import the Gaussian Naive Bayes estimator from scikit-learn (try a Google search \u2013 the sklearn documentation will pop up, and you should be able to find the correct import statement pretty easily). This is a particularly useful estimator as a first-pass: it essentially fits each input class with an axis-aligned normal distribution, and thus is very fast (though usually not very accurate).\n",
224 | "\n",
225 | "Fit this model to the data, and then use cross-validation to determine the accuracy of the result."
226 | ]
227 | },
228 | {
229 | "cell_type": "code",
230 | "collapsed": false,
231 | "input": [],
232 | "language": "python",
233 | "metadata": {},
234 | "outputs": []
235 | },
236 | {
237 | "cell_type": "code",
238 | "collapsed": false,
239 | "input": [],
240 | "language": "python",
241 | "metadata": {},
242 | "outputs": []
243 | },
244 | {
245 | "cell_type": "code",
246 | "collapsed": false,
247 | "input": [],
248 | "language": "python",
249 | "metadata": {},
250 | "outputs": []
251 | },
252 | {
253 | "cell_type": "markdown",
254 | "metadata": {},
255 | "source": [
256 | "Note that the typical accuracy score may not be relevant here: if our goal is to find RR Lyrae from the background, you'll want to use a different scoring method (the ``scoring`` argument to ``cross_val_score``).\n",
257 | "The default scoring here is ``\"accuracy\"``, which basically means\n",
258 | "\n",
259 | "$$\n",
260 | "{\\rm precision} \\equiv \\frac{\\rm # correct labels}{\\rm total labels}\n",
261 | "$$\n",
262 | "\n",
263 | "But for unbalanced data, accuracy is actually a really bad metric! You can create a very accurate classifier by simply returning 0 for everything: the result will have an accuracy of $(93141 - 483) / 93141 = 99.5%$\n",
264 | "Needless to say, this model isn't very useful.\n",
265 | "\n",
266 | "Instead, we should look at the ``\"precision\"`` and ``\"recall\"`` scores.\n",
267 | "\n",
268 | "This is what the terms mean:\n",
269 | "\n",
270 | "$$\n",
271 | "{\\rm precision} \\equiv \\frac{\\rm true~positives}{\\rm true~positives + false~positives}\n",
272 | "$$\n",
273 | "\n",
274 | "$$\n",
275 | "{\\rm recall} \\equiv \\frac{\\rm true~positives}{\\rm true~positives + false~negatives}\n",
276 | "$$\n",
277 | "\n",
278 | "$$\n",
279 | "F_1 \\equiv \\frac{\\rm precision \\cdot recall}{\\rm precision + recall}\n",
280 | "$$\n",
281 | "\n",
282 | "(Note that wikipedia has a nice visual of these quantities).\n",
283 | "The range for precision, recall, and F1 score is 0 to 1, with 1 being perfect.\n",
284 | "Often in astronomy, we call the recall the *completeness*, and (1 - precision) the *contamination*.\n",
285 | "\n",
286 | "Use the ``scoring`` argument to the sklearn cross validation routine to find the optimal accuracy, precision, and recall attained with Gaussian Naive Bayes."
287 | ]
288 | },
289 | {
290 | "cell_type": "code",
291 | "collapsed": false,
292 | "input": [],
293 | "language": "python",
294 | "metadata": {},
295 | "outputs": []
296 | },
297 | {
298 | "cell_type": "code",
299 | "collapsed": false,
300 | "input": [],
301 | "language": "python",
302 | "metadata": {},
303 | "outputs": []
304 | },
305 | {
306 | "cell_type": "code",
307 | "collapsed": false,
308 | "input": [],
309 | "language": "python",
310 | "metadata": {},
311 | "outputs": []
312 | },
313 | {
314 | "cell_type": "markdown",
315 | "metadata": {},
316 | "source": [
317 | "### 3. Applying a more Flexible Model\n",
318 | "\n",
319 | "Gaussian Naive Bayes is an example of a **generative classifier**, meaning that each individual class distribution is modeled, and the classification is achieved by comparing these.\n",
320 | "Another realm of classification algorithms are **discriminative classifiers** which, roughly speaking, try to draw a line or curve (or more generally, hyperplane or manifold) through parameter space to separate the classes.\n",
321 | "\n",
322 | "A good example of a discriminative classifier is the Support Vector Machine classifier, which you can read about in the [SVM Notebook](indepth_SVM.ipynb).\n",
323 | "One thing to note is that the SVM routines tend to be fairly slow for large amounts of data, so for experimenting with it below it may be helpful to subset your data: i.e."
324 | ]
325 | },
326 | {
327 | "cell_type": "code",
328 | "collapsed": false,
329 | "input": [
330 | "# subset consisting of 10% of full data\n",
331 | "Xsub = X[::10]\n",
332 | "ysub = y[::10]"
333 | ],
334 | "language": "python",
335 | "metadata": {},
336 | "outputs": []
337 | },
338 | {
339 | "cell_type": "markdown",
340 | "metadata": {},
341 | "source": [
342 | "You can always go back and run the algorithm on the full dataset once you've tried a few things.\n",
343 | "\n",
344 | "Here, import the support vector machine classifier (again, Google is your friend) and fit it to this data. What is the accuracy, precision, and recall of the default estimator? How does changing the ``kernel`` parameter affect things?"
345 | ]
346 | },
347 | {
348 | "cell_type": "code",
349 | "collapsed": false,
350 | "input": [],
351 | "language": "python",
352 | "metadata": {},
353 | "outputs": []
354 | },
355 | {
356 | "cell_type": "code",
357 | "collapsed": false,
358 | "input": [],
359 | "language": "python",
360 | "metadata": {},
361 | "outputs": []
362 | },
363 | {
364 | "cell_type": "code",
365 | "collapsed": false,
366 | "input": [],
367 | "language": "python",
368 | "metadata": {},
369 | "outputs": []
370 | },
371 | {
372 | "cell_type": "markdown",
373 | "metadata": {},
374 | "source": [
375 | "You may notice that there are some issues here: at times the SVC fails to even predict a single positive sample, which makes precision and recall ill-defined!\n",
376 | "One of the issues with discriminative classification is that it does not do a good job with dataset imbalance (i.e. where background objects dominate).\n",
377 | "This can be adjusted by weighting the instances of each class, here using the ``class_weight`` parameter.\n",
378 | "Set this to ``'auto'``: how does this change the results?"
379 | ]
380 | },
381 | {
382 | "cell_type": "code",
383 | "collapsed": false,
384 | "input": [],
385 | "language": "python",
386 | "metadata": {},
387 | "outputs": []
388 | },
389 | {
390 | "cell_type": "code",
391 | "collapsed": false,
392 | "input": [],
393 | "language": "python",
394 | "metadata": {},
395 | "outputs": []
396 | },
397 | {
398 | "cell_type": "code",
399 | "collapsed": false,
400 | "input": [],
401 | "language": "python",
402 | "metadata": {},
403 | "outputs": []
404 | },
405 | {
406 | "cell_type": "markdown",
407 | "metadata": {},
408 | "source": [
409 | "### 4. Non-parametrics: Random Forests\n",
410 | "\n",
411 | "Both the generative and discriminative models above are examples of *parametric* models. Either the data distribution or the separating hyper-plane is parametrized in some way, which means that the model is not as flexible as it could be.\n",
412 | "Often better results can be found with a non-parametric model such as a Random Forest.\n",
413 | "For some insight into what's happening with this model, see the [Random Forest Notebook](indepth_RandomForests.ipynb).\n",
414 | "\n",
415 | "Try a Random Forest classifier on this data (again: see Google). What accuracy, precision, and recall can you attain?"
416 | ]
417 | },
418 | {
419 | "cell_type": "code",
420 | "collapsed": false,
421 | "input": [],
422 | "language": "python",
423 | "metadata": {},
424 | "outputs": []
425 | },
426 | {
427 | "cell_type": "code",
428 | "collapsed": false,
429 | "input": [],
430 | "language": "python",
431 | "metadata": {},
432 | "outputs": []
433 | },
434 | {
435 | "cell_type": "code",
436 | "collapsed": false,
437 | "input": [],
438 | "language": "python",
439 | "metadata": {},
440 | "outputs": []
441 | },
442 | {
443 | "cell_type": "markdown",
444 | "metadata": {},
445 | "source": [
446 | "One of the most important pieces of machine learning is *hyperparameter optimization*.\n",
447 | "Hyperparameters are the parameters given at model instantiation, which are set by the user rather than tuned by the fit to the data.\n",
448 | "Hyperparameter optimization is the process of determining the best choice of these parameters for your dataset. Keep in mind that the results of your fit may be extremely sensitive to these!\n",
449 | "\n",
450 | "We've seen a hint of this previously: when we use cross-validation to compare different options for the SGD classifier.\n",
451 | "The random forest classifier has a number of hyperparameters to tune, but here let's focus-in on the ``max_depth`` parameter and the ``n_estimators`` parameter.\n",
452 | "Try a few of these, compute the cross-validation score, and see how it changes. Do you notice any patterns?"
453 | ]
454 | },
455 | {
456 | "cell_type": "code",
457 | "collapsed": false,
458 | "input": [],
459 | "language": "python",
460 | "metadata": {},
461 | "outputs": []
462 | },
463 | {
464 | "cell_type": "code",
465 | "collapsed": false,
466 | "input": [],
467 | "language": "python",
468 | "metadata": {},
469 | "outputs": []
470 | },
471 | {
472 | "cell_type": "markdown",
473 | "metadata": {},
474 | "source": [
475 | "Searching through a grid of hyperparameters for the optimal model is a very common (and useful!) task in performing a machine learning analysis.\n",
476 | "Scikit-learn provides a *Grid Search* interface to enable this to be done quickly. Take a look at the scikit-learn [Grid Search Documentation](http://scikit-learn.org/stable/modules/grid_search.html) and use the grid search tools to find the best combination of ``n_estimators`` and ``max_depth`` for your particular data.\n",
477 | "What's the best accuracy/precision/recall that you can attain with RandomForests on this dataset?"
478 | ]
479 | },
480 | {
481 | "cell_type": "code",
482 | "collapsed": false,
483 | "input": [],
484 | "language": "python",
485 | "metadata": {},
486 | "outputs": []
487 | },
488 | {
489 | "cell_type": "code",
490 | "collapsed": false,
491 | "input": [],
492 | "language": "python",
493 | "metadata": {},
494 | "outputs": []
495 | },
496 | {
497 | "cell_type": "code",
498 | "collapsed": false,
499 | "input": [],
500 | "language": "python",
501 | "metadata": {},
502 | "outputs": []
503 | },
504 | {
505 | "cell_type": "markdown",
506 | "metadata": {},
507 | "source": [
508 | "### 5. Other classification models\n",
509 | "\n",
510 | "If you have more time, take a look at scikit-learn's online documentation and read about the other classification models available. Try a few of them on this dataset: are there any that can outperform your random forest on one of these metrics?"
511 | ]
512 | },
513 | {
514 | "cell_type": "code",
515 | "collapsed": false,
516 | "input": [],
517 | "language": "python",
518 | "metadata": {},
519 | "outputs": []
520 | },
521 | {
522 | "cell_type": "code",
523 | "collapsed": false,
524 | "input": [],
525 | "language": "python",
526 | "metadata": {},
527 | "outputs": []
528 | },
529 | {
530 | "cell_type": "code",
531 | "collapsed": false,
532 | "input": [],
533 | "language": "python",
534 | "metadata": {},
535 | "outputs": []
536 | }
537 | ],
538 | "metadata": {}
539 | }
540 | ]
541 | }
--------------------------------------------------------------------------------
/Breakout-Regression.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": "",
4 | "signature": "sha256:d4626a838a71b8679af7e1334182e84960d06dc3870a8ce6953ce72b8e60a7bd"
5 | },
6 | "nbformat": 3,
7 | "nbformat_minor": 0,
8 | "worksheets": [
9 | {
10 | "cells": [
11 | {
12 | "cell_type": "markdown",
13 | "metadata": {},
14 | "source": [
15 | "# Exploring Supervised Machine Learning: Regression"
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {},
21 | "source": [
22 | "This notebook explores a regression task on astronomical data with Scikit-Learn.\n",
23 | "Much of the practice of machine learning relies on searching through documentation to learn (or remind yourself) of how things are done. Recall that in IPython, you can see the documentation of any function using the ``?`` character. For example:"
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "collapsed": false,
29 | "input": [
30 | "import numpy as np\n",
31 | "np.linspace?"
32 | ],
33 | "language": "python",
34 | "metadata": {},
35 | "outputs": []
36 | },
37 | {
38 | "cell_type": "markdown",
39 | "metadata": {},
40 | "source": [
41 | "In addition, Google is your friend. Do you want to know how to import a Gaussian mixture model in scikit-learn? Search for \"scikit-learn Gaussian Mixture\" and go from there."
42 | ]
43 | },
44 | {
45 | "cell_type": "markdown",
46 | "metadata": {},
47 | "source": [
48 | "Below we'll explore a *galaxy photometry* dataset for regression, where the labels give the redshift (based on SDSS spectra for the training set).\n",
49 | "\n",
50 | "(see also the [Classification breakout](Breakout-Classification.ipynb))"
51 | ]
52 | },
53 | {
54 | "cell_type": "code",
55 | "collapsed": false,
56 | "input": [
57 | "%matplotlib inline\n",
58 | "import matplotlib as mpl\n",
59 | "import matplotlib.pyplot as plt\n",
60 | "import numpy as np\n",
61 | "\n",
62 | "# set matplotlib figure style\n",
63 | "mpl.style.use('ggplot') # only works in matplotlib 1.4+\n",
64 | "mpl.rc('figure', figsize=(8, 6))"
65 | ],
66 | "language": "python",
67 | "metadata": {},
68 | "outputs": []
69 | },
70 | {
71 | "cell_type": "markdown",
72 | "metadata": {},
73 | "source": [
74 | "## The Data: Photometric Redshifts\n",
75 | "\n",
76 | "The photometric redshift data comes from the spectroscopic galaxy catalog from SDSS DR7.\n",
77 | "There is a wealth of data in this dataset, but we will focus on photometry and redshift measurements:"
78 | ]
79 | },
80 | {
81 | "cell_type": "code",
82 | "collapsed": false,
83 | "input": [
84 | "from astroML.datasets import fetch_sdss_specgals\n",
85 | "data = fetch_sdss_specgals()"
86 | ],
87 | "language": "python",
88 | "metadata": {},
89 | "outputs": []
90 | },
91 | {
92 | "cell_type": "markdown",
93 | "metadata": {},
94 | "source": [
95 | "This is a much more involved dataset than used in the classification example; it's in the form of a NumPy structured array."
96 | ]
97 | },
98 | {
99 | "cell_type": "code",
100 | "collapsed": false,
101 | "input": [
102 | "data.dtype.names"
103 | ],
104 | "language": "python",
105 | "metadata": {},
106 | "outputs": []
107 | },
108 | {
109 | "cell_type": "markdown",
110 | "metadata": {},
111 | "source": [
112 | "We'll extract a matrix of (u,g,r,i,z) model magnitudes to tackle the \"classic\" photometric redshift problem:"
113 | ]
114 | },
115 | {
116 | "cell_type": "code",
117 | "collapsed": false,
118 | "input": [
119 | "# put magnitudes in a matrix\n",
120 | "mags = np.vstack([data['modelMag_%s' % f] for f in 'ugriz']).T\n",
121 | "z = data['z']"
122 | ],
123 | "language": "python",
124 | "metadata": {},
125 | "outputs": []
126 | },
127 | {
128 | "cell_type": "code",
129 | "collapsed": false,
130 | "input": [
131 | "print(mags.shape)\n",
132 | "print(z.shape)"
133 | ],
134 | "language": "python",
135 | "metadata": {},
136 | "outputs": []
137 | },
138 | {
139 | "cell_type": "markdown",
140 | "metadata": {},
141 | "source": [
142 | "More often, photometric redshifts are done on galaxy *colors*, because this minimizes a dependency on galaxy mass which can cause problems, and also helps with data calibration issues."
143 | ]
144 | },
145 | {
146 | "cell_type": "code",
147 | "collapsed": false,
148 | "input": [
149 | "colors = mags[:, 1:] - mags[:, :-1]"
150 | ],
151 | "language": "python",
152 | "metadata": {},
153 | "outputs": []
154 | },
155 | {
156 | "cell_type": "markdown",
157 | "metadata": {},
158 | "source": [
159 | "Nevertheless, this step does discard some potentially useful data, so it may not be the best choice!\n",
160 | "\n",
161 | "Note that we're explicitly ignoring errors in data, and this is bad.\n",
162 | "We'll think about how you might address this problem further below.\n",
163 | "\n",
164 | "Let's take a look at color-vs-redshift for a subset of these galaxies:"
165 | ]
166 | },
167 | {
168 | "cell_type": "code",
169 | "collapsed": false,
170 | "input": [
171 | "fig, ax = plt.subplots()\n",
172 | "\n",
173 | "N = 50000\n",
174 | "ax.plot(colors[:N, 1], z[:N], ',', alpha=0.3)\n",
175 | "\n",
176 | "ax.set(xlim=(-2, 0),\n",
177 | " ylim=(0, 0.4),\n",
178 | " xlabel='g - r color',\n",
179 | " ylabel='redshift');"
180 | ],
181 | "language": "python",
182 | "metadata": {},
183 | "outputs": []
184 | },
185 | {
186 | "cell_type": "markdown",
187 | "metadata": {},
188 | "source": [
189 | "From this, it appears that a regression applied to the photometry should give us some amount of information on the redshift."
190 | ]
191 | },
192 | {
193 | "cell_type": "markdown",
194 | "metadata": {},
195 | "source": [
196 | "## Regression: Determining Photometric Redshift\n",
197 | "\n",
198 | "Here we will explore a regression task using the above data."
199 | ]
200 | },
201 | {
202 | "cell_type": "markdown",
203 | "metadata": {},
204 | "source": [
205 | "### 1. Exploring the Data\n",
206 | "\n",
207 | "With any machine learning task, it's a good idea to start by exploring the data.\n",
208 | "Plot some of the colors vs the redshift. Are there any patterns you see? If you were to create some sort of rough photometric redshift by hand, what would you do? How accurate would you expect to be?\n",
209 | "\n",
210 | "Note that plotting the full dataset can be a bit slow; you might try just plotting the first 10,000 points or so."
211 | ]
212 | },
213 | {
214 | "cell_type": "code",
215 | "collapsed": false,
216 | "input": [],
217 | "language": "python",
218 | "metadata": {},
219 | "outputs": []
220 | },
221 | {
222 | "cell_type": "code",
223 | "collapsed": false,
224 | "input": [],
225 | "language": "python",
226 | "metadata": {},
227 | "outputs": []
228 | },
229 | {
230 | "cell_type": "code",
231 | "collapsed": false,
232 | "input": [],
233 | "language": "python",
234 | "metadata": {},
235 | "outputs": []
236 | },
237 | {
238 | "cell_type": "markdown",
239 | "metadata": {},
240 | "source": [
241 | "### 2. Simple Regression\n",
242 | "\n",
243 | "It's always a good idea to start out with a simple regression algorithm to get an idea for how it performs.\n",
244 | "In Scikit-learn, the ``sklearn.linear_model.LinearRegression`` is a good candidate.\n",
245 | "Import and fit the model, and use cross-validation to determine the accuracy of the result."
246 | ]
247 | },
248 | {
249 | "cell_type": "code",
250 | "collapsed": false,
251 | "input": [],
252 | "language": "python",
253 | "metadata": {},
254 | "outputs": []
255 | },
256 | {
257 | "cell_type": "code",
258 | "collapsed": false,
259 | "input": [],
260 | "language": "python",
261 | "metadata": {},
262 | "outputs": []
263 | },
264 | {
265 | "cell_type": "code",
266 | "collapsed": false,
267 | "input": [],
268 | "language": "python",
269 | "metadata": {},
270 | "outputs": []
271 | },
272 | {
273 | "cell_type": "markdown",
274 | "metadata": {},
275 | "source": [
276 | "The default scoring for regression cross-validation is the \"R2 score\", in which 1.0 is perfect and lower values are worse.\n",
277 | "Other options are ``mean_absolute_error``, ``mean_squared_error``, and ``median_absolute_error``.\n",
278 | "\n",
279 | "For the photometric redshift problem in particular, often we're less interested in average performance and more interested in minimizing things like catastrophic outliers (i.e. predictions which are very far from the true value).\n",
280 | "Scikit-learn doesn't have any scoring methods which account for this, but we can define our own scoring method to measure whatever we'd like. For example, here is a scoring function which computes the percentage of inputs within $\\pm 0.05$ of the true redshift:"
281 | ]
282 | },
283 | {
284 | "cell_type": "code",
285 | "collapsed": false,
286 | "input": [
287 | "def outlier_score(model, X, y, thresh=0.05):\n",
288 | " y_pred = model.predict(X)\n",
289 | " diff = abs(y - y_pred)\n",
290 | " return np.sum(diff <= thresh) / len(y)"
291 | ],
292 | "language": "python",
293 | "metadata": {},
294 | "outputs": []
295 | },
296 | {
297 | "cell_type": "markdown",
298 | "metadata": {},
299 | "source": [
300 | "Pass this scoring function to the ``scoring`` parameter of the cross validation. What percentage does this model attain?"
301 | ]
302 | },
303 | {
304 | "cell_type": "code",
305 | "collapsed": false,
306 | "input": [],
307 | "language": "python",
308 | "metadata": {},
309 | "outputs": []
310 | },
311 | {
312 | "cell_type": "code",
313 | "collapsed": false,
314 | "input": [],
315 | "language": "python",
316 | "metadata": {},
317 | "outputs": []
318 | },
319 | {
320 | "cell_type": "code",
321 | "collapsed": false,
322 | "input": [],
323 | "language": "python",
324 | "metadata": {},
325 | "outputs": []
326 | },
327 | {
328 | "cell_type": "markdown",
329 | "metadata": {},
330 | "source": [
331 | "This illustrates one of the key pieces of doing machine learning with astronomy data: we can't simply rely on the defaults. We have to think about what we're after, and customize the output to get what we're after."
332 | ]
333 | },
334 | {
335 | "cell_type": "markdown",
336 | "metadata": {},
337 | "source": [
338 | "### 3. Moving to a more complex model\n",
339 | "\n",
340 | "There are a large number of regression routines available in scikit-learn; one of the most interesting is the Random Forest regressor (for some insight into Random Forests, see the [Random Forest Notebook](indepth_RandomForests.ipynb)).\n",
341 | "Repeat the above experiments using the random forest estimator. How does the R2 score and the outlier score compare to those obtained with simple linear regression?"
342 | ]
343 | },
344 | {
345 | "cell_type": "code",
346 | "collapsed": false,
347 | "input": [],
348 | "language": "python",
349 | "metadata": {},
350 | "outputs": []
351 | },
352 | {
353 | "cell_type": "code",
354 | "collapsed": false,
355 | "input": [],
356 | "language": "python",
357 | "metadata": {},
358 | "outputs": []
359 | },
360 | {
361 | "cell_type": "code",
362 | "collapsed": false,
363 | "input": [],
364 | "language": "python",
365 | "metadata": {},
366 | "outputs": []
367 | },
368 | {
369 | "cell_type": "markdown",
370 | "metadata": {},
371 | "source": [
372 | "### 4. Tuning the HyperParameters\n",
373 | "\n",
374 | "For random forests in particular, the choice of hyperparameters can be very important in finding the best fit.\n",
375 | "Adjust the ``n_estimators`` and ``max_depth`` parameters and see what the effect is.\n",
376 | "What's the best model for attaining high R2 score? For attaining a low outlier fraction?"
377 | ]
378 | },
379 | {
380 | "cell_type": "code",
381 | "collapsed": false,
382 | "input": [],
383 | "language": "python",
384 | "metadata": {},
385 | "outputs": []
386 | },
387 | {
388 | "cell_type": "code",
389 | "collapsed": false,
390 | "input": [],
391 | "language": "python",
392 | "metadata": {},
393 | "outputs": []
394 | },
395 | {
396 | "cell_type": "code",
397 | "collapsed": false,
398 | "input": [],
399 | "language": "python",
400 | "metadata": {},
401 | "outputs": []
402 | },
403 | {
404 | "cell_type": "markdown",
405 | "metadata": {},
406 | "source": [
407 | "Searching through a grid of hyperparameters for the optimal model is a very common (and useful!) task in performing a machine learning analysis.\n",
408 | "Scikit-learn provides a *Grid Search* interface to enable this to be done quickly. Take a look at the scikit-learn [Grid Search Documentation](http://scikit-learn.org/stable/modules/grid_search.html) and use the grid search tools to find the best combination of ``n_estimators`` and ``max_depth`` for your particular data.\n",
409 | "What's the best ``r2`` score & outlier score that you can attain with RandomForests on this dataset?"
410 | ]
411 | },
412 | {
413 | "cell_type": "code",
414 | "collapsed": false,
415 | "input": [],
416 | "language": "python",
417 | "metadata": {},
418 | "outputs": []
419 | },
420 | {
421 | "cell_type": "code",
422 | "collapsed": false,
423 | "input": [],
424 | "language": "python",
425 | "metadata": {},
426 | "outputs": []
427 | },
428 | {
429 | "cell_type": "code",
430 | "collapsed": false,
431 | "input": [],
432 | "language": "python",
433 | "metadata": {},
434 | "outputs": []
435 | },
436 | {
437 | "cell_type": "markdown",
438 | "metadata": {},
439 | "source": [
440 | "### 5. More Regression models\n",
441 | "\n",
442 | "If you have more time, read through the scikit-learn documentation and take a look at other regression models that are available. Try some of them on this dataset \u2013 keep in mind that some don't scale all that well to large numbers of samples, so it may be beneficial to use only a subset of the training data.\n",
443 | "\n",
444 | "Did you find any model which out-performs the random forest?"
445 | ]
446 | },
447 | {
448 | "cell_type": "code",
449 | "collapsed": false,
450 | "input": [],
451 | "language": "python",
452 | "metadata": {},
453 | "outputs": []
454 | },
455 | {
456 | "cell_type": "code",
457 | "collapsed": false,
458 | "input": [],
459 | "language": "python",
460 | "metadata": {},
461 | "outputs": []
462 | },
463 | {
464 | "cell_type": "code",
465 | "collapsed": false,
466 | "input": [],
467 | "language": "python",
468 | "metadata": {},
469 | "outputs": []
470 | }
471 | ],
472 | "metadata": {}
473 | }
474 | ]
475 | }
--------------------------------------------------------------------------------
/Datasets.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": "",
4 | "signature": "sha256:16bf6ad2c65d1494d779e4eee3613226dbed01395cdd9890dc5839dc9eb93be5"
5 | },
6 | "nbformat": 3,
7 | "nbformat_minor": 0,
8 | "worksheets": [
9 | {
10 | "cells": [
11 | {
12 | "cell_type": "markdown",
13 | "metadata": {},
14 | "source": [
15 | "# Getting the Datasets\n",
16 | "\n",
17 | "AstroML includes commands for loading various datasets.\n",
18 | "The first time the command is run, it fetches the data from the web, does some preprocessing, and caches the result on disk.\n",
19 | "\n",
20 | "In order to not inadvertently bring down the network, we need to do a bit of prep to import the files from the memory stick provided by the conference organizers.\n",
21 | "The following function will do this, if you specify the correct path of the memory stick:"
22 | ]
23 | },
24 | {
25 | "cell_type": "code",
26 | "collapsed": false,
27 | "input": [
28 | "import os\n",
29 | "import sys\n",
30 | "import shutil\n",
31 | "import numpy as np\n",
32 | "from astroML.datasets import get_data_home\n",
33 | "\n",
34 | "def load_LGAStat_data(src_dir):\n",
35 | " data_dir = get_data_home()\n",
36 | " assert(os.path.exists(src_dir))\n",
37 | " assert(os.path.isdir(src_dir))\n",
38 | " \n",
39 | " # specgals\n",
40 | " filename = 'SDSSspecgalsDR8.fit'\n",
41 | " print(\"copying {0}\".format(filename))\n",
42 | " sys.stdout.flush()\n",
43 | " shutil.copyfile(os.path.join(src_dir, filename),\n",
44 | " os.path.join(data_dir, filename))\n",
45 | " \n",
46 | " # RR Lyrae mags\n",
47 | " filename = 'RRLyrae.fit'\n",
48 | " print(\"copying {0}\".format(filename))\n",
49 | " sys.stdout.flush()\n",
50 | " shutil.copyfile(os.path.join(src_dir, filename),\n",
51 | " os.path.join(data_dir, filename))\n",
52 | " \n",
53 | " # Stripe 82\n",
54 | " filename = 'stripe82calibStars_v2.6.dat.gz'\n",
55 | " print(\"processing {0}\".format(filename))\n",
56 | " sys.stdout.flush()\n",
57 | " archive_file = 'sdss_S82standards.npy'\n",
58 | " DTYPE = [('RA', 'f8'),\n",
59 | " ('DEC', 'f8'),\n",
60 | " ('RArms', 'f4'),\n",
61 | " ('DECrms', 'f4'),\n",
62 | " ('Ntot', 'i4'),\n",
63 | " ('A_r', 'f4')]\n",
64 | "\n",
65 | " for band in 'ugriz':\n",
66 | " DTYPE += [('Nobs_%s' % band, 'i4')]\n",
67 | " DTYPE += map(lambda s: (s + '_' + band, 'f4'),\n",
68 | " ['mmed', 'mmu', 'msig', 'mrms', 'mchi2'])\n",
69 | " \n",
70 | " # first column is 'CALIBSTARS'. We'll ignore this.\n",
71 | " COLUMNS = range(1, len(DTYPE) + 1)\n",
72 | " kwargs = kwargs = dict(usecols=COLUMNS, dtype=DTYPE)\n",
73 | "\n",
74 | " data = np.loadtxt(os.path.join(src_dir, filename), **kwargs)\n",
75 | " np.save(os.path.join(data_dir, archive_file), data)\n",
76 | " \n",
77 | " # make sure that it worked\n",
78 | " from astroML.datasets import fetch_rrlyrae_combined\n",
79 | " from astroML.datasets import fetch_sdss_specgals\n",
80 | " X, y = fetch_rrlyrae_combined(download_if_missing=False)\n",
81 | " photoz_gals = fetch_sdss_specgals(download_if_missing=False)\n",
82 | " \n",
83 | " assert(X.shape == (93141, 4))\n",
84 | " assert(photoz_gals.shape == (661598,))\n",
85 | " \n",
86 | " print(\"Finished!\")"
87 | ],
88 | "language": "python",
89 | "metadata": {},
90 | "outputs": []
91 | },
92 | {
93 | "cell_type": "markdown",
94 | "metadata": {},
95 | "source": [
96 | "### To Get the Data Without Killing The Network\n",
97 | "\n",
98 | "1. Copy ``LocalGroupAstrostatistics2015/obs/astroML`` directory to the directory where this notebook is\n",
99 | "2. Run the following code:"
100 | ]
101 | },
102 | {
103 | "cell_type": "code",
104 | "collapsed": false,
105 | "input": [
106 | "load_LGAStat_data('./astroML/')"
107 | ],
108 | "language": "python",
109 | "metadata": {},
110 | "outputs": []
111 | },
112 | {
113 | "cell_type": "markdown",
114 | "metadata": {},
115 | "source": [
116 | "(This will take a bit of time: it is reading and re-formatting some of the binary data).\n",
117 | "If that code block runs without raising an error, then you have the data cached on your system."
118 | ]
119 | },
120 | {
121 | "cell_type": "markdown",
122 | "metadata": {},
123 | "source": [
124 | "Now you can run these commands & astroML will load the cached version of the datasets we'll use:"
125 | ]
126 | },
127 | {
128 | "cell_type": "code",
129 | "collapsed": false,
130 | "input": [
131 | "from astroML.datasets import fetch_rrlyrae_combined\n",
132 | "X, y = fetch_rrlyrae_combined(download_if_missing=False)"
133 | ],
134 | "language": "python",
135 | "metadata": {},
136 | "outputs": []
137 | },
138 | {
139 | "cell_type": "code",
140 | "collapsed": false,
141 | "input": [
142 | "from astroML.datasets import fetch_sdss_specgals\n",
143 | "photoz_gals = fetch_sdss_specgals(download_if_missing=False)"
144 | ],
145 | "language": "python",
146 | "metadata": {},
147 | "outputs": []
148 | },
149 | {
150 | "cell_type": "markdown",
151 | "metadata": {},
152 | "source": [
153 | "The default value of ``download_if_missing`` is ``True``; if you change it & have not already cached the data, it will start a ~200MB download!"
154 | ]
155 | }
156 | ],
157 | "metadata": {}
158 | }
159 | ]
160 | }
--------------------------------------------------------------------------------
/Index.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": "",
4 | "signature": "sha256:6c2f30df22b2f3029e2ad19c9cc41a10b60f57a893c72a0c92eea770ecf730ae"
5 | },
6 | "nbformat": 3,
7 | "nbformat_minor": 0,
8 | "worksheets": [
9 | {
10 | "cells": [
11 | {
12 | "cell_type": "markdown",
13 | "metadata": {},
14 | "source": [
15 | "# Machine Learning for Astronomy\n",
16 | "\n",
17 | "*Jake VanderPlas, June 2015*\n",
18 | "\n",
19 | "## 1. Introduction\n",
20 | "\n",
21 | "- [Basics of Machine Learning for Astronomy](Intro.ipynb)\n",
22 | "\n",
23 | "## 2. Breakouts\n",
24 | "\n",
25 | "You'll probably have time for only one: choose your favorite!\n",
26 | "\n",
27 | "- [Classification Breakout: Photometric identification of RR Lyrae](Breakout-Classification.ipynb)\n",
28 | "- [Regression Breakout: Photometric redshift determination](Breakout-Regression.ipynb)\n",
29 | " \n",
30 | "## 3. Additional content: in depth intro to specific algorithms\n",
31 | "\n",
32 | "We won't cover this as a group, but you might wish to read through these yourself.\n",
33 | "\n",
34 | "- [Random Forests](indepth_RandomForests.ipynb)\n",
35 | "- [Support Vector Machines](indepth_SVM.ipynb)\n",
36 | "- [K Means](indepth_KMeans.ipynb)\n",
37 | "- [Principal Component Analysis](indepth_PCA.ipynb)\n",
38 | "\n",
39 | "These are taken from my [PyCon 2015 Scikit-Learn tutorial](https://github.com/jakevdp/sklearn_pycon2015)"
40 | ]
41 | }
42 | ],
43 | "metadata": {}
44 | }
45 | ]
46 | }
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Copyright (c) 2015, LocalGroupAstrostatistics2015
2 | All rights reserved.
3 |
4 | Redistribution and use in source and binary forms, with or without
5 | modification, are permitted provided that the following conditions are met:
6 |
7 | * Redistributions of source code must retain the above copyright notice, this
8 | list of conditions and the following disclaimer.
9 |
10 | * Redistributions in binary form must reproduce the above copyright notice,
11 | this list of conditions and the following disclaimer in the documentation
12 | and/or other materials provided with the distribution.
13 |
14 | * Neither the name of MachineLearning nor the names of its
15 | contributors may be used to endorse or promote products derived from
16 | this software without specific prior written permission.
17 |
18 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
19 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
20 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
21 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
22 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
23 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
24 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
25 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
26 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
27 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
28 |
29 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # MachineLearning
2 | Materials for the Wednesday Afternoon Machine Learning workshop
3 |
4 | See the [Index](Index.ipynb) for a listing of the content.
--------------------------------------------------------------------------------
/fig_code/__init__.py:
--------------------------------------------------------------------------------
1 | from .data import *
2 | from .figures import *
3 |
4 | from .sgd_separator import plot_sgd_separator
5 | from .linear_regression import plot_linear_regression
6 | from .kmeans import plot_kmeans
7 | from .helpers import plot_iris_knn
8 |
--------------------------------------------------------------------------------
/fig_code/data.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 |
3 |
4 | def linear_data_sample(N=40, rseed=0, m=3, b=-2):
5 | rng = np.random.RandomState(rseed)
6 |
7 | x = 10 * rng.rand(N)
8 | dy = m / 2 * (1 + rng.rand(N))
9 | y = m * x + b + dy * rng.randn(N)
10 |
11 | return (x, y, dy)
12 |
13 |
14 | def linear_data_sample_big_errs(N=40, rseed=0, m=3, b=-2):
15 | rng = np.random.RandomState(rseed)
16 |
17 | x = 10 * rng.rand(N)
18 | dy = m / 2 * (1 + rng.rand(N))
19 | dy[20:25] *= 10
20 | y = m * x + b + dy * rng.randn(N)
21 |
22 | return (x, y, dy)
23 |
24 |
25 | def sample_light_curve(phased=True):
26 | from astroML.datasets import fetch_LINEAR_sample
27 | data = fetch_LINEAR_sample()
28 | t, y, dy = data[18525697].T
29 |
30 | if phased:
31 | P_best = 0.580313015651
32 | t /= P_best
33 |
34 | return (t, y, dy)
35 |
36 |
37 | def sample_light_curve_2(phased=True):
38 | from astroML.datasets import fetch_LINEAR_sample
39 | data = fetch_LINEAR_sample()
40 | t, y, dy = data[10022663].T
41 |
42 | if phased:
43 | P_best = 0.61596079804
44 | t /= P_best
45 |
46 | return (t, y, dy)
47 |
48 |
--------------------------------------------------------------------------------
/fig_code/figures.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import matplotlib.pyplot as plt
3 | import warnings
4 |
5 |
6 | def plot_venn_diagram():
7 | fig, ax = plt.subplots(subplot_kw=dict(frameon=False, xticks=[], yticks=[]))
8 | ax.add_patch(plt.Circle((0.3, 0.3), 0.3, fc='red', alpha=0.5))
9 | ax.add_patch(plt.Circle((0.6, 0.3), 0.3, fc='blue', alpha=0.5))
10 | ax.add_patch(plt.Rectangle((-0.1, -0.1), 1.1, 0.8, fc='none', ec='black'))
11 | ax.text(0.2, 0.3, '$x$', size=30, ha='center', va='center')
12 | ax.text(0.7, 0.3, '$y$', size=30, ha='center', va='center')
13 | ax.text(0.0, 0.6, '$I$', size=30)
14 | ax.axis('equal')
15 |
16 |
17 | def plot_example_decision_tree():
18 | fig = plt.figure(figsize=(10, 4))
19 | ax = fig.add_axes([0, 0, 0.8, 1], frameon=False, xticks=[], yticks=[])
20 | ax.set_title('Example Decision Tree: Animal Classification', size=24)
21 |
22 | def text(ax, x, y, t, size=20, **kwargs):
23 | ax.text(x, y, t,
24 | ha='center', va='center', size=size,
25 | bbox=dict(boxstyle='round', ec='k', fc='w'), **kwargs)
26 |
27 | text(ax, 0.5, 0.9, "How big is\nthe animal?", 20)
28 | text(ax, 0.3, 0.6, "Does the animal\nhave horns?", 18)
29 | text(ax, 0.7, 0.6, "Does the animal\nhave two legs?", 18)
30 | text(ax, 0.12, 0.3, "Are the horns\nlonger than 10cm?", 14)
31 | text(ax, 0.38, 0.3, "Is the animal\nwearing a collar?", 14)
32 | text(ax, 0.62, 0.3, "Does the animal\nhave wings?", 14)
33 | text(ax, 0.88, 0.3, "Does the animal\nhave a tail?", 14)
34 |
35 | text(ax, 0.4, 0.75, "> 1m", 12, alpha=0.4)
36 | text(ax, 0.6, 0.75, "< 1m", 12, alpha=0.4)
37 |
38 | text(ax, 0.21, 0.45, "yes", 12, alpha=0.4)
39 | text(ax, 0.34, 0.45, "no", 12, alpha=0.4)
40 |
41 | text(ax, 0.66, 0.45, "yes", 12, alpha=0.4)
42 | text(ax, 0.79, 0.45, "no", 12, alpha=0.4)
43 |
44 | ax.plot([0.3, 0.5, 0.7], [0.6, 0.9, 0.6], '-k')
45 | ax.plot([0.12, 0.3, 0.38], [0.3, 0.6, 0.3], '-k')
46 | ax.plot([0.62, 0.7, 0.88], [0.3, 0.6, 0.3], '-k')
47 | ax.plot([0.0, 0.12, 0.20], [0.0, 0.3, 0.0], '--k')
48 | ax.plot([0.28, 0.38, 0.48], [0.0, 0.3, 0.0], '--k')
49 | ax.plot([0.52, 0.62, 0.72], [0.0, 0.3, 0.0], '--k')
50 | ax.plot([0.8, 0.88, 1.0], [0.0, 0.3, 0.0], '--k')
51 | ax.axis([0, 1, 0, 1])
52 |
53 |
54 | def visualize_tree(estimator, X, y, boundaries=True,
55 | xlim=None, ylim=None):
56 | estimator.fit(X, y)
57 |
58 | if xlim is None:
59 | xlim = (X[:, 0].min() - 0.1, X[:, 0].max() + 0.1)
60 | if ylim is None:
61 | ylim = (X[:, 1].min() - 0.1, X[:, 1].max() + 0.1)
62 |
63 | x_min, x_max = xlim
64 | y_min, y_max = ylim
65 | xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
66 | np.linspace(y_min, y_max, 100))
67 | Z = estimator.predict(np.c_[xx.ravel(), yy.ravel()])
68 |
69 | # Put the result into a color plot
70 | Z = Z.reshape(xx.shape)
71 | plt.figure()
72 | plt.pcolormesh(xx, yy, Z, alpha=0.2, cmap='rainbow')
73 | plt.clim(y.min(), y.max())
74 |
75 | # Plot also the training points
76 | plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='rainbow')
77 | plt.axis('off')
78 |
79 | plt.xlim(x_min, x_max)
80 | plt.ylim(y_min, y_max)
81 | plt.clim(y.min(), y.max())
82 |
83 | # Plot the decision boundaries
84 | def plot_boundaries(i, xlim, ylim):
85 | if i < 0:
86 | return
87 |
88 | tree = estimator.tree_
89 |
90 | if tree.feature[i] == 0:
91 | plt.plot([tree.threshold[i], tree.threshold[i]], ylim, '-k')
92 | plot_boundaries(tree.children_left[i],
93 | [xlim[0], tree.threshold[i]], ylim)
94 | plot_boundaries(tree.children_right[i],
95 | [tree.threshold[i], xlim[1]], ylim)
96 |
97 | elif tree.feature[i] == 1:
98 | plt.plot(xlim, [tree.threshold[i], tree.threshold[i]], '-k')
99 | plot_boundaries(tree.children_left[i], xlim,
100 | [ylim[0], tree.threshold[i]])
101 | plot_boundaries(tree.children_right[i], xlim,
102 | [tree.threshold[i], ylim[1]])
103 |
104 | if boundaries:
105 | plot_boundaries(0, plt.xlim(), plt.ylim())
106 |
107 |
108 | def plot_tree_interactive(X, y):
109 | from sklearn.tree import DecisionTreeClassifier
110 |
111 | def interactive_tree(depth=1):
112 | clf = DecisionTreeClassifier(max_depth=depth, random_state=0)
113 | visualize_tree(clf, X, y)
114 |
115 | from IPython.html.widgets import interact
116 | return interact(interactive_tree, depth=[1, 5])
117 |
118 |
119 | def plot_kmeans_interactive(min_clusters=1, max_clusters=6):
120 | from IPython.html.widgets import interact
121 | from sklearn.metrics.pairwise import euclidean_distances
122 | from sklearn.datasets.samples_generator import make_blobs
123 |
124 | X, y = make_blobs(n_samples=300, centers=4,
125 | random_state=0, cluster_std=0.60)
126 |
127 | def _kmeans_step(frame=0, n_clusters=4):
128 | rng = np.random.RandomState(2)
129 | labels = np.zeros(X.shape[0])
130 | centers = rng.randn(n_clusters, 2)
131 |
132 | nsteps = frame // 3
133 |
134 | for i in range(nsteps + 1):
135 | old_centers = centers
136 | if i < nsteps or frame % 3 > 0:
137 | dist = euclidean_distances(X, centers)
138 | labels = dist.argmin(1)
139 |
140 | if i < nsteps or frame % 3 > 1:
141 | with warnings.catch_warnings():
142 | warnings.filterwarnings('ignore',
143 | message='Mean of empty slice')
144 | centers = np.array([X[labels == j].mean(0)
145 | for j in range(n_clusters)])
146 | nans = np.isnan(centers)
147 | centers[nans] = old_centers[nans]
148 |
149 |
150 | # plot the data and cluster centers
151 | plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='rainbow',
152 | vmin=0, vmax=n_clusters - 1);
153 | plt.scatter(old_centers[:, 0], old_centers[:, 1], marker='o',
154 | c=np.arange(n_clusters),
155 | s=200, cmap='rainbow')
156 | plt.scatter(old_centers[:, 0], old_centers[:, 1], marker='o',
157 | c='black', s=50)
158 |
159 | # plot new centers if third frame
160 | if frame % 3 == 2:
161 | for i in range(n_clusters):
162 | plt.annotate('', centers[i], old_centers[i],
163 | arrowprops=dict(arrowstyle='->', linewidth=1))
164 | plt.scatter(centers[:, 0], centers[:, 1], marker='o',
165 | c=np.arange(n_clusters),
166 | s=200, cmap='rainbow')
167 | plt.scatter(centers[:, 0], centers[:, 1], marker='o',
168 | c='black', s=50)
169 |
170 | plt.xlim(-4, 4)
171 | plt.ylim(-2, 10)
172 |
173 | if frame % 3 == 1:
174 | plt.text(3.8, 9.5, "1. Reassign points to nearest centroid",
175 | ha='right', va='top', size=14)
176 | elif frame % 3 == 2:
177 | plt.text(3.8, 9.5, "2. Update centroids to cluster means",
178 | ha='right', va='top', size=14)
179 |
180 | return interact(_kmeans_step, frame=[0, 50],
181 | n_clusters=[min_clusters, max_clusters])
182 |
183 |
184 | def plot_image_components(x, coefficients=None, mean=0, components=None,
185 | imshape=(8, 8), n_components=6, fontsize=12):
186 | if coefficients is None:
187 | coefficients = x
188 |
189 | if components is None:
190 | components = np.eye(len(coefficients), len(x))
191 |
192 | mean = np.zeros_like(x) + mean
193 |
194 |
195 | fig = plt.figure(figsize=(1.2 * (5 + n_components), 1.2 * 2))
196 | g = plt.GridSpec(2, 5 + n_components, hspace=0.3)
197 |
198 | def show(i, j, x, title=None):
199 | ax = fig.add_subplot(g[i, j], xticks=[], yticks=[])
200 | ax.imshow(x.reshape(imshape), interpolation='nearest')
201 | if title:
202 | ax.set_title(title, fontsize=fontsize)
203 |
204 | show(slice(2), slice(2), x, "True")
205 |
206 | approx = mean.copy()
207 | show(0, 2, np.zeros_like(x) + mean, r'$\mu$')
208 | show(1, 2, approx, r'$1 \cdot \mu$')
209 |
210 | for i in range(0, n_components):
211 | approx = approx + coefficients[i] * components[i]
212 | show(0, i + 3, components[i], r'$c_{0}$'.format(i + 1))
213 | show(1, i + 3, approx,
214 | r"${0:.2f} \cdot c_{1}$".format(coefficients[i], i + 1))
215 | plt.gca().text(0, 1.05, '$+$', ha='right', va='bottom',
216 | transform=plt.gca().transAxes, fontsize=fontsize)
217 |
218 | show(slice(2), slice(-2, None), approx, "Approx")
219 |
220 |
221 | def plot_pca_interactive(data, n_components=6):
222 | from sklearn.decomposition import PCA
223 | from IPython.html.widgets import interact
224 |
225 | pca = PCA(n_components=n_components)
226 | Xproj = pca.fit_transform(data)
227 |
228 | def show_decomp(i=0):
229 | plot_image_components(data[i], Xproj[i],
230 | pca.mean_, pca.components_)
231 |
232 | interact(show_decomp, i=(0, data.shape[0] - 1));
233 |
--------------------------------------------------------------------------------
/fig_code/helpers.py:
--------------------------------------------------------------------------------
1 | """
2 | Small helpers for code that is not shown in the notebooks
3 | """
4 |
5 | from sklearn import neighbors, datasets, linear_model
6 | import pylab as pl
7 | import numpy as np
8 | from matplotlib.colors import ListedColormap
9 |
10 | # Create color maps for 3-class classification problem, as with iris
11 | cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
12 | cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
13 |
14 | def plot_iris_knn():
15 | iris = datasets.load_iris()
16 | X = iris.data[:, :2] # we only take the first two features. We could
17 | # avoid this ugly slicing by using a two-dim dataset
18 | y = iris.target
19 |
20 | knn = neighbors.KNeighborsClassifier(n_neighbors=3)
21 | knn.fit(X, y)
22 |
23 | x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1
24 | y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1
25 | xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
26 | np.linspace(y_min, y_max, 100))
27 | Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
28 |
29 | # Put the result into a color plot
30 | Z = Z.reshape(xx.shape)
31 | pl.figure()
32 | pl.pcolormesh(xx, yy, Z, cmap=cmap_light)
33 |
34 | # Plot also the training points
35 | pl.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)
36 | pl.xlabel('sepal length (cm)')
37 | pl.ylabel('sepal width (cm)')
38 | pl.axis('tight')
39 |
40 |
41 | def plot_polynomial_regression():
42 | rng = np.random.RandomState(0)
43 | x = 2*rng.rand(100) - 1
44 |
45 | f = lambda t: 1.2 * t**2 + .1 * t**3 - .4 * t **5 - .5 * t ** 9
46 | y = f(x) + .4 * rng.normal(size=100)
47 |
48 | x_test = np.linspace(-1, 1, 100)
49 |
50 | pl.figure()
51 | pl.scatter(x, y, s=4)
52 |
53 | X = np.array([x**i for i in range(5)]).T
54 | X_test = np.array([x_test**i for i in range(5)]).T
55 | regr = linear_model.LinearRegression()
56 | regr.fit(X, y)
57 | pl.plot(x_test, regr.predict(X_test), label='4th order')
58 |
59 | X = np.array([x**i for i in range(10)]).T
60 | X_test = np.array([x_test**i for i in range(10)]).T
61 | regr = linear_model.LinearRegression()
62 | regr.fit(X, y)
63 | pl.plot(x_test, regr.predict(X_test), label='9th order')
64 |
65 | pl.legend(loc='best')
66 | pl.axis('tight')
67 | pl.title('Fitting a 4th and a 9th order polynomial')
68 |
69 | pl.figure()
70 | pl.scatter(x, y, s=4)
71 | pl.plot(x_test, f(x_test), label="truth")
72 | pl.axis('tight')
73 | pl.title('Ground truth (9th order polynomial)')
74 |
75 |
76 |
--------------------------------------------------------------------------------
/fig_code/kmeans.py:
--------------------------------------------------------------------------------
1 | import matplotlib.pyplot as plt
2 | from sklearn.datasets.samples_generator import make_blobs
3 | from sklearn.cluster import KMeans
4 |
5 |
6 |
7 | def plot_kmeans():
8 | X, y = make_blobs(n_samples=300, centers=4,
9 | random_state=0, cluster_std=0.60)
10 |
11 | y_pred = KMeans(4).fit(X).predict(X)
12 |
13 | fig, ax = plt.subplots(1, 2, figsize=(12, 6))
14 |
15 | ax[0].scatter(X[:, 0], X[:, 1])
16 | ax[0].set_title('Input')
17 |
18 | ax[1].scatter(X[:, 0], X[:, 1], c=y)
19 | ax[1].set_title('Labels determined by K Means')
20 |
--------------------------------------------------------------------------------
/fig_code/linear_regression.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import matplotlib.pyplot as plt
3 | from sklearn.linear_model import LinearRegression
4 |
5 |
6 | def plot_linear_regression():
7 | a = 0.5
8 | b = 1.0
9 |
10 | # x from 0 to 10
11 | x = 30 * np.random.random(20)
12 |
13 | # y = a*x + b with noise
14 | y = a * x + b + np.random.normal(size=x.shape)
15 |
16 | # create a linear regression classifier
17 | clf = LinearRegression()
18 | clf.fit(x[:, None], y)
19 |
20 | # predict y from the data
21 | x_new = np.linspace(0, 30, 100)
22 | y_new = clf.predict(x_new[:, None])
23 |
24 | # plot the results
25 | ax = plt.axes()
26 | ax.scatter(x, y, s=60)
27 | ax.plot(x_new, y_new)
28 |
29 | ax.set_xlabel('x')
30 | ax.set_ylabel('y')
31 |
32 | ax.axis('tight')
33 |
34 |
35 | if __name__ == '__main__':
36 | plot_linear_regression()
37 | plt.show()
38 |
--------------------------------------------------------------------------------
/fig_code/sgd_separator.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import matplotlib.pyplot as plt
3 | from sklearn.linear_model import SGDClassifier
4 | from sklearn.datasets.samples_generator import make_blobs
5 |
6 | def plot_sgd_separator():
7 | # we create 50 separable points
8 | X, Y = make_blobs(n_samples=50, centers=2,
9 | random_state=0, cluster_std=0.60)
10 |
11 | # fit the model
12 | clf = SGDClassifier(loss="hinge", alpha=0.01,
13 | n_iter=200, fit_intercept=True)
14 | clf.fit(X, Y)
15 |
16 | # plot the line, the points, and the nearest vectors to the plane
17 | xx = np.linspace(-1, 5, 10)
18 | yy = np.linspace(-1, 5, 10)
19 |
20 | X1, X2 = np.meshgrid(xx, yy)
21 | Z = np.empty(X1.shape)
22 | for (i, j), val in np.ndenumerate(X1):
23 | x1 = val
24 | x2 = X2[i, j]
25 | p = clf.decision_function([x1, x2])
26 | Z[i, j] = p[0]
27 | levels = [-1.0, 0.0, 1.0]
28 | linestyles = ['dashed', 'solid', 'dashed']
29 | colors = 'k'
30 |
31 | ax = plt.axes()
32 | ax.contour(X1, X2, Z, levels, colors=colors, linestyles=linestyles)
33 | ax.scatter(X[:, 0], X[:, 1], c=Y, s=60)
34 |
35 | ax.axis('tight')
36 |
37 |
38 | if __name__ == '__main__':
39 | plot_sgd_separator()
40 | plt.show()
41 |
--------------------------------------------------------------------------------
/indepth_GMM.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": "",
4 | "signature": "sha256:de851a6e13d685da444ca25d51086aeef554469384c3a456dacba39fa84228bd"
5 | },
6 | "nbformat": 3,
7 | "nbformat_minor": 0,
8 | "worksheets": [
9 | {
10 | "cells": [
11 | {
12 | "cell_type": "markdown",
13 | "metadata": {},
14 | "source": [
15 | "This notebook was put together by [Jake Vanderplas](http://www.vanderplas.com) for PyCon 2015. Source and license info is on [GitHub](https://github.com/jakevdp/sklearn_pycon2015/)."
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {},
21 | "source": [
22 | "# Density Estimation: Gaussian Mixture Models"
23 | ]
24 | },
25 | {
26 | "cell_type": "markdown",
27 | "metadata": {},
28 | "source": [
29 | "Here we'll explore **Gaussian Mixture Models**, which is an unsupervised clustering & density estimation technique.\n",
30 | "\n",
31 | "We'll start with our standard set of initial imports"
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "collapsed": false,
37 | "input": [
38 | "%matplotlib inline\n",
39 | "import numpy as np\n",
40 | "import matplotlib.pyplot as plt\n",
41 | "from scipy import stats\n",
42 | "\n",
43 | "# use seaborn plotting defaults\n",
44 | "import seaborn as sns; sns.set()"
45 | ],
46 | "language": "python",
47 | "metadata": {},
48 | "outputs": []
49 | },
50 | {
51 | "cell_type": "markdown",
52 | "metadata": {},
53 | "source": [
54 | "## Introducing Gaussian Mixture Models\n",
55 | "\n",
56 | "We previously saw an example of K-Means, which is a clustering algorithm which is most often fit using an expectation-maximization approach.\n",
57 | "\n",
58 | "Here we'll consider an extension to this which is suitable for both **clustering** and **density estimation**.\n",
59 | "\n",
60 | "For example, imagine we have some one-dimensional data in a particular distribution:"
61 | ]
62 | },
63 | {
64 | "cell_type": "code",
65 | "collapsed": false,
66 | "input": [
67 | "np.random.seed(2)\n",
68 | "x = np.concatenate([np.random.normal(0, 2, 2000),\n",
69 | " np.random.normal(5, 5, 2000),\n",
70 | " np.random.normal(3, 0.5, 600)])\n",
71 | "plt.hist(x, 80, normed=True)\n",
72 | "plt.xlim(-10, 20);"
73 | ],
74 | "language": "python",
75 | "metadata": {},
76 | "outputs": []
77 | },
78 | {
79 | "cell_type": "markdown",
80 | "metadata": {},
81 | "source": [
82 | "Gaussian mixture models will allow us to approximate this density:"
83 | ]
84 | },
85 | {
86 | "cell_type": "code",
87 | "collapsed": false,
88 | "input": [
89 | "from sklearn.mixture import GMM\n",
90 | "clf = GMM(4, n_iter=500, random_state=3).fit(x)\n",
91 | "xpdf = np.linspace(-10, 20, 1000)\n",
92 | "density = np.exp(clf.score(xpdf))\n",
93 | "\n",
94 | "plt.hist(x, 80, normed=True, alpha=0.5)\n",
95 | "plt.plot(xpdf, density, '-r')\n",
96 | "plt.xlim(-10, 20);"
97 | ],
98 | "language": "python",
99 | "metadata": {},
100 | "outputs": []
101 | },
102 | {
103 | "cell_type": "markdown",
104 | "metadata": {},
105 | "source": [
106 | "Note that this density is fit using a **mixture of Gaussians**, which we can examine by looking at the ``means_``, ``covars_``, and ``weights_`` attributes:"
107 | ]
108 | },
109 | {
110 | "cell_type": "code",
111 | "collapsed": false,
112 | "input": [
113 | "clf.means_"
114 | ],
115 | "language": "python",
116 | "metadata": {},
117 | "outputs": []
118 | },
119 | {
120 | "cell_type": "code",
121 | "collapsed": false,
122 | "input": [
123 | "clf.covars_"
124 | ],
125 | "language": "python",
126 | "metadata": {},
127 | "outputs": []
128 | },
129 | {
130 | "cell_type": "code",
131 | "collapsed": false,
132 | "input": [
133 | "clf.weights_"
134 | ],
135 | "language": "python",
136 | "metadata": {},
137 | "outputs": []
138 | },
139 | {
140 | "cell_type": "code",
141 | "collapsed": false,
142 | "input": [
143 | "plt.hist(x, 80, normed=True, alpha=0.3)\n",
144 | "plt.plot(xpdf, density, '-r')\n",
145 | "\n",
146 | "for i in range(clf.n_components):\n",
147 | " pdf = clf.weights_[i] * stats.norm(clf.means_[i, 0],\n",
148 | " np.sqrt(clf.covars_[i, 0])).pdf(xpdf)\n",
149 | " plt.fill(xpdf, pdf, facecolor='gray',\n",
150 | " edgecolor='none', alpha=0.3)\n",
151 | "plt.xlim(-10, 20);"
152 | ],
153 | "language": "python",
154 | "metadata": {},
155 | "outputs": []
156 | },
157 | {
158 | "cell_type": "markdown",
159 | "metadata": {},
160 | "source": [
161 | "These individual Gaussian distributions are fit using an expectation-maximization method, much as in K means, except that rather than explicit cluster assignment, the **posterior probability** is used to compute the weighted mean and covariance.\n",
162 | "Somewhat surprisingly, this algorithm **provably** converges to the optimum (though the optimum is not necessarily global)."
163 | ]
164 | },
165 | {
166 | "cell_type": "markdown",
167 | "metadata": {},
168 | "source": [
169 | "## How many Gaussians?\n",
170 | "\n",
171 | "Given a model, we can use one of several means to evaluate how well it fits the data.\n",
172 | "For example, there is the Aikaki Information Criterion (AIC) and the Bayesian Information Criterion (BIC)"
173 | ]
174 | },
175 | {
176 | "cell_type": "code",
177 | "collapsed": false,
178 | "input": [
179 | "print(clf.bic(x))\n",
180 | "print(clf.aic(x))"
181 | ],
182 | "language": "python",
183 | "metadata": {},
184 | "outputs": []
185 | },
186 | {
187 | "cell_type": "markdown",
188 | "metadata": {},
189 | "source": [
190 | "Let's take a look at these as a function of the number of gaussians:"
191 | ]
192 | },
193 | {
194 | "cell_type": "code",
195 | "collapsed": false,
196 | "input": [
197 | "n_estimators = np.arange(1, 10)\n",
198 | "clfs = [GMM(n, n_iter=1000).fit(x) for n in n_estimators]\n",
199 | "bics = [clf.bic(x) for clf in clfs]\n",
200 | "aics = [clf.aic(x) for clf in clfs]\n",
201 | "\n",
202 | "plt.plot(n_estimators, bics, label='BIC')\n",
203 | "plt.plot(n_estimators, aics, label='AIC')\n",
204 | "plt.legend();"
205 | ],
206 | "language": "python",
207 | "metadata": {},
208 | "outputs": []
209 | },
210 | {
211 | "cell_type": "markdown",
212 | "metadata": {},
213 | "source": [
214 | "It appears that for both the AIC and BIC, 4 components is preferred."
215 | ]
216 | },
217 | {
218 | "cell_type": "markdown",
219 | "metadata": {},
220 | "source": [
221 | "## Example: GMM For Outlier Detection\n",
222 | "\n",
223 | "GMM is what's known as a **Generative Model**: it's a probabilistic model from which a dataset can be generated.\n",
224 | "One thing that generative models can be useful for is **outlier detection**: we can simply evaluate the likelihood of each point under the generative model; the points with a suitably low likelihood (where \"suitable\" is up to your own bias/variance preference) can be labeld outliers.\n",
225 | "\n",
226 | "Let's take a look at this by defining a new dataset with some outliers:"
227 | ]
228 | },
229 | {
230 | "cell_type": "code",
231 | "collapsed": false,
232 | "input": [
233 | "np.random.seed(0)\n",
234 | "\n",
235 | "# Add 20 outliers\n",
236 | "true_outliers = np.sort(np.random.randint(0, len(x), 20))\n",
237 | "y = x.copy()\n",
238 | "y[true_outliers] += 50 * np.random.randn(20)"
239 | ],
240 | "language": "python",
241 | "metadata": {},
242 | "outputs": []
243 | },
244 | {
245 | "cell_type": "code",
246 | "collapsed": false,
247 | "input": [
248 | "clf = GMM(4, n_iter=500, random_state=0).fit(y)\n",
249 | "xpdf = np.linspace(-10, 20, 1000)\n",
250 | "density_noise = np.exp(clf.score(xpdf))\n",
251 | "\n",
252 | "plt.hist(y, 80, normed=True, alpha=0.5)\n",
253 | "plt.plot(xpdf, density_noise, '-r')\n",
254 | "#plt.xlim(-10, 20);"
255 | ],
256 | "language": "python",
257 | "metadata": {},
258 | "outputs": []
259 | },
260 | {
261 | "cell_type": "markdown",
262 | "metadata": {},
263 | "source": [
264 | "Now let's evaluate the log-likelihood of each point under the model, and plot these as a function of ``y``:"
265 | ]
266 | },
267 | {
268 | "cell_type": "code",
269 | "collapsed": false,
270 | "input": [
271 | "log_likelihood = clf.score_samples(y)[0]\n",
272 | "plt.plot(y, log_likelihood, '.k');"
273 | ],
274 | "language": "python",
275 | "metadata": {},
276 | "outputs": []
277 | },
278 | {
279 | "cell_type": "code",
280 | "collapsed": false,
281 | "input": [
282 | "detected_outliers = np.where(log_likelihood < -9)[0]\n",
283 | "\n",
284 | "print(\"true outliers:\")\n",
285 | "print(true_outliers)\n",
286 | "print(\"\\ndetected outliers:\")\n",
287 | "print(detected_outliers)"
288 | ],
289 | "language": "python",
290 | "metadata": {},
291 | "outputs": []
292 | },
293 | {
294 | "cell_type": "markdown",
295 | "metadata": {},
296 | "source": [
297 | "The algorithm misses a few of these points, which is to be expected (some of the \"outliers\" actually land in the middle of the distribution!)\n",
298 | "\n",
299 | "Here are the outliers that were missed:"
300 | ]
301 | },
302 | {
303 | "cell_type": "code",
304 | "collapsed": false,
305 | "input": [
306 | "set(true_outliers) - set(detected_outliers)"
307 | ],
308 | "language": "python",
309 | "metadata": {},
310 | "outputs": []
311 | },
312 | {
313 | "cell_type": "markdown",
314 | "metadata": {},
315 | "source": [
316 | "And here are the non-outliers which were spuriously labeled outliers:"
317 | ]
318 | },
319 | {
320 | "cell_type": "code",
321 | "collapsed": false,
322 | "input": [
323 | "set(detected_outliers) - set(true_outliers)"
324 | ],
325 | "language": "python",
326 | "metadata": {},
327 | "outputs": []
328 | },
329 | {
330 | "cell_type": "markdown",
331 | "metadata": {},
332 | "source": [
333 | "Finally, we should note that although all of the above is done in one dimension, GMM does generalize to multiple dimensions, as we'll see in the breakout session."
334 | ]
335 | },
336 | {
337 | "cell_type": "markdown",
338 | "metadata": {},
339 | "source": [
340 | "## Other Density Estimators\n",
341 | "\n",
342 | "The other main density estimator that you might find useful is *Kernel Density Estimation*, which is available via ``sklearn.neighbors.KernelDensity``. In some ways, this can be thought of as a generalization of GMM where there is a gaussian placed at the location of *every* training point!"
343 | ]
344 | },
345 | {
346 | "cell_type": "code",
347 | "collapsed": false,
348 | "input": [
349 | "from sklearn.neighbors import KernelDensity\n",
350 | "kde = KernelDensity(0.15).fit(x[:, None])\n",
351 | "density_kde = np.exp(kde.score_samples(xpdf[:, None]))\n",
352 | "\n",
353 | "plt.hist(x, 80, normed=True, alpha=0.5)\n",
354 | "plt.plot(xpdf, density, '-b', label='GMM')\n",
355 | "plt.plot(xpdf, density_kde, '-r', label='KDE')\n",
356 | "plt.xlim(-10, 20)\n",
357 | "plt.legend();"
358 | ],
359 | "language": "python",
360 | "metadata": {},
361 | "outputs": []
362 | },
363 | {
364 | "cell_type": "markdown",
365 | "metadata": {},
366 | "source": [
367 | "All of these density estimators can be viewed as **Generative models** of the data: that is, that is, the model tells us how more data can be created which fits the model."
368 | ]
369 | }
370 | ],
371 | "metadata": {}
372 | }
373 | ]
374 | }
--------------------------------------------------------------------------------
/indepth_KMeans.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": "",
4 | "signature": "sha256:52d0366aa82af43633deaad52d9a5b52a4e649150bdccb048354423038a9636c"
5 | },
6 | "nbformat": 3,
7 | "nbformat_minor": 0,
8 | "worksheets": [
9 | {
10 | "cells": [
11 | {
12 | "cell_type": "markdown",
13 | "metadata": {},
14 | "source": [
15 | "This notebook was put together by [Jake Vanderplas](http://www.vanderplas.com) for PyCon 2015. Source and license info is on [GitHub](https://github.com/jakevdp/sklearn_pycon2015/)."
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {},
21 | "source": [
22 | "# Clustering: K-Means In-Depth"
23 | ]
24 | },
25 | {
26 | "cell_type": "markdown",
27 | "metadata": {},
28 | "source": [
29 | "Here we'll explore **K Means Clustering**, which is an unsupervised clustering technique.\n",
30 | "\n",
31 | "We'll start with our standard set of initial imports"
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "collapsed": false,
37 | "input": [
38 | "%matplotlib inline\n",
39 | "import numpy as np\n",
40 | "import matplotlib.pyplot as plt\n",
41 | "from scipy import stats\n",
42 | "\n",
43 | "# use seaborn plotting defaults\n",
44 | "import seaborn as sns; sns.set()"
45 | ],
46 | "language": "python",
47 | "metadata": {},
48 | "outputs": []
49 | },
50 | {
51 | "cell_type": "markdown",
52 | "metadata": {},
53 | "source": [
54 | "## Introducing K-Means"
55 | ]
56 | },
57 | {
58 | "cell_type": "markdown",
59 | "metadata": {},
60 | "source": [
61 | "K Means is an algorithm for **unsupervised clustering**: that is, finding clusters in data based on the data attributes alone (not the labels).\n",
62 | "\n",
63 | "K Means is a relatively easy-to-understand algorithm. It searches for cluster centers which are the mean of the points within them, such that every point is closest to the cluster center it is assigned to.\n",
64 | "\n",
65 | "Let's look at how KMeans operates on the simple clusters we looked at previously. To emphasize that this is unsupervised, we'll not plot the colors of the clusters:"
66 | ]
67 | },
68 | {
69 | "cell_type": "code",
70 | "collapsed": false,
71 | "input": [
72 | "from sklearn.datasets.samples_generator import make_blobs\n",
73 | "X, y = make_blobs(n_samples=300, centers=4,\n",
74 | " random_state=0, cluster_std=0.60)\n",
75 | "plt.scatter(X[:, 0], X[:, 1], s=50);"
76 | ],
77 | "language": "python",
78 | "metadata": {},
79 | "outputs": []
80 | },
81 | {
82 | "cell_type": "markdown",
83 | "metadata": {},
84 | "source": [
85 | "By eye, it is relatively easy to pick out the four clusters. If you were to perform an exhaustive search for the different segmentations of the data, however, the search space would be exponential in the number of points. Fortunately, there is a well-known *Expectation Maximization (EM)* procedure which scikit-learn implements, so that KMeans can be solved relatively quickly."
86 | ]
87 | },
88 | {
89 | "cell_type": "code",
90 | "collapsed": false,
91 | "input": [
92 | "from sklearn.cluster import KMeans\n",
93 | "est = KMeans(4) # 4 clusters\n",
94 | "est.fit(X)\n",
95 | "y_kmeans = est.predict(X)\n",
96 | "plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='rainbow');"
97 | ],
98 | "language": "python",
99 | "metadata": {},
100 | "outputs": []
101 | },
102 | {
103 | "cell_type": "markdown",
104 | "metadata": {},
105 | "source": [
106 | "The algorithm identifies the four clusters of points in a manner very similar to what we would do by eye!"
107 | ]
108 | },
109 | {
110 | "cell_type": "markdown",
111 | "metadata": {},
112 | "source": [
113 | "## The K-Means Algorithm: Expectation Maximization\n",
114 | "\n",
115 | "K-Means is an example of an algorithm which uses an *Expectation-Maximization* approach to arrive at the solution.\n",
116 | "*Expectation-Maximization* is a two-step approach which works as follows:\n",
117 | "\n",
118 | "1. Guess some cluster centers\n",
119 | "2. Repeat until converged\n",
120 | "\n",
121 | " 1. Assign points to the nearest cluster center\n",
122 | " 2. Set the cluster centers to the mean \n",
123 | " \n",
124 | "Let's quickly visualize this process:"
125 | ]
126 | },
127 | {
128 | "cell_type": "code",
129 | "collapsed": false,
130 | "input": [
131 | "from fig_code import plot_kmeans_interactive\n",
132 | "plot_kmeans_interactive();"
133 | ],
134 | "language": "python",
135 | "metadata": {},
136 | "outputs": []
137 | },
138 | {
139 | "cell_type": "markdown",
140 | "metadata": {},
141 | "source": [
142 | "This algorithm will (often) converge to the optimal cluster centers."
143 | ]
144 | },
145 | {
146 | "cell_type": "markdown",
147 | "metadata": {},
148 | "source": [
149 | "### KMeans Caveats\n",
150 | "\n",
151 | "The convergence of this algorithm is not guaranteed; for that reason, scikit-learn by default uses a large number of random initializations and finds the best results.\n",
152 | "\n",
153 | "Also, the number of clusters must be set beforehand... there are other clustering algorithms for which this requirement may be lifted."
154 | ]
155 | },
156 | {
157 | "cell_type": "markdown",
158 | "metadata": {},
159 | "source": [
160 | "## Application of KMeans to Digits\n",
161 | "\n",
162 | "For a closer-to-real-world example, let's again take a look at the digits data. Here we'll use KMeans to automatically cluster the data in 64 dimensions, and then look at the cluster centers to see what the algorithm has found."
163 | ]
164 | },
165 | {
166 | "cell_type": "code",
167 | "collapsed": false,
168 | "input": [
169 | "from sklearn.datasets import load_digits\n",
170 | "digits = load_digits()"
171 | ],
172 | "language": "python",
173 | "metadata": {},
174 | "outputs": []
175 | },
176 | {
177 | "cell_type": "code",
178 | "collapsed": false,
179 | "input": [
180 | "est = KMeans(n_clusters=10)\n",
181 | "clusters = est.fit_predict(digits.data)\n",
182 | "est.cluster_centers_.shape"
183 | ],
184 | "language": "python",
185 | "metadata": {},
186 | "outputs": []
187 | },
188 | {
189 | "cell_type": "markdown",
190 | "metadata": {},
191 | "source": [
192 | "We see ten clusters in 64 dimensions. Let's visualize each of these cluster centers to see what they represent:"
193 | ]
194 | },
195 | {
196 | "cell_type": "code",
197 | "collapsed": false,
198 | "input": [
199 | "fig = plt.figure(figsize=(8, 3))\n",
200 | "for i in range(10):\n",
201 | " ax = fig.add_subplot(2, 5, 1 + i, xticks=[], yticks=[])\n",
202 | " ax.imshow(est.cluster_centers_[i].reshape((8, 8)), cmap=plt.cm.binary)"
203 | ],
204 | "language": "python",
205 | "metadata": {},
206 | "outputs": []
207 | },
208 | {
209 | "cell_type": "markdown",
210 | "metadata": {},
211 | "source": [
212 | "We see that *even without the labels*, KMeans is able to find clusters whose means are recognizable digits (with apologies to the number 8)!\n",
213 | "\n",
214 | "The cluster labels are permuted; let's fix this:"
215 | ]
216 | },
217 | {
218 | "cell_type": "code",
219 | "collapsed": false,
220 | "input": [
221 | "from scipy.stats import mode\n",
222 | "\n",
223 | "labels = np.zeros_like(clusters)\n",
224 | "for i in range(10):\n",
225 | " mask = (clusters == i)\n",
226 | " labels[mask] = mode(digits.target[mask])[0]"
227 | ],
228 | "language": "python",
229 | "metadata": {},
230 | "outputs": []
231 | },
232 | {
233 | "cell_type": "markdown",
234 | "metadata": {},
235 | "source": [
236 | "For good measure, let's use our PCA visualization and look at the true cluster labels and K-means cluster labels:"
237 | ]
238 | },
239 | {
240 | "cell_type": "code",
241 | "collapsed": false,
242 | "input": [
243 | "from sklearn.decomposition import PCA\n",
244 | "\n",
245 | "X = PCA(2).fit_transform(digits.data)\n",
246 | "\n",
247 | "kwargs = dict(cmap = plt.cm.get_cmap('rainbow', 10),\n",
248 | " edgecolor='none', alpha=0.6)\n",
249 | "fig, ax = plt.subplots(1, 2, figsize=(8, 4))\n",
250 | "ax[0].scatter(X[:, 0], X[:, 1], c=labels, **kwargs)\n",
251 | "ax[0].set_title('learned cluster labels')\n",
252 | "\n",
253 | "ax[1].scatter(X[:, 0], X[:, 1], c=digits.target, **kwargs)\n",
254 | "ax[1].set_title('true labels');"
255 | ],
256 | "language": "python",
257 | "metadata": {},
258 | "outputs": []
259 | },
260 | {
261 | "cell_type": "markdown",
262 | "metadata": {},
263 | "source": [
264 | "Just for kicks, let's see how accurate our K-Means classifier is **with no label information:**"
265 | ]
266 | },
267 | {
268 | "cell_type": "code",
269 | "collapsed": false,
270 | "input": [
271 | "from sklearn.metrics import accuracy_score\n",
272 | "accuracy_score(digits.target, labels)"
273 | ],
274 | "language": "python",
275 | "metadata": {},
276 | "outputs": []
277 | },
278 | {
279 | "cell_type": "markdown",
280 | "metadata": {},
281 | "source": [
282 | "80% \u2013 not bad! Let's check-out the confusion matrix for this:"
283 | ]
284 | },
285 | {
286 | "cell_type": "code",
287 | "collapsed": false,
288 | "input": [
289 | "from sklearn.metrics import confusion_matrix\n",
290 | "print(confusion_matrix(digits.target, labels))\n",
291 | "\n",
292 | "plt.imshow(confusion_matrix(digits.target, labels),\n",
293 | " cmap='Blues', interpolation='nearest')\n",
294 | "plt.colorbar()\n",
295 | "plt.grid(False)\n",
296 | "plt.ylabel('true')\n",
297 | "plt.xlabel('predicted');"
298 | ],
299 | "language": "python",
300 | "metadata": {},
301 | "outputs": []
302 | },
303 | {
304 | "cell_type": "markdown",
305 | "metadata": {},
306 | "source": [
307 | "Again, this is an 80% classification accuracy for an **entirely unsupervised estimator** which knew nothing about the labels."
308 | ]
309 | },
310 | {
311 | "cell_type": "markdown",
312 | "metadata": {},
313 | "source": [
314 | "## Example: KMeans for Color Compression\n",
315 | "\n",
316 | "One interesting application of clustering is in color image compression. For example, imagine you have an image with millions of colors. In most images, a large number of the colors will be unused, and conversely a large number of pixels will have similar or identical colors.\n",
317 | "\n",
318 | "Scikit-learn has a number of images that you can play with, accessed through the datasets module. For example:"
319 | ]
320 | },
321 | {
322 | "cell_type": "code",
323 | "collapsed": false,
324 | "input": [
325 | "from sklearn.datasets import load_sample_image\n",
326 | "china = load_sample_image(\"china.jpg\")\n",
327 | "plt.imshow(china)\n",
328 | "plt.grid(False);"
329 | ],
330 | "language": "python",
331 | "metadata": {},
332 | "outputs": []
333 | },
334 | {
335 | "cell_type": "markdown",
336 | "metadata": {},
337 | "source": [
338 | "The image itself is stored in a 3-dimensional array, of size ``(height, width, RGB)``:"
339 | ]
340 | },
341 | {
342 | "cell_type": "code",
343 | "collapsed": false,
344 | "input": [
345 | "china.shape"
346 | ],
347 | "language": "python",
348 | "metadata": {},
349 | "outputs": []
350 | },
351 | {
352 | "cell_type": "markdown",
353 | "metadata": {},
354 | "source": [
355 | "We can envision this image as a cloud of points in a 3-dimensional color space. We'll rescale the colors so they lie between 0 and 1, then reshape the array to be a typical scikit-learn input:"
356 | ]
357 | },
358 | {
359 | "cell_type": "code",
360 | "collapsed": false,
361 | "input": [
362 | "X = (china / 255.0).reshape(-1, 3)\n",
363 | "print(X.shape)"
364 | ],
365 | "language": "python",
366 | "metadata": {},
367 | "outputs": []
368 | },
369 | {
370 | "cell_type": "markdown",
371 | "metadata": {},
372 | "source": [
373 | "We now have 273,280 points in 3 dimensions.\n",
374 | "\n",
375 | "Our task is to use KMeans to compress the $256^3$ colors into a smaller number (say, 64 colors). Basically, we want to find $N_{color}$ clusters in the data, and create a new image where the true input color is replaced by the color of the closest cluster."
376 | ]
377 | },
378 | {
379 | "cell_type": "code",
380 | "collapsed": false,
381 | "input": [
382 | "# reduce the size of the image for speed\n",
383 | "image = china[::3, ::3]\n",
384 | "n_colors = 64\n",
385 | "\n",
386 | "X = (image / 255.0).reshape(-1, 3)\n",
387 | " \n",
388 | "model = KMeans(n_colors)\n",
389 | "labels = model.fit_predict(X)\n",
390 | "colors = model.cluster_centers_\n",
391 | "new_image = colors[labels].reshape(image.shape)\n",
392 | "new_image = (255 * new_image).astype(np.uint8)\n",
393 | "\n",
394 | "# create and plot the new image\n",
395 | "with sns.axes_style('white'):\n",
396 | " plt.figure()\n",
397 | " plt.imshow(image)\n",
398 | " plt.title('input')\n",
399 | "\n",
400 | " plt.figure()\n",
401 | " plt.imshow(new_image)\n",
402 | " plt.title('{0} colors'.format(n_colors))"
403 | ],
404 | "language": "python",
405 | "metadata": {},
406 | "outputs": []
407 | },
408 | {
409 | "cell_type": "markdown",
410 | "metadata": {},
411 | "source": [
412 | "Compare the input and output image: we've reduced the $256^3$ colors to just 64."
413 | ]
414 | }
415 | ],
416 | "metadata": {}
417 | }
418 | ]
419 | }
--------------------------------------------------------------------------------
/indepth_PCA.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": "",
4 | "signature": "sha256:f5a85bf5bf354025ca4bbae625d8e4bde4d0936513cee5d134e347f0c898433f"
5 | },
6 | "nbformat": 3,
7 | "nbformat_minor": 0,
8 | "worksheets": [
9 | {
10 | "cells": [
11 | {
12 | "cell_type": "markdown",
13 | "metadata": {},
14 | "source": [
15 | "This notebook was put together by [Jake Vanderplas](http://www.vanderplas.com) for PyCon 2015. Source and license info is on [GitHub](https://github.com/jakevdp/sklearn_pycon2015/)."
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {},
21 | "source": [
22 | "# Dimensionality Reduction: Principal Component Analysis in-depth\n",
23 | "\n",
24 | "Here we'll explore **Principal Component Analysis**, which is an extremely useful linear dimensionality reduction technique.\n",
25 | "\n",
26 | "We'll start with our standard set of initial imports:"
27 | ]
28 | },
29 | {
30 | "cell_type": "code",
31 | "collapsed": false,
32 | "input": [
33 | "from __future__ import print_function, division\n",
34 | "\n",
35 | "%matplotlib inline\n",
36 | "import numpy as np\n",
37 | "import matplotlib.pyplot as plt\n",
38 | "from scipy import stats\n",
39 | "\n",
40 | "# use seaborn plotting style defaults\n",
41 | "import seaborn as sns; sns.set()"
42 | ],
43 | "language": "python",
44 | "metadata": {},
45 | "outputs": []
46 | },
47 | {
48 | "cell_type": "markdown",
49 | "metadata": {},
50 | "source": [
51 | "## Introducing Principal Component Analysis\n",
52 | "\n",
53 | "Principal Component Analysis is a very powerful unsupervised method for *dimensionality reduction* in data. It's easiest to visualize by looking at a two-dimensional dataset:"
54 | ]
55 | },
56 | {
57 | "cell_type": "code",
58 | "collapsed": false,
59 | "input": [
60 | "np.random.seed(1)\n",
61 | "X = np.dot(np.random.random(size=(2, 2)), np.random.normal(size=(2, 200))).T\n",
62 | "plt.plot(X[:, 0], X[:, 1], 'o')\n",
63 | "plt.axis('equal');"
64 | ],
65 | "language": "python",
66 | "metadata": {},
67 | "outputs": []
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "We can see that there is a definite trend in the data. What PCA seeks to do is to find the **Principal Axes** in the data, and explain how important those axes are in describing the data distribution:"
74 | ]
75 | },
76 | {
77 | "cell_type": "code",
78 | "collapsed": false,
79 | "input": [
80 | "from sklearn.decomposition import PCA\n",
81 | "pca = PCA(n_components=2)\n",
82 | "pca.fit(X)\n",
83 | "print(pca.explained_variance_ratio_)\n",
84 | "print(pca.components_)"
85 | ],
86 | "language": "python",
87 | "metadata": {},
88 | "outputs": []
89 | },
90 | {
91 | "cell_type": "markdown",
92 | "metadata": {},
93 | "source": [
94 | "To see what these numbers mean, let's view them as vectors plotted on top of the data:"
95 | ]
96 | },
97 | {
98 | "cell_type": "code",
99 | "collapsed": false,
100 | "input": [
101 | "plt.plot(X[:, 0], X[:, 1], 'o', alpha=0.5)\n",
102 | "for length, vector in zip(pca.explained_variance_, pca.components_):\n",
103 | " v = vector * 3 * np.sqrt(length)\n",
104 | " plt.plot([0, v[0]], [0, v[1]], '-k', lw=3)\n",
105 | "plt.axis('equal');"
106 | ],
107 | "language": "python",
108 | "metadata": {},
109 | "outputs": []
110 | },
111 | {
112 | "cell_type": "markdown",
113 | "metadata": {},
114 | "source": [
115 | "Notice that one vector is longer than the other. In a sense, this tells us that that direction in the data is somehow more \"important\" than the other direction.\n",
116 | "The explained variance quantifies this measure of \"importance\" in direction.\n",
117 | "\n",
118 | "Another way to think of it is that the second principal component could be **completely ignored** without much loss of information! Let's see what our data look like if we only keep 95% of the variance:"
119 | ]
120 | },
121 | {
122 | "cell_type": "code",
123 | "collapsed": false,
124 | "input": [
125 | "clf = PCA(0.95) # keep 95% of variance\n",
126 | "X_trans = clf.fit_transform(X)\n",
127 | "print(X.shape)\n",
128 | "print(X_trans.shape)"
129 | ],
130 | "language": "python",
131 | "metadata": {},
132 | "outputs": []
133 | },
134 | {
135 | "cell_type": "markdown",
136 | "metadata": {},
137 | "source": [
138 | "By specifying that we want to throw away 5% of the variance, the data is now compressed by a factor of 50%! Let's see what the data look like after this compression:"
139 | ]
140 | },
141 | {
142 | "cell_type": "code",
143 | "collapsed": false,
144 | "input": [
145 | "X_new = clf.inverse_transform(X_trans)\n",
146 | "plt.plot(X[:, 0], X[:, 1], 'o', alpha=0.2)\n",
147 | "plt.plot(X_new[:, 0], X_new[:, 1], 'ob', alpha=0.8)\n",
148 | "plt.axis('equal');"
149 | ],
150 | "language": "python",
151 | "metadata": {},
152 | "outputs": []
153 | },
154 | {
155 | "cell_type": "markdown",
156 | "metadata": {},
157 | "source": [
158 | "The light points are the original data, while the dark points are the projected version. We see that after truncating 5% of the variance of this dataset and then reprojecting it, the \"most important\" features of the data are maintained, and we've compressed the data by 50%!\n",
159 | "\n",
160 | "This is the sense in which \"dimensionality reduction\" works: if you can approximate a data set in a lower dimension, you can often have an easier time visualizing it or fitting complicated models to the data."
161 | ]
162 | },
163 | {
164 | "cell_type": "markdown",
165 | "metadata": {},
166 | "source": [
167 | "### Application of PCA to Digits\n",
168 | "\n",
169 | "The dimensionality reduction might seem a bit abstract in two dimensions, but the projection and dimensionality reduction can be extremely useful when visualizing high-dimensional data. Let's take a quick look at the application of PCA to the digits data we looked at before:"
170 | ]
171 | },
172 | {
173 | "cell_type": "code",
174 | "collapsed": false,
175 | "input": [
176 | "from sklearn.datasets import load_digits\n",
177 | "digits = load_digits()\n",
178 | "X = digits.data\n",
179 | "y = digits.target"
180 | ],
181 | "language": "python",
182 | "metadata": {},
183 | "outputs": []
184 | },
185 | {
186 | "cell_type": "code",
187 | "collapsed": false,
188 | "input": [
189 | "pca = PCA(0.5) # project from 64 to 2 dimensions\n",
190 | "Xproj = pca.fit_transform(X)\n",
191 | "print(X.shape)\n",
192 | "print(Xproj.shape)"
193 | ],
194 | "language": "python",
195 | "metadata": {},
196 | "outputs": []
197 | },
198 | {
199 | "cell_type": "code",
200 | "collapsed": false,
201 | "input": [
202 | "plt.scatter(Xproj[:, 0], Xproj[:, 1], c=y, edgecolor='none', alpha=0.5,\n",
203 | " cmap=plt.cm.get_cmap('nipy_spectral', 10))\n",
204 | "plt.colorbar();"
205 | ],
206 | "language": "python",
207 | "metadata": {},
208 | "outputs": []
209 | },
210 | {
211 | "cell_type": "markdown",
212 | "metadata": {},
213 | "source": [
214 | "This gives us an idea of the relationship between the digits. Essentially, we have found the optimal stretch and rotation in 64-dimensional space that allows us to see the layout of the digits, **without reference** to the labels."
215 | ]
216 | },
217 | {
218 | "cell_type": "markdown",
219 | "metadata": {},
220 | "source": [
221 | "### What do the Components Mean?\n",
222 | "\n",
223 | "PCA is a very useful dimensionality reduction algorithm, because it has a very intuitive interpretation via *eigenvectors*.\n",
224 | "The input data is represented as a vector: in the case of the digits, our data is\n",
225 | "\n",
226 | "$$\n",
227 | "x = [x_1, x_2, x_3 \\cdots]\n",
228 | "$$\n",
229 | "\n",
230 | "but what this really means is\n",
231 | "\n",
232 | "$$\n",
233 | "image(x) = x_1 \\cdot{\\rm (pixel~1)} + x_2 \\cdot{\\rm (pixel~2)} + x_3 \\cdot{\\rm (pixel~3)} \\cdots\n",
234 | "$$\n",
235 | "\n",
236 | "If we reduce the dimensionality in the pixel space to (say) 6, we recover only a partial image:"
237 | ]
238 | },
239 | {
240 | "cell_type": "code",
241 | "collapsed": false,
242 | "input": [
243 | "from fig_code.figures import plot_image_components\n",
244 | "\n",
245 | "sns.set_style('white')\n",
246 | "plot_image_components(digits.data[0])"
247 | ],
248 | "language": "python",
249 | "metadata": {},
250 | "outputs": []
251 | },
252 | {
253 | "cell_type": "markdown",
254 | "metadata": {},
255 | "source": [
256 | "But the pixel-wise representation is not the only choice. We can also use other *basis functions*, and write something like\n",
257 | "\n",
258 | "$$\n",
259 | "image(x) = {\\rm mean} + x_1 \\cdot{\\rm (basis~1)} + x_2 \\cdot{\\rm (basis~2)} + x_3 \\cdot{\\rm (basis~3)} \\cdots\n",
260 | "$$\n",
261 | "\n",
262 | "What PCA does is to choose optimal **basis functions** so that only a few are needed to get a reasonable approximation.\n",
263 | "The low-dimensional representation of our data is the coefficients of this series, and the approximate reconstruction is the result of the sum:"
264 | ]
265 | },
266 | {
267 | "cell_type": "code",
268 | "collapsed": false,
269 | "input": [
270 | "from fig_code.figures import plot_pca_interactive\n",
271 | "plot_pca_interactive(digits.data)"
272 | ],
273 | "language": "python",
274 | "metadata": {},
275 | "outputs": []
276 | },
277 | {
278 | "cell_type": "markdown",
279 | "metadata": {},
280 | "source": [
281 | "Here we see that with only six PCA components, we recover a reasonable approximation of the input!\n",
282 | "\n",
283 | "Thus we see that PCA can be viewed from two angles. It can be viewed as **dimensionality reduction**, or it can be viewed as a form of **lossy data compression** where the loss favors noise. In this way, PCA can be used as a **filtering** process as well."
284 | ]
285 | },
286 | {
287 | "cell_type": "markdown",
288 | "metadata": {},
289 | "source": [
290 | "### Choosing the Number of Components\n",
291 | "\n",
292 | "But how much information have we thrown away? We can figure this out by looking at the **explained variance** as a function of the components:"
293 | ]
294 | },
295 | {
296 | "cell_type": "code",
297 | "collapsed": false,
298 | "input": [
299 | "sns.set()\n",
300 | "pca = PCA().fit(X)\n",
301 | "plt.plot(np.cumsum(pca.explained_variance_ratio_))\n",
302 | "plt.xlabel('number of components')\n",
303 | "plt.ylabel('cumulative explained variance');"
304 | ],
305 | "language": "python",
306 | "metadata": {},
307 | "outputs": []
308 | },
309 | {
310 | "cell_type": "markdown",
311 | "metadata": {},
312 | "source": [
313 | "Here we see that our two-dimensional projection loses a lot of information (as measured by the explained variance) and that we'd need about 20 components to retain 90% of the variance. Looking at this plot for a high-dimensional dataset can help you understand the level of redundancy present in multiple observations."
314 | ]
315 | },
316 | {
317 | "cell_type": "markdown",
318 | "metadata": {},
319 | "source": [
320 | "### PCA as data compression\n",
321 | "\n",
322 | "As we mentioned, PCA can be used for is a sort of data compression. Using a small ``n_components`` allows you to represent a high dimensional point as a sum of just a few principal vectors.\n",
323 | "\n",
324 | "Here's what a single digit looks like as you change the number of components:"
325 | ]
326 | },
327 | {
328 | "cell_type": "code",
329 | "collapsed": false,
330 | "input": [
331 | "fig, axes = plt.subplots(8, 8, figsize=(8, 8))\n",
332 | "fig.subplots_adjust(hspace=0.1, wspace=0.1)\n",
333 | "\n",
334 | "for i, ax in enumerate(axes.flat):\n",
335 | " pca = PCA(i + 1).fit(X)\n",
336 | " im = pca.inverse_transform(pca.transform(X[20:21]))\n",
337 | "\n",
338 | " ax.imshow(im.reshape((8, 8)), cmap='binary')\n",
339 | " ax.text(0.95, 0.05, 'n = {0}'.format(i + 1), ha='right',\n",
340 | " transform=ax.transAxes, color='green')\n",
341 | " ax.set_xticks([])\n",
342 | " ax.set_yticks([])"
343 | ],
344 | "language": "python",
345 | "metadata": {},
346 | "outputs": []
347 | },
348 | {
349 | "cell_type": "markdown",
350 | "metadata": {},
351 | "source": [
352 | "Let's take another look at this by using IPython's ``interact`` functionality to view the reconstruction of several images at once:"
353 | ]
354 | },
355 | {
356 | "cell_type": "code",
357 | "collapsed": false,
358 | "input": [
359 | "from IPython.html.widgets import interact\n",
360 | "\n",
361 | "def plot_digits(n_components):\n",
362 | " fig = plt.figure(figsize=(8, 8))\n",
363 | " plt.subplot(1, 1, 1, frameon=False, xticks=[], yticks=[])\n",
364 | " nside = 10\n",
365 | " \n",
366 | " pca = PCA(n_components).fit(X)\n",
367 | " Xproj = pca.inverse_transform(pca.transform(X[:nside ** 2]))\n",
368 | " Xproj = np.reshape(Xproj, (nside, nside, 8, 8))\n",
369 | " total_var = pca.explained_variance_ratio_.sum()\n",
370 | " \n",
371 | " im = np.vstack([np.hstack([Xproj[i, j] for j in range(nside)])\n",
372 | " for i in range(nside)])\n",
373 | " plt.imshow(im)\n",
374 | " plt.grid(False)\n",
375 | " plt.title(\"n = {0}, variance = {1:.2f}\".format(n_components, total_var),\n",
376 | " size=18)\n",
377 | " plt.clim(0, 16)\n",
378 | " \n",
379 | "interact(plot_digits, n_components=[1, 64], nside=[1, 8]);"
380 | ],
381 | "language": "python",
382 | "metadata": {},
383 | "outputs": []
384 | },
385 | {
386 | "cell_type": "markdown",
387 | "metadata": {},
388 | "source": [
389 | "## Other Dimensionality Reducting Routines\n",
390 | "\n",
391 | "Note that scikit-learn contains many other unsupervised dimensionality reduction routines: some you might wish to try are\n",
392 | "Other dimensionality reduction techniques which are useful to know about:\n",
393 | "\n",
394 | "- [sklearn.decomposition.PCA](http://scikit-learn.org/0.13/modules/generated/sklearn.decomposition.PCA.html): \n",
395 | " Principal Component Analysis\n",
396 | "- [sklearn.decomposition.RandomizedPCA](http://scikit-learn.org/0.13/modules/generated/sklearn.decomposition.RandomizedPCA.html):\n",
397 | " extremely fast approximate PCA implementation based on a randomized algorithm\n",
398 | "- [sklearn.decomposition.SparsePCA](http://scikit-learn.org/0.13/modules/generated/sklearn.decomposition.SparsePCA.html):\n",
399 | " PCA variant including L1 penalty for sparsity\n",
400 | "- [sklearn.decomposition.FastICA](http://scikit-learn.org/0.13/modules/generated/sklearn.decomposition.FastICA.html):\n",
401 | " Independent Component Analysis\n",
402 | "- [sklearn.decomposition.NMF](http://scikit-learn.org/0.13/modules/generated/sklearn.decomposition.NMF.html):\n",
403 | " non-negative matrix factorization\n",
404 | "- [sklearn.manifold.LocallyLinearEmbedding](http://scikit-learn.org/0.13/modules/generated/sklearn.manifold.LocallyLinearEmbedding.html):\n",
405 | " nonlinear manifold learning technique based on local neighborhood geometry\n",
406 | "- [sklearn.manifold.IsoMap](http://scikit-learn.org/0.13/modules/generated/sklearn.manifold.Isomap.html):\n",
407 | " nonlinear manifold learning technique based on a sparse graph algorithm\n",
408 | " \n",
409 | "Each of these has its own strengths & weaknesses, and areas of application. You can read about them on the [scikit-learn website](http://sklearn.org)."
410 | ]
411 | }
412 | ],
413 | "metadata": {}
414 | }
415 | ]
416 | }
--------------------------------------------------------------------------------
/indepth_RandomForests.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": "",
4 | "signature": "sha256:1cfbdbd1a84e1e50abb22003cd560039a56bb0fb5ab295943913b86a53fd0a61"
5 | },
6 | "nbformat": 3,
7 | "nbformat_minor": 0,
8 | "worksheets": [
9 | {
10 | "cells": [
11 | {
12 | "cell_type": "markdown",
13 | "metadata": {},
14 | "source": [
15 | "This notebook was put together by [Jake Vanderplas](http://www.vanderplas.com) for PyCon 2015. Source and license info is on [GitHub](https://github.com/jakevdp/sklearn_pycon2015/)."
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {},
21 | "source": [
22 | "# Supervised Learning In-Depth: Random Forests"
23 | ]
24 | },
25 | {
26 | "cell_type": "markdown",
27 | "metadata": {},
28 | "source": [
29 | "Previously we saw a powerful discriminative classifier, **Support Vector Machines**.\n",
30 | "Here we'll take a look at motivating another powerful algorithm. This one is a *non-parametric* algorithm called **Random Forests**."
31 | ]
32 | },
33 | {
34 | "cell_type": "code",
35 | "collapsed": false,
36 | "input": [
37 | "%matplotlib inline\n",
38 | "import numpy as np\n",
39 | "import matplotlib.pyplot as plt\n",
40 | "from scipy import stats\n",
41 | "\n",
42 | "# use seaborn plotting defaults\n",
43 | "import seaborn as sns; sns.set()"
44 | ],
45 | "language": "python",
46 | "metadata": {},
47 | "outputs": []
48 | },
49 | {
50 | "cell_type": "markdown",
51 | "metadata": {},
52 | "source": [
53 | "## Motivating Random Forests: Decision Trees"
54 | ]
55 | },
56 | {
57 | "cell_type": "markdown",
58 | "metadata": {},
59 | "source": [
60 | "Random forests are an example of an *ensemble learner* built on decision trees.\n",
61 | "For this reason we'll start by discussing decision trees themselves.\n",
62 | "\n",
63 | "Decision trees are extremely intuitive ways to classify or label objects: you simply ask a series of questions designed to zero-in on the classification:"
64 | ]
65 | },
66 | {
67 | "cell_type": "code",
68 | "collapsed": false,
69 | "input": [
70 | "import fig_code\n",
71 | "fig_code.plot_example_decision_tree()"
72 | ],
73 | "language": "python",
74 | "metadata": {},
75 | "outputs": []
76 | },
77 | {
78 | "cell_type": "markdown",
79 | "metadata": {},
80 | "source": [
81 | "The binary splitting makes this extremely efficient.\n",
82 | "As always, though, the trick is to *ask the right questions*.\n",
83 | "This is where the algorithmic process comes in: in training a decision tree classifier, the algorithm looks at the features and decides which questions (or \"splits\") contain the most information.\n",
84 | "\n",
85 | "### Creating a Decision Tree\n",
86 | "\n",
87 | "Here's an example of a decision tree classifier in scikit-learn. We'll start by defining some two-dimensional labeled data:"
88 | ]
89 | },
90 | {
91 | "cell_type": "code",
92 | "collapsed": false,
93 | "input": [
94 | "from sklearn.datasets import make_blobs\n",
95 | "\n",
96 | "X, y = make_blobs(n_samples=300, centers=4,\n",
97 | " random_state=0, cluster_std=1.0)\n",
98 | "plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='rainbow');"
99 | ],
100 | "language": "python",
101 | "metadata": {},
102 | "outputs": []
103 | },
104 | {
105 | "cell_type": "markdown",
106 | "metadata": {},
107 | "source": [
108 | "We have some convenience functions in the repository that help "
109 | ]
110 | },
111 | {
112 | "cell_type": "code",
113 | "collapsed": false,
114 | "input": [
115 | "from fig_code import visualize_tree, plot_tree_interactive"
116 | ],
117 | "language": "python",
118 | "metadata": {},
119 | "outputs": []
120 | },
121 | {
122 | "cell_type": "markdown",
123 | "metadata": {},
124 | "source": [
125 | "Now using IPython's ``interact`` (available in IPython 2.0+, and requires a live kernel) we can view the decision tree splits:"
126 | ]
127 | },
128 | {
129 | "cell_type": "code",
130 | "collapsed": false,
131 | "input": [
132 | "plot_tree_interactive(X, y);"
133 | ],
134 | "language": "python",
135 | "metadata": {},
136 | "outputs": []
137 | },
138 | {
139 | "cell_type": "markdown",
140 | "metadata": {},
141 | "source": [
142 | "Notice that at each increase in depth, every node is split in two **except** those nodes which contain only a single class.\n",
143 | "The result is a very fast **non-parametric** classification, and can be extremely useful in practice.\n",
144 | "\n",
145 | "**Question: Do you see any problems with this?**"
146 | ]
147 | },
148 | {
149 | "cell_type": "markdown",
150 | "metadata": {},
151 | "source": [
152 | "### Decision Trees and over-fitting\n",
153 | "\n",
154 | "One issue with decision trees is that it is very easy to create trees which **over-fit** the data. That is, they are flexible enough that they can learn the structure of the noise in the data rather than the signal! For example, take a look at two trees built on two subsets of this dataset:"
155 | ]
156 | },
157 | {
158 | "cell_type": "code",
159 | "collapsed": false,
160 | "input": [
161 | "from sklearn.tree import DecisionTreeClassifier\n",
162 | "clf = DecisionTreeClassifier()\n",
163 | "\n",
164 | "plt.figure()\n",
165 | "visualize_tree(clf, X[:200], y[:200], boundaries=False)\n",
166 | "plt.figure()\n",
167 | "visualize_tree(clf, X[-200:], y[-200:], boundaries=False)"
168 | ],
169 | "language": "python",
170 | "metadata": {},
171 | "outputs": []
172 | },
173 | {
174 | "cell_type": "markdown",
175 | "metadata": {},
176 | "source": [
177 | "The details of the classifications are completely different! That is an indication of **over-fitting**: when you predict the value for a new point, the result is more reflective of the noise in the model rather than the signal."
178 | ]
179 | },
180 | {
181 | "cell_type": "markdown",
182 | "metadata": {},
183 | "source": [
184 | "## Ensembles of Estimators: Random Forests\n",
185 | "\n",
186 | "One possible way to address over-fitting is to use an **Ensemble Method**: this is a meta-estimator which essentially averages the results of many individual estimators which over-fit the data. Somewhat surprisingly, the resulting estimates are much more robust and accurate than the individual estimates which make them up!\n",
187 | "\n",
188 | "One of the most common ensemble methods is the **Random Forest**, in which the ensemble is made up of many decision trees which are in some way perturbed.\n",
189 | "\n",
190 | "There are volumes of theory and precedent about how to randomize these trees, but as an example, let's imagine an ensemble of estimators fit on subsets of the data. We can get an idea of what these might look like as follows:"
191 | ]
192 | },
193 | {
194 | "cell_type": "code",
195 | "collapsed": false,
196 | "input": [
197 | "def fit_randomized_tree(random_state=0):\n",
198 | " X, y = make_blobs(n_samples=300, centers=4,\n",
199 | " random_state=0, cluster_std=2.0)\n",
200 | " clf = DecisionTreeClassifier(max_depth=15)\n",
201 | " \n",
202 | " rng = np.random.RandomState(random_state)\n",
203 | " i = np.arange(len(y))\n",
204 | " rng.shuffle(i)\n",
205 | " visualize_tree(clf, X[i[:250]], y[i[:250]], boundaries=False,\n",
206 | " xlim=(X[:, 0].min(), X[:, 0].max()),\n",
207 | " ylim=(X[:, 1].min(), X[:, 1].max()))\n",
208 | " \n",
209 | "from IPython.html.widgets import interact\n",
210 | "interact(fit_randomized_tree, random_state=[0, 100]);"
211 | ],
212 | "language": "python",
213 | "metadata": {},
214 | "outputs": []
215 | },
216 | {
217 | "cell_type": "markdown",
218 | "metadata": {},
219 | "source": [
220 | "See how the details of the model change as a function of the sample, while the larger characteristics remain the same!\n",
221 | "The random forest classifier will do something similar to this, but use a combined version of all these trees to arrive at a final answer:"
222 | ]
223 | },
224 | {
225 | "cell_type": "code",
226 | "collapsed": false,
227 | "input": [
228 | "from sklearn.ensemble import RandomForestClassifier\n",
229 | "clf = RandomForestClassifier(n_estimators=1000, random_state=0)\n",
230 | "visualize_tree(clf, X, y, boundaries=False);"
231 | ],
232 | "language": "python",
233 | "metadata": {},
234 | "outputs": []
235 | },
236 | {
237 | "cell_type": "markdown",
238 | "metadata": {},
239 | "source": [
240 | "By averaging over 100 randomly perturbed models, we end up with an overall model which is a much better fit to our data!\n",
241 | "\n",
242 | "*(Note: above we randomized the model through sub-sampling... Random Forests use more sophisticated means of randomization, which you can read about in, e.g. the [scikit-learn documentation](http://scikit-learn.org/stable/modules/ensemble.html#forest)*)"
243 | ]
244 | },
245 | {
246 | "cell_type": "markdown",
247 | "metadata": {},
248 | "source": [
249 | "## Quick Example: Moving to Regression\n",
250 | "\n",
251 | "Above we were considering random forests within the context of classification.\n",
252 | "Random forests can also be made to work in the case of regression (that is, continuous rather than categorical variables). The estimator to use for this is ``sklearn.ensemble.RandomForestRegressor``.\n",
253 | "\n",
254 | "Let's quickly demonstrate how this can be used:"
255 | ]
256 | },
257 | {
258 | "cell_type": "code",
259 | "collapsed": false,
260 | "input": [
261 | "from sklearn.ensemble import RandomForestRegressor\n",
262 | "\n",
263 | "x = 10 * np.random.rand(100)\n",
264 | "\n",
265 | "def model(x, sigma=0.3):\n",
266 | " fast_oscillation = np.sin(5 * x)\n",
267 | " slow_oscillation = np.sin(0.5 * x)\n",
268 | " noise = sigma * np.random.randn(len(x))\n",
269 | "\n",
270 | " return slow_oscillation + fast_oscillation + noise\n",
271 | "\n",
272 | "y = model(x)\n",
273 | "plt.errorbar(x, y, 0.3, fmt='o');"
274 | ],
275 | "language": "python",
276 | "metadata": {},
277 | "outputs": []
278 | },
279 | {
280 | "cell_type": "code",
281 | "collapsed": false,
282 | "input": [
283 | "xfit = np.linspace(0, 10, 1000)\n",
284 | "yfit = RandomForestRegressor(100).fit(x[:, None], y).predict(xfit[:, None])\n",
285 | "ytrue = model(xfit, 0)\n",
286 | "\n",
287 | "plt.errorbar(x, y, 0.3, fmt='o')\n",
288 | "plt.plot(xfit, yfit, '-r');\n",
289 | "plt.plot(xfit, ytrue, '-k', alpha=0.5);"
290 | ],
291 | "language": "python",
292 | "metadata": {},
293 | "outputs": []
294 | },
295 | {
296 | "cell_type": "markdown",
297 | "metadata": {},
298 | "source": [
299 | "As you can see, the non-parametric random forest model is flexible enough to fit the multi-period data, without us even specifying a multi-period model!"
300 | ]
301 | },
302 | {
303 | "cell_type": "markdown",
304 | "metadata": {},
305 | "source": [
306 | "## Example: Random Forest for Classifying Digits\n",
307 | "\n",
308 | "We previously saw the **hand-written digits** data. Let's use that here to test the efficacy of the SVM and Random Forest classifiers."
309 | ]
310 | },
311 | {
312 | "cell_type": "code",
313 | "collapsed": false,
314 | "input": [
315 | "from sklearn.datasets import load_digits\n",
316 | "digits = load_digits()\n",
317 | "digits.keys()"
318 | ],
319 | "language": "python",
320 | "metadata": {},
321 | "outputs": []
322 | },
323 | {
324 | "cell_type": "code",
325 | "collapsed": false,
326 | "input": [
327 | "X = digits.data\n",
328 | "y = digits.target\n",
329 | "print(X.shape)\n",
330 | "print(y.shape)"
331 | ],
332 | "language": "python",
333 | "metadata": {},
334 | "outputs": []
335 | },
336 | {
337 | "cell_type": "markdown",
338 | "metadata": {},
339 | "source": [
340 | "To remind us what we're looking at, we'll visualize the first few data points:"
341 | ]
342 | },
343 | {
344 | "cell_type": "code",
345 | "collapsed": false,
346 | "input": [
347 | "# set up the figure\n",
348 | "fig = plt.figure(figsize=(6, 6)) # figure size in inches\n",
349 | "fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)\n",
350 | "\n",
351 | "# plot the digits: each image is 8x8 pixels\n",
352 | "for i in range(64):\n",
353 | " ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])\n",
354 | " ax.imshow(digits.images[i], cmap=plt.cm.binary)\n",
355 | " \n",
356 | " # label the image with the target value\n",
357 | " ax.text(0, 7, str(digits.target[i]))"
358 | ],
359 | "language": "python",
360 | "metadata": {},
361 | "outputs": []
362 | },
363 | {
364 | "cell_type": "markdown",
365 | "metadata": {},
366 | "source": [
367 | "We can quickly classify the digits using a decision tree as follows:"
368 | ]
369 | },
370 | {
371 | "cell_type": "code",
372 | "collapsed": false,
373 | "input": [
374 | "from sklearn.cross_validation import train_test_split\n",
375 | "from sklearn import metrics\n",
376 | "\n",
377 | "Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)\n",
378 | "clf = DecisionTreeClassifier(max_depth=11)\n",
379 | "clf.fit(Xtrain, ytrain)\n",
380 | "ypred = clf.predict(Xtest)"
381 | ],
382 | "language": "python",
383 | "metadata": {},
384 | "outputs": []
385 | },
386 | {
387 | "cell_type": "markdown",
388 | "metadata": {},
389 | "source": [
390 | "We can check the accuracy of this classifier:"
391 | ]
392 | },
393 | {
394 | "cell_type": "code",
395 | "collapsed": false,
396 | "input": [
397 | "metrics.accuracy_score(ypred, ytest)"
398 | ],
399 | "language": "python",
400 | "metadata": {},
401 | "outputs": []
402 | },
403 | {
404 | "cell_type": "markdown",
405 | "metadata": {},
406 | "source": [
407 | "and for good measure, plot the confusion matrix:"
408 | ]
409 | },
410 | {
411 | "cell_type": "code",
412 | "collapsed": false,
413 | "input": [
414 | "plt.imshow(metrics.confusion_matrix(ypred, ytest),\n",
415 | " interpolation='nearest', cmap=plt.cm.binary)\n",
416 | "plt.grid(False)\n",
417 | "plt.colorbar()\n",
418 | "plt.xlabel(\"predicted label\")\n",
419 | "plt.ylabel(\"true label\");"
420 | ],
421 | "language": "python",
422 | "metadata": {},
423 | "outputs": []
424 | },
425 | {
426 | "cell_type": "markdown",
427 | "metadata": {},
428 | "source": [
429 | "### Exercise\n",
430 | "1. Repeat this classification task with ``sklearn.ensemble.RandomForestClassifier``. How does the ``max_depth``, ``max_features``, and ``n_estimators`` affect the results?\n",
431 | "2. Try this classification with ``sklearn.svm.SVC``, adjusting ``kernel``, ``C``, and ``gamma``. Which classifier performs optimally?\n",
432 | "3. Try a few sets of parameters for each model and check the F1 score (``sklearn.metrics.f1_score``) on your results. What's the best F1 score you can reach?"
433 | ]
434 | },
435 | {
436 | "cell_type": "code",
437 | "collapsed": false,
438 | "input": [],
439 | "language": "python",
440 | "metadata": {},
441 | "outputs": []
442 | }
443 | ],
444 | "metadata": {}
445 | }
446 | ]
447 | }
--------------------------------------------------------------------------------
/indepth_SVM.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": "",
4 | "signature": "sha256:c63557555e0a4e14924e8e1d9a88e798cc1143102cf0c93e9af353571250919a"
5 | },
6 | "nbformat": 3,
7 | "nbformat_minor": 0,
8 | "worksheets": [
9 | {
10 | "cells": [
11 | {
12 | "cell_type": "markdown",
13 | "metadata": {},
14 | "source": [
15 | "This notebook was put together by [Jake Vanderplas](http://www.vanderplas.com) for PyCon 2015. Source and license info is on [GitHub](https://github.com/jakevdp/sklearn_pycon2015/)."
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {},
21 | "source": [
22 | "# Supervised Learning In-Depth: Support Vector Machines"
23 | ]
24 | },
25 | {
26 | "cell_type": "markdown",
27 | "metadata": {},
28 | "source": [
29 | "Previously we introduced supervised machine learning.\n",
30 | "There are many supervised learning algorithms available; here we'll go into brief detail one of the most powerful and interesting methods: **Support Vector Machines (SVMs)**."
31 | ]
32 | },
33 | {
34 | "cell_type": "code",
35 | "collapsed": false,
36 | "input": [
37 | "%matplotlib inline\n",
38 | "import numpy as np\n",
39 | "import matplotlib.pyplot as plt\n",
40 | "from scipy import stats\n",
41 | "\n",
42 | "# use seaborn plotting defaults\n",
43 | "import seaborn as sns; sns.set()"
44 | ],
45 | "language": "python",
46 | "metadata": {},
47 | "outputs": []
48 | },
49 | {
50 | "cell_type": "markdown",
51 | "metadata": {},
52 | "source": [
53 | "## Motivating Support Vector Machines"
54 | ]
55 | },
56 | {
57 | "cell_type": "markdown",
58 | "metadata": {},
59 | "source": [
60 | "Support Vector Machines (SVMs) are a powerful supervised learning algorithm used for **classification** or for **regression**. SVMs are a **discriminative** classifier: that is, they draw a boundary between clusters of data.\n",
61 | "\n",
62 | "Let's show a quick example of support vector classification. First we need to create a dataset:"
63 | ]
64 | },
65 | {
66 | "cell_type": "code",
67 | "collapsed": false,
68 | "input": [
69 | "from sklearn.datasets.samples_generator import make_blobs\n",
70 | "X, y = make_blobs(n_samples=50, centers=2,\n",
71 | " random_state=0, cluster_std=0.60)\n",
72 | "plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='spring');"
73 | ],
74 | "language": "python",
75 | "metadata": {},
76 | "outputs": []
77 | },
78 | {
79 | "cell_type": "markdown",
80 | "metadata": {},
81 | "source": [
82 | "A discriminative classifier attempts to draw a line between the two sets of data. Immediately we see a problem: such a line is ill-posed! For example, we could come up with several possibilities which perfectly discriminate between the classes in this example:"
83 | ]
84 | },
85 | {
86 | "cell_type": "code",
87 | "collapsed": false,
88 | "input": [
89 | "xfit = np.linspace(-1, 3.5)\n",
90 | "plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='spring')\n",
91 | "\n",
92 | "for m, b in [(1, 0.65), (0.5, 1.6), (-0.2, 2.9)]:\n",
93 | " plt.plot(xfit, m * xfit + b, '-k')\n",
94 | "\n",
95 | "plt.xlim(-1, 3.5);"
96 | ],
97 | "language": "python",
98 | "metadata": {},
99 | "outputs": []
100 | },
101 | {
102 | "cell_type": "markdown",
103 | "metadata": {},
104 | "source": [
105 | "These are three *very* different separaters which perfectly discriminate between these samples. Depending on which you choose, a new data point will be classified almost entirely differently!\n",
106 | "\n",
107 | "How can we improve on this?"
108 | ]
109 | },
110 | {
111 | "cell_type": "markdown",
112 | "metadata": {},
113 | "source": [
114 | "### Support Vector Machines: Maximizing the *Margin*\n",
115 | "\n",
116 | "Support vector machines are one way to address this.\n",
117 | "What support vector machined do is to not only draw a line, but consider a *region* about the line of some given width. Here's an example of what it might look like:"
118 | ]
119 | },
120 | {
121 | "cell_type": "code",
122 | "collapsed": false,
123 | "input": [
124 | "xfit = np.linspace(-1, 3.5)\n",
125 | "plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='spring')\n",
126 | "\n",
127 | "for m, b, d in [(1, 0.65, 0.33), (0.5, 1.6, 0.55), (-0.2, 2.9, 0.2)]:\n",
128 | " yfit = m * xfit + b\n",
129 | " plt.plot(xfit, yfit, '-k')\n",
130 | " plt.fill_between(xfit, yfit - d, yfit + d, edgecolor='none', color='#AAAAAA', alpha=0.4)\n",
131 | "\n",
132 | "plt.xlim(-1, 3.5);"
133 | ],
134 | "language": "python",
135 | "metadata": {},
136 | "outputs": []
137 | },
138 | {
139 | "cell_type": "markdown",
140 | "metadata": {},
141 | "source": [
142 | "Notice here that if we want to maximize this width, the middle fit is clearly the best.\n",
143 | "This is the intuition of **support vector machines**, which optimize a linear discriminant model in conjunction with a **margin** representing the perpendicular distance between the datasets."
144 | ]
145 | },
146 | {
147 | "cell_type": "markdown",
148 | "metadata": {},
149 | "source": [
150 | "#### Fitting a Support Vector Machine\n",
151 | "\n",
152 | "Now we'll fit a Support Vector Machine Classifier to these points. While the mathematical details of the likelihood model are interesting, we'll let you read about those elsewhere. Instead, we'll just treat the scikit-learn algorithm as a black box which accomplishes the above task."
153 | ]
154 | },
155 | {
156 | "cell_type": "code",
157 | "collapsed": false,
158 | "input": [
159 | "from sklearn.svm import SVC # \"Support Vector Classifier\"\n",
160 | "clf = SVC(kernel='linear')\n",
161 | "clf.fit(X, y)"
162 | ],
163 | "language": "python",
164 | "metadata": {},
165 | "outputs": []
166 | },
167 | {
168 | "cell_type": "markdown",
169 | "metadata": {},
170 | "source": [
171 | "To better visualize what's happening here, let's create a quick convenience function that will plot SVM decision boundaries for us:"
172 | ]
173 | },
174 | {
175 | "cell_type": "code",
176 | "collapsed": false,
177 | "input": [
178 | "def plot_svc_decision_function(clf, ax=None):\n",
179 | " \"\"\"Plot the decision function for a 2D SVC\"\"\"\n",
180 | " if ax is None:\n",
181 | " ax = plt.gca()\n",
182 | " x = np.linspace(plt.xlim()[0], plt.xlim()[1], 30)\n",
183 | " y = np.linspace(plt.ylim()[0], plt.ylim()[1], 30)\n",
184 | " Y, X = np.meshgrid(y, x)\n",
185 | " P = np.zeros_like(X)\n",
186 | " for i, xi in enumerate(x):\n",
187 | " for j, yj in enumerate(y):\n",
188 | " P[i, j] = clf.decision_function([xi, yj])\n",
189 | " # plot the margins\n",
190 | " ax.contour(X, Y, P, colors='k',\n",
191 | " levels=[-1, 0, 1], alpha=0.5,\n",
192 | " linestyles=['--', '-', '--'])"
193 | ],
194 | "language": "python",
195 | "metadata": {},
196 | "outputs": []
197 | },
198 | {
199 | "cell_type": "code",
200 | "collapsed": false,
201 | "input": [
202 | "plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='spring')\n",
203 | "plot_svc_decision_function(clf);"
204 | ],
205 | "language": "python",
206 | "metadata": {},
207 | "outputs": []
208 | },
209 | {
210 | "cell_type": "markdown",
211 | "metadata": {},
212 | "source": [
213 | "Notice that the dashed lines touch a couple of the points: these points are the pivotal pieces of this fit, and are known as the *support vectors* (giving the algorithm its name).\n",
214 | "In scikit-learn, these are stored in the ``support_vectors_`` attribute of the classifier:"
215 | ]
216 | },
217 | {
218 | "cell_type": "code",
219 | "collapsed": false,
220 | "input": [
221 | "plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='spring')\n",
222 | "plot_svc_decision_function(clf)\n",
223 | "plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],\n",
224 | " s=200, facecolors='none');"
225 | ],
226 | "language": "python",
227 | "metadata": {},
228 | "outputs": []
229 | },
230 | {
231 | "cell_type": "markdown",
232 | "metadata": {},
233 | "source": [
234 | "Let's use IPython's ``interact`` functionality to explore how the distribution of points affects the support vectors and the discriminative fit.\n",
235 | "(This is only available in IPython 2.0+, and will not work in a static view)"
236 | ]
237 | },
238 | {
239 | "cell_type": "code",
240 | "collapsed": false,
241 | "input": [
242 | "from IPython.html.widgets import interact\n",
243 | "\n",
244 | "def plot_svm(N=10):\n",
245 | " X, y = make_blobs(n_samples=200, centers=2,\n",
246 | " random_state=0, cluster_std=0.60)\n",
247 | " X = X[:N]\n",
248 | " y = y[:N]\n",
249 | " clf = SVC(kernel='linear')\n",
250 | " clf.fit(X, y)\n",
251 | " plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='spring')\n",
252 | " plt.xlim(-1, 4)\n",
253 | " plt.ylim(-1, 6)\n",
254 | " plot_svc_decision_function(clf, plt.gca())\n",
255 | " plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],\n",
256 | " s=200, facecolors='none')\n",
257 | " \n",
258 | "interact(plot_svm, N=[10, 200], kernel='linear');"
259 | ],
260 | "language": "python",
261 | "metadata": {},
262 | "outputs": []
263 | },
264 | {
265 | "cell_type": "markdown",
266 | "metadata": {},
267 | "source": [
268 | "Notice the unique thing about SVM is that only the support vectors matter: that is, if you moved any of the other points without letting them cross the decision boundaries, they would have no effect on the classification results!"
269 | ]
270 | },
271 | {
272 | "cell_type": "markdown",
273 | "metadata": {},
274 | "source": [
275 | "#### Going further: Kernel Methods\n",
276 | "\n",
277 | "Where SVM gets incredibly exciting is when it is used in conjunction with *kernels*.\n",
278 | "To motivate the need for kernels, let's look at some data which is not linearly separable:"
279 | ]
280 | },
281 | {
282 | "cell_type": "code",
283 | "collapsed": false,
284 | "input": [
285 | "from sklearn.datasets.samples_generator import make_circles\n",
286 | "X, y = make_circles(100, factor=.1, noise=.1)\n",
287 | "\n",
288 | "clf = SVC(kernel='linear').fit(X, y)\n",
289 | "\n",
290 | "plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='spring')\n",
291 | "plot_svc_decision_function(clf);"
292 | ],
293 | "language": "python",
294 | "metadata": {},
295 | "outputs": []
296 | },
297 | {
298 | "cell_type": "markdown",
299 | "metadata": {},
300 | "source": [
301 | "Clearly, no linear discrimination will ever separate these data.\n",
302 | "One way we can adjust this is to apply a **kernel**, which is some functional transformation of the input data.\n",
303 | "\n",
304 | "For example, one simple model we could use is a **radial basis function**"
305 | ]
306 | },
307 | {
308 | "cell_type": "code",
309 | "collapsed": false,
310 | "input": [
311 | "r = np.exp(-(X[:, 0] ** 2 + X[:, 1] ** 2))"
312 | ],
313 | "language": "python",
314 | "metadata": {},
315 | "outputs": []
316 | },
317 | {
318 | "cell_type": "markdown",
319 | "metadata": {},
320 | "source": [
321 | "If we plot this along with our data, we can see the effect of it:"
322 | ]
323 | },
324 | {
325 | "cell_type": "code",
326 | "collapsed": false,
327 | "input": [
328 | "from mpl_toolkits import mplot3d\n",
329 | "\n",
330 | "def plot_3D(elev=30, azim=30):\n",
331 | " ax = plt.subplot(projection='3d')\n",
332 | " ax.scatter3D(X[:, 0], X[:, 1], r, c=y, s=50, cmap='spring')\n",
333 | " ax.view_init(elev=elev, azim=azim)\n",
334 | " ax.set_xlabel('x')\n",
335 | " ax.set_ylabel('y')\n",
336 | " ax.set_zlabel('r')\n",
337 | "\n",
338 | "interact(plot_3D, elev=[-90, 90], azip=(-180, 180));"
339 | ],
340 | "language": "python",
341 | "metadata": {},
342 | "outputs": []
343 | },
344 | {
345 | "cell_type": "markdown",
346 | "metadata": {},
347 | "source": [
348 | "We can see that with this additional dimension, the data becomes trivially linearly separable!\n",
349 | "This is a relatively simple kernel; SVM has a more sophisticated version of this kernel built-in to the process. This is accomplished by using ``kernel='rbf'``, short for *radial basis function*:"
350 | ]
351 | },
352 | {
353 | "cell_type": "code",
354 | "collapsed": false,
355 | "input": [
356 | "clf = SVC(kernel='rbf')\n",
357 | "clf.fit(X, y)\n",
358 | "\n",
359 | "plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='spring')\n",
360 | "plot_svc_decision_function(clf)\n",
361 | "plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],\n",
362 | " s=200, facecolors='none');"
363 | ],
364 | "language": "python",
365 | "metadata": {},
366 | "outputs": []
367 | },
368 | {
369 | "cell_type": "markdown",
370 | "metadata": {},
371 | "source": [
372 | "Here there are effectively $N$ basis functions: one centered at each point! Through a clever mathematical trick, this computation proceeds very efficiently using the \"Kernel Trick\", without actually constructing the matrix of kernel evaluations.\n",
373 | "\n",
374 | "We'll leave SVMs for the time being and take a look at another classification algorithm: Random Forests."
375 | ]
376 | }
377 | ],
378 | "metadata": {}
379 | }
380 | ]
381 | }
--------------------------------------------------------------------------------